Feb-28-2018, 09:21 AM
(This post was last modified: Feb-28-2018, 09:21 AM by digitalmatic7.)
I've created a very simple bot, and I'm having trouble threading it.
It loads a list of URLs from links.csv (sample list here: https://pastebin.com/QiH3qpRD)
Then it will scrape the rank of each URL from Alexa API.
The problem is that all the threads are handling the same URL, can you guys help me figure out where I went wrong in my code:
![[Image: lA0Zf1_BQEGJWIIwSIHrJg.png]](https://image.prntscr.com/image/lA0Zf1_BQEGJWIIwSIHrJg.png)
Do I need to use "multiprocessing.Pool"?
EDIT: I think I've figured out what I was doing wrong. Concurrent instead of parallel. I'm playing with some new code.
***
It loads a list of URLs from links.csv (sample list here: https://pastebin.com/QiH3qpRD)
Then it will scrape the rank of each URL from Alexa API.
The problem is that all the threads are handling the same URL, can you guys help me figure out where I went wrong in my code:
![[Image: lA0Zf1_BQEGJWIIwSIHrJg.png]](https://image.prntscr.com/image/lA0Zf1_BQEGJWIIwSIHrJg.png)
Do I need to use "multiprocessing.Pool"?
from __future__ import print_function
import threading
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
# CREATE URL LIST FROM CSV
df = pd.read_csv('links.csv', header=0) # df = dataframe
df.insert(1, 'Alexa Rank:', "") # create new column
# GET URL TOTAL FROM CSV
url_total = len(df.index)
print()
print('Total URLS Loaded:', url_total, "- Task Starting...")
print()
url_total = len(df.index) - 1 # get total number of URLs in list
def worker(Id):
time.sleep(0.3)
# COUNTER TO INCREMENT THROUGH URL_LIST
list_counter = 0
while list_counter <= url_total:
scrape = requests.get("http://data.alexa.com/data?cli=10&dat=s&url=" + df.iloc[list_counter, 0],
headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"})
html = scrape.content
soup = BeautifulSoup(html, 'lxml')
rank = re.findall(r'<popularity[^>]*text="(\d+)"', str(soup)) # scrape alexa rank
rank = rank[0]
df.iloc[list_counter, 1] = rank # add to dataframe
print(u"\u2713", '-', list_counter, '-', df.iloc[list_counter, 0], '-', "Alexa Rank:", rank)
list_counter = list_counter + 1
def main():
threads = []
for i in range(4):
t = threading.Thread(target=worker, args=(i,))
threads.append(t)
t.start()
print("Main has spawn all the threads")
for t in threads:
t.join()
main()***EDIT: I think I've figured out what I was doing wrong. Concurrent instead of parallel. I'm playing with some new code.
***
