Apr-12-2019, 04:09 PM
I am trying to get one simple bit of data from several thousand scraped files.
I want to do this using concurrent futures, but am having a bit of an issue
I created a sample which contains just 10 files, for testing and it looks like this:
I expected a list of lists, each containing city name and number of pages, but what I get is:
I tested the return statement (by running test_parse) and it does what it's supposed to do:
I want to do this using concurrent futures, but am having a bit of an issue
I created a sample which contains just 10 files, for testing and it looks like this:
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
from pathlib import Path
from bs4 import BeautifulSoup
import os
class Scrape3:
def __init__(self):
os.chdir(os.path.abspath(os.path.dirname(__file__)))
filepath = Path('./html')
citylist = [
['Andover'],
['Berlin'],
['Brooklyn'],
['Burlington'],
['Colchester'],
['Groton'],
['Hartland'],
['Kent'],
['Manchester'],
['Marlborough']
]
for city in citylist:
city.append(filepath / f'{city[0]}_page1.html')
# for item in citylist:
# print(f'{item[0]}, {item[1].resolve()}')
self.numpages = []
# self.test_parse(citylist)
self.get_numpages(citylist)
print(f'numpages: {self.numpages}')
def parse(self, city):
with city[1].open('rb') as fp:
page = fp.read()
soup = BeautifulSoup(page, 'lxml')
return [city[0], str(soup.find('span', {'class': "paginate-info"}).text.split()[2])]
def test_parse(self, citylist):
for city in citylist:
print(self.parse(city))
def get_numpages(self, citylist):
ex = ThreadPoolExecutor(max_workers=10)
for city in citylist:
wait_for = [
ex.submit(self.parse(city))
]
for f in as_completed(wait_for):
self.numpages.append(f.result)
if __name__ == '__main__':
Scrape3()It all appears to function properly until it comes to what's getting stored in self.numpages.I expected a list of lists, each containing city name and number of pages, but what I get is:
Output:numpages: [<bound method Future.result of <Future at 0x7f23423502e8 state=finished raised TypeError>>, <bound method Future.result of <Future at 0x7f23413fe390 state=finished raised TypeError>>, <bound method Future.result of <Future at 0x7f2340b930b8 state=finished raised TypeError>>, <bound method Future.result of <Future at 0x7f2340320e80 state=finished raised TypeError>>, <bound method Future.result of <Future at 0x7f23402fe278 state=finished raised TypeError>>, <bound method Future.result of <Future at 0x7f2340bbbf60 state=finished raised TypeError>>, <bound method Future.result of <Future at 0x7f234029d240 state=finished raised TypeError>>, <bound method Future.result of <Future at 0x7f234022e198 state=finished raised TypeError>>, <bound method Future.result of <Future at 0x7f234029dcc0 state=finished raised TypeError>>, <bound method Future.result of <Future at 0x7f234034f0f0 state=finished raised TypeError>>]I am missing something. Don't know what's creating the TypeError. Anybody know what that might be?I tested the return statement (by running test_parse) and it does what it's supposed to do:
[city[0], str(soup.find('span', {'class': "paginate-info"}).text.split()[2])]Output:['Andover', '18']
['Berlin', '91']
['Brooklyn', '76']
['Burlington', '59']
['Colchester', '77']
['Groton', '92']
['Hartland', '1']
['Kent', '23']
['Manchester', '278']
['Marlborough', '39']
