little parser-script crashes after doing good work for some time

apollo · Feb-03-2021, 10:48 AM

hello dear all - good day Smile

i have some code where i do not know why this permanently runs into an error...

import requests
from bs4 import BeautifulSoup
from concurrent.futures.thread import ThreadPoolExecutor

url = "https://wordpress.org/plugins/browse/popular/{}"


def main(url, num):
    with requests.Session() as req:
        print(f"Collecting Page# {num}")
        r = req.get(url.format(num))
        soup = BeautifulSoup(r.content, 'html.parser')
        link = [item.get("href")
                for item in soup.findAll("a", rel="bookmark")]
        return set(link)


with ThreadPoolExecutor(max_workers=20) as executor:
    futures = [executor.submit(main, url, num)
               for num in [""]+[f"page/{x}/" for x in range(2, 5)]]

allin = []
for future in futures:
    allin.extend(future.result())


def parser(url):
    with requests.Session() as req:
        print(f"Extracting {url}")
        r = req.get(url)
        soup = BeautifulSoup(r.content, 'html.parser')
        target = [item.get_text(strip=True, separator=" ") for item in soup.find(
            "h3", class_="screen-reader-text").find_next("ul").findAll("li")[:8]]
        head = [soup.find("h1", class_="plugin-title").text]
        new = [x for x in target if x.startswith(
            ("V", "Las", "Ac", "W", "T", "P"))]
        return head + new


with ThreadPoolExecutor(max_workers=50) as executor1:
    futures1 = [executor1.submit(parser, url) for url in allin]

for future in futures1:
    print(future.result())

note: this fetches data (meta-data) from the WP-plugins. So i gather some meta information about the wordpress-plugins. Originally the code fetches the firts 50 pages like so:

with ThreadPoolExecutor(max_workers=20) as executor:
    futures = [executor.submit(main, url, num)
               for num in [""]+[f"page/{x}/" for x in range(2, 50)]]

allin = []
for future in futures:

see the page-range it originally should run over 50 pages. But - i have scaled doown the resuits for a smaller range - for only a few pages.

question :could this scaling down cause some issues="

look forward to hear from you

regards apollo Smile

see below the errors

Extracting https://wordpress.org/plugins/envira-gallery-lite/
Extracting https://wordpress.org/plugins/foobox-image-lightbox/
Extracting https://wordpress.org/plugins/all-in-one-favicon/
Extracting https://wordpress.org/plugins/wp-force-ssl/
Extracting https://wordpress.org/plugins/login-lockdown/
Extracting https://wordpress.org/plugins/post-type-switcher/Extracting https://wordpress.org/plugins/cyr3lat/
Extracting https://wordpress.org/plugins/wpfront-scroll-top/

Extracting https://wordpress.org/plugins/gdpr-cookie-compliance/Extracting https://wordpress.org/plugins/mw-wp-form/Extracting https://wordpress.org/plugins/cf7-conditional-fields/
Extracting https://wordpress.org/plugins/genesis-simple-edits/Extracting https://wordpress.org/plugins/user-switching/



Extracting https://wordpress.org/plugins/wp-rollback/
Extracting https://wordpress.org/plugins/woocommerce-google-analytics-integration/
Extracting https://wordpress.org/plugins/wp-crontrol/
Extracting https://wordpress.org/plugins/wp-retina-2x/
Extracting https://wordpress.org/plugins/https-redirection/
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-bcc70c1246ab> in <module>
     42 
     43 for future in futures1:
---> 44     print(future.result())

~\devel\IDE\lib\concurrent\futures\_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433 
    434             self._condition.wait(timeout)

~\devel\IDE\lib\concurrent\futures\_base.py in __get_result(self)
    386     def __get_result(self):
    387         if self._exception:
--> 388             raise self._exception
    389         else:
    390             return self._result

~\devel\IDE\lib\concurrent\futures\thread.py in run(self)
     55 
     56         try:
---> 57             result = self.fn(*self.args, **self.kwargs)
     58         except BaseException as exc:
     59             self.future.set_exception(exc)

<ipython-input-2-bcc70c1246ab> in parser(url)
     30         r = req.get(url)
     31         soup = BeautifulSoup(r.content, 'html.parser')
---> 32         target = [item.get_text(strip=True, separator=" ") for item in soup.find(
     33             "h3", class_="screen-reader-text").find_next("ul").findAll("li")[:8]]
     34         head = [soup.find("h1", class_="plugin-title").text]

AttributeError: 'NoneType' object has no attribute 'find_next'

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Webscrape script, add to run every day at a specific time	jjcooper	2	2,428	Jun-07-2024, 03:46 AM Last Post: sawtooth500
	Script does not work on Linux server	scrapemasta	0	1,908	Mar-22-2024, 11:07 AM Last Post: scrapemasta
	how to add a login to a bs4 parser-script	apollo	10	13,710	Jul-02-2020, 10:47 PM Last Post: snippsat
	Python-selenium script for automated web-login does not work	hectorKJ	2	6,890	Sep-10-2019, 01:29 PM Last Post: buran

little parser-script crashes after doing good work for some time

User Panel Messages

Announcements