BeautifulSoup not parsing other URLs

giddyhead · Feb-23-2022, 05:35 PM

Hello again everyone. The following issue I have currently at hand. The script runs to the second page for example "https://www.startpage.com/lookup/?search=position%202&version=NUM2200" and then return back to the first page of https://www.startpage.com/lookup/?search=position%201&version=NUM2200. I used urljoin to for the pages of the base and relative page but it keep cycling back and forth from page 1 to page 2 and back. Why does it want to do that? What can I do to fix this? thanks

from lxml import etree
import html5lib
import requests
from bs4 import BeautifulSoup

url = "https://www.startpage.com/lookup/?search=position%201&version=NUM2200" 


while True:
     
     request = requests.get(url) #Get URL server status
     
     soup = BeautifulSoup(request.content, 'html5lib') #Pass url content to Soup
     dom = etree.HTML(str(soup)) #Ini etree
     url = urljoin(BASE_URL, dom.xpath('/html/body/div[2]/div/section/div[3]/div/div[2]/section/div[1]/div[1]/div[1]/a')[0].get("href")) #Join Relative and Base for full URL of next Page URL
     print('THis is Next url',url)
   
         
     for a in soup.find_all("span", {'class': re.compile(r'^text')}): #Get Text in Span Class and Filter out specific words
          bltext=a.text
          if bltext == 'cook Book':
               st = bltext.replace('cook  Book','')
          elif bltext == 'Study Tools':
               st = bltext.replace('Study Tools','')
          elif bltext == 'Explore More':
               st = bltext.replace('Explore More','')
          elif bltext == 'WayPlus':
               st = bltext.replace('WayPlus','')
          elif bltext == 'Explore More':
               st = bltext.replace('Explore More','')
          elif bltext == 'Store':
               st = bltext.replace('Store','')

          else:
                        
               print('\n',a.text)
     
               

           #with open(f'{chp}.txt', 'w', encoding='utf-8') as f:
            #f.write(chp+'\n'+i.text)
   
     print('pages',url)

     print('This is url', url)
     
     if url in 'https://www.startpage.com/lookup/?search=position%207&version=NUM2203': #Page to Stop
          break #Break out of loop

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	BeautifulSoup: 6k records - but stops after parsing 20 lines	apollo	0	2,582	May-10-2021, 05:08 PM Last Post: apollo
	Logic behind BeautifulSoup data-parsing	jimsxxl	7	7,813	Apr-13-2021, 09:06 AM Last Post: jimsxxl
	Need logic on how to scrap 100K URLs	goodmind	2	4,157	Jun-29-2020, 09:53 AM Last Post: goodmind
	Scrape multiple urls LXML	santdoyle	1	4,596	Oct-26-2019, 09:53 PM Last Post: snippsat
	Need to Verify URLs; getting SSLError	rahul_goswami	0	3,244	Aug-20-2019, 10:17 AM Last Post: rahul_goswami
	Scrap text out of td table from URLS	Gochix2020	4	8,009	Aug-03-2019, 02:56 AM Last Post: Larz60+
	Regex URLs Django 2.1	sterion66	0	3,669	Nov-04-2018, 10:22 AM Last Post: sterion66
	Scraping external URLs from pages	Apook	5	6,260	Jul-18-2018, 06:42 PM Last Post: nilamo
	hi new at python , trying to get urls from website	dviry	6	6,949	Feb-24-2018, 07:34 PM Last Post: metulburr
	BeautifulSoup Parsing Error	slinkplink	6	16,729	Feb-12-2018, 02:55 PM Last Post: seco

BeautifulSoup not parsing other URLs

User Panel Messages

Announcements