Problem with scrapping Website

giddyhead · (This post was last modified: Jun-22-2022, 01:43 AM by giddyhead.)

Good Day Everyone. I am having issues with web scaping as I am not sure why it does not want to scape. I am using xpath and also soup to gather the next URL to check if it works however it does not want to work. What am I doing wrong?

import requests
from lxml import etree
import html5lib
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time, re
import csv
import time

start = time.time()

print('Starting Program')       
base ="https://www.studylight.org/lexicons/eng/hebrew/1.html"
url = "https://www.studylight.org/lexicons/eng/hebrew/1.html"

while True:
     
     request = requests.get(urljoin(base,url)) #Get URL server status
     soup = BeautifulSoup(request.content, 'html5lib') #Pass url content to Soup
     
     dom = etree.HTML(str(soup)) #Ini etree
     url = dom.xpath('/html/body/div[1]/div[3]/div[2]/div[4]/form/div/div[3]/div[2]/a') #Find Next Page URL
     url2 = urljoin(base,url)

     urltest2 = soup.find_all("span", class_="greek-hebrew fs-21") #Find next url
     print('Test First url', url2,' Test number 2 ' , urltest2)
     # #for line in soup.find_all('a'):
     #       #print(urljoin(base,line.text))#.get('href'))

     if url2 in 'https://www.studylight.org/lexicons/eng/hebrew/3.html':  # Page to Stop
          break  # Break out of loop

print('Program Completed')

AhanaSharma · (This post was last modified: Mar-11-2024, 05:31 PM by Larz60+.)

Couple of issues found in your code:

The XPath expression '/html/body/div[1]/div[3]/div[2]/div[4]/form/div/div[3]/div[2]/a' might not be accurately targeting the next page URL. Ensure that the XPath is correctly pointing to the anchor tag (<a>) containing the link to the next page.

After retrieving the URL using XPath, you're trying to join it with the base URL using urljoin(base, url). However, url is a list returned by XPath, so you should extract the URL string from the list before joining it.

Here's a revised version of your code:

import requests
from lxml import etree
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time

start = time.time()

print('Starting Program')
base = "https://www.studylight.org/lexicons/eng/hebrew/1.html"
url = "https://www.studylight.org/lexicons/eng/hebrew/1.html"

while True:
    request = requests.get(urljoin(base, url))
    soup = BeautifulSoup(request.content, 'html5lib')

    url_tags = soup.select('a[href^="/lexicons/eng/hebrew/"]')  # CSS Selector for next page URL
    if url_tags:
        next_page_url = url_tags[0]['href']
        url = next_page_url
        print('Next Page URL:', url)
    else:
        break

print('Program Completed')

Larz60+ write Mar-11-2024, 05:31 PM:
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
Tags have been added this time. Please use BBCode tags on future posts.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	python web scrapping	mg24	1	1,687	Mar-01-2024, 09:48 PM Last Post: snippsat
	How can I ignore empty fields when scrapping	never5000	0	2,467	Feb-11-2022, 09:19 AM Last Post: never5000
	Suggestion request for scrapping html table	Vkkindia	3	3,867	Dec-06-2021, 06:09 PM Last Post: Larz60+
	web scrapping through Python	Naheed	2	3,903	May-17-2021, 12:02 PM Last Post: Naheed
	Website scrapping and download	santoshrane	3	6,527	Apr-14-2021, 07:22 AM Last Post: kashcode
	Newbie help with lxml scrapping	chelsealoa	1	2,856	Jan-08-2021, 09:14 AM Last Post: Larz60+
	Scrapping Sport score	laplacea	1	3,790	Dec-13-2020, 04:09 PM Last Post: Larz60+
	How to export to csv the output of every iteration when scrapping with a loop	efthymios	2	4,637	Nov-30-2020, 07:46 PM Last Post: efthymios
	Problem with logging in on website - python w/ requests	GoldeNx	6	8,024	Sep-25-2020, 10:52 AM Last Post: snippsat
	Web scrapping - Stopped working	peterjv26	2	6,309	Sep-23-2020, 08:30 AM Last Post: peterjv26

Problem with scrapping Website

User Panel Messages

Announcements