Need help with XPath using requests,time,urllib.request and BeautifulSoup

spacedog · Apr-24-2021, 12:32 AM

I have an xpath expression that I know works. Using the URL:
https://www.yellowpages.com/houston-tx/m...1657186981

and XPath:
//div[@class='sales-info']/H1[1]

Should return this:
Spector Ivan

My code is posted below. Can anyone please explain why it doesn't work here?
It works using scrapy, but I cannot mulit-thread in scrapy so I'm looking for an alternate.

Thanks.

import requests,time,urllib.request, concurrent.futures, pandas as pd  #proxy cheker < https://stackoverflow.com/questions/765305/proxy-check-in-python >
from bs4 import BeautifulSoup
import time
from lxml import html

url = 'https://www.yellowpages.com/houston-tx/mip/spector-ivan-11449879?lid=1001657186981'

proxy_handler = urllib.request.ProxyHandler({'http': '149.19.32.99:8082'})
opener = urllib.request.build_opener(proxy_handler)
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)

pg=urllib.request.urlopen(url) 

soup = BeautifulSoup(pg,'lxml')

tree = html.fromstring(soup.prettify())
testdata = tree.xpath("//div[@class='sales-info']/H1[1]")
print('XPath data: ', testdata)

bowlofred · Apr-24-2021, 12:52 AM

Maybe something more like...?

>>> tree.xpath("//div[@class='sales-info']/h1/text()")[0]
'\n        Spector  Ivan\n       '

spacedog · (This post was last modified: Apr-24-2021, 01:48 AM by spacedog.)

Thanks but that didn't do it:
IndexError: list index out of range

bowlofred · Apr-24-2021, 02:48 AM

Odd, I just changed that one line and it "works" for me.

...
#testdata = tree.xpath("//div[@class='sales-info']/H1[1]")
testdata = tree.xpath("//div[@class='sales-info']/h1/text()")[0]
print('XPath data: ', testdata)

Output:XPath data:
        Spector  Ivan

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	urllib can't find "parse"	rjdegraff42	6	10,277	Jul-24-2023, 05:28 PM Last Post: deanhystad
	Import requests/beautifulsoup problem	Jokadaro_	3	4,099	Dec-05-2021, 01:22 PM Last Post: Jokadaro_
	how can I correct the Bad Request error on my curl request	tomtom	8	8,852	Oct-03-2021, 06:32 AM Last Post: tomtom
	Prevent urllib.request from using my local proxy	spacedog	0	4,619	Apr-24-2021, 08:55 PM Last Post: spacedog
	urllib.request.ProxyHandler works with bad proxy	spacedog	0	8,243	Apr-24-2021, 08:02 AM Last Post: spacedog
	Help with urllib.request	Brian177	2	4,802	Apr-21-2021, 01:58 PM Last Post: Brian177
	urllib.request	ericmt123	2	3,679	Dec-21-2020, 06:53 PM Last Post: Larz60+
	Cannot open url link using urllib.request	Askic	5	10,027	Oct-25-2020, 04:56 PM Last Post: Askic
	urllib is not a package traceback	cc26	3	9,407	Aug-28-2020, 09:34 AM Last Post: snippsat
	ImportError: cannot import name 'Request' from 'request'	abhishek81py	1	6,216	Jun-18-2020, 08:07 AM Last Post: buran

Need help with XPath using requests,time,urllib.request and BeautifulSoup

User Panel Messages

Announcements