Saving html page and reloading into selenium while developing all xpaths

**Larz60+** · (This post was last modified: Sep-10-2018, 10:26 AM by Larz60+.)

I wanted to save the homepage that I was scraping so that I wouldn't have to fetch it every time I was making changes during development.
first I tried (filename is a pathlib Path object):

chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--log-level=3')
filename = spath.savedhtmlpath / 'homepage.html'

browser = webdriver.Chrome(chrome_options=chrome_options, executable_path=r'/home/Larz60p/Drivers//chromedriver')
#--| Parse
browser.get(url)
html = browser.page_source
with filename.open('w') as fp:
    fp.write(html)
time.sleep(2)

where filename was a pathlib path, and then when reading back:

chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--log-level=3')
filename = spath.savedhtmlpath / 'homepage.html'

path = f'file://{filename.resolve()}'
browser.get(path)

So far so good. But when I tried to extract information with xpath, I didn't get error, but couldn't find what I was looking for either. I wasn't able to determine what the issue was.

So...
This is a bit of a hack, but it works without flaw:

chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--log-level=3')

filename = spath.savedhtmlpath / 'homepage.html'
if not filename.exists():
    response = requests.get(url)
    if response.status_code == 200:
        with filename.open('wb') as fp:
            fp.write(response.content)

path = f'file://{filename.resolve()}'
browser.get(path)

I am almost satisfied using this method, especially as I will only use it for development, but my gut tells me there is a better way.

Anyone know of a better solution, or why 'html = browser.page_source' doesn't work? ??

***metulburr*** · (This post was last modified: Sep-10-2018, 11:13 AM by metulburr.)

(Sep-10-2018, 10:26 AM)Larz60+ Wrote: or why 'html = browser.page_source' doesn't work? ??

that is the normal method i use for getting the source. There might be something weird with whatever site you are going to (like an iframe). You do of course you have to make sure that the page fully loads between browser.get and browser.page_source. Ive been bit by that so many times.

**Larz60+** · Sep-10-2018, 12:05 PM

i Tried it with a couple of pages, and I couldn't find anything using xpath.

Quote:Onec I downloaded with requests and saved thatm the same xpath worked fine.
There might be something weird with whatever site you are going to (like an iframe). You do of course you have to make sure that the page fully loads between browser.get and browser.page_source. I've been bit by that so many times.

there is something definately wierd about this site. It was built with wix, and full of Ajax (I think) code.
Took a while to get through that, but the hack seems to work fine. I'd be curious as to why, perhaps later, I'll try to diff the two files, but no time for that now.

***snippsat*** · (This post was last modified: Sep-10-2018, 01:15 PM by snippsat.)

(Sep-10-2018, 12:05 PM)Larz60+ Wrote: It was built with wix, and full of Ajax (I think) code.

It will be problem getting all source from Wix.
Here’s what Wix says.

Quote:Your Wix site and all of its content is hosted exclusively on Wix’s servers, and cannot be transferred elsewhere.
Specifically, it is not possible to export or embed files, pages or sites, created using the Wix Editor or ADI, to another external destination or host.

jonathanwhite1 · Feb-04-2021, 07:01 AM

It's a good decision. I agree with this.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Click on a button on web page using Selenium	Pavel_47	7	8,799	Jan-05-2023, 04:20 AM Last Post: ellapurnellrt
	selenium returns junk instead of html	klaarnou	5	4,852	Mar-27-2022, 07:20 AM Last Post: klaarnou
	Selenium/Helium loads up a blank web page	firaki12345	0	3,356	Mar-23-2021, 11:51 AM Last Post: firaki12345
	HTML multi select HTML listbox with Flask/Python	rfeyer	0	7,154	Mar-14-2021, 12:23 PM Last Post: rfeyer
	Parsing html page and working with checkbox (on a captcha)	straannick	17	19,130	Feb-04-2021, 02:54 PM Last Post: snippsat
	Using Python request without selenium on html form with javascript onclick submit but	eraosa	0	4,464	Jan-09-2021, 06:08 PM Last Post: eraosa
	API auto-refresh on HTML page using Flask	toc	2	15,630	Dec-23-2020, 02:00 PM Last Post: toc
	Selenium Parsing (unable to Parse page after loading)	oneclick	7	8,885	Oct-30-2020, 08:13 PM Last Post: tomalex
	Selenium Page Object Model with Python	Cryptus	5	7,634	Aug-19-2020, 06:30 AM Last Post: mlieqo
	Selenium on Angular page	Martinelli	3	8,795	Jul-28-2020, 12:40 PM Last Post: Martinelli

Saving html page and reloading into selenium while developing all xpaths

User Panel Messages

Announcements