I wanted to save the homepage that I was scraping so that I wouldn't have to fetch it every time I was making changes during development.
first I tried (filename is a pathlib Path object):
So...
This is a bit of a hack, but it works without flaw:
Anyone know of a better solution, or why 'html = browser.page_source' doesn't work? ??
first I tried (filename is a pathlib Path object):
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--log-level=3')
filename = spath.savedhtmlpath / 'homepage.html'
browser = webdriver.Chrome(chrome_options=chrome_options, executable_path=r'/home/Larz60p/Drivers//chromedriver')
#--| Parse
browser.get(url)
html = browser.page_source
with filename.open('w') as fp:
fp.write(html)
time.sleep(2)where filename was a pathlib path, and then when reading back:chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--log-level=3')
filename = spath.savedhtmlpath / 'homepage.html'
path = f'file://{filename.resolve()}'
browser.get(path)So far so good. But when I tried to extract information with xpath, I didn't get error, but couldn't find what I was looking for either. I wasn't able to determine what the issue was.So...
This is a bit of a hack, but it works without flaw:
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--log-level=3')
filename = spath.savedhtmlpath / 'homepage.html'
if not filename.exists():
response = requests.get(url)
if response.status_code == 200:
with filename.open('wb') as fp:
fp.write(response.content)
path = f'file://{filename.resolve()}'
browser.get(path)I am almost satisfied using this method, especially as I will only use it for development, but my gut tells me there is a better way.Anyone know of a better solution, or why 'html = browser.page_source' doesn't work? ??
