Jan-11-2019, 09:38 AM
Hi all.
I'm experiencing a problem while scraping information from this URL.
The problem arises because mechanize changes the hours while retrieving the html source code. Any hour has a delay of -1 hours. I think it might depend on some local configuration on my system (I live in Italy and the site might have another time zone).
This being said, I could not solve the problem and ask for some help :)
This is a brief working extract of my code
I'm experiencing a problem while scraping information from this URL.
The problem arises because mechanize changes the hours while retrieving the html source code. Any hour has a delay of -1 hours. I think it might depend on some local configuration on my system (I live in Italy and the site might have another time zone).
This being said, I could not solve the problem and ask for some help :)
This is a brief working extract of my code
from __future__ import print_function
from bs4 import BeautifulSoup
import regex as re
import mechanize
from datetime import datetime
URL_PAGE = 'https://www.myfxbook.com/forex-economic-calendar'
# retrieve html code
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
html_content = br.open(URL_PAGE).read()
# soup
soup = BeautifulSoup(html_content, "html.parser")
#regex for extraction
cal_row_re = re.compile(r'^calRow.*') # <-- name
date_re = re.compile(r'\w+\s?\d+:\d+') # <-- date
#extracting events
CalEvents = soup.find_all(id=cal_row_re)
for singleEvent in CalEvents:
date = singleEvent.find(text=date_re).strip()
eventName = singleEvent.find(class_='noUnderline').get_text().strip()
print(date, eventName, sep = ';')Thank you in advance
