Update 1-4-2018
In part 2 do some practice and look at how to scrape pages with JavaScript.
Scrape and download:
Start bye doing some stuff with xkcd.
[Image: AL3Z2m.jpg]
Using CSS selector for text
Loop over pages and get images:
xkcd has a simple page structure
So can loop over and get images,set start and stop.
Speed it up a lot with concurrent.futures:
concurrent.futures has a minimalistic API for Threading and Multiprocessing.
Only change one word to switch ThreadPoolExecutor(
So if download 200 images(start_img=1, stop_img=200) it takes ca
Will press time down to
for
Making all links and load 20 parallel task ProcessPoolExecutor(
JavaScript,why do i not get all content
JavaScript is used all over the web because it's unique position to run in Browser(client side).
This can make it more difficult to do parsing,
because
There are way to overcome this,gone use Selenium
Installation
Example with How Secure Is My Password?
So this give real time info using JavaScripts,gone enter in password
Then give source code to BeautifulSoup for parsing.
Headless(not loading browser):
Both Chrome and FireFox now release headless mode in there newer drivers.
This mean that browser do not start(visible) as in example over.
Gone look at a simple setup for both Chrome and FireFox.
Final projects:
![[Image: Qr8P7Q.png]](https://imageshack.com/a/img923/747/Qr8P7Q.png)
Here gone loop over most played tracks on SoundCloud this week.
So here first has to activate mouser over play button(
- All tested Python 3.6.4
- Added more Selenium stuff and headless mode setup
- Added Final projects which play songs on SoundCloud
- Link to Web-Scraping part 1
In part 2 do some practice and look at how to scrape pages with JavaScript.
Scrape and download:
Start bye doing some stuff with xkcd.
[Image: AL3Z2m.jpg]
Using CSS selector for text
select('#ctitle') and find() for image link.import requests
from bs4 import BeautifulSoup
import webbrowser
import os
url = 'http://xkcd.com/1/'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
text = soup.select_one('#ctitle').text
link = soup.find('div', id='comic').find('img').get('src')
link = link.replace('//', 'http://')
# Image title and link
print('{}\n{}'.format(text, link))
# Download image
img_name = os.path.basename(link)
img = requests.get(link)
with open(img_name, 'wb') as f_out:
f_out.write(img.content)
# Open image in browser or default image viewer
webbrowser.open_new_tab(img_name)Output:Barrel - Part 1
http://imgs.xkcd.com/comics/barrel_cropped_(1).jpgLoop over pages and get images:
xkcd has a simple page structure
xkcd.com/1/ xkcd.com/2/... ectSo can loop over and get images,set start and stop.
import requests
from bs4 import BeautifulSoup
import os
def image_down(start_img, stop_img):
for numb in range(start_img, stop_img):
url = 'http://xkcd.com/{}/'.format(numb)
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'html.parser')
link = soup.find('div', id='comic').find('img').get('src')
link = link.replace('//', 'http://')
img_name = os.path.basename(link)
try:
img = requests.get(link)
with open(img_name, 'wb') as f_out:
f_out.write(img.content)
except:
# Just want images don't care about errors
pass
if __name__ == '__main__':
start_img = 1
stop_img = 20
image_down(start_img, stop_img)Speed it up a lot with concurrent.futures:
concurrent.futures has a minimalistic API for Threading and Multiprocessing.
Only change one word to switch ThreadPoolExecutor(
Threading) and ProcessPoolExecutor(Multiprocessing).So if download 200 images(start_img=1, stop_img=200) it takes ca
1,10 minute to download in code over.Will press time down to
10-sec 200 images.Making all links and load 20 parallel task ProcessPoolExecutor(
Multiprocessing).import requests
from bs4 import BeautifulSoup
import concurrent.futures
import os
def image_down(url):
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
link = soup.find('div', id='comic').find('img').get('src')
link = link.replace('//', 'http://')
img_name = os.path.basename(link)
try:
img = requests.get(link)
with open(img_name, 'wb') as f_out:
f_out.write(img.content)
except:
# Just want images don't care about errors
pass
if __name__ == '__main__':
start_img = 1
stop_img = 200
with concurrent.futures.ProcessPoolExecutor(max_workers=20) as executor:
for numb in range(start_img, stop_img):
url = 'http://xkcd.com/{}/'.format(numb)
executor.submit(image_down, url)JavaScript,why do i not get all content
JavaScript is used all over the web because it's unique position to run in Browser(client side).
This can make it more difficult to do parsing,
because
Requests/bs4/lxml can not get all that's is executed/rendered bye JavaScript.There are way to overcome this,gone use Selenium
Installation
Example with How Secure Is My Password?
So this give real time info using JavaScripts,gone enter in password
123hello in Selenium.Then give source code to BeautifulSoup for parsing.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import time
browser = webdriver.Chrome()
'''
#-- FireFox
caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = True
browser = webdriver.Firefox(capabilities=caps)
'''
url = 'https://howsecureismypassword.net/'
browser.get(url)
inputElement = browser.find_elements_by_class_name("password-input")[0]
inputElement.send_keys("123hello")
inputElement.send_keys(Keys.RETURN)
time.sleep(5) #seconds
# Give source code to BeautifulSoup
soup = BeautifulSoup(browser.page_source, 'html.parser')
# Get JavaScript info from site
top_text = soup.select_one('.result__text.result__before')
crack_time = soup.select_one('.result__text.result__time')
bottom_text = soup.select_one('.result__text.result__after')
print(top_text.text)
print(crack_time.text)
print(bottom_text.text)
time.sleep(5) #seconds
browser.close()Output:It would take a computer about
1 minute
to crack your passwordHeadless(not loading browser):
Both Chrome and FireFox now release headless mode in there newer drivers.
This mean that browser do not start(visible) as in example over.
Gone look at a simple setup for both Chrome and FireFox.
FireFox:from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time
#--| Setup
options = Options()
options.set_headless(headless=True)
caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = True
browser = webdriver.Firefox(firefox_options=options, capabilities=caps, executable_path=r"path to geckodriver")
#--| Parse
browser.get('https://www.python.org/')
time.sleep(2)
t = browser.find_element_by_xpath('//*[@id="dive-into-python"]/ul[2]/li[1]/div[1]/pre/code/span[1]')
print(t.text)
browser.quit()Output:# Python 3: Fibonacci series up to nChrome:from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
#--| Setup
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--log-level=3')
browser = webdriver.Chrome(chrome_options=chrome_options, executable_path=r'path to chromedriver')
#--| Parse
browser.get('https://www.python.org/')
time.sleep(2)
t = browser.find_element_by_xpath('//*[@id="dive-into-python"]/ul[2]/li[1]/div[1]/pre/code/span[1]')
print(t.text)
browser.quit()Output:# Python 3: Fibonacci series up to nFinal projects:
![[Image: Qr8P7Q.png]](https://imageshack.com/a/img923/747/Qr8P7Q.png)
Here gone loop over most played tracks on SoundCloud this week.
So here first has to activate mouser over play button(
ActionChains/hover) then click on play button.from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import time
def play_song(how_many_songs, time_to__play):
browser = webdriver.Chrome()
url = 'https://soundcloud.com/charts/top?genre=all-music&country=all-countries'
browser.get(url)
time.sleep(3)
for song_number in range(1, how_many_songs+1):
play = browser.find_elements_by_xpath('//*[@id="content"]/div/div/div[1]/div[2]/div/div[3]/ul/li[{}]/div/div[2]/div[2]/a'.format(song_number))[0]
hover = ActionChains(browser).move_to_element(play)
hover.perform()
play.click()
time.sleep(time_to__play)
browser.quit()
if __name__ == '__main__':
how_many_songs = 5
time_to__play = 15 # sec
play_song(how_many_songs, time_to__play)
