[Python / BS4] How to Scrape

digitalmatic7 · Oct-19-2017, 07:36 AM

I'm new to programming and am having trouble scraping with BS4.

I'm webmaster for a popular website (can't share it here, but it uses Disqus comments platform).

I want to scrape the vote count and the message in top comments within a set range.. (scrape comments with 20-200 upvotes).

I noticed that:

Vote count should be easy to scrape since the upvote count can be found in the 'a class', example: "count-116"
The problem is that 'a class' isn't linked to the message text in any way I can see

I've been playing around with some code working on an example site, but so far no success:

from bs4 import BeautifulSoup
import urllib.request
import re

scrape = urllib.request.urlopen('https://disqus.com/home/discussion/channel-discussdisqus/disqus_leaderboard_what_are_the_best_sports_websites/').read()
#soup = BeautifulSoup(scrape,'lxml')
soup = BeautifulSoup(scrape, 'html.parser')

for elem in soup.find_all('a', src=re.compile('count-116')):
    print (elem['src'])

^ This was my attempt to scrape the 'a' element that contains 'count-116', I was going to run it in a while loop with an increment..

count-20
count-21
count-22

...but sadly it doesn't work.

Can anyone help me understand the proper way?

**Larz60+** · Oct-19-2017, 07:57 AM

see in the tutorials section:
Web scraping Part 1
Web scraping Part 2

digitalmatic7 · (This post was last modified: Oct-19-2017, 08:25 AM by digitalmatic7.)

(Oct-19-2017, 07:57 AM)Larz60+ Wrote: see in the tutorials section:
Web scraping Part 1
Web scraping Part 2

Thanks! Great resources!!

Any advice on how to scrape the message if the likes fall in the specified numerical range?

[Image: cZOQiN22SeCAk3ONOdtA2Q.png]

I'm confident I can scrape the number of likes after playing around with the code for a while, but how would I scrape something that has no unique identifier? I need to connect the likes to the message. It's the logic or process of doing it that's really confusing to me.

wavic · Oct-19-2017, 09:45 AM

post_message = soup.find('div', class_='post-message') # target the div
paras = post_message.find_all('p') # get all 'p' tags from that div

If there are many div elements do this in for loop

post_messages = soup.find_all('div', class_='post-message') # post_messages will holds many divs to iterate over them
for post_message in post_messages:
    paras = post_message.find_all('p')

About the likes. You have to do the same like above but to start with scraping all divs with class 'post-body'. For each scrape all the divs with class post message. For each scrape all the p tags.

After getting the p tags for each post-body div scrape the a tag with the votes Perhaps this is generated with JavaScript so you have to take the page content using selenium.

You will need to install PhantomJS to do it like in the example below but you can use Chrome or Firefox.

from selenium import webdriver

driver = webdriver.PhantomJS() # webdriver.Firefox() or webdriver.Chrome()
driver.get(url)
html = driver.page_source

soup = BeautifulSoup(html, 'lxml')

***metulburr*** · (This post was last modified: Oct-19-2017, 11:22 AM by metulburr.)

You have to use selenium as if you turn of javascript in your browser, you load nothing. Also the entire page is in an iframe. Took me awhile to figure out why the scraping wasnt working. Here is an example of getting by their anti-bot measures. I like to use chrome/firefox at first, so that i can easily troubleshoot it while i look at the code that browser is actually getting, and then switch over to phantomjs after to make it headless.

But once you get by their javascript and iframe to stumble you, then you can scrape like normal.

from selenium import webdriver
from bs4 import BeautifulSoup
import time

URL = 'https://disqus.com/home/discussion/channel-discussdisqus/disqus_leaderboard_what_are_the_best_sports_websites/'

driver = webdriver.Chrome('/home/metulburr/chromedriver')
driver.set_window_position(0,0)

driver.get(URL)
time.sleep(3)
driver.switch_to.frame(driver.find_element_by_tag_name("iframe"))
time.sleep(3)
soup = BeautifulSoup(driver.page_source, 'html.parser')
section = soup.find('section', {'id':'conversation'})
posts = section.find_all('li',{'class':'post'})
for post in posts:
    print(post.find('div', {'class':'post-message '}).p)

If for some reason you wanted to return out of the iframe back to the original page

driver.switch_to.default_content()

digitalmatic7 · (This post was last modified: Oct-19-2017, 11:21 PM by digitalmatic7.)

Thanks for all the help and advice guys! I have a knack for creating projects that are way above my skill level. Seeing how you solve problems really motivates me a lot to learn.

I would have never thought to get selenium to work with BS4 like that. That's pretty interesting! Also the switch to iframe command is something I never knew about.

Found some great info here too: https://www.guru99.com/handling-iframes-selenium.html

I'm really confused by how that page loads in my normal chrome browser. The Iframe doesn't even show up in the source code for the main html document, how does that make sense? Instead it has its own source code which I can't even access in Firefox. There's a special option in Chrome to see it.

[Image: xv7IbtAoT5m06efURxvhFw.png]

I'm going to play around with the code you guys provided, thanks again!

***metulburr*** · (This post was last modified: Oct-20-2017, 01:01 AM by metulburr.)

If you right click -> Inspect -> Console -> Drop down that starts with "top" -> select drop down and you will see the ID of the iframe in chrome

But my browser didnt show the iframe option either via right clicking. The way i found out about the iframe was i printed the page source hat selenium was using and looked at it. Once i saw an iframe tag i knew what the problem was. Printing the source never fails me. Couldnt say much about firefox, as i mostly use chrome and phantomjs.

If selenium cant find a tag and you know its correct, then your next culprit is an iframe usually. Most of the time sites will have less than 3 iframes on a single page, and you dont have to even bother with the ID

***snippsat*** · (This post was last modified: Oct-20-2017, 11:52 PM by snippsat.)

Sometime there are several ways to solve task like this.
Looking closer at it,so are all post stored as JSON.
Then is easier to just parser the JSON with Request.
Load more comments will be cursor=0(first 50 post), cursor=1(load 50 new post) in url address.
Example take out likes from 2 first post.

import requests

url = 'https://disqus.com/api/3.0/threads/listPostsThreaded?limit=50&thread=4946429135&forum=channel-discussdisqus&order=popular&cursor=0%3A0%3A0&api_key=E8Uh5l5fHZ6gD8U3KycjAIAk46f68Zw7C6eW8WSjZvCLXebZ7p0r1yrYDrLilk2F'
r = requests.get(url)
post = r.json()
print(post["response"][0]['likes'])
print(post["response"][1]['likes'])

Output:116
33

***metulburr*** · Oct-21-2017, 12:00 AM

Nice!!!

How did you know there was a JSON for it?

***snippsat*** · Oct-21-2017, 12:55 AM

(Oct-21-2017, 12:00 AM)metulburr Wrote: How did you know there was a JSON for it?

I did not know,inspecting the site and the clue lies in network traffic.

Disqus comment plug-in system is really large 2 Billion monthly unique views,
so some structure most the have and JSON is often a choice for a web API.
They also use Python Django for back end,and JavaScript for front end.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	How can I web scrape the "alt" attribute from a "img" tag with Python?	cisky	1	8,008	Aug-19-2022, 04:59 AM Last Post: snippsat
	Python Obstacles \| Kung-Fu \| Full File HTML Document Scrape and Store it in MariaDB	BrandonKastning	5	5,132	Dec-29-2021, 02:26 AM Last Post: BrandonKastning
	Python Obstacles \| American Kenpo \| Wiki Scrape URL/Table and Store it in MariaDB	BrandonKastning	6	5,170	Dec-29-2021, 12:38 AM Last Post: BrandonKastning
	Python Obstacles \| Karate \| HTML/Scrape Specific Tag and Store it in MariaDB	BrandonKastning	8	5,872	Nov-22-2021, 01:38 AM Last Post: BrandonKastning
	Beautifulsoup doesn't scrape page (python 2.7)	Hikki	0	3,114	Aug-01-2020, 05:54 PM Last Post: Hikki
	scrape data 1 go to next page scrape data 2 and so on	alkaline3	6	10,406	Mar-13-2020, 07:59 PM Last Post: alkaline3
	Scrape ASPX data with python...	hoff1022	0	5,606	Feb-26-2019, 06:16 PM Last Post: hoff1022

[Python / BS4] How to Scrape

User Panel Messages

Announcements