I'm working on a review scraper, and I'm troubleshooting some code to get proof of concept that what I want can be done in requests_HTML. I am running into an issue I dont understand. In random pages I am returning NoneType, object has no attribute, but the page IS valid.
While calling each page individually, I was getting all the information asked for, 10 sets of data to correspond with 10 reviews per page, for 129 pages. When I get to a certain page, in this case, page 30, and the last page 129, it stops returning the information I asked for, and instead returns NoneType's:
Inspecting the elements for the questionable pages shows me no change in the HTML or CSS selectors for what I am pointing to.
This is the code I'm testing:
![[Image: 28LK4nM.png]](https://i.imgur.com/28LK4nM.png)
Any help you can give would be appreciated.
While calling each page individually, I was getting all the information asked for, 10 sets of data to correspond with 10 reviews per page, for 129 pages. When I get to a certain page, in this case, page 30, and the last page 129, it stops returning the information I asked for, and instead returns NoneType's:
Quote: File "c:\Programs\Python\requests-html_test\test2.py", line 61, in <module>
print(amz.get_reviews(reviews))
File "c:\Programs\Python\requests-html_test\test2.py", line 27, in get_reviews
body = review.find('span[data-hook=review-body] span', first=True).text.replace('\n','').strip() # exchange newlines with a space
AttributeError: 'NoneType' object has no attribute 'text'
Inspecting the elements for the questionable pages shows me no change in the HTML or CSS selectors for what I am pointing to.
This is the code I'm testing:
from requests_html import HTMLSession
import time
class Reviews:
def __init__(self, *args) -> None:
self.asin = asin
self.title = title
self.pagedata = HTMLSession()
self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'}
self.url = f'https://www.amazon.com/{self.title}/reviews/{self.asin}/ref=cm_cr_othr_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&sortBy=recent&pageNumber='
def pagination(self, page):
r = self.pagedata.get(self.url + str(page)) # construct review url with current page
return r.html.find('div[data-hook=review]') # get all review data
def get_reviews(self, reviews): # collects data from reviews, and appends them to total
total = []
for review in reviews:
title = review.find('a[data-hook=review-title]', first=True).text
rating = review.find('i[data-hook=review-star-rating] span', first=True).text
body = review.find('span[data-hook=review-body] span', first=True).text.replace('\n','').strip() # exchange newlines with a space for smaller formating
data = { #collecting data from for loop
"title": title,
'rating': rating,
'body': body[:100]
}
total.append(data)
return total
if __name__ == '__main__':
with open('user_url.txt', "r") as file: # opens a text file to pull the item's store page URL
user_url = file.read() # get url from txt file
#print(user_url)
_, _, _, title, _, asin, *_ = user_url.split("/") #pulling <asin> and item <title> from given URL
amz = Reviews(asin, title) # Call with asin and title, needed to construct reviews page
results = [] # to gather collected data
for x in range(1, 29): # pagination
print('getting page ', x)
time.sleep(1.0) # a pause to test if slowing things down helps
reviews = amz.pagination(x)
results.append(amz.get_reviews(reviews)) # collecting reviews
#reviews = amz.pagination(30) # to test pulling each page individually
#print(amz.get_reviews(reviews))
print(results)These are the relevant elements from the first page I can't seem to parse:![[Image: 28LK4nM.png]](https://i.imgur.com/28LK4nM.png)
Any help you can give would be appreciated.
