May-13-2023, 03:30 AM
Hi guys,
I'm practising some simple webscraping on a website which has different categories and within each category, has a number of pages. What I'm trying to do is have a list of categories and after the code get's to the end of a categories set of pages, to then start the next category and start scraping from that categories page 1.
The code I have written works in getting to the end of a category, but if the category has say 3 pages, when it moves to the next category, it starts at page 3- where I need it to obviously start back at page 1.
My code is:
Thanking you.
I'm practising some simple webscraping on a website which has different categories and within each category, has a number of pages. What I'm trying to do is have a list of categories and after the code get's to the end of a categories set of pages, to then start the next category and start scraping from that categories page 1.
The code I have written works in getting to the end of a category, but if the category has say 3 pages, when it moves to the next category, it starts at page 3- where I need it to obviously start back at page 1.
My code is:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
from time import sleep
from random2 import randint
cats = ['bessey', 'kreg']
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
product_url = []
product_name = []
start_page_number = 1
data = {'Product': product_name, 'Product URL': product_url}
for cat in cats:
while True:
url = f'https://www.example.com/brands/{cat}?PageProduct={start_page_number}&PageSizeProduct=48'
print(url)
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
section = soup.find('section', class_='layout-maincontent product-list-grid-template')
product_grid = section.find('div', id='product-grid')
more_pages = product_grid.find('div', class_ = 'product')
if more_pages is None:
break
products = product_grid.find_all('div', class_='product')
# print(url)
for product in products:
# Find Product Direct URL Link
productlisttitle = product.find('div', class_='cv-zone-product-4')
urlref = productlisttitle.find('a')['href']
urllink = 'https://www.example.com' + urlref
product_url.append(urllink)
start_page_number = 1Could someone please help me in showing me how to reiterate through the list of categories, but start at page 1 for each category? Thanking you.
