Scraping with BeautifulSoup

Prince_Bhatia · (This post was last modified: Sep-06-2017, 09:08 PM by Prince_Bhatia.)

hi,

i am trying to scrape the website "https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=1&PageSize=36&order=BESTMATCH"

what i am trying to do scrape, product name, it's price and image link

i got the success a bit with one problem, name, price and image are coming in every cell, like formatting is so poor.

can someone help me to ammend codes so that i can get name in name column, price in price column and image in image column.

from urllib.request import urlopen
from bs4 import BeautifulSoup

#page_url = "https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=1&PageSize=36&order=BESTMATCH"
#html = urlopen(page_url)
#bs0bj = BeautifulSoup(html, "html.parser")

#page_details = bs0bj.find_all("div", {"class":"item-container"})

f = open("Scrapedetails.csv", "w")
Headers = "Item_Name, Price, Image\n"
f.write(Headers)

#for i in page_details:
#    Item_Name = i.find("a", {"class":"item-title"})
#    Price = i.find("li", {"class":"price-current"})
#    Image = i.find("img")
#    Name_item = Item_Name.get_text()
#    Prin = Price.get_text()
#    imgf = Image["src"]# to get the key src 
#    f.write("{}".format(Name_item)+ ",{}".format(Prin)+ ",{}".format(imgf))
#f.close()

for page in range(1,15):
    page_url = "https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page={}&PageSize=36&order=BESTMATCH".format(page)
    html = urlopen(page_url)
    bs0bj = BeautifulSoup(html, "html.parser")
    page_details = bs0bj.find_all("div", {"class":"item-container"})
    for i in page_details:
        Item_Name = i.find("a", {"class":"item-title"})
        Price = i.find("li", {"class":"price-current"})
        Image = i.find("img")
        Name_item = Item_Name.get_text()
        Prin = Price.get_text()
        imgf = Image["src"]# to get the key src 
        f.write("{}".format(Name_item)+ ",{}".format(Prin)+ ",{}".format(imgf)+ "\n")
f.close()

i am attaching the excel file too and what are the new ways to save data in csv ,can someone help me in it with codes too?

***metulburr*** · Sep-06-2017, 11:47 PM

it looks like there is newlines somewhere in the strings that you are writing messing up the csv file. Find the newlines and remove them before writing to the file. IF its before or after the first and last character you can use str.strip() to remove them.

Prince_Bhatia · (This post was last modified: Sep-07-2017, 09:12 AM by Prince_Bhatia.)

Nope , no success with strip sir and unable to find the new line even, i tried everything but go no sucess, and i am not sure how to solve it Huh

**Larz60+** · (This post was last modified: Sep-07-2017, 11:03 AM by Larz60+.)

to detect any symbols:

load the html page into notepad++
select View-->Show Symbol-->Show All Characters

the EOL and other characters will be highlighted

***metulburr*** · Sep-07-2017, 11:15 AM

it looks like the newline is within the string, not at the beginning or the end.

Quote:

//images10.newegg.com/NeweggImage/ProductImageCompressAll300/A85V_1_20170906967475116.jpg

Refurbished: MSI GeForce GT 730 DirectX 12 N730K-2GD5LP/OC 2GB 64-Bit GDDR5 PCI Express 2.0 x16 HDCP Ready V$
deo Card

There is no comma in here so this appears to be one element.

Also i just noticed after running your program over and over, that it triggered a captcha for me causing your script to fail.

Prince_Bhatia · Sep-07-2017, 11:21 AM

(Sep-07-2017, 11:15 AM)metulburr Wrote: it looks like the newline is within the string, not at the beginning or the end.
Quote:
//images10.newegg.com/NeweggImage/ProductImageCompressAll300/A85V_1_20170906967475116.jpg

Refurbished: MSI GeForce GT 730 DirectX 12 N730K-2GD5LP/OC 2GB 64-Bit GDDR5 PCI Express 2.0 x16 HDCP Ready V$
deo Card
There is no comma in here so this appears to be one element.

Also i just noticed after running your program over and over, that it triggered a captcha for me causing your script to fail.

i even tried to put comma's at the f.write at the bottom but result was same...what should i do? how to solve it?

***metulburr*** · Sep-07-2017, 11:49 AM

adding commas are not going to change it. Because the newlines in the content are being passed to your csv file creating new rows.

You could split the strings by newlines to "remove" them and then join them back together before writing to the file

>>> ''.join('text\ntest'.split())
'texttest'

or replace the newlines in the string

>>> "line 1\nline 2\n...".replace('\n', '')
'line 1line 2...'

***snippsat*** · (This post was last modified: Sep-07-2017, 02:50 PM by snippsat.)

You have to be more exact to get clean data,before you loop and write.
Example here you get clean price out,
and added https: so image links will work.

from urllib.request import urlopen
from bs4 import BeautifulSoup

page_url = "https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=1&PageSize=36&order=BESTMATCH"
html = urlopen(page_url)
bs0bj = BeautifulSoup(html, "html.parser")
page_details = bs0bj.find_all("div", {"class":"item-container"})
for i in page_details:
    Item_Name = i.find("a", {"class":"item-title"})
    Price = i.find("li", {"class":"price-current"})
    Image = i.find("img")
    Name_item = Item_Name.get_text()
    imgf = Image["src"]

    # Fix
    #print(Name_item)
    print(Price.find('strong').text)
    #print('https:{}'.format(imgf))

Output:179
479
589
579
479
559
469
489
...

print image will be:

Output:https://images10.newegg.com/ProductImageCompressAll300/14-487-292-06.jpg
https://images10.newegg.com/NeweggImage/ProductImageCompressAll300/14-487-321-S99.jpg
https://images10.newegg.com/NeweggImage/ProductImageCompressAll300/14-487-319-S99.jpg
https://images10.newegg.com/NeweggImage/ProductImageCompressAll300/14-487-318-S99.jpg
.........

Prince_Bhatia · (This post was last modified: Sep-07-2017, 06:34 PM by Prince_Bhatia.)

Alright, i got it solved guys....just replace while writing product name ",","|"

Below are the codes

from urllib.request import urlopen
from bs4 import BeautifulSoup

f = open("Scrapedetails.csv", "w")
Headers = "Item_Name, Price, Image\n"
f.write(Headers)

for page in range(1,15):
    page_url = "https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page={}&PageSize=36&order=BESTMATCH".format(page)
    html = urlopen(page_url)
    bs0bj = BeautifulSoup(html, "html.parser")
    page_details = bs0bj.find_all("div", {"class":"item-container"})
    for i in page_details:
        Item_Name = i.find("a", {"class":"item-title"})
        Price = i.find("li", {"class":"price-current"}).find('strong')
        Image = i.find("img")
        Name_item = Item_Name.get_text().strip()
        prin = Price.get_text()
        imgf = Image["src"]# to get the key src 
        

        print(Name_item)
        print(prin)
        print('https:{}'.format(imgf))
        f.write("{}".format(Name_item).replace(",", "|")+ ",{}".format(prin)+ ",https:{}".format(imgf)+ "\n")
f.close()

Thank you so much everybody, everyone who helped in this code, this is an very best platform for all python lovers who all wants to be a great programmer.

I am also attaching the end result

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Scraping based on years BeautifulSoup	rhat398	0	3,052	May-22-2021, 07:20 PM Last Post: rhat398
	Beautifulsoup Scraping	PolskaYBZ	3	5,019	Jun-22-2019, 10:05 AM Last Post: PolskaYBZ
	Combining selenium and beautifulsoup for web scraping	sumandas89	3	15,498	Jan-30-2018, 02:14 PM Last Post: metulburr

Scraping with BeautifulSoup

User Panel Messages

Announcements