Cleaning HTML data using Jupyter Notebook

jacob1986 · (This post was last modified: Mar-04-2021, 10:22 PM by jacob1986.)

I need help cleaning extracting HTML code, the output is showing the data with commas inbetween the information (small example shown as below). My full code is at the bottom, my code can also be found at https://github.com/aaron1986/Coursera_Ca...tats.ipynb

['Defence',
'Clean',
'sheets',
'13',
'Goals',
'Conceded',
'11',

Moreover, I would like to view the data as below.

[Defence,
Clean sheets 13,
Goals Conceded 11,
]

import requests
import pandas as pd
import numpy as np
import seaborn as sns

from urllib.request import urlopen
from bs4 import BeautifulSoup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
main_url = 'xxxxxxxx'
result= requests.get(main_url)
result.text
>>>>>>>>>>>>>>>>>
soup = BeautifulSoup(result.text, 'html.parser')
print(soup.prettify())
>>>>>>>>>>>>>>>>>>>>>>>>>
new = soup.find("ul", class_ = "normalStatList")
new.get_text()
>>>>>>>>>>>>>>>>>>>>
new2 = new.get_text().replace('\n', ' ').split()
new2
>>>>>>>>>>>>>

***snippsat*** · (This post was last modified: Mar-04-2021, 10:07 PM by snippsat.)

I guess you use BeautifulSoup.
Doing it like this you mess up original structure as it also spilt sentence.
As you don't show html it's not easy to help.
Here a quick example see that sentence don't get split up here.

from bs4 import BeautifulSoup

html = '''\
<body>
  <h1>This is a Heading</h1>
  <p>This is a paragraph</p>
  <p>blue car</p>
</body>'''

soup = BeautifulSoup(html, 'lxml')

>>> ptag = soup.find_all('p')
>>> ptag
[<p>This is a paragraph</p>, <p>blue car</p>]
>>> 
>>> for t in ptag:
...     print(t.text)     
...     
This is a paragraph
blue car
>>> lst = [t.text for t in ptag]
>>> lst
['This is a paragraph', 'blue car']

jacob1986 · Mar-04-2021, 10:23 PM

I have updated my post with full code.

***snippsat*** · Mar-04-2021, 11:14 PM

To show a example of first one of normalStat,loop can try to figure out yourself.

import requests
from bs4 import BeautifulSoup

url = 'https://www.premierleague.com/players/16431/player/stats'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
stat = soup.find('div', class_="statsListBlock")

>>> norm = soup.find(class_="normalStat")
>>> text = norm.select_one('.stat').text.strip()
>>> text
'Clean sheets   \n      13'
>>> " ".join(text.split())
'Clean sheets 13'

jacob1986 · Mar-05-2021, 08:52 PM

Hi, thank-you for the reply, I have tried to code the loop but I cannot seem to loop all the '.stat' fields together.

***snippsat*** · Mar-05-2021, 09:39 PM

Try this.

import requests
from bs4 import BeautifulSoup

url = 'https://www.premierleague.com/players/16431/player/stats'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
norm_stat = soup.find_all(class_='normalStat')
for tag in norm_stat:
    temp = tag.select_one('.stat').text.strip()
    result = " ".join(temp.split())
    print(result)

Output:Clean sheets 13
Goals Conceded 11
Tackles 19
Tackle success % 63%
Last man tackles 0
Blocked shots 1
Interceptions 24
Clearances 68
Headed Clearance 36
.....

jacob1986 · Mar-05-2021, 10:13 PM

Thank-you. It was the 'select_one' part that was confusing me.

***snippsat*** · Mar-05-2021, 10:44 PM

(Mar-05-2021, 10:13 PM)jacob1986 Wrote: Thank-you. It was the 'select_one' part that was confusing me.

As info with select() and select_one() get all the power of CSS Selector.
Many forget about this powerful feature of BS and just stick find() and find_all().
As you see can mix this together.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Trying to scrape data from HTML with no identifiers	pythonpaul32	2	3,129	Dec-02-2023, 03:42 AM Last Post: pythonpaul32
	Need Pointers/Advise for Cleaning up BS4 XPATH Data	BrandonKastning	0	2,114	Mar-08-2022, 12:28 PM Last Post: BrandonKastning
	Post HTML Form Data to API Endpoints	Dexty	0	2,289	Nov-11-2021, 10:51 PM Last Post: Dexty
	cleaning HTML pages using lxml and XPath	wenkos	2	5,194	Aug-25-2021, 10:54 AM Last Post: wenkos
	HTML multi select HTML listbox with Flask/Python	rfeyer	0	7,154	Mar-14-2021, 12:23 PM Last Post: rfeyer
	Any way to remove HTML tags from scraped data? (I want text only)	SeBz2020uk	1	5,780	Nov-02-2020, 08:12 PM Last Post: Larz60+
	html data cell attribute issue	delahug	5	5,076	May-31-2020, 09:18 AM Last Post: delahug
	Extracting html data using attributes	WiPi	14	12,047	May-04-2020, 02:04 PM Last Post: snippsat
	extrat data from a button html	windows11	1	3,253	Mar-24-2020, 03:39 PM Last Post: Larz60+
	Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row	BrandonKastning	0	3,731	Mar-22-2020, 06:10 AM Last Post: BrandonKastning

Cleaning HTML data using Jupyter Notebook

User Panel Messages

Announcements