Getting from <td> tag by using urllib,Beautifulsoup

KuroBuster · Aug-18-2021, 02:58 AM

I would like to create a program to check for updates on a regular basis.
In the "VMware ESXi" release notes, the version is in a table (i.e., in a <td> tag).
To do this, I want to scrape from urllib and then use BeautifulSoup to filter the information in the <td> tag,
so I wrote the following code, but it returned "None".

import urllib.request, urllib.error, urllib.parse, re
from bs4 import BeautifulSoup
import binascii

header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'}
#Spoofing
root = 'https://kb.vmware.com/s/article/2143832'
url = urllib.request.Request(root,headers=header)
response = urllib.request.urlopen(url).read().decode('utf-8')
soup = BeautifulSoup(response)
corn_soup = soup.find('td')

print(corn_soup)

I think I'm accessing the site correctly, but I don't think I'm getting the information I need in the soup.

***snippsat*** · (This post was last modified: Aug-18-2021, 05:31 PM by snippsat.)

(Aug-18-2021, 02:58 AM)KuroBuster Wrote: I think I'm accessing the site correctly, but I don't think I'm getting the information I need in the soup.

Information is generated bye JavaScript,then Selenium is a option.

An other more advance way is to look at source and what's send over network.
Here catch JSON response,as a advice use Requests and not urllib

import requests
from pprint import pprint

url = 'https://kb.vmware.com/services/apexrest/v1/article?docid=2143832&lang=en_us'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
}
response = requests.get(url, headers=headers)
data = response.json()
products = data['meta']['articleProducts']['relatedProducts']
versions = data['meta']['articleProducts']['relatedVersions']
pprint(products)
print('-' * 30)
pprint(versions)

Output:['VMware vSphere ESXi',
 'VMware vSphere ESX',
 'VMware vSphere',
 'VMware ESXi',
 'VMware ESX Server',
 'VMware ESX']
------------------------------
['VMware vSphere ESXi 7.0.0',
 'VMware vSphere ESXi 6.7',
 'VMware vSphere ESXi 6.5',
 'VMware vSphere ESXi 6.0',
 'VMware vSphere ESXi 5.5',
 'VMware vSphere ESXi 5.1',
 'VMware vSphere ESXi 5.0',
 'VMware vSphere ESX 4.x',
 'VMware ESXi 4.1.x Installable',
 'VMware ESXi 4.1.x Embedded',
 'VMware ESXi 4.0.x Installable',
 'VMware ESXi 4.0.x Embedded',
 'VMware ESX Server 3.5.x',
 'VMware ESX Server 3.0.x',
 'VMware ESX Server 2.5.x',
 'VMware ESX Server 2.1.x',
 'VMware ESX Server 2.0.x',
 'VMware ESX Server 1.x',
 'VMware ESX Server 1.5.x']

KuroBuster · Aug-20-2021, 07:53 AM

Ohh exactly what I was looking for!
Thanks! Big Grin

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Beginner: urllib error	tomfry	7	12,308	May-03-2020, 04:35 AM Last Post: Larz60+
	SSLCertVerificationError using urllib (urlopen)	FalseFact	1	7,910	Mar-31-2019, 08:34 AM Last Post: snippsat
	urllib request urlopen?	nutgut	4	8,004	Apr-14-2018, 01:12 PM Last Post: nutgut

Getting from <td> tag by using urllib,Beautifulsoup

User Panel Messages

Announcements