Mar-01-2020, 06:56 PM
Hello to all,
I´m trying to clean an html file that has repeated paragraphs within body. Below I show the input file and expected output.
Input.html https://jsfiddle.net/97ptc0Lh/4/
Output.html https://jsfiddle.net/97ptc0Lh/1/
I've been trying with the following code using BeautifulSoup but I don´t know why is not working, since the resultant list CleanHtml contains the repeated elements (paragraphs) that I´d like to remove. I already asked here, but still no much progress.
I´m trying to clean an html file that has repeated paragraphs within body. Below I show the input file and expected output.
Input.html https://jsfiddle.net/97ptc0Lh/4/
Output.html https://jsfiddle.net/97ptc0Lh/1/
I've been trying with the following code using BeautifulSoup but I don´t know why is not working, since the resultant list CleanHtml contains the repeated elements (paragraphs) that I´d like to remove. I already asked here, but still no much progress.
from bs4 import BeautifulSoup
fp = open("Input.html", "rb")
soup = BeautifulSoup(fp, "html5lib")
Uniques = set()
CleanHtml = []
for element in soup.html:
if element not in Uniques:
Uniques.add(element)
CleanHtml.append(element)
print (CleanHtml)Thanks in advance for any help.
