Python Forum
[SOLVED] [BeautifulSoup] A few questions
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[SOLVED] [BeautifulSoup] A few questions
#1
Question 
Hello,

I have the following questions about using BS:

What's the difference between...
1. .decompose() and .extract()? And how to remove empty lines after running either?
2. soup.select_one() and soup.kml.Document.name?
3. .string and .text?

Thank you.

INPUTFILE="blah.kml"
PATH=pathlib.Path(item).parent
BASENAME = pathlib.Path(item).stem
EXTENSION = pathlib.Path(item).suffix
OUTPUTFILE = f"{BASENAME}.EDITED{EXTENSION}"

soup = BeautifulSoup(open(INPUTFILE, 'r',encoding='utf-8'), features='xml')

for snippet in soup.find_all(['description', 'ExtendedData']):
	#snippet.decompose()
	snippet.extract()
soup.smooth()

name = soup.select_one("kml > Document > name")
#WHY BAD? name = soup.kml.Document.name
if name:
	print("Name found")

	name.string = BASENAME
	
	#AttributeError: property 'text' of 'Tag' object has no setter
	#name.text = BASENAME
else:
	print("No name")

	name = soup.new_tag("name")

	#name.text = BASENAME
	name.string = BASENAME

	doc = soup.kml.Document
	doc.insert(0,name)

with open(OUTPUTFILE, "w",encoding='utf-8') as file:
	file.write(str(soup))	
--
Edit: Here's how to remove empty lines after using decompose()/extract():
with open(OUTPUTFILE, "w",encoding='utf-8') as file:
	#file.write(str(soup))
	file.write(str(BeautifulSoup(str(soup), features='xml')))
Reply
#2
extract creates what looks like two trees. The original tree remains intact whatever you assign to the new tag, does not have the extracted tree.
The example in the beautifulsoup4 documentation has an example of this:
>>> from bs4 import BeautifulSoup
>>> markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
>>> soup = BeautifulSoup(markup, 'html.parser')
>>> a_tag = soup.a
>>> i_tag = soup.i.extract()
>>> a_tag
<a href="http://example.com/">I linked to </a>
>>> i_tag
<i>example.com</i>
>>> print(i_tag.parent)
None
>>> my_string = i_tag.string.extract()
>>> my_string
'example.com'
>>> print(my_string.parent)
None
>>> i_tag
<i></i>
>>>
decompose on the otherhand, completely destroys the tag.
The beautifulsoup4 documentation has an example of this also:
>>> from bs4 import BeautifulSoup
>>> markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
>>> soup = BeautifulSoup(markup, 'html.parser')
>>> a_tag = soup.a
>>> i_tag = soup.i
>>> 
>>> i_tag.decompose()
>>> i_tag
<></>
>>> a_tag
<a href="http://example.com/">I linked to </a>
>>>
As you can see, the i_tag is gone from the origianl, as well as the a_tag. It would have remained in the a_tag using 'extract'

I thought I'd try Grok4 for the remainder of your question,
here's the response:
You're very welcome! Here's a ready-to-copy version optimized for that old python-forum.io markup (which is quite restrictive).
Just paste this directly as your reply there — it will look clean and give proper credit.

`
Thank you for the clear questions! Here are the answers (credit for the detailed explanation goes to Grok 4 by xAI):

2. Difference between soup.select_one() and direct attribute access like soup.kml.Document.name

• select_one() uses CSS selectors → reliable, readable, works exactly like in browsers
• Direct access (soup.kml.Document.name) is a BeautifulSoup shortcut that automatically creates attributes from tag names
– very convenient for known XML formats (KML, Atom, RSS…)
– but fragile: breaks if there are multiple tags with the same name or if the tag name collides with Python reserved words (e.g. <class> is impossible)
– can return a “merged” tag or even a list-like object when multiple matches exist

Best practice: use select_one() or find() in almost all real code. Use the dot notation only for quick one-off scripts with predictable XML/KML.

Example (both give the same result in simple cases):
soup.select_one('Document name').text
soup.kml.Document.name.string # shorter but riskier

3. Difference between .string and .text

.tag.string
• Returns a NavigableString only if the tag has exactly ONE child and that child is plain text
• Returns None if there are any nested tags (even <br>, <span>, etc.)

.tag.text (or .get_text())
• Always returns a string
• Recursively collects ALL text inside the tag and its descendants
• This is what you want 99% of the time

Examples:
<p>Hello <b>bold <i>italic</i></b> world</p>

tag.string → None
tag.text → "Hello bold italic world"

<name>My Place</name>

Both .string and .text return "My Place" (because no nesting)

So: use .text unless you specifically need to check that a tag contains only plain text.

Hope this helps!
(Explanation courtesy of Grok 4 – Larz60+ just relaying the wisdom )
`

Feel free to post it exactly like that, or let me know if you want a slightly shorter or longer version. Happy to tweak it further for the forum! 😄
Reply
#3
Thanks much.

I forgot that "type()" is the way to find what each output means.

I also found that "name" is a reserved word, and "soup.kml.Document.name" returns… Document's name instead of a pointer to its "name" sub-element.

This works
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Question [SOLVED] [BeautifulSoup] Adding newline before+after new element? Winfried 5 107 Feb-25-2026, 09:01 AM
Last Post: Winfried
Question [SOLVED] [BeautifulSoup] Why do I get a KeyError? Winfried 2 2,087 Aug-31-2025, 06:45 PM
Last Post: Winfried
  [Solved]Help with BeautifulSoup.getText() Error Extra 5 8,056 Jan-19-2023, 02:03 PM
Last Post: prvncpa

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020