bpo-14465: xml.etree.ElementTree pretty printing by dzeban · Pull Request #4016 · python/cpython

dzeban · 2017-10-17T07:47:11Z

This PR adds pretty printing to the xml.etree.ElementTree module, namely to the tostring, tostringlist functions and write method of the ElementTree object.

Pretty printing feature saves backward compatibility in a sense that old output won't be changed.

https://bugs.python.org/issue14465

vadmium · 2017-10-17T20:06:36Z

   is either ``"xml"``, ``"html"`` or ``"text"`` (default is ``"xml"``).
   *short_empty_elements* has the same meaning as in :meth:`ElementTree.write`.
+   *pretty* enables pretty printing with default indent. Keyword-only *indent*
+   allows to set custom indent.


allows to set → allows setting (or allows one to set, etc)

vadmium · 2017-10-17T20:13:12Z

      emitted as a single self-closed tag, otherwise they are emitted as a pair
-      of start/end tags.
+      of start/end tags. *indent* enables and set indentation level for the
+      human readable output.


enables and sets
human-readable

dzeban · 2017-10-18T12:17:20Z

@vadmium Thanks for the doc review - I've fixed the mistakes.

scoder · 2017-10-20T10:58:10Z

   generate a Unicode string (otherwise, a bytestring is generated).  *method*
   is either ``"xml"``, ``"html"`` or ``"text"`` (default is ``"xml"``).
   *short_empty_elements* has the same meaning as in :meth:`ElementTree.write`.
+   *pretty* enables pretty printing with default indent. Keyword-only *indent*


Please name it pretty_print instead for lxml compatibility.

scoder · 2017-10-20T11:03:59Z

Does this slow down the normal serialisation in any noticeable way?

dzeban · 2017-10-21T21:13:19Z

As for serialization performance - here are some simple tests. I've tested with current master and my branch with and without pretty printing. I did tostring serialization on 207KB xml file.

$ git checkout master
$ ./configure --enable-optimizations
$ make
$ ./python 
Python 3.7.0a2+ (heads/master:73c4708, Oct 22 2017, 00:03:16) 
[GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux

$ ./python -m timeit -s "import xml.etree.ElementTree as ET; tree = ET.parse('xml.xml'); root = tree.getroot()" "ET.tostring(root, 'unicode', 'xml')"
20 loops, best of 5: 12.4 msec per loop

$ git checkout etree-pretty-print 
$ ./configure --enable-optimizations
$ make

$ ./python 
Python 3.7.0a1+ (heads/etree-pretty-print:f114e68, Oct 21 2017, 23:35:43) 
[GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux

$ ./python -m timeit -s "import xml.etree.ElementTree as ET; tree = ET.parse('xml.xml'); root = tree.getroot()" "ET.tostring(root, 'unicode', 'xml')"                                                                                                                           
20 loops, best of 5: 14.2 msec per loop

$ ./python -m timeit -s "import xml.etree.ElementTree as ET; tree = ET.parse('xml.xml'); root = tree.getroot()" "ET.tostring(root, 'unicode', 'xml', pretty_print=True)"
20 loops, best of 5: 16.7 msec per loop

So the new version is 14% slower. Pretty printing is 17% slower than non pretty printing. Overall, pretty printing enabled code is 34% slower than original version from master.

I'm not sure if it's really scary. Do we have such comparison for lxml?

vstinner · 2017-10-23T17:27:57Z

./python -m timeit

I suggest you to use perf timeit, timeit is not reliable.
http://perf.readthedocs.io/en/latest/cli.html#timeit-cmd

dzeban · 2017-10-24T07:40:41Z

Here are the benchmarks using perf module:

$ # Master branch
$ ./python -m perf timeit -s "import xml.etree.ElementTree as ET; tree = ET.parse('xml.xml'); root = tree.getroot()" "ET.tostring(root, 
'unicode', 'xml')"
.....................
Mean +- std dev: 12.8 ms +- 0.3 ms

$ # etree-pretty-print branch without pretty printing
$ ./python -m perf timeit -s "import xml.etree.ElementTree as ET; tree = ET.parse('xml.xml'); root = tree.getroot()" "ET.tostring(root, 'unicode', 'xml')"
.....................
Mean +- std dev: 14.3 ms +- 0.4 ms

$ # etree-pretty-print branch with pretty printing
$ ./python -m perf timeit -s "import xml.etree.ElementTree as ET; tree = ET.parse('xml.xml'); root = tree.getroot()" "ET.tostring(root, 'unicode', 'xml', pretty_print=True)"
.....................
Mean +- std dev: 15.7 ms +- 0.5 ms

This one shows that new version is 12% slower. Pretty printing adds 10%.

I don't think that slowing down because of pretty printing is a big deal because if you need fast serialization you don't need a pretty output. I'm gonna see how lxml behaves with and without pretty printing, cause I think I'll see the same percentage.

What I'm worried about is 12% slowing against master version. Should we really stress it? Because I think if you really need to serialize a lot of XML and need it fast you'll gonna use lxml anyway.

Stevoisiak · 2018-04-24T14:58:17Z

@dzeban I think it's unfair to assume users will always use lxml over xml.etree.

xml.etree is built into Python
Many Python guides teach xml.etree over lxml
The API for xml.etree is arguably better documented
xml.etree allows sub-classes of Element, which lxml strongly discourages.

scoder · 2018-04-24T20:35:21Z

new version is 12% slower

I don't see how a rare and "mostly for debugging" feature like pretty printing could possibly justify such a big slow-down for a time critical feature like serialisation.

Many Python guides teach xml.etree over lxml

"Many" seems highly exaggerated.

The API for xml.etree is arguably better documented

A surprising argument, given that almost all of the documentation (and tutorials and books and ...) for xml.etree applies to lxml.etree just as well.

xml.etree allows sub-classes of Element, which lxml strongly discourages.

You shouldn't believe everything that random people write on the Internet. Why do you think there is a dedicated subclassing API in lxml?

Stevoisiak · 2018-04-25T16:12:26Z

@scoder Valid counterpoints. Though, on your last point:

You shouldn't believe everything that random people write on the Internet.

I was referencing an answer I wrote myself based on the documentation for lxml.etree.ElementBase.

Why do you think there is a dedicated subclassing API in lxml?

~~Is there a dedicated subclassing API in lxml? I would be glad to amend my point if this were the case.~~

Upon further review, it appears lxml.etree.ElementBase does support subclassing.

Stevoisiak · 2018-06-04T20:26:59Z

While I'm in favor of adding an option for pretty printing, I'm not sure it's worth a 12% decrease in performance.

Could the performance impact be lessened if the indent attribute was replaced with a constant?

dzeban · 2018-06-05T08:05:26Z

@Stevoisiak do you mean to remove indent as an argument and make it a module level constant like indent = ' ' so the user couldn't change it? I could give it a shot but don't think that it's what users expect from Python APIs.

mcepl · 2018-08-17T15:52:50Z

While I'm in favor of adding an option for pretty printing, I'm not sure it's worth a 12% decrease in performance.

@Stevoisiak This feels like a serious ungrateful bikeshedding to me. Could we first add this feature, which will finally remove for me the last and only reason why to use lxml at all (and close six years old bug)? Then you can fiddle with the performance. After all, if you really care about performance so much, than you probably won't beat lxml with its dependence on libxml2 which is the fastest XML library around.

scoder · 2018-08-18T04:51:23Z

lxml ... is the fastest XML library around.

That's true for many use cases but not for all. ElementTree is actually very memory efficient and very fast on certain API operations. Building trees is something that can be done very fast in ElementTree but is not quite as fast in lxml. If the serialisation then slows it down so much afterwards, that is a problem. I would rather like to see the serialisation performance of ElementTree improved, definitely not further slowed down.

mcepl · 2018-08-18T08:10:28Z

That's true for many use cases but not for all.

The point I wanted to make is, that the question of performance of the ElementTree as found in the standard library is quite orthogonal to whether I would still have to pepper my code with silliness like:

# from http://infix.se/2007/02/06/gentlemen-indent-your-xml
    def __xmlIndent(self,elem,level=0):
        i = "\n" + level*"  "
        if len(elem):
            if not elem.text or not elem.text.strip():
                elem.text = i + "  "
            for e in elem:
                self.__xmlIndent(e, level+1)
                if not e.tail or not e.tail.strip():
                    e.tail = i + "  "
            if not e.tail or not e.tail.strip():
                e.tail = i
        else:
            if level and (not elem.tail or not elem.tail.strip()):
                elem.tail = i

Performance is the never ending story (https://bugs.python.org/issue14465#msg323690 promises more holy wars over C versus Cython versus some other hobby horse) and this particular issue will never be fixed.

scoder · 2018-08-19T15:40:45Z

Performance is the never ending story

Well, you can't just drop a performance regression on everyone just to get a feature implemented that many people won't use, and that is particularly uninteresting for performance relevant cases (where the amount of data is large and you wouldn't want to inflate it further). This is simply the wrong tradeoff.

pepper my code with silliness

What about providing a pull request that takes that code and either a) adds it as an indent(elem) function to xml.etree.ElementTree, or b) adds it to the documentation as a recipe? ISTM that the first would solve all problems: it's available in the stdlib for everyone to use, and it's out of the way for everyone who cares more about speed than indentation. You could even optimise it for memory efficiency along the way, by interning and reusing the indentation strings.

mcepl · 2018-08-25T11:10:35Z

Well, you can't just drop a performance regression on everyone just to get a feature implemented that many people won't use

Well, of course, the pretty_print defaults to False, so it would happen only on request.

ISTM that the first would solve all problems: it's available in the stdlib for everyone to use, and it's out of the way for everyone who cares more about speed than indentation.

It wouldn't solve all problems. I would really prefer having compatible API (of course, again, defaulting to False). And yes, if there is a chance that PR is accepted, I will gladly use my function as the base for the pretty_print option.

scoder · 2018-08-25T11:16:32Z

the pretty_print defaults to False, so it would happen only on request.

Ah, but the performance regression of this PR was measured to be 12% when it the option is disabled and another 10% when enabling the option. I wouldn't mind the pretty printing case being slower, but I am arguing against making the normal case so much slower.

scoder · 2019-08-15T13:09:12Z

Rejecting this PR since it slows down the normal (non pretty-printed) serialisation.

dzeban added 9 commits October 17, 2017 10:34

ElementTree pretty printing for xml serializer

0dec4a3

Add docstrings to tostring and write method

f98e3de

Add pretty printing to the tostringlist function

5e3e950

Add pretty printing to HTML serializer

8810d90

Add pretty print test for xml serialization

137d03f

Add pretty print test for html serialization

1a3867d

Add reference output for html serialization test

2e83705

Update documentation

b693517

Add to ACKS as per contribution guidelines

5592088

the-knights-who-say-ni added the CLA signed label Oct 17, 2017

bedevere-bot added the awaiting review label Oct 17, 2017

dzeban mentioned this pull request Oct 17, 2017

xml.etree pretty printing dzeban/cpython#2

Open

Add NEWS entry

18c1ee0

vstinner requested review from eliben, serhiy-storchaka and vstinner October 17, 2017 13:57

vadmium reviewed Oct 17, 2017

View reviewed changes

Fix doc mistakes

e5373e2

serhiy-storchaka self-assigned this Oct 18, 2017

scoder reviewed Oct 20, 2017

View reviewed changes

Rename pretty argument to pretty_print

f114e68

dzeban added 2 commits October 28, 2017 13:15

Respect HTML whitespaces inside script and style

760bbed

Rewrite HTML serializer test to work on Windows

1e9149c

dzeban force-pushed the etree-pretty-print branch from 97783fb to 1e9149c Compare October 28, 2017 10:20

mcepl mentioned this pull request Aug 25, 2018

bpo-14465: Provide simple prett printing for XML and ETree API #8933

Closed

scoder closed this Aug 15, 2019

serhiy-storchaka removed their assignment Sep 8, 2023

Uh oh!

Conversation

dzeban commented Oct 17, 2017 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vadmium Oct 17, 2017

Choose a reason for hiding this comment

Uh oh!

vadmium Oct 17, 2017

Choose a reason for hiding this comment

Uh oh!

dzeban commented Oct 18, 2017

Uh oh!

scoder Oct 20, 2017

Choose a reason for hiding this comment

Uh oh!

scoder commented Oct 20, 2017

Uh oh!

dzeban commented Oct 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vstinner commented Oct 23, 2017

Uh oh!

dzeban commented Oct 24, 2017

Uh oh!

Stevoisiak commented Apr 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scoder commented Apr 24, 2018

Uh oh!

Stevoisiak commented Apr 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Stevoisiak commented Jun 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dzeban commented Jun 5, 2018

Uh oh!

mcepl commented Aug 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scoder commented Aug 18, 2018

Uh oh!

mcepl commented Aug 18, 2018

Uh oh!

scoder commented Aug 19, 2018

Uh oh!

mcepl commented Aug 25, 2018

Uh oh!

scoder commented Aug 25, 2018

Uh oh!

scoder commented Aug 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

dzeban commented Oct 17, 2017 •

edited by bedevere-bot

Loading

dzeban commented Oct 21, 2017 •

edited

Loading

Stevoisiak commented Apr 24, 2018 •

edited

Loading

Stevoisiak commented Apr 25, 2018 •

edited

Loading

Stevoisiak commented Jun 4, 2018 •

edited

Loading

mcepl commented Aug 17, 2018 •

edited

Loading