bpo-34043: Optimize tarfile uncompress performance#8089
Conversation
tarfile._Stream has two buffer for compressed and uncompressed data. Those buffers are not aligned so unnecessary bytes slicing happens for every reading chunks. This commit bypass compressed buffering. In this benchmark [1], user time become 250ms from 300ms. [1]: https://bugs.python.org/msg320763
| # Skip underlaying buffer to avoid unaligned double | ||
| # buffering. | ||
| if self.buf: | ||
| buf, self.buf = self.buf, b"" |
There was a problem hiding this comment.
nitpick: I would prefer to do that on two lines, but it's just a matter of taste.
vstinner
left a comment
There was a problem hiding this comment.
LGTM. I just saw a typo in a comment.
| buf = self.__read(self.bufsize) | ||
| if not buf: | ||
| break | ||
| # Skip underlaying buffer to avoid unaligned double |
| If size is not defined, return all bytes of the stream | ||
| up to EOF. | ||
| """ | ||
| if size is None: |
There was a problem hiding this comment.
I have only one question. Why this branch was removed?
There was a problem hiding this comment.
This issue is follow up of bpo-34010, (GH-8020).
See also, https://bugs.python.org/issue34010#msg321040
There was a problem hiding this comment.
Thank you. Instances of _Stream are never leaked to the end user? Then this change LGTM.
There was a problem hiding this comment.
It can be accessed via TarFile.fileobj.
But using it directly is not pragramatic, and TarFile.fileobj.read() caused TypeError because of "".join() bug.
So it must not used for previous versions.
There was a problem hiding this comment.
I saw that the class is private, but I didn't notice that TarFile.fileobj is public. Maybe just replace the assertion with a regular if/raise, just in case? Maybe raise a NotImplementedError?
tarfile._Stream has two buffer for compressed and uncompressed data.
Those buffers are not aligned so unnecessary bytes slicing happens
for every reading chunks.
This commit bypass compressed buffering.
In this benchmark 1, user time become 250ms from 300ms.
https://bugs.python.org/issue34043