Message 414968 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tim.peters
Recipients	JelleZijlstra, tim.peters
Date	2022-03-12.03:24:49
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1647055489.9.0.896345569782.issue46990@roundup.psfhosted.org>
In-reply-to

Content
Well, that's annoying ;-) In context, the OP was saving a list of 10 million splits. So each overallocation by a single element burned 80 million bytes of RAM. Overallocating by 7 burned 560 million bytes. Which is unusual. Usually a split result is short-lived, consumed once then thrown away. OTOH, the overwhelming motivation for overallocating at all is to acheive O(1) amortized time after a long _sequence_ of appends, and split results typically aren't appended to at all. split() appears to be using it as a timing micro-optimization for tiny lists instead. So, like I said, it's annoying ;-) For "small" lists, split() really shouldn't overallocate at all (because, as before, split results are rarely appended to). A compromise could be to save pointers to the first N (12, whatever) instances of the splitting string in a stack ("auto") vector, before any list object (or result string object) is created. If it's out of stuff to do before reaching N, fine, build a result out of exactly what was found. If there's more to do, build a result from the first N, and go on as currently (letting PyList_Append deal with it - overallocation is huge in percentage terms when the list is short, but not so much as the list gets longer).

Well, that's annoying ;-) In context, the OP was saving a list of 10 million splits. So each overallocation by a single element burned 80 million bytes of RAM. Overallocating by 7 burned 560 million bytes.

Which is unusual. Usually a split result is short-lived, consumed once then thrown away.

OTOH, the overwhelming motivation for overallocating at all is to acheive O(1) amortized time after a long _sequence_ of appends, and split results typically aren't appended to at all. split() appears to be using it as a timing micro-optimization for tiny lists instead.

So, like I said, it's annoying ;-) For "small" lists, split() really shouldn't overallocate at all (because, as before, split results are rarely appended to). A compromise could be to save pointers to the first N (12, whatever) instances of the splitting string in a stack ("auto") vector, before any list object (or result string object) is created. If it's out of stuff to do before reaching N, fine, build a result out of exactly what was found. If there's more to do, build a result from the first N, and go on as currently (letting PyList_Append deal with it - overallocation is huge in percentage terms when the list is short, but not so much as the list gets longer).

History
Date	User	Action	Args
2022-03-12 03:24:49	tim.peters	set	recipients: + tim.peters, JelleZijlstra
2022-03-12 03:24:49	tim.peters	set	messageid: <1647055489.9.0.896345569782.issue46990@roundup.psfhosted.org>
2022-03-12 03:24:49	tim.peters	link	issue46990 messages
2022-03-12 03:24:49	tim.peters	create