Issue 31759: re wont recover nor fail on runaway regular expression

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/75940

classification

Title:	re wont recover nor fail on runaway regular expression
Type:	behavior	Stage:	resolved
Components:	Regular Expressions	Versions:	Python 3.6

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:		Nosy List:	Raphaël Riel, ezio.melotti, mrabarnett, tim.peters
Priority:	normal	Keywords:

Created on 2017-10-11 14:56 by Raphaël Riel, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
re_backtracking.py	Raphaël Riel, 2017-10-11 14:56	Incremental regex testing

Messages (7)
msg304146 - (view)	Author: Raphaël Riel (Raphaël Riel)	Date: 2017-10-11 14:56
re won't raise nor return when working with Runaway Regular Expression. It will compute "almost" indefinitely. Although I'm pretty sure it may complete sometime, it's definetly looks like it's stuck. ``` > python -VVVV Python 3.6.2 (default, Aug 23 2017, 14:57:08) [GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.42)] ``` Reproduce with attached file. Should there be a (configurable?) limit on the number of steps involved in the process. Or some warnings and/or hard limit that raises exception? https://pythex.org/ will fail with a HTTP502 BadGateway (server taking too long to respond) https://regex101.com/ python's tester seems to set a limit for this case. I can't say how they managed this.
msg304165 - (view)	Author: Raphaël Riel (Raphaël Riel)	Date: 2017-10-11 17:14
Results for my local computer: ``` Attempting test_01 Done in 0.12ms Attempting test_02 Done in 1.52ms Attempting test_03 Done in 26.24ms Attempting test_04 Done in 432.32ms Attempting test_05 Done in 886.3ms Attempting test_06 Done in 1757.07ms Attempting test_07 Done in 3388.92ms Attempting test_08 Done in 6669.02ms Attempting test_09 Done in 13088.37ms Attempting test_10 Done in 26600.23ms Attempting test_11 Done in 6722.9ms Attempting test_12 Done in 211192.82ms Attempting test_13 Done in 6850465.09ms Attempting test_14 ^CTraceback (most recent call last): File "/Users/rriel/Desktop/re_backtracking.py", line 26, in <module> bad_expression.sub("IN_GROUP", statement) KeyboardInterrupt ``` Cancelled test_14...
msg304169 - (view)	Author: Tim Peters (tim.peters) *	Date: 2017-10-11 18:48
Well, the problem in the regexp is this part: "\d+,? ?". You're not _requiring_ that strings of digits be separated by a comma or blank, you're only _allowing_ them to be so separated. A solid string of digits is matched by this, and so the enclosing + requires the engine, when backtracking, to consider every possible way of breaking the solid string of digits into one or more solid strings of digits too. If there are n digits in the digit string, there are 2*(n-1) ways to do so. Overall the regexp can never match because "NULL" never gets matched. So all possibilities will be tried. We can compute how many attempts will be made like so: prod = 1 for c in re.findall("\d+\|NULL", statement): if c == "NULL": break prod = 2**(len(c) - 1) print(format(prod, ",")) Attempting test_01 256 Attempting test_02 4,096 Attempting test_03 65,536 Attempting test_04 1,048,576 Attempting test_05 2,097,152 Attempting test_06 4,194,304 Attempting test_07 8,388,608 Attempting test_08 16,777,216 Attempting test_09 33,554,432 Attempting test_10 67,108,864 Attempting test_11 16,777,216 Attempting test_12 536,870,912 Attempting test_13 17,179,869,184 Attempting test_14 549,755,813,888 Note, e.g., this "predicts" test_08 will take about the same time as test_11 (both require 16,777,216 attempts). Which is what you actually saw happen. And that's why you gave up on test_14: it will require over half a trillion failing attempts before it ends, which will take roughly 30 times longer than it failed to match test_13. If we were building a regexp _tester_, we'd spin off a new process to run the regexp, and kill the process if it took "too long". But that's not we're doing - library functions do what you tell them to do ;-) In this case, it's easily repaired. For example, replace "\d+,? ?" by "\d+(?=[ ,)]),? ?" This uses a lookahead assertion to _require_ that a digit string is followed by a blank, comma, or right parenthesis. Which prevents exponential-time backtracking in failing cases.
msg304177 - (view)	Author: Raphaël Riel (Raphaël Riel)	Date: 2017-10-11 19:49
Thanks Tim! Pretty nice answer that I can learn from! Thanks for your time. I definitely knew my Regex was broken, yet I was surprised the interpreter/library didn't gave up/error after some(several million) steps. Some other language seems to just assume there will be no match (Source: some play on https://regex101.com/), and I don't think this is a valid approach. Should there be a WARNing logged on a defined soft-limit? I know this is low-level "re" library, and your point is pretty valid about the fact the lib should be doing what it's told to do. I was mostly looking for opinion about WARNs on soft limits and maybe errors on a hard-limit to debug/avoid this kind of false-hang situation. Feel free to close/wont-fix/not-a-bug this issue! And thanks again for your kind answer to my initial issue!
msg304183 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2017-10-11 21:32
You shouldn't assume that just because it takes a long time on one implementation that it'll take a long time on all of the others, because it's sometimes possible to include additional checks to reduce the problem. (I doubt you could eliminate the problem entirely, however.) My regex module, for example, includes some additional checks, and it seems to be OK with these tests.
msg304197 - (view)	Author: Tim Peters (tim.peters) *	Date: 2017-10-12 01:52
Sure! The OP was obviously asking about the engine that ships with Python, so that's what I talked about. Raphaël, Matthew develops an excellent replacement ("regex") for Python's re module, which you can install via, e.g., "pip install regex" (or, on Windows, "python -m pip install regex"). More info here: https://pypi.python.org/pypi/regex/ Matthew, will that become Python's standard offering some day? I'd be in favor of that! It has many advantages, although it doesn't always avoid exponential-time backtracking in failing cases. For example, `re` and `regex` both take exponential time to fail to match the regexp: "((xy)+)+$" against strings of the form: "xy" * i + "y" Increase `i` by 1, and both take about twice as long to fail to match (meaning either .match or .search), and `re` is actually quicker on my box (3.6.3 on 64-bit Win10). In any case, I'm closing this, since there's no concrete idea on the table for a change to `re` that would actually help (e.g., people ignore warnings, and there's really no way to _guess_ whether a regexp is "taking too long" to begin with - if it's taking minutes, people immediately discover the hangup already when they interrupt the program and see that it's trying to match a regexp).
msg304361 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2017-10-13 21:15
@Tim: the regex module includes some extra checks to reduce the chance of excessive backtracking. In the case of the OP's example, they seem to be working. However, it's difficult to know when adding such checks will help, and your example is one case where they are being done but aren't helping, with the result that it's slower.

History
Date	User	Action	Args
2022-04-11 14:58:53	admin	set	github: 75940
2017-10-13 21:15:40	mrabarnett	set	messages: + msg304361
2017-10-12 01:52:17	tim.peters	set	status: open -> closed resolution: wont fix messages: + msg304197 stage: resolved
2017-10-11 21:32:31	mrabarnett	set	messages: + msg304183
2017-10-11 19:49:25	Raphaël Riel	set	messages: + msg304177
2017-10-11 18:48:56	tim.peters	set	nosy: + tim.peters messages: + msg304169
2017-10-11 17:14:49	Raphaël Riel	set	messages: + msg304165
2017-10-11 14:56:13	Raphaël Riel	create