Issue31759
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2017-10-11 14:56 by Raphaël Riel, last changed 2022-04-11 14:58 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| re_backtracking.py | Raphaël Riel, 2017-10-11 14:56 | Incremental regex testing | ||
| Messages (7) | |||
|---|---|---|---|
| msg304146 - (view) | Author: Raphaël Riel (Raphaël Riel) | Date: 2017-10-11 14:56 | |
re won't raise nor return when working with Runaway Regular Expression. It will compute "almost" indefinitely. Although I'm pretty sure it *may* complete sometime, it's definetly looks like it's stuck. ``` > python -VVVV Python 3.6.2 (default, Aug 23 2017, 14:57:08) [GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.42)] ``` Reproduce with attached file. Should there be a (configurable?) limit on the number of steps involved in the process. Or some warnings and/or hard limit that raises exception? https://pythex.org/ will fail with a HTTP502 BadGateway (server taking too long to respond) https://regex101.com/ python's tester seems to set a limit for this case. I can't say how they managed this. |
|||
| msg304165 - (view) | Author: Raphaël Riel (Raphaël Riel) | Date: 2017-10-11 17:14 | |
Results for my local computer:
```
Attempting test_01
Done in 0.12ms
Attempting test_02
Done in 1.52ms
Attempting test_03
Done in 26.24ms
Attempting test_04
Done in 432.32ms
Attempting test_05
Done in 886.3ms
Attempting test_06
Done in 1757.07ms
Attempting test_07
Done in 3388.92ms
Attempting test_08
Done in 6669.02ms
Attempting test_09
Done in 13088.37ms
Attempting test_10
Done in 26600.23ms
Attempting test_11
Done in 6722.9ms
Attempting test_12
Done in 211192.82ms
Attempting test_13
Done in 6850465.09ms
Attempting test_14
^CTraceback (most recent call last):
File "/Users/rriel/Desktop/re_backtracking.py", line 26, in <module>
bad_expression.sub("IN_GROUP", statement)
KeyboardInterrupt
```
Cancelled test_14...
|
|||
| msg304169 - (view) | Author: Tim Peters (tim.peters) * ![]() |
Date: 2017-10-11 18:48 | |
Well, the problem in the regexp is this part: "\d+,? ?". You're not _requiring_ that strings of digits be separated by a comma or blank, you're only _allowing_ them to be so separated. A solid string of digits is matched by this, and so the enclosing + requires the engine, when backtracking, to consider every possible way of breaking the solid string of digits into one or more solid strings of digits too. If there are n digits in the digit string, there are 2**(n-1) ways to do so.
Overall the regexp can never match because "NULL" never gets matched. So all possibilities will be tried. We can compute how many attempts will be made like so:
prod = 1
for c in re.findall("\d+|NULL", statement):
if c == "NULL":
break
prod *= 2**(len(c) - 1)
print(format(prod, ","))
Attempting test_01
256
Attempting test_02
4,096
Attempting test_03
65,536
Attempting test_04
1,048,576
Attempting test_05
2,097,152
Attempting test_06
4,194,304
Attempting test_07
8,388,608
Attempting test_08
16,777,216
Attempting test_09
33,554,432
Attempting test_10
67,108,864
Attempting test_11
16,777,216
Attempting test_12
536,870,912
Attempting test_13
17,179,869,184
Attempting test_14
549,755,813,888
Note, e.g., this "predicts" test_08 will take about the same time as test_11 (both require 16,777,216 attempts). Which is what you actually saw happen.
And that's why you gave up on test_14: it will require over half a trillion failing attempts before it ends, which will take roughly 30 times longer than it failed to match test_13.
If we were building a regexp _tester_, we'd spin off a new process to run the regexp, and kill the process if it took "too long". But that's not we're doing - library functions do what you tell them to do ;-)
In this case, it's easily repaired. For example, replace
"\d+,? ?"
by
"\d+(?=[ ,)]),? ?"
This uses a lookahead assertion to _require_ that a digit string is followed by a blank, comma, or right parenthesis. Which prevents exponential-time backtracking in failing cases.
|
|||
| msg304177 - (view) | Author: Raphaël Riel (Raphaël Riel) | Date: 2017-10-11 19:49 | |
Thanks Tim! Pretty nice answer that I can learn from! Thanks for your time. I definitely knew my Regex was broken, yet I was surprised the interpreter/library didn't gave up/error after some(several million) steps. Some other language seems to just assume there will be no match (Source: some play on https://regex101.com/), and I don't think this is a valid approach. Should there be a WARNing logged on a defined soft-limit? I know this is low-level "re" library, and your point is pretty valid about the fact the lib should be doing what it's told to do. I was mostly looking for opinion about WARNs on soft limits and maybe errors on a hard-limit to debug/avoid this kind of false-hang situation. Feel free to close/wont-fix/not-a-bug this issue! And thanks again for your kind answer to my initial issue! |
|||
| msg304183 - (view) | Author: Matthew Barnett (mrabarnett) * ![]() |
Date: 2017-10-11 21:32 | |
You shouldn't assume that just because it takes a long time on one implementation that it'll take a long time on all of the others, because it's sometimes possible to include additional checks to reduce the problem. (I doubt you could eliminate the problem entirely, however.) My regex module, for example, includes some additional checks, and it seems to be OK with these tests. |
|||
| msg304197 - (view) | Author: Tim Peters (tim.peters) * ![]() |
Date: 2017-10-12 01:52 | |
Sure! The OP was obviously asking about the engine that ships with Python, so that's what I talked about.
Raphaël, Matthew develops an excellent replacement ("regex") for Python's re module, which you can install via, e.g., "pip install regex" (or, on Windows, "python -m pip install regex"). More info here:
https://pypi.python.org/pypi/regex/
Matthew, will that become Python's standard offering some day? I'd be in favor of that! It has many advantages, although it doesn't always avoid exponential-time backtracking in failing cases. For example, `re` and `regex` both take exponential time to fail to match the regexp:
"((xy)+)+$"
against strings of the form:
"xy" * i + "y"
Increase `i` by 1, and both take about twice as long to fail to match (meaning either .match or .search), and `re` is actually quicker on my box (3.6.3 on 64-bit Win10).
In any case, I'm closing this, since there's no concrete idea on the table for a change to `re` that would actually help (e.g., people ignore warnings, and there's really no way to _guess_ whether a regexp is "taking too long" to begin with - if it's taking minutes, people immediately discover the hangup already when they interrupt the program and see that it's trying to match a regexp).
|
|||
| msg304361 - (view) | Author: Matthew Barnett (mrabarnett) * ![]() |
Date: 2017-10-13 21:15 | |
@Tim: the regex module includes some extra checks to reduce the chance of excessive backtracking. In the case of the OP's example, they seem to be working. However, it's difficult to know when adding such checks will help, and your example is one case where they are being done but aren't helping, with the result that it's slower. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022-04-11 14:58:53 | admin | set | github: 75940 |
| 2017-10-13 21:15:40 | mrabarnett | set | messages: + msg304361 |
| 2017-10-12 01:52:17 | tim.peters | set | status: open -> closed resolution: wont fix messages: + msg304197 stage: resolved |
| 2017-10-11 21:32:31 | mrabarnett | set | messages: + msg304183 |
| 2017-10-11 19:49:25 | Raphaël Riel | set | messages: + msg304177 |
| 2017-10-11 18:48:56 | tim.peters | set | nosy:
+ tim.peters messages: + msg304169 |
| 2017-10-11 17:14:49 | Raphaël Riel | set | messages: + msg304165 |
| 2017-10-11 14:56:13 | Raphaël Riel | create | |

