forked from zpoint/CPython-Internals
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathre.md
More file actions
389 lines (269 loc) · 10.8 KB
/
Copy pathre.md
File metadata and controls
389 lines (269 loc) · 10.8 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
# re
# contents
* [related file](#related-file)
* [how regex work](#how-regex-work)
* [parse](#parse)
* [compile](#compile)
* [match](#match)
* [code detail](#code-detail)
* [sre_ucs1_search](#sre_ucs1_search)
* [fifo cache](#fifo-cache)
* [read more](#read-more)
# related file
* cpython/Lib/re.py
* cpython/Lib/sre_compile.py
* cpython/Lib/sre_constants.py
* cpython/Lib/sre_parse.py
* cpython/Modules/sre.h
* cpython/Modules/sre_constants.h
* cpython/Modules/sre_lib.h
* cpython/Modules/_sre.c
* cpython/Modules/clinic/_sre.c.h
# how regex work
what happened when you execute the following code ?
```python3
import re
re.search("abc\d+abc", "xxabc123abcd") # re.DEBUG)
```
the call stack

the overview of different phases of **re.search**

let's see what's going on in each step
## parse
the core code in **parse** phase is in **cpython/Lib/sre_parse.py**
there is a class named **Tokenizer** to split your regex text into tokens, and **_parse** function is in charge of binding the token with the **sre_constants** code
```python3
# pseudo code
tokenizer = Tokenizer(pattern, flags)
while True:
next_token = tokenizer.get_next_token()
if not next_token:
break
parse(next_token)
```
let's run the parse phase with regex pattern `abc\d+abc`
```python3
next_token: a
rest part of token: bc\d+abc
parse result: [(LITERAL, 97)]
next_token: b
rest part of token: c\d+abc
parse result: [(LITERAL, 97), (LITERAL, 98)]
next_token: c
rest part of token: \d+abc
parse result: [(LITERAL, 97), (LITERAL, 98), (LITERAL, 99)]
next_token: \d
rest part of token: +abc
parse result: [(LITERAL, 97), (LITERAL, 98), (LITERAL, 99), (IN, [(CATEGORY, CATEGORY_DIGIT)])]
```
now, the **next_token** is `'+'`, the `parse` function will pop whatever in the last parse result, and insert in back with the repeat symbol
```python3
item = parse_result[-1:]
min, max = 1, MAXREPEAT
parse_result[-1] = (MAX_REPEAT, (min, max, item))
next_token: +
rest part of token: abc
parse result: [(LITERAL, 97), (LITERAL, 98), (LITERAL, 99), (MAX_REPEAT, (1, MAXREPEAT, [(IN, [(CATEGORY, CATEGORY_DIGIT)])]))]
next_token: a
rest part of token: bc
parse result: [(LITERAL, 97), (LITERAL, 98), (LITERAL, 99), (MAX_REPEAT, (1, MAXREPEAT, [(IN, [(CATEGORY, CATEGORY_DIGIT)])])), (LITERAL, 97)]
next_token: b
rest part of token: c
parse result: [(LITERAL, 97), (LITERAL, 98), (LITERAL, 99), (MAX_REPEAT, (1, MAXREPEAT, [(IN, [(CATEGORY, CATEGORY_DIGIT)])])), (LITERAL, 97), (LITERAL, 98)]
next_token: c
rest part of token: None
parse result: [(LITERAL, 97), (LITERAL, 98), (LITERAL, 99), (MAX_REPEAT, (1, MAXREPEAT, [(IN, [(CATEGORY, CATEGORY_DIGIT)])])), (LITERAL, 97), (LITERAL, 98), (LITERAL, 99)]
```
## compile
this is the input of the **compile** phase
```python3
[(LITERAL, 97), (LITERAL, 98), (LITERAL, 99), (MAX_REPEAT, (1, MAXREPEAT, [(IN, [(CATEGORY, CATEGORY_DIGIT)])])), (LITERAL, 97), (LITERAL, 98), (LITERAL, 99)]
```
the compile function in **sre_compile.py**
```python3
def _code(p, flags):
flags = p.state.flags | flags
code = []
# compile info block
_compile_info(code, p, flags)
# compile the pattern
_compile(code, p.data, flags)
code.append(SUCCESS)
return code
```
the **info block** has the following structure `[INFO, length_of_prefix_code, mask_indicate_whether_has_prefix, min_length_for_the_pattern, max_length_for_the_pattern, prefix1, prefix2..., prefixn, overlap_table_for_prefix]`
after the **_compile_info** function, code becomes
```python3
code == [INFO, 12, 1, 7, MAXREPEAT, 3, 3, 97, 98, 99, 0, 0, 0]
```
12 means the length of the prefix part excluding the first **INFO** element is 12
1 is the mask value, 1 indicate that the current pattern has a literal prefix
7 is the minimal length of the pattern `"abc\d+abc"`, i.e. length of `abc1abc` is 7
MAXREPEAT is the maximum length of the pattern `"abc\d+abc"`, constant **MAXREPEAT** indicate the +∞ (actually, the value is finite)
the first 3 is the length of the literal prefix, (`"abc"` here, which is 3)
the second 3 is the length of the prefix_skip, prefix_skip here has the same length as the prefix
the following 97, 98, 99 is the prefix `"abc"`
and the following 0, 0, 0 here is generated by the **_generate_overlap_table**
after compile the **info block**, the **pattern** will be compiled in `_compile(code, p.data, flags)`
```python3
for op, av in pattern:
compile_and_fill_code_according_to_op_and_av()
# pattern here is: [(LITERAL, 97), (LITERAL, 98), (LITERAL, 99), (MAX_REPEAT, (1, MAXREPEAT, [(IN, [(CATEGORY, CATEGORY_DIGIT)])])), (LITERAL, 97), (LITERAL, 98), (LITERAL, 99)]
op: LITERAL av: 97
code: [LITERAL, 97]
op: LITERAL av: 98
code: [LITERAL, 97, LITERAL, 98]
op: LITERAL av: 99
code: [LITERAL, 97, LITERAL, 98, LITERAL, 99]
op: MAX_REPEAT av: (1, MAXREPEAT, [(IN, [(CATEGORY, CATEGORY_DIGIT)])])
```
the handling process of MAX_REPEAT is a little bit different from others, the code snippet handling this situation is shown
```python3
if op is MAX_REPEAT:
emit(REPEAT_ONE)
else:
emit(MIN_REPEAT_ONE)
skip = _len(code); emit(0)
emit(av[0])
emit(av[1])
# position 1
_compile(code, av[2], flags) # recursive call
emit(SUCCESS)
code[skip] = _len(code) - skip # change the length
# position 2
```
the **code** object in position 1
```python3
code: [LITERAL, 97, LITERAL, 98, LITERAL, 99, REPEAT_ONE, 0, 1, MAXREPEAT]
```
after the recursive call, **code** object in position 2, the length now becomes 9
```python3
op: IN av: [(CATEGORY, CATEGORY_DIGIT)]
code: [LITERAL, 97, LITERAL, 98, LITERAL, 99, REPEAT_ONE, 9, 1, MAXREPEAT, IN, 4, CATEGORY, CATEGORY_UNI_DIGIT, FAILURE, SUCCESS]
op: LITERAL av: 97
code: [LITERAL, 97, LITERAL, 98, LITERAL, 99, REPEAT_ONE, 9, 1, MAXREPEAT, IN, 4, CATEGORY, CATEGORY_UNI_DIGIT, FAILURE, SUCCESS, LITERAL, 97]
op: LITERAL av: 98
code: [LITERAL, 97, LITERAL, 98, LITERAL, 99, REPEAT_ONE, 9, 1, MAXREPEAT, IN, 4, CATEGORY, CATEGORY_UNI_DIGIT, FAILURE, SUCCESS, LITERAL, 97, LITERAL, 98]
op: LITERAL av: 99
code: [LITERAL, 97, LITERAL, 98, LITERAL, 99, REPEAT_ONE, 9, 1, MAXREPEAT, IN, 4, CATEGORY, CATEGORY_UNI_DIGIT, FAILURE, SUCCESS, LITERAL, 97, LITERAL, 98, LITERAL, 99, SUCCESS]
```
combine the **_compile_info** and **_compile** together, the code to the next phase is
```python3
[INFO, 12, 1, 7, MAXREPEAT, 3, 3, 97, 98, 99, 0, 0, 0, LITERAL, 97, LITERAL, 98, LITERAL, 99, REPEAT_ONE, 9, 1, MAXREPEAT, IN, 4, CATEGORY, CATEGORY_UNI_DIGIT, FAILURE, SUCCESS, LITERAL, 97, LITERAL, 98, LITERAL, 99, SUCCESS]
```
## match
the match phase is written in c
it's a big for loop, act as a state machine, every complied opcode pass down here will be passed to this loop, the loop will exam the pattern according to the opcode, if mismatch in any state, go to fail
if success, go on, there will be another article talking about the match phase(if I have time)
```c
for (;;) {
switch (pattern[i]) {
case SRE_OP_MARK:xxx
case SRE_OP_LITERAL:xxx
case SRE_OP_REPEAT_ONE:xxx
...
}
}
```
# code detail
## sre_ucs1_search
follow the call stack to here
```python3
status = sre_search(&state, PatternObject_GetCode(self));
```
the defination of **sre_search** is
the search will delegate to different function according to the real size of the **unicode** object, for detail of memory usage of **unicode** object, please refer to [how unicode implemented](https://github.com/zpoint/CPython-Internals/blob/master/BasicObject/str/str.md)
```python3
LOCAL(Py_ssize_t)
sre_search(SRE_STATE* state, SRE_CODE* pattern)
{
if (state->charsize == 1)
return sre_ucs1_search(state, pattern);
if (state->charsize == 2)
return sre_ucs2_search(state, pattern);
assert(state->charsize == 4);
return sre_ucs4_search(state, pattern);
}
```
if we search for **sre_ucs1_search** in the folder, we can't find anything
```python3
find . -name '*' -exec grep -nHr "sre_ucs1_search" {} \;
Binary file ./libpython3.8m.a matches
./Modules/_sre.c:572: return sre_ucs1_search(state, pattern);
```
the trick here is for reusing the common part in **sre_lib.h** with different definitions
```c
/* generate 8-bit version */
#define SRE_CHAR Py_UCS1
#define SIZEOF_SRE_CHAR 1
#define SRE(F) sre_ucs1_##F
#include "sre_lib.h"
/* generate 16-bit unicode version */
#define SRE_CHAR Py_UCS2
#define SIZEOF_SRE_CHAR 2
#define SRE(F) sre_ucs2_##F
#include "sre_lib.h"
/* generate 32-bit unicode version */
#define SRE_CHAR Py_UCS4
#define SIZEOF_SRE_CHAR 4
#define SRE(F) sre_ucs4_##F
#include "sre_lib.h"
```
we can find the defination of **search** in **sre_lib.h**
```python3
LOCAL(Py_ssize_t)
SRE(search)(SRE_STATE* state, SRE_CODE* pattern)
```
which will be expanded to three different form
```python3
inline Py_ssize_t sre_ucs1_search(SRE_STATE* state, SRE_CODE* pattern)
inline Py_ssize_t sre_ucs2_search(SRE_STATE* state, SRE_CODE* pattern)
inline Py_ssize_t sre_ucs4_search(SRE_STATE* state, SRE_CODE* pattern)
```
when I go inside the expanded **sre_ucs1_search**, I found the [match](#match) phase
# fifo cache
when you called the **re.compile**, there will be a FIFO cache in the python level
**_cache** is of type dictionary, and because type dict is now ordered([dict implementation](https://github.com/zpoint/CPython-Internals/blob/master/BasicObject/dict/dict.md#delete-in-entries)), it act as a fifo queue
```python3
>>> r1 = re.compile("a\d+b")
>>> id(r1)
4409698992
>>> r2 = re.compile("b\d+b")
>>> id(r2)
4409699184
>>> r3 = re.compile("a\d+b")
>>> id(r3)
4409698992 # same as r1
>>> r4 = re.compile("b\d+b")
>>> id(r4)
4409699184 # same as r2
```
the cache mechanism is written in python
```python3
_cache = {} # ordered!
_MAXCACHE = 512
def _compile(pattern, flags):
# omit
try:
return _cache[type(pattern), pattern, flags]
except KeyError:
pass
# omit
p = sre_compile.compile(pattern, flags)
if not (flags & DEBUG):
if len(_cache) >= _MAXCACHE:
# Drop the oldest item
try:
del _cache[next(iter(_cache))]
except (StopIteration, RuntimeError, KeyError):
pass
_cache[type(pattern), pattern, flags] = p
return p
```
# read more
* [ccs.neu.edu->sre](http://www.ccs.neu.edu/home/shivers/papers/sre.txt)
* [Python's Hidden Regular Expression Gems](http://lucumr.pocoo.org/2015/11/18/pythons-hidden-re-gems/)
* [Understanding Python’s SRE structure](https://blog.labix.org/2003/06/16/understanding-pythons-sre-structure)
* [Get Started with Regex: Regular Expressions Make Easy](https://www.whoishostingthis.com/resources/regex/)
* [Comparing regular expressions in Perl, Python, and Emacs](https://www.johndcook.com/blog/regex-perl-python-emacs/)