CPython-Internals/Modules/re/re.md at develop · NilashishC/CPython-Internals

389 lines (269 loc) · 10.8 KB
* [related file](#related-file)
* [how regex work](#how-regex-work)
    * [parse](#parse)
    * [compile](#compile)
    * [match](#match)
* [code detail](#code-detail)
    * [sre_ucs1_search](#sre_ucs1_search)
    * [fifo cache](#fifo-cache)
* [read more](#read-more)
# related file
* cpython/Lib/re.py
* cpython/Lib/sre_compile.py
* cpython/Lib/sre_constants.py
* cpython/Lib/sre_parse.py
* cpython/Modules/sre.h
* cpython/Modules/sre_constants.h
* cpython/Modules/sre_lib.h
* cpython/Modules/_sre.c
* cpython/Modules/clinic/_sre.c.h
# how regex work
what happened when you execute the following code ?
re.search("abc\d+abc", "xxabc123abcd") # re.DEBUG)
the call stack
![call_stack](https://github.com/zpoint/CPython-Internals/blob/master/Modules/re/call_stack.png)
the overview of different phases of **re.search**
![overview](https://github.com/zpoint/CPython-Internals/blob/master/Modules/re/overview.png)
let's see what's going on in each step
the core code in **parse** phase is in **cpython/Lib/sre_parse.py**
there is a class named **Tokenizer** to split your regex text into tokens, and **_parse** function is in charge of binding the token with the **sre_constants** code
# pseudo code
tokenizer = Tokenizer(pattern, flags)
while True:
    next_token = tokenizer.get_next_token()
    if not next_token:
        break
    parse(next_token)
let's run the parse phase with regex pattern `abc\d+abc`
next_token: a
rest part of token: bc\d+abc
parse result: [(LITERAL, 97)]
next_token: b
rest part of token: c\d+abc
parse result: [(LITERAL, 97), (LITERAL, 98)]
next_token: c
rest part of token: \d+abc
parse result: [(LITERAL, 97), (LITERAL, 98), (LITERAL, 99)]
next_token: \d
rest part of token: +abc
parse result: [(LITERAL, 97), (LITERAL, 98), (LITERAL, 99), (IN, [(CATEGORY, CATEGORY_DIGIT)])]
now, the **next_token** is `'+'`, the `parse` function will pop whatever in the last parse result, and insert in back with the repeat symbol
item = parse_result[-1:]
min, max = 1, MAXREPEAT
parse_result[-1] = (MAX_REPEAT, (min, max, item))
next_token: +
rest part of token: abc
parse result: [(LITERAL, 97), (LITERAL, 98), (LITERAL, 99), (MAX_REPEAT, (1, MAXREPEAT, [(IN, [(CATEGORY, CATEGORY_DIGIT)])]))]
next_token: a
rest part of token: bc
parse result: [(LITERAL, 97), (LITERAL, 98), (LITERAL, 99), (MAX_REPEAT, (1, MAXREPEAT, [(IN, [(CATEGORY, CATEGORY_DIGIT)])])), (LITERAL, 97)]
next_token: b
rest part of token: c
parse result: [(LITERAL, 97), (LITERAL, 98), (LITERAL, 99), (MAX_REPEAT, (1, MAXREPEAT, [(IN, [(CATEGORY, CATEGORY_DIGIT)])])), (LITERAL, 97), (LITERAL, 98)]
next_token: c
rest part of token: None
parse result: [(LITERAL, 97), (LITERAL, 98), (LITERAL, 99), (MAX_REPEAT, (1, MAXREPEAT, [(IN, [(CATEGORY, CATEGORY_DIGIT)])])), (LITERAL, 97), (LITERAL, 98), (LITERAL, 99)]
this is the input of the **compile** phase
[(LITERAL, 97), (LITERAL, 98), (LITERAL, 99), (MAX_REPEAT, (1, MAXREPEAT, [(IN, [(CATEGORY, CATEGORY_DIGIT)])])), (LITERAL, 97), (LITERAL, 98), (LITERAL, 99)]
the compile function in **sre_compile.py**
def _code(p, flags):
    flags = p.state.flags | flags
    code = []
    # compile info block
    _compile_info(code, p, flags)
    # compile the pattern
    _compile(code, p.data, flags)
    code.append(SUCCESS)
    return code
the **info block** has the following structure `[INFO, length_of_prefix_code, mask_indicate_whether_has_prefix, min_length_for_the_pattern, max_length_for_the_pattern, prefix1, prefix2..., prefixn, overlap_table_for_prefix]`
after the **_compile_info** function, code becomes
code == [INFO, 12, 1, 7, MAXREPEAT, 3, 3, 97, 98, 99, 0, 0, 0]
12 means the length of the prefix part excluding the first **INFO** element is 12
1 is the mask value, 1 indicate that the current pattern has a literal prefix
7 is the minimal length of the pattern `"abc\d+abc"`, i.e. length of `abc1abc` is 7
MAXREPEAT is the maximum length of the pattern `"abc\d+abc"`, constant **MAXREPEAT** indicate the +∞ (actually, the value is finite)
the first 3 is the length of the literal prefix, (`"abc"` here, which is 3)
the second 3 is the length of the prefix_skip, prefix_skip here has the same length as the prefix
the following 97, 98, 99 is the prefix `"abc"`
and the following 0, 0, 0 here is generated by the **_generate_overlap_table**
after compile the **info block**, the **pattern** will be compiled in `_compile(code, p.data, flags)`
for op, av in pattern:
    compile_and_fill_code_according_to_op_and_av()
# pattern here is: [(LITERAL, 97), (LITERAL, 98), (LITERAL, 99), (MAX_REPEAT, (1, MAXREPEAT, [(IN, [(CATEGORY, CATEGORY_DIGIT)])])), (LITERAL, 97), (LITERAL, 98), (LITERAL, 99)]
op: LITERAL av: 97
code: [LITERAL, 97]
op: LITERAL av: 98
code: [LITERAL, 97, LITERAL, 98]
op: LITERAL av: 99
code: [LITERAL, 97, LITERAL, 98, LITERAL, 99]
op: MAX_REPEAT av: (1, MAXREPEAT, [(IN, [(CATEGORY, CATEGORY_DIGIT)])])
the handling process of MAX_REPEAT is a little bit different from others, the code snippet handling this situation is shown
if op is MAX_REPEAT:
    emit(REPEAT_ONE)
    emit(MIN_REPEAT_ONE)
skip = _len(code); emit(0)
emit(av[0])
emit(av[1])
# position 1
_compile(code, av[2], flags) # recursive call
emit(SUCCESS)
code[skip] = _len(code) - skip # change the length
# position 2
the **code** object in position 1
code: [LITERAL, 97, LITERAL, 98, LITERAL, 99, REPEAT_ONE, 0, 1, MAXREPEAT]
after the recursive call, **code** object in position 2, the length now becomes 9
op: IN av: [(CATEGORY, CATEGORY_DIGIT)]
code: [LITERAL, 97, LITERAL, 98, LITERAL, 99, REPEAT_ONE, 9, 1, MAXREPEAT, IN, 4, CATEGORY, CATEGORY_UNI_DIGIT, FAILURE, SUCCESS]
op: LITERAL av: 97
code: [LITERAL, 97, LITERAL, 98, LITERAL, 99, REPEAT_ONE, 9, 1, MAXREPEAT, IN, 4, CATEGORY, CATEGORY_UNI_DIGIT, FAILURE, SUCCESS, LITERAL, 97]
op: LITERAL av: 98
code: [LITERAL, 97, LITERAL, 98, LITERAL, 99, REPEAT_ONE, 9, 1, MAXREPEAT, IN, 4, CATEGORY, CATEGORY_UNI_DIGIT, FAILURE, SUCCESS, LITERAL, 97, LITERAL, 98]
op: LITERAL av: 99
code: [LITERAL, 97, LITERAL, 98, LITERAL, 99, REPEAT_ONE, 9, 1, MAXREPEAT, IN, 4, CATEGORY, CATEGORY_UNI_DIGIT, FAILURE, SUCCESS, LITERAL, 97, LITERAL, 98, LITERAL, 99, SUCCESS]
combine the **_compile_info** and **_compile** together, the code to the next phase is
[INFO, 12, 1, 7, MAXREPEAT, 3, 3, 97, 98, 99, 0, 0, 0, LITERAL, 97, LITERAL, 98, LITERAL, 99, REPEAT_ONE, 9, 1, MAXREPEAT, IN, 4, CATEGORY, CATEGORY_UNI_DIGIT, FAILURE, SUCCESS, LITERAL, 97, LITERAL, 98, LITERAL, 99, SUCCESS]
the match phase is written in c
it's a big for loop, act as a state machine, every complied opcode pass down here will be passed to this loop, the loop will exam the pattern according to the opcode, if mismatch in any state, go to fail
if success, go on, there will be another article talking about the match phase(if I have time)
    switch (pattern[i]) {
        case SRE_OP_MARK:xxx
        case SRE_OP_LITERAL:xxx
        case SRE_OP_REPEAT_ONE:xxx
        ...
# code detail
## sre_ucs1_search
follow the call stack to here
status = sre_search(&state, PatternObject_GetCode(self));
the defination of **sre_search** is
the search will delegate to different function according to the real size of the **unicode** object, for detail of memory usage of **unicode** object, please refer to [how unicode implemented](https://github.com/zpoint/CPython-Internals/blob/master/BasicObject/str/str.md)
LOCAL(Py_ssize_t)
sre_search(SRE_STATE* state, SRE_CODE* pattern)
    if (state->charsize == 1)
        return sre_ucs1_search(state, pattern);
    if (state->charsize == 2)
        return sre_ucs2_search(state, pattern);
    assert(state->charsize == 4);
    return sre_ucs4_search(state, pattern);
if we search for **sre_ucs1_search** in the folder, we can't find anything
find . -name '*' -exec grep -nHr "sre_ucs1_search" {} \;
Binary file ./libpython3.8m.a matches
./Modules/_sre.c:572:        return sre_ucs1_search(state, pattern);
the trick here is for reusing the common part in **sre_lib.h** with different definitions
/* generate 8-bit version */
#define SRE_CHAR Py_UCS1
#define SIZEOF_SRE_CHAR 1
#define SRE(F) sre_ucs1_##F
#include "sre_lib.h"
/* generate 16-bit unicode version */
#define SRE_CHAR Py_UCS2
#define SIZEOF_SRE_CHAR 2
#define SRE(F) sre_ucs2_##F
#include "sre_lib.h"
/* generate 32-bit unicode version */
#define SRE_CHAR Py_UCS4
#define SIZEOF_SRE_CHAR 4
#define SRE(F) sre_ucs4_##F
#include "sre_lib.h"
we can find the defination of **search** in **sre_lib.h**
LOCAL(Py_ssize_t)
SRE(search)(SRE_STATE* state, SRE_CODE* pattern)
which will be expanded to three different form
inline Py_ssize_t sre_ucs1_search(SRE_STATE* state, SRE_CODE* pattern)
inline Py_ssize_t sre_ucs2_search(SRE_STATE* state, SRE_CODE* pattern)
inline Py_ssize_t sre_ucs4_search(SRE_STATE* state, SRE_CODE* pattern)
when I go inside the expanded **sre_ucs1_search**, I found the [match](#match) phase
# fifo cache
when you called the **re.compile**, there will be a FIFO cache in the python level
**_cache** is of type dictionary, and because type dict is now ordered([dict implementation](https://github.com/zpoint/CPython-Internals/blob/master/BasicObject/dict/dict.md#delete-in-entries)), it act as a fifo queue
>>> r1 = re.compile("a\d+b")
>>> r2 = re.compile("b\d+b")
>>> r3 = re.compile("a\d+b")
4409698992 # same as r1
>>> r4 = re.compile("b\d+b")
4409699184 # same as r2
the cache mechanism is written in python
_cache = {}  # ordered!
_MAXCACHE = 512
def _compile(pattern, flags):
        return _cache[type(pattern), pattern, flags]
    except KeyError:
        pass
    p = sre_compile.compile(pattern, flags)
    if not (flags & DEBUG):
        if len(_cache) >= _MAXCACHE:
            # Drop the oldest item
            try:
                del _cache[next(iter(_cache))]
            except (StopIteration, RuntimeError, KeyError):
        _cache[type(pattern), pattern, flags] = p
    return p
# read more
* [ccs.neu.edu->sre](http://www.ccs.neu.edu/home/shivers/papers/sre.txt)
* [Python's Hidden Regular Expression Gems](http://lucumr.pocoo.org/2015/11/18/pythons-hidden-re-gems/)
* [Understanding Python’s SRE structure](https://blog.labix.org/2003/06/16/understanding-pythons-sre-structure)
* [Get Started with Regex: Regular Expressions Make Easy](https://www.whoishostingthis.com/resources/regex/)
* [Comparing regular expressions in Perl, Python, and Emacs](https://www.johndcook.com/blog/regex-perl-python-emacs/)
Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

re.md

Latest commit

History

re.md

File metadata and controls