Skip to content

Commit cf4b8b9

Browse files
committed
Improved the documentation
1 parent bfd0ee8 commit cf4b8b9

4 files changed

Lines changed: 174 additions & 78 deletions

File tree

docs/features.md

Lines changed: 0 additions & 77 deletions
Original file line numberDiff line numberDiff line change
@@ -330,83 +330,6 @@ Another example using 5-rounds base58:
330330

331331
-----
332332

333-
### Guess-decode an arbitrary input
334-
335-
This is done by trying encodings using the breadth-first tree search algorithm. It stops when a given condition (by default, all characters must be printable), in the form of a function applied to the decoded string at the current depth, is met. It returns two results: the decoded string and a tuple with the related encoding names in order of application. The following parameters can be entered:
336-
337-
- `stop_func`: can be a function or a regular expression to be matched (automatically converted to a function that uses the `re` module) ; by default, checks if all input characters are printable.
338-
- `max_depth`: the maximum depth for the tree search ; by default 5.
339-
- `codec_categories`: a string indicating a codec [category](#list-codecs) or a list of [category](#list-codecs) strings ; by default, `None`, meaning the whole [categories](#list-codecs) (very slow).
340-
- `found`: a list or tuple of currently found encodings, this can be used to save time if the first decoding steps are known ; by default, an empty tuple.
341-
342-
343-
A simple example for a 1-stage base64-encoded string:
344-
345-
```python
346-
>>> codext.guess("VGhpcyBpcyBhIHRlc3Q=")
347-
('This is a test', ('base64',))
348-
```
349-
350-
An example of a 2-stages base64- then base62-encoded string:
351-
352-
```python
353-
>>> codext.guess("CJG3Ix8bVcSRMLOqwDUg28aDsT7")
354-
('FKU2Ng7lJbR>.IHuzLDv17eLhE6', ('barbie',))
355-
```
356-
357-
In the second example, we can see that the given encoded string is not decoded as expected. This is the case because the (default) stop condition is too broad and stops if all the characters of the output are printable. If we have a prior knowledge on what we should expect, we can input a simple string or a regex:
358-
359-
```python
360-
>>> codext.guess("CJG3Ix8bVcSRMLOqwDUg28aDsT7", "test")
361-
('This is a test', ('base62', 'base64'))
362-
```
363-
364-
Instead of a string, we can also pass a function. For this purpose, standard stop functions are predefined in the `stopfunc` submodule. So, we can for instance use `stopfunc.lang_en` to stop when we find something that is English (only works if [`langdetect`](https://pypi.org/project/langdetect/) is installed, which is willingly NOT in the requirements of this package). Note that working this way gives lots of false positives if the text is very short like in the example case. That's why the `codec_categories` argument is used to only consider baseX codecs. This is also demonstrated in the next examples.
365-
366-
```python
367-
>>> codext.guess("CJG3Ix8bVcSRMLOqwDUg28aDsT7", codext.stopfunc.lang_en, codec_categories="base")
368-
('This is a test', ('base62', 'base64'))
369-
```
370-
371-
If we know the first encoding, we can set this in the `found` parameter to save time:
372-
373-
```python
374-
>>> codext.guess("CJG3Ix8bVcSRMLOqwDUg28aDsT7", "test", found=["base62"])
375-
('This is a test', ('base62', 'base64'))
376-
```
377-
378-
If we are sure that only `base` (which is a valid [category](#list-codecs)) encodings are used, we can restrict the tree search using the `codec_categories` parameter to save time:
379-
380-
```python
381-
>>> codext.guess("CJG3Ix8bVcSRMLOqwDUg28aDsT7", "test", codec_categories="base")
382-
('This is a test', ('base62', 'base64'))
383-
```
384-
385-
Another example of 2-stages encoded string:
386-
387-
```python
388-
>>> codext.guess("LSAuLi4uIC4uIC4uLiAvIC4uIC4uLiAvIC4tIC8gLSAuIC4uLiAt", "test")
389-
('this is a test', ('base64', 'morse'))
390-
>>> codext.guess("LSAuLi4uIC4uIC4uLiAvIC4uIC4uLiAvIC4tIC8gLSAuIC4uLiAt", "test", codec_categories=["base", "language"])
391-
('this is a test', ('base64', 'morse'))
392-
```
393-
394-
When multiple results are expected, `stop` and `show` arguments can be used respectively to avoid stopping while finding a result and to display the intermediate result.
395-
396-
!!! warning "Computation time"
397-
398-
Note that, in the very last examples, the first call takes much longer than the second one but requires no knowledge about the possible [categories](#list-codecs) of encodings.
399-
400-
!!! note "Stop functions"
401-
402-
Currently, a few standard stop functions are provided with the `stopfunc` submodule:
403-
404-
- `flag`: searches for the pattern "`[Ff][Ll1][Aa4@][Gg9]`" (either UTF-8 or UTF-16)
405-
- `lang_**`: checks if the given lang (any from the [`PROFILES_DIRECTORY`](https://github.com/Mimino666/langdetect/tree/master/langdetect/profiles) of the [`langdetect` module](https://github.com/Mimino666/langdetect) if it is installed) is detected (note that it first checks if all characters are printable)
406-
- `printables`: checks that every output character is in the set of printables
407-
408-
-----
409-
410333
### Hooked `codecs` functions
411334

412335
In order to select the right de/encoding function and avoid any conflict, the native `codecs` library registers search functions (using the `register(search_function)` function), called in order of registration while searching for a codec.

docs/guessing.md

Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
## Guess Mode
2+
3+
For decoding multiple layers of codecs, `codext` features a guess mode relying on an Artificial Intelligence algorithm, the Breadth-First tree Search (BFS). For many cases, the default parameters are sufficient for guess-decoding things. But it may require parameters tuning.
4+
5+
-----
6+
7+
### Parameters
8+
9+
BFS stops when a given condition, in the form of a function applied to the decoded string at the current depth, is met. It returns two results: the decoded string and a tuple with the related encoding names in order of application.
10+
11+
The following parameters are tunable:
12+
13+
- `stop_func`: can be a function or a regular expression to be matched (automatically converted to a function that uses the `re` module) ; by default, checks if all input characters are printable.
14+
- `min_depth`: the minimum depth for the tree search (allows to avoid a bit of overhead while checking the current decoded output at a depth with the stop function when we are sure it should not be the right result) ; by default 0.
15+
- `max_depth`: the maximum depth for the tree search ; by default 5.
16+
- `codec_categories`: a string indicating a codec [category](#list-codecs) or a list of [category](#list-codecs) strings ; by default, `None`, meaning the whole [categories](#list-codecs) (very slow).
17+
- `found`: a list or tuple of currently found encodings that can be used to save time if the first decoding steps are known ; by default, an empty tuple.
18+
19+
A simple example for a 1-stage base64-encoded string:
20+
21+
```python
22+
>>> codext.guess("VGhpcyBpcyBhIHRlc3Q=")
23+
{('base64',): 'This is a test'}
24+
```
25+
26+
An example of a 2-stages base64- then base62-encoded string:
27+
28+
```python
29+
>>> codext.guess("CJG3Ix8bVcSRMLOqwDUg28aDsT7")
30+
{('base62',): 'VGhpcyBpcyBhIHRlc3Q='}
31+
```
32+
33+
In the second example, we can see that the given encoded string is not decoded as expected. This is the case because the (default) stop condition is too broad and stops if all the characters of the output are printable. If we have a prior knowledge on what we should expect, we can input a simple string or a regex:
34+
35+
!!! note "Default stop function"
36+
37+
:::python
38+
>>> codext.stopfunc.default.__name__
39+
'...'
40+
41+
The output depends on whether you have a language detection backend library installed ; see section [*Natural Language Detection*](#natural-language-detection). If no such library is installed, the default function is "`text`".
42+
43+
```python
44+
>>> codext.guess("CJG3Ix8bVcSRMLOqwDUg28aDsT7", "test")
45+
{('base62', 'base64'): 'This is a test'}
46+
```
47+
48+
In this example, the string "*test*" is converted to a function that uses this string as regular expression. Instead of a string, we can also pass a function. For this purpose, standard [stop functions](#available-stop-functions) are predefined. So, we can for instance use `stopfunc.lang_en` to stop when we find something that is English. Note that working this way gives lots of false positives if the text is very short like in the example case. That's why the `codec_categories` argument is used to only consider baseX codecs. This is also demonstrated in the next examples.
49+
50+
```python
51+
>>> codext.stopfunc._reload_lang("langdetect")
52+
>>> codext.guess("CJG3Ix8bVcSRMLOqwDUg28aDsT7", codext.stopfunc.lang_en, codec_categories="base")
53+
('This is a test', ('base62', 'base64'))
54+
```
55+
56+
If we know the first encoding, we can set this in the `found` parameter to save time:
57+
58+
```python
59+
>>> codext.guess("CJG3Ix8bVcSRMLOqwDUg28aDsT7", "test", found=["base62"])
60+
('This is a test', ('base62', 'base64'))
61+
```
62+
63+
If we are sure that only `base` (which is a valid [category](#list-codecs)) encodings are used, we can restrict the tree search using the `codec_categories` parameter to save time:
64+
65+
```python
66+
>>> codext.guess("CJG3Ix8bVcSRMLOqwDUg28aDsT7", "test", codec_categories="base")
67+
('This is a test', ('base62', 'base64'))
68+
```
69+
70+
Another example of 2-stages encoded string:
71+
72+
```python
73+
>>> codext.guess("LSAuLi4uIC4uIC4uLiAvIC4uIC4uLiAvIC4tIC8gLSAuIC4uLiAt", "test")
74+
('this is a test', ('base64', 'morse'))
75+
>>> codext.guess("LSAuLi4uIC4uIC4uLiAvIC4uIC4uLiAvIC4tIC8gLSAuIC4uLiAt", "test", codec_categories=["base", "language"])
76+
('this is a test', ('base64', 'morse'))
77+
```
78+
79+
When multiple results are expected, `stop` and `show` arguments can be used respectively to avoid stopping while finding a result and to display the intermediate result.
80+
81+
!!! warning "Computation time"
82+
83+
Note that, in the very last examples, the first call takes much longer than the second one but requires no knowledge about the possible [categories](#list-codecs) of encodings.
84+
85+
-----
86+
87+
### Available Stop Functions
88+
89+
A few stop functions are predefined in the `stopfunc` submodule.
90+
91+
```python
92+
>>> import codext
93+
>>> dir(codext.stopfunc)
94+
['LANG_BACKEND', 'LANG_BACKENDS', ..., '_reload_lang', 'default', 'flag', ..., 'printables', 'regex', 'text']
95+
```
96+
97+
Currently, the following stop functions are provided:
98+
99+
- `flag`: searches for the pattern "`[Ff][Ll1][Aa4@][Gg9]`" (either UTF-8 or UTF-16)
100+
- `lang_**`: checks if the given lang is detected (note that it first checks if all characters are text ; see `text` hereafter)
101+
- `printables`: checks that every output character is in the set of printables
102+
- `regex(pattern)`: takes one argument, the regular expression, for checking a string against the given pattern
103+
- `text`: checks for printables and an entropy less than 4.6 (empirically determined)
104+
105+
A stop function can be used as the second argument of the `guess` function or as a keyword-argument, as shown in the following examples:
106+
107+
```python
108+
>>> codext.guess("...", codext.stopfunc.text)
109+
[...]
110+
>>> codext.guess("...", [...], stop_func=codext.stopfunc.text)
111+
[...]
112+
```
113+
114+
When a string is given, it is automatically converted to a `regex` stop function.
115+
116+
```python
117+
>>> s = codext.encode("pattern testing", "leetspeak")
118+
>>> s
119+
'p4773rn 73571n9'
120+
>>> stop_func = codext.stopfunc.regex("p[a4@][t7]{2}[e3]rn")
121+
>>> stop_func(s)
122+
True
123+
>>> codext.guess(s, stop_func)
124+
[...]
125+
```
126+
127+
Additionally, a simple stop function is predefined for CTF players, matching various declinations of the word *flag*. Alternatively, a pattern can always be used when flags have a particular format.
128+
129+
```python
130+
>>> codext.stopfunc.flag("test string")
131+
False
132+
>>> codext.stopfunc.flag("test f1@9")
133+
True
134+
>>> codext.stopfunc.regex(r"^CTF\{.*?\}$")("CTF{098f6bcd4621d373cade4e832627b4f6}")
135+
True
136+
```
137+
138+
The particular type of stop function `lang_**` is explained in the [next section](#natural-language-detection).
139+
140+
-----
141+
142+
### Natural Language Detection
143+
144+
As in many cases, we are trying to decode inputs to readable text, it is necessary to narrow the scope while searching for valid decoded outputs. As matching printables and even text (as defined here before as printables with an entropy of less than 4.6) is too broad for many cases, it may be very useful to apply natural language detection. In `codext`, this is done by relying on Natural Language Processing (NLP) backend libraries, loaded only if they were separately installed.
145+
146+
Currently, the following backends are supported, in order of precedence (this order was empirically determined by testing):
147+
148+
- [`langid`](https://github.com/saffsd/langid.py): *Standalone Language Identification (LangID) tool.*
149+
- [`langdetect`](https://github.com/Mimino666/langdetect): *Port of Nakatani Shuyo's language-detection library (version from 03/03/2014) to Python.*
150+
- [`pycld2`](https://github.com/aboSamoor/pycld2): *Python bindings for the Compact Langauge Detect 2 (CLD2).*
151+
- [`cld3`](https://github.com/bsolomon1124/pycld3): *Python bindings to the Compact Language Detector v3 (CLD3).*
152+
- [`textblob`](https://github.com/sloria/TextBlob): *Python (2 and 3) library for processing textual data.*
153+
154+
The way NLP is used is to check that these libraries exist and to take the first one by default. This sets up the `stopfunc.default` for the guess mode. This behavior aims to keep language detection as optional and to avoid multiple specific requirements having the same purpose.
155+
156+
While loaded, the default backend can be switched to another one by using the `_reload_lang` function:
157+
158+
```python
159+
>>> codext.stopfunc._reload_lang("pycld2") # this loads pycld2 and attaches lang_** functions to the stopfunc submodule
160+
>>> codext.stopfunc._reload_lang() # this unloads any loaded backend
161+
```
162+
163+
Each time a backend is loaded, it gets `lang_**` stop functions attached to the `stopfunc` submodule for each supported language.
164+
165+
-----
166+
167+
### Ranking Heuristic
168+
169+
!!! warning "Work in progress"
170+
171+
This part is still in progress and shall be improved with better features and/or using machine learning.
172+

docs/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
## Introduction
22

3-
Codext, contraction of "*codecs*" and "*extension*", is a tiny library that gathers a few additional encodings for use with [`codecs`](https://docs.python.org/3/library/codecs.html). While imported, it registers new encodings to a proxy codecs registry for making the encodings available from the `codecs.(decode|encode|open)` calls.
3+
Codext, contraction of "*codecs*" and "*extension*", is a library that gathers many additional encodings for use with [`codecs`](https://docs.python.org/3/library/codecs.html). While imported, it registers new encodings to an extended codecs registry for making the encodings available from the `codecs.(decode|encode|open)` API. It also features [CLI tools](./cli.html) and a [guess mode](./features.html#guess-decode-an-arbitrary-input) for decoding mutliple layers of codecs.
44

55
### Setup
66

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ docs_dir: docs
55
nav:
66
- Introduction: index.md
77
- Features: features.md
8+
- 'Guess mode': guessing.md
89
- Encodings:
910
- Base: enc/base.md
1011
- Binary: enc/binary.md

0 commit comments

Comments
 (0)