Improved the documentation

dhondta · dhondta · commit cf4b8b9fc5e5 · 2022-01-09T05:22:37.000+01:00
diff --git a/docs/features.md b/docs/features.md
@@ -330,83 +330,6 @@ Another example using 5-rounds base58:
 
 -----
 
-### Guess-decode an arbitrary input
-
-This is done by trying encodings using the breadth-first tree search algorithm. It stops when a given condition (by default, all characters must be printable), in the form of a function applied to the decoded string at the current depth, is met. It returns two results: the decoded string and a tuple with the related encoding names in order of application. The following parameters can be entered:
-
-- `stop_func`: can be a function or a regular expression to be matched (automatically converted to a function that uses the `re` module) ; by default, checks if all input characters are printable.
-- `max_depth`: the maximum depth for the tree search ; by default 5.
-- `codec_categories`: a string indicating a codec [category](#list-codecs) or a list of [category](#list-codecs) strings ; by default, `None`, meaning the whole [categories](#list-codecs) (very slow).
-- `found`: a list or tuple of currently found encodings, this can be used to save time if the first decoding steps are known ; by default, an empty tuple.
-
-
-A simple example for a 1-stage base64-encoded string:
-
-```python
->>> codext.guess("VGhpcyBpcyBhIHRlc3Q=")
-('This is a test', ('base64',))
-```
-
-An example of a 2-stages base64- then base62-encoded string:
-
-```python
->>> codext.guess("CJG3Ix8bVcSRMLOqwDUg28aDsT7")
-('FKU2Ng7lJbR>.IHuzLDv17eLhE6', ('barbie',))
-```
-
-In the second example, we can see that the given encoded string is not decoded as expected. This is the case because the (default) stop condition is too broad and stops if all the characters of the output are printable. If we have a prior knowledge on what we should expect, we can input a simple string or a regex:
-
-```python
->>> codext.guess("CJG3Ix8bVcSRMLOqwDUg28aDsT7", "test")
-('This is a test', ('base62', 'base64'))
-```
-
-Instead of a string, we can also pass a function. For this purpose, standard stop functions are predefined in the `stopfunc` submodule. So, we can for instance use `stopfunc.lang_en` to stop when we find something that is English (only works if [`langdetect`](https://pypi.org/project/langdetect/) is installed, which is willingly NOT in the requirements of this package). Note that working this way gives lots of false positives if the text is very short like in the example case. That's why the `codec_categories` argument is used to only consider baseX codecs. This is also demonstrated in the next examples.
-
-```python
->>> codext.guess("CJG3Ix8bVcSRMLOqwDUg28aDsT7", codext.stopfunc.lang_en, codec_categories="base")
-('This is a test', ('base62', 'base64'))
-```
-
-If we know the first encoding, we can set this in the `found` parameter to save time:
-
-```python
->>> codext.guess("CJG3Ix8bVcSRMLOqwDUg28aDsT7", "test", found=["base62"])
-('This is a test', ('base62', 'base64'))
-```
-
-If we are sure that only `base` (which is a valid [category](#list-codecs)) encodings are used, we can restrict the tree search using the `codec_categories` parameter to save time:
-
-```python
->>> codext.guess("CJG3Ix8bVcSRMLOqwDUg28aDsT7", "test", codec_categories="base")
-('This is a test', ('base62', 'base64'))
-```
-
-Another example of 2-stages encoded string:
-
-```python
->>> codext.guess("LSAuLi4uIC4uIC4uLiAvIC4uIC4uLiAvIC4tIC8gLSAuIC4uLiAt", "test")
-('this is a test', ('base64', 'morse'))
->>> codext.guess("LSAuLi4uIC4uIC4uLiAvIC4uIC4uLiAvIC4tIC8gLSAuIC4uLiAt", "test", codec_categories=["base", "language"])
-('this is a test', ('base64', 'morse'))
-```
-
-When multiple results are expected, `stop` and `show` arguments can be used respectively to avoid stopping while finding a result and to display the intermediate result.
-
-!!! warning "Computation time"
-    
-    Note that, in the very last examples, the first call takes much longer than the second one but requires no knowledge about the possible [categories](#list-codecs) of encodings.
-
-!!! note "Stop functions"
-    
-    Currently, a few standard stop functions are provided with the `stopfunc` submodule:
-    
-    - `flag`:       searches for the pattern "`[Ff][Ll1][Aa4@][Gg9]`" (either UTF-8 or UTF-16)
-    - `lang_**`:    checks if the given lang (any from the [`PROFILES_DIRECTORY`](https://github.com/Mimino666/langdetect/tree/master/langdetect/profiles) of the [`langdetect` module](https://github.com/Mimino666/langdetect) if it is installed) is detected (note that it first checks if all characters are printable)
-    - `printables`: checks that every output character is in the set of printables
-
------
-
 ### Hooked `codecs` functions
 
 In order to select the right de/encoding function and avoid any conflict, the native `codecs` library registers search functions (using the `register(search_function)` function), called in order of registration while searching for a codec.
diff --git a/docs/guessing.md b/docs/guessing.md
@@ -0,0 +1,172 @@
+## Guess Mode
+
+For decoding multiple layers of codecs, `codext` features a guess mode relying on an Artificial Intelligence algorithm, the Breadth-First tree Search (BFS). For many cases, the default parameters are sufficient for guess-decoding things. But it may require parameters tuning.
+
+-----
+
+### Parameters
+
+BFS stops when a given condition, in the form of a function applied to the decoded string at the current depth, is met. It returns two results: the decoded string and a tuple with the related encoding names in order of application.
+
+The following parameters are tunable:
+
+- `stop_func`: can be a function or a regular expression to be matched (automatically converted to a function that uses the `re` module) ; by default, checks if all input characters are printable.
+- `min_depth`: the minimum depth for the tree search (allows to avoid a bit of overhead while checking the current decoded output at a depth with the stop function when we are sure it should not be the right result) ; by default 0.
+- `max_depth`: the maximum depth for the tree search ; by default 5.
+- `codec_categories`: a string indicating a codec [category](#list-codecs) or a list of [category](#list-codecs) strings ; by default, `None`, meaning the whole [categories](#list-codecs) (very slow).
+- `found`: a list or tuple of currently found encodings that can be used to save time if the first decoding steps are known ; by default, an empty tuple.
+
+A simple example for a 1-stage base64-encoded string:
+
+```python
+>>> codext.guess("VGhpcyBpcyBhIHRlc3Q=")
+{('base64',): 'This is a test'}
+```
+
+An example of a 2-stages base64- then base62-encoded string:
+
+```python
+>>> codext.guess("CJG3Ix8bVcSRMLOqwDUg28aDsT7")
+{('base62',): 'VGhpcyBpcyBhIHRlc3Q='}
+```
+
+In the second example, we can see that the given encoded string is not decoded as expected. This is the case because the (default) stop condition is too broad and stops if all the characters of the output are printable. If we have a prior knowledge on what we should expect, we can input a simple string or a regex:
+
+!!! note "Default stop function"
+    
+        :::python
+        >>> codext.stopfunc.default.__name__
+        '...'
+    
+    The output depends on whether you have a language detection backend library installed ; see section [*Natural Language Detection*](#natural-language-detection). If no such library is installed, the default function is "`text`".
+
+```python
+>>> codext.guess("CJG3Ix8bVcSRMLOqwDUg28aDsT7", "test")
+{('base62', 'base64'): 'This is a test'}
+```
+
+In this example, the string "*test*" is converted to a function that uses this string as regular expression. Instead of a string, we can also pass a function. For this purpose, standard [stop functions](#available-stop-functions) are predefined. So, we can for instance use `stopfunc.lang_en` to stop when we find something that is English. Note that working this way gives lots of false positives if the text is very short like in the example case. That's why the `codec_categories` argument is used to only consider baseX codecs. This is also demonstrated in the next examples.
+
+```python
+>>> codext.stopfunc._reload_lang("langdetect")
+>>> codext.guess("CJG3Ix8bVcSRMLOqwDUg28aDsT7", codext.stopfunc.lang_en, codec_categories="base")
+('This is a test', ('base62', 'base64'))
+```
+
+If we know the first encoding, we can set this in the `found` parameter to save time:
+
+```python
+>>> codext.guess("CJG3Ix8bVcSRMLOqwDUg28aDsT7", "test", found=["base62"])
+('This is a test', ('base62', 'base64'))
+```
+
+If we are sure that only `base` (which is a valid [category](#list-codecs)) encodings are used, we can restrict the tree search using the `codec_categories` parameter to save time:
+
+```python
+>>> codext.guess("CJG3Ix8bVcSRMLOqwDUg28aDsT7", "test", codec_categories="base")
+('This is a test', ('base62', 'base64'))
+```
+
+Another example of 2-stages encoded string:
+
+```python
+>>> codext.guess("LSAuLi4uIC4uIC4uLiAvIC4uIC4uLiAvIC4tIC8gLSAuIC4uLiAt", "test")
+('this is a test', ('base64', 'morse'))
+>>> codext.guess("LSAuLi4uIC4uIC4uLiAvIC4uIC4uLiAvIC4tIC8gLSAuIC4uLiAt", "test", codec_categories=["base", "language"])
+('this is a test', ('base64', 'morse'))
+```
+
+When multiple results are expected, `stop` and `show` arguments can be used respectively to avoid stopping while finding a result and to display the intermediate result.
+
+!!! warning "Computation time"
+    
+    Note that, in the very last examples, the first call takes much longer than the second one but requires no knowledge about the possible [categories](#list-codecs) of encodings.
+
+-----
+
+### Available Stop Functions
+
+A few stop functions are predefined in the `stopfunc` submodule.
+
+```python
+>>> import codext
+>>> dir(codext.stopfunc)
+['LANG_BACKEND', 'LANG_BACKENDS', ..., '_reload_lang', 'default', 'flag', ..., 'printables', 'regex', 'text']
+```
+
+Currently, the following stop functions are provided:
+
+- `flag`:           searches for the pattern "`[Ff][Ll1][Aa4@][Gg9]`" (either UTF-8 or UTF-16)
+- `lang_**`:        checks if the given lang is detected (note that it first checks if all characters are text ; see `text` hereafter)
+- `printables`:     checks that every output character is in the set of printables
+- `regex(pattern)`: takes one argument, the regular expression, for checking a string against the given pattern
+- `text`:           checks for printables and an entropy less than 4.6 (empirically determined)
+
+A stop function can be used as the second argument of the `guess` function or as a keyword-argument, as shown in the following examples:
+
+```python
+>>> codext.guess("...", codext.stopfunc.text)
+[...]
+>>> codext.guess("...", [...], stop_func=codext.stopfunc.text)
+[...]
+```
+
+When a string is given, it is automatically converted to a `regex` stop function.
+
+```python
+>>> s = codext.encode("pattern testing", "leetspeak")
+>>> s
+'p4773rn 73571n9'
+>>> stop_func = codext.stopfunc.regex("p[a4@][t7]{2}[e3]rn")
+>>> stop_func(s)
+True
+>>> codext.guess(s, stop_func)
+[...]
+```
+
+Additionally, a simple stop function is predefined for CTF players, matching various declinations of the word *flag*. Alternatively, a pattern can always be used when flags have a particular format.
+
+```python
+>>> codext.stopfunc.flag("test string")
+False
+>>> codext.stopfunc.flag("test f1@9")
+True
+>>> codext.stopfunc.regex(r"^CTF\{.*?\}$")("CTF{098f6bcd4621d373cade4e832627b4f6}")
+True
+```
+
+The particular type of stop function `lang_**` is explained in the [next section](#natural-language-detection).
+
+-----
+
+### Natural Language Detection
+
+As in many cases, we are trying to decode inputs to readable text, it is necessary to narrow the scope while searching for valid decoded outputs. As matching printables and even text (as defined here before as printables with an entropy of less than 4.6) is too broad for many cases, it may be very useful to apply natural language detection. In `codext`, this is done by relying on Natural Language Processing (NLP) backend libraries, loaded only if they were separately installed.
+
+Currently, the following backends are supported, in order of precedence (this order was empirically determined by testing):
+
+- [`langid`](https://github.com/saffsd/langid.py): *Standalone Language Identification (LangID) tool.*
+- [`langdetect`](https://github.com/Mimino666/langdetect): *Port of Nakatani Shuyo's language-detection library (version from 03/03/2014) to Python.*
+- [`pycld2`](https://github.com/aboSamoor/pycld2): *Python bindings for the Compact Langauge Detect 2 (CLD2).*
+- [`cld3`](https://github.com/bsolomon1124/pycld3): *Python bindings to the Compact Language Detector v3 (CLD3).*
+- [`textblob`](https://github.com/sloria/TextBlob): *Python (2 and 3) library for processing textual data.*
+
+The way NLP is used is to check that these libraries exist and to take the first one by default. This sets up the `stopfunc.default` for the guess mode. This behavior aims to keep language detection as optional and to avoid multiple specific requirements having the same purpose.
+
+While loaded, the default backend can be switched to another one by using the `_reload_lang` function:
+    
+```python
+>>> codext.stopfunc._reload_lang("pycld2")  # this loads pycld2 and attaches lang_** functions to the stopfunc submodule
+>>> codext.stopfunc._reload_lang()          # this unloads any loaded backend
+```
+
+Each time a backend is loaded, it gets `lang_**` stop functions attached to the `stopfunc` submodule for each supported language.
+
+-----
+
+### Ranking Heuristic
+
+!!! warning "Work in progress"
+    
+    This part is still in progress and shall be improved with better features and/or using machine learning.
+
diff --git a/docs/index.md b/docs/index.md
@@ -1,6 +1,6 @@
 ## Introduction
 
-Codext, contraction of "*codecs*" and "*extension*", is a tiny library that gathers a few additional encodings for use with [`codecs`](https://docs.python.org/3/library/codecs.html). While imported, it registers new encodings to a proxy codecs registry for making the encodings available from the `codecs.(decode|encode|open)` calls.
+Codext, contraction of "*codecs*" and "*extension*", is a library that gathers many additional encodings for use with [`codecs`](https://docs.python.org/3/library/codecs.html). While imported, it registers new encodings to an extended codecs registry for making the encodings available from the `codecs.(decode|encode|open)` API. It also features [CLI tools](./cli.html) and a [guess mode](./features.html#guess-decode-an-arbitrary-input) for decoding mutliple layers of codecs.
 
 ### Setup
 
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -5,6 +5,7 @@ docs_dir: docs
 nav:
   - Introduction: index.md
   - Features: features.md
+  - 'Guess mode': guessing.md
   - Encodings:
     - Base: enc/base.md
     - Binary: enc/binary.md