codeanalyzer-python (`canpy`)

A Python static-analysis toolkit — the CLDK backend that emits a canonical symbol table and call graph, as analysis.json or a Neo4j property graph.

canpy is a static analyzer for Python built on Jedi, with optional CodeQL-resolved call edges and Tree-sitter parsing. It produces the canonical CodeLLM-DevKit (CLDK) analysis.json — a symbol table plus a call graph — and can project that same analysis into a Neo4j property graph. It is the Python backend behind CLDK, mirroring its TypeScript (cants) and Java siblings.

Every run produces a symbol table and a call graph. Edges come from Jedi's lexical resolution by default; --codeql resolves additional edges (RPC / third-party / dynamically-dispatched targets) and merges them with the Jedi-derived edges, also backfilling callees Jedi could not resolve.

Features

Symbol table — modules, classes, functions, methods, variables, decorators, imports, and docstrings, with precise source spans.
Call graph — Jedi's lexical resolver by default (level 1), with optional PyCG-resolved edges merged in at --analysis-level 2 (provenance-tagged, coupling-aware sharding for large apps).
Dataflow graphs (level 3) — native, whole-program dependence graphs built from Python's own ast: per-callable exceptional CFGs and PDGs (control + data dependence), stitched into a Horwitz–Reps–Binkley SDG with parameter/summary edges, emitted as the program_graphs section at --analysis-level 3 and queryable with a context-sensitive backward slicer.
Neo4j output — project the analysis into a labeled property graph: a self-contained graph.cypher snapshot, or an incremental push to a live database over Bolt.
Versioned schema — a machine-readable, version-stamped Neo4j schema contract (--emit schema), checked in as schema.neo4j.json and shipped with every release.
Incremental cache — per-file results are cached under .codeanalyzer; --lazy (default) reuses them, --eager forces a clean rebuild. --ray distributes the work across cores.
Compact output — canonical analysis.json, or binary analysis.msgpack for smaller artifacts.

Installation

Prerequisites

Python 3.10 or newer.

A C toolchain and the venv / development headers — the analyzer builds an isolated virtual environment per project (via Python's venv) so Jedi can resolve types and imports:

# Ubuntu / Debian
sudo apt install python3-venv python3-dev build-essential

# Fedora / RHEL / CentOS
sudo dnf group install "Development Tools" && sudo dnf install python3-venv python3-devel

# macOS
xcode-select --install

Install via pip (PyPI)

pip install codeanalyzer-python
canpy --help

For the optional live Neo4j push (--emit neo4j --neo4j-uri …), install the neo4j extra:

pip install 'codeanalyzer-python[neo4j]'

Install via shell script

Install the CLI as an isolated tool with the one-line installer (provisions via uv / pipx / pip):

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/codellm-devkit/codeanalyzer-python/releases/latest/download/canpy-installer.sh | sh

Install via Homebrew

brew install codellm-devkit/tap/codeanalyzer-python

The formula depends on uv and installs canpy as an isolated, version-pinned uv tool (the package and its dependencies are resolved and cached on first run).

Build from source

This project uses uv for dependency management.

git clone https://github.com/codellm-devkit/codeanalyzer-python
cd codeanalyzer-python
uv sync --all-groups
uv run canpy --help

Usage

canpy --input /path/to/python/project

With no --output, the analysis is printed to stdout as compact JSON; with --output <dir> it is written to analysis.json (or graph.cypher for --emit neo4j, or analysis.msgpack with --format msgpack) in that directory.

Options

$ canpy --help

 Usage: canpy [OPTIONS] COMMAND [ARGS]...

 Static Analysis on Python source code using Jedi, PyCG and Tree sitter.

╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
│ --version                                                              Show the canpy version    │
│                                                                        and exit.                 │
│ --input                -i                     PATH                     Path to the project root  │
│                                                                        directory (not required   │
│                                                                        for --emit schema).       │
│ --output               -o                     PATH                     Output directory for      │
│                                                                        artifacts.                │
│ --format               -f                     [json|msgpack]           Output format for --emit  │
│                                                                        json: json or msgpack.    │
│                                                                        [default: json]           │
│ --emit                                        [json|neo4j|schema]      Output target: json       │
│                                                                        (analysis.json, default)  │
│                                                                        | neo4j (graph.cypher or  │
│                                                                        live Bolt push) | schema  │
│                                                                        (the Neo4j schema.json    │
│                                                                        contract).                │
│                                                                        [default: json]           │
│ --app-name                                    TEXT                     Logical application name  │
│                                                                        for the graph             │
│                                                                        :PyApplication anchor     │
│                                                                        (default: input dir       │
│                                                                        name).                    │
│ --neo4j-uri                                   TEXT                     Push the graph to a live  │
│                                                                        Neo4j over Bolt           │
│                                                                        (incremental); omit to    │
│                                                                        write graph.cypher.       │
│                                                                        [env var: NEO4J_URI]      │
│ --neo4j-user                                  TEXT                     Neo4j username.           │
│                                                                        [env var: NEO4J_USERNAME] │
│                                                                        [default: neo4j]          │
│ --neo4j-password                              TEXT                     Neo4j password. Prefer    │
│                                                                        the env var over the flag │
│                                                                        (the flag is visible in   │
│                                                                        shell history / process   │
│                                                                        list).                    │
│                                                                        [env var: NEO4J_PASSWORD] │
│                                                                        [default: neo4j]          │
│ --neo4j-database                              TEXT                     Neo4j database name       │
│                                                                        (default: server          │
│                                                                        default).                 │
│                                                                        [env var: NEO4J_DATABASE] │
│ --analysis-level       -a                     INTEGER RANGE [1<=x<=3]  Analysis depth: 1=symbol  │
│                                                                        table+Jedi call graph,    │
│                                                                        2=+PyCG call graph,       │
│                                                                        3=+native dataflow graphs │
│                                                                        (CFG/PDG/SDG).            │
│                                                                        [default: 1]              │
│ --graphs                                      TEXT                     Level 3 only:             │
│                                                                        comma-separated           │
│                                                                        program-graph sections to │
│                                                                        emit (cfg, dfg, pdg,      │
│                                                                        sdg). Default: all. `dfg` │
│                                                                        emits the PDG's data      │
│                                                                        edges only; `sdg` implies │
│                                                                        the dependence edges it   │
│                                                                        stitches.                 │
│                                                                        [default:                 │
│                                                                        cfg,dfg,pdg,sdg]          │
│ --graph-field-depth                           INTEGER RANGE [x>=1]     Level 3 only: k-limit on  │
│                                                                        access-path depth         │
│                                                                        (x.f.g.h with k=3 becomes │
│                                                                        x.f.g.*). Mandatory bound │
│                                                                        — it is what guarantees   │
│                                                                        the interprocedural       │
│                                                                        fixpoint terminates.      │
│                                                                        [default: 3]              │
│ --ray                      --no-ray                                    Enable Ray for            │
│                                                                        distributed analysis.     │
│                                                                        [default: no-ray]         │
│ --eager                    --lazy                                      Enable eager or lazy      │
│                                                                        analysis. Defaults to     │
│                                                                        lazy.                     │
│                                                                        [default: lazy]           │
│ --skip-tests               --include-tests                             Skip test files in        │
│                                                                        analysis.                 │
│                                                                        [default: skip-tests]     │
│ --no-venv                  --venv                                      Skip virtualenv creation  │
│                                                                        and dependency            │
│                                                                        installation; resolve     │
│                                                                        imports against the       │
│                                                                        ambient Python            │
│                                                                        environment instead.      │
│                                                                        [default: venv]           │
│ --file-name                                   PATH                     Analyze only the          │
│                                                                        specified file (relative  │
│                                                                        to input directory).      │
│ --cache-dir            -c                     PATH                     Directory to store        │
│                                                                        analysis cache. Defaults  │
│                                                                        to '.codeanalyzer' in the │
│                                                                        input directory.          │
│ --clear-cache              --keep-cache                                Clear cache after         │
│                                                                        analysis. By default,     │
│                                                                        cache is retained.        │
│                                                                        [default: keep-cache]     │
│                        -v                     INTEGER                  Increase verbosity: -v,   │
│                                                                        -vv, -vvv                 │
│                                                                        [default: 0]              │
│ --pycg-shard               --no-pycg-shard                             Shard PyCG call-graph     │
│                                                                        analysis by Python        │
│                                                                        package (level 2 only).   │
│                                                                        When the project exceeds  │
│                                                                        the 500-file ceiling,     │
│                                                                        PyCG is run independently │
│                                                                        per top-level package     │
│                                                                        with cross-package        │
│                                                                        imports treated as ghost  │
│                                                                        nodes. Without this flag, │
│                                                                        projects over the ceiling │
│                                                                        fall back to Jedi-only    │
│                                                                        edges.                    │
│                                                                        [default: no-pycg-shard]  │
│ --pycg-shard-ceiling                          INTEGER RANGE [x>=1]     Maximum files per shard   │
│                                                                        when --pycg-shard is      │
│                                                                        active (default 100).     │
│                                                                        Shards exceeding this     │
│                                                                        limit are skipped; their  │
│                                                                        call edges are omitted    │
│                                                                        from the call graph (Jedi │
│                                                                        edges for those packages  │
│                                                                        are still included).      │
│                                                                        Lower values are safer    │
│                                                                        for packages with deep    │
│                                                                        class hierarchies or      │
│                                                                        heavy import graphs.      │
│                                                                        [default: 100]            │
│ --pycg-shard-timeout                          INTEGER RANGE [x>=0]     Per-shard wall-clock      │
│                                                                        timeout in seconds when   │
│                                                                        --pycg-shard is active    │
│                                                                        (default 120). A shard    │
│                                                                        that exceeds this limit   │
│                                                                        is skipped gracefully.    │
│                                                                        PyCG's fixpoint is        │
│                                                                        bimodal: it either        │
│                                                                        converges quickly or      │
│                                                                        diverges indefinitely, so │
│                                                                        the timeout acts as a     │
│                                                                        final safety net after    │
│                                                                        the file-count ceiling.   │
│                                                                        Set to 0 to disable.      │
│                                                                        POSIX only (macOS /       │
│                                                                        Linux); ignored on        │
│                                                                        Windows.                  │
│                                                                        [default: 120]            │
│ --pycg-shard-strategy                         [jedi|package]           How --pycg-shard groups   │
│                                                                        files (level 2 only).     │
│                                                                        'jedi' (default)          │
│                                                                        partitions the Jedi       │
│                                                                        module-dependency graph   │
│                                                                        (SCC + Louvain) so        │
│                                                                        tightly-coupled modules   │
│                                                                        co-compute and few call   │
│                                                                        edges are severed between │
│                                                                        shards; import cycles are │
│                                                                        never split. 'package'    │
│                                                                        uses the legacy           │
│                                                                        one-shard-per-package-di… │
│                                                                        grouping.                 │
│                                                                        [default: jedi]           │
│ --pycg-max-iter                               INTEGER RANGE [x>=-1]    Cap on PyCG's fixpoint    │
│                                                                        passes per shard/project  │
│                                                                        (level 2; default 50).    │
│                                                                        PyCG iterates until its   │
│                                                                        points-to state stops     │
│                                                                        changing, but its         │
│                                                                        access-path domain has no │
│                                                                        convergence bound, so     │
│                                                                        heavy metaclass/mixin     │
│                                                                        code (e.g. an ORM) can    │
│                                                                        loop with each pass       │
│                                                                        costing seconds. The cap  │
│                                                                        returns a                 │
│                                                                        sound-but-incomplete call │
│                                                                        graph instead of looping  │
│                                                                        until the timeout kills   │
│                                                                        it. Set to -1 for PyCG's  │
│                                                                        unbounded                 │
│                                                                        run-to-convergence        │
│                                                                        behaviour.                │
│                                                                        [default: 50]             │
│ --help                                                                 Show this message and     │
│                                                                        exit.                     │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

Examples

Basic analysis to stdout, or to a file:

canpy --input ./my-python-project                        # compact JSON on stdout
canpy --input ./my-python-project --output ./out         # → ./out/analysis.json

Binary output (msgpack):

canpy --input ./my-python-project --output ./out --format msgpack   # → ./out/analysis.msgpack

Resolve extra call edges with CodeQL:
```
canpy --input ./my-python-project --codeql
```
By default, edges come from Jedi's lexical analysis. Adding --codeql resolves additional edges (including RPC / third-party / dynamically-dispatched targets) and merges them with the Jedi-derived edges; CodeQL also backfills resolved callees Jedi could not resolve. CodeQL integration is experimental; the CLI is downloaded into <cache_dir>/codeql/ on first use.

Emit a Neo4j snapshot, or push to a live database:

canpy --input ./my-python-project --emit neo4j --output ./out   # → ./out/graph.cypher
canpy --input ./my-python-project --emit neo4j \
  --neo4j-uri bolt://localhost:7687 --neo4j-user neo4j --neo4j-password secret

Emit the Neo4j schema contract:

canpy --emit schema                   # print schema.json to stdout (no project needed)
canpy --emit schema --output ./out    # → ./out/schema.json

Force a clean rebuild with a custom cache directory:

canpy --input ./my-python-project --eager --cache-dir /path/to/custom-cache

Native dataflow graphs (level 3) — CFG/PDG/SDG + slicing:
```
canpy --input ./my-python-project -a 3 --output ./out          # + program_graphs section
canpy --input ./my-python-project -a 3 --graphs cfg,pdg        # scope the emitted sections
canpy --input ./my-python-project -a 3 --graph-field-depth 2   # tighter access-path k-limit
```
Level 3 also enriches the Neo4j projection (--emit neo4j) with the CPG overlay (:PyCFGNode nodes and PY_CFG_NEXT/PY_CDG/PY_DDG/PY_PARAM_IN/PY_PARAM_OUT/ PY_SUMMARY edges — the cross-language dataflow vocabulary, PY_-namespaced like every other row family so multi-language databases never mingle analyzers' edges).

Analysis levels

Level	Flag	What it adds	Cost
1	`-a 1` (default)	Symbol table + Jedi resolver call graph	Cheap
2	`-a 2`	PyCG call-graph enrichment (provenance-merged)	Moderate
3	`-a 3`	Native CFG/PDG/SDG (`program_graphs`) + CPG Neo4j overlay + backward slicing	Heavy, whole-program

Levels are cumulative — -a 3 includes level 2's call graph (the SDG is stitched over it). Nothing at level 3 runs unless requested: -a 1/-a 2 timings and output are unaffected.

Architecture & Tooling

Locked level-3 substrate decisions (#67):

CFG source: hand-built from the stdlib ast module — the same parse the symbol-table builder uses, so graph nodes join back to symbol-table signatures by construction. One synthetic ENTRY/EXIT per callable, statement-level nodes keyed (signature, node_id) in source-span order, exceptional edges first-class.
Def-use source: hand-built reaching definitions (classic forward worklist) over k-limited access paths (--graph-field-depth, default 3) — no usable SSA library exists for Python.
Points-to oracle: a type-based may-alias MVP stub — two access paths may alias iff their suffixes are prefix-compatible and their bases' Jedi-inferred types are compatible (unknown types conservatively alias). Frozen behind may_alias(); upgrading to a real points-to substrate is staged follow-up work. Call dispatch comes from the merged Jedi(+PyCG) call graph, treated as a frozen oracle.
Summaries: relational formal-in → formal-out flows composed bottom-up over the Tarjan SCC condensation of the call graph, monotone fixpoint within SCCs; globals ride as extra formals (<global>:module::name), closure captures bind at definition sites.
Clients: backward slicing ships in-process (two-phase context-sensitive HRB traversal, codeanalyzer.dataflow.slicing). Taint is deliberately left to the CLDK SDK: once the SDG is emitted it is language-independent labeled reachability.
Precision posture: sound-leaning and over-approximate — prefer false positives to missed flows. Known unsoundness (documented, not silently absorbed): eval/exec, reflection (getattr/setattr with dynamic names), monkey-patching, C extensions, import side effects, and module top-level statements (globals are modeled as formals instead).

Output targets

canpy builds one analysis in memory and can emit it three ways (--emit):

`analysis.json` (default)

A PyApplication document — the canonical CLDK contract:

{
  "symbol_table":   { /* file path → module (classes, functions, variables, imports, …) */ },
  "call_graph":     [ /* CALL_DEP edges: { source, target, weight, provenance } keyed by callable signature */ ],
  "program_graphs": { /* -a 3 only: schema_version, k_limit, per-callable { cfg, pdg, param_nodes }, sdg_edges */ }
}

By default this is printed to stdout in JSON; with --output it is written to analysis.json (or analysis.msgpack with --format msgpack, a more compact binary format).

Neo4j graph

--emit neo4j projects the same analysis into a labeled property graph. Every node label is Py-prefixed and every relationship type is PY_-prefixed (e.g. :PyClass, PY_CALLS) so multiple language analyzers can share one database without label or relationship-type collisions. Declarations are keyed by their signature under a shared :PySymbol label; calls, imports, inheritance, decorators, and call sites are relationships:

Without --neo4j-uri — writes a self-contained graph.cypher (constraints + indexes, a scoped wipe, then batched MERGEs). Load it with cypher-shell < graph.cypher. Needs no extra dependencies.
With --neo4j-uri — pushes to a live Neo4j over Bolt incrementally: only modules whose content hash changed are rewritten, and on a full run modules whose source file vanished are pruned. Requires the neo4j extra. Every graph carries a schema_version on its :PyApplication node.

Call-graph endpoints that aren't present in the symbol table (third-party / framework / RPC targets) are materialized as :PyExternal ghost nodes, mirroring the analyzer's own ghost-node behaviour.

The connection options also read from the standard Neo4j environment variables — NEO4J_URI, NEO4J_USERNAME, NEO4J_PASSWORD, NEO4J_DATABASE — when the corresponding flag is omitted (an explicit flag wins). Prefer the env var for the password so it doesn't land in shell history or the process list:

export NEO4J_URI=bolt://localhost:7687
export NEO4J_PASSWORD=secret
canpy -i ./my-project --emit neo4j     # credentials picked up from the environment

Schema contract

--emit schema writes the machine-readable, version-stamped Neo4j schema (schema.json: node labels, relationships, properties, constraints, and indexes). It needs no project and is checked into the repo as schema.neo4j.json and bundled in every release as a GitHub Release asset, so a consumer can validate producer/consumer compatibility without invoking the tool. The shape of the contract matches the codeanalyzer-typescript backend.

A UML of the analysis.json schema (the PyApplication containment tree) is checked in as schema-uml.drawio, and the property-graph schema as neo4j-schema.drawio.

Development

This project uses uv.

uv sync --all-groups
uv run canpy --input /path/to/project           # run from source
uv run canpy --emit schema > schema.neo4j.json  # regenerate the checked-in schema contract
uv run python scripts/update_readme.py          # regenerate the canpy --help block above
uv run pytest                                   # run the test suite

The Neo4j schema-conformance test always runs. The Neo4j bolt integration test spins up a real Neo4j via Testcontainers and is opt-in — it needs a container runtime (Docker or Podman) and is enabled with an environment variable:

RUN_CONTAINER_TESTS=1 uv run pytest test/test_neo4j_bolt.py -s

License

Apache 2.0 — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
.claude		.claude
.github		.github
codeanalyzer		codeanalyzer
docs/assets		docs/assets
packaging		packaging
scripts		scripts
test		test
.envrc		.envrc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
neo4j-schema.drawio		neo4j-schema.drawio
pyproject.toml		pyproject.toml
schema-uml.drawio		schema-uml.drawio
schema.neo4j.json		schema.neo4j.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

codeanalyzer-python (`canpy`)

Table of Contents

Features

Installation

Prerequisites

Install via pip (PyPI)

Install via shell script

Install via Homebrew

Build from source

Usage

Options

Examples

Analysis levels

Architecture & Tooling

Output targets

`analysis.json` (default)

Neo4j graph

Schema contract

Development

License

About

Uh oh!

Releases 18

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

codeanalyzer-python (canpy)

Table of Contents

Features

Installation

Prerequisites

Install via pip (PyPI)

Install via shell script

Install via Homebrew

Build from source

Usage

Options

Examples

Analysis levels

Architecture & Tooling

Output targets

analysis.json (default)

Neo4j graph

Schema contract

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 18

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

codeanalyzer-python (`canpy`)

`analysis.json` (default)

Packages