Skip to content

What is codeanalyzer-python?

codeanalyzer-python is a static-analysis tool for Python source code. You point it at a project directory and it produces one typed artifact — a PyApplication — that captures the project’s symbol table (modules, classes, callables, fields), its call graph (who-calls-whom), and its framework entrypoints (the routes, tasks, and commands a framework dispatches into). You stop grepping source by hand and start querying a structured model of the program.

It builds one analysis in memory and can emit it three ways: as the default analysis.json, as a Neo4j property graph (a queryable, persistent system of record), or as the version-stamped schema contract that describes that graph. The graph is the same PyApplication projected onto labeled nodes and typed relationships — so a whole portfolio of applications can live in one database and be traversed with Cypher instead of parsed out of giant JSON blobs.

PY_HAS_MODULE PY_DECLARES PY_HAS_ATTRIBUTE PY_HAS_METHOD PY_DECORATED_BY PY_HAS_CALLSITE PY_RESOLVES_TO PY_CALLS :PyApplication name schema_version :PyModule module_name content_hash :PyClass :PySymbol name base_classes :PyAttribute name type :PyCallable :PySymbol signature cyclomatic_complexity :PyDecorator name :PyCallSite method_name receiver_type :PyExternal name module
The analysis is a Neo4j property graph: every node carries a label (its color) and properties; every relationship carries a type. The dashed ring marks an entrypoint; the PY_CALLS edge is the resolved call graph.

It is the Python backend behind CLDK, the multilingual analysis SDK — the same role canjava plays for Java. You can use it through CLDK’s typed facade, or directly: as a CLI that writes analysis.json or a property graph, or as a Python library that hands you PyApplication objects.

Every run follows the same shape: point at a project, build the artifact, choose an output target.

  1. Point at a project. canpy --input ./my-project. The tool discovers every .py file (test files excluded by default), and creates an isolated virtual environment so dependencies resolve.

  2. It builds a PyApplication. Jedi and Tree-sitter extract the symbol table; a call graph is derived from it; optional CodeQL resolution and a pluggable pass pipeline enrich it with extra edges and entrypoints.

  3. Emit it. One analysis, three targets via --emit: json (the default analysis.json / msgpack), neo4j (a graph.cypher snapshot or a live incremental Bolt push), or schema (the machine-readable, version-stamped graph schema). The same typed model underlies all three.

flowchart LR
    A["canpy --input"] --> B[Symbol table<br/>Jedi + Tree-sitter]
    B --> C[Call graph<br/>Jedi edges]
    B -.->|--codeql| D[CodeQL edges]
    C --> E[Analysis passes<br/>entrypoints + synthetic edges]
    D -.-> E
    E --> F["PyApplication<br/>(in memory)"]
    F -->|"--emit json"| G["analysis.json / msgpack"]
    F -->|"--emit neo4j"| H["Labeled property graph"]
    F -->|"--emit schema"| K["schema.json<br/>(version contract)"]
    H -->|"no --neo4j-uri"| I["graph.cypher<br/>(self-contained snapshot)"]
    H -->|"--neo4j-uri"| J["Live Bolt push<br/>(incremental)"]

The artifact is a single PyApplication with three top-level pieces:

FieldTypeWhat it holds
symbol_tableDict[str, PyModule]One PyModule per source file — its imports, classes, functions, and module-level variables.
call_graphList[PyCallEdge]Identity-keyed source -> target edges (by PyCallable.signature) with a weight and provenance.
entrypointsDict[str, List[PyEntrypoint]]Framework-dispatched roots, keyed by framework name.
external_symbolsDict[str, ...]First-class library/built-in targets (signature -> {name, module}) the call graph reaches but doesn’t own.

analysis.json is one file per project: to ask anything, a consumer loads the whole blob into memory, and it doesn’t compose across a portfolio. --emit neo4j projects the very same in-memory PyApplication into a labeled property graph instead — a queryable, persistent store that many applications can share.

Every node label is Py-prefixed and every relationship type is PY_-prefixed (:PyClass, PY_CALLS, and so on) so the Java, TypeScript, and Python analyzers can write into one database without label or relationship-type collisions. Declarations (:PyClass, :PyCallable, :PyExternal) are keyed by their signature under a shared :PySymbol label. Each graph hangs off a single :PyApplication anchor named by --app-name, and carries a schema_version (currently 1.1.0) on that node.

Terminal window
# Project one application into a live Neo4j graph
canpy --input ./my-service --emit neo4j --app-name my-service \
--neo4j-uri bolt://localhost:7687 --neo4j-user neo4j

--emit neo4j picks its writer based solely on whether --neo4j-uri is set:

Without --neo4j-uri, canpy writes a self-contained graph.cypher: the constraints and indexes, a scoped wipe of just this app’s prior subtree, then batched UNWIND … MERGE statements for every node and edge. It needs no extra dependencies and expresses the full truth of the analysis. Load it with cypher-shell:

Terminal window
canpy --input ./my-service --emit neo4j --app-name my-service --output ./out
cypher-shell < ./out/graph.cypher

--app-name is the multi-tenant key. It names the single :PyApplication root node (uniqueness-constrained) and scopes every mutation to that anchor: the snapshot wipe only touches MATCH (a:PyApplication {name: <app>}) and its module subtree, and the Bolt full-run prune is scoped to (:PyApplication {name})-[:PY_HAS_MODULE]->(:PyModule). Pushing app B can never delete app A’s modules from a shared database. When omitted it defaults to the basename of --input.

So one Neo4j database can hold a whole portfolio — each application anchored at its own :PyApplication node, sharing :PyExternal / :PyPackage / :PyDecorator nodes — and cross-service questions become a Cypher traversal instead of a stack of JSON files. For example, every callable across every loaded app that calls a given external symbol:

MATCH (caller:PyCallable)-[:PY_CALLS]->(ext:PyExternal {name: "subprocess.run"})
MATCH (app:PyApplication)-[:PY_HAS_MODULE]->(:PyModule)-[:PY_DECLARES*]->(caller)
RETURN app.name AS application, caller.signature AS caller
ORDER BY application

The graph splits analysis from consumption. The analyzer is the producer: run it out-of-band as a CI step or a Kubernetes Job / CronJob that pushes incrementally to a managed or clustered Neo4j (Aura, Enterprise) over Bolt. Because pushes are content-hash incremental, re-running on each commit rewrites only the modules that changed.

Everything that reads the graph — agents, dashboards, the CLDK Python SDK — is a lightweight, read-only client that scales independently of the heavier analysis pods, needs only the Bolt URI and read-only credentials, and shares the versioned schema_version contract stamped on each :PyApplication. Analysis is produced once, centrally; reads fan out cheaply from there.

Here is the payoff. CLDK has a read-only Neo4j backend. Point the Python facade at the Bolt URI with a Neo4jConnectionConfig, and it reconstructs the same typed PyApplication — the same PyModule symbol table, the same PyCallEdge call graph, the same networkx DiGraph — as the in-process analyzer, with no JDK, no native binary, and no project source on the consumer. It only needs the graph and read-only credentials.

The application_name you pass here is the same string as the producer’s --app-name; it scopes every query to that one app’s subgraph.

# Python project — read-only Neo4j backend (graph populated out of band)
# pip install cldk[neo4j]
from cldk import CLDK
from cldk.analysis.commons.backend_config import Neo4jConnectionConfig
analysis = CLDK.python(
backend=Neo4jConnectionConfig(
uri="bolt://localhost:7687",
username="neo4j",
password="neo4j",
application_name="my-service", # matches canpy --app-name
),
)
classes = analysis.get_classes() # Dict[str, PyClass]
cg = analysis.get_call_graph() # networkx.DiGraph keyed by callable signatures
for sig, cls in classes.items():
print(sig, list(cls.methods))
Terminal window
# Write analysis.json to ./out
canpy --input ./my-project --output ./out
# Or stream JSON to stdout (no --output)
canpy --input ./my-project | jq '.entrypoints'

A code LLM asked “what calls this function?” without analysis crawls: file read after file read, grep after grep, burning tokens on an answer it still can’t be sure of. codeanalyzer-python resolves that once, statically, into a graph — so the answer is a lookup, not a guess. Jedi gives you that for free on every run; CodeQL deepens it when dynamic dispatch and third-party calls matter; the pass pipeline surfaces the framework roots that make reachability questions meaningful.

With --emit neo4j, that resolved graph stops being a per-run artifact and becomes shared infrastructure: produced once by a CI/Kubernetes job, queried cheaply and concurrently by every agent and tool that needs it — across a whole portfolio, in one Cypher traversal.