⚠️ Scope amendment (2026-07-02): analyzer is a pure graph provider — slicing moves to the SDK too
This issue already (correctly) put taint out of scope — see "OUT OF SCOPE" below. The family-wide boundary now standard (cldk-forge PR #7, dataflow-graphs.md § provider/client boundary; reference codellm-devkit/codeanalyzer-java#171) extends that to backward slicing as well: slicing is a reachability query over the SDG, same as taint, so it lives in the SDK, not the analyzer.
Superseded here → filed as codellm-devkit/python-sdk#229 (the shared client-analysis engine): GOAL 3 (backward slicing as an SDG query), PART 3 item 12 (backward slicing + its expected-set gate), the "client gates" in the DoD, and the "slicing" step in DELIVERY. (The "OUT OF SCOPE → SDK-side taint" note already anticipated this — #229 is that SDK home, now covering slicing too.)
Stays in this analyzer: the full graph substrate — GOAL 1 (program_graphs: CFG/PDG/SDG), the transitive SUMMARY edges (PART 2 item 9), and the CPG projection (GOAL 2). The intraprocedural PDG construction and its correctness are still built and gated here; only the slice query over the finished SDG moves to the SDK. Retitled: dropped "and slicing".
PROBLEM
codeanalyzer-python today emits the level-1 symbol table and Jedi resolver call
graph, plus level-2 PyCG call-graph enrichment. It has no dataflow: no CFG, no
dependence graphs, no way to answer "what does this value affect". This issue
adds level 3 — native, whole-program dependence graphs built from Python's own
ast module, per the CLDK dataflow-graphs contract — and exposes backward
slicing as a query over them.
Native is the constraint: everything runs in-process in the analyzer's own
ecosystem. No external analysis engines, no subprocess to a foreign toolchain.
GOALS (the contract, in one list)
- Emit CFG, PDG (CDG+DDG), and SDG as first-class sections of analysis.json
(program_graphs, schema_version'd, keyed by canonical (signature, node_id)),
gated by -a 3 / --graphs.
- Project the CPG (AST+CFG+PDG overlay) through the existing Neo4j emitter as
new node labels / edge types; additive schema.neo4j.json bump.
- Expose backward slicing (two-phase context-sensitive HRB traversal) as an
SDG query.
- Hold the cross-language parity clause: shared node kinds / edge types /
JSON shapes; Python-specific additions are additive and recorded in
.claude/SCHEMA_DECISIONS.md.
- Keep
-a 1/-a 2 timings unaffected; summaries are content-hash-cacheable
with recorded dependency structure (incrementality is later, but its hooks
are now).
- Deterministic output: implement sequentially first and pass every gate;
per-callable parallel fan-out (the repo already ships Ray for exactly this
shape) is staged as follow-up work, differential-tested against the
sequential output.
OUT OF SCOPE (deliberately)
- Taint analysis: once the SDG is emitted, taint is labeled reachability over
DDG ∪ PARAM_IN/OUT ∪ SUMMARY with sanitizer blocking — language-independent
graph traversal that the CLDK SDK will own once, shared across analyzers.
Only the source/sink/sanitizer model packs are language-specific, and those
ship as data on the SDK side too.
- Incremental re-analysis: aspirational; dependency structure is recorded from
the start so it can be switched on later without a rewrite.
SUBSTRATE DECISIONS (locked)
- CFG source: hand-built from the stdlib
ast module — the same parse
the symbol-table builder already uses; no second parse
tree, no second identity mapping.
- Def-use source: hand-built reaching definitions (classic forward
worklist); no usable SSA library exists for Python that
matches our node-identity needs.
- Points-to oracle: type-based may-alias MVP stub — two access paths may
alias iff their types are compatible, using Jedi's
already-inferred types where available; unknown types
conservatively may-alias. Sound-leaning but imprecise;
upgrading to a real points-to substrate is staged as
follow-up work. Call dispatch comes from the existing
merged Jedi(+PyCG) call graph, treated as a frozen
oracle.
- Identity mapping: graph nodes are keyed by (signature, node_id) — the same
PyCallable.signature keys as symbol_table / call_graph;
node_id is the source-span-order statement index within
the callable (synthetic ENTRY = 0, EXIT = last CFG node).
- Precision posture: sound-leaning, over-approximate; flow-sensitive on
locals, heap precision capped by the type-based oracle;
k-limited access paths (--graph-field-depth, default 3).
PART 1 — INTRAPROCEDURAL GRAPHS
- Exceptional CFG per callable: statement-level, synthetic ENTRY/EXIT,
multi-exit normalized; explicit lowering rules for the Python checklist:
try/except/else/finally, with (implicit try/finally), comprehensions
(own scope, modeled atomically), generators/yield, async/await,
decorators (call-site fact, not CFG), raise, assert, short-circuit
boolean operators, break/continue.
- Dominators + post-dominators (Cooper–Harper–Kennedy iterative); synthetic
edge for infinite loops; control dependence via the post-dominance
frontier walk (Ferrante–Ottenstein–Warren).
- Access-path variable model (k-limited, bases tagged local/param/self/
global/capture); reaching definitions → DDG edges.
- PDG assembly; the intraprocedural backward-slice gate on the fixture
(hand-computed expected node set, exact match).
PART 2 — INTERPROCEDURAL
- Frozen oracles: the merged Jedi(+PyCG) call graph for dispatch; the
type-based may-alias stub for heap writes. Tarjan SCC condensation of the
call graph as the bottom-up processing order.
- Global/module state (module bindings read/written across functions)
recorded as extra summary inputs/outputs.
- Function summaries: per-callable formal-in → formal-out reachability over
PDG + current callee summaries, composed bottom-up over the condensation
DAG; monotone fixpoint within SCCs; k-limiting mandatory for termination.
(Whole-function reachability instead of hammock-region decomposition —
regions are an incrementality optimization, staged with incrementality.)
- External/library callees default to conservative pass-through: every
argument flows to the return value and every mutable argument.
- SDG assembly: actual/formal in/out nodes, CALL / PARAM_IN / PARAM_OUT
edges, SUMMARY edges from the composed summaries; globals ride as extra
formals; closure captures modeled as capture formals bound at the
definition site.
PART 3 — EMISSION AND CLIENTS
program_graphs section in analysis.json per the contract; --graphs cfg,dfg,pdg,sdg selector and --graph-field-depth with strict flag
validation; -a 3 extends the existing cumulative level semantics
(level 3 ⊇ level 2's call graph, which the SDG needs).
- CPG projection: CFGNode label + CFG_NEXT/CDG/DDG/PARAM_IN/PARAM_OUT/
SUMMARY/HAS_CFG_NODE in the neo4j/ subpackage; additive schema.neo4j.json
bump; conformance test extended.
- Backward slicing (two-phase context-sensitive traversal) as an SDG query
with an exact expected-set gate on the fixture.
CAVEATS AND KNOWN RISKS
- Python is the weakest-oracle language for points-to: no in-process
Andersen library exists. The type-based stub is intentionally
over-approximate; do not trade soundness for a lower false-positive rate —
ranking/pruning is downstream's job.
- Inherited unsoundness in Python: eval/exec, reflection (getattr/setattr
with dynamic names), monkey-patching, C extensions. Documented in the
README, not silently absorbed.
- Termination: the interprocedural fixpoint requires k-limiting (mandatory
knob); summary domains are finite formal-in/formal-out sets.
- Cost: whole-program summary construction is orders of magnitude slower
than level 1; everything is gated behind -a 3 and -a 1/-a 2 timings are
CI-checked to be unaffected.
- Determinism: node ids are assigned in source-span order during sequential
construction; emission is collect-then-sort. Parallel fan-out lands later,
differential-tested against sequential output.
DELIVERY
Single feature branch with stage-per-commit history (fixture → CFG → dominance
→ def-use → PDG → oracles → summaries/SDG → slicing → schema/CLI emission →
CPG projection → docs), one PR referencing this issue. Follow-ups staged as
separate issues/PRs: real points-to substrate, per-callable parallel fan-out,
incremental re-analysis, SDK-side taint.
VERIFICATION / DEFINITION OF DONE
- Every gate in the dataflow-construction spec passes on the fixture (CFG,
dominance, DFG, PDG-slice, summary, SDG, client gates) — exact expected
sets, not "non-empty".
- Fixture covers the full Python lowering checklist plus the shared fixture
minimums (aliasing, SCC recursion, multi-file flow, closure capture,
module-global flow).
- analysis.json with -a 3 validates against the program_graphs models;
parity clause holds (no renamed/repurposed shared vocabulary).
- Cypher snapshot with graphs loads clean into empty Neo4j; CFGNode count
matches JSON; no dangling edges (deferred-edge gate).
- -a 1 / -a 2 wall-clock unchanged within noise on the benchmark fixture,
and their output is byte-identical to pre-change output.
PROBLEM
codeanalyzer-python today emits the level-1 symbol table and Jedi resolver call
graph, plus level-2 PyCG call-graph enrichment. It has no dataflow: no CFG, no
dependence graphs, no way to answer "what does this value affect". This issue
adds level 3 — native, whole-program dependence graphs built from Python's own
astmodule, per the CLDK dataflow-graphs contract — and exposes backwardslicing as a query over them.
Native is the constraint: everything runs in-process in the analyzer's own
ecosystem. No external analysis engines, no subprocess to a foreign toolchain.
GOALS (the contract, in one list)
(
program_graphs, schema_version'd, keyed by canonical (signature, node_id)),gated by
-a 3/--graphs.new node labels / edge types; additive schema.neo4j.json bump.
SDG query.
JSON shapes; Python-specific additions are additive and recorded in
.claude/SCHEMA_DECISIONS.md.
-a 1/-a 2timings unaffected; summaries are content-hash-cacheablewith recorded dependency structure (incrementality is later, but its hooks
are now).
per-callable parallel fan-out (the repo already ships Ray for exactly this
shape) is staged as follow-up work, differential-tested against the
sequential output.
OUT OF SCOPE (deliberately)
DDG ∪ PARAM_IN/OUT ∪ SUMMARY with sanitizer blocking — language-independent
graph traversal that the CLDK SDK will own once, shared across analyzers.
Only the source/sink/sanitizer model packs are language-specific, and those
ship as data on the SDK side too.
the start so it can be switched on later without a rewrite.
SUBSTRATE DECISIONS (locked)
astmodule — the same parsethe symbol-table builder already uses; no second parse
tree, no second identity mapping.
worklist); no usable SSA library exists for Python that
matches our node-identity needs.
alias iff their types are compatible, using Jedi's
already-inferred types where available; unknown types
conservatively may-alias. Sound-leaning but imprecise;
upgrading to a real points-to substrate is staged as
follow-up work. Call dispatch comes from the existing
merged Jedi(+PyCG) call graph, treated as a frozen
oracle.
PyCallable.signature keys as symbol_table / call_graph;
node_id is the source-span-order statement index within
the callable (synthetic ENTRY = 0, EXIT = last CFG node).
locals, heap precision capped by the type-based oracle;
k-limited access paths (--graph-field-depth, default 3).
PART 1 — INTRAPROCEDURAL GRAPHS
multi-exit normalized; explicit lowering rules for the Python checklist:
try/except/else/finally, with (implicit try/finally), comprehensions
(own scope, modeled atomically), generators/yield, async/await,
decorators (call-site fact, not CFG), raise, assert, short-circuit
boolean operators, break/continue.
edge for infinite loops; control dependence via the post-dominance
frontier walk (Ferrante–Ottenstein–Warren).
global/capture); reaching definitions → DDG edges.
(hand-computed expected node set, exact match).
PART 2 — INTERPROCEDURAL
type-based may-alias stub for heap writes. Tarjan SCC condensation of the
call graph as the bottom-up processing order.
recorded as extra summary inputs/outputs.
PDG + current callee summaries, composed bottom-up over the condensation
DAG; monotone fixpoint within SCCs; k-limiting mandatory for termination.
(Whole-function reachability instead of hammock-region decomposition —
regions are an incrementality optimization, staged with incrementality.)
argument flows to the return value and every mutable argument.
edges, SUMMARY edges from the composed summaries; globals ride as extra
formals; closure captures modeled as capture formals bound at the
definition site.
PART 3 — EMISSION AND CLIENTS
program_graphssection in analysis.json per the contract;--graphs cfg,dfg,pdg,sdgselector and--graph-field-depthwith strict flagvalidation;
-a 3extends the existing cumulative level semantics(level 3 ⊇ level 2's call graph, which the SDG needs).
SUMMARY/HAS_CFG_NODE in the neo4j/ subpackage; additive schema.neo4j.json
bump; conformance test extended.
with an exact expected-set gate on the fixture.
CAVEATS AND KNOWN RISKS
Andersen library exists. The type-based stub is intentionally
over-approximate; do not trade soundness for a lower false-positive rate —
ranking/pruning is downstream's job.
with dynamic names), monkey-patching, C extensions. Documented in the
README, not silently absorbed.
knob); summary domains are finite formal-in/formal-out sets.
than level 1; everything is gated behind -a 3 and -a 1/-a 2 timings are
CI-checked to be unaffected.
construction; emission is collect-then-sort. Parallel fan-out lands later,
differential-tested against sequential output.
DELIVERY
Single feature branch with stage-per-commit history (fixture → CFG → dominance
→ def-use → PDG → oracles → summaries/SDG → slicing → schema/CLI emission →
CPG projection → docs), one PR referencing this issue. Follow-ups staged as
separate issues/PRs: real points-to substrate, per-callable parallel fan-out,
incremental re-analysis, SDK-side taint.
VERIFICATION / DEFINITION OF DONE
dominance, DFG, PDG-slice, summary, SDG, client gates) — exact expected
sets, not "non-empty".
minimums (aliasing, SCC recursion, multi-file flow, closure capture,
module-global flow).
parity clause holds (no renamed/repurposed shared vocabulary).
matches JSON; no dangling edges (deferred-edge gate).
and their output is byte-identical to pre-change output.