Skip to content

Full system dependency graph at analysis level 3 (WALA 1.6.10)#172

Open
rahlk wants to merge 4 commits into
mainfrom
minor/issue-171-full-SDG
Open

Full system dependency graph at analysis level 3 (WALA 1.6.10)#172
rahlk wants to merge 4 commits into
mainfrom
minor/issue-171-full-SDG

Conversation

@rahlk

@rahlk rahlk commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Closes #171.

What

  • WALA 1.6.7 → 1.6.10 (latest release). Existing suite passes; -a 2 output on the call-graph-test fixture is identical before/after (call graph and symbol table compared JSON-normalized against main).
  • New analysis level 3 (-a 3): restores and completes the WALA slicer SDG that was removed in dcdeb2c, exposing the full system dependency graph — control + data dependence — as two new analysis.json sections:
    • system_dependency_graph — method-level dependence edges (CONTROL_DEP/DATA_DEP + statement kinds + weight). Validates against the Python SDK's existing JApplication.system_dependency_graph: List[JGraphEdges] model with zero SDK changes (verified with JApplication.model_validate on the fixture output).
    • program_graphs — statement-level graphs per the CLDK level-3 dataflow contract, keyed by (signature, node_id) (ENTRY = 0, SSA instructions in iindex order, EXIT = last): per-callable CFG (nodes with source lines; fallthrough/true/false/switch_case/loop_back/exception/return edges) and PDG (CDG/DDG edges), plus cross-function sdg_edges (CALL/PARAM_IN/PARAM_OUT). Output is deterministically sorted.
  • New flags (strictly validated — unknown values exit non-zero with a clear message, never a silent fallback):
    • --graphs cfg,pdg,sdg — scope the emitted program_graphs sections (default all; requires -a 3).
    • --sdg-data-deps no-heap|full — slicer data-dependence depth. Default no-heap (NO_HEAP_NO_EXCEPTIONS + NO_EXCEPTIONAL_EDGES, the fast pre-removal settings); full opts into heap-carried dependence.
  • Neo4j follows the same contract: --emit neo4j defaults to the full SDG analysis (-a 3) — an explicit -a dials down — and method-level SDG edges project as J_CONTROL_DEP / J_DATA_DEP / J_HEAP_DATA_DEP relationships between :JCallable nodes (props weight/source_kind/destination_kind, same resolved-gating as J_CALLS). The dependence kind rides in the relationship type because the writers MERGE one relationship per (type, source, target) — a pair with both a control and a data dependence must keep both edges. WALA's Dependency enum is closed (exactly those three kinds), so the vocabulary is total. schema.neo4j.json bumps additively to 1.1.0.
  • Design decisions (level mapping, slicer options, RTA builder, node identity, edge-kind derivations, Neo4j edge encoding) are recorded in .claude/SCHEMA_DECISIONS.md; agent guide (CLAUDE.md + AGENTS.md symlink) and README documentation added.

What it looks like

On the call-graph-test fixture (helloString()log()loglog(), helloString()getName()):

system_dependency_graph:
  helloString() -> log()      [CONTROL_DEP]  NORMAL > METHOD_ENTRY
  helloString() -> log()      [DATA_DEP]     PARAM_CALLER > PARAM_CALLEE
  getName()     -> helloString() [DATA_DEP]  NORMAL_RET_CALLEE > NORMAL_RET_CALLER
  ...
sdg_edges:
  org.example.User.helloString()#1 -> org.example.User.log()#0      [CALL]
  org.example.User.helloString()#1 -> org.example.User.log()#0      [PARAM_IN]
  org.example.User.getName()#3     -> org.example.User.helloString()#2 [PARAM_OUT]

PARAM_OUT correctly appears only for the non-void callee; every call gets an exceptional CFG edge; the PDG shows the def-use chain getName() result → concat → return → EXIT.

Verification

  • -a 2 parity gate: byte-equivalent call_graph/symbol_table vs main on the fixture (WALA bump + refactor are invisible below level 3).
  • SDK model gate: -a 3 output validates against cldk.models.java.JApplication.
  • New integration tests (Testcontainers, alongside callGraphShouldHaveKnownEdges):
    • fullSystemDependencyGraphShouldBeEmittedAtAnalysisLevelThree — concrete edge assertions (control dep log()→loglog(), data dep helloString()→getName(), CALL → callee #0, PARAM_OUT from getName()), single-ENTRY/EXIT CFG gate, ENTRY-anchored CDG + DDG presence, and a no-dangling gate over every sdg_edges endpoint.
    • analysisLevelTwoShouldNotEmitSdgSections — level 2 stays call-graph-only.
    • invalidGraphSelectorShouldFailFast — unknown --graphs value exits non-zero with a clear error.
  • Neo4j gates: GraphProjectorSystemDepTest (both dependence kinds survive between the same pair; unknown kinds skipped; unresolved endpoints gated out) and Neo4jSchemaConformanceTest (projector emits only cataloged types; checked-in schema.neo4j.json regenerated and current). End-to-end: --emit neo4j with no -a on the fixture produces a graph.cypher with 3 J_CALLS + 3 J_CONTROL_DEP + 4 J_DATA_DEP rows.

Out of scope (tracked in #171)

SUMMARY edges, slicing/taint clients, the Neo4j CPG projection of the level-3 graphs, and --jobs parallelism. SDK-side adoption (default -a 3, shared ProgramGraphs models, SCIP adaptation) is codellm-devkit/python-sdk#228.

rahlk added 4 commits July 1, 2026 22:19
Restores the WALA slicer SDG removed in dcdeb2c and completes it as a new
analysis level. At -a 3, analysis.json gains:

- system_dependency_graph: method-level CONTROL_DEP/DATA_DEP edges in the
  JGraphEdges shape the Python SDK already models (source/target callable,
  statement kinds, weight).
- program_graphs: statement-level graphs keyed by (signature, node_id) with
  ENTRY = 0, SSA instructions in iindex order, EXIT = last: a CFG per callable
  (source lines; fallthrough/true/false/switch_case/loop_back/exception/return
  edge kinds) and a PDG (CDG/DDG edges), plus cross-function sdg_edges
  (CALL/PARAM_IN/PARAM_OUT). Output is deterministically sorted.

New flags, strictly validated (unknown values exit non-zero, no fallback):
--graphs cfg,pdg,sdg scopes the program_graphs sections; --sdg-data-deps
no-heap|full picks the slicer depth (default NO_HEAP_NO_EXCEPTIONS +
NO_EXCEPTIONAL_EDGES; full opts into heap-carried dependence). The RTA
builder and levels 1/2 output are unchanged.

The Neo4j projection follows: --emit neo4j now defaults to -a 3 (an explicit
-a dials down) and method-level SDG edges project as J_CONTROL_DEP /
J_DATA_DEP / J_HEAP_DATA_DEP relationships between :JCallable nodes with the
same resolved-gating as J_CALLS. The dependence kind rides in the relationship
type because the writers MERGE one relationship per (type, source, target).
Neo4j schema contract bumps additively to 1.1.0.

Issue #171
README gains the full-SDG section (-a 3, --graphs, --sdg-data-deps, known
unsoundness) and the Neo4j default; CLAUDE.md (with AGENTS.md symlink) is the
contributor/agent guide; .claude/SCHEMA_DECISIONS.md records the level-3 and
Neo4j-projection design decisions. Repo .gitignore un-ignores these three past
a global gitignore rule.

Issue #171
…aph provider

Taint and slicing are frontend-side reachability queries over the emitted
universal graph (program_graphs + system_dependency_graph); the analyzer never
runs client analyses. Graph substrate additions (per-argument PARAM nodes,
SUMMARY edges) remain analyzer-side.

Issue #171
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Full system dependency graph: update WALA to 1.6.10 and expose control+data dependence at analysis level 3

1 participant