Skip to content

Level 2: replace CodeQL with PyCG, add coupling-aware adaptive sharding#50

Merged
rahlk merged 14 commits into
mainfrom
feat/jedi-shard-planner
Jun 27, 2026
Merged

Level 2: replace CodeQL with PyCG, add coupling-aware adaptive sharding#50
rahlk merged 14 commits into
mainfrom
feat/jedi-shard-planner

Conversation

@rahlk

@rahlk rahlk commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Replaces CodeQL with PyCG as the level 2 call graph backend, and adds coupling-aware sharding so PyCG scales to large apps.

Motivation and Context

PyCG does not scale past a few hundred files. A flat file-count shard forces every shard small (severs many call edges, hurts recall) just to tame the few shards that diverge. This shards by Jedi module coupling instead, and recovers diverging shards by re-sharding only them.

How Has This Been Tested?

Unit tests for the planner, dep exclusion, max_iter, and the adaptive loop (14 tests, all pass).

End to end on a real app. Benchmark app: Odoo, 1028 modules, level 2, Ray. PyCG edges recovered:

uniform ceiling 100, timeout 90s    13302 edges   ~100 files lost   96s
uniform ceiling 100, timeout 300s   17149 edges   ~100 files lost   307s
adaptive (start 100)                22210 edges     20 files lost    760s

Adaptive recovers 22210 edges (+30% over the best uniform run), losing only 20 of 1028 files instead of a whole 100-file shard.

Breaking Changes

Yes.

--codeql / --no-codeql   removed, replaced by --analysis-level {1,2}
edge provenance          "codeql" literal becomes "pycg"
new dependency           pycg (Apache 2.0)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have read the Codellm-Devkit Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed

Additional context

Sharding algorithm:

plan = scc_louvain_shards(jedi_module_graph, budget=ceiling)
edges, budget = [], ceiling
while plan.shards:
    converged, runaways = run_pycg(plan.shards, timeout)   # symlink-bounded, Ray
    edges += converged
    if not runaways: break
    budget = max(floor, budget // 2)
    if budget did not shrink or max rounds hit:
        runaways -> jedi_only ; break
    next = []
    for r in runaways:
        sub = scc_louvain_shards(r.files, budget)   # re-shard that runaway alone
        if sub did not split: r -> jedi_only ; continue
        next += sub.shards
    plan.shards = budget = next
return coalesce(edges)

New flags:

--analysis-level {1,2}                 default 1
--pycg-shard / --no-pycg-shard         shard level 2 on large projects
--pycg-shard-strategy {jedi,package}   jedi (default) uses the planner
--pycg-shard-ceiling (default 100)     starting budget per shard
--pycg-shard-timeout (default 120)     per-shard wall clock
--pycg-max-iter (default 50)           caps PyCG fixpoint passes

Caveats:

  1. Adaptive wall time is higher (760s vs 300s). Decomposition rounds run in sequence, and each round waits the full timeout for its runaways before re-sharding. Tunable later (shorter per-round timeout, overlap rounds).
  2. 20 files stay Jedi-only. They are a true PyCG divergence (the ORM metaclass core), not a bug. PyCG has no convergence guarantee on its field-sensitive access paths.
  3. Numbers are from one app (Odoo). They will vary by codebase.
  4. timeout and max_iter are the only guards against PyCG running forever. With timeout 0 and max_iter -1, a divergent shard never returns.

sinha108 and others added 14 commits June 18, 2026 14:29
CodeQL is incompatible with open-source distribution (proprietary CLI,
licensed query packs). Replace the using_codeql: bool option with
analysis_level: int (1=symbol table only, 2=+call graph). Remove the
entire codeanalyzer/semantic_analysis/codeql/ module and all CLI flags,
__enter__ setup, and helper methods that depended on it.

Provenance literal updated: "codeql" -> "pycg" in PyCallEdge schema.
CLI flag updated: --codeql/--no-codeql -> -a/--analysis-level.

Signed-off-by: Saurabh Sinha <sinha108@gmail.com>
Wire PyCG as the call graph engine for analysis_level >= 2. PyCG's iterative
name-pointer analysis recovers locally-scoped function calls, closures, and
higher-order patterns that Jedi's type-inference misses. Edges from both backendsare merged; edges seen by both carry provenance=["jedi","pycg"].

Entry-point filter excludes .codeanalyzer, venv, site-packages and other
non-project directories so PyCG only analyses the project's own source.

Result on test fixture: 6 edges (vs. 2 Jedi-only), recovering all
locally-scoped function calls.

Signed-off-by: Saurabh Sinha <sinha108@gmail.com>
- Introduces --pycg-shard/--no-pycg-shard to run PyCG independently per
  Python package root instead of over the whole project, with cross-package
  imports treated as ghost nodes.
- Adds --pycg-shard-ceiling (default 100) to skip shards with too many
  files, and --pycg-shard-timeout (default 120s) as a final
  safety net for packages whose pointer fixpoint diverges indefinitely.
- Adds test fixtures (decorators_and_hof, class_hierarchy, async_patterns,
  Flask 3.0.3, requests 2.31.0) and corresponding CLI tests with PyCG-
  specific edge assertions.

Verified on a 6086-file project: 74,008 PyCG edges produced across 748/753
shards; 5 deep-OO framework shards timed out and were gracefully skipped.

Signed-off-by: Saurabh Sinha <sinha108@gmail.com>
- Fix regression: Jedi call-graph edges are now always built at
  analysis_level >= 1 (level 1 = Jedi only, level 2 = Jedi + PyCG).
- Add filter_external_edges() in call_graph.py: drops edges where both
  source and target are outside the app namespace, using the full
  recursive callable walk (inner_callables, inner_classes) so nested
  functions and closures are correctly treated as app symbols.
- Apply filter unconditionally after call graph construction in core.py.

Signed-off-by: Saurabh Sinha <sinha108@gmail.com>
Incorporates Neo4j emit target, --emit/--app-name/--neo4j-* CLI options,
EmitTarget enum, _install_into_venv helper, uv dependency, canpy rename,
and _compute_external_symbols from main. Retains PyCG as analysis level 2
backend (--analysis-level, --pycg-shard, --pycg-shard-ceiling,
--pycg-shard-timeout) and filter_external_edges from this branch.
CodeQL is kept as an optional augmentation pass (--codeql/--no-codeql)
that enriches call sites before Jedi runs; PyCG adds further edges at
level 2 on top of the Jedi+CodeQL merge.

Signed-off-by: Saurabh Sinha <sinha108@gmail.com>
Strips the using_codeql flag, --codeql/--no-codeql CLI option, CodeQL
__enter__ setup block, and codeql_edges call from analyze() that were
brought in when merging main. CodeQL is incompatible with open-source
distribution (proprietary CLI, licensed query packs); this branch uses
PyCG as the level-2 call-graph backend instead.

Signed-off-by: Saurabh Sinha <sinha108@gmail.com>
- Remove --no-codeql from test_no_venv_skips_virtualenv (flag no longer exists)
- Update level-1 CLI tests to assert call_graph non-empty (Jedi edges now
  always produced at level 1; previous assertion was written before that fix)
- Replace 'codeql' provenance literal in sample_graph_app.py with 'pycg'
  (PyCallEdge schema only allows jedi/pycg/joern after CodeQL removal)

Signed-off-by: Saurabh Sinha <sinha108@gmail.com>
When --ray and --pycg-shard are both active, PyCG shards are submitted
as Ray remote tasks simultaneously instead of running sequentially.
Per-shard timeout is enforced via ray.wait(timeout=N) + ray.cancel at
the orchestrator level.

Key changes:
- _pycg_shard_worker: picklable module-level function that runs PyCG
  in a Ray worker and returns (src, dst, weight) tuples
- PyCG._build_sharded_ray: submits all eligible shards as ray.remote
  tasks, collects results with ray.wait(num_returns=N, timeout=T),
  cancels and logs stragglers, then runs the same dedup/merge as the
  sequential path
- PyCG.__init__: new using_ray parameter (default False)
- core._get_pycg_call_graph: passes using_ray=self.using_ray to PyCG

Signed-off-by: Saurabh Sinha <sinha108@gmail.com>
Timing logs (INFO, visible at -vv) — consistent ✅ Phase: N unit in X.Xs:

PyCG shard progress bars (sequential and Ray-parallel modes) matching

The Ray collection loop is restructured from a single ray.wait(N) call
to a deadline-based ray.wait(1) loop.

Fix double progress-bar render.

Fix venv warn-and-continue: _install_into_venv callers now catch
CalledProcessError and emit a WARNING, so a failing pip install
(e.g. psycopg2 needing compiled C extensions on odoo) no longer
aborts the analysis

Signed-off-by: Saurabh Sinha <sinha108@gmail.com>
Sharding lets PyCG (level 2) scale past its ~500-file ceiling by analysing
the project in independent pieces. The existing scheme shards one-per-package
with a flat file-count ceiling, which is blind to call coupling: it severs
heavily-interacting modules (their cross-shard edges become ghost nodes PyCG
never resolves) and drops oversized packages wholesale.

Add a coupling-aware planner that partitions the module-dependency graph
*derived from the Jedi call graph already computed at level 1*:

  1. project Jedi callable->callable edges onto a weighted module DiGraph;
  2. condense strongly-connected components (import cycles become atomic and
     are never split across shards);
  3. cluster with Louvain so tightly-coupled modules co-compute;
  4. enforce the per-shard file budget (re-partition oversized communities,
     then merge/first-fit-pack the remainder to recover edges and cut count).

The reported cut_ratio (fraction of Jedi edge weight crossing shard
boundaries) is an upper bound on PyCG edges lost to sharding; on a synthetic
worst case it drops from 0.55 (per-package) to 0.03.

Wire it into PyCG behind --pycg-shard-strategy {jedi,package} (default jedi).
Because planner shards are arbitrary file sets rather than directories, each
runs through a temporary symlink mini-project (_shard_symlink_root) so PyCG's
own package-root bound confines analysis to the shard and emits
project-relative edge names with no prefix rewrite.

Thread the level-1 Jedi edges through core -> _get_pycg_call_graph ->
build_call_graph_edges to feed the planner. Ray parallelism falls back to
sequential under the jedi strategy for now.

Add test/test_shard_planner.py (graph projection, SCC atomicity, budget,
single-assignment, cut-ratio vs naive, determinism).
Materialise each planned file-set shard as a symlink mini-project up front
(the trees must outlive their remote tasks), submit one Ray task per shard,
and collect against a single wall-clock deadline (Ray workers can't use
SIGALRM, so the timeout is enforced at the orchestrator, mirroring
_build_sharded_ray). Symlink trees are cleaned up once the batch completes.

Factor _materialize_shard_root out of the _shard_symlink_root context manager
so both the sequential and Ray paths share tree construction. Under --ray the
jedi strategy now parallelises instead of falling back to sequential.
PyModule.module_name is only the file stem (py_file.stem), which collides
heavily across a real project — every __init__.py, models.py, views.py shares
a name. Keying the partition graph by module_name collapsed all same-stem
files into a single node and, via the last-wins module_name->file_path map,
silently dropped every colliding file but one from the shards.

Observed on odoo: a 1028-file symbol table produced a graph of only 399
nodes (4 fat shards), so ~600 files were never handed to PyCG.

Key graph nodes by file_path (unique) instead; carry module_name as a node
attribute for readable reporting. plan_shards now emits file-path shards
directly (no name->file remap) with a parallel module_shards name view.

Add a regression test asserting every file lands in exactly one shard under
stem collisions, and update graph tests for file-keyed nodes.
…n-tree deps

Two robustness fixes for level-2 PyCG, motivated by odoo divergence analysis.

1. max_iter cap (--pycg-max-iter, default 50). PyCG runs its PostProcessor
   fixpoint with max_iter=-1 (until convergence). Its abstract domain is
   field-sensitive access paths with no k-limiting/widening, so on heavy
   metaclass/mixin code the def set balloons (measured: 23 odoo ORM files ->
   7.3k defs pass 0, 8.4k pass 1) and convergence may need many O(defs^2)
   passes. Capping passes returns a sound-but-incomplete graph and guarantees
   termination even with --pycg-shard-timeout 0 (which previously hung forever
   on a single diverging shard). Threaded through _run_pycg_batch and the Ray
   worker. Note: the wall-clock timeout is still the guard for shards whose
   individual passes exceed it.

2. Dependency exclusion. PyCG bounds analysis to its package dir via
   "if mod_dir not in mod.__file__". The whole-project path used
   package=project_dir, but an in-tree .codeanalyzer venv / site-packages
   lives under project_dir, so PyCG followed imports into dependencies and
   exploded. Run the whole-project path inside a symlink mini-project (as the
   shards already do) whose root mirrors only the SKIP_DIRS-filtered source,
   so deps resolve outside mod_dir and stay ghost nodes.

Add test/test_pycg_sharding.py (max_iter threading; in-tree dep stays a ghost
and its internals are never analysed).
A uniform shard ceiling forces a global choice: small shards everywhere
(high cut, low recall) just to tame the few that diverge. Instead, start
coarse and re-shard only the shards that time out.

Algorithm: plan shards with SCC + Louvain at the ceiling, run each through
PyCG, and treat any timed-out shard as a runaway. Re-partition that runaway's
files alone at half the budget and re-run. Repeat down to a floor (10 files).
Files that still diverge at the floor, or form an atomic cycle that will not
split, fall back to Jedi-only coverage.

Refactor the planned executor into a reusable primitive that returns
(edges, runaways), used by both the sequential and Ray paths, and drive it
from an adaptive loop.

Odoo benchmark (1028 modules, level 2, Ray): 22210 PyCG edges, up from 17149
for the best uniform ceiling, with only 20 of 1028 files irreducible. Cost is
wall time (about 12.7 min) since rounds run in sequence.

Add a unit test driving the adaptive loop with a stubbed runner.
@rahlk rahlk merged commit 8898e4e into main Jun 27, 2026
@rahlk rahlk deleted the feat/jedi-shard-planner branch June 27, 2026 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants