PyCG sharding: Jedi planner + adaptive decomposition of runaways by rahlk · Pull Request #49 · codellm-devkit/codeanalyzer-python

rahlk · 2026-06-27T15:27:10Z

Shards PyCG (level 2) by Jedi module coupling instead of a flat file count, then recovers shards that diverge by re-sharding them.

How sharding works now:

plan = scc_louvain_shards(jedi_module_graph, budget=ceiling)
edges, budget = [], ceiling
while plan.shards:
    converged, runaways = run_pycg(plan.shards, timeout)   # symlink-bounded, Ray
    edges += converged
    if not runaways: break
    budget = max(floor, budget // 2)
    if budget did not shrink or max rounds hit:
        runaways -> jedi_only ; break
    next = []
    for r in runaways:
        sub = scc_louvain_shards(r.files, budget)   # re-shard that runaway alone
        if sub did not split: r -> jedi_only ; continue
        next += sub.shards
    plan.shards = budget = next
return coalesce(edges)

Why: a uniform ceiling forces every shard small (severs many edges) just to tame the few that diverge. Starting coarse keeps the cut low, and only the shards that time out get split.

Telemetry (benchmark app: Odoo, 1028 modules, level 2, Ray). PyCG edges recovered:

uniform ceiling 100, timeout 90s    13302 edges   ~100 files lost   96s
uniform ceiling 100, timeout 300s   17149 edges   ~100 files lost   307s
adaptive (start 100)                22210 edges     20 files lost    760s

Adaptive recovers 22210 edges (+30% over the best uniform run), and loses only 20 of 1028 files instead of a whole 100-file shard.

Also in this branch:

--pycg-shard-strategy {jedi,package}   jedi (default) uses the planner
--pycg-max-iter (default 50)           caps PyCG fixpoint passes
dep exclusion                          runs PyCG in a symlink mini-project so an
                                       in-tree .codeanalyzer venv is not analyzed
file-keyed module graph                module_name is only the file stem, so it
                                       collided and dropped files; key by path

Caveats:

Wall time is higher (760s vs 300s). Decomposition rounds run in sequence, and each round waits the full timeout for its runaways before re-sharding. Tunable later (shorter per-round timeout, overlap rounds).
20 files stay Jedi-only. They are a true PyCG divergence (the ORM metaclass core), not a bug. PyCG has no convergence guarantee on its field-sensitive access paths.
Numbers are from one app (Odoo). They will vary by codebase.
timeout and max_iter are the only guards against PyCG running forever. With timeout 0 and max_iter -1, a divergent shard never returns.

Sharding lets PyCG (level 2) scale past its ~500-file ceiling by analysing the project in independent pieces. The existing scheme shards one-per-package with a flat file-count ceiling, which is blind to call coupling: it severs heavily-interacting modules (their cross-shard edges become ghost nodes PyCG never resolves) and drops oversized packages wholesale. Add a coupling-aware planner that partitions the module-dependency graph *derived from the Jedi call graph already computed at level 1*: 1. project Jedi callable->callable edges onto a weighted module DiGraph; 2. condense strongly-connected components (import cycles become atomic and are never split across shards); 3. cluster with Louvain so tightly-coupled modules co-compute; 4. enforce the per-shard file budget (re-partition oversized communities, then merge/first-fit-pack the remainder to recover edges and cut count). The reported cut_ratio (fraction of Jedi edge weight crossing shard boundaries) is an upper bound on PyCG edges lost to sharding; on a synthetic worst case it drops from 0.55 (per-package) to 0.03. Wire it into PyCG behind --pycg-shard-strategy {jedi,package} (default jedi). Because planner shards are arbitrary file sets rather than directories, each runs through a temporary symlink mini-project (_shard_symlink_root) so PyCG's own package-root bound confines analysis to the shard and emits project-relative edge names with no prefix rewrite. Thread the level-1 Jedi edges through core -> _get_pycg_call_graph -> build_call_graph_edges to feed the planner. Ray parallelism falls back to sequential under the jedi strategy for now. Add test/test_shard_planner.py (graph projection, SCC atomicity, budget, single-assignment, cut-ratio vs naive, determinism).

Materialise each planned file-set shard as a symlink mini-project up front (the trees must outlive their remote tasks), submit one Ray task per shard, and collect against a single wall-clock deadline (Ray workers can't use SIGALRM, so the timeout is enforced at the orchestrator, mirroring _build_sharded_ray). Symlink trees are cleaned up once the batch completes. Factor _materialize_shard_root out of the _shard_symlink_root context manager so both the sequential and Ray paths share tree construction. Under --ray the jedi strategy now parallelises instead of falling back to sequential.

PyModule.module_name is only the file stem (py_file.stem), which collides heavily across a real project — every __init__.py, models.py, views.py shares a name. Keying the partition graph by module_name collapsed all same-stem files into a single node and, via the last-wins module_name->file_path map, silently dropped every colliding file but one from the shards. Observed on odoo: a 1028-file symbol table produced a graph of only 399 nodes (4 fat shards), so ~600 files were never handed to PyCG. Key graph nodes by file_path (unique) instead; carry module_name as a node attribute for readable reporting. plan_shards now emits file-path shards directly (no name->file remap) with a parallel module_shards name view. Add a regression test asserting every file lands in exactly one shard under stem collisions, and update graph tests for file-keyed nodes.

…n-tree deps Two robustness fixes for level-2 PyCG, motivated by odoo divergence analysis. 1. max_iter cap (--pycg-max-iter, default 50). PyCG runs its PostProcessor fixpoint with max_iter=-1 (until convergence). Its abstract domain is field-sensitive access paths with no k-limiting/widening, so on heavy metaclass/mixin code the def set balloons (measured: 23 odoo ORM files -> 7.3k defs pass 0, 8.4k pass 1) and convergence may need many O(defs^2) passes. Capping passes returns a sound-but-incomplete graph and guarantees termination even with --pycg-shard-timeout 0 (which previously hung forever on a single diverging shard). Threaded through _run_pycg_batch and the Ray worker. Note: the wall-clock timeout is still the guard for shards whose individual passes exceed it. 2. Dependency exclusion. PyCG bounds analysis to its package dir via "if mod_dir not in mod.__file__". The whole-project path used package=project_dir, but an in-tree .codeanalyzer venv / site-packages lives under project_dir, so PyCG followed imports into dependencies and exploded. Run the whole-project path inside a symlink mini-project (as the shards already do) whose root mirrors only the SKIP_DIRS-filtered source, so deps resolve outside mod_dir and stay ghost nodes. Add test/test_pycg_sharding.py (max_iter threading; in-tree dep stays a ghost and its internals are never analysed).

A uniform shard ceiling forces a global choice: small shards everywhere (high cut, low recall) just to tame the few that diverge. Instead, start coarse and re-shard only the shards that time out. Algorithm: plan shards with SCC + Louvain at the ceiling, run each through PyCG, and treat any timed-out shard as a runaway. Re-partition that runaway's files alone at half the budget and re-run. Repeat down to a floor (10 files). Files that still diverge at the floor, or form an atomic cycle that will not split, fall back to Jedi-only coverage. Refactor the planned executor into a reusable primitive that returns (edges, runaways), used by both the sequential and Ray paths, and drive it from an adaptive loop. Odoo benchmark (1028 modules, level 2, Ray): 22210 PyCG edges, up from 17149 for the best uniform ceiling, with only 20 of 1028 files irreducible. Cost is wall time (about 12.7 min) since rounds run in sequence. Add a unit test driving the adaptive loop with a stubbed runner.

…deql Merge pull request #49 from codellm-devkit/feat/jedi-shard-planner

rahlk added 5 commits June 26, 2026 23:37

rahlk merged commit 1a2cdcc into feat/level2-pycg-remove-codeql Jun 27, 2026

rahlk added a commit that referenced this pull request Jun 27, 2026

Merge pull request #51 from codellm-devkit/feat/level2-pycg-remove-co…

e07887c

…deql Merge pull request #49 from codellm-devkit/feat/jedi-shard-planner

rahlk deleted the feat/jedi-shard-planner branch June 27, 2026 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PyCG sharding: Jedi planner + adaptive decomposition of runaways#49

PyCG sharding: Jedi planner + adaptive decomposition of runaways#49
rahlk merged 5 commits into
feat/level2-pycg-remove-codeqlfrom
feat/jedi-shard-planner

rahlk commented Jun 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

rahlk commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rahlk commented Jun 27, 2026 •

edited

Loading