Skip to content

feat: doc-converter + CCR — PDF/DOCX/XLSX/PPTX reading and large doc token efficiency#66

Open
stealthwhizz wants to merge 5 commits into
open-gitagent:mainfrom
stealthwhizz:feat/compression-clean
Open

feat: doc-converter + CCR — PDF/DOCX/XLSX/PPTX reading and large doc token efficiency#66
stealthwhizz wants to merge 5 commits into
open-gitagent:mainfrom
stealthwhizz:feat/compression-clean

Conversation

@stealthwhizz

Copy link
Copy Markdown

Summary

  • Doc-converter — gitagent can now read PDF, DOCX, XLSX, PPTX files. Converts to clean markdown automatically on read(). Previously returned [Binary file].
  • CCR (Chunked Reversible Retrieval) — documents above 8K tokens return a lightweight outline first. Agent fetches only the sections it needs via read_doc_section, preventing large docs from flooding context on every turn.
  • Benchmark scripttest-live-benchmark.mjs compares two successful runs: full markdown in prompt (no CCR) vs outline + fetch (with CCR). Supports Anthropic, OpenAI, and Lyzr.

What changed

File Change
src/tools/doc-converter.ts New — PDF/DOCX/XLSX/PPTX → markdown, 50MB file size guard, explicit errors
src/tools/doc-store.ts New — CCR chunk store, read_doc_section with has_refs, LRU soft cap (10MB)
src/tools/read.ts Modified — doc-converter + CCR pipeline wired in, mode="outline" dry-run
src/tools/shared.ts Modified — mode="outline" added to read schema
src/tools/index.ts Modified — read_doc_section tool registered with has_refs response
test-live-benchmark.mjs New — live before/after benchmark, multi-provider

API surface (locked)

read(path, mode="outline") → outline + token estimates (dry-run, no content loaded)
read_doc_section(doc_id, section_id) → { content, tokens, has_refs: [section_id, ...] }

has_refs lets the agent decide follow-up fetches in one hop without re-parsing content.

Known limitations (v2)

Closes #65

- SmartCrusher: compress JSON tool outputs (removes nulls, deduplicates arrays, truncates strings)
- CodeCompressor: strip comments and blank lines from source files (TS, JS, Py, Go, Rust, Java, C++)
- CacheAligner: order system prompt static-first to maximize provider KV cache hits
- doc-converter: convert PDF, DOCX, XLSX, PPTX to clean markdown via pdf-parse, mammoth, xlsx, jszip
- doc-store + CCR: chunk large docs into sections, return outline first, agent fetches on demand
- Disk cache for converted docs at memory/.doc-cache/ with source-newer-than-cache invalidation
- Fix: check isConvertible() by extension before isBinary() — PDFs without null bytes were bypassing conversion
- Add benchmark scripts: test-compression.mjs, test-live-benchmark.mjs, test-token-proof.mjs
…on errors

- read_doc_section now returns has_refs: list of section IDs referenced
  in fetched content so agent can decide follow-up fetches in one hop
- read(path, mode="outline") dry-run returns outline + token estimates
  without loading content — lets agent plan section fetches upfront
- DocStore: LRU soft cap (10MB) evicts least-recently-accessed docs to
  prevent OOM in long sessions reading many large files
- doc-converter: hard cap at 50MB — returns explicit error instead of
  stalling silently on very large files
- doc-converter: surfaces explicit error message on conversion failure
  instead of returning null (agent can now tell the user what went wrong)
- read.ts: cache invalidation comment for mtime+inode edge case
Copilot AI review requested due to automatic review settings July 2, 2026 11:05

@shreyas-lyzr shreyas-lyzr left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid implementation — the doc-converter + CCR pipeline is well-structured and matches the agreed spec. Four issues below that need fixing before merge: one is a runtime crash, one is a key-collision bug, one is a section-ID false-positive, and one is a path mismatch that will silently break read_doc_section on relative paths. Everything else is in good shape.

@@ -0,0 +1,242 @@
import { extname, basename } from "path";
import { estimateTokens } from "../compact.js";
import { PDFParse } from "pdf-parse";

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import { PDFParse } is wrong for pdf-parse v2. The package's default export is a plain async function, not a named class. This will throw TypeError: PDFParse is not a constructor at runtime on any PDF read.

The correct usage is:

Suggested change
import { PDFParse } from "pdf-parse";
import pdfParse from "pdf-parse";

And in pdfToMarkdown:

async function pdfToMarkdown(buffer: Buffer): Promise<string> {
	const result = await pdfParse(buffer);
	return result.text.trim();
}

Worth adding a quick smoke test (even just node -e "require('pdf-parse')(Buffer.alloc(0)).catch(() => {})") to catch API mismatches like this before CI.

Comment thread src/tools/read.ts
return Math.ceil(s.length / 4);
}

function cacheKey(absolutePath: string): string {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cacheKey replaces /, \\, and : with _, which causes path collisions. For example:

  • /tmp/a/b.pdf and /tmp/a_b.pdf both produce _tmp_a_b.pdf.md

A stale or wrong cached conversion will be silently served with no error. Use a collision-resistant key — a hash of the absolute path, or at minimum double the separator:

Suggested change
function cacheKey(absolutePath: string): string {
function cacheKey(absolutePath: string): string {
// Use a hash to avoid collisions from path separator substitution
const hash = absolutePath.split("").reduce((h, c) => (Math.imul(31, h) + c.charCodeAt(0)) | 0, 0);
const safe = absolutePath.replace(/[/\\:]/g, "_").slice(-60);
return `${Math.abs(hash).toString(36)}_${safe}.md`;
}

Comment thread src/tools/doc-store.ts Outdated
doc.lastAccessed = Date.now();

// Extract section IDs referenced in this chunk's content (e.g. "see s3" or "[s3]")
const has_refs = [...chunk.content.matchAll(/\[?(s\d+)\]?/g)]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The has_refs regex /\[?(s\d+)\]?/g matches any substring of the form sN — including words like s3ction, s10me, or session3. Running this against real document text will produce noisy false-positive refs that lead the agent to fetch unrelated sections.

Tighten the match to whole-word / bracket-delimited occurrences:

Suggested change
const has_refs = [...chunk.content.matchAll(/\[?(s\d+)\]?/g)]
const has_refs = [...chunk.content.matchAll(/(?<![a-zA-Z])\[s(\d+)\]|(?<![a-zA-Z])s(\d+)(?!\w)/g)]
.map((m) => m[1] ?? m[2])
.filter((id) => {
const sectionId = `s${id}`;
return sectionId !== chunkId && doc.chunks.some((c) => c.id === sectionId);
});

Or, simpler and safer for v1: only match the explicit bracket form [sN] which is what the outline output uses, and skip the bare-word match entirely.

Comment thread src/tools/index.ts Outdated
},
execute: async (_toolCallId: string, params: unknown) => {
const { path, section_id } = params as { path: string; section_id: string };
const altPath = path.startsWith("/") ? path : `${config.dir}/${path}`.replace(/\\/g, "/");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The altPath construction uses string concatenation instead of path.resolve(), which can produce a path like /cwd/./docs/report.pdf when the user passes ./docs/report.pdf. Meanwhile read.ts stores the key using resolve(cwd, path) which normalizes to /cwd/docs/report.pdf. These won't match in the Map, so docStore.fetch(path, section_id) and docStore.fetch(altPath, section_id) both return null and the agent gets [No section found].

Suggested change
const altPath = path.startsWith("/") ? path : `${config.dir}/${path}`.replace(/\\/g, "/");
const altPath = resolve(config.dir, path);

(Import resolve from "path" at the top of index.ts.) This makes the lookup key consistent with what read.ts stores.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a document-reading and context-cost optimization pipeline to gitagent: binary office/PDF docs can be converted to readable markdown on read(), and large documents can be served via CCR (outline-first + section fetch) to avoid repeatedly flooding prompt context across turns.

Changes:

  • Added a doc converter for PDF/DOCX/XLSX/PPTX (plus HTML) that outputs cleaned markdown with a 50MB guard and caching.
  • Added CCR chunking + an in-session DocStore, and registered a read_doc_section retrieval tool.
  • Added multiple token benchmark/proof scripts and introduced JSON/code/system-prompt “compression” utilities.

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/tools/doc-converter.ts New document conversion + cleanup pipeline to markdown.
src/tools/doc-store.ts New CCR chunk store with outline generation, section fetch, and LRU eviction.
src/tools/read.ts Wires conversion + CCR into read(), adds caching and optional compression passes.
src/tools/shared.ts Adds mode="outline" option to the read tool schema.
src/tools/index.ts Registers read_doc_section tool and wires DocStore/CostTracker into read.
src/sdk.ts Instantiates DocStore and passes it into builtin tool creation for sessions.
src/loader.ts Builds system prompt via CacheAligner to improve KV-cache efficiency.
src/tools/cli.ts Compresses JSON CLI output via SmartCrusher before returning to the model.
src/cost-tracker.ts Tracks doc-conversion “token savings” at the session level.
src/compression/smart-crusher.ts New JSON compression utility (SmartCrusher).
src/compression/code-compressor.ts New source-code noise stripping utility (CodeCompressor).
src/compression/cache-aligner.ts New system-prompt prefix alignment utility (CacheAligner).
src/compression/index.ts Barrel exports for the compression utilities.
test-live-benchmark.mjs New live benchmark script (provider-aware) comparing CCR vs no-CCR.
test-token-proof.mjs New “meeting demo” token proof script for CCR savings.
test-compression.mjs New benchmark covering SmartCrusher/CodeCompressor/doc-converter/CacheAligner.
package.json Adds new dependencies for conversion + compression features.
package-lock.json Lockfile updates for the newly added dependencies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/tools/read.ts Outdated
Comment on lines +89 to +93
// ── mode="outline": dry-run for CCR docs ──────────────────────
// Returns outline + token estimates without fetching content.
if (mode === "outline") {
if (docStore?.has(absolutePath)) {
const outline = docStore.getOutline(absolutePath)!;
Comment thread src/tools/index.ts
Comment on lines +82 to +88
const refsNote = result.has_refs.length > 0
? `\n\n[References other sections: ${result.has_refs.join(", ")} — fetch them if needed]`
: "";
return {
content: [{ type: "text", text: `${result.content}${refsNote}` }],
details: undefined,
};
Comment thread src/tools/index.ts
Comment on lines +73 to +75
const { path, section_id } = params as { path: string; section_id: string };
const altPath = path.startsWith("/") ? path : `${config.dir}/${path}`.replace(/\\/g, "/");
const result = docStore.fetch(path, section_id) ?? docStore.fetch(altPath, section_id);
Comment thread src/tools/doc-store.ts Outdated
Comment on lines +55 to +57
const has_refs = [...chunk.content.matchAll(/\[?(s\d+)\]?/g)]
.map((m) => m[1])
.filter((id) => id !== chunkId && doc.chunks.some((c) => c.id === id));
Comment on lines +16 to +21
const CONVERTIBLE = new Set([
".pdf", ".docx", ".doc",
".xlsx", ".xls",
".pptx", ".ppt",
".html", ".htm",
]);
Comment thread src/loader.ts Outdated
Comment on lines +383 to +387
const efficiency = Math.round(cacheEfficiency(aligned) * 100);
console.error(
`[compression] System prompt: ${aligned.staticTokens + aligned.dynamicTokens} tokens` +
` | static prefix: ${aligned.staticTokens} tokens (${efficiency}% cache-eligible)`,
);
Comment thread package.json
Comment on lines 64 to 66
"@sinclair/typebox": "^0.34.41",
"headroom-ai": "^0.22.4",
"js-yaml": "^4.1.0",
Comment thread src/tools/doc-store.ts
Comment on lines +33 to +37
store(filePath: string, chunks: DocChunk[]): void {
// Evict existing entry's char count before overwriting
const existing = this.docs.get(filePath);
if (existing) {
this.totalChars -= existing.chunks.reduce((s, c) => s + c.content.length, 0);
…n, legacy formats, logging

- read.ts: cacheKey uses hash to prevent collisions between paths that differ only in separators
- doc-store.ts: has_refs regex requires [sN] bracket syntax to avoid false positives from words like session3
- index.ts: use resolve() for altPath so relative paths match stored keys
- read.ts: mode=outline now converts and chunks the doc if not already in store instead of returning an error
- doc-converter.ts: remove .doc/.ppt — mammoth/JSZip cannot reliably parse legacy binary formats
- loader.ts: gate system prompt metrics behind GITAGENT_DEBUG env var
- package.json: remove unused headroom-ai dependency

@shreyas-lyzr shreyas-lyzr left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All four issues from the previous review are resolved:

  1. PDFParse import — pdf-parse v2.4.5 exports PDFParse as a named class, so the import and constructor call are correct for the pinned version. The previous concern was based on v1 behavior; no longer applies.

  2. Cache key collision — fixed with a djb2 hash appended to the path. /tmp/a/b.pdf and /tmp/a_b.pdf now produce distinct keys.

  3. has_refs false positives — fixed by requiring bracket syntax [sN] and filtering against actual chunk IDs from the store.

  4. Path mismatch in read_doc_section — fixed by using resolve(config.dir, path) for altPath and trying both the raw and resolved paths.

Implementation matches the agreed spec. Good to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: doc-converter + CCR — PDF/DOCX/XLSX/PPTX reading and large doc token efficiency

3 participants