feat: doc-converter + CCR — PDF/DOCX/XLSX/PPTX reading and large doc token efficiency by stealthwhizz · Pull Request #66 · open-gitagent/gitagent

stealthwhizz · 2026-07-02T11:05:44Z

Summary

Doc-converter — gitagent can now read PDF, DOCX, XLSX, PPTX files. Converts to clean markdown automatically on read(). Previously returned [Binary file].
CCR (Chunked Reversible Retrieval) — documents above 8K tokens return a lightweight outline first. Agent fetches only the sections it needs via read_doc_section, preventing large docs from flooding context on every turn.
Benchmark script — test-live-benchmark.mjs compares two successful runs: full markdown in prompt (no CCR) vs outline + fetch (with CCR). Supports Anthropic, OpenAI, and Lyzr.

What changed

File	Change
`src/tools/doc-converter.ts`	New — PDF/DOCX/XLSX/PPTX → markdown, 50MB file size guard, explicit errors
`src/tools/doc-store.ts`	New — CCR chunk store, `read_doc_section` with `has_refs`, LRU soft cap (10MB)
`src/tools/read.ts`	Modified — doc-converter + CCR pipeline wired in, `mode="outline"` dry-run
`src/tools/shared.ts`	Modified — `mode="outline"` added to read schema
`src/tools/index.ts`	Modified — `read_doc_section` tool registered with `has_refs` response
`test-live-benchmark.mjs`	New — live before/after benchmark, multi-provider

API surface (locked)

read(path, mode="outline") → outline + token estimates (dry-run, no content loaded)
read_doc_section(doc_id, section_id) → { content, tokens, has_refs: [section_id, ...] }

has_refs lets the agent decide follow-up fetches in one hop without re-parsing content.

Known limitations (v2)

Cross-section dependency resolution — agent's responsibility for now
Streaming / incremental conversion for very large PDFs
Multi-user persistence across sessions
SmartCrusher, CodeCompressor, CacheAligner — tracked in issue feat: doc-converter + CCR — PDF/DOCX/XLSX/PPTX reading and large doc token efficiency #65

Closes #65

- SmartCrusher: compress JSON tool outputs (removes nulls, deduplicates arrays, truncates strings) - CodeCompressor: strip comments and blank lines from source files (TS, JS, Py, Go, Rust, Java, C++) - CacheAligner: order system prompt static-first to maximize provider KV cache hits - doc-converter: convert PDF, DOCX, XLSX, PPTX to clean markdown via pdf-parse, mammoth, xlsx, jszip - doc-store + CCR: chunk large docs into sections, return outline first, agent fetches on demand - Disk cache for converted docs at memory/.doc-cache/ with source-newer-than-cache invalidation - Fix: check isConvertible() by extension before isBinary() — PDFs without null bytes were bypassing conversion - Add benchmark scripts: test-compression.mjs, test-live-benchmark.mjs, test-token-proof.mjs

…on errors - read_doc_section now returns has_refs: list of section IDs referenced in fetched content so agent can decide follow-up fetches in one hop - read(path, mode="outline") dry-run returns outline + token estimates without loading content — lets agent plan section fetches upfront - DocStore: LRU soft cap (10MB) evicts least-recently-accessed docs to prevent OOM in long sessions reading many large files - doc-converter: hard cap at 50MB — returns explicit error instead of stalling silently on very large files - doc-converter: surfaces explicit error message on conversion failure instead of returning null (agent can now tell the user what went wrong) - read.ts: cache invalidation comment for mtime+inode edge case

…pic, OpenAI, Lyzr)

shreyas-lyzr

Solid implementation — the doc-converter + CCR pipeline is well-structured and matches the agreed spec. Four issues below that need fixing before merge: one is a runtime crash, one is a key-collision bug, one is a section-ID false-positive, and one is a path mismatch that will silently break read_doc_section on relative paths. Everything else is in good shape.

shreyas-lyzr · 2026-07-02T11:09:24Z

@@ -0,0 +1,242 @@
+import { extname, basename } from "path";
+import { estimateTokens } from "../compact.js";
+import { PDFParse } from "pdf-parse";


The import { PDFParse } is wrong for pdf-parse v2. The package's default export is a plain async function, not a named class. This will throw TypeError: PDFParse is not a constructor at runtime on any PDF read.

The correct usage is:

Suggested change

import { PDFParse } from "pdf-parse";

import pdfParse from "pdf-parse";

And in pdfToMarkdown:

async function pdfToMarkdown(buffer: Buffer): Promise<string> { const result = await pdfParse(buffer); return result.text.trim(); }

Worth adding a quick smoke test (even just node -e "require('pdf-parse')(Buffer.alloc(0)).catch(() => {})") to catch API mismatches like this before CI.

shreyas-lyzr · 2026-07-02T11:09:24Z

+	return Math.ceil(s.length / 4);
+}
+
+function cacheKey(absolutePath: string): string {


cacheKey replaces /, \\, and : with _, which causes path collisions. For example:

/tmp/a/b.pdf and /tmp/a_b.pdf both produce _tmp_a_b.pdf.md

A stale or wrong cached conversion will be silently served with no error. Use a collision-resistant key — a hash of the absolute path, or at minimum double the separator:

Suggested change

function cacheKey(absolutePath: string): string {

function cacheKey(absolutePath: string): string {

// Use a hash to avoid collisions from path separator substitution

const hash = absolutePath.split("").reduce((h, c) => (Math.imul(31, h) + c.charCodeAt(0)) | 0, 0);

const safe = absolutePath.replace(/[/\\:]/g, "_").slice(-60);

return `${Math.abs(hash).toString(36)}_${safe}.md`;

}

shreyas-lyzr · 2026-07-02T11:09:24Z

+		doc.lastAccessed = Date.now();
+
+		// Extract section IDs referenced in this chunk's content (e.g. "see s3" or "[s3]")
+		const has_refs = [...chunk.content.matchAll(/\[?(s\d+)\]?/g)]


The has_refs regex /\[?(s\d+)\]?/g matches any substring of the form sN — including words like s3ction, s10me, or session3. Running this against real document text will produce noisy false-positive refs that lead the agent to fetch unrelated sections.

Tighten the match to whole-word / bracket-delimited occurrences:

Suggested change

const has_refs = [...chunk.content.matchAll(/\[?(s\d+)\]?/g)]

const has_refs = [...chunk.content.matchAll(/(?<![a-zA-Z])\[s(\d+)\]|(?<![a-zA-Z])s(\d+)(?!\w)/g)]

.map((m) => m[1] ?? m[2])

.filter((id) => {

const sectionId = `s${id}`;

return sectionId !== chunkId && doc.chunks.some((c) => c.id === sectionId);

});

Or, simpler and safer for v1: only match the explicit bracket form [sN] which is what the outline output uses, and skip the bare-word match entirely.

shreyas-lyzr · 2026-07-02T11:09:24Z

+			},
+			execute: async (_toolCallId: string, params: unknown) => {
+				const { path, section_id } = params as { path: string; section_id: string };
+				const altPath = path.startsWith("/") ? path : `${config.dir}/${path}`.replace(/\\/g, "/");


The altPath construction uses string concatenation instead of path.resolve(), which can produce a path like /cwd/./docs/report.pdf when the user passes ./docs/report.pdf. Meanwhile read.ts stores the key using resolve(cwd, path) which normalizes to /cwd/docs/report.pdf. These won't match in the Map, so docStore.fetch(path, section_id) and docStore.fetch(altPath, section_id) both return null and the agent gets [No section found].

Suggested change

const altPath = path.startsWith("/") ? path : `${config.dir}/${path}`.replace(/\\/g, "/");

const altPath = resolve(config.dir, path);

(Import resolve from "path" at the top of index.ts.) This makes the lookup key consistent with what read.ts stores.

Copilot

Pull request overview

This PR adds a document-reading and context-cost optimization pipeline to gitagent: binary office/PDF docs can be converted to readable markdown on read(), and large documents can be served via CCR (outline-first + section fetch) to avoid repeatedly flooding prompt context across turns.

Changes:

Added a doc converter for PDF/DOCX/XLSX/PPTX (plus HTML) that outputs cleaned markdown with a 50MB guard and caching.
Added CCR chunking + an in-session DocStore, and registered a read_doc_section retrieval tool.
Added multiple token benchmark/proof scripts and introduced JSON/code/system-prompt “compression” utilities.

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`src/tools/doc-converter.ts`	New document conversion + cleanup pipeline to markdown.
`src/tools/doc-store.ts`	New CCR chunk store with outline generation, section fetch, and LRU eviction.
`src/tools/read.ts`	Wires conversion + CCR into `read()`, adds caching and optional compression passes.
`src/tools/shared.ts`	Adds `mode="outline"` option to the `read` tool schema.
`src/tools/index.ts`	Registers `read_doc_section` tool and wires `DocStore`/`CostTracker` into `read`.
`src/sdk.ts`	Instantiates `DocStore` and passes it into builtin tool creation for sessions.
`src/loader.ts`	Builds system prompt via `CacheAligner` to improve KV-cache efficiency.
`src/tools/cli.ts`	Compresses JSON CLI output via SmartCrusher before returning to the model.
`src/cost-tracker.ts`	Tracks doc-conversion “token savings” at the session level.
`src/compression/smart-crusher.ts`	New JSON compression utility (SmartCrusher).
`src/compression/code-compressor.ts`	New source-code noise stripping utility (CodeCompressor).
`src/compression/cache-aligner.ts`	New system-prompt prefix alignment utility (CacheAligner).
`src/compression/index.ts`	Barrel exports for the compression utilities.
`test-live-benchmark.mjs`	New live benchmark script (provider-aware) comparing CCR vs no-CCR.
`test-token-proof.mjs`	New “meeting demo” token proof script for CCR savings.
`test-compression.mjs`	New benchmark covering SmartCrusher/CodeCompressor/doc-converter/CacheAligner.
`package.json`	Adds new dependencies for conversion + compression features.
`package-lock.json`	Lockfile updates for the newly added dependencies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+			// ── mode="outline": dry-run for CCR docs ──────────────────────
+			// Returns outline + token estimates without fetching content.
+			if (mode === "outline") {
+				if (docStore?.has(absolutePath)) {
+					const outline = docStore.getOutline(absolutePath)!;


+				const refsNote = result.has_refs.length > 0
+					? `\n\n[References other sections: ${result.has_refs.join(", ")} — fetch them if needed]`
+					: "";
+				return {
+					content: [{ type: "text", text: `${result.content}${refsNote}` }],
+					details: undefined,
+				};


+				const { path, section_id } = params as { path: string; section_id: string };
+				const altPath = path.startsWith("/") ? path : `${config.dir}/${path}`.replace(/\\/g, "/");
+				const result = docStore.fetch(path, section_id) ?? docStore.fetch(altPath, section_id);


+		const has_refs = [...chunk.content.matchAll(/\[?(s\d+)\]?/g)]
+			.map((m) => m[1])
+			.filter((id) => id !== chunkId && doc.chunks.some((c) => c.id === id));


+const CONVERTIBLE = new Set([
+	".pdf", ".docx", ".doc",
+	".xlsx", ".xls",
+	".pptx", ".ppt",
+	".html", ".htm",
+]);


+	const efficiency = Math.round(cacheEfficiency(aligned) * 100);
+	console.error(
+		`[compression] System prompt: ${aligned.staticTokens + aligned.dynamicTokens} tokens` +
+		` | static prefix: ${aligned.staticTokens} tokens (${efficiency}% cache-eligible)`,
+	);


    "@sinclair/typebox": "^0.34.41",
+    "headroom-ai": "^0.22.4",
    "js-yaml": "^4.1.0",


+	store(filePath: string, chunks: DocChunk[]): void {
+		// Evict existing entry's char count before overwriting
+		const existing = this.docs.get(filePath);
+		if (existing) {
+			this.totalChars -= existing.chunks.reduce((s, c) => s + c.content.length, 0);


…n, legacy formats, logging - read.ts: cacheKey uses hash to prevent collisions between paths that differ only in separators - doc-store.ts: has_refs regex requires [sN] bracket syntax to avoid false positives from words like session3 - index.ts: use resolve() for altPath so relative paths match stored keys - read.ts: mode=outline now converts and chunks the doc if not already in store instead of returning an error - doc-converter.ts: remove .doc/.ppt — mammoth/JSZip cannot reliably parse legacy binary formats - loader.ts: gate system prompt metrics behind GITAGENT_DEBUG env var - package.json: remove unused headroom-ai dependency

shreyas-lyzr

All four issues from the previous review are resolved:

PDFParse import — pdf-parse v2.4.5 exports PDFParse as a named class, so the import and constructor call are correct for the pinned version. The previous concern was based on v1 behavior; no longer applies.
Cache key collision — fixed with a djb2 hash appended to the path. /tmp/a/b.pdf and /tmp/a_b.pdf now produce distinct keys.
has_refs false positives — fixed by requiring bracket syntax [sN] and filtering against actual chunk IDs from the store.
Path mismatch in read_doc_section — fixed by using resolve(config.dir, path) for altPath and trying both the raw and resolved paths.

Implementation matches the agreed spec. Good to merge.

stealthwhizz added 4 commits July 1, 2026 12:58

feat(benchmark): correct methodology + multi-provider support (Anthro…

2ad404a

…pic, OpenAI, Lyzr)

Merge branch 'open-gitagent:main' into feat/compression-clean

cc320f6

Copilot AI review requested due to automatic review settings July 2, 2026 11:05

Copilot started reviewing on behalf of stealthwhizz July 2, 2026 11:06 View session

shreyas-lyzr requested changes Jul 2, 2026

View reviewed changes

Copilot AI reviewed Jul 2, 2026

View reviewed changes

shreyas-lyzr approved these changes Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: doc-converter + CCR — PDF/DOCX/XLSX/PPTX reading and large doc token efficiency#66

feat: doc-converter + CCR — PDF/DOCX/XLSX/PPTX reading and large doc token efficiency#66
stealthwhizz wants to merge 5 commits into
open-gitagent:mainfrom
stealthwhizz:feat/compression-clean

stealthwhizz commented Jul 2, 2026

Uh oh!

shreyas-lyzr left a comment

Uh oh!

shreyas-lyzr Jul 2, 2026

Uh oh!

shreyas-lyzr Jul 2, 2026

Uh oh!

shreyas-lyzr Jul 2, 2026

Uh oh!

shreyas-lyzr Jul 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

shreyas-lyzr left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	import { PDFParse } from "pdf-parse";
	import pdfParse from "pdf-parse";

-function cacheKey(absolutePath: string): string {
+function cacheKey(absolutePath: string): string {
+	// Use a hash to avoid collisions from path separator substitution
+	const hash = absolutePath.split("").reduce((h, c) => (Math.imul(31, h) + c.charCodeAt(0)) | 0, 0);
+	const safe = absolutePath.replace(/[/\\:]/g, "_").slice(-60);
+	return `${Math.abs(hash).toString(36)}_${safe}.md`;
+}

-		const has_refs = [...chunk.content.matchAll(/\[?(s\d+)\]?/g)]
+const has_refs = [...chunk.content.matchAll(/(?<![a-zA-Z])\[s(\d+)\]|(?<![a-zA-Z])s(\d+)(?!\w)/g)]
+	.map((m) => m[1] ?? m[2])
+	.filter((id) => {
+		const sectionId = `s${id}`;
+		return sectionId !== chunkId && doc.chunks.some((c) => c.id === sectionId);
+	});

	const altPath = path.startsWith("/") ? path : `${config.dir}/${path}`.replace(/\\/g, "/");
	const altPath = resolve(config.dir, path);

Uh oh!

Conversation

stealthwhizz commented Jul 2, 2026

Summary

What changed

API surface (locked)

Known limitations (v2)

Uh oh!

shreyas-lyzr left a comment

Choose a reason for hiding this comment

Uh oh!

shreyas-lyzr Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

shreyas-lyzr Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

shreyas-lyzr Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

shreyas-lyzr Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

shreyas-lyzr left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants