A minimal, eval-first RAG service over your own documents. It retrieves with real
local embeddings, answers with inline [1][2] citations, and refuses when the
context does not support an answer. Every claim below is produced by code in this
repo.
Produced by the bundled eval (python -m docquery.eval) against a labeled
fixture corpus, with the real all-MiniLM-L6-v2 embeddings and the
deterministic Fake LLM client (no API key needed):
| Metric | Score | Notes |
|---|---|---|
| Retrieval recall@k | 1.000 | k=4, 8 in-scope questions |
| Citation accuracy | 0.875 | every cited chunk supports the claim |
| Refusal accuracy | 1.000 | 4 out-of-scope questions all refused |
| Answer keyword accuracy | 0.625 | expected keyword present in the extractive answer |
The eval writes this exact table to eval_report.md. The answer synthesis uses
the deterministic Fake client, so the run is reproducible with no API key. If
the MiniLM model cannot be downloaded, the eval falls back to a deterministic
offline embedder and still scores recall@k 1.000 and refusal accuracy 1.000
on this corpus (citation accuracy rises to 1.000 there because the lexical
backend ranks the supporting chunk first on every in-scope question). See
Limitations for why answer keyword accuracy is 0.625.
No API key. The first run downloads the ~90 MB MiniLM model once (after that it is fully offline); if there is no network, it falls back to the offline embedder automatically.
python -m venv .venv && . .venv/bin/activate
pip install -e .
python -m docquery.eval # writes eval_report.md, prints the table
python -m docquery.cli ask "What is the largest planet?" -d fixtures/corpusExpected ask output (real MiniLM backend; the * marks chunks the answer
cited):
Jupiter is the largest planet in the Solar System. [1] Saturn is best known for its prominent ring system, made of ice and rock
particles. [2]
(top similarity 0.707)
Sources:
[1]* solar_system.md (score 0.707)
[2]* solar_system.md (score 0.492)
[3] solar_system.md (score 0.444)
[4] solar_system.md (score 0.430)
An out-of-scope question is refused instead of answered:
python -m docquery.cli ask "What is the capital of Australia?" -d fixtures/corpus
# I don't have enough context to answer that.
# (refused; top similarity 0.187 < 0.25)pip:
pip install -e . # core: numpy, pypdf, sentence-transformers
pip install -e ".[anthropic]" # real Claude answer synthesis
pip install -e ".[api]" # FastAPI /ask endpoint
pip install -e ".[dev]" # pytest + ruffuv:
uv venv && uv pip install -e ".[dev]"flowchart TD
subgraph Ingest["Ingest / Index (build time)"]
A[".md / .txt / .pdf files"] --> B["chunk_text<br/>paragraph-aware,<br/>max_chars=600, overlap=80"]
B --> C["embed chunks<br/>MiniLM, hashing fallback"]
C --> D[("VectorIndex<br/>unit-norm matrix")]
end
Q["question"] --> E["embed query"]
D --> R
E --> R["retrieve top-k<br/>cosine, k=4 descending"]
R --> G{"results and<br/>top_score >= 0.25?"}
G -->|"no / weak"| REF1["refuse<br/>(out of scope)"]
G -->|"yes"| L["LLM answer<br/>Anthropic / FakeClient"]
L --> H{"LLM returned<br/>refusal text?"}
H -->|"yes"| REF2["refuse"]
H -->|"no"| CIT["parse [n] citations"]
CIT --> ANS["answer with<br/>inline citations"]
Documents are chunked, embedded, and stored in an in-memory cosine index; a query retrieves the top-k chunks and, if the top score clears the threshold, an LLM answers with inline [n] citations, otherwise the pipeline refuses. A second refusal path covers the case where the LLM itself reports it lacks the context.
The pipeline is five small steps, each in its own module:
- Ingest (
chunking.py) reads.md,.txt, and.pdf(PDF viapypdf), then splits each document into overlapping chunks that keep paragraphs whole when they fit. - Embed (
embeddings.py) encodes chunks with sentence-transformersall-MiniLM-L6-v2, which runs offline on CPU once cached. If the model cannot be loaded (no cache and no network), it falls back to a deterministic hashing embedder so the pipeline still works with zero network. The fallback is not semantically rich; it exists for reproducibility. - Index (
index.py) stores unit-normalized vectors in a numpy matrix. A query is one matrix-vector product (cosine similarity); top-k is exact. No FAISS, no external service. - Synthesize (
llm.py) sends the question and the numbered context to anLLMClient.AnthropicClient(default modelclaude-sonnet-4-6) is the real adapter; the Anthropic SDK is imported lazily, so the package imports with no SDK and no key.FakeClientis deterministic and extractive, and powers both the tests and the offline demo. - Ground and refuse (
pipeline.py) returns the answer with parsed citation indices, and refuses with"I don't have enough context to answer that."when the top similarity is below the threshold.
export ANTHROPIC_API_KEY=sk-ant-...
python -m docquery.cli ask "What does the HTTP 404 status code mean?" \
-d fixtures/corpus --anthropic --model claude-sonnet-4-6Available models: claude-opus-4-8, claude-sonnet-4-6 (default, for cost),
claude-haiku-4-5-20251001.
pip install -e ".[api]"
uvicorn docquery.api:app
curl -s localhost:8000/ask -H 'content-type: application/json' \
-d '{"question": "Which planet is closest to the Sun?"}'eval.py runs the full pipeline against fixtures/qa.json, a labeled Q/A set
with both in-scope and out-of-scope questions, and computes:
- Retrieval recall@k: did the chunk from the labeled source document appear in the top-k?
- Citation accuracy: does every
[n]the answer cites resolve to a chunk drawn from the question's supporting document? - Refusal accuracy: were the out-of-scope questions refused?
- Answer keyword accuracy: did the answer contain the expected keyword?
All four run offline with the Fake client and local embeddings, so the report is reproducible. Add your own documents under a directory and your own labeled questions to measure your corpus the same way.
pip install -e ".[dev]"
pytestThe suite (32 tests) covers chunking, retrieval ordering, citation mapping, the refusal threshold, and the eval producing real numbers. It runs with no network and no API key: a deterministic Fake LLM client handles synthesis, and the tests force the offline hashing embedder so no model download is required.
- Answer keyword accuracy is 0.625 offline. The deterministic
FakeClientis extractive: it stitches the first sentence of the top chunks together. When the answer phrase lives in a later sentence, the exact keyword can be missing even though retrieval and citations are correct. WithAnthropicClientthe synthesized answer phrases the result directly, which raises this metric; the repo reports the offline number because it is the one anyone can reproduce without a key. - The hashing fallback embedder is lexical, not semantic. It matches on token
overlap, so paraphrases with no shared words retrieve poorly. It exists only so
the pipeline and eval run with no network. The real
all-MiniLM-L6-v2backend handles paraphrase; install it and letget_embedder()pick it up. - Retrieval is brute-force cosine over an in-memory matrix. This is the right choice for small, self-contained corpora and keeps the dependency surface to numpy. It is not built for millions of chunks.
- The refusal threshold (0.25) is a single global cutoff tuned on the fixture corpus. A different corpus or embedder may want a different value; it is a constructor argument and a CLI flag.
MIT. Copyright (c) 2026 Allan Paulo de Souza. See LICENSE.