Skip to content

allanps/docquery

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

docquery

A minimal, eval-first RAG service over your own documents. It retrieves with real local embeddings, answers with inline [1][2] citations, and refuses when the context does not support an answer. Every claim below is produced by code in this repo.

Results

Produced by the bundled eval (python -m docquery.eval) against a labeled fixture corpus, with the real all-MiniLM-L6-v2 embeddings and the deterministic Fake LLM client (no API key needed):

Metric Score Notes
Retrieval recall@k 1.000 k=4, 8 in-scope questions
Citation accuracy 0.875 every cited chunk supports the claim
Refusal accuracy 1.000 4 out-of-scope questions all refused
Answer keyword accuracy 0.625 expected keyword present in the extractive answer

The eval writes this exact table to eval_report.md. The answer synthesis uses the deterministic Fake client, so the run is reproducible with no API key. If the MiniLM model cannot be downloaded, the eval falls back to a deterministic offline embedder and still scores recall@k 1.000 and refusal accuracy 1.000 on this corpus (citation accuracy rises to 1.000 there because the lexical backend ranks the supporting chunk first on every in-scope question). See Limitations for why answer keyword accuracy is 0.625.

Quickstart

No API key. The first run downloads the ~90 MB MiniLM model once (after that it is fully offline); if there is no network, it falls back to the offline embedder automatically.

python -m venv .venv && . .venv/bin/activate
pip install -e .
python -m docquery.eval                 # writes eval_report.md, prints the table
python -m docquery.cli ask "What is the largest planet?" -d fixtures/corpus

Expected ask output (real MiniLM backend; the * marks chunks the answer cited):

Jupiter is the largest planet in the Solar System. [1] Saturn is best known for its prominent ring system, made of ice and rock
particles. [2]

(top similarity 0.707)
Sources:
  [1]* solar_system.md  (score 0.707)
  [2]* solar_system.md  (score 0.492)
  [3]  solar_system.md  (score 0.444)
  [4]  solar_system.md  (score 0.430)

An out-of-scope question is refused instead of answered:

python -m docquery.cli ask "What is the capital of Australia?" -d fixtures/corpus
# I don't have enough context to answer that.
# (refused; top similarity 0.187 < 0.25)

Install

pip:

pip install -e .                 # core: numpy, pypdf, sentence-transformers
pip install -e ".[anthropic]"    # real Claude answer synthesis
pip install -e ".[api]"          # FastAPI /ask endpoint
pip install -e ".[dev]"          # pytest + ruff

uv:

uv venv && uv pip install -e ".[dev]"

Architecture

flowchart TD
    subgraph Ingest["Ingest / Index (build time)"]
        A[".md / .txt / .pdf files"] --> B["chunk_text<br/>paragraph-aware,<br/>max_chars=600, overlap=80"]
        B --> C["embed chunks<br/>MiniLM, hashing fallback"]
        C --> D[("VectorIndex<br/>unit-norm matrix")]
    end

    Q["question"] --> E["embed query"]
    D --> R
    E --> R["retrieve top-k<br/>cosine, k=4 descending"]
    R --> G{"results and<br/>top_score >= 0.25?"}

    G -->|"no / weak"| REF1["refuse<br/>(out of scope)"]
    G -->|"yes"| L["LLM answer<br/>Anthropic / FakeClient"]
    L --> H{"LLM returned<br/>refusal text?"}
    H -->|"yes"| REF2["refuse"]
    H -->|"no"| CIT["parse [n] citations"]
    CIT --> ANS["answer with<br/>inline citations"]
Loading

Documents are chunked, embedded, and stored in an in-memory cosine index; a query retrieves the top-k chunks and, if the top score clears the threshold, an LLM answers with inline [n] citations, otherwise the pipeline refuses. A second refusal path covers the case where the LLM itself reports it lacks the context.

How it works

The pipeline is five small steps, each in its own module:

  1. Ingest (chunking.py) reads .md, .txt, and .pdf (PDF via pypdf), then splits each document into overlapping chunks that keep paragraphs whole when they fit.
  2. Embed (embeddings.py) encodes chunks with sentence-transformers all-MiniLM-L6-v2, which runs offline on CPU once cached. If the model cannot be loaded (no cache and no network), it falls back to a deterministic hashing embedder so the pipeline still works with zero network. The fallback is not semantically rich; it exists for reproducibility.
  3. Index (index.py) stores unit-normalized vectors in a numpy matrix. A query is one matrix-vector product (cosine similarity); top-k is exact. No FAISS, no external service.
  4. Synthesize (llm.py) sends the question and the numbered context to an LLMClient. AnthropicClient (default model claude-sonnet-4-6) is the real adapter; the Anthropic SDK is imported lazily, so the package imports with no SDK and no key. FakeClient is deterministic and extractive, and powers both the tests and the offline demo.
  5. Ground and refuse (pipeline.py) returns the answer with parsed citation indices, and refuses with "I don't have enough context to answer that." when the top similarity is below the threshold.

Using Claude for real answers

export ANTHROPIC_API_KEY=sk-ant-...
python -m docquery.cli ask "What does the HTTP 404 status code mean?" \
    -d fixtures/corpus --anthropic --model claude-sonnet-4-6

Available models: claude-opus-4-8, claude-sonnet-4-6 (default, for cost), claude-haiku-4-5-20251001.

HTTP API

pip install -e ".[api]"
uvicorn docquery.api:app
curl -s localhost:8000/ask -H 'content-type: application/json' \
    -d '{"question": "Which planet is closest to the Sun?"}'

The eval

eval.py runs the full pipeline against fixtures/qa.json, a labeled Q/A set with both in-scope and out-of-scope questions, and computes:

  • Retrieval recall@k: did the chunk from the labeled source document appear in the top-k?
  • Citation accuracy: does every [n] the answer cites resolve to a chunk drawn from the question's supporting document?
  • Refusal accuracy: were the out-of-scope questions refused?
  • Answer keyword accuracy: did the answer contain the expected keyword?

All four run offline with the Fake client and local embeddings, so the report is reproducible. Add your own documents under a directory and your own labeled questions to measure your corpus the same way.

Tests

pip install -e ".[dev]"
pytest

The suite (32 tests) covers chunking, retrieval ordering, citation mapping, the refusal threshold, and the eval producing real numbers. It runs with no network and no API key: a deterministic Fake LLM client handles synthesis, and the tests force the offline hashing embedder so no model download is required.

Limitations

  • Answer keyword accuracy is 0.625 offline. The deterministic FakeClient is extractive: it stitches the first sentence of the top chunks together. When the answer phrase lives in a later sentence, the exact keyword can be missing even though retrieval and citations are correct. With AnthropicClient the synthesized answer phrases the result directly, which raises this metric; the repo reports the offline number because it is the one anyone can reproduce without a key.
  • The hashing fallback embedder is lexical, not semantic. It matches on token overlap, so paraphrases with no shared words retrieve poorly. It exists only so the pipeline and eval run with no network. The real all-MiniLM-L6-v2 backend handles paraphrase; install it and let get_embedder() pick it up.
  • Retrieval is brute-force cosine over an in-memory matrix. This is the right choice for small, self-contained corpora and keeps the dependency surface to numpy. It is not built for millions of chunks.
  • The refusal threshold (0.25) is a single global cutoff tuned on the fixture corpus. A different corpus or embedder may want a different value; it is a constructor argument and a CLI flag.

License

MIT. Copyright (c) 2026 Allan Paulo de Souza. See LICENSE.

About

Minimal, eval-first RAG over your documents: grounded answers with inline citations, a refusal path, and a reproducible eval report.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages