Skip to content
View allanps's full-sized avatar

Organizations

@lakemeup

Block or report allanps

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
allanps/README.md
Allan Paulo de Souza. I build small LLM systems in Python: put the guarantees in code, call the model only at typed edges, and report numbers a reader can reproduce.

LinkedIn: allan-paulo

Allan Paulo de Souza

I build small LLM systems in Python. The common thread: put the guarantees in code, call the model only at typed edges, and report numbers a reader can reproduce. Every result below comes from a test run in the linked repo.

he/him. ML and genomics background; now working on agents, retrieval, and evaluation.

Python pydantic pytest FastAPI MIT

Start here: agentcore is the clearest example of the pattern; docquery applies it to retrieval.


Selected work

The repos share one idea: do not trust the model on the inside. Validate it, measure it, and refuse when it cannot answer.

Repo What it does Measured
agentcore A template for reliable agents. The agent is a deterministic state machine; the LLM is called only at named, typed edges, and code owns transitions, routing, SLAs, and PII redaction. Worked example: support-ticket triage. 0 of 336 adversarial / garbage model outputs broke a core invariant. 449 tests pass offline.
docquery Eval-first RAG (retrieval-augmented generation) over your documents. Grounded answers with inline [1][2] citations and a refusal path when retrieval is weak. Default embeddings are all-MiniLM-L6-v2, which runs offline on CPU once cached; a deterministic hashing fallback keeps the demo running with no network. recall@k 1.00, refusal accuracy 1.00, citation accuracy 0.875 on a labeled fixture. Answer-keyword accuracy 0.625, not rounded up. 35 tests.
llm-evals-mini A small LLM-eval and guardrails harness. Runs an LLM-as-judge against human labels and reports Cohen's kappa, so you can see how much the judge agrees with people before it gates anything. Adds schema validation and a CI regression gate. The bundled 20-row fixture runs a deterministic fake judge offline. Point it at your own labels and a real judge, where kappa drops below 1.0 and means something. 28 tests.
structext Typed structured extraction from any provider. Pydantic schema in, validated object out, with automatic re-ask on invalid JSON and token/cost logging. On invalid JSON it re-asks with the validation error; the test_retry_on_invalid_then_valid test covers the malformed-then-valid path. 15 tests.

Two applied PoCs sit alongside these: redacting-pii (an LLM pipeline that redacts PII from transcripts, with a prompt-improvement loop) and llmomics (LLM-driven generation of bioinformatics pipelines, from the genomics background).

Every repo is Python, MIT, and provider-agnostic. Anthropic Claude is the default adapter; a deterministic fake client makes the tests and demos run offline with no API key.


How agentcore splits the model from the rules

The LLM only proposes values at the three typed edges; a core function validates each one and code owns every transition before any decision is made.

flowchart LR
  NEW[ticket in] --> CAT{categorize}
  CAT -->|enum, unknown to other| ROUTE{route}
  ROUTE -->|priority, queue, SLA| DRAFT{draft reply}
  DRAFT -->|redact PII, cap length| CLOSED[closed]
  E1([classify]) -.proposes.-> CAT
  E2([assess]) -.proposes.-> ROUTE
  E3([write]) -.proposes.-> DRAFT
Loading

The demo runs four tickets offline with the fake client. On the security ticket the model proposed priority=low; the core forced HIGH, routed to the security queue, and flagged human review. On a separate how-to ticket the core scrubbed an email and a card number from the drafted reply. That run is a test, not a screenshot:

Terminal session: agentcore runs offline with the fake client. On a security ticket the model returns priority=low and the core forces HIGH, routes to the security queue, and requires human review. On a how-to ticket the drafted reply has the email and card redacted. pytest reports 449 passed and 0 of 336 adversarial model outputs broke a core invariant.


Writing

Short design-decision writeups: the problem, the option I chose, what I measured, and where it breaks.

Date Post
2026 A deterministic core with the LLM at the edges
2026 docquery: a small RAG service that cites its sources and refuses when retrieval is weak
2026 A small harness for calibrating an LLM-as-judge before it gates anything

How I build

  • I write the eval before the feature and report what the test prints, including the number below 1.0.
  • The number lives next to the code that produces it. Each repo has a results table and a one-command way to reproduce it offline.
  • Limitations go in the README, not omitted. Regex PII redaction misses formats it does not match; docquery's no-network fallback is a deterministic hashing embedder that is not semantically meaningful, and the reported retrieval numbers use the real MiniLM model; a 20-row fixture is a demo, not a benchmark.
  • I keep a provider seam with a deterministic fake, so the whole suite runs with no key.

Open to AI engineering and applied ML work, remote. Reach me on LinkedIn.

Pinned Loading

  1. agentcore agentcore Public

    Deterministic core, LLM at the edges: a tiny template for agents whose guarantees live in code, not in the prompt.

    Python

  2. docquery docquery Public

    Minimal, eval-first RAG over your documents: grounded answers with inline citations, a refusal path, and a reproducible eval report.

    Python

  3. llm-evals-mini llm-evals-mini Public

    A tiny, honest LLM-eval and guardrails harness: calibrated LLM-as-judge, schema validation, and a CI regression gate.

    Python

  4. redacting-pii redacting-pii Public

    PoC: redacting PII from interview transcripts

    Python

  5. structext structext Public

    Reliable typed extraction from any LLM: pydantic schema in, validated object out, with automatic re-ask on validation failure.

    Python

  6. llmomics llmomics Public

    LLM-driven generation of bioinformatics pipelines.

    Python 1