Skip to content

allanps/agentcore

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

agentcore

Put the hard guarantees in code, not in the prompt. The agent is a deterministic state machine; the LLM is called only at named, typed edges. A model that returns garbage cannot break a transition, a routing rule, an SLA, or a guardrail.

from agentcore import triage, Ticket, FakeLLMClient

# Offline, no API key. The fake model returns priority "low" for a security ticket.
client = FakeLLMClient(responses=[
    {"category": "security"},
    {"priority": "low"},                       # the model lowballs it...
    {"reply": "ping me at admin@acme.com"},     # ...and leaks an email
])

d = triage(client, Ticket(id="T-9", subject="account takeover", body="someone is in my account"))

print(d.state.value)        # closed
print(d.priority.value)     # high      <- core forced HIGH, ignored the model
print(d.queue)              # security  <- security routing, not the default
print(d.requires_human)     # True      <- mandatory human review
print(d.reply)              # ping me at [REDACTED_EMAIL]  <- PII scrubbed

The model said "low priority, no PII problem." The deterministic core overrode all of it. That is the whole idea.

Without vs with a deterministic core

A pure-prompt agent puts the rules in the system prompt: "classify the ticket, route security issues to the security team, redact PII, never auto-close an account issue without a human." Every one of those is a request the model may or may not honor. A jailbreak, a bad sample, a schema drift, or a model upgrade can silently break any of them, and you find out in production.

agentcore keeps the model on the outside. It is asked three narrow questions (what category, what priority, draft a reply) and nothing it answers is trusted until code validates it:

Concern Pure-prompt agent Deterministic core
State transitions Model "remembers" the flow Enforced table; illegal jumps raise
Category Whatever string the model emits Coerced to a fixed enum, unknown -> other
Routing / SLA Described in the prompt Pure functions over validated enums
Security escalation "Please prioritize security" Forced HIGH + security queue + human review, in code
PII in replies "Don't include PII" Regex redaction before the reply leaves the system
Reply length "Keep it short" Hard character cap
Testable offline No (needs the model) Yes (model faked)

Architecture

stateDiagram-v2
    direction LR
    [*] --> NEW

    NEW --> CATEGORIZED: classify_intent (LLM)<br/>core.apply_categorization<br/>coerce_category -> OTHER
    CATEGORIZED --> ROUTED: assess_priority (LLM)<br/>core.route<br/>coerce_priority, queue, SLA, human-flag
    ROUTED --> DRAFTED: draft_reply (LLM)<br/>core.attach_reply<br/>redact PII, cap 1200 chars
    DRAFTED --> CLOSED: core.close (rules only)
    CLOSED --> [*]

    note right of NEW
        LLM is called only at the three labeled edges.
        Each returns a raw, untrusted value.
        core.py coerces it to a valid enum and applies
        routing / SLA / guardrail rules before the state
        advances via transition().
    end note
Loading

A ticket moves through a linear state machine whose transitions are enforced in core.py. The LLM is consulted only at the three labeled edges in edges.py; each edge returns an untrusted raw value that the core coerces to a valid enum and validates before the state advances.

How it works

            deterministic core (code)                 LLM (edges)
        +-------------------------------+        +---------------------+
NEW --> | apply_categorization          | <----- | classify_intent     |
        |   validate -> Category enum   |        +---------------------+
CATEGORIZED                             |
        | route                         | <----- | assess_priority     |
        |   priority enum + queue + SLA |        +---------------------+
        |   + human-review flag (rules) |
ROUTED                                  |
        | attach_reply                  | <----- | draft_reply         |
        |   redact PII + cap length     |        +---------------------+
DRAFTED                                 |
        | close                         |
        +-------------------------------+
CLOSED

Edges propose. The core disposes. A bad edge output cannot break a rule.
  • core.py owns every guarantee: the transition table, routing map, SLA table, human-review rules, and PII redaction. Every function is pure: (Decision, ...) -> Decision. State can only change through transition, which raises on an illegal jump.
  • edges.py holds the three LLM seams. Each builds a prompt and a JSON schema, calls the client, and returns the raw value. Edges decide nothing.
  • agent.py is the driver. Read triage top to bottom: every edge output is immediately handed to a core function that validates it. There is no path from edge to decision that skips the core.
  • llm.py defines the LLMClient seam, a deterministic FakeLLMClient for tests/offline use, and AnthropicLLMClient (the real default). The anthropic import is lazy, so the package imports with no SDK and no API key.

The real adapter defaults to claude-sonnet-4-6 for cost; claude-opus-4-8 and claude-haiku-4-5-20251001 are also exported.

Install

pip install -e .            # core, zero runtime dependencies
pip install -e ".[anthropic]"   # add the real Anthropic adapter

With uv:

uv venv && uv pip install -e ".[dev]"

Run the example (under 30s, offline)

python -m agentcore.demo        # FakeLLMClient, no key, no network
python -m agentcore.demo --live # real Anthropic adapter (needs ANTHROPIC_API_KEY)

Results

The reliability claim is measured by the test suite, not asserted. The headline test, test_invariants_hold_for_every_garbage_combination, runs the full triage pipeline against a Cartesian product of bad model outputs (8 category values x 6 priority values x 7 reply values = 336 runs) covering unknown strings, wrong types, empty values, None, and injection-style junk. After every run it checks six invariants: the ticket ends in CLOSED, category and priority are valid enum members, the queue is known, the SLA is a positive integer, security tickets are always escalated, and the reply is a redacted, length-capped string.

Metric Value How it is produced
Adversarial model outputs that violated a core rule 0 / 336 pytest tests/test_agent_adversarial.py::test_invariants_hold_for_every_garbage_combination
Total tests passing (no network, no API key) 449 pytest

Reproduce:

python3 -m venv .venv && . .venv/bin/activate
pip install -e ".[dev]"
pytest
============================= 449 passed in 3.46s ==============================

Limitations

  • This is a template, not a framework. The domain (support triage) is a worked example; the categories, queues, and SLA table are illustrative. Adapt them to your domain.
  • PII redaction uses regex for emails, card-like digit runs, and US SSNs. It is a guardrail, not a compliance solution, and will miss formats it does not match. Do not treat it as a complete DLP layer.
  • The pattern moves correctness into code, which means you write the rules. If a rule is wrong, the core enforces it faithfully and wrongly. The win is that the rule is explicit, versioned, and tested, not that it is automatically correct.
  • The agent is a linear pipeline (categorize -> route -> draft -> close). Branching workflows, retries, and tool loops are out of scope here; the same core/edge split extends to them, but this template does not implement them.
  • FakeLLMClient makes tests deterministic. It does not model latency, token limits, or streaming. Use the live adapter for behavior that depends on the real model.

License

MIT. Copyright (c) 2026 Allan Paulo de Souza. See LICENSE.

About

Deterministic core, LLM at the edges: a tiny template for agents whose guarantees live in code, not in the prompt.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages