Put the hard guarantees in code, not in the prompt. The agent is a deterministic state machine; the LLM is called only at named, typed edges. A model that returns garbage cannot break a transition, a routing rule, an SLA, or a guardrail.
from agentcore import triage, Ticket, FakeLLMClient
# Offline, no API key. The fake model returns priority "low" for a security ticket.
client = FakeLLMClient(responses=[
{"category": "security"},
{"priority": "low"}, # the model lowballs it...
{"reply": "ping me at admin@acme.com"}, # ...and leaks an email
])
d = triage(client, Ticket(id="T-9", subject="account takeover", body="someone is in my account"))
print(d.state.value) # closed
print(d.priority.value) # high <- core forced HIGH, ignored the model
print(d.queue) # security <- security routing, not the default
print(d.requires_human) # True <- mandatory human review
print(d.reply) # ping me at [REDACTED_EMAIL] <- PII scrubbedThe model said "low priority, no PII problem." The deterministic core overrode all of it. That is the whole idea.
A pure-prompt agent puts the rules in the system prompt: "classify the ticket, route security issues to the security team, redact PII, never auto-close an account issue without a human." Every one of those is a request the model may or may not honor. A jailbreak, a bad sample, a schema drift, or a model upgrade can silently break any of them, and you find out in production.
agentcore keeps the model on the outside. It is asked three narrow questions (what category, what priority, draft a reply) and nothing it answers is trusted until code validates it:
| Concern | Pure-prompt agent | Deterministic core |
|---|---|---|
| State transitions | Model "remembers" the flow | Enforced table; illegal jumps raise |
| Category | Whatever string the model emits | Coerced to a fixed enum, unknown -> other |
| Routing / SLA | Described in the prompt | Pure functions over validated enums |
| Security escalation | "Please prioritize security" | Forced HIGH + security queue + human review, in code |
| PII in replies | "Don't include PII" | Regex redaction before the reply leaves the system |
| Reply length | "Keep it short" | Hard character cap |
| Testable offline | No (needs the model) | Yes (model faked) |
stateDiagram-v2
direction LR
[*] --> NEW
NEW --> CATEGORIZED: classify_intent (LLM)<br/>core.apply_categorization<br/>coerce_category -> OTHER
CATEGORIZED --> ROUTED: assess_priority (LLM)<br/>core.route<br/>coerce_priority, queue, SLA, human-flag
ROUTED --> DRAFTED: draft_reply (LLM)<br/>core.attach_reply<br/>redact PII, cap 1200 chars
DRAFTED --> CLOSED: core.close (rules only)
CLOSED --> [*]
note right of NEW
LLM is called only at the three labeled edges.
Each returns a raw, untrusted value.
core.py coerces it to a valid enum and applies
routing / SLA / guardrail rules before the state
advances via transition().
end note
A ticket moves through a linear state machine whose transitions are enforced in core.py. The LLM is consulted only at the three labeled edges in edges.py; each edge returns an untrusted raw value that the core coerces to a valid enum and validates before the state advances.
deterministic core (code) LLM (edges)
+-------------------------------+ +---------------------+
NEW --> | apply_categorization | <----- | classify_intent |
| validate -> Category enum | +---------------------+
CATEGORIZED |
| route | <----- | assess_priority |
| priority enum + queue + SLA | +---------------------+
| + human-review flag (rules) |
ROUTED |
| attach_reply | <----- | draft_reply |
| redact PII + cap length | +---------------------+
DRAFTED |
| close |
+-------------------------------+
CLOSED
Edges propose. The core disposes. A bad edge output cannot break a rule.
core.pyowns every guarantee: the transition table, routing map, SLA table, human-review rules, and PII redaction. Every function is pure:(Decision, ...) -> Decision. State can only change throughtransition, which raises on an illegal jump.edges.pyholds the three LLM seams. Each builds a prompt and a JSON schema, calls the client, and returns the raw value. Edges decide nothing.agent.pyis the driver. Readtriagetop to bottom: every edge output is immediately handed to a core function that validates it. There is no path from edge to decision that skips the core.llm.pydefines theLLMClientseam, a deterministicFakeLLMClientfor tests/offline use, andAnthropicLLMClient(the real default). Theanthropicimport is lazy, so the package imports with no SDK and no API key.
The real adapter defaults to claude-sonnet-4-6 for cost; claude-opus-4-8 and claude-haiku-4-5-20251001 are also exported.
pip install -e . # core, zero runtime dependencies
pip install -e ".[anthropic]" # add the real Anthropic adapterWith uv:
uv venv && uv pip install -e ".[dev]"python -m agentcore.demo # FakeLLMClient, no key, no network
python -m agentcore.demo --live # real Anthropic adapter (needs ANTHROPIC_API_KEY)The reliability claim is measured by the test suite, not asserted. The headline test, test_invariants_hold_for_every_garbage_combination, runs the full triage pipeline against a Cartesian product of bad model outputs (8 category values x 6 priority values x 7 reply values = 336 runs) covering unknown strings, wrong types, empty values, None, and injection-style junk. After every run it checks six invariants: the ticket ends in CLOSED, category and priority are valid enum members, the queue is known, the SLA is a positive integer, security tickets are always escalated, and the reply is a redacted, length-capped string.
| Metric | Value | How it is produced |
|---|---|---|
| Adversarial model outputs that violated a core rule | 0 / 336 | pytest tests/test_agent_adversarial.py::test_invariants_hold_for_every_garbage_combination |
| Total tests passing (no network, no API key) | 449 | pytest |
Reproduce:
python3 -m venv .venv && . .venv/bin/activate
pip install -e ".[dev]"
pytest============================= 449 passed in 3.46s ==============================
- This is a template, not a framework. The domain (support triage) is a worked example; the categories, queues, and SLA table are illustrative. Adapt them to your domain.
- PII redaction uses regex for emails, card-like digit runs, and US SSNs. It is a guardrail, not a compliance solution, and will miss formats it does not match. Do not treat it as a complete DLP layer.
- The pattern moves correctness into code, which means you write the rules. If a rule is wrong, the core enforces it faithfully and wrongly. The win is that the rule is explicit, versioned, and tested, not that it is automatically correct.
- The agent is a linear pipeline (categorize -> route -> draft -> close). Branching workflows, retries, and tool loops are out of scope here; the same core/edge split extends to them, but this template does not implement them.
FakeLLMClientmakes tests deterministic. It does not model latency, token limits, or streaming. Use the live adapter for behavior that depends on the real model.
MIT. Copyright (c) 2026 Allan Paulo de Souza. See LICENSE.