PatchSmith is a research platform for evaluating AI software-maintenance agents.
Give it a repository, an issue, and a test command. PatchSmith retrieves likely context, lets an agent propose a bounded patch, runs validation in a sandbox, and writes down the evidence: diffs, traces, stdout, stderr, timing, reports, and model-cost metadata when a live model is used.
The goal is not to pretend every passing test means "the agent fixed it." PatchSmith is built to answer the more useful question: what happened, why did it happen, and can someone audit the repair attempt later?
PatchSmith is active R&D code.
The seeded benchmark lane is the stable development path. It is useful for testing retrieval, scaffold behavior, sandbox execution, reporting, and release gates. The public GitHub issue lane exists, but it is still calibration work. Treat public-issue repair results as experimental unless they come with a saved artifact directory that includes the repository state, issue spec, reproduction command, patch, validation output, and model metadata.
Current public status is ready_with_caveats: the offline seeded-suite demo is
coherent, and live LLM calibration evidence exists for saved DeepAgents
gpt-5.4-mini runs in that benchmark lane only. Do not generalize those results
to public-issue repair quality without matching reproduction and validation
artifacts.
- Clones or copies target repositories into controlled workspaces.
- Indexes files, symbols, and issue text for context retrieval.
- Supports native keyword, hybrid, graph, and
ctxhelmcontext providers. - Runs repair attempts through deterministic baselines and the
deepagentsruntime adapter. - Includes a native DeepAgents planner with file reads, todo state, a skills-backed repair contract, structured patch output, a patch-review subagent, and sandbox-feedback retries.
- Applies model output through PatchSmith's own bounded text-replacement gate.
- Runs local or Docker sandbox validation with command-policy checks.
- Produces Markdown, JSON, HTML, trace, diff, stdout, stderr, timing, and cost artifacts.
- Ships local quality gates for tests, static checks, package build, release hygiene, demo readiness, and artifact indexing.
Most coding-agent demos compress the whole story into one number: did the final test pass?
That is too coarse for repair research. A run can fail because retrieval missed the right file, the prompt scaffold asked the wrong thing, the model produced an unsafe edit, the reproduction command was weak, the sandbox was misconfigured, or the issue was never reproducible in the first place.
PatchSmith keeps those pieces separate. It is meant for comparing repair systems without hiding setup failures, weak validation, or lucky patches.
- It is not an autonomous GitHub issue fixer.
- It is not a hosted repair service.
- It is not proof that a focused test pass equals upstream acceptance.
- It is not a leaderboard claim against SWE-agent, OpenHands, or coding models.
- It is not a replacement for human review.
The model gets to propose. PatchSmith owns the mutation.
issue + repository + test command
-> clone or copy repository
-> index files and symbols
-> retrieve candidate context
-> run a repair runtime
-> apply one bounded patch
-> run policy-checked validation
-> write reports, traces, logs, diffs, and metrics
PatchSmith does not let a model write freely into the repository. A runtime can inspect context and propose an edit. PatchSmith applies the final patch through its own gate and records the result.
git clone https://github.com/thromel/patchsmith.git
cd patchsmith
python3 -m venv .venv
. .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e ".[dev]"For native DeepAgents experiments:
python -m pip install -e ".[dev,deepagents]"Run the test suite:
PYTHONPATH=src python -m pytest -qRun PatchSmith as a terminal coding agent against the current repository:
patchsmith agent "Fix the failing parser test" \
--test-command "python -m pytest tests/test_parser.py -q"Run the canonical five-minute demo:
patchsmith demo seeded-logic-bugThen inspect the generated run directory:
patchsmith inspect artifacts/demo/seeded_logic_bug/runs/<run_id>The demo writes report.md, final.diff, traces.jsonl, validation logs,
metadata.json, and context/selected_files.json. A static preview lives in
docs/sample_artifacts/seeded_logic_bug/.
Preflight the same command before spending on a model call:
patchsmith agent "Fix the failing parser test" \
--test-command "python -m pytest tests/test_parser.py -q" \
--preflight \
--jsonPreflight mode validates the task text, selected context/provider settings,
DeepAgents response/token budget, native DeepAgents prompt/tool token headroom,
reasoning-model token headroom, optional apply target, and whether
OPENAI_API_KEY is set in the environment. It does not call a model.
Start an interactive agent session:
patchsmith chat \
--test-command "python -m pytest tests/test_parser.py -q"patchsmith agent --interactive starts the same session shell. Inside chat,
plain text is treated as a repair task in act mode, while /mode plan makes
plain text run /preflight only until the pending planned task is approved
with go ahead or /run, discarded with cancel plan, or until /mode act is
selected again. Slash commands keep the session keyboard-first. Obvious control
phrases such as what next?, show status, review diff, apply check,
go ahead, cancel plan, and apply it route to their matching slash
commands instead of starting the wrong operation:
Actual patchsmith chat, patchsmith agent --interactive, and one-shot
patchsmith agent runs perform a live OpenAI model availability/auth preflight
before starting DeepAgents. /preflight remains a no-model-call local
diagnostic. Use --skip-model-preflight only for offline harnesses or deliberate
provider experiments.
/preflight <task>validates config without a model call./mode [act|plan]switches plain-text behavior between repair runs and preflight-only planning./runexecutes the pending plan-mode task;/run <task>executes an explicit task immediately./cancel [plan]discards the pending plan-mode task without clearing the rest of the session./context add <path[#symbol]>pins a repo-relative file or symbol hint for later runs./context show,/context remove <path>, and/context clearinspect or update those forced context hints./model [id|clear]shows, sets, or clears the DeepAgents model override for later runs in the session./budget [responses <n>|tokens <n>|set <responses> <tokens>|clear]shows or changes the session response/token caps. Use-1to disable a cap./doctorchecks local DeepAgents dependencies, API-key presence, model selection, budget caps, native DeepAgents prompt/tool token headroom, reasoning-model token headroom, and apply readiness before a run./costsummarizes tasks, runs, validation count, run errors, model calls, responses, tokens, and estimated session cost from the transcript./metricssummarizes transcript-derived process metrics: manual preflights, automatic run preflights, runs, validation rate, apply success rate, custom command count, context/model/budget/permission updates, verify runs, diff views/reviews, instruction and memory views, feedback views/updates, run evidence views, timeline views, next recommendations, apply approvals/rejections/blocks, checkpoints/restores, model tokens, and cost per validated run./gate [validated|clean|reviewed|applied|cost <usd>]evaluates the current transcript against built-in evidence gates.reviewedrequires a validated clean run, deterministic diff review, no high-risk diff review, and a ready/apply check./traceor/evidencesummarizes the last run's report, trace, diff, changed files, model usage, and cost without leaving chat./verify [command]reruns the configured or explicit allowed test command through PatchSmith's sandbox command policy and records the result./run <task>records the same readiness checks as/preflight, runs the live model preflight supplied by the CLI, then starts the DeepAgents repair loop./statusshows session, model, budget, context, and last-run artifact state./historylists tasks from the current session./timeline [n]shows the recent transcript event trail for task, run, gate, trace, checkpoint, feedback, hook, and config activity./nextrecommends the next evidence-backed action from the transcript, such as approving or cancelling a pending plan-mode task, trace review, gating, apply, checkpoint, export, breaking a repeated failure loop, or turning a rejected apply decision into feedback before retrying./sessionslists resumable chat transcripts under the current artifacts directory with usage, validation, error, and cost summaries./commandslists project custom slash commands loaded from.patchsmith/commands/*.md./hookslists project lifecycle hooks loaded from.patchsmith/hooks.json./agentslists project agent profiles loaded from.patchsmith/agents/*.md./agent [name|clear]shows, selects, or clears the active project agent profile for later runs./instructions [show|reload|clear]shows, reloads, or disables AGENTS.md/CLAUDE.md-style project instructions for later runs./memory [show|reload|clear]shows the same loaded project memory through a Claude Code-style command and records separate memory-view transcript evidence./plan [show|set|add|start|done|block|skip|pending|clear]manages a transcripted session plan. Active plan items are included as context for later runs./feedback <text>,/feedback show, and/feedback clearcapture human-in-the-loop guidance for later runs without changing the base task./checkpoint [label],/checkpoints, and/restore <id|label>save, list, and restore chat state without rewriting the transcript./permissions [show|apply auto|apply manual|dirty allow|dirty deny]shows or changes apply intent and whether dirty worktrees are accepted by apply operations. In interactive chat,apply autois artifact-first: after a run, PatchSmith recordsapply_auto_deferredand requires/diff review,/apply check, and/applyinstead of mutating immediately./approve apply <reason>records an explicit human approval for the current reviewed diff after/apply check; it is required before applying high-risk reviewed diffs./reject apply <reason>records an explicit human denial for the current reviewed diff after/apply check;/applystays blocked until a newer approval supersedes it, and/nextrecommends feeding the denial back into a retry instead of applying against the transcript./compact [note]records a deterministic transcript summary and clears the in-memory task history while preserving last-run artifact pointers./clearclears in-memory task, run, and apply state while retaining the transcript for audit and resume./export [path]writes a Markdown session report from the JSONL transcript./diffshows the last generated diff path,/diff statsummarizes files and line counts,/diff show [lines]prints a bounded diff preview, and/diff reviewruns deterministic patch-risk checks before apply./apply checkruns the same local repo, dirty-worktree, empty-diff, andgit apply --checksafeguards without mutating the working tree./applyapplies the last generated diff only after the current transcript has a non-high-risk/diff reviewand a ready/apply check; it then runs the same safegit applychecks used bypatchsmith agent --apply. High-risk reviews require an explicit/approve apply <reason>event after the ready apply check before mutation is allowed; a newer/reject apply <reason>blocks mutation until superseded by a newer approval./rewindor/undoreverses the last generated diff throughgit apply --reverse --checkand thengit apply --reverse./exitor/quitends the session.
Project custom commands are Markdown prompt templates under
.patchsmith/commands/. The filename becomes the slash command name, and
subdirectories become namespaces: .patchsmith/commands/review.md maps to
/review, while .patchsmith/commands/bench/live.md maps to /bench:live.
Optional frontmatter fields such as description and argument_hint appear in
/commands and --list-commands --json. PatchSmith substitutes $ARGUMENTS,
{{arguments}}, or {{ args }} with the text after the slash command, records
the command source in the transcript, and then runs the rendered prompt through
the same DeepAgents runner, trace, diff, cost, and apply machinery as /run.
Project agent profiles are Markdown specialist definitions under
.patchsmith/agents/. They are PatchSmith's lightweight version of file-based
subagents: the profile can set model/budget/context defaults and prepend
specialist instructions to the next bounded repair prompt, but it still runs
through the same safe runner and transcript path.
---
description: Verification-focused repair mode
model: gpt-5-mini
subagents: inline
max_context_files: 3
max_model_responses: 4
max_model_tokens: 90000
test_command: pytest tests/test_parser.py -q
context_paths: |
- src/parser.py#parse
- tests/test_parser.py
---
Localize the failure before editing.
Reject broad rewrites.Use /agents to list profiles and /agent verifier to select one inside
chat. One-shot runs can apply the same profile with --agent-profile verifier.
Selecting a profile is transcripted, replayed on resume, included in runtime
metadata, and counted in /metrics.
PatchSmith automatically loads concise project instruction files from the repo root before a chat or one-shot agent run:
AGENTS.mdCLAUDE.mdCODEX.mdGEMINI.md.cursorrules.patchsmith/instructions.md
The loaded instruction context is size-capped, listed by /instructions, stored
in the session transcript, and included in runtime metadata. Extra repo-relative
files can be added with --instruction-path, and automatic instruction loading
can be disabled with --no-agent-instructions. These files guide the repair
prompt, but they do not override PatchSmith's safe runner, patch safety checks,
hooks, validation, or apply policy.
Session plans are lightweight task lists for complex work:
/plan set inspect parser; write focused test; run validation
/plan start 1
/run Fix the parser edge case
/plan done 1
/plan show
Plans use explicit statuses: pending, in_progress, completed, blocked,
and skipped. PatchSmith stores plan changes in the transcript, restores them
on resume, includes active plan state in the next repair prompt, and counts plan
views/updates in /metrics.
Rewind is PatchSmith's lightweight checkpoint path for applied agent patches.
It is intentionally diff-scoped: /rewind reverses the last generated diff
instead of resetting the whole repository. That preserves unrelated worktree
state while still giving the user a quick way to back out a bad applied patch.
Rewind attempts and successes are transcripted and included in /metrics.
Session gates let you check whether an interactive session is ready to promote without leaving the shell:
/gate validated
/gate clean
/gate reviewed
/gate applied
/gate cost 0.05
validated requires at least one validated run. clean additionally requires
100% validation rate and zero run errors. reviewed adds deterministic diff
risk-review evidence, no high-risk diff reviews, and a ready non-mutating apply
check. applied requires successful apply attempts. cost <usd> caps cost per
validated run. Gate results are recorded in the transcript and counted in
/metrics; saved sessions can still be gated offline with
patchsmith chat --session-gate.
Project hooks are deterministic shell commands configured in
.patchsmith/hooks.json. PatchSmith passes a JSON envelope on stdin and records
hook outcomes in the chat transcript. Blocking hooks fail closed before
expensive or risky actions such as model runs and patch application:
{
"hooks": {
"PreRun": [
{
"name": "budget-guard",
"matcher": "benchmark|live",
"command": "python scripts/check_budget.py",
"timeout_seconds": 10
}
],
"PreApply": [
{
"name": "review-before-apply",
"command": "python scripts/check_diff_policy.py"
}
]
}
}Hook commands can print {"decision": "block", "reason": "..."} to stdout or
exit non-zero to stop the lifecycle event. Supported chat lifecycle events are
SessionStart, SessionEnd, UserPromptSubmit, UserPromptExpansion,
PreRun, PostRun, PreApply, and PostApply. PreRun and PreApply are
the main safety gates; post-event hooks are recorded as evidence but cannot
undo completed work.
Chat sessions write JSONL transcripts under artifacts/chat_sessions/ so runs,
manual preflights, automatic run preflights, context changes, usage/cost
fields, apply decisions, and follow-up tasks remain auditable.
Resume a saved chat transcript by session id:
patchsmith chat --resume 20260614T195042Z-4be8450eList saved chat sessions without starting the shell:
patchsmith chat --list-sessionsList project custom commands without starting the shell:
patchsmith chat --list-commands
patchsmith chat --list-commands --jsonList project agent profiles without starting the shell:
patchsmith chat --list-agents
patchsmith chat --list-agents --jsonList project instruction files without starting the shell:
patchsmith chat --list-instructions
patchsmith chat --list-instructions --jsonList project hooks without starting the shell:
patchsmith chat --list-hooks
patchsmith chat --list-hooks --jsonPrint saved session metrics without starting the shell:
patchsmith chat --session-metrics 20260614T195042Z-4be8450e --jsonPrint the deterministic next recommendation for a saved session:
patchsmith chat --session-next 20260614T195042Z-4be8450e --jsonGate a saved session for CI or benchmark promotion:
patchsmith chat \
--session-gate 20260614T195042Z-4be8450e \
--require-validated-run \
--min-validation-rate 1.0 \
--require-diff-review \
--max-high-risk-diff-reviews 0 \
--require-ready-apply-check \
--max-cost-per-validated-run-usd 0.05 \
--max-run-errors 0Export a saved transcript without resuming the shell:
patchsmith chat \
--export-session 20260614T195042Z-4be8450e \
--export-path artifacts/chat_sessions/20260614T195042Z-4be8450e.mdRun a scripted chat session:
patchsmith chat --script scripts/live-smoke.patchsmithThe script file contains the same lines you would type interactively, such as
/doctor, /model gpt-5-mini, /budget set 4 60000, /review target, or
/exit. This is useful for repeatable smoke runs and CI-friendly dry runs;
commands in the script still use the same chat transcript, preflight, runner,
trace, diff, cost, and apply boundaries.
The agent command defaults to:
--repo .--runtime deepagents--planner deepagents--context-provider native_hybrid--deepagents-subagents auto--max-model-responses 12--max-model-tokens 200000
It writes the proposed patch, trace, validation output, and report under
artifacts/runs/<run_id>/. This is intentionally artifact-first: the current
CLI does not mutate the user's working tree unless asked.
To apply the generated diff back to a clean local Git worktree:
patchsmith agent "Fix the failing parser test" \
--test-command "python -m pytest tests/test_parser.py -q" \
--apply--apply is intentionally explicit. It requires --repo to be a local Git
repository, rejects dirty worktrees by default, checks the generated diff with
git apply --check, and then applies it with git apply. Inside interactive
chat, /apply is stricter: it first requires the session to record
non-high-risk /diff review evidence and a ready /apply check. Pass
--allow-dirty-apply only when you intentionally want to apply into a worktree
that already has uncommitted changes.
Run a deterministic repair on a seeded bug:
PYTHONPATH=src python -m patchsmith.cli run \
--repo evals/tasks/seeded_bugs_v1/task_001_logic_bug/repo \
--issue-file evals/tasks/seeded_bugs_v1/task_001_logic_bug/issue.md \
--test-command "python3 -m pytest" \
--runtime heuristic \
--context-provider native_hybrid \
--artifacts-dir artifacts \
--jsonExpected behavior: PatchSmith edits the seeded task repository, runs the pytest
command, and writes a run report under artifacts/runs/<run_id>/.
If you already know a likely file, force it into the context:
PYTHONPATH=src python -m patchsmith.cli run \
--repo path/to/repo \
--issue-file path/to/issue.md \
--context-provider native_hybrid \
--context-path "src/package/module.py#suspected_symbol" \
--json--context-path can be repeated. PatchSmith strips the optional #symbol
suffix before reading the file, but keeps the full hint in the issue text for
repair flows that provide reviewed hints.
PatchSmith has two DeepAgents modes:
runtime=deepagents, planner=heuristic: adapter and scaffold compatibility.runtime=deepagents, planner=deepagents: native DeepAgents planning with a live OpenAI-compatible chat model.
The native planner seeds DeepAgents with a state-backed virtual filesystem,
read-only file permissions, a compact /.patchsmith/repair-interface.md
run manifest, a /.patchsmith/acceptance-rubric.md verifier checklist, a
/.patchsmith/repo-instructions.md scoped repository-instruction manifest, a
PatchSmith repair skill, a durable memory file, structured PatchPlan output,
a failure-localizer subagent, and a patch-reviewer subagent for ambiguous
or feedback-driven repairs. The repair interface is the
agent-computer interface for a run: it lists required manifests, mounted source
paths, subagent routing mode, and output constraints before the model explores
source files. The acceptance rubric is generated from the issue, mounted files,
preferred targets, validation fixtures, and unsafe-patch exclusions; it gives
the model a task-local checklist to verify the selected path, old span,
validation claim, and patch shape before final output. Repo instructions are
discovered only from AGENTS.md-style files at the repository root and ancestors
of mounted context paths, capped before mounting, and framed as scoped
constraints rather than permission for broad exploration. Native
DeepAgents plans must also include compact localization fields that name the
failing runtime mechanism and justify why the selected file/span controls it.
For cost experiments, PATCHSMITH_DEEPAGENTS_SUBAGENTS=inline disables the
DeepAgents subagents and requires the main planner to do localization and review
inline. Keep this mode experimental until saved live artifacts show lower
response count, token use, and cost without reducing validation.
PATCHSMITH_DEEPAGENTS_SUBAGENTS=auto is the safer ablation: it keeps
subagents for retries, reviewed source hints, validation fixtures, and
multi-context repairs, but disables them for simple single-control-point runs.
The default remains full.
PatchSmith only requests reasoning.encrypted_content for OpenAI model ids that
are expected to support reasoning items, such as gpt-5*, o1*, o3*, and
o4*. Non-reasoning models such as gpt-4.1-mini still use the Responses API
without that include flag, avoiding provider-side compatibility failures. Set
PATCHSMITH_DEEPAGENTS_ENCRYPTED_REASONING=enabled or
PATCHSMITH_DEEPAGENTS_ENCRYPTED_REASONING=off only for explicit provider
capability experiments; the default auto mode is recorded in DeepAgents
contract metadata.
Preflight a model before spending money:
OPENAI_API_KEY=... \
PYTHONPATH=src python -m patchsmith.cli openai-model-preflight \
--model <model> \
--jsonRun the native DeepAgents planner on the seeded benchmark:
OPENAI_API_KEY=... \
PATCHSMITH_DEEPAGENTS_MODEL=<model> \
PYTHONPATH=src python -m patchsmith.cli eval-repair \
--dataset evals/tasks/seeded_bugs_v1 \
--runtime deepagents \
--planner deepagents \
--max-retries 1 \
--max-tasks 10 \
--context-provider native_hybrid \
--output artifacts/experiments/deepagents_native_repair_eval_v1 \
--jsonDo not cite model performance from a README command. Use the saved artifact
directory for the exact run: model name, account, prompt, dataset, commit, diff,
logs, and validation output all matter.
Keep --max-tasks set on live runs unless you are intentionally expanding the
benchmark and budget.
Optional context-budget experiments:
PATCHSMITH_DEEPAGENTS_CONTEXT_MODE=span \
PATCHSMITH_DEEPAGENTS_CONTEXT_WINDOW_LINES=80 \
PATCHSMITH_DEEPAGENTS_SUBAGENTS=auto \
PYTHONPATH=src python -m patchsmith.cli execute-public-issue-repairs \
--deepagents-max-context-files 2 ...--deepagents-max-context-files limits the number of repository files mounted
into the DeepAgents virtual filesystem while preserving reviewed source hints,
validation fixtures, and strong target-localization signals such as
symbol-qualified control points first. The default 0 keeps the full retrieved
context because smaller mounts can make the model ask for more reasoning tokens;
use this knob only when the saved trace and suite gate prove a net improvement.
PATCHSMITH_DEEPAGENTS_CONTEXT_MODE=span is a separate first-attempt
compression lane: mounted repository paths stay stable, but each mounted source
file is narrowed to a focused line window around matched symbols, runtime-cache
cues, or reviewed source hints. Keep it opt-in until a saved live suite shows
lower tokens without losing target alignment or validation.
Complex benchmark
reports include DeepAgents virtual-file count, context-cap usage, tokens, and
cost so these experiments can be compared from saved artifacts. They also
report repair-interface and acceptance-rubric manifest tasks plus read-first
rates, and repo-instructions manifest tasks plus read-first rate, so
interface/verifier/context changes are visible in benchmark summaries instead
of requiring manual trace inspection. Trajectory reports keep the legacy agent
trajectory score stable and expose contextual-verifier coverage as a separate
rate, so verifier adoption can be compared without silently moving older score
thresholds. Use min_contextual_verifier_rate or
--min-contextual-verifier-rate when a complex suite must prove verifier
coverage from saved traces; refresh-evidence exposes the same gate as
--complex-suite-min-contextual-verifier-rate. Use
min_repo_instructions_manifest_rate and
min_repo_instructions_read_first_rate when a context-policy lane must prove
scoped AGENTS.md-style repository guidance was mounted and read before source
edits. Use
evals/issue_corpora/public_issue_smoke_v1/complex_suite_verifier.template.json
for the next rubric-enabled live lane; the older complex_suite.template.json
remains the historical pre-rubric baseline.
Public issue repair summaries and rows also record actual model calls, tokens,
and estimated cost; rows that exceed configured post-run live-cost,
response-count, or token-count caps after execution are not counted as
validated claims.
Use --max-actual-model-responses and --max-actual-model-tokens on live
public-issue runs when a benchmark lane needs hard claim limits for DeepAgents'
internal call volume. PatchSmith mounts those limits into
/.patchsmith/repair-interface.md as a resource budget so the agent sees the
claim boundary before exploring source, and the native DeepAgents planner now
installs an active response-budget callback that blocks the next model call once
the configured response count is exhausted. Token caps are still checked from
recorded provider usage after each response, so token overages remain failed
claims even when the final patch passes tests. Reasoning-model runs can spend
tens of thousands of tokens before producing visible output, so calibrate
initial token caps from a small preflight lane before treating low caps as
product defaults. For response ceilings of six or
fewer, PatchSmith also switches the repair interface into budget-critical mode:
generic memory/skill reads are no longer required, the first preferred
source/symbol is mounted as a compact Fast Patch Packet, and the prompt asks the
agent to return a structured PatchPlan as soon as the controlling branch is
clear. Add --deepagents-subagents auto for budgeted calibration lanes; it
keeps retries eligible for subagents but uses compact inline
localization/review on the first attempt.
The subagent mode is reported in the DeepAgents contract as subagent_mode
with a subagent_routing reason list. The same contract records
repair_interface_manifest_path and repair_interface_manifest_read_first so
saved artifacts prove which run interface the model received. Use auto or
inline only for side-by-side calibration runs, not as default performance
claims.
Validate the seeded dataset:
PYTHONPATH=src python -m patchsmith.cli validate-dataset \
--dataset evals/tasks/seeded_bugs_v1 \
--output artifacts/experiments/seeded_dataset_validation_v1 \
--jsonCompare retrieval providers:
PYTHONPATH=src python -m patchsmith.cli eval-retrieval \
--dataset evals/tasks/seeded_bugs_v1 \
--context-provider native \
--context-provider native_hybrid \
--context-provider native_graph \
--context-provider ctxhelm_cli \
--output artifacts/experiments/retrieval_eval_v1 \
--jsonRun a DeepAgents compatibility repair benchmark:
PYTHONPATH=src python -m patchsmith.cli eval-repair \
--dataset evals/tasks/seeded_bugs_v1 \
--runtime deepagents \
--planner heuristic \
--context-provider native_hybrid \
--output artifacts/experiments/deepagents_compatibility_repair_eval_v1 \
--jsonCompare scaffold variants:
PYTHONPATH=src python -m patchsmith.cli eval-scaffold \
--dataset evals/tasks/seeded_bugs_v1 \
--variant agentless \
--variant heuristic \
--variant deepagents \
--context-provider native_hybrid \
--output artifacts/experiments/scaffold_comparison_v1 \
--jsonSummarize a completed public-issue repair lane as a complex benchmark:
PYTHONPATH=src python -m patchsmith.cli eval-complex \
--attempt-dir artifacts/experiments/public_issue_corpus_v1 \
--benchmark public_issue_smoke_v1_latest_all \
--output artifacts/experiments/complex_deepagents_public_issue_smoke_v1_latest_all \
--jsoneval-complex reads saved repair-attempt artifacts, traces, and reports. It
does not run repositories, execute tests, or call a model provider.
Aggregate multiple saved public-issue repair lanes into a gated complex suite:
PYTHONPATH=src python -m patchsmith.cli eval-complex-suite \
--suite-spec evals/issue_corpora/public_issue_smoke_v1/complex_suite.template.json \
--validate-only \
--json
PYTHONPATH=src python -m patchsmith.cli eval-complex-suite \
--suite-spec evals/issue_corpora/public_issue_smoke_v1/complex_suite.template.json \
--jsonThe suite spec contains the saved attempt directories, output directory, and
gate thresholds. Selected cost/token/response caps measure the chosen best
attempt per task; attempted cost/token/response caps measure total spend across
all evaluated attempts, so exploratory changes that validate but waste model
budget or internal DeepAgents calls still trip the suite gate. Max
attempted/selected task caps catch single-task cost, token, or response-count
outliers that an aggregate average can hide. Suites can also set
min_target_alignment_rate to require final patches to stay inside paths
localized by explicit target candidates or by DeepAgents' structured
failure-localization rationale for the selected patch plan. Complex summaries
also report selected-attempt context-efficiency proxies: virtual files, virtual
files per validated task, tokens per virtual file, and responses per virtual
file. They also report selected context-target recall and precision whenever
the saved trace includes both localized target paths and DeepAgents mounted
source paths. Suite gates can cap the proxy metrics with
max_selected_virtual_files_per_validated_task,
max_selected_tokens_per_virtual_file, and
max_selected_responses_per_virtual_file, and can require target/context
coverage with min_selected_context_target_recall and
min_selected_context_target_precision. Complex reports also expose a
trace-derived progress score so failed long-horizon attempts can distinguish
reproduction-only, patch-generated, target-aligned, quality-warning, and clean
validated stages. Use min_selected_progress_score when a suite must enforce
that retained attempts reached a minimum partial-progress floor. They also emit
deterministic failure-class counts for triage buckets such as validated,
quality-risk, preflight-blocked, reproduction-failed, no-patch,
target-misaligned, runtime/tool failure, retry-exhausted, and validation-failed;
these labels are artifact-derived benchmark signals, not human root-cause
annotations. Complex reports also aggregate a HarnessFix-style
harness_layer label so failed attempts can be triaged by the implicated
layer: budget, model, sandbox, preflight, reproduction, planning, context,
patch quality, retry, runtime, validation, or orchestration. DeepAgents feedback retries also carry a narrower runtime
retry_failure_class in the retry artifact and trace payload so the next
attempt sees whether it is handling validation failure, safety-gate rejection,
quality risk, repeated-target failure, or missing validation before editing.
Complex summaries aggregate those retry classes as retry_failure_class_counts
so retry-policy experiments can be compared from saved traces. They also emit
process-quality labels and flags derived from the same trace. A validated patch
can still be marked process-risky when the trace lacks verification, uses an
unclassified retry, churns through repeated failed events, or edits again after
successful verification; use process_quality_label_counts,
process_quality_flag_counts, and process_risky_validated_tasks when
checking for AgentLens-style lucky-pass risk.
Suites can enforce those process diagnostics with min_process_quality_score
and max_process_risky_validated_tasks.
Verifier lanes can also require the
task-local acceptance rubric to be mounted and read before final output with
min_acceptance_rubric_manifest_rate and
min_acceptance_rubric_read_first_rate. They can also require the selected
patch to satisfy deterministic rubric-alignment proxies with
min_acceptance_rubric_alignment_rate, which checks that the patch was rubric
backed, target-aligned, mounted-context bounded, and free of patch-quality
warnings. The refresh-evidence CLI exposes the
same gates as --complex-suite-min-acceptance-rubric-manifest-rate and
--complex-suite-min-acceptance-rubric-read-first-rate, plus
--complex-suite-min-acceptance-rubric-alignment-rate.
refresh-evidence --complex-suite-spec ... can regenerate the same suite report
and complex_benchmark_suite_gate.json from saved attempt artifacts without
spending live-model tokens. The validation-only command checks the suite
interface first: attempt directories, required result files, output path, and
gate-threshold count. When verifier or acceptance-rubric gates are requested
and older saved attempts lack that evidence, the suite also emits
verifier_contract_rerun follow-up candidates and a shell-ready runbook for
rubric-backed DeepAgents reruns.
Build an artifact index:
PYTHONPATH=src python -m patchsmith.cli index-artifacts \
--artifacts-dir artifacts \
--output artifacts/experiments/index.md \
--json-output artifacts/experiments/index.json \
--html-output artifacts/experiments/index.html \
--run-detail-output-dir artifacts/experiments/run-details \
--jsonThe public issue smoke lane lives under
evals/issue_corpora/public_issue_smoke_v1.
It has separate gates for corpus validation, repository preflight, source-free context preview, task materialization, focused test planning, setup validation, reproduction evidence, repair readiness, and repair attempts.
That separation is intentional. A public issue repair should only count after the failing behavior is reproduced, a patch is generated, and validation passes. Passing setup checks are useful evidence, but they do not prove repair quality.
Build the seeded smoke image:
docker build -f docker/seeded-smoke.Dockerfile -t patchsmith-seeded-smoke:py312 .Run with Docker isolation:
PYTHONPATH=src python -m patchsmith.cli run \
--repo evals/tasks/seeded_bugs_v1/task_001_logic_bug/repo \
--issue-file evals/tasks/seeded_bugs_v1/task_001_logic_bug/issue.md \
--test-command "python3 -m pytest" \
--runtime heuristic \
--context-provider native_hybrid \
--sandbox-mode docker \
--sandbox-image patchsmith-seeded-smoke:py312 \
--artifacts-dir artifacts \
--jsonDocker mode disables implicit image pulls, disables network by default, drops
capabilities, mounts the repository at /workspace, applies resource limits,
and records the selected sandbox in the trace.
Run the local release gate:
PYTHONPATH=src python -m patchsmith.cli quality-gate \
--project-root . \
--artifacts-dir artifacts \
--output artifacts/experiments/quality_gate.md \
--json-output artifacts/experiments/quality_gate.json \
--logs-dir artifacts/experiments/quality_gate_logs \
--jsonThe gate runs compile checks, whitespace checks, the full pytest suite, and a package build. CI also runs Ruff, Ruff format check, mypy, compile checks, pytest, and package build.
src/patchsmith/
cli/ CLI commands
evaluation/ seeded and public-issue evaluation flows
observability/ artifact index, failure reports, renderers
portfolio/ readiness, release, and demo reports
runtime/ agent runtime adapters
deepagents_planner.py native DeepAgents planner
deepagents_prompts.py native DeepAgents prompts, memory, and skill text
deepagents_schema.py native DeepAgents structured output schema
retrieval.py native retrieval providers
sandbox.py local and Docker command execution
docs/ architecture, safety, evaluation, and runbook docs
evals/ seeded tasks and public issue corpora
adr/ architecture decision records
experiments/ experiment plans
templates/ report and ADR templates
PYTHONPATH=src python -m pytest -q
PYTHONPATH=src python -m patchsmith.cli demo seeded-logic-bug
PYTHONPATH=src python -m patchsmith.cli quality-gate --json
PYTHONPATH=src python -m patchsmith.cli project-status --json
PYTHONPATH=src python -m patchsmith.cli demo-readiness --json- Artifact gallery
- Failure taxonomy
- Benchmark manifest schema
- Runtime fairness checks
- Comparison matrix
- Changelog
- Calibrate live DeepAgents runs against the seeded benchmark.
- Expand public-issue reproduction coverage without treating setup success as repair success.
- Add more model/provider cost accounting.
- Improve artifact comparison for retrieval, patch quality, retries, and validation strength.
- Keep the runtime adapters small enough to read and test.
MIT. See LICENSE.