Skip to content

fix: default CODEBUFF_VERBOSE and CODEBUFF_PROMPT_LOG to disabled, fix prompt logger truncation#1

Open
reillyse wants to merge 214 commits into
mainfrom
reillyse/hippo-integration
Open

fix: default CODEBUFF_VERBOSE and CODEBUFF_PROMPT_LOG to disabled, fix prompt logger truncation#1
reillyse wants to merge 214 commits into
mainfrom
reillyse/hippo-integration

Conversation

@reillyse

Copy link
Copy Markdown
Owner

Changes

  • CODEBUFF_VERBOSE: Default changed from TTY-based fallback to disabled (false). Only enabled when explicitly set to a non-0 value.
  • CODEBUFF_PROMPT_LOG: Default changed from enabled to disabled when unset. Only enabled when set to 1, true, or a custom path.
  • Prompt logger truncation fix: Added post-append truncateIfNeeded() call so large entries that push the log past 5MB are properly truncated.
  • Documentation: Updated docs/CLAUDE_OAUTH_DEPLOYMENT.md env vars table with new defaults for both CODEBUFF_VERBOSE and CODEBUFF_PROMPT_LOG.
  • Tests: Updated cli-lite prompt-logger tests to expect disabled-by-default behavior.

Files Changed

  • cli-lite/src/index.ts
  • cli-lite/src/prompt-logger.ts
  • cli-lite/src/repl.ts
  • cli-lite/src/__tests__/prompt-logger.test.ts
  • cli/src/utils/prompt-logger.ts
  • docs/CLAUDE_OAUTH_DEPLOYMENT.md

brandonkachen and others added 30 commits February 4, 2026 13:51
…mitation

PostgreSQL ADD VALUE for enums is not visible within the same transaction,
so the UPDATE statements need to run in a separate migration after the
enum value is committed.

- 0039: Add referral_legacy enum value + is_legacy column (DEFAULT true)
- 0040: Backfill credit_ledger with referral_legacy type
… issue

drizzle-kit migrate runs all pending migrations in a single transaction,
so the new enum value is not committed when the UPDATE tries to use it.

Moved backfill to standalone script: scripts/backfill-referral-legacy.sql
Run this manually after migration 0039 is deployed.
The trailing comma after the last entry caused drizzle-kit migrate to fail
with a JSON parse error in CI.
…er production deployments"

This reverts commit d6d19fa.
Co-authored-by: brandonkachen <brandonchenjiacheng@gmail.com> and Codebuff!
…bled at all, which matches backend semantics
…r schema

- Remove stripe_price_id column from user table (migration 0041)
- Remove STRIPE_USAGE_PRICE_ID from env schemas and defaults
- Stop auto-subscribing new users to Pure Usage on signup
- Remove stripe_price_id from TypeScript types (user.ts, next-auth.d.ts, typed.d.ts)
- Clean up test utilities and eval configs
- Delete scripts/update-stripe-subscriptions.ts
jahooma and others added 30 commits April 28, 2026 12:23
- Remove 'review' from FREEBUFF_REMOVED_COMMAND_IDS and FREEBUFF_REMOVED_COMMANDS
- Enable CHATGPT_OAUTH_ENABLED flag
Unify the OAuth connect-command surface so both commands are available in every build and only accept their fully-qualified form.

- Expose /connect:chatgpt in the regular Codebuff build (previously freebuff-only)

- Drop the /chatgpt short alias; /connect:chatgpt is the only name

- Drop the /claude short alias; /connect:claude is the only name

- Update /plan and /review freebuff-only guards and help banner to point at /connect:chatgpt

Backfills an OpenSpec change documenting the invariants as a new cli-slash-commands capability, already archived at openspec/changes/archive/2026-04-23-cli-connect-commands-cleanup/ with the main spec synced to openspec/specs/cli-slash-commands/spec.md.
Adds packages/agent-runtime/src/util/normalize-conversation.ts — a 4-pass
normalization pass that runs immediately before every LLM call (in
run-agent-step.ts) and at every STEP_ALL / GENERATE_N yield boundary (in
run-programmatic-step.ts, added in the previous commit).

Enforced invariants:
 1. No consecutive assistant-role messages (invalid prefill shape)
 2. No orphan tool_result (preceding assistant must carry matching id)
 3. Every tool_use has an immediately-following matching tool_result
 4. Conversation does not end with a trailing assistant before a model call

Pass ordering matters: merge consecutive assistants → drop orphan
tool_results → synthesize missing tool_results → handle trailing assistant.
The merge-first step protects pass 2/3 against the [assistant(A),
assistant(B), tool_result A, tool_result B] shape where A and B belong to
the same turn; merging produces [assistant(A+B), tool_result A, tool_result
B] which is valid.

Severity split on repair emission:
 - emitRepair (warn only, never throws): consecutive-assistant merge and
   orphan-tool_result drop. These are lossless cleanup of shapes that
   occur naturally at runtime (e.g., successive end_turn calls with
   excludeToolFromMessageHistory: true) or trivially-safe dropping with no
   data fabrication.
 - emitSynthesis (throws in strict mode): synthesized missing tool_result
   and trailing-assistant continuation. These fabricate conversation data
   to keep the model happy, indicating an upstream bug that should surface
   at PR time.

Strict mode is opt-in via CODEBUFF_STRICT_CONVERSATION=1; default is
'repair'. In repair mode returns a new array with structured WARN logs
('conversation.shape.repaired' event) per repair. Idempotent on
already-valid conversations — the conversationHasViolation() fast path
returns the original array unchanged with no telemetry.

Wiring in run-agent-step.ts: normalizeConversation runs once on
agentState.messageHistory just before both LLM chokepoints (promptAiSdk
n-responses path and getAgentStreamFromTemplate streaming path).
Adds a Sparrow-only OpenTelemetry instrumentation layer targeting Honeycomb.
Core module lives at common/src/sparrow/telemetry/ and emits a hierarchical
span tree (prompt > agent.run > agent.step > gen_ai.chat + tool.call) with
real OTel context propagation via trace.setSpan + context.with.

- tracer-provider: BasicTracerProvider + BatchSpanProcessor + OTLP HTTP exporter
- span-helpers: withSpan, withPromptSpan, withAgentRunSpan, withAgentStepSpan,
  recordLlmCall, recordToolCall (idempotent finish)
- context-harvester: 5s TTL cache of git/cwd/branch/Linear-key context
- cost-rollup: token and USD aggregation up the span tree
- attributes: Honeycomb attribute naming conventions
- sparrow-config: persisted JSON at ~/.config/manicode/sparrow-config.json
- 7 test files covering init/shutdown/flush/privacy/context-cache/cost-
  rollup/OAuth-fallback/coverage-gaps (104 passing tests)
- docs/telemetry.md and HONEYCOMB_* + SPARROW_TELEMETRY_CAPTURE_PROMPTS
  env var documentation

Telemetry is a silent no-op when HONEYCOMB_API_KEY is absent. Content (prompt
bodies, tool args, tool output) is never captured by default.
Adds /telemetry CLI command with subcommands: enable, disable, status, dataset,
capture-prompts, debug, init, shutdown, flush. Commands hot-reconfig the tracer
provider in-process without a CLI restart (flush + shutdown + re-init).

- cli/src/commands/telemetry.ts: command implementation
- cli/src/commands/command-registry.ts: registration
- cli/src/index.tsx: initTelemetry on CLI startup
- cli/src/utils/renderer-cleanup.ts: flush + shutdown on exit so in-flight
  spans reach Honeycomb before process termination
Adds SPARROW-tagged telemetry hooks at 5 upstream-file touchpoints. All
failures are swallowed so telemetry cannot break user workflows.

- main-prompt.ts: withPromptSpan wraps the top-level prompt execution; every
  user turn produces one root prompt span with harvested git/cwd/Linear context
- run-agent-step.ts: withAgentRunSpan wraps each agent orchestration;
  withAgentStepSpan wraps each runAgentStep with step_number + agent_id
- tool-executor.ts: recordToolCall around executeToolCall and
  executeCustomToolCall. Integrated onto upstream atomic-pair contract
  (commit 3d59dc9) — span opened before try/catch, finished exactly once on
  each success/failure path. spawn_agents tool.call spans record child.agent_id
  linkage to spawned agent.run spans. Aborted custom-tool short-circuit skips
  span to avoid zero-duration noise.
- llm.ts: recordLlmCall around LLM dispatch with route/model/cost/tokens.
  Fixes cost-override bug: use !== undefined (not truthy check) so $0 cost
  overrides still emit spans with correct cost.
- run.ts: plumbs telemetry context through the SDK run entry point
…ANGES

- openspec/changes/sparrow-telemetry/: full proposal with design, tasks (42/50
  complete; remaining 8 are live Honeycomb verification + full-regression),
  and specs. Strict-validates.
- SPARROW_CHANGES.md: enumerates every upstream-file touchpoint (main-prompt,
  run-agent-step, tool-executor, llm, run) and new Sparrow-only files for
  future upstream-merge audits.
Every user turn is already wrapped in a `withPromptSpan` \u2014 the root
`prompt` span that owns the full agent.run/agent.step/gen_ai.chat/
tool.call hierarchy. When that span closes, the BatchSpanProcessor has
the full trace queued, but previously we had to wait up to 5s for its
timer (or for the CLI to exit / a manual `/telemetry flush`) before it
shipped to Honeycomb.

This patch fires a non-blocking `flushTelemetry(2000)` in a finally
block after the prompt span ends, so every turn reaches Honeycomb
within ~1\u20132 seconds of the CLI finishing its response. Fire-and-forget
(never awaited), so it adds zero latency to the turn result; errors
are swallowed so telemetry can never break user workflows.

Changes:
- common/src/sparrow/telemetry/span-helpers.ts: withPromptSpan now
  wraps the withSpan call in try/finally and kicks off a fire-and-
  forget flush after the span has ended.
- common/src/sparrow/telemetry/__tests__/integration.test.ts: 3 new
  tests covering success path, throwing callback, and flush-error
  swallowing.
- docs/telemetry.md: new "When spans are flushed" section documenting
  the 5 flush triggers (turn-end + 5s timer + batch size + CLI exit +
  config changes).
- Add codebuff.oauth_account_id (sha256 hash of OAuth refresh/access token) so two Claude or ChatGPT subscriptions on the same machine can be distinguished in Honeycomb dashboards.

- Propagate user.email, user.name, host.name from the per-prompt harvest cache onto every gen_ai.chat span at creation, so token totals can be sliced by user/machine without joining through trace IDs.

- Add deriveOAuthAccountId helper + 6 unit tests covering stability, distinctness, env-var fallback (refreshToken= case), and the one-way property of the hash.

- Add .agents/skills/honeycomb-usage-report/ skill that produces a daily + aggregate breakdown of LLM usage by route and model over the past 7 days.
….chat

Adds two new boolean attributes on every gen_ai.chat span to enable
silent-fallback observability:

- codebuff.chatgpt_oauth_eligible: tri-state (true | false | unset).
  true when the openai/* model is in the ChatGPT OAuth allowlist;
  false when openai/* but not allowlisted (e.g. gpt-5-nano);
  unset for non-openai models.
- codebuff.claude_oauth_eligible: binary (true | unset).
  true for any anthropic/* or claude-* model recognized by isClaudeModel.

Computed centrally in recordLlmCall() so all streaming paths emit
consistently. Decoupled from codebuff.route so dashboards can count
'could have used OAuth subscription but didn't' with:
  <attr>_eligible = true AND codebuff.route = codebuff_backend

Includes tests covering allowlist hits, non-allowlist openai/*,
non-OpenAI/non-Claude models, missing model, and the silent-fallback
query combination.

🤖 Generated with Codebuff
Co-Authored-By: Codebuff <noreply@codebuff.com>
…chat.tsx imports

Post-cherry-pick fixup for the telemetry/cache-debug/ChatGPT OAuth range:

- bun.lock: resync after bun install pulled in @opentelemetry/exporter-trace-otlp-http and upgraded @opentelemetry/{resources,sdk-trace-base,core,context-async-hooks} to v2 (resourceFromAttributes export)

- common/src/util/messages.ts: add errorToolResult() helper used by packages/agent-runtime/src/util/normalize-conversation.ts to fabricate missing tool_result entries

- cli/src/chat.tsx: add missing imports for BottomStatusLine, useClaudeQuotaQuery, getClaudeOAuthStatus that were referenced but never imported
file-picker handleSteps uses read_files before yielding STEP, but read_files
was not in toolNames. The LLM would see read_files in message history and
try to call it, triggering "Tool read_files is not currently available".

Same issue for file-lister with read_subtree.

- Add read_files to file-picker toolNames
- Add read_subtree to file-lister toolNames
- Regenerate cli-lite bundled agents
…t docs 4.6

- Opus: anthropic/claude-opus-4.6 → anthropic/claude-opus-4.8
- GPT-5: openai/gpt-5.1 → openai/gpt-5.2
- Grok: x-ai/grok-4.1-fast → x-ai/grok-4.3
- Haiku: anthropic/claude-3.5-haiku-20241022 → anthropic/claude-haiku-4.5
- Docs: updated claude-sonnet-4.5 → claude-sonnet-4.6 in all examples

Updated model-config.ts constants, bundled-agents generated files,
oauth mappings, agent-definition type unions, .agents/ files,
OpenAI pricing tables, and all user-facing documentation.
- Add onBeforeSubagentPrompt/onAfterSubagentComplete hooks to AgentRuntimeDeps
- Wire hooks through agent-runtime, SDK, and both CLI/cli-lite
- Add getSubagentHippoContext (3s timeout) and storeSubagentResultToHippo
- Enrich commander, file-picker, opus-agent, gpt-5-agent with hippo context
- Fix cancelled outputs stored as success (now classified as failure)
- Fix misleading error message in getSubagentHippoContext
- Normalize agent IDs with getShortAgentId for fully-qualified ID support
- Cap injected context at 1500 chars to prevent token bloat
- Add warn-level logging for hook failures
- Add hippo storage schema and accuracy eval design docs
- Add cli-lite/scripts/stamp-version.ts to stamp git SHA into package.json
- Add stamp:version npm script
- Revert package.json version to base 0.1.0 (script handles stamping)
The build script was installing the binary to `npm config get prefix`
which could differ from where the shell resolves `codebuff` (e.g. nvm
bin directory). Now detects the existing binary location via `which`,
falling back to npm prefix for first-time installs.
- buildOutputDescription no longer pads --output with file lists (cli + cli-lite); files tracked via --files-changed
- add buildSubagentOutputDescription: mine narrative from output/message keys, handle plain strings, errors, and cancelled (surface reason instead of "Completed (Ns)")
- trim SUBAGENT_NARRATIVE_KEYS to the keys codebuff subagents actually emit (output, message)
- pass actual hippo error into logHippoPrompt metadata on subagent failure
- export + add 21 unit tests for buildSubagentOutputDescription
- Prune ALLOWED_MODEL_PREFIXES to ['anthropic', 'openai'] only
- Remove deepseekModels/DeepseekModel and CURRENT_GROK_MODEL constants
- Add CURRENT_HAIKU_MODEL ('anthropic/claude-haiku-4.5') for utility agents
- Add CURRENT_GPT5_MINI_MODEL ('openai/gpt-5-mini') for research agents
- Switch file-picker, file-lister, commander-lite → Haiku (Claude OAuth)
- Switch researcher-web, researcher-docs → GPT-5-mini (ChatGPT OAuth)
- Update FREE_MODE_AGENT_MODELS allowlist to match new model assignments
- Add openai/gpt-5-mini to OPENROUTER_TO_OPENAI_MODEL_MAP
- Prune ModelName type unions (3 copies) to OpenAI + Anthropic only
- Switch getModelForMode experimental/ask from Gemini Pro → Claude Opus
- Regenerate cli-lite bundled-agents with updated model assignments
- Update file-picker tests to expect Haiku instead of Grok/Gemini

🤖 Generated with Codebuff
Co-Authored-By: Codebuff <noreply@codebuff.com>
Grok is no longer used by any agent (file-lister migrated to claude-haiku-4.5).
Removing the dangling x-ai/grok-4 definition so grok can never be reintroduced
from this repo. Empties nonCacheableModels (its only entry was the grok model).
Print truncated request/response content for tool calls and truncated prompt+params on subagent start/finish. Make the truncation limit configurable via CODEBUFF_DEBUG_TRUNCATE (default 500, 0 disables). Fix finish-event cost display: totalCost is credits (1 credit = $0.01), now shown via formatCredits as e.g. "44 credits ($0.44)" instead of "$44.0000". Add unit tests in output.test.ts.
Refresh bundled-agents.generated.ts so the deployed cli-lite binary no longer references the removed google/gemini-2.0-flash-001 model. file-picker/file-lister/commander-lite use claude-haiku-4.5; researchers use gpt-5-mini. Pushing triggers weft to rebuild the agent image with the up-to-date binary.
…dError + nested cause)

isTransientApiError now (A) recursively unwraps error.cause (cycle-guarded) so a transient 529/overload nested inside a wrapper error is recognized, and (B) treats AI_NoOutputGeneratedError as transient since mid-stream provider overloads surface that way. Retries remain capped by MAX_STEP_RETRIES at the call site. Adds unit tests (error-transient.test.ts) and integration tests (loop-agent-steps.test.ts).
Adds describeTransientApiError() and getTransientStatusCode() helpers in common/src/util/error.ts so the retry notice shows a meaningful reason (e.g. "Response stream interrupted (no output)" for AI_NoOutputGeneratedError, or "Transient API error (529)" walking the cause chain) instead of a vague message. run-agent-step.ts uses the helper in its retry notice. Adds unit tests and an integration assertion.
Wrap subagent execution in a per-subagent timeout (10 min) so a stalled
LLM stream can no longer hang the parent's Promise.allSettled join
indefinitely. On timeout the child is aborted, a subagent_finish event is
emitted (so the UI never shows a dangling 'started' agent), and the
active agentState is attached to SubagentTimeoutError so partial credits
are still aggregated. Applies to both fan-out and inline spawns.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants