Skip to content

feat: stop/start and autoscale previews to 0#6488

Open
chronark wants to merge 5 commits into
mainfrom
idle-previews
Open

feat: stop/start and autoscale previews to 0#6488
chronark wants to merge 5 commits into
mainfrom
idle-previews

Conversation

@chronark

@chronark chronark commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Previously preview deployments would stick around for a very long time, 6 hours of no-requests were necessary for our cron job to stop them. That meant we were paying for a lot of idle compute, especially for workspaces who push code frequently.

This changes a few things:

  • idle threshold is lowered to 1h
  • preview deployments on the same branch get stopped almost immediately when a new deployment is ready
  • there's a manual start and stop button in the dashboard
  • Consolidate the deployment states standby and archived into stopped. Neither one was ever used.

CleanShot 2026-06-19 at 07.14.42.mp4

We'll also do wake-on-request soon, but it would've made this PR significantly larger and I wanted to prioritize shipping this.

@vercel

vercel Bot commented Jun 18, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
dashboard Ready Ready Preview, Comment Jun 19, 2026 5:34am
design Ready Ready Preview, Comment Jun 19, 2026 5:34am

Request Review

@mintlify

mintlify Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
engineering 🟢 Ready View Preview Jun 18, 2026, 1:20 PM

💡 Tip: Enable Workflows to automatically generate PRs for you.

@mintlify

mintlify Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
unkey 🟢 Ready View Preview Jun 18, 2026, 1:27 PM

@pullfrog pullfrog Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

The destructive desired_state enum change ships without a data-backfill migration, and existing rows almost certainly contain standby. Please confirm the production migration plan before merging — details below. Everything else is minor.

Reviewed changes — full review of the preview stop/start + idle-to-zero feature: enum consolidation, the relocated every-minute idle cron, branch-displacement spin-down, and the manual stop/wake stack across Restate, connect RPC, tRPC, and dashboard.

  • Consolidate desired_state to running/stopped — drops standby and archived from the MySQL enum, the Drizzle schema, and the proto state machine (ChangeDesiredState), and points all callers at the single STOPPED value.
  • Relocate idle scaledown to a cron handler — moves the scan from the deploy worker into idlepreview/handler.go, lowers the idle window to 1h, and schedules it every minute as a singleton-keyed CronService object.
  • Spin down previous branch deployments — when a preview becomes ready, spinDownPreviousDeployments schedules sibling running deployments on the same branch to stop after a 1-minute grace via the new ListRunningDeploymentsByBranch query.
  • Manual stop/wake — adds StopDeployment/WakeDeployment Restate handlers, public connect RPCs, tRPC routers with audit logging, UI dialogs, and canStop/canWake eligibility; wake polls instance health inline until ready or regionReadyTimeout.
  • Add Overwrite to ScheduleDesiredStateChange — lets explicit user intent (swapLiveDeployment, manual stop) replace a pending transition while the idle cron and branch spin-down yield with Overwrite:false.

⚠️ Enum drop has no data-backfill migration and standby was in active use

The desired_state enum is narrowed from three values to two, but there is no migration in the PR to rewrite existing rows. The description states neither old value "was ever used", which does not hold for standby: the previous swapLiveDeployment, promote_handler, and the old idle cron all scheduled demoted deployments to STANDBY, so any production database that has ever promoted or rolled back a deployment has live standby rows.

Dropping an enum member while such rows exist is unsafe: in MySQL strict mode the ALTER fails outright, and in non-strict mode out-of-range values are silently coerced to the empty string '', which then fails every Go enum scan on read.

Technical details
# Enum drop has no data-backfill migration

## Affected sites
- `pkg/mysql/schema/deployments.sql:21` — enum narrowed to `('running','stopped')`. This is the declarative dev schema (applied fresh via `dev/Dockerfile.mysql` initdb), so dev is fine; production is not covered by any in-repo migration mechanism.
- `web/internal/db/src/schema/deployments.ts:60` — Drizzle schema updated to match, no accompanying migration file.
- `svc/ctrl/worker/deploy/deploy_handler.go` (was `STANDBY` in `swapLiveDeployment`), `svc/ctrl/worker/deploy/promote_handler.go:160` (was `STANDBY`) — prior code that wrote `standby` to demoted deployments, confirming the value is populated in prod.

## Required outcome
- A forward migration runs `UPDATE deployments SET desired_state = 'stopped' WHERE desired_state IN ('standby','archived')` BEFORE the `ALTER TABLE ... MODIFY COLUMN desired_state ENUM('running','stopped') ...`, and the code that reads/writes `stopped` is sequenced after the `ALTER` lands.

## Open questions for the human
- How are production MySQL schema changes applied for this repo (the schema dir is declarative for dev only)? Whoever owns that path needs the backfill + ALTER ordering above.
- Confirm whether `archived` was genuinely never written; even if so, `standby` clearly was.

Pullfrog  | Fix all ➔Fix 👍s ➔View workflow run | Using Claude Opus𝕏

Comment thread svc/ctrl/worker/deploy/deploy_handler.go
Comment thread pkg/db/queries/deployment_list_running_by_branch.sql

@ogzhanolguncu ogzhanolguncu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy path works but I have some questions.

When we make a new production deployment, the old deployment can become stopped at swapLiveDeployment. But in wake.go and wake.ts we don't guard against production, and in getDeploymentActionEligibility we don't guard against production either. So technically we can wake a production deployment, and that might disrupt the system?

}
return nil, connect.NewError(connect.CodeInternal, fmt.Errorf("failed to load deployment: %w", err))
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

assert.Equal(deployment.Status, db.DeploymentsStatusReady, "deployment is not running"),
assert.Equal(deployment.DesiredState, db.DeploymentsDesiredStateRunning, "deployment is not running"),

what if we make those checks first so we don't have to run additional query for no reason? If its not running there is no point of looking up for environment, right?

Request(
&hydrav1.ScheduleDesiredStateChangeRequest{
DelayMillis: 0,
State: hydrav1.DeploymentDesiredState_DEPLOYMENT_DESIRED_STATE_RUNNING,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we set overwrite: true here so wake overrides the previous pending transition? Right now it defaults to false, so wake no-ops if something is already pending (e.g. a cron stop), which means the wake might not actually win. Stop already uses overwrite: true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants