Troubleshooting
Symptom table
| Symptom | Likely cause | Fix |
|---|---|---|
Local next dev logs [shadow-canary] middleware passthrough (VERCEL_ENV != 'production') on every request | VERCEL_GIT_REPO_SLUG not set locally | Run vercel env pull OR add VERCEL_GIT_REPO_SLUG=<your-repo-slug> to .env.local. Middleware degrades to passthrough in dev — behavior is correct but the warn is telling you the Edge Config key can’t be derived |
Production deploy returns 500 on every request with VERCEL_GIT_REPO_SLUG is not set | The Vercel project isn’t linked to a Git repo, OR the env var was explicitly overridden to empty | Re-link the Git repo in Vercel Project Settings > Git, then redeploy. VERCEL_GIT_REPO_SLUG is auto-injected on every linked-project deploy |
| 404 on JS/CSS chunks after deploy | Skew Protection is OFF | Enable Skew Protection in Vercel Project Settings (Pro/Enterprise required) |
/admin shows “unconfigured” or fails to load Edge Config data | Edge Config store not linked to the project, OR shadow-<repo-slug>-canary key not yet populated | Vercel dashboard > Storage > your store > Connected Projects — add your project. Run deploy-shadow.yml once to populate the key |
| Cross-deploy rewrites return 401 | Deployment Protection is blocking shadow/previous deploy URLs | Disable SSO Protection or enable Protection Bypass for Automation (see Prerequisites) |
| Canary cron does not fire | Default branch is not master | Rename default branch to master (GitHub Settings > Branches) or update the on.push.branches trigger in the workflow files |
| Canary stuck at 4% (or any low %) | canaryPaused: true in Edge Config | Use the admin UI Resume button, or set canaryPaused: false directly in Edge Config |
| Canary stuck at 0% after deploy | First deploy with no previous prod URL | Use [skip-canary] on first production push, or push again after the bootstrap deploy |
| Shadow deploy gets 0% traffic | trafficShadowPercent: 0 in Edge Config | Set trafficShadowPercent: 1 in Edge Config (propagates in 60s) |
/debug page shows wrong branch | You are hitting the shadow or previous prod URL directly | This is expected — those deploys always show their own branch. Visit via the custom domain to see the routing in action |
| Cookie does not stick across requests | Cookie set with wrong domain or SameSite mismatch | Ensure the middleware sets sameSite: 'lax' and path: '/'; check that the custom domain matches what the browser expects |
| SLO check always fails | /api/slo returns non-200 | Check the endpoint response: curl https://your-app.vercel.app/api/slo. If it is a stub, it returns 200 by default — something is wrong with your custom implementation |
vercel promote fails in CI | Token does not have team scope or wrong org ID | Regenerate the token with team scope and verify VERCEL_ORG_ID matches the team’s orgId in .vercel/project.json |
| Admin login returns 401 | ADMIN_USER or ADMIN_PASS env var not set, or wrong value | Check Vercel Project Settings > Environment Variables. Defaults are admin / 12345 if vars are absent |
| Edge Config reads fail at runtime | EDGE_CONFIG connection string not injected | Vercel injects this automatically when the store is linked. Re-link the store to the project and redeploy |
| Middleware runs on shadow/previous deploy and routes again | x-shadow-routed header not set or stripped | Verify rewriteTo sets x-shadow-routed: 1. Check that no other middleware or proxy strips it before reaching the target deploy |
| Rollback button in admin returns 500 | VERCEL_API_TOKEN not set or expired | Add/refresh VERCEL_API_TOKEN in Vercel env vars |
Workflow fails with ::error::Edge Config read failed (HTTP NNN) | Vercel API transient — 401/403 token expired or wrong scope, 429 rate limit, 5xx Vercel outage, 000 runner network error | Fail-safe, not a bug. This step refuses to write Edge Config when the read can’t be trusted, preventing the historical state-clobber bug. Check Vercel status, wait for recovery, then re-trigger the workflow from the Actions UI. Edge Config was not mutated — the step exits before the PATCH. For 401/403, regenerate VERCEL_TOKEN with team scope. For 404 on the project lookup specifically, verify VERCEL_PROJECT_ID matches the project the token can access |
Workflow fails with ::error:: … body is not valid JSON | Vercel returned 200 OK with a non-JSON body (CDN error page during an incident) | Same recovery as the HTTP-NNN error above — re-trigger once the API is healthy. The fail-fast is intentional: a malformed 200 response would otherwise be parsed as {} and clobber state |
Canary stuck: detailed checklist
If trafficProdCanaryPercent has not changed in over 15 minutes:
- Check GitHub Actions > Canary ramp — is the workflow running? Look for a failed or skipped run.
- In the failing run, check the “Skip if no canary” step — is
paused=true? Use admin UI to resume. - Check the “Run 2 SLO checks” step — what HTTP code is
/api/sloreturning? - Verify
VERCEL_TOKENhas not expired andVERCEL_EDGE_CONFIG_IDis correct. - Manually trigger the workflow (Actions > Canary ramp > Run workflow) to test.
Shadow not routing
If you visit /debug from multiple incognito windows and never get the shadow deploy:
- Verify
deploymentDomainShadowis set in Edge Config (not empty string) - Verify
trafficShadowPercentis greater than 0 - Check that the middleware is running on the production deploy (not preview) —
VERCEL_ENVmust beproduction - Bot detection may be filtering your client — check the user-agent
At 1%, you need roughly 100 requests to statistically expect one shadow assignment. Use shadowForceIPs or the /debug Force Shadow button for testing.
SLO false positives
If the canary rolls back but the deploy looks healthy:
- The SLO endpoint may be timing out — the cron uses
--max-time 10. If your check makes slow external API calls, it may be cut off. - The SLO endpoint may be returning 500 due to an unrelated dependency issue. Make the endpoint fail open if monitoring is unavailable.
- Two checks 30 seconds apart is a small sample. If your error rate is naturally spiky (e.g. a scheduled job that briefly spikes errors), the check timing may coincide with a spike. Consider adding a moving average or increasing the check interval.
Related:
- Workflows — workflow internals and customization
- Edge Config — field reference for manual edits
- Dashboard — Pause, Resume, Cancel controls