Skip to main content

Feature flags for AI

AI writes code faster than you can review it, and AI features behave differently for every user. A feature flag puts a switch in front of both - ship dark, release to a slice, and kill it the moment it misbehaves. No redeploy.

The core idea

AI ships faster than you can verify

Coding agents and LLM features generate more code and more behavior than any team can review line by line. The bottleneck is no longer writing it - it is being sure it is safe in front of real users.

A feature flag wraps the new path so it merges dark: the code ships, but exposure stays under your control, decoupled from deploy. You decide who sees the AI and when - and you can take it back in seconds.

AI code firehose funneling through a single flag switch

Feature flags let your AI work safely:

Merge AI code dark

Agent-written code lands behind a flag, off by default, instead of going straight to every user.

Contain the blast radius

A hallucination or a bad model only ever reaches the slice of traffic you opened it to.

Kill it instantly

One YAML change turns the AI path off and falls back to the safe one - no rollback, no redeploy.

Why shipping AI without a flag is risky

Traditional code is deterministic - you can review it once and trust it. AI is not. That breaks the assumptions a big-bang release relies on.

  • Non-determinism. The same prompt can return a different answer per user and per run - you cannot fully test it before release.
  • Silent quality regressions. A new model or a tweaked prompt can get subtly worse in ways no unit test catches.
  • Cost and latency spikes. A bigger model can quietly multiply your bill and your p99 the moment it goes live.
  • Unreviewed agent code at 100%. Without a flag, machine-written code reaches every user in a single deploy.
  • No fast way back. Reverting means a redeploy or a rollback - minutes to hours - instead of a one-line flag flip.
AI-generated code

Wrap agent output in a flag

When Copilot, Cursor, or Claude Code writes a new path, gate it behind a boolean flag that defaults off. The code merges and deploys with everything else, but no user reaches it until you say so.

How to use it: turn it on for internal staff first, watch it in production, then widen to everyone - or flip it back if it misbehaves.

Agent-written code entering a gated pipe behind a flag
flags.goff.yaml
ai-summary:
variations:
on: true
off: false
targeting:
# Internal staff get the agent-written path first
- query: email ew "@your-company.com"
variation: on
defaultRule:
variation: off

Default off. Only your own team gets the agent-written summary until you widen it.

Targeting docs

AI features

Roll a model out to a percentage

Put the model name in a flag variation and send a small slice of traffic to the new one. The split is deterministic on the targeting key, so a given user keeps a consistent experience while you ramp.

How to use it: a canary - start at a few percent, watch quality, cost, and latency, then ramp to everyone with a progressive rollout.

User traffic split between two models on a rising ramp
flags.goff.yaml
llm-model:
variations:
current: "gpt-4o-mini"
candidate: "gpt-4o"
defaultRule:
percentage:
candidate: 10 # 10% of users hit the new model
current: 90

Bump candidate to 25, then 100, as the new model proves out. No redeploy.

Progressive rollout docs

Compare models & prompts

A/B test which AI is actually better

Different is not the same as better. Define a variation per prompt or model, split traffic 50/50, and wrap it in an experimentation rollout so it runs for a fixed window.

How to use it: pair it with the data export to measure which variation won, then keep the winner.

Two AI variants compared inside a bracketed start-end window
flags.goff.yaml
support-prompt:
variations:
promptA: "concise-v1"
promptB: "detailed-v2"
defaultRule:
percentage:
promptA: 50
promptB: 50
experimentation:
start: 2026-07-01T00:00:00.1-05:00
end: 2026-07-08T00:00:00.1-05:00

A clean one-week measurement of two prompts, then an automatic return to the default.

Experimentation docs

Kill switch

Turn the AI off in seconds

Keep a deterministic fallback variation - a rules engine, a cached answer, or the previous model - alongside the live AI path. When something goes wrong, you do not debug under fire; you flip the switch.

How to use it: change one line in the flag and every user drops to the safe path on the relay proxy’s next poll - no redeploy, no rollback.

A large switch cutting an AI path back to a safe deterministic fallback
flags.goff.yaml
ai-chat:
variations:
live: "llm"
fallback: "rules-engine"
targeting:
- query: beta eq "true"
variation: live
defaultRule:
# Flip this one line to "fallback" to kill the AI path
variation: live

Flip defaultRule to fallback and the whole AI feature is off for everyone in seconds.

What to flag in your AI stack

Anywhere AI touches production is a place to put a switch - here is what to reach for and when.

WhatReach for it whenHow to flag it
AI-written code pathA coding agent wrote a new path you have not fully reviewed in production.Gate it behind a boolean flag, default off, and open it to internal users first.
New / upgraded modelYou are swapping in a new LLM and want to prove it before everyone hits it.Canary it with a percentage, then ramp it progressively.
Prompt changeYou changed a prompt and need to know it is actually better, not just different.Run an experimentation rollout and measure the two against each other.
New AI featureA user-facing AI feature is ready to ship but unproven at scale.Release it dark, then widen exposure as the data stays green.
Deterministic fallbackThe AI path can fail, time out, or get expensive without warning.Keep a non-AI variation and a kill switch that flips to it in one edit.

Pitfalls to avoid

  • Shipping agent code straight to 100%. Machine-written code deserves the same dark launch as anything else - default the flag off and widen on evidence.
  • No deterministic fallback. If the only path is the AI path, an outage or a bad answer has nowhere to fall back to.
  • No kill switch. Always keep the one-line flip that takes the feature off without a deploy.
  • Bucketing experiments on an unstable key. Consistency rides on the targeting key; a value that changes per request flips users between models mid-session.
  • Leaving a finished AI rollout in the code. Once a model is at 100% for everyone, the flag is debt - clean it up.

Ship AI with a safety net

Self-hosted, OpenFeature-native, MIT-licensed. Gate every model, prompt, and agent-written path behind a flag - and kill it with a one-line YAML change.

Frequently asked questions

Why do AI features need a feature flag?
AI is non-deterministic: the same input can produce a different output for every user and every run. A flag lets you release that behavior to a small slice first, measure it, and reverse it instantly if quality, latency, or cost regresses - without shipping new code.
How do I roll out a new LLM model safely?
Put the model name in a flag variation and use a percentage or progressive rollout. Start at a few percent, watch your metrics, then ramp to everyone. Because evaluation is deterministic on the targeting key, a given user keeps a consistent experience while you widen the rollout.
Can I A/B test two prompts or two models?
Yes. Define a variation per prompt or model, split traffic with a percentage, and wrap it in an experimentation rollout so it runs for a fixed window. Pair it with the data export to measure which variation actually won.
What happens when an AI feature misbehaves?
Keep a deterministic fallback variation - a rules engine, a cached answer, or the previous model - and a kill switch. Flipping the flag routes everyone to the safe path on the relay proxy’s next poll, with no redeploy and no rollback.
Do I need to redeploy to change which model is live?
No. The model lives in the flag configuration. Edit the YAML and the relay proxy picks it up on its next poll, so swapping models, changing the split, or killing the feature never requires a deploy.
Can I gate code my coding agent wrote?
Yes - that is one of the strongest uses. Wrap the agent-written path in a feature flag that defaults off, merge it dark, and turn it on for internal users before widening. The unreviewed code reaches production behind a switch instead of going live to everyone in a single deploy.