A/B testing with feature flags

Ship two versions, measure the winner. GO Feature Flag splits your users into A and B, then exports the data so your own analytics can tell you which one actually performed better.

Get started Experimentation docs

The idea

Test two versions, let the data pick the winner

A/B test is the shorthand for a simple controlled experiment: two versions, A and B, are shown to comparable groups of users, and you measure which one moves the metric you care about.

You do not need a separate experimentation platform for this. GO Feature Flag already gives you the two building blocks: evaluation to split users into A and B, and exporters to capture what happened - so you can decide the winner with the analytics you already own.

Users are split into variation A and variation B; every evaluation and outcome is exported to a database, where a comparison shows which variation won.

What you get

A stable, deterministic split

Users are bucketed by hashing the targeting key, so the same person stays in the same group for the whole test - no flicker between A and B.

Your data, your warehouse

Exposures and outcomes are exported to a destination you own - BigQuery, S3, Kafka, a webhook - so you analyse results with the tools you already trust.

Time-boxed and reversible

An experimentation rollout runs the split only inside a window you set, then falls back to the default automatically - no cleanup redeploy.

A/B testing in three steps

The recipe is the same one the docs recommend: an experimentation rollout combined with the export of your data. Split your users, capture who saw what, record what they did - then read the result. Every step lives in configuration or a single SDK call, so there is no redeploy to start or stop a test.

Step 1 · Evaluation

Split your users into A and B

A percentage rollout sends a share of traffic to each variation - say 50/50. The split is deterministic on the targeting key, so a user keeps the same variation for the whole test instead of flipping between requests.

Wrap it in an experimentation rollout to bound the test to a start and end date. Inside the window users get the split; outside it, everyone falls back to the default - a clean measurement period with an automatic end.

A crowd of users is deterministically routed through a hashing node into two equal groups, A and B, inside a bounded time window.

flags.goff.yaml

checkout-experiment:
  variations:
    control: "current-checkout"
    candidate: "new-checkout"
  defaultRule:
    percentage:
      control: 50
      candidate: 50
  # only run the experiment inside this window
  experimentation:
    start: 2026-04-01T00:00:00.1-05:00
    end: 2026-04-15T00:00:00.1-05:00

A deterministic 50/50 split that only runs for two weeks, then returns to the control.

Experimentation rollout docs

Step 2 · Exporters

Capture who saw which variation

Every flag evaluation emits a feature event - the targeting key, the flag, and the variation that user received. An exporter ships those events to a destination you own, so you have a record of exactly who was exposed to A and who was exposed to B.

You wire one exporter per event type. Point them at the same warehouse - one table for exposures, one for outcomes - and your experiment data lands where your analysts can query it.

Evaluation events stream through an exporter and fan out to destinations: a database, object storage, and a message queue.

goff-proxy.yaml

# goff-proxy.yaml
exporters:
  # exposures: which user saw which variation
  - kind: bigquery
    projectID: "my-project"
    datasetID: "goff_experiments"
    tableName: "feature_flag_evaluations"
    eventType: "feature"
  # outcomes: what each user actually did
  - kind: bigquery
    projectID: "my-project"
    datasetID: "goff_experiments"
    tableName: "tracking_events"
    eventType: "tracking"

Two BigQuery exporters on the relay proxy: exposures (feature) and outcomes (tracking) into the same dataset.

Exporter docs

Step 3 · Tracking

Record what your users did

Exposure alone does not tell you who won - you also need the outcome. The OpenFeature Tracking API lets you record a conversion, a purchase amount, or any action against the same targeting key you evaluate with. Those tracking events flow through the tracking exporter you configured in step 2.

Now you can join the two in your analytics: for each variation, how many users converted and how much they were worth. That comparison is your A/B test result.

Each variation's exposures are joined with its outcomes; a comparison chart highlights the winning variation.

checkout.js

// When the user completes the action you care about,
// record it against the SAME context you evaluate flags with.
client.track("checkout-completed", evaluationContext, {
  value: 99.77,
  currencyCode: "USD",
});

Recorded against the same context as the flag evaluation, so exposure and outcome join on the targeting key.

Tracking API docs

Where your experiment data can go

Data warehouses

Stream feature and tracking events straight into BigQuery for SQL analysis per variation.

Object storage

Batch events to AWS S3, Google Cloud Storage, Azure Blob Storage, or the local file system as JSON, CSV, or Parquet.

Streaming & queues

Push events in near real time to Kafka, AWS Kinesis, Google Cloud Pub/Sub, or AWS SQS.

Webhook & OpenTelemetry

Send events to any HTTP endpoint, or emit them as OpenTelemetry signals for your observability stack.

See the full event format

Why run your A/B tests on GO Feature Flag

You own the data. Exposures and outcomes go to your warehouse, not a vendor’s - no per-seat experimentation bill, no data leaving your stack.
Unlike hosted experimentation platforms such as LaunchDarkly or Optimizely, the split happens inside your self-hosted flag and the analysis runs in your own warehouse - so A/B testing stays free and MIT-licensed with no per-seat experiment charge. See how GO Feature Flag compares.
OpenFeature-native. Assignment and tracking use the standard API, so any OpenFeature SDK and the relay proxy work the same way.
It composes with your rollouts. An experiment is just one more thing a feature flag can do - target a segment first, then split within it.
The same recipe compares AI models or prompts. Swap the variations and you are A/B testing features for AI instead of UI.

A/B testing pitfalls to avoid

Re-tuning the percentage mid-test. Changing the split reshuffles some users between A and B. Use an experimentation rollout when you need a stable group for the whole window.
Bucketing on an unstable key. Consistency rides on the targeting key; a value that changes per request makes users flip variations and pollutes your results.
Exporting exposures but not outcomes. Knowing who saw A or B is only half of it - without tracking events you cannot say which variation actually won.
Stopping the moment it looks good. Peeking at an experiment before it reaches significance produces false winners. Size the window up front and let it run.

Run your next experiment on your own data

Self-hosted, OpenFeature-native, MIT-licensed. Split with a YAML file, export to your warehouse, measure the winner - no experimentation SaaS required.

Get started View on GitHub

Frequently asked questions

Does GO Feature Flag support A/B testing?

Yes. GO Feature Flag supports A/B testing natively. You assign users to variation A or B with an experimentation rollout, and every exposure and outcome is exported to a warehouse you own so you can measure which variation won - there is no separate experimentation platform to buy.

Do I need a separate A/B testing tool with GO Feature Flag?

No. GO Feature Flag already gives you the two halves of an experiment: the evaluation engine assigns each user to a variation, and exporters ship the data out. You run the analysis in whatever analytics tool or warehouse you already use - there is no extra experimentation SaaS to buy.

How does GO Feature Flag decide which users are in A vs B?

It hashes the evaluation context’s targeting key deterministically. The same user always lands in the same variation as long as the percentages do not change, so their experience stays consistent for the whole test - and it is consistent across servers and restarts.

Where does my experiment data go?

Wherever you point your exporters. Every evaluation emits a feature event (who saw which variation) and the Tracking API emits tracking events (what they did). Both flow through exporters to destinations you control: BigQuery, S3, GCS, Azure Blob, Kafka, Kinesis, Pub/Sub, SQS, a webhook, or OpenTelemetry.

How is an A/B test different from a progressive rollout?

A progressive rollout is about shipping a change gradually - you are not measuring, you are ramping. An A/B test holds a fixed split for a fixed window so you can compare outcomes cleanly. Use the experimentation rollout (not a percentage you keep re-tuning) when you need a stable measurement group.

How long should I run an A/B test?

Long enough to reach statistical significance for the metric you care about - stopping early (peeking) leads to false winners. A duration calculator such as vwo’s helps you size the window before you start, which you then set as the experimentation start and end dates.

Does this work with the relay proxy and any SDK?

Yes. A/B testing is built on the standard evaluation and export paths, so it works the same across every OpenFeature SDK talking to the relay proxy. Exposures are captured automatically on evaluation, and outcomes are recorded through the OpenFeature Tracking API in your SDK.

Client SDKs

Server SDKs

Developers

Resources