Running a Live Model A/B Test with Majordomo Experiments
You ran a replay. The match rate came back at 96%. Cost savings of 70%. The cheaper model looks like a drop-in replacement.
You still haven’t switched.
There’s a gap between “this model handled last month’s traffic well” and “this model is handling today’s traffic well.” Production data drifts. Request patterns shift. Prompts get updated. Customers do things you didn’t anticipate. An offline replay tells you about a model’s historical behavior, which is a strong signal, but it is not the same thing as running the model live and watching what happens.
That gap is what Experiments is built to close.
What the Experiments Feature Does
Experiments routes a portion of your live production traffic to a different model and measures the results in real time. You define a set of arms, each with a model and a relative weight, and the gateway handles the splitting transparently. Your application sends the same requests it always has. Majordomo intercepts them, assigns each request to an arm, overrides the model field, and forwards the request to the appropriate provider.
Every request that goes through an experiment is tagged with the experiment ID and arm ID. That means you get per-arm breakdowns of cost, latency, and error rates as traffic flows through, with no instrumentation work on your end.
The key difference from Replay is causality. Replay runs historical requests against a new model and compares outputs. Experiments runs current requests against a new model and measures outcomes. You are not comparing responses side by side. You are watching two models serve real users under identical conditions and seeing which one performs better.
Setting Up an Experiment
In the Majordomo dashboard, create a new experiment from the Experiments section. It starts in draft status, which means it is not active yet and you can edit the configuration freely.
Name and time window. Give the experiment a name that identifies what you are testing. Set a start and end time. If you leave the end time open, the experiment runs until you manually complete or pause it.
Traffic filters. This is where you scope the experiment to a specific slice of your traffic.
The most useful filter for most teams is metadata. If you have been tagging your requests with feature names using X-Majordomo-Feature or a similar header, you can run an experiment on just that feature’s traffic. Set a metadata filter like feature = document-classification and only requests carrying that tag will enter the experiment. Everything else passes through normally on whatever model it was already using.
You can also scope an experiment to a specific API key. This is useful if you want to limit an experiment to traffic from a particular service or team without relying on metadata tagging.
Assignment strategy. By default, each request gets a fresh random arm assignment based on the configured weights. That works well for features where individual requests are independent.
If your feature involves multi-turn conversations or workflows where consistency matters, use sticky assignment. Set a sticky key to the name of a metadata field that identifies the user or session, such as user_id. Majordomo will hash that field’s value to deterministically assign the same arm every time for the same user. A user who gets the challenger model on their first request will get it on every subsequent request in the experiment.
Arms. Add at least two arms: one control and one or more challengers. Each arm has a name, a provider, a model, and a weight. Weights are relative integers. A control with weight 80 and a challenger with weight 20 means 80% of traffic goes to the control and 20% to the challenger. If you want a 50/50 split, use equal weights.
Mark exactly one arm as the control. This is your baseline: the model you are currently using. The control arm is what you will compare everything else against.
Activating and Monitoring
Once the configuration looks right, activate the experiment. From that point, incoming requests that match your filters will start being routed to arms according to the weights you set.
The experiment detail page updates in real time as traffic flows through. The report section shows a table with one row per arm and columns for request count, average latency, P95 and P99 latency, average cost per request, total cost, and error rate.
Watch error rate first. If the challenger arm’s error rate starts climbing above the control’s, that is a problem worth pausing the experiment to investigate before it affects more traffic. Latency is the second thing to watch. A model that is 2x cheaper but 3x slower may not be an acceptable trade depending on your use case.
Cost usually takes care of itself once you see real request volumes accumulating. The total cost comparison is more meaningful than the per-request average once you have a few hundred requests on each arm, because it accounts for any differences in output length that affect token counts.
The latency charts break out P50, P95, and P99 per arm. P50 tells you about the typical case. P95 and P99 tell you about the tail. A model can have a lower P50 than your control but a worse P99, which means most requests are faster but the slow ones are slower. Whether that matters depends on what your application does with those requests.
A Real Example
You are running a support ticket triage feature on GPT-4o. The feature reads incoming support tickets and assigns them to one of twelve categories. You ran a replay and saw a 94% match rate with GPT-4o mini at 75% lower cost. The match rate was strong enough to be encouraging, but you want live confirmation before flipping the whole feature.
You create an experiment with a metadata filter on feature = triage. Control arm: GPT-4o at weight 90. Challenger arm: GPT-4o mini at weight 10. You use sticky assignment on ticket_id so that if a ticket triggers multiple requests in the same session, it stays on the same model.
You activate the experiment and leave it running for two days.
After 2,000 requests on the control arm and 220 requests on the challenger arm, the results look like this:
Control (GPT-4o): avg latency 1,340ms, P99 latency 4,200ms, avg cost $0.0031, error rate 0.2%.
Challenger (GPT-4o mini): avg latency 890ms, P99 latency 2,100ms, avg cost $0.0007, error rate 0.4%.
The challenger is faster at every percentile and costs 77% less. The error rate difference is two extra errors in 220 requests, which is not statistically meaningful at this sample size. You increase the challenger weight to 50 and run for another three days. The pattern holds.
You complete the experiment and switch the feature to GPT-4o mini.
When to Replay Versus When to Experiment
These two features are not alternatives. They cover different parts of the validation process.
Replay is the first step. It is fast, cheap, and completely offline. You can run a replay in an afternoon and get a strong signal about whether a model is worth testing further. If the match rate is below 80%, you have your answer without having put any production traffic at risk. If the match rate is high, you have a quantified upside to motivate the experiment.
Experiments are the second step. They confirm what Replay suggested and give you live performance data under current conditions. They also serve as a gradual rollout mechanism: instead of switching a feature cold, you can ramp from 10% to 50% to 100% as confidence builds.
The workflow is: replay to filter candidates, experiment to validate the winner, then switch.
Keeping the Blast Radius Small
A few practical notes on running experiments safely.
Start with a small challenger weight. 10% is enough to get meaningful data within a day or two for most features, and it limits the number of users exposed to an unproven model. Only increase the weight once the early data looks good.
Filter aggressively at first. Running an experiment on 100% of your traffic is rarely the right call for the first test of a new model. Use metadata filters to scope it to a single feature, a specific team’s API key, or a particular request type. You get cleaner signal and smaller risk.
Watch the error rate actively in the first few hours. A sharp spike in errors on the challenger arm is a signal to pause immediately. The experiment can always be restarted. A wave of bad responses to real users is harder to undo.
Experiments are available in Majordomo Cloud and in the self-hosted gateway running version 1.4 or later. Replay is covered in How to Test a Model Switch Without Breaking Production. The getting started guide covers initial setup.