How to Test a Model Switch Without Breaking Production

Vivek Vaidya ·
llm replay model-evaluation cost-optimization

A new model drops. It’s cheaper. Faster. The benchmarks look good.

You don’t switch.

You can’t. You have no idea how your actual prompts will behave on this new model. Your document classification pipeline has been running on Claude Sonnet for months. It works. You don’t know if it would still work on Haiku, or on GPT-4o mini, or on whatever just came out. And you definitely can’t find out by reading the provider’s benchmark page.

So you do nothing. The cost compounds. The opportunity sits there, untouched.

This is one of the most common stuck points for teams running AI in production. The model landscape is moving fast, cheaper options appear regularly, but switching feels like a gamble on something you can’t measure. The result is that most teams are overpaying for model capability they don’t need, because the cost of uncertainty is higher than the cost of the model.

There’s a better way.

The Problem With Benchmarks

When OpenAI publishes benchmark scores for GPT-4o mini, or Anthropic publishes Haiku’s performance against MMLU, those numbers are real. They’re just not your numbers.

Your prompts are not MMLU questions. Your outputs have specific structure requirements, edge cases, and failure modes that no public benchmark captures. A model can score 85% on a standard benchmark and fail completely on your particular task. Another model can score lower on benchmarks and handle your use case perfectly.

The only way to know how a model performs on your task is to run your task on the model. Not a sample of synthetic tasks. Your actual production requests.

What Replay Does

Majordomo’s Replay feature does exactly this. It takes a set of requests from your production history, replays each one against a target model, and compares the results.

The comparison has two layers:

Exact match. If the replay response is character-for-character identical to the original, that’s an exact match. For structured outputs like JSON classification labels, extraction results, or formatted data, exact match rate is a meaningful signal. If Haiku agrees with Sonnet 97% of the time on the same extraction task, that’s strong evidence they’re interchangeable for that task.

LLM judge equivalence. For free-form outputs like summaries, explanations, or chat responses, exact match is too strict. A replay response can be semantically equivalent without being identical. When you configure a judge model, it evaluates each pair of responses and determines whether they’re functionally equivalent, then gives you a reason when they’re not.

Together, these give you a match rate: the percentage of requests where the cheaper model produces an equivalent result. Combined with the cost delta and latency comparison, you have everything you need to make the decision.

Running a Replay

In the Majordomo dashboard, start a new replay from the Replay section. You configure it in two parts:

Source filters. Which requests to replay. You can filter by API key, by model, by any metadata dimension you’ve been tagging (feature, team, environment), and by a request count limit. If you’ve been tagging your document classification feature with X-Majordomo-Feature: document-classification, you can replay exactly that feature’s traffic.

Target model. The provider and model you want to test. This is the model you’re considering switching to.

Optionally, you enable the LLM judge and pick which model to use as the judge. For most use cases, a cheap, fast model works well as the judge. You’re not asking it to be smarter than your production model; you’re asking it to compare two responses and decide if they mean the same thing.

Hit start. The replay runs asynchronously. For 50 requests, it typically completes in a few minutes.

Reading the Results

The results page shows four summary numbers:

Match rate. Exact matches plus judge-equivalent responses, as a percentage of total requests. This is the headline number. If you’re evaluating a model switch for a specific feature, this tells you whether the cheaper model can handle the work.

Cost savings. The difference between what those requests actually cost at the original model’s pricing versus what they would have cost at the target model’s pricing. Real numbers, not estimates.

Latency delta. Average response time comparison. Cheaper models are often faster too, which is a better user experience in addition to lower cost.

Divergent count. The requests where the models disagreed. This is where the work is.

Drill into any request row to see the side-by-side comparison: the original prompt, the original response, and the replay response. When a judge marks two responses as divergent, you see the judge’s reasoning. That reasoning tells you whether the divergence is meaningful (the cheaper model is giving wrong answers) or cosmetic (the cheaper model is phrasing the same answer differently).

The divergent cases are the most valuable output. You’re not just getting a yes/no answer on the model switch. You’re getting a view into exactly where and how the cheaper model behaves differently, which tells you whether those differences matter for your use case.

A Real Example

Say you’ve been running a document classification feature on Claude Sonnet for six months. The feature classifies incoming documents into one of eight categories. It’s been running well, but it costs more than you’d like.

You run a replay of the last 50 production requests against Claude Haiku.

Results: 94% match rate (47 of 50 exact match, 2 judge-equivalent, 1 divergent). Cost savings: 78%. Latency improvement: 40%.

You drill into the one divergent case. The original Sonnet response classified a document as LEGAL_NOTICE. Haiku classified it as CORRESPONDENCE. The judge explains: the document had characteristics of both categories and Sonnet and Haiku made different judgment calls. You look at the actual document. Either classification would be acceptable in your system.

Effective match rate: 100%. You switch.

That analysis took 20 minutes. The cost reduction is permanent and applies to every request from that feature going forward.

What to Do With Low Match Rates

A replay that comes back at 70% match rate isn’t telling you the cheaper model is bad. It’s telling you more specifically what you need to fix.

Work through the divergent cases. Are the failures concentrated in specific types of requests? Is the cheaper model consistently mishandling a specific edge case that you could handle with a prompt adjustment? Is it a prompting issue rather than a capability issue?

Sometimes a small prompt change brings a cheaper model from 70% to 95% match rate. The prompt that works well for Sonnet isn’t always the optimal prompt for Haiku. The replay results give you the specific failure cases to work with.

Sometimes the match rate is genuinely too low and the model isn’t right for the task. That’s a useful answer too. You’ve ruled out the cheaper option without deploying it to production and finding out the hard way.

The Broader Pattern

The point of Replay isn’t just to make one model decision. It’s to give you a repeatable process for keeping model choices current as the landscape changes.

New models drop every few months. Prices change. Provider capabilities shift. Without a way to test these changes against your actual production traffic, the natural default is inertia. You stick with what works because the cost of switching feels uncertain.

With Replay, that cost becomes knowable. You run the test, you read the results, you make the call. The model landscape becomes an optimization opportunity instead of a source of anxiety.

This is what an AI control plane is supposed to do. Not just show you what’s happening, but give you the tools to act on it with confidence.


The Replay feature is available in Majordomo Cloud and in the self-hosted gateway. The getting started guide covers setup. If you’re already running the gateway and logging requests, you can run your first replay today.