Building Cascades and Guardrails into Your LLM Agent: From Demo to Production

In the previous post I argued that every production LLM feature needs two layers around the model call: cascades for reliability and guardrails for governance. The closing promise was that adding them is smaller than the YAML in that post made it look. This post pays that off.

I’ll take the same customer support triage agent (classify, plan retrieval, fetch similar tickets, draft a response) from naive to cascaded, guarded, and logged. Twice. Once with pydantic-ai, once with majordomo-llm. Same agent, same layers, two idioms.

The two versions are not in competition. They’re in different positions on the same axis. pydantic-ai gives you an opinionated agent framework with batteries included; majordomo-llm gives you primitives and gets out of your way. The point of showing both is that the layers exist independent of which framework you pick.

The contract

Before any code, the contract. Three Pydantic models, one function signature. Both versions will satisfy this.

from pydantic import BaseModel
from typing import Literal

class Classification(BaseModel):
    category: Literal["billing", "technical", "account", "other"]
    severity: Literal["low", "medium", "high"]
    summary: str

class RetrievalPlan(BaseModel):
    queries: list[str]   # 1-3 search queries against the past-tickets index

class Draft(BaseModel):
    body: str
    suggested_actions: list[str]
    confidence: Literal["low", "medium", "high"]

async def triage(thread_text: str) -> Draft: ...

The retrieval tool is whatever your team already has. For the post I’ll stub it:

async def fetch_similar_tickets(queries: list[str]) -> list[str]:
    return ["[ticket #4412] ... resolved by ...", "[ticket #5901] ..."]  # placeholder

PII redaction and detection are also stubbed. In production, swap them for Presidio or a GLiNER NER model for PII, and a prompt-injection classifier from HuggingFace. The point of this post is the layer structure, not the choice of classifier.

def redact_pii(text: str) -> str: ...      # returns text with SSNs, cards, etc. masked
def contains_pii(text: str) -> bool: ...   # boolean for output egress check
def is_prompt_injection(text: str) -> bool: ...

Version A: pydantic-ai

Stage 0, naive

One Agent, one model, one fallback-free model call. This is what most teams ship first.

from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIChatModel

agent = Agent(
    OpenAIChatModel("gpt-5-4"),
    output_type=Draft,
    system_prompt="You are a customer support triage assistant...",
    tools=[fetch_similar_tickets],
)

async def triage(thread_text: str) -> Draft:
    result = await agent.run(thread_text)
    return result.output

The agent decides when to call the tool. Classification and retrieval planning happen implicitly inside the model’s reasoning. Structured output is enforced by pydantic-ai (OpenAI’s response schema under the hood).

Stage 1, add the cascade

FallbackModel is built into pydantic-ai. It wraps an ordered list of models and falls back on provider errors.

from pydantic_ai.models.fallback import FallbackModel
from pydantic_ai.models.anthropic import AnthropicModel
from pydantic_ai.models.openai import OpenAIChatModel

model = FallbackModel(
    AnthropicModel("claude-sonnet-4-6"),    # primary
    OpenAIChatModel("gpt-5-4"),             # peer fallback
    AnthropicModel("claude-haiku-4-5"),     # degraded fallback
)

agent = Agent(model, output_type=Draft, system_prompt=..., tools=[fetch_similar_tickets])

Two lines changed. The cascade is per-agent, not per-step, which is the pydantic-ai idiom. If you want different cascade policies for classification vs. drafting, you’d need to split into multiple agents. We don’t here, and the closing comparison flags that tradeoff.

Stage 2, add guardrails

Input guardrail: a history_processor runs on every model request and redacts PII before the request leaves your process. Output guardrail: an @agent.output_validator runs after the model returns and can raise ModelRetry to force a regeneration.

from pydantic_ai import ModelRetry, RunContext
from pydantic_ai.messages import ModelMessage, UserPromptPart

async def input_guardrail(messages: list[ModelMessage]) -> list[ModelMessage]:
    for msg in messages:
        for part in msg.parts:
            if isinstance(part, UserPromptPart):
                if is_prompt_injection(part.content):
                    raise ValueError("Blocked: prompt injection detected")
                part.content = redact_pii(part.content)
    return messages

agent = Agent(
    model,
    output_type=Draft,
    system_prompt=...,
    tools=[fetch_similar_tickets],
    history_processors=[input_guardrail],
)

@agent.output_validator
def output_guardrail(ctx: RunContext, draft: Draft) -> Draft:
    if contains_pii(draft.body):
        raise ModelRetry("Draft contains PII. Rewrite without any PII.")
    return draft

Redaction silently mutates the input. Prompt injection raises a hard error. Output PII triggers a retry. Three actions, three different behaviors, exactly as the previous post prescribed.

Stage 3, add logging

pydantic-ai emits OpenTelemetry spans. Logfire is the first-party sink; any OTel collector works.

import logfire

logfire.configure()
logfire.instrument_pydantic_ai()

agent = Agent(model, output_type=Draft, system_prompt=..., tools=[fetch_similar_tickets],
              history_processors=[input_guardrail], instrument=True)

Three lines. You now have prompt, response, tokens, latency, tool calls, and fallback events for every request, queryable in Logfire or shippable to whatever OTel backend you already run.

Final pydantic-ai version: ~40 lines of agent code.

Version B: majordomo-llm

Stage 0, naive

Three explicit steps, one model each, no fallback.

from majordomo_llm import get_llm_instance

classifier = get_llm_instance("openai", "gpt-5-4")
retrieval_planner = get_llm_instance("openai", "gpt-5-4-mini")
drafter = get_llm_instance("openai", "gpt-5-4")

async def triage(thread_text: str) -> Draft:
    classification = (await classifier.get_structured_json_response(
        response_model=Classification, user_prompt=thread_text,
        system_prompt="Classify this support thread.",
    )).content

    plan = (await retrieval_planner.get_structured_json_response(
        response_model=RetrievalPlan,
        user_prompt=f"Thread: {thread_text}\nClassification: {classification.model_dump_json()}",
        system_prompt="Plan 1-3 search queries to find similar past tickets.",
    )).content

    similar = await fetch_similar_tickets(plan.queries)

    return (await drafter.get_structured_json_response(
        response_model=Draft,
        user_prompt=f"Thread: {thread_text}\nClass: {classification}\nSimilar: {similar}",
        system_prompt="Draft a response for a human agent to send.",
    )).content

More lines than pydantic-ai, because orchestration is explicit. You can see exactly where each model call happens.

Stage 1, add the cascade

Swap each get_llm_instance for an LLMCascade. One cascade per step, each tuned for the job.

from majordomo_llm import LLMCascade

classifier = LLMCascade([
    ("anthropic", "claude-sonnet-4-6"),
    ("openai",    "gpt-5-4"),
    ("anthropic", "claude-haiku-4-5"),
])

retrieval_planner = LLMCascade([
    ("anthropic", "claude-haiku-4-5"),
    ("openai",    "gpt-5-4-mini"),
])

drafter = LLMCascade([
    ("anthropic", "claude-sonnet-4-6"),
    ("openai",    "gpt-5-4"),
])

The triage function body doesn’t change. LLMCascade is interface-compatible with LLM. The classifier and drafter get a degraded-fallback step (haiku); the retrieval planner doesn’t, because if the planner can’t run we’d rather fail loud than draft from no context.

Stage 2, add guardrails

majordomo-llm ships a hook pipeline. Hooks attach to any LLM or cascade via hook_pipeline=. Each hook returns one of four outcomes: pass, warn, redact, or block. That maps directly to the three actions from the previous post (redact, block, alert/warn), and the pipeline’s on_verdicts callback gives you the near-miss corpus, including pass-throughs.

from majordomo_llm.hooks import (
    HookPipeline, RegexHook, HookOutcome, HookContext,
)

# Built-in regex hook for SSN. Redact on input, block on output.
ssn_in  = RegexHook(name="ssn_in",  pattern=r"\d{3}-\d{2}-\d{4}",
                    phase="before", action="redact", redaction="[SSN]")
ssn_out = RegexHook(name="ssn_out", pattern=r"\d{3}-\d{2}-\d{4}",
                    phase="after",  action="block")

# Custom hook: same protocol, swap the placeholder for a real classifier.
class PromptInjectionHook:
    name = "prompt_injection"

    async def before_call(self, prompt: str, ctx: HookContext) -> HookOutcome:
        if is_prompt_injection(prompt):
            return HookOutcome.block(self.name, "suspected prompt injection")
        return HookOutcome.pass_through(self.name)

    async def after_call(self, prompt: str, response: str, ctx: HookContext) -> HookOutcome:
        return HookOutcome.pass_through(self.name)

# Record every verdict, even pass-throughs. This is the near-miss corpus
# the previous post said was the most valuable artifact a guardrail layer produces.
async def record_verdicts(request_id, verdicts):
    await audit_store.insert(request_id, verdicts)

pipeline = HookPipeline(
    [PromptInjectionHook(), ssn_in, ssn_out],
    on_verdicts=record_verdicts,
)

classifier        = LLMCascade([...], hook_pipeline=pipeline)
retrieval_planner = LLMCascade([...], hook_pipeline=pipeline)
drafter           = LLMCascade([...], hook_pipeline=pipeline)

The triage function body doesn’t change. The pipeline fires before and after every cascade call (once per step, not once per provider attempt within the cascade), and a blocked hook raises HookBlocked. Buggy hooks are caught by the pipeline and treated as pass-throughs, so a guardrail bug can’t take down the agent.

For the LLM-as-judge escape hatch from the previous post (“genuinely ambiguous cases where no classifier exists”), there’s LLMJudgeHook in the same package; pass it a callable that returns {"verdict": "pass"|"fail", "reason": ...} JSON. The library defaults to pass-through on judge timeout or parse error, so an unavailable judge can’t block a real call.

Stage 3, add logging

Wrap each cascade in LoggingLLM. Fire-and-forget logging to SQLite + file storage.

from majordomo_llm.logging import LoggingLLM
from majordomo_llm.logging.adapters import SqliteAdapter, FileStorageAdapter

db = await SqliteAdapter.create(path="./triage_logs.db")
storage = await FileStorageAdapter.create(root="./triage_payloads")

classifier        = LoggingLLM(classifier_cascade,        db, storage)
retrieval_planner = LoggingLLM(retrieval_planner_cascade, db, storage)
drafter           = LoggingLLM(drafter_cascade,           db, storage)

Now the observability payoff. The two questions the previous post said you must be able to answer:

-- fallback rate per step, last 24h
SELECT step_name, AVG(CASE WHEN cascade_step_index > 0 THEN 1.0 ELSE 0 END) AS fallback_rate
FROM llm_requests
WHERE created_at > datetime('now', '-1 day')
GROUP BY step_name;

-- cost per step, last 24h
SELECT step_name, SUM(total_cost) AS spend
FROM llm_requests
WHERE created_at > datetime('now', '-1 day')
GROUP BY step_name;

Two queries. The classifier falling back at 40% while the drafter holds steady tells you exactly where to look.

Final majordomo-llm version: ~70 lines of agent code.

How to choose

pydantic-ai and majordomo-llm are not the same tool with different syntax. The tradeoffs are real.

Use pydantic-ai when the agent’s logic is best expressed as “give the model some tools and let it figure out the orchestration.” FallbackModel is one policy for the whole agent, which is fine when the agent is one logical thing. instrument=True is a single switch. The lines-of-code count is hard to beat when the framework’s idiom matches your problem.

Use majordomo-llm when you want per-step control: a cheap cascade for classification, a premium cascade for drafting, no cascade at all for the step you’d rather fail loud on. When you want a hook pipeline you can attach the same way to any cascade, with redact/block/warn actions and a built-in audit channel for near-misses. When you want logging in a queryable SQL store rather than an OTel backend.

You can also use both. pydantic-ai for the agent loop, majordomo-llm under the hood for cascades and hooks (via a custom Model adapter). That’s a longer post; the short answer is the layers are framework-agnostic, and the version that’s easier to read in your codebase is usually the right one.

What didn’t go in this post

Three things the previous post mentioned that the code above gestures at but doesn’t fully build out, because each is a post of its own:

A non-engineer-editable guardrails config. Both versions hardcode the guardrail functions. A real ops/legal-editable layer is a configuration UI on top of a versioned policy file. Worth building, longer than 100 lines.
Alerts on fallback. Both versions log fallbacks. Neither pages. Hook a Slack webhook into the OTel pipeline (pydantic-ai) or an on_verdicts callback (majordomo-llm).
Production-grade PII detection. The stubs in this post are placeholders. Swap them for Presidio + a GLiNER model and a prompt-injection classifier. The structure of the guardrails layer doesn’t change.

Getting the structure right is the hard part, and the structure is small. Once it’s in place, everything above is incremental.