Most teams have a deployment process for code. Peer review, staging environment, automated tests, rollback plan. The whole apparatus. Then someone edits a system prompt in a shared doc, pastes it into the production config, and ships it at 2pm on a Thursday. No test set. No comparison against baseline. No way to know if it's better or worse until users start complaining.
This is the gap that's defining AI reliability in 2026. Braintrust's evaluation guide frames it directly: teams deploy changes without measuring their impact on output quality, and failures follow because "a small change to a prompt might improve the application's tone while breaking its accuracy on certain inputs." That's not a model problem. That's a process problem.
The good news: the tooling and patterns to fix it have matured considerably. The bad news: most small teams still haven't wired them up.
The Labelled Test Set Is the Foundation You're Missing
Before you can evaluate a prompt change, you need something to evaluate it against. That means a labelled test set — a collection of inputs with expected outputs or quality criteria — that lives in version control alongside your prompts.
The eval-harness-template published in late May describes the pattern bluntly: "Most AI workflow failures we see in production share one root cause: no labelled test set, no harness, no gating. Without it, every prompt change is a guess. With it, prompt iteration becomes empirical." The scaffold ships with example test sets across three domains — document processing (200 cases), customer service (150 cases), and compliance review (100 cases) — and a CI workflow that gates promotion to production.
The categories matter as much as the count. Cases should cover routine inputs (the happy path), exceptional inputs (edge cases your prompt needs to handle), ambiguous inputs (where the right answer is genuinely unclear), and adversarial inputs (where users or data will try to break your prompt). A test set that's 95% routine cases will pass confidently right up until it doesn't.
Two Eval Modes, Both Required
The Braintrust guide draws a distinction that's worth internalizing: offline evaluation and online evaluation serve different purposes and neither replaces the other.
Offline evaluation runs before deployment, against your labelled test set. It's your unit test suite for prompt changes — catches regressions before they reach users. Online evaluation monitors live traffic asynchronously, scoring sampled production requests against quality criteria. It catches the failures offline testing can't anticipate: novel query patterns, distribution shifts, gradual model drift after a provider update.
The teams that get this right treat offline evals as a deployment gate (the build fails if quality drops below threshold) and online evals as a continuous monitoring signal (alerts when production quality diverges from the offline baseline). Running only one of these is like having unit tests but no production monitoring, or production monitoring but no tests. Both halves are load-bearing.
The Arize Team's Honest Starting Point
The Arize team's retrospective on building their own agent, published in May, is worth reading because it starts where most teams actually are: "our 'testing framework' was a Google Doc. We'd write down queries, record the responses, make changes, check if things improved, and repeat. It was painful, inefficient, and didn't scale."
The core problem they identified: AI agents aren't deterministic. The same input produces different outputs across runs, which means manual spot-checking is structurally incapable of catching regressions. You need enough test cases, run enough times, with scoring that aggregates across the variance — not a human eyeballing five examples and deciding it "seems fine."
This is also why the H1 2026 retrospective from Digital Applied describes regression detection moving from "optional bonus" to "default expectation" among teams running customer-facing LLM features. The field is converging on a norm: prompt changes require eval gates, the same way code changes require tests.
The Operational Implication
If you're a small team, the minimum viable harness is simpler than it sounds: a labelled test set in version control, a runner that scores your prompt against it, and a CI step that blocks promotion if scores regress. The eval-harness-template gives you the scaffold in TypeScript or Python. Braintrust's CI/CD review covers the tooling options for the scoring and reporting layer.
The harder part is cultural: whoever owns the prompt needs to own the test set. If the test set lives with a different team, or doesn't exist, the harness is theater. Prompt changes are code changes. The test set is what makes that statement mean something.
Eval Patterns
Score distributions, not just averages. A prompt change that moves average quality from 7.2 to 7.4 while increasing the variance — more 9s but also more 4s — is probably a regression on the cases that matter. Track p10 and p25 scores alongside the mean. A floor that drops is a problem even when the ceiling rises.
Reliability Notes
Version your prompts like you version your code. Prompt changes that aren't tagged and tracked make rollback nearly impossible. When a production incident traces back to a prompt edit, "we changed the wording last Tuesday" is not a rollback plan. Git history for prompts, with the same discipline you'd apply to a config file, is the minimum.
Cost Watch
Run your eval suite against a cheaper model first. If you're using an LLM-as-judge scorer, the judge doesn't need to be your most expensive model. A smaller model running thousands of eval cases per deployment adds up fast. Validate that your cheaper judge correlates well with human ratings on a calibration set, then use it for the high-volume automated runs. Reserve the expensive model for the ambiguous and adversarial cases where judgment quality actually matters.
