How to run experiments on AI agents and workflows

Scoring each step of an agent workflow and calling it an experiment is like grading every sentence in a report without checking whether the report answered the question.
The scores can all look fine while the agent completely fails the job it was supposed to do. That's the core problem this article addresses: AI agent experiments require a fundamentally different approach than prompt A/B tests, because the unit of measurement isn't a response — it's the entire workflow, from first tool call to final outcome.
This guide is for engineers, PMs, and data teams who are building or evaluating multi-step AI agent systems and need a rigorous way to test them. If you already run A/B tests on prompts or product features, you have the right instincts — but the methodology needs to change when agents maintain state, chain tool calls, and make decisions that compound across steps. Here's what you'll learn:
- Why agent experiments fail when you apply prompt A/B tests logic to multi-step workflows
- How to define metrics that actually reflect workflow-level success — task-completion rate, error recovery rate, step efficiency, and failure point attribution
- How to structure controlled experiments around tool use, model selection, and workflow architecture using feature flags and gradual rollouts
- The hard measurement problems unique to agent systems — non-determinism, long observation windows, and attribution across steps — and how to work around them
- How to build a continuous improvement loop so your experiments compound into institutional knowledge instead of disappearing after each ship decision
The article moves in that order: from why the problem is structurally different, to what to measure, to how to run the experiments, to what can still go wrong, to how to make the whole program stick over time.
Why AI agent experiments are fundamentally different from prompt A/B tests
If you already run prompt A/B tests, your instinct when evaluating an AI agent will be to apply the same methodology — score each step's output, aggregate the results, and declare a winner. That instinct is wrong, and acting on it will produce experiment results that are not just incomplete but actively misleading.
The reason isn't a matter of scale or complexity. It's a structural mismatch between what prompt A/B tests measure and what agents actually do.
What traditional prompt A/B tests actually measure
A prompt A/B test evaluates a single transformation: one input goes in, one output comes out, and you score that output. The system is effectively stateless from the experiment's perspective. Each interaction is independent. You can measure relevance, accuracy, tone, or task completion for that one response — and because nothing that happened before affects what you're measuring, isolated scoring is valid.
This methodology works precisely because single-response systems have a more predictable execution model. The agent receives a prompt, reasons through the steps, and produces an output. That's the entire surface area of the experiment. Prompt A/B tests are well-designed for exactly this architecture — and only this architecture.
Why agents break the assumptions that make prompt A/B tests valid
Agents don't produce a single output. They maintain state across steps, make tool calls whose results feed subsequent decisions, and operate across a decision chain where each output becomes the next input. The evaluation unit isn't a response — it's a trajectory.
Consider a multi-agent fraud detection workflow where one agent analyzes transaction patterns, a second reviews customer history, an aggregator synthesizes both findings, and a validator checks the final output before a decision is made. Each agent's output is the input context for the next. A prompt A/B test can tell you whether the transaction analysis agent produced a well-formed output.
It cannot tell you whether that output was the right input for the customer history agent, whether the aggregator handled conflicting signals correctly, or whether the validator caught a downstream error that originated two steps earlier.
The coordination layer between steps — the decisions about when to call a tool, what to do with a tool's output, how to handle ambiguity before passing context forward — is entirely invisible to step-level evaluation.
The cascade failure problem
This is where step-level scoring stops being merely incomplete and becomes actively misleading. In a multi-step agent workflow, an error at step 2 doesn't produce a bad step-2 output that you can catch and flag. It corrupts the context that steps 3, 4, and 5 operate on.
Each subsequent step may execute correctly given the corrupted context it received — and score well on a local evaluation — while the overall task fails completely.
You can run a step-level evaluation that shows every individual output passing quality thresholds while the agent fails the actual job it was supposed to do. The scores are technically accurate and strategically useless.
This isn't hypothetical. A Cornell University study found that single-agent GPT-4 systems achieved a 2.92% success rate on complex planning tasks, while coordinated multi-agent systems reached 42.68% on the same benchmark. That gap isn't a prompt quality problem.
It's a workflow architecture problem — and no amount of per-step output scoring would have identified it, because the failure mode lives in the coordination between steps, not within any individual step.
The evaluation unit must be the workflow
The practical implication is straightforward: AI agent experiments must be designed to measure task completion at the workflow level, not output quality at the step level. The success metric for an agent experiment is whether the agent completed the task correctly end-to-end — not whether each individual response looked good in isolation.
This reframes what "winning" means. A variant that produces slightly lower-quality intermediate outputs but recovers from errors more reliably and completes more tasks is a better agent than one that scores higher on per-step evals but fails more often at the finish line.
GrowthBook frames this directly: when it comes to AI, "evals are just the tip of the iceberg — A/B testing is where real value gets created." Output-level evaluation is a starting point, not a measurement strategy. Connecting AI behavior to user outcomes requires measuring the workflow, not the response.
The four metrics that expose what step-level scoring misses in agent workflows
Knowing that workflow-level measurement is necessary is not the same as knowing what to measure. According to Galileo's State of Eval Engineering Report, 72% of AI teams believe comprehensive testing drives reliability — but only 15% achieve elite evaluation coverage.
That 57-point gap isn't a motivation problem. It's a measurement problem: teams are testing, but they're testing at the wrong granularity, disconnected from the business outcomes that actually matter. Four metrics close that gap.
Task-completion rate: the primary signal
Task-completion rate sounds obvious until you try to define it precisely. The metric isn't whether each step produced a plausible output — it's whether the agent actually accomplished the stated goal of the full workflow. That distinction matters because of what Confident AI calls "false task completion": the transcript says "done," the final output looks clean, but nothing in the real system actually changed.
This failure mode is completely invisible to output-only scoring. You only catch it by measuring at the workflow level, against ground truth.
Task-completion rate is your leading indicator for user satisfaction and retention. When it degrades, users notice before your dashboards do — which means instrumenting it properly is a prerequisite for everything else.
Error recovery rate: measuring resilience under failure
No agent workflow runs cleanly every time. The question isn't whether errors occur — it's whether the agent recovers from them without human intervention or a full abort. Error recovery rate measures exactly that: the proportion of workflow runs where the agent encountered a failure state (wrong tool call, bad argument, unexpected output) and successfully self-corrected.
The failure modes this metric is designed to catch include what practitioners call the "interrogation loop" — retry cycles that never converge, burning latency and compute while producing nothing. Low error recovery rate is a direct driver of elevated cost and degraded response time, which makes it a business metric as much as a quality metric.
Step efficiency: the cost hidden in the decision path
Every unnecessary tool call, redundant reasoning step, or circular summary has a price. Step efficiency — the number of steps required to complete a task relative to a minimum viable path — is a direct proxy for inference cost and latency. Confident AI's "budget burner" pattern describes this precisely: traces that look fine on the surface but are filled with reasoning thrash and circular summaries that advance nothing.
The qualitative complement to raw step count is turn relevancy — whether intermediate steps are actually moving toward the goal. Both require trace data to measure, but they answer different questions: step count tells you how much work the agent did, turn relevancy tells you whether that work was purposeful.
Failure point attribution: finding where workflows actually break
The step where a workflow degrades is often not the step where the failure becomes visible. A bad tool argument in step 2 might not surface as a wrong answer until step 6. This is why full LLM tracing isn't optional — it's the foundational instrumentation requirement. Without a complete trace of the decision path, failure point attribution is guesswork.
Galileo's finding that traditional metrics "don't capture agentic workflows where success depends on multi-step reasoning" points directly at this gap. Trace-level data is what converts a vague sense that "the agent is underperforming" into a specific, actionable diagnosis: this tool selection, at this step, under these input conditions, is where the workflow breaks.
Mapping workflow metrics to business outcomes
Each of these metrics connects to a concrete business dimension. Task-completion rate maps to user satisfaction and retention. Latency and support cost are where error recovery rate lands — every unrecovered failure is either a slow response or a human escalation. Inference cost and response time move with step efficiency. For failure point attribution, the downstream dimension is engineering velocity: faster diagnosis means faster iteration cycles.
Experimentation platforms that support custom SQL metrics and configurable metric windows let teams define these workflow-level outcomes as first-class metrics computed from their own data, rather than relying on proxy signals that don't reflect what the agent actually did.
GrowthBook's warehouse-native experimentation connects directly to your existing data infrastructure, so workflow-level metrics like task-completion rate and error recovery rate can be defined in SQL against your own event data rather than requiring a separate instrumentation layer. Metric correlation analysis is particularly useful here for detecting trade-offs: when you optimize for step efficiency, does task-completion rate hold? That tension is exactly the kind of question controlled AI agent experiments are designed to answer.
Structuring AI agent experiments around tool use, error recovery, and task completion
The core discipline of controlled experimentation — change one thing, hold everything else constant — sounds simple until you're staring at an agent system with a dozen moving parts. Which model is handling which subtask? Which tools are available at which step? How is the workflow decomposed? How aggressive are the guardrails?
Each of these is a variable, and without a deliberate isolation strategy, you cannot know which one caused the change in task-completion rate you observed. The operational challenge isn't conceptual — it's that most teams haven't formally defined their agent configurations in a way that makes controlled comparison possible.
Isolating the experimental variable in a multi-component agent system
Before you can isolate a variable, you have to name it precisely. Agent systems have four primary variable categories worth treating as distinct experimental dimensions: the prompt or instruction set, the model selection, the tool configuration (which tools are available, in what order, with what permissions), and the workflow decomposition (how subtasks are structured and delegated across steps or sub-agents).
The reason informal experimentation fails here is captured well by recent work on structured agent experiment design: "experimental conditions for agentic workflows are still largely specified in prose, making them difficult to compare, reuse, or audit". When your baseline and variant are described in natural language rather than formally encoded, you can't reliably determine what actually changed between them — and you certainly can't reproduce the experiment later.
The discipline of naming the variable precisely, encoding it consistently, and holding the other three categories fixed is what separates a valid agent experiment from an anecdote.
Feature flags as the mechanism for switching agent configurations without a deploy
The practical mechanism for switching between agent configurations in production — without a full code deploy — is feature flags linked directly to experiment assignments. Feature flags control which agent configuration a session receives (model version A vs. B, tool set with web search vs. without, prompt variant for the planning step), and the experiment tracks workflow-level outcomes against that assignment. GrowthBook supports attribute-based targeting rules so flags can be scoped to specific workflow contexts, task types, or user segments without affecting the full user base.
Gradual rollout is the right risk posture for AI agent experiments. Exposing 5% of traffic to a new tool configuration while monitoring error recovery rate as a guardrail metric gives you real signal without committing the full user base to an untested configuration.
If the new configuration causes error cascades — a real risk when a tool call failure at step 2 corrupts the rest of the workflow — an instant kill switch lets you deactivate it without waiting for a deployment cycle. Kill switches work instantly because flag evaluation happens locally: flip the toggle in the dashboard, the new rules propagate via the streaming connection, and every SDK evaluates the updated rule on its next check. No deploy required. That kill-switch capability isn't a convenience; it's a safety requirement for agent experiments where failures compound across steps.
A well-structured experiment schema supports this pattern directly: linked feature flags connect flag state to experiment assignment, guardrail metrics are a distinct category separate from primary goals, and experiment targeting rules enable conditional enrollment so you can scope an experiment to a specific task type — order processing workflows, say, rather than all agent sessions indiscriminately.
Treating architectural decisions as testable hypotheses
The most common mistake in agent experimentation is treating workflow decomposition and guardrail placement as fixed infrastructure rather than experimental variables. They aren't. Whether a task is handled by a single agent with a long prompt or decomposed into three specialized sub-agents with handoff logic is an architectural choice with measurable effects on task-completion rate and error recovery behavior. It should be tested, not assumed.
The practical implication: before running a prompt A/B test, verify that your current workflow decomposition isn't confounding the results. A suboptimal architecture will suppress the signal from any downstream prompt experiment — you'll be optimizing the instructions for a structure that itself needs to change.
Running controlled rollouts at the workflow level
The unit of randomization and measurement in an agent experiment is the workflow run, not the individual tool call or response. This matters for how you structure traffic splitting and how long you need to run the experiment. Agent workflows are typically lower-volume than web page views, which means reaching statistical significance takes longer.
Standard A/B testing requires you to decide your sample size upfront and wait until you hit it — which can mean weeks of waiting for low-volume agent workflows. Sequential testing is an alternative approach: you monitor results as they come in and stop the experiment when the evidence is strong enough to make a decision, without waiting for a fixed endpoint.
This avoids two failure modes: stopping too early because the first results looked promising, and running far longer than necessary because you committed to a sample size that assumed higher traffic than you actually have. For AI agent experiments specifically, where completed workflow runs are scarce, sequential testing is worth understanding before you set up your first experiment.
Targeting also matters more in agent experiments than in standard product experiments. The same configuration may perform very differently across task categories. Scoping enrollment to a specific workflow context — using prerequisite conditions on experiment phases — keeps your results interpretable and prevents a strong effect in one task type from being diluted by noise from another.
The hard problems in AI agent experiments: non-determinism, long horizons, and attribution
Well-designed agent experiments can still produce untrustworthy results. Not because the experiment was set up incorrectly, but because the measurement itself is structurally compromised. Engineers who already run A/B tests will apply familiar statistical intuitions to agent experiments — and those intuitions will mislead them.
The noise floor is higher, the observation window is longer, and the causal chain between a decision point and a final outcome is obscured by every compounding step in between. These aren't immature problems that better tooling will eventually solve. They're structural properties of how agents work.
The non-determinism problem — why agent experiments have a higher noise floor
A single LLM call produces different outputs even for identical inputs — that's just how language models work. An agent making ten tool calls compounds that randomness ten times over: the output of step one affects what step two receives, which affects step three, and so on. By the end of a ten-step workflow, two runs that started with the same input can have followed completely different paths.
As Arize AI puts it directly: "LLMs are non-deterministic — agents can follow strange paths and still arrive at the right answer, making debugging difficult." In a single-response evaluation, you can run enough samples to average out that variance. In an agent workflow, the same input can produce divergent execution paths entirely — meaning two runs of the "same" experiment may not actually be testing the same behavior at all.
This inflates outcome variance in ways that require larger sample sizes or longer run times to compensate for. But there's a subtler problem too. Anthropic has documented cases where agents succeed via unexpected paths that the grading logic doesn't recognize as success — producing false negatives.
When Anthropic tested Opus 4.5 on a flight booking task, the model solved the problem by discovering a policy loophole rather than following the expected resolution path. The agent succeeded; the evaluator marked it as a failure. If your grading logic can be outpaced by model behavior, your experiment results are measuring evaluator limitations as much as agent capability.
Long-horizon observation windows — how task duration distorts experiment timing
Standard experimentation practice says to run until you reach statistical significance. For agent workflows that take hours or days to complete, that heuristic breaks down. You cannot measure a partial workflow — an incomplete task looks identical to a failed one until it finishes.
Checking results before tasks complete produces misleading signals, and the peeking problem that affects all experiments is especially acute here because task completion times are variable and long.
This creates a direct tension with fast iteration cycles. If your agent handles multi-day workflows, a single experiment round-trip might take weeks before you have enough completed tasks to draw conclusions. Anthropic notes that "agents use tools across many turns, modifying state in the environment and adapting as they go — which means mistakes can propagate and compound." That compounding happens over time, which means your observation window must be long enough to capture the full propagation, not just the first few steps where things look fine.
Attribution across multi-step workflows — which decision caused the outcome?
When an agent fails at step eight, which of the preceding decisions caused it? The final outcome metric tells you that something went wrong. It doesn't tell you where. Databricks frames this precisely: "An answer may be correct, but the tool calls leading to it may be inefficient, risky or inconsistent. Evaluating only the final output can hide underlying reasoning failures."
This means teams need step-level instrumentation — tracing tool calls, intermediate state, and decision points — not just outcome logging. Without it, experiment results tell you that a configuration change moved a metric, but not why, which makes intelligent iteration nearly impossible.
Arize AI's framing is useful here: "Effective agent evaluation requires looking beyond final outputs to assess what the agent knows, what actions it takes, and how it plans." That's not a nice-to-have. It's a prerequisite for attribution.
The right tooling can help here: experimentation platforms that pull metrics directly from your data warehouse — rather than requiring a separate event-tracking layer — let you join trace data from your agent runs to the business outcomes you actually care about, like whether the user completed their task or came back the next day. But the tooling only works if the traces exist. If you're not logging tool calls, intermediate state, and decision points during every workflow run, no amount of downstream analytics infrastructure will give you attribution.
Replay-based testing environments — controlling for non-determinism
The most practical mitigation for both non-determinism and attribution challenges is replay-based testing: capturing real execution traces from production, then replaying them against a modified agent configuration in a controlled environment. Because the environment state, tool responses, and input sequence are held constant, you're doing an apples-to-apples comparison where only the agent configuration changes.
Anthropic's multi-turn eval harness design follows this pattern — a controlled environment, an agent loop, and a grading mechanism that evaluates the full execution, not just the final output.
Replay environments get closer to realistic testing standards than static benchmarks. But they come with a real cost: significant instrumentation investment upfront, and a structural limitation that replay environments may not surface novel failure modes that only emerge in live conditions with real users and real tool states. There's no clean solution here — only a more honest accounting of what your experiment results can and cannot tell you.
Building a continuous improvement loop for AI agent workflows
One-off experiments on agent systems tend to produce one-off insights. A team tests a new prompt, ships the winner, and moves on — only to revisit the same question six months later when a different engineer joins and asks why the workflow is structured the way it is. The institutional knowledge from that first experiment is gone.
This is the compounding knowledge problem that makes agent experimentation programs fail not at the experiment level, but at the program level.
The argument here isn't that individual experiments are useless. It's that their value multiplies only when they're treated as documented artifacts in a cumulative learning system rather than isolated tests that produce a binary ship/don't-ship decision. As GrowthBook's experimentation infrastructure puts it: "Each experiment may not have a large effect on your metrics, but many experiments might." That framing captures exactly what a continuous improvement loop is designed to exploit.
Establishing a baseline and iteration cadence
Before you can iterate meaningfully, you need a stable reference point. For agent workflows, that means documenting a control state with known task-completion rates, error recovery rates, step counts, and failure point distributions — the metrics covered earlier in this article. Without a documented baseline, you can't distinguish a genuine improvement from natural variance in agent behavior.
The iteration cadence question is trickier. There's no universal answer for how frequently agent teams should run new experiments, and any specific schedule would be arbitrary. What matters more is calibration: are you running enough bold experiments to generate real learning, or are you defaulting to safe prompt tweaks that win easily but teach you little?
Monitoring win rates and experiment frequency over time gives you a signal on this — a consistently high win rate often means the experiments are too conservative, not that the team is unusually skilled.
The practical loop is straightforward: establish baseline → run experiment → document result → update baseline → repeat. Each cycle should produce a reusable artifact, not just a deployed change.
Documenting experiments as learning artifacts
The documentation problem is worse for agent systems than for standard product experiments because the experimental variable could be a prompt, a tool configuration, a guardrail threshold, a workflow decomposition decision, or a model swap — and these are easy to lose track of across a fast-moving codebase. "Experimentation programs generate a lot of artifacts that can be hard to capture," as GrowthBook notes in describing the institutional knowledge problem directly.
What a useful learning artifact contains: the hypothesis going in, exactly what was changed, which workflow-level metrics moved and by how much, the conclusion, and the reason the experiment was stopped. These aren't just documentation hygiene — they're the inputs that prevent future teams from re-running experiments you've already run.
Some experimentation platforms use vector embeddings across experiment descriptions and hypotheses to flag potentially redundant experiments before they're launched, which addresses this problem at the infrastructure level rather than relying on team members to remember what was tested.
Character.AI's use of GrowthBook for iterative model comparison illustrates what this looks like in practice. Landon Smith, Head of Post-Training, described it this way: "We can compare different modeling techniques from the perspective of our users — guiding our research in the direction that best serves our product." That's the iteration loop operating as intended: each comparison informs the next research direction rather than producing a one-time answer.
Tracking cumulative workflow-level impact
Individual agent experiments frequently show small or ambiguous effects. A 2% improvement in task-completion rate might not clear statistical significance on its own, but ten such experiments compounding over a quarter represent a meaningful capability shift. The only way to see that is with a cumulative view across the full experimentation program — not just the most recent test.
This is where tooling like GrowthBook's Cumulative Impact dashboard becomes relevant: it aggregates the effect of all experiments on any given metric, making the program-level story visible to both the teams running experiments and the stakeholders who need to understand why the investment is worthwhile. Connecting that view to North Star metrics — task-completion rate, cost per workflow run, user satisfaction — closes the loop between individual experiment decisions and long-term business outcomes.
The observation that holds true across the AI agent community is that "there is no secret sauce" in the basic loop structure. The competitive advantage comes from how systematically you refine it. If the architecture is commoditized, the teams that win are the ones that accumulate the most rigorous institutional knowledge about what actually moves their specific workflow metrics — and that only happens through a program, not a series of one-off tests.
Three starting points for teams that have never run a controlled agent experiment
The through-line of this article is simple: AI agent experiments fail when you treat a workflow as a collection of responses instead of a single unit of behavior. Every section builds on that premise — from why cascade failures are invisible to step-level scoring, to why task-completion rate is the metric that actually connects to user outcomes, to why your workflow decomposition might be confounding every prompt experiment you've ever run.
Where you start depends on where you are. The right first move isn't the same for every team.
If you're not yet measuring task-completion rate at the workflow level against ground truth: That's your starting point — not a new experiment, not a new tool, just a measurement change. Define what "task complete" means for your specific workflow, instrument it against real outcomes rather than transcript output, and run your existing configuration against that baseline before changing anything else.
If you have workflow-level metrics but no trace instrumentation: Add tracing before your next experiment. Without a complete record of tool calls, intermediate state, and decision points, you'll know that a configuration change moved a metric but not why. That's not enough to inform the next experiment — you'll be iterating blind.
If you have both metrics and tracing: Identify the one architectural variable you've been treating as fixed infrastructure — your workflow decomposition, your guardrail thresholds, your tool selection logic — and design a controlled test around it. That's the experiment most teams should have run six months ago and haven't because it felt like "infrastructure" rather than experimentation.
GrowthBook's feature flag and experiment infrastructure is designed to support this pattern at each stage: gradual rollouts with guardrail metrics, linked feature flags for configuration control, and cumulative impact tracking so individual small wins add up to a visible program-level story.
The honest constraint to keep in mind: observation windows for agent experiments are longer than you're used to, and instrumentation takes real time to build. You will feel pressure to move faster than the data supports. A premature conclusion from an incomplete workflow observation is worse than no conclusion — it sends your team in the wrong direction with false confidence.
This article is meant to give you a clear, honest foundation for that work — not to make it sound easier than it is.
Related insights
Related Articles
Ready to ship faster?
No credit card required. Start with feature flags, experimentation, and product analytics—free.

