Your agents shouldn't guess at feature flags and experiments

Scott Bailey

Jun 29, 2026

min read

Your agents shouldn't guess at feature flags and experiments

If you use Claude, Cursor, or another coding agent day to day, you've probably tried to use it to create a feature flag or analyze an experiment. The agent can parse your intent and call the GrowthBook API, but it guesses at the procedure. It might ship a flag enabled by default. It might read experiment results without checking for a sample ratio mismatch first. Its results are all plausible, and you only find out later what it skipped or got wrong.

GrowthBook's new open-source agent skills library is built around a different model. You give the agent a written playbook that encodes the procedure and best practices. The agent follows the playbook.

What are agent skills?

Agent skills are Markdown files that an AI agent reads as instructions for how to do something. Each skill covers a specific GrowthBook task: creating a feature flag, designing an experiment spec, analyzing results, cleaning up stale flags. The agent follows the steps in the skill, calls the GrowthBook REST API directly, and returns the output to you.

Install them with:

npx skills add growthbook/skills

Or install via the Claude Code plugin. The skills follow the Agent Skills standard, so they work in Claude, Cursor, Codex, or any agent that supports it. Every skill is a readable Markdown file at github.com/growthbook/skills, so you can inspect exactly what the agent will do, fork the repo, and adapt skills to your workflow. We welcome ideas and contributions to the library. The GrowthBook docs cover setup and the full list of skills.

Why use a skill instead of a plain prompt?

Asking an agent to "just create a flag" works until it doesn't. Your agent might add a rollout rule, or ship the flag at 100% right away. Ask it to "analyze the experiment" and it could read the p-value off the page without checking for a sample ratio mismatch first. A simple prompt doesn't constrain an agent or guide it through best practices.

A skill turns those best practices into steps the agent follows every time:

New flags ship disabled. The skill creates the flag in an off state. You explicitly add rules to enable traffic.
Changes follow draft, review, publish. Edits don't go live until they move through GrowthBook's approval flow.
Experiment analysis includes SRM checks. Before reporting lift, the skill checks for sample ratio mismatches that would invalidate the result, then reports confidence or credible intervals, not a raw p-value.
It won't invent what has to be real. Ask it to run an experiment on a metric that isn't defined in GrowthBook yet, and it stops and tells you to create it first, instead of inventing a metric ID and confidently reporting success.

The guardrails aren't instructions you hope the agent remembers. They're written steps in a file it always references.

What the skills cover

The library spans the full GrowthBook lifecycle:

Feature flags: create, target, ramp, safe rollout, and cleanup
Experiments: brainstorm, design, launch, analyze, stop
Discovery: search and audit your flags, and trace their dependencies

A realistic walkthrough: Creating a new flag-based experiment

Say you're shipping a new checkout flow. Here's how a session might look.

Prompt 1: "Wrap the new checkout in a feature flag and roll it out to 10%."

The skill creates the flag in a disabled state, then adds a percentage rollout rule set to 10%. The flag goes live only after you confirm.

Prompt 2: "Set up an experiment on the new checkout measuring conversion."

The skill collects a hypothesis, the primary metric (checkout conversion), and any guardrail metrics you want to watch, say revenue per user. It calculates a recommended sample size from your baseline and a minimum detectable effect, then drafts the experiment spec. Draft state, not live, so you review before it launches.

Prompt 3 (one week later): "Did the checkout experiment win?"

The skill fetches results, runs an SRM check first, and reports the lift with a confidence interval (frequentist) or credible interval (Bayesian), depending on your stats engine. You get an honest read: whether the results are conclusive, what the interval says, and a recommendation on whether to ship or keep running.

Each prompt is plain language. Each runs a specific, auditable skill under the hood. You stay in your preferred editor or terminal the whole time.

Try it, read it, fork it

The full library is open source at github.com/growthbook/skills. Each skill is a short Markdown file you can read and adapt to your team's own conventions. Skills also compose: analyze an experiment, stop it with a documented conclusion, then clean up the flag, all in one workflow without jumping between tasks.

If something doesn't match how your team works, open a PR or file an issue. The goal is a set of skills the community trusts because the community has read, used, and refined them. These skills are an active project, and we're adding new ones all the time. We'd love your feedback and ideas on what to build next.

Example H2