Khan Academy uses GrowthBook to improve AI tutoring

Khan Academy uses GrowthBook to test Khanmigo prompts, system instructions, and models in production while protecting learner trust, data privacy, and site performance.

About Khan Academy

Khan Academy is a nonprofit with a mission to provide a free, world-class education for anyone, anywhere. Since launching in 2008, Khan Academy has grown from math videos into a comprehensive learning platform with practice exercises, articles, teacher tools, and Khanmigo, its generative AI tutor and teaching assistant. In School Year 2024–25, Khan Academy served 120 million users across 190 countries and 55+ languages.

Industry
Education
Location
Global
+6.1%
correctness
+5.1%
cognitive engagement
3 sec
latency reduction
“A/B testing GenAI features has been an absolute game changer. Small changes can shift LLM output dramatically, and experiments are how we climb toward better quality. Experimentation went from feeling like a speed bump to becoming a safety net.”
Kelli Hill, Ph.D., Senior Director, Data Insights
Kelli Hill, Ph.D., Senior Director, Data Insights
Khan Academy

Executive Summary

Khan Academy has always had a high bar for experimentation. The team serves millions of learners, works with sensitive education data, and supports classroom environments where product changes can affect students, teachers, and trust.

That bar got even higher with Khanmigo, its AI-powered tutor.

Unlike traditional product features, Khanmigo’s output is non-deterministic. The same prompt on the same question can produce different answers. A small prompt change can shift behavior. A model update can improve one thing and quietly break another. And in education, a response that looks polished is not enough. It has to support learning.

With GrowthBook, Khan Academy built a self-hosted experimentation platform for feature flags and A/B testing that works with its existing data warehouse, event pipelines, and metrics. The team now uses GrowthBook to test Khanmigo prompts, system instructions, and model changes against AI quality metrics like cognitive engagement, next-item performance, undesirable tutoring behaviors, verbosity, and latency. This helps the team improve Khanmigo faster and with more confidence.

The challenge: AI broke the old rules of product intuition

Khan Academy had deep experience with experimentation before adopting GrowthBook.

In 2011, the team re-engineered open-source A/B testing tooling for Google App Engine so it could run fast, scalable tests across millions of interactions. In 2014, it updated the system to log data into BigQuery, improving analytics and reducing latency by 2x. By 2019, during a major backend re-architecture, Khan Academy deprecated that legacy system and began evaluating whether to build, buy, or adopt a new open-source experimentation platform.

In 2024, Khan Academy launched GrowthBook as its self-hosted platform for feature flags and experiments. GrowthBook fit because it could use Khan Academy’s existing data warehouse, data pipelines, instrumented events, and metrics. It also supported experiments without slowing a site with roughly one million daily active users.

That foundation became critical as Khan Academy began improving Khanmigo, its generative AI tutor and teaching assistant that introduced a harder measurement problem.

A prompt change, system instruction change, or model swap could alter the quality, length, speed, and accuracy of a tutoring interaction in ways that were difficult to predict. A faster answer could be less accurate. A longer answer could be less useful. A polished answer could still be poor tutoring.

Khan Academy needed to understand whether AI changes were actually improving learning quality, not just producing outputs that looked better on the surface.

The solution: Production A/B testing for AI evals

Khan Academy’s AI evaluation process evolved in stages.

The team started with intuition-driven prompt testing, then moved to a more structured internal prompt playground. That helped teams test prompts more consistently, but humans still had to review outputs and decide whether they were good.

As Khanmigo matured, Khan Academy built automated post hoc evals. The team developed rubrics for key learning measures, including cognitive engagement. Human experts labeled transcripts to create trusted ground truth datasets. Those datasets were then used to build LLM judge prompts that could evaluate Khanmigo interactions at scale.

But post hoc evaluation could only go so far. It could show trends, but it could not always prove that a specific prompt, system, or model change caused an improvement.

So Khan Academy built the infrastructure to run production A/B tests at the Khanmigo chat-thread level. Instead of only randomizing by user, the team can randomize each new chat interaction. That lets Khan Academy test different versions of prompts, system instructions, and models directly in GrowthBook.

Today, Khan Academy processes about 20% of Khanmigo chat data nightly into dashboards, giving the team a continuous view into AI tutoring quality.

GrowthBook became the system for connecting AI changes to measurable outcomes.

Khan Academy now evaluates AI experiments against primary, secondary, and guardrail metrics, including:

  • Cognitive engagement
  • Next-item performance
  • Undesirable math tutoring behaviors, such as giving away the answer
  • Thread length and verbosity
  • Response latency

That gave the team a way to move beyond “Does this answer seem better?” and toward “Did this change improve tutoring quality without creating unacceptable tradeoffs?”

The journey: From AI vibes to measurable tutoring quality

Early testing helped the team learn what generative AI could do, but it did not provide reliability or scale. A small group of experts could inspect individual outputs, but that was not enough for a product serving millions of learners.

Khan Academy needed a better evaluation loop to measure tutoring quality continuously, compare changes rigorously, and support confident decisions in production.

First, the team defined what “good tutoring” should mean in measurable terms. Cognitive engagement became a critical metric because Khan Academy believes deeper engagement leads to more effective practice, and more effective practice leads to better learning outcomes.

Then the team created human-grounded rubrics and LLM judges to evaluate Khanmigo interactions at scale. Finally, Khan Academy connected those AI quality metrics to GrowthBook experiments so teams could test whether specific changes improved the tutoring experience.

That loop is already producing measurable gains. In a recent series of Khanmigo product tests, giving Khanmigo more structured context from a student’s Khan Academy learning history helped improve next-item correctness by 6.1%. Another test using fuller plain-text conversation context increased cognitive engagement by 5.1%.

The team can now test small changes, read the impact across learning and guardrail metrics, and decide whether to refine, roll back, or roll out. GrowthBook gives Khan Academy a way to measure those effects before they reach everyone.

Example: Reducing Math Agent latency without sacrificing accuracy

A clear example came from Khanmigo’s Math Agent.

The Math Agent helps Khanmigo verify calculations behind the scenes. That improves math accuracy, but it also adds an extra step to the response flow. In classrooms, that extra wait can disrupt the rhythm of tutoring.

Khan Academy needed to reduce latency without sacrificing math accuracy.

The team ran a sequence of experiments:

  1. Disable the Math Agent
    Latency improved, but math errors doubled.
  2. Move Khanmigo to a faster GPT-5 model and disable the Math Agent
    Latency improved, but math errors still doubled.
  3. Make the Math Agent more concise
    Latency dropped by about 3 seconds, while math accuracy stayed stable.
  4. Use faster models inside the Math Agent
    Latency improved by about 300 milliseconds, with stable math accuracy.
  5. Limit Math Agent execution time
    Latency improved by another roughly 600 milliseconds, with stable math accuracy.

Without experimentation, the team might have optimized for the easiest metric to measure: speed. But speed alone would have pushed them toward worse math accuracy.

The result: A safety net for AI product development

The biggest outcome was not just better testing infrastructure. It was a culture shift.

Before Khanmigo, A/B testing could feel like an extra hurdle. Teams wanted to ship based on intuition, user research, or internal confidence. Generative AI changed that.

Because LLM behavior is hard to predict, experiments became the way to understand whether a change improved tutoring quality, hurt accuracy, increased verbosity, slowed responses, or created risk for learners.

“Without this type of testing, they would be going with their intuition. This is all so new. It’s so hard to predict how a large language model will respond to something that you put into it.”
Kelli Hill, Ph.D., Senior Director, Data Insights, Khan Academy

Khan Academy can now test Khanmigo AI changes in production while staying aligned with the principles that matter most to its mission:

  • Protect learner trust
  • Keep sensitive data under control
  • Avoid disrupting classrooms
  • Measure real tutoring quality
  • Improve AI features through evidence, not guesswork

For Khan Academy, GrowthBook made experimentation part of the AI development loop.

Data ownership mattered

Khan Academy works with sensitive education data, including data involving children. That made many commercial experimentation tools difficult to consider because they required sending user data to a third-party system.

GrowthBook’s self-hosted architecture lets Khan Academy keep control of its data while still giving teams modern feature flagging and experimentation workflows.

“The fact that we could retain ownership of our data was very, very important. We have data from children stored in our servers, and that’s something that we have to really protect.”
John Resig, Chief Software Architect, Khan Academy

GrowthBook also fits Khan Academy’s performance requirements. Its static configuration model allowed the team to cache configuration in the CDN and inline it into the page so users could get it immediately on page load without an extra network request.

For a learning platform serving roughly one million daily active users, experimentation could not come at the cost of speed.

Responsible experimentation for learners and classrooms

Khan Academy could not simply test AI changes as fast as possible. The team needed to move quickly without compromising learner trust or classroom stability.

Before launching production AI experiments, Khan Academy built a responsible experimentation framework. Experiments are grounded in learning science principles, limited to U.S. users ages 13 and older, easy for adults to opt out of, designed not to disrupt classroom instruction, and randomized at the group level when needed to avoid confusion among students and teachers.

The governance model defines when a test can move quickly and when it needs additional review.

If a treatment is invisible to users, such as a prompt or system instruction change, teams can test across eligible segments. If a change is visible, the team evaluates whether users can toggle it off and whether it could cause classroom confusion. Higher-risk experiments require sign-off before launch.

This framework helps Khan Academy balance speed with responsibility. The goal is not to slow experimentation down. It is to make sure teams can learn quickly without putting learners or teachers in a bad spot.

Why GrowthBook

Khan Academy chose GrowthBook because it matched the way the team wanted to build.

GrowthBook gave Khan Academy:

  • Self-hosting and data ownership for sensitive education data
  • Warehouse-native analysis using existing data infrastructure
  • Feature flags and experiments in one platform
  • Performance-friendly SDK behavior for a high-traffic learning site
  • Flexibility to support user-level and chat-thread-level randomization
  • Transparent experiment analysis teams could validate against their own expectations
“The flexibility that GrowthBook provided us gives us a lot more power. It allows us to unlock a whole bunch of additional types of testing that just weren’t possible for us previously.”
John Resig, Chief Software Architect, Khan Academy

GrowthBook gave Khan Academy the foundation to test AI responsibly, protect learner data, and keep improving Khanmigo with confidence.

About Khan Academy

Khan Academy is a nonprofit with a mission to provide a free, world-class education for anyone, anywhere. Since launching in 2008, Khan Academy has grown from math videos into a comprehensive learning platform with practice exercises, articles, teacher tools, and Khanmigo, its generative AI tutor and teaching assistant. In School Year 2024–25, Khan Academy served 120 million users across 190 countries and 55+ languages.

Industry
Education
Location
Global

Ready to ship faster?

No credit card required. Start with feature flags, experimentation, and product analytics—free.

Simplified white illustration of a right angle ruler or carpenter's square tool.White checkmark symbol with a scattered pixelated effect around its edges on a transparent background.