From Amazon to Atlassian: one Leader's framework for testing assumptions instead of egos

Ashley Stirrup

May 25, 2026

min read

From Amazon to Atlassian: one Leader's framework for testing assumptions instead of egos

Running an experiment is the easy part. The hard part is knowing when to trust your instincts and when to test them. Andrew Willingham, Head of Legal and People Products at Atlassian, learned that lesson the painful way—by shipping a product that experts loved but users couldn't figure out how to use.

Willingham spent over 11 years at Amazon, starting in consumer marketing, where every pixel on the homepage was A/B tested. He later led product development for Amazon's HR systems, building talent management software for 1.5 million associates. Now at Atlassian, he's applying those lessons to reinvent how companies hire, evaluate, and retain talent in the age of AI.

His career offers a rare perspective: he's built products for millions of consumers with unlimited test traffic, then pivoted to enterprise HR where user volumes are smaller and qualitative research becomes the primary de-risking tool. Along the way, he's developed a clear framework for when to test, when to ship, and when to trust a metric.

🎧 Listen to the full episode →

The talent review product that flopped on launch

When Willingham first joined Amazon's HR organization, his team was tasked with building a talent review product for operations. At the time, Amazon was running performance calibrations on Excel spreadsheets, PowerPoint slides, and Word docs. There was no consistency, no visibility, and plenty of data leakage risk.

Willingham's team is embedded with Amazon's Talent Management Center of Excellence—the IO psych specialists who design high-performing organizations. They built a product the experts loved. It reflected best practices. It had all the features a world-class talent review system should have.

Then they handed it to the HRBPs—the people who actually run talent reviews on the front lines.

"They got into it and they were like, what, this is way too complex. Like you have all these features. I don't know how to use this. And so it was a flop initially." — Andrew Willingham

The team had to go back to the drawing board. This time, they worked directly with the HRBPs. They simplified the interface. They made it so intuitive that any HRBP could run a talent review without a two-week prep cycle.

The result? Business leaders could now run their own talent reviews in real time, without waiting for HR support. That wasn't the original goal, but it became the product's biggest unlock.

"That was a really, really big win after that initial failure." — Andrew Willingham

The lesson: Know who your user is. It may not be the same person as your customer. And if you don't sit with the actual operator and watch them use your product, you're building in the dark.

From millions of tests to desk rides

At Amazon, A/B testing was the default. Every change to the buy button, every font size adjustment, every ad placement was tested at scale. Willingham's team would run double A/B tests: one to ensure no harm to page load latency, another to measure conversion lift.

Then he moved to HR, where the user base was 450,000 corporate employees and 1.5 million associates. Still large, but not compared to hundreds of millions of homepage hits.

"We did have to adjust our approaches. So instead of leading kind of with a pure data approach of like, cool, we did this and here's our p-values and everything else, we moved into kind of relying on a very close connection with customer research." — Andrew Willingham

He started attending talent reviews. He watched HRBPs prep Excel files and juggle PowerPoints. He asked questions: Why are you doing this step? Why did you bring this data? Didn't you already do that earlier?

This qualitative research became the proxy for high-volume A/B testing. It gave the team enough confidence to de-risk their roadmap before launch.

"You need to sit with your actual user and watch them use your product. It's going to be painful because you're going to be like, what are you doing? You're supposed to do this. But you can figure out pretty quickly, okay, that's actually what they're trying to do." — Andrew Willingham

The metrics that matter: efficiency and quality

At Atlassian, Willingham optimizes for two North Star metrics: efficiency and quality.

Efficiency means reducing the time it takes to run a talent calibration, complete a hiring funnel, or onboard a new employee. Quality means maintaining or increasing the outcome — hire quality, decision accuracy, and employee satisfaction.

These metrics are deliberately antagonistic. You can't optimize one without protecting the other.

"If our goal is to hire faster, I'll just get rid of all interviews and blind hire people. And then I'm going to hit my metric and walk away here. But that's why you have to have that quality metric to say, you're optimizing against these things that are naturally kind of antagonistic." — Andrew Willingham

This balance is critical when applying AI to HR processes. Willingham's team is reinventing hiring workflows from the ground up. They're testing whether steps in the traditional funnel — sourcing, recruiter screen, hiring manager screen, five interviews, offer — can be skipped, replaced, or reordered.

The goal isn't just efficiency. It's a better candidate experience and higher decision quality. That requires testing both speed and outcome.

When to test and when to trust

Willingham has a clear framework for deciding what to test.

If it's a durable truth—something you know is aligned with outcomes you care about—ship it. At Amazon, faster shipping times always translated to future revenue. No customer wants their package to arrive later. That's a truth you can bet on.

But if it's counterintuitive, test it. Willingham gives the example of dashboard usage time. Everyone assumes longer usage time means higher engagement. But it could also mean the product is slow and frustrating.

"You have to test one in that case, does usage time actually indicate satisfaction and helpful to customers? I can think of ways in which it would and ways in which it wouldn't." — Andrew Willingham

The other place to test: marketing copy and value propositions. Willingham's team tests how to get product managers to adopt AI features. Telling them "AI is important" doesn't work. Telling them "We'll generate your status report so you don't have to type it" does.

"A lot of what we're doing experimentation is trying to change behavior, if not everything." — Andrew Willingham

Why product managers resist experimentation

Willingham has seen reluctance to A/B testing across multiple companies. He thinks it comes down to ego.

"It's a bit scary as a product manager to go down this road, because again, you're giving up. It's a bit of an ego hit in some ways, cause you're not saying I have the answers. You're saying, cool. Here's my plan for testing. I don't know if I'm right or not." — Andrew Willingham

But admitting you don't know is the right posture. Executives don't have all the answers either. What they want is a plan to test, learn, and apply those learnings.

Willingham also points out that the most valuable experiments are often the ones that fail—especially the ones you expected to be slam dunks.

"The experiments that were the most helpful were the ones that didn't work. Particularly if we expected it to be a slam dunk and it didn't work, I'm like, whoa." — Andrew Willingham

That talent review flop taught him to always validate who the actual user is. It's a lesson he's applied to every product since.

From consumer marketing to AI-powered HR

Willingham's current focus is on applying AI to reimagine people and legal products at Atlassian. His team owns everything from HRIS and payroll to hiring systems and compliance tools. They're building zero-to-one solutions that reduce effort and increase quality.

But he's not testing every pixel. He's testing assumptions. He's testing value propositions. He's testing whether a workflow that's been standard in the industry for decades is actually optimal.

And he's doing it with the same mindset he developed at Amazon: small, iterative changes that compound over time.

"Amazon didn't jump to today's gateway. It's 20 years of experimentation. And along the way, you learn kind of what those pillars are that become enduring truths." — Andrew Willingham

The lesson for product teams: experimentation isn't about running the most tests. It's about learning the fastest. And the fastest way to learn is to test the things you're least confident about, sit with your users, and be willing to admit when you're wrong.

Ready to build an experimentation program that drives real learning, not just test volume? GrowthBook is an open-source feature flagging and A/B testing platform built for product teams who want to ship winning experiments faster. Start for free at growthbook.io.