You do your discovery, interview your users, align the user needs with the business and propose a solution. But then when you start implementing, you discover there are two—or more—ways to implement the solution.
Other times, there's a feature that's already implemented, but you want to improve it. Maybe a HIPPO had a great idea on how to improve it, someone saw a different implementation on another app, or a bit of ayahuasca helped someone think of a different solution. Either way, you want to see if this new approach gets better results.
That's when an A/B test comes in handy. You use it to compare whether the solution should be implementation A or B.
In theory, you could compare A, B, C & D, but the stats get more complicated, so let's stick with only two conditions.
Changing or improving an existing implementation
You need a control group and a change group. Yes, according to stats you should add a null hypothesis. And I know in my previous post I made a big deal about following stats. But I don't think when doing A/B tests for product you need that level of formality. Yes, you need a hypothesis: the new implementation (variant) will perform better than the old implementation. It's not the proper null hypothesis, but let's get pragmatic sometimes.
Divide your groups randomly—usually Firebase can do that, but there are other tools that can do the same.
You usually need a few thousand users to reach statistical significance. And remember, significance doesn't mean your expected result happened; it means the result is not due to chance.
The important thing when you evaluate is that your sample size includes all the people who belong to each cohort, not only those who started your funnel.
You might not test on your whole user base. Maybe you want to focus on a subset—either country-based, or people who have done something specific in the past. Or maybe you have a huge user base and want to limit risk by testing only a subset first. You know better what you're testing and what you're building.
Complete new implementation
When you're implementing a completely new solution with nothing to compare to, you can do two things:
Introduce two different solutions and compare them. No control group needed.
Introduce one solution and compare its impact on your broader product metrics. For example, you add a completely new feature that ends up dropping your overall user retention.
This second approach requires breaking your test into two phases. First, test for impact on the product. Once that's done, run another test where you compare A with B, with A becoming your control group.
Beyond UI testing
A/B tests are most common with UI-based implementations. If you're a ChatGPT user, I'm sure you've seen it asking you to rate which solution you prefer.
But they can also be used for backend-based features—though that's less common. You might test if one algorithm is faster than another, or compare different recommendation engines.
Whatever you do, get your analytics right first. Make sure you're tracking data properly—if you have duplicate events or inconsistent naming, your results will be meaningless.
A/B is the starting step to become data-oriented
A/B testing is a great tool to evaluate if you're doing the right thing overall. If all your tests keep showing that the control group is better, then maybe you need to look at your discovery process. Or it's a good way of telling your boss that you can't improve the product just based on great ideas.
A/B testing is one of the best tools you can have in your toolkit to become a data-oriented product manager.
This post was improved in grammar, typos and flow with the help of Claude Sonnet. No picture this time because ChatGPT decided it can not generate images today.