This post is statistically significant

Thoughts in significance in A/B tests and p

May 11, 2025

I’ll divide this idea into three acts, because I want to address two things: first, one of the earliest culture shocks I experienced after leaving academia; and second, what it really means to have statistically significant results.

On Not Learning Statistics

I come from an engineering background. I studied Electronics Engineering at the glorious School of Sciences at the Autonomous University of San Luis Potosí, Mexico. It had many, many, many positive aspects — but also some downsides. Among them, the curriculum was run by rebellious physicists and mathematicians who strongly believed that everything we learned should be rooted in physics or mathematics.

We didn’t have any non-science courses, apart from languages (English and German).

We used to joke about how easy students from other faculties had it, taking simple classes like marketing or sales. (Btw: the moment I had to sell or market something, I realised just how wrong I was to think that — and how misguided engineering education was in reinforcing those beliefs.)

All of this is just to say: I didn’t study statistics at university. There was a course on probability, and one on numerical methods — but not statistics. Statistics was seen as a recipe-book discipline used in the social sciences, and the idea was that if you understood probability well enough, you could figure out statistics on your own.

On Learning Statistics

I had to learn statistics on the go when I started my doctorate in Human-Computer Interaction, a multidisciplinary field tackling technological questions with social science methods.

My dissertation was statistics-intensive, and I had to figure out a lot by myself. Fortunately, my supervisor was a zealot for statistics. Also, maybe those physicists and mathematicians at my undergrad were right, and I had enough maths to manage by myself.

This combination (learning it solo + having a supervisor who passionately published critiques of bad statistical methods in academic papers) had two lasting effects on me:

I would do everything by the book (literally).
I became very strict about definitions and over-testing.

On Using Statistics Outside Academia

The first time I planned an A/B test (which are fantastic tools — I’ll write more on them later), I was preparing an ANOVA and writing R scripts to get going. I spoke to others in the organisation running A/B tests, asking whether they were using ANOVA or simplifying to t-tests.

They looked at me like I was speaking Klingon.

They were just plugging numbers into an online calculator and letting it do the magic. (Mathemagics, as an old prof used to say.)

I tried to push back and apply “scientific rigour” — or at least the level needed to publish in a Human-Computer Interaction journal. But I wasn’t making any friends in the organisation. Everyone saw that rigour as wasteful.

“When in Rome, do as the Romans do,” I thought. So I used the online calculator. And you know what? The experiments helped us make the decisions we needed to. Did condition A yield more value than B? Yes or no — and were the results significant? Move on with the product.

At the end of the day, we’re not building scientific knowledge. We’re just testing what delivers more value to users and to the organisation. This mindset made it easier and faster to plan and analyse experiments.

One of my biggest lessons after leaving academia was that I needed to move faster and be more valuable to the organisation. That didn’t mean abandoning my academic soul — but it meant making some concessions. I still try to keep things tidy and principled, but I remind myself: I’m not writing scientific papers — I’m collecting information to make better product decisions.

A Final Note on Statistical Significance

While addressing the significance of A/B tests, I want to close with what it means to be statistically significant.

Statistical significance (the famous p-value) tells you whether the observed results are likely due to chance.

It does not tell you:

Which condition performed better,
How large the difference was,
Or whether the difference is meaningful in practice.

For example, a result where Condition A converts 80% and B 10% could still be statistically insignificant. Conversely, a result showing no difference could still be statistically significant.

So when a result is not significant, it means the observed difference might be due to random variation — not necessarily that both conditions are equal. You need statistical significance to claim your result is not random, but it doesn’t say anything about the practical outcome.

PS. After asking ChatGPT to review the grammar of this post, it offered me to create a flow / rule of thumb for how to handle significance, which I found interesting. So here it is the image it generated.

This post was improved with ChatGPT. It helped with grammar and style checks, and also generated the accompanying image.