P-Hacking in Startups

Jemaclus

> This isn't academic nit-picking. It's how medical research works when lives are on the line. Your startup's growth deserves the same rigor.

But does it, really? A lot of companies sell... well, let's say "not important" stuff. Most companies don't cost peoples' lives when you get it wrong. If you A/B test user signups for a startup that sells widgets, people aren't living or dying based on the results. The consequences of getting it wrong are... you sell fewer widgets?

While I understand the overall point of the post -- and agree with it! -- I do take issue with this particular point. A lot of companies are, arguably, _too rigorous_ when it comes to testing.

At my last company, we spent 6 weeks waiting for stat sig. But within 48 hours, we had a positive signal. Conversion was up! Not statistically significant, but trending in the direction we wanted. But to "maintain rigor," we waited 6 weeks before turning it... and the final numbers were virtually the same as the 48 hour numbers.

Note: I'm not advocating stopping tests as soon as something shows trending in the right direction. The third scenario on the post points this out as a flaw! I do like their proposal for "peeking" and subsequent testing.

But, really, let's just be realistic about what level of "rigor" is required to make decisions. We aren't shooting rockets into space. We're shipping software. We can change things if we get them wrong. It's okay. The world won't end.

IMO, the right framing here is: your startup deserves to be as rigorous as is necessary to achieve its goals. If its goals are "stat sig on every test," then sure, treat it like someone might die if you're wrong. (I would argue that you have the wrong goals, in this case, but I digress...)

But if your goals are "do no harm, see if we're heading in the right vector, and trust that you can pivot if it turns out you got a false positive," then you kind of explicitly don't need to treat it with the same rigor as a medical test.

simonw

On the one hand, this is a very nicely presented explanation of how to run statistically significant A/B style tests.

It's worth emphasizing though that if your startup hasn't achieved product market fit yet this kind of thing is a huge waste of time! Build features, see if people use them.

andy99

> Imagine you're a product manager trying to optimize your website’s dashboard. Your goal is to increase user signups.

This would be Series B or later right? I don't really feel like it's a core startup behavior.

shoo

related book: Trustworthy Online Controlled Experiments

https://experimentguide.com/

derektank

I don't have any first hand experience with customer facing startups, SaaS or otherwise. How common is rigorous testing in the first place?

dayjah

As you scale it improves. More often at a small scale you ask users and they’ll give you invaluable information. As you scale you abstract folks into buckets. At about 1million MAU I’ve found A/B testing and p-value starts to make sense.

bcyn

Great read, thanks! Could you dive a little deeper into example 2 & pre-registration? Conceptually I understand how the probability of false positives increases with the number of variants.

But how does a simple act such as "pre-registration" change anything? It's not as if observing another metric that already existed changes anything about what you experimented with.

PollardsRho

If you have many metrics that could possibly be construed as "this was what we were trying to improve", that's many different possibilities for random variation to give you a false positive. If you're explicit at the start of an experiment that you're considering only a single metric a success, it turns any other results you get into "hmm, this is an interesting pattern that merits further exploration" and not "this is a significant result that confirms whatever I thought at the beginning."

It's basically a variation on the multiple comparisons, but sneakier: it's easy to spend an hour going through data and, over that time, test dozens of different hypotheses. At that point, whatever p-value you'd compute for a single comparison isn't relevant, because after that many comparisons you'd expect at least one to have uncorrected p = 0.05 by random chance.

noodletheworld

There are many resources that will explain this rigorously if you search for the term “p-hacking”.

The TLDR as I understand it is:

All data has patterns. If you look hard enough, you will find something.

How do you tell the difference between random variance and an actual pattern?

It’s simple and rigorously correct to only search the data for a single metric; other methods, eg. Bonferroni correction (divide p by k) exist, but are controversial (1).

Basically, are you a statistician? If not, sticking to the best practices in experimentation means your results are going to be meaningful.

If you see a pattern in another metric, run another experiment.

[1] - https://pmc.ncbi.nlm.nih.gov/articles/PMC1112991/

irq-1

1−(1−0.05)^9=64 (small mistake; should be ^20)

HN

P-Hacking in Startups

P-Hacking in Startups