Everything Is Correlated

78 comments

·August 22, 2025

simsla

This relates to one of my biggest pet peeves.

People interpret "statistically significant" to mean "notable"/"meaningful". I detected a difference, and statistics say that it matters. That's the wrong way to think about things.

Significance testing only tells you the probability that the measured difference is a "good measurement". With a certain degree of confidence, you can say "the difference exists as measured".

Whether the measured difference is significant in the sense of "meaningful" is a value judgement that we / stakeholders should impose on top of that, usually based on the magnitude of the measured difference, not the statistical significance.

It sounds obvious, but this is one of the most common fallacies I observe in industry and a lot of science.

For example: "This intervention causes an uplift in [metric] with p<0.001. High statistical significance! The uplift: 0.000001%." Meaningful? Probably not.

mustaphah

You're spot on that significant ≠ meaningful effect. But I'd push back slightly on the example. A very low p-value doesn't always imply a meaningful effect, but it's not independent of effect size either. A p-value comes from a test statistic that's basically:

(effect size) / (noise / sqrt(n))

Note that bigger test statistic means smaller p-value.

So very low p-values usually come from bigger effects or from very large sample sizes (n). That's why you can technically get p<0.001 with a microscopic effect, but only if you have astronomical sample sizes. In most empirical studies, though, p<0.001 does suggest the effect is going to be large because there are practical limits on the sample size.

amelius

https://pmc.ncbi.nlm.nih.gov/articles/PMC3444174/

> Using Effect Size—or Why the P Value Is Not Enough

> Statistical significance is the least interesting thing about the results. You should describe the results in terms of measures of magnitude –not just, does a treatment affect people, but how much does it affect them.

– Gene V. Glass

kqr

To add nuance, it is not that bad. Given reasonable levels of statistical power, experiments cannot show meaningless effect sizes with statistical significance. Of course, some people design experiments at power levels way beyond what's useful, and this is perhaps even more true when it comes to things where big data is available (like website analytics), but I would argue the problem is the unreasonable power level, rather than a problem with statistical significance itself.

When wielded correctly, statistical significance is a useful guide both to what's a real signal worth further investigation, and it filters out meaningless effect sizes.

A bigger problem even when statistical significance is used right is publication bias. If, out of 100 experiments, we only get to see the 7 that were significant, we already have a false:true ratio of 5:2 in the results we see – even though all are presented as true.

jpcompartir

And if we increase N enough we will be able to find these 'good measurements' and 'statistically significant differences' everywhere.

Worse still if we did not agree in advance what hypotheses we were testing, and go looking back through historical data to find 'statistically significant' correlations.

ants_everywhere

Which means that statistical significance is really a measure of whether N is big enough

kqr

This has been known ever since the beginning of frequentist hypothesis testing. Fisher warned us not to place too much emphasis on the p-value he asked us to calculate, specifically because it is mainly a measure of sample size, not clinical significance.

V__

I really like this video [1] from 3blue1brown, where he proposes to think about significance as a way to update the probability. One positive test (or in this analog a study) updates the probability by X % and thus you nearly always need more tests (or studies) for a 'meaningful' judgment.

[1] https://www.youtube.com/watch?v=lG4VkPoG3ko

tomrod

This is sort of the basis of econometrics, as well as a driving thought behind causal inference.

Econometrics cares not only about statistical significance but also usefulness/economic usefulness.

Causal inference builds on base statistics and ML, but its strength lies in how it uses design and assumptions to isolate causality. Tools like sensitivity analysis, robustness checks, and falsification tests help assess whether the causal story holds up. My one beef is that these tools still lean heavily on the assumption that the underlying theoretical model is correctly specified. In other words, causal inference helps stress-test assumptions, but it doesn’t always provide a clear way to judge whether one theoretical framework is more valid than another!

ants_everywhere

> Significance testing only tells you the probability that the measured difference is a "good measurement". With a certain degree of confidence, you can say "the difference exists as measured".

Significance does not tell you this. The p-value can be arbitrarily close to 0 while the probability of the null hypothesis being true is simultaneously arbitrarily close to one

wat10000

Right. The meaning of p-value is, in a world where there is no effect, what is the probability of getting the result you got purely by random chance? It doesn’t directly tell you anything about whether this is such a world or not.

taneq

I’d say rather that “statistically significance” is a measure of surprise. It’s saying “If this default (the null hypothesis) is true, how surprised would I be to make these observations?”

kqr

Maybe you can think of it as saying "should I be surprised" but certainly not "how surprised should I be". The magnitude of the p-value is a function of sample size. It is not an odds ratio for updating your beliefs.

nathan_compton

Really classic "rationalist" style writing: a soup of correct observations about statistical phenomena with chunks of weird political bullshit thrown in here and there. For example: "On a more contemporary note, these theoretical & empirical considerations also throw doubt on concerns about ‘algorithmic bias’ or inferences drawing on ‘protected classes’: not drawing on them may not be desirable, possible, or even meaningful."

This is such a bizarre sentence. The way its tossed in, not explained in any way, not supported by references, etc. Like I guess the implication being made is something like "because there is a hidden latent variable that determines criminality and we can never escape from correlations with it, its ok to use "is_black" in our black box model which decides if someone is going to get parole? Ridiculous. Does this really "throw doubt" on whether we should care about this?

The concerns about how models work are deeper than the statistical challenges of creating or interpreting them. For one thing, all the degrees of freedom we include in our model selection process allow us to construct models which do anything that we want. If we see a parole model which includes "likes_hiphop" as an explanatory variable we ought to ask ourselves who decided that should be there and whether there was an agenda at play beyond "producing the best model possible."

These concerns about everything being correlated actually warrant much more careful understanding about the political ramifications of how and what we choose to model and based on which variables, because they tell us that in almost any non-trivial case a model is at least partly necessarily a political object almost certainly consciously or subconsciously decorated with some conception of how the world is or ought to be explained.

zahlman

> This is such a bizarre sentence. The way its tossed in, not explained in any way,

It reads naturally in context and is explained by the foregoing text. For example, the phrase "these theoretical & empirical considerations" refers to theoretical and empirical considerations described above. The basic idea is that, because everything correlates with everything else, you can't just look at correlations and infer that they're more than incidental. The political implications are not at all "weird", and follow naturally. The author observes that social scientists build complex models and observe huge amounts of variables, which allows them to find correlations that support their hypothesis; but these correlations, exactly because they can be found everywhere, are not anywhere near as solid evidence as they are presented as being.

> Like I guess the implication being made is something like "because there is a hidden latent variable that determines criminality and we can never escape from correlations with it, its ok to use "is_black" in our black box model which decides if someone is going to get parole?

No, not at all. The implication is that we cannot conclude that the black box model actually has an "is_black" variable, even if it is observed to have disparate impact on black people.

nathan_compton

Sorry, but I don't think that is a reasonable read. The phrase "not drawing on them may not be desirable, possible, or even meaningful" is a political statement except perhaps for "possible," which is just a flat statement that its hard to separate causal variables from non-causal ones.

Nothing in the statistical observation that variables tend to be correlated suggests we should somehow reject the moral perspective that that its desirable for a model to be based on causal rather than merely correlated variables, even if finding such variables is difficult or even, impossible to do perfectly. And its certainly also _meaningful_ to do so, even if there are statistical challenges. A model based on "socioeconomic status" has a totally different social meaning than one based on race, even if we cannot fully disentangle the two statistically. He is mixing up statistical and social, moral and even philosophical questions in a way which is, in my opinion, misleading.

jeremyjh

Or maybe your own announced bias against “rationalists” is affecting your reading of this. I agree with GPs interpretation.

pcrh

"Rationalists" do seem to have a fetish for ranking people and groups of people. Oddly enough, they frequently use poorly performed studies and under-powered data to reach their conclusions about genetics and IQ especially.

ml-anon

Yes this is gwern to a "T". Overwhelm with a r/iamverysmart screed whilst insidiously inserting baseless speculation and opinion as fact as if the references provided cover those too. Weirdly the scaling/AI community loves him.

senko

The article missed the chance to include the quote from that standard compendium of information and wisdom, The Hitchhiker's Guide to the Galaxy:

> Since every piece of matter in the Universe is in some way affected by every other piece of matter in the Universe, it is in theory possible to extrapolate the whole of creation — every sun, every planet, their orbits, their composition and their economic and social history from, say, one small piece of fairy cake.

euroderf

Particles do not suffer from predestination, do they ?

sayamqazi

Wouldnt you need the T_zero configuration of the universe for this to work?

Given different T_zero configs of matter and energies T_current would be different. and there are many pathways that could lead to same physical configuration (position + energies etc) with different (Universe minus cake) configurations.

Also we are assuming there is no non-deterministic processed happening at all.

senko

I am assuming integrating over all possible configurations would be a component of The Total Perspective Vortex.

After all, Feynman showed this is in principle possible, even with local nondeterminism.

(this being a text medium with a high probability of another commenter misunderstanding my intent, I must end this with a note that I am, of course, BSing :)

eru

> Wouldnt you need the T_zero configuration of the universe for this to work?

Why? We learn about the past by looking at the present all the time. We also learn about the future by looking at the present.

> Also we are assuming there is no non-deterministic processed happening at all.

Depends on the kind of non-determinism. If there's randomness, you 'just' deal with probability distributions instead. Since you have measurement error anyway, you need to do that anyway.

There are other forms of non-determinism, of course.

psychoslave

> We learn about the past by looking at the present all the time. We also learn about the future by looking at the present.

We infer about the past, based a bit on some material evidence we can subjectively partially get some acquaintance with. Through thick cultural biases. And the actual material suggestions should not come to far from our already integrated internal narrative, without what we will ignore it or actively fight it.

Future is pure fantasm, only bound by our imagination and what we take for unchallengeable fundamentals of what the world allows according to our inner model of it.

At least, that's one possible interpretation of the thoughts when an attention focus on present.

prox

In Buddhism we have dependent origination : https://en.wikipedia.org/wiki/Prat%C4%ABtyasamutp%C4%81da

lioeters

Also the concept of implicate order, proposed by the theoretical physicist David Bohm.

> Bohm employed the hologram as a means of characterising implicate order, noting that each region of a photographic plate in which a hologram is observable contains within it the whole three-dimensional image, which can be viewed from a range of perspectives.

> That is, each region contains a whole and undivided image.

> "There is the germ of a new notion of order here. This order is not to be understood solely in terms of a regular arrangement of objects (e.g., in rows) or as a regular arrangement of events (e.g., in a series). Rather, a total order is contained, in some implicit sense, in each region of space and time."

> "Now, the word 'implicit' is based on the verb 'to implicate'. This means 'to fold inward' ... so we may be led to explore the notion that in some sense each region contains a total structure 'enfolded' within it."

apples_oranges

People didn't always use statistics to discover truths about the world.

This, once developed, just happened to be a useful method. But given the abuse using those methods, and the proliferation of stupidity disguised as intelligence, it's always fitting to question it, and this time with this correlation noise observation.

Logic, fundamental knowledge about domains, you need that first. Just counting things without understanding them in at least one or two other ways, is a tempting invitation for misleading conclusions.

kqr

> People didn't always use statistics to discover truths about the world.

And they were much, much worse off for it. Logic does not let you learn anything new. All logic allows you to do is restate what you already know. Fundamental knowledge comes from experience or experiments, which need to be interpreted through a statistical lens because observations are never perfect.

Before statistics, our alternatives for understanding the world was (a) rich people sitting down and thinking deeply about how things could be, (b) charismatic people standing up and giving sermons on how they would like things to be, or (c) clever people guessing things right every now and then.

With statistics, we have to a large degree mechanised the process of learning how the world works, and anyone sensible can participate, and they can know with reasonable certainty whether they are right or wrong. It was impossible to prove a philosopher or a clergyman wrong!

That said, I think I agree with your overall point. One of the strengths of statistical reasoning is what's sometimes called intercomparison, the fact that we can draw conclusions from differences between processes without understanding anything about those processes. This is also a weakness because it makes it easy to accidentally or intentionally manipulate results.

mnky9800n

There is a quote from George Lucas where he talks about how when new things come into a society people have a tend to over do it.

https://www.youtube.com/watch?v=VEIrQUXm_hY

apples_oranges

Nice, yeah. With many movies one has to ask: What's the point? Especially all Disney Star Wars..

Evidlo

This is such a massive article. I wish I had the ability to grind out treatises like that. Looking at other content on the guy's website, he must be like a machine.

kqr

IIRC Gwern lives extremely frugally somewhere remote and is thus able to spend a lot of time on private research.

lazyasciiart

That and early bitcoin adoption. There’s a short bio somewhere on the site.

tux3

IIRC people funded moving gwern to the bay not too long ago.

aswegs8

I wish I would be even able to read things like that.

pas

lots of time, many iterations, affinity for the hard questions, some expertise in research (and Haskell). oh, and also it helps if someone is funding your little endeavor :)

tmulc18

gwern is goated

pcrh

This is why experimental science is different from observational studies.

Statistical analyses provide a reason to believe one hypothesis over another, but any scientist will extend that with an experimental approach.

Most of the examples given in this blog post refer to medical, sociological or behavioral studies, where properly controlled experiments are hard to perform, and as such are frequently under-powered to reveal true cause-effect associations.

null

[deleted]

st-keller

„This renders the meaning of significance-testing unclear; it is calculating precisely the odds of the data under scenarios known a priori to be false.“

I cannot see the problem in that. To get to meaningful results we often calculate with simplyfied models - which are known to be false in a strict sense. We use Newtons laws - we analyze electric networks based on simplifications - a bank-year used to be 360 days! Works well.

What did i miss?

bjornsing

The problem is basically that you can always buy a significant result with money (large enough N always leads to ”significant” result). That’s a serious issue if you see research as pursuit of truth.

syntacticsalt

Reporting effect size mitigates this problem. If observed effect size is too small, its statistical significance isn't viewed as meaningful.

bjornsing

Sure (and of course). But did you see the effect size histogram in the OP?

thyristan

There is a known maximum error introduced by those simplifications. Put the other way around, Einstein is a refinement of Newton. Special relativity converges towards Newtonian motion for low speeds.

You didn't really miss anything. The article is incomplete, and wrongly suggests that something like "false" even exists in statistics. But really something is only false "with a x% probability of it actually being true nonetheless". Meaning that you have to "statistic harder" if you want to get x down. Usually the best way to do that is to increase the number of tries/samples N. What the article gets completely wrong is that for sufficiently large N, you don't have to care anymore, and might as well use false/true as absolutes, because you pass the threshold of "will happen once within the lifetime of a bazillion universes" or something.

Problem is, of course, that lots and lots of statistics are done with a low N. Social sciences, medicine, and economy are necessarily always in the very-low-N range, and therefore always have problematic statistics. And try to "statistic harder" without being able to increase N, thereby just massaging their numbers enough to get a desired conclusion proved. Or just increase N a little, claiming to have escaped the low-N-problem.

syntacticsalt

A frequentist interpretation of inference assumes parameters have fixed, but unknown values. In this paradigm, it is sensible to speak of the statement "this parameter's value is zero" as either true or false.

I do not think it is accurate to portray the author as someone who does not understand asymptotic statistics.

thyristan

> it is sensible to speak of the statement "this parameter's value is zero" as either true or false.

Nope. The correct way is rather something like "the measurements/polls/statistics x ± ε are consistent with this parameter's true value to be zero", where x is your measured value and ε is some measurement error, accuracy or statistical deviation. x will never really be zero, but zero can be within an interval [x - ε; x + ε].

PeterStuer

Back when I wrote a loan repayment calculator, there were 47 common different ways to 'day count' (used in calculating payments for incomplete repayment periods, e.g in monthly payments, what is the 1st-13th of aug 2025 as a fraction of aug 2025?).

whyever

It's a quantitative problem. How big is the error introduced by the simplification?

dang

Related. Others?

Everything Is Correlated - https://news.ycombinator.com/item?id=19797844 - May 2019 (53 comments)

stouset

Correlated, you mean?

pnt12

Those would be all articles posted in HN :)

psychoslave

Looks like an impressive thorough piece of investigation. Well done.

That said, holistic supposition can certainly be traced back as far as writting dawns. Here the focus on more modern/contemporary era is legitimate to keep the focus delimited on a more specific concern, but is a bit obfuscating this fact. Maybe it's already acknowledged in the document, I read it all yet.

justonceokay

Statistical correlations anre important to establish but there are the easiest and least useful part of the research. Creating theories as to “why” and “how” these correlations exist are what advances our knowledge.

I read lot of papers that painstakingly show a correlation in the data, but then their theory about the correlation is a complete non sequitur.

endymion-light

the rest of the page has amazing design, but there's just something about the graphs switching from dark to light that flashbangs my eyes really badly - i think it's the sudden light!

2rsf

Did they quote https://www.tylervigen.com/spurious-correlations ?

ezomode

Who should quote who? The article is from 2014.