Skip to content(if available)orjump to list(if available)

The Illusion of Causality in Charts

gcanyon

The article seems more about the underlying causality, and less about the charts' specific role in misleading. To pick one example, the scatterplot chart isn't misleading: it's just a humble chart doing exactly what it's supposed to do: present some data in a way that makes clear the relationship (not necessarily causality!) between saturated fat consumption and heart disease.

The underlying issue (which the article discusses to some extent) is how confounding factors can make the data misleading/allow the data to be misinterpreted.

To discuss "The Illusion of Causality in Charts" I'd want to consider how one chart type vs. another is more susceptible to misinterpretation/more misleading than another. I don't know if that's actually true -- I haven't worked up some examples to check -- but that's what I was hoping for here.

hammock

> the scatterplot chart isn't misleading

Even leaving out the data (which you rightly point out) you are forced to choose what to plot on x and y, which by convention will communicate IV and DV respectively whether you like it or not.

rcxdude

True. Arguably this is a harmful convention: with any scatter plot you should consider the axes could be flipped.

the-mitr

You can check out some work of Howard wainer in this regard.

Graphic discovery, visual revelations etc.

https://en.m.wikipedia.org/wiki/Howard_Wainer

melagonster

a famous example is that bar chart always better than pie chart (see the advice from page of pie chart on ggplot website).

nwlotz

One of the best things I was forced to do in high school was read "How to Lie with Statistics" by Darrell Huff. The book's a bit dated and oversimplified in parts, but it gave me a healthy skepticism that served me well in college and beyond.

I think the issues described in this piece, and by other comments, are going to get much worse with the (dis)information overload AI can provide. "Hey AI, plot thing I don't like A with bad outcome B, and scale the axes so they look heavily correlated". Then it's picked up on social media, a clout-chasing public official sees it, and now it's used to make policy.

hammock

It helps to internalize the concept that all statistics (visualizations, but also literally any statistic with an element of organization) is narrative. “The medium is the message” type of way.

Sometimes you are choosing the narrative consciously (I created this chart to tell a story), and sometimes you are choosing it unconsciously (I just want to scatter plot and see what it shows - but you chose the x and y to plot, and you chose the scatter plot vs some other framework), and sometimes it is chosen for you (chart defaults for example, or north is up on a map).

And it’s not just charts. Statistics on the whole exist to organize raw data. The very act of introducing organization means you have a scheme, framework, lens which with to do so. You have to accept that and become conscious of that.

You cannot do anything as simple as report an average without choosing which data to include and which type of average to use. Or a histogram without choosing the bin sizes, and again, the data to include.

This is all to say nothing of the way the data was produced in the first place. (Separate topic)

djoldman

This is not a problem with charts, it is a problem with the interpretation of charts.

1. In general, humans are not trained to be skeptical of data visualizations.

2. Humans are hard-wired to find and act on patterns, illusory or not, at great expense.

Incidentally, I've found that avoiding the words "causes," "causality," and "causation" is almost always the right path or at the least should be the rule as opposed to the exception. In my experience, they rarely clarify and are almost always overreach.

ninetyninenine

It's not a problem of interpretation or visualization or charts. People are talking about it as if it's deception or interpretation but the problem is deeper than this.

It's a fundamental problem of reality.

The nature of reality itself prevents us from determining causality from observation, this includes looking at a chart.

If you observe two variables. Whether those random variables correlate or not... there is NO way to determine if one variable is causative to another through observation alone. Any causation in a conclusion from observation alone is in actuality only assumed. Note the key phrase here is: "through observation alone."

In order to determine if one thing "causes" another thing, you have to insert yourself into the experiment. It needs to go beyond observation.

The experimenter needs to turn off the cause and turn on the cause in a random pattern and see whether that changes the correlation. Only through this can one determine causation. If you don't agree with this, think about it a bit.

Also note that this is how they approve and validate medicine... they have to prove that the medicine/procedure "causes" a better outcome and the only way to do this is to actually make giving and withholding the medicine as part of the trial.

rcxdude

I'd say this is generally true, but in practice there are a decent number of cases where some reasoning can give you a pretty good confidence one way or another. Mainly by considering what other correlations exist and what causal relationships are plausible (because not all of them are).

(I say this coming from an engineering context, where e.g. you can pretty confidently say that your sensor isn't affecting the weather but vice-versa is plausible)

ninetyninenine

This is true fundamentally. It is not general. It is a fundamental facet of reality.

In practice it’s hard to determine causality so people make assumptions. Most conclusions are like that. I said this in the original post that conclusions from observation alone must have assumptions made. Which is fine given available resources. If you find people who smoke weed have lower iq you can come to the conclusion that weed causes iq to lower assuming that all smokers of weed had average iq before smoking and this is fine.

I’m sure you’ve seen many causative conclusions redacted because of incorrect assumptions so it is in general a very unreliable method.

And that’s why in medicine they strictly have to do causative based testing because they can’t afford to have a conclusion based off of an incorrect assumption.

djoldman

I find the definition of causality that places it squarely in the realm of philosophy to be a dead end or perhaps a circle with no end or objective or goal.

"What does it mean that something is caused by something else?" At the end of it all, what matters is how it's used in the real world. Personally I find the philosophical discussion to be tiresome.

In law, "to cause" is pretty strict: "but for" A, B would not exist or have happened. Therefore A caused B. That's one version. Other people and regimes have theirs.

This is why it's something I try to avoid.

In any case, descriptions of distributions are more comprehensive and avoid conclusions.

ninetyninenine

I'm not talking about philosophy. Clinical trials for medicine use this technique to determine causality. I'm talking about something very practical and well known.

It is literally the basis for medicine. We literally have to have a "hand in the experiment" for clinical trials to with-hold medicine and to give medicine in order to establish that medicine "causes" a "cure". Clinical trials are by design not about just observation.

Likely, you just don't understand what I was saying.

null

[deleted]

null

[deleted]

justonceokay

A pet issue I have that is in line with the “illusions” in the article is what I might call the “bound by statistics” fallacy.

The shape of it is that there is a statistic about population and then that statistic is used to describe a member of that population. For example, a news story that starts with “70% of restaurants fail in their first year, so it’s surprising that new restaurant Pete’s Pizza is opening their third location!”

But it’s only surprising if you know absolutely nothing about Pete and his business. Pete’s a smart guy. He’s running without debt and has community and government ties. His aunt ran a pizza business and gave him her recipes.

In a Bayesian way of thinking, the newscasters statement only makes sense if the only prior they have is the average success rate of restaurants. But that is an admittance that they know nothing about the actual specifics of the current situation, or the person they are talking about. Additionally there is zero causal relationship between group statistics and individual outcomes, the causal relationship goes the other way. Pete’s success will slightly change that 70% metric, but the 70% metric never bound Pete to be “likely to fail”.

Other places I see the “bound by statistics” problem is in healthcare, criminal proceedings, racist rhetoric, and identity politics.

skybrian

People sometimes talk about this as “taking the outside view” or “reference class forecasting.” [1] It doesn’t work when there are important differences between the case being considered and other members of the reference class. Nationwide statistics are especially zoomed-out and there are going to be a lot of people in the same country who are quite different from you. Worldwide statistics are even worse.

It doesn’t mean the statistics are wrong, though. If there is a 70% chance of failure, there’s also a 30% chance of success. But it’s subjective: use a different reference class and you’ll get a different number.

The opposite problem is also common: assuming that “this time it’s different” without considering the reasons why others have failed.

The general issue is overconfidence and failure to consider alternative scenarios. The future isn’t known to us and it’s easy to fool yourself into thinking it is known.

[1] https://en.m.wikipedia.org/wiki/Reference_class_forecasting

yusina

I agree with your description, but the pizza place case is even simpler: Statistics don't guarantee future single sample properties. 70% fail in the first year. So, 30% don't. Why would it be surprising to see one that didn't? It would be surprising to see none that didn't fail. So, it's expected to see lots that don't fail, Pete's being one of them.

zmgsabst

It’s not even surprising without knowing about Pete: the newspaper isn’t going to publish the many that failed, so their own selection bias is the dominant effect. Even if we take the probability of the group to be the probability of individuals, eg, rolling dice (“1 in 6 people rolls a 4!”).

Lots of them opened, of them 70% failed, and one who didn’t happened to be named Pete.

No more interesting than “Pete rolled a 4!” even though 83% of people don’t.

skybrian

If a newspaper only publishes surprising results, but it’s unsurprising when they appear in the newspaper, then you’ve set up a paradox: a set that only contains nonmembers.

I don’t think it’s valid to define “surprising” in such a self-referential way. When something unusual appears in the news, that doesn’t make it common. The probabilities are different before and after applying a filter.

yusina

If 99.9% fail and you see one that didn't, then would it be surprising that it didn't? No! It's not 100%, so there must be some example. For that one it's not surprising.

More precisely, it's not surprising that one exists. It may be surprising that this particular one survived, just as it wouldbe surprising that it's my neighbor who wins the lottery next week. But it's likely that somebody will, so if somebody has won, it won't be a surprise that somebody did.

nemomarx

It doesn't make it a common outcome overall, but it makes it a common outcome for it to appear in the newspaper, right? It's just different meanings of "common".

zmgsabst

I didn’t say that, nor use any self-reference.

> When something unusual appears in the news, that doesn’t make it common.

This in particular isn’t even close to what I said, which was: rare events can be unsurprising in large datasets — as is the case with both dice rolls and restaurants succeeding.

steveBK123

People are also very even worse with conditional probabilities

NoTranslationL

This is a tough problem. I’m working on an app called Reflect [1] that lets you analyze your life’s data and the temptation to draw conclusions from charts and correlations is strong. We added an experiments feature that will let you form hypotheses and it will even flag confounding variables if you track other metrics during your experiments. Still trying to make it even better to avoid drawing false conclusions.

[1] https://apps.apple.com/us/app/reflect-track-anything/id64638...

singularity2001

what's very fascinating in general is that causality is a difficult mathematical concept which only a tiny fraction of the population learns yet everyone is talking about it and "using it"

we do have a pretty good intuition for it but if you look at the details and ask people what is the difference between correlation and causality and how do you distinguish it things get rabbit holey pretty quick

null

[deleted]

qixv

You know, everyone that confuses correlation with causation ends up dying.