Skip to content(if available)orjump to list(if available)

Stop using the elbow criterion for k-means

whatshisface

The elbow method has a simple theoretical justification. If the data actually has k well-separated clusters, the gains from splitting halves of clusters should be dramatically less than from splitting distinct clusters. If that isn't true, the data doesn't have well-separated clusters anyway.

joshdavham

Since when did researchers decide to start titling their papers like clickbait YouTube videos?

The “Stop doing [conventional thing]!” title formula is, for whatever reason, the title that I always find the most annoying.

teraflop

It's nothing new. See, for instance, "Go To Statement Considered Harmful", published by Dijkstra in 1968.

vacuity

Though that name came from Wirth as editor. The original name ("A Case against the GO TO Statement") is somewhat less inflammatory.

cttet

Since a long time ago, and I found a trend that in more theoretical fields, the more 'clickbaity' the titles are, e.g. TCS, machine learning theory, etc.

cozzyd

Stop doing is all you need.

nurettin

It didn't grief me as much. Just scroll to the conclusion.

djoldman

> ... much better alternatives such as the variance-ratio criterion (VRC) of Calinski and Harabasz [6], the Bayesian Information Criterion (BIC), or the Gap statistics should always be preferred instead.

That's fair.

And: clustering algorithms are unsupervised by definition, therefore there is no correct answer.

In my experience, the use case almost always controls which algorithm (k-means, DBSCAN, etc.) will be best as well as the parameters chosen, e.g. k if that is available.

tetris11

I always used the silhoutte method (mostly because it generated cool art), and I'm happy to see it works well for large K, though fails at K=1

egberts1

But, but BUT credit card fraud divisions make heavy use of these "hockey-sticks" in K-means.

esafak

Clustering is a poorly defined task. What constitutes a cluster, separation by euclidean distance? That's not scale invariant.

The representation and clustering algorithm should be optimized for the downstream task.

binarymax

This is one of those interesting mathematical attempts to formalize a very human “I know it when I see it” quality. So maybe k-means is a poorly defined task if you don’t know k. Then you end up with the meta task of defining k, which has its problems, as seen in the paper. K-means alone, on some unknown data, without knowing the distance metric, and with no other heuristics or analysis, is going to go wrong. But if you are performing a dataset specific task, in a known vector space, with a good understanding of the outcome, then it’s really useful. The problem is probably that people learn that k-means is an unsupervised learning algorithm, and apply it incorrectly.

hoseja

Thing is, you only see it when appropriately zoomed in and therefore may not actually see it even though the algorithm should be able to find it.

rgavuliak

Elbow method was never working well in practice. In my over a decade of experience I am yet to see a significant separation in the chart.

azinman2

So what do you do instead?

rgavuliak

Usually I try to tie the clusters to some kind of a bottom line metric. It's not always possible though.

To be honest I am not a big fan of segmentation as is, since it results in a set of subjective decisions when both deciding number of clusters and their interpretation.

An alternative is DBSCAN which comes up with natural clusters, but its also not 100 % foolproof as it has a group of unclusterables which at times ends up too high.

renewiltord

Ketamine but that didn’t help either.

pbronez

Paper points out that K-means is a special case of gaussian mixture modeling that makes lots of assumptions about the data. As always, understanding the statistical assumptions behind your model is absolutely essential.

bbstats

Also, stop using k-means

PaulHoule

Depends what you're doing. I have a content-based recommender that uses clustering for diversity, basically if you want N items and have k clusters it picks the top N/k out of each cluster. [1] k-means works just great for that.

[1] for better or worse though, if 1/k of your articles are about some topic like soccer or cryptocurrency you will always get about 1/k articles about soccer or cryptocurrency no matter how you vote.

bbstats

I still prefer a Gaussian Mixture for this unless you're saying you roughly even N per cluster

PaulHoule

It all depends on what you want.

If 3/k of your items are about soccer or 2/k are about cryptocurrency then you get 3 and 2 clusters for them respectively whereas on some level you might want 1 cluster for each of them and also want one cluster for a topic which with prevalence 1/3k. On the other hand, having 2 clusters for a topic with 2/k could be seen as fair on another level... one way or the other I appreciate the system not having too many parameters and not being too sensitive to those parameters and I'd expect it to do roughly the right thing if k is set a little too high.

Early one I had a fight with the system where it kept showing me articles about soccer despite hating soccer because I had roughly 1/k articles about soccer (an RSS feed from The Guardian!). I wound up thinking a lot about feature engineering for sports and started reading soccer articles in detail and pretty soon I was amazed by games that were 0-1 and an own goal or that went 8-0 and pretty soon I became one of those people who is watching soccer at 9am on Saturday.

Related to that is the question of “should it show me more articles from clusters that I like better?” which defeats the point of diversity but certainly would improve most quality metrics (though I’d get sick of arXiv papers about recommender systems, which was most of what I got when k=1 in the beginning, and would be screaming for articles about cricket or something instead.)

I haven’t done a lot of work to try to improve that system because I like the feed it makes already and I’m not sure what to optimize to make it better. If it has a weakness it is that it has a lot of latency but there are certain things like sports articles and Ken Shiriff’s blog which are more timely and I’d like a way for some articles to “jump the line” without seeing too many of those articles overall…. But I’m not sure how to quantify what success is.

egberts1

ISWYDT

jb1991

This seems a bit like a knee-jerk reaction.

Der_Einzige

Stop using K-means. HDBScan in its GPU implementation in CuML is probably superior, yes even on your "very large" datasets (A100s can be rented for less than 1$ an hour).

hansvm

K-means is optimal for some problems, like in bolt quantization (and hierarchical initializations tend to converge quickly). It's not really applicable to TFA, since you'd choose the cluster count via metrics which have nothing to do with elbow criteria or whatever, but the lesson isn't to not use k-means or to use some other clustering algorithm; it's to know your goals and which model best achieves those.

antman

Two different beasts, both can give good results depending on the dataset