Softmax forever, or why I like softmax

115 comments

·February 16, 2025

roger_

An aside: please use proper capitalization. With this article I found myself backtracking thinking I’d missed a word, which was very annoying. Not sure what the authors intention was with that decision but please reconsider.

1dom

I agree.

I'm all for Graham's pyramid of disagreement: we should focus on the core argument, rather than superfluous things like tone, or character, or capitalisation.

But this is too much for me personally. I just realised I consider the complete lack of capitalisation on a piece of public intellectual work to be obnoxious. Sorry, it's impractical, distracting and generates unnecessary cognitive load for everyone else.

You're the top comment right now, and it's not about the content of the article at all, which is a real shame. All the wasted thought cycles across so many people :(

treetalker

Graham's Hierarchy, in "How to Disagree": https://paulgraham.com/disagree.html

MrMcCall

Yeah, people should wake up to what people are really saying.

jiggawatts

It's a fad associated with AI, popularised by Sam Altman especially.

It's the new black turtleneck that everyone is wearing, but will swear upon their mother's life isn't because they're copying Steve Jobs.

meowface

Twitter and all forms of instant messaging (SMS, WhatsApp, Discord, and the older ones like AIM/MSN/ICQ) have normalized it for many years. Sam is just one of the few large company CEOs to tweet in the style other Twitter users usually use. He's adopting the native culture rather than setting a trend.

Sam still uses capitalization in all of his essays, as do most people (including young people). In essays, like this one, it's distracting without it. I predict in 10 years the vast majority of people will all-lowercase on places like Twitter but almost no one will do it for essays.

bowsamic

Well at least it makes it easy to know who to avoid

msikora

That is so incredibly dumb. I can get it in a tweet, but please, please, please properly capitalize in anything longer than a few words!

4ggr0

i don't want to press the shift-key everytime i need a capitalized letter on my phone and i disable auto-correct because it constantly messes with native languages etc.

wasn't aware that this makes me a steve jobs copier :(

EDIT: people are seriously so emotionally invested in capitalization that i get downvoted into minus, jeez.

yapyap

[dead]

alchemist1e9

> It's a fad associated with AI, popularised by Sam Altman especially.

I know this is true but does anyone understand why they do it? It is actually cognitively disruptive when reading content because many of us are trained to simultaneously proof read while reading.

So I also consider it a type of cognitive attack vector and it annoys me extremely as well.

Gormo

Capitalization and punctuation are to written language what pronunciation and stress are to spoken language. If someone was mispronouncing every word, using incorrect vowels, stressing the wrong syllables, etc., you'd have a really hard time understanding anything they're saying. Writing with incorrect punctuation and capitalization impedes comprehension in the same way.

omoikane

This looks like a personal blog post, in a blog where the author have avoided capitalization fairly consistently. The blog post was likely not meant to be a research paper, and reading it as a research paper is probably setting the wrong expectations.

If people wanted to read formal-looking formatted text, the author has linked to one in the second paragraph:

https://arxiv.org/abs/1511.07916 - Natural Language Understanding with Distributed Representation

PhilippGille

https://www.theguardian.com/society/2025/feb/18/death-of-cap...

bowsamic

Well I will fight this trend to the death. Thankfully I don't like to surround myself with philistines

meowface

The war is already over.

I 100% agree lowercase in longform essays is ridiculous, but I think for everything aside from essays, articles, papers, long emails, and some percentage of multi-paragraph site comments, lowercase is absolutely going to be the default online in 20 years.

null

[deleted]

jgalt212

I think he knows he did something non-standard, as his previous post from seven weeks ago has correct capitalization.

https://kyunghyuncho.me/bye-felix/

mempko

At least for now, maybe this is the best way to tell if a text is written by an LLM or a person. An LLM will capitalize!

cassepipe

Please ChatGPT, decapitalize that comment above for me

    at least for now, maybe this is the best way to tell if a text is written by an llm or a person. an llm will capitalize!

svachalek

Unless you tell it not to...

handsclean

This is the norm for Gen Z. We don’t see it because children don’t set social norms where adults are present too, but with the oldest of Gen Z about to turn 30, you and I should expect to see this more and more, and get used to it. If every kid can handle it, I think we can, too.

aidenn0

It doesn't change the point of your comment necessarily, but as far as TFA goes, the author was teaching a University course in 2015, so is highly unlikely to be Gen Z.

Gormo

Kids also pee in their pants at a rate vastly exceeding that of adults, but they usually stop doing it before they hit 30.

timdellinger

an opinion, and a falsifiable hypothesis:

call me old-fasahioned, but two spaces after a period will solve this problem if people insist on all-lower-case. this also helps distinguish between abbreviations such as st. martin's and the ends of sentences.

i'll bet that the linguistics experimentalists have metrics that quantify reading speed measurements as determined by eye tracking experiments, and can verify this.

thaumasiotes

> [I]'ll bet that the linguistics experimentalists have metrics that quantify reading speed measurements as determined by eye tracking experiments, and can verify this.

You appear to be trolling for the sake of trolling, but for reference: reading speed is determined by familiarity with the style of the text. Diverging from whatever people are used to will make them slower.

There is no such thing as "two spaces" in HTML, so good luck with that.

recursive

> There is no such thing as "two spaces" in HTML, so good luck with that.

Code point 160 followed by 32. In other words `  ` will do it.

fc417fc802

( do away with both capitalization and periods ( use tabs to separate sentences ( problem solved [( i'm only kind of joking here ( i actually think that would work pretty well ))] )))

( or alternatively use nested sexp to delineate paragraphs, square brackets for parentheticals [( this turned out to be an utterly cursed idea, for the record )] )

maurits

For people interested in the softmax, log sum exp and energy models, have a look at "Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One" [1]

[1]: https://arxiv.org/abs/1912.03263

stared

There are many useful tricks - like cosine distance.

In contrast, softmax has a very deep grounding in statistical physics - where it is called the Boltzmann distribution. In fact, this connection between statistical physics and machine learning was so fundamental that it was a key part of the 2024 Nobel Prize in Physics awarded to Hopfield and Hinton.

LudwigNagasena

Study of thermodynamics gave rise to many concepts in information theory and statistics, but I wouldn't say that there is any direct connection per se between thermodynamics and any field where statistics or information theory are applicable. And the reasoning behind the 2024 Nobel Prize in Physics was... quite innovative.

mitthrowaway2

> I wouldn't say that there is any direct connection per se between thermodynamics and any field where statistics or information theory are applicable.

Thermodynamics can absolutely be studied through both a statistical mechanics and an information theory lens, and many physicists have found this to be quite productive and enlightening. Especially when it gets to tricky cases involving entropy, like Maxwell's Demon and Landauer's Eraser, one struggles not to do so.

creakingstairs

Because the domain is a Korean name, I half-expected this to be about an old Korean game company[1] with the same name. They made some banger RPGs at the time and had really great art books.

[1] https://en.m.wikipedia.org/wiki/ESA_(company)

jmlim00

That's what I thought too.

nobodywillobsrv

Softmax’s exponential comes from counting occupation states. Maximize the ways to arrange things with logits as energies, and you get exp(logits) over a partition function, pure Boltzmann style. It’s optimal because it’s how probability naturally piles up.

efavdb

I personally don’t think much of the maximum entropy principle. If you look at the axioms that inform it, they don’t really seem obviously correct. Further, the usual qualitative argument is only right in a certain lens: namely they say choosing anything else would require you to make more assumptions about your distribution than is required. Yet it’s easy to find examples where the max entropy solution suppresses some states more than is necessary etc., which to me contradicts that qualitative argument.

semiinfinitely

right and it should be totally obvious that we would choose an energy function from statistical mechanics to train our hotdog-or-not classifier

C-x_C-f

No need to introduce the concept of energy. It's a "natural" probability measure on any space where the outcomes have some weight. In particular, it's the measure that maximizes entropy while fixing the average weight. Of course it's contentious if this is really "natural," and what that even means. Some hardcore proponents like Jaynes argue along the lines of epistemic humility but for applications it really just boils down to it being a simple and effective choice.

yorwba

In statistical mechanics, fixing the average weight has significance, since the average weight i.e. average energy determines the total energy of a large collection of identical systems, and hence is macroscopically observable.

But in machine learning, it has no significance at all. In particular, to fix the average weight, you need to vary the temperature depending on the individual weights, but machine learning practicioners typically fix the temperature instead, so that the average weight varies wildly.

So softmax weights (logits) are just one particular way to parameterize a categorical distribution, and there's nothing precluding another parameterization from working just as well or better.

xelxebar

The connection isn't immediately obvious, but it's simply because solving for the maximum entry distribution that achieves a given expectation value produces the Botlzmann distribution. In stat mech, our "classifier" over (micro-)states is energy; in A.I. the classifier is labels.

For details, the keyword is Lagrange multiplier [0]. The specific application here is maximizing f as the entropy with the constraint g the expectation value.

If you're like me at all, the above will be a nice short rabbit hole to go down!

[0]:https://tutorial.math.lamar.edu/classes/calciii/lagrangemult...

Y_Y

The way that energy comes in is that you have a fixed (conserved) amount of it and you have to portion it out among your states. There's nothing inherently energy-related about, it just happens that we often want to look energy distributions and lots of physical systems distribute energy this way (because it's the energy distribution with maximal entropy given the constraints).

(After I wrote this I saw the sibling comment from xelxebar which is a better way of saying the same thing.)

semiinfinitely

i think that log-sum-exp should actually be the function that gets the name "softmax" because its actually a soft maximum over a set of values. And what we call "softmax" should be called "grad softmax" (since grad of logsumexp is softmax).

GistNoesis

softmax is badly named and should rather be called softargmax.

incognito124

How to sample from a categorical: https://news.ycombinator.com/item?id=42596716

Note: I am the author

lelag

I'm happy to see you repaired your keyboard.

littlestymaar

I think they mean they're the author of the post they link, not the author of the OP with his broken caps.

lelag

Oh, right. I misunderstood.

littlestymaar

Off topic: Unlike many out there I'm not usually bothered by lack of capitalization in comments or tweets, but for an essay like this, it makes the paragraphs so hard to read!

Validark

If someone can't even put in a minimal amount of effort for basic punctuation and grammar, I'm not going to read their article on something more sophisticated. If you go for the lowercase i's because you want a childish or slob aesthetic, that can be funny in context. But in math or computing, I'm not going to care what someone thinks if they don't know or don't care that 'I' should be capitalized. Grammarly has a free tier. ChatGPT has a free tier. Paste your word slop into one of those and it will fix the basics for you.

LeifCarrotson

We just had a similar discussion at work the other day when one of our junior engineers noticed that a senior engineer was reflexively tapping the space bar twice after each sentence. That, too, was good style back when we were writing on typewriters or using monospace fonts with no typesetting. Only a child or a slob would fail to provide an extra gap between sentences, it would be distracting to readers and difficult to locate full stops without that!

But it's 2025, and HTML and Word and the APA and MLA and basically everyone agree that times and style guides have changed.

I agree that not capitalizing the first letter in a sentence is a step too far.

For a counter-example, I personally don't care whether they use the proper em-dash, en-dash, or hyphen--I don't even know when or how to insert the right one with my keyboard. I'm sure there are enthusiasts who care very deeply about using the right ones, and feel that my lack of concern for using the right dash is lazy and unrefined. Culture is changing as more and more communication happens on phone touchscreens, and I have to ask myself - am I out of touch? No, it's the children who are wrong. /s

But I strongly disagree that the author should pass everything they write through Grammarly or worse, through ChatGPT.

dagss

Same here, I just had to stop reading.

janalsncm

This is a really intuitive explanation, thanks for posting. I think everyone’s first intuition for “how do we turn these logits into probabilities” is to use a weighted sum of the absolute values of the numbers. The unjustified complexity of softmax annoyed me in college.

The author gives a really clean explanation for why that’s hard for a network to learn, starting from first principles.

calebm

Funny timing, I just used softmax yesterday to turn a list of numbers (some of which could be negative) into a probability distribution (summing up to 1). So useful. It was the perfect tool for the job.

yorwba

The author admits they "kinda stopped reading this paper" after noticing that they only used one hyperparameter configuration, which I agree is a flaw in the paper, but that's not an excuse for sloppy treatment of the rest of the paper. (It would however, be an excuse to ignore it entirely.)

In particular, the assumption that |a_k| ≈ 0 initially is incorrect, since in the original paper https://arxiv.org/abs/2502.01628 the a_k are distances from one vector to multiple other vectors, and they're unlikely to be initialized in such a way that the distance is anywhere close to zero. So while the gradient divergence near 0 could certainly be a problem, it doesn't have to be as fatal as the author seems to think it is.

totalizator

That would be "welcome to the world of academia". My post-doc friends won't even read a blog post prior to checking author's resume. They are very dismissive every time they notice anything they consider sloppy etc.

throw_pm23

You seem to merge two different points. Not reading based on sloppiness is defensible. Not reading based on the author's resume less so.

aidenn0

When "sloppiness" is defined as "did anything on my personal list of pet peeves" (and it often is) then the defensibility of the two begin to converge.

lblume

Which is a problem with the reputation-based academic system itself ("publish or perish") and not individuals working in it.

null

[deleted]

AnotherGoodName

For the answer of is "is softmax the only way to turn unnormalized real values into a categorial distribution" you can just use statistics.

Eg. Using Bayesian stats, if i assume an even prior (pretend i have no assumptions about how biased it is), i see a coin flip heads 4 times in a row, what's the probability of it being heads?

Via a long winded proof using the dirichlet distribution Bayesian stats will say "add one to the top and two to the bottom". Here we saw 4/4 heads. So we guess 5/6 chance of being heads (+1 to the top, +2 to the bottom) the next time or a 1/6 chance of being tails. This represents that the statistical model is assuming some bias in the coin.

That's normalized as a probability against 1 which is what we want. It works for multiple probabilities as well, you add to the bottom as many different outcomes as you have. The Dirichlet distribution allows for real numbers and you can support this too. If you feel this gives too much weight to the possibility of the coin being biased you can actually simply add more to the top and bottom which is the same as accounting for this in your prior, eg. add 100 to the top and 200 to the bottom instead.

Now this has a lot of differences with outcomes compared to softmax. It actually gives everything a non-zero chance rather than using the classic sigmoid activation function that softmax has underneath which moves things to almost absolute 0 or 1. But... other distributions like this are very helpful in many circumstances. Do you actually think the chance of tails becomes 0 if you see heads flipped 100 times in a row? Of course not.

So anyway the softmax function fits things to a particular type of distribution but you can actually fit pretty much anything to any distribution with good old statistics. Choose the right one for your use case.

programjames

There's a rather simple proof for this "add one to the top, n to the bottom" posterior. Take a unit-perimeter circle, and divide it into n regions for each of the possible outcomes. Then lay out your k outcomes into their corresponding regions. You have n dividers and k outcomes for a total of n + k points. By rotational symmetry, the distance between any two points is equal in expectation. Thus, the expected size of any region is (1 + # outcomes in the region) / (n + k). So, if you were to take one more sample

E[sample in region i] = (1 + # outcomes in the region) / (n + k)

But, the indicator variable "sample in region i" is always either zero or one, so this must equal the probability!

HN

Softmax forever, or why I like softmax

Softmax forever, or why I like softmax