Skip to content(if available)orjump to list(if available)

Language Models Pack Billions of Concepts into 12,000 Dimensions

niemandhier

Wow, I think I might just have grasped one of the sources of the problems we keep seeing with LLMs.

Johnson-Lichtenstrauss guarantees a distance preserving embedding for a finite set of points into a space with a dimension based on the number of points.

It does not say anything about preserving the underlying topology of the contious high dimensional manifold, that would be Takens/Whitney-style embedding results (and Sauer–Yorke for attractors).

The embedding dimensions needed to fulfil Takens are related to the original manifolds dimension and not the number of points.

It’s quite probable that we observe violations of topological features of the original manifold, when using our to low dimensional embedded version to interpolate.

I used AI to sort the hodge pudge of math in my head into something another human could understand, edited result is below: === AI in use === If you want to resolve an attractor down to a spatial scale rho, you need about n ≈ C * rho^(-d_B) sample points (here d_B is the box-counting/fractal dimension).

The Johnson–Lindenstrauss (JL) lemma says that to preserve all pairwise distances among n points within a factor 1±ε, you need a target dimension

k ≳ (d_B / ε^2) * log(C / rho).

So as you ask for finer resolution (rho → 0), the required k must grow. If you keep k fixed (i.e., you embed into a dimension that’s too low), there is a smallest resolvable scale

rho* (roughly rho* ≳ C * exp(-(ε^2/d_B) * k), up to constants),

below which you can’t keep all distances separated: points that are far on the true attractor will show up close after projection. That’s called “folding” and might be the source of some of the problems we observe .

=== AI end === Bottom line: JL protects distance geometry for a finite sample at a chosen resolution; if you push the resolution finer without increasing k, collisions are inevitable. This is perfectly consistent with the embedding theorems for dynamical systems, which require higher dimensions to get a globally one-to-one (no-folds) representation of the entire attractor.

yorwba

I think the author is too focused on the case where all vectors are orthogonal and as a consequence overestimates the amount of error that would be acceptable in practice. The challenge isn't keeping orthogonal vectors almost orthogonal, but keeping the distance ordering between vectors that are far from orthogonal. Even much smaller values of epsilon can give you trouble there.

So the claim that "This research suggests that current embedding dimensions (1,000-20,000) provide more than adequate capacity for representing human knowledge and reasoning." is way too optimistic in my opinion.

aabhay

My intuition of this problem is much simpler — assuming there’s some rough hierarchy of concepts, you can guesstimate how many concepts can exist in a 12,000-d space by taking the combinatorial of the number of dimensions. In that world, each concept is mutually orthgonal with every other concept in at least some dimension. While that doesn’t mean their cosine distance is large, it does mean you’re guaranteed a function that can linearly separate the two concepts.

It means you get 12,000! (Factorial) concepts in the limit case, more than enough room to fit a taxonomy

Morizero

That number is far, far, far greater than the number of atoms in the universe (~10^43741 >>>>>>>> ~10^80).

dwohnitmok

These set of intuitions and the Johnson-Lindenstrauss lemma in particular are what power a lot of the research effort behind SAEs (Sparse Autoencoders) in the field of mechanistic interpretability in AI safety.

A lot of the ideas are explored in more detail in Anthropic's 2022 paper that's one of the foundational papers in SAE research: https://transformer-circuits.pub/2022/toy_model/index.html

emil-lp

Where can I read the actual paper? Where is it published?

yorwba

That is the actual paper, it's published on transformer-circuits.pub.

emil-lp

It's not peer-reviewed?

bigdict

What's the point of the relu in the loss function? Its inputs are nonnegative anyway.

GolDDranks

I wondered the same. Seems like it would just make a V-shaped loss around the zero, but abs has that property already!

Nevermark

Let's try to keep things positive.

null

[deleted]

js8

You can also imagine a similar thing on binary vectors. There two vectors are "orthogonal" if they share no bits that are set to one. So you can encode huge number of concepts using only small number of bits in modestly sized vectors, and most of them will be orthogonal.

phreeza

If they are only orthogonal if they share no bits that are set to one, only one vector, the complement, will be orthogonal, no?

yznovyak

I don't think so. For n=3 you can have 000, 001, 010, 100. All 4 (n+1) are pairwise orthogonal. However, I don't think js8 is correct as it looks like in 2^n you can't have more than n+1 mutually orthogonal vectors, as if any vector has 1 in some place, no other vector can have 1 in the same place.

prerok

Hmm, I think one correction: is (0,0,0) actually a vector? I think that, by definition, an n-dimentional space can have at most n vectors which are all orthogonal to one another.

js8

For example, 1010 and 0101 are orthogonal, but 1010 and 0011 are not (share the 3rd bit). Though calling them orthogonal is not quite right.

asplake

By the original definition, they can share bits that are set to zero and still be orthogonal. Think of the bits as basis vectors – if they have none in common, they are orthogonal.

henearkr

Your definition of orthogonal is incorrect, in this case.

In the case of binary vectors, don't forget you are working with the finite field of two elements {0, 1}, and use XOR.