Theoretical limitations of multi-layer Transformer
7 comments
·January 31, 2025cs702
Matthyze
I'd humbly like to ask people who've read the paper whether it's worth trying to understand it without a great math background. The paper looks intersting but daunting, and I'd hate to sink a lot of time into it and leave defeated.
It sometimes sucks being in ML with 'only' a CS background. Feels like all the math and physics grads are running around having fun with their fancy mathematics, while I stand here, feeling dimwitted.
aketchum
Most of it is linear algebra and convex optimization. You can learn a lot of it with free resources from MIT, Stanford, Georgia Tech, or YouTube. If you want more of a school style learning environment you can enroll in the Georgia Tech OMSCS program and just take the classes related to the math etc that you are interested in. No reason you have to graduate and it is maybe $800 a course.
Matthyze
Thanks! Now might actually be a great time for me to pick up the material. If anyone has suggestions about particularly good free/online mathematics courses for ML, I'd really love to hear it. Or books!
svachalek
I'm not a math person either, but I'm familiar with some of the terminology. The two things I got out of this paper were:
1. Depth is more important than breadth for making transformers smarter. That is, for a given size of model it will be more powerful with more, smaller layers than it would be with less, bigger ones. Interestingly, Mistral just updated their small model yesterday with a big reduction in layers in order to improve performance. Among the many ways they say it more technically they do say directly "depth plays a more critical role than width in reasoning and composition tasks".
2. As I understand it, they are claiming to be able to prove mathematically that Chain of Thought such as seen in the new DeepSeek R1 and GPT o1/o3 models creates results that wouldn't be possible without it. The thought chains effectively act as additional layers, and per the previous point, the more layers the better. "From a theoretical view, CoT provides Transformer with extra computation space, and previous work ... proved that log-precision Transformer with CoT could simulate any polynomial-time algorithm. Therefore, by further assuming certain complexity conjecture ... their results imply that constant depth Transformer with CoT could simulate poly-time algorithm, while constant depth Transform ... itself can not solve P-complete task."
szundi
Would be fun to ask o3
cubefox
Loosely related thought: A year ago, there was a lot of talk about the Mamba SSM architecture replacing transformers. Apparently that didn't happen so far.
Huh. I just skimmed this and quickly concluded that it's definitely not light reading.
It sure looks and smells like good work, so I've added it to my reading list.
Nowadays I feel like my reading list is growing faster than I can go through it.