Transformer^2: Self-Adaptive LLMs
16 comments
·January 15, 2025E_Bfx
redox99
In ML results are often a score (accuracy or whatever) which makes it more gamefied
It's common to have competitions where the one with the highest score in the benchmark "wins". Even if there is no formal competition, it's very important being the SOTA model.
Results are more applicable to the real world, and more "cool" subjectively (I don't think there's a 2 minutes paper equivalent for math?), which increases ego.
And often authors are trying to convince others to use their findings. So it's partly a marketing brochure.
mdp2021
It is discomforting to read, in the first paragraph, that "dynamical adjustment of weights" is justified as "adaptation". Clearly it is a sought milestone to have «a future where AI models are no longer static»: but the chief reason remains, "intelligent systems reprocesses their body of knowledge and change it to improve it" - it is anterior to "adaptation to environment", it is "maintenance of the body of knowledge (of the world model)": it is the continuous practice of "thinking about things", "pondering", "reflecting", "using judgement"...
There is not just a simple «lifelong learning»: the whole past experience is still productive, requiring analysis, not "solved".
Anyway: the directions seem good.
Edit: equally interesting in another direction is the automated analysis of the internal subagents, «break[ing] down the vast, complex knowledge stored in the LLM into smaller, meaningful, and independent pieces (e.g., the different pathways or components for math, language understanding, etc)». Should not there be a general study of the dissection of systems with seemingly emergent intelligence, doing on LLMs like we do on C. Elegans?
wildermuthn
Great research here. Contextual real-time weight modification is definitely one of the breakthroughs required for AGI. Why create a LoRA when you can generate one on the fly suited to the task at hand?
mnky9800n
Why not, as each new task comes up, and then weights are revalued, save those weights and keep them for reference as priors for similar future tasks? As the model is exposed to new data the average of the set of priors of things the model thinks is similar might move closer to the posterior making the model quicker and more able to arrive at good outcomes. I suppose storage might be an issue.
verdverm
It does not seem like they are doing inference time weight changes, to the tune of running backprop. It sounds more like they are applying a pre-trained vector to the model, and select that vector based on the input, in a two step process
mtts
Sort of. According to the text they can use multiple z-vectors (sets of weights that select for parts of the system to be used to answer a specific question) simultaneously, using a "simple optimization algorithm" to determine the relative weight for each of these vectors.
wildermuthn
That’s my general understanding as well, but it isn’t a large conceptual leap to go from real-time selection of pretrained “z-vectors” to real-time generation of the same. The larger conceptual breakthrough, with demonstration of its effectiveness, is the big success here.
mtts
The interesting thing here is that the human brain also seems to use pretrained ... things. For vision, use the visual subsystem. For hearing, use the auditory subsystem. For movement ... you get the point. Plus you can combine these pretrained ... things, so for example for complex movement, like balancing on a tightrope, multiple subsystems are used (try standing on one leg with your eyes closed).
Z-vectors are of course nothing like the subsystems in your brain, but general the approach is certainly similar to how the brain works.
logicchains
>Contextual real-time weight modification is definitely one of the breakthroughs required for AGI.
It's already been invented: https://arxiv.org/abs/2202.05780 . That design is just very inefficient to scale up / use as a transformer backbone.
bugglebeetle
See also the work being done by GoodFire AI:
They now have an API that allows for dynamic exploration and manipulation of the latent space for LLama 8-70B models (think Golden Gate Claude). They also open sourced the sparse auto-encoders that (in part) allow for this:
https://huggingface.co/Goodfire/Llama-3.3-70B-Instruct-SAE-l...
verdverm
This sounds like MoE and maybe a bit of chain-of-thought. Curious what someone with more domain expertise thinks about this
If they can test against Llama 70B and Mistral 7B, they ought to compare against Mistral 8x7b imho
Vampiero
It's all very interesting but those pictures look pretty bad. Clear visible artifacts, awful shapes.
tzury
The ideas in the paper have been implemented and tested. The authors conducted experiments on several tasks (math, coding, reasoning, and visual question answering) and showed that their approach works better than previous methods like LoRA.
Key ideas (in simple terms):
1. What’s the problem?
- Fine-tuning LLMs for every new task is slow, expensive, and often doesn't generalize well.
- Models trained on one task may perform poorly on others, especially unseen ones.
- Current methods (like LoRA) can add new capabilities but aren't efficient enough.
2. The solution: - Transformer² uses a new fine-tuning method called Singular Value Fine-tuning (SVF). This focuses on adjusting only certain parts of the model’s "weight matrices" rather than changing everything.
- By tweaking specific components (called "singular values"), it trains smaller, efficient "expert" modules that specialize in particular types of tasks.
3. How it works: - Training phase: Train these smaller expert modules offline using reinforcement learning (RL) to specialize in tasks like coding, math, or reasoning.
- Inference phase: When a new input is given, the system analyzes the task (e.g., “Is this a math or coding problem?”) in the first pass. Based on this, it combines the right expert modules and adapts the model’s behavior in the second pass.
4. Three adaptation strategies: - Prompt-based: Use a cleverly designed text prompt to figure out the task type and pick the right expert module.
- Classifier-based: Train a separate model to classify tasks and match them to experts.
- Few-shot adaptation: Look at a small number of examples (few-shot learning) to dynamically combine expert modules for the best results.
5. Efficiency: - The system uses fewer parameters than traditional fine-tuning methods like LoRA.
- Adaptation works even on small datasets without overfitting or forgetting older tasks.
> Transformer² represents a significant milestone in the evolution of AI systems.
Coming from a math background, it always amazes me to see how people in AI/ML brag about their papers. If someone wrote:
> My paper represents a significant milestone in the evolution of algebraic geometry/ergodic theory/combinatorics
it would be a laughing stock for the math community.