Skip to content(if available)orjump to list(if available)

Titans: Learning to Memorize at Test Time

cs702

Interesting. I like the idea of a meta-mechanism that learns to update an associative memory based on how surprising the data is. The other stuff, reading memory via keys and values and selectively erasing it with gating, look pretty conventional on a first glance. Thank you for sharing this on HN. I've added it to my reading list.

EDIT: I'm reminded of this other type of associative memory: https://github.com/glassroom/heinsen_routing. The idea there is to compute a mixture of memories that best predicts the given input sequence. Quite frankly, I don't remember how the whole thing works, but I do remember that it works. It's been a while since I used it, so YMMV. In any case, it may be of interest to you.

testfoo11111111

there's nothing "pretty conventional" about a neural memory mechanism that comes along with such solid evidence of scalability and appealing performance characteristics.

If neural memory was conventional, GPT4o's memory wouldn't be stored as plain text and prepended to prompts.

This paper reminds me of the Switch Transformer paper; e.g. solidifying, expanding on, and proving out an area of research that may well have a big impact on leading LLMs and the SOTA in AI.

Agreed the concept of surprise is very cool.

pizza

There definitely is precedent - any parallelizably-decodable CABAC-derived neural compression algorithm basically has a flavor of this idea at its heart - intersperse statistical state throughout your token stream so you can decouple novelty in your state space on the fly.

Taken to its extreme where the ‘memory’ is descriptive enough to deterministically control the decoding you get parallelism over the sequence for free as a consequence of the associativity.

Similar techniques are used in making video compression algorithms robust enough for low latency reconnection in online streaming in poor/changing network conditions, or making it possible to decompress JPEGs at >1GBps in parallel by exploiting the presence of ‘RESET’ tokens that indicate independent/novel substreams.

That said, I do agree that this is definitely a great paper and contribution to language models though!

cma

1991

> Each RNN tries to solve the pretext task of predicting its next input, sending only unexpected inputs to the next RNN above. This greatly facilitates downstream supervised deep learning such as sequence classification. By 1993, the approach solved problems of depth 1000 (requiring 1000 subsequent computational stages/layers—the more such stages, the deeper the learning). A variant collapses the hierarchy into a single deep net. It uses a so-called conscious chunker RNN which attends to unexpected events that surprise a lower-level so-called subconscious automatiser RNN. The chunker learns to understand the surprising events by predicting them. The automatiser uses my neural knowledge distillation procedure of 1991 [UN0-UN2] to compress and absorb the formerly conscious insights and behaviours of the chunker, thus making them subconscious. The systems of 1991 allowed for much deeper learning than previous methods.

https://people.idsia.ch/~juergen/very-deep-learning-1991.htm...