Towards Memory Specialization: A Case for Long-Term and Short-Term RAM
16 comments
·September 1, 2025pfdietz
Wouldn't a generational garbage collector automatically separate objects into appropriate lifetime categories?
Animats
What they seem to want is fast-read, slow-write memory. "Primary applications include model weights in ML inference, code pages, hot instruction paths, and relatively static data pages". Is there device physics for cheaper, smaller fast-read slow write memory cells for that?
For "hot instruction paths", caching is already the answer. Not sure about locality of reference for model weights. Do LLMs blow the cache?
bobmcnamara
> Do LLMs blow the cache?
Sometimes very yes?
If you've got 1GB of weights, those are coming through the caches on their way to execution unit somehow.
Many caches are smart enough to recognize these accesses as a strided, streaming, heavily prefetchable, evictable read, and optimize for that.
Many models are now quantized too to reduce the overall the overall memory bandwidth needed for execution, which also helps with caching.
toast0
Probably not what they want, but NOR flash is generally directly addressable, it's commonly used to replace mask roms.
bobmcnamara
NOR is usually limited to <30MHz, but if you always want to fetch an entire cacheline, and design the read port, you can fetch the entire cacheline at once so that's pretty neat.
I don't know if anyone has applied this to neutral networks.
gary_0
> device physics for cheaper, smaller
And lower power usage. Datacenters and mobile devices will always want that.
photochemsyn
Yes, this from the paper:
> "The key insight motivating LtRAM is that long data lifetimes and read heavy access patterns allow optimizations that are unsuitable for general purpose memories. Primary applications include model weights in ML inference, code pages, hot instruction paths, and relatively static data pages—workloads that can tolerate higher write costs in exchange for lower read energy and improved cost per bit. This specialization addresses fundamental mismatches in current systems where read intensive data competes for the same resources as frequently modified data."
Essentially I guess they're calling for more specific hardware for LLM tasks, much like was done with all the networking equipment for dedicated packet processing with specialized SRAM/DRAM/TCAM tiers to keep latency to a minimum.
While there's an obvious need for this for traffic flow across the internet, whether or not LLMs are really going to scale like that, or there's a massive AI/LLM bubble about to pop, would be the practical issue, and who knows? The tea leaves are unclear.
dooglius
I'm not seeing the case for adding this to general-purpose CPUs/software. Only a small portion of software is going to be able to be properly annotated to take advantage of this, so it'd be a pointless cost for the rest of users. Normally short-term access can easily become long-term in the tail the process gets preempted by something higher priority or spend a lot of time on an I/O operation. It's also not clear why if you had an efficient solution for the short-term case you wouldn't just add a refresh cycle and use it in place of normal SRAM as generic cache? These make a lot more sense in a dedicated hardware context -- like neural nets -- which I think is the authors' main target here.
gary_0
> Only a small portion of software is going to be able to be properly annotated to take advantage of this
The same could be said for, say, SIMD/vectorization, which 99% of ordinary application code has no use for, but it quietly provides big performance benefits whenever you resample an image, or use a media codec, or display 3D graphics, or run a small AI model on the CPU, etc. There are lots of performance microfeatures like this that may or may not be worth it to include in a system, but just because they are only useful in certain very specific cases does not mean they should be dismissed out of hand. Sometimes the juice is worth the squeeze (and sometimes not, but you can't know for sure unless you put it out into the world and see if people use it).
dooglius
That's fair, I'm implicitly assuming the area cost for this dedicated memory would be much larger than that of e.g. SIMD vector banks.
gary_0
The existence of SIMD has knock-on effects on the design of the execution unit and the FPUs, though, since it's usually the only way to fully utilize them for float/arithmetic workloads. And newer SIMD features like AVX/AVX2 have a pretty big effect on the whole CPU design; it was widely reported that Intel and AMD went to a lot of trouble to make it viable, even though most software probably isn't even compiled with AVX support enabled.
Also SIMD is just one example. Modern DMA controllers are probably another good example but I know less about them (although I did try some weird things with the one in the Raspberry Pi). Or niche OS features like shared memory--pipes are usually all you need, and don't break the multitasking paradigm, but in the few cases where shared memory is needed it speeds things up tremendously.
staindk
Sounds a bit like Intel's Optane which was seemed great in principle but I never had a use for it.
https://www.intel.com/content/www/us/en/products/details/mem...
esseph
Used a lot with giant SAP HANA systems
Grosvenor
I'll put the Tandem 5 minute rule paper here, it seems very relevant.
https://dsf.berkeley.edu/cs286/papers/fiveminute-tr1986.pdf
and a revisit of the rule 20 years later (It still held).
https://cs-people.bu.edu/mathan/reading-groups/papers-classi...
meling
Are there new physics on the horizon that could pave the way for new memory technologies?
In the microcontroller world, there's already asymmetric RAM like this, although it's all based on the same (SRAM) technology, and the distinction is around the topology. You have TCM directly coupled to the core, then you generally have a few SRAM blocks attached to an AXI cross-bar (so that if software running on different µc cores don't simultaneously access the same block, you have non-interference on timing; but simultaneous access is allowed at the cost of known timing), and then a few more SRAM blocks attached a couple of AXI bridges away (from the point of view of a core; for example, closer to a DMA engine, or a low power core, or another peripheral that masters the bus). You can choose to ignore this, but for maximum performance and (more importantly) maximum timing determinism, understanding what is in which block is key. And that's without getting into EMIFs and off-chip SRAM and DRAM, or XIP out of various NVM technologies...