Lossless LLM 3x Throughput Increase by LMCache

lihanc111

Our team has built this open source project, LMCache, to reduce repetitive computation in LLM inference and make systems serve more people (3x more throughput in chat applications) and it has been used in IBM's open source LLM inference stack.

In LLM serving, the input is computed into intermediate states called KV cache to further provide answers. These data are relatively large (~1-2GB for long context) and are often evicted when GPU memory is not enough. In these cases, when users ask a follow up question, the software needs to recompute for the same KV Cache. LMCache is designed to combat that by efficiently offloading and loading these KV cache to and from DRAM and disk.

Ask us anything!

dist-epoch

How is it possible to do non-prefix KV cache? I was under the impression that the V for one token potentially depends on the V of all previous ones.

da-x

Yes, there's KV cache 'Blending' see [1].

Future versions of LMCache are aiming to support this.

[1] CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion- https://arxiv.org/abs/2405.16444

0xjunhao

Hi, I had a quick question. Would it be correct to say the following?

1. For long inputs and short outputs, the inference can be arbitrarily number of times faster, as it avoids repeated KV computation.

2. Conversely, for short inputs and long outputs, it might be slightly slower, since loading and storing the KV cache are on the critical path of the execution.

iLoveOncall

Is this any different than prompt caching?

HN

Lossless LLM 3x Throughput Increase by LMCache

Lossless LLM 3x Throughput Increase by LMCache