LMCache turns repeated prefills into reusable KV Cache—ideal for long prompts, RAG templates, and shared prefixes across vLLM instances.
What Is LMCache?
LMCache addresses repeated prefill work in LLM inference: when many requests share long prompt prefixes, it moves ephemeral KV Cache into a reusable, observable, scalable cache layer—cutting duplicate compute and lowering TTFT (time to first token).
Project and docs:
Do You Need LMCache?
LMCache fits best when:
- Prompts are long and many requests share large prefix segments.
- RAG requests repeatedly inject similar document chunks or fixed templates.
- Multi-turn conversations carry long history that later turns reuse.
- Agent system prompts, tool docs, and policy text are lengthy.
- Multiple vLLM instances should share cache instead of each process recomputing alone.
- You care about TTFT—not just decode-stage
tokens/s.
If requests are mostly short with no shared prefixes, or the bottleneck is generation, gains may be modest. LMCache saves repeated prefill—it does not magically speed every token.
Start with vLLM MP Mode
Two common integration patterns:
- MP mode: LMCache runs as a standalone service; vLLM connects via
LMCacheMPConnector. - In-process mode: LMCache lives inside the vLLM process—good for quick experiments.
Try MP mode first. An independent cache service survives inference restarts more gracefully and simplifies admin APIs, metrics, and multi-instance sharing.
Installation
The official quickstart recommends Python 3.12.
With uv:
uv venv --python 3.12
source .venv/bin/activate
uv pip install lmcache vllm
With a standard virtualenv:
python -m venv .venv
source .venv/bin/activate
pip install lmcache vllm
Pin versions in production—verify vLLM, LMCache, and connector compatibility before upgrading blindly.
Start the LMCache Server
Launch LMCache first:
lmcache server \
--l1-size-gb 20 \
--eviction-policy LRU \
--chunk-size 16
Parameters:
--l1-size-gb 20: allocate 20GB for local L1 cache.--eviction-policy LRU: evict least-recently-used entries when full.--chunk-size 16: small chunks for demos so short prompts still show hit logs.
Note: chunk-size 16 is demo-friendly. Production often uses defaults like 256. Smaller chunks mean finer hit granularity but higher management overhead; larger chunks reduce overhead but may miss short prefixes cleanly.
Default ports:
- ZMQ:
5555 - HTTP admin and metrics:
8080
Start vLLM with LMCache
In another terminal:
vllm serve Qwen/Qwen3-8B \
--port 8000 \
--kv-transfer-config \
'{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'
On vLLM 0.20.0 or newer, you can explicitly point at LMCache’s connector module:
vllm serve Qwen/Qwen3-8B \
--port 8000 \
--kv-transfer-config \
'{"kv_connector":"LMCacheMPConnector", "kv_connector_module_path":"lmcache.integration.vllm.lmcache_mp_connector", "kv_role":"kv_both"}'
This picks up LMCache’s newer server protocol and fixes instead of relying only on vLLM’s bundled connector.
Verify Hits with Two Requests
The simplest cache test: send two requests that share a prefix.
First request:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-8B",
"prompt": "Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts",
"max_tokens": 100,
"temperature": 0.7
}'
Second request—same prefix, extended suffix:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-8B",
"prompt": "Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models",
"max_tokens": 100,
"temperature": 0.7
}'
When LMCache works, the first call logs something like Stored ... tokens; the second logs Retrieved ... tokens for the shared prefix.
Reading Hit Logs
Don’t stop at seeing Retrieved. Also check:
- How many tokens hit.
- Where data was fetched from—CPU RAM, local disk, remote storage.
- Whether retrieval latency beats recomputing prefill.
- Whether only a tiny segment hit, limiting real benefit.
- Whether chunk alignment prevented a clean hit on similar prefixes.
Interpret hits relative to prompt length. Hitting 4000 of 5000 tokens may drop TTFT noticeably; hitting 32 of 200 tokens may feel negligible.
Workloads That Benefit Most
Long System Prompts
Enterprise assistants often ship fixed role rules, tool docs, and output formats. Identical blocks across requests are prime cache candidates.
RAG Templates
RAG isn’t just retrieved chunks—fixed templates, citation rules, and answer constraints repeat too. High retrieval overlap or users querying the same corpus repeatedly amplifies LMCache value.
Agent Tool Documentation
Agents embed tool lists, call rules, and error-handling policies—long, repetitive prefixes by design.
Multi-Instance Inference
A single vLLM process has prefix cache, but scaling out fragments it. A standalone LMCache layer lets instances share.
Don’t Start with a Complex Backend
LMCache supports CPU RAM, local disk, Redis/Valkey, S3-compatible object storage, Mooncake, InfiniStore, NIXL, and more.
Don’t jump to distributed storage on day one. Safer path:
- Run locally on CPU RAM and validate the pipeline.
- Watch hit rate and TTFT improvements.
- After proven business value, consider cross-machine sharing, persistence, and remote backends.
Before scaling, ask:
- Must cache be shared across machines?
- Is persistence required?
- Will network transfer erase cache savings?
- Does eviction policy match access patterns?
More complex cache tiers mean harder debugging. Prove prompt reuse in your workload first.
In-Process Mode
For local experiments:
LMCACHE_CHUNK_SIZE=8 \
vllm serve Qwen/Qwen3-8B \
--port 8000 \
--kv-transfer-config \
'{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
Simple, but cache travels with the vLLM process—less isolation on restart, crash, or scale-out. Fine for POCs; weak as a long-term production default.
Pre-Production Checklist
Run a small load test before rollout:
- Compare TTFT with LMCache on vs off.
- Measure prefill latency separately—not just end-to-end time.
- Track prefix cache hit tokens and hit ratio.
- Watch LMCache memory stability.
- Test cache reuse after vLLM restarts.
- Verify ZMQ, HTTP metrics, and logs under concurrency.
- Use real production prompts—not demo text only.
Low prefix overlap in real traffic usually answers clearly: not yet.
Common Pitfalls
Treating LMCache as a response cache
LMCache stores KV Cache, not final answers. It cuts prefix reprocessing cost—it does not guarantee identical outputs.
Obsessing over tokens/s only
LMCache mainly affects TTFT and prefill cost. Decode tokens/s may barely move.
Copying demo chunk-size
Small chunks in demos make hits visible. Tune chunk size in production from prompt length, overlap patterns, memory, and throughput—don’t copy demo values blindly.
Version skew
vLLM, LMCache, and connectors have compatibility matrices. Read docs and release notes before upgrading production.
No observability
Cache without metrics is hard to debug. At minimum: hit rate, hit tokens, read/write latency, capacity, evictions, and error logs.
Summary
LMCache suits LLM serving where long prompts repeat. It is not a universal speed button—it converts repeated prefill into reusable KV Cache.
If you already run vLLM: start with MP mode locally, confirm hits with two shared-prefix requests, then compare TTFT on real traffic. Only strong hit rates justify production adoption.