TIME WAIT BLOG.
#Inference June 25, 2026 11 MIN READ

LMCache Practical Guide: Reusing KV Cache in vLLM Inference Services

LMCache extracts reusable KV Cache from repeated prefills to cut vLLM TTFT—best for long prompts and high prefix overlap.

LMCache Practical Guide: Reusing KV Cache in vLLM Inference Services

LMCache turns repeated prefills into reusable KV Cache—ideal for long prompts, RAG templates, and shared prefixes across vLLM instances.

What Is LMCache?

LMCache addresses repeated prefill work in LLM inference: when many requests share long prompt prefixes, it moves ephemeral KV Cache into a reusable, observable, scalable cache layer—cutting duplicate compute and lowering TTFT (time to first token).

Project and docs:

Do You Need LMCache?

LMCache fits best when:

If requests are mostly short with no shared prefixes, or the bottleneck is generation, gains may be modest. LMCache saves repeated prefill—it does not magically speed every token.

Start with vLLM MP Mode

Two common integration patterns:

Try MP mode first. An independent cache service survives inference restarts more gracefully and simplifies admin APIs, metrics, and multi-instance sharing.

Installation

The official quickstart recommends Python 3.12.

With uv:

uv venv --python 3.12
source .venv/bin/activate
uv pip install lmcache vllm

With a standard virtualenv:

python -m venv .venv
source .venv/bin/activate
pip install lmcache vllm

Pin versions in production—verify vLLM, LMCache, and connector compatibility before upgrading blindly.

Start the LMCache Server

Launch LMCache first:

lmcache server \
  --l1-size-gb 20 \
  --eviction-policy LRU \
  --chunk-size 16

Parameters:

Note: chunk-size 16 is demo-friendly. Production often uses defaults like 256. Smaller chunks mean finer hit granularity but higher management overhead; larger chunks reduce overhead but may miss short prefixes cleanly.

Default ports:

Start vLLM with LMCache

In another terminal:

vllm serve Qwen/Qwen3-8B \
  --port 8000 \
  --kv-transfer-config \
  '{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'

On vLLM 0.20.0 or newer, you can explicitly point at LMCache’s connector module:

vllm serve Qwen/Qwen3-8B \
  --port 8000 \
  --kv-transfer-config \
  '{"kv_connector":"LMCacheMPConnector", "kv_connector_module_path":"lmcache.integration.vllm.lmcache_mp_connector", "kv_role":"kv_both"}'

This picks up LMCache’s newer server protocol and fixes instead of relying only on vLLM’s bundled connector.

Verify Hits with Two Requests

The simplest cache test: send two requests that share a prefix.

First request:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "prompt": "Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts",
    "max_tokens": 100,
    "temperature": 0.7
  }'

Second request—same prefix, extended suffix:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "prompt": "Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models",
    "max_tokens": 100,
    "temperature": 0.7
  }'

When LMCache works, the first call logs something like Stored ... tokens; the second logs Retrieved ... tokens for the shared prefix.

Reading Hit Logs

Don’t stop at seeing Retrieved. Also check:

Interpret hits relative to prompt length. Hitting 4000 of 5000 tokens may drop TTFT noticeably; hitting 32 of 200 tokens may feel negligible.

Workloads That Benefit Most

Long System Prompts

Enterprise assistants often ship fixed role rules, tool docs, and output formats. Identical blocks across requests are prime cache candidates.

RAG Templates

RAG isn’t just retrieved chunks—fixed templates, citation rules, and answer constraints repeat too. High retrieval overlap or users querying the same corpus repeatedly amplifies LMCache value.

Agent Tool Documentation

Agents embed tool lists, call rules, and error-handling policies—long, repetitive prefixes by design.

Multi-Instance Inference

A single vLLM process has prefix cache, but scaling out fragments it. A standalone LMCache layer lets instances share.

Don’t Start with a Complex Backend

LMCache supports CPU RAM, local disk, Redis/Valkey, S3-compatible object storage, Mooncake, InfiniStore, NIXL, and more.

Don’t jump to distributed storage on day one. Safer path:

  1. Run locally on CPU RAM and validate the pipeline.
  2. Watch hit rate and TTFT improvements.
  3. After proven business value, consider cross-machine sharing, persistence, and remote backends.

Before scaling, ask:

More complex cache tiers mean harder debugging. Prove prompt reuse in your workload first.

In-Process Mode

For local experiments:

LMCACHE_CHUNK_SIZE=8 \
vllm serve Qwen/Qwen3-8B \
  --port 8000 \
  --kv-transfer-config \
  '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'

Simple, but cache travels with the vLLM process—less isolation on restart, crash, or scale-out. Fine for POCs; weak as a long-term production default.

Pre-Production Checklist

Run a small load test before rollout:

Low prefix overlap in real traffic usually answers clearly: not yet.

Common Pitfalls

Treating LMCache as a response cache

LMCache stores KV Cache, not final answers. It cuts prefix reprocessing cost—it does not guarantee identical outputs.

Obsessing over tokens/s only

LMCache mainly affects TTFT and prefill cost. Decode tokens/s may barely move.

Copying demo chunk-size

Small chunks in demos make hits visible. Tune chunk size in production from prompt length, overlap patterns, memory, and throughput—don’t copy demo values blindly.

Version skew

vLLM, LMCache, and connectors have compatibility matrices. Read docs and release notes before upgrading production.

No observability

Cache without metrics is hard to debug. At minimum: hit rate, hit tokens, read/write latency, capacity, evictions, and error logs.

Summary

LMCache suits LLM serving where long prompts repeat. It is not a universal speed button—it converts repeated prefill into reusable KV Cache.

If you already run vLLM: start with MP mode locally, confirm hits with two shared-prefix requests, then compare TTFT on real traffic. Only strong hit rates justify production adoption.

/related_artifacts

Vector Embeddings: The New Universal Interface
#Databases Jun 28, 2024

Vector Embeddings: The New Universal Interface

Retrieval systems are shifting from relational lookups to semantic similarity.

read full log arrow_right_alt
Vibe-Trading: Connecting Natural-Language Research, Backtesting, and Trading Tools to AI Agents
#FinTech Jul 03, 2026

Vibe-Trading: Connecting Natural-Language Research, Backtesting, and Trading Tools to AI Agents

An open-source AI Agent workspace for trading research: natural-language queries, multi-market data, backtesting, reports, and MCP tools in one flow.

read full log arrow_right_alt