DeepSeek Context Caching: 50x Cost Reduction Patterns
Master DeepSeek's automatic disk-based context caching. Prefix-exact-match mechanics, cache-aware prompt design, pre-warming strategies, and monitoring cache hit rates for maximum cost savings.
DeepSeek's context caching is the most underappreciated feature in the entire API. It's automatic, enabled by default, requires zero code changes, and reduces input token costs by 50x — from $0.14/M to $0.0028/M on Flash. But it uses prefix-exact-match, which means your prompt structure directly determines whether you get the discount or pay full price.
How DeepSeek Context Caching Works
DeepSeek's KV cache is disk-based (not in-memory like Claude's prompt caching). Each request automatically persists "cache prefix units" to disk. Subsequent requests hit the cache only when their prefix EXACTLY matches a previously persisted unit.
Cache Persistence Triggers
- Request boundary persistence: Each request creates cache units at the end of user input and model output
- Common prefix detection: When the system detects the same prefix across multiple requests, it persists that prefix
- Fixed-interval persistence: For very long inputs, the system carves cache units at fixed token intervals
Critical: Cache Hit Requires Full Prefix Match
Request 1: [System: "You are a helpful assistant"] + [User: "Document A"]
Request 2: [System: "You are a helpful assistant"] + [User: "Document B"]
Result: NO CACHE HIT — the user messages differ, so the full prefix doesn't match.
After 2+ requests, the system detects the common system prompt prefix and persists it.
Request 1: [System: "You are a helpful assistant"] + [User: "Document A"] + "Summarize"
Request 2: [System: "You are a helpful assistant"] + [User: "Document A"] + "Analyze"
Result: CACHE HIT on [System + Document A] — only "Summarize" vs "Analyze" differ.
Cache-Aware Prompt Design
Pattern 1: System Prompt as Cache Anchor
Make your system prompt the unchanging cache prefix:
SYSTEM_PROMPT = "You are a document analysis assistant. Analyze the attached document." # NEVER change this
# These will share a cache prefix (the system prompt):
messages_1 = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"{document}\n\nSummarize this document."}
]
messages_2 = [
{"role": "system", "content": SYSTEM_PROMPT}, # Same system prompt = cache hit beginning
{"role": "user", "content": f"{document}\n\nExtract key entities."}
]
Pattern 2: Static Document + Variable Question
The ideal cache pattern — load the document once, ask many questions:
DOCUMENT = load_large_document() # 50K tokens of static content
# Request 1: Creates cache unit for [System + Document]
ask(DOCUMENT, "Summarize this document")
# Request 2: Hits cache on [System + Document], only pays for "Extract key dates"
ask(DOCUMENT, "Extract all dates mentioned")
# Request 3: Hits cache again
ask(DOCUMENT, "What are the top 3 risks identified?")
# Cost: Full price for Request 1. Cache-hit price for Requests 2+.
# For a 50K document: Request 1 = $0.007, Requests 2+ = $0.00014 each (Flash pricing)
Pattern 3: Multi-Turn Conversation Caching
# Turn 1: Full price
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": "What is the capital of China?"}
]
response = call_api(messages)
messages.append(response.choices[0].message)
# Turn 2: Cache hit on Turn 1's full prefix
messages.append({"role": "user", "content": "What is the capital of the US?"})
response = call_api(messages)
Pattern 4: Cache Pre-Warming
For applications where the first request is latency-sensitive, pre-warm the cache:
# Pre-warm: Send a minimal-cost request to build the cache (use short max_tokens)
client.chat.completions.create(
model="deepseek-v4-flash",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"{LARGE_DOCUMENT}\n\nOK"}
],
max_tokens=10 # Minimize output cost
)
# Now real requests hit the cache from the start
client.chat.completions.create(
model="deepseek-v4-flash",
messages=[
{"role": "system", "content": SYSTEM_PROMPT}, # Cache hit
{"role": "user", "content": f"{LARGE_DOCUMENT}\n\nAnalyze..."} # Cache hit
]
)
Monitoring Cache Performance
Track cache hit rates in your API responses:
response = client.chat.completions.create(...)
usage = response.usage
cache_hit = usage.prompt_cache_hit_tokens
cache_miss = usage.prompt_cache_miss_tokens
total_input = cache_hit + cache_miss
hit_rate = cache_hit / total_input if total_input > 0 else 0
# Calculate effective cost
cache_hit_cost = (cache_hit / 1_000_000) * 0.0028 # Flash cache hit rate
cache_miss_cost = (cache_miss / 1_000_000) * 0.14 # Flash cache miss rate
effective_cost = cache_hit_cost + cache_miss_cost
print(f"Cache hit rate: {hit_rate:.1%}")
print(f"Effective cost: ${effective_cost:.6f}")
print(f"Savings vs all-miss: ${(total_input / 1_000_000 * 0.14 - effective_cost):.6f}")
Common Cache Optimization Failures
Failure 1: Dynamic System Prompts
# BAD: System prompt changes every request
system = f"You are an analyst. Today is {datetime.now()}."
# GOOD: Static system prompt, dynamic info in user message
system = "You are a financial analyst."
user = f"Today is {datetime.now()}. Analyze the attached report: {report}"
Failure 2: Interleaved Document Queries
# BAD: Switching documents between queries breaks cache
ask(doc_a, "Summarize") # Builds cache for doc_a
ask(doc_b, "Summarize") # Builds cache for doc_b (different prefix)
ask(doc_a, "Analyze") # doc_a cache may still exist, but prefix was broken
# GOOD: Batch all queries per document
for question in questions:
ask(doc_a, question) # All hit same cache
for question in questions:
ask(doc_b, question) # All hit same cache
Failure 3: Variable System Prompt Content
# BAD: Counter in system prompt prevents cache hits
system = f"Request #{request_counter}: You are an assistant."
# GOOD: Counter in user message
system = "You are an assistant."
user = f"[Request #{request_counter}] Analyze: {content}"
Note:
Pro Move: For production document Q&A systems, structure your API calls as: [System Prompt] → [Document] → [Question]. The system prompt and document form an immutable cache prefix. All questions against that document hit the cache. At Flash's cache-hit rate of $0.0028/M, you can answer 1,000 questions against a 50K document for roughly $0.14 total in input costs.
Note:
Cache lifetime: Caches persist for hours to days after last use, then are automatically cleared. There's no API to manually invalidate or extend cache lifetime. For daily batch jobs, assume the first request of each batch pays full price unless you pre-warm.
Related Pages
- 1M Context Strategies — Structure your prompts for 1M context before optimizing for cache hits.
- Cost Optimization Patterns — Apply these caching patterns to drive down costs in production pipelines.
Related Articles
Creative Writing with Claude: Prose, Dialogue & Worldbuilding
Prompts for creative writing with Claude — the model where Anthropic's literary strengths shine. Master prose, dialogue, narrative structure, and worldbuilding with Claude's unique creative capabilities.
Prompt Techniques
Master advanced prompting techniques including agentic prompting, chain-of-thought reasoning, and multi-step AI workflows for better AI responses.
Midjourney Horror & Thriller SREF Codes: Cinematic Guide
Discover Midjourney SREF codes for creating tense horror and thriller cinematics. Generate unsettling atmospheres with dramatic lighting, eerie shadows, and suspenseful visual styles.