Context Compression: Reduce Tokens Without Losing Quality
Compress prompts for lower cost and latency using LLMLingua, selective context, and summarization. Learn compression ratios, quality tradeoffs, and when not to compress.
The Compression Imperative
Context windows are growing (200K+ tokens), but every token costs money and adds latency. Most context is filler — removing non-essential tokens without losing meaning can cut costs 2-5x.
Before compression (2,500 tokens, $0.00625 at GPT-4o rates):
"You may find it helpful to know that the project under discussion
pertains to a web application framework that was originally developed
by a team of engineers at..."
After compression (600 tokens, $0.0015):
"Project: web framework by Meta, React-based, 2013, 200K+ GitHub stars"
Compression Techniques
LLMLingua
Use a small language model to compress prompts for a larger one. The small model identifies and removes non-essential tokens while preserving semantic content.
from llmlingua import PromptCompressor
compressor = PromptCompressor(
model_name="microsoft/llmlingua-2-bert-base-multilingual",
use_llmlingua2=True
)
long_prompt = """
The Model Context Protocol is an open standard developed by Anthropic
that enables AI applications to connect with various tools and data
sources in a standardized way. This protocol is designed to provide
a common interface for AI systems to interact with external resources...
"""
compressed = compressor.compress_prompt(
long_prompt,
rate=0.5, # Compress to 50% of original tokens
force_tokens=["!", "?", "."] # Preserve these
)
print(f"Original: {len(long_prompt.split())} words")
print(f"Compressed: {len(compressed.split())} words")
Selective Context
For RAG pipelines: don't include all retrieved chunks in the prompt. Filter to only the most relevant ones.
def select_relevant_chunks(query, chunks, max_tokens=2000):
"""Select only chunks likely to help answer the query."""
token_count = 0
selected = []
# Sort by relevance score (from retrieval)
for chunk in sorted(chunks, key=lambda c: c.score, reverse=True):
chunk_tokens = count_tokens(chunk.text)
if token_count + chunk_tokens > max_tokens:
break
selected.append(chunk)
token_count += chunk_tokens
return selected
Summarization-Based Compression
Ask the model to summarize context first, then answer. Two calls instead of one, but the second call has a dramatically compressed prompt.
# Step 1: Summarize
"Summarize the following in 150 words, capturing only the
key facts relevant to understanding the situation:"
# Step 2: Answer from summary
"Based on this summary: [compressed context], answer: [question]"
Token Pruning Heuristics
Simple rules that don't require a model call:
Remove if:
- Stop words that don't change meaning ("the", "a", "an")
- Redundant phrases ("in order to" → "to")
- Filler text ("you may find it helpful to know that" → "")
- Repeated information across RAG chunks
- Whitespace and formatting tokens
Compression Quality Tradeoffs
| Method | Compression Ratio | Quality Retention | Latency |
|---|---|---|---|
| Token pruning | 1.2-1.5x | 95-98% | None |
| Selective context (RAG) | 2-10x | 90-95% | None |
| LLMLingua | 2-3x | 93-97% | +1-2s |
| Summarization | 3-10x | 85-93% | +API call |
| Aggressive compression | 10-20x | 70-85% | +API call |
When NOT to Compress
Tasks where exact wording matters:
- Legal document analysis (every clause could be critical)
- Contract review (missing a negation word changes meaning)
- Medical records (precision > cost)
Short prompts: Compression overhead > savings if the prompt is under 500 tokens.
Creative tasks: Compression removes stylistic elements. A compressed poem prompt loses rhythm and nuance.
Ambiguous queries: Compression can remove disambiguating context. If the original query is already terse, don't compress further.
Cost Comparison
Typical RAG pipeline with 10 retrieved documents (20,000 tokens total):
| Approach | Input Tokens | Cost (GPT-4o) | Quality |
|---|---|---|---|
| Include all 10 chunks | 20,000 | $0.05 | 100% |
| Selective top-3 chunks | 6,000 | $0.015 | 95% |
| LLMLingua 50% + top-3 | 3,000 | $0.0075 | 93% |
| Summarize then answer | 500 + 500 | $0.0025 | 88% |
The sweet spot for most pipelines: top-3 selective context with light LLMLingua compression.
Combining With Prompt Caching
Compression and caching are complementary:
- Compress your system prompt and static examples once.
- Cache the compressed version.
- Append only the dynamic query for each call.
This gives you both compression savings AND cache discounts — maximum cost optimization.
Related Articles
Algorithm Design Prompts for ChatGPT | Problem Solving Guide
Learn how to write effective prompts for algorithm design tasks, from problem analysis to implementation strategies.
Automatic Prompt Engineering (APE)
Use LLMs to generate, score, and optimize prompts for other LLMs. APE discovered a better CoT prompt than humans did — and the same principles apply to your production prompts.
Claude Long Document Strategies: Structuring 100K+ Token Prompts
Master Claude's 200K context for massive documents. Learn where to place instructions in long prompts, chunking strategies, progressive disclosure, and maintaining coherence across entire codebases and book-length documents.