Chain-of-Thought in Practice
When and how to use chain-of-thought vs. direct prompting — real benchmarks, cost analysis, and a decision framework for choosing the right approach.

The Core Question
Chain-of-thought (CoT) prompting — telling the model to "think step by step" — is the most cited prompt engineering technique in the literature. It consistently improves accuracy on reasoning tasks. But it also doubles or triples token usage, adds latency, and on some model/task combinations it buys you nothing.
The real question isn't whether CoT works. It's when it's worth the cost.
Benchmarks: What CoT Actually Delivers
The academic literature is clear: CoT provides meaningful accuracy gains on math, logic, and multi-step reasoning. On factual recall or simple classification, it's dead weight.
Accuracy Gains by Task Category
| Task Category | Benchmark | Direct Prompt | Zero-Shot CoT | Few-Shot CoT | Gain |
|---|---|---|---|---|---|
| Grade-school math | GSM8K | 56.2% | 78.1% | 86.7% | +21.9% to +30.5% |
| Competition math | MATH | 12.4% | 26.8% | 43.5% | +14.4% to +31.1% |
| Multi-step arithmetic | MultiArith | 48.5% | 83.8% | 91.4% | +35.3% to +42.9% |
| Logical reasoning | LogiQA | 41.2% | 50.1% | 57.3% | +8.9% to +16.1% |
| Commonsense QA | StrategyQA | 68.7% | 71.9% | 81.6% | +3.2% to +12.9% |
| Factual knowledge | TriviaQA | 71.3% | 70.8% | 71.1% | -0.5% to -0.2% |
| Sentiment analysis | SST-2 | 94.1% | 93.8% | 94.0% | -0.3% to -0.1% |
Data from Wei et al. (2022) Chain-of-Thought Prompting paper, reported on PaLM 540B. Gains are smaller on newer models that already reason well, but the pattern holds: CoT helps most on problems requiring sequential reasoning, zero help on problems requiring knowledge or pattern matching.
Note:
These numbers are from a 540B model. On smaller models (7B-13B), CoT gains are typically larger because smaller models benefit more from explicit reasoning scaffolding. On frontier models like Claude 3.5 Sonnet or GPT-4o, the gap between direct and CoT is narrower — these models already chain internally.
When CoT Provides No Benefit
Not every task needs reasoning. CoT adds zero value — and sometimes hurts — on:
- Single-step classification — sentiment, topic labeling, spam detection
- Factual lookup — "What year did X happen?", "Who wrote Y?"
- Simple extraction — pulling dates, names, or numbers from text
- Translation — the model translates directly, step-by-step lowers quality
- Creative generation — over-structured thinking kills creative flow and voice
In a production system, routing simple queries through CoT wastes tokens and adds latency for no accuracy gain. Use a lightweight classifier or intent router to decide whether a query needs reasoning.
CoT vs. Direct: The Decision Framework
Not every complex question needs CoT. Not every simple question is fine without it. The framework below separates questions worth the token budget from questions that aren't.
For each incoming prompt, answer three questions:
1. DOES THE TASK REQUIRE SEQUENTIAL REASONING?
Yes → proceed to 2
No → direct prompt. CoT adds tokens with no benefit.
2. IS THE EXPECTED ACCURACY GAIN MEANINGFUL?
Base accuracy < 80% and CoT would push it above 90% → CoT
Base accuracy already > 95% → unlikely worth it
Task is safety-critical (medical, legal, financial) → CoT even if marginal
3. DOES THE TOKEN BUDGET JUSTIFY IT?
CoT typically multiplies output tokens by 2-4x.
If the query volume is high and each query is a single classification:
CoT cost adds up fast. Direct prompt or fine-tune instead.
If each query is valuable and wrong answers have high cost:
Pay the CoT tax.
Quick decision table:
| Scenario | Use Direct | Use CoT |
|---|---|---|
| Math word problem | ✓ | |
| Multi-step logic puzzle | ✓ | |
| "Explain your reasoning" explicitly asked | ✓ | |
| Factual Q&A (when was X born?) | ✓ | |
| Sentiment classification | ✓ | |
| Creative writing (poem, story) | ✓ | |
| Code generation (simple function) | ✓ | |
| Code generation (complex algorithm) | ✓ | |
| Medical diagnosis from symptoms | ✓ | |
| Financial analysis with calculations | ✓ |
Note:
Don't blindly add "think step by step" to everything. Measure whether CoT actually improves your specific use case. Run an A/B test on your own data before committing. The benchmarks above are on public datasets — your domain may differ.
Cost Analysis: The CoT Token Tax
CoT works by producing intermediate tokens before the final answer. Those tokens aren't free.
Token Multipliers
| CoT Technique | Output Token Multiplier | When to Accept the Cost |
|---|---|---|
| Zero-Shot CoT ("think step by step") | 1.5-2.5x | Default starting point. Low overhead. |
| Few-Shot CoT (3 examples) | 3-5x | When accuracy gap is large (>15%). |
| Self-Consistency (5 samples) | 5-10x | High-stakes decisions, automated systems. |
| Tree of Thoughts (3 branches) | 8-15x | Only for complex planning/optimization. |
Latency Impact
More tokens = more latency. CoT increases response time proportionally to the token multiplier. For real-time user-facing applications:
- Direct prompt: ~800ms
- Zero-Shot CoT: ~1.5-2s
- Few-Shot CoT: ~3-5s
- Self-Consistency: ~5-10s
If your users are waiting, self-consistency is too slow. Reserve it for offline/batch processing or high-stakes automated decisions where latency doesn't matter.
Cost Optimization Tactics
Tactic 1: Use the smallest model that still benefits from CoT. A 7B-13B model with CoT often outperforms a 70B model without it, at a fraction of the cost.
Claude 3 Haiku + CoT: ~$0.003 per request, ~85% on GSM8K
GPT-4o without CoT: ~$0.015 per request, ~87% on GSM8K
Tactic 2: Gate CoT behind a complexity classifier. Run a cheap classification pass first: "Is this a complex reasoning question? Yes/No." Only apply CoT to Yes responses. Typical routing splits 70/30 simple/complex, saving CoT tokens on most queries.
Tactic 3: Use structured CoT with explicit stop conditions. Tell the model to stop after N steps or if confidence exceeds a threshold. Prevents unbounded reasoning chains.
Think step by step, but stop after 5 reasoning steps maximum.
If you're confident after 3 steps, go directly to the final answer.
Model-Specific CoT Behavior
Not all models respond to CoT the same way. The table below is based on observed behavior, not published benchmarks — test on your own data.
| Model | CoT Behavior | Recommendation |
|---|---|---|
| GPT-4o | Strong zero-shot CoT. Few-shot adds 3-5% on math. | Use zero-shot CoT unless accuracy-critical. |
| GPT-4o-mini | CoT is essential for math/logic. Few-shot doubles accuracy on GSM8K. | Always use at least zero-shot CoT. |
| Claude 3.5 Sonnet | Excellent reasoning without explicit CoT. Gains are marginal (1-3%). | Direct prompt often sufficient. Use CoT only for competition-level problems. |
| Claude 3 Haiku | Benefits significantly from zero-shot CoT on math. | Always use CoT for reasoning tasks. |
| Gemini 2.5 Pro | Strong internal reasoning. CoT gains are small. | Use direct. CoT is redundant for most queries. |
| o1 / o3 / reasoning models | Do not add CoT. These models reason internally. Explicit CoT instructions degrade performance or are ignored. | Direct prompt. Let the model's internal chain handle it. |
| Llama 3 70B/405B | Large gap between direct and CoT. Few-shot CoT adds 15-25% on math. | Use few-shot CoT for any reasoning task. |
| Mistral/Mixtral 8x7B | Moderate CoT benefit. Zero-shot CoT typically enough. | Use zero-shot CoT. Few-shot shows diminishing returns. |
The reasoning model caveat: OpenAI o1, o3, and similar models are trained to reason internally. Adding "think step by step" to an o1 prompt doesn't help — the model already reasons, and your instruction can conflict with its training. For these models, focus on clarity of the task description, not reasoning instructions.
Techniques Reference
The basic CoT patterns. Skip to the decision framework if you already know these.
Zero-Shot CoT
The simplest form. Add one sentence.
Q: {your question}
Let's think step by step.
Works on most models and most reasoning tasks. Always start here before trying anything more complex.
Few-Shot CoT
When zero-shot isn't enough, provide 2-3 worked examples showing the reasoning pattern you want.
Q: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each.
How many tennis balls does he have now?
A: Roger started with 5 balls.
2 cans of 3 balls each = 2 × 3 = 6 balls
5 + 6 = 11 tennis balls
The answer is 11.
Q: {your question}
A: Let's think step by step.
Few-shot is worth the token cost when:
- The model consistently misses a specific step in its reasoning
- You need a specific reasoning format (legal analysis, financial calculations)
- The target model is smaller and needs explicit examples
Self-Consistency
Generate multiple chains of reasoning and take the majority answer. This is the highest-impact CoT variant for accuracy, but also the most expensive.
Solve this problem 3 different ways. For each approach, work through
the full reasoning chain independently.
Problem: {your question}
Approach 1:
[reasoning chain]
Answer 1: [answer]
Approach 2:
[reasoning chain]
Answer 2: [answer]
Approach 3:
[reasoning chain]
Answer 3: [answer]
Final Answer: Compare the three approaches. Which reasoning is most
reliable? If answers disagree, explain why and choose the best one.
Self-consistency works because different reasoning paths can arrive at the same correct answer. If 3/3 chains agree, confidence is high. If 1/3 differs, the majority is likely right. If all 3 disagree, flag for human review.
Verification Step
A cheaper alternative to self-consistency: solve once, then verify.
Step 1: Solve the problem step by step.
Step 2: Verify your answer by:
a. Working backwards from the answer to check it
b. Estimating to see if the answer is in the right ballpark
c. Checking for common mistakes (sign errors, unit mismatches, off-by-one)
Step 3: If verification reveals an error, re-solve from step 1.
Tree of Thoughts
For problems with branching decisions — planning, creative problem-solving, game playing — Tree of Thoughts explores multiple paths and selects the best.
Problem: {complex planning problem}
Phase 1 — Generate options:
Brainstorm 3 distinct approaches to solve this problem.
Phase 2 — Evaluate each option:
For each approach, rate it 1-10 on: feasibility, completeness, efficiency.
Briefly explain each rating.
Phase 3 — Choose and execute:
Select the highest-rated approach and work through it step by step.
If you hit a dead end, backtrack to the next best option.
Tree of Thoughts is expensive (8-15x tokens) and only worth it for problems where a single reasoning path is unlikely to succeed. Most tasks don't need it.
Prompt Templates by Domain
Math and Quantitative Problems
{problem}
Work through this step by step:
1. Identify what's being asked and what's given
2. List the relevant formulas or operations
3. Compute each intermediate value, showing work
4. State the final answer with units
Do not skip any arithmetic steps.
Code Generation
Task: {description}
Language: {language}
Constraints: {constraints}
Think through the approach before writing code:
1. What's the algorithm or approach?
2. What are the edge cases?
3. What's the time/space complexity?
Then write the implementation with comments explaining each section.
Analysis and Decision-Making
Situation: {context}
Decision to make: {question}
Analyze step by step:
1. Summarize the relevant facts
2. Identify the key tradeoffs (list pros and cons)
3. Consider the second-order effects of each option
4. Weigh the tradeoffs and recommend a decision
5. State your confidence level (high/medium/low) and why
Explanation and Teaching
Topic: {topic}
Audience: {audience level}
Explain this step by step, building from fundamentals:
1. Start with the core concept in one simple sentence
2. Break it into 3-5 logical sub-parts
3. Explain each sub-part with a concrete example
4. Connect the sub-parts back to the whole
5. End with a summary and one common misconception to avoid
Failure Modes: When CoT Goes Wrong
CoT isn't foolproof. These patterns degrade accuracy and waste tokens.
Reasoning cascades. An early mistake propagates through the chain. The model builds detailed reasoning on a wrong premise, producing a confidently wrong answer. This is the most common CoT failure — and the hardest to detect because the reasoning looks plausible.
Fix: Add a verification step after the chain. "Before giving your final answer, check: does the reasoning hold if we change one assumption?"
Overconfidence from verbosity. Longer chains feel more authoritative but aren't necessarily more accurate. Models sometimes generate extra reasoning to justify a wrong intuition.
Fix: Use self-consistency. If multiple independent chains disagree, the verbose but wrong chain gets exposed.
Hallucinated intermediate facts. The model invents numbers, citations, or facts during its reasoning that don't exist in the input.
Fix: "Base your reasoning only on information provided in the problem. If you need to assume a value, state the assumption explicitly."
CoT on simple tasks causes confusion. Adding "think step by step" to a task like "translate 'hello' to Spanish" can cause the model to overthink: "Hmm, hello is a greeting. In Spanish, greetings include hola, buenos días..." producing a worse answer than the direct "hola."
Fix: Don't use CoT on tasks classified as simple. A lightweight router prevents this.
Best Practices
-
Start with zero-shot CoT. One sentence. Measure the accuracy delta. Don't build few-shot templates until you know the baseline gain.
-
Measure before committing. Run an A/B test on your own data. CoT that works on GSM8K may not work on your domain's math problems.
-
Route intelligently. Not every query needs reasoning. A cheap classifier saves 30-40% of CoT tokens with no accuracy loss.
-
Match the technique to the reliability requirement. Standard user-facing queries → zero-shot CoT. High-stakes automated decisions → self-consistency. Planning/optimization → Tree of Thoughts. Everything else → direct.
-
Know your model. CoT on Claude 3.5 Sonnet gives marginal gains. CoT on Llama 8B gives massive gains. Don't apply the same strategy across models.
-
Watch for reasoning models. o1, o3, and similar models don't need CoT. Adding it can hurt performance. If your model has "reasoning" or "thinking" in its name, skip CoT.
-
CoT is a diagnostic, not just a performance tool. When debugging a prompt, forcing the model to show its reasoning reveals where it goes wrong. Use CoT during prompt development even if you strip it for production.
Related Articles
Academic Citation Guide
Master academic citations with these comprehensive prompts designed to help you properly cite sources, manage references, and handle complex citation scenarios in academic writing.
Line Art & Outline Minimalism SREF Codes
Clean line work with minimal detail, precise outlines, and maximum clarity through simple strokes.
Gemini Large Document Analysis: Books, Codebases & Research Sets
Use Gemini's massive context window for full-document analysis. Learn patterns for analyzing entire books, codebases, legal documents, and research corpora in a single prompt.