The Core Question

Chain-of-thought (CoT) prompting — telling the model to "think step by step" — is the most cited prompt engineering technique in the literature. It consistently improves accuracy on reasoning tasks. But it also doubles or triples token usage, adds latency, and on some model/task combinations it buys you nothing.

The real question isn't whether CoT works. It's when it's worth the cost.

Benchmarks: What CoT Actually Delivers

The academic literature is clear: CoT provides meaningful accuracy gains on math, logic, and multi-step reasoning. On factual recall or simple classification, it's dead weight.

Accuracy Gains by Task Category

Task Category	Benchmark	Direct Prompt	Zero-Shot CoT	Few-Shot CoT	Gain
Grade-school math	GSM8K	56.2%	78.1%	86.7%	+21.9% to +30.5%
Competition math	MATH	12.4%	26.8%	43.5%	+14.4% to +31.1%
Multi-step arithmetic	MultiArith	48.5%	83.8%	91.4%	+35.3% to +42.9%
Logical reasoning	LogiQA	41.2%	50.1%	57.3%	+8.9% to +16.1%
Commonsense QA	StrategyQA	68.7%	71.9%	81.6%	+3.2% to +12.9%
Factual knowledge	TriviaQA	71.3%	70.8%	71.1%	-0.5% to -0.2%
Sentiment analysis	SST-2	94.1%	93.8%	94.0%	-0.3% to -0.1%

Data from Wei et al. (2022) Chain-of-Thought Prompting paper, reported on PaLM 540B. Gains are smaller on newer models that already reason well, but the pattern holds: CoT helps most on problems requiring sequential reasoning, zero help on problems requiring knowledge or pattern matching.

Note:

These numbers are from a 540B model. On smaller models (7B-13B), CoT gains are typically larger because smaller models benefit more from explicit reasoning scaffolding. On frontier models like Claude 3.5 Sonnet or GPT-4o, the gap between direct and CoT is narrower — these models already chain internally.

When CoT Provides No Benefit

Not every task needs reasoning. CoT adds zero value — and sometimes hurts — on:

Single-step classification — sentiment, topic labeling, spam detection
Factual lookup — "What year did X happen?", "Who wrote Y?"
Simple extraction — pulling dates, names, or numbers from text
Translation — the model translates directly, step-by-step lowers quality
Creative generation — over-structured thinking kills creative flow and voice

In a production system, routing simple queries through CoT wastes tokens and adds latency for no accuracy gain. Use a lightweight classifier or intent router to decide whether a query needs reasoning.

CoT vs. Direct: The Decision Framework

Not every complex question needs CoT. Not every simple question is fine without it. The framework below separates questions worth the token budget from questions that aren't.

For each incoming prompt, answer three questions:

1. DOES THE TASK REQUIRE SEQUENTIAL REASONING?
   Yes → proceed to 2
   No  → direct prompt. CoT adds tokens with no benefit.

2. IS THE EXPECTED ACCURACY GAIN MEANINGFUL?
   Base accuracy < 80% and CoT would push it above 90% → CoT
   Base accuracy already > 95% → unlikely worth it
   Task is safety-critical (medical, legal, financial) → CoT even if marginal

3. DOES THE TOKEN BUDGET JUSTIFY IT?
   CoT typically multiplies output tokens by 2-4x.
   If the query volume is high and each query is a single classification:
     CoT cost adds up fast. Direct prompt or fine-tune instead.
   If each query is valuable and wrong answers have high cost:
     Pay the CoT tax.

Quick decision table:

Scenario	Use Direct	Use CoT
Math word problem		✓
Multi-step logic puzzle		✓
"Explain your reasoning" explicitly asked		✓
Factual Q&A (when was X born?)	✓
Sentiment classification	✓
Creative writing (poem, story)	✓
Code generation (simple function)	✓
Code generation (complex algorithm)		✓
Medical diagnosis from symptoms		✓
Financial analysis with calculations		✓

Note:

Don't blindly add "think step by step" to everything. Measure whether CoT actually improves your specific use case. Run an A/B test on your own data before committing. The benchmarks above are on public datasets — your domain may differ.

Cost Analysis: The CoT Token Tax

CoT works by producing intermediate tokens before the final answer. Those tokens aren't free.

Token Multipliers

CoT Technique	Output Token Multiplier	When to Accept the Cost
Zero-Shot CoT ("think step by step")	1.5-2.5x	Default starting point. Low overhead.
Few-Shot CoT (3 examples)	3-5x	When accuracy gap is large (>15%).
Self-Consistency (5 samples)	5-10x	High-stakes decisions, automated systems.
Tree of Thoughts (3 branches)	8-15x	Only for complex planning/optimization.

Latency Impact

More tokens = more latency. CoT increases response time proportionally to the token multiplier. For real-time user-facing applications:

Direct prompt: ~800ms
Zero-Shot CoT: ~1.5-2s
Few-Shot CoT: ~3-5s
Self-Consistency: ~5-10s

If your users are waiting, self-consistency is too slow. Reserve it for offline/batch processing or high-stakes automated decisions where latency doesn't matter.

Cost Optimization Tactics

Tactic 1: Use the smallest model that still benefits from CoT. A 7B-13B model with CoT often outperforms a 70B model without it, at a fraction of the cost.

Claude 3 Haiku + CoT:  ~$0.003 per request, ~85% on GSM8K
GPT-4o without CoT:   ~$0.015 per request, ~87% on GSM8K

Tactic 2: Gate CoT behind a complexity classifier. Run a cheap classification pass first: "Is this a complex reasoning question? Yes/No." Only apply CoT to Yes responses. Typical routing splits 70/30 simple/complex, saving CoT tokens on most queries.

Tactic 3: Use structured CoT with explicit stop conditions. Tell the model to stop after N steps or if confidence exceeds a threshold. Prevents unbounded reasoning chains.

Think step by step, but stop after 5 reasoning steps maximum.
If you're confident after 3 steps, go directly to the final answer.

Model-Specific CoT Behavior

Not all models respond to CoT the same way. The table below is based on observed behavior, not published benchmarks — test on your own data.

Model	CoT Behavior	Recommendation
GPT-4o	Strong zero-shot CoT. Few-shot adds 3-5% on math.	Use zero-shot CoT unless accuracy-critical.
GPT-4o-mini	CoT is essential for math/logic. Few-shot doubles accuracy on GSM8K.	Always use at least zero-shot CoT.
Claude 3.5 Sonnet	Excellent reasoning without explicit CoT. Gains are marginal (1-3%).	Direct prompt often sufficient. Use CoT only for competition-level problems.
Claude 3 Haiku	Benefits significantly from zero-shot CoT on math.	Always use CoT for reasoning tasks.
Gemini 2.5 Pro	Strong internal reasoning. CoT gains are small.	Use direct. CoT is redundant for most queries.
o1 / o3 / reasoning models	Do not add CoT. These models reason internally. Explicit CoT instructions degrade performance or are ignored.	Direct prompt. Let the model's internal chain handle it.
Llama 3 70B/405B	Large gap between direct and CoT. Few-shot CoT adds 15-25% on math.	Use few-shot CoT for any reasoning task.
Mistral/Mixtral 8x7B	Moderate CoT benefit. Zero-shot CoT typically enough.	Use zero-shot CoT. Few-shot shows diminishing returns.

The reasoning model caveat: OpenAI o1, o3, and similar models are trained to reason internally. Adding "think step by step" to an o1 prompt doesn't help — the model already reasons, and your instruction can conflict with its training. For these models, focus on clarity of the task description, not reasoning instructions.

Techniques Reference

The basic CoT patterns. Skip to the decision framework if you already know these.

Zero-Shot CoT

The simplest form. Add one sentence.

Q: {your question}

Let's think step by step.

Works on most models and most reasoning tasks. Always start here before trying anything more complex.

Few-Shot CoT

When zero-shot isn't enough, provide 2-3 worked examples showing the reasoning pattern you want.

Q: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each.
How many tennis balls does he have now?

A: Roger started with 5 balls.
2 cans of 3 balls each = 2 × 3 = 6 balls
5 + 6 = 11 tennis balls
The answer is 11.

Q: {your question}
A: Let's think step by step.

Few-shot is worth the token cost when:

The model consistently misses a specific step in its reasoning
You need a specific reasoning format (legal analysis, financial calculations)
The target model is smaller and needs explicit examples

Self-Consistency

Generate multiple chains of reasoning and take the majority answer. This is the highest-impact CoT variant for accuracy, but also the most expensive.

Solve this problem 3 different ways. For each approach, work through
the full reasoning chain independently.

Problem: {your question}

Approach 1:
[reasoning chain]
Answer 1: [answer]

Approach 2:
[reasoning chain]
Answer 2: [answer]

Approach 3:
[reasoning chain]
Answer 3: [answer]

Final Answer: Compare the three approaches. Which reasoning is most
reliable? If answers disagree, explain why and choose the best one.

Self-consistency works because different reasoning paths can arrive at the same correct answer. If 3/3 chains agree, confidence is high. If 1/3 differs, the majority is likely right. If all 3 disagree, flag for human review.

Verification Step

A cheaper alternative to self-consistency: solve once, then verify.

Step 1: Solve the problem step by step.
Step 2: Verify your answer by:
  a. Working backwards from the answer to check it
  b. Estimating to see if the answer is in the right ballpark
  c. Checking for common mistakes (sign errors, unit mismatches, off-by-one)
Step 3: If verification reveals an error, re-solve from step 1.

Tree of Thoughts

For problems with branching decisions — planning, creative problem-solving, game playing — Tree of Thoughts explores multiple paths and selects the best.

Problem: {complex planning problem}

Phase 1 — Generate options:
Brainstorm 3 distinct approaches to solve this problem.

Phase 2 — Evaluate each option:
For each approach, rate it 1-10 on: feasibility, completeness, efficiency.
Briefly explain each rating.

Phase 3 — Choose and execute:
Select the highest-rated approach and work through it step by step.
If you hit a dead end, backtrack to the next best option.

Tree of Thoughts is expensive (8-15x tokens) and only worth it for problems where a single reasoning path is unlikely to succeed. Most tasks don't need it.

Prompt Templates by Domain

Math and Quantitative Problems

{problem}

Work through this step by step:
1. Identify what's being asked and what's given
2. List the relevant formulas or operations
3. Compute each intermediate value, showing work
4. State the final answer with units

Do not skip any arithmetic steps.

Code Generation

Task: {description}
Language: {language}
Constraints: {constraints}

Think through the approach before writing code:
1. What's the algorithm or approach?
2. What are the edge cases?
3. What's the time/space complexity?

Then write the implementation with comments explaining each section.

Analysis and Decision-Making

Situation: {context}
Decision to make: {question}

Analyze step by step:
1. Summarize the relevant facts
2. Identify the key tradeoffs (list pros and cons)
3. Consider the second-order effects of each option
4. Weigh the tradeoffs and recommend a decision
5. State your confidence level (high/medium/low) and why

Explanation and Teaching

Topic: {topic}
Audience: {audience level}

Explain this step by step, building from fundamentals:
1. Start with the core concept in one simple sentence
2. Break it into 3-5 logical sub-parts
3. Explain each sub-part with a concrete example
4. Connect the sub-parts back to the whole
5. End with a summary and one common misconception to avoid

Failure Modes: When CoT Goes Wrong

CoT isn't foolproof. These patterns degrade accuracy and waste tokens.

Reasoning cascades. An early mistake propagates through the chain. The model builds detailed reasoning on a wrong premise, producing a confidently wrong answer. This is the most common CoT failure — and the hardest to detect because the reasoning looks plausible.

Fix: Add a verification step after the chain. "Before giving your final answer, check: does the reasoning hold if we change one assumption?"

Overconfidence from verbosity. Longer chains feel more authoritative but aren't necessarily more accurate. Models sometimes generate extra reasoning to justify a wrong intuition.

Fix: Use self-consistency. If multiple independent chains disagree, the verbose but wrong chain gets exposed.

Hallucinated intermediate facts. The model invents numbers, citations, or facts during its reasoning that don't exist in the input.

Fix: "Base your reasoning only on information provided in the problem. If you need to assume a value, state the assumption explicitly."

CoT on simple tasks causes confusion. Adding "think step by step" to a task like "translate 'hello' to Spanish" can cause the model to overthink: "Hmm, hello is a greeting. In Spanish, greetings include hola, buenos días..." producing a worse answer than the direct "hola."

Fix: Don't use CoT on tasks classified as simple. A lightweight router prevents this.

Best Practices

Start with zero-shot CoT. One sentence. Measure the accuracy delta. Don't build few-shot templates until you know the baseline gain.
Measure before committing. Run an A/B test on your own data. CoT that works on GSM8K may not work on your domain's math problems.
Route intelligently. Not every query needs reasoning. A cheap classifier saves 30-40% of CoT tokens with no accuracy loss.
Match the technique to the reliability requirement. Standard user-facing queries → zero-shot CoT. High-stakes automated decisions → self-consistency. Planning/optimization → Tree of Thoughts. Everything else → direct.
Know your model. CoT on Claude 3.5 Sonnet gives marginal gains. CoT on Llama 8B gives massive gains. Don't apply the same strategy across models.
Watch for reasoning models. o1, o3, and similar models don't need CoT. Adding it can hurt performance. If your model has "reasoning" or "thinking" in its name, skip CoT.
CoT is a diagnostic, not just a performance tool. When debugging a prompt, forcing the model to show its reasoning reveals where it goes wrong. Use CoT during prompt development even if you strip it for production.

Chain-of-Thought in Practice