Self-Consistency: Improving Reasoning Through Majority Voting
Sample multiple reasoning paths and vote on the best answer. Boost accuracy on arithmetic, commonsense, and symbolic tasks with this technique wrapper for Chain-of-Thought.
The Core Idea
Self-consistency (Wang et al. 2022) replaces greedy decoding with diverse sampling. Instead of taking one reasoning path, generate multiple paths and select the most consistent answer. Errors tend to be unique; correct answers converge.
Standard CoT: Model → One reasoning path → One answer (greedy, risk of error)
Self-Consistency: Model → 5-10 reasoning paths → Majority vote → Most reliable answer
When Single-Path Reasoning Fails
Question: When I was 6 my sister was half my age. Now I'm 70,
how old is my sister?
Single CoT output:
"When I was 6, my sister was half my age = 3.
Now I'm 70, so she's 70 / 2 = 35."
→ WRONG (correct answer: 67)
With self-consistency, you generate multiple paths:
Path 1: "Sister was 3 when I was 6, age difference is 3 years.
At 70, sister is 70 - 3 = 67." → Answer: 67 ✓
Path 2: "Sister was half my age at 6 = 3 years old.
70 - 6 = 64 years passed. She's 3 + 64 = 67." → Answer: 67 ✓
Path 3: "Half of 70 is 35." → Answer: 35 ✗
Result: 67 appears twice, 35 appears once → Final answer: 67 ✓
Implementation
import asyncio
from collections import Counter
async def self_consistency(model, prompt, n_samples=5, temperature=0.7):
"""Generate N reasoning paths and return the majority answer."""
responses = await asyncio.gather(*[
model.generate(prompt, temperature=temperature, max_tokens=500)
for _ in range(n_samples)
])
# Extract final answers from reasoning paths
answers = [extract_final_answer(r) for r in responses]
# Majority vote
counts = Counter(answers)
best_answer, votes = counts.most_common(1)[0]
confidence = votes / n_samples
return {
"answer": best_answer,
"confidence": confidence,
"all_answers": answers,
"reasoning_paths": responses
}
def extract_final_answer(text: str) -> str:
"""Extract the final answer from a reasoning chain.
Looks for patterns like 'The answer is X' or 'Therefore, X'."""
import re
patterns = [
r'(?:answer is|therefore|conclusion:)\s*(.+)',
r'(?:^|\n)(\d+)\s*$' # Last line is just a number
]
for pattern in patterns:
matches = re.findall(pattern, text.lower())
if matches:
return matches[-1].strip()
return text.strip().split('\n')[-1]
Aggregation Strategies
| Method | How It Works | Best For |
|---|---|---|
| Majority vote | Count exact answer matches | Discrete answers (numbers, categories) |
| Weighted vote | Weight by reasoning chain quality score | When you have a confidence evaluator |
| Span extraction | Find overlapping answer spans across responses | Free-text answers |
| LLM aggregator | Ask another LLM call to synthesize all paths | Complex multi-faceted answers |
Temperature and Sampling
Temperature controls diversity. Higher = more diverse paths, but also more noise.
| Temperature | Diversity | Accuracy Impact | Best For |
|---|---|---|---|
| 0.0 | Deterministic | No gain (same path each time) | Never use for self-consistency |
| 0.3-0.5 | Low diversity | Small gains | Simple arithmetic |
| 0.5-0.7 | Moderate diversity | Best balance | Most reasoning tasks |
| 0.7-1.0 | High diversity | Risk of noise overwhelming signal | Complex open-ended reasoning |
When Self-Consistency Helps
Strong gains on:
- Arithmetic reasoning (GSM8K, MATH datasets)
- Commonsense reasoning (StrategyQA, CommonsenseQA)
- Symbolic reasoning (date arithmetic, logical deduction)
Weak or no gains on:
- Factual recall (the model either knows it or doesn't)
- Simple classification (paths all converge to same answer)
- Tasks where the model is fundamentally wrong 100% of the time
- Creative writing (no single "correct" answer)
Cost Analysis
Self-consistency multiplies token costs linearly. Every sample is a full API call.
| Samples | Relative Cost | Typical Accuracy Gain |
|---|---|---|
| 1 (baseline) | 1x | - |
| 3 | 3x | +10-15% |
| 5 | 5x | +15-20% |
| 10 | 10x | +20-25% (diminishing returns beyond 10) |
When the cost is worth it:
- High-stakes decisions where accuracy matters more than cost
- Automated pipelines where you can batch process
- One-time analysis tasks (research, legal review)
Combining With Other Techniques
Self-consistency wraps around other prompting strategies — it's not a replacement.
- CoT + Self-Consistency: The standard combination. Generate CoT chains, vote on answers.
- ToT + Self-Consistency: Generate multiple trees, vote on final root nodes.
- Few-Shot + Self-Consistency: Use few-shot examples to improve individual path quality, then vote.
Related Articles
Gemini Multimodal Workflows: Cross-Modal Prompt Patterns
Combine images, video, audio, and text in a single Gemini prompt. Master cross-modal reasoning, multi-source analysis, and complex multimodal chain patterns.
DeepSeek Bilingual Tasks: Chinese-English Prompting
Master DeepSeek's native Chinese-English bilingual capability. Translation patterns, cross-lingual reasoning, multilingual content generation, and system prompt language strategies.
Minimalist Abstract SREF Codes
Minimalist abstraction SREF codes for Midjourney featuring essential reduction, single-line work, monochrome elegance, and contemplative simplicity.