The Core Idea

Self-Refine (Madaan et al. 2023) makes a single LLM its own critic. Instead of accepting the first output, the model evaluates its own work, generates actionable feedback, and refines the output — all within a single prompt loop. No RL, no supervised data, no external verifier.

Standard:  Prompt → LLM → Done (take it or leave it)
Self-Refine: Prompt → LLM → Output → LLM critiques → LLM refines → (repeat)

The model serves three roles: generator, critic, and refiner. The same model that made the mistake can often fix it — if you ask it to check its own work.

The FEEDBACK → REFINE Loop

Self-Refine alternates between two steps:

Step 1: FEEDBACK — "Here's your output. What's wrong with it? Be specific."
Step 2: REFINE  — "Here's the feedback. Rewrite the output to address it."
Step 3: Repeat until stopping criteria met.

What Makes Feedback Actionable

Vague feedback is useless. Self-Refine needs feedback that:

Localizes the problem — "The sentiment is neutral due to phrases like 'good' and 'okay'" not "the review is weak"
Gives a fix direction — "Replace neutral adjectives with strongly positive ones" not "make it better"
Is specific to the task — Different tasks need different feedback rubrics (code: efficiency; writing: tone; math: calculation errors)

Self-Refine in Action: Sentiment Reversal

Here's a real Self-Refine trace — rewriting a positive hotel review into a negative one:

Task: Rewrite this review to have a Negative sentiment.
Input: "The Trop is magical. The retirement community vibe adds
        charm. The food court and rooms are top-notch."

── Initial output (model got it wrong — still positive) ──
"If you're looking for a good experience in Vegas, the Trop
may be the perfect place. The hallways have a charming feel
and the food court offers tasty options."

── Feedback ──
"This review is still positive because of words like 'good',
'perfect', 'charming', and 'tasty'. Replace these with
strongly negative adjectives like 'terrible', 'smelly',
'disgusting'. Add sensory details that make it unpleasant."

── Refined output (fixed) ──
"If you ever wondered where the magic of Vegas crawled into
a hole to rot, welcome to the Trop. The hallways smelled of
formaldehyde. The food court was terrible. The rooms looked
like a retirement community from Florida. Steer clear."

The feedback was actionable — it pinpointed the offending words and gave replacement direction. The refinement addressed every point.

Prompt Template

[SYSTEM]
You are a writer, editor, and critic in one system.

Step 1 — Generate:
{task_prompt}

Step 2 — Critique your output:
Rate your output on these dimensions:
- Accuracy: Did you answer the question correctly?
- Completeness: Did you cover everything asked?
- Quality: Is the reasoning sound and well-expressed?
For each issue found, state EXACTLY what is wrong and WHY.

Step 3 — Refine:
Using the critique above, rewrite your output to fix every issue.

Implementation

def self_refine(llm, task_prompt, max_iterations=3):
    """Generate, critique, and refine iteratively."""

    # Step 1: Initial generation
    output = llm.generate(task_prompt)

    for i in range(max_iterations):
        # Step 2: Self-critique
        feedback_prompt = f"""
        Here is an output generated for the task below. Evaluate it critically.
        Be specific — point to exact phrases, errors, or gaps. Don't be vague.

        Task: {task_prompt}

        Output: {output}

        Feedback (be specific, actionable, and constructive):
        """
        feedback = llm.generate(feedback_prompt)

        # Step 3: Refine based on feedback
        refine_prompt = f"""
        Task: {task_prompt}

        Previous output: {output}

        Feedback on previous output: {feedback}

        Refined output (address every point in the feedback):
        """
        refined = llm.generate(refine_prompt)

        # Stopping condition: feedback indicates no further improvement needed
        if is_sufficient(feedback):
            return refined, {"iterations": i + 1, "feedback": feedback}

        output = refined

    return output, {"iterations": max_iterations}


def is_sufficient(feedback: str) -> bool:
    """Check if feedback indicates the output is already good."""
    sufficiency_indicators = [
        "no issues", "looks good", "no errors",
        "well done", "no changes needed", "correct"
    ]
    return any(indicator in feedback.lower() for indicator in sufficiency_indicators)

Where Self-Refine Excels

Task	Why Self-Refine Works	Typical Gain
Code generation	Model can spot bugs/efficiency issues in its own code	+15-25%
Writing (reviews, essays)	Model can detect tone, repetition, weak phrasing	+20-30%
Math reasoning	Model can catch arithmetic errors on re-read	+10-20%
Dialogue responses	Model can judge relevance, informativeness, engagement	+15-25%
Toxicity removal	Model can identify problematic language and rephrase	+25-40%

Data: Madaan et al. 2023, evaluated on GPT-3.5 and GPT-4 across 7 tasks.

Where It Fails

When the model can't recognize its own errors:

If the model confidently produces wrong math, it will confidently critique the wrong math as correct
Circular failures: the model thinks X is right, critiques X as right, "refines" to still-X

When the task has no clear quality signal:

Creative writing where "good" is subjective
Open-ended brainstorming where all ideas are valid

When the cost isn't worth it:

Each iteration is 3x the token cost (generate + feedback + refine)
For simple tasks where the first answer is usually right, skip it

Self-Refine vs. Other Strategies

Technique	Mechanism	External Model?	Training Data?	Best For
Self-Refine	Generate → critique → refine	No (single LLM)	No	Self-correctable errors
Self-Consistency	Multiple samples → vote	No (single LLM)	No	Reasoning with verifiable answers
Reflexion	Act → evaluate → retry with memory	No (single LLM)	No	Agent task recovery
RLHF	Human feedback → reward model → PPO	Yes (reward model)	Yes (human preferences)	Alignment and safety
Constitutional AI	Model critiques based on principles	No (single LLM)	No (principles only)	Value alignment

Cost Analysis

Iterations	API Calls	Relative Cost	Typical Improvement
0 (baseline)	1	1x	—
1	3 (gen + fb + refine)	3x	+10-15%
2	5	5x	+15-20%
3	7	7x	+20% (diminishing returns)

Rule of thumb: Use 1 iteration for most tasks. Use 2-3 only when accuracy is critical and you can measure improvement objectively.

Stopping Criteria

Looping forever burns tokens. Pick a stopping condition:

Max iterations (simplest): Stop after N rounds regardless. N=1 for most cases.
Feedback signal: Stop when feedback says "no issues found" or "looks good"
Delta check: Stop when the refinement changes less than X% of the text
Quality threshold: Stop when a separate scoring prompt rates output above threshold
Human in the loop: Show each refinement, let human decide when to stop

Combining Self-Refine With CoT

Self-Refine works best when the initial output has clear, surface-level errors — the kind the model can spot on re-reading. For deep reasoning errors, combine with Chain-of-Thought:

Step 1: CoT generation — "Let's think step by step about {problem}"
Step 2: Self-Refine the CoT chain — "Critique each step of this reasoning"
Step 3: Refine the faulty steps

This catches both reasoning gaps (CoT) and execution errors (Self-Refine).

Self-Refine: Iterative Self-Improvement