Self-Refine: Iterative Self-Improvement

Use one LLM to generate, critique, and refine its own output in a feedback loop. Boost quality on code gen, writing, and math without external models or training data.

June 11, 2026
self-refinefeedbackiterationrefinementprompt-engineeringcode-generation

The Core Idea

Self-Refine (Madaan et al. 2023) makes a single LLM its own critic. Instead of accepting the first output, the model evaluates its own work, generates actionable feedback, and refines the output — all within a single prompt loop. No RL, no supervised data, no external verifier.

Standard:  Prompt → LLM → Done (take it or leave it)
Self-Refine: Prompt → LLM → Output → LLM critiques → LLM refines → (repeat)

The model serves three roles: generator, critic, and refiner. The same model that made the mistake can often fix it — if you ask it to check its own work.

The FEEDBACK → REFINE Loop

Self-Refine alternates between two steps:

Step 1: FEEDBACK — "Here's your output. What's wrong with it? Be specific."
Step 2: REFINE  — "Here's the feedback. Rewrite the output to address it."
Step 3: Repeat until stopping criteria met.

What Makes Feedback Actionable

Vague feedback is useless. Self-Refine needs feedback that:

  1. Localizes the problem — "The sentiment is neutral due to phrases like 'good' and 'okay'" not "the review is weak"
  2. Gives a fix direction — "Replace neutral adjectives with strongly positive ones" not "make it better"
  3. Is specific to the task — Different tasks need different feedback rubrics (code: efficiency; writing: tone; math: calculation errors)

Self-Refine in Action: Sentiment Reversal

Here's a real Self-Refine trace — rewriting a positive hotel review into a negative one:

Task: Rewrite this review to have a Negative sentiment.
Input: "The Trop is magical. The retirement community vibe adds
        charm. The food court and rooms are top-notch."

── Initial output (model got it wrong — still positive) ──
"If you're looking for a good experience in Vegas, the Trop
may be the perfect place. The hallways have a charming feel
and the food court offers tasty options."

── Feedback ──
"This review is still positive because of words like 'good',
'perfect', 'charming', and 'tasty'. Replace these with
strongly negative adjectives like 'terrible', 'smelly',
'disgusting'. Add sensory details that make it unpleasant."

── Refined output (fixed) ──
"If you ever wondered where the magic of Vegas crawled into
a hole to rot, welcome to the Trop. The hallways smelled of
formaldehyde. The food court was terrible. The rooms looked
like a retirement community from Florida. Steer clear."

The feedback was actionable — it pinpointed the offending words and gave replacement direction. The refinement addressed every point.

Prompt Template

[SYSTEM]
You are a writer, editor, and critic in one system.

Step 1 — Generate:
{task_prompt}

Step 2 — Critique your output:
Rate your output on these dimensions:
- Accuracy: Did you answer the question correctly?
- Completeness: Did you cover everything asked?
- Quality: Is the reasoning sound and well-expressed?
For each issue found, state EXACTLY what is wrong and WHY.

Step 3 — Refine:
Using the critique above, rewrite your output to fix every issue.

Implementation

def self_refine(llm, task_prompt, max_iterations=3):
    """Generate, critique, and refine iteratively."""

    # Step 1: Initial generation
    output = llm.generate(task_prompt)

    for i in range(max_iterations):
        # Step 2: Self-critique
        feedback_prompt = f"""
        Here is an output generated for the task below. Evaluate it critically.
        Be specific — point to exact phrases, errors, or gaps. Don't be vague.

        Task: {task_prompt}

        Output: {output}

        Feedback (be specific, actionable, and constructive):
        """
        feedback = llm.generate(feedback_prompt)

        # Step 3: Refine based on feedback
        refine_prompt = f"""
        Task: {task_prompt}

        Previous output: {output}

        Feedback on previous output: {feedback}

        Refined output (address every point in the feedback):
        """
        refined = llm.generate(refine_prompt)

        # Stopping condition: feedback indicates no further improvement needed
        if is_sufficient(feedback):
            return refined, {"iterations": i + 1, "feedback": feedback}

        output = refined

    return output, {"iterations": max_iterations}


def is_sufficient(feedback: str) -> bool:
    """Check if feedback indicates the output is already good."""
    sufficiency_indicators = [
        "no issues", "looks good", "no errors",
        "well done", "no changes needed", "correct"
    ]
    return any(indicator in feedback.lower() for indicator in sufficiency_indicators)

Where Self-Refine Excels

TaskWhy Self-Refine WorksTypical Gain
Code generationModel can spot bugs/efficiency issues in its own code+15-25%
Writing (reviews, essays)Model can detect tone, repetition, weak phrasing+20-30%
Math reasoningModel can catch arithmetic errors on re-read+10-20%
Dialogue responsesModel can judge relevance, informativeness, engagement+15-25%
Toxicity removalModel can identify problematic language and rephrase+25-40%

Data: Madaan et al. 2023, evaluated on GPT-3.5 and GPT-4 across 7 tasks.

Where It Fails

When the model can't recognize its own errors:

  • If the model confidently produces wrong math, it will confidently critique the wrong math as correct
  • Circular failures: the model thinks X is right, critiques X as right, "refines" to still-X

When the task has no clear quality signal:

  • Creative writing where "good" is subjective
  • Open-ended brainstorming where all ideas are valid

When the cost isn't worth it:

  • Each iteration is 3x the token cost (generate + feedback + refine)
  • For simple tasks where the first answer is usually right, skip it

Self-Refine vs. Other Strategies

TechniqueMechanismExternal Model?Training Data?Best For
Self-RefineGenerate → critique → refineNo (single LLM)NoSelf-correctable errors
Self-ConsistencyMultiple samples → voteNo (single LLM)NoReasoning with verifiable answers
ReflexionAct → evaluate → retry with memoryNo (single LLM)NoAgent task recovery
RLHFHuman feedback → reward model → PPOYes (reward model)Yes (human preferences)Alignment and safety
Constitutional AIModel critiques based on principlesNo (single LLM)No (principles only)Value alignment

Cost Analysis

IterationsAPI CallsRelative CostTypical Improvement
0 (baseline)11x
13 (gen + fb + refine)3x+10-15%
255x+15-20%
377x+20% (diminishing returns)

Rule of thumb: Use 1 iteration for most tasks. Use 2-3 only when accuracy is critical and you can measure improvement objectively.

Stopping Criteria

Looping forever burns tokens. Pick a stopping condition:

  • Max iterations (simplest): Stop after N rounds regardless. N=1 for most cases.
  • Feedback signal: Stop when feedback says "no issues found" or "looks good"
  • Delta check: Stop when the refinement changes less than X% of the text
  • Quality threshold: Stop when a separate scoring prompt rates output above threshold
  • Human in the loop: Show each refinement, let human decide when to stop

Combining Self-Refine With CoT

Self-Refine works best when the initial output has clear, surface-level errors — the kind the model can spot on re-reading. For deep reasoning errors, combine with Chain-of-Thought:

Step 1: CoT generation — "Let's think step by step about {problem}"
Step 2: Self-Refine the CoT chain — "Critique each step of this reasoning"
Step 3: Refine the faulty steps

This catches both reasoning gaps (CoT) and execution errors (Self-Refine).