Graph-of-Thought: Beyond Trees and Chains

Model reasoning as a directed graph with merging, branching, and backtracking. Take GoT beyond Chain-of-Thought and Tree-of-Thought for complex multi-source synthesis.

June 10, 2026
graph-of-thoughtreasoningtree-of-thoughtcotprompt-engineering

The Reasoning Spectrum

Prompting techniques form a progression in expressive power:

Chain-of-Thought (CoT) → linear, one path
  ↓
Tree-of-Thought (ToT)  → branching, evaluate and prune
  ↓
Graph-of-Thought (GoT) → merging, backtracking, combining partial solutions

Graph-of-Thought (Besta et al. 2023) models reasoning as a directed graph where thoughts are nodes and dependencies are edges. This enables operations that linear chains and trees cannot express.

Core Operations

OperationWhat It DoesWhen You Need It
GenerateCreate a new thought from existing onesStandard step forward
AggregateMerge multiple thoughts into oneCombining partial results
RefineImprove a thought based on feedbackIterative improvement
BacktrackReturn to an earlier thought and try a different pathDead-end recovery
ScoreEvaluate thought quality for pruningQuality control

When GoT Beats ToT and CoT

GoT wins when partial solutions can be combined:

  • Sorting problems: Breaking a list into sublists, sorting each, merging results. Sorting 64 numbers: 32-way split → sort individually → merge.
  • Document synthesis: Multiple documents each cover part of a topic. Generate summaries → aggregate into unified report.
  • Parallel sub-problems: A problem decomposes into independent sub-problems whose solutions combine additively.

ToT wins when branching is the primary need:

  • Creative writing with multiple candidate outlines
  • Game playing with multiple move sequences
  • Planning with a clear evaluation function

CoT wins when the reasoning is linear:

  • Simple arithmetic
  • Step-by-step instructions
  • Sequential deduction

Implementation Sketch

class GraphOfThought:
    def __init__(self, llm, scorer):
        self.llm = llm
        self.scorer = scorer  # function(thought) → score
        self.nodes = {}  # id → {content, score, parents, children}

    def generate(self, prompt, parent_ids=None):
        """Generate a new thought from parent thoughts."""
        context = prompt
        if parent_ids:
            parent_content = [self.nodes[pid]["content"] for pid in parent_ids]
            context = f"Based on: {'; '.join(parent_content)}\n\n{prompt}"
        response = self.llm.generate(context)
        node_id = len(self.nodes)
        self.nodes[node_id] = {
            "content": response,
            "score": self.scorer(response),
            "parents": parent_ids or [],
            "children": []
        }
        for pid in (parent_ids or []):
            self.nodes[pid]["children"].append(node_id)
        return node_id

    def aggregate(self, node_ids, prompt):
        """Merge multiple thoughts into one."""
        contents = [self.nodes[nid]["content"] for nid in node_ids]
        merge_prompt = f"Combine these into a single coherent answer:\n\n"
        merge_prompt += "\n---\n".join(contents)
        merge_prompt += f"\n\n{prompt}"
        return self.generate(merge_prompt, parent_ids=node_ids)

    def backtrack(self, node_id):
        """Return to a node's parents, removing the failed path."""
        self.nodes[node_id]["score"] = -1  # mark as dead
        return self.nodes[node_id]["parents"]

# Example: combining three research perspectives
got = GraphOfThought(llm, scorer)
n1 = got.generate("Research the carbon impact of AI training from an energy perspective")
n2 = got.generate("Research the carbon impact of AI training from a hardware perspective")
n3 = got.generate("Research the carbon impact of AI training from a data center perspective")
result = got.aggregate([n1, n2, n3], "Synthesize a comprehensive carbon impact report")

The Scoring Problem

Every graph operation depends on scoring. Getting scores wrong means pruning good ideas or keeping bad ones.

Scoring approaches:

  • LLM-as-judge: Ask the model to rate thoughts 1-5 on relevance and quality. Consistent but expensive.
  • Heuristic scoring: Length, keyword presence, format compliance. Fast but shallow.
  • Reference-based: Compare to a known-good answer. Gold standard but requires ground truth.
  • Ensemble score: Average multiple LLM ratings for more reliable scoring.

Graph vs Tree: Cost Comparison

For a problem with branching factor 3 and depth 3:

MethodLLM CallsTokensBest For
CoT1100%Linear reasoning
ToT3 × 3 = 9 (worst case)~900%Branch-evaluate-prune
GoT3 + 1 = 4 (merge pattern)~400%Parallel + combine

GoT is cheaper than ToT for parallel decomposable problems because you skip the evaluation and pruning overhead on all-but-one branch.

Practical Limitations

  • Scoring is hard. The quality of aggregation depends entirely on the scorer's accuracy. Bad scores corrupt the whole graph.
  • Not production-ready. GoT is a research concept. Libraries like DSPy support it experimentally, but no production frameworks ship it.
  • Overhead for simple tasks. If CoT works, don't add graph complexity. Only use GoT when merging intermediate results demonstrably improves output.
  • Latency. Each node is an API call. A graph with 10 nodes is 10+ sequential or parallel calls. Budget accordingly.