Incident Runbook Agent

An AI agent that assists on-call engineers during incidents. It reads your team's runbook, analyzes the incident details (pasted logs, alerts, error messages), classifies severity, matches the appropriate remediation steps, and drafts a postmortem after resolution. No infrastructure access required — it operates on text.

Note:

This agent is an assistant, not an auto-remediator. It reads your runbook and suggests actions. A human must execute the actual remediation steps. Think of it as an on-call buddy that has the runbook memorized, never panics, and writes the postmortem for you.

Agent File Structure

incident-runbook-agentadd

agent.pyadd

tools.pyadd

severity_matrix.jsonadd

config.jsonadd

Setup

Install Dependencies

Install the OpenAI client.

pip install openai

Create config.json

Point the agent at your runbook directory and severity definitions.

{
  "openai_api_key": "sk-...",
  "model": "gpt-4o",
  "max_iterations": 6,
  "runbook_directory": "./runbooks",
  "severity_matrix_path": "severity_matrix.json"
}

Verify

Test with a simulated incident.

python agent.py --incident "database-connection-pool-exhausted.txt" --runbook "./runbooks/database.md"

The agent should classify severity, match runbook steps, and produce a timeline.

System Prompt

You are an SRE incident responder. Your role is to assist on-call engineers
during incidents by applying runbook procedures. Follow this protocol:

1. THOUGHT: What type of incident is this? What systems are affected?
2. ACTION: Read the relevant runbook, analyze the incident details
3. Classify the incident: severity (SEV1-SEV4), impact scope, affected services
4. Match remediation steps from the runbook to the incident symptoms
5. Generate a timeline: detection → diagnosis → mitigation → resolution
6. After resolution, draft a blameless postmortem
7. FINAL_RESPONSE: severity classification + matched runbook steps +
   timeline + postmortem draft

Rules:
- Always classify severity before suggesting actions
- If symptoms don't match any runbook, suggest the closest match and flag it
- Timeline format: timestamp, action taken, outcome
- Postmortem must be blameless — focus on systems, process, and prevention
- If unsure about severity or the correct runbook, escalate to human — don't guess
- Never suggest running destructive commands (drop table, rm -rf, force push)


### Severity Matrix

```json
{
  "severity_levels": [
    {
      "level": "SEV1",
      "label": "Critical",
      "definition": "Complete service outage affecting all users. Revenue loss, data loss, or security breach in progress.",
      "response_time": "Immediate (5 min)",
      "escalation": "VP Engineering, CTO",
      "examples": ["Production database down", "Auth service returning 500 for all users", "Data breach confirmed"]
    },
    {
      "level": "SEV2",
      "label": "Major",
      "definition": "Significant degradation affecting many users. Core feature unavailable but partial workarounds exist.",
      "response_time": "15 min",
      "escalation": "Engineering Manager, On-call lead",
      "examples": ["Checkout flow failing for 30% of users", "Search returning stale results", "API latency > 5s p99"]
    },
    {
      "level": "SEV3",
      "label": "Minor",
      "definition": "Partial degradation affecting a subset of users or a non-critical feature.",
      "response_time": "1 hour",
      "escalation": "On-call engineer (no manager escalation)",
      "examples": ["Admin dashboard slow to load", "Email notifications delayed 10 min", "Non-critical cron job failing"]
    },
    {
      "level": "SEV4",
      "label": "Cosmetic",
      "definition": "No user impact. Visual glitch, typo, or internal tool issue.",
      "response_time": "Next business day",
      "escalation": "None (track in backlog)",
      "examples": ["Logo misaligned on settings page", "Typo in notification email", "Internal wiki page broken"]
    }
  ]
}

Tool Definitions

Agent Tools

read_runbook

Read a runbook file from the runbook directory. Returns the full markdown content.

Values: path: string (relative to runbook_directory)

list_runbooks

List all available runbooks in the runbook directory. Returns filenames and first heading (title).

Values: none

classify_severity

Classify incident severity against the severity matrix based on user impact, scope, and symptoms.

Values: incident_text: string

match_remediation

Match incident symptoms to remediation steps in a runbook. Returns steps with confidence scores.

Values: incident_text: string, runbook_text: string

generate_timeline

Generate an incident timeline from detection to resolution with timestamps, actions, and outcomes.

Values: events: object[]

draft_postmortem

Generate a blameless postmortem with: summary, impact, timeline, root cause, remediation, prevention.

Values: incident_text: string, timeline: string, resolution: string

Tool Implementation

# tools.py
import os
import json
import re
from datetime import datetime

RUNBOOK_DIR = None

def list_runbooks():
    if not os.path.exists(RUNBOOK_DIR):
        return f"ERROR: Runbook directory not found: {RUNBOOK_DIR}"
    files = []
    for f in sorted(os.listdir(RUNBOOK_DIR)):
        if f.endswith(".md"):
            full = os.path.join(RUNBOOK_DIR, f)
            with open(full, "r") as fh:
                first_line = fh.readline().strip()
                title = first_line.lstrip("# ").strip() if first_line.startswith("#") else "(no title)"
            files.append(f"{f} — {title}")
    return "\n".join(files) if files else f"No .md runbooks found in {RUNBOOK_DIR}"

def read_runbook(path):
    full = os.path.join(RUNBOOK_DIR, path)
    if not os.path.exists(full):
        return f"ERROR: Runbook not found: {path}"
    with open(full, "r") as f:
        content = f.read()
    return content[:8000] if len(content) > 8000 else content

def classify_severity(incident_text, severity_matrix_path="severity_matrix.json"):
    if not os.path.exists(severity_matrix_path):
        return f"WARNING: Severity matrix not found. Cannot classify."
    with open(severity_matrix_path) as f:
        matrix = json.load(f)["severity_levels"]

    incident_lower = incident_text.lower()
    # Score each severity level by keyword matches from its examples
    scores = {}
    for level in matrix:
        score = 0
        for example in level.get("examples", []):
            example_words = set(example.lower().split())
            for word in example_words:
                if len(word) > 3 and word in incident_lower:
                    score += 1
        scores[level["level"]] = score

    best = max(scores, key=scores.get)
    if scores[best] > 0:
        reason = f"Matched {scores[best]} keywords from {best} examples"
    else:
        reason = "No keyword matches — defaulting to SEV4"

    return json.dumps({
        "severity": best,
        "reason": reason,
        "scores": scores
    }, indent=2)

def match_remediation(incident_text, runbook_text):
    steps = []
    # Extract all remediation steps (numbered or bulleted lists after "Remediation" or "Steps")
    sections = re.split(r'(?=^#{1,3}\s)', runbook_text, flags=re.MULTILINE)
    for section in sections:
        if re.search(r'(remediation|steps|resolution|mitigation)', section, re.IGNORECASE):
            lines = section.strip().split("\n")
            for line in lines:
                match = re.match(r'(?:\d+\.|\-|\*)\s+(.+)', line.strip())
                if match:
                    steps.append(match.group(1))

    if not steps:
        steps = ["No structured remediation steps found in runbook. Review manually."]

    return json.dumps({
        "matched_steps": steps,
        "step_count": len(steps),
        "warning": "Steps matched by heading proximity. Human verification required."
    }, indent=2)

def generate_timeline(events):
    timeline = []
    for i, event in enumerate(events):
        timestamp = event.get("time", datetime.now().isoformat())
        action = event.get("action", "Unknown action")
        outcome = event.get("outcome", "Unknown outcome")
        timeline.append(f"[{timestamp}] {action} → {outcome}")
    return json.dumps({"timeline": timeline, "event_count": len(timeline)}, indent=2)

def draft_postmortem(client, model, incident_text, timeline, resolution):
    prompt = f"""Write a blameless postmortem for this incident. Use the following structure:

# Incident Postmortem

## Summary
(2-3 sentences — what happened, impact, duration)

## Impact
- Users affected:
- Services affected:
- Duration:
- Revenue impact (if known):

## Timeline
{timeline}

## Root Cause
(What system or process failure caused this? Focus on systems, not individuals.)

## Remediation
{resolution}

## Prevention
(Specific, actionable changes to prevent recurrence: monitoring, alerts, runbook updates, code changes, process changes.)

Incident details:
{incident_text}

Keep the tone blameless and constructive. Focus on what we can change about our systems and processes."""

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.2
    )
    return response.choices[0].message.content

Agent Initialization

# agent.py
import json
import os
import argparse
from openai import OpenAI
import tools as agent_tools

TOOL_SCHEMAS = [
    {
        "type": "function",
        "function": {
            "name": "list_runbooks",
            "description": "List all available runbooks in the runbook directory",
            "parameters": {"type": "object", "properties": {}, "required": []}
        }
    },
    {
        "type": "function",
        "function": {
            "name": "read_runbook",
            "description": "Read a specific runbook's full markdown content",
            "parameters": {
                "type": "object",
                "properties": {"path": {"type": "string"}},
                "required": ["path"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "classify_severity",
            "description": "Classify incident severity (SEV1-SEV4) based on impact and symptoms",
            "parameters": {
                "type": "object",
                "properties": {"incident_text": {"type": "string"}},
                "required": ["incident_text"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "match_remediation",
            "description": "Match incident symptoms to runbook remediation steps",
            "parameters": {
                "type": "object",
                "properties": {
                    "incident_text": {"type": "string"},
                    "runbook_text": {"type": "string"}
                },
                "required": ["incident_text", "runbook_text"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "generate_timeline",
            "description": "Generate an incident timeline from events",
            "parameters": {
                "type": "object",
                "properties": {
                    "events": {"type": "array", "items": {
                        "type": "object",
                        "properties": {
                            "time": {"type": "string"},
                            "action": {"type": "string"},
                            "outcome": {"type": "string"}
                        }
                    }}
                },
                "required": ["events"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "draft_postmortem",
            "description": "Generate a blameless postmortem document",
            "parameters": {
                "type": "object",
                "properties": {
                    "incident_text": {"type": "string"},
                    "timeline": {"type": "string"},
                    "resolution": {"type": "string"}
                },
                "required": ["incident_text", "timeline", "resolution"]
            }
        }
    }
]

SYSTEM_PROMPT = """You are an SRE incident responder. Your role is to assist on-call
engineers during incidents by applying runbook procedures. Follow this protocol:

1. THOUGHT: What type of incident is this? What systems are affected?
2. ACTION: Read the relevant runbook, analyze the incident details
3. Classify the incident: severity (SEV1-SEV4), impact scope, affected services
4. Match remediation steps from the runbook to the incident symptoms
5. Generate a timeline: detection → diagnosis → mitigation → resolution
6. After resolution, draft a blameless postmortem
7. FINAL_RESPONSE: severity classification + matched runbook steps +
   timeline + postmortem draft

Rules:
- Always classify severity before suggesting actions
- If symptoms don't match any runbook, suggest the closest match and flag it
- Timeline format: timestamp, action taken, outcome
- Postmortem must be blameless — focus on systems, process, and prevention
- If unsure about severity or runbook match, escalate to human — don't guess
- Never suggest running destructive commands"""


def run_agent(incident_text: str, config: dict, runbook_path: str = None):
    client = OpenAI(api_key=config["openai_api_key"])
    model = config.get("model", "gpt-4o")

    agent_tools.RUNBOOK_DIR = config.get("runbook_directory", "./runbooks")
    severity_path = config.get("severity_matrix_path", "severity_matrix.json")

    query = f"Analyze this incident:\n\n{incident_text}"
    if runbook_path:
        query += f"\n\nUse the runbook at: {runbook_path}"

    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": query}
    ]

    for i in range(config.get("max_iterations", 6)):
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            tools=TOOL_SCHEMAS,
            temperature=0.2
        )

        msg = response.choices[0].message
        messages.append(msg)

        if msg.content and "FINAL_RESPONSE:" in msg.content:
            return msg.content.split("FINAL_RESPONSE:", 1)[1].strip()

        if not msg.tool_calls:
            messages.append({
                "role": "user",
                "content": "Continue the incident analysis. Classify severity, match runbook steps, generate timeline, and provide FINAL_RESPONSE."
            })
            continue

        for tool_call in msg.tool_calls:
            func_name = tool_call.function.name
            func_args = json.loads(tool_call.function.arguments)

            if func_name == "list_runbooks":
                result = agent_tools.list_runbooks()
            elif func_name == "read_runbook":
                result = agent_tools.read_runbook(func_args.get("path", ""))
            elif func_name == "classify_severity":
                result = agent_tools.classify_severity(
                    func_args.get("incident_text", ""), severity_path)
            elif func_name == "match_remediation":
                result = agent_tools.match_remediation(
                    func_args.get("incident_text", ""),
                    func_args.get("runbook_text", ""))
            elif func_name == "generate_timeline":
                result = agent_tools.generate_timeline(
                    func_args.get("events", []))
            elif func_name == "draft_postmortem":
                result = agent_tools.draft_postmortem(
                    client, model,
                    func_args.get("incident_text", ""),
                    func_args.get("timeline", ""),
                    func_args.get("resolution", ""))
            else:
                result = f"Unknown tool: {func_name}"

            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": result
            })

    return "Agent reached max iterations."


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--incident", required=True, help="Path to incident details file or quoted text")
    parser.add_argument("--runbook", help="Specific runbook file to use (optional)")
    parser.add_argument("--config", default="config.json")
    args = parser.parse_args()

    with open(args.config) as f:
        config = json.load(f)

    # Read incident from file or treat as raw text
    incident_text = args.incident
    if os.path.exists(args.incident):
        with open(args.incident) as f:
            incident_text = f.read()

    result = run_agent(incident_text, config, args.runbook)
    print(result)

Walkthrough

Handling a database connection pool exhaustion incident.

On-call engineer pastes alert details

The agent receives the PagerDuty alert text:

Alert: Database connection pool exhausted
Service: api-gateway
Time: 2026-06-12 14:32 UTC
Error: psycopg2.OperationalError: FATAL: remaining connection slots
       are reserved for non-replication superuser connections
Impact: All API requests returning 503

Agent lists and reads the runbook

list_runbooks() returns:

database.md — Database Connection Issues
api-gateway.md — API Gateway Failures
redis.md — Redis Cache Incidents
deployment.md — Deployment Rollback

read_runbook("database.md") loads the full runbook, which includes sections for connection pool exhaustion, replication lag, and disk-full scenarios.

Classifies severity

classify_severity evaluates the incident text against the severity matrix:

Complete service outage ✓
All API requests failing ✓
Immediate revenue impact ✓ → SEV1 — Critical. Triggers VP/CTO escalation recommendation.

Matches remediation steps

match_remediation finds the "Connection Pool Exhaustion" section and returns:

Check active connections: SELECT count(*) FROM pg_stat_activity
Identify idle-in-transaction connections blocking slots
Terminate idle connections older than 5 minutes: SELECT pg_terminate_backend(pid)
If pool is full from a single service, restart that service
Temporarily increase max_connections as a stopgap
Root cause: check for connection leak in recent deployment

Generates timeline and postmortem

After the engineer resolves the incident (connection leak from a deployment 2 hours prior), the agent generates:

FINAL_RESPONSE:

SEVERITY: SEV1 — Critical
AFFECTED: api-gateway (all users)
DURATION: 14 min (14:32 – 14:46 UTC)

TIMELINE:
[14:32] PagerDuty alert: connection pool exhausted → On-call acknowledges
[14:33] Identified 142 idle-in-transaction connections from api-gateway
[14:35] Terminated idle connections → Partial recovery (50% of slots freed)
[14:38] Restarted api-gateway → Connection count normalized
[14:46] All endpoints returning 200 → Incident resolved

POSTMORTEM DRAFT:
[See full postmortem document above]

PREVENTION:
- Add connection pool monitoring alert at 70% utilization (before exhaustion)
- Add idle-in-transaction timeout (statement_timeout = 30s)
- Add connection leak check to pre-deployment checklist
- Update database runbook with pg_stat_activity query examples

Customization

Runbook Configuration

runbook_directory

Directory containing markdown runbooks. Each file should have clear remediation sections (## Remediation, ## Steps, or ## Resolution).

Values: path to directory

severity_matrix_path

JSON file defining your organization's severity levels (SEV1-SEV4). Customize labels, response times, and escalation paths.

Values: path to .json file

max_iterations

Review iterations. Increase for complex incidents that span multiple runbooks or require deep analysis.

Values: 1-10 (default 6)

Note:

Runbook format matters. The agent matches remediation steps by heading proximity. Runbooks should have clear ## Remediation or ## Steps sections with numbered or bulleted lists. Free-form prose runbooks will produce lower-quality matches.

Key Takeaway

An incident runbook agent is at its best during the first 5 minutes of an incident — when the on-call engineer is waking up, context-switching, and trying to remember which runbook applies. The agent reads the runbook so the human doesn't have to. Post-incident, it drafts the postmortem while details are fresh. The human's job is execution and judgment; the agent's job is recall and documentation.

Incident Runbook Agent Blueprint