Incident Runbook Agent Blueprint
AI agent that reads your on-call runbook, analyzes incident details, classifies severity, matches remediation steps, generates timelines, and drafts postmortems. Self-contained — works with markdown runbooks and pasted error logs.
Incident Runbook Agent
An AI agent that assists on-call engineers during incidents. It reads your team's runbook, analyzes the incident details (pasted logs, alerts, error messages), classifies severity, matches the appropriate remediation steps, and drafts a postmortem after resolution. No infrastructure access required — it operates on text.
Note:
This agent is an assistant, not an auto-remediator. It reads your runbook and suggests actions. A human must execute the actual remediation steps. Think of it as an on-call buddy that has the runbook memorized, never panics, and writes the postmortem for you.
Agent File Structure
Setup
Install Dependencies
Install the OpenAI client.
pip install openai
Create config.json
Point the agent at your runbook directory and severity definitions.
{
"openai_api_key": "sk-...",
"model": "gpt-4o",
"max_iterations": 6,
"runbook_directory": "./runbooks",
"severity_matrix_path": "severity_matrix.json"
}
Verify
Test with a simulated incident.
python agent.py --incident "database-connection-pool-exhausted.txt" --runbook "./runbooks/database.md"
The agent should classify severity, match runbook steps, and produce a timeline.
System Prompt
You are an SRE incident responder. Your role is to assist on-call engineers
during incidents by applying runbook procedures. Follow this protocol:
1. THOUGHT: What type of incident is this? What systems are affected?
2. ACTION: Read the relevant runbook, analyze the incident details
3. Classify the incident: severity (SEV1-SEV4), impact scope, affected services
4. Match remediation steps from the runbook to the incident symptoms
5. Generate a timeline: detection → diagnosis → mitigation → resolution
6. After resolution, draft a blameless postmortem
7. FINAL_RESPONSE: severity classification + matched runbook steps +
timeline + postmortem draft
Rules:
- Always classify severity before suggesting actions
- If symptoms don't match any runbook, suggest the closest match and flag it
- Timeline format: timestamp, action taken, outcome
- Postmortem must be blameless — focus on systems, process, and prevention
- If unsure about severity or the correct runbook, escalate to human — don't guess
- Never suggest running destructive commands (drop table, rm -rf, force push)
### Severity Matrix
```json
{
"severity_levels": [
{
"level": "SEV1",
"label": "Critical",
"definition": "Complete service outage affecting all users. Revenue loss, data loss, or security breach in progress.",
"response_time": "Immediate (5 min)",
"escalation": "VP Engineering, CTO",
"examples": ["Production database down", "Auth service returning 500 for all users", "Data breach confirmed"]
},
{
"level": "SEV2",
"label": "Major",
"definition": "Significant degradation affecting many users. Core feature unavailable but partial workarounds exist.",
"response_time": "15 min",
"escalation": "Engineering Manager, On-call lead",
"examples": ["Checkout flow failing for 30% of users", "Search returning stale results", "API latency > 5s p99"]
},
{
"level": "SEV3",
"label": "Minor",
"definition": "Partial degradation affecting a subset of users or a non-critical feature.",
"response_time": "1 hour",
"escalation": "On-call engineer (no manager escalation)",
"examples": ["Admin dashboard slow to load", "Email notifications delayed 10 min", "Non-critical cron job failing"]
},
{
"level": "SEV4",
"label": "Cosmetic",
"definition": "No user impact. Visual glitch, typo, or internal tool issue.",
"response_time": "Next business day",
"escalation": "None (track in backlog)",
"examples": ["Logo misaligned on settings page", "Typo in notification email", "Internal wiki page broken"]
}
]
}
Tool Definitions
Agent Tools
Values: path: string (relative to runbook_directory)
Values: none
Values: incident_text: string
Values: incident_text: string, runbook_text: string
Values: events: object[]
Values: incident_text: string, timeline: string, resolution: string
Tool Implementation
# tools.py
import os
import json
import re
from datetime import datetime
RUNBOOK_DIR = None
def list_runbooks():
if not os.path.exists(RUNBOOK_DIR):
return f"ERROR: Runbook directory not found: {RUNBOOK_DIR}"
files = []
for f in sorted(os.listdir(RUNBOOK_DIR)):
if f.endswith(".md"):
full = os.path.join(RUNBOOK_DIR, f)
with open(full, "r") as fh:
first_line = fh.readline().strip()
title = first_line.lstrip("# ").strip() if first_line.startswith("#") else "(no title)"
files.append(f"{f} — {title}")
return "\n".join(files) if files else f"No .md runbooks found in {RUNBOOK_DIR}"
def read_runbook(path):
full = os.path.join(RUNBOOK_DIR, path)
if not os.path.exists(full):
return f"ERROR: Runbook not found: {path}"
with open(full, "r") as f:
content = f.read()
return content[:8000] if len(content) > 8000 else content
def classify_severity(incident_text, severity_matrix_path="severity_matrix.json"):
if not os.path.exists(severity_matrix_path):
return f"WARNING: Severity matrix not found. Cannot classify."
with open(severity_matrix_path) as f:
matrix = json.load(f)["severity_levels"]
incident_lower = incident_text.lower()
# Score each severity level by keyword matches from its examples
scores = {}
for level in matrix:
score = 0
for example in level.get("examples", []):
example_words = set(example.lower().split())
for word in example_words:
if len(word) > 3 and word in incident_lower:
score += 1
scores[level["level"]] = score
best = max(scores, key=scores.get)
if scores[best] > 0:
reason = f"Matched {scores[best]} keywords from {best} examples"
else:
reason = "No keyword matches — defaulting to SEV4"
return json.dumps({
"severity": best,
"reason": reason,
"scores": scores
}, indent=2)
def match_remediation(incident_text, runbook_text):
steps = []
# Extract all remediation steps (numbered or bulleted lists after "Remediation" or "Steps")
sections = re.split(r'(?=^#{1,3}\s)', runbook_text, flags=re.MULTILINE)
for section in sections:
if re.search(r'(remediation|steps|resolution|mitigation)', section, re.IGNORECASE):
lines = section.strip().split("\n")
for line in lines:
match = re.match(r'(?:\d+\.|\-|\*)\s+(.+)', line.strip())
if match:
steps.append(match.group(1))
if not steps:
steps = ["No structured remediation steps found in runbook. Review manually."]
return json.dumps({
"matched_steps": steps,
"step_count": len(steps),
"warning": "Steps matched by heading proximity. Human verification required."
}, indent=2)
def generate_timeline(events):
timeline = []
for i, event in enumerate(events):
timestamp = event.get("time", datetime.now().isoformat())
action = event.get("action", "Unknown action")
outcome = event.get("outcome", "Unknown outcome")
timeline.append(f"[{timestamp}] {action} → {outcome}")
return json.dumps({"timeline": timeline, "event_count": len(timeline)}, indent=2)
def draft_postmortem(client, model, incident_text, timeline, resolution):
prompt = f"""Write a blameless postmortem for this incident. Use the following structure:
# Incident Postmortem
## Summary
(2-3 sentences — what happened, impact, duration)
## Impact
- Users affected:
- Services affected:
- Duration:
- Revenue impact (if known):
## Timeline
{timeline}
## Root Cause
(What system or process failure caused this? Focus on systems, not individuals.)
## Remediation
{resolution}
## Prevention
(Specific, actionable changes to prevent recurrence: monitoring, alerts, runbook updates, code changes, process changes.)
Incident details:
{incident_text}
Keep the tone blameless and constructive. Focus on what we can change about our systems and processes."""
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0.2
)
return response.choices[0].message.content
Agent Initialization
# agent.py
import json
import os
import argparse
from openai import OpenAI
import tools as agent_tools
TOOL_SCHEMAS = [
{
"type": "function",
"function": {
"name": "list_runbooks",
"description": "List all available runbooks in the runbook directory",
"parameters": {"type": "object", "properties": {}, "required": []}
}
},
{
"type": "function",
"function": {
"name": "read_runbook",
"description": "Read a specific runbook's full markdown content",
"parameters": {
"type": "object",
"properties": {"path": {"type": "string"}},
"required": ["path"]
}
}
},
{
"type": "function",
"function": {
"name": "classify_severity",
"description": "Classify incident severity (SEV1-SEV4) based on impact and symptoms",
"parameters": {
"type": "object",
"properties": {"incident_text": {"type": "string"}},
"required": ["incident_text"]
}
}
},
{
"type": "function",
"function": {
"name": "match_remediation",
"description": "Match incident symptoms to runbook remediation steps",
"parameters": {
"type": "object",
"properties": {
"incident_text": {"type": "string"},
"runbook_text": {"type": "string"}
},
"required": ["incident_text", "runbook_text"]
}
}
},
{
"type": "function",
"function": {
"name": "generate_timeline",
"description": "Generate an incident timeline from events",
"parameters": {
"type": "object",
"properties": {
"events": {"type": "array", "items": {
"type": "object",
"properties": {
"time": {"type": "string"},
"action": {"type": "string"},
"outcome": {"type": "string"}
}
}}
},
"required": ["events"]
}
}
},
{
"type": "function",
"function": {
"name": "draft_postmortem",
"description": "Generate a blameless postmortem document",
"parameters": {
"type": "object",
"properties": {
"incident_text": {"type": "string"},
"timeline": {"type": "string"},
"resolution": {"type": "string"}
},
"required": ["incident_text", "timeline", "resolution"]
}
}
}
]
SYSTEM_PROMPT = """You are an SRE incident responder. Your role is to assist on-call
engineers during incidents by applying runbook procedures. Follow this protocol:
1. THOUGHT: What type of incident is this? What systems are affected?
2. ACTION: Read the relevant runbook, analyze the incident details
3. Classify the incident: severity (SEV1-SEV4), impact scope, affected services
4. Match remediation steps from the runbook to the incident symptoms
5. Generate a timeline: detection → diagnosis → mitigation → resolution
6. After resolution, draft a blameless postmortem
7. FINAL_RESPONSE: severity classification + matched runbook steps +
timeline + postmortem draft
Rules:
- Always classify severity before suggesting actions
- If symptoms don't match any runbook, suggest the closest match and flag it
- Timeline format: timestamp, action taken, outcome
- Postmortem must be blameless — focus on systems, process, and prevention
- If unsure about severity or runbook match, escalate to human — don't guess
- Never suggest running destructive commands"""
def run_agent(incident_text: str, config: dict, runbook_path: str = None):
client = OpenAI(api_key=config["openai_api_key"])
model = config.get("model", "gpt-4o")
agent_tools.RUNBOOK_DIR = config.get("runbook_directory", "./runbooks")
severity_path = config.get("severity_matrix_path", "severity_matrix.json")
query = f"Analyze this incident:\n\n{incident_text}"
if runbook_path:
query += f"\n\nUse the runbook at: {runbook_path}"
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": query}
]
for i in range(config.get("max_iterations", 6)):
response = client.chat.completions.create(
model=model,
messages=messages,
tools=TOOL_SCHEMAS,
temperature=0.2
)
msg = response.choices[0].message
messages.append(msg)
if msg.content and "FINAL_RESPONSE:" in msg.content:
return msg.content.split("FINAL_RESPONSE:", 1)[1].strip()
if not msg.tool_calls:
messages.append({
"role": "user",
"content": "Continue the incident analysis. Classify severity, match runbook steps, generate timeline, and provide FINAL_RESPONSE."
})
continue
for tool_call in msg.tool_calls:
func_name = tool_call.function.name
func_args = json.loads(tool_call.function.arguments)
if func_name == "list_runbooks":
result = agent_tools.list_runbooks()
elif func_name == "read_runbook":
result = agent_tools.read_runbook(func_args.get("path", ""))
elif func_name == "classify_severity":
result = agent_tools.classify_severity(
func_args.get("incident_text", ""), severity_path)
elif func_name == "match_remediation":
result = agent_tools.match_remediation(
func_args.get("incident_text", ""),
func_args.get("runbook_text", ""))
elif func_name == "generate_timeline":
result = agent_tools.generate_timeline(
func_args.get("events", []))
elif func_name == "draft_postmortem":
result = agent_tools.draft_postmortem(
client, model,
func_args.get("incident_text", ""),
func_args.get("timeline", ""),
func_args.get("resolution", ""))
else:
result = f"Unknown tool: {func_name}"
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": result
})
return "Agent reached max iterations."
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--incident", required=True, help="Path to incident details file or quoted text")
parser.add_argument("--runbook", help="Specific runbook file to use (optional)")
parser.add_argument("--config", default="config.json")
args = parser.parse_args()
with open(args.config) as f:
config = json.load(f)
# Read incident from file or treat as raw text
incident_text = args.incident
if os.path.exists(args.incident):
with open(args.incident) as f:
incident_text = f.read()
result = run_agent(incident_text, config, args.runbook)
print(result)
Walkthrough
Handling a database connection pool exhaustion incident.
On-call engineer pastes alert details
The agent receives the PagerDuty alert text:
Alert: Database connection pool exhausted
Service: api-gateway
Time: 2026-06-12 14:32 UTC
Error: psycopg2.OperationalError: FATAL: remaining connection slots
are reserved for non-replication superuser connections
Impact: All API requests returning 503
Agent lists and reads the runbook
list_runbooks() returns:
- database.md — Database Connection Issues
- api-gateway.md — API Gateway Failures
- redis.md — Redis Cache Incidents
- deployment.md — Deployment Rollback
read_runbook("database.md") loads the full runbook, which includes sections for connection pool exhaustion, replication lag, and disk-full scenarios.
Classifies severity
classify_severity evaluates the incident text against the severity matrix:
- Complete service outage ✓
- All API requests failing ✓
- Immediate revenue impact ✓ → SEV1 — Critical. Triggers VP/CTO escalation recommendation.
Matches remediation steps
match_remediation finds the "Connection Pool Exhaustion" section and returns:
- Check active connections:
SELECT count(*) FROM pg_stat_activity - Identify idle-in-transaction connections blocking slots
- Terminate idle connections older than 5 minutes:
SELECT pg_terminate_backend(pid) - If pool is full from a single service, restart that service
- Temporarily increase max_connections as a stopgap
- Root cause: check for connection leak in recent deployment
Generates timeline and postmortem
After the engineer resolves the incident (connection leak from a deployment 2 hours prior), the agent generates:
FINAL_RESPONSE:
SEVERITY: SEV1 — Critical
AFFECTED: api-gateway (all users)
DURATION: 14 min (14:32 – 14:46 UTC)
TIMELINE:
[14:32] PagerDuty alert: connection pool exhausted → On-call acknowledges
[14:33] Identified 142 idle-in-transaction connections from api-gateway
[14:35] Terminated idle connections → Partial recovery (50% of slots freed)
[14:38] Restarted api-gateway → Connection count normalized
[14:46] All endpoints returning 200 → Incident resolved
POSTMORTEM DRAFT:
[See full postmortem document above]
PREVENTION:
- Add connection pool monitoring alert at 70% utilization (before exhaustion)
- Add idle-in-transaction timeout (statement_timeout = 30s)
- Add connection leak check to pre-deployment checklist
- Update database runbook with pg_stat_activity query examples
Customization
Runbook Configuration
Values: path to directory
Values: path to .json file
Values: 1-10 (default 6)
Note:
Runbook format matters. The agent matches remediation steps by heading proximity. Runbooks should have clear ## Remediation or ## Steps sections with numbered or bulleted lists. Free-form prose runbooks will produce lower-quality matches.
Key Takeaway
An incident runbook agent is at its best during the first 5 minutes of an incident — when the on-call engineer is waking up, context-switching, and trying to remember which runbook applies. The agent reads the runbook so the human doesn't have to. Post-incident, it drafts the postmortem while details are fresh. The human's job is execution and judgment; the agent's job is recall and documentation.
Related Articles
Agent Memory Architectures
Four fundamentally different approaches to agent memory — conversational, vector, graph, and summary-based. When to use each, how to combine them, and the tools for implementation.
Code Review Agent Blueprint
Complete code review agent that reads file trees, runs linters, checks patterns, and suggests refactors. Ready-to-run with file system access and Git integration.
Content Writer Agent Blueprint
Multi-step content creation agent with outline, research, draft, edit, and finalization stages. Includes grammar checking, tone adjustment, and SEO optimization tools.