Data Insights Agent Blueprint
AI agent that analyzes datasets: statistical summaries, correlation analysis, outlier detection, chart generation, and written insight reports. Runs on CSV and JSON files with pandas and matplotlib.
Data Insights Agent
An AI agent that acts as an automated data analyst. It loads datasets, computes descriptive statistics, finds correlations and outliers, generates charts, and writes a plain-English insights report. Designed to run after the File-Based ETL Agent — ETL cleans the data, this agent analyzes it.
Note:
This agent complements the File-Based ETL Agent. Run ETL first to clean and transform your data, then point this agent at the processed output for analysis. Works with any CSV or JSON dataset.
Agent File Structure
Setup
Install Dependencies
Install the OpenAI client, pandas, and matplotlib.
pip install openai pandas matplotlib
Create config.json
Configure the agent with dataset path and output directory.
{
"openai_api_key": "sk-...",
"model": "gpt-4o",
"max_iterations": 8,
"dataset_path": "./data/processed/customer_churn.csv",
"output_directory": "./insights",
"chart_format": "png"
}
Verify
Run the agent on a dataset.
python agent.py --task "Analyze this dataset and identify the top predictors of churn"
The agent should load data, compute statistics, generate charts, and write a report.
System Prompt
You are a data scientist performing exploratory data analysis. Your job is to
load a dataset, analyze it comprehensively, and produce actionable insights.
Follow this protocol:
1. THOUGHT: What type of data is this? What patterns should I look for?
2. ACTION: Load the dataset, compute descriptive statistics
3. Find correlations between numeric variables
4. Detect outliers using IQR or z-score methods
5. Generate charts for the most important relationships
6. Write a plain-English insights report with specific recommendations
7. FINAL_REPORT: Summary + key findings + charts + recommendations
Rules:
- Always profile the dataset first (types, nulls, ranges) before analysis
- Focus on the user's stated question — don't produce generic analysis
- Charts must be labeled, legible, and saved to the output directory
- Use IQR (1.5x rule) for outlier detection unless the data is normally distributed
- Report correlation strength with context — 0.7 may be strong in social science but weak in physics
- Write insights that someone without a statistics background can act on
### Tool Definitions
<ParameterGrid title="Agent Tools" items={[
{ param: "load_dataset", description: "Load a CSV or JSON dataset. Returns shape, column list, dtypes, and first 5 rows.", values: "path?: string (default: config dataset_path)" },
{ param: "describe_statistics", description: "Compute descriptive statistics: mean, median, std, min, max, quartiles, skew, kurtosis for numeric columns. Also value counts for categorical columns.", values: "columns?: string[] (default: all)" },
{ param: "find_correlations", description: "Compute pairwise Pearson correlation for numeric columns. Returns top correlations sorted by absolute value.", values: "threshold?: float (default 0.3), columns?: string[]" },
{ param: "detect_outliers", description: "Detect outliers using the IQR method (1.5x). Returns rows and values that fall outside the fences.", values: "column: string, method?: 'iqr' | 'zscore'" },
{ param: "generate_chart", description: "Generate a chart (scatter, bar, histogram, box plot, heatmap) and save to the output directory. Returns the file path.", values: "chart_type: string, x_column: string, y_column?: string, title?: string" },
{ param: "write_report", description: "Generate a plain-English insights report from analysis results. Uses LLM to synthesize findings.", values: "analysis_json: string" }
]} />
### Tool Implementation
```python
# tools.py
import os
import json
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use("Agg") # non-interactive backend
import matplotlib.pyplot as plt
DATASET_PATH = None
OUTPUT_DIR = None
CHART_FORMAT = "png"
_df = None # cached dataframe
def load_dataset(path=None):
global _df
path = path or DATASET_PATH
full = path if os.path.isabs(path) else os.path.join(os.getcwd(), path)
if not os.path.exists(full):
return f"ERROR: Dataset not found: {path}"
try:
if path.endswith(".csv"):
_df = pd.read_csv(full)
elif path.endswith(".json") or path.endswith(".jsonl"):
_df = pd.read_json(full, lines=path.endswith(".jsonl"))
else:
return f"ERROR: Unsupported file format: {path}"
return (
f"Loaded {path}\n"
f"Shape: {_df.shape[0]} rows × {_df.shape[1]} columns\n"
f"Columns: {list(_df.columns)}\n"
f"Dtypes:\n{_df.dtypes.to_string()}\n\n"
f"First 5 rows:\n{_df.head().to_string()}\n\n"
f"Null counts:\n{_df.isnull().sum().to_string()}"
)
except Exception as e:
return f"ERROR loading dataset: {e}"
def describe_statistics(columns=None):
if _df is None:
return "ERROR: No dataset loaded. Call load_dataset first."
cols = columns or list(_df.columns)
numeric = [c for c in cols if c in _df.columns and _df[c].dtype in ("int64", "float64")]
categorical = [c for c in cols if c in _df.columns and c not in numeric]
output = []
if numeric:
desc = _df[numeric].describe().T
# Add skew and kurtosis
desc["skew"] = _df[numeric].skew()
desc["kurtosis"] = _df[numeric].kurtosis()
output.append("=== NUMERIC COLUMNS ===")
output.append(desc.to_string())
if categorical:
output.append("\n=== CATEGORICAL COLUMNS ===")
for col in categorical[:10]:
vc = _df[col].value_counts().head(10)
output.append(f"\n{col} ({_df[col].nunique()} unique):")
output.append(vc.to_string())
return "\n".join(output)
def find_correlations(threshold=0.3, columns=None):
if _df is None:
return "ERROR: No dataset loaded."
numeric = [c for c in (_df.select_dtypes(include=[np.number]).columns)
if not columns or c in columns]
if len(numeric) < 2:
return "Need at least 2 numeric columns for correlation analysis."
corr_matrix = _df[numeric].corr()
pairs = []
for i in range(len(numeric)):
for j in range(i + 1, len(numeric)):
val = corr_matrix.iloc[i, j]
if abs(val) >= threshold:
pairs.append({
"col_a": numeric[i], "col_b": numeric[j],
"correlation": round(val, 3),
"strength": "strong" if abs(val) > 0.7 else "moderate" if abs(val) > 0.4 else "weak"
})
pairs.sort(key=lambda x: abs(x["correlation"]), reverse=True)
if not pairs:
return f"No correlations found above threshold {threshold}."
return json.dumps(pairs, indent=2)
def detect_outliers(column, method="iqr"):
if _df is None:
return "ERROR: No dataset loaded."
if column not in _df.columns:
return f"ERROR: Column '{column}' not found."
if _df[column].dtype not in ("int64", "float64"):
return f"ERROR: Column '{column}' is not numeric."
data = _df[column].dropna()
if method == "iqr":
q1 = data.quantile(0.25)
q3 = data.quantile(0.75)
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
outliers = _df[(data < lower) | (data > upper)]
method_name = f"IQR (Q1={q1:.2f}, Q3={q3:.2f}, IQR={iqr:.2f})"
fences = f"lower={lower:.2f}, upper={upper:.2f}"
elif method == "zscore":
from scipy import stats
z = np.abs(stats.zscore(data))
outliers = _df[z > 3]
method_name = "Z-score (threshold=3)"
fences = f"> 3 standard deviations from mean"
else:
return f"Unknown method: {method}. Use 'iqr' or 'zscore'."
return json.dumps({
"column": column,
"method": method_name,
"fences": fences,
"outlier_count": len(outliers),
"outlier_pct": round(len(outliers) / len(_df) * 100, 2),
"outlier_values": outliers[column].describe().to_dict()
}, indent=2, default=str)
def generate_chart(chart_type, x_column, y_column=None, title=None):
if _df is None:
return "ERROR: No dataset loaded."
if x_column not in _df.columns:
return f"ERROR: Column '{x_column}' not found."
os.makedirs(OUTPUT_DIR, exist_ok=True)
filename = f"{chart_type}_{x_column}"
if y_column:
filename += f"_vs_{y_column}"
filename += f".{CHART_FORMAT}"
filepath = os.path.join(OUTPUT_DIR, filename)
plt.figure(figsize=(10, 6))
data = _df[[x_column] + ([y_column] if y_column and y_column in _df.columns else [])].dropna()
if chart_type == "scatter" and y_column:
plt.scatter(data[x_column], data[y_column], alpha=0.5, s=10)
plt.xlabel(x_column)
plt.ylabel(y_column)
elif chart_type == "bar" and y_column:
data.groupby(x_column)[y_column].mean().sort_values().tail(20).plot(kind="barh")
plt.xlabel(y_column)
plt.ylabel(x_column)
elif chart_type == "histogram":
plt.hist(data[x_column], bins=30, edgecolor="black", alpha=0.7)
plt.xlabel(x_column)
plt.ylabel("Frequency")
elif chart_type == "box":
plt.boxplot(data[x_column].dropna(), vert=False)
plt.xlabel(x_column)
elif chart_type == "heatmap" and y_column:
pivot = pd.crosstab(data[x_column], data[y_column])
plt.imshow(pivot, aspect="auto", cmap="YlOrRd")
plt.colorbar(label="Count")
plt.xticks(range(len(pivot.columns)), pivot.columns, rotation=45, ha="right")
plt.yticks(range(len(pivot.index)), pivot.index)
else:
plt.close()
return f"Unsupported chart type or missing columns: {chart_type}"
if title:
plt.title(title)
plt.tight_layout()
plt.savefig(filepath, dpi=150, bbox_inches="tight")
plt.close()
return f"Chart saved: {filepath}"
def write_report(client, model, analysis_json):
prompt = f"""Write a data insights report based on this analysis. Use plain English
that someone without a statistics background can understand.
Analysis results:
{analysis_json}
Structure your report as:
1. Executive Summary (3-4 sentences — what's the dataset, key findings)
2. Key Findings (numbered list, each with the finding + what it means)
3. Top Correlations (what variables move together and why it matters)
4. Notable Outliers (how many, which variables, what to investigate)
5. Recommendations (3-5 specific, actionable next steps)
Keep it concise. Focus on actionable insights, not statistical jargon."""
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0.2
)
return response.choices[0].message.content
Agent Initialization
# agent.py
import json
import argparse
from openai import OpenAI
import tools as agent_tools
TOOL_SCHEMAS = [
{
"type": "function",
"function": {
"name": "load_dataset",
"description": "Load a CSV or JSON dataset and return shape, columns, dtypes, and preview",
"parameters": {
"type": "object",
"properties": {"path": {"type": "string"}},
"required": []
}
}
},
{
"type": "function",
"function": {
"name": "describe_statistics",
"description": "Compute descriptive statistics for all or specified columns",
"parameters": {
"type": "object",
"properties": {
"columns": {"type": "array", "items": {"type": "string"}}
},
"required": []
}
}
},
{
"type": "function",
"function": {
"name": "find_correlations",
"description": "Find pairwise correlations between numeric columns above a threshold",
"parameters": {
"type": "object",
"properties": {
"threshold": {"type": "number", "default": 0.3},
"columns": {"type": "array", "items": {"type": "string"}}
},
"required": []
}
}
},
{
"type": "function",
"function": {
"name": "detect_outliers",
"description": "Detect outliers in a numeric column using IQR or z-score method",
"parameters": {
"type": "object",
"properties": {
"column": {"type": "string"},
"method": {"type": "string", "enum": ["iqr", "zscore"]}
},
"required": ["column"]
}
}
},
{
"type": "function",
"function": {
"name": "generate_chart",
"description": "Generate and save a chart (scatter, bar, histogram, box, heatmap)",
"parameters": {
"type": "object",
"properties": {
"chart_type": {"type": "string", "enum": ["scatter", "bar", "histogram", "box", "heatmap"]},
"x_column": {"type": "string"},
"y_column": {"type": "string"},
"title": {"type": "string"}
},
"required": ["chart_type", "x_column"]
}
}
},
{
"type": "function",
"function": {
"name": "write_report",
"description": "Generate a plain-English insights report from analysis results",
"parameters": {
"type": "object",
"properties": {"analysis_json": {"type": "string"}},
"required": ["analysis_json"]
}
}
}
]
SYSTEM_PROMPT = """You are a data scientist performing exploratory data analysis.
Your job is to load a dataset, analyze it comprehensively, and produce actionable
insights. Follow this protocol:
1. THOUGHT: What type of data is this? What patterns should I look for?
2. ACTION: Load the dataset, compute descriptive statistics
3. Find correlations between numeric variables
4. Detect outliers using IQR or z-score methods
5. Generate charts for the most important relationships
6. Write a plain-English insights report with specific recommendations
7. FINAL_REPORT: Summary + key findings + charts + recommendations
Rules:
- Always profile the dataset first (types, nulls, ranges) before analysis
- Focus on the user's stated question — don't produce generic analysis
- Charts must be labeled, legible, and saved to the output directory
- Use IQR (1.5x rule) for outlier detection
- Report correlation strength with context
- Write insights that someone without a statistics background can act on"""
def run_agent(task: str, config: dict):
client = OpenAI(api_key=config["openai_api_key"])
model = config.get("model", "gpt-4o")
agent_tools.DATASET_PATH = config.get("dataset_path", "./data.csv")
agent_tools.OUTPUT_DIR = config.get("output_directory", "./insights")
agent_tools.CHART_FORMAT = config.get("chart_format", "png")
os.makedirs(agent_tools.OUTPUT_DIR, exist_ok=True)
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Analyze this dataset. Task: {task}"}
]
for i in range(config.get("max_iterations", 8)):
response = client.chat.completions.create(
model=model,
messages=messages,
tools=TOOL_SCHEMAS,
temperature=0.1
)
msg = response.choices[0].message
messages.append(msg)
if msg.content and "FINAL_REPORT:" in msg.content:
return msg.content.split("FINAL_REPORT:", 1)[1].strip()
if not msg.tool_calls:
messages.append({
"role": "user",
"content": "Continue the analysis. Load data, compute stats, find correlations, detect outliers, generate charts, and provide FINAL_REPORT."
})
continue
for tool_call in msg.tool_calls:
func_name = tool_call.function.name
func_args = json.loads(tool_call.function.arguments)
if func_name == "load_dataset":
result = agent_tools.load_dataset(func_args.get("path"))
elif func_name == "describe_statistics":
result = agent_tools.describe_statistics(func_args.get("columns"))
elif func_name == "find_correlations":
result = agent_tools.find_correlations(
func_args.get("threshold", 0.3),
func_args.get("columns"))
elif func_name == "detect_outliers":
result = agent_tools.detect_outliers(
func_args.get("column", ""),
func_args.get("method", "iqr"))
elif func_name == "generate_chart":
result = agent_tools.generate_chart(
func_args.get("chart_type", ""),
func_args.get("x_column", ""),
func_args.get("y_column"),
func_args.get("title"))
elif func_name == "write_report":
result = agent_tools.write_report(
client, model,
func_args.get("analysis_json", ""))
else:
result = f"Unknown tool: {func_name}"
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": result
})
return "Agent reached max iterations."
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--task", required=True, help="Analysis task description")
parser.add_argument("--config", default="config.json")
args = parser.parse_args()
with open(args.config) as f:
config = json.load(f)
result = run_agent(args.task, config)
print(result)
print(f"\nCharts saved to: {config.get('output_directory', './insights')}")
Walkthrough
Analyzing a customer churn dataset to identify top predictors.
Agent loads and profiles the dataset
load_dataset() returns:
Loaded customer_churn.csv
Shape: 7,043 rows × 21 columns
Columns: customer_id, gender, senior_citizen, tenure, monthly_charges,
total_charges, contract, payment_method, churn, ...
Dtypes: 3 int64, 15 object, 3 float64
Null counts: total_charges: 11 nulls, all others: 0
Computes statistics and correlations
describe_statistics() shows tenure (mean 32 months, skewed right), monthly_charges ($65 avg, range $18-$118). 27% churn rate overall.
find_correlations(threshold=0.2) finds:
- tenure ↔ total_charges: 0.83 (expected — longer customers pay more)
- tenure ↔ churn: -0.35 (longer tenure = less likely to churn)
- monthly_charges ↔ churn: 0.19 (higher monthly = slightly more churn)
Detects outliers and generates charts
detect_outliers("monthly_charges") finds 0 outliers — distribution is bounded ($18-$118). detect_outliers("total_charges") finds 57 outliers at the high end — long-tenure customers.
generate_chart("scatter", "tenure", "monthly_charges", "Tenure vs Monthly Charges by Churn") creates a scatter plot color-coded by churn status.
Delivers FINAL_REPORT
FINAL_REPORT:
EXECUTIVE SUMMARY
Analyzed 7,043 customers with 27.4% churn rate. The strongest churn
predictors are contract type, tenure, and payment method — not price.
KEY FINDINGS
1. Month-to-month contracts churn at 43% vs 11% for 1-year and 3% for
2-year contracts. Contract type is the #1 predictor.
2. Customers who churn average 18 months tenure vs 38 months for retained
customers. Most churn happens in the first 6 months.
3. Electronic check payment method has 45% churn vs 15-18% for credit
card and bank transfer. This may proxy for less-committed customers.
TOP CORRELATIONS
- Tenure vs Total Charges: 0.83 (expected)
- Tenure vs Churn: -0.35 (longer customers stay)
- Monthly Charges vs Churn: 0.19 (weak — price alone doesn't predict churn)
NOTABLE OUTLIERS
57 customers with total_charges > $8,000 (top 0.8%). All are long-tenure
(60+ months) with high monthly charges. None churned.
RECOMMENDATIONS
1. Target month-to-month customers for annual contract conversion offers
2. Implement a 30-day onboarding program — churn rate drops sharply after
month 6
3. Investigate why electronic check users churn at 3x the rate of others
4. Don't use price as a churn signal — the correlation is weak
Customization
Analysis Settings
Values: path to file
Values: path to directory
Values: png, svg, pdf
Values: gpt-4o, gpt-4o-mini
Note:
Dataset size. The agent loads the full dataset into memory via pandas. For files >1GB or >1M rows, use the File ETL Agent first to filter/sample, then point this agent at the processed output.
Key Takeaway
The Data Insights Agent is most valuable when you have clean data and a specific question. Don't throw it at raw, messy CSVs — that's what the File ETL Agent is for. The two-agents-together workflow (ETL → Insights) mirrors how real data teams work: one person cleans, another analyzes. The agent generates both the statistical outputs AND the plain-English explanation, bridging the gap between analysis and communication.
Related Articles
CrewAI Setup Guide
Complete setup and configuration guide for CrewAI — the most popular open-source multi-agent orchestration framework. Role-based agents, Flows + Crews architecture, event-driven state management.
Incident Runbook Agent Blueprint
AI agent that reads your on-call runbook, analyzes incident details, classifies severity, matches remediation steps, generates timelines, and drafts postmortems. Self-contained — works with markdown runbooks and pasted error logs.
AI Agent Blueprints & Configurations
Ready-to-run AI agent blueprints, configurations, and local setup guides. Build research agents, code reviewers, and content writers with copy-paste implementations.