Data Insights Agent

An AI agent that acts as an automated data analyst. It loads datasets, computes descriptive statistics, finds correlations and outliers, generates charts, and writes a plain-English insights report. Designed to run after the File-Based ETL Agent — ETL cleans the data, this agent analyzes it.

Note:

This agent complements the File-Based ETL Agent. Run ETL first to clean and transform your data, then point this agent at the processed output for analysis. Works with any CSV or JSON dataset.

Agent File Structure

Flat vector schema of the Data Insights Agent setup displaying files and data library connections

data-insights-agentadd

agent.pyadd

tools.pyadd

config.jsonadd

Setup

Install Dependencies

Install the OpenAI client, pandas, and matplotlib.

pip install openai pandas matplotlib

Create config.json

Configure the agent with dataset path and output directory.

{
  "openai_api_key": "sk-...",
  "model": "gpt-4o",
  "max_iterations": 8,
  "dataset_path": "./data/processed/customer_churn.csv",
  "output_directory": "./insights",
  "chart_format": "png"
}

Verify

Run the agent on a dataset.

python agent.py --task "Analyze this dataset and identify the top predictors of churn"

The agent should load data, compute statistics, generate charts, and write a report.

System Prompt

You are a data scientist performing exploratory data analysis. Your job is to
load a dataset, analyze it comprehensively, and produce actionable insights.
Follow this protocol:

1. THOUGHT: What type of data is this? What patterns should I look for?
2. ACTION: Load the dataset, compute descriptive statistics
3. Find correlations between numeric variables
4. Detect outliers using IQR or z-score methods
5. Generate charts for the most important relationships
6. Write a plain-English insights report with specific recommendations
7. FINAL_REPORT: Summary + key findings + charts + recommendations

Rules:
- Always profile the dataset first (types, nulls, ranges) before analysis
- Focus on the user's stated question — don't produce generic analysis
- Charts must be labeled, legible, and saved to the output directory
- Use IQR (1.5x rule) for outlier detection unless the data is normally distributed
- Report correlation strength with context — 0.7 may be strong in social science but weak in physics
- Write insights that someone without a statistics background can act on


### Tool Definitions

<ParameterGrid title="Agent Tools" items={[
  { param: "load_dataset", description: "Load a CSV or JSON dataset. Returns shape, column list, dtypes, and first 5 rows.", values: "path?: string (default: config dataset_path)" },
  { param: "describe_statistics", description: "Compute descriptive statistics: mean, median, std, min, max, quartiles, skew, kurtosis for numeric columns. Also value counts for categorical columns.", values: "columns?: string[] (default: all)" },
  { param: "find_correlations", description: "Compute pairwise Pearson correlation for numeric columns. Returns top correlations sorted by absolute value.", values: "threshold?: float (default 0.3), columns?: string[]" },
  { param: "detect_outliers", description: "Detect outliers using the IQR method (1.5x). Returns rows and values that fall outside the fences.", values: "column: string, method?: 'iqr' | 'zscore'" },
  { param: "generate_chart", description: "Generate a chart (scatter, bar, histogram, box plot, heatmap) and save to the output directory. Returns the file path.", values: "chart_type: string, x_column: string, y_column?: string, title?: string" },
  { param: "write_report", description: "Generate a plain-English insights report from analysis results. Uses LLM to synthesize findings.", values: "analysis_json: string" }
]} />

### Tool Implementation

```python
# tools.py
import os
import json
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use("Agg")  # non-interactive backend
import matplotlib.pyplot as plt

DATASET_PATH = None
OUTPUT_DIR = None
CHART_FORMAT = "png"
_df = None  # cached dataframe

def load_dataset(path=None):
    global _df
    path = path or DATASET_PATH
    full = path if os.path.isabs(path) else os.path.join(os.getcwd(), path)
    if not os.path.exists(full):
        return f"ERROR: Dataset not found: {path}"

    try:
        if path.endswith(".csv"):
            _df = pd.read_csv(full)
        elif path.endswith(".json") or path.endswith(".jsonl"):
            _df = pd.read_json(full, lines=path.endswith(".jsonl"))
        else:
            return f"ERROR: Unsupported file format: {path}"

        return (
            f"Loaded {path}\n"
            f"Shape: {_df.shape[0]} rows × {_df.shape[1]} columns\n"
            f"Columns: {list(_df.columns)}\n"
            f"Dtypes:\n{_df.dtypes.to_string()}\n\n"
            f"First 5 rows:\n{_df.head().to_string()}\n\n"
            f"Null counts:\n{_df.isnull().sum().to_string()}"
        )
    except Exception as e:
        return f"ERROR loading dataset: {e}"


def describe_statistics(columns=None):
    if _df is None:
        return "ERROR: No dataset loaded. Call load_dataset first."

    cols = columns or list(_df.columns)
    numeric = [c for c in cols if c in _df.columns and _df[c].dtype in ("int64", "float64")]
    categorical = [c for c in cols if c in _df.columns and c not in numeric]

    output = []

    if numeric:
        desc = _df[numeric].describe().T
        # Add skew and kurtosis
        desc["skew"] = _df[numeric].skew()
        desc["kurtosis"] = _df[numeric].kurtosis()
        output.append("=== NUMERIC COLUMNS ===")
        output.append(desc.to_string())

    if categorical:
        output.append("\n=== CATEGORICAL COLUMNS ===")
        for col in categorical[:10]:
            vc = _df[col].value_counts().head(10)
            output.append(f"\n{col} ({_df[col].nunique()} unique):")
            output.append(vc.to_string())

    return "\n".join(output)


def find_correlations(threshold=0.3, columns=None):
    if _df is None:
        return "ERROR: No dataset loaded."

    numeric = [c for c in (_df.select_dtypes(include=[np.number]).columns)
               if not columns or c in columns]

    if len(numeric) < 2:
        return "Need at least 2 numeric columns for correlation analysis."

    corr_matrix = _df[numeric].corr()
    pairs = []
    for i in range(len(numeric)):
        for j in range(i + 1, len(numeric)):
            val = corr_matrix.iloc[i, j]
            if abs(val) >= threshold:
                pairs.append({
                    "col_a": numeric[i], "col_b": numeric[j],
                    "correlation": round(val, 3),
                    "strength": "strong" if abs(val) > 0.7 else "moderate" if abs(val) > 0.4 else "weak"
                })

    pairs.sort(key=lambda x: abs(x["correlation"]), reverse=True)
    if not pairs:
        return f"No correlations found above threshold {threshold}."
    return json.dumps(pairs, indent=2)


def detect_outliers(column, method="iqr"):
    if _df is None:
        return "ERROR: No dataset loaded."
    if column not in _df.columns:
        return f"ERROR: Column '{column}' not found."
    if _df[column].dtype not in ("int64", "float64"):
        return f"ERROR: Column '{column}' is not numeric."

    data = _df[column].dropna()
    if method == "iqr":
        q1 = data.quantile(0.25)
        q3 = data.quantile(0.75)
        iqr = q3 - q1
        lower = q1 - 1.5 * iqr
        upper = q3 + 1.5 * iqr
        outliers = _df[(data < lower) | (data > upper)]
        method_name = f"IQR (Q1={q1:.2f}, Q3={q3:.2f}, IQR={iqr:.2f})"
        fences = f"lower={lower:.2f}, upper={upper:.2f}"
    elif method == "zscore":
        from scipy import stats
        z = np.abs(stats.zscore(data))
        outliers = _df[z > 3]
        method_name = "Z-score (threshold=3)"
        fences = f"> 3 standard deviations from mean"
    else:
        return f"Unknown method: {method}. Use 'iqr' or 'zscore'."

    return json.dumps({
        "column": column,
        "method": method_name,
        "fences": fences,
        "outlier_count": len(outliers),
        "outlier_pct": round(len(outliers) / len(_df) * 100, 2),
        "outlier_values": outliers[column].describe().to_dict()
    }, indent=2, default=str)


def generate_chart(chart_type, x_column, y_column=None, title=None):
    if _df is None:
        return "ERROR: No dataset loaded."
    if x_column not in _df.columns:
        return f"ERROR: Column '{x_column}' not found."

    os.makedirs(OUTPUT_DIR, exist_ok=True)
    filename = f"{chart_type}_{x_column}"
    if y_column:
        filename += f"_vs_{y_column}"
    filename += f".{CHART_FORMAT}"
    filepath = os.path.join(OUTPUT_DIR, filename)

    plt.figure(figsize=(10, 6))
    data = _df[[x_column] + ([y_column] if y_column and y_column in _df.columns else [])].dropna()

    if chart_type == "scatter" and y_column:
        plt.scatter(data[x_column], data[y_column], alpha=0.5, s=10)
        plt.xlabel(x_column)
        plt.ylabel(y_column)
    elif chart_type == "bar" and y_column:
        data.groupby(x_column)[y_column].mean().sort_values().tail(20).plot(kind="barh")
        plt.xlabel(y_column)
        plt.ylabel(x_column)
    elif chart_type == "histogram":
        plt.hist(data[x_column], bins=30, edgecolor="black", alpha=0.7)
        plt.xlabel(x_column)
        plt.ylabel("Frequency")
    elif chart_type == "box":
        plt.boxplot(data[x_column].dropna(), vert=False)
        plt.xlabel(x_column)
    elif chart_type == "heatmap" and y_column:
        pivot = pd.crosstab(data[x_column], data[y_column])
        plt.imshow(pivot, aspect="auto", cmap="YlOrRd")
        plt.colorbar(label="Count")
        plt.xticks(range(len(pivot.columns)), pivot.columns, rotation=45, ha="right")
        plt.yticks(range(len(pivot.index)), pivot.index)
    else:
        plt.close()
        return f"Unsupported chart type or missing columns: {chart_type}"

    if title:
        plt.title(title)
    plt.tight_layout()
    plt.savefig(filepath, dpi=150, bbox_inches="tight")
    plt.close()
    return f"Chart saved: {filepath}"


def write_report(client, model, analysis_json):
    prompt = f"""Write a data insights report based on this analysis. Use plain English
that someone without a statistics background can understand.

Analysis results:
{analysis_json}

Structure your report as:
1. Executive Summary (3-4 sentences — what's the dataset, key findings)
2. Key Findings (numbered list, each with the finding + what it means)
3. Top Correlations (what variables move together and why it matters)
4. Notable Outliers (how many, which variables, what to investigate)
5. Recommendations (3-5 specific, actionable next steps)

Keep it concise. Focus on actionable insights, not statistical jargon."""

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.2
    )
    return response.choices[0].message.content

Agent Initialization

# agent.py
import json
import argparse
from openai import OpenAI
import tools as agent_tools

TOOL_SCHEMAS = [
    {
        "type": "function",
        "function": {
            "name": "load_dataset",
            "description": "Load a CSV or JSON dataset and return shape, columns, dtypes, and preview",
            "parameters": {
                "type": "object",
                "properties": {"path": {"type": "string"}},
                "required": []
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "describe_statistics",
            "description": "Compute descriptive statistics for all or specified columns",
            "parameters": {
                "type": "object",
                "properties": {
                    "columns": {"type": "array", "items": {"type": "string"}}
                },
                "required": []
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "find_correlations",
            "description": "Find pairwise correlations between numeric columns above a threshold",
            "parameters": {
                "type": "object",
                "properties": {
                    "threshold": {"type": "number", "default": 0.3},
                    "columns": {"type": "array", "items": {"type": "string"}}
                },
                "required": []
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "detect_outliers",
            "description": "Detect outliers in a numeric column using IQR or z-score method",
            "parameters": {
                "type": "object",
                "properties": {
                    "column": {"type": "string"},
                    "method": {"type": "string", "enum": ["iqr", "zscore"]}
                },
                "required": ["column"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "generate_chart",
            "description": "Generate and save a chart (scatter, bar, histogram, box, heatmap)",
            "parameters": {
                "type": "object",
                "properties": {
                    "chart_type": {"type": "string", "enum": ["scatter", "bar", "histogram", "box", "heatmap"]},
                    "x_column": {"type": "string"},
                    "y_column": {"type": "string"},
                    "title": {"type": "string"}
                },
                "required": ["chart_type", "x_column"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "write_report",
            "description": "Generate a plain-English insights report from analysis results",
            "parameters": {
                "type": "object",
                "properties": {"analysis_json": {"type": "string"}},
                "required": ["analysis_json"]
            }
        }
    }
]

SYSTEM_PROMPT = """You are a data scientist performing exploratory data analysis.
Your job is to load a dataset, analyze it comprehensively, and produce actionable
insights. Follow this protocol:

1. THOUGHT: What type of data is this? What patterns should I look for?
2. ACTION: Load the dataset, compute descriptive statistics
3. Find correlations between numeric variables
4. Detect outliers using IQR or z-score methods
5. Generate charts for the most important relationships
6. Write a plain-English insights report with specific recommendations
7. FINAL_REPORT: Summary + key findings + charts + recommendations

Rules:
- Always profile the dataset first (types, nulls, ranges) before analysis
- Focus on the user's stated question — don't produce generic analysis
- Charts must be labeled, legible, and saved to the output directory
- Use IQR (1.5x rule) for outlier detection
- Report correlation strength with context
- Write insights that someone without a statistics background can act on"""


def run_agent(task: str, config: dict):
    client = OpenAI(api_key=config["openai_api_key"])
    model = config.get("model", "gpt-4o")

    agent_tools.DATASET_PATH = config.get("dataset_path", "./data.csv")
    agent_tools.OUTPUT_DIR = config.get("output_directory", "./insights")
    agent_tools.CHART_FORMAT = config.get("chart_format", "png")

    os.makedirs(agent_tools.OUTPUT_DIR, exist_ok=True)

    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": f"Analyze this dataset. Task: {task}"}
    ]

    for i in range(config.get("max_iterations", 8)):
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            tools=TOOL_SCHEMAS,
            temperature=0.1
        )

        msg = response.choices[0].message
        messages.append(msg)

        if msg.content and "FINAL_REPORT:" in msg.content:
            return msg.content.split("FINAL_REPORT:", 1)[1].strip()

        if not msg.tool_calls:
            messages.append({
                "role": "user",
                "content": "Continue the analysis. Load data, compute stats, find correlations, detect outliers, generate charts, and provide FINAL_REPORT."
            })
            continue

        for tool_call in msg.tool_calls:
            func_name = tool_call.function.name
            func_args = json.loads(tool_call.function.arguments)

            if func_name == "load_dataset":
                result = agent_tools.load_dataset(func_args.get("path"))
            elif func_name == "describe_statistics":
                result = agent_tools.describe_statistics(func_args.get("columns"))
            elif func_name == "find_correlations":
                result = agent_tools.find_correlations(
                    func_args.get("threshold", 0.3),
                    func_args.get("columns"))
            elif func_name == "detect_outliers":
                result = agent_tools.detect_outliers(
                    func_args.get("column", ""),
                    func_args.get("method", "iqr"))
            elif func_name == "generate_chart":
                result = agent_tools.generate_chart(
                    func_args.get("chart_type", ""),
                    func_args.get("x_column", ""),
                    func_args.get("y_column"),
                    func_args.get("title"))
            elif func_name == "write_report":
                result = agent_tools.write_report(
                    client, model,
                    func_args.get("analysis_json", ""))
            else:
                result = f"Unknown tool: {func_name}"

            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": result
            })

    return "Agent reached max iterations."


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--task", required=True, help="Analysis task description")
    parser.add_argument("--config", default="config.json")
    args = parser.parse_args()

    with open(args.config) as f:
        config = json.load(f)

    result = run_agent(args.task, config)
    print(result)
    print(f"\nCharts saved to: {config.get('output_directory', './insights')}")

Walkthrough

Cyberpunk style dashboard showing data insights analysis metrics and correlation scatter plot

Analyzing a customer churn dataset to identify top predictors.

Agent loads and profiles the dataset

load_dataset() returns:

Loaded customer_churn.csv
Shape: 7,043 rows × 21 columns
Columns: customer_id, gender, senior_citizen, tenure, monthly_charges,
         total_charges, contract, payment_method, churn, ...
Dtypes: 3 int64, 15 object, 3 float64
Null counts: total_charges: 11 nulls, all others: 0

Computes statistics and correlations

describe_statistics() shows tenure (mean 32 months, skewed right), monthly_charges ($65 avg, range $18-$118). 27% churn rate overall.

find_correlations(threshold=0.2) finds:

tenure ↔ total_charges: 0.83 (expected — longer customers pay more)
tenure ↔ churn: -0.35 (longer tenure = less likely to churn)
monthly_charges ↔ churn: 0.19 (higher monthly = slightly more churn)

Detects outliers and generates charts

detect_outliers("monthly_charges") finds 0 outliers — distribution is bounded ($18-$118). detect_outliers("total_charges") finds 57 outliers at the high end — long-tenure customers.

generate_chart("scatter", "tenure", "monthly_charges", "Tenure vs Monthly Charges by Churn") creates a scatter plot color-coded by churn status.

Delivers FINAL_REPORT

FINAL_REPORT:

EXECUTIVE SUMMARY
Analyzed 7,043 customers with 27.4% churn rate. The strongest churn
predictors are contract type, tenure, and payment method — not price.

KEY FINDINGS
1. Month-to-month contracts churn at 43% vs 11% for 1-year and 3% for
   2-year contracts. Contract type is the #1 predictor.
2. Customers who churn average 18 months tenure vs 38 months for retained
   customers. Most churn happens in the first 6 months.
3. Electronic check payment method has 45% churn vs 15-18% for credit
   card and bank transfer. This may proxy for less-committed customers.

TOP CORRELATIONS
- Tenure vs Total Charges: 0.83 (expected)
- Tenure vs Churn: -0.35 (longer customers stay)
- Monthly Charges vs Churn: 0.19 (weak — price alone doesn't predict churn)

NOTABLE OUTLIERS
57 customers with total_charges > $8,000 (top 0.8%). All are long-tenure
(60+ months) with high monthly charges. None churned.

RECOMMENDATIONS
1. Target month-to-month customers for annual contract conversion offers
2. Implement a 30-day onboarding program — churn rate drops sharply after
   month 6
3. Investigate why electronic check users churn at 3x the rate of others
4. Don't use price as a churn signal — the correlation is weak

Customization

Analysis Settings

dataset_path

Path to the CSV or JSON dataset to analyze. Supports .csv, .json, and .jsonl formats.

Values: path to file

output_directory

Where charts and the insights report are saved. Created automatically.

Values: path to directory

chart_format

Output format for generated charts.

Values: png, svg, pdf

model

gpt-4o recommended for nuanced insight generation. gpt-4o-mini works for basic statistical summaries.

Values: gpt-4o, gpt-4o-mini

Note:

Dataset size. The agent loads the full dataset into memory via pandas. For files >1GB or >1M rows, use the File ETL Agent first to filter/sample, then point this agent at the processed output.

Key Takeaway

The Data Insights Agent is most valuable when you have clean data and a specific question. Don't throw it at raw, messy CSVs — that's what the File ETL Agent is for. The two-agents-together workflow (ETL → Insights) mirrors how real data teams work: one person cleans, another analyzes. The agent generates both the statistical outputs AND the plain-English explanation, bridging the gap between analysis and communication.

Data Insights Agent Blueprint