Claude Data Extraction: Unstructured to Structured
Master Claude for data extraction and structuring. Prompts for turning unstructured documents into JSON, CSV, and tables — leveraging Claude's strong compliance with output formatting instructions.
Claude's compliance with output formatting instructions is dramatically better than other models. If you tell Claude to return JSON with a specific schema, you get exactly that JSON — no markdown fences, no explanatory text, no creative reinterpretations. This makes Claude the ideal model for data extraction pipelines: unstructured documents in, structured data out, reliably.
The Extraction Prompt Pattern
Every extraction prompt needs:
- Source description — What kind of document Claude is reading
- Extraction schema — Exact structure of the desired output
- Field definitions — What each field means and how to populate it
- Edge case handling — What to do when information is missing, ambiguous, or conflicting
Basic Extraction Prompt
Extract structured data from the following [document type: invoice/resume/contract/etc.].
OUTPUT FORMAT: Valid JSON object with this exact structure:
{
"field_name": "value or null",
"field_name": "value or null"
}
FIELD DEFINITIONS:
- field_name: [What it represents. Where to find it in the document.
If multiple values: which one to pick. If missing: use null.]
RULES:
- Return ONLY the JSON object. No markdown fences, no explanatory text.
- If a field is mentioned multiple times with different values, use [rule].
- If a field is ambiguous, use the most [specific/recent/authoritative] value.
- If a field is completely absent, use null (not "N/A" or empty string).
- For date fields, use ISO 8601 format (YYYY-MM-DD).
- For currency fields, use numbers without currency symbols.
Invoice Extraction Example
Extract structured data from this invoice:
{
"invoice_number": "string — the invoice ID/number. Usually labeled 'Invoice #' or 'Inv No.'",
"invoice_date": "string — date the invoice was issued, ISO 8601 format",
"due_date": "string — payment due date, ISO 8601. If not specified, use null",
"vendor": {
"name": "string — company or individual name sending the invoice",
"tax_id": "string or null — VAT/GST/EIN number if present",
"email": "string or null",
"address": "string or null — full address as it appears"
},
"client": {
"name": "string",
"address": "string or null"
},
"line_items": [
{
"description": "string",
"quantity": "number",
"unit_price": "number",
"total": "number"
}
],
"subtotal": "number — before tax",
"tax_rate": "number or null — as percentage (e.g., 20 not 0.20)",
"tax_amount": "number or null",
"total": "number — final amount due",
"currency": "string — ISO 4217 code (USD, EUR, GBP). Default: USD if not specified"
}
RULES:
- If the invoice has a 'paid' stamp or 'PAID' watermark, add "status": "paid"
- If line item totals don't sum to the stated total, flag with "discrepancy": true
- If tax amount is calculable from subtotal × tax_rate but differs from stated,
use the STATED amount and flag the discrepancy
Complex Extraction Patterns
Nested Entity Extraction
From this legal document, extract all mentioned entities and their relationships:
{
"entities": [
{
"name": "string — official legal name",
"type": "corporation|individual|government_agency|trust|other",
"role": "buyer|seller|lender|borrower|guarantor|witness|other",
"jurisdiction": "string — state/country of incorporation or residence",
"aliases": ["string — other names used for this entity in the document"]
}
],
"relationships": [
{
"from": "string — entity name (must match an entity.name exactly)",
"to": "string — entity name",
"type": "owns|owes|guarantees|licenses|indemnifies|other",
"details": "string — brief description of the relationship",
"clause_reference": "string — section/paragraph where defined"
}
],
"key_terms": [
{
"term": "string",
"definition": "string — how the document defines it, or null if undefined",
"section": "string — where it appears"
}
]
}
AMBIGUITY RULES:
- If an entity is referred to by multiple names, use the first official name as 'name'
and list others in 'aliases'
- If a relationship is implied but not explicit, include it but set a flag: "explicit": false
Temporal Extraction
Extract all events with their temporal information:
{
"events": [
{
"description": "string — what happened",
"date": "string or null — ISO 8601. If only year: YYYY. If year-month: YYYY-MM",
"date_type": "exact|approximate|range_start|range_end|before|after|unknown",
"date_context": "string — how the date was expressed ('Q3 2024', 'last Tuesday', 'sometime in spring')",
"participants": ["string — who was involved"],
"location": "string or null",
"confidence": "high|medium|low — how confident are you in the extracted date?"
}
]
}
TEMPORAL REASONING:
- 'Last Tuesday' relative to document date of 2024-03-15 → 2024-03-12
- 'The following quarter' after Q3 2024 → Q4 2024 (range: 2024-10-01 to 2024-12-31)
- If the document date is unknown and a relative date is used, set date to null and
preserve the relative expression in date_context
Batch Extraction
For extracting from multiple documents with consistent schema:
Process the following [N] documents. For EACH document, return a JSON object
with the extraction schema. Return ALL results as a JSON array:
[
{ "doc_id": 1, ...extracted fields... },
{ "doc_id": 2, ...extracted fields... }
]
If extraction fails for a document (e.g., wrong format, unreadable):
{ "doc_id": N, "error": "reason for failure" }
Quality Assurance Patterns
The Verification Loop
After extraction, verify your output:
1. COUNT CHECK: Number of extracted items should match what you observed
in the document. State: "Extracted [N] items. Document appears to contain [M]."
2. REQUIRED FIELD CHECK: For each required field, confirm it has a non-null value
or explain why it's null: "Field [X] is null because [reason]."
3. SUSPICIOUS VALUE FLAG: If any value seems unusual (e.g., $1,000,000 for a
coffee invoice), flag it: "SUSPICIOUS: [field] = [value]. Possible error in
extraction or source document."
4. CONFIDENCE SCORE: Overall confidence in the extraction: [0.0 - 1.0]
Below 0.8: list the uncertain fields and why.
Handling Imperfect Documents
This document may contain:
- OCR errors (scanned document)
- Handwritten annotations
- Conflicting information (e.g., two different totals)
- Missing pages or sections
When you encounter issues:
- OCR UNCERTAINTY: "The value appears to be [best guess] but could be [alternative].
Context: [surrounding text]."
- CONFLICT: "Field [X] appears twice: value A (section 3) and value B (section 7).
Using [choice] because [reasoning]."
- MISSING: "Section [X] appears to be missing. Fields [Y, Z] cannot be extracted."
Note:
Production Pipeline Pattern: For high-volume extraction, implement a two-pass system. Pass 1: Claude extracts + self-verifies with confidence scores. Pass 2: Items with confidence < 0.9 go to a second Claude call with more explicit extraction instructions, or to human review.
Note:
Schema drift danger: Claude will follow your schema exactly — even if the schema is wrong for the document. If you ask for "invoice_number" but the document calls it "Reference No.", Claude may return null rather than recognizing the different label. Include field aliases in your definitions.
Related Pages
- Document Analysis & Summarization — Before extracting structured data, analyze documents to understand their structure and identify what's extractable.
- Code Review & Refactoring — Structured output patterns from extraction apply equally well to automated code analysis pipelines.
Related Articles
Exam Preparation Guide: ChatGPT Prompts for Success
Master exam preparation with these ChatGPT prompts designed to help you study effectively, create study schedules, and perform well in academic assessments.
Fine Art Style SREF Codes for Midjourney
Discover curated Midjourney SREF codes for fine art aesthetics including classical, modern, and contemporary art movements. Complete guide with examples and tips.
Master ChatGPT Prompts: Complete Strategy Guide
Learn proven strategies and best practices for crafting effective ChatGPT prompts. Get better AI responses with clear techniques and examples.