Blueprints Blog Contact About
← All blueprints

Mistral OCR

Extract structured data from PDF invoices using Mistral OCR — one YAML model, one shared source function, one API call per file.

aiocrpdfmistral
Model path models/raw/invoices.yaml
Kind append
Strategy Incremental by file hash (md5)

Turn a folder of PDFs into a queryable table — no Python, no OCR libraries, no glue code.

What This Blueprint Gives You

  • Scan a folder of PDFs automatically
  • Incremental — only new or changed files are processed
  • File tracking via md5 hash (no duplicates, detects modifications)
  • OCR + structured extraction in a single API call
  • Output schema defined in YAML — guaranteed format
  • Reusable source function for any document type

Quick Start

1. Set up credentials

Get an API key at console.mistral.ai. Add to .env:

MISTRAL_API_KEY=your-key-here

2. Add the source function

Copy lib/mistral_ocr.star into your project.

3. Create a model

# models/raw/invoices.yaml
kind: append
source: mistral_ocr
config:
  folder: data/invoices
  target: raw.invoices
  schema:
    type: object
    properties:
      invoice_number:
        type: string
      date:
        type: string
      vendor:
        type: string
      bill_to:
        type: string
      total:
        type: number
      line_items:
        type: array
        items:
          type: object
          properties:
            item:
              type: string
            qty:
              type: integer
            price:
              type: number

The schema defines the exact JSON structure you want. Mistral OCR extracts it directly — no separate LLM call needed.

4. Add PDFs

Drop PDF files into data/invoices/:

data/invoices/
├── invoice_001.pdf
├── invoice_002.pdf
└── invoice_003.pdf

5. Run

ondatrasql run raw.invoices
[OK] raw.invoices (append, backfill, 7 rows, 6.1s)

6. Run again — nothing happens

ondatrasql run raw.invoices
[OK] raw.invoices (append, incremental, 0 rows, 85ms)

All files already processed. No API calls made.

7. Drop a new PDF and run

cp new_invoice.pdf data/invoices/
ondatrasql run raw.invoices
[OK] raw.invoices (append, incremental, 1 rows, 2.9s)

Only the new file is processed.

8. Query

ondatrasql sql "SELECT invoice_number, vendor, item, qty, price FROM raw.invoices"
| invoice_number | vendor           | item          | qty | price |
| -------------- | ---------------- | ------------- | --- | ----- |
| INV-2026-001   | Acme Corp        | Widget A      | 10  | 50    |
| INV-2026-001   | Acme Corp        | Widget B      | 5   | 120   |
| INV-2026-001   | Acme Corp        | Service Fee   | 1   | 200   |
| INV-2026-002   | Globex Inc       | Consulting    | 8   | 150   |
| INV-2026-002   | Globex Inc       | License Fee   | 1   | 500   |
| INV-2026-003   | Stark Industries | Repulsor Beam | 2   | 5000  |
| INV-2026-003   | Stark Industries | Arc Reactor   | 1   | 25000 |

That’s the whole model. Most users only change YAML — not code.

How It Works

  1. read_blob() + md5() computes a hash for every PDF in the folder
  2. Hashes are compared against previously processed files in the target table
  3. Only new or modified files are sent to Mistral
  4. http.upload() sends each PDF to Mistral’s file API
  5. Mistral OCR extracts text and structures it into JSON in a single call using document_annotation_format
  6. save.row() appends each extracted row to DuckLake

Two API calls per file: upload + OCR with schema. No LLM step.

Customization

Most users never need to touch the source code. Change the folder and schema in YAML only.

Receipts — different schema:

# models/raw/receipts.yaml
kind: append
source: mistral_ocr
config:
  folder: data/receipts
  target: raw.receipts
  schema:
    type: object
    properties:
      store:
        type: string
      date:
        type: string
      total:
        type: number
      items:
        type: array
        items:
          type: object
          properties:
            name:
              type: string
            price:
              type: number

Contracts — different document type:

# models/raw/contracts.yaml
kind: append
source: mistral_ocr
config:
  folder: data/contracts
  target: raw.contracts
  schema:
    type: object
    properties:
      parties:
        type: string
      effective_date:
        type: string
      termination_date:
        type: string
      value:
        type: number

Source Function

# lib/mistral_ocr.star

def mistral_ocr(save, folder="data/documents", target="raw.documents", schema=None):
    """Scan a folder of PDFs, OCR with Mistral, extract structured data.
    Tracks file hashes to skip already-processed files."""

    api_key = env.get("MISTRAL_API_KEY")
    if not api_key:
        fail("MISTRAL_API_KEY not set in .env")

    if not schema:
        fail("schema is required in config")

    auth_header = {"Authorization": "Bearer " + api_key}

    files = query("SELECT filename AS file, md5(content) AS hash FROM read_blob('" + folder + "/*.pdf') ORDER BY filename")

    existing = {}
    parts = target.split(".")
    table_exists = query("SELECT COUNT(*) AS n FROM information_schema.tables WHERE table_schema='" + parts[0] + "' AND table_name='" + parts[1] + "'")
    if table_exists[0]["n"] != "0":
        rows = query("SELECT DISTINCT source_file, file_hash FROM " + target)
        for r in rows:
            existing[r["source_file"]] = r["file_hash"]

    new_files = [f for f in files if existing.get(f["file"], "") != f["hash"]]

    if len(new_files) == 0:
        return

    for f in new_files:
        filepath = f["file"]
        file_hash = f["hash"]

        # Upload
        upload = http.upload(
            "https://api.mistral.ai/v1/files",
            file=filepath,
            headers=auth_header,
            fields={"purpose": "ocr"},
        )
        if not upload.ok:
            fail("upload " + filepath + ": " + upload.text)

        # OCR + structured extraction in one call
        ocr = http.post("https://api.mistral.ai/v1/ocr",
            headers=auth_header,
            json={
                "model": "mistral-ocr-latest",
                "document": {"type": "file", "file_id": upload.json["id"]},
                "document_annotation_format": {
                    "type": "json_schema",
                    "json_schema": {"name": "extraction", "schema": schema},
                },
            },
        )
        if not ocr.ok:
            fail("ocr " + filepath + ": " + ocr.text)

        data = json.decode(ocr.json["document_annotation"])
        if "properties" in data:
            data = data["properties"]

        # Emit rows
        items = data.get("line_items", None)
        if items:
            for item in items:
                row = {"source_file": filepath, "file_hash": file_hash}
                for k, v in data.items():
                    if k != "line_items":
                        row[k] = v
                for k, v in item.items():
                    row[k] = v
                save.row(row)
        else:
            row = {"source_file": filepath, "file_hash": file_hash}
            for k, v in data.items():
                row[k] = v
            save.row(row)

Using as a Standalone Script

Use this only if you want full control in Starlark. For most cases, YAML is simpler and preferred.

# models/raw/invoices.star
# @kind: append

load("lib/mistral_ocr.star", "mistral_ocr")
mistral_ocr(
    save,
    folder="data/invoices",
    target="raw.invoices",
    schema={
        "type": "object",
        "properties": {
            "invoice_number": {"type": "string"},
            "date": {"type": "string"},
            "vendor": {"type": "string"},
            "bill_to": {"type": "string"},
            "total": {"type": "number"},
            "line_items": {"type": "array", "items": {"type": "object", "properties": {
                "item": {"type": "string"},
                "qty": {"type": "integer"},
                "price": {"type": "number"},
            }}},
        },
    },
)