Turn a folder of PDFs into a queryable table — no Python, no OCR libraries, no glue code.
What This Blueprint Gives You
- Scan a folder of PDFs automatically
- Incremental — only new or changed files are processed
- File tracking via md5 hash (no duplicates, detects modifications)
- OCR + structured extraction in a single API call
- Output schema defined in YAML — guaranteed format
- Reusable source function for any document type
Quick Start
1. Set up credentials
Get an API key at console.mistral.ai. Add to .env:
MISTRAL_API_KEY=your-key-here
2. Add the source function
Copy lib/mistral_ocr.star into your project.
3. Create a model
# models/raw/invoices.yaml
kind: append
source: mistral_ocr
config:
folder: data/invoices
target: raw.invoices
schema:
type: object
properties:
invoice_number:
type: string
date:
type: string
vendor:
type: string
bill_to:
type: string
total:
type: number
line_items:
type: array
items:
type: object
properties:
item:
type: string
qty:
type: integer
price:
type: number
The schema defines the exact JSON structure you want. Mistral OCR extracts it directly — no separate LLM call needed.
4. Add PDFs
Drop PDF files into data/invoices/:
data/invoices/
├── invoice_001.pdf
├── invoice_002.pdf
└── invoice_003.pdf
5. Run
ondatrasql run raw.invoices
[OK] raw.invoices (append, backfill, 7 rows, 6.1s)
6. Run again — nothing happens
ondatrasql run raw.invoices
[OK] raw.invoices (append, incremental, 0 rows, 85ms)
All files already processed. No API calls made.
7. Drop a new PDF and run
cp new_invoice.pdf data/invoices/
ondatrasql run raw.invoices
[OK] raw.invoices (append, incremental, 1 rows, 2.9s)
Only the new file is processed.
8. Query
ondatrasql sql "SELECT invoice_number, vendor, item, qty, price FROM raw.invoices"
| invoice_number | vendor | item | qty | price |
| -------------- | ---------------- | ------------- | --- | ----- |
| INV-2026-001 | Acme Corp | Widget A | 10 | 50 |
| INV-2026-001 | Acme Corp | Widget B | 5 | 120 |
| INV-2026-001 | Acme Corp | Service Fee | 1 | 200 |
| INV-2026-002 | Globex Inc | Consulting | 8 | 150 |
| INV-2026-002 | Globex Inc | License Fee | 1 | 500 |
| INV-2026-003 | Stark Industries | Repulsor Beam | 2 | 5000 |
| INV-2026-003 | Stark Industries | Arc Reactor | 1 | 25000 |
That’s the whole model. Most users only change YAML — not code.
How It Works
read_blob()+md5()computes a hash for every PDF in the folder- Hashes are compared against previously processed files in the target table
- Only new or modified files are sent to Mistral
http.upload()sends each PDF to Mistral’s file API- Mistral OCR extracts text and structures it into JSON in a single call using
document_annotation_format save.row()appends each extracted row to DuckLake
Two API calls per file: upload + OCR with schema. No LLM step.
Customization
Most users never need to touch the source code. Change the folder and schema in YAML only.
Receipts — different schema:
# models/raw/receipts.yaml
kind: append
source: mistral_ocr
config:
folder: data/receipts
target: raw.receipts
schema:
type: object
properties:
store:
type: string
date:
type: string
total:
type: number
items:
type: array
items:
type: object
properties:
name:
type: string
price:
type: number
Contracts — different document type:
# models/raw/contracts.yaml
kind: append
source: mistral_ocr
config:
folder: data/contracts
target: raw.contracts
schema:
type: object
properties:
parties:
type: string
effective_date:
type: string
termination_date:
type: string
value:
type: number
Source Function
# lib/mistral_ocr.star
def mistral_ocr(save, folder="data/documents", target="raw.documents", schema=None):
"""Scan a folder of PDFs, OCR with Mistral, extract structured data.
Tracks file hashes to skip already-processed files."""
api_key = env.get("MISTRAL_API_KEY")
if not api_key:
fail("MISTRAL_API_KEY not set in .env")
if not schema:
fail("schema is required in config")
auth_header = {"Authorization": "Bearer " + api_key}
files = query("SELECT filename AS file, md5(content) AS hash FROM read_blob('" + folder + "/*.pdf') ORDER BY filename")
existing = {}
parts = target.split(".")
table_exists = query("SELECT COUNT(*) AS n FROM information_schema.tables WHERE table_schema='" + parts[0] + "' AND table_name='" + parts[1] + "'")
if table_exists[0]["n"] != "0":
rows = query("SELECT DISTINCT source_file, file_hash FROM " + target)
for r in rows:
existing[r["source_file"]] = r["file_hash"]
new_files = [f for f in files if existing.get(f["file"], "") != f["hash"]]
if len(new_files) == 0:
return
for f in new_files:
filepath = f["file"]
file_hash = f["hash"]
# Upload
upload = http.upload(
"https://api.mistral.ai/v1/files",
file=filepath,
headers=auth_header,
fields={"purpose": "ocr"},
)
if not upload.ok:
fail("upload " + filepath + ": " + upload.text)
# OCR + structured extraction in one call
ocr = http.post("https://api.mistral.ai/v1/ocr",
headers=auth_header,
json={
"model": "mistral-ocr-latest",
"document": {"type": "file", "file_id": upload.json["id"]},
"document_annotation_format": {
"type": "json_schema",
"json_schema": {"name": "extraction", "schema": schema},
},
},
)
if not ocr.ok:
fail("ocr " + filepath + ": " + ocr.text)
data = json.decode(ocr.json["document_annotation"])
if "properties" in data:
data = data["properties"]
# Emit rows
items = data.get("line_items", None)
if items:
for item in items:
row = {"source_file": filepath, "file_hash": file_hash}
for k, v in data.items():
if k != "line_items":
row[k] = v
for k, v in item.items():
row[k] = v
save.row(row)
else:
row = {"source_file": filepath, "file_hash": file_hash}
for k, v in data.items():
row[k] = v
save.row(row)
Using as a Standalone Script
Use this only if you want full control in Starlark. For most cases, YAML is simpler and preferred.
# models/raw/invoices.star
# @kind: append
load("lib/mistral_ocr.star", "mistral_ocr")
mistral_ocr(
save,
folder="data/invoices",
target="raw.invoices",
schema={
"type": "object",
"properties": {
"invoice_number": {"type": "string"},
"date": {"type": "string"},
"vendor": {"type": "string"},
"bill_to": {"type": "string"},
"total": {"type": "number"},
"line_items": {"type": "array", "items": {"type": "object", "properties": {
"item": {"type": "string"},
"qty": {"type": "integer"},
"price": {"type": "number"},
}}},
},
},
)
Ondatra Labs