OndatraSQL's dual-language execution model
How OndatraSQL combines SQL for transformation with Starlark for transport. Lib functions become table sources in FROM. One language for queries, one for the network.
OndatraSQL uses two languages: SQL for data transformation and Starlark for data transport. Both execute in the same pipeline, share the same dependency graph, and produce output to DuckLake.
This post describes the design decisions behind this approach and how the three layers interact at runtime.
Three layers, strict separation
Data pipelines involve three distinct concerns:
| Layer | What it does | Where |
|---|---|---|
| API dict | Declares the contract: auth, endpoints, rate limits, pagination | lib/*.star (top-level dict) |
| Starlark | Handles I/O: HTTP requests, pagination logic, response parsing | lib/*.star (functions) |
| SQL | Handles transformation: joins, aggregations, casts, pivots | models/*.sql |
The API dict is parsed as AST — no code execution. Starlark functions handle the vendor-specific protocol. SQL models handle the data modeling. No layer does another layer’s job.
SQL models
SQL models are SELECT statements with directives:
-- models/mart/revenue.sql
-- @kind: table
SELECT
order_date,
COUNT(*) AS orders,
SUM(amount) AS revenue
FROM staging.orders
GROUP BY order_date
OndatraSQL handles table creation, schema evolution, change detection, incremental processing, dependency ordering, and validation. The user writes the query.
Lib functions as table sources
Starlark is a deterministic scripting language with Python-like syntax, originally designed for build systems (Bazel, Buck2). OndatraSQL embeds Starlark directly in the binary.
Lib functions in lib/ become table sources in SQL, like read_csv() or read_json() but with authentication, pagination, and polling:
-- models/raw/gam_report.sql
-- @kind: append
-- @incremental: DATE
-- @incremental_initial: 20260401
SELECT
AD_UNIT_NAME,
DATE,
AD_SERVER_IMPRESSIONS::BIGINT AS AD_SERVER_IMPRESSIONS,
AD_SERVER_REVENUE::DOUBLE AS AD_SERVER_REVENUE
FROM gam_report()
The SQL controls what gets fetched. Column names match the API. SQL casts communicate intent to the blueprint — untyped columns are dimensions, typed columns are metrics. A downstream staging model handles the transformation:
-- models/staging/gam_report.sql
-- @kind: table
SELECT
AD_UNIT_NAME AS ad_unit,
strptime(DATE, '%Y%m%d')::DATE AS report_date,
AD_SERVER_IMPRESSIONS::BIGINT AS impressions,
AD_SERVER_REVENUE::DOUBLE AS revenue
FROM raw.gam_report
Why Starlark
Starlark was chosen for three properties:
- Deterministic: no mutable global state, no threading. Same input always produces the same output.
- Sandboxed: no filesystem access, no arbitrary imports. Scripts can only use the modules OndatraSQL provides.
- Embeddable: runs inside the Go binary. No external runtime, no dependency management.
The API dict
Lib functions declare their contract via an API dict. All values must be literals — the runtime parses the dict as AST without executing code. This makes it statically analyzable: validation, documentation generation, and IDE support work without running any Starlark.
# lib/gam_report.star
API = {
"base_url": "https://admanager.googleapis.com",
"auth": {
"service_account": {"env": "GAM_KEY_FILE"},
"scope": "https://www.googleapis.com/auth/admanager",
},
"timeout": 120,
"retry": 3,
"backoff": 2,
"fetch": {
"args": ["options"],
"page_size": 10000,
},
}
def fetch(options="", page=None, columns=[], target=""):
# OAuth, create report, poll, fetch pages
# Return rows + next cursor for pagination
return {"rows": [...], "next": cursor_or_none}
The runtime manages pagination, auth injection, retry, and rate limiting. The blueprint focuses on the API-specific logic — how to parse the response, what the next page cursor looks like, how to map fields.
How the three layers interact
SQL models reference lib functions in FROM. The runtime:
- Detects
FROM gam_report(...)in the SQL AST - Matches against
lib/gam_report.star - Extracts column names and types from the SELECT via DuckDB AST
- Executes the Starlark function with runtime-managed pagination
- Replaces the function call with the collected results
- Runs the SQL normally
Dependencies are extracted from the SQL AST. A single ondatrasql run executes everything in DAG order:
Outbound: same pattern, reverse direction
The same lib function can also push data outbound via @sink. SQL controls the transformation — including building nested JSON for APIs:
-- models/sync/contacts.sql
-- @kind: merge
-- @unique_key: id
-- @sink: crm
SELECT id,
json_object('name', name, 'email', email) AS properties
FROM staging.contacts
The runtime materializes the SQL, detects what changed via DuckLake snapshots, and calls push() with only the delta rows. The blueprint sends them to the API.
Tradeoffs
Starlark is not a general-purpose language. It does not have access to the Python ecosystem, machine learning libraries, or specialized parsers. For workloads that require those capabilities, a separate ingestion step using Python or another tool may be more appropriate.
The benefit is simplicity: one binary handles both transformation and transport, with no external runtime to manage. Every API becomes a table source in SQL.
OndatraSQL