← Blog

OndatraSQL's dual-language execution model

How OndatraSQL combines SQL for transformation with Starlark for transport. Lib functions become table sources in FROM. One language for queries, one for the network.

OndatraSQL uses two languages: SQL for data transformation and Starlark for data transport. Both execute in the same pipeline, share the same dependency graph, and produce output to DuckLake.

This post describes the design decisions behind this approach and how the three layers interact at runtime.

Three layers, strict separation

Data pipelines involve three distinct concerns:

LayerWhat it doesWhere
API dictDeclares the contract: auth, endpoints, rate limits, paginationlib/*.star (top-level dict)
StarlarkHandles I/O: HTTP requests, pagination logic, response parsinglib/*.star (functions)
SQLHandles transformation: joins, aggregations, casts, pivotsmodels/*.sql

The API dict is parsed as AST — no code execution. Starlark functions handle the vendor-specific protocol. SQL models handle the data modeling. No layer does another layer’s job.

SQL models

SQL models are SELECT statements with directives:

-- models/mart/revenue.sql
-- @kind: table

SELECT
    order_date,
    COUNT(*) AS orders,
    SUM(amount) AS revenue
FROM staging.orders
GROUP BY order_date

OndatraSQL handles table creation, schema evolution, change detection, incremental processing, dependency ordering, and validation. The user writes the query.

Lib functions as table sources

Starlark is a deterministic scripting language with Python-like syntax, originally designed for build systems (Bazel, Buck2). OndatraSQL embeds Starlark directly in the binary.

Lib functions in lib/ become table sources in SQL, like read_csv() or read_json() but with authentication, pagination, and polling:

-- models/raw/gam_report.sql
-- @kind: append
-- @incremental: DATE
-- @incremental_initial: 20260401

SELECT
    AD_UNIT_NAME,
    DATE,
    AD_SERVER_IMPRESSIONS::BIGINT AS AD_SERVER_IMPRESSIONS,
    AD_SERVER_REVENUE::DOUBLE AS AD_SERVER_REVENUE
FROM gam_report()

The SQL controls what gets fetched. Column names match the API. SQL casts communicate intent to the blueprint — untyped columns are dimensions, typed columns are metrics. A downstream staging model handles the transformation:

-- models/staging/gam_report.sql
-- @kind: table

SELECT
    AD_UNIT_NAME AS ad_unit,
    strptime(DATE, '%Y%m%d')::DATE AS report_date,
    AD_SERVER_IMPRESSIONS::BIGINT AS impressions,
    AD_SERVER_REVENUE::DOUBLE AS revenue
FROM raw.gam_report

Why Starlark

Starlark was chosen for three properties:

  1. Deterministic: no mutable global state, no threading. Same input always produces the same output.
  2. Sandboxed: no filesystem access, no arbitrary imports. Scripts can only use the modules OndatraSQL provides.
  3. Embeddable: runs inside the Go binary. No external runtime, no dependency management.

The API dict

Lib functions declare their contract via an API dict. All values must be literals — the runtime parses the dict as AST without executing code. This makes it statically analyzable: validation, documentation generation, and IDE support work without running any Starlark.

# lib/gam_report.star

API = {
    "base_url": "https://admanager.googleapis.com",
    "auth": {
        "service_account": {"env": "GAM_KEY_FILE"},
        "scope": "https://www.googleapis.com/auth/admanager",
    },
    "timeout": 120,
    "retry": 3,
    "backoff": 2,
    "fetch": {
        "args": ["options"],
        "page_size": 10000,
    },
}

def fetch(options="", page=None, columns=[], target=""):
    # OAuth, create report, poll, fetch pages
    # Return rows + next cursor for pagination
    return {"rows": [...], "next": cursor_or_none}

The runtime manages pagination, auth injection, retry, and rate limiting. The blueprint focuses on the API-specific logic — how to parse the response, what the next page cursor looks like, how to map fields.

How the three layers interact

SQL models reference lib functions in FROM. The runtime:

  1. Detects FROM gam_report(...) in the SQL AST
  2. Matches against lib/gam_report.star
  3. Extracts column names and types from the SELECT via DuckDB AST
  4. Executes the Starlark function with runtime-managed pagination
  5. Replaces the function call with the collected results
  6. Runs the SQL normally

Dependencies are extracted from the SQL AST. A single ondatrasql run executes everything in DAG order:

gam_report() transport
raw/gam_report.sql fetch
staging/gam_report.sql transform
mart/metrics.sql aggregate

Outbound: same pattern, reverse direction

The same lib function can also push data outbound via @sink. SQL controls the transformation — including building nested JSON for APIs:

-- models/sync/contacts.sql
-- @kind: merge
-- @unique_key: id
-- @sink: crm

SELECT id,
    json_object('name', name, 'email', email) AS properties
FROM staging.contacts

The runtime materializes the SQL, detects what changed via DuckLake snapshots, and calls push() with only the delta rows. The blueprint sends them to the API.

Tradeoffs

Starlark is not a general-purpose language. It does not have access to the Python ecosystem, machine learning libraries, or specialized parsers. For workloads that require those capabilities, a separate ingestion step using Python or another tool may be more appropriate.

The benefit is simplicity: one binary handles both transformation and transport, with no external runtime to manage. Every API becomes a table source in SQL.