Blueprints Blog Contact About
← Back to blog

How OndatraSQL uses DuckLake for storage and versioning

OndatraSQL stores pipeline output in DuckLake — a catalog layer on top of Parquet files with snapshots, time-travel, and schema evolution.

OndatraSQL stores pipeline output in DuckLake — a catalog layer that manages Parquet files with snapshots, time-travel, and schema evolution. This post describes how the storage layer works and what it provides.

DuckLake’s architecture

DuckLake separates data from metadata:

  • Data is stored as Parquet files on disk (or in cloud object storage)
  • Metadata is stored in a lightweight catalog database (SQLite or Postgres)

DuckDB reads and writes both layers. The catalog tracks which Parquet files belong to which tables, which snapshots exist, and what schema each table has at each point in time.

ducklake.sqlite          ← catalog (metadata)
data/
  staging/
    orders/
      data_0.parquet     ← data files
      data_1.parquet

How OndatraSQL initializes DuckLake

When you run ondatrasql init, the project creates a DuckLake catalog backed by SQLite:

-- config/catalog.sql
ATTACH 'ducklake:sqlite:ducklake.sqlite' AS lake (DATA_PATH 'data/');

This single statement creates the catalog. OndatraSQL sets USE lake so all subsequent operations — table creation, data writes, schema changes — target the DuckLake catalog.

Snapshots

Every ondatrasql run that writes data creates a DuckLake snapshot. Snapshots are immutable — they record the state of every table at the time of commit.

OndatraSQL stores pipeline metadata in each snapshot’s commit_extra_info:

  • Model name and kind
  • SQL hash (for change detection)
  • Column definitions (for schema evolution)
  • Row count and duration
  • DAG run identifier

This metadata is what enables OndatraSQL’s change detection: the runtime compares the current model hash against the hash stored in the latest snapshot to determine whether a model needs to re-run.

Time-travel

Any previous state is queryable using DuckLake’s AT clause:

SELECT * FROM staging.orders AT (VERSION => 3)

OndatraSQL uses time-travel internally for Smart CDC — comparing the current table state against the previous snapshot to determine which rows changed.

Schema evolution

When a model’s SQL changes (e.g. a new column is added), OndatraSQL applies the schema change via ALTER TABLE inside the same transaction as the data write:

  1. New columns are added
  2. Renamed columns are detected via AST comparison
  3. Type promotions are applied (e.g. INTEGERBIGINT)
  4. Removed columns are dropped

In DuckLake, ALTER TABLE ADD COLUMN is a metadata-only operation — no Parquet files are rewritten. The snapshot chain is preserved across schema changes.

ACID transactions

OndatraSQL wraps each model’s execution in a single DuckLake transaction:

BEGIN
  ALTER TABLE (schema evolution)
  INSERT INTO (data write)
  Audit checks (validation)
  set_commit_message (metadata)
COMMIT

If any step fails — including audit validation — the entire transaction is aborted. The table, schema, and snapshot history are left unchanged.

Catalog backends

DuckLake supports two catalog backends:

BackendUse case
SQLiteLocal development, single-machine pipelines, CI
PostgresMulti-reader access, shared catalogs

OndatraSQL validates the catalog backend at startup and supports both.

Portability

The catalog file and Parquet data can be stored locally, on network storage, or in cloud object stores (S3, GCS, Azure Blob). The pipeline produces the same output regardless of where the data lives.

# Local
ATTACH 'ducklake:sqlite:ducklake.sqlite' AS lake (DATA_PATH 'data/');

# S3
ATTACH 'ducklake:sqlite:ducklake.sqlite' AS lake (DATA_PATH 's3://bucket/data/');

Example

ondatrasql init
ondatrasql run
ondatrasql sql "SELECT * FROM mart.revenue"
ondatrasql sql "SELECT * FROM mart.revenue AT (VERSION => 3)"