OndatraSQL stores pipeline output in DuckLake — a catalog layer that manages Parquet files with snapshots, time-travel, and schema evolution. This post describes how the storage layer works and what it provides.
DuckLake’s architecture
DuckLake separates data from metadata:
- Data is stored as Parquet files on disk (or in cloud object storage)
- Metadata is stored in a lightweight catalog database (SQLite or Postgres)
DuckDB reads and writes both layers. The catalog tracks which Parquet files belong to which tables, which snapshots exist, and what schema each table has at each point in time.
ducklake.sqlite ← catalog (metadata)
data/
staging/
orders/
data_0.parquet ← data files
data_1.parquet
How OndatraSQL initializes DuckLake
When you run ondatrasql init, the project creates a DuckLake catalog backed by SQLite:
-- config/catalog.sql
ATTACH 'ducklake:sqlite:ducklake.sqlite' AS lake (DATA_PATH 'data/');
This single statement creates the catalog. OndatraSQL sets USE lake so all subsequent operations — table creation, data writes, schema changes — target the DuckLake catalog.
Snapshots
Every ondatrasql run that writes data creates a DuckLake snapshot. Snapshots are immutable — they record the state of every table at the time of commit.
OndatraSQL stores pipeline metadata in each snapshot’s commit_extra_info:
- Model name and kind
- SQL hash (for change detection)
- Column definitions (for schema evolution)
- Row count and duration
- DAG run identifier
This metadata is what enables OndatraSQL’s change detection: the runtime compares the current model hash against the hash stored in the latest snapshot to determine whether a model needs to re-run.
Time-travel
Any previous state is queryable using DuckLake’s AT clause:
SELECT * FROM staging.orders AT (VERSION => 3)
OndatraSQL uses time-travel internally for Smart CDC — comparing the current table state against the previous snapshot to determine which rows changed.
Schema evolution
When a model’s SQL changes (e.g. a new column is added), OndatraSQL applies the schema change via ALTER TABLE inside the same transaction as the data write:
- New columns are added
- Renamed columns are detected via AST comparison
- Type promotions are applied (e.g.
INTEGER→BIGINT) - Removed columns are dropped
In DuckLake, ALTER TABLE ADD COLUMN is a metadata-only operation — no Parquet files are rewritten. The snapshot chain is preserved across schema changes.
ACID transactions
OndatraSQL wraps each model’s execution in a single DuckLake transaction:
BEGIN
ALTER TABLE (schema evolution)
INSERT INTO (data write)
Audit checks (validation)
set_commit_message (metadata)
COMMIT
If any step fails — including audit validation — the entire transaction is aborted. The table, schema, and snapshot history are left unchanged.
Catalog backends
DuckLake supports two catalog backends:
| Backend | Use case |
|---|---|
| SQLite | Local development, single-machine pipelines, CI |
| Postgres | Multi-reader access, shared catalogs |
OndatraSQL validates the catalog backend at startup and supports both.
Portability
The catalog file and Parquet data can be stored locally, on network storage, or in cloud object stores (S3, GCS, Azure Blob). The pipeline produces the same output regardless of where the data lives.
# Local
ATTACH 'ducklake:sqlite:ducklake.sqlite' AS lake (DATA_PATH 'data/');
# S3
ATTACH 'ducklake:sqlite:ducklake.sqlite' AS lake (DATA_PATH 's3://bucket/data/');
Example
ondatrasql init
ondatrasql run
ondatrasql sql "SELECT * FROM mart.revenue"
ondatrasql sql "SELECT * FROM mart.revenue AT (VERSION => 3)"
Ondatra Labs