Validation

On this page

Bad data in a pipeline is worse than no data. A wrong number in a dashboard erodes trust. A null email sent to a CRM causes an API error. A sudden row count drop means something broke upstream. OndatraSQL catches these problems at three different stages, each designed for a different class of issue.

Why three stages

You might wonder why one validation step isn’t enough. The reason is timing. Some problems are visible before data is written (a null email is never valid), some are only visible after (the total row count dropped by 80%), and some you want to know about but not block on (the average price dipped below your threshold).

Each stage runs at a different point in the pipeline and has different failure behavior:

Constraints run before insert. If a row is bad, nothing is written.
Audits run inside the transaction, after insert. If the result is bad, everything rolls back.
Warnings run after commit. They log but don’t block.

All three use the same macro syntax. Only the prefix changes:

-- @constraint: not_null(email)
-- @audit: row_count(>, 0)
-- @warning: mean(price, >=, 1)

Constraints: catching bad rows

Constraints validate individual rows before data enters your target table. If any row fails, the model stops and nothing is written. Your target table is never touched.

-- @constraint: not_null(email)
-- @constraint: compare(total, >=, 0)

This is your first line of defense. It catches data quality issues at the source — before they propagate to downstream models or get pushed to external APIs via @sink. If you’re doing outbound sync, constraints are especially important: they prevent invalid data from ever reaching HubSpot, Salesforce, or wherever your data goes.

See the Constraints reference for the full macro list.

Audits: catching bad results

Some problems aren’t visible at the row level. The row count dropped to zero. The data is three days stale. The total revenue doesn’t match the source. These are dataset-level regressions that only become apparent after the data is written.

-- @audit: row_count(>, 0)
-- @audit: freshness(updated_at, 24h)

Here’s what makes audits powerful: they run inside the same transaction as the data write. If an audit fails, the entire transaction aborts — schema changes, data, commit metadata, everything rolls back. No DuckLake snapshot is created. Your table stays exactly as it was before the run.

This means you can’t end up with committed bad data. The audit is your last chance to catch a problem before it becomes permanent.

See the Audits reference for the full macro list.

Warnings: monitoring without blocking

Sometimes you want to know about something without stopping the pipeline. The average order value dipped. The null rate in a column crept up. The row count changed more than 20%.

-- @warning: mean(price, >=, 1)
-- @warning: null_percent(email, 5)

Warnings run after the data is committed. They show up in your CLI output and --json metadata. They’re for monitoring — the pipeline equivalent of a Grafana alert.

A common workflow is to start with warnings, observe the behavior over a few runs, tune the thresholds, and then promote to constraints or audits when you’re confident in the threshold.

Custom macros

All validation macros are defined in config/macros/ as DuckDB macros. You can edit the defaults or add your own. The naming convention is ondatra_{type}_{name}:

-- config/macros/audits.sql
CREATE OR REPLACE MACRO ondatra_audit_balanced_ledger(t) AS TABLE
  SELECT 'balances do not reconcile'
  WHERE (SELECT SUM(debit) FROM query(printf('SELECT * FROM %s', t)))
     != (SELECT SUM(credit) FROM query(printf('SELECT * FROM %s', t)));

Then use it in any model:

-- @audit: balanced_ledger

The full macro lists are in the reference: Constraints, Audits, Warnings.