Incremental dbt Models Done Right: Patterns for Late-Arriving Data
Incremental models are dbt's most powerful feature and its biggest footgun. Here are the three strategies we use, when to pick each, and the late-data handling pattern that took us a year to get right.
Every dbt project eventually hits the point where a full table rebuild takes too long. For one of our clients — a fintech with 18 months of transaction history — the fct_transactions model took 45 minutes to build. Every morning. Blocking everything downstream.
Incremental models fix this by only processing new or changed data. The concept is straightforward. Getting the details right, especially for late-arriving records, took us a year of trial and error across multiple engagements.
The Three Incremental Strategies
dbt supports three incremental strategies on Snowflake. Each behaves differently, and picking the wrong one causes subtle bugs.
There is also append, which just inserts without checking for duplicates. We only use append for raw event logs where duplicates are acceptable or handled downstream.
Pattern 1: delete+insert for Event Data
This is our default for fact tables. Events do not change after they happen — an order was placed, a page was viewed, a payment was processed. New records arrive, old records stay put.
-- models/marts/finance/fct_transactions.sql
{{
config(
materialized='incremental',
unique_key='transaction_id',
incremental_strategy='delete+insert',
)
}}
with source as (
select
transaction_id,
account_id,
transaction_type,
amount_usd,
transaction_at,
_loaded_at
from {{ ref('stg_banking__transactions') }}
{% if is_incremental() %}
where _loaded_at >= (
select max(_loaded_at) from {{ this }}
) - interval '6 hours'
-- ^^^^^^^^^^^^^^^^
-- The overlap window. This is the important part.
{% endif %}
)
select * from source
That - interval '6 hours' is doing critical work. Without it, you are asking: "give me everything after the last record I processed." But what if a record arrived late — it was generated at 2 PM but did not land in the raw table until 8 PM? The max _loaded_at already moved past it. The record is silently lost.
The overlap window re-processes the last 6 hours of data on every run. Because we use delete+insert with a unique_key, the re-processed records replace the existing ones. No duplicates. No data loss.
How to Size the Overlap Window
Too small and you miss late records. Too large and you reprocess too much data, negating the performance benefit of incremental models.
We size it based on observed source latency:
-- Measure how late records actually arrive for a given source
select
percentile_cont(0.95) within group (
order by datediff('minute', event_timestamp, _loaded_at)
) as p95_latency_minutes,
percentile_cont(0.99) within group (
order by datediff('minute', event_timestamp, _loaded_at)
) as p99_latency_minutes,
max(datediff('minute', event_timestamp, _loaded_at)) as max_latency_minutes
from raw_banking.transactions
where _loaded_at >= dateadd(day, -30, current_date());
For this client, the P99 latency was 3.2 hours. We set the overlap window to 6 hours (roughly 2x the P99) to be safe. The incremental run processes about 8% more data than strictly necessary, but the full rebuild took 45 minutes and the incremental run takes 3 minutes. The overhead is negligible.
Pattern 2: merge for Slowly Changing Dimensions
Customer records, product catalogs, employee directories — these change over time. A customer updates their email. A product changes its price tier. For these, we use merge:
-- models/marts/core/dim_customers.sql
{{
config(
materialized='incremental',
unique_key='customer_id',
incremental_strategy='merge',
merge_update_columns=['email_masked', 'full_name', 'segment', 'updated_at', '_dbt_updated_at'],
)
}}
with source as (
select
customer_id,
email_masked,
full_name,
segment,
updated_at,
current_timestamp() as _dbt_updated_at
from {{ ref('stg_crm__customers') }}
{% if is_incremental() %}
where updated_at >= (
select max(updated_at) from {{ this }}
) - interval '24 hours'
{% endif %}
)
select * from source
The merge_update_columns list is deliberate. We only update the columns that can change. If we omitted this, dbt would update every column, including customer_id (the match key), which is unnecessary work.
Pattern 3: Partitioned Incremental for Large Event Tables
For very large event tables (billions of rows), even the overlap window approach scans too much data. We use Snowflake's clustering and partition pruning to make the incremental scan efficient:
-- models/marts/product/fct_usage_events.sql
{{
config(
materialized='incremental',
unique_key='event_id',
incremental_strategy='delete+insert',
cluster_by=['event_date'],
)
}}
with source as (
select
event_id,
user_id,
event_name,
event_properties,
event_timestamp,
event_timestamp::date as event_date,
_loaded_at
from {{ ref('stg_analytics__events') }}
{% if is_incremental() %}
where event_date >= (
select max(event_date) from {{ this }}
) - interval '3 days'
{% endif %}
)
select * from source
The cluster key on event_date means Snowflake physically groups data by date. The incremental filter prunes to just the last 3 days of micro-partitions. On a table with 2 billion rows, this scan touches ~0.4% of the data.
The Full-Refresh Escape Hatch
Incremental models drift over time. Schema changes, backfills, or bugs in the incremental logic can leave the table in a state that does not match what a full rebuild would produce. We schedule a weekly full refresh for critical models:
# Sunday 3 AM: full refresh of all incremental models
dbt build --select config.materialized:incremental --full-refresh
This is the safety net. If the incremental logic had a bug, the weekly full refresh corrects it. If you only discover the bug on a Monday, you know the data was clean as of Sunday.
When Not to Use Incremental
Not every model should be incremental. We keep models as table (full rebuild) when:
The source is small. If the full build takes under 30 seconds, the complexity of incremental logic is not worth it. We have seen teams make 500-row lookup tables incremental. Just rebuild them.
The transformation is non-deterministic. If the model uses window functions that depend on the full dataset (like percentile ranks), incremental results will differ from a full rebuild. Either accept that or do not go incremental.
The model is early in development. Get the logic right first, then optimize. Debugging an incremental model is harder than debugging a full rebuild.
The move to incremental models is an optimization, not a requirement. Start with full rebuilds. Measure build times. Convert to incremental only when the build time is actually a problem — and when you have the testing infrastructure to catch the edge cases that incremental logic introduces.