Building Idempotent Pipelines: The Pattern That Saves You at 3 AM

At some point, every data engineer gets the 3 AM page: a pipeline failed halfway through. Some data loaded, some did not. You need to rerun it. The question is: can you?

If your pipeline is idempotent, you rerun it and go back to sleep. If it is not, you spend the next two hours figuring out which records already loaded, manually deleting partial data, and praying you did not double-count anything.

We build every pipeline to be idempotent from day one. It adds maybe 10% more effort upfront. It saves us — and our clients — hundreds of hours over the life of a project.

What Idempotent Actually Means

A pipeline is idempotent if running it multiple times with the same input produces the same output. No duplicates. No missing records. No side effects.

The non-idempotent version? Run it twice and you get 200 rows. Your revenue report doubles overnight. Finance calls you before your morning coffee.

The Three Patterns We Use

Pattern 1: Delete-and-Replace (Most Common)

This is our default for batch pipelines. Before loading data for a given partition (usually a date), delete any existing data for that partition, then insert the new data. The whole thing runs in a transaction.

-- Idempotent daily load pattern
begin transaction;

delete from staging.stg_orders
where order_date = '{{ ds }}';  -- Airflow template for execution date

insert into staging.stg_orders
select
    order_id,
    customer_id,
    order_date,
    total_amount,
    current_timestamp() as _loaded_at
from raw_shopify.orders
where order_date = '{{ ds }}';

commit;

The key is the where order_date = '{{ ds }}' on both the delete and the insert. This scopes the operation to a single partition. If you rerun the pipeline for March 4th, it deletes March 4th data and reinserts it. March 3rd and March 5th are untouched.

Pattern 2: Merge / Upsert (For Slowly Changing Data)

When records update over time — customers change their email, orders get refunded — you need a merge pattern:

-- Idempotent merge for a customer dimension
merge into marts.dim_customers as target
using (
    select
        customer_id,
        email_masked,
        full_name,
        segment,
        region,
        updated_at
    from staging.stg_crm__customers
    where updated_at >= dateadd(day, -3, current_date())
) as source
on target.customer_id = source.customer_id
when matched and source.updated_at > target.updated_at then
    update set
        email_masked = source.email_masked,
        full_name = source.full_name,
        segment = source.segment,
        region = source.region,
        updated_at = source.updated_at,
        _updated_at = current_timestamp()
when not matched then
    insert (customer_id, email_masked, full_name, segment, region, updated_at, _updated_at)
    values (source.customer_id, source.email_masked, source.full_name, source.segment, source.region, source.updated_at, current_timestamp());

Notice the source.updated_at > target.updated_at guard on the update. This prevents older data from overwriting newer data if the pipeline processes records out of order. Small detail. Prevents a nasty class of bugs.

Pattern 3: dbt Incremental Models

dbt's incremental materialization strategy is delete-and-replace or merge under the hood. Here is how we configure it:

-- models/staging/stg_events__sessionized.sql
{{
    config(
        materialized='incremental',
        unique_key='event_id',
        incremental_strategy='delete+insert',
        on_schema_change='append_new_columns'
    )
}}

with source as (
    select * from {{ source('raw_ga4', 'events') }}
    {% if is_incremental() %}
    where event_timestamp >= (
        select max(event_timestamp) from {{ this }}
    ) - interval '3 hours'  -- Overlap window for late-arriving events
    {% endif %}
),

sessionized as (
    select
        event_id,
        user_id,
        event_name,
        event_timestamp,
        session_id,
        -- ... session logic
    from source
)

select * from sessionized

That - interval '3 hours' is important. Events arrive late — a user's phone was offline, a batch was delayed, a timezone conversion was off. The overlap window re-processes the last 3 hours of data on every run, catching late arrivals without missing anything.

The Airflow Side: Making Retries Safe

Idempotent SQL is only half the story. The orchestrator needs to be configured so retries work correctly.

# Key Airflow settings for idempotent pipelines
@dag(
    schedule="0 6 * * *",
    start_date=datetime(2026, 1, 1),
    catchup=True,           # Enable backfills
    max_active_runs=1,      # Prevent overlapping runs
    default_args={
        "retries": 2,
        "retry_delay": timedelta(minutes=5),
        "depends_on_past": False,  # Each run is independent
    },
)

Two settings matter:

max_active_runs=1 — prevents two runs for different dates from executing simultaneously. Without this, a backfill can overlap with today's run, and both try to write to the same table at the same time.

depends_on_past=False — each execution is independent. If yesterday's run failed, today's run still executes. Combined with the partitioned delete-and-replace pattern, this means you can fix yesterday's data independently without blocking today.

Before

After

How to Test Idempotency

This is the test we run on every pipeline before it goes to production:

Run the pipeline for date X. Record the row count and a checksum of the output.
Run it again for date X. Compare row count and checksum. They should be identical.
Simulate a partial failure. Kill the pipeline mid-run. Rerun. Verify the output matches step 1.
Run a backfill. Execute the pipeline for dates X-7 through X. Verify each date's output is independent and correct.

-- Quick idempotency check: compare two runs
select
    count(*) as row_count,
    md5(listagg(md5(to_varchar(t.*)), ',') within group (order by order_id)) as checksum
from staging.stg_orders t
where order_date = '2026-03-04';

-- Run this before and after a re-execution. Numbers should match.

We have caught bugs with this test on nearly every engagement. The most common: a staging model that uses current_timestamp() in a column, making the output different on every run. The fix is simple — use the Airflow execution date instead of the current time.

Idempotency is not a feature you bolt on later. It is a design decision you make at the start of every pipeline. Once you internalize the delete-and-replace pattern, it becomes second nature. And the first time you rerun a failed pipeline at 3 AM without having to think — you will be glad you did.