pg_stream

pg_stream is a PostgreSQL 18 extension that turns ordinary SQL views into self-maintaining stream tables — no external processes, no sidecars, no bespoke refresh pipelines. Just CREATE EXTENSION pg_stream and your views stay fresh.

-- Declare a stream table — a view that maintains itself
SELECT pgstream.create_stream_table(
    'active_orders',
    'SELECT * FROM orders WHERE status = ''active''',
    '30s'
);

-- Insert a row — the stream table updates automatically on the next refresh
INSERT INTO orders (id, status) VALUES (42, 'active');
SELECT count(*) FROM active_orders;  -- 1

The problem with materialized views

PostgreSQL's materialized views are powerful but frustrating. REFRESH MATERIALIZED VIEW re-runs the entire query from scratch, even if only one row changed in a million-row table. Your choices are: burn CPU on full recomputation, or accept stale data. Most teams end up building bespoke refresh pipelines just to keep summary tables current.

What pg_stream does differently

pg_stream captures changes to your source tables and — on each refresh cycle — derives a delta query that processes only the changed rows and merges the result into the materialized table. One insert into a million-row source table? pg_stream touches exactly one row's worth of computation.

The approach is grounded in the DBSP differential dataflow framework (Budiu et al., 2022). Delta queries are derived automatically from your SQL's operator tree: joins produce the classic bilinear expansion, aggregates maintain auxiliary counters, and linear operators like filters pass deltas through unchanged.

Key capabilities

FeatureDescription
Incremental refreshOnly changed rows are recomputed — never a full table scan
Cascading DAGStream tables that depend on stream tables propagate deltas downstream automatically
Demand-driven schedulingSet a freshness interval on the views your app queries; upstream layers inherit the tightest schedule automatically
Hybrid CDCStarts with lightweight row-level triggers; seamlessly transitions to WAL-based logical replication once available
Broad SQL supportJOINs, GROUP BY, DISTINCT, UNION/INTERSECT/EXCEPT, subqueries, CTEs (including WITH RECURSIVE), window functions, LATERAL, and more
Built-in observabilityMonitoring views, refresh history, NOTIFY-based alerting
CloudNativePG-readyShips as an Image Volume extension image for Kubernetes deployments

Demand-driven scheduling

With the default CALCULATED schedule mode, you only set an explicit refresh interval on the stream tables your application actually queries. The system propagates that cadence upward through the dependency graph: each upstream stream table inherits the tightest schedule among its downstream dependents. You declare freshness requirements where they matter — at the consumer — and the entire pipeline adjusts without manual coordination.

Hybrid change capture

pg_stream bootstraps with lightweight row-level triggers — no configuration needed, works out of the box. Once the first refresh succeeds and wal_level = logical is available, the system automatically transitions to WAL-based logical replication for lower write-side overhead. The transition is seamless: trigger → transitioning → WAL-only. If anything goes wrong, it falls back to triggers.


Explore this documentation


Source & releases

Written in Rust using pgrx. Targets PostgreSQL 18. Apache 2.0 licensed.

Getting Started with pg_stream

What is pg_stream?

pg_stream adds stream tables to PostgreSQL — tables that are defined by a SQL query and kept automatically up to date as the underlying data changes. Think of them as materialized views that refresh themselves, but smarter: instead of re-running the entire query on every refresh, pg_stream uses Incremental View Maintenance (IVM) to process only the rows that changed.

Traditional materialized views force a choice: either re-run the full query (expensive) or accept stale data. pg_stream eliminates this trade-off. When you insert a single row into a million-row table, pg_stream computes the effect of that one row on the query result — it doesn't touch the other 999,999.

How data flows

The key concept is that data flows downstream automatically — from your base tables through any chain of stream tables, without you writing a single line of orchestration code:

  You write to base tables
         │
         ▼
  ┌─────────────┐   triggers (or WAL)   ┌─────────────────────┐
  │ Base Tables │ ─────────────────────▶ │   Change Buffers    │
  │ (you write) │                        │ (pgstream_changes.*) │
  └─────────────┘                        └──────────┬──────────┘
                                                     │
                                           delta query (ΔQ) on refresh
                                                     │
                                                     ▼
  ┌──────────────────────────────────────────────────────────────┐
  │  Stream Table A  ◀── depends on base tables                  │
  └──────────────────────────┬───────────────────────────────────┘
                             │  change captured, buffer written
                             ▼
  ┌──────────────────────────────────────────────────────────────┐
  │  Stream Table B  ◀── depends on Stream Table A               │
  └──────────────────────────────────────────────────────────────┘

One write to a base table can ripple through an entire DAG of stream tables — each layer refreshed in the correct topological order, each doing only the work proportional to what actually changed.

  1. You write to your base tables normally — INSERT, UPDATE, DELETE
  2. Lightweight AFTER row-level triggers capture each change into a buffer, atomically in the same transaction. No polling, no logical replication slots required by default.
  3. On each refresh cycle, pg_stream derives a delta query (ΔQ) that reads only the buffered changes since the last refresh frontier
  4. The delta is merged into the stream table — only the affected rows are written
  5. If other stream tables depend on this one, they are scheduled next (topological order)
  6. Optionally: once wal_level = logical is available and the first refresh succeeds, pg_stream automatically transitions from triggers to WAL-based CDC (~2–15 μs write overhead vs triggers). The transition is seamless and transparent.

This tutorial walks through a concrete org-chart example so you can see this flow end to end, including a chain of stream tables that propagates changes automatically.


What you'll build

An employee org-chart system with two stream tables:

  • department_tree — a recursive CTE that flattens a department hierarchy into paths like Company > Engineering > Backend
  • department_stats — a join + aggregation over department_tree (a stream table!) that computes headcount and salary budget, with the full path included
  • department_report — a further aggregation that rolls up stats to top-level departments

The chain departmentsdepartment_treedepartment_statsdepartment_report demonstrates automatic downstream propagation: modify a department name in the base table and all three stream tables update automatically, in the right order, without any manual orchestration.

By the end you will have:

  • Seen how stream tables are created, queried, and refreshed
  • Watched a single UPDATE in a base table cascade through three layers of stream tables automatically
  • Understood the two IVM strategies and when each applies

Prerequisites

  • PostgreSQL 18.x with pg_stream installed (see INSTALL.md)
  • shared_preload_libraries = 'pg_stream' in postgresql.conf
  • psql or any SQL client

Step 1: Create the Base Tables

These are ordinary PostgreSQL tables — pg_stream doesn't require any special column types, annotations, or schema conventions. The only requirement is that tables have a primary key (pg_stream uses it internally to track which rows changed).

-- Department hierarchy (self-referencing tree)
CREATE TABLE departments (
    id         SERIAL PRIMARY KEY,
    name       TEXT NOT NULL,
    parent_id  INT REFERENCES departments(id)
);

-- Employees belong to a department
CREATE TABLE employees (
    id            SERIAL PRIMARY KEY,
    name          TEXT NOT NULL,
    department_id INT NOT NULL REFERENCES departments(id),
    salary        NUMERIC(10,2) NOT NULL
);

Now insert some data — a three-level department tree and a handful of employees:

-- Top-level
INSERT INTO departments (id, name, parent_id) VALUES
    (1, 'Company',     NULL);

-- Second level
INSERT INTO departments (id, name, parent_id) VALUES
    (2, 'Engineering', 1),
    (3, 'Sales',       1),
    (4, 'Operations',  1);

-- Third level (under Engineering)
INSERT INTO departments (id, name, parent_id) VALUES
    (5, 'Backend',     2),
    (6, 'Frontend',    2),
    (7, 'Platform',    2);

-- Employees
INSERT INTO employees (name, department_id, salary) VALUES
    ('Alice',   5, 120000),   -- Backend
    ('Bob',     5, 115000),   -- Backend
    ('Charlie', 6, 110000),   -- Frontend
    ('Diana',   7, 130000),   -- Platform
    ('Eve',     3, 95000),    -- Sales
    ('Frank',   3, 90000),    -- Sales
    ('Grace',   4, 100000);   -- Operations

At this point these are plain tables with no triggers, no change tracking, nothing special. The department tree looks like this:

Company (1)
├── Engineering (2)
│   ├── Backend (5)     — Alice, Bob
│   ├── Frontend (6)    — Charlie
│   └── Platform (7)    — Diana
├── Sales (3)           — Eve, Frank
└── Operations (4)      — Grace

Step 2: Create the First Stream Table — Recursive Hierarchy

Our first stream table flattens the department tree. For every department, it computes the full path from the root and the depth level. This uses WITH RECURSIVE — a SQL construct that can't be differentiated with simple algebraic rules (the recursion depends on itself), but pg_stream handles it using incremental strategies (semi-naive evaluation for inserts, Delete-and-Rederive for mixed changes) that we'll explain later.

SELECT pgstream.create_stream_table(
    'department_tree',
    $$
    WITH RECURSIVE tree AS (
        -- Base case: root departments (no parent)
        SELECT id, name, parent_id, name AS path, 0 AS depth
        FROM departments
        WHERE parent_id IS NULL

        UNION ALL

        -- Recursive step: children join back to the tree
        SELECT d.id, d.name, d.parent_id,
               tree.path || ' > ' || d.name AS path,
               tree.depth + 1
        FROM departments d
        JOIN tree ON d.parent_id = tree.id
    )
    SELECT id, name, parent_id, path, depth FROM tree
    $$,
    '30s',
    'DIFFERENTIAL'
);

What just happened?

That single function call did a lot of work atomically (all in one transaction):

  1. Parsed the defining query into an operator tree — identifying the recursive CTE, the scan on departments, the join, the union
  2. Created a storage table called department_tree in the public schema — a real PostgreSQL heap table with columns matching the SELECT output, plus internal columns __pgs_row_id (a hash used to track individual rows)
  3. Installed CDC triggers on the departments table — lightweight AFTER INSERT OR UPDATE OR DELETE row-level triggers that will capture every future change
  4. Created a change buffer table in the pgstream_changes schema — this is where the triggers write captured changes
  5. Ran an initial full refresh — executed the recursive query against the current data and populated the storage table
  6. Registered the stream table in pg_stream's catalog with a 30-second refresh schedule

Query it immediately — it's already populated:

SELECT * FROM department_tree ORDER BY path;

Expected output:

 id |    name     | parent_id |            path             | depth
----+-------------+-----------+-----------------------------+-------
  1 | Company     |           | Company                     |     0
  2 | Engineering |         1 | Company > Engineering       |     1
  5 | Backend     |         2 | Company > Engineering > Backend  | 2
  6 | Frontend    |         2 | Company > Engineering > Frontend | 2
  7 | Platform    |         2 | Company > Engineering > Platform | 2
  4 | Operations  |         1 | Company > Operations        |     1
  3 | Sales       |         1 | Company > Sales             |     1

This is a real PostgreSQL table — you can create indexes on it, join it in other queries, reference it in views, or even use it as a source for other stream tables. pg_stream keeps it in sync automatically.

Key insight: The recursive query that computes paths and depths would normally need to be re-run manually (or via REFRESH MATERIALIZED VIEW). With pg_stream, it stays fresh — any change to the departments table is automatically reflected within the schedule bound (30 seconds here).


Step 3: Chain Stream Tables — Build the Downstream Layers

Now create department_stats. The twist: instead of joining directly against departments, it joins against department_tree — the stream table we just created. This creates a chain: changes to departments update department_tree, whose changes then trigger department_stats to update.

This demonstrates how pg_stream builds a DAG — a directed acyclic graph of stream tables — and automatically schedules refreshes in topological order.

SELECT pgstream.create_stream_table(
    'department_stats',
    $$
    SELECT
        t.id          AS department_id,
        t.name        AS department_name,
        t.path        AS full_path,
        t.depth,
        COUNT(e.id)                    AS headcount,
        COALESCE(SUM(e.salary), 0)     AS total_salary,
        COALESCE(AVG(e.salary), 0)     AS avg_salary
    FROM department_tree t
    LEFT JOIN employees e ON e.department_id = t.id
    GROUP BY t.id, t.name, t.path, t.depth
    $$,
    'CALCULATED',       -- inherit schedule from downstream; see explanation below
    'DIFFERENTIAL'
);

What just happened — and why this one is different?

Like before, pg_stream parsed the query, created a storage table, and set up CDC. But department_stats depends on department_tree, not a base table — so no new triggers were installed. Instead, pg_stream registered department_tree as an upstream dependency in the DAG.

The schedule is 'CALCULATED', which means: "don't give this table its own schedule — inherit the tightest schedule of any downstream table that queries it". Since no other stream table has been created yet, PostgreSQL will prompt a full refresh on demand for now.

The query has no recursive CTE, so pg_stream uses algebraic differentiation:

  1. Decomposed into operators: Scan(department_tree)LEFT JOINScan(employees)Aggregate(GROUP BY + COUNT/SUM/AVG)Project
  2. Derived a differentiation rule for each:
    • Δ(Scan) = read only change buffer rows (not the full table)
    • Δ(LEFT JOIN) = join change rows from one side against the full other side
    • Δ(Aggregate) = for COUNT/SUM/AVG, add or subtract per group — no rescan needed
  3. Composed these into a single delta query (ΔQ) that never touches unchanged rows

When one employee is inserted, the refresh reads one change buffer row, joins to find the department, and adjusts only that group's count and sum.

Query it:

SELECT department_name, full_path, headcount, total_salary
FROM department_stats
ORDER BY full_path;

Expected output:

 department_name |                 full_path                  | headcount | total_salary
-----------------+--------------------------------------------+-----------+--------------
 Backend         | Company > Engineering > Backend            |         2 |    235000.00
 Frontend        | Company > Engineering > Frontend           |         1 |    110000.00
 Platform        | Company > Engineering > Platform           |         1 |    130000.00
 Engineering     | Company > Engineering                      |         0 |         0.00
 Operations      | Company > Operations                       |         1 |    100000.00
 Sales           | Company > Sales                            |         2 |    185000.00
 Company         | Company                                    |         0 |         0.00

Notice that the full_path column comes from department_tree — this data already went through one layer of incremental maintenance before landing here.

Add a third layer: department_report

Now add a rollup that aggregates department_stats by top-level group (depth = 1):

SELECT pgstream.create_stream_table(
    'department_report',
    $$
    SELECT
        split_part(full_path, ' > ', 2) AS division,
        SUM(headcount)                  AS total_headcount,
        SUM(total_salary)               AS total_payroll
    FROM department_stats
    WHERE depth >= 1
    GROUP BY 1
    $$,
    '30s',            -- this is the only explicit schedule; CALCULATED tables above inherit it
    'DIFFERENTIAL'
);

The DAG is now:

departments (base)  employees (base)
      │                   │
      ▼                   │
department_tree ──────────┤
   (DIFF, CALCULATED)     │
      │                   ▼
      └──────▶ department_stats
                 (DIFF, CALCULATED)
                      │
                      ▼
               department_report
                  (DIFF, 30s)  ◀── only explicit schedule

department_report drives the whole pipeline. Because it has a 30-second schedule, pg_stream automatically propagates that cadence upstream: department_stats and department_tree will also be refreshed within 30 seconds of a base table change, in topological order, with no manual configuration.

Query the report:

SELECT * FROM department_report ORDER BY division;
  division   | total_headcount | total_payroll
-------------+-----------------+---------------
 Engineering |               4 |    475000.00
 Operations  |               1 |    100000.00
 Sales       |               2 |    185000.00

Step 4: Watch a Change Cascade Through All Three Layers

This is the heart of pg_stream. We'll make four changes to the base tables and watch changes propagate automatically through the three-layer DAG — each layer doing only the minimum work.

The data flow pipeline (three layers)

  Your SQL statement
       │
       ▼
  CDC trigger fires (same transaction)
  Change buffer receives one row
       │
       ▼
  Background scheduler fires (or manual refresh_stream_table)
       │
       ├──▶ [Layer 1] Refresh department_tree
       │         delta query reads change buffer
       │         MERGE touches only affected rows in department_tree
       │         department_tree's own change buffer is updated
       │
       ├──▶ [Layer 2] Refresh department_stats
       │         delta query reads department_tree's change buffer
       │         MERGE touches only affected department groups
       │
       └──▶ [Layer 3] Refresh department_report
                 delta query reads department_stats' change buffer
                 MERGE touches only affected division rows
                 All change buffers cleaned up ✓

All three layers run in a single scheduled pass, in topological order.

4a: A single INSERT ripples through all three layers

INSERT INTO employees (name, department_id, salary) VALUES
    ('Heidi', 6, 105000);  -- New Frontend engineer

What happened immediately (in your transaction): The AFTER INSERT trigger on employees fired and wrote one row to pgstream_changes.changes_<employees_oid>. The row contains the new values, action type I, and the LSN at the time of insert. Your transaction committed normally — no blocking.

The stream tables don't know about Heidi yet. The change is in the buffer, waiting for the next refresh.

Refresh the whole pipeline in one call (or wait for the 30-second schedule):

-- refresh_stream_table cascades to dependent tables automatically
SELECT pgstream.refresh_stream_table('department_report');

What happened across the three layers:

LayerWhat ranRows touched
department_treeNo change — employees is not a source for this ST0
department_statsDelta query: read 1 buffer row, join to Frontend, COUNT+1, SUM+1050001 (Frontend group only)
department_reportDelta query: read 1 change from dept_stats, SUM += 1 headcount, += 1050001 (Engineering row only)

Check the result:

SELECT department_name, headcount, total_salary FROM department_stats
WHERE department_name = 'Frontend';
 department_name | headcount | total_salary
-----------------+-----------+--------------
 Frontend        |         2 |    215000.00

The 6 other groups in department_stats were not touched at all.

Contrast with a standard materialized view: REFRESH MATERIALIZED VIEW would re-scan all 8 employees, re-join with all 7 departments, re-aggregate, and update all 7 rows. With pg_stream, the work was proportional to the 1 changed row — across all three layers.

4b: A department change cascades through the whole DAG

Now change the departments table — the root of the entire chain:

INSERT INTO departments (id, name, parent_id) VALUES
    (8, 'DevOps', 2);  -- New team under Engineering

What happened: The CDC trigger on departments fired. The change buffer for departments has one new row. None of the stream tables know about it yet.

Refresh the whole pipeline:

SELECT pgstream.refresh_stream_table('department_report');

What happened across all three layers:

LayerWhat ranRows touched
department_treeSemi-naive evaluation: base case finds new dept, recursive term computes its path. Result: 1 new row1 inserted
department_statsDelta query reads new row from dept_tree's change buffer; DevOps has 0 employees so delta is minimal1 inserted (headcount=0)
department_reportDelta on Engineering row: headcount stays the same (DevOps has 0 employees)0 effective changes

How the recursive CTE refresh works — unlike department_stats, recursive CTEs can't be algebraically differentiated (the recursion references itself). pg_stream uses incremental fixpoint strategies:

  • INSERT → semi-naive evaluation: differentiate the base case, propagate the delta through the recursive term, stopping when no new rows are produced. Only new rows inserted.
  • DELETE or UPDATE → Delete-and-Rederive (DRed): remove rows derived from deleted facts, re-derive rows that may have alternative derivation paths, handle cascades cleanly.
SELECT id, name, depth, path FROM department_tree WHERE name = 'DevOps';
 id |  name  | depth |           path                |
----+--------+-------+-------------------------------+
  8 | DevOps |         2 | Company > Engineering > DevOps |    2

The recursive CTE automatically expanded to include the new department at the correct depth and path. One inserted row in departments produced one new row in the stream table.

4c: UPDATE — A single rename that cascades everywhere

Rename "Engineering" to "R&D":

UPDATE departments SET name = 'R&D' WHERE id = 2;

What happened in the change buffer: The CDC trigger captured the old row (name='Engineering') and the new row (name='R&D'). Both old and new values are stored so the delta can compute what to remove and what to add.

Refresh:

SELECT pgstream.refresh_stream_table('department_report');

What happened:

LayerWork doneResult
department_treeDRed strategy: delete rows derived with old name, re-derive with new name. 5 rows updated (Engineering + 4 sub-teams)Paths now say Company > R&D > …
department_statsDelta reads 5 changed rows from dept_tree's buffer; updates full_path column for those 5 departments5 rows updated
department_reportDivision name changed: "Engineering" row replaced by "R&D" row1 DELETE + 1 INSERT

Query to verify the cascade:

SELECT name, path FROM department_tree WHERE path LIKE '%R&D%' ORDER BY depth;
  name   |              path
---------+----------------------------------
 R&D     | Company > R&D
 Backend  | Company > R&D > Backend
 DevOps  | Company > R&D > DevOps
 Frontend | Company > R&D > Frontend
 Platform | Company > R&D > Platform

One UPDATE to a department name flowed through all three layers automatically — updating 5 + 5 + 2 rows across the chain.

4d: DELETE — Remove an employee

DELETE FROM employees WHERE name = 'Bob';

What happened: The AFTER DELETE trigger on employees fired, writing a change buffer row with action type D and Bob's old values (department_id=5, salary=115000). The delta query will use these old values to compute the correct aggregate adjustment — it knows to subtract 115000 from Backend's salary sum and decrement the count.

SELECT pgstream.refresh_stream_table('department_stats');
SELECT * FROM department_stats WHERE department_name = 'Backend';
 department_id | department_name | headcount | total_salary | avg_salary
---------------+-----------------+-----------+--------------+------------
             5 | Backend         |         1 |    120000.00 |  120000.00

Headcount dropped from 2 → 1 and the salary aggregates updated. Again, only the Backend group was touched — the other 6 department rows were untouched.


Step 5: Automatic Scheduling — Let the DAG Drive Itself

In the examples above, we called refresh_stream_table() manually. In production you never need to do this. pg_stream runs a background scheduler that automatically refreshes stale tables — in topological order.

How schedules propagate

We gave department_report a '30s' schedule and the two upstream tables a 'CALCULATED' schedule. This is the recommended pattern:

 department_tree    (CALCULATED → inherits 30s from downstream)
       │
 department_stats   (CALCULATED → inherits 30s from downstream)
       │
 department_report  (30s — the only explicit schedule)

CALCULATED means: compute the tightest schedule across all downstream dependents. You declare freshness requirements at the tables your application queries — the system figures out how often each upstream table needs to refresh.

What the scheduler does every second

  1. Queries the catalog for stream tables past their freshness bound
  2. Sorts them topologically (upstream first) — department_tree refreshes before department_stats, which refreshes before department_report
  3. Runs each refresh (respecting pg_stream.max_concurrent_refreshes)
  4. Updates the last-refresh frontier

Monitoring

-- Current status of all stream tables
SELECT table_name, schedule, last_refresh_at, stale, refresh_mode
FROM pgstream.pgs_status();
    table_name      | schedule  | last_refresh_at         | stale | refresh_mode
--------------------+-----------+-------------------------+-------+--------------
 department_tree    | 30s       | 2026-02-26 10:30:00.123 | f     | DIFFERENTIAL
 department_stats   | 30s       | 2026-02-26 10:30:00.456 | f     | DIFFERENTIAL
 department_report  | 30s       | 2026-02-26 10:30:00.789 | f     | DIFFERENTIAL
-- Detailed performance stats
SELECT table_name, refresh_count, avg_refresh_ms, rows_affected_last
FROM pgstream.pg_stat_stream_tables;

Optional: WAL-based CDC

By default pg_stream uses triggers. If wal_level = logical is configured, set:

ALTER SYSTEM SET pg_stream.cdc_mode = 'auto';
SELECT pg_reload_conf();

pg_stream will automatically transition each stream table from trigger-based to WAL-based capture after the first successful refresh — reducing per-write overhead from ~50–200 μs to ~2–15 μs. The transition is transparent; your queries and the refresh schedule are unaffected.


Step 6: Understanding the Two IVM Strategies

You've now seen both strategies pg_stream uses for incremental view maintenance. Understanding when each applies helps you write efficient stream table queries.

Algebraic Differentiation (used by department_stats)

For queries composed of scans, filters, joins, and algebraic aggregates (COUNT, SUM, AVG), pg_stream can derive the IVM delta mathematically. The rules come from the theory of DBSP (Database Stream Processing):

OperatorDelta RuleCost
ScanRead only change buffer rows (not the full table)O(changes)
Filter (WHERE)Apply predicate to change rowsO(changes)
JoinJoin change rows from one side against the full other sideO(changes × lookup)
Aggregate (COUNT/SUM/AVG)Add or subtract deltas per group — no rescanO(affected groups)
ProjectPass throughO(changes)

The total cost is proportional to the number of changes, not the table size. For a million-row table with 10 changes, the delta query touches ~10 rows.

Incremental Strategies for Recursive CTEs (used by department_tree)

For recursive CTEs, pg_stream can't derive an algebraic delta because the recursion references itself. Instead it uses two complementary strategies, chosen automatically based on what changed:

Semi-naive evaluation (for INSERT-only changes):

  1. Differentiate the base case — find the new seed rows
  2. Propagate the delta through the recursive term, iterating until no new rows are produced
  3. The result is only the new rows created by the change — not the whole tree

Delete-and-Rederive (DRed) (for DELETE or UPDATE):

  1. Remove all rows derived from the old fact
  2. Re-derive rows that had the old fact as one of their derivation paths (they may still be reachable via other paths)
  3. Insert the newly derived rows under the new fact

Both strategies are more efficient than full recomputation — they work on the affected portion of the result set, not the entire recursive query. The MERGE only modifies rows that actually changed.

When to use which?

You don't choose — pg_stream detects the strategy automatically based on the query structure:

Query PatternStrategyPerformance
Scan + Filter + Join + algebraic Aggregate (COUNT/SUM/AVG)AlgebraicExcellent — O(changes)
Non-recursive CTEsAlgebraic (inlined)CTE body is differentiated inline
MIN / MAX aggregatesSemi-algebraicUses LEAST/GREATEST merge; per-group rescan only when an extremum is deleted
STRING_AGG, ARRAY_AGG, ordered-set aggregatesGroup-rescanAffected groups fully re-aggregated from source
GROUPING SETS / CUBE / ROLLUPAlgebraic (rewritten)Auto-expanded to UNION ALL of GROUP BY queries; CUBE capped at 64 branches
Recursive CTEs (WITH RECURSIVE) INSERTSemi-naive evaluationO(new rows derived from the change)
Recursive CTEs (WITH RECURSIVE) DELETE/UPDATEDelete-and-RederiveRe-derives rows with alternative paths; O(affected subgraph)
Window functionsPartition recomputeOnly affected partitions recomputed

Step 7: Clean Up

When you're done experimenting, drop the stream tables. Drop dependents before their sources:

SELECT pgstream.drop_stream_table('department_report');
SELECT pgstream.drop_stream_table('department_stats');
SELECT pgstream.drop_stream_table('department_tree');

DROP TABLE employees;
DROP TABLE departments;

drop_stream_table atomically removes in a single transaction:

  • The storage table (e.g., public.department_stats)
  • CDC triggers on source tables (removed only if no other stream table references the same source)
  • Change buffer tables in pgstream_changes
  • Catalog entries in pgstream.pgs_stream_tables

Summary: What You Learned

ConceptWhat you saw
Stream tablesTables defined by a SQL query that stay automatically up to date
CDC triggersLightweight change capture in the same transaction — no logical replication or polling required
DAG schedulingStream tables can depend on other stream tables; refreshes run in topological order, schedules propagate upstream via CALCULATED mode
Algebraic IVMDelta queries that process only changed rows — O(changes) regardless of table size
Semi-naive / DRedIncremental strategies for WITH RECURSIVE — INSERT uses semi-naive, DELETE/UPDATE uses Delete-and-Rederive
Downstream propagationA single base table write cascades through an entire chain of stream tables, automatically, in the right order
Hybrid CDCTriggers by default; optional automatic transition to WAL-based capture for lower write-side overhead
Monitoringpgs_status() and pg_stat_stream_tables for freshness, timing, and error history

The key takeaway: you write to base tables — pg_stream does the rest. Data flows downstream automatically, each layer doing the minimum work proportional to what changed, in dependency order.


What's Next?

SQL Reference

Complete reference for all SQL functions, views, and catalog tables provided by pgstream.


Functions

pgstream.create_stream_table

Create a new stream table.

pgstream.create_stream_table(
    name          text,
    query         text,
    schedule text      DEFAULT '1m',
    refresh_mode  text      DEFAULT 'DIFFERENTIAL',
    initialize    bool      DEFAULT true
) → void

Parameters:

ParameterTypeDefaultDescription
nametextName of the stream table. May be schema-qualified (myschema.my_st). Defaults to public schema.
querytextThe defining SQL query. Must be a valid SELECT statement using supported operators.
scheduletext'1m'Refresh schedule as a Prometheus/GNU-style duration string (e.g., '30s', '5m', '1h', '1h30m', '1d') or a cron expression (e.g., '*/5 * * * *', '@hourly'). Set to NULL for CALCULATED mode (inherits schedule from downstream dependents).
refresh_modetext'DIFFERENTIAL''FULL' (truncate and reload) or 'DIFFERENTIAL' (apply delta only).
initializebooltrueIf true, populates the table immediately via a full refresh. If false, creates the table empty.

Duration format:

UnitSuffixExample
Secondss'30s'
Minutesm'5m'
Hoursh'2h'
Daysd'1d'
Weeksw'1w'
Compound'1h30m', '2m30s'

Cron expression format:

schedule also accepts standard cron expressions for time-based scheduling. The scheduler refreshes the stream table when the cron schedule fires, rather than checking staleness.

FormatFieldsExampleDescription
5-fieldmin hour dom mon dow'*/5 * * * *'Every 5 minutes
6-fieldsec min hour dom mon dow'0 */5 * * * *'Every 5 minutes at :00 seconds
Alias'@hourly'Every hour
Alias'@daily'Every day at midnight
Alias'@weekly'Every Sunday at midnight
Alias'@monthly'First of every month
Weekday range'0 6 * * 1-5'6 AM on weekdays

Note: Cron-scheduled stream tables do not participate in CALCULATED schedule resolution. The stale column in monitoring views returns NULL for cron-scheduled tables.

Example:

-- Duration-based: refresh when data is staler than 2 minutes
SELECT pgstream.create_stream_table(
    'order_totals',
    'SELECT region, SUM(amount) AS total FROM orders GROUP BY region',
    '2m',
    'DIFFERENTIAL'
);

-- Cron-based: refresh every hour
SELECT pgstream.create_stream_table(
    'hourly_summary',
    'SELECT date_trunc(''hour'', ts), COUNT(*) FROM events GROUP BY 1',
    '@hourly',
    'FULL'
);

-- Cron-based: refresh at 6 AM on weekdays
SELECT pgstream.create_stream_table(
    'daily_report',
    'SELECT region, SUM(revenue) AS total FROM sales GROUP BY region',
    '0 6 * * 1-5',
    'FULL'
);

Aggregate Examples:

All supported aggregate functions work in both FULL and DIFFERENTIAL modes:

-- Algebraic aggregates (fully differential — no rescan needed)
SELECT pgstream.create_stream_table(
    'sales_summary',
    'SELECT region, COUNT(*) AS cnt, SUM(amount) AS total, AVG(amount) AS avg_amount
     FROM orders GROUP BY region',
    '1m',
    'DIFFERENTIAL'
);

-- Semi-algebraic aggregates (MIN/MAX)
SELECT pgstream.create_stream_table(
    'salary_ranges',
    'SELECT department, MIN(salary) AS min_sal, MAX(salary) AS max_sal
     FROM employees GROUP BY department',
    '2m',
    'DIFFERENTIAL'
);

-- Group-rescan aggregates (BOOL_AND/OR, STRING_AGG, ARRAY_AGG, JSON_AGG, JSONB_AGG,
--                          BIT_AND, BIT_OR, BIT_XOR, JSON_OBJECT_AGG, JSONB_OBJECT_AGG,
--                          STDDEV, STDDEV_POP, STDDEV_SAMP, VARIANCE, VAR_POP, VAR_SAMP,
--                          MODE, PERCENTILE_CONT, PERCENTILE_DISC,
--                          CORR, COVAR_POP, COVAR_SAMP, REGR_AVGX, REGR_AVGY,
--                          REGR_COUNT, REGR_INTERCEPT, REGR_R2, REGR_SLOPE,
--                          REGR_SXX, REGR_SXY, REGR_SYY, ANY_VALUE)
SELECT pgstream.create_stream_table(
    'team_members',
    'SELECT department,
            STRING_AGG(name, '', '' ORDER BY name) AS members,
            ARRAY_AGG(employee_id) AS member_ids,
            BOOL_AND(active) AS all_active,
            JSON_AGG(name) AS members_json
     FROM employees
     GROUP BY department',
    '1m',
    'DIFFERENTIAL'
);

-- Bitwise aggregates
SELECT pgstream.create_stream_table(
    'permission_summary',
    'SELECT department,
            BIT_OR(permissions) AS combined_perms,
            BIT_AND(permissions) AS common_perms,
            BIT_XOR(flags) AS xor_flags
     FROM employees
     GROUP BY department',
    '1m',
    'DIFFERENTIAL'
);

-- JSON object aggregates
SELECT pgstream.create_stream_table(
    'config_map',
    'SELECT department,
            JSON_OBJECT_AGG(setting_name, setting_value) AS settings,
            JSONB_OBJECT_AGG(key, value) AS metadata
     FROM config
     GROUP BY department',
    '1m',
    'DIFFERENTIAL'
);

-- Statistical aggregates
SELECT pgstream.create_stream_table(
    'salary_stats',
    'SELECT department,
            STDDEV_POP(salary) AS sd_pop,
            STDDEV_SAMP(salary) AS sd_samp,
            VAR_POP(salary) AS var_pop,
            VAR_SAMP(salary) AS var_samp
     FROM employees
     GROUP BY department',
    '1m',
    'DIFFERENTIAL'
);

-- Ordered-set aggregates (MODE, PERCENTILE_CONT, PERCENTILE_DISC)
SELECT pgstream.create_stream_table(
    'salary_percentiles',
    'SELECT department,
            MODE() WITHIN GROUP (ORDER BY grade) AS most_common_grade,
            PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary,
            PERCENTILE_DISC(0.9) WITHIN GROUP (ORDER BY salary) AS p90_salary
     FROM employees
     GROUP BY department',
    '1m',
    'DIFFERENTIAL'
);

-- Regression / correlation aggregates (CORR, COVAR_*, REGR_*)
SELECT pgstream.create_stream_table(
    'regression_stats',
    'SELECT department,
            CORR(salary, experience) AS sal_exp_corr,
            COVAR_POP(salary, experience) AS covar_pop,
            COVAR_SAMP(salary, experience) AS covar_samp,
            REGR_SLOPE(salary, experience) AS slope,
            REGR_INTERCEPT(salary, experience) AS intercept,
            REGR_R2(salary, experience) AS r_squared,
            REGR_COUNT(salary, experience) AS regr_n
     FROM employees
     GROUP BY department',
    '1m',
    'DIFFERENTIAL'
);

-- ANY_VALUE aggregate (PostgreSQL 16+)
SELECT pgstream.create_stream_table(
    'dept_sample',
    'SELECT department, ANY_VALUE(office_location) AS sample_office
     FROM employees GROUP BY department',
    '1m',
    'DIFFERENTIAL'
);

-- FILTER clause on aggregates
SELECT pgstream.create_stream_table(
    'order_metrics',
    'SELECT region,
            COUNT(*) AS total,
            COUNT(*) FILTER (WHERE status = ''active'') AS active_count,
            SUM(amount) FILTER (WHERE status = ''shipped'') AS shipped_total
     FROM orders
     GROUP BY region',
    '1m',
    'DIFFERENTIAL'
);

CTE Examples:

Non-recursive CTEs are fully supported in both FULL and DIFFERENTIAL modes:

-- Simple CTE
SELECT pgstream.create_stream_table(
    'active_order_totals',
    'WITH active_users AS (
        SELECT id, name FROM users WHERE active = true
    )
    SELECT a.id, a.name, SUM(o.amount) AS total
    FROM active_users a
    JOIN orders o ON o.user_id = a.id
    GROUP BY a.id, a.name',
    '1m',
    'DIFFERENTIAL'
);

-- Chained CTEs (CTE referencing another CTE)
SELECT pgstream.create_stream_table(
    'top_regions',
    'WITH regional AS (
        SELECT region, SUM(amount) AS total FROM orders GROUP BY region
    ),
    ranked AS (
        SELECT region, total FROM regional WHERE total > 1000
    )
    SELECT * FROM ranked',
    '2m',
    'DIFFERENTIAL'
);

-- Multi-reference CTE (referenced twice in FROM — shared delta optimization)
SELECT pgstream.create_stream_table(
    'self_compare',
    'WITH totals AS (
        SELECT user_id, SUM(amount) AS total FROM orders GROUP BY user_id
    )
    SELECT t1.user_id, t1.total, t2.total AS next_total
    FROM totals t1
    JOIN totals t2 ON t1.user_id = t2.user_id + 1',
    '1m',
    'DIFFERENTIAL'
);

Recursive CTEs work with both FULL and DIFFERENTIAL modes:

-- Recursive CTE (hierarchy traversal)
SELECT pgstream.create_stream_table(
    'category_tree',
    'WITH RECURSIVE cat_tree AS (
        SELECT id, name, parent_id, 0 AS depth
        FROM categories WHERE parent_id IS NULL
        UNION ALL
        SELECT c.id, c.name, c.parent_id, ct.depth + 1
        FROM categories c
        JOIN cat_tree ct ON c.parent_id = ct.id
    )
    SELECT * FROM cat_tree',
    '5m',
    'FULL'  -- FULL mode: standard re-execution
);

-- Recursive CTE with DIFFERENTIAL mode (incremental semi-naive / DRed)
SELECT pgstream.create_stream_table(
    'org_chart',
    'WITH RECURSIVE reports AS (
        SELECT id, name, manager_id FROM employees WHERE manager_id IS NULL
        UNION ALL
        SELECT e.id, e.name, e.manager_id
        FROM employees e JOIN reports r ON e.manager_id = r.id
    )
    SELECT * FROM reports',
    '2m',
    'DIFFERENTIAL'  -- Uses semi-naive, DRed, or recomputation (auto-selected)
);

Set Operation Examples:

INTERSECT, INTERSECT ALL, EXCEPT, EXCEPT ALL, UNION, and UNION ALL are supported:

-- INTERSECT: customers who placed orders in BOTH regions
SELECT pgstream.create_stream_table(
    'bi_region_customers',
    'SELECT customer_id FROM orders_east
     INTERSECT
     SELECT customer_id FROM orders_west',
    '2m',
    'DIFFERENTIAL'
);

-- INTERSECT ALL: preserves duplicates (bag semantics)
SELECT pgstream.create_stream_table(
    'common_items',
    'SELECT item_name FROM warehouse_a
     INTERSECT ALL
     SELECT item_name FROM warehouse_b',
    '1m',
    'DIFFERENTIAL'
);

-- EXCEPT: orders not yet shipped
SELECT pgstream.create_stream_table(
    'unshipped_orders',
    'SELECT order_id FROM orders
     EXCEPT
     SELECT order_id FROM shipments',
    '1m',
    'DIFFERENTIAL'
);

-- EXCEPT ALL: preserves duplicate counts (bag subtraction)
SELECT pgstream.create_stream_table(
    'excess_inventory',
    'SELECT sku FROM stock_received
     EXCEPT ALL
     SELECT sku FROM stock_shipped',
    '5m',
    'DIFFERENTIAL'
);

-- UNION: deduplicated merge of two sources
SELECT pgstream.create_stream_table(
    'all_contacts',
    'SELECT email FROM customers
     UNION
     SELECT email FROM newsletter_subscribers',
    '5m',
    'DIFFERENTIAL'
);

LATERAL Set-Returning Function Examples:

Set-returning functions (SRFs) in the FROM clause are supported in both FULL and DIFFERENTIAL modes. Common SRFs include jsonb_array_elements, jsonb_each, jsonb_each_text, and unnest:

-- Flatten JSONB arrays into rows
SELECT pgstream.create_stream_table(
    'flat_children',
    'SELECT p.id, child.value AS val
     FROM parent_data p,
     jsonb_array_elements(p.data->''children'') AS child',
    '1m',
    'DIFFERENTIAL'
);

-- Expand JSONB key-value pairs (multi-column SRF)
SELECT pgstream.create_stream_table(
    'flat_properties',
    'SELECT d.id, kv.key, kv.value
     FROM documents d,
     jsonb_each(d.metadata) AS kv',
    '2m',
    'DIFFERENTIAL'
);

-- Unnest arrays
SELECT pgstream.create_stream_table(
    'flat_tags',
    'SELECT t.id, tag.tag
     FROM tagged_items t,
     unnest(t.tags) AS tag(tag)',
    '1m',
    'DIFFERENTIAL'
);

-- SRF with WHERE filter
SELECT pgstream.create_stream_table(
    'high_value_items',
    'SELECT p.id, (e.value)::int AS amount
     FROM products p,
     jsonb_array_elements(p.prices) AS e
     WHERE (e.value)::int > 100',
    '5m',
    'DIFFERENTIAL'
);

-- SRF combined with aggregation
SELECT pgstream.create_stream_table(
    'element_counts',
    'SELECT a.id, count(*) AS cnt
     FROM arrays a,
     jsonb_array_elements(a.data) AS e
     GROUP BY a.id',
    '1m',
    'FULL'
);

LATERAL Subquery Examples:

LATERAL subqueries in the FROM clause are supported in both FULL and DIFFERENTIAL modes. Use them for top-N per group, correlated aggregation, and conditional expansion:

-- Top-N per group: latest item per order
SELECT pgstream.create_stream_table(
    'latest_items',
    'SELECT o.id, o.customer, latest.amount
     FROM orders o,
     LATERAL (
         SELECT li.amount
         FROM line_items li
         WHERE li.order_id = o.id
         ORDER BY li.created_at DESC
         LIMIT 1
     ) AS latest',
    '1m',
    'DIFFERENTIAL'
);

-- Correlated aggregate
SELECT pgstream.create_stream_table(
    'dept_summaries',
    'SELECT d.id, d.name, stats.total, stats.cnt
     FROM departments d,
     LATERAL (
         SELECT SUM(e.salary) AS total, COUNT(*) AS cnt
         FROM employees e
         WHERE e.dept_id = d.id
     ) AS stats',
    '1m',
    'DIFFERENTIAL'
);

-- LEFT JOIN LATERAL: preserve outer rows with NULLs when subquery returns no rows
SELECT pgstream.create_stream_table(
    'dept_stats_all',
    'SELECT d.id, d.name, stats.total
     FROM departments d
     LEFT JOIN LATERAL (
         SELECT SUM(e.salary) AS total
         FROM employees e
         WHERE e.dept_id = d.id
     ) AS stats ON true',
    '1m',
    'DIFFERENTIAL'
);

WHERE Subquery Examples:

Subqueries in the WHERE clause are automatically transformed into semi-join, anti-join, or scalar subquery operators in the DVM operator tree:

-- EXISTS subquery: customers who have placed orders
SELECT pgstream.create_stream_table(
    'active_customers',
    'SELECT c.id, c.name
     FROM customers c
     WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id)',
    '1m',
    'DIFFERENTIAL'
);

-- NOT EXISTS: customers with no orders
SELECT pgstream.create_stream_table(
    'inactive_customers',
    'SELECT c.id, c.name
     FROM customers c
     WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id)',
    '1m',
    'DIFFERENTIAL'
);

-- IN subquery: products that have been ordered
SELECT pgstream.create_stream_table(
    'ordered_products',
    'SELECT p.id, p.name
     FROM products p
     WHERE p.id IN (SELECT product_id FROM order_items)',
    '1m',
    'DIFFERENTIAL'
);

-- NOT IN subquery: products never ordered
SELECT pgstream.create_stream_table(
    'unordered_products',
    'SELECT p.id, p.name
     FROM products p
     WHERE p.id NOT IN (SELECT product_id FROM order_items)',
    '1m',
    'DIFFERENTIAL'
);

-- Scalar subquery in SELECT list
SELECT pgstream.create_stream_table(
    'products_with_max_price',
    'SELECT p.id, p.name, (SELECT max(price) FROM products) AS max_price
     FROM products p',
    '1m',
    'DIFFERENTIAL'
);

Notes:

  • The defining query is parsed into an operator tree and validated for DVM support.
  • Views as sources — views referenced in the defining query are automatically inlined as subqueries (auto-rewrite pass #0). CDC triggers are created on the underlying base tables. Nested views (view → view → table) are fully expanded. The user's original query is preserved in original_query for reinit and introspection. Materialized views are rejected in DIFFERENTIAL mode (use FULL mode or the underlying query directly). Foreign tables are also rejected in DIFFERENTIAL mode.
  • CDC triggers and change buffer tables are created automatically for each source table.
  • The ST is registered in the dependency DAG; cycles are rejected.
  • Non-recursive CTEs are inlined as subqueries during parsing (Tier 1). Multi-reference CTEs share delta computation (Tier 2).
  • Recursive CTEs in DIFFERENTIAL mode use three strategies, auto-selected per refresh: semi-naive evaluation for INSERT-only changes, Delete-and-Rederive (DRed) for mixed changes, and recomputation fallback when CTE columns don't match ST storage columns.
  • LATERAL SRFs in DIFFERENTIAL mode use row-scoped recomputation: when a source row changes, only the SRF expansions for that row are re-evaluated.
  • LATERAL subqueries in DIFFERENTIAL mode also use row-scoped recomputation: when an outer row changes, the correlated subquery is re-executed only for that row.
  • WHERE subqueries (EXISTS, IN, scalar) are parsed into dedicated semi-join, anti-join, and scalar subquery operators with specialized delta computation.
  • ALL (subquery) is the only subquery form that is currently rejected.
  • ORDER BY is accepted but silently discarded — row order in the storage table is undefined (consistent with PostgreSQL's CREATE MATERIALIZED VIEW behavior). Apply ORDER BY when querying the stream table.
  • LIMIT / OFFSET are rejected — stream tables materialize the full result set. Apply LIMIT when querying the stream table.

pgstream.alter_stream_table

Alter properties of an existing stream table.

pgstream.alter_stream_table(
    name          text,
    schedule text      DEFAULT NULL,
    refresh_mode  text      DEFAULT NULL,
    status        text      DEFAULT NULL
) → void

Parameters:

ParameterTypeDefaultDescription
nametextName of the stream table (schema-qualified or unqualified).
scheduletextNULLNew schedule as a duration string (e.g., '5m'). Pass NULL to leave unchanged.
refresh_modetextNULLNew refresh mode ('FULL' or 'DIFFERENTIAL'). Pass NULL to leave unchanged.
statustextNULLNew status ('ACTIVE', 'SUSPENDED'). Pass NULL to leave unchanged. Resuming resets consecutive errors to 0.

Examples:

-- Change schedule
SELECT pgstream.alter_stream_table('order_totals', schedule => '5m');

-- Switch to full refresh mode
SELECT pgstream.alter_stream_table('order_totals', refresh_mode => 'FULL');

-- Suspend a stream table
SELECT pgstream.alter_stream_table('order_totals', status => 'SUSPENDED');

-- Resume a suspended stream table
SELECT pgstream.alter_stream_table('order_totals', status => 'ACTIVE');

pgstream.drop_stream_table

Drop a stream table, removing the storage table and all catalog entries.

pgstream.drop_stream_table(name text) → void

Parameters:

ParameterTypeDescription
nametextName of the stream table to drop.

Example:

SELECT pgstream.drop_stream_table('order_totals');

Notes:

  • Drops the underlying storage table with CASCADE.
  • Removes all catalog entries (metadata, dependencies, refresh history).
  • Cleans up CDC triggers and change buffer tables for source tables that are no longer tracked by any ST.

pgstream.refresh_stream_table

Manually trigger a synchronous refresh of a stream table.

pgstream.refresh_stream_table(name text) → void

Parameters:

ParameterTypeDescription
nametextName of the stream table to refresh.

Example:

SELECT pgstream.refresh_stream_table('order_totals');

Notes:

  • Blocked if the ST is SUSPENDED — use pgstream.alter_stream_table(... status => 'ACTIVE') first.
  • Uses an advisory lock to prevent concurrent refreshes of the same ST.
  • For DIFFERENTIAL mode, generates and applies a delta query. For FULL mode, truncates and reloads.
  • Records the refresh in pgstream.pgs_refresh_history.

pgstream.pgs_status

Get the status of all stream tables.

pgstream.pgs_status() → SETOF record(
    name                text,
    status              text,
    refresh_mode        text,
    is_populated        bool,
    consecutive_errors  int,
    schedule            text,
    data_timestamp      timestamptz,
    staleness           interval
)

Example:

SELECT * FROM pgstream.pgs_status();
namestatusrefresh_modeis_populatedconsecutive_errorsscheduledata_timestampstaleness
public.order_totalsACTIVEDIFFERENTIALtrue05m2026-02-21 12:00:00+0000:02:30

pgstream.st_refresh_stats

Return per-ST refresh statistics aggregated from the refresh history.

pgstream.st_refresh_stats() → SETOF record(
    pgs_name                text,
    pgs_schema              text,
    status                 text,
    refresh_mode           text,
    is_populated           bool,
    total_refreshes        bigint,
    successful_refreshes   bigint,
    failed_refreshes       bigint,
    total_rows_inserted    bigint,
    total_rows_deleted     bigint,
    avg_duration_ms        float8,
    last_refresh_action    text,
    last_refresh_status    text,
    last_refresh_at        timestamptz,
    staleness_secs       float8,
    stale           bool
)

Example:

SELECT pgs_name, status, total_refreshes, avg_duration_ms, stale
FROM pgstream.st_refresh_stats();

pgstream.get_refresh_history

Return refresh history for a specific stream table.

pgstream.get_refresh_history(
    name      text,
    max_rows  int  DEFAULT 20
) → SETOF record(
    refresh_id       bigint,
    data_timestamp   timestamptz,
    start_time       timestamptz,
    end_time         timestamptz,
    action           text,
    status           text,
    rows_inserted    bigint,
    rows_deleted     bigint,
    duration_ms      float8,
    error_message    text
)

Example:

SELECT action, status, rows_inserted, duration_ms
FROM pgstream.get_refresh_history('order_totals', 5);

pgstream.get_staleness

Get the current staleness in seconds for a specific stream table.

pgstream.get_staleness(name text) → float8

Returns NULL if the ST has never been refreshed.

Example:

SELECT pgstream.get_staleness('order_totals');
-- Returns: 12.345  (seconds since last refresh)

pgstream.slot_health

Check replication slot health for all tracked CDC slots.

pgstream.slot_health() → SETOF record(
    slot_name          text,
    source_relid       bigint,
    active             bool,
    retained_wal_bytes bigint,
    wal_status         text
)

Example:

SELECT * FROM pgstream.slot_health();
slot_namesource_relidactiveretained_wal_byteswal_status
pg_stream_slot_1638416384false1048576reserved

pgstream.check_cdc_health

Check CDC health for all tracked source tables. Returns per-source health status including the current CDC mode, replication slot details, estimated lag, and any alerts.

pgstream.check_cdc_health() → SETOF record(
    source_relid   bigint,
    source_table   text,
    cdc_mode       text,
    slot_name      text,
    lag_bytes      bigint,
    confirmed_lsn  text,
    alert          text
)

Columns:

ColumnTypeDescription
source_relidbigintOID of the tracked source table
source_tabletextResolved name of the source table (e.g., public.orders)
cdc_modetextCurrent CDC mode: TRIGGER, TRANSITIONING, or WAL
slot_nametextReplication slot name (NULL for TRIGGER mode)
lag_bytesbigintReplication slot lag in bytes (NULL for TRIGGER mode)
confirmed_lsntextLast confirmed WAL position (NULL for TRIGGER mode)
alerttextAlert message if unhealthy (e.g., slot_lag_exceeds_threshold, replication_slot_missing)

Example:

SELECT * FROM pgstream.check_cdc_health();
source_relidsource_tablecdc_modeslot_namelag_bytesconfirmed_lsnalert
16384public.ordersTRIGGER
16390public.eventsWALpg_stream_slot_163905242880/1A8B000

pgstream.explain_st

Explain the DVM plan for a stream table's defining query.

pgstream.explain_st(name text) → SETOF record(
    property  text,
    value     text
)

Example:

SELECT * FROM pgstream.explain_st('order_totals');
propertyvalue
Defining QuerySELECT region, SUM(amount) ...
Refresh ModeDIFFERENTIAL
Operator TreeAggregate → Scan(orders)
Source Tablesorders (oid=16384)
DVM SupportedYes

pgstream.pg_stream_hash

Compute a 64-bit xxHash row ID from a text value.

pgstream.pg_stream_hash(input text) → bigint

Marked IMMUTABLE, PARALLEL SAFE.

Example:

SELECT pgstream.pg_stream_hash('some_key');
-- Returns: 1234567890123456789

pgstream.pg_stream_hash_multi

Compute a row ID by hashing multiple text values (composite keys).

pgstream.pg_stream_hash_multi(inputs text[]) → bigint

Marked IMMUTABLE, PARALLEL SAFE. Uses \x1E (record separator) between values and \x00NULL\x00 for NULL entries.

Example:

SELECT pgstream.pg_stream_hash_multi(ARRAY['key1', 'key2']);

Expression Support

pgstream's DVM parser supports a wide range of SQL expressions in defining queries. All expressions work in both FULL and DIFFERENTIAL modes.

Conditional Expressions

ExpressionExampleNotes
CASE WHEN … THEN … ELSE … ENDCASE WHEN amount > 100 THEN 'high' ELSE 'low' ENDSearched CASE
CASE <expr> WHEN … THEN … ENDCASE status WHEN 1 THEN 'active' WHEN 2 THEN 'inactive' ENDSimple CASE
COALESCE(a, b, …)COALESCE(phone, email, 'unknown')Returns first non-NULL argument
NULLIF(a, b)NULLIF(divisor, 0)Returns NULL if a = b
GREATEST(a, b, …)GREATEST(score1, score2, score3)Returns the largest value
LEAST(a, b, …)LEAST(price, max_price)Returns the smallest value

Comparison Operators

ExpressionExampleNotes
IN (list)category IN ('A', 'B', 'C')Also supports NOT IN
BETWEEN a AND bprice BETWEEN 10 AND 100Also supports NOT BETWEEN
IS DISTINCT FROMa IS DISTINCT FROM bNULL-safe inequality
IS NOT DISTINCT FROMa IS NOT DISTINCT FROM bNULL-safe equality
SIMILAR TOname SIMILAR TO '%pattern%'SQL regex matching
op ANY(array)id = ANY(ARRAY[1,2,3])Array comparison
op ALL(array)score > ALL(ARRAY[50,60])Array comparison

Boolean Tests

ExpressionExample
IS TRUEactive IS TRUE
IS NOT TRUEflag IS NOT TRUE
IS FALSEcompleted IS FALSE
IS NOT FALSEvalid IS NOT FALSE
IS UNKNOWNresult IS UNKNOWN
IS NOT UNKNOWNflag IS NOT UNKNOWN

SQL Value Functions

FunctionDescription
CURRENT_DATECurrent date
CURRENT_TIMECurrent time with time zone
CURRENT_TIMESTAMPCurrent date and time with time zone
LOCALTIMECurrent time without time zone
LOCALTIMESTAMPCurrent date and time without time zone
CURRENT_ROLECurrent role name
CURRENT_USERCurrent user name
SESSION_USERSession user name
CURRENT_CATALOGCurrent database name
CURRENT_SCHEMACurrent schema name

Array and Row Expressions

ExpressionExampleNotes
ARRAY[…]ARRAY[1, 2, 3]Array constructor
ROW(…)ROW(a, b, c)Row constructor
Array subscriptarr[1]Array element access
Field access(rec).fieldComposite type field access
Star indirection(data).*Expand all fields

Subquery Expressions

Subqueries are supported in the WHERE clause and SELECT list. They are parsed into dedicated DVM operators with specialized delta computation for incremental maintenance.

ExpressionExampleDVM Operator
EXISTS (subquery)WHERE EXISTS (SELECT 1 FROM orders WHERE orders.cid = c.id)Semi-Join
NOT EXISTS (subquery)WHERE NOT EXISTS (SELECT 1 FROM orders WHERE orders.cid = c.id)Anti-Join
IN (subquery)WHERE id IN (SELECT product_id FROM order_items)Semi-Join (rewritten as equality)
NOT IN (subquery)WHERE id NOT IN (SELECT product_id FROM order_items)Anti-Join
Scalar subquery (SELECT)SELECT (SELECT max(price) FROM products) AS max_pScalar Subquery

Notes:

  • EXISTS and IN (subquery) in the WHERE clause are transformed into semi-join operators. NOT EXISTS and NOT IN (subquery) become anti-join operators.
  • Multiple subqueries in the same WHERE clause are supported when combined with AND. Subqueries combined with OR are also supported — they are automatically rewritten into UNION of separate filtered queries.
  • Scalar subqueries in the SELECT list are supported as long as they return exactly one row and one column.
  • ALL (subquery) is supported — it is automatically rewritten to an anti-join via NOT EXISTS with the negated condition.

Auto-Rewrite Pipeline

pg_stream transparently rewrites certain SQL constructs before parsing. These rewrites are applied automatically and require no user action:

OrderTriggerRewrite
#0View references in FROMInline view body as subquery
#1DISTINCT ON (expr)Convert to ROW_NUMBER() OVER (PARTITION BY expr ORDER BY ...) = 1 subquery
#2GROUPING SETS / CUBE / ROLLUPDecompose into UNION ALL of separate GROUP BY queries
#3Scalar subquery in WHEREConvert to CROSS JOIN with inline view
#4EXISTS/IN inside ORSplit into UNION of separate filtered queries
#5Multiple PARTITION BY clausesSplit into joined subqueries, one per distinct partitioning

HAVING Clause

HAVING is fully supported. The filter predicate is applied on top of the aggregate delta computation — groups that pass the HAVING condition are included in the stream table.

SELECT pgstream.create_stream_table('big_departments',
  'SELECT department, COUNT(*) AS cnt FROM employees GROUP BY department HAVING COUNT(*) > 10',
  '1m', 'DIFFERENTIAL');

Tables Without Primary Keys (Keyless Tables)

Tables without a primary key can be used as sources. pg_stream generates a content-based row identity by hashing all column values using pg_stream_hash_multi(). This allows DIFFERENTIAL mode to work, though at the cost of being unable to distinguish truly duplicate rows (rows with identical values in all columns).

-- No primary key — pg_stream uses content hashing for row identity
CREATE TABLE events (ts TIMESTAMPTZ, payload JSONB);
SELECT pgstream.create_stream_table('event_summary',
  'SELECT payload->>''type'' AS event_type, COUNT(*) FROM events GROUP BY 1',
  '1m', 'DIFFERENTIAL');

Volatile Function Detection

pg_stream checks all functions and operators in the defining query against pg_proc.provolatile:

  • VOLATILE functions (e.g., random(), now(), gen_random_uuid()) are rejected in DIFFERENTIAL mode because they produce different results on each evaluation, breaking delta correctness.
  • VOLATILE operators — custom operators backed by volatile functions are also detected. The check resolves the operator’s implementation function via pg_operator.oprcode and checks its volatility in pg_proc.
  • STABLE functions (e.g., current_setting()) produce a warning — they are consistent within a single transaction but may differ between refreshes.
  • IMMUTABLE functions are always safe and produce no warnings.

FULL mode accepts all volatility classes since it re-evaluates the entire query each time.

COLLATE Expressions

COLLATE clauses on expressions are supported:

SELECT pgstream.create_stream_table('sorted_names',
  'SELECT name COLLATE "C" AS c_name FROM users',
  '1m', 'DIFFERENTIAL');

IS JSON Predicate (PostgreSQL 16+)

The IS JSON predicate validates whether a value is valid JSON. All variants are supported:

-- Filter rows with valid JSON
SELECT pgstream.create_stream_table('valid_json_events',
  'SELECT id, payload FROM events WHERE payload::text IS JSON',
  '1m', 'DIFFERENTIAL');

-- Type-specific checks
SELECT pgstream.create_stream_table('json_objects_only',
  'SELECT id, data IS JSON OBJECT AS is_obj,
          data IS JSON ARRAY AS is_arr,
          data IS JSON SCALAR AS is_scalar
   FROM json_data',
  '1m', 'FULL');

Supported variants: IS JSON, IS JSON OBJECT, IS JSON ARRAY, IS JSON SCALAR, IS NOT JSON (all forms), WITH UNIQUE KEYS.

SQL/JSON Constructors (PostgreSQL 16+)

SQL-standard JSON constructor functions are supported in both FULL and DIFFERENTIAL modes:

-- JSON_OBJECT: construct a JSON object from key-value pairs
SELECT pgstream.create_stream_table('user_json',
  'SELECT id, JSON_OBJECT(''name'' : name, ''age'' : age) AS data FROM users',
  '1m', 'DIFFERENTIAL');

-- JSON_ARRAY: construct a JSON array from values
SELECT pgstream.create_stream_table('value_arrays',
  'SELECT id, JSON_ARRAY(a, b, c) AS arr FROM measurements',
  '1m', 'FULL');

-- JSON(): parse a text value as JSON
-- JSON_SCALAR(): wrap a scalar value as JSON
-- JSON_SERIALIZE(): serialize a JSON value to text

Note: JSON_ARRAYAGG() and JSON_OBJECTAGG() are SQL-standard aggregate functions fully recognized by the DVM engine. In DIFFERENTIAL mode, they use the group-rescan strategy (affected groups are re-aggregated from source data). The full deparsed SQL is preserved to handle the special key: value, ABSENT ON NULL, ORDER BY, and RETURNING clause syntax.

JSON_TABLE (PostgreSQL 17+)

JSON_TABLE() generates a relational table from JSON data. It is supported in the FROM clause in both FULL and DIFFERENTIAL modes. Internally, it is modeled as a LateralFunction.

-- Extract structured data from a JSON column
SELECT pgstream.create_stream_table('user_phones',
  $$SELECT u.id, j.phone_type, j.phone_number
    FROM users u,
         JSON_TABLE(u.contact_info, '$.phones[*]'
           COLUMNS (
             phone_type TEXT PATH '$.type',
             phone_number TEXT PATH '$.number'
           )
         ) AS j$$,
  '1m', 'DIFFERENTIAL');

Supported column types:

  • Regular columnsname TYPE PATH '$.path' (with optional ON ERROR/ON EMPTY behaviors)
  • EXISTS columnsname TYPE EXISTS PATH '$.path'
  • Formatted columnsname TYPE FORMAT JSON PATH '$.path'
  • Nested columnsNESTED PATH '$.path' COLUMNS (...)

The PASSING clause is also supported for passing named variables to path expressions.

Unsupported Expression Types

The following are rejected with clear error messages rather than producing broken SQL:

ExpressionError BehaviorSuggested Rewrite
TABLESAMPLERejected — stream tables materialize the complete result setUse WHERE random() < 0.1 if sampling is needed
Window functions in expressionsRejected — e.g., CASE WHEN ROW_NUMBER() OVER (...) ...Move window function to a separate column
FOR UPDATE / FOR SHARERejected — stream tables do not support row-level lockingRemove the locking clause
Unknown node typesRejected with type information

Restrictions & Interoperability

Stream tables are standard PostgreSQL heap tables stored in the pgstream schema with an additional __pgs_row_id BIGINT PRIMARY KEY column managed by the refresh engine. This section describes what you can and cannot do with them.

Referencing Other Stream Tables

Stream tables can reference other stream tables in their defining query. This creates a dependency edge in the internal DAG, and the scheduler refreshes upstream tables before downstream ones. Cycles are detected and rejected at creation time.

-- ST1 reads from a base table
SELECT pgstream.create_stream_table('order_totals',
  'SELECT customer_id, SUM(amount) AS total FROM orders GROUP BY customer_id',
  '1m', 'DIFFERENTIAL');

-- ST2 reads from ST1
SELECT pgstream.create_stream_table('big_customers',
  'SELECT customer_id, total FROM pgstream.order_totals WHERE total > 1000',
  '1m', 'DIFFERENTIAL');

Views as Sources in Defining Queries

PostgreSQL views can be used as source tables in a stream table's defining query. Views are automatically inlined — replaced with their underlying SELECT definition as subqueries — so CDC triggers land on the actual base tables.

CREATE VIEW active_orders AS
  SELECT * FROM orders WHERE status = 'active';

-- This works in DIFFERENTIAL mode:
SELECT pgstream.create_stream_table('order_summary',
  'SELECT customer_id, COUNT(*) FROM active_orders GROUP BY customer_id',
  '1m', 'DIFFERENTIAL');
-- Internally, 'active_orders' is replaced with:
--   (SELECT ... FROM orders WHERE status = 'active') AS active_orders

Nested views (view → view → table) are fully expanded via a fixpoint loop. Column-renaming views (CREATE VIEW v(a, b) AS ...) work correctly — pg_get_viewdef() produces the proper column aliases.

When a view is inlined, the user's original SQL is stored in the original_query catalog column for reinit and introspection. The defining_query column contains the expanded (post-inlining) form.

DDL hooks: CREATE OR REPLACE VIEW on a view that was inlined into a stream table marks that ST for reinit. DROP VIEW sets affected STs to ERROR status.

Materialized views are rejected in DIFFERENTIAL mode — their stale-snapshot semantics prevent CDC triggers from tracking changes. Use the underlying query directly, or switch to FULL mode. In FULL mode, materialized views are allowed (no CDC needed).

Foreign tables are rejected in DIFFERENTIAL mode — row-level triggers cannot be created on foreign tables. Use FULL mode instead.

Partitioned Tables as Sources

Partitioned tables are fully supported as source tables in both FULL and DIFFERENTIAL modes. CDC triggers are installed on the partitioned parent table, and PostgreSQL 13+ ensures the trigger fires for all DML routed to child partitions. The change buffer uses the parent table's OID (pgstream_changes.changes_<parent_oid>).

CREATE TABLE orders (
    id INT, region TEXT, amount NUMERIC
) PARTITION BY LIST (region);
CREATE TABLE orders_us PARTITION OF orders FOR VALUES IN ('US');
CREATE TABLE orders_eu PARTITION OF orders FOR VALUES IN ('EU');

-- Works — inserts into any partition are captured:
SELECT pgstream.create_stream_table('order_summary',
  'SELECT region, SUM(amount) FROM orders GROUP BY region',
  '1m', 'DIFFERENTIAL');

Note: pg_stream targets PostgreSQL 18. On PostgreSQL 12 or earlier (not supported), parent triggers do not fire for partition-routed rows, which would cause silent data loss.

Logical Replication Targets

Tables that receive data via logical replication require special consideration. Changes arriving via replication do not fire normal row-level triggers, which means CDC triggers will miss those changes.

pg_stream emits a WARNING at stream table creation time if any source table is detected as a logical replication target (via pg_subscription_rel).

Workarounds:

  • Use cdc_mode = 'wal' for WAL-based CDC that captures all changes regardless of origin.
  • Use FULL refresh mode, which recomputes entirely from the current table state.
  • Set a frequent refresh schedule with FULL mode to limit staleness.

Views on Stream Tables

PostgreSQL views can reference stream tables. The view reflects the data as of the most recent refresh.

CREATE VIEW top_customers AS
SELECT customer_id, total
FROM pgstream.order_totals
WHERE total > 500
ORDER BY total DESC;

Materialized Views on Stream Tables

Materialized views can reference stream tables, though this is typically redundant (both are physical snapshots of a query). The materialized view requires its own REFRESH MATERIALIZED VIEW — it does not auto-refresh when the stream table refreshes.

Logical Replication of Stream Tables

Stream tables can be published for logical replication like any ordinary table:

-- On publisher
CREATE PUBLICATION my_pub FOR TABLE pgstream.order_totals;

-- On subscriber
CREATE SUBSCRIPTION my_sub
  CONNECTION 'host=... dbname=...'
  PUBLICATION my_pub;

Caveats:

  • The __pgs_row_id column is replicated (it is the primary key), which is an internal implementation detail.
  • The subscriber receives materialized data, not the defining query. Refreshes on the publisher propagate as normal DML via logical replication.
  • Do not install pg_stream on the subscriber and attempt to refresh the replicated table — it will have no CDC triggers or catalog entries.
  • The internal change buffer tables (pgstream_changes.changes_<oid>) and catalog tables are not published by default; subscribers only receive the final output.

Known Delta Computation Limitations

The following edge cases produce incorrect delta results in DIFFERENTIAL mode under specific data mutation patterns. They have no effect on FULL mode.

JOIN Key Column Change + Simultaneous Right-Side Delete

When a row's join key column is updated in the same refresh cycle as the joined-side row is deleted, the delta query may fail to emit the required DELETE from the stream table:

-- Stream table joining orders with customers
SELECT pgstream.create_stream_table('order_details',
  'SELECT o.id, c.name FROM orders o JOIN customers c ON o.cust_id = c.id',
  '1m', 'DIFFERENTIAL');

-- Scenario that exposes the limitation:
-- In the same transaction (or same refresh interval):
UPDATE orders SET cust_id = 5 WHERE cust_id = 3;  -- key change
DELETE FROM customers WHERE id = 3;               -- old join partner deleted
-- The delta for the now-stale (orders.cust_id=3, customers.id=3) join result
-- may not be emitted as a DELETE, leaving a stale row in the stream table
-- until the next full refresh cycle.

Root cause: The JOIN delta query reads current_right (customers) after all changes are applied. When customer 3 is deleted before the delta runs, the DELETE half of the join cannot find its join partner and is silently dropped.

Mitigations:

  • Adaptive FULL fallback (default): when the scheduler detects a high change volume, it switches to a full recompute, which will correct any stale rows. The threshold is configurable via pg_stream.adaptive_full_threshold.
  • Avoid co-locating key-changing UPDATEs and DELETEs in the same refresh interval. Stagger changes across multiple refresh cycles.
  • FULL mode for stream tables where join key changes and right-side deletes are expected to co-occur frequently.

CUBE/ROLLUP Expansion Limit

CUBE(a, b, c...n) on N columns generates $2^N$ grouping set branches (a UNION ALL of N queries). pg_stream rejects CUBE/ROLLUP that would produce more than 64 branches to prevent runaway memory usage during query generation. Use explicit GROUPING SETS(...) instead:

-- Rejected: CUBE(a, b, c, d, e, f, g) would generate 128 branches
-- Use instead:
SELECT pgstream.create_stream_table('multi_dim',
  'SELECT a, b, c, SUM(v) FROM t
   GROUP BY GROUPING SETS ((a, b, c), (a, b), (a), ())',
  '5m', 'DIFFERENTIAL');

What Is NOT Allowed

OperationRestrictionReason
Direct DML (INSERT, UPDATE, DELETE)❌ Not supportedStream table contents are managed exclusively by the refresh engine.
Direct DDL (ALTER TABLE)❌ Not supportedUse pgstream.alter_stream_table() to change the defining query or schedule.
Foreign keys referencing or from a stream table❌ Not supportedThe refresh engine performs bulk MERGE operations that do not respect FK ordering.
User-defined triggers on stream tables✅ Supported (DIFFERENTIAL)In DIFFERENTIAL mode, the refresh engine decomposes changes into explicit DELETE + UPDATE + INSERT statements so triggers fire with correct TG_OP, OLD, and NEW. Row-level triggers are suppressed during FULL refresh. Controlled by pg_stream.user_triggers GUC (default: auto).
TRUNCATE on a stream table❌ Not supportedUse pgstream.refresh_stream_table() to reset data.

Tip: The __pgs_row_id column is visible but should be ignored by consuming queries — it is an implementation detail used for delta MERGE operations.


Views

pgstream.stream_tables_info

Status overview with computed staleness information.

SELECT * FROM pgstream.stream_tables_info;

Columns include all pgstream.pgs_stream_tables columns plus:

ColumnTypeDescription
stalenessintervalnow() - data_timestamp
stalebooltrue if staleness > schedule

pgstream.pg_stat_stream_tables

Comprehensive monitoring view combining catalog metadata with aggregate refresh statistics.

SELECT * FROM pgstream.pg_stat_stream_tables;

Key columns:

ColumnTypeDescription
pgs_idbigintStream table ID
pgs_schema / pgs_nametextSchema and name
statustextINITIALIZING, ACTIVE, SUSPENDED, ERROR
refresh_modetextFULL or DIFFERENTIAL
data_timestamptimestamptzTimestamp of last refresh
stalenessintervalCurrent staleness
staleboolWhether schedule is exceeded
total_refreshesbigintTotal refresh count
successful_refreshesbigintSuccessful refresh count
failed_refreshesbigintFailed refresh count
avg_duration_msfloat8Average refresh duration
consecutive_errorsintCurrent error streak

Catalog Tables

pgstream.pgs_stream_tables

Core metadata for each stream table.

ColumnTypeDescription
pgs_idbigserialPrimary key
pgs_relidoidOID of the storage table
pgs_nametextTable name
pgs_schematextSchema name
defining_querytextThe SQL query that defines the ST
scheduletextRefresh schedule (duration or cron expression)
refresh_modetextFULL or DIFFERENTIAL
statustextINITIALIZING, ACTIVE, SUSPENDED, ERROR
is_populatedboolWhether the table has been populated
data_timestamptimestamptzTimestamp of the data in the ST
frontierjsonbPer-source LSN positions (version tracking)
last_refresh_attimestamptzWhen last refreshed
consecutive_errorsintCurrent error streak count
needs_reinitboolWhether upstream DDL requires reinitialization
functions_usedtext[]Function names used in the defining query (for DDL tracking)
created_attimestamptzCreation timestamp
updated_attimestamptzLast modification timestamp

pgstream.pgs_dependencies

DAG edges — records which source tables each ST depends on, including CDC mode metadata.

ColumnTypeDescription
pgs_idbigintFK to pgs_stream_tables
source_relidoidOID of the source table
source_typetextTABLE, STREAM_TABLE, or VIEW
columns_usedtext[]Which columns are referenced
cdc_modetextCurrent CDC mode: TRIGGER, TRANSITIONING, or WAL
slot_nametextReplication slot name (WAL/TRANSITIONING modes)
decoder_confirmed_lsnpg_lsnWAL decoder's last confirmed position
transition_started_attimestamptzWhen the trigger→WAL transition started

pgstream.pgs_refresh_history

Audit log of all refresh operations.

ColumnTypeDescription
refresh_idbigserialPrimary key
pgs_idbigintFK to pgs_stream_tables
data_timestamptimestamptzData timestamp of the refresh
start_timetimestamptzWhen the refresh started
end_timetimestamptzWhen it completed
actiontextNO_DATA, FULL, DIFFERENTIAL, REINITIALIZE, SKIP
rows_insertedbigintRows inserted
rows_deletedbigintRows deleted
error_messagetextError message if failed
statustextRUNNING, COMPLETED, FAILED, SKIPPED
initiated_bytextWhat triggered: SCHEDULER, MANUAL, or INITIAL
freshness_deadlinetimestamptzSLA deadline (duration schedules only; NULL for cron)

pgstream.pgs_change_tracking

CDC slot tracking per source table.

ColumnTypeDescription
source_relidoidOID of the tracked source table
slot_nametextLogical replication slot name
last_consumed_lsnpg_lsnLast consumed WAL position
tracked_by_pgs_idsbigint[]Array of ST IDs depending on this source

Configuration

Complete reference for all pg_stream GUC (Grand Unified Configuration) variables.


Overview

pg_stream exposes sixteen configuration variables in the pg_stream namespace. All can be set in postgresql.conf or at runtime via SET / ALTER SYSTEM.

Required postgresql.conf settings:

shared_preload_libraries = 'pg_stream'

The extension must be loaded via shared_preload_libraries because it registers GUC variables and a background worker at startup.

Note: wal_level = logical and max_replication_slots are not required by default. The default CDC mode (trigger) uses lightweight row-level triggers. If you set pg_stream.cdc_mode = 'auto' or 'wal', then wal_level = logical is needed for WAL-based capture (see pg_stream.cdc_mode).


GUC Variables

pg_stream.enabled

Enable or disable the pg_stream extension.

PropertyValue
Typebool
Defaulttrue
ContextSUSET (superuser)
Restart RequiredNo

When set to false, the background scheduler stops processing refreshes. Existing stream tables remain in the catalog but are not refreshed. Manual pgstream.refresh_stream_table() calls still work.

-- Disable automatic refreshes
SET pg_stream.enabled = false;

-- Re-enable
SET pg_stream.enabled = true;

pg_stream.scheduler_interval_ms

How often the background scheduler checks for stream tables that need refreshing.

PropertyValue
Typeint
Default1000 (1 second)
Range10060000 (100ms to 60s)
ContextSUSET
Restart RequiredNo

Tuning Guidance:

  • Low-latency workloads (sub-second schedule): Set to 100500.
  • Standard workloads (minutes of schedule): Default 1000 is appropriate.
  • Low-overhead workloads (many STs with long schedules): Increase to 500010000 to reduce scheduler overhead.

The scheduler interval does not determine refresh frequency — it determines how often the scheduler checks whether any ST's staleness exceeds its schedule (or whether a cron expression has fired). The actual refresh frequency is governed by schedule (duration or cron) and canonical period alignment.

SET pg_stream.scheduler_interval_ms = 500;

pg_stream.min_schedule_seconds

Minimum allowed schedule value (in seconds) when creating or altering a stream table with a duration-based schedule. This limit does not apply to cron expressions.

PropertyValue
Typeint
Default60 (1 minute)
Range186400 (1 second to 24 hours)
ContextSUSET
Restart RequiredNo

This acts as a safety guardrail to prevent users from setting impractically small schedules that would cause excessive refresh overhead.

Tuning Guidance:

  • Development/testing: Set to 1 for fast iteration.
  • Production: Keep at 60 or higher to prevent excessive WAL consumption and CPU usage.
-- Allow 10-second schedules (for testing)
SET pg_stream.min_schedule_seconds = 10;

pg_stream.max_consecutive_errors

Maximum consecutive refresh failures before a stream table is moved to ERROR status.

PropertyValue
Typeint
Default3
Range1100
ContextSUSET
Restart RequiredNo

When a ST's consecutive_errors reaches this threshold:

  1. The ST status changes to ERROR.
  2. Automatic refreshes stop for this ST.
  3. Manual intervention is required: SELECT pgstream.alter_stream_table('...', status => 'ACTIVE').

Tuning Guidance:

  • Strict (production): 3 — fail fast to surface issues.
  • Lenient (development): 1020 — tolerate transient errors.
SET pg_stream.max_consecutive_errors = 5;

pg_stream.change_buffer_schema

Schema where CDC change buffer tables are created.

PropertyValue
Typetext
Default'pgstream_changes'
ContextSUSET
Restart RequiredNo (but existing change buffers remain in the old schema)

Change buffer tables are named <schema>.changes_<oid> where <oid> is the source table's OID. Placing them in a dedicated schema keeps them out of the public namespace.

Tuning Guidance:

  • Generally leave at the default. Change only if pgstream_changes conflicts with an existing schema in your database.
SET pg_stream.change_buffer_schema = 'my_change_buffers';

pg_stream.max_concurrent_refreshes

Maximum number of stream tables that can be refreshed simultaneously.

PropertyValue
Typeint
Default4
Range132
ContextSUSET
Restart RequiredNo

Controls concurrency in the scheduler. Each refresh acquires an advisory lock, and the scheduler skips STs that exceed this limit.

Tuning Guidance:

  • Small databases (few STs): 14 is sufficient.
  • Large deployments (50+ STs): Increase to 816 if the server has spare CPU and I/O capacity.
  • Resource-constrained: Set to 1 for fully sequential refresh processing.

The optimal setting depends on:

  • Number of CPU cores available
  • I/O throughput (SSD vs HDD)
  • Complexity of the defining queries
  • Amount of concurrent OLTP workload
SET pg_stream.max_concurrent_refreshes = 8;

pg_stream.differential_max_change_ratio

Maximum change-to-table ratio before DIFFERENTIAL refresh falls back to FULL refresh.

PropertyValue
Typefloat
Default0.15 (15%)
Range0.01.0
ContextSUSET
Restart RequiredNo

When the number of pending change buffer rows exceeds this fraction of the source table's estimated row count, the refresh engine switches from DIFFERENTIAL (which uses JSONB parsing and window functions) to FULL refresh. At high change rates FULL refresh is cheaper because it avoids the per-row JSONB overhead.

Special Values:

  • 0.0: Disable adaptive fallback — always use DIFFERENTIAL.
  • 1.0: Always fall back to FULL (effectively forces FULL mode).

Tuning Guidance:

  • OLTP with low change rates (< 5%): Default 0.15 is appropriate.
  • Batch-load workloads (bulk inserts): Lower to 0.050.10 so large batches trigger FULL refresh sooner.
  • Latency-sensitive (want deterministic refresh time): Set to 0.0 to always use DIFFERENTIAL.
-- Lower threshold for batch-heavy workloads
SET pg_stream.differential_max_change_ratio = 0.10;

-- Disable adaptive fallback
SET pg_stream.differential_max_change_ratio = 0.0;

pg_stream.cleanup_use_truncate

Use TRUNCATE instead of per-row DELETE for change buffer cleanup when the entire buffer is consumed by a refresh.

PropertyValue
Typebool
Defaulttrue
ContextSUSET
Restart RequiredNo

After a differential refresh consumes all rows from the change buffer, the engine must clean up the buffer table. TRUNCATE is O(1) regardless of row count, versus DELETE which must update indexes row-by-row. This saves 3–5 ms per refresh at 10%+ change rates.

Trade-off: TRUNCATE acquires an AccessExclusiveLock on the change buffer table. If concurrent DML on the source table is actively inserting into the same change buffer via triggers, this lock can cause brief contention.

Tuning Guidance:

  • Most workloads: Leave at true — the performance benefit outweighs the brief lock.
  • High-concurrency OLTP with continuous writes during refresh: Set to false if you observe lock-wait timeouts on the change buffer.
-- Use per-row DELETE for change buffer cleanup
SET pg_stream.cleanup_use_truncate = false;

pg_stream.merge_planner_hints

Inject SET LOCAL planner hints before MERGE execution during differential refresh.

PropertyValue
Typebool
Defaulttrue
ContextSUSET
Restart RequiredNo

When enabled, the refresh executor estimates the delta size and applies optimizer hints within the transaction:

  • Delta ≥ 100 rows: SET LOCAL enable_nestloop = off — forces hash joins instead of nested-loop joins.
  • Delta ≥ 10,000 rows: additionally SET LOCAL work_mem = '<N>MB' (see pg_stream.merge_work_mem_mb).

This reduces P95 latency spikes caused by PostgreSQL choosing nested-loop plans for medium/large delta sizes.

Tuning Guidance:

  • Most workloads: Leave at true — the hints improve tail latency without affecting small deltas.
  • Custom plan overrides: Set to false if you manage planner settings yourself or if the hints conflict with your pg_hint_plan configuration.
-- Disable planner hints
SET pg_stream.merge_planner_hints = false;

pg_stream.merge_work_mem_mb

work_mem value (in MB) applied via SET LOCAL when the delta exceeds 10,000 rows and planner hints are enabled.

PropertyValue
Typeint
Default64 (64 MB)
Range84096 (8 MB to 4 GB)
ContextSUSET
Restart RequiredNo

A higher value lets PostgreSQL use larger in-memory hash tables for the MERGE join, avoiding disk-spilling sort/merge strategies on large deltas. This setting is only applied when both merge_planner_hints = true and the delta exceeds 10,000 rows.

Tuning Guidance:

  • Servers with ample RAM (32+ GB): Increase to 128256 for faster large-delta refreshes.
  • Memory-constrained: Lower to 1632 or disable planner hints entirely.
  • Very large deltas (100K+ rows): Consider 256512 if refresh latency matters.
SET pg_stream.merge_work_mem_mb = 128;

pg_stream.merge_strategy

Strategy for applying delta changes to the stream table during differential refresh.

PropertyValue
Typetext
Default'auto'
Values'auto', 'merge', 'delete_insert'
ContextSUSET
Restart RequiredNo

Values:

  • auto (default): Use MERGE for small deltas; switch to DELETE + INSERT when the delta exceeds 25% of the stream table row count.
  • merge: Always use a single MERGE statement. Best for correctness and simplicity.
  • delete_insert: Always use DELETE + INSERT. May be faster for very large deltas but has known limitations with aggregate/DISTINCT queries.

Tuning Guidance:

  • Most workloads: Leave at 'auto' or 'merge'.
  • Batch-heavy ETL: Try 'delete_insert' if MERGE performance degrades on large deltas, but test thoroughly with your specific queries.
-- Always use MERGE
SET pg_stream.merge_strategy = 'merge';

-- Always use DELETE + INSERT
SET pg_stream.merge_strategy = 'delete_insert';

pg_stream.use_prepared_statements

Use SQL PREPARE / EXECUTE for MERGE statements during differential refresh.

PropertyValue
Typebool
Defaulttrue
ContextSUSET
Restart RequiredNo

When enabled, the refresh executor issues PREPARE __pgs_merge_{id} on the first cache-hit cycle, then uses EXECUTE on subsequent cycles. After approximately 5 executions, PostgreSQL switches from a custom plan to a generic plan, saving 1–2 ms of parse/plan overhead per refresh.

Tuning Guidance:

  • Most workloads: Leave at true — the cumulative parse/plan savings are significant for frequently-refreshed stream tables.
  • Highly skewed data: Set to false if prepared-statement parameter sniffing produces poor plans (e.g., highly skewed LSN distributions causing bad join estimates).
-- Disable prepared statements
SET pg_stream.use_prepared_statements = false;

pg_stream.user_triggers

Control how user-defined triggers on stream tables are handled during refresh.

PropertyValue
Typetext
Default'auto'
Values'auto', 'on', 'off'
ContextSUSET
Restart RequiredNo

When a stream table has user-defined row-level triggers, the refresh engine can decompose the MERGE into explicit DELETE + UPDATE + INSERT statements so triggers fire with correct TG_OP, OLD, and NEW values.

Values:

  • auto (default): Automatically detect user triggers on the stream table. If present, use the explicit DML path; otherwise use MERGE.
  • on: Always use the explicit DML path, even without user triggers. Useful for testing.
  • off: Always use MERGE. User triggers are suppressed during refresh. This is the escape hatch if explicit DML causes issues.

Notes:

  • Row-level triggers do not fire during FULL refresh regardless of this setting. FULL refresh uses DISABLE TRIGGER USER / ENABLE TRIGGER USER to suppress them.
  • The explicit DML path adds ~25–60% overhead compared to MERGE for affected stream tables.
  • Stream tables without user triggers have zero overhead when using auto (only a fast pg_trigger check).
-- Auto-detect (default)
SET pg_stream.user_triggers = 'auto';

-- Always use explicit DML (for testing)
SET pg_stream.user_triggers = 'on';

-- Suppress triggers, use MERGE
SET pg_stream.user_triggers = 'off';

pg_stream.block_source_ddl

When enabled, column-affecting DDL (e.g., ALTER TABLE ... DROP COLUMN, ALTER TABLE ... ALTER COLUMN ... TYPE) on source tables tracked by stream tables is blocked with an ERROR instead of silently marking stream tables for reinitialization.

This is useful in production environments where you want to prevent accidental schema changes that would trigger expensive full recomputation of downstream stream tables.

Default: false
Context: Superuser

-- Block column-affecting DDL on tracked source tables
SET pg_stream.block_source_ddl = true;

-- Allow DDL (stream tables will be marked for reinit instead)
SET pg_stream.block_source_ddl = false;

Note: Only column-affecting changes are blocked. Benign DDL (adding indexes, comments, constraints) is always allowed regardless of this setting.


pg_stream.cdc_mode

CDC (Change Data Capture) mechanism selection.

ValueDescription
'trigger'(default) Always use row-level triggers for change capture
'auto'Use triggers for creation; transition to WAL-based CDC if wal_level = logical
'wal'Require WAL-based CDC (fails if wal_level != logical)

Default: 'trigger'

-- Always use triggers (default, zero-config)
SET pg_stream.cdc_mode = 'trigger';

-- Enable automatic trigger → WAL transition
SET pg_stream.cdc_mode = 'auto';

-- Require WAL-based CDC (error if wal_level != logical)
SET pg_stream.cdc_mode = 'wal';

pg_stream.wal_transition_timeout

Maximum time (seconds) to wait for the WAL decoder to catch up during the transition from trigger-based to WAL-based CDC. If the decoder has not caught up within this timeout, the system falls back to triggers.

Default: 300 (5 minutes)
Range: 103600

SET pg_stream.wal_transition_timeout = 300;

Complete postgresql.conf Example

# Required
shared_preload_libraries = 'pg_stream'

# Optional tuning
pg_stream.enabled = true
pg_stream.scheduler_interval_ms = 1000
pg_stream.min_schedule_seconds = 60
pg_stream.max_consecutive_errors = 3
pg_stream.change_buffer_schema = 'pgstream_changes'
pg_stream.max_concurrent_refreshes = 4
pg_stream.differential_max_change_ratio = 0.15
pg_stream.cleanup_use_truncate = true
pg_stream.merge_planner_hints = true
pg_stream.merge_work_mem_mb = 64
pg_stream.merge_strategy = 'auto'
pg_stream.use_prepared_statements = true
pg_stream.user_triggers = 'auto'
pg_stream.block_source_ddl = false
pg_stream.cdc_mode = 'trigger'
pg_stream.wal_transition_timeout = 300

Runtime Configuration

All GUC variables can be changed at runtime by a superuser:

-- View current settings
SHOW pg_stream.enabled;
SHOW pg_stream.scheduler_interval_ms;

-- Change for current session
SET pg_stream.max_concurrent_refreshes = 8;

-- Change persistently (requires reload)
ALTER SYSTEM SET pg_stream.scheduler_interval_ms = 500;
SELECT pg_reload_conf();

Further Reading

Installation Guide

Prerequisites

RequirementVersion
PostgreSQL18.x

Building from source additionally requires Rust 1.82+ and pgrx 0.17.x. Pre-built release artifacts only need a running PostgreSQL 18.x instance.


Installing from a Pre-built Release

1. Download the release archive

Download the archive for your platform from the GitHub Releases page:

PlatformArchive
Linux x86_64pg_stream-<ver>-pg18-linux-amd64.tar.gz
macOS Apple Siliconpg_stream-<ver>-pg18-macos-arm64.tar.gz
Windows x64pg_stream-<ver>-pg18-windows-amd64.zip

Optionally verify the checksum against SHA256SUMS.txt from the same release:

sha256sum -c SHA256SUMS.txt

2. Extract and install

Linux / macOS:

tar xzf pg_stream-0.1.1-pg18-linux-amd64.tar.gz
cd pg_stream-0.1.1-pg18-linux-amd64

sudo cp lib/*.so  "$(pg_config --pkglibdir)/"
sudo cp extension/*.control extension/*.sql "$(pg_config --sharedir)/extension/"

Windows (PowerShell):

Expand-Archive pg_stream-0.1.1-pg18-windows-amd64.zip -DestinationPath .
cd pg_stream-0.1.1-pg18-windows-amd64

Copy-Item lib\*.dll  "$(pg_config --pkglibdir)\"
Copy-Item extension\* "$(pg_config --sharedir)\extension\"

3. Using with CloudNativePG (Kubernetes)

pg_stream is distributed as an OCI extension image for use with CloudNativePG Image Volume Extensions.

Requirements: Kubernetes 1.33+, CNPG 1.28+, PostgreSQL 18.

# Pull the extension image
docker pull ghcr.io/grove/pg_stream-ext:0.1.1

See cnpg/cluster-example.yaml and cnpg/database-example.yaml for complete Cluster and Database deployment examples.

4. Local Docker development (without Kubernetes)

For local development without Kubernetes, install the extension files manually into a standard PostgreSQL container from a release archive:

# Extract extension files from the release archive
tar xzf pg_stream-0.1.1-pg18-linux-amd64.tar.gz
cd pg_stream-0.1.1-pg18-linux-amd64

# Run PostgreSQL with the extension mounted
docker run --rm \
  -v $PWD/lib/pg_stream.so:/usr/lib/postgresql/18/lib/pg_stream.so:ro \
  -v $PWD/extension/:/tmp/ext/:ro \
  -e POSTGRES_PASSWORD=postgres \
  postgres:18.1 \
  sh -c 'cp /tmp/ext/* /usr/share/postgresql/18/extension/ && \
         exec postgres -c shared_preload_libraries=pg_stream'

Building from Source

1. Install Rust

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

2. Install pgrx

cargo install --locked cargo-pgrx --version 0.17.0
cargo pgrx init --pg18 $(pg_config --bindir)/pg_config

3. Build the Extension

# Development build (faster compilation)
cargo pgrx install --pg-config $(pg_config --bindir)/pg_config

# Release build (optimized, for production)
cargo pgrx install --release --pg-config $(pg_config --bindir)/pg_config

# Package for deployment (creates installable artifacts)
cargo pgrx package --pg-config $(pg_config --bindir)/pg_config

PostgreSQL Configuration

Add the following to postgresql.conf before starting PostgreSQL:

# Required — loads the extension shared library at server start
shared_preload_libraries = 'pg_stream'

# Recommended — must accommodate scheduler + refresh workers
max_worker_processes = 8

Note: wal_level = logical and max_replication_slots are not required. The extension uses lightweight row-level triggers for CDC, not logical replication.

Restart PostgreSQL after modifying these settings:

pg_ctl restart -D /path/to/data
# or
systemctl restart postgresql

Extension Installation

Connect to the target database and run:

CREATE EXTENSION pg_stream;

This creates:

  • The pgstream schema with catalog tables and SQL functions
  • The pgstream_changes schema for change buffer tables
  • Event triggers for DDL tracking
  • The pgstream.pg_stat_stream_tables monitoring view

Verification

After installation, verify everything is working:

-- Check the extension version
SELECT extname, extversion FROM pg_extension WHERE extname = 'pg_stream';

-- Or get a full status overview (includes version, scheduler state, stream table count)
SELECT * FROM pgstream.pgs_status();

Inspecting the installation

-- Check the installed version
SELECT extversion FROM pg_extension WHERE extname = 'pg_stream';

-- Check which schemas were created
SELECT schema_name
FROM information_schema.schemata
WHERE schema_name IN ('pgstream', 'pgstream_changes');

-- Check all registered GUC variables
SHOW pg_stream.enabled;
SHOW pg_stream.scheduler_interval_ms;
SHOW pg_stream.max_concurrent_refreshes;

-- Check the scheduler background worker is running
SELECT * FROM pgstream.pgs_status();

-- List all stream tables
SELECT pgs_schema, pgs_name, status, refresh_mode, is_populated
FROM pgstream.pgs_stream_tables;

-- Check that the shared library loaded correctly
SELECT * FROM pg_extension WHERE extname = 'pg_stream';

-- Verify the catalog tables exist
SELECT tablename
FROM pg_tables
WHERE schemaname = 'pgstream'
ORDER BY tablename;

Quick functional test

CREATE TABLE test_source (id INT PRIMARY KEY, val TEXT);
INSERT INTO test_source VALUES (1, 'hello');

SELECT pgstream.create_stream_table(
    'test_st',
    'SELECT id, val FROM test_source',
    '1m',
    'FULL'
);

SELECT * FROM test_st;
-- Should return: 1 | hello

-- Clean up
SELECT pgstream.drop_stream_table('test_st');
DROP TABLE test_source;

Uninstallation

-- Drop all stream tables first
SELECT pgstream.drop_stream_table(pgs_schema || '.' || pgs_name)
FROM pgstream.pgs_stream_tables;

-- Drop the extension
DROP EXTENSION pg_stream CASCADE;

Remove pg_stream from shared_preload_libraries in postgresql.conf and restart PostgreSQL.

Troubleshooting

Unit tests crash on macOS 26+ (symbol not found in flat namespace)

macOS 26 (Tahoe) changed dyld to eagerly resolve all flat-namespace symbols at binary load time. pgrx extensions reference PostgreSQL server-internal symbols (e.g. CacheMemoryContext, SPI_connect) via the -Wl,-undefined,dynamic_lookup linker flag. These symbols are normally provided by the postgres executable when the extension is loaded as a shared library — but for cargo test --lib there is no postgres process, so the test binary aborts immediately:

dyld[66617]: symbol not found in flat namespace '_CacheMemoryContext'

This affects local development only — integration tests, E2E tests, and the extension itself running inside PostgreSQL are unaffected.

The fix is built into the just test-unit recipe. It automatically:

  1. Compiles a tiny C stub library (scripts/pg_stub.ctarget/libpg_stub.dylib) that provides NULL/no-op definitions for the ~28 PostgreSQL symbols.
  2. Compiles the test binary with --no-run.
  3. Runs the binary with DYLD_INSERT_LIBRARIES pointing to the stub.

The stub is only built on macOS 26+. On Linux or older macOS, just test-unit runs cargo test --lib directly with no changes.

Note: The stub symbols are never called — unit tests exercise pure Rust logic only. If a test accidentally calls a PostgreSQL function it will crash with a NULL dereference (the desired fail-fast behavior).

If you run unit tests without just (e.g. directly via cargo test --lib), you can use the wrapper script instead:

./scripts/run_unit_tests.sh pg18

# With test name filter:
./scripts/run_unit_tests.sh pg18 -- test_parse_basic

Extension fails to load

Ensure shared_preload_libraries = 'pg_stream' is set and PostgreSQL has been restarted (not just reloaded). The extension requires shared memory initialization at startup.

Background worker not starting

Check that max_worker_processes is high enough to accommodate the scheduler worker plus any refresh workers. The default of 8 is usually sufficient.

Check logs for details

The extension logs at various levels. Enable debug logging for more detail:

SET client_min_messages TO debug1;

Architecture

This document describes the internal architecture of pg_stream — a PostgreSQL 18 extension that implements stream tables with differential view maintenance. For a high-level description of what pg_stream does and why, read ESSENCE.md. For release milestones and future plans, see ROADMAP.md.


High-Level Overview

┌─────────────────────────────────────────────────────────────────┐
│                     PostgreSQL 18 Backend                       │
│                                                                 │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌─────────────┐   │
│  │  Source  │   │  Source  │   │  Storage │   │  Storage    │   │
│  │  Table A │   │  Table B │   │  Table X │   │  Table Y    │   │
│  └────┬─────┘   └────┬─────┘   └────▲─────┘   └────▲────────┘   │
│       │              │              │              │            │
│  ═════╪══════════════╪══════════════╪══════════════╪════════    │
│       │              │              │              │            │
│  ┌────▼──────────────▼────┐   ┌────┴──────────────┴────┐        │
│  │  Hybrid CDC Layer      │   │  Delta Application     │        │
│  │  Triggers ──or── WAL   │   │  (INSERT/DELETE diffs) │        │
│  └────────────┬───────────┘   └────────────▲───────────┘        │
│               │                            │                    │
│  ┌────────────▼───────────┐   ┌────────────┴───────────┐        │
│  │   Change Buffer        │   │   DVM Engine           │        │
│  │   (pgstream_changes.*) │   │   (Operator Tree)      │        │
│  └────────────┬───────────┘   └────────────▲───────────┘        │
│               │                            │                    │
│               └────────────┬───────────────┘                    │
│                            │                                    │
│  ┌─────────────────────────▼─────────────────────────────┐      │
│  │              Refresh Engine                           │      │
│  │  ┌──────────┐  ┌──────────┐  ┌─────────────────────┐  │      │
│  │  │ Frontier │  │ DAG      │  │ Scheduler           │  │      │
│  │  │ Tracker  │  │ Resolver │  │ (canonical schedule)│  │      │
│  │  └──────────┘  └──────────┘  └─────────────────────┘  │      │
│  └───────────────────────────────────────────────────────┘      │
│                                                                 │
│  ┌────────────────────────────────────────────────────────┐     │
│  │                    Catalog (pgstream.*)                │     │
│  │  pgs_stream_tables │ pgs_dependencies │ pgs_refresh_history│  │
│  └────────────────────────────────────────────────────────┘     │
│                                                                 │
│  ┌──────────────────────────────────────────────────────┐       │
│  │                  Monitoring Layer                    │       │
│  │  st_refresh_stats │ slot_health │ check_cdc_health    │       │
│  │  explain_st │ views │ NOTIFY alerting               │       │
│  └──────────────────────────────────────────────────────┘       │
└─────────────────────────────────────────────────────────────────┘

Component Details

1. SQL API Layer (src/api.rs)

The public entry point for users. All operations are exposed as #[pg_extern] functions in the pgstream schema:

  • create_stream_table — Applies a chain of auto-rewrite passes (view inlining → DISTINCT ON → GROUPING SETS → scalar subquery → SubLinks in OR → multi-PARTITION BY windows), parses the defining query, builds an operator tree, creates the storage table, registers CDC slots, populates the catalog, and optionally performs an initial full refresh.
  • alter_stream_table — Modifies schedule, refresh mode, or status (ACTIVE/SUSPENDED).
  • drop_stream_table — Removes the storage table, catalog entries, and cleans up CDC slots.
  • refresh_stream_table — Triggers a manual refresh (same path as automatic scheduling).
  • pgs_status — Returns a summary of all registered stream tables.

2. Catalog (src/catalog.rs)

The catalog manages persistent metadata stored in PostgreSQL tables within the pgstream schema:

TablePurpose
pgstream.pgs_stream_tablesCore metadata: name, query, schedule, status, frontier, etc.
pgstream.pgs_dependenciesDAG edges from ST to source tables
pgstream.pgs_refresh_historyAudit log of every refresh operation
pgstream.pgs_change_trackingPer-source CDC slot metadata

Schema creation is handled by extension_sql!() macros that run at CREATE EXTENSION time.

Entity-Relationship Diagram

erDiagram
    pgs_stream_tables {
        bigserial pgs_id PK
        oid pgs_relid UK "OID of materialized storage table"
        text pgs_name
        text pgs_schema
        text defining_query
        text original_query "User's original SQL (pre-inlining)"
        text schedule "Duration or cron expression"
        text refresh_mode "FULL | DIFFERENTIAL"
        text status "INITIALIZING | ACTIVE | SUSPENDED | ERROR"
        boolean is_populated
        timestamptz data_timestamp "Freshness watermark"
        jsonb frontier "DBSP-style version frontier"
        timestamptz last_refresh_at
        int consecutive_errors
        boolean needs_reinit
        float8 auto_threshold
        float8 last_full_ms
        timestamptz created_at
        timestamptz updated_at
    }

    pgs_dependencies {
        bigint pgs_id PK,FK "References pgs_stream_tables.pgs_id"
        oid source_relid PK "OID of source table"
        text source_type "TABLE | STREAM_TABLE | VIEW"
        text_arr columns_used "Column-level lineage"
        text cdc_mode "TRIGGER | TRANSITIONING | WAL"
        text slot_name "Replication slot (WAL mode)"
        pg_lsn decoder_confirmed_lsn "WAL decoder progress"
        timestamptz transition_started_at "Trigger→WAL transition start"
    }

    pgs_refresh_history {
        bigserial refresh_id PK
        bigint pgs_id FK "References pgs_stream_tables.pgs_id"
        timestamptz data_timestamp
        timestamptz start_time
        timestamptz end_time
        text action "NO_DATA | FULL | DIFFERENTIAL | REINITIALIZE | SKIP"
        bigint rows_inserted
        bigint rows_deleted
        text error_message
        text status "RUNNING | COMPLETED | FAILED | SKIPPED"
        text initiated_by "SCHEDULER | MANUAL | INITIAL"
        timestamptz freshness_deadline
    }

    pgs_change_tracking {
        oid source_relid PK "OID of tracked source table"
        text slot_name "Trigger function name"
        pg_lsn last_consumed_lsn
        bigint_arr tracked_by_pgs_ids "ST IDs sharing this source"
    }

    pgs_stream_tables ||--o{ pgs_dependencies : "has sources"
    pgs_stream_tables ||--o{ pgs_refresh_history : "has refresh history"
    pgs_stream_tables }o--o{ pgs_change_tracking : "tracks via pgs_ids array"

Note: Change buffer tables (pgstream_changes.changes_<oid>) are created dynamically per source table OID and live in the separate pgstream_changes schema.

3. CDC / Change Data Capture (src/cdc.rs, src/wal_decoder.rs)

pg_stream uses a hybrid CDC architecture that starts with triggers and optionally transitions to WAL-based (logical replication) capture for lower write-side overhead.

Trigger Mode (default)

  1. Trigger Management — Creates AFTER INSERT OR UPDATE OR DELETE row-level triggers (pg_stream_cdc_<oid>) on each tracked source table. Each trigger fires a PL/pgSQL function (pg_stream_cdc_fn_<oid>()) that writes changes to the buffer table.
  2. Change Buffering — Decoded changes are written to per-source change buffer tables in the pgstream_changes schema. Each row captures the LSN (pg_current_wal_lsn()), transaction ID, action type (I/U/D), and the new/old row data as JSONB via to_jsonb().
  3. Cleanup — Consumed changes are deleted after each successful refresh via delete_consumed_changes(), bounded by the upper LSN to prevent unbounded scans.
  4. Lifecycle — Triggers and trigger functions are automatically created when a source table is first tracked and dropped when the last stream table referencing a source is removed.

The trigger approach was chosen as the default for transaction safety (triggers can be created in the same transaction as DDL), simplicity (no slot management, no wal_level = logical requirement), and immediate visibility (changes are visible in buffer tables as soon as the source transaction commits).

WAL Mode (optional, automatic transition)

When pg_stream.cdc_mode is set to 'auto' or 'wal' and wal_level = logical is available, the system transitions from trigger-based to WAL-based CDC after the first successful refresh:

  1. WAL Availability Detection — At stream table creation, checks whether wal_level = logical is configured. If so, the source dependency is marked for WAL transition.
  2. WAL Decoder Background Worker — A dedicated background worker (src/wal_decoder.rs) polls logical replication slots and writes decoded changes into the same change buffer tables used by triggers, ensuring a uniform format for the DVM engine.
  3. Transition Orchestration — The transition is a three-step process: (a) create a replication slot, (b) wait for the decoder to catch up to the trigger's last confirmed LSN, (c) drop the trigger and switch the dependency to WAL mode. If the decoder doesn't catch up within pg_stream.wal_transition_timeout (default 300s), the system falls back to triggers.
  4. CDC Mode Tracking — Each source dependency in pgs_dependencies carries a cdc_mode column (TRIGGER / TRANSITIONING / WAL) and WAL-specific metadata (slot_name, decoder_confirmed_lsn, transition_started_at).

See ADR-001 and ADR-002 in plans/adrs/PLAN_ADRS.md for the original design rationale and plans/sql/PLAN_HYBRID_CDC.md for the full implementation plan.

4. DVM Engine (src/dvm/)

The Differential View Maintenance engine is the core of the system. It transforms the defining SQL query into an executable operator tree that can compute deltas efficiently.

Auto-Rewrite Pipeline (src/dvm/parser.rs)

Before the defining query is parsed into an operator tree, it passes through a chain of auto-rewrite passes that normalize SQL constructs the DVM parser doesn't handle directly:

PassFunctionPurpose
#0rewrite_views_inline()Replace view references with (view_definition) AS alias subqueries
#1rewrite_distinct_on()Convert DISTINCT ON to ROW_NUMBER() OVER (…) = 1 window subquery
#2rewrite_grouping_sets()Decompose GROUPING SETS / CUBE / ROLLUP into UNION ALL of GROUP BY
#3rewrite_scalar_subquery_in_where()Convert WHERE col > (SELECT …) to CROSS JOIN
#4rewrite_sublinks_in_or()Split WHERE a OR EXISTS (…) into UNION branches
#5rewrite_multi_partition_windows()Split multiple PARTITION BY clauses into joined subqueries

The view inlining pass (#0) runs first so that view definitions containing DISTINCT ON, GROUPING SETS, etc. are further rewritten by downstream passes. Nested views are expanded via a fixpoint loop (max depth 10).

Query Parser (src/dvm/parser.rs)

Parses the defining query using PostgreSQL's internal parser (via pgrx raw_parser) and extracts:

  • WITH clause — CTE definitions (non-recursive: inline expansion or shared delta; recursive: detected for mode gating)
  • Target list — output columns
  • FROM clause — source tables, joins, subqueries, and CTE references
  • WHERE clause — filters
  • GROUP BY / aggregate functions
  • DISTINCT / UNION ALL / INTERSECT / EXCEPT

The parser produces an OpTree — a tree of operator nodes. CTE handling follows a tiered approach:

  1. Tier 1 (Inline Expansion) — Non-recursive CTEs referenced once are expanded into Subquery nodes, equivalent to subqueries in FROM.
  2. Tier 2 (Shared Delta) — Non-recursive CTEs referenced multiple times produce CteScan nodes that share a single delta computation via a CTE registry and delta cache.
  3. Tier 3a/3b/3c (Recursive) — Recursive CTEs (WITH RECURSIVE) are detected via query_has_recursive_cte(). In FULL mode, the query executes as-is. In DIFFERENTIAL mode, the strategy is auto-selected: semi-naive evaluation for INSERT-only changes, Delete-and-Rederive (DRed) for mixed changes, or recomputation fallback when CTE columns don't match ST storage.

Operators (src/dvm/operators/)

Each operator knows how to generate a delta query — given a set of changes to its inputs, it produces the corresponding changes to its output:

OperatorDelta Strategy
ScanDirect passthrough of CDC changes
FilterApply WHERE predicate to deltas
ProjectApply column projection to deltas
JoinJoin deltas against the other side's current state
OuterJoinLEFT/RIGHT outer join with NULL padding
FullJoinFULL OUTER JOIN with 8-part delta (both sides may produce NULLs)
AggregateRecompute group values where affected keys changed
DistinctCOUNT-based duplicate tracking
UnionAllMerge deltas from both branches
IntersectDual-count multiplicity with LEAST boundary crossing
ExceptDual-count multiplicity with GREATEST(0, L-R) boundary crossing
SubqueryTransparent delegation + optional column renaming (CTEs, subselects)
CteScanShared delta lookup from CTE cache (multi-reference CTEs)
RecursiveCteSemi-naive / DRed / recomputation for WITH RECURSIVE
WindowPartition-based recomputation for window functions
LateralFunctionRow-scoped recomputation for SRFs in FROM (jsonb_array_elements, unnest, etc.)
LateralSubqueryRow-scoped recomputation for correlated subqueries in LATERAL FROM
SemiJoinEXISTS / IN subquery delta via semi-join
AntiJoinNOT EXISTS / NOT IN subquery delta via anti-join
ScalarSubqueryCorrelated scalar subquery in SELECT list

See DVM_OPERATORS.md for detailed descriptions.

Diff Engine (src/dvm/diff.rs)

Generates the final diff SQL that:

  1. Computes the delta from the operator tree
  2. Produces ('+', row) for inserts and ('-', row) for deletes
  3. Applies the diff via DELETE matching old rows and INSERT for new rows

5. DAG / Dependency Graph (src/dag.rs)

Stream tables can depend on other stream tables (cascading), forming a Directed Acyclic Graph:

  • Cycle detection — Prevents circular dependencies at creation time using DFS.
  • Topological ordering — Determines refresh order: upstream STs must be refreshed before downstream STs.
  • Cascade operations — When a source table changes, all transitive dependents are identified for refresh.

6. Version / Frontier Tracking (src/version.rs)

Implements a per-source frontier (JSONB map of source_oid → LSN) to track exactly how far each stream table has consumed changes:

  • Read frontier — Before refresh, read the frontier to know where to start consuming changes.
  • Advance frontier — After a successful refresh, the frontier is updated to the latest consumed LSN.
  • Consistent snapshots — The frontier ensures that each refresh processes a contiguous, non-overlapping window of changes.

Delayed View Semantics (DVS) Guarantee

The contents of every stream table are logically equivalent to evaluating its defining query at some past point in time — the data_timestamp. The scheduler refreshes STs in topological order so that when ST B references upstream ST A, A has already been refreshed to the target data_timestamp before B runs its delta query against A's contents. The frontier lifecycle is:

  1. Created — on first full refresh; records the LSN of each source at that moment.
  2. Advanced — on each differential refresh; the old frontier becomes the lower bound and the new frontier (with fresh LSNs) the upper bound. The DVM engine reads changes in [old, new].
  3. Reset — on reinitialize; a fresh frontier is created from scratch.

7. Refresh Engine (src/refresh.rs)

Orchestrates the complete refresh cycle:

┌──────────────┐
│  Check State │ → Is ST active? Has it been populated?
└──────┬───────┘
       │
 ┌─────▼──────┐
 │ Drain CDC  │ → Read WAL changes into change buffer tables
 └─────┬──────┘
       │
 ┌─────▼──────────────┐
 │ Determine Action   │ → FULL, DIFFERENTIAL, NO_DATA, REINITIALIZE, or SKIP?
 │                    │   (adaptive: if change ratio > pg_stream.differential_max_change_ratio,
 │                    │    downgrade DIFFERENTIAL → FULL automatically)
 └─────┬──────────────┘
       │
 ┌─────▼──────┐
 │ Execute    │ → Full: TRUNCATE + INSERT ... SELECT
 │            │   Differential: Generate & apply delta SQL
 └─────┬──────┘
       │
 ┌─────▼──────────────┐
 │ Record History     │ → Write to pgstream.pgs_refresh_history
 └─────┬──────────────┘
       │
 ┌─────▼──────────────┐
 │ Advance Frontier   │ → Update JSONB frontier in catalog
 └─────┬──────────────┘
       │
 ┌─────▼──────────────┐
 │ Reset Error Count  │ → On success, reset consecutive_errors to 0
 └──────────────────────┘

8. Background Worker & Scheduling (src/scheduler.rs)

Registration & Lifecycle

pg_stream registers one PostgreSQL background worker — the scheduler — during _PG_init() (extension load). Because it is registered at startup, pg_stream must appear in shared_preload_libraries, which requires a server restart.

┌──────────────────────────────────────────────────────────────────┐
│                  PostgreSQL postmaster                           │
│                                                                  │
│  shared_preload_libraries = 'pg_stream'                          │
│       │                                                          │
│       ▼                                                          │
│  _PG_init()                                                      │
│    ├─ Register GUCs (pg_stream.enabled, scheduler_interval_ms …) │
│    ├─ Register shared memory (PgStreamSharedState, atomics)      │
│    └─ BackgroundWorkerBuilder::new("pg_stream scheduler")        │
│         .set_start_time(RecoveryFinished)                        │
│         .set_restart_time(5s)       ← auto-restart on crash      │
│         .load()                                                  │
│                                                                  │
│  After recovery finishes:                                        │
│       │                                                          │
│       ▼                                                          │
│  pg_stream_scheduler_main()         ← background worker starts   │
│    ├─ Attach SIGHUP + SIGTERM handlers                           │
│    ├─ Connect to SPI (database = "postgres")                     │
│    ├─ Crash recovery: mark stale RUNNING records as FAILED       │
│    └─ Enter main loop ─────────────────────────┐                 │
│         │                                      │                 │
│         ▼                                      │                 │
│     wait_latch(scheduler_interval_ms)          │                 │
│         │                                      │                 │
│     ┌───▼───────────────────────────────┐      │                 │
│     │ SIGTERM? → log + break            │      │                 │
│     │ pg_stream.enabled = false? → skip │      │                 │
│     │ Otherwise → scheduler tick        │      │                 │
│     └───┬───────────────────────────────┘      │                 │
│         │                                      │                 │
│         └──────────── loop ────────────────────┘                 │
└──────────────────────────────────────────────────────────────────┘

Key lifecycle properties:

PropertyBehaviour
Start conditionAfter PostgreSQL recovery finishes (RecoveryFinished)
Auto-restart5-second delay after an unexpected crash
Graceful shutdownHandles SIGTERM — breaks the main loop and exits cleanly
Config reloadHandles SIGHUP — re-reads GUC values on the next latch wake
Crash recoveryOn startup, any pgs_refresh_history rows stuck in RUNNING status are marked FAILED (the transaction that wrote them was rolled back by PostgreSQL, but the status row may have been committed in a prior transaction)
DatabaseConnects to the postgres database via SPI

Scheduler Tick

Each tick of the main loop performs the following steps inside a single transaction:

  1. DAG rebuild — Compare the shared-memory DAG_REBUILD_SIGNAL counter against the local copy. If it advanced (a CREATE, ALTER, or DROP stream table occurred), rebuild the in-memory dependency graph (StDag) from the catalog.
  2. Topological traversal — Walk stream tables in dependency order (upstream before downstream). This ensures that when ST B references ST A, A is refreshed first.
  3. Per-ST evaluation — For each active ST:
    • Skip if in retry backoff (exponential, per-ST).
    • Skip if schedule/cron says not yet due.
    • Skip if an advisory lock indicates a concurrent refresh.
    • Check upstream change buffers for pending rows.
  4. Execute refresh — Acquire an advisory lock → record RUNNING in history → run FULL / DIFFERENTIAL / REINITIALIZE → store new frontier → release lock → record completion.
  5. WAL transitions — Advance any trigger→WAL CDC mode transitions (src/wal_decoder.rs).
  6. Slot health — Check replication slot health and emit NOTIFY alerts.
  7. Prune retry state — Remove backoff entries for STs that no longer exist.

Sequential Processing

The scheduler processes stream tables sequentially within a single background worker. Although pg_stream.max_concurrent_refreshes (default 4) exists as a GUC, it currently only prevents a manual pgstream.refresh_stream_table() call from overlapping with the scheduler on the same ST — it does not spawn additional workers. All STs are refreshed one at a time in topological order.

The PostgreSQL GUC max_worker_processes (default 8) sets the server-wide budget for all background workers (autovacuum, parallel query, logical replication, extensions). pg_stream consumes one slot from that budget.

Retry & Error Handling

Each ST maintains an in-memory RetryState (reset on scheduler restart):

  • Retryable errors (SPI failures, lock contention, slot issues) trigger exponential backoff.
  • Permanent errors (schema mismatch, user errors) skip backoff but increment consecutive_errors.
  • When consecutive_errors reaches pg_stream.max_consecutive_errors (default 3), the ST is auto-suspended and a NOTIFY alert is emitted.
  • Schema errors additionally set needs_reinit, triggering a REINITIALIZE on the next successful cycle.

Scheduling Policy

Automatic refresh scheduling uses canonical periods (48·2ⁿ seconds, n = 0, 1, 2, …) snapped to the user's schedule:

  • Picks the smallest canonical period ≤ schedule.
  • For DOWNSTREAM schedule (NULL schedule), the ST refreshes only when explicitly triggered or when a downstream ST needs it.
  • Advisory locks prevent concurrent refreshes of the same ST.
  • The scheduler is driven by the background worker polling at the pg_stream.scheduler_interval_ms GUC interval.

Shared Memory (src/shmem.rs)

The scheduler background worker and user sessions share a PgStreamSharedState structure protected by a PgLwLock. Key fields:

FieldTypePurpose
dag_versionu64Incremented when the ST catalog changes; used by the scheduler to detect when the DAG needs rebuilding.
scheduler_pidi32PID of the scheduler background worker (0 if not running).
scheduler_runningboolWhether the scheduler is active.
last_scheduler_wakei64Unix timestamp of the last scheduler wake cycle (for monitoring).

A separate PgAtomic<AtomicU64> named DAG_REBUILD_SIGNAL is incremented by API functions (create, alter, drop) after catalog mutations. The scheduler compares its local copy against the atomic counter to detect when to rebuild its in-memory DAG without holding a lock.

A second PgAtomic<AtomicU64> named CACHE_GENERATION tracks DDL events that may invalidate cached delta or MERGE templates across backends. When DDL hooks fire (view change, ALTER TABLE, function change) or API functions mutate the catalog, CACHE_GENERATION is bumped. Each backend maintains a thread-local generation counter; on the next refresh, if the shared generation has advanced, the backend flushes its delta template cache, MERGE template cache, and prepared statements.

9. DDL Tracking (src/hooks.rs)

Event triggers monitor DDL changes to source tables and functions:

  • _on_ddl_end — Fires on ALTER TABLE to detect column adds/drops/type changes. If a source table used by a ST is altered, the ST's needs_reinit flag is set. Also detects CREATE OR REPLACE FUNCTION / ALTER FUNCTION — if the function appears in a ST's functions_used catalog column, the ST is marked for reinit.
  • _on_sql_drop — Fires on DROP TABLE to set needs_reinit for affected STs. Also detects DROP FUNCTION and marks affected STs for reinit.
  • Function name extractionobject_identity strings (e.g., public.my_func(integer, text)) are parsed to extract the bare function name, which is matched against the functions_used TEXT[] column in pgs_stream_tables.

Reinitialization is deferred until the next refresh cycle, which then performs a REINITIALIZE action (drop and recreate the storage table from the updated query).

10. Error Handling (src/error.rs)

Centralized error types using thiserror:

  • PgStreamError variants cover catalog access, SQL execution, CDC, DVM, DAG, and config errors.
  • Each refresh failure increments consecutive_errors.
  • When consecutive_errors reaches pg_stream.max_consecutive_errors (default 3), the ST is moved to ERROR status and suspended from automatic refresh.
  • Manual intervention (ALTER ... status => 'ACTIVE') resets the counter.

11. Monitoring (src/monitor.rs)

Provides observability functions:

  • st_refresh_stats — Aggregate statistics (total/successful/failed refreshes, avg duration, staleness status).
  • get_refresh_history — Per-ST audit trail.
  • get_staleness — Current staleness in seconds.
  • slot_health — Checks replication slot state and WAL retention.
  • check_cdc_health — Per-source CDC health status including mode, slot lag, confirmed LSN, and alerts.
  • explain_st — Describes the DVM plan for a given ST.
  • Viewspgstream.stream_tables_info (computed staleness) and pgstream.pg_stat_stream_tables (combined stats).

NOTIFY Alerting

Operational events are broadcast via PostgreSQL NOTIFY on the pg_stream_alert channel. Clients can subscribe with LISTEN pg_stream_alert; and receive JSON-formatted events:

EventCondition
staledata staleness exceeds 2× schedule
auto_suspendedST suspended after pg_stream.max_consecutive_errors failures
reinitialize_neededUpstream DDL change detected
slot_lag_warningReplication slot WAL retention is growing
cdc_transition_completeSource transitioned from trigger to WAL-based CDC
cdc_transition_failedTrigger→WAL transition failed (fell back to triggers)
refresh_completedRefresh completed successfully
refresh_failedRefresh failed with an error

12. Row ID Hashing (src/hash.rs)

Provides deterministic 64-bit row identifiers using xxHash (xxh64) with a fixed seed. Two SQL functions are exposed:

  • pgstream.pg_stream_hash(text) — Hash a single text value; used for simple single-column row IDs.
  • pgstream.pg_stream_hash_multi(text[]) — Hash multiple values (separated by a record-separator byte \x1E) for composite keys (join row IDs, GROUP BY keys).

Row IDs are written into every stream table's storage as an internal __pgs_row_id BIGINT column and are used by the delta application phase to match DELETE candidates precisely.

13. Configuration (src/config.rs)

Twelve GUC (Grand Unified Configuration) variables control runtime behavior, plus five performance-tuning GUCs. See CONFIGURATION.md for details.

GUCDefaultPurpose
pg_stream.enabledtrueMaster on/off switch for the scheduler
pg_stream.scheduler_interval_ms1000Scheduler background worker wake interval (ms)
pg_stream.min_schedule_seconds60Minimum allowed schedule
pg_stream.max_consecutive_errors3Errors before auto-suspending a ST
pg_stream.change_buffer_schemapgstream_changesSchema for change buffer tables
pg_stream.max_concurrent_refreshes4Maximum parallel refresh workers
pg_stream.differential_max_change_ratio0.15Change-to-table-size ratio above which DIFFERENTIAL falls back to FULL
pg_stream.cleanup_use_truncatetrueUse TRUNCATE instead of DELETE for change buffer cleanup when the entire buffer is consumed
pg_stream.user_triggers'auto'User-defined trigger handling: auto / on / off
pg_stream.block_source_ddlfalseBlock column-affecting DDL on tracked source tables instead of reinit
pg_stream.cdc_mode'trigger'CDC mechanism: trigger / auto / wal
pg_stream.wal_transition_timeout300Max seconds to wait for WAL decoder catch-up during transition
pg_stream.merge_planner_hintstrueInject SET LOCAL planner hints (disable nestloop, raise work_mem) before MERGE
pg_stream.merge_work_mem_mb64work_mem (MB) applied when delta exceeds 10 000 rows and planner hints enabled
pg_stream.merge_strategy'auto'Delta application strategy: auto / merge / delete_insert
pg_stream.use_prepared_statementstrueUse SQL PREPARE/EXECUTE for cached MERGE templates

Data Flow: End-to-End Refresh

 Source Table INSERT/UPDATE/DELETE
           │
           ▼
 Hybrid CDC Layer:
   ┌─────────────────────────────────────────────┐
   │ TRIGGER mode: Row-Level AFTER Trigger        │
   │   pg_stream_cdc_fn_<oid>() → buffer table    │
   │                                              │
   │ WAL mode: Logical Replication Slot           │
   │   wal_decoder bgworker → same buffer table   │
   └─────────────────────────────────────────────┘
           │
           ▼
 Change Buffer Table (pgstream_changes.changes_<oid>)
   Columns: change_id, lsn, xid, action (I/U/D), row_data (jsonb)
           │
           ▼
 DVM Engine: generate delta SQL from operator tree
   - Scan operator reads from change buffer
   - Filter/Project/Join transform the deltas
   - Aggregate recomputes affected groups
           │
           ▼
 Diff Engine: produce (+/-) diff rows
           │
           ▼
 Delta Application:
   DELETE FROM storage WHERE __pgs_row_id IN (removed)
   INSERT INTO storage SELECT ... FROM (added)
           │
           ▼
 Frontier Update: advance per-source LSN
           │
           ▼
 History Record: log to pgstream.pgs_refresh_history

Module Map

src/
├── lib.rs           # Extension entry, module declarations, _PG_init
├── bin/
│   └── pgrx_embed.rs# pgrx SQL entity embedding (generated)
├── api.rs           # SQL API functions (create/alter/drop/refresh/status)
├── catalog.rs       # Catalog CRUD operations
├── cdc.rs           # Change data capture (triggers + WAL transition)
├── config.rs        # GUC variable registration
├── dag.rs           # Dependency graph (cycle detection, topo sort)
├── error.rs         # Centralized error types
├── hash.rs          # xxHash row ID generation (pg_stream_hash / pg_stream_hash_multi)
├── hooks.rs         # DDL event trigger handlers (_on_ddl_end, _on_sql_drop)
├── shmem.rs         # Shared memory state (PgStreamSharedState, DAG_REBUILD_SIGNAL, CACHE_GENERATION)
├── dvm/
│   ├── mod.rs       # DVM module root + recursive CTE orchestration
│   ├── parser.rs    # Query → OpTree converter (CTE extraction, subquery, window support)
│   ├── diff.rs      # Delta SQL generation (CTE delta cache)
│   ├── row_id.rs    # Row ID generation
│   └── operators/
│       ├── mod.rs           # Operator trait + registry
│       ├── scan.rs          # Table scan (CDC passthrough)
│       ├── filter.rs        # WHERE clause filtering
│       ├── project.rs       # Column projection
│       ├── join.rs          # Inner join
│       ├── join_common.rs   # Shared join utilities (snapshot subqueries, column disambiguation)
│       ├── outer_join.rs    # LEFT/RIGHT outer join
│       ├── full_join.rs     # FULL OUTER JOIN (8-part delta)
│       ├── aggregate.rs     # GROUP BY + aggregate functions (39 AggFunc variants)
│       ├── distinct.rs      # DISTINCT deduplication
│       ├── union_all.rs     # UNION ALL merging
│       ├── intersect.rs     # INTERSECT / INTERSECT ALL (dual-count LEAST)
│       ├── except.rs        # EXCEPT / EXCEPT ALL (dual-count GREATEST)
│       ├── subquery.rs      # Subquery / inlined CTE delegation
│       ├── cte_scan.rs      # Shared CTE delta (multi-reference)
│       ├── recursive_cte.rs # Recursive CTE (semi-naive + DRed + recomputation)
│       ├── window.rs        # Window function (partition recomputation)
│       ├── lateral_function.rs  # LATERAL SRF (row-scoped recomputation)
│       ├── lateral_subquery.rs  # LATERAL correlated subquery
│       ├── semi_join.rs     # EXISTS / IN subquery (semi-join delta)
│       ├── anti_join.rs     # NOT EXISTS / NOT IN subquery (anti-join delta)
│       └── scalar_subquery.rs   # Correlated scalar subquery in SELECT
├── monitor.rs       # Monitoring & observability functions
├── refresh.rs       # Refresh orchestration
├── scheduler.rs     # Automatic scheduling with canonical periods
├── version.rs       # Frontier / LSN tracking
└── wal_decoder.rs   # WAL-based CDC (logical replication slot polling, transitions)

Extension Control File (pg_stream.control)

The pg_stream.control file in the repository root is required by PostgreSQL's extension infrastructure. It declares the extension's description, default version, shared-library path, and privilege requirements. PostgreSQL reads this file when CREATE EXTENSION pg_stream; is executed.

During packaging (cargo pgrx package), pgrx replaces the @CARGO_VERSION@ placeholder with the version from Cargo.toml and copies the file into the target's share/extension/ directory alongside the SQL migration scripts.

DVM Operators

This document describes the Differential View Maintenance (DVM) operators implemented by pgstream. Each operator transforms a stream of row-level changes (deltas) propagated from source tables through the operator tree.

Prior Art

  • Budiu, M. et al. (2023). "DBSP: Automatic Incremental View Maintenance." VLDB 2023. (comparison)
  • Gupta, A. & Mumick, I.S. (1999). Materialized Views: Techniques, Implementations, and Applications. MIT Press.
  • Koch, C. et al. (2014). "DBToaster: Higher-order Delta Processing for Dynamic, Frequently Fresh Views." VLDB Journal.
  • PostgreSQL 9.4+ — Materialized views with REFRESH MATERIALIZED VIEW CONCURRENTLY.

Overview

When a stream table is created, the defining SQL query is parsed into a tree of DVM operators. During an differential refresh, changes flow bottom-up through this tree:

         Aggregate
            │
         Project
            │
          Filter
            │
    ┌───────┴───────┐
   Join             │
  ┌─┴─┐            │
Scan(A) Scan(B)   Scan(C)

Each operator implements a differentiation rule: given the delta (Δ) to its input(s), it produces the corresponding delta to its output. This is conceptually similar to automatic differentiation in calculus.

The general contract:

  • Input: a set of ('+', row) and ('-', row) tuples (inserts and deletes)
  • Output: a set of ('+', row) and ('-', row) tuples

Updates are modeled as a delete of the old row followed by an insert of the new row.


Operators

Scan

Module: src/dvm/operators/scan.rs

The leaf operator. Reads CDC changes from a source table's change buffer.

Delta Rule:

$$\Delta(\text{Scan}(R)) = \Delta R$$

The scan operator is a direct passthrough — inserts in the source become inserts in the output, deletes become deletes.

SQL Generation:

SELECT op, row_data FROM pgstream_changes.changes_<oid>
WHERE xid >= <last_consumed_xid>

Notes:

  • Each source table has a dedicated change buffer table created by the CDC module.
  • Row data is stored as JSONB with column names as keys.
  • The __pgs_row_id column (xxHash of primary key) is included for deduplication.

Filter

Module: src/dvm/operators/filter.rs

Applies a WHERE clause predicate to the delta stream.

Delta Rule:

$$\Delta(\sigma_p(R)) = \sigma_p(\Delta R)$$

Filtering is applied to the deltas in the same way as to the base data — only rows satisfying the predicate pass through.

SQL Generation:

SELECT * FROM (<input_delta>) AS d
WHERE <predicate>

Example:

If the defining query is:

SELECT * FROM orders WHERE status = 'shipped'

And a new row (id=5, status='pending') is inserted, it does not appear in the delta output. If (id=3, status='shipped') is inserted, it passes through.

Edge Cases:

  • For updates that change the predicate column (e.g., status from 'pending' to 'shipped'), the CDC produces a delete of the old row and insert of the new row. The filter passes the insert (matches) and blocks the delete (doesn't match the old row against the predicate), correctly resulting in a net insert.

Project

Module: src/dvm/operators/project.rs

Applies column projection from the target list.

Delta Rule:

$$\Delta(\pi_L(R)) = \pi_L(\Delta R)$$

Projects the same columns from the delta that the query projects from the base data.

SQL Generation:

SELECT <target_columns> FROM (<input_delta>) AS d

Notes:

  • Projection is applied after filtering for efficiency.
  • Computed expressions in the target list (e.g., price * quantity AS total) are evaluated on the delta rows.

Join (Inner)

Module: src/dvm/operators/join.rs

Implements inner join between two inputs.

Delta Rule:

For $R \bowtie S$:

$$\Delta(R \bowtie S) = (\Delta R \bowtie S) \cup (R' \bowtie \Delta S)$$

Where $R' = R \cup \Delta R$ (the new state of R after applying deltas).

In practice, when only one side has changes (common case), the delta join simplifies to joining the changed rows against the current state of the other side.

SQL Generation:

-- Changes to left side joined with current right side
SELECT '+' AS op, l.*, r.*
FROM (<left_delta> WHERE op = '+') AS l
JOIN <right_table> AS r ON <join_condition>

UNION ALL

-- Current left side joined with changes to right side
SELECT '+' AS op, l.*, r.*
FROM <left_table> AS l
JOIN (<right_delta> WHERE op = '+') AS r ON <join_condition>

(And corresponding DELETE queries for op = '-'.)

Notes:

  • The join uses the current state of the non-changed side, not the change buffer.
  • For equi-joins, this is efficient — the join key narrows the scan.
  • Non-equi joins (theta joins) may require broader scans.

Outer Join

Module: src/dvm/operators/outer_join.rs (LEFT JOIN), src/dvm/operators/full_join.rs (FULL JOIN)

Implements LEFT, RIGHT, and FULL OUTER JOIN.

RIGHT JOIN Handling:

RIGHT JOIN is automatically converted to a LEFT JOIN with swapped left/right operands during query parsing. This normalization happens transparently — the user can write RIGHT JOIN and the parser rewrites it to an equivalent LEFT JOIN before the operator tree is constructed.

Delta Rule:

Similar to inner join, but additionally handles NULL-padded rows:

$$\Delta(R \text{ LEFT JOIN } S) = (\Delta R \bowtie_L S) \cup (R' \bowtie_L \Delta S)$$

With special handling for:

  • Rows in ΔR that have no match in S → emit ('+', row, NULLs)
  • Rows in ΔS that create a first match for an R row → emit ('-', row, NULLs) and ('+', row, s_data)
  • Rows in ΔS that remove the last match for an R row → emit ('-', row, s_data) and ('+', row, NULLs)

SQL Generation (LEFT JOIN):

Uses anti-join detection (via NOT EXISTS) to correctly handle the NULL padding transitions.

FULL OUTER JOIN Delta Rule:

FULL OUTER JOIN extends the LEFT JOIN delta with symmetric right-side handling. The delta is computed as an 8-part UNION ALL:

  1. Parts 1–5: Same as LEFT JOIN delta (inserted/deleted rows from both sides, with NULL-padding transitions)
  2. Parts 6–7: Symmetric anti-join transitions for the right side (rows in ΔL that remove/create the last/first match for an S row)
  3. Part 8: Right-side insertions that have no match in the left side → emit ('+', NULLs, s_data)

Each part uses pre-computed delta flags (__has_ins_*, __has_del_*) to efficiently detect first-match/last-match transitions without redundant subqueries.

Nested Join Support:

Module: src/dvm/operators/join_common.rs

All join operators (inner, left, full) support nested children — i.e., a join whose left or right operand is itself another join. The join_common module provides shared helpers:

  • build_snapshot_sql() — returns the table reference for simple (Scan) operands, or a parenthesized subquery with disambiguated columns for nested join operands
  • rewrite_join_condition() — rewrites column references in ON conditions to use the correct alias prefixes for nested children (e.g., o.cust_iddl.o__cust_id)

This enables queries with 3 or more joined tables, e.g.:

SELECT o.id, c.name, p.title
FROM orders o
JOIN customers c ON o.cust_id = c.id
JOIN products p ON o.prod_id = p.id

Limitations:

  • FULL OUTER JOIN delta computation can be expensive due to dual-side NULL tracking (8 UNION ALL parts).
  • Performance degrades with high-cardinality join keys.
  • NATURAL JOIN is supported — common columns are resolved automatically and synthesized into an explicit equi-join condition.

Aggregate

Module: src/dvm/operators/aggregate.rs

Handles GROUP BY with aggregate functions (COUNT, SUM, AVG, MIN, MAX, BOOL_AND, BOOL_OR, STRING_AGG, ARRAY_AGG, JSON_AGG, JSONB_AGG, BIT_AND, BIT_OR, BIT_XOR, JSON_OBJECT_AGG, JSONB_OBJECT_AGG, STDDEV_POP, STDDEV_SAMP, VAR_POP, VAR_SAMP, MODE, PERCENTILE_CONT, PERCENTILE_DISC, JSON_ARRAYAGG, JSON_OBJECTAGG) and the FILTER (WHERE …) and WITHIN GROUP (ORDER BY …) clauses.

Delta Rule:

$$\Delta(\gamma_{G, \text{agg}}(R)) = \gamma_{G, \text{agg}}(R' \text{ WHERE } G \in \text{affected_keys}) - \gamma_{G, \text{agg}}(R \text{ WHERE } G \in \text{affected_keys})$$

Where:

  • $G$ = grouping columns
  • affected_keys = the set of group key values that appear in ΔR
  • $R'$ = $R \cup \Delta R$ (the new state)

Strategy:

  1. Identify affected groups — Collect all group key values that appear in the delta (either inserted or deleted rows).
  2. Recompute old values — Query the storage table for current aggregate values of affected groups.
  3. Recompute new values — Query the updated source for new aggregate values of affected groups.
  4. Diff — For each affected group:
    • If old exists and new differs → emit ('-', old) and ('+', new)
    • If old exists and new is gone → emit ('-', old) (group eliminated)
    • If no old and new exists → emit ('+', new) (new group appeared)

Supported Aggregate Functions:

FunctionDVM StrategyNotes
COUNT(*)AlgebraicFully differential
COUNT(expr)AlgebraicFully differential
SUM(expr)AlgebraicFully differential
AVG(expr)AlgebraicDecomposed to SUM/COUNT internally
MIN(expr)Semi-algebraicUses LEAST merge; falls back to per-group rescan when min row is deleted
MAX(expr)Semi-algebraicUses GREATEST merge; falls back to per-group rescan when max row is deleted
BOOL_AND(expr)Group-rescanAffected groups are re-aggregated from source data
BOOL_OR(expr)Group-rescanAffected groups are re-aggregated from source data
STRING_AGG(expr, sep)Group-rescanAffected groups are re-aggregated from source data
ARRAY_AGG(expr)Group-rescanAffected groups are re-aggregated from source data
JSON_AGG(expr)Group-rescanAffected groups are re-aggregated from source data
JSONB_AGG(expr)Group-rescanAffected groups are re-aggregated from source data
BIT_AND(expr)Group-rescanAffected groups are re-aggregated from source data
BIT_OR(expr)Group-rescanAffected groups are re-aggregated from source data
BIT_XOR(expr)Group-rescanAffected groups are re-aggregated from source data
JSON_OBJECT_AGG(key, value)Group-rescanAffected groups are re-aggregated from source data
JSONB_OBJECT_AGG(key, value)Group-rescanAffected groups are re-aggregated from source data
STDDEV_POP(expr) / STDDEV(expr)Group-rescanAffected groups are re-aggregated from source data
STDDEV_SAMP(expr)Group-rescanAffected groups are re-aggregated from source data
VAR_POP(expr)Group-rescanAffected groups are re-aggregated from source data
VAR_SAMP(expr) / VARIANCE(expr)Group-rescanAffected groups are re-aggregated from source data
MODE() WITHIN GROUP (ORDER BY expr)Group-rescanOrdered-set aggregate; affected groups re-aggregated
PERCENTILE_CONT(frac) WITHIN GROUP (ORDER BY expr)Group-rescanOrdered-set aggregate; affected groups re-aggregated
PERCENTILE_DISC(frac) WITHIN GROUP (ORDER BY expr)Group-rescanOrdered-set aggregate; affected groups re-aggregated
CORR(Y, X)Group-rescanRegression aggregate; affected groups re-aggregated
COVAR_POP(Y, X)Group-rescanRegression aggregate; affected groups re-aggregated
COVAR_SAMP(Y, X)Group-rescanRegression aggregate; affected groups re-aggregated
REGR_AVGX(Y, X)Group-rescanRegression aggregate; affected groups re-aggregated
REGR_AVGY(Y, X)Group-rescanRegression aggregate; affected groups re-aggregated
REGR_COUNT(Y, X)Group-rescanRegression aggregate; affected groups re-aggregated
REGR_INTERCEPT(Y, X)Group-rescanRegression aggregate; affected groups re-aggregated
REGR_R2(Y, X)Group-rescanRegression aggregate; affected groups re-aggregated
REGR_SLOPE(Y, X)Group-rescanRegression aggregate; affected groups re-aggregated
REGR_SXX(Y, X)Group-rescanRegression aggregate; affected groups re-aggregated
REGR_SXY(Y, X)Group-rescanRegression aggregate; affected groups re-aggregated
REGR_SYY(Y, X)Group-rescanRegression aggregate; affected groups re-aggregated
ANY_VALUE(expr)Group-rescanPostgreSQL 16+; affected groups re-aggregated
JSON_ARRAYAGG(expr ...)Group-rescanSQL-standard JSON aggregation (PostgreSQL 16+); full deparsed SQL preserved
JSON_OBJECTAGG(key: value ...)Group-rescanSQL-standard JSON aggregation (PostgreSQL 16+); full deparsed SQL preserved

FILTER Clause:

All aggregate functions support the FILTER (WHERE …) clause:

SELECT COUNT(*) FILTER (WHERE status = 'active') AS active_count FROM orders GROUP BY region

The filter predicate is applied within the delta computation — only rows matching the filter contribute to the aggregate delta. Filtered aggregates are excluded from the P5 direct-bypass optimization.

SQL Generation:

The aggregate operator uses a 3-CTE pipeline:

  1. Merge CTE — Joins affected group keys against old (storage) and new (source) aggregate values, producing __pgs_meta_action ('I' for new-only groups, 'D' for disappeared groups, 'U' for changed groups).
  2. LATERAL VALUES expansion — A single-pass LATERAL (VALUES ...) clause expands each merge row into insert and delete actions, avoiding a 4-branch UNION ALL:
FROM merge_cte m,
LATERAL (VALUES
    ('I', m.new_count, m.new_total),
    ('D', m.old_count, m.old_total)
) v(action, count_val, val_total)
WHERE (m.__pgs_meta_action = 'I' AND v.action = 'I')
   OR (m.__pgs_meta_action = 'D' AND v.action = 'D')
   OR (m.__pgs_meta_action = 'U')
  1. Final projection — Emits ('+', row) and ('-', row) tuples for the refresh engine.

MIN/MAX Merge Strategy:

MIN and MAX use a semi-algebraic strategy with two cases:

  1. Non-extremum deletion — When the deleted row is NOT the current minimum (or maximum), the merge uses LEAST(old_value, new_inserts) for MIN or GREATEST(old_value, new_inserts) for MAX. This is fully algebraic and requires no rescan.

  2. Extremum deletion — When the row holding the current minimum (or maximum) IS deleted, the new value cannot be computed from the delta alone. The merge expression returns NULL as a sentinel, which triggers the change-detection guard (IS DISTINCT FROM) to emit the group for re-aggregation. The MERGE layer treats this as a DELETE + INSERT pair, recomputing the group from source data. This is still more efficient than a full table refresh since only affected groups are rescanned.


Distinct

Module: src/dvm/operators/distinct.rs

Implements SELECT DISTINCT using reference counting.

Delta Rule:

$$\Delta(\delta(R)) = { r \in \Delta R : \text{count}(r, R) = 0 \land \text{count}(r, R') > 0 } - { r \in \Delta R : \text{count}(r, R) > 0 \land \text{count}(r, R') = 0 }$$

In other words:

  • A row enters the output when its count transitions from 0 to ≥1
  • A row leaves the output when its count transitions from ≥1 to 0

Strategy:

Maintains a hidden __pgs_dup_count column in the storage table to track how many times each distinct row appears in the pre-distinct input.

  1. On insert: increment count. If count was 0, emit ('+', row).
  2. On delete: decrement count. If count becomes 0, emit ('-', row).

Notes:

  • The duplicate count is not visible in user queries against the storage table (projected away by the view layer).
  • Duplicate counting uses __pgs_row_id (xxHash) for efficient lookups.

Union All

Module: src/dvm/operators/union_all.rs

Merges deltas from two branches.

Delta Rule:

$$\Delta(R \cup_{\text{all}} S) = \Delta R \cup_{\text{all}} \Delta S$$

Simply concatenates the delta streams from both branches.

SQL Generation:

SELECT * FROM (<left_delta>)
UNION ALL
SELECT * FROM (<right_delta>)

Notes:

  • Column count and types must match between branches.
  • Each branch is independently processed through its own operator sub-tree.
  • This is the simplest operator since UNION ALL preserves all duplicates.

Intersect

Module: src/dvm/operators/intersect.rs

Implements INTERSECT and INTERSECT ALL using dual-count per-branch multiplicity tracking.

Delta Rule:

$$\Delta(R \cap S): \text{emit rows where } \min(\text{count}_L, \text{count}_R) \text{ crosses the 0 boundary}$$

  • INTERSECT (set): a row is present when both branches contain it.
  • INTERSECT ALL (bag): a row appears $\min(\text{count}_L, \text{count}_R)$ times.

SQL Generation (3-CTE chain):

  1. Delta CTE — tags rows from left/right child deltas with branch indicator ('L'/'R') and computes per-row net_count.
  2. Merge CTE — joins with the storage table to compute old and new per-branch counts (__pgs_count_l, __pgs_count_r).
  3. Final CTE — detects boundary crossings using LEAST(old_count_l, old_count_r) vs LEAST(new_count_l, new_count_r).

Notes:

  • Storage table requires hidden columns __pgs_count_l and __pgs_count_r for multiplicity tracking.
  • Both set and bag variants use the same 3-CTE structure; only the boundary logic stays the same (both use LEAST).

Except

Module: src/dvm/operators/except.rs

Implements EXCEPT and EXCEPT ALL using dual-count per-branch multiplicity tracking.

Delta Rule:

$$\Delta(R - S): \text{emit rows where } \max(0, \text{count}_L - \text{count}_R) \text{ crosses the 0 boundary}$$

  • EXCEPT (set): a row is present when it exists in the left but not the right branch.
  • EXCEPT ALL (bag): a row appears $\max(0, \text{count}_L - \text{count}_R)$ times.

SQL Generation (3-CTE chain):

  1. Delta CTE — same as Intersect: tags rows from both child deltas with branch indicator.
  2. Merge CTE — joins with storage table for old/new per-branch counts.
  3. Final CTE — detects boundary crossings using GREATEST(0, old_count_l - old_count_r) vs GREATEST(0, new_count_l - new_count_r).

Notes:

  • EXCEPT is not commutative — left branch is the positive input, right is subtracted.
  • Storage table requires hidden columns __pgs_count_l and __pgs_count_r.
  • Same 3-CTE structure as Intersect with different effective-count function.

Subquery

Module: src/dvm/operators/subquery.rs

Handles both inlined CTEs and explicit subqueries in FROM ((SELECT ...) AS alias).

Delta Rule:

$$\Delta(\rho_{\text{alias}}(Q)) = \rho_{\text{alias}}(\Delta Q)$$

A subquery wrapper is transparent for differentiation — it delegates to its child's delta and optionally renames output columns to match the subquery's column aliases.

SQL Generation:

-- If column aliases differ from child output columns:
SELECT __pgs_row_id, __pgs_action, child_col1 AS alias_col1, child_col2 AS alias_col2
FROM (<child_delta>)

If the child columns already match the aliases, the subquery is a pure passthrough — no additional CTE is emitted.

Notes:

  • This operator enables both CTE support (Tier 1) and standalone subqueries in FROM.
  • Column aliases on subqueries (FROM (...) AS x(a, b)) are handled by emitting a thin renaming CTE.
  • The subquery body is fully differentiated as a normal operator sub-tree.

CTE Scan (Shared Delta)

Module: src/dvm/operators/cte_scan.rs

Handles multi-reference CTEs by computing the CTE body's delta once and reusing it across all references (Tier 2).

Delta Rule:

$$\Delta(\text{CteScan}(\text{id}, Q)) = \text{cache}[\text{id}] \quad \text{(computed once, reused)}$$

When a CTE is referenced multiple times in a query, each reference produces a CteScan node with the same cte_id. The diff engine differentiates the CTE body once and caches the result. Subsequent CteScan nodes for the same CTE reuse the cached delta.

SQL Generation:

-- First reference: differentiates the CTE body and stores result in cache
-- Subsequent references: point to the same system CTE name
SELECT __pgs_row_id, __pgs_action, <columns>
FROM __pgs_cte_<cte_name>_delta  -- shared across all references

If column aliases are present, a thin renaming CTE is added on top of the cached delta.

Notes:

  • Without CteScan (Tier 1), multi-reference CTEs are inlined: each reference duplicates the full operator sub-tree. CteScan (Tier 2) eliminates this duplication.
  • The CTE body is pre-differentiated in dependency order (earlier CTEs before later ones that reference them).
  • Column alias support follows the same pattern as the Subquery operator.

Recursive CTEs

Recursive CTEs (WITH RECURSIVE) are supported via two strategies depending on the refresh mode:

FULL Mode

Recursive CTEs work out-of-the-box with refresh_mode = 'FULL'. The defining query is executed as-is via INSERT INTO ... SELECT ..., and PostgreSQL handles the iterative evaluation internally.

DIFFERENTIAL Mode (Three-Strategy Incremental Maintenance)

Recursive CTEs with refresh_mode = 'DIFFERENTIAL' use an automatic three-strategy approach, selected based on column compatibility and change type:

Strategy 1: Semi-Naive Evaluation (INSERT-only changes)

When only INSERT changes are present in the change buffer, pg_stream uses semi-naive evaluation — the standard technique for incremental fixpoint computation. The base case is differentiated normally through the DVM operator tree, then the resulting delta is propagated through the recursive term using a nested WITH RECURSIVE:

WITH RECURSIVE
  __pgs_base_delta AS (
    -- Normal DVM differentiation of the base case (INSERT rows only)
    <differentiated base case>
  ),
  __pgs_rec_delta AS (
    -- Seed: base case delta rows
    SELECT cols FROM __pgs_base_delta WHERE __pgs_action = 'I'
    UNION ALL
    -- Seed: new base rows joining existing ST storage
    SELECT cols FROM <recursive term with self_ref = ST_storage, base = change_buffer>
    UNION ALL
    -- Propagation: recursive term applied to growing delta
    SELECT cols FROM <recursive term with self_ref = __pgs_rec_delta, base = full>
  )
SELECT pgstream.pg_stream_hash(...) AS __pgs_row_id, 'I' AS __pgs_action, cols
FROM __pgs_rec_delta

The cost is proportional to the number of new rows produced by the change, not the full result set.

Strategy 2: Delete-and-Rederive / DRed (mixed INSERT/DELETE/UPDATE changes)

When the change buffer contains DELETE or UPDATE changes, simple propagation is insufficient — a deleted base row may have transitively derived many recursive rows, some of which may still be derivable from alternative paths. DRed handles this in four phases:

  1. Insert propagation — semi-naive evaluation for the INSERT portion (same as Strategy 1)
  2. Over-deletion cascade — propagate base-case deletions through the recursive term against ST storage to find all transitively-derived rows that might be invalidated
  3. Rederivation — re-execute the recursive CTE from the remaining (non-deleted) base rows to restore any over-deleted rows that have alternative derivations
  4. Combine — final delta = inserts + (over-deletions − rederived rows)

This avoids full recomputation while correctly handling deletions with alternative derivation paths.

Strategy 3: Recomputation Fallback

When the CTE defines more columns than the outer SELECT projects (column mismatch), the incremental strategies cannot be used because the ST storage table lacks columns needed for recursive self-joins. In this case, the full defining query is re-executed and anti-joined against current storage:

WITH __pgs_recomp_new AS (
    SELECT pgstream.pg_stream_hash(row_to_json(sub)::text) AS __pgs_row_id, col1, col2, ...
    FROM (<defining_query>) sub
),
__pgs_recomp_ins AS (
    SELECT n.__pgs_row_id, 'I'::text AS __pgs_action, n.col1, n.col2, ...
    FROM __pgs_recomp_new n
    LEFT JOIN <storage_table> s ON s.__pgs_row_id = n.__pgs_row_id
    WHERE s.__pgs_row_id IS NULL
),
__pgs_recomp_del AS (
    SELECT s.__pgs_row_id, 'D'::text AS __pgs_action, s.col1, s.col2, ...
    FROM <storage_table> s
    LEFT JOIN __pgs_recomp_new n ON n.__pgs_row_id = s.__pgs_row_id
    WHERE n.__pgs_row_id IS NULL
)
SELECT * FROM __pgs_recomp_ins
UNION ALL
SELECT * FROM __pgs_recomp_del

The cost is proportional to the full result set size.

Strategy Selection
CTE columns match ST?Change typeStrategy
✅ MatchINSERT-onlySemi-naive (Strategy 1)
✅ MatchMixed (INSERT+DELETE/UPDATE)DRed (Strategy 2)
❌ MismatchAnyRecomputation (Strategy 3)

Notes:

  • Non-linear recursion (multiple self-references in the recursive term) is rejected — PostgreSQL restricts the recursive term to reference the CTE at most once.
  • The __pgs_row_id column (xxHash of the JSON-serialized row) is used for row identity.
  • For write-heavy workloads on very large recursive result sets with frequent mixed changes, refresh_mode = 'FULL' may still be more efficient than DRed.

Window Functions

Module: src/dvm/operators/window.rs

Handles window functions (ROW_NUMBER, RANK, DENSE_RANK, SUM() OVER, etc.) using partition-based recomputation.

Delta Rule:

When any row in a partition changes (insert, update, or delete), the entire partition's window function output is recomputed:

$$\Delta(\omega_{f, P}(R)) = \omega_{f, P}(R'|{\text{affected partitions}}) - \omega{f, P}(R|_{\text{affected partitions}})$$

Where $P$ is the PARTITION BY key and $f$ is the window function.

Strategy:

  1. Identify affected partition keys from the child delta.
  2. Delete old window function results for affected partitions from storage.
  3. Fetch surviving (unchanged) rows in affected partitions from the child.
  4. Recompute the window function on the current input for affected partitions.
  5. Emit the recomputed rows as inserts.

SQL Generation:

-- CTE 1: Affected partition keys from delta
WITH affected_partitions AS (
    SELECT DISTINCT <partition_cols> FROM (<child_delta>)
),
-- CTE 2: Surviving rows (not in delta) for affected partitions  
surviving AS (
    SELECT * FROM <storage_table>
    WHERE (<partition_cols>) IN (SELECT * FROM affected_partitions)
    AND __pgs_row_id NOT IN (SELECT __pgs_row_id FROM (<child_delta>) WHERE __pgs_action = 'D')
),
-- CTE 3: Recompute window function
recomputed AS (
    SELECT *, <window_func> OVER (PARTITION BY <partition_cols> ORDER BY <order_cols>) AS <alias>
    FROM surviving
)
-- Delete old results + insert recomputed results
SELECT 'D' AS __pgs_action, ...  -- old rows from affected partitions
UNION ALL
SELECT 'I' AS __pgs_action, ...  -- recomputed rows

Notes:

  • The cost is proportional to the size of affected partitions, not the full table. For workloads where changes spread across few partitions, this is efficient.
  • All window expressions must share the same PARTITION BY / ORDER BY clause.
  • Without PARTITION BY, the entire table is treated as a single partition — any change triggers a full recomputation.

Window Frame Clauses:

Window frame specifications are fully supported:

  • Modes: ROWS, RANGE, GROUPS
  • Bounds: UNBOUNDED PRECEDING, N PRECEDING, CURRENT ROW, N FOLLOWING, UNBOUNDED FOLLOWING
  • Between syntax: BETWEEN <start> AND <end>
  • Exclusion: EXCLUDE CURRENT ROW, EXCLUDE GROUP, EXCLUDE TIES, EXCLUDE NO OTHERS

Example: SUM(val) OVER (ORDER BY ts ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)

Named WINDOW Clauses:

Named window definitions are resolved from the query-level WINDOW clause:

SELECT id, SUM(val) OVER w, AVG(val) OVER w
FROM data
WINDOW w AS (PARTITION BY category ORDER BY ts)

The parser resolves OVER w by looking up the window definition from the WINDOW clause and merging partition, order, and frame specifications.


Lateral Function (Set-Returning Functions in FROM)

Module: src/dvm/operators/lateral_function.rs

Handles set-returning functions (SRFs) used in the FROM clause with implicit LATERAL semantics: jsonb_array_elements, jsonb_each, jsonb_each_text, unnest, etc.

Delta Rule:

When a source row changes (insert, update, or delete), the SRF expansion is re-evaluated only for that source row:

$$\Delta(R \ltimes f(R.\text{col})) = (R' \ltimes f(R'.\text{col}))|{\text{changed rows}} - (R \ltimes f(R.\text{col}))|{\text{changed rows}}$$

Where $R$ is the source table, $f$ is the SRF, and changed rows are identified via the child delta.

Strategy (Row-Scoped Recomputation):

  1. Propagate the child delta to identify changed source rows.
  2. Find all existing ST rows derived from changed source rows (via column matching).
  3. Delete old SRF expansions for those source rows.
  4. Re-expand the SRF for inserted/updated source rows.
  5. Emit deletes + inserts as the final delta.

SQL Generation (4-CTE chain):

-- CTE 1: Changed source rows from child delta
WITH lat_changed AS (
    SELECT DISTINCT "__pgs_row_id", "__pgs_action", <child_cols>
    FROM <child_delta>
),
-- CTE 2: Old ST rows for changed source rows (to be deleted)
lat_old AS (
    SELECT st."__pgs_row_id", st.<all_output_cols>
    FROM <st_table> st
    WHERE EXISTS (
        SELECT 1 FROM lat_changed cs
        WHERE st.<col1> IS NOT DISTINCT FROM cs.<col1>
          AND st.<col2> IS NOT DISTINCT FROM cs.<col2>
          ...
    )
),
-- CTE 3: Re-expand SRF for inserted/updated source rows
lat_expand AS (
    SELECT pg_stream_hash(<all_cols>::text) AS "__pgs_row_id",
           cs.<child_cols>, <srf_alias>.<srf_cols>
    FROM lat_changed cs,
         LATERAL <srf_function>(cs.<arg>) AS <srf_alias>
    WHERE cs."__pgs_action" = 'I'
),
-- CTE 4: Final delta
lat_final AS (
    SELECT "__pgs_row_id", 'D' AS "__pgs_action", <cols> FROM lat_old
    UNION ALL
    SELECT "__pgs_row_id", 'I' AS "__pgs_action", <cols> FROM lat_expand
)

Row Identity:

Content-based: hash(child_columns || srf_result_columns). This is stable as long as the same source row produces the same expanded values.

Supported SRFs:

FunctionOutput ColumnsNotes
jsonb_array_elements(jsonb)value (jsonb)Expands JSONB array to rows
jsonb_array_elements_text(jsonb)value (text)Text variant
jsonb_each(jsonb)key (text), value (jsonb)Expands JSONB object to key-value pairs
jsonb_each_text(jsonb)key (text), value (text)Text variant
unnest(anyarray)Element typeUnnests PostgreSQL arrays
Custom SRFsUser-provided column aliasesAS alias(col1, col2)

Notes:

  • The cost is proportional to the number of changed source rows × average SRF expansion size, not the full table.
  • WITH ORDINALITY is supported — adds a bigint ordinality column to the output.
  • ROWS FROM() with multiple functions is not supported (rejected at parse time).
  • Column aliases (e.g., AS child(value)) are used to determine output column names; for known SRFs without aliases, the alias name becomes the column name.
  • JSON_TABLE (PostgreSQL 17+) — JSON_TABLE(expr, path COLUMNS (...)) is modeled as a LateralFunction and uses the same row-scoped recomputation strategy. Supported column types: regular, EXISTS, formatted, and nested columns with ON ERROR/ON EMPTY behaviors and PASSING clauses.

Lateral Subquery (Correlated Subqueries in FROM)

Module: src/dvm/operators/lateral_subquery.rs

Handles correlated subqueries used in the FROM clause with explicit or implicit LATERAL semantics: FROM t, LATERAL (SELECT ... WHERE ref = t.col) AS alias or FROM t LEFT JOIN LATERAL (...) AS alias ON true.

Delta Rule:

When an outer row changes, the correlated subquery is re-executed only for that row:

$$\Delta(R \ltimes Q(R)) = (R' \ltimes Q(R'))|{\text{changed rows}} - (R \ltimes Q(R))|{\text{changed rows}}$$

Where $R$ is the outer table, $Q(R)$ is the correlated subquery, and changed rows are identified via the child delta.

Strategy (Row-Scoped Recomputation):

  1. Propagate the child delta to identify changed outer rows.
  2. Find all existing ST rows derived from changed outer rows (via column matching with IS NOT DISTINCT FROM).
  3. Delete old subquery expansions for those outer rows.
  4. Re-execute the subquery for inserted/updated outer rows using the original outer alias.
  5. Emit deletes + inserts as the final delta.

SQL Generation (4-CTE chain):

-- CTE 1: Changed outer rows from child delta
WITH lat_sq_changed AS (
    SELECT DISTINCT "__pgs_row_id", "__pgs_action", <child_cols>
    FROM <child_delta>
),
-- CTE 2: Old ST rows for changed outer rows (to be deleted)
lat_sq_old AS (
    SELECT st."__pgs_row_id", st.<all_output_cols>
    FROM <st_table> st
    WHERE EXISTS (
        SELECT 1 FROM lat_sq_changed cs
        WHERE st.<col1> IS NOT DISTINCT FROM cs.<col1>
          AND st.<col2> IS NOT DISTINCT FROM cs.<col2>
          ...
    )
),
-- CTE 3: Re-execute subquery for inserted/updated outer rows
lat_sq_expand AS (
    SELECT pg_stream_hash(<all_cols>::text) AS "__pgs_row_id",
           <outer_alias>.<child_cols>, <sub_alias>.<sub_cols>
    FROM lat_sq_changed AS <outer_alias>,      -- Original outer alias!
         LATERAL (<subquery_sql>) AS <sub_alias>
    WHERE <outer_alias>."__pgs_action" = 'I'
),
-- CTE 4: Final delta
lat_sq_final AS (
    SELECT "__pgs_row_id", 'D' AS "__pgs_action", <cols> FROM lat_sq_old
    UNION ALL
    SELECT "__pgs_row_id", 'I' AS "__pgs_action", <cols> FROM lat_sq_expand
)

LEFT JOIN LATERAL Handling:

For queries using LEFT JOIN LATERAL (...) ON true, the expand CTE uses LEFT JOIN LATERAL instead of comma syntax and wraps subquery columns in COALESCE for hash stability:

lat_sq_expand AS (
    SELECT pg_stream_hash(<outer_cols>::text || '/' || COALESCE(<sub_cols>::text, '')) AS "__pgs_row_id",
           <outer_alias>.<child_cols>, <sub_alias>.<sub_cols>
    FROM lat_sq_changed AS <outer_alias>
    LEFT JOIN LATERAL (<subquery_sql>) AS <sub_alias> ON true
    WHERE <outer_alias>."__pgs_action" = 'I'
)

Row Identity:

Content-based: hash(outer_columns || '/' || subquery_result_columns). For LEFT JOIN with NULL results, COALESCE ensures a stable hash.

Supported Patterns:

PatternSyntaxNotes
Top-N per groupLATERAL (SELECT ... ORDER BY ... LIMIT N)Most common use case
Correlated aggregateLATERAL (SELECT SUM(x) FROM t WHERE t.fk = p.pk)Returns single row per outer row
Existence with dataLEFT JOIN LATERAL (...) ON truePreserves outer rows with NULLs
Multi-column lookupLATERAL (SELECT a, b FROM t WHERE t.fk = p.pk LIMIT 1)Multiple derived values
GROUP BY inside subqueryLATERAL (SELECT type, COUNT(*) FROM t WHERE t.fk = p.pk GROUP BY type)Multiple rows per outer row

Key Design Decision: Outer Alias Rewriting

The subquery body contains column references to the outer table (e.g., WHERE li.order_id = o.id). In the expansion CTE, the changed-sources CTE is aliased with the original outer table alias (e.g., lat_sq_changed AS o) so that the subquery's column references resolve naturally without rewriting.

Notes:

  • The cost is proportional to the number of changed outer rows × average subquery result size, not the full table.
  • The subquery is stored as raw SQL (like LateralFunction) because it cannot be independently differentiated — it depends on outer row context.
  • Source table OIDs referenced by the subquery body are extracted at parse time for CDC trigger setup.
  • ORDER BY + LIMIT inside the subquery are valid (they apply per-outer-row, not to the stream table).

Semi-Join (EXISTS / IN Subquery)

Module: src/dvm/operators/semi_join.rs

Handles WHERE EXISTS (SELECT ... FROM ...) and WHERE col IN (SELECT ...) patterns. The parser transforms these into a SemiJoin operator with a left (outer) child, a right (inner) child, and a join condition.

Delta Rule:

$$\Delta(L \ltimes R) = \Delta L|{R} + L|{\Delta R \text{ causes existence change}}$$

  • Part 1: Outer rows that changed and still satisfy the semi-join condition.
  • Part 2: Existing outer rows whose semi-join result flipped due to inner changes (a matching inner row was inserted or deleted).

Strategy (Two-Part Delta):

  1. Part 1 (outer delta): Filter delta_left to rows that have at least one match in the current right-hand snapshot.
  2. Part 2 (inner delta): For each row in the left snapshot, check whether the existence of matching right-hand rows changed between the old and current state. Emit 'I' if a match appeared, 'D' if all matches disappeared.

The "old" right-hand state is reconstructed from the current state by reversing the delta: R_old = (R_current EXCEPT ALL delta_right(action='I')) UNION ALL delta_right(action='D').

Row Identity:

  • Part 1: Uses __pgs_row_id from the left delta.
  • Part 2: Content-based hash via pg_stream_hash_multi on left-side columns.

Supported Patterns:

PatternSQLNotes
EXISTSWHERE EXISTS (SELECT 1 FROM t WHERE t.fk = s.pk)Direct semi-join
IN (subquery)WHERE id IN (SELECT fk FROM t)Rewritten to EXISTS with equality
Multiple conditionsWHERE EXISTS (... AND ...)Additional predicates in subquery WHERE

Anti-Join (NOT EXISTS / NOT IN Subquery)

Module: src/dvm/operators/anti_join.rs

Handles WHERE NOT EXISTS (SELECT ... FROM ...) and WHERE col NOT IN (SELECT ...) patterns. The inverse of the semi-join operator.

Delta Rule:

$$\Delta(L \triangleright R) = \Delta L|{\neg R} + L|{\Delta R \text{ causes existence change}}$$

  • Part 1: Outer rows that changed and have no match in the right-hand snapshot.
  • Part 2: Existing outer rows whose anti-join result flipped due to inner changes.

Strategy (Two-Part Delta):

  1. Part 1 (outer delta): Filter delta_left to rows with NOT EXISTS in the current right snapshot.
  2. Part 2 (inner delta): For each row in the left snapshot, detect existence changes. Emit 'D' if a match appeared (row no longer qualifies), 'I' if all matches disappeared (row now qualifies).

Note the inverted semantics compared to semi-join: a new match means deletion, losing all matches means insertion.

Row Identity: Same as semi-join.

Supported Patterns:

PatternSQLNotes
NOT EXISTSWHERE NOT EXISTS (SELECT 1 FROM t WHERE t.fk = s.pk)Direct anti-join
NOT IN (subquery)WHERE id NOT IN (SELECT fk FROM t)Rewritten to NOT EXISTS with equality

Scalar Subquery (Correlated SELECT Subquery)

Module: src/dvm/operators/scalar_subquery.rs

Handles scalar subqueries appearing in the SELECT list, e.g., SELECT a, (SELECT max(x) FROM t) AS mx FROM s. The subquery must return exactly one row and one column.

Delta Rule:

$$\Delta(L \times q) = \Delta L \times q' + L \times (q' - q)$$

Where $q$ is the scalar subquery value and $q'$ is the updated value.

Strategy (Two-Part Delta):

  1. Part 1 (outer delta): Propagate the child delta, appending the current scalar subquery value to each row.
  2. Part 2 (scalar value change): When the scalar subquery's result changes, emit deletes for all existing outer rows (with the old scalar value) and re-inserts for all outer rows (with the new value). The old scalar value is reconstructed by reversing the inner delta.

SQL Generation (3 or 4 CTEs):

-- Part 1: child delta + current scalar value
WITH sq_outer AS (
    SELECT *, (<scalar_subquery>) AS "<alias>"
    FROM <child_delta>
),
-- Part 2a: DELETE all outer rows when scalar changed
sq_del AS (
    SELECT "__pgs_row_id", 'D' AS "__pgs_action", <cols>
    FROM <st_table>
    WHERE (<scalar_old>) IS DISTINCT FROM (<scalar_current>)
),
-- Part 2b: INSERT all outer rows with new scalar value
sq_ins AS (
    SELECT pg_stream_hash_multi(...) AS "__pgs_row_id",
           'I' AS "__pgs_action", <cols>, (<scalar_current>) AS "<alias>"
    FROM <source_snapshot>
    WHERE (<scalar_old>) IS DISTINCT FROM (<scalar_current>)
)
-- Final: UNION ALL of all parts
SELECT * FROM sq_outer
UNION ALL SELECT * FROM sq_del
UNION ALL SELECT * FROM sq_ins

Row Identity:

  • Part 1: __pgs_row_id from the child delta.
  • Part 2: Content-based hash via pg_stream_hash_multi on all output columns.

Notes:

  • The scalar subquery is stored as raw SQL (deparsed from the parse tree).
  • The old scalar value is approximated using the same EXCEPT ALL / UNION ALL reversal technique as semi/anti-join.
  • If the scalar subquery references a table that changes, all outer rows must be re-evaluated — the delta can be large.
  • Source OIDs used by the scalar subquery are captured at parse time for CDC trigger registration.

Operator Tree Construction

The DVM engine builds the operator tree by analyzing the parsed query:

  1. WITH clause → CTE definitions extracted into a name→body map (non-recursive) or CTE registry (multi-reference)
  2. FROM clauseScan nodes for physical tables; Subquery nodes for inlined CTEs and subqueries in FROM; CteScan nodes for multi-reference CTEs; LateralFunction nodes for SRFs and JSON_TABLE in FROM; LateralSubquery nodes for correlated subqueries in FROM
  3. JOINJoin or OuterJoin wrapping two sub-trees
  4. LATERAL SRFsLateralFunction wrapping the left-hand FROM item as its child
  5. LATERAL subqueriesLateralSubquery wrapping the left-hand FROM item as its child (comma syntax or JOIN LATERAL)
  6. WHERE subqueriesSemiJoin for EXISTS/IN (subquery), AntiJoin for NOT EXISTS/NOT IN (subquery), extracted from the WHERE clause
  7. Scalar subqueriesScalarSubquery for (SELECT ...) in the SELECT list, wrapping the child tree
  8. WHEREFilter wrapping the scan/join tree (remaining non-subquery predicates)
  9. SELECT listProject for column selection and expressions
  10. GROUP BYAggregate wrapping the filtered/projected tree
  11. DISTINCTDistinct on top
  12. UNION ALLUnionAll combining two complete sub-trees
  13. INTERSECT / EXCEPTIntersect or Except combining two sub-trees with dual-count tracking
  14. Window functionsWindow wrapping the sub-tree with PARTITION BY / ORDER BY metadata
  15. ORDER BY → silently discarded (storage row order is undefined)
  16. LIMIT / OFFSET → rejected with a clear error (stream tables materialize the full result set)

For recursive CTEs (WITH RECURSIVE), the query is parsed into an OpTree with RecursiveCte operator nodes. In DIFFERENTIAL mode, the strategy (semi-naive, DRed, or recomputation) is selected automatically based on column compatibility and change type — see the Recursive CTEs section above for details.

The tree is then traversed bottom-up during delta generation: each operator's generate_delta_sql() method composes its SQL fragment around the output of its child operator(s).


Further Reading

pg_stream — Benchmark Guide

This document explains how the database-level refresh benchmarks work and how to interpret their output.


Overview

The benchmark suite in tests/e2e_bench_tests.rs measures wall-clock refresh time for FULL vs INCREMENTAL mode across a matrix of table sizes, change rates, and query complexities. Each benchmark spawns an isolated PostgreSQL 18.x container via Testcontainers, ensuring reproducible and interference-free measurements.

The core question the benchmarks answer:

How much faster is an INCREMENTAL refresh compared to a FULL refresh, given a specific workload?


Prerequisites

Build the E2E test Docker image before running any benchmarks:

./tests/build_e2e_image.sh

Docker must be running on the host.


Running Benchmarks

All benchmark tests are tagged #[ignore] so they are skipped during normal CI. The --nocapture flag is required to see the printed output tables.

Quick Spot Checks (~5–10 seconds each)

# Simple scan, 10K rows, 1% change rate
cargo test --test e2e_bench_tests --features pg18 -- --ignored --nocapture bench_scan_10k_1pct

# Aggregate query, 100K rows, 1% change rate
cargo test --test e2e_bench_tests --features pg18 -- --ignored --nocapture bench_aggregate_100k_1pct

# Join + aggregate, 100K rows, 10% change rate
cargo test --test e2e_bench_tests --features pg18 -- --ignored --nocapture bench_join_agg_100k_10pct

Zero-Change Latency (~5 seconds)

cargo test --test e2e_bench_tests --features pg18 -- --ignored --nocapture bench_no_data_refresh_latency

Full Matrix (~15–30 minutes)

Runs all 30 combinations and prints a consolidated summary:

cargo test --test e2e_bench_tests --features pg18 -- --ignored --nocapture bench_full_matrix

Run All Benchmarks in Parallel

cargo test --test e2e_bench_tests --features pg18 -- --ignored --nocapture

Note: each test starts its own container, so parallel execution requires sufficient Docker resources.


Benchmark Dimensions

Table Sizes

SizeRowsPurpose
Small10,000Fast iteration; measures per-row overhead
Medium100,000More realistic; reveals scaling characteristics

Change Rates

RateDescription
1%Low churn — the sweet spot for incremental refresh
10%Moderate churn — tests delta query scalability
50%High churn — stress test; approaches full-refresh cost

Query Complexities

ScenarioDefining QueryOperators Tested
scanSELECT id, region, category, amount, score FROM srcTable scan only
filterSELECT id, region, amount FROM src WHERE amount > 5000Scan + filter (WHERE)
aggregateSELECT region, SUM(amount), COUNT(*) FROM src GROUP BY regionScan + group-by aggregate
joinSELECT s.id, s.region, s.amount, d.region_name FROM src s JOIN dim d ON ...Scan + inner join
join_aggSELECT d.region_name, SUM(s.amount), COUNT(*) FROM src s JOIN dim d ON ... GROUP BY ...Scan + join + aggregate

DML Mix per Cycle

Each change cycle applies a realistic mix of operations:

OperationFractionExample at 10K rows, 10% rate
UPDATE70%700 rows have amount incremented
DELETE15%150 rows removed
INSERT15%150 new rows added

What Each Benchmark Does

1. Start a fresh PostgreSQL 18.x container
2. Install the pg_stream extension
3. Create and populate the source table (10K or 100K rows)
4. Create dimension table if needed (for join scenarios)
5. ANALYZE for stable query plans

── FULL mode ──
6. Create a Stream Table in FULL refresh mode
7. For each of 3 cycles:
   a. Apply random DML (updates + deletes + inserts)
   b. ANALYZE
   c. Time the FULL refresh (TRUNCATE + re-execute entire query)
   d. Record refresh_ms and ST row count
8. Drop the FULL-mode ST

── INCREMENTAL mode ──
9. Reset source table to same starting state
10. Create a Stream Table in INCREMENTAL refresh mode
11. For each of 3 cycles:
    a. Apply random DML (same parameters)
    b. ANALYZE
    c. Time the INCREMENTAL refresh (delta query + MERGE)
    d. Record refresh_ms and ST row count

12. Print results table and summary

Both modes start from the same data to ensure a fair comparison. The 3-cycle design captures warm-up effects (cycle 1 may be slower due to plan caching).


Reading the Output

Detail Table

╔══════════════════════════════════════════════════════════════════════════════════════╗
║                    pg_stream Refresh Benchmark Results                      ║
╠════════════╤══════════╤════════╤═════════════╤═══════╤════════════╤═════════════════╣
║ Scenario   │ Rows     │ Chg %  │ Mode        │ Cycle │ Refresh ms │ ST Rows         ║
╠════════════╪══════════╪════════╪═════════════╪═══════╪════════════╪═════════════════╣
║ aggregate  │    10000 │     1% │ FULL        │     1 │       22.1 │               5 ║
║ aggregate  │    10000 │     1% │ FULL        │     2 │        4.8 │               5 ║
║ aggregate  │    10000 │     1% │ FULL        │     3 │        5.3 │               5 ║
║ aggregate  │    10000 │     1% │ INCREMENTAL │     1 │        8.4 │               5 ║
║ aggregate  │    10000 │     1% │ INCREMENTAL │     2 │        4.4 │               5 ║
║ aggregate  │    10000 │     1% │ INCREMENTAL │     3 │        4.6 │               5 ║
╚════════════╧══════════╧════════╧═════════════╧═══════╧════════════╧═════════════════╝
ColumnMeaning
ScenarioQuery complexity level (scan, filter, aggregate, join, join_agg)
RowsNumber of rows in the base table
Chg %Percentage of rows changed per cycle
ModeFULL (truncate + recompute) or INCREMENTAL (delta + merge)
CycleWhich of the 3 measurement rounds (cycle 1 often includes warm-up)
Refresh msWall-clock time for the refresh operation
ST RowsRow count in the Stream Table after refresh (sanity check)

Summary Table

┌─────────────────────────────────────────────────────────────────────────┐
│                        Summary (avg ms per cycle)                       │
├────────────┬──────────┬────────┬─────────────────┬──────────────────────┤
│ Scenario   │ Rows     │ Chg %  │ FULL avg ms     │ INCREMENTAL avg ms   │
├────────────┼──────────┼────────┼─────────────────┼──────────────────────┤
│ aggregate  │    10000 │     1% │       10.7       │        5.8 (  1.8x) │
└────────────┴──────────┴────────┴─────────────────┴──────────────────────┘

The Speedup value in parentheses is FULL avg / INCREMENTAL avg — how many times faster the incremental refresh is compared to a full refresh.


Interpreting the Speedup

What to Expect

Change RateTable SizeExpected SpeedupExplanation
1%10K1.5–5xSmall table; overhead is similar, delta is tiny
1%100K5–50xLarger table amplifies full-refresh cost
10%100K2–10xModerate delta; still significantly faster
50%any1–2xDelta is nearly as large as full table

Rules of Thumb

SpeedupInterpretation
> 10xStrong win for INCREMENTAL — typical at low change rates on larger tables
5–10xClear advantage for INCREMENTAL
2–5xModerate advantage — INCREMENTAL is the right choice
1–2xMarginal gain — either mode is acceptable
~1xBreak-even — change rate is too high for incremental to help
< 1xINCREMENTAL is slower — would indicate overhead exceeds savings (investigate)

Key Patterns to Look For

  1. Scaling with table size: For the same change rate, speedup should increase with table size. FULL must re-process all rows; INCREMENTAL processes only the delta.

  2. Degradation with change rate: As change rate rises from 1% → 50%, speedup should decrease. At 50%, INCREMENTAL processes half the table which approaches FULL cost.

  3. Query complexity amplifies speedup: Aggregate and join queries benefit more from INCREMENTAL because they avoid expensive re-computation. A join_agg at 1% changes should show higher speedup than a simple scan at the same parameters.

  4. Cycle 1 warm-up: The first cycle in each mode may be slower due to PostgreSQL plan cache population. Use cycles 2–3 for the steadiest numbers.

  5. ST Rows consistency: The ST row count should be similar between FULL and INCREMENTAL for the same scenario (accounting for random DML). Large discrepancies indicate a correctness issue.


Zero-Change Latency

The bench_no_data_refresh_latency test measures the overhead of a refresh when no data has changed — the NO_DATA code path.

┌──────────────────────────────────────────────┐
│ NO_DATA Refresh Latency (10 iterations)      │
├──────────────────────────────────────────────┤
│ Avg:     3.21 ms                             │
│ Max:     5.10 ms                             │
│ Target: < 10 ms                              │
│ Status: ✅ PASS                              │
└──────────────────────────────────────────────┘
MetricMeaning
AvgAverage wall-clock time across 10 no-op refreshes
MaxWorst-case single iteration
TargetThe PLAN.md goal: < 10 ms per no-op refresh
StatusPASS if avg < 10 ms, SLOW otherwise

A passing result confirms the scheduler's per-cycle overhead is negligible. Values > 10 ms in containerized environments may be acceptable due to Docker overhead; bare-metal PostgreSQL should comfortably meet the target.


Available Tests

Individual Tests (10K rows)

Test NameScenarioChange Rate
bench_scan_10k_1pctscan1%
bench_scan_10k_10pctscan10%
bench_scan_10k_50pctscan50%
bench_filter_10k_1pctfilter1%
bench_aggregate_10k_1pctaggregate1%
bench_join_10k_1pctjoin1%
bench_join_agg_10k_1pctjoin_agg1%

Individual Tests (100K rows)

Test NameScenarioChange Rate
bench_scan_100k_1pctscan1%
bench_scan_100k_10pctscan10%
bench_scan_100k_50pctscan50%
bench_aggregate_100k_1pctaggregate1%
bench_aggregate_100k_10pctaggregate10%
bench_join_agg_100k_1pctjoin_agg1%
bench_join_agg_100k_10pctjoin_agg10%

Special Tests

Test NameDescription
bench_full_matrixAll 30 combinations (5 queries × 2 sizes × 3 rates)
bench_no_data_refresh_latencyZero-change overhead (10 iterations)

In-Process Micro-Benchmarks (Criterion.rs)

In addition to the E2E database benchmarks, the project includes two Criterion.rs benchmark suites that measure pure Rust computation time without database overhead. These are useful for tracking performance regressions in the internal query-building and IVM differentiation logic.

Benchmark Suites

refresh_bench — Utility Functions

benches/refresh_bench.rs benchmarks the low-level helper functions used during refresh operations:

Benchmark GroupWhat It Measures
quote_identPostgreSQL identifier quoting speed
col_listColumn list SQL generation
prefixed_col_listPrefixed column list generation (e.g., NEW.col)
expr_to_sqlAST expression → SQL string conversion
output_columnsOutput column extraction from parsed queries
source_oidsSource table OID resolution
lsn_gtLSN comparison expression generation
frontier_jsonFrontier state JSON serialization
canonical_periodInterval parsing and canonicalization
dag_operationsDAG topological sort and cycle detection
xxh64xxHash-64 hashing throughput

diff_operators — IVM Operator Differentiation

benches/diff_operators.rs benchmarks the delta SQL generation for every IVM operator. Each benchmark creates a realistic operator tree and measures differentiate() throughput:

Benchmark GroupWhat It Measures
diff_scanTable scan differentiation (3, 10, 20 columns)
diff_filterFilter (WHERE) differentiation
diff_projectProjection (SELECT subset) differentiation
diff_aggregateGROUP BY aggregate differentiation (simple + complex)
diff_inner_joinInner join differentiation
diff_left_joinLeft outer join differentiation
diff_distinctDISTINCT differentiation
diff_union_allUNION ALL differentiation (2, 5, 10 children)
diff_windowWindow function differentiation
diff_join_aggregateComposite join + aggregate pipeline
differentiate_fullFull differentiate() call for scan-only and filter+scan trees

Running Micro-Benchmarks

# Run all Criterion benchmarks
just bench

# Run only refresh utility benchmarks
cargo bench --bench refresh_bench --features pg18

# Run only IVM diff operator benchmarks
just bench-diff
# or equivalently:
cargo bench --bench diff_operators --features pg18

# Output in Bencher-compatible format (for CI integration)
just bench-bencher

Output and Reports

Criterion produces statistical analysis for each benchmark including:

  • Mean and standard deviation of execution time
  • Throughput (iterations/sec)
  • Comparison with previous run — reports improvements/regressions with confidence intervals

HTML reports are generated in target/criterion/ with interactive charts showing distributions and regression history. Open target/criterion/report/index.html to browse all results.

Sample output:

diff_scan/3_columns   time:   [11.834 µs 12.074 µs 12.329 µs]
diff_scan/10_columns  time:   [16.203 µs 16.525 µs 16.869 µs]
diff_aggregate/simple time:   [21.447 µs 21.862 µs 22.301 µs]
diff_inner_join       time:   [25.919 µs 26.421 µs 26.952 µs]

Continuous Benchmarking with Bencher

Bencher provides continuous benchmark tracking in CI, detecting performance regressions on pull requests before they merge.

How It Works

The .github/workflows/benchmarks.yml workflow:

  1. On main pushes — runs both Criterion suites and uploads results to Bencher as the baseline. This establishes the expected performance for each benchmark.

  2. On pull requests — runs the same benchmarks and compares against the main baseline using a Student's t-test with a 99% upper confidence boundary. If any benchmark regresses beyond the threshold, the PR check fails.

Setup

To enable Bencher for your fork or deployment:

  1. Create a Bencher account at bencher.dev and create a project.

  2. Add the API token as a GitHub Actions secret:

    • Go to Settings → Secrets and variables → Actions
    • Add BENCHER_API_TOKEN with your Bencher API token
  3. Update the project slug in .github/workflows/benchmarks.yml if your Bencher project name differs from pg-stream.

The workflow gracefully degrades — if BENCHER_API_TOKEN is not set, benchmarks still run and upload artifacts but skip Bencher tracking.

Local Bencher-Format Output

To see what Bencher would receive from CI:

just bench-bencher

This runs both suites with --output-format bencher, producing JSON output compatible with bencher run.

Dashboard

Once configured, the Bencher dashboard shows:

  • Historical trends for every benchmark across commits
  • Statistical thresholds with configurable alerting
  • PR annotations highlighting which benchmarks regressed and by how much

Troubleshooting

IssueResolution
docker: command not foundInstall Docker Desktop and ensure it is running
Container startup timeoutIncrease Docker memory allocation (≥ 4 GB recommended)
image not foundRun ./tests/build_e2e_image.sh to build the test image
Highly variable timingsClose other workloads; use --test-threads=1 to avoid container contention
SLOW status on latency testExpected in Docker; bare-metal should pass < 10 ms

What Happens When You INSERT a Row?

This tutorial traces the complete lifecycle of a single INSERT statement on a base table that is referenced by a stream table — from the moment the row is written to the moment the stream table reflects the change.

Setup: A Real-World Example

Suppose you run an e-commerce platform. You have an orders table and a stream table that maintains a running total per customer:

-- Base table
CREATE TABLE orders (
    id    SERIAL PRIMARY KEY,
    customer TEXT NOT NULL,
    amount   NUMERIC(10,2) NOT NULL
);

-- Stream table: always-fresh customer totals
SELECT pgstream.create_stream_table(
    'customer_totals',
    $$
      SELECT customer, SUM(amount) AS total, COUNT(*) AS order_count
      FROM orders GROUP BY customer
    $$,
    '1m',           -- refresh when data is staler than 1 minute
    'DIFFERENTIAL'  -- only process changed rows, not the full table
);

After creation, customer_totals is a real PostgreSQL table:

SELECT * FROM customer_totals;
-- (empty — no orders yet)

Phase 1: The INSERT

A new order arrives:

INSERT INTO orders (customer, amount) VALUES ('alice', 49.99);

What happens inside PostgreSQL

When create_stream_table() was called, pg_stream installed an AFTER INSERT OR UPDATE OR DELETE trigger on the orders table. This trigger fires automatically — the user's INSERT statement triggers it transparently.

The trigger function (pgstream_changes.pg_stream_cdc_fn_<oid>()) executes inside the same transaction as the INSERT and writes a single row into the change buffer table:

pgstream_changes.changes_16384    (where 16384 = orders table OID)
┌───────────┬─────────────┬────────┬─────────┬──────────┬──────────┬────────────┐
│ change_id │ lsn         │ action │ pk_hash  │ new_id   │ new_cust │ new_amount │
├───────────┼─────────────┼────────┼─────────┼──────────┼──────────┼────────────┤
│ 1         │ 0/1A3F2B80  │ I      │ -837291 │ 1        │ alice    │ 49.99      │
└───────────┴─────────────┴────────┴─────────┴──────────┴──────────┴────────────┘

Key details:

  • lsn: The current WAL Log Sequence Number (pg_current_wal_lsn()), used to bound which changes belong to which refresh cycle.
  • action: 'I' for INSERT, 'U' for UPDATE, 'D' for DELETE.
  • pk_hash: A pre-computed hash of the primary key (orders.id), used later for efficient row matching.
  • new_* columns: The actual column values from NEW, stored as native PostgreSQL types (not JSONB). There are no old_* values for INSERTs.

The trigger adds zero overhead to the user's transaction commit beyond this single INSERT into the buffer table. There is no JSONB serialization, no logical replication slot, and no external process involved.

Phase 2: The Scheduler Wakes Up

A background worker called the scheduler runs inside PostgreSQL (registered via shared_preload_libraries). It wakes up every pg_stream.scheduler_interval_ms milliseconds (default: 1000ms) and performs a tick:

  1. Rebuild the DAG (if any stream tables were created/dropped since last tick) — a dependency graph of all stream tables and their source tables.
  2. Topological sort — determine the refresh order so that stream tables depending on other stream tables are refreshed after their dependencies.
  3. For each stream table, check: has its staleness exceeded its schedule?

For customer_totals with a '1m' schedule, the scheduler compares:

  • now() minus data_timestamp (the freshness watermark from the last refresh)
  • Against the schedule: 60 seconds

If more than 60 seconds have elapsed and the stream table isn't already being refreshed, the scheduler begins a refresh.

Phase 3: Frontier Advancement

Before executing the refresh, the scheduler creates a new frontier — a snapshot of how far to read changes from each source table:

Previous frontier: { orders(16384): lsn = 0/1A3F2A00 }
New frontier:      { orders(16384): lsn = 0/1A3F2C00 }

The frontier is a DBSP-inspired version vector. Each source table has its own LSN cursor. The refresh will process all changes in the buffer table where lsn > previous_frontier_lsn AND lsn <= new_frontier_lsn.

This means:

  • Changes committed before the previous refresh are already reflected.
  • Changes committed after the new frontier will be picked up in the next cycle.
  • The INSERT we made (lsn = 0/1A3F2B80) falls within this window.

Phase 4: Change Detection — Is There Anything to Do?

Before running the full delta query, the scheduler runs a short-circuit check: does the change buffer actually have any rows in the LSN window?

SELECT count(*)::bigint FROM (
    SELECT 1 FROM pgstream_changes.changes_16384
    WHERE lsn > '0/1A3F2A00'::pg_lsn
    AND lsn <= '0/1A3F2C00'::pg_lsn
    LIMIT <threshold>
) __pgs_capped

This query also checks the adaptive threshold: if the number of changes exceeds a percentage of the source table size (default: 10%), the scheduler falls back to a FULL refresh instead of DIFFERENTIAL, because applying thousands of individual deltas would be slower than a bulk reload.

For our single INSERT, the count is 1 — well below the threshold. The scheduler proceeds with a DIFFERENTIAL refresh.

Phase 5: Delta Query Generation (DVM Engine)

This is where the Differential View Maintenance (DVM) engine does its work. The defining query:

SELECT customer, SUM(amount) AS total, COUNT(*) AS order_count
FROM orders GROUP BY customer

is parsed into an operator tree:

Aggregate(GROUP BY customer, SUM(amount), COUNT(*))
  └── Scan(orders)

The DVM engine differentiates each operator — converting it from "compute the full result" to "compute only what changed":

Step 1: Differentiate the Scan

The Scan(orders) operator becomes a read from the change buffer:

-- Reads only changes in the LSN window, splitting UPDATEs into DELETE+INSERT
WITH __pgs_raw AS (
    SELECT c.pk_hash, c.action,
           c."new_customer", c."old_customer",
           c."new_amount", c."old_amount"
    FROM pgstream_changes.changes_16384 c
    WHERE c.lsn > '0/1A3F2A00'::pg_lsn
    AND   c.lsn <= '0/1A3F2C00'::pg_lsn
)
-- INSERT rows: take new_* values
SELECT pk_hash AS __pgs_row_id, 'I' AS __pgs_action,
       "new_customer" AS customer, "new_amount" AS amount
FROM __pgs_raw WHERE action IN ('I', 'U')
UNION ALL
-- DELETE rows: take old_* values
SELECT pk_hash AS __pgs_row_id, 'D' AS __pgs_action,
       "old_customer" AS customer, "old_amount" AS amount
FROM __pgs_raw WHERE action IN ('D', 'U')

For our single INSERT, this produces:

__pgs_row_id | __pgs_action | customer | amount
-------------|--------------|----------|-------
-837291      | I            | alice    | 49.99

Step 2: Differentiate the Aggregate

The Aggregate differentiation is the heart of incremental maintenance. Instead of re-computing SUM(amount) over the entire orders table, it computes:

-- Delta for SUM: add new values, subtract deleted values
SELECT customer,
       SUM(CASE WHEN __pgs_action = 'I' THEN amount
                WHEN __pgs_action = 'D' THEN -amount END) AS total,
       SUM(CASE WHEN __pgs_action = 'I' THEN 1
                WHEN __pgs_action = 'D' THEN -1 END) AS order_count,
       pgstream.pg_stream_hash(customer::text) AS __pgs_row_id,
       'I' AS __pgs_action
FROM <scan_delta>
GROUP BY customer

For our INSERT of ('alice', 49.99), this yields:

customer | total  | order_count | __pgs_row_id | __pgs_action
---------|--------|-------------|--------------|-------------
alice    | +49.99 | +1          | 7283194      | I

The stream table uses reference counting: it tracks __pgs_count (how many source rows contribute to each group). When __pgs_count reaches 0, the group row is deleted.

Phase 6: MERGE Into the Stream Table

The delta is applied to the customer_totals storage table using a single SQL MERGE statement:

MERGE INTO public.customer_totals AS st
USING (<delta_query>) AS d
ON st.__pgs_row_id = d.__pgs_row_id
WHEN MATCHED AND d.__pgs_action = 'D' THEN DELETE
WHEN MATCHED AND d.__pgs_action = 'I' THEN
    UPDATE SET customer = d.customer, total = d.total, order_count = d.order_count
WHEN NOT MATCHED AND d.__pgs_action = 'I' THEN
    INSERT (__pgs_row_id, customer, total, order_count)
    VALUES (d.__pgs_row_id, d.customer, d.total, d.order_count)

Since alice didn't exist before, this is a NOT MATCHEDINSERT. The stream table now contains:

SELECT * FROM customer_totals;
 customer | total | order_count
----------|-------|------------
 alice    | 49.99 | 1

Phase 7: Cleanup and Bookkeeping

After the MERGE succeeds:

  1. Consumed changes are deleted from the buffer table:

    DELETE FROM pgstream_changes.changes_16384
    WHERE lsn > '0/1A3F2A00'::pg_lsn
    AND lsn <= '0/1A3F2C00'::pg_lsn
    
  2. The frontier is saved to the catalog as JSONB, so the next refresh knows where to start.

  3. The refresh is recorded in pgstream.pgs_refresh_history:

    refresh_id | pgs_id | action       | rows_inserted | rows_deleted | status    | initiated_by
    1          | 1      | DIFFERENTIAL | 1             | 0            | COMPLETED | SCHEDULER
    
  4. The data timestamp on the stream table is advanced, resetting the staleness clock.

  5. The MERGE template is cached in thread-local storage. The next refresh for this stream table skips SQL parsing, operator tree construction, and differentiation — it only substitutes LSN values into the cached template. This saves ~45ms per refresh cycle.

What About UPDATE and DELETE?

UPDATE

UPDATE orders SET amount = 59.99 WHERE id = 1;

The trigger writes a single row with action = 'U', capturing both OLD and NEW values:

action | new_amount | old_amount | new_customer | old_customer
-------|------------|------------|--------------|-------------
U      | 59.99      | 49.99      | alice        | alice

The scan differentiation splits this into:

  • DELETE old: (alice, 49.99) with action 'D'
  • INSERT new: (alice, 59.99) with action 'I'

The aggregate differentiation computes: +59.99 - 49.99 = +10.00 for alice's total. The MERGE updates the existing row.

DELETE

DELETE FROM orders WHERE id = 1;

The trigger writes action = 'D' with the OLD values. The aggregate differentiation computes -49.99 for the total and -1 for the count. If the __pgs_count reaches 0 (no more orders for alice), the MERGE deletes alice's row from the stream table entirely.

Performance: Why This Is Fast

StepWhat it avoids
Trigger-based CDCNo logical replication slot, no WAL parsing, no external process
Typed columnsNo JSONB serialization in the trigger, no jsonb_populate_record in the delta query
Pre-computed pk_hashNo per-row hash computation during the delta query
LSN-bounded readsIndex scan on the change buffer, not a full table scan
Algebraic differentiationProcesses only changed rows — O(changes) not O(table size)
MERGE statementSingle SQL round-trip for all inserts, updates, and deletes
Cached templatesAfter the first refresh, delta SQL generation is skipped entirely
Adaptive fallbackAutomatically switches to FULL refresh when changes exceed a threshold

For a table with 10 million rows and 100 changed rows, a DIFFERENTIAL refresh processes only those 100 rows. A FULL refresh would need to scan all 10 million.


Next in This Series

What Happens When You UPDATE a Row?

This tutorial traces what happens when an UPDATE statement hits a base table that is referenced by a stream table. It covers the trigger capture, the scan-level decomposition into DELETE + INSERT, and how each DVM operator propagates the change — including cases where the group key changes, where JOINs are involved, and where multiple UPDATEs happen within a single refresh window.

Prerequisite: Read WHAT_HAPPENS_ON_INSERT.md first — it introduces the full 7-phase lifecycle. This tutorial focuses on how UPDATE differs.

Setup

Same e-commerce example:

CREATE TABLE orders (
    id       SERIAL PRIMARY KEY,
    customer TEXT NOT NULL,
    amount   NUMERIC(10,2) NOT NULL
);

SELECT pgstream.create_stream_table(
    'customer_totals',
    $$
      SELECT customer, SUM(amount) AS total, COUNT(*) AS order_count
      FROM orders GROUP BY customer
    $$,
    '1m',
    'DIFFERENTIAL'
);

-- Seed some data
INSERT INTO orders (customer, amount) VALUES
    ('alice', 49.99),
    ('alice', 30.00),
    ('bob',   75.00);

After the first refresh, the stream table contains:

customer | total | order_count
---------|-------|------------
alice    | 79.99 | 2
bob      | 75.00 | 1

Case 1: Simple Value UPDATE (Same Group Key)

UPDATE orders SET amount = 59.99 WHERE id = 1;

Alice's first order changes from 49.99 to 59.99. The customer (group key) stays the same.

Phase 1: Trigger Capture

The AFTER UPDATE trigger fires and writes one row to the change buffer with both OLD and NEW values:

pgstream_changes.changes_16384
┌───────────┬─────────────┬────────┬──────────┬──────────┬────────────┬──────────┬────────────┐
│ change_id │ lsn         │ action │ new_cust │ new_amt  │ old_cust   │ old_amt  │ pk_hash    │
├───────────┼─────────────┼────────┼──────────┼──────────┼────────────┼──────────┼────────────┤
│ 4         │ 0/1A3F3000  │ U      │ alice    │ 59.99    │ alice      │ 49.99    │ -837291    │
└───────────┴─────────────┴────────┴──────────┴──────────┴────────────┴──────────┴────────────┘

Key difference from INSERT: the trigger writes both new_* and old_* columns. The pk_hash is computed from NEW.id.

Phase 2–4: Scheduler, Frontier, Change Detection

Identical to the INSERT flow. The scheduler detects one change row in the LSN window.

Phase 5: Scan Differentiation — The U → D+I Split

This is where UPDATE handling diverges fundamentally. The scan delta operator decomposes the UPDATE into two events:

__pgs_row_id | __pgs_action | customer | amount
-------------|--------------|----------|-------
-837291      | D            | alice    | 49.99     ← old values (DELETE)
-837291      | I            | alice    | 59.99     ← new values (INSERT)

Why split into D+I? This is a core IVM principle. Downstream operators (aggregates, joins, filters) don't have special "update" logic — they only understand insertions and deletions. By decomposing the UPDATE:

  • The DELETE event subtracts the old values from running aggregates
  • The INSERT event adds the new values

This algebraic approach handles arbitrary operator trees without operator-specific update logic.

Phase 5 (continued): Aggregate Differentiation

The aggregate operator processes both events against the alice group:

-- DELETE event: subtract old values
alice: total += CASE WHEN action='D' THEN -49.99 END  →  -49.99
alice: count += CASE WHEN action='D' THEN -1 END       →  -1

-- INSERT event: add new values
alice: total += CASE WHEN action='I' THEN +59.99 END  →  +59.99
alice: count += CASE WHEN action='I' THEN +1 END       →  +1

Net effect on alice's group:

total delta:  -49.99 + 59.99 = +10.00
count delta:  -1 + 1 = 0

The aggregate emits this as an INSERT (because the group still exists and its value changed):

customer | total  | order_count | __pgs_row_id | __pgs_action
---------|--------|-------------|--------------|-------------
alice    | +10.00 | 0           | 7283194      | I

Phase 6: MERGE

The MERGE updates the existing row:

-- MERGE WHEN MATCHED AND action = 'I' THEN UPDATE:
-- alice's total: 79.99 + 10.00 = 89.99  (via reference counting)
-- alice's count: 2 + 0 = 2

Wait — that's not right. The MERGE doesn't add deltas; it replaces the row. The aggregate delta query actually computes the new absolute value by combining the stored state with the delta:

COALESCE(existing.total, 0) + delta.total  → 79.99 + 10.00 = 89.99
COALESCE(existing.__pgs_count, 0) + delta.__pgs_count → 2 + 0 = 2

Result:

SELECT * FROM customer_totals;
 customer | total | order_count
----------|-------|------------
 alice    | 89.99 | 2            ← was 79.99
 bob      | 75.00 | 1

Case 2: Group Key Change (Customer Reassignment)

UPDATE orders SET customer = 'bob' WHERE id = 2;

Alice's second order (amount=30.00) is reassigned to Bob. The group key itself changes.

Trigger Capture

change_id | lsn         | action | new_cust | new_amt | old_cust | old_amt | pk_hash
5         | 0/1A3F3100  | U      | bob      | 30.00   | alice    | 30.00   | 4521038

The old and new customer values differ.

Scan Delta: D+I Split

__pgs_row_id | __pgs_action | customer | amount
-------------|--------------|----------|-------
4521038      | D            | alice    | 30.00    ← removes from alice's group
4521038      | I            | bob      | 30.00    ← adds to bob's group

Aggregate Delta

The aggregate groups by customer, so the DELETE and INSERT land in different groups:

Group "alice":
  total delta:  -30.00
  count delta:  -1

Group "bob":
  total delta:  +30.00
  count delta:  +1

After MERGE

SELECT * FROM customer_totals;
 customer | total  | order_count
----------|--------|------------
 alice    | 59.99  | 1            ← lost one order (-30.00)
 bob      | 105.00 | 2            ← gained one order (+30.00)

This is why the D+I decomposition is essential. Without it, you'd need special "move between groups" logic. With it, the standard aggregate differentiation handles group key changes naturally.


Case 3: UPDATE That Deletes a Group

-- Alice only has one order left. Reassign it to bob.
UPDATE orders SET customer = 'bob' WHERE id = 1;

Aggregate Delta

Group "alice":
  total delta:    -59.99
  count delta:    -1
  new __pgs_count: 1 - 1 = 0  → group vanishes!

Group "bob":
  total delta:    +59.99
  count delta:    +1

When __pgs_count reaches 0, the aggregate emits a DELETE for alice's group:

customer | total | __pgs_row_id | __pgs_action
---------|-------|--------------|-------------
alice    | —     | 7283194      | D             ← group removed
bob      | ...   | 9182734      | I             ← group updated

The MERGE deletes alice's row entirely:

SELECT * FROM customer_totals;
 customer | total  | order_count
----------|--------|------------
 bob      | 165.00 | 3

Case 4: Multiple UPDATEs on the Same Row (Within One Refresh Window)

What if a row is updated multiple times before the next refresh?

UPDATE orders SET amount = 10.00 WHERE id = 3;  -- bob: 75 → 10
UPDATE orders SET amount = 20.00 WHERE id = 3;  -- bob: 10 → 20
UPDATE orders SET amount = 30.00 WHERE id = 3;  -- bob: 20 → 30

The change buffer now has 3 rows for pk_hash of order #3:

change_id | action | old_amt | new_amt
6         | U      | 75.00   | 10.00
7         | U      | 10.00   | 20.00
8         | U      | 20.00   | 30.00

Net-Effect Computation

The scan delta uses a split fast-path design. Since order #3 has multiple changes (cnt > 1), it takes the multi-change path with window functions:

FIRST_VALUE(action) OVER (PARTITION BY pk_hash ORDER BY change_id)  → 'U'
LAST_VALUE(action) OVER (...)                                        → 'U'

Both first and last actions are 'U', so:

  • DELETE: emits using old values from the earliest change (change_id=6): old_amt = 75.00
  • INSERT: emits using new values from the latest change (change_id=8): new_amt = 30.00

Net delta:

__pgs_row_id | __pgs_action | amount
-------------|--------------|-------
pk_hash_3    | D            | 75.00    ← original value before all changes
pk_hash_3    | I            | 30.00    ← final value after all changes

The aggregate sees -75.00 + 30.00 = -45.00. This is correct regardless of the intermediate values. The intermediate rows (10.00, 20.00) are never seen.


Case 5: INSERT + UPDATE in Same Window

INSERT INTO orders (customer, amount) VALUES ('charlie', 100.00);
UPDATE orders SET amount = 200.00 WHERE customer = 'charlie';

Both happen before the next refresh. The buffer has:

change_id | action | old_amt | new_amt
9         | I      | NULL    | 100.00
10        | U      | 100.00  | 200.00

Net-effect analysis:

  • first_action = 'I' (row didn't exist before this window)
  • last_action = 'U' (row exists after)

Result:

  • No DELETE emitted (first_action = 'I' means the row was born in this window)
  • INSERT with final values: (charlie, 200.00)

The aggregate sees a pure insertion of (charlie, 200.00) — the intermediate value of 100.00 never appears.


Case 6: UPDATE + DELETE in Same Window

UPDATE orders SET amount = 999.99 WHERE id = 3;
DELETE FROM orders WHERE id = 3;

Net-effect:

  • first_action = 'U' (row existed before)
  • last_action = 'D' (row no longer exists)

Result:

  • DELETE with original old values from the first change
  • No INSERT (last_action = 'D')

The aggregate correctly sees only a removal.


Case 7: UPDATE with JOINs

Consider a stream table that joins two tables:

CREATE TABLE customers (
    id   SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    tier TEXT NOT NULL DEFAULT 'standard'
);

CREATE TABLE orders (
    id          SERIAL PRIMARY KEY,
    customer_id INT REFERENCES customers(id),
    amount      NUMERIC(10,2)
);

SELECT pgstream.create_stream_table(
    'order_details',
    $$
      SELECT c.name, c.tier, o.amount
      FROM orders o
      JOIN customers c ON o.customer_id = c.id
    $$,
    '1m',
    'DIFFERENTIAL'
);

Now update a customer's tier:

UPDATE customers SET tier = 'premium' WHERE name = 'alice';

How the JOIN Delta Works

The join differentiation follows the formula:

$$\Delta(L \bowtie R) = (\Delta L \bowtie R) \cup (L \bowtie \Delta R) - (\Delta L \bowtie \Delta R)$$

Since only the customers table changed:

  • $\Delta L$ = changes to orders (empty)
  • $\Delta R$ = changes to customers (alice's tier: standard → premium)

So:

  • Part 1: $\Delta\text{orders} \bowtie \text{customers}$ = empty (no order changes)
  • Part 2: $\text{orders} \bowtie \Delta\text{customers}$ = all of alice's orders joined with her tier change
  • Part 3: $\Delta\text{orders} \bowtie \Delta\text{customers}$ = empty (no order changes)

Part 2 produces the delta: for each of alice's orders, DELETE the old row (with tier='standard') and INSERT a new row (with tier='premium').

The stream table is updated to reflect the new tier across all of alice's order rows.


Performance Summary

ScenarioBuffer rowsDelta rows emittedWork
Simple value change12 (D+I)O(1) per group
Group key change12 (D+I, different groups)O(1) per affected group
Group deletion11 (D) + 1 (I) or 1 (D)O(1)
N updates same rowN2 (D first-old + I last-new)O(N) scan, O(1) aggregate
INSERT+UPDATE same window21 (I only)O(1)
UPDATE+DELETE same window21 (D only)O(1)

In all cases, the work is proportional to the number of changed rows, not the total table size. A single UPDATE on a billion-row table produces the same delta cost as on a 10-row table.


Next in This Series

What Happens When You DELETE a Row?

This tutorial traces what happens when a DELETE statement hits a base table that is referenced by a stream table. It covers the trigger capture, how the scan delta emits a single DELETE event, and how each DVM operator propagates the removal — including group deletion, partial group reduction, JOINs, cascading deletes within a single refresh window, and the important edge case where a DELETE cancels a prior INSERT.

Prerequisite: Read WHAT_HAPPENS_ON_INSERT.md first — it introduces the full 7-phase lifecycle (trigger → scheduler → frontier → change detection → DVM delta → MERGE → cleanup). This tutorial focuses on how DELETE differs.

Setup

Same e-commerce example used throughout the series:

CREATE TABLE orders (
    id       SERIAL PRIMARY KEY,
    customer TEXT NOT NULL,
    amount   NUMERIC(10,2) NOT NULL
);

SELECT pgstream.create_stream_table(
    'customer_totals',
    $$
      SELECT customer, SUM(amount) AS total, COUNT(*) AS order_count
      FROM orders GROUP BY customer
    $$,
    '1m',
    'DIFFERENTIAL'
);

-- Seed some data
INSERT INTO orders (customer, amount) VALUES
    ('alice', 50.00),
    ('alice', 30.00),
    ('bob',   75.00),
    ('bob',   25.00);

After the first refresh, the stream table contains:

customer | total  | order_count
---------|--------|------------
alice    | 80.00  | 2
bob      | 100.00 | 2

Case 1: Delete One Row (Group Survives)

DELETE FROM orders WHERE id = 2;  -- alice's 30.00 order

Alice still has one remaining order (id=1, amount=50.00). The group shrinks but doesn't vanish.

Phase 1: Trigger Capture

The AFTER DELETE trigger fires and writes one row to the change buffer with only OLD values:

pgstream_changes.changes_16384
┌───────────┬─────────────┬────────┬──────────┬──────────┬────────────┬──────────┬────────────┐
│ change_id │ lsn         │ action │ new_cust │ new_amt  │ old_cust   │ old_amt  │ pk_hash    │
├───────────┼─────────────┼────────┼──────────┼──────────┼────────────┼──────────┼────────────┤
│ 5         │ 0/1A3F3000  │ D      │ NULL     │ NULL     │ alice      │ 30.00    │ 4521038    │
└───────────┴─────────────┴────────┴──────────┴──────────┴────────────┴──────────┴────────────┘

Key difference from INSERT and UPDATE:

  • new_* columns are all NULL — the row no longer exists, so there are no NEW values
  • old_* columns contain the deleted row's data — this is what gets subtracted
  • pk_hash is computed from OLD.id (the deleted row's primary key)

Phase 2–4: Scheduler, Frontier, Change Detection

Identical to the INSERT flow. The scheduler detects one change row in the LSN window.

Phase 5: Scan Differentiation — Pure DELETE

Unlike UPDATE (which splits into D+I), a DELETE produces a single event:

__pgs_row_id | __pgs_action | customer | amount
-------------|--------------|----------|-------
4521038      | D            | alice    | 30.00

The scan delta applies the net-effect filtering rule:

  • first_action = 'D' → row existed before the refresh window
  • last_action = 'D' → row does not exist after

Result: emit a DELETE using old values. No INSERT is emitted (because last_action = 'D').

This is the simplest path through the scan delta — one change, one PK, one DELETE event.

Phase 5 (continued): Aggregate Differentiation

The aggregate operator processes the DELETE event against the alice group:

-- DELETE event: subtract old values from alice's group
__ins_count = 0         -- no inserts
__del_count = 1         -- one deletion
__ins_total = 0         -- no amount added
__del_total = 30.00     -- 30.00 removed

The merge CTE joins this delta with the existing stream table state:

new_count = old_count + ins_count - del_count = 2 + 0 - 1 = 1  (still > 0)

Since new_count > 0 and the group already existed (old_count = 2), the action is classified as 'U' (update). The aggregate emits the group with its new values:

customer | total | order_count | __pgs_row_id | __pgs_action
---------|-------|-------------|--------------|-------------
alice    | 50.00 | 1           | 7283194      | I

Note: the 'U' meta-action is emitted as __pgs_action = 'I' because the MERGE treats it as an update-via-INSERT (see aggregate final CTE: CASE WHEN __pgs_meta_action = 'D' THEN 'D' ELSE 'I' END).

Phase 6: MERGE

The MERGE statement matches alice's existing row and updates it:

MERGE INTO customer_totals AS st
USING (...delta...) AS d
ON st.__pgs_row_id = d.__pgs_row_id
WHEN MATCHED AND d.__pgs_action = 'I' THEN
  UPDATE SET customer = d.customer, total = d.total, order_count = d.order_count, ...

Result:

SELECT * FROM customer_totals;
 customer | total  | order_count
----------|--------|------------
 alice    | 50.00  | 1            ← was 80.00 / 2
 bob      | 100.00 | 2

Phase 7: Cleanup

The change buffer rows in the consumed LSN window are deleted:

DELETE FROM pgstream_changes.changes_16384
WHERE lsn > '0/1A3F2FFF'::pg_lsn AND lsn <= '0/1A3F3000'::pg_lsn;

Case 2: Delete Last Row in Group (Group Vanishes)

-- Alice has one order left (id=1, amount=50.00). Delete it.
DELETE FROM orders WHERE id = 1;

Trigger Capture

change_id | lsn         | action | old_cust | old_amt | pk_hash
6         | 0/1A3F3100  | D      | alice    | 50.00   | -837291

Scan Delta

Single DELETE event:

__pgs_row_id | __pgs_action | customer | amount
-------------|--------------|----------|-------
-837291      | D            | alice    | 50.00

Aggregate Delta

Group "alice":
  ins_count = 0
  del_count = 1
  new_count = old_count + 0 - 1 = 1 - 1 = 0  → group vanishes!

When new_count drops to 0 (or below), the aggregate classifies this as action 'D' (delete). The reference count has reached zero — no rows contribute to this group anymore.

The aggregate emits a DELETE for alice's group:

customer | __pgs_row_id | __pgs_action
---------|--------------|-------------
alice    | 7283194      | D

MERGE

The MERGE matches alice's existing row and deletes it:

WHEN MATCHED AND d.__pgs_action = 'D' THEN DELETE

Result:

SELECT * FROM customer_totals;
 customer | total  | order_count
----------|--------|------------
 bob      | 100.00 | 2

Alice's row is completely removed from the stream table. This is the correct behavior — with zero contributing rows, the group should not exist.


Case 3: Delete Multiple Rows (Same Group, Same Window)

-- Delete both of bob's orders before the next refresh
DELETE FROM orders WHERE id = 3;  -- bob, 75.00
DELETE FROM orders WHERE id = 4;  -- bob, 25.00

The change buffer has two rows with different pk_hash values (different PKs):

change_id | action | old_cust | old_amt | pk_hash
7         | D      | bob      | 75.00   | pk_hash_3
8         | D      | bob      | 25.00   | pk_hash_4

Scan Delta

Each PK has exactly one change, so both take the single-change fast path:

__pgs_row_id | __pgs_action | customer | amount
-------------|--------------|----------|-------
pk_hash_3    | D            | bob      | 75.00
pk_hash_4    | D            | bob      | 25.00

Two DELETE events, both targeting bob's group.

Aggregate Delta

The aggregate sums both deletions:

Group "bob":
  ins_count = 0
  del_count = 2
  del_total = 75.00 + 25.00 = 100.00
  new_count = 2 + 0 - 2 = 0  → group vanishes!

The aggregate emits a DELETE for bob's group.

MERGE

Bob's row is deleted from the stream table. With both alice and bob gone (from Cases 1+2+3), the stream table is now empty.


Case 4: INSERT + DELETE in Same Window (Cancellation)

What if a row is inserted and then deleted before the next refresh?

INSERT INTO orders (customer, amount) VALUES ('charlie', 200.00);
DELETE FROM orders WHERE customer = 'charlie';

The change buffer has:

change_id | action | new_cust | new_amt | old_cust | old_amt | pk_hash
9         | I      | charlie  | 200.00  | NULL     | NULL    | pk_hash_new
10        | D      | NULL     | NULL    | charlie  | 200.00  | pk_hash_new

Net-Effect Computation

Both changes share the same pk_hash. The pk_stats CTE finds cnt = 2, so this goes through the multi-change path:

first_action = FIRST_VALUE(action) OVER (...) → 'I'
last_action  = LAST_VALUE(action)  OVER (...) → 'D'

The scan delta applies the net-effect filtering:

  • DELETE branch: requires first_action != 'I' → FAILS (first_action = 'I')
  • INSERT branch: requires last_action != 'D' → FAILS (last_action = 'D')

Result: zero events emitted. The INSERT and DELETE completely cancel each other out.

The aggregate never sees charlie. The stream table is unchanged. This is correct — the row was born and died within the same refresh window, so it should have no visible effect.


Case 5: UPDATE + DELETE in Same Window

UPDATE orders SET amount = 999.99 WHERE id = 3;  -- bob: 75 → 999.99
DELETE FROM orders WHERE id = 3;

The change buffer:

change_id | action | old_amt | new_amt
11        | U      | 75.00   | 999.99
12        | D      | 999.99  | NULL

Net-Effect Computation

Same pk_hash, cnt = 2:

first_action = 'U'  (row existed before this window)
last_action  = 'D'  (row no longer exists)

Filtering:

  • DELETE branch: first_action != 'I' → OK. Emit DELETE with old values from the earliest change: old_amt = 75.00
  • INSERT branch: last_action != 'D' → FAILS. No INSERT emitted.

Net delta:

__pgs_row_id | __pgs_action | amount
-------------|--------------|-------
pk_hash_3    | D            | 75.00

The intermediate value of 999.99 never appears. The aggregate sees only the removal of the original value (75.00), which is correct — that's the value that was previously accounted for in the stream table.


Case 6: DELETE with JOINs

Consider a stream table that joins two tables:

CREATE TABLE customers (
    id   SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    tier TEXT NOT NULL DEFAULT 'standard'
);

CREATE TABLE orders (
    id          SERIAL PRIMARY KEY,
    customer_id INT REFERENCES customers(id),
    amount      NUMERIC(10,2)
);

SELECT pgstream.create_stream_table(
    'order_details',
    $$
      SELECT c.name, c.tier, o.amount
      FROM orders o
      JOIN customers c ON o.customer_id = c.id
    $$,
    '1m',
    'DIFFERENTIAL'
);

Seed data:

INSERT INTO customers VALUES (1, 'alice', 'premium'), (2, 'bob', 'standard');
INSERT INTO orders VALUES (1, 1, 50.00), (2, 1, 30.00), (3, 2, 75.00);

After refresh, the stream table has:

name  | tier     | amount
------|----------|-------
alice | premium  | 50.00
alice | premium  | 30.00
bob   | standard | 75.00

Now delete an order:

DELETE FROM orders WHERE id = 2;  -- alice's 30.00 order

How the JOIN Delta Works

The join differentiation formula:

$$\Delta(L \bowtie R) = (\Delta L \bowtie R) \cup (L \bowtie \Delta R) - (\Delta L \bowtie \Delta R)$$

Since only the orders table changed:

  • $\Delta L$ = changes to orders (one DELETE: order #2)
  • $\Delta R$ = changes to customers (empty)

So:

  • Part 1: $\Delta\text{orders} \bowtie \text{customers}$ = the deleted order joined with its customer
  • Part 2: $\text{orders} \bowtie \Delta\text{customers}$ = empty (no customer changes)
  • Part 3: $\Delta\text{orders} \bowtie \Delta\text{customers}$ = empty (customers unchanged)

Part 1 produces:

name  | tier    | amount | __pgs_action
------|---------|--------|-------------
alice | premium | 30.00  | D

The deleted order is joined with alice's customer record to produce a DELETE delta row with the complete joined values.

MERGE

The MERGE matches the row (alice, premium, 30.00) and deletes it:

SELECT * FROM order_details;
 name  | tier     | amount
-------|----------|-------
 alice | premium  | 50.00      ← alice's remaining order
 bob   | standard | 75.00

What About Deleting From the Dimension Table?

DELETE FROM customers WHERE id = 2;  -- remove bob entirely

Now $\Delta R$ has a DELETE for bob, while $\Delta L$ is empty:

  • Part 1: $\Delta\text{orders} \bowtie \text{customers}$ = empty
  • Part 2: $\text{orders} \bowtie \Delta\text{customers}$ = bob's order(s) joined with deleted customer record

Part 2 produces DELETE events for every order that referenced bob:

name | tier     | amount | __pgs_action
-----|----------|--------|-------------
bob  | standard | 75.00  | D

After MERGE, bob's rows vanish from the stream table.

Note: This assumes referential integrity — if orders still references customer #2, a foreign key constraint would prevent the DELETE in practice. But from the IVM perspective, the join delta correctly handles the removal regardless.


Case 7: Bulk DELETE

DELETE FROM orders WHERE amount < 50.00;

This deletes multiple rows across potentially multiple groups. The trigger fires once per row (it's a FOR EACH ROW trigger), writing one change buffer entry per deleted row:

change_id | action | old_cust | old_amt | pk_hash
13        | D      | alice    | 30.00   | pk_hash_2
14        | D      | bob      | 25.00   | pk_hash_4

Scan Delta

Each deleted PK is independent (different pk_hash values), so each takes the single-change fast path. Two DELETE events:

__pgs_row_id | __pgs_action | customer | amount
-------------|--------------|----------|-------
pk_hash_2    | D            | alice    | 30.00
pk_hash_4    | D            | bob      | 25.00

Aggregate Delta

The aggregate groups these by customer:

Group "alice":
  del_count = 1, del_total = 30.00
  new_count = 2 - 1 = 1  (survives)

Group "bob":
  del_count = 1, del_total = 25.00
  new_count = 2 - 1 = 1  (survives)

Both groups survive (count > 0), so the aggregate emits UPDATE (as 'I') events with new values:

customer | total | order_count
---------|-------|------------
alice    | 50.00 | 1
bob      | 75.00 | 1

The MERGE updates both rows. All work is proportional to the number of deleted rows (2), not the total table size.


Case 8: TRUNCATE (What Doesn't Work)

TRUNCATE orders;

TRUNCATE does not fire row-level triggers. The change buffer receives zero entries. The next refresh cycle will detect no changes and skip the stream table.

This means the stream table becomes stale — it still shows the old aggregate values while the base table is empty.

Workaround: After a TRUNCATE, manually reinitialize:

SELECT pgstream.refresh_stream_table('customer_totals');

With the FULL refresh mode (or by setting the reinitialize flag), the stream table is recomputed from scratch, producing correct results.

Why no TRUNCATE support? PostgreSQL fires statement-level AFTER TRUNCATE triggers, but they don't provide OLD row data. The trigger can detect that a TRUNCATE happened, but cannot enumerate which rows were removed. Supporting TRUNCATE would require reading the stream table to determine what to subtract — effectively a full refresh anyway.


How DELETE Differs From INSERT and UPDATE — A Summary

AspectINSERTUPDATEDELETE
Trigger writesnew_* columns onlyBoth new_* and old_*old_* columns only
new_ columns*Row valuesNew valuesNULL
old_ columns*NULLOld valuesRow values
pk_hash sourceNEW.pkNEW.pkOLD.pk
Scan delta output1 INSERT event2 events (D+I split)1 DELETE event
Aggregate effectAdds to group count/sumSubtracts old, adds newSubtracts from group
Can delete a group?No (only creates/grows)Yes (if group key changes)Yes (if count reaches 0)
MERGE actionINSERT new rowUPDATE existing rowDELETE matched row

The Reference Counting Principle

The core insight behind incremental DELETE handling is reference counting. Every aggregate group in the stream table maintains an internal counter (__pgs_count) that tracks how many source rows contribute to the group:

Stream table internal state:
customer | total | order_count | __pgs_count (hidden)
---------|-------|-------------|---------------------
alice    | 80.00 | 2           | 2
bob      | 100.00| 2           | 2
  • INSERT__pgs_count += 1
  • DELETE__pgs_count -= 1
  • UPDATE__pgs_count += 0 (D cancels I for same-group updates)

When __pgs_count reaches 0:

  • The group has zero contributing rows
  • The aggregate emits a DELETE event
  • The MERGE removes the row from the stream table

This is mathematically rigorous — the stream table always reflects the correct result of the defining query over the current base table contents, incrementally maintained through algebraic delta operations.


Performance Summary

ScenarioBuffer rowsDelta rows emittedWork
Single row DELETE (group survives)11 (D)O(1) per group
Single row DELETE (group vanishes)11 (D)O(1)
N deletes same groupNN (D) → 1 group deltaO(N) scan, O(1) per group
INSERT+DELETE same window20 (cancels)O(1)
UPDATE+DELETE same window21 (D original)O(1)
Bulk DELETE across M groupsNN (D) → M group deltasO(N) scan, O(M) aggregate
JOIN table DELETE1K (one per matched join row)O(K) join

In all cases, the work is proportional to the number of changed rows, not the total table size. Deleting 3 rows from a billion-row table produces the same delta cost as from a 10-row table.


Next in This Series

What Happens When You TRUNCATE a Table?

This tutorial explains what happens when a TRUNCATE statement hits a base table that is referenced by a stream table. Unlike INSERT, UPDATE, and DELETE — which are fully tracked by the CDC trigger — TRUNCATE is a special case that bypasses row-level triggers entirely. Understanding this gap is essential for operating pg_stream correctly.

Prerequisite: Read WHAT_HAPPENS_ON_INSERT.md first — it introduces the 7-phase lifecycle. This tutorial explains why TRUNCATE breaks that lifecycle and how to recover.

Setup

Same e-commerce example used throughout the series:

CREATE TABLE orders (
    id       SERIAL PRIMARY KEY,
    customer TEXT NOT NULL,
    amount   NUMERIC(10,2) NOT NULL
);

SELECT pgstream.create_stream_table(
    'customer_totals',
    $$
      SELECT customer, SUM(amount) AS total, COUNT(*) AS order_count
      FROM orders GROUP BY customer
    $$,
    '1m',
    'DIFFERENTIAL'
);

-- Seed some data
INSERT INTO orders (customer, amount) VALUES
    ('alice', 50.00),
    ('alice', 30.00),
    ('bob',   75.00),
    ('bob',   25.00);

After the first refresh, the stream table contains:

customer | total  | order_count
---------|--------|------------
alice    | 80.00  | 2
bob      | 100.00 | 2

Case 1: TRUNCATE the Base Table

TRUNCATE orders;

All four rows are removed instantly.

What Happens at the Trigger Level: Nothing

PostgreSQL's TRUNCATE command does not fire row-level AFTER triggers. This is by design in PostgreSQL — TRUNCATE is a DDL-like operation that removes all rows without scanning them individually. The per-row AFTER INSERT OR UPDATE OR DELETE trigger that pg_stream installs is simply never invoked.

The change buffer remains empty:

pgstream_changes.changes_16384
┌───────────┬─────────────┬────────┬──────────┬──────────┐
│ change_id │ lsn         │ action │ ...      │ ...      │
├───────────┼─────────────┼────────┼──────────┼──────────┤
│ (empty)   │             │        │          │          │
└───────────┴─────────────┴────────┴──────────┴──────────┘

No rows. Zero change events captured.

What Happens at the Scheduler: Skip

On the next refresh cycle, the scheduler:

  1. Checks the change buffer for rows in the LSN window
  2. Finds zero rows
  3. Classifies the refresh action as NO_DATA
  4. Advances the data timestamp but does not modify the stream table

The stream table still shows the old values:

SELECT * FROM customer_totals;
 customer | total  | order_count
----------|--------|------------
 alice    | 80.00  | 2            ← STALE! orders table is empty
 bob      | 100.00 | 2            ← STALE! orders table is empty

The stream table is now stale. It reflects data that no longer exists in the base table.

Why This Happens

The entire incremental view maintenance pipeline depends on the change buffer to know what changed. Without change events, the DVM has no deltas to apply. The stream table's reference-counted aggregates still think there are 4 orders contributing to two groups.

This is not a bug — it's a fundamental limitation of trigger-based CDC. The trigger can only fire when PostgreSQL executes individual row operations, and TRUNCATE deliberately skips per-row processing for performance.


Case 2: How to Recover — Manual Refresh

The fix is straightforward. Force a manual refresh:

SELECT pgstream.refresh_stream_table('customer_totals');

This executes a full refresh regardless of the stream table's configured refresh mode:

  1. TRUNCATE the stream table itself (clearing the stale data)
  2. Re-execute the defining query: SELECT customer, SUM(amount) AS total, COUNT(*) AS order_count FROM orders GROUP BY customer
  3. INSERT the results into the stream table
  4. Update the frontier so future differential refreshes start from the current LSN

Since the orders table is empty, the defining query returns zero rows. The stream table becomes empty too:

SELECT pgstream.refresh_stream_table('customer_totals');

SELECT * FROM customer_totals;
 customer | total | order_count
----------|-------|------------
 (0 rows)                        ← correct: orders is empty

The stream table is now consistent again. Future INSERT/UPDATE/DELETE operations on orders will be captured normally by the trigger and propagated incrementally.


Case 3: TRUNCATE Then INSERT (Common ETL Pattern)

A common data loading pattern is:

BEGIN;
TRUNCATE orders;
INSERT INTO orders (customer, amount) VALUES
    ('charlie', 100.00),
    ('charlie', 200.00),
    ('dave',    150.00);
COMMIT;

What the Change Buffer Sees

  • TRUNCATE: 0 events (not captured)
  • INSERT charlie 100.00: 1 event (captured)
  • INSERT charlie 200.00: 1 event (captured)
  • INSERT dave 150.00: 1 event (captured)

The change buffer has 3 INSERT events — but knows nothing about the 4 rows that were removed by TRUNCATE.

What the Scheduler Does

The scheduler sees 3 pending changes and runs a DIFFERENTIAL refresh:

Aggregate delta (from INSERTs only):
  Group "charlie": ins_count = 2, ins_total = 300.00
  Group "dave":    ins_count = 1, ins_total = 150.00

The MERGE applies these deltas to the existing stream table:

SELECT * FROM customer_totals;
 customer | total  | order_count
----------|--------|------------
 alice    | 80.00  | 2            ← STALE! alice has no orders
 bob      | 100.00 | 2            ← STALE! bob has no orders
 charlie  | 300.00 | 2            ← correct
 dave     | 150.00 | 1            ← correct

The new data (charlie, dave) is correct. But the old data (alice, bob) persists because the TRUNCATE that removed their orders was never captured. The reference counts for alice and bob were never decremented.

How to Fix

SELECT pgstream.refresh_stream_table('customer_totals');

SELECT * FROM customer_totals;
 customer | total  | order_count
----------|--------|------------
 charlie  | 300.00 | 2            ← correct
 dave     | 150.00 | 1            ← correct

Case 4: TRUNCATE a Dimension Table in a JOIN

Consider a stream table that joins two tables:

CREATE TABLE customers (
    id   SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    tier TEXT NOT NULL DEFAULT 'standard'
);

CREATE TABLE orders (
    id          SERIAL PRIMARY KEY,
    customer_id INT REFERENCES customers(id),
    amount      NUMERIC(10,2)
);

SELECT pgstream.create_stream_table(
    'order_details',
    $$
      SELECT c.name, c.tier, o.amount
      FROM orders o
      JOIN customers c ON o.customer_id = c.id
    $$,
    '1m',
    'DIFFERENTIAL'
);

Now truncate the dimension table:

TRUNCATE customers CASCADE;

The CASCADE also truncates orders (due to the foreign key). Neither TRUNCATE fires row-level triggers. The change buffer is empty for both tables.

The stream table continues to show all the old joined rows as if nothing happened. The only recovery is a manual refresh:

SELECT pgstream.refresh_stream_table('order_details');

Case 5: FULL Mode Stream Tables Are Immune

If the stream table uses FULL refresh mode instead of DIFFERENTIAL:

SELECT pgstream.create_stream_table(
    'customer_totals_full',
    $$
      SELECT customer, SUM(amount) AS total, COUNT(*) AS order_count
      FROM orders GROUP BY customer
    $$,
    '1m',
    'FULL'
);

A FULL-mode stream table doesn't use the change buffer at all. Every refresh cycle:

  1. TRUNCATEs the stream table
  2. Re-executes the defining query
  3. Inserts all results

So after a TRUNCATE of the base table, the next scheduled refresh automatically picks up the correct state — no manual intervention needed. The trade-off is that every refresh recomputes from scratch, which is more expensive for large result sets.


Why PostgreSQL Doesn't Fire Row Triggers on TRUNCATE

Understanding the PostgreSQL internals helps explain why this limitation exists:

OperationMechanismRow triggers fired?
DELETE FROM tScans and removes rows one by oneYes — AFTER DELETE per row
TRUNCATE tRemoves all heap files and reinitializes the table storageNo — no per-row processing
DELETE FROM t WHERE trueSame as DELETE FROM t (full scan)Yes — AFTER DELETE per row

TRUNCATE is fundamentally different from DELETE. It's an O(1) operation that replaces the table's storage files, while DELETE is O(N) — scanning every row and recording each removal in WAL.

PostgreSQL does support statement-level AFTER TRUNCATE triggers:

CREATE TRIGGER after_truncate_trigger
    AFTER TRUNCATE ON orders
    FOR EACH STATEMENT
    EXECUTE FUNCTION some_function();

However, statement-level TRUNCATE triggers:

  • Do not receive OLD row data (there's no OLD record)
  • Cannot enumerate which rows were removed
  • Only know that a TRUNCATE happened on a specific table

This means a TRUNCATE trigger could detect the event but cannot generate the per-row DELETE events that the DVM pipeline needs.


Alternative: DELETE FROM Instead of TRUNCATE

If you need the stream table to stay consistent without manual intervention, use DELETE FROM instead of TRUNCATE:

-- Instead of: TRUNCATE orders;
DELETE FROM orders;

This is slower (O(N) vs O(1)) but fires the row-level DELETE trigger for every row. The change buffer captures all removals, and the next differential refresh correctly decrements all reference counts, removing groups whose count reaches zero.

ApproachSpeedStream table consistent?
TRUNCATE ordersO(1) — instantNo — requires manual refresh
DELETE FROM ordersO(N) — scans all rowsYes — triggers fire for each row
TRUNCATE + manual refreshO(1) + O(query)Yes — after manual refresh

For tables with millions of rows, DELETE FROM can be slow and generate significant WAL. In those cases, TRUNCATE followed by a manual refresh is often the pragmatic choice.


Best Practices

1. Always Refresh After TRUNCATE

Make it a habit: if you TRUNCATE a base table, immediately refresh all dependent stream tables:

TRUNCATE orders;
SELECT pgstream.refresh_stream_table('customer_totals');

2. Use DELETE FROM for Small Tables

For tables with fewer than ~100K rows, DELETE FROM is fast enough and keeps everything consistent automatically.

3. Wrap TRUNCATE + Refresh in a Function

For ETL pipelines, create a helper:

CREATE OR REPLACE FUNCTION reload_orders(data jsonb) RETURNS void AS $$
BEGIN
    TRUNCATE orders;
    INSERT INTO orders SELECT * FROM jsonb_populate_recordset(null::orders, data);
    PERFORM pgstream.refresh_stream_table('customer_totals');
END;
$$ LANGUAGE plpgsql;

4. Consider FULL Mode for ETL-Heavy Tables

If a table is routinely truncated and reloaded, FULL refresh mode may be simpler than DIFFERENTIAL — it naturally handles TRUNCATE because it recomputes from scratch every cycle.

5. Monitor for Staleness

pg_stream emits monitoring alerts when a stream table's data deviates from expected freshness. After a TRUNCATE, the data_timestamp still advances (since the scheduler sees no pending changes), but the actual data is stale. The most reliable detection is comparing the stream table results against a fresh query:

-- Quick consistency check
SELECT count(*) FROM customer_totals;  -- still shows rows?
SELECT count(*) FROM orders;            -- should be 0 after TRUNCATE

How TRUNCATE Compares to Other Operations

AspectINSERTUPDATEDELETETRUNCATE
Trigger fires?Yes (per row)Yes (per row)Yes (per row)No
Change buffer1 row per INSERT1 row per UPDATE1 row per DELETEEmpty
Stream table updated?Yes (next refresh)Yes (next refresh)Yes (next refresh)No — stays stale
RecoveryAutomaticAutomaticAutomaticManual refresh required
FULL mode affected?N/A (recomputes)N/A (recomputes)N/A (recomputes)N/A (recomputes)
SpeedO(1) per rowO(1) per rowO(1) per rowO(1) total

Summary

TRUNCATE is the one common DML-like operation that falls outside pg_stream's automatic change tracking. The trigger-based CDC architecture captures INSERT, UPDATE, and DELETE perfectly — but TRUNCATE bypasses row-level triggers by design.

The key takeaways:

  1. TRUNCATE does not fire row-level triggers — the change buffer stays empty
  2. The stream table becomes stale — showing data that no longer exists in the base table
  3. Manual refresh fixes itSELECT pgstream.refresh_stream_table('name') recomputes from scratch
  4. FULL mode is immune — every refresh recomputes regardless of change tracking
  5. DELETE FROM is the trigger-safe alternative — slower but keeps everything consistent automatically
  6. After TRUNCATE, always refresh — make this a standard part of your ETL workflow

Next in This Series

Frequently Asked Questions


General

What is pg_stream?

pg_stream is a PostgreSQL 18 extension that implements stream tables — declarative, automatically-refreshing materialized views with Differential View Maintenance (DVM). You define a SQL query and a refresh schedule; the extension handles change capture, delta computation, and incremental refresh automatically.

It is inspired by the DBSP differential dataflow framework. See DBSP_COMPARISON.md for a detailed comparison.

How is this different from PostgreSQL materialized views?

FeatureMaterialized ViewsStream Tables
RefreshManual (REFRESH MATERIALIZED VIEW)Automatic (scheduler) or manual
Incremental refreshNot supported nativelyBuilt-in differential mode
Change detectionNone — always full recomputeCDC triggers track row-level changes
Dependency orderingNoneDAG-aware topological refresh
MonitoringNoneBuilt-in views, stats, NOTIFY alerts
ScheduleNoneDuration strings (5m) or cron (*/5 * * * *)

What PostgreSQL versions are supported?

PostgreSQL 18.x exclusively. The extension uses features specific to PostgreSQL 18.

Does pg_stream require wal_level = logical?

No. pg_stream uses lightweight row-level triggers for change data capture, not logical replication. You do not need to set wal_level = logical or configure max_replication_slots.

Is pg_stream production-ready?

pg_stream is under active development. It has a comprehensive test suite (700+ unit tests, 290+ end-to-end tests), but users should evaluate it against their specific workloads before deploying to production.


Installation & Setup

How do I install pg_stream?

  1. Add pg_stream to shared_preload_libraries in postgresql.conf:
    shared_preload_libraries = 'pg_stream'
    
  2. Restart PostgreSQL.
  3. Run:
    CREATE EXTENSION pg_stream;
    

See INSTALL.md for platform-specific instructions and pre-built release artifacts.

What are the minimum configuration requirements?

Only shared_preload_libraries = 'pg_stream' is mandatory (requires a restart). All other settings have sensible defaults. max_worker_processes = 8 is recommended.

Can I install pg_stream on a managed PostgreSQL service (RDS, Cloud SQL, etc.)?

It depends on whether the service allows custom extensions and shared_preload_libraries. Since pg_stream does not require wal_level = logical, it avoids one of the most common restrictions on managed services. Check your provider's documentation for custom extension support.

How do I uninstall pg_stream?

  1. Drop all stream tables first (or they will be cascade-dropped):
    SELECT pgstream.drop_stream_table(pgs_name) FROM pgstream.pgs_stream_tables;
    
  2. Drop the extension:
    DROP EXTENSION pg_stream CASCADE;
    
  3. Remove pg_stream from shared_preload_libraries and restart PostgreSQL.

Creating & Managing Stream Tables

How do I create a stream table?

SELECT pgstream.create_stream_table(
    'order_totals',                                           -- name
    'SELECT customer_id, SUM(amount) AS total
     FROM orders GROUP BY customer_id',                       -- defining query
    '5m',                                                     -- refresh schedule
    'DIFFERENTIAL'                                            -- refresh mode
);

What is the difference between FULL and DIFFERENTIAL refresh mode?

  • FULL — Truncates the stream table and re-runs the entire defining query every refresh cycle. Simple but expensive for large result sets.
  • DIFFERENTIAL — Computes only the delta (changes since the last refresh) using the DVM engine and applies it via a MERGE statement. Much faster when only a small fraction of source data changes between refreshes.

When should I use FULL vs. DIFFERENTIAL?

Use DIFFERENTIAL (default) when:

  • Source tables are large and changes between refreshes are small
  • The defining query uses supported operators (most common SQL is supported)

Use FULL when:

  • The defining query uses unsupported aggregates (CORR, COVAR_*, REGR_*)
  • Source tables are small and a full recompute is cheap
  • You see frequent adaptive fallbacks to FULL (check refresh history)

What schedule formats are supported?

Duration strings:

UnitSuffixExample
Secondss30s
Minutesm5m
Hoursh2h
Daysd1d
Weeksw1w
Compound1h30m

Cron expressions:

FormatExampleDescription
5-field*/5 * * * *Every 5 minutes
Aliases@hourly, @dailyBuilt-in shortcuts

CALCULATED mode: Pass NULL as the schedule to inherit the schedule from downstream dependents.

What is the minimum allowed schedule?

The pg_stream.min_schedule_seconds GUC (default: 60) sets the floor. Schedules shorter than this value are rejected. Set to 1 for development/testing.

Can a stream table reference another stream table?

Yes. Stream tables can depend on other stream tables. The scheduler automatically refreshes them in topological order (upstream first). Circular dependencies are detected and rejected at creation time.

-- ST1: aggregates orders
SELECT pgstream.create_stream_table('order_totals',
    'SELECT customer_id, SUM(amount) AS total FROM orders GROUP BY customer_id',
    '1m', 'DIFFERENTIAL');

-- ST2: filters ST1
SELECT pgstream.create_stream_table('big_customers',
    'SELECT customer_id, total FROM pgstream.order_totals WHERE total > 1000',
    '1m', 'DIFFERENTIAL');

How do I change a stream table's schedule or mode?

-- Change schedule
SELECT pgstream.alter_stream_table('order_totals', schedule => '10m');

-- Switch refresh mode
SELECT pgstream.alter_stream_table('order_totals', refresh_mode => 'FULL');

-- Suspend
SELECT pgstream.alter_stream_table('order_totals', status => 'SUSPENDED');

-- Resume
SELECT pgstream.alter_stream_table('order_totals', status => 'ACTIVE');

Can I change the defining query of a stream table?

Not directly. You must drop and recreate the stream table:

SELECT pgstream.drop_stream_table('order_totals');
SELECT pgstream.create_stream_table('order_totals', '<new query>', '5m', 'DIFFERENTIAL');

How do I trigger a manual refresh?

SELECT pgstream.refresh_stream_table('order_totals');

This works even when pg_stream.enabled = false (scheduler disabled).


SQL Support

What SQL features are supported in defining queries?

Most common SQL is supported in both FULL and DIFFERENTIAL modes:

  • Table scans, projections, WHERE/HAVING filters
  • INNER, LEFT, RIGHT, FULL OUTER JOIN (including multi-table joins)
  • GROUP BY with 25+ aggregate functions (COUNT, SUM, AVG, MIN, MAX, BOOL_AND/OR, STRING_AGG, ARRAY_AGG, JSON_AGG, JSONB_AGG, BIT_AND/OR/XOR, STDDEV, VARIANCE, MODE, PERCENTILE_CONT/DISC, and more)
  • FILTER (WHERE ...) on aggregates
  • DISTINCT
  • Set operations: UNION ALL, UNION, INTERSECT, INTERSECT ALL, EXCEPT, EXCEPT ALL
  • Subqueries: EXISTS, NOT EXISTS, IN (subquery), NOT IN (subquery), scalar subqueries
  • Non-recursive and recursive CTEs
  • Window functions (ROW_NUMBER, RANK, SUM OVER, etc.)
  • LATERAL joins with set-returning functions and correlated subqueries
  • CASE, COALESCE, NULLIF, GREATEST, LEAST, BETWEEN, IS DISTINCT FROM

See DVM Operators for the complete list.

What SQL features are NOT supported?

The following are rejected with clear error messages and suggested rewrites:

FeatureReasonSuggested Rewrite
TABLESAMPLEStream tables materialize the full result setUse WHERE random() < fraction in consuming query
Window functions in expressionsCannot be differentially maintainedMove window function to a separate column
LIMIT / OFFSETStream tables materialize the full result setApply when querying the stream table
FOR UPDATE / FOR SHARERow-level locking not applicableRemove the locking clause

The following were previously rejected but are now supported via automatic parse-time rewrites:

FeatureHow It Works
DISTINCT ON (…)Auto-rewritten to ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) = 1 subquery
GROUPING SETS / CUBE / ROLLUPAuto-rewritten to UNION ALL of separate GROUP BY queries
NATURAL JOINCommon columns resolved at parse time; explicit equi-join synthesized
ALL (subquery)Rewritten to NOT EXISTS with negated condition (AntiJoin)

Each rejected feature is explained in detail in the Why Are These SQL Features Not Supported? section below.

What happens to ORDER BY in defining queries?

ORDER BY is accepted but silently discarded. Row order in a stream table is undefined (consistent with PostgreSQL's CREATE MATERIALIZED VIEW behavior). Apply ORDER BY when querying the stream table, not in the defining query.

Which aggregates support DIFFERENTIAL mode?

Algebraic (O(changes), fully incremental): COUNT, SUM, AVG

Semi-algebraic (incremental with occasional group rescan): MIN, MAX

Group-rescan (affected groups re-aggregated from source): STRING_AGG, ARRAY_AGG, JSON_AGG, JSONB_AGG, BOOL_AND, BOOL_OR, BIT_AND, BIT_OR, BIT_XOR, JSON_OBJECT_AGG, JSONB_OBJECT_AGG, STDDEV, STDDEV_POP, STDDEV_SAMP, VARIANCE, VAR_POP, VAR_SAMP, MODE, PERCENTILE_CONT, PERCENTILE_DISC, CORR, COVAR_POP, COVAR_SAMP, REGR_AVGX, REGR_AVGY, REGR_COUNT, REGR_INTERCEPT, REGR_R2, REGR_SLOPE, REGR_SXX, REGR_SXY, REGR_SYY

37 aggregate function variants are supported in total.


Change Data Capture (CDC)

How does pg_stream capture changes to source tables?

pg_stream installs AFTER INSERT/UPDATE/DELETE row-level PL/pgSQL triggers on each source table. These triggers write change records (action, old/new row data as JSONB, LSN, transaction ID) into per-source buffer tables in the pgstream_changes schema.

What is the overhead of CDC triggers?

Approximately 20–55 μs per row (PL/pgSQL dispatch + row_to_json() + buffer INSERT). At typical write rates (<1000 writes/sec per source table), this adds less than 5% DML latency overhead.

What happens when I TRUNCATE a source table?

TRUNCATE is now captured via a statement-level AFTER TRUNCATE trigger that writes a T marker row to the change buffer. When the differential refresh engine detects this marker, it automatically falls back to a full refresh for that cycle, ensuring the stream table stays consistent.

Previously, TRUNCATE bypassed row-level triggers entirely. This is no longer a concern — both FULL and DIFFERENTIAL mode stream tables handle TRUNCATE correctly.

Are CDC triggers automatically cleaned up?

Yes. When the last stream table referencing a source is dropped, the trigger and its associated change buffer table are automatically removed.

What happens if a source table is dropped or altered?

pg_stream has DDL event triggers that detect ALTER TABLE and DROP TABLE on source tables. When detected:

  • Affected stream tables are marked with needs_reinit = true
  • The next refresh cycle performs a full reinitialization (drops and recreates the storage table)
  • A reinitialize_needed NOTIFY alert is sent

How do I check if a source table has switched from trigger-based CDC to WAL-based CDC?

When you enable hybrid CDC (pg_stream.cdc_mode = 'auto'), pg_stream starts capturing changes with triggers and can automatically transition to WAL-based logical replication once conditions are met. There are several ways to check the current CDC mode for each source table:

1. Query the dependency catalog directly:

SELECT d.source_relid, c.relname AS source_table, d.cdc_mode,
       d.slot_name, d.decoder_confirmed_lsn, d.transition_started_at
FROM pgstream.pgs_dependencies d
JOIN pg_class c ON c.oid = d.source_relid;

The cdc_mode column shows one of three values:

  • TRIGGER — changes are captured via row-level triggers (the default)
  • TRANSITIONING — the system is in the process of switching from triggers to WAL
  • WAL — changes are captured via logical replication

2. Use the built-in health check function:

SELECT source_table, cdc_mode, slot_name, lag_bytes, alert
FROM pgstream.check_cdc_health();

This returns a row per source table with the current mode, replication slot lag (for WAL-mode sources), and any alert conditions such as slot_lag_exceeds_threshold or replication_slot_missing.

3. Listen for real-time transition notifications:

LISTEN pg_stream_cdc_transition;

pg_stream sends a NOTIFY with a JSON payload whenever a transition starts, completes, or is rolled back. Example payload:

{
  "event": "transition_complete",
  "source_table": "public.orders",
  "old_mode": "TRANSITIONING",
  "new_mode": "WAL",
  "slot_name": "pg_stream_slot_16384"
}

This lets you integrate CDC mode changes into your monitoring stack without polling.

4. Check the global GUC setting:

SHOW pg_stream.cdc_mode;

This shows the desired global behavior (trigger, auto, or wal), not the per-table actual state. The per-table state lives in pgs_dependencies.cdc_mode as described above.

See CONFIGURATION.md for details on the pg_stream.cdc_mode and pg_stream.wal_transition_timeout GUCs.

Is it safe to add triggers to a stream table while the source table is switching CDC modes?

Yes, this is completely safe. CDC mode transitions and user-defined triggers operate on different tables and do not interfere with each other:

  • CDC transitions affect how changes are captured from source tables (e.g., orders). The transition switches the capture mechanism from row-level triggers on the source table to WAL-based logical replication.
  • User-defined triggers live on stream tables (e.g., order_totals) and control how the refresh engine applies changes to the materialized output.

Because these are independent concerns, you can freely add, modify, or remove triggers on a stream table at any point — including during an active CDC transition on its source tables.

How it works in practice:

  1. The refresh engine checks for user-defined triggers on the stream table at the start of each refresh cycle (via a fast pg_trigger lookup, <0.1 ms).
  2. If user triggers are detected, the engine uses explicit DELETE / UPDATE / INSERT statements instead of MERGE, so your triggers fire with correct TG_OP, OLD, and NEW values.
  3. The change data consumed by the refresh engine has the same format regardless of whether it came from CDC triggers or WAL decoding — so the trigger detection and the CDC mode are fully decoupled.

A trigger added between two refresh cycles will simply be picked up on the next cycle. The only (theoretical) edge case is adding a trigger in the tiny window during a single refresh transaction, between the trigger-detection check and the MERGE execution — but since both happen within the same transaction, this is virtually impossible in practice.


Performance & Tuning

How do I tune the scheduler interval?

The pg_stream.scheduler_interval_ms GUC controls how often the scheduler checks for stale stream tables (default: 1000 ms).

WorkloadRecommended Value
Low-latency (near real-time)100500
Standard1000 (default)
Low-overhead (many STs, long schedules)500010000

What is the adaptive fallback to FULL?

When the number of pending changes exceeds pg_stream.differential_max_change_ratio (default: 15%) of the source table size, DIFFERENTIAL mode automatically falls back to FULL for that refresh cycle. This prevents pathological delta queries on bulk changes.

  • Set to 0.0 to always use DIFFERENTIAL (even on large change sets)
  • Set to 1.0 to effectively always use FULL
  • Default 0.15 (15%) is a good balance

How many concurrent refreshes can run?

Controlled by pg_stream.max_concurrent_refreshes (default: 4, range: 1–32). Each concurrent refresh uses a background worker. Increase this if you have many stream tables and available CPU/IO.

How do I check if my stream tables are keeping up?

-- Quick overview
SELECT pgs_name, status, staleness, stale
FROM pgstream.stream_tables_info;

-- Detailed statistics
SELECT pgs_name, total_refreshes, avg_duration_ms, consecutive_errors, stale
FROM pgstream.pg_stat_stream_tables;

-- Recent refresh history for a specific ST
SELECT * FROM pgstream.get_refresh_history('order_totals', 10);

What is __pgs_row_id?

Every stream table has a __pgs_row_id BIGINT PRIMARY KEY column. It stores a 64-bit xxHash of the row's group-by key (or all columns for non-aggregate queries). The refresh engine uses it for delta MERGE operations (matching DELETEs and INSERTs by row ID).

You should ignore this column in your queries. It is an implementation detail.


Interoperability

Can PostgreSQL views reference stream tables?

Yes. Stream tables are standard heap tables. Views work normally and reflect data as of the most recent refresh.

Can materialized views reference stream tables?

Yes, though it is somewhat redundant (both are physical snapshots). The materialized view requires its own REFRESH MATERIALIZED VIEW — it does not auto-refresh when the stream table refreshes.

Can I replicate stream tables with logical replication?

Yes. Stream tables can be published like any ordinary table:

CREATE PUBLICATION my_pub FOR TABLE pgstream.order_totals;

Important caveats:

  • The __pgs_row_id column is replicated (it is the primary key)
  • Subscribers receive materialized data, not the defining query
  • Do not install pg_stream on the subscriber and attempt to refresh the replicated table — it will have no CDC triggers or catalog entries
  • Internal change buffer tables are not published by default

Can I INSERT, UPDATE, or DELETE rows in a stream table directly?

No. Stream table contents are managed exclusively by the refresh engine. Direct DML will corrupt the internal state.

Can I add foreign keys to or from stream tables?

No. The refresh engine uses bulk MERGE operations that do not respect foreign key ordering. Foreign key constraints on stream tables are not supported.

Can I add my own triggers to stream tables?

Yes, for DIFFERENTIAL mode stream tables. When user-defined row-level triggers are detected (or pg_stream.user_triggers = 'on'), the refresh engine automatically switches from MERGE to explicit DELETE + UPDATE + INSERT statements. This ensures triggers fire with the correct TG_OP, OLD, and NEW values.

Limitations:

  • Row-level triggers do not fire during FULL refresh (they are automatically suppressed via DISABLE TRIGGER USER). Use REFRESH MODE DIFFERENTIAL for stream tables with triggers.
  • The IS DISTINCT FROM guard prevents no-op UPDATE triggers when the aggregate result is unchanged.
  • BEFORE triggers that modify NEW will affect the stored value — the next refresh may "correct" it back, causing oscillation.

See the pg_stream.user_triggers GUC in CONFIGURATION.md for control options.

Can I ALTER TABLE a stream table directly?

No. Use pgstream.alter_stream_table() to modify schedule, refresh mode, or status. To change the defining query, drop and recreate the stream table.


Monitoring & Alerting

What monitoring views are available?

ViewDescription
pgstream.stream_tables_infoStatus overview with computed staleness
pgstream.pg_stat_stream_tablesComprehensive stats (refresh counts, avg duration, error streaks)

How do I get alerted when something goes wrong?

pg_stream sends PostgreSQL NOTIFY messages on the pg_stream_alert channel with JSON payloads:

EventWhen
stale_dataStaleness exceeds 2× the schedule
auto_suspendedStream table suspended after max consecutive errors
reinitialize_neededUpstream DDL change detected
buffer_growth_warningChange buffer growing unexpectedly
refresh_completedRefresh completed successfully
refresh_failedRefresh failed

Listen with:

LISTEN pg_stream_alert;

What happens when a stream table keeps failing?

After pg_stream.max_consecutive_errors (default: 3) consecutive failures, the stream table moves to ERROR status and automatic refreshes stop. An auto_suspended NOTIFY alert is sent.

To recover:

-- Fix the underlying issue (e.g., restore a dropped source table), then:
SELECT pgstream.alter_stream_table('my_table', status => 'ACTIVE');

Retries use exponential backoff (base 1s, max 60s, ±25% jitter, up to 5 retries before counting as a real failure).


Configuration Reference

GUCTypeDefaultDescription
pg_stream.enabledbooltrueEnable/disable the scheduler. Manual refreshes still work when false.
pg_stream.scheduler_interval_msint1000Scheduler wake interval in milliseconds (100–60000)
pg_stream.min_schedule_secondsint60Minimum allowed schedule duration (1–86400)
pg_stream.max_consecutive_errorsint3Failures before auto-suspending (1–100)
pg_stream.change_buffer_schematextpgstream_changesSchema for CDC buffer tables
pg_stream.max_concurrent_refreshesint4Max parallel refresh workers (1–32)
pg_stream.user_triggerstextautoUser trigger handling: auto (detect), on (always explicit DML), off (suppress)
pg_stream.differential_max_change_ratiofloat0.15Change ratio threshold for adaptive FULL fallback (0.0–1.0)
pg_stream.cleanup_use_truncatebooltrueUse TRUNCATE instead of DELETE for buffer cleanup

All GUCs are SUSET context (superuser SET) and take effect without restart, except shared_preload_libraries which requires a PostgreSQL restart.


Troubleshooting

Unit tests crash with symbol not found in flat namespace on macOS 26+

macOS 26 (Tahoe) changed the dynamic linker (dyld) to eagerly resolve all flat-namespace symbols at binary load time. pgrx extensions link PostgreSQL server symbols (e.g. CacheMemoryContext, SPI_connect) with -Wl,-undefined,dynamic_lookup, which previously resolved lazily. Since cargo test --lib runs outside the postgres process, those symbols are missing and the test binary aborts:

dyld[66617]: symbol not found in flat namespace '_CacheMemoryContext'

Use just test-unit — it automatically detects macOS 26+ and injects a stub library (libpg_stub.dylib) via DYLD_INSERT_LIBRARIES. The stub provides NULL/no-op definitions for the ~28 PostgreSQL symbols; they are never called during unit tests (pure Rust logic only).

This does not affect integration tests, E2E tests, just lint, just build, or the extension running inside PostgreSQL.

See the Installation Guide for details and manual usage.

My stream table is stuck in INITIALIZING status

The initial full refresh may have failed. Check:

SELECT * FROM pgstream.get_refresh_history('my_table', 5);

If the error is transient, retry with:

SELECT pgstream.refresh_stream_table('my_table');

My stream table shows stale data but the scheduler is running

Common causes:

  1. TRUNCATE on source table — bypasses CDC triggers. Manual refresh needed.
  2. Too many errors — check consecutive_errors in pgstream.pg_stat_stream_tables. Resume with ALTER ... status => 'ACTIVE'.
  3. Long-running refresh — check for lock contention or slow defining queries.
  4. Scheduler disabled — verify SHOW pg_stream.enabled; returns on.

I get "cycle detected" when creating a stream table

Stream tables cannot have circular dependencies. If ST-A depends on ST-B and ST-B depends on ST-A (directly or transitively), creation is rejected. Restructure your queries to break the cycle.

A source table was altered and my stream table stopped refreshing

pg_stream detects DDL changes via event triggers and marks affected stream tables with needs_reinit = true. The next scheduler cycle will reinitialize (full drop + recreate of storage) the stream table automatically. If the schema change breaks the defining query, the reinitialization will fail — check refresh history for the error and recreate the stream table with an updated query.

How do I see the delta query generated for a stream table?

SELECT pgstream.explain_st('order_totals');

This shows the DVM operator tree, source tables, and the generated delta SQL.


Why Are These SQL Features Not Supported?

This section gives detailed technical explanations for each SQL limitation. pg_stream follows the principle of "fail loudly rather than produce wrong data" — every unsupported feature is detected at stream-table creation time and rejected with a clear error message and a suggested rewrite.

How does NATURAL JOIN work?

NATURAL JOIN is now fully supported. At parse time, pg_stream resolves the common columns between the two tables (using OpTree::output_columns()) and synthesizes explicit equi-join conditions. This supports INNER, LEFT, RIGHT, and FULL NATURAL JOIN variants.

Internally, NATURAL JOIN is converted to an explicit JOIN ... ON before the DVM engine builds its operator tree, so delta computation works identically to a manually specified equi-join.

Note: The internal __pgs_row_id column is excluded from common column resolution, so NATURAL JOINs between stream tables work correctly.

How do GROUPING SETS, CUBE, and ROLLUP work?

GROUPING SETS, CUBE, and ROLLUP are now fully supported via an automatic parse-time rewrite. pg_stream decomposes these constructs into a UNION ALL of separate GROUP BY queries before the DVM engine processes the query.

Explosion guard: CUBE(N) generates $2^N$ branches. pg_stream rejects CUBE/ROLLUP combinations that would produce more than 64 branches to prevent runaway memory usage. Use explicit GROUPING SETS(...) instead.

For example:

-- This defining query:
SELECT dept, region, SUM(amount) FROM sales GROUP BY CUBE(dept, region)

-- Is automatically rewritten to:
SELECT dept, region, SUM(amount) FROM sales GROUP BY dept, region
UNION ALL
SELECT dept, NULL::text, SUM(amount) FROM sales GROUP BY dept
UNION ALL
SELECT NULL::text, region, SUM(amount) FROM sales GROUP BY region
UNION ALL
SELECT NULL::text, NULL::text, SUM(amount) FROM sales

GROUPING() function calls are replaced with integer literal constants corresponding to the grouping level. The rewrite is transparent — the DVM engine sees only standard GROUP BY + UNION ALL operators and can apply incremental delta computation to each branch independently.

How does DISTINCT ON (…) work?

DISTINCT ON is now fully supported via an automatic parse-time rewrite. pg_stream transparently transforms DISTINCT ON into a ROW_NUMBER() window function subquery:

-- This defining query:
SELECT DISTINCT ON (dept) dept, employee, salary
FROM employees ORDER BY dept, salary DESC

-- Is automatically rewritten to:
SELECT dept, employee, salary FROM (
    SELECT dept, employee, salary,
           ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) AS rn
    FROM employees
) sub WHERE rn = 1

The rewrite happens before DVM parsing, so the operator tree sees a standard window function query and can apply partition-based recomputation for incremental delta maintenance.

Why is TABLESAMPLE rejected?

TABLESAMPLE returns a random subset of rows from a table (e.g., FROM orders TABLESAMPLE BERNOULLI(10) gives ~10% of rows).

Stream tables materialize the complete result set of the defining query and keep it up-to-date across refreshes. Baking a random sample into the defining query is not meaningful because:

  1. Non-determinism. Each refresh would sample different rows, making the stream table contents unstable and unpredictable. The delta between refreshes would be dominated by sampling noise, not actual data changes.

  2. CDC incompatibility. The trigger-based change-capture system tracks specific row-level changes (inserts, updates, deletes). A TABLESAMPLE defining query has no stable row identity — the "changed rows" concept doesn't apply when the entire sample shifts each cycle.

Rewrite:

-- Instead of sampling in the defining query:
SELECT * FROM orders TABLESAMPLE BERNOULLI(10)

-- Materialize the full result and sample when querying:
SELECT * FROM order_stream_table WHERE random() < 0.1

Why is LIMIT / OFFSET rejected?

Stream tables materialize the complete result set and keep it synchronized with source data. LIMIT/OFFSET would truncate the result:

  1. Undefined ordering. LIMIT without ORDER BY returns an arbitrary subset. Even with ORDER BY, stream tables discard ordering — the "top N" rows concept doesn't apply to a set-based materialized result.

  2. Delta instability. When source rows change, the boundary between "in the LIMIT" and "out of the LIMIT" shifts. A single INSERT could evict one row and admit another, requiring the refresh to track the full ordered position of every row — essentially a full rescan.

  3. Semantic mismatch. Users who write LIMIT 100 typically want to limit what they read, not what is stored. Since stream tables are queried separately from their definition, the LIMIT belongs in the consuming query.

Rewrite:

-- Instead of:
'SELECT * FROM orders ORDER BY created_at DESC LIMIT 100'

-- Omit LIMIT from the defining query, apply when reading:
SELECT * FROM orders_stream_table ORDER BY created_at DESC LIMIT 100

Why are window functions in expressions rejected?

Window functions like ROW_NUMBER() OVER (…) are supported as standalone columns in stream tables. However, embedding a window function inside an expression — such as CASE WHEN ROW_NUMBER() OVER (...) = 1 THEN ... or SUM(x) OVER (...) + 1 — is rejected.

This restriction exists because:

  1. Partition-based recomputation. pg_stream's differential mode handles window functions by recomputing entire partitions that were affected by changes. When a window function is buried inside an expression, the DVM engine cannot isolate the window computation from the surrounding expression, making it impossible to correctly identify which partitions to recompute.

  2. Expression tree ambiguity. The DVM parser would need to differentiate the outer expression (arithmetic, CASE, etc.) while treating the inner window function specially. This creates a combinatorial explosion of differentiation rules for every possible expression type × window function combination.

Rewrite:

-- Instead of:
SELECT id, CASE WHEN ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) = 1
                THEN 'top' ELSE 'other' END AS rank_label
FROM employees

-- Move window function to a separate column, then use a wrapping stream table:
-- ST1:
SELECT id, dept, salary,
       ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) AS rn
FROM employees

-- ST2 (references ST1):
SELECT id, CASE WHEN rn = 1 THEN 'top' ELSE 'other' END AS rank_label
FROM pgstream.employees_ranked

Why is FOR UPDATE / FOR SHARE rejected?

FOR UPDATE and related locking clauses (FOR SHARE, FOR NO KEY UPDATE, FOR KEY SHARE) acquire row-level locks on selected rows. This is incompatible with stream tables because:

  1. Refresh semantics. Stream table contents are managed by the refresh engine using bulk MERGE operations. Row-level locks taken during the defining query would conflict with the refresh engine's own locking strategy.

  2. No direct DML. Since users cannot directly modify stream table rows, there is no use case for locking rows inside the defining query. The locks would be held for the duration of the refresh transaction and then released, serving no purpose.

Why is ALL (subquery) not supported?

ALL (subquery) compares a value against every row returned by a subquery (e.g., WHERE x > ALL (SELECT y FROM t)). It is rejected because:

  1. Negation rewrite complexity. x > ALL (SELECT y FROM t) is logically equivalent to NOT EXISTS (SELECT 1 FROM t WHERE y >= x), which pg_stream can handle via its anti-join operator. The rewrite is straightforward.

  2. Rare usage. ALL (subquery) is uncommon in analytical queries. Supporting it directly would add operator complexity for minimal benefit.

Rewrite:

-- Instead of:
WHERE amount > ALL (SELECT threshold FROM limits)

-- Use NOT EXISTS:
WHERE NOT EXISTS (SELECT 1 FROM limits WHERE threshold >= amount)

Why is ORDER BY silently discarded?

ORDER BY in the defining query is accepted but ignored. This is consistent with how PostgreSQL treats CREATE MATERIALIZED VIEW AS SELECT ... ORDER BY ... — the ordering is not preserved in the stored data.

Stream tables are heap tables with no guaranteed row order. The ORDER BY in the defining query would only affect the order of the initial INSERT, which has no lasting effect. Apply ordering when querying the stream table:

-- This ORDER BY is meaningless in the defining query:
'SELECT region, SUM(amount) FROM orders GROUP BY region ORDER BY total DESC'

-- Instead, order when reading:
SELECT * FROM regional_totals ORDER BY total DESC

Why are unsupported aggregates (CORR, COVAR_*, REGR_*) limited to FULL mode?

Regression aggregates like CORR, COVAR_POP, COVAR_SAMP, and the REGR_* family require maintaining running sums of products and squares across the entire group. Unlike COUNT/SUM/AVG (where deltas can be computed from the change alone) or group-rescan aggregates (where only affected groups are re-read), regression aggregates:

  1. Lack algebraic delta rules. There is no closed-form way to update a correlation coefficient from a single row change without access to the full group's data.

  2. Would degrade to group-rescan anyway. Even if supported, the implementation would need to rescan the full group from source — identical to FULL mode for most practical group sizes.

These aggregates work fine in FULL refresh mode, which re-runs the entire query from scratch each cycle.


Why Are These Stream Table Operations Restricted?

Stream tables are regular PostgreSQL heap tables under the hood, but their contents are managed exclusively by the refresh engine. This section explains why certain operations that work on ordinary tables are disallowed or unsupported on stream tables.

Why can't I INSERT, UPDATE, or DELETE rows in a stream table?

Stream table contents are the output of the refresh engine — they represent the materialized result of the defining query at a specific point in time. Direct DML would corrupt this contract in several ways:

  1. Row ID integrity. Every row has a __pgs_row_id (a 64-bit xxHash of the group-by key or all columns). The refresh engine uses this for delta MERGE — matching incoming deltas against existing rows. A manually inserted row with an incorrect or duplicate __pgs_row_id would cause the next differential refresh to produce wrong results (double-counting, missed deletes, or merge conflicts).

  2. Frontier inconsistency. Each refresh records a frontier — a set of per-source LSN positions that represent "data up to this point has been materialized." A manual DML change is not tracked by any frontier. The next differential refresh would either overwrite the change (if the delta touches the same row) or leave the stream table in a state that doesn't match any consistent point-in-time snapshot of the source data.

  3. Change buffer desync. The CDC triggers on source tables write changes to buffer tables. The refresh engine reads these buffers and advances the frontier. Manual DML on the stream table bypasses this pipeline entirely — the buffer and frontier have no record of the change, so future refreshes cannot account for it.

If you need to post-process stream table data, create a view or a second stream table that references the first one.

Why can't I add foreign keys to or from a stream table?

Foreign key constraints require that referenced/referencing rows exist at the time of each DML statement. The refresh engine violates this assumption:

  1. Bulk MERGE ordering. A differential refresh executes a single MERGE INTO statement that applies all deltas (inserts and deletes) atomically. PostgreSQL evaluates FK constraints row-by-row within this MERGE. If a parent row is deleted and a new parent inserted in the same delta batch, the child FK check may fail because it sees the delete before the insert — even though the final state would be consistent.

  2. Full refresh uses TRUNCATE + INSERT. In FULL mode, the refresh engine truncates the stream table and re-inserts all rows. TRUNCATE does not fire individual DELETE triggers and bypasses FK cascade logic, which would leave referencing tables with dangling references.

  3. Cross-table refresh ordering. If stream table A has an FK referencing stream table B, both tables refresh independently (in topological order, but in separate transactions). There is no guarantee that A's refresh sees B's latest data — the FK constraint could transiently fail between refreshes.

Workaround: Enforce referential integrity in the consuming application or use a view that joins the stream tables and validates the relationship.

How do user-defined triggers work on stream tables?

When a DIFFERENTIAL mode stream table has user-defined row-level triggers (or pg_stream.user_triggers = 'on'), the refresh engine uses explicit DML decomposition instead of MERGE:

  1. Delta materialized once. The delta query result is stored in a temporary table (__pgs_delta_<id>) to avoid evaluating it three times.

  2. DELETE removed rows. Rows in the stream table whose __pgs_row_id is absent from the delta are deleted. AFTER DELETE triggers fire with correct OLD values.

  3. UPDATE changed rows. Rows whose __pgs_row_id exists in both the stream table and delta but whose values differ (checked via IS DISTINCT FROM) are updated. AFTER UPDATE triggers fire with correct OLD and NEW. No-op updates (where values are identical) are skipped, preventing spurious triggers.

  4. INSERT new rows. Rows in the delta whose __pgs_row_id is absent from the stream table are inserted. AFTER INSERT triggers fire with correct NEW values.

FULL refresh behavior: Row-level user triggers are automatically suppressed during FULL refresh via DISABLE TRIGGER USER / ENABLE TRIGGER USER. A NOTIFY pgstream_refresh is emitted so listeners know a FULL refresh occurred. Use REFRESH MODE DIFFERENTIAL for stream tables that need per-row trigger semantics.

Performance: The explicit DML path adds ~25–60% overhead compared to MERGE for triggered stream tables. Stream tables without user triggers have zero overhead (only a fast pg_trigger check, <0.1 ms).

Control: The pg_stream.user_triggers GUC controls this behavior:

  • auto (default): detect user triggers automatically
  • on: always use explicit DML (useful for testing)
  • off: always use MERGE, suppressing triggers

Why can't I ALTER TABLE a stream table directly?

Stream table metadata (defining query, schedule, refresh mode) is stored in the pg_stream catalog (pgstream.pgs_stream_tables). A direct ALTER TABLE would change the physical table without updating the catalog, causing:

  1. Column mismatch. If you add or remove columns, the refresh engine's cached delta query and MERGE statement would reference columns that no longer exist (or miss new ones), causing runtime errors.

  2. __pgs_row_id invalidation. The row ID hash is computed from the defining query's output columns. Altering the table schema without updating the defining query would make existing row IDs inconsistent with the new column set.

Use pgstream.alter_stream_table() to change schedule, refresh mode, or status. To change the defining query or column structure, drop and recreate the stream table.

Why can't I TRUNCATE a stream table?

TRUNCATE removes all rows instantly but does not update the pg_stream frontier or change buffers. After a TRUNCATE:

  1. Differential refresh sees no changes. The frontier still records the last-processed LSN. No new source changes may have occurred, so the next differential refresh produces an empty delta — leaving the stream table empty even though the source still has data.

  2. No recovery path for differential mode. The refresh engine has no way to detect that the stream table was externally truncated. It assumes the current contents match the frontier.

Use pgstream.refresh_stream_table('my_table') to force a full re-materialization, or drop and recreate the stream table if you need a clean slate.

Changelog

All notable changes to pg_stream are documented in this file.

The format is based on Keep a Changelog. For future plans and release milestones, see ROADMAP.md.


[Unreleased]


[0.1.1] — 2026-02-26

Changed

CloudNativePG Image Volume Extension Distribution

  • Extension-only OCI image — replaced the full PostgreSQL Docker image (ghcr.io/<owner>/pg_stream) with a minimal scratch-based extension image (ghcr.io/<owner>/pg_stream-ext) following the CNPG Image Volume Extensions specification. The image contains only .so, .control, and .sql files (< 10 MB vs ~400 MB for the old full image).
  • New cnpg/Dockerfile.ext — release Dockerfile for packaging pre-built artifacts into the scratch-based extension image.
  • New cnpg/Dockerfile.ext-build — multi-stage from-source build for local development and CI.
  • New cnpg/database-example.yaml — CNPG Database resource for declarative CREATE EXTENSION pg_stream (replaces postInitSQL).
  • Updated cnpg/cluster-example.yaml — uses official CNPG PostgreSQL 18 operand image with .spec.postgresql.extensions for Image Volume mounting.
  • Removed cnpg/Dockerfile and cnpg/Dockerfile.release — the old full PostgreSQL images are no longer built or published.
  • Updated release workflow — publishes multi-arch (amd64/arm64) extension image to GHCR with layout verification and SQL smoke test.
  • Updated CI CNPG smoke test — uses transitional composite image approach until kind supports Kubernetes 1.33 with ImageVolume feature gate.

[0.1.0] — 2026-02-26

Fixed

WAL Decoder pgoutput Action Parsing (F4 / G2.3)

  • Positional action parsingparse_pgoutput_action() previously used data.contains("INSERT:") etc., which would misclassify events when a schema name, table name, or column value contained an action keyword (e.g., a table named INSERT_LOG or a text value "DELETE: old row"). Replaced with positional parsing: strip "table " prefix, skip schema.table: , then match the action keyword before the next :.
  • 3 new unit tests covering the edge cases.

Added

CUBE / ROLLUP Combinatorial Explosion Guard (F14 / G5.2)

  • Branch limit guardCUBE(n) on N columns generates $2^N$ UNION ALL branches. Large CUBEs would silently produce memory-exhausting query trees. rewrite_grouping_sets() now rejects CUBE/ROLLUP combinations that would expand beyond 64 branches, emitting a clear error that directs users to explicit GROUPING SETS(...).

Documentation: Known Delta Computation Limitations (F7 / F11)

  • JOIN key change + simultaneous right-side delete — documented in docs/SQL_REFERENCE.md § "Known Delta Computation Limitations" with a concrete SQL example, root-cause explanation, and three mitigations (adaptive FULL fallback, staggered changes, FULL mode).
  • Keyless table duplicate-row limitation — the "Tables Without Primary Keys" section now includes a > Limitation callout explaining that rows with identical content produce the same content hash, causing INSERT deduplication and ambiguous DELETE matching. Recommends adding a surrogate PK or UNIQUE constraint.
  • SQL-standard JSON aggregate recognitionJSON_ARRAYAGG(expr ...) and JSON_OBJECTAGG(key: value ...) are now recognized as first-class DVM aggregates with the group-rescan strategy. Previously treated as opaque raw expressions, they now work correctly in DIFFERENTIAL mode.
  • Two new AggFunc variants: JsonObjectAggStd(String), JsonArrayAggStd(String). The carried String preserves the full deparsed SQL since the special key: value, ABSENT ON NULL, ORDER BY, and RETURNING clauses differ from regular function syntax.
  • 4 E2E tests covering both FULL and DIFFERENTIAL modes.

JSON_TABLE Support (F12)

  • JSON_TABLE() in FROM clause — PostgreSQL 17+ JSON_TABLE(expr, path COLUMNS (...)) is now supported. Deparsed with full syntax including PASSING clauses, regular/EXISTS/formatted/nested columns, and ON ERROR/ON EMPTY behaviors. Modeled internally as LateralFunction.
  • 2 E2E tests (FULL and DIFFERENTIAL modes).

Operator Volatility Checking (F16)

  • Operator volatility checking — custom operators backed by volatile functions are now detected and rejected in DIFFERENTIAL mode. The check queries pg_operatorpg_proc to resolve operator function volatility. This completes the volatility coverage (G7.2) started with function volatility detection.
  • 3 unit tests and 2 E2E tests.

Cross-Session Cache Invalidation (F17)

  • Cross-session cache invalidation — a shared-memory atomic counter (CACHE_GENERATION) ensures that when one backend alters a stream table, drops a stream table, or triggers a DDL hook, all other backends automatically flush their delta template and MERGE template caches on the next refresh cycle. Previously, cached templates could become stale in multi-backend deployments.
  • Thread-local generation tracking in both dvm/mod.rs (delta cache) and refresh.rs (MERGE cache + prepared statements).

Function/Operator DDL Tracking (F18)

  • Function DDL trackingCREATE OR REPLACE FUNCTION and ALTER FUNCTION on functions referenced by stream table defining queries now trigger reinit of affected STs. DROP FUNCTION also marks affected STs for reinit.
  • functions_used catalog column — new TEXT[] column in pgstream.pgs_stream_tables stores all function names used by the defining query (extracted from the parsed OpTree at creation time). DDL hooks query this column to find affected STs.
  • 2 E2E tests and 5 unit tests for function name extraction.

View Inlining (G2.1)

  • View inlining auto-rewrite — views referenced in defining queries are transparently replaced with their underlying SELECT definition as inline subqueries. CDC triggers land on base tables, so DIFFERENTIAL mode works correctly with views. Nested views (view → view → table) are fully expanded via a fixpoint loop (max depth 10).
  • Materialized view rejection — materialized views (relkind = 'm') are rejected with a clear error in DIFFERENTIAL mode. FULL mode allows them.
  • Foreign table rejection — foreign tables (relkind = 'f') are rejected in DIFFERENTIAL mode (row-level triggers cannot be created on foreign tables).
  • Original query preservation — the user's original SQL (pre-inlining) is stored in pgstream.pgs_stream_tables.original_query for reinit after view changes and user introspection.
  • View DDL hooksCREATE OR REPLACE VIEW triggers reinit of affected stream tables. DROP VIEW sets affected stream tables to ERROR status.
  • View dependency tracking — views are registered as soft dependencies in pgstream.pgs_dependencies (source_type = 'VIEW') for DDL hook lookups.
  • E2E test suite — 16 E2E tests covering basic view inlining, UPDATE/DELETE through views, filtered views, aggregation, joins, nested views, FULL mode, materialized view rejection/allowance, view replacement/drop hooks, TRUNCATE propagation, column renaming, catalog verification, and dependency registration.

SQL Feature Gaps (S1–S15)

  • Volatile function detection (S1) — defining queries containing volatile functions (e.g., random(), clock_timestamp()) are rejected in DIFFERENTIAL mode with a clear error. Stable functions (e.g., now()) emit a warning.
  • TRUNCATE capture in CDC (S2) — statement-level AFTER TRUNCATE trigger writes a T marker row to the change buffer. Differential refresh detects the marker and automatically falls back to a full refresh.
  • ALL (subquery) support (S3)x op ALL (subquery) is rewritten to an AntiJoin via NOT EXISTS with a negated condition.
  • DISTINCT ON auto-rewrite (S4)DISTINCT ON (col1, col2) is transparently rewritten to a ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) = 1 subquery before DVM parsing. Previously rejected.
  • 12 regression aggregates (S5)CORR, COVAR_POP, COVAR_SAMP, REGR_AVGX, REGR_AVGY, REGR_COUNT, REGR_INTERCEPT, REGR_R2, REGR_SLOPE, REGR_SXX, REGR_SXY, REGR_SYY — all use group-rescan strategy. 39 aggregate function variants total (up from 25).
  • Mixed UNION / UNION ALL (S6) — nested set operations with different ALL flags are now parsed correctly.
  • Column snapshot + schema fingerprint (S7)pgs_dependencies stores a JSONB column snapshot and SHA-256 fingerprint for each source table. DDL change detection uses a 3-tier fast path: fingerprint → snapshot → legacy columns_used fallback.
  • pg_stream.block_source_ddl GUC (S8) — when true, column-affecting DDL on tracked source tables is blocked with an ERROR instead of marking stream tables for reinit.
  • NATURAL JOIN support (S9) — common columns are resolved at parse time and an explicit equi-join condition is synthesized. Supports INNER, LEFT, RIGHT, and FULL NATURAL JOIN variants. Previously rejected.
  • Keyless table support (S10) — source tables without a primary key now work correctly. CDC triggers compute an all-column content hash for row identity. Consistent __pgs_row_id between full and delta refreshes.
  • GROUPING SETS / CUBE / ROLLUP auto-rewrite (S11) — decomposed at parse time into a UNION ALL of separate GROUP BY queries. GROUPING() calls become integer literals. Previously rejected.
  • Scalar subquery in WHERE rewrite (S12)WHERE col > (SELECT avg(x) FROM t) is rewritten to a CROSS JOIN with column reference replacement.
  • SubLinks in OR rewrite (S13)WHERE a OR EXISTS (...) is decomposed into UNION branches, one per OR arm.
  • Multi-PARTITION BY window rewrite (S14) — window functions with different PARTITION BY clauses are split into separate subqueries joined by a ROW_NUMBER() OVER () row marker.
  • Recursive CTE semi-naive + DRed (S15) — DIFFERENTIAL mode for recursive CTEs now uses semi-naive evaluation for INSERT-only changes, Delete-and- Rederive (DRed) for mixed changes, and recomputation fallback. Strategy is auto-selected per refresh.

Native Syntax Planning

  • Native DDL syntax research — comprehensive analysis of 15 PostgreSQL extension syntax mechanisms for supporting CREATE STREAM TABLE-like syntax. See docs/research/CUSTOM_SQL_SYNTAX.md.
  • Native syntax plan — tiered strategy: Tier 1 (function API, existing), Tier 1.5 (CALL procedure wrappers), Tier 2 (CREATE MATERIALIZED VIEW ... WITH (pgstream.stream = true) via ProcessUtility_hook). See plans/sql/PLAN_NATIVE_SYNTAX.md.

Hybrid CDC — Automatic Trigger → WAL Transition

  • Hybrid CDC architecture — stream tables now start with lightweight row-level triggers for zero-config setup and can automatically transition to WAL-based (logical replication) capture for lower write-side overhead. The transition is controlled by the pg_stream.cdc_mode GUC (trigger / auto / wal).
  • WAL decoder background worker — dedicated worker that polls logical replication slots and writes decoded changes into the same change buffer tables used by triggers, ensuring a uniform format for the DVM engine.
  • Transition orchestration — transparent three-step process: create replication slot, wait for decoder catch-up, drop trigger. Falls back to triggers automatically if the decoder does not catch up within the timeout.
  • CDC health monitoring — new pgstream.check_cdc_health() function returns per-source CDC mode, slot lag, confirmed LSN, and alerts.
  • CDC transition notificationsNOTIFY pg_stream_cdc_transition emits JSON payloads when sources transition between CDC modes.
  • New GUCspg_stream.cdc_mode and pg_stream.wal_transition_timeout.
  • Catalog extensionpgs_dependencies table gains cdc_mode, slot_name, decoder_confirmed_lsn, and transition_started_at columns.

User-Defined Triggers on Stream Tables

  • User trigger support in DIFFERENTIAL mode — user-created AFTER triggers on stream tables now fire correctly during differential refresh via explicit per-row DML (INSERT/UPDATE/DELETE) instead of bulk MERGE.
  • FULL refresh trigger handling — user triggers are suppressed during FULL refresh with DISABLE TRIGGER USER and a NOTIFY pgstream_refresh is emitted so listeners know when to re-query.
  • Trigger detectionhas_user_triggers() automatically detects user-defined triggers on storage tables at refresh time.
  • DDL warningCREATE TRIGGER on a stream table emits a notice explaining the trigger semantics and the pg_stream.user_triggers GUC.
  • New GUCpg_stream.user_triggers (auto / on / off) controls whether the explicit DML path is used.

Changed

  • Monitoring layerslot_health() now covers WAL-mode sources. Architecture diagrams and documentation updated to reflect the hybrid CDC model.
  • Stream table restrictions — user triggers on stream tables upgraded from "⚠️ Unsupported" to "✅ Supported (DIFFERENTIAL mode)".

Core Engine

  • Declarative stream tables — define a SQL query and a schedule; the extension handles automatic refresh.
  • Differential View Maintenance (DVM) — incremental delta computation derived automatically from the defining query's operator tree.
  • Trigger-based CDC — lightweight AFTER row-level triggers capture changes into per-source buffer tables. No wal_level = logical required.
  • DAG-aware scheduling — stream tables that depend on other stream tables are refreshed in topological order with cycle detection.
  • Background scheduler — canonical-period scheduling (48·2ⁿ seconds) with cron expression support.
  • Crash-safe refresh — advisory locks prevent concurrent refreshes; crash recovery marks in-flight refreshes as failed.

SQL Support

  • Full operator coverage — table scans, projections, WHERE/HAVING filters, INNER/LEFT/RIGHT/FULL OUTER joins, nested multi-table joins, GROUP BY with 25 aggregate functions, DISTINCT, UNION ALL, UNION, INTERSECT, EXCEPT.
  • Subquery support — subqueries in FROM, EXISTS/NOT EXISTS, IN/NOT IN (subquery), scalar subqueries in SELECT.
  • CTE support — non-recursive CTEs (inline and shared delta), recursive CTEs (WITH RECURSIVE) in both FULL and DIFFERENTIAL modes.
  • Recursive CTE incremental maintenance — DIFFERENTIAL mode now uses semi-naive evaluation for INSERT-only changes, Delete-and-Rederive (DRed) for mixed changes, and recomputation fallback when CTE columns don't match ST storage. Strategy is auto-selected per refresh.
  • DISTINCT ON auto-rewrite — transparently rewritten to ROW_NUMBER() window subquery before DVM parsing.
  • GROUPING SETS / CUBE / ROLLUP auto-rewrite — decomposed into UNION ALL of separate GROUP BY queries at parse time.
  • NATURAL JOIN support — common columns resolved at parse time with explicit equi-join synthesis.
  • ALL (subquery) support — rewritten to AntiJoin via NOT EXISTS.
  • Scalar subquery in WHERE — rewritten to CROSS JOIN.
  • SubLinks in OR — decomposed into UNION branches.
  • Multi-PARTITION BY windows — split into joined subqueries.
  • Regression aggregates — CORR, COVAR_POP, COVAR_SAMP, REGR_* (12 new).
  • JSON_ARRAYAGG / JSON_OBJECTAGG — SQL-standard JSON aggregates recognized as first-class DVM aggregates in DIFFERENTIAL mode.
  • JSON_TABLE — PostgreSQL 17+ JSON_TABLE() in FROM clause.
  • Keyless table support — tables without primary keys use content hashing.
  • Volatile function and operator detection — rejected in DIFFERENTIAL, warned for stable. Custom operators backed by volatile functions are also detected.
  • TRUNCATE capture in CDC — triggers fall back to full refresh.
  • Window functions — ROW_NUMBER, RANK, SUM OVER, etc. with full frame clause support (ROWS, RANGE, GROUPS, BETWEEN, EXCLUDE) and named WINDOW clauses.
  • LATERAL SRFsjsonb_array_elements, unnest, jsonb_each, etc. via row-scoped recomputation.
  • LATERAL subqueries — explicit LATERAL (SELECT ...) in FROM with correlated references.
  • Expression support — CASE WHEN, COALESCE, NULLIF, GREATEST, LEAST, IN (list), BETWEEN, IS DISTINCT FROM, IS TRUE/FALSE/UNKNOWN, SIMILAR TO, ANY/ALL (array), ARRAY/ROW constructors, array subscript, field access.
  • Ordered-set aggregates — MODE, PERCENTILE_CONT, PERCENTILE_DISC with WITHIN GROUP (ORDER BY).

Monitoring & Observability

  • Refresh statisticsst_refresh_stats(), get_refresh_history(), get_staleness().
  • Slot healthslot_health() checks replication slot state and WAL retention.
  • DVM plan inspectionexplain_st() describes the operator tree.
  • Monitoring viewspgstream.stream_tables_info and pgstream.pg_stat_stream_tables.
  • NOTIFY alertingpg_stream_alert channel broadcasts stale, suspended, reinitialize, slot lag, refresh completed/failed events.

Infrastructure

  • Row ID hashingpg_stream_hash() and pg_stream_hash_multi() using xxHash (xxh64) for deterministic row identity.
  • DDL event trackingALTER TABLE and DROP TABLE on source tables automatically set needs_reinit on affected stream tables. CREATE OR REPLACE FUNCTION / ALTER FUNCTION / DROP FUNCTION on functions used by defining queries also triggers reinit.
  • Cross-session cache coherence — shared-memory CACHE_GENERATION atomic counter ensures all backends flush delta/MERGE template caches when DDL changes occur.
  • Version / frontier tracking — per-source JSONB frontier for consistent snapshots and Delayed View Semantics (DVS) guarantee.
  • 12 GUC variablesenabled, scheduler_interval_ms, min_schedule_seconds, max_consecutive_errors, change_buffer_schema, max_concurrent_refreshes, differential_max_change_ratio, cleanup_use_truncate, user_triggers, cdc_mode, wal_transition_timeout, block_source_ddl.

Documentation

  • Architecture guide, SQL reference, configuration reference, FAQ, getting-started tutorial, DVM operators reference, benchmark guide.
  • Deep-dive tutorials: What Happens on INSERT / UPDATE / DELETE / TRUNCATE.

Testing

  • ~1,138 unit tests, 22 E2E test suites (Testcontainers + custom Docker image).
  • Property-based tests, integration tests, resilience tests.
  • Column snapshot and schema fingerprint-based DDL change detection.

Known Limitations

  • TABLESAMPLE, LIMIT / OFFSET, FOR UPDATE / FOR SHARE — rejected with clear error messages.
  • Window functions inside expressions (CASE, COALESCE, arithmetic) — rejected.
  • Circular stream table dependencies (cycles) — not yet supported.

pg_stream — Project Roadmap

Last updated: 2026-02-26 Current version: 0.1.1

For a concise description of what pg_stream is and why it exists, read ESSENCE.md — it explains the core problem (full REFRESH MATERIALIZED VIEW recomputation), how the differential dataflow approach solves it, the hybrid trigger→WAL CDC architecture, and the broad SQL coverage, all in plain language.


Overview

pg_stream is a PostgreSQL 18 extension that implements streaming tables with incremental view maintenance (IVM) via differential dataflow. All 13 design phases are complete. This roadmap tracks the path from pre-release to 1.0 and beyond.

 We are here
     │
     ▼
 ┌─────────┐   ┌─────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐
 │  0.1.0  │──▶│  0.2.0  │──▶│  0.3.0   │──▶│  0.4.0   │──▶│  1.0.0   │──▶│  1.x+    │
 │ Pre-    │   │ Correct-│   │ Prod-    │   │ Observ-  │   │ Stable   │   │ Scale &  │
 │ release │   │ ness    │   │ ready    │   │ ability  │   │ Release  │   │ Ecosystem│
 └─────────┘   └─────────┘   └──────────┘   └──────────┘   └──────────┘   └──────────┘

v0.1.0 — Released (2026-02-26)

Status: Released — all 13 design phases implemented.

Core engine, DVM with 21 OpTree operators, trigger-based CDC, DAG-aware scheduling, monitoring, dbt macro package, and 1,300+ tests.

See CHANGELOG.md for the full feature list.

Late additions (pre-March 1st)

Low-risk, high-value items pulled forward from v0.2.0:

ItemDescriptionEffortRef
F4WAL decoder: pgoutput message parsing edge cases2–3hSQL_GAPS_7.md G3.3
F7Document JOIN key column change limitations1–2hSQL_GAPS_7.md G4.2
F11Keyless table duplicate-rows: document known behavior1hSQL_GAPS_7.md G7.1
F14CUBE explosion guard (reject oversized CUBE grouping sets)1hSQL_GAPS_7.md G5.2

v0.2.0 — Correctness & Stability

Goal: Close all critical and high-priority gaps to reach a provably correct baseline. No new features — only fixes, verification, and test coverage.

Tier 0 — Critical (must-fix)

ItemDescriptionEffortRef
F1DELETE+INSERT merge strategy double-evaluation guard3–4hSQL_GAPS_7.md G1.1 (P0)
F2WAL decoder: keyless-table pk_hash computation4–6hSQL_GAPS_7.md G3.1 · PLAN_HYBRID_CDC.md
F3WAL decoder: old_* column population for UPDATEs4–6hSQL_GAPS_7.md G3.2 · PLAN_HYBRID_CDC.md
F5JOIN key column change detection in delta SQL3–4hSQL_GAPS_7.md G4.1 (P1)
F6ALTER TYPE / ALTER POLICY DDL tracking3–5hSQL_GAPS_7.md G9.1 (P1)

Subtotal: 17–25 hours (F4 and F7 moved to v0.1.0)

Tier 1 — Verification

ItemDescriptionEffortRef
F8–F10, F12Window partition key E2E, recursive CTE monotonicity audit, ALTER DOMAIN tracking, PgBouncer compatibility docs16–23hSQL_GAPS_7.md G5–G9

F11 (keyless table duplicate-rows documentation) moved to v0.1.0.

Tier 2 — Robustness

ItemDescriptionEffortRef
F13, F15–F16LIMIT-in-subquery warning, RANGE_AGG recognition, read replica detection, SPI error classification6–8hSQL_GAPS_7.md G2, G5, G6, G8

F14 (CUBE explosion guard) moved to v0.1.0.

Tier 3 — Test coverage

ItemDescriptionEffortRef
F17–F2621 aggregate differential E2E, FULL JOIN E2E, INTERSECT/EXCEPT pairs, GUC variation tests, CI combined coverage29–38hSQL_GAPS_7.md G7 · STATUS_TESTING.md

v0.2.0 total: ~68–94 hours (F4, F7, F11, F14 moved to v0.1.0)

Exit criteria:

  • Zero P0 gaps
  • All P1 gaps resolved or documented as known limitations
  • E2E test count ≥ 400 with 0 pre-existing failures
  • Combined coverage ≥ 75%

v0.3.0 — Production Readiness

Goal: Operational polish, parallel refresh, and production-grade WAL-based CDC. The extension is suitable for production use after this milestone.

Performance & Parallelism

ItemDescriptionEffortRef
P1Verify PostgreSQL parallel query for delta SQL0hREPORT_PARALLELIZATION.md §E
P2DAG level extraction (topological_levels())2–4hREPORT_PARALLELIZATION.md §B
P3Dynamic background worker dispatch per level12–16hREPORT_PARALLELIZATION.md §A+B

Operational

ItemDescriptionEffortRef
O1Extension upgrade migrations (ALTER EXTENSION UPDATE)4–6hSQL_GAPS_7.md G8.2 · PLAN_UPGRADE_MIGRATIONS.md
O2Prepared statement cleanup on cache invalidation3–4hSQL_GAPS_7.md G8.3
O3Adaptive fallback threshold exposure via monitoring2–3hSQL_GAPS_7.md G8.4
O4SPI SQLSTATE error classification for retry3–4hSQL_GAPS_7.md G8.6
O5Slot lag alerting thresholds (configurable)2–3hSQL_GAPS_7.md G10

WAL CDC Hardening

ItemDescriptionEffortRef
W1WAL decoder fixes (F2–F4 prerequisite from v0.2.0)Done in v0.2.0PLAN_HYBRID_CDC.md
W2WAL mode E2E test suite (parallel to trigger suite)8–12hPLAN_HYBRID_CDC.md
W3WAL→trigger automatic fallback hardening4–6hPLAN_HYBRID_CDC.md
W4Promote pg_stream.cdc_mode = 'auto' to recommendedDocumentationPLAN_HYBRID_CDC.md

v0.3.0 total: ~40–58 hours (excluding v0.2.0 prerequisites)

Exit criteria:

  • max_concurrent_refreshes drives real parallel refresh
  • WAL CDC mode passes full E2E suite
  • Extension upgrade path tested (0.1.0 → 0.3.0)
  • Zero P0/P1 gaps remaining

v0.4.0 — Observability & Integration

Goal: Prometheus/Grafana observability, dbt-pgstream formal release, complete documentation review, and validated upgrade path. After this milestone the product is externally visible and monitored.

Observability

ItemDescriptionEffortRef
M1Prometheus exporter configuration guide4–6hPLAN_ECO_SYSTEM.md §1
M2Grafana dashboard (refresh latency, staleness, CDC lag)4–6hPLAN_ECO_SYSTEM.md §1

Integration & Release prep

ItemDescriptionEffortRef
R5dbt-pgstream 0.1.0 formal release (PyPI)2–3hdbt-pgstream/ · PLAN_DBT_MACRO.md
R6Complete documentation review & polish4–6hdocs/
O1Extension upgrade migrations (ALTER EXTENSION UPDATE)4–6hSQL_GAPS_7.md G8.2 · PLAN_UPGRADE_MIGRATIONS.md

v0.4.0 total: ~18–27 hours

Exit criteria:

  • Grafana dashboard published
  • dbt-pgstream 0.1.0 on PyPI
  • ALTER EXTENSION pg_stream UPDATE tested (0.3.0 → 0.4.0)
  • All public documentation current and reviewed

v1.0.0 — Stable Release

Goal: First officially supported release. Semantic versioning locks in. API, catalog schema, and GUC names are considered stable. Focus is distribution — getting pg_stream onto package registries.

Release engineering

ItemDescriptionEffortRef
R1Semantic versioning policy + compatibility guarantees2–3hPLAN_VERSIONING.md
R2PGXN / apt / rpm packaging8–12hPLAN_PACKAGING.md
R3Docker Hub official image → CNPG extension image✅ DonePLAN_CLOUDNATIVEPG.md
R4CNPG operator hardening (K8s 1.33+ native ImageVolume)4–6hPLAN_CLOUDNATIVEPG.md

v1.0.0 total: ~18–27 hours

Exit criteria:

  • Published on PGXN and Docker Hub
  • CNPG extension image published to GHCR (pg_stream-ext)
  • CNPG cluster-example.yaml validated (Image Volume approach)
  • Upgrade path from v0.4.0 tested
  • Semantic versioning policy in effect

Post-1.0 — Scale & Ecosystem

These are not gated on 1.0 but represent the longer-term horizon.

Ecosystem expansion

ItemDescriptionEffortRef
E1dbt full adapter (dbt-pgstream extending dbt-postgres)20–30hPLAN_DBT_ADAPTER.md
E2Airflow provider (apache-airflow-providers-pgstream)16–20hPLAN_ECO_SYSTEM.md §4
E3CLI tool (pgstream) for management outside SQL16–20hPLAN_ECO_SYSTEM.md §4
E4Flyway / Liquibase migration support8–12hPLAN_ECO_SYSTEM.md §5
E5ORM integrations guide (SQLAlchemy, Django, etc.)8–12hPLAN_ECO_SYSTEM.md §5

Scale

ItemDescriptionEffortRef
S1External orchestrator sidecar for 100+ STs20–40hREPORT_PARALLELIZATION.md §D
S2Citus / distributed PostgreSQL compatibility~6 monthsplans/infra/CITUS.md
S3Multi-database support (beyond postgres DB)TBDPLAN_MULTI_DATABASE.md

Advanced SQL

ItemDescriptionEffortRef
A1Circular dependency support (SCC fixpoint iteration)~40hCIRCULAR_REFERENCES.md
A2Streaming aggregation (sub-second latency path)TBDPLAN_STREAMING_AGGREGATION.md
A3PostgreSQL 19 forward-compatibilityTBDPLAN_PG19_COMPAT.md

Effort Summary

MilestoneEffort estimateCumulative
v0.2.0 — Correctness68–94h68–94h
v0.3.0 — Production ready40–58h108–152h
v0.4.0 — Observability & Integration18–27h126–179h
v1.0.0 — Stable release18–27h144–206h
Post-1.0 (ecosystem)88–134h232–340h
Post-1.0 (scale)6+ months

References

DocumentPurpose
CHANGELOG.mdWhat's been built
plans/PLAN.mdOriginal 13-phase design plan
plans/sql/SQL_GAPS_7.md53 known gaps, prioritized
plans/performance/REPORT_PARALLELIZATION.mdParallelization options analysis
plans/performance/STATUS_PERFORMANCE.mdBenchmark results
plans/ecosystem/PLAN_ECO_SYSTEM.mdEcosystem project catalog
plans/dbt/PLAN_DBT_ADAPTER.mdFull dbt adapter plan
plans/infra/CITUS.mdCitus compatibility plan
plans/infra/PLAN_VERSIONING.mdVersioning & compatibility policy
plans/infra/PLAN_PACKAGING.mdPGXN / deb / rpm packaging
plans/infra/PLAN_DOCKER_IMAGE.mdOfficial Docker image (superseded by CNPG extension image)
plans/ecosystem/PLAN_CLOUDNATIVEPG.mdCNPG Image Volume extension image
plans/infra/PLAN_MULTI_DATABASE.mdMulti-database support
plans/infra/PLAN_PG19_COMPAT.mdPostgreSQL 19 forward-compatibility
plans/sql/PLAN_UPGRADE_MIGRATIONS.mdExtension upgrade migrations
plans/sql/PLAN_STREAMING_AGGREGATION.mdSub-second streaming aggregation
plans/adrs/PLAN_ADRS.mdArchitectural decisions
docs/ARCHITECTURE.mdSystem architecture

pg_stream vs. DBSP: Similarities and Differences

What They Share (Conceptual Foundation)

pg_stream explicitly cites DBSP as its theoretical foundation (see PRIOR_ART.md). The key overlap:

ConceptDBSP (paper)pg_stream (implementation)
Z-set / delta modelRows annotated with weights (+1/−1) in an abelian group__pgs_action = 'I'/'D' column on every delta row — effectively Z-sets restricted to {+1, −1}
Per-operator differentiationRecursive Algorithm 4.6: Q^Δ = D ∘ Q ∘ I, decomposed per-operator via the chain rule (Q₁ ∘ Q₂)^Δ = Q₁^Δ ∘ Q₂^ΔDiffContext::diff_node() walks the OpTree and calls per-operator differentiators (scan, filter, project, join, aggregate, distinct, union, etc.) — same recursive structural decomposition
Linear operators are self-incrementalTheorem 3.3: for LTI operator Q, Q^Δ = QFilter and Project pass deltas through unchanged (just apply predicate/projection to the delta stream)
Bilinear join ruleTheorem 3.4: Δ(a × b) = Δa × Δb + a × Δb + Δa × bdiff_inner_join generates exactly 3 UNION ALL parts: (delta_left ⋈ current_right), (current_left ⋈ delta_right), and optionally (delta_left ⋈ delta_right)
Aggregate auxiliary counters§4.2: counting algorithm for maintaining aggregates with deletions__pgs_count auxiliary column, LEFT JOIN back to stream table to read old counts and compute new counts
Recursive queries§6: fixed-point iteration with z⁻¹ delay operator, semi-naive evaluationdiff_recursive_cte uses recomputation-diff (DRed-style), not DBSP's native fixed-point circuit

Key Differences

1. Execution model — standalone engine vs. embedded in PostgreSQL

DBSP is a standalone streaming runtime (Rust library, now Feldera). It compiles query plans into dataflow graphs that maintain in-memory state and process continuous micro-batches. Operators are long-lived stateful actors with their own memory.

pg_stream is an extension inside PostgreSQL. It has no persistent dataflow graph. On each refresh, it generates a single SQL query (CTE chain) that PostgreSQL's own planner/executor evaluates. After execution, no operator state persists — auxiliary state lives in the stream table itself (__pgs_count columns) and change buffer tables.

2. Streams vs. periodic batches

DBSP operates on true infinite streams indexed by logical time t ∈ ℕ. Each "step" processes one micro-batch of changes, and operators carry integration state (I operator = running sum from t=0).

pg_stream operates in discrete refresh cycles triggered by a lag-based scheduler. There is no integration operator — the "current state" is just the stream table's contents, and changes are consumed from CDC buffer tables between LSN boundaries. Each refresh is a self-contained transaction.

3. Z-set weights vs. binary actions

DBSP uses integer weights in ℤ — rows can have weights > 1 (bags) or < −1 (multiple deletions). This enables correct multiset semantics and composable group algebra.

pg_stream uses binary actions ('I' insert, 'D' delete, sometimes 'U' update). It doesn't maintain true Z-set weights. For aggregates, the __pgs_count auxiliary column serves a similar purpose but is specific to the aggregate operator — it's not a general weight propagated through the operator tree.

4. Integration operator (I)

DBSP: The integration operator I(s)[t] = Σᵢ≤ₜ s[i] is an explicit first-class circuit element. It maintains running sums of changes and is the key mechanism for computing incremental joins (z⁻¹(I(a)) = "accumulated left side up to previous step").

pg_stream: No explicit integration. The equivalent of I is just "read the current contents of the source/stream table." Join differentiation directly reads the current snapshot of the non-delta side (build_snapshot_sql() generates FROM "public"."orders" r), which implicitly includes all historical changes.

5. Recursion

DBSP: Native fixed-point circuits with z⁻¹ delay. Can incrementally maintain recursive queries (e.g., transitive closure) by iterating only on new changes within each step — semi-naive evaluation generalized to arbitrary recursion.

pg_stream: Uses recomputation-diff for recursive CTEs — re-executes the full recursive query and anti-joins against current storage to compute the delta. This is correct but not truly incremental for the recursive part.

6. Correctness guarantees

DBSP: Proven correct in Lean. All theorems are machine-checked. The chain rule, cycle rule, and bilinear decomposition are formally verified.

pg_stream: Verified empirically via property-based tests (the assert_invariant checks that Contents(ST) = Q(DB) after each mutation cycle). No formal proof, but the per-operator rules are direct translations of DBSP's rules.

7. Scope

DBSP: A general-purpose theory and streaming engine. Handles nested relations, streaming aggregation over windows, arbitrary compositions. The Feldera implementation supports a full SQL frontend.

pg_stream: Focused on materialized views inside PostgreSQL. Supports a specific subset of SQL (scan, filter, project, inner/left/full join, aggregates, DISTINCT, UNION ALL, INTERSECT, EXCEPT, CTEs, window functions, lateral joins). It is not a general streaming engine — it leverages PostgreSQL's own query planner and executor.


Summary

pg_stream applies DBSP's differentiation rules to generate delta queries, but it is not a DBSP implementation. It borrows the mathematical framework (per-operator differentiation, Z-set-like deltas, bilinear join decomposition) while making fundamentally different architectural choices: embedded in PostgreSQL, no persistent dataflow state, periodic batch execution, and PostgreSQL's planner as the optimizer. Think of it as "DBSP's differentiation algebra, compiled down to SQL CTEs and executed by PostgreSQL."

Prior Art

This document lists the academic papers, PostgreSQL commits, open-source tools, and standard algorithms whose techniques are reused in pg_stream.

Maintaining this record serves two purposes:

  1. Attribution — credit the research and engineering work this project builds upon.
  2. Independent derivation — demonstrate that every core technique predates and is independent of any single vendor's commercial product.

Differential View Maintenance (DVM)

DBSP — Automatic Incremental View Maintenance

Budiu, M., Ryzhyk, L., McSherry, F., & Tannen, V. (2023). "DBSP: Automatic Incremental View Maintenance for Rich Query Languages." Proceedings of the VLDB Endowment (PVLDB), 16(7), 1601–1614. https://arxiv.org/abs/2203.16684

The Z-set abstraction (rows annotated with +1/−1 multiplicity) is the theoretical foundation for the __pgs_action column produced by the delta operators in src/dvm/operators/. The per-operator differentiation rules (scan, filter, project, join, aggregate, union) are direct applications of the DBSP lifting operator (D) described in this paper.

See DBSP_COMPARISON.md for a detailed comparison of pg_stream's architecture with the DBSP model.

Gupta & Mumick — Materialized Views Survey

Gupta, A. & Mumick, I.S. (1995). "Maintenance of Materialized Views: Problems, Techniques, and Applications." IEEE Data Engineering Bulletin, 18(2), 3–18.

Gupta, A. & Mumick, I.S. (1999). Materialized Views: Techniques, Implementations, and Applications. MIT Press. ISBN 978-0-262-57122-7.

The per-operator differentiation rules in src/dvm/operators/ follow the derivation given in section 3 of the 1995 survey. The counting algorithm for maintaining aggregates with deletions uses the approach described in the MIT Press book.

DBToaster — Higher-order Delta Processing

Koch, C., Ahmad, Y., Kennedy, O., Nikolic, M., Nötzli, A., Olteanu, D., & Zavodny, J. (2014). "DBToaster: Higher-order Delta Processing for Dynamic, Frequently Fresh Views." The VLDB Journal, 23(2), 253–278. https://doi.org/10.1007/s00778-013-0348-4

Inspiration for the recursive delta compilation strategy where the delta of a complex query is itself a query that can be differentiated.

DRed — Deletion and Re-derivation

Gupta, A., Mumick, I.S., & Subrahmanian, V.S. (1993). "Maintaining Views Incrementally." Proceedings of the 1993 ACM SIGMOD International Conference, 157–166.

The DRed algorithm for handling deletions in recursive views is the basis for the recursive CTE differential refresh strategy in src/dvm/operators/recursive_cte.rs.


Scheduling

Earliest-Deadline-First (EDF)

Liu, C.L. & Layland, J.W. (1973). "Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment." Journal of the ACM, 20(1), 46–61. https://doi.org/10.1145/321738.321743

The schedule-based scheduling in src/scheduler.rs applies the classic EDF principle: the stream table whose freshness deadline expires soonest is refreshed first. EDF is optimal for uniprocessor preemptive scheduling and is a standard technique in operating systems and real-time databases.

Topological Sort — Kahn's Algorithm

Kahn, A.B. (1962). "Topological sorting of large networks." Communications of the ACM, 5(11), 558–562. https://doi.org/10.1145/368996.369025

The dependency DAG in src/dag.rs uses Kahn's algorithm for topological ordering and cycle detection. This is standard computer science curriculum and appears in every major algorithms textbook (Cormen et al., Sedgewick, Kleinberg & Tardos).


Change Data Capture (CDC)

PostgreSQL Row-Level Triggers

Row-level AFTER INSERT/UPDATE/DELETE triggers have been available in PostgreSQL since version 6.x (late 1990s). The trigger-based change capture pattern used in src/cdc.rs is a well-established PostgreSQL technique:

  • PostgreSQL documentation: CREATE TRIGGER — trigger-based CDC has been a standard pattern for decades.
  • PostgreSQL wiki: "Trigger-based Change Data Capture in PostgreSQL."

Debezium

Debezium project (Red Hat, open source since 2016). https://debezium.io/

Debezium implements trigger-based and WAL-based CDC for PostgreSQL and other databases. The change buffer table pattern (pg_stream_changes.changes_<oid>) follows a similar approach, modified for single-process consumption within the PostgreSQL backend.

pgaudit

pgaudit extension (2015). https://github.com/pgaudit/pgaudit

Captures DML via AFTER row-level triggers for audit logging, demonstrating the same trigger-based change-capture technique in production since 2015.


Materialized View Refresh

PostgreSQL REFRESH MATERIALIZED VIEW CONCURRENTLY

PostgreSQL 9.4 (December 2014, commit 96ef3b8). src/backend/commands/matview.c

The snapshot-diff strategy used for recomputation-diff refreshes (where the full query is re-executed and anti-joined against current storage to compute inserts and deletes) mirrors the algorithm implemented in PostgreSQL's REFRESH MATERIALIZED VIEW CONCURRENTLY. This PostgreSQL feature predates all relevant patents and is publicly documented.

SQL MERGE Statement

ISO/IEC 9075:2003 (SQL:2003 standard) — MERGE statement. PostgreSQL 15 (October 2022, commit 7103eba).

The MERGE-based delta application in src/refresh.rs uses the ISO-standard MERGE statement, independently implemented by Oracle, SQL Server, DB2, and PostgreSQL. This is not derived from any vendor-specific implementation.


General Database Theory

Relational Algebra

Codd, E.F. (1970). "A Relational Model of Data for Large Shared Data Banks." Communications of the ACM, 13(6), 377–387.

The operator tree in src/dvm/parser.rs models standard relational algebra operators (select, project, join, aggregate, union). These are foundational database theory from 1970.

Semi-Naive Evaluation

Bancilhon, F. & Ramakrishnan, R. (1986). "An Amateur's Introduction to Recursive Query Processing Strategies." Proceedings ACM SIGMOD, 16–52.

General background for recursive CTE evaluation strategies. PostgreSQL's own WITH RECURSIVE implementation uses iterative fixpoint evaluation based on these principles.


This document is maintained for attribution and independent-derivation documentation purposes. It does not constitute legal advice.

Custom SQL Syntax for PostgreSQL Extensions

Comprehensive Technical Research Report

Date: 2026-02-25 Context: pg_stream extension — evaluating approaches to support CREATE STREAM TABLE syntax or equivalent native-feeling DDL.


Table of Contents

  1. Executive Summary
  2. PostgreSQL Parser Hooks / Utility Hooks
  3. The ProcessUtility_hook Approach
  4. Raw Parser Extension (gram.y)
  5. The Utility Command Approach
  6. Custom Access Methods (CREATE ACCESS METHOD)
  7. Table Access Method API (PostgreSQL 12+)
  8. Foreign Data Wrapper Approach
  9. Event Triggers
  10. TimescaleDB Continuous Aggregates Pattern
  11. Citus Distributed DDL Pattern
  12. PostgreSQL 18 New Features
  13. COMMENT / OPTIONS Abuse Pattern
  14. pg_ivm (Incremental View Maintenance) Pattern
  15. CREATE TABLE ... USING (Table Access Methods) Deep Dive
  16. Comparison Matrix
  17. Recommendations for pg_stream

1. Executive Summary

PostgreSQL's parser is not extensible — there is no parser hook that allows extensions to add new grammar rules. This is a fundamental design constraint. Every approach to "custom DDL syntax" in extensions falls into one of two categories:

  1. Intercept existing syntax — Use ProcessUtility_hook or event triggers to intercept standard DDL (e.g., CREATE TABLE, CREATE VIEW) and augment its behavior.
  2. Use a SQL function as the DDL interface — Define SELECT my_extension.create_thing(...) as the user-facing API (this is what pg_stream currently does).

No production PostgreSQL extension ships truly new SQL grammar without forking the PostgreSQL parser. TimescaleDB, Citus, pg_ivm, and others all work within existing syntax boundaries.


2. PostgreSQL Parser Hooks / Utility Hooks

Available Hook Points

PostgreSQL provides several hook function pointers that extensions can override in _PG_init():

HookHeaderPurpose
ProcessUtility_hooktcop/utility.hIntercept utility (DDL) statement execution
post_parse_analyze_hookparser/analyze.hInspect/modify the analyzed parse tree after semantic analysis
planner_hookoptimizer/planner.hReplace or augment the query planner
ExecutorStart_hookexecutor/executor.hIntercept executor startup
ExecutorRun_hookexecutor/executor.hIntercept executor row processing
ExecutorFinish_hookexecutor/executor.hIntercept executor finish
ExecutorEnd_hookexecutor/executor.hIntercept executor cleanup
object_access_hookcatalog/objectaccess.hNotifications when objects are created/modified/dropped
emit_log_hookutils/elog.hIntercept log messages

What's Missing: No Parser Hook

There is no parser_hook or raw_parser_hook. The raw parser (gram.yscan.l → bison grammar) is compiled into the PostgreSQL server binary. Extensions cannot:

  • Add new keywords (e.g., STREAM)
  • Add new grammar productions (e.g., CREATE STREAM TABLE)
  • Modify the tokenizer/lexer
  • Intercept raw SQL text before parsing

The closest hook is post_parse_analyze_hook, which fires after the SQL has already been parsed and analyzed. By this point:

  • The SQL string has already been tokenized and parsed by gram.y
  • A parse tree (Query node) has been produced
  • If the SQL contains unknown syntax, a syntax error has already been raised

Technical Details of post_parse_analyze_hook

/* In src/backend/parser/analyze.c */
typedef void (*post_parse_analyze_hook_type)(ParseState *pstate,
                                             Query *query,
                                             JumbleState *jstate);
post_parse_analyze_hook_type post_parse_analyze_hook = NULL;

Extensions can set this in _PG_init():

static post_parse_analyze_hook_type prev_post_parse_analyze_hook = NULL;

void _PG_init(void) {
    prev_post_parse_analyze_hook = post_parse_analyze_hook;
    post_parse_analyze_hook = my_post_parse_analyze;
}

Use cases: Query rewriting after parsing (e.g., adding security predicates, row-level security), statistics collection, plan caching invalidation. Not usable for new syntax because parsing has already completed.

Pros/Cons

AspectAssessment
Native syntaxImpossible — cannot add new grammar
Intercept existing DDLYes via ProcessUtility_hook
Modify parsed queriesYes via post_parse_analyze_hook
ComplexityLow for hooking, but limited in capability
PG versionAll modern versions (hooks stable since PG 9.x)
MaintenanceVery low — hook signatures rarely change

3. The ProcessUtility_hook Approach

How It Works

ProcessUtility_hook is the most powerful DDL interception point. It fires for every "utility statement" (DDL, COPY, EXPLAIN, etc.) after parsing but before execution.

typedef void (*ProcessUtility_hook_type)(PlannedStmt *pstmt,
                                         const char *queryString,
                                         bool readOnlyTree,
                                         ProcessUtilityContext context,
                                         ParamListInfo params,
                                         QueryEnvironment *queryEnv,
                                         DestReceiver *dest,
                                         QueryCompletion *qc);

An extension can:

  1. Inspect the parse tree node — The PlannedStmt->utilityStmt field contains the parsed DDL node (e.g., CreateStmt, AlterTableStmt, ViewStmt).
  2. Modify the parse tree — Change fields before passing to the standard handler.
  3. Replace execution entirely — Skip calling the standard handler and do something else.
  4. Post-process — Call the standard handler first, then do additional work.
  5. Block execution — Raise an error to prevent the DDL.

What Extensions Use This

ExtensionWhat they interceptPurpose
TimescaleDBCREATE TABLE, ALTER TABLE, DROP TABLE, CREATE INDEX, etc.Convert regular tables to hypertables, distribute DDL
CitusMost DDL statementsPropagate DDL to worker nodes
pg_partmanCREATE TABLE, partition DDLAuto-manage partitioning
pg_stat_statementsAll utility statementsTrack DDL execution statistics
pgAuditAll utility statementsAudit logging
pg_hint_planUses post_parse_analyze_hook instead
sepgsqlObject creation/modificationSecurity label enforcement

Can It Handle New Syntax?

No. It can only intercept DDL that PostgreSQL's parser already understands. You cannot use ProcessUtility_hook to handle CREATE STREAM TABLE because the parser will reject that syntax before the hook is ever called.

However, it can intercept and augment existing syntax:

  • CREATE TABLE ... (some_option) → Intercept CreateStmt, check for special markers, do extra work
  • CREATE VIEW ... WITH (custom_option = true) → Intercept ViewStmt, check reloptions
  • CREATE MATERIALIZED VIEW ... WITH (custom = true) → Same approach

Pattern: Intercepting CREATE TABLE

static void my_process_utility(PlannedStmt *pstmt, ...) {
    Node *parsetree = pstmt->utilityStmt;

    if (IsA(parsetree, CreateStmt)) {
        CreateStmt *stmt = (CreateStmt *) parsetree;
        // Check for a special reloption or table name pattern
        ListCell *lc;
        foreach(lc, stmt->options) {
            DefElem *opt = (DefElem *) lfirst(lc);
            if (strcmp(opt->defname, "stream") == 0) {
                // This is a stream table! Do custom logic.
                create_stream_table_from_ddl(stmt, queryString);
                return; // Don't call standard handler
            }
        }
    }

    // Pass through to standard handler
    if (prev_ProcessUtility)
        prev_ProcessUtility(pstmt, ...);
    else
        standard_ProcessUtility(pstmt, ...);
}

Pros/Cons

AspectAssessment
Native CREATE STREAM TABLENo — parser rejects unknown syntax
CREATE TABLE ... WITH (stream=true)Yes — feasible via reloptions
ComplexityMedium — must carefully chain with other extensions
PG versionAll modern versions
MaintenanceLow — hook signature changes rarely (changed in PG14, PG15)
RiskMust always chain prev_ProcessUtility — misbehaving can break other extensions

4. Raw Parser Extension (gram.y)

How It Works

PostgreSQL's SQL parser is a Bison-generated LALR(1) parser defined in:

  • src/backend/parser/gram.y — Grammar rules (~18,000 lines)
  • src/backend/parser/scan.l — Flex lexer (tokenizer)
  • src/include/parser/kwlist.h — Reserved/unreserved keyword list

To add CREATE STREAM TABLE, you would:

  1. Add STREAM to the keyword list (unreserved or reserved)
  2. Add grammar rules to gram.y:
    CreateStreamTableStmt:
        CREATE STREAM TABLE qualified_name '(' OptTableElementList ')'
        OptWith AS SelectStmt
        {
            CreateStreamTableStmt *n = makeNode(CreateStreamTableStmt);
            n->relation = $4;
            n->query = $9;
            /* ... */
            $$ = (Node *) n;
        }
    ;
    
  3. Add a new NodeTag for CreateStreamTableStmt
  4. Handle it in ProcessUtility
  5. Rebuild the PostgreSQL server

Implications

This requires forking PostgreSQL. The modified parser is compiled into postgres binary. You cannot ship a grammar modification as a loadable extension (.so/.dylib).

Who Does This?

  • YugabyteDB — Fork of PG with custom grammar for distributed features
  • CockroachDB — Entirely custom parser (Go, not PG's Bison grammar)
  • Amazon Aurora (partially) — Custom grammar additions for Aurora-specific features
  • Greenplum — Fork of PG with added grammar for DISTRIBUTED BY, PARTITION BY etc.
  • ParadeDB — Fork of PG with some custom syntax additions

Pros/Cons

AspectAssessment
Native CREATE STREAM TABLEYes — full parser-level support
ComplexityVery high — must maintain a PG fork
PG versionTied to a single PG version
MaintenanceExtremely high — must rebase on every PG release (gram.y changes significantly between major versions)
DistributionCannot use CREATE EXTENSION; must ship entire modified PostgreSQL
User adoptionVery low — users must replace their PostgreSQL installation
psql autocompleteWould work with matching psql modifications
pg_dump/pg_restoreBroken unless you also modify those tools

Verdict: Not viable for an extension. Only viable for a PostgreSQL fork/distribution.


5. The Utility Command Approach

How It Works

Some sources reference a "custom utility command" mechanism. In practice, this does not exist as a formal PostgreSQL extension point. What people sometimes mean is one of:

5a. Using DO Blocks as Custom Commands

DO $$ BEGIN PERFORM pgstream.create_stream_table('my_st', 'SELECT ...'); END $$;

This is just a wrapped function call — not a real custom command.

5b. Abusing COMMENT or SET for Command Dispatch

Some extensions parse custom commands from strings:

-- Using SET to pass commands
SET myext.command = 'CREATE STREAM TABLE my_st AS SELECT ...';
SELECT myext.execute_pending_command();

Or using post_parse_analyze_hook to intercept a specially-formatted query:

-- Extension intercepts this via post_parse_analyze_hook
SELECT * FROM myext.dispatch('CREATE STREAM TABLE ...');

5c. Overloading Existing Syntax

Some extensions overload SELECT or CALL:

CALL pgstream.create_stream_table('my_st', $$SELECT ...$$);

CALL was introduced in PostgreSQL 11 for stored procedures. Using it makes the DDL feel more "command-like" than SELECT function().

Pros/Cons

AspectAssessment
Native syntaxNo — still a function call in disguise
User experienceModerate — CALL is better than SELECT
ComplexityLow
PG versionPG11+ for CALL
MaintenanceVery low

6. Custom Access Methods (CREATE ACCESS METHOD)

How It Works

PostgreSQL supports extension-defined access methods (index AMs and table AMs):

CREATE ACCESS METHOD my_am TYPE TABLE HANDLER my_am_handler;

This was introduced in PostgreSQL 9.6 for index AMs and extended to table AMs in PostgreSQL 12. The CREATE ACCESS METHOD statement shows PostgreSQL's philosophy: extensions can define new implementations of existing concepts (tables, indexes) but not new concepts (stream tables).

Table AM vs. Index AM

TypeSinceHandler SignatureExample
Index AMPG 9.6IndexAmRoutine with scan/insert/delete callbacksbloom, brin, GiST
Table AMPG 12TableAmRoutine with 60+ callbacksheap (default), columnar (Citus), zedstore (experimental)

Can We Use This for Stream Tables?

The table AM API defines how tuples are stored and retrieved, not how tables are created or maintained. A stream table's key features are:

  • Defining query — Not part of the table AM concept
  • Automatic refresh — Not part of the table AM concept
  • Change tracking — Could partially overlap with table AM's tuple modification callbacks
  • Storage — The actual storage could use heap (default) AM

You could theoretically create a custom table AM that:

  1. Uses heap storage underneath
  2. Intercepts INSERT/UPDATE/DELETE to maintain change buffers
  3. Adds custom metadata

But this would be an extreme abuse of the API. Table AMs are meant for storage engines, not for implementing materialized view semantics.

Pros/Cons

AspectAssessment
Native syntaxNoCREATE TABLE ... USING my_am is the closest
ComplexityExtremely high — 60+ callbacks to implement
FitnessPoor — table AM is about storage, not view maintenance
PG versionPG 12+
MaintenanceHigh — AM API evolves between major versions

7. Table Access Method API (PostgreSQL 12+)

Deep Technical Details

The Table Access Method (AM) API was introduced in PostgreSQL 12 via commit c2fe139c20 by Andres Freund. It abstracts the storage layer, allowing extensions to replace the default heap storage with custom implementations.

The CREATE TABLE ... USING Syntax

-- Use default AM (heap)
CREATE TABLE normal_table (id int, data text);

-- Use custom AM
CREATE TABLE my_table (id int, data text) USING my_custom_am;

-- Set default for a database
SET default_table_access_method = 'my_custom_am';

TableAmRoutine Structure

The handler function must return a TableAmRoutine struct with callbacks:

typedef struct TableAmRoutine {
    NodeTag type;

    /* Slot callbacks */
    const TupleTableSlotOps *(*slot_callbacks)(Relation rel);

    /* Scan callbacks */
    TableScanDesc (*scan_begin)(Relation rel, Snapshot snap, int nkeys, ...);
    void (*scan_end)(TableScanDesc scan);
    void (*scan_rescan)(TableScanDesc scan, ...);
    bool (*scan_getnextslot)(TableScanDesc scan, ...);

    /* Parallel scan */
    Size (*parallelscan_estimate)(Relation rel);
    Size (*parallelscan_initialize)(Relation rel, ...);
    void (*parallelscan_reinitialize)(Relation rel, ...);

    /* Index fetch */
    IndexFetchTableData *(*index_fetch_begin)(Relation rel);
    void (*index_fetch_reset)(IndexFetchTableData *data);
    void (*index_fetch_end)(IndexFetchTableData *data);
    bool (*index_fetch_tuple)(IndexFetchTableData *data, ...);

    /* Tuple modification */
    void (*tuple_insert)(Relation rel, TupleTableSlot *slot, ...);
    void (*tuple_insert_speculative)(Relation rel, ...);
    void (*tuple_complete_speculative)(Relation rel, ...);
    void (*multi_insert)(Relation rel, TupleTableSlot **slots, int nslots, ...);
    TM_Result (*tuple_delete)(Relation rel, ItemPointer tid, ...);
    TM_Result (*tuple_update)(Relation rel, ItemPointer otid, ...);
    TM_Result (*tuple_lock)(Relation rel, ItemPointer tid, ...);

    /* DDL callbacks */
    void (*relation_set_new_filelocator)(Relation rel, ...);
    void (*relation_nontransactional_truncate)(Relation rel);
    void (*relation_copy_data)(Relation rel, const RelFileLocator *newrlocator);
    void (*relation_copy_for_cluster)(Relation rel, ...);
    void (*relation_vacuum)(Relation rel, VacuumParams *params, ...);
    bool (*scan_analyze_next_block)(TableScanDesc scan, ...);
    bool (*scan_analyze_next_tuple)(TableScanDesc scan, ...);

    /* Planner support */
    void (*relation_estimate_size)(Relation rel, int32 *attr_widths, ...);

    /* ... more callbacks */
} TableAmRoutine;

Hybrid Approach: Table AM + ProcessUtility_hook

A more practical pattern:

  1. Register a custom table AM (e.g., stream_am) that wraps heap
  2. Use ProcessUtility_hook to intercept CREATE TABLE ... USING stream_am
  3. When detected, perform stream table registration (catalog, CDC, etc.)
  4. The actual storage uses standard heap via delegation
-- User writes:
CREATE TABLE order_totals (region text, total numeric)
    USING stream_am
    WITH (query = 'SELECT region, SUM(amount) FROM orders GROUP BY region',
          schedule = '1m',
          refresh_mode = 'DIFFERENTIAL');

Problems with This Approach

  1. Column list is mandatoryCREATE TABLE ... USING requires explicit column definitions. Stream tables should derive columns from the query.
  2. Query in WITH clause — Storing a full SQL query in reloptions is hacky and has length limits.
  3. No AS SELECT — Table AMs don't support CREATE TABLE ... AS SELECT with USING clause in the standard grammar.
  4. VACUUM, ANALYZE complexity — Must implement or delegate all maintenance callbacks.
  5. pg_dump compatibility — pg_dump would dump CREATE TABLE ... USING stream_am but not the associated metadata (query, schedule, etc.)

Pros/Cons

AspectAssessment
Native syntaxPartialCREATE TABLE ... USING stream_am
Feels like a stream tableNo — still looks like a regular table with options
ComplexityVery high
pg_dumpBroken — metadata in catalog tables won't be dumped
PG versionPG 12+
MaintenanceHigh — table AM API changes between versions

8. Foreign Data Wrapper Approach

How It Works

Foreign Data Wrappers (FDW) allow PostgreSQL to access external data sources via CREATE FOREIGN TABLE. An extension can register a custom FDW:

CREATE EXTENSION pg_stream;
CREATE SERVER stream_server FOREIGN DATA WRAPPER pgstream_fdw;

CREATE FOREIGN TABLE order_totals (region text, total numeric)
    SERVER stream_server
    OPTIONS (
        query 'SELECT region, SUM(amount) FROM orders GROUP BY region',
        schedule '1m',
        refresh_mode 'DIFFERENTIAL'
    );

FDW API

The FDW API provides callbacks for:

  • GetForeignRelSize — Estimate relation size for planning
  • GetForeignPaths — Generate access paths
  • GetForeignPlan — Create a plan node
  • BeginForeignScan — Start scan
  • IterateForeignScan — Get next tuple
  • EndForeignScan — End scan
  • AddForeignUpdatePaths — Support INSERT/UPDATE/DELETE (optional)

How It Could Work for Stream Tables

  1. Define a custom FDW (pgstream_fdw)
  2. The FDW's scan callbacks read from the underlying storage table
  3. ProcessUtility_hook intercepts CREATE FOREIGN TABLE ... SERVER stream_server to set up CDC, catalog entries, etc.
  4. A background worker handles refresh scheduling

Problems

  1. Foreign tables have restrictions — Cannot have indexes, constraints, triggers, or participate in inheritance. This severely limits usability.
  2. Query planner limitations — Foreign tables use a separate planning path with potentially worse plan quality.
  3. No MVCC — Foreign tables typically don't provide snapshot isolation semantics.
  4. User model confusion — "Foreign table" implies external data, not a derived view.
  5. EXPLAIN output — Shows "Foreign Scan" instead of "Seq Scan", confusing users.
  6. pg_dump — Foreign tables are dumped, but server/FDW setup may not transfer correctly.
  7. Two-step creation — Requires CREATE SERVER before CREATE FOREIGN TABLE.

Pros/Cons

AspectAssessment
Native syntaxPartialCREATE FOREIGN TABLE with options
Feels like a stream tableNo — foreign tables have different semantics
Index supportNo — major limitation
Trigger supportNo — major limitation
ComplexityMedium
PG versionPG 9.1+
MaintenanceLow — FDW API is very stable

Verdict: Not suitable. The restrictions on foreign tables (no indexes, no triggers) make this impractical for stream tables that need to behave like regular tables.


9. Event Triggers

How It Works

Event triggers fire on DDL events at the database level:

CREATE EVENT TRIGGER my_trigger ON ddl_command_end
    WHEN TAG IN ('CREATE TABLE', 'ALTER TABLE', 'DROP TABLE')
    EXECUTE FUNCTION my_handler();

Available events:

  • ddl_command_start — Before DDL execution (PG 9.3+)
  • ddl_command_end — After DDL execution (PG 9.3+)
  • sql_drop — When objects are dropped (PG 9.3+)
  • table_rewrite — When a table is rewritten (PG 9.5+)

Inside the Handler

CREATE FUNCTION my_handler() RETURNS event_trigger AS $$
DECLARE
    obj record;
BEGIN
    FOR obj IN SELECT * FROM pg_event_trigger_ddl_commands()
    LOOP
        -- obj.objid, obj.object_type, obj.command_tag, etc.
        IF obj.command_tag = 'CREATE TABLE' AND obj.object_type = 'table' THEN
            -- Check if this table has a special marker
            -- (e.g., a specific reloption or comment)
        END IF;
    END LOOP;
END;
$$ LANGUAGE plpgsql;

Pattern: CREATE TABLE + Event Trigger

  1. User creates a table with a special comment or option:
    CREATE TABLE order_totals (region text, total numeric);
    COMMENT ON TABLE order_totals IS 'pgstream:query=SELECT region...;schedule=1m';
    
  2. Event trigger on ddl_command_end fires
  3. Handler parses the comment, detects stream table intent
  4. Handler registers the stream table in the catalog

Limitations

  1. Cannot modify the DDL — Event triggers observe DDL, they can't change what happened. On ddl_command_end, the table already exists.
  2. Cannot prevent DDL — On ddl_command_start, you can raise an error to prevent it, but you can't redirect it.
  3. Two-step process — User must CREATE TABLE AND then mark it somehow (comment, option, separate function call).
  4. No custom syntax — Event triggers watch existing DDL commands.
  5. pg_stream already uses this — For DDL tracking on upstream tables (see hooks.rs).

Pros/Cons

AspectAssessment
Native syntaxNo — watches existing DDL only
ComplexityLow
Can transform DDLNo — observe only
PG versionPG 9.3+
MaintenanceVery low
pg_stream usageAlready used for upstream DDL tracking

10. TimescaleDB Continuous Aggregates Pattern

How It Works

TimescaleDB continuous aggregates (caggs) demonstrate the most sophisticated approach to custom DDL-like syntax in a PostgreSQL extension. Their evolution is instructive.

Phase 1: Pure Function API (early versions)

-- Create a view, then register it
CREATE VIEW daily_temps AS
SELECT time_bucket('1 day', time) AS day, AVG(temp)
FROM conditions GROUP BY 1;

SELECT add_continuous_aggregate_policy('daily_temps', ...);

Phase 2: CREATE MATERIALIZED VIEW WITH (introduced in TimescaleDB 2.0)

CREATE MATERIALIZED VIEW daily_temps
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 day', time) AS day, device_id, AVG(temp)
FROM conditions
GROUP BY 1, 2;

How the Hook Chain Works

TimescaleDB's approach uses layered hooks:

  1. ProcessUtility_hook intercepts CREATE MATERIALIZED VIEW
  2. Checks reloptions for timescaledb.continuous in the WithClause
  3. If found:
    • Does NOT call standard ProcessUtility for the matview
    • Instead creates a regular hypertable (the materialization)
    • Creates an internal view (the user-facing query interface)
    • Registers refresh policies in the catalog
    • Sets up continuous aggregate metadata
  4. For REFRESH MATERIALIZED VIEW, intercepts and routes to their refresh engine
  5. For DROP MATERIALIZED VIEW, intercepts and cleans up all artifacts

The Magic: Reloptions as Extension Point

PostgreSQL's CREATE MATERIALIZED VIEW ... WITH (option = value) passes options as DefElem nodes in the parse tree. The parser treats these as generic key-value pairs — it does NOT validate the option names. This is the key insight: PostgreSQL's parser accepts arbitrary options in WITH clauses.

// In ProcessUtility_hook:
if (IsA(parsetree, CreateTableAsStmt)) {
    CreateTableAsStmt *stmt = (CreateTableAsStmt *) parsetree;
    if (stmt->objtype == OBJECT_MATVIEW) {
        // Check for our custom option in stmt->into->options
        bool is_continuous = false;
        ListCell *lc;
        foreach(lc, stmt->into->rel->options) {
            DefElem *opt = (DefElem *) lfirst(lc);
            if (strcmp(opt->defname, "timescaledb.continuous") == 0) {
                is_continuous = true;
                break;
            }
        }
        if (is_continuous) {
            // Handle as continuous aggregate
            return;
        }
    }
}

Refresh Policies

-- Add a refresh policy (function call, not DDL)
SELECT add_continuous_aggregate_policy('daily_temps',
    start_offset => INTERVAL '1 month',
    end_offset => INTERVAL '1 day',
    schedule_interval => INTERVAL '1 hour');

What pg_stream Could Learn

The TimescaleDB pattern for pg_stream would look like:

-- Option A: CREATE MATERIALIZED VIEW with custom option
CREATE MATERIALIZED VIEW order_totals
WITH (pgstream.stream = true, pgstream.schedule = '1m', pgstream.mode = 'DIFFERENTIAL')
AS SELECT region, SUM(amount) FROM orders GROUP BY region;

-- Option B: CREATE TABLE with custom option (less natural)
CREATE TABLE order_totals (region text, total numeric)
WITH (pgstream.stream = true);
-- Then separately: SELECT pgstream.set_query('order_totals', 'SELECT ...');

Pros/Cons

AspectAssessment
Native syntaxGoodCREATE MATERIALIZED VIEW ... WITH (pgstream.stream) looks natural
User experienceVery good — familiar DDL syntax with extension options
ComplexityHigh — must implement full ProcessUtility_hook chain
pg_dumpPartial — matview DDL is dumped, but custom metadata needs pg_dump extension or config tables
PG versionPG 9.3+ (matviews), PG 12+ (better option handling)
MaintenanceMedium — must track changes to matview creation internals
Shared preloadRequired — ProcessUtility_hook needs shared_preload_libraries

11. Citus Distributed DDL Pattern

How It Works

Citus (now part of Microsoft) demonstrates another approach to extending DDL behavior:

ProcessUtility_hook Chain

Citus has one of the most comprehensive ProcessUtility_hook implementations:

void multi_ProcessUtility(PlannedStmt *pstmt, ...) {
    // 1. Classify the DDL
    Node *parsetree = pstmt->utilityStmt;

    // 2. Check if it affects distributed tables
    if (IsA(parsetree, AlterTableStmt)) {
        // Propagate ALTER TABLE to all worker nodes
        PropagateAlterTable((AlterTableStmt *)parsetree, queryString);
    }

    // 3. Call standard handler (or skip for intercepted commands)
    if (prev_ProcessUtility)
        prev_ProcessUtility(pstmt, ...);
    else
        standard_ProcessUtility(pstmt, ...);

    // 4. Post-processing
    if (IsA(parsetree, CreateStmt)) {
        // Check if we should auto-distribute this table
    }
}

Table Distribution via Function Calls

Citus does NOT add custom DDL syntax. Distribution is done via function calls:

-- Create a regular table
CREATE TABLE events (id bigint, data jsonb, created_at timestamptz);

-- Distribute it (function call, not DDL)
SELECT create_distributed_table('events', 'id');

-- Or create a reference table
SELECT create_reference_table('lookups');

Columnar Storage via Table AM

Citus also provides columnar storage as a table AM:

CREATE TABLE analytics_data (...)
    USING columnar;

This uses the table AM API (PostgreSQL 12+) — see Section 7.

What Citus Teaches Us

  • Function calls for complex operationscreate_distributed_table() is analogous to pgstream.create_stream_table().
  • ProcessUtility_hook for DDL propagation — Intercept standard DDL and add behavior.
  • Table AM for storage — Separate concern from distribution logic.
  • No custom syntax — Even with Microsoft's resources, Citus doesn't fork the parser.

Pros/Cons

AspectAssessment
Native syntaxNo — uses function calls like pg_stream
Approach validatedYes — Citus is used at massive scale with this pattern
ComplexityMedium (function API) to High (ProcessUtility_hook)
User adoptionProven successful
MaintenanceLow for function API

12. PostgreSQL 18 New Features

Relevant Extension Points in PG 18

PostgreSQL 18 (released 2025) includes several features relevant to this analysis:

12a. Virtual Generated Columns

PG 18 adds GENERATED ALWAYS AS (expr) VIRTUAL columns. Not directly relevant to stream tables, but shows PostgreSQL's willingness to expand CREATE TABLE syntax incrementally.

12b. Improved Table AM API

PG 18 refines the table AM API with better TOAST handling and improved parallel scan support. This makes custom table AMs slightly more practical.

12c. Enhanced Event Trigger Information

PG 18 expands pg_event_trigger_ddl_commands() with additional metadata fields, making event-trigger-based approaches more capable.

12d. pg_stat_io Improvements

Enhanced I/O statistics infrastructure that could benefit monitoring of stream table refresh operations.

12e. No New Parser Extension Points

PostgreSQL 18 does not add any parser extension mechanism. The parser remains monolithic and non-extensible. There have been occasional discussions on pgsql-hackers about parser hooks, but no concrete proposals have been accepted.

12f. No Custom DDL Extension Points

No new general-purpose DDL extension points beyond the existing hook system.

Looking Forward: Discussion on pgsql-hackers

There have been recurring threads on pgsql-hackers about:

  • Extension-defined SQL syntax — Rejected due to complexity and parser architecture
  • Loadable parser modules — Theoretical discussions, no implementation
  • Extension catalogs — Some interest in allowing extensions to register custom catalogs

None of these are implemented in PG 18.

Pros/Cons

AspectAssessment
New syntax extension pointsNone in PG 18
Table AM improvementsMinor — slightly easier to implement
Event trigger improvementsMinor — more metadata available
Parser extensibilityNot planned for any upcoming PG release

13. COMMENT / OPTIONS Abuse Pattern

How It Works

Several extensions use table comments or reloptions as a "poor man's metadata" to tag tables with custom semantics.

Pattern 1: COMMENT-based

CREATE TABLE order_totals (region text, total numeric);
COMMENT ON TABLE order_totals IS '@pgstream {"query": "SELECT ...", "schedule": "1m"}';

An event trigger or background worker scans pg_description for tables with the @pgstream prefix and processes them.

Pattern 2: Reloptions-based

CREATE TABLE order_totals (region text, total numeric)
    WITH (fillfactor = 70, pgstream.stream = true);

Problem: PostgreSQL validates reloptions against a known list. You cannot add arbitrary options to WITH (...) without registering them. Extensions can register custom reloptions via add_reloption() functions, but this is a relatively obscure API.

Pattern 3: GUC-based Tagging

-- Set a GUC that our ProcessUtility_hook reads
SET pgstream.next_create_is_stream = true;
SET pgstream.stream_query = 'SELECT region, SUM(amount) FROM orders GROUP BY region';

-- Hook intercepts this CREATE TABLE and registers it
CREATE TABLE order_totals (region text, total numeric);

-- Reset
RESET pgstream.next_create_is_stream;

This is extremely hacky but has been used in practice (some partitioning extensions used similar patterns before native partitioning).

Who Uses This?

  • pgmemcache — Uses comments to configure caching behavior
  • Some row-level security extensions — Comments to define policies
  • pg_partman — Uses a configuration table (not comments) but similar concept

Pros/Cons

AspectAssessment
Native syntaxNo — abuses existing mechanisms
User experiencePoor — fragile, easy to break by editing comments
ComplexityLow
pg_dumpCOMMENT is dumped — metadata survives pg_dump/restore
RobustnessLow — comments can be accidentally changed
PG versionAll versions

14. pg_ivm (Incremental View Maintenance) Pattern

How It Works

pg_ivm is the most directly comparable extension to pg_stream. It implements incremental view maintenance for PostgreSQL.

API Design

pg_ivm uses a pure function-call API:

-- Create an incrementally maintainable materialized view
SELECT create_immv('order_totals', 'SELECT region, SUM(amount) FROM orders GROUP BY region');

-- Refresh
SELECT refresh_immv('order_totals');

-- Drop
DROP TABLE order_totals;  -- Just drop the underlying table

Key function: create_immv(name, query) — Creates an "Incrementally Maintainable Materialized View" (IMMV).

Internal Implementation

  1. create_immv() is a SQL function (not a hook)
  2. It parses the query, creates a storage table, sets up triggers on source tables
  3. IMMVs are stored as regular tables with metadata in a custom catalog (pg_ivm_immv)
  4. Triggers on source tables automatically update the IMMV on DML

No ProcessUtility_hook

pg_ivm does not use ProcessUtility_hook. It operates entirely through:

  • SQL functions (create_immv, refresh_immv)
  • Row-level triggers for automatic maintenance
  • A custom catalog table for metadata

Why No Custom Syntax?

pg_ivm was developed as a proof-of-concept for PostgreSQL core IVM support. The authors explicitly chose function-call syntax to:

  1. Avoid shared_preload_libraries requirement (hooks need it)
  2. Keep the extension simple and portable
  3. Focus on the IVM algorithm, not the user interface

Eventually Merged to Core?

There was discussion about upstreaming IVM to PostgreSQL core. If merged, it would get proper syntax (CREATE INCREMENTAL MATERIALIZED VIEW). As an extension, it stays with function calls.

Relevance to pg_stream

pg_stream's current API (pgstream.create_stream_table()) follows the exact same pattern as pg_ivm. This is the established approach for IVM extensions.

Pros/Cons

AspectAssessment
Native syntaxNo — function calls
ComplexityLow — simple function API
shared_preload_librariesNot required for basic function API
pg_dumpNo — function calls are not dumped; must use custom dump/restore
User experienceModerate — familiar to pg_ivm users
Community acceptanceEstablished pattern for IVM extensions

15. CREATE TABLE ... USING (Table Access Methods) Deep Dive

Full Syntax

CREATE TABLE tablename (
    column1 datatype,
    column2 datatype,
    ...
) USING access_method_name
  WITH (storage_parameter = value, ...);

How the Parser Handles USING

In gram.y:

CreateStmt: CREATE OptTemp TABLE ...
    OptTableAccessMethod OptWith ...

OptTableAccessMethod:
    USING name    { $$ = $2; }
    | /* empty */ { $$ = NULL; }
    ;

The USING clause sets CreateStmt->accessMethod to the access method name string.

How ProcessUtility Handles It

In createRelation() (src/backend/commands/tablecmds.c):

  1. If accessMethod is specified, look it up in pg_am
  2. Verify it's a table AM (not an index AM)
  3. Store the AM OID in pg_class.relam
  4. Use the AM's callbacks for all subsequent operations

Custom Reloptions with Table AMs

Table AMs can define custom reloptions via:

static relopt_parse_elt stream_relopt_tab[] = {
    {"query", RELOPT_TYPE_STRING, offsetof(StreamOptions, query)},
    {"schedule", RELOPT_TYPE_STRING, offsetof(StreamOptions, schedule)},
    {"refresh_mode", RELOPT_TYPE_STRING, offsetof(StreamOptions, refresh_mode)},
};

This would allow:

CREATE TABLE order_totals (region text, total numeric)
    USING stream_heap
    WITH (query = 'SELECT ...', schedule = '1m', refresh_mode = 'DIFFERENTIAL');

Problems Specific to Stream Tables

  1. Column derivation — Stream tables derive columns from the query. CREATE TABLE ... USING requires explicit column definitions, creating redundancy and potential inconsistency.

  2. No AS SELECT — You can't combine USING with AS SELECT:

    -- This does NOT work in PostgreSQL grammar:
    CREATE TABLE order_totals
        USING stream_heap
        AS SELECT region, SUM(amount) FROM orders GROUP BY region;
    
  3. Full AM implementation required — Even if you delegate to heap, you must implement all callbacks and handle edge cases.

  4. VACUUM/ANALYZE — Must properly delegate to heap for these to work.

  5. Replication — Logical replication assumes heap tuples; custom AMs may break.

Hybrid Practical Approach

If pursuing this route:

-- Step 1: Set default AM
SET default_table_access_method = 'stream_heap';

-- Step 2: Create with query in options
CREATE TABLE order_totals ()
    WITH (pgstream.query = 'SELECT region, SUM(amount) FROM orders GROUP BY region',
          pgstream.schedule = '1m');

-- ProcessUtility_hook would:
-- 1. Detect USING stream_heap (or detect our custom reloptions)
-- 2. Parse the query from options
-- 3. Derive columns from the query
-- 4. Create the actual table with proper columns using heap AM
-- 5. Register in pgstream catalog
-- 6. Set up CDC

Pros/Cons

AspectAssessment
Native syntaxPartialCREATE TABLE ... USING stream_heap WITH (...)
Column derivationNot supported — must specify columns or use hook magic
ComplexityVery high
pg_dumpGoodCREATE TABLE ... USING is properly dumped
PG versionPG 12+
MaintenanceHigh — AM API changes between versions

16. Comparison Matrix

ApproachNative SyntaxComplexitypg_dumpPG VersionMaintenanceRecommended
Function API (current)NoLowNo*AnyVery LowYes
ProcessUtility_hook + MATVIEW WITHGoodHighPartial9.3+MediumMaybe
Raw parser forkPerfectVery HighNoFork onlyVery HighNo
Table AM USINGPartialVery HighYes12+HighNo
FDW FOREIGN TABLEPartialMediumYes9.1+LowNo
Event triggers aloneNoLowNo9.3+LowNo
COMMENT abuseNoLowYesAnyLowNo
GUC + CREATE TABLE hackNoMediumPartialAnyMediumNo
TimescaleDB pattern (MATVIEW + WITH)GoodHighPartial9.3+MediumBest option

* Custom pg_dump support can be added via pg_dump hook or wrapper script.


17. Recommendations for pg_stream

Current Approach: Function API (Keep and Enhance)

pg_stream's current approach (pgstream.create_stream_table('name', 'query', ...)) is:

  • Proven — Same pattern as pg_ivm, Citus, and many other extensions
  • Simple — No shared_preload_libraries required for basic usage
  • Maintainable — No hook chains to debug
  • Portable — Works on any PG version that supports pgrx

Enhancement opportunities:

-- Current
SELECT pgstream.create_stream_table('order_totals',
    'SELECT region, SUM(amount) FROM orders GROUP BY region', '1m');

-- Enhanced: CALL syntax for more DDL-like feel (PG 11+)
CALL pgstream.create_stream_table('order_totals',
    $$SELECT region, SUM(amount) FROM orders GROUP BY region$$, '1m');

Future Option: TimescaleDB-style Materialized View Integration

If user demand justifies the complexity, pg_stream could add a second creation path via ProcessUtility_hook:

-- New native-feeling syntax (requires shared_preload_libraries)
CREATE MATERIALIZED VIEW order_totals
WITH (pgstream.stream = true, pgstream.schedule = '1m')
AS SELECT region, SUM(amount) FROM orders GROUP BY region
WITH NO DATA;

-- Original function API still works (no hook needed)
SELECT pgstream.create_stream_table('order_totals',
    'SELECT region, SUM(amount) FROM orders GROUP BY region', '1m');

Implementation plan for hook-based approach:

  1. Register ProcessUtility_hook in _PG_init() (already needed for shared_preload_libraries)
  2. Intercept CREATE MATERIALIZED VIEW → Check for pgstream.stream option
  3. If found: parse options, call create_stream_table_impl() internally, create standard storage table instead of matview
  4. Intercept DROP MATERIALIZED VIEW → Check if target is a stream table → Clean up
  5. Intercept REFRESH MATERIALIZED VIEW → Route to stream table refresh engine
  6. Intercept ALTER MATERIALIZED VIEW → Route to stream table alter logic

Estimated complexity: ~800-1200 lines of Rust hook code + tests.

  • Forking PostgreSQL for custom grammar — Maintenance cost is prohibitive
  • Table AM approach — Complexity without proportional benefit
  • FDW approach — Too many restrictions on foreign tables
  • COMMENT abuse — Fragile and poor UX

pg_dump / pg_restore Strategy

Regardless of approach, pg_dump is a challenge. Options:

  1. Custom dump/restore functionspgstream.dump_config() and pgstream.restore_config()
  2. Migration script generationpgstream.generate_migration() outputs SQL to recreate all stream tables
  3. Event trigger on restore — Detect when tables are restored and re-register them
  4. Sidecar file — Generate a companion SQL file alongside pg_dump

Appendix A: Hook Registration in pgrx (Rust)

For reference, here's how ProcessUtility_hook registration works in pgrx:

#![allow(unused)]
fn main() {
use pgrx::prelude::*;
use pgrx::pg_sys;
use std::ffi::CStr;

static mut PREV_PROCESS_UTILITY_HOOK: pg_sys::ProcessUtility_hook_type = None;

#[pg_guard]
pub extern "C-unwind" fn my_process_utility(
    pstmt: *mut pg_sys::PlannedStmt,
    query_string: *const std::os::raw::c_char,
    read_only_tree: bool,
    context: pg_sys::ProcessUtilityContext,
    params: pg_sys::ParamListInfo,
    query_env: *mut pg_sys::QueryEnvironment,
    dest: *mut pg_sys::DestReceiver,
    qc: *mut pg_sys::QueryCompletion,
) {
    // SAFETY: pstmt is a valid pointer provided by PostgreSQL
    let stmt = unsafe { (*pstmt).utilityStmt };

    // Check if this is a CreateTableAsStmt (materialized view)
    if unsafe { pgrx::is_a(stmt, pg_sys::NodeTag::T_CreateTableAsStmt) } {
        // Check for our custom options...
    }

    // Chain to previous hook or standard handler
    unsafe {
        if let Some(prev) = PREV_PROCESS_UTILITY_HOOK {
            prev(pstmt, query_string, read_only_tree, context,
                 params, query_env, dest, qc);
        } else {
            pg_sys::standard_ProcessUtility(
                pstmt, query_string, read_only_tree, context,
                params, query_env, dest, qc);
        }
    }
}

pub fn register_hooks() {
    unsafe {
        PREV_PROCESS_UTILITY_HOOK = pg_sys::ProcessUtility_hook;
        pg_sys::ProcessUtility_hook = Some(my_process_utility);
    }
}
}

Appendix B: Key Source Files in PostgreSQL

FilePurpose
src/backend/parser/gram.ySQL grammar (~18,000 lines)
src/backend/parser/scan.lLexer/tokenizer
src/include/parser/kwlist.hKeyword definitions
src/backend/tcop/utility.cProcessUtility() — DDL dispatcher
src/backend/commands/tablecmds.cCREATE/ALTER/DROP TABLE implementation
src/backend/commands/createas.cCREATE TABLE AS / CREATE MATVIEW AS
src/include/access/tableam.hTable Access Method API
src/include/foreign/fdwapi.hFDW API
src/backend/commands/event_trigger.cEvent trigger infrastructure

Appendix C: References

  1. PostgreSQL Documentation — Table Access Method Interface
  2. PostgreSQL Documentation — Event Triggers
  3. PostgreSQL Documentation — Writing A Foreign Data Wrapper
  4. TimescaleDB Source — process_utility.c
  5. Citus Source — multi_utility.c
  6. pg_ivm Source — createas.c
  7. pgrx Documentation — Hooks
  8. PostgreSQL Wiki — CustomScanProviders