pg_stream
pg_stream is a PostgreSQL 18 extension that turns ordinary SQL views into
self-maintaining stream tables — no external processes, no sidecars, no
bespoke refresh pipelines. Just CREATE EXTENSION pg_stream and your views
stay fresh.
-- Declare a stream table — a view that maintains itself
SELECT pgstream.create_stream_table(
'active_orders',
'SELECT * FROM orders WHERE status = ''active''',
'30s'
);
-- Insert a row — the stream table updates automatically on the next refresh
INSERT INTO orders (id, status) VALUES (42, 'active');
SELECT count(*) FROM active_orders; -- 1
The problem with materialized views
PostgreSQL's materialized views are powerful but frustrating.
REFRESH MATERIALIZED VIEW re-runs the entire query from scratch, even if
only one row changed in a million-row table. Your choices are: burn CPU on
full recomputation, or accept stale data. Most teams end up building bespoke
refresh pipelines just to keep summary tables current.
What pg_stream does differently
pg_stream captures changes to your source tables and — on each refresh cycle — derives a delta query that processes only the changed rows and merges the result into the materialized table. One insert into a million-row source table? pg_stream touches exactly one row's worth of computation.
The approach is grounded in the DBSP differential dataflow framework (Budiu et al., 2022). Delta queries are derived automatically from your SQL's operator tree: joins produce the classic bilinear expansion, aggregates maintain auxiliary counters, and linear operators like filters pass deltas through unchanged.
Key capabilities
| Feature | Description |
|---|---|
| Incremental refresh | Only changed rows are recomputed — never a full table scan |
| Cascading DAG | Stream tables that depend on stream tables propagate deltas downstream automatically |
| Demand-driven scheduling | Set a freshness interval on the views your app queries; upstream layers inherit the tightest schedule automatically |
| Hybrid CDC | Starts with lightweight row-level triggers; seamlessly transitions to WAL-based logical replication once available |
| Broad SQL support | JOINs, GROUP BY, DISTINCT, UNION/INTERSECT/EXCEPT, subqueries, CTEs (including WITH RECURSIVE), window functions, LATERAL, and more |
| Built-in observability | Monitoring views, refresh history, NOTIFY-based alerting |
| CloudNativePG-ready | Ships as an Image Volume extension image for Kubernetes deployments |
Demand-driven scheduling
With the default CALCULATED schedule mode, you only set an explicit refresh
interval on the stream tables your application actually queries. The system
propagates that cadence upward through the dependency graph: each upstream
stream table inherits the tightest schedule among its downstream dependents.
You declare freshness requirements where they matter — at the consumer — and
the entire pipeline adjusts without manual coordination.
Hybrid change capture
pg_stream bootstraps with lightweight row-level triggers — no configuration
needed, works out of the box. Once the first refresh succeeds and
wal_level = logical is available, the system automatically transitions to
WAL-based logical replication for lower write-side overhead. The transition
is seamless: trigger → transitioning → WAL-only. If anything goes wrong, it
falls back to triggers.
Explore this documentation
- Getting Started — build a three-layer DAG from scratch in minutes
- SQL Reference — every function and option
- Architecture — how the engine works internally
- Configuration — GUC variables and tuning
Source & releases
Written in Rust using pgrx. Targets PostgreSQL 18. Apache 2.0 licensed.
- Repository: github.com/grove/pg-stream
- Install instructions: Installation
- Changelog: Changelog
- Roadmap: Roadmap
Getting Started with pg_stream
What is pg_stream?
pg_stream adds stream tables to PostgreSQL — tables that are defined by a SQL query and kept automatically up to date as the underlying data changes. Think of them as materialized views that refresh themselves, but smarter: instead of re-running the entire query on every refresh, pg_stream uses Incremental View Maintenance (IVM) to process only the rows that changed.
Traditional materialized views force a choice: either re-run the full query (expensive) or accept stale data. pg_stream eliminates this trade-off. When you insert a single row into a million-row table, pg_stream computes the effect of that one row on the query result — it doesn't touch the other 999,999.
How data flows
The key concept is that data flows downstream automatically — from your base tables through any chain of stream tables, without you writing a single line of orchestration code:
You write to base tables
│
▼
┌─────────────┐ triggers (or WAL) ┌─────────────────────┐
│ Base Tables │ ─────────────────────▶ │ Change Buffers │
│ (you write) │ │ (pgstream_changes.*) │
└─────────────┘ └──────────┬──────────┘
│
delta query (ΔQ) on refresh
│
▼
┌──────────────────────────────────────────────────────────────┐
│ Stream Table A ◀── depends on base tables │
└──────────────────────────┬───────────────────────────────────┘
│ change captured, buffer written
▼
┌──────────────────────────────────────────────────────────────┐
│ Stream Table B ◀── depends on Stream Table A │
└──────────────────────────────────────────────────────────────┘
One write to a base table can ripple through an entire DAG of stream tables — each layer refreshed in the correct topological order, each doing only the work proportional to what actually changed.
- You write to your base tables normally —
INSERT,UPDATE,DELETE - Lightweight
AFTERrow-level triggers capture each change into a buffer, atomically in the same transaction. No polling, no logical replication slots required by default. - On each refresh cycle, pg_stream derives a delta query (ΔQ) that reads only the buffered changes since the last refresh frontier
- The delta is merged into the stream table — only the affected rows are written
- If other stream tables depend on this one, they are scheduled next (topological order)
- Optionally: once
wal_level = logicalis available and the first refresh succeeds, pg_stream automatically transitions from triggers to WAL-based CDC (~2–15 μs write overhead vs triggers). The transition is seamless and transparent.
This tutorial walks through a concrete org-chart example so you can see this flow end to end, including a chain of stream tables that propagates changes automatically.
What you'll build
An employee org-chart system with two stream tables:
department_tree— a recursive CTE that flattens a department hierarchy into paths likeCompany > Engineering > Backenddepartment_stats— a join + aggregation overdepartment_tree(a stream table!) that computes headcount and salary budget, with the full path includeddepartment_report— a further aggregation that rolls up stats to top-level departments
The chain departments → department_tree → department_stats → department_report demonstrates automatic downstream propagation: modify a department name in the base table and all three stream tables update automatically, in the right order, without any manual orchestration.
By the end you will have:
- Seen how stream tables are created, queried, and refreshed
- Watched a single
UPDATEin a base table cascade through three layers of stream tables automatically - Understood the two IVM strategies and when each applies
Prerequisites
- PostgreSQL 18.x with pg_stream installed (see INSTALL.md)
shared_preload_libraries = 'pg_stream'inpostgresql.confpsqlor any SQL client
Step 1: Create the Base Tables
These are ordinary PostgreSQL tables — pg_stream doesn't require any special column types, annotations, or schema conventions. The only requirement is that tables have a primary key (pg_stream uses it internally to track which rows changed).
-- Department hierarchy (self-referencing tree)
CREATE TABLE departments (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
parent_id INT REFERENCES departments(id)
);
-- Employees belong to a department
CREATE TABLE employees (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
department_id INT NOT NULL REFERENCES departments(id),
salary NUMERIC(10,2) NOT NULL
);
Now insert some data — a three-level department tree and a handful of employees:
-- Top-level
INSERT INTO departments (id, name, parent_id) VALUES
(1, 'Company', NULL);
-- Second level
INSERT INTO departments (id, name, parent_id) VALUES
(2, 'Engineering', 1),
(3, 'Sales', 1),
(4, 'Operations', 1);
-- Third level (under Engineering)
INSERT INTO departments (id, name, parent_id) VALUES
(5, 'Backend', 2),
(6, 'Frontend', 2),
(7, 'Platform', 2);
-- Employees
INSERT INTO employees (name, department_id, salary) VALUES
('Alice', 5, 120000), -- Backend
('Bob', 5, 115000), -- Backend
('Charlie', 6, 110000), -- Frontend
('Diana', 7, 130000), -- Platform
('Eve', 3, 95000), -- Sales
('Frank', 3, 90000), -- Sales
('Grace', 4, 100000); -- Operations
At this point these are plain tables with no triggers, no change tracking, nothing special. The department tree looks like this:
Company (1)
├── Engineering (2)
│ ├── Backend (5) — Alice, Bob
│ ├── Frontend (6) — Charlie
│ └── Platform (7) — Diana
├── Sales (3) — Eve, Frank
└── Operations (4) — Grace
Step 2: Create the First Stream Table — Recursive Hierarchy
Our first stream table flattens the department tree. For every department, it computes the full path from the root and the depth level. This uses WITH RECURSIVE — a SQL construct that can't be differentiated with simple algebraic rules (the recursion depends on itself), but pg_stream handles it using incremental strategies (semi-naive evaluation for inserts, Delete-and-Rederive for mixed changes) that we'll explain later.
SELECT pgstream.create_stream_table(
'department_tree',
$$
WITH RECURSIVE tree AS (
-- Base case: root departments (no parent)
SELECT id, name, parent_id, name AS path, 0 AS depth
FROM departments
WHERE parent_id IS NULL
UNION ALL
-- Recursive step: children join back to the tree
SELECT d.id, d.name, d.parent_id,
tree.path || ' > ' || d.name AS path,
tree.depth + 1
FROM departments d
JOIN tree ON d.parent_id = tree.id
)
SELECT id, name, parent_id, path, depth FROM tree
$$,
'30s',
'DIFFERENTIAL'
);
What just happened?
That single function call did a lot of work atomically (all in one transaction):
- Parsed the defining query into an operator tree — identifying the recursive CTE, the scan on
departments, the join, the union - Created a storage table called
department_treein thepublicschema — a real PostgreSQL heap table with columns matching the SELECT output, plus internal columns__pgs_row_id(a hash used to track individual rows) - Installed CDC triggers on the
departmentstable — lightweightAFTER INSERT OR UPDATE OR DELETErow-level triggers that will capture every future change - Created a change buffer table in the
pgstream_changesschema — this is where the triggers write captured changes - Ran an initial full refresh — executed the recursive query against the current data and populated the storage table
- Registered the stream table in pg_stream's catalog with a 30-second refresh schedule
Query it immediately — it's already populated:
SELECT * FROM department_tree ORDER BY path;
Expected output:
id | name | parent_id | path | depth
----+-------------+-----------+-----------------------------+-------
1 | Company | | Company | 0
2 | Engineering | 1 | Company > Engineering | 1
5 | Backend | 2 | Company > Engineering > Backend | 2
6 | Frontend | 2 | Company > Engineering > Frontend | 2
7 | Platform | 2 | Company > Engineering > Platform | 2
4 | Operations | 1 | Company > Operations | 1
3 | Sales | 1 | Company > Sales | 1
This is a real PostgreSQL table — you can create indexes on it, join it in other queries, reference it in views, or even use it as a source for other stream tables. pg_stream keeps it in sync automatically.
Key insight: The recursive query that computes paths and depths would normally need to be re-run manually (or via
REFRESH MATERIALIZED VIEW). With pg_stream, it stays fresh — any change to thedepartmentstable is automatically reflected within the schedule bound (30 seconds here).
Step 3: Chain Stream Tables — Build the Downstream Layers
Now create department_stats. The twist: instead of joining directly against departments, it joins against department_tree — the stream table we just created. This creates a chain: changes to departments update department_tree, whose changes then trigger department_stats to update.
This demonstrates how pg_stream builds a DAG — a directed acyclic graph of stream tables — and automatically schedules refreshes in topological order.
SELECT pgstream.create_stream_table(
'department_stats',
$$
SELECT
t.id AS department_id,
t.name AS department_name,
t.path AS full_path,
t.depth,
COUNT(e.id) AS headcount,
COALESCE(SUM(e.salary), 0) AS total_salary,
COALESCE(AVG(e.salary), 0) AS avg_salary
FROM department_tree t
LEFT JOIN employees e ON e.department_id = t.id
GROUP BY t.id, t.name, t.path, t.depth
$$,
'CALCULATED', -- inherit schedule from downstream; see explanation below
'DIFFERENTIAL'
);
What just happened — and why this one is different?
Like before, pg_stream parsed the query, created a storage table, and set up CDC. But department_stats depends on department_tree, not a base table — so no new triggers were installed. Instead, pg_stream registered department_tree as an upstream dependency in the DAG.
The schedule is 'CALCULATED', which means: "don't give this table its own schedule — inherit the tightest schedule of any downstream table that queries it". Since no other stream table has been created yet, PostgreSQL will prompt a full refresh on demand for now.
The query has no recursive CTE, so pg_stream uses algebraic differentiation:
- Decomposed into operators:
Scan(department_tree)→LEFT JOIN→Scan(employees)→Aggregate(GROUP BY + COUNT/SUM/AVG)→Project - Derived a differentiation rule for each:
Δ(Scan)= read only change buffer rows (not the full table)Δ(LEFT JOIN)= join change rows from one side against the full other sideΔ(Aggregate)= for COUNT/SUM/AVG, add or subtract per group — no rescan needed
- Composed these into a single delta query (ΔQ) that never touches unchanged rows
When one employee is inserted, the refresh reads one change buffer row, joins to find the department, and adjusts only that group's count and sum.
Query it:
SELECT department_name, full_path, headcount, total_salary
FROM department_stats
ORDER BY full_path;
Expected output:
department_name | full_path | headcount | total_salary
-----------------+--------------------------------------------+-----------+--------------
Backend | Company > Engineering > Backend | 2 | 235000.00
Frontend | Company > Engineering > Frontend | 1 | 110000.00
Platform | Company > Engineering > Platform | 1 | 130000.00
Engineering | Company > Engineering | 0 | 0.00
Operations | Company > Operations | 1 | 100000.00
Sales | Company > Sales | 2 | 185000.00
Company | Company | 0 | 0.00
Notice that the full_path column comes from department_tree — this data already went through one layer of incremental maintenance before landing here.
Add a third layer: department_report
Now add a rollup that aggregates department_stats by top-level group (depth = 1):
SELECT pgstream.create_stream_table(
'department_report',
$$
SELECT
split_part(full_path, ' > ', 2) AS division,
SUM(headcount) AS total_headcount,
SUM(total_salary) AS total_payroll
FROM department_stats
WHERE depth >= 1
GROUP BY 1
$$,
'30s', -- this is the only explicit schedule; CALCULATED tables above inherit it
'DIFFERENTIAL'
);
The DAG is now:
departments (base) employees (base)
│ │
▼ │
department_tree ──────────┤
(DIFF, CALCULATED) │
│ ▼
└──────▶ department_stats
(DIFF, CALCULATED)
│
▼
department_report
(DIFF, 30s) ◀── only explicit schedule
department_report drives the whole pipeline. Because it has a 30-second schedule, pg_stream automatically propagates that cadence upstream: department_stats and department_tree will also be refreshed within 30 seconds of a base table change, in topological order, with no manual configuration.
Query the report:
SELECT * FROM department_report ORDER BY division;
division | total_headcount | total_payroll
-------------+-----------------+---------------
Engineering | 4 | 475000.00
Operations | 1 | 100000.00
Sales | 2 | 185000.00
Step 4: Watch a Change Cascade Through All Three Layers
This is the heart of pg_stream. We'll make four changes to the base tables and watch changes propagate automatically through the three-layer DAG — each layer doing only the minimum work.
The data flow pipeline (three layers)
Your SQL statement
│
▼
CDC trigger fires (same transaction)
Change buffer receives one row
│
▼
Background scheduler fires (or manual refresh_stream_table)
│
├──▶ [Layer 1] Refresh department_tree
│ delta query reads change buffer
│ MERGE touches only affected rows in department_tree
│ department_tree's own change buffer is updated
│
├──▶ [Layer 2] Refresh department_stats
│ delta query reads department_tree's change buffer
│ MERGE touches only affected department groups
│
└──▶ [Layer 3] Refresh department_report
delta query reads department_stats' change buffer
MERGE touches only affected division rows
All change buffers cleaned up ✓
All three layers run in a single scheduled pass, in topological order.
4a: A single INSERT ripples through all three layers
INSERT INTO employees (name, department_id, salary) VALUES
('Heidi', 6, 105000); -- New Frontend engineer
What happened immediately (in your transaction): The AFTER INSERT trigger on employees fired and wrote one row to pgstream_changes.changes_<employees_oid>. The row contains the new values, action type I, and the LSN at the time of insert. Your transaction committed normally — no blocking.
The stream tables don't know about Heidi yet. The change is in the buffer, waiting for the next refresh.
Refresh the whole pipeline in one call (or wait for the 30-second schedule):
-- refresh_stream_table cascades to dependent tables automatically
SELECT pgstream.refresh_stream_table('department_report');
What happened across the three layers:
| Layer | What ran | Rows touched |
|---|---|---|
department_tree | No change — employees is not a source for this ST | 0 |
department_stats | Delta query: read 1 buffer row, join to Frontend, COUNT+1, SUM+105000 | 1 (Frontend group only) |
department_report | Delta query: read 1 change from dept_stats, SUM += 1 headcount, += 105000 | 1 (Engineering row only) |
Check the result:
SELECT department_name, headcount, total_salary FROM department_stats
WHERE department_name = 'Frontend';
department_name | headcount | total_salary
-----------------+-----------+--------------
Frontend | 2 | 215000.00
The 6 other groups in department_stats were not touched at all.
Contrast with a standard materialized view:
REFRESH MATERIALIZED VIEWwould re-scan all 8 employees, re-join with all 7 departments, re-aggregate, and update all 7 rows. With pg_stream, the work was proportional to the 1 changed row — across all three layers.
4b: A department change cascades through the whole DAG
Now change the departments table — the root of the entire chain:
INSERT INTO departments (id, name, parent_id) VALUES
(8, 'DevOps', 2); -- New team under Engineering
What happened: The CDC trigger on departments fired. The change buffer for departments has one new row. None of the stream tables know about it yet.
Refresh the whole pipeline:
SELECT pgstream.refresh_stream_table('department_report');
What happened across all three layers:
| Layer | What ran | Rows touched |
|---|---|---|
department_tree | Semi-naive evaluation: base case finds new dept, recursive term computes its path. Result: 1 new row | 1 inserted |
department_stats | Delta query reads new row from dept_tree's change buffer; DevOps has 0 employees so delta is minimal | 1 inserted (headcount=0) |
department_report | Delta on Engineering row: headcount stays the same (DevOps has 0 employees) | 0 effective changes |
How the recursive CTE refresh works — unlike department_stats, recursive CTEs can't be algebraically differentiated (the recursion references itself). pg_stream uses incremental fixpoint strategies:
- INSERT → semi-naive evaluation: differentiate the base case, propagate the delta through the recursive term, stopping when no new rows are produced. Only new rows inserted.
- DELETE or UPDATE → Delete-and-Rederive (DRed): remove rows derived from deleted facts, re-derive rows that may have alternative derivation paths, handle cascades cleanly.
SELECT id, name, depth, path FROM department_tree WHERE name = 'DevOps';
id | name | depth | path |
----+--------+-------+-------------------------------+
8 | DevOps | 2 | Company > Engineering > DevOps | 2
The recursive CTE automatically expanded to include the new department at the correct depth and path. One inserted row in departments produced one new row in the stream table.
4c: UPDATE — A single rename that cascades everywhere
Rename "Engineering" to "R&D":
UPDATE departments SET name = 'R&D' WHERE id = 2;
What happened in the change buffer: The CDC trigger captured the old row (name='Engineering') and the new row (name='R&D'). Both old and new values are stored so the delta can compute what to remove and what to add.
Refresh:
SELECT pgstream.refresh_stream_table('department_report');
What happened:
| Layer | Work done | Result |
|---|---|---|
department_tree | DRed strategy: delete rows derived with old name, re-derive with new name. 5 rows updated (Engineering + 4 sub-teams) | Paths now say Company > R&D > … |
department_stats | Delta reads 5 changed rows from dept_tree's buffer; updates full_path column for those 5 departments | 5 rows updated |
department_report | Division name changed: "Engineering" row replaced by "R&D" row | 1 DELETE + 1 INSERT |
Query to verify the cascade:
SELECT name, path FROM department_tree WHERE path LIKE '%R&D%' ORDER BY depth;
name | path
---------+----------------------------------
R&D | Company > R&D
Backend | Company > R&D > Backend
DevOps | Company > R&D > DevOps
Frontend | Company > R&D > Frontend
Platform | Company > R&D > Platform
One UPDATE to a department name flowed through all three layers automatically — updating 5 + 5 + 2 rows across the chain.
4d: DELETE — Remove an employee
DELETE FROM employees WHERE name = 'Bob';
What happened: The AFTER DELETE trigger on employees fired, writing a change buffer row with action type D and Bob's old values (department_id=5, salary=115000). The delta query will use these old values to compute the correct aggregate adjustment — it knows to subtract 115000 from Backend's salary sum and decrement the count.
SELECT pgstream.refresh_stream_table('department_stats');
SELECT * FROM department_stats WHERE department_name = 'Backend';
department_id | department_name | headcount | total_salary | avg_salary
---------------+-----------------+-----------+--------------+------------
5 | Backend | 1 | 120000.00 | 120000.00
Headcount dropped from 2 → 1 and the salary aggregates updated. Again, only the Backend group was touched — the other 6 department rows were untouched.
Step 5: Automatic Scheduling — Let the DAG Drive Itself
In the examples above, we called refresh_stream_table() manually. In production you never need to do this. pg_stream runs a background scheduler that automatically refreshes stale tables — in topological order.
How schedules propagate
We gave department_report a '30s' schedule and the two upstream tables a 'CALCULATED' schedule. This is the recommended pattern:
department_tree (CALCULATED → inherits 30s from downstream)
│
department_stats (CALCULATED → inherits 30s from downstream)
│
department_report (30s — the only explicit schedule)
CALCULATED means: compute the tightest schedule across all downstream dependents. You declare freshness requirements at the tables your application queries — the system figures out how often each upstream table needs to refresh.
What the scheduler does every second
- Queries the catalog for stream tables past their freshness bound
- Sorts them topologically (upstream first) —
department_treerefreshes beforedepartment_stats, which refreshes beforedepartment_report - Runs each refresh (respecting
pg_stream.max_concurrent_refreshes) - Updates the last-refresh frontier
Monitoring
-- Current status of all stream tables
SELECT table_name, schedule, last_refresh_at, stale, refresh_mode
FROM pgstream.pgs_status();
table_name | schedule | last_refresh_at | stale | refresh_mode
--------------------+-----------+-------------------------+-------+--------------
department_tree | 30s | 2026-02-26 10:30:00.123 | f | DIFFERENTIAL
department_stats | 30s | 2026-02-26 10:30:00.456 | f | DIFFERENTIAL
department_report | 30s | 2026-02-26 10:30:00.789 | f | DIFFERENTIAL
-- Detailed performance stats
SELECT table_name, refresh_count, avg_refresh_ms, rows_affected_last
FROM pgstream.pg_stat_stream_tables;
Optional: WAL-based CDC
By default pg_stream uses triggers. If wal_level = logical is configured, set:
ALTER SYSTEM SET pg_stream.cdc_mode = 'auto';
SELECT pg_reload_conf();
pg_stream will automatically transition each stream table from trigger-based to WAL-based capture after the first successful refresh — reducing per-write overhead from ~50–200 μs to ~2–15 μs. The transition is transparent; your queries and the refresh schedule are unaffected.
Step 6: Understanding the Two IVM Strategies
You've now seen both strategies pg_stream uses for incremental view maintenance. Understanding when each applies helps you write efficient stream table queries.
Algebraic Differentiation (used by department_stats)
For queries composed of scans, filters, joins, and algebraic aggregates (COUNT, SUM, AVG), pg_stream can derive the IVM delta mathematically. The rules come from the theory of DBSP (Database Stream Processing):
| Operator | Delta Rule | Cost |
|---|---|---|
| Scan | Read only change buffer rows (not the full table) | O(changes) |
| Filter (WHERE) | Apply predicate to change rows | O(changes) |
| Join | Join change rows from one side against the full other side | O(changes × lookup) |
| Aggregate (COUNT/SUM/AVG) | Add or subtract deltas per group — no rescan | O(affected groups) |
| Project | Pass through | O(changes) |
The total cost is proportional to the number of changes, not the table size. For a million-row table with 10 changes, the delta query touches ~10 rows.
Incremental Strategies for Recursive CTEs (used by department_tree)
For recursive CTEs, pg_stream can't derive an algebraic delta because the recursion references itself. Instead it uses two complementary strategies, chosen automatically based on what changed:
Semi-naive evaluation (for INSERT-only changes):
- Differentiate the base case — find the new seed rows
- Propagate the delta through the recursive term, iterating until no new rows are produced
- The result is only the new rows created by the change — not the whole tree
Delete-and-Rederive (DRed) (for DELETE or UPDATE):
- Remove all rows derived from the old fact
- Re-derive rows that had the old fact as one of their derivation paths (they may still be reachable via other paths)
- Insert the newly derived rows under the new fact
Both strategies are more efficient than full recomputation — they work on the affected portion of the result set, not the entire recursive query. The MERGE only modifies rows that actually changed.
When to use which?
You don't choose — pg_stream detects the strategy automatically based on the query structure:
| Query Pattern | Strategy | Performance |
|---|---|---|
| Scan + Filter + Join + algebraic Aggregate (COUNT/SUM/AVG) | Algebraic | Excellent — O(changes) |
| Non-recursive CTEs | Algebraic (inlined) | CTE body is differentiated inline |
MIN / MAX aggregates | Semi-algebraic | Uses LEAST/GREATEST merge; per-group rescan only when an extremum is deleted |
STRING_AGG, ARRAY_AGG, ordered-set aggregates | Group-rescan | Affected groups fully re-aggregated from source |
GROUPING SETS / CUBE / ROLLUP | Algebraic (rewritten) | Auto-expanded to UNION ALL of GROUP BY queries; CUBE capped at 64 branches |
Recursive CTEs (WITH RECURSIVE) INSERT | Semi-naive evaluation | O(new rows derived from the change) |
Recursive CTEs (WITH RECURSIVE) DELETE/UPDATE | Delete-and-Rederive | Re-derives rows with alternative paths; O(affected subgraph) |
| Window functions | Partition recompute | Only affected partitions recomputed |
Step 7: Clean Up
When you're done experimenting, drop the stream tables. Drop dependents before their sources:
SELECT pgstream.drop_stream_table('department_report');
SELECT pgstream.drop_stream_table('department_stats');
SELECT pgstream.drop_stream_table('department_tree');
DROP TABLE employees;
DROP TABLE departments;
drop_stream_table atomically removes in a single transaction:
- The storage table (e.g.,
public.department_stats) - CDC triggers on source tables (removed only if no other stream table references the same source)
- Change buffer tables in
pgstream_changes - Catalog entries in
pgstream.pgs_stream_tables
Summary: What You Learned
| Concept | What you saw |
|---|---|
| Stream tables | Tables defined by a SQL query that stay automatically up to date |
| CDC triggers | Lightweight change capture in the same transaction — no logical replication or polling required |
| DAG scheduling | Stream tables can depend on other stream tables; refreshes run in topological order, schedules propagate upstream via CALCULATED mode |
| Algebraic IVM | Delta queries that process only changed rows — O(changes) regardless of table size |
| Semi-naive / DRed | Incremental strategies for WITH RECURSIVE — INSERT uses semi-naive, DELETE/UPDATE uses Delete-and-Rederive |
| Downstream propagation | A single base table write cascades through an entire chain of stream tables, automatically, in the right order |
| Hybrid CDC | Triggers by default; optional automatic transition to WAL-based capture for lower write-side overhead |
| Monitoring | pgs_status() and pg_stat_stream_tables for freshness, timing, and error history |
The key takeaway: you write to base tables — pg_stream does the rest. Data flows downstream automatically, each layer doing the minimum work proportional to what changed, in dependency order.
What's Next?
- SQL_REFERENCE.md — Full API reference for all functions, views, and configuration
- ARCHITECTURE.md — Deep dive into the system architecture and data flow
- DVM_OPERATORS.md — How each SQL operator is differentiated for incremental maintenance
- CONFIGURATION.md — GUC variables for tuning schedule, concurrency, and cleanup behavior
- What Happens on INSERT — Detailed trace of a single INSERT through the entire pipeline
- What Happens on UPDATE — How UPDATEs are split into D+I, group key changes, and net-effect computation
- What Happens on DELETE — Reference counting, group deletion, and INSERT+DELETE cancellation
- What Happens on TRUNCATE — Why TRUNCATE bypasses triggers and how to recover
SQL Reference
Complete reference for all SQL functions, views, and catalog tables provided by pgstream.
Functions
pgstream.create_stream_table
Create a new stream table.
pgstream.create_stream_table(
name text,
query text,
schedule text DEFAULT '1m',
refresh_mode text DEFAULT 'DIFFERENTIAL',
initialize bool DEFAULT true
) → void
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
name | text | — | Name of the stream table. May be schema-qualified (myschema.my_st). Defaults to public schema. |
query | text | — | The defining SQL query. Must be a valid SELECT statement using supported operators. |
schedule | text | '1m' | Refresh schedule as a Prometheus/GNU-style duration string (e.g., '30s', '5m', '1h', '1h30m', '1d') or a cron expression (e.g., '*/5 * * * *', '@hourly'). Set to NULL for CALCULATED mode (inherits schedule from downstream dependents). |
refresh_mode | text | 'DIFFERENTIAL' | 'FULL' (truncate and reload) or 'DIFFERENTIAL' (apply delta only). |
initialize | bool | true | If true, populates the table immediately via a full refresh. If false, creates the table empty. |
Duration format:
| Unit | Suffix | Example |
|---|---|---|
| Seconds | s | '30s' |
| Minutes | m | '5m' |
| Hours | h | '2h' |
| Days | d | '1d' |
| Weeks | w | '1w' |
| Compound | — | '1h30m', '2m30s' |
Cron expression format:
schedule also accepts standard cron expressions for time-based scheduling. The scheduler refreshes the stream table when the cron schedule fires, rather than checking staleness.
| Format | Fields | Example | Description |
|---|---|---|---|
| 5-field | min hour dom mon dow | '*/5 * * * *' | Every 5 minutes |
| 6-field | sec min hour dom mon dow | '0 */5 * * * *' | Every 5 minutes at :00 seconds |
| Alias | — | '@hourly' | Every hour |
| Alias | — | '@daily' | Every day at midnight |
| Alias | — | '@weekly' | Every Sunday at midnight |
| Alias | — | '@monthly' | First of every month |
| Weekday range | — | '0 6 * * 1-5' | 6 AM on weekdays |
Note: Cron-scheduled stream tables do not participate in CALCULATED schedule resolution. The
stalecolumn in monitoring views returnsNULLfor cron-scheduled tables.
Example:
-- Duration-based: refresh when data is staler than 2 minutes
SELECT pgstream.create_stream_table(
'order_totals',
'SELECT region, SUM(amount) AS total FROM orders GROUP BY region',
'2m',
'DIFFERENTIAL'
);
-- Cron-based: refresh every hour
SELECT pgstream.create_stream_table(
'hourly_summary',
'SELECT date_trunc(''hour'', ts), COUNT(*) FROM events GROUP BY 1',
'@hourly',
'FULL'
);
-- Cron-based: refresh at 6 AM on weekdays
SELECT pgstream.create_stream_table(
'daily_report',
'SELECT region, SUM(revenue) AS total FROM sales GROUP BY region',
'0 6 * * 1-5',
'FULL'
);
Aggregate Examples:
All supported aggregate functions work in both FULL and DIFFERENTIAL modes:
-- Algebraic aggregates (fully differential — no rescan needed)
SELECT pgstream.create_stream_table(
'sales_summary',
'SELECT region, COUNT(*) AS cnt, SUM(amount) AS total, AVG(amount) AS avg_amount
FROM orders GROUP BY region',
'1m',
'DIFFERENTIAL'
);
-- Semi-algebraic aggregates (MIN/MAX)
SELECT pgstream.create_stream_table(
'salary_ranges',
'SELECT department, MIN(salary) AS min_sal, MAX(salary) AS max_sal
FROM employees GROUP BY department',
'2m',
'DIFFERENTIAL'
);
-- Group-rescan aggregates (BOOL_AND/OR, STRING_AGG, ARRAY_AGG, JSON_AGG, JSONB_AGG,
-- BIT_AND, BIT_OR, BIT_XOR, JSON_OBJECT_AGG, JSONB_OBJECT_AGG,
-- STDDEV, STDDEV_POP, STDDEV_SAMP, VARIANCE, VAR_POP, VAR_SAMP,
-- MODE, PERCENTILE_CONT, PERCENTILE_DISC,
-- CORR, COVAR_POP, COVAR_SAMP, REGR_AVGX, REGR_AVGY,
-- REGR_COUNT, REGR_INTERCEPT, REGR_R2, REGR_SLOPE,
-- REGR_SXX, REGR_SXY, REGR_SYY, ANY_VALUE)
SELECT pgstream.create_stream_table(
'team_members',
'SELECT department,
STRING_AGG(name, '', '' ORDER BY name) AS members,
ARRAY_AGG(employee_id) AS member_ids,
BOOL_AND(active) AS all_active,
JSON_AGG(name) AS members_json
FROM employees
GROUP BY department',
'1m',
'DIFFERENTIAL'
);
-- Bitwise aggregates
SELECT pgstream.create_stream_table(
'permission_summary',
'SELECT department,
BIT_OR(permissions) AS combined_perms,
BIT_AND(permissions) AS common_perms,
BIT_XOR(flags) AS xor_flags
FROM employees
GROUP BY department',
'1m',
'DIFFERENTIAL'
);
-- JSON object aggregates
SELECT pgstream.create_stream_table(
'config_map',
'SELECT department,
JSON_OBJECT_AGG(setting_name, setting_value) AS settings,
JSONB_OBJECT_AGG(key, value) AS metadata
FROM config
GROUP BY department',
'1m',
'DIFFERENTIAL'
);
-- Statistical aggregates
SELECT pgstream.create_stream_table(
'salary_stats',
'SELECT department,
STDDEV_POP(salary) AS sd_pop,
STDDEV_SAMP(salary) AS sd_samp,
VAR_POP(salary) AS var_pop,
VAR_SAMP(salary) AS var_samp
FROM employees
GROUP BY department',
'1m',
'DIFFERENTIAL'
);
-- Ordered-set aggregates (MODE, PERCENTILE_CONT, PERCENTILE_DISC)
SELECT pgstream.create_stream_table(
'salary_percentiles',
'SELECT department,
MODE() WITHIN GROUP (ORDER BY grade) AS most_common_grade,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary,
PERCENTILE_DISC(0.9) WITHIN GROUP (ORDER BY salary) AS p90_salary
FROM employees
GROUP BY department',
'1m',
'DIFFERENTIAL'
);
-- Regression / correlation aggregates (CORR, COVAR_*, REGR_*)
SELECT pgstream.create_stream_table(
'regression_stats',
'SELECT department,
CORR(salary, experience) AS sal_exp_corr,
COVAR_POP(salary, experience) AS covar_pop,
COVAR_SAMP(salary, experience) AS covar_samp,
REGR_SLOPE(salary, experience) AS slope,
REGR_INTERCEPT(salary, experience) AS intercept,
REGR_R2(salary, experience) AS r_squared,
REGR_COUNT(salary, experience) AS regr_n
FROM employees
GROUP BY department',
'1m',
'DIFFERENTIAL'
);
-- ANY_VALUE aggregate (PostgreSQL 16+)
SELECT pgstream.create_stream_table(
'dept_sample',
'SELECT department, ANY_VALUE(office_location) AS sample_office
FROM employees GROUP BY department',
'1m',
'DIFFERENTIAL'
);
-- FILTER clause on aggregates
SELECT pgstream.create_stream_table(
'order_metrics',
'SELECT region,
COUNT(*) AS total,
COUNT(*) FILTER (WHERE status = ''active'') AS active_count,
SUM(amount) FILTER (WHERE status = ''shipped'') AS shipped_total
FROM orders
GROUP BY region',
'1m',
'DIFFERENTIAL'
);
CTE Examples:
Non-recursive CTEs are fully supported in both FULL and DIFFERENTIAL modes:
-- Simple CTE
SELECT pgstream.create_stream_table(
'active_order_totals',
'WITH active_users AS (
SELECT id, name FROM users WHERE active = true
)
SELECT a.id, a.name, SUM(o.amount) AS total
FROM active_users a
JOIN orders o ON o.user_id = a.id
GROUP BY a.id, a.name',
'1m',
'DIFFERENTIAL'
);
-- Chained CTEs (CTE referencing another CTE)
SELECT pgstream.create_stream_table(
'top_regions',
'WITH regional AS (
SELECT region, SUM(amount) AS total FROM orders GROUP BY region
),
ranked AS (
SELECT region, total FROM regional WHERE total > 1000
)
SELECT * FROM ranked',
'2m',
'DIFFERENTIAL'
);
-- Multi-reference CTE (referenced twice in FROM — shared delta optimization)
SELECT pgstream.create_stream_table(
'self_compare',
'WITH totals AS (
SELECT user_id, SUM(amount) AS total FROM orders GROUP BY user_id
)
SELECT t1.user_id, t1.total, t2.total AS next_total
FROM totals t1
JOIN totals t2 ON t1.user_id = t2.user_id + 1',
'1m',
'DIFFERENTIAL'
);
Recursive CTEs work with both FULL and DIFFERENTIAL modes:
-- Recursive CTE (hierarchy traversal)
SELECT pgstream.create_stream_table(
'category_tree',
'WITH RECURSIVE cat_tree AS (
SELECT id, name, parent_id, 0 AS depth
FROM categories WHERE parent_id IS NULL
UNION ALL
SELECT c.id, c.name, c.parent_id, ct.depth + 1
FROM categories c
JOIN cat_tree ct ON c.parent_id = ct.id
)
SELECT * FROM cat_tree',
'5m',
'FULL' -- FULL mode: standard re-execution
);
-- Recursive CTE with DIFFERENTIAL mode (incremental semi-naive / DRed)
SELECT pgstream.create_stream_table(
'org_chart',
'WITH RECURSIVE reports AS (
SELECT id, name, manager_id FROM employees WHERE manager_id IS NULL
UNION ALL
SELECT e.id, e.name, e.manager_id
FROM employees e JOIN reports r ON e.manager_id = r.id
)
SELECT * FROM reports',
'2m',
'DIFFERENTIAL' -- Uses semi-naive, DRed, or recomputation (auto-selected)
);
Set Operation Examples:
INTERSECT, INTERSECT ALL, EXCEPT, EXCEPT ALL, UNION, and UNION ALL are supported:
-- INTERSECT: customers who placed orders in BOTH regions
SELECT pgstream.create_stream_table(
'bi_region_customers',
'SELECT customer_id FROM orders_east
INTERSECT
SELECT customer_id FROM orders_west',
'2m',
'DIFFERENTIAL'
);
-- INTERSECT ALL: preserves duplicates (bag semantics)
SELECT pgstream.create_stream_table(
'common_items',
'SELECT item_name FROM warehouse_a
INTERSECT ALL
SELECT item_name FROM warehouse_b',
'1m',
'DIFFERENTIAL'
);
-- EXCEPT: orders not yet shipped
SELECT pgstream.create_stream_table(
'unshipped_orders',
'SELECT order_id FROM orders
EXCEPT
SELECT order_id FROM shipments',
'1m',
'DIFFERENTIAL'
);
-- EXCEPT ALL: preserves duplicate counts (bag subtraction)
SELECT pgstream.create_stream_table(
'excess_inventory',
'SELECT sku FROM stock_received
EXCEPT ALL
SELECT sku FROM stock_shipped',
'5m',
'DIFFERENTIAL'
);
-- UNION: deduplicated merge of two sources
SELECT pgstream.create_stream_table(
'all_contacts',
'SELECT email FROM customers
UNION
SELECT email FROM newsletter_subscribers',
'5m',
'DIFFERENTIAL'
);
LATERAL Set-Returning Function Examples:
Set-returning functions (SRFs) in the FROM clause are supported in both FULL and DIFFERENTIAL modes. Common SRFs include jsonb_array_elements, jsonb_each, jsonb_each_text, and unnest:
-- Flatten JSONB arrays into rows
SELECT pgstream.create_stream_table(
'flat_children',
'SELECT p.id, child.value AS val
FROM parent_data p,
jsonb_array_elements(p.data->''children'') AS child',
'1m',
'DIFFERENTIAL'
);
-- Expand JSONB key-value pairs (multi-column SRF)
SELECT pgstream.create_stream_table(
'flat_properties',
'SELECT d.id, kv.key, kv.value
FROM documents d,
jsonb_each(d.metadata) AS kv',
'2m',
'DIFFERENTIAL'
);
-- Unnest arrays
SELECT pgstream.create_stream_table(
'flat_tags',
'SELECT t.id, tag.tag
FROM tagged_items t,
unnest(t.tags) AS tag(tag)',
'1m',
'DIFFERENTIAL'
);
-- SRF with WHERE filter
SELECT pgstream.create_stream_table(
'high_value_items',
'SELECT p.id, (e.value)::int AS amount
FROM products p,
jsonb_array_elements(p.prices) AS e
WHERE (e.value)::int > 100',
'5m',
'DIFFERENTIAL'
);
-- SRF combined with aggregation
SELECT pgstream.create_stream_table(
'element_counts',
'SELECT a.id, count(*) AS cnt
FROM arrays a,
jsonb_array_elements(a.data) AS e
GROUP BY a.id',
'1m',
'FULL'
);
LATERAL Subquery Examples:
LATERAL subqueries in the FROM clause are supported in both FULL and DIFFERENTIAL modes. Use them for top-N per group, correlated aggregation, and conditional expansion:
-- Top-N per group: latest item per order
SELECT pgstream.create_stream_table(
'latest_items',
'SELECT o.id, o.customer, latest.amount
FROM orders o,
LATERAL (
SELECT li.amount
FROM line_items li
WHERE li.order_id = o.id
ORDER BY li.created_at DESC
LIMIT 1
) AS latest',
'1m',
'DIFFERENTIAL'
);
-- Correlated aggregate
SELECT pgstream.create_stream_table(
'dept_summaries',
'SELECT d.id, d.name, stats.total, stats.cnt
FROM departments d,
LATERAL (
SELECT SUM(e.salary) AS total, COUNT(*) AS cnt
FROM employees e
WHERE e.dept_id = d.id
) AS stats',
'1m',
'DIFFERENTIAL'
);
-- LEFT JOIN LATERAL: preserve outer rows with NULLs when subquery returns no rows
SELECT pgstream.create_stream_table(
'dept_stats_all',
'SELECT d.id, d.name, stats.total
FROM departments d
LEFT JOIN LATERAL (
SELECT SUM(e.salary) AS total
FROM employees e
WHERE e.dept_id = d.id
) AS stats ON true',
'1m',
'DIFFERENTIAL'
);
WHERE Subquery Examples:
Subqueries in the WHERE clause are automatically transformed into semi-join, anti-join, or scalar subquery operators in the DVM operator tree:
-- EXISTS subquery: customers who have placed orders
SELECT pgstream.create_stream_table(
'active_customers',
'SELECT c.id, c.name
FROM customers c
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id)',
'1m',
'DIFFERENTIAL'
);
-- NOT EXISTS: customers with no orders
SELECT pgstream.create_stream_table(
'inactive_customers',
'SELECT c.id, c.name
FROM customers c
WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id)',
'1m',
'DIFFERENTIAL'
);
-- IN subquery: products that have been ordered
SELECT pgstream.create_stream_table(
'ordered_products',
'SELECT p.id, p.name
FROM products p
WHERE p.id IN (SELECT product_id FROM order_items)',
'1m',
'DIFFERENTIAL'
);
-- NOT IN subquery: products never ordered
SELECT pgstream.create_stream_table(
'unordered_products',
'SELECT p.id, p.name
FROM products p
WHERE p.id NOT IN (SELECT product_id FROM order_items)',
'1m',
'DIFFERENTIAL'
);
-- Scalar subquery in SELECT list
SELECT pgstream.create_stream_table(
'products_with_max_price',
'SELECT p.id, p.name, (SELECT max(price) FROM products) AS max_price
FROM products p',
'1m',
'DIFFERENTIAL'
);
Notes:
- The defining query is parsed into an operator tree and validated for DVM support.
- Views as sources — views referenced in the defining query are automatically inlined as subqueries (auto-rewrite pass #0). CDC triggers are created on the underlying base tables. Nested views (view → view → table) are fully expanded. The user's original query is preserved in
original_queryfor reinit and introspection. Materialized views are rejected in DIFFERENTIAL mode (use FULL mode or the underlying query directly). Foreign tables are also rejected in DIFFERENTIAL mode. - CDC triggers and change buffer tables are created automatically for each source table.
- The ST is registered in the dependency DAG; cycles are rejected.
- Non-recursive CTEs are inlined as subqueries during parsing (Tier 1). Multi-reference CTEs share delta computation (Tier 2).
- Recursive CTEs in DIFFERENTIAL mode use three strategies, auto-selected per refresh: semi-naive evaluation for INSERT-only changes, Delete-and-Rederive (DRed) for mixed changes, and recomputation fallback when CTE columns don't match ST storage columns.
- LATERAL SRFs in DIFFERENTIAL mode use row-scoped recomputation: when a source row changes, only the SRF expansions for that row are re-evaluated.
- LATERAL subqueries in DIFFERENTIAL mode also use row-scoped recomputation: when an outer row changes, the correlated subquery is re-executed only for that row.
- WHERE subqueries (
EXISTS,IN, scalar) are parsed into dedicated semi-join, anti-join, and scalar subquery operators with specialized delta computation. ALL (subquery)is the only subquery form that is currently rejected.- ORDER BY is accepted but silently discarded — row order in the storage table is undefined (consistent with PostgreSQL's
CREATE MATERIALIZED VIEWbehavior). Apply ORDER BY when querying the stream table. - LIMIT / OFFSET are rejected — stream tables materialize the full result set. Apply LIMIT when querying the stream table.
pgstream.alter_stream_table
Alter properties of an existing stream table.
pgstream.alter_stream_table(
name text,
schedule text DEFAULT NULL,
refresh_mode text DEFAULT NULL,
status text DEFAULT NULL
) → void
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
name | text | — | Name of the stream table (schema-qualified or unqualified). |
schedule | text | NULL | New schedule as a duration string (e.g., '5m'). Pass NULL to leave unchanged. |
refresh_mode | text | NULL | New refresh mode ('FULL' or 'DIFFERENTIAL'). Pass NULL to leave unchanged. |
status | text | NULL | New status ('ACTIVE', 'SUSPENDED'). Pass NULL to leave unchanged. Resuming resets consecutive errors to 0. |
Examples:
-- Change schedule
SELECT pgstream.alter_stream_table('order_totals', schedule => '5m');
-- Switch to full refresh mode
SELECT pgstream.alter_stream_table('order_totals', refresh_mode => 'FULL');
-- Suspend a stream table
SELECT pgstream.alter_stream_table('order_totals', status => 'SUSPENDED');
-- Resume a suspended stream table
SELECT pgstream.alter_stream_table('order_totals', status => 'ACTIVE');
pgstream.drop_stream_table
Drop a stream table, removing the storage table and all catalog entries.
pgstream.drop_stream_table(name text) → void
Parameters:
| Parameter | Type | Description |
|---|---|---|
name | text | Name of the stream table to drop. |
Example:
SELECT pgstream.drop_stream_table('order_totals');
Notes:
- Drops the underlying storage table with
CASCADE. - Removes all catalog entries (metadata, dependencies, refresh history).
- Cleans up CDC triggers and change buffer tables for source tables that are no longer tracked by any ST.
pgstream.refresh_stream_table
Manually trigger a synchronous refresh of a stream table.
pgstream.refresh_stream_table(name text) → void
Parameters:
| Parameter | Type | Description |
|---|---|---|
name | text | Name of the stream table to refresh. |
Example:
SELECT pgstream.refresh_stream_table('order_totals');
Notes:
- Blocked if the ST is
SUSPENDED— usepgstream.alter_stream_table(... status => 'ACTIVE')first. - Uses an advisory lock to prevent concurrent refreshes of the same ST.
- For
DIFFERENTIALmode, generates and applies a delta query. ForFULLmode, truncates and reloads. - Records the refresh in
pgstream.pgs_refresh_history.
pgstream.pgs_status
Get the status of all stream tables.
pgstream.pgs_status() → SETOF record(
name text,
status text,
refresh_mode text,
is_populated bool,
consecutive_errors int,
schedule text,
data_timestamp timestamptz,
staleness interval
)
Example:
SELECT * FROM pgstream.pgs_status();
| name | status | refresh_mode | is_populated | consecutive_errors | schedule | data_timestamp | staleness |
|---|---|---|---|---|---|---|---|
| public.order_totals | ACTIVE | DIFFERENTIAL | true | 0 | 5m | 2026-02-21 12:00:00+00 | 00:02:30 |
pgstream.st_refresh_stats
Return per-ST refresh statistics aggregated from the refresh history.
pgstream.st_refresh_stats() → SETOF record(
pgs_name text,
pgs_schema text,
status text,
refresh_mode text,
is_populated bool,
total_refreshes bigint,
successful_refreshes bigint,
failed_refreshes bigint,
total_rows_inserted bigint,
total_rows_deleted bigint,
avg_duration_ms float8,
last_refresh_action text,
last_refresh_status text,
last_refresh_at timestamptz,
staleness_secs float8,
stale bool
)
Example:
SELECT pgs_name, status, total_refreshes, avg_duration_ms, stale
FROM pgstream.st_refresh_stats();
pgstream.get_refresh_history
Return refresh history for a specific stream table.
pgstream.get_refresh_history(
name text,
max_rows int DEFAULT 20
) → SETOF record(
refresh_id bigint,
data_timestamp timestamptz,
start_time timestamptz,
end_time timestamptz,
action text,
status text,
rows_inserted bigint,
rows_deleted bigint,
duration_ms float8,
error_message text
)
Example:
SELECT action, status, rows_inserted, duration_ms
FROM pgstream.get_refresh_history('order_totals', 5);
pgstream.get_staleness
Get the current staleness in seconds for a specific stream table.
pgstream.get_staleness(name text) → float8
Returns NULL if the ST has never been refreshed.
Example:
SELECT pgstream.get_staleness('order_totals');
-- Returns: 12.345 (seconds since last refresh)
pgstream.slot_health
Check replication slot health for all tracked CDC slots.
pgstream.slot_health() → SETOF record(
slot_name text,
source_relid bigint,
active bool,
retained_wal_bytes bigint,
wal_status text
)
Example:
SELECT * FROM pgstream.slot_health();
| slot_name | source_relid | active | retained_wal_bytes | wal_status |
|---|---|---|---|---|
| pg_stream_slot_16384 | 16384 | false | 1048576 | reserved |
pgstream.check_cdc_health
Check CDC health for all tracked source tables. Returns per-source health status including the current CDC mode, replication slot details, estimated lag, and any alerts.
pgstream.check_cdc_health() → SETOF record(
source_relid bigint,
source_table text,
cdc_mode text,
slot_name text,
lag_bytes bigint,
confirmed_lsn text,
alert text
)
Columns:
| Column | Type | Description |
|---|---|---|
source_relid | bigint | OID of the tracked source table |
source_table | text | Resolved name of the source table (e.g., public.orders) |
cdc_mode | text | Current CDC mode: TRIGGER, TRANSITIONING, or WAL |
slot_name | text | Replication slot name (NULL for TRIGGER mode) |
lag_bytes | bigint | Replication slot lag in bytes (NULL for TRIGGER mode) |
confirmed_lsn | text | Last confirmed WAL position (NULL for TRIGGER mode) |
alert | text | Alert message if unhealthy (e.g., slot_lag_exceeds_threshold, replication_slot_missing) |
Example:
SELECT * FROM pgstream.check_cdc_health();
| source_relid | source_table | cdc_mode | slot_name | lag_bytes | confirmed_lsn | alert |
|---|---|---|---|---|---|---|
| 16384 | public.orders | TRIGGER | ||||
| 16390 | public.events | WAL | pg_stream_slot_16390 | 524288 | 0/1A8B000 |
pgstream.explain_st
Explain the DVM plan for a stream table's defining query.
pgstream.explain_st(name text) → SETOF record(
property text,
value text
)
Example:
SELECT * FROM pgstream.explain_st('order_totals');
| property | value |
|---|---|
| Defining Query | SELECT region, SUM(amount) ... |
| Refresh Mode | DIFFERENTIAL |
| Operator Tree | Aggregate → Scan(orders) |
| Source Tables | orders (oid=16384) |
| DVM Supported | Yes |
pgstream.pg_stream_hash
Compute a 64-bit xxHash row ID from a text value.
pgstream.pg_stream_hash(input text) → bigint
Marked IMMUTABLE, PARALLEL SAFE.
Example:
SELECT pgstream.pg_stream_hash('some_key');
-- Returns: 1234567890123456789
pgstream.pg_stream_hash_multi
Compute a row ID by hashing multiple text values (composite keys).
pgstream.pg_stream_hash_multi(inputs text[]) → bigint
Marked IMMUTABLE, PARALLEL SAFE. Uses \x1E (record separator) between values and \x00NULL\x00 for NULL entries.
Example:
SELECT pgstream.pg_stream_hash_multi(ARRAY['key1', 'key2']);
Expression Support
pgstream's DVM parser supports a wide range of SQL expressions in defining queries. All expressions work in both FULL and DIFFERENTIAL modes.
Conditional Expressions
| Expression | Example | Notes |
|---|---|---|
CASE WHEN … THEN … ELSE … END | CASE WHEN amount > 100 THEN 'high' ELSE 'low' END | Searched CASE |
CASE <expr> WHEN … THEN … END | CASE status WHEN 1 THEN 'active' WHEN 2 THEN 'inactive' END | Simple CASE |
COALESCE(a, b, …) | COALESCE(phone, email, 'unknown') | Returns first non-NULL argument |
NULLIF(a, b) | NULLIF(divisor, 0) | Returns NULL if a = b |
GREATEST(a, b, …) | GREATEST(score1, score2, score3) | Returns the largest value |
LEAST(a, b, …) | LEAST(price, max_price) | Returns the smallest value |
Comparison Operators
| Expression | Example | Notes |
|---|---|---|
IN (list) | category IN ('A', 'B', 'C') | Also supports NOT IN |
BETWEEN a AND b | price BETWEEN 10 AND 100 | Also supports NOT BETWEEN |
IS DISTINCT FROM | a IS DISTINCT FROM b | NULL-safe inequality |
IS NOT DISTINCT FROM | a IS NOT DISTINCT FROM b | NULL-safe equality |
SIMILAR TO | name SIMILAR TO '%pattern%' | SQL regex matching |
op ANY(array) | id = ANY(ARRAY[1,2,3]) | Array comparison |
op ALL(array) | score > ALL(ARRAY[50,60]) | Array comparison |
Boolean Tests
| Expression | Example |
|---|---|
IS TRUE | active IS TRUE |
IS NOT TRUE | flag IS NOT TRUE |
IS FALSE | completed IS FALSE |
IS NOT FALSE | valid IS NOT FALSE |
IS UNKNOWN | result IS UNKNOWN |
IS NOT UNKNOWN | flag IS NOT UNKNOWN |
SQL Value Functions
| Function | Description |
|---|---|
CURRENT_DATE | Current date |
CURRENT_TIME | Current time with time zone |
CURRENT_TIMESTAMP | Current date and time with time zone |
LOCALTIME | Current time without time zone |
LOCALTIMESTAMP | Current date and time without time zone |
CURRENT_ROLE | Current role name |
CURRENT_USER | Current user name |
SESSION_USER | Session user name |
CURRENT_CATALOG | Current database name |
CURRENT_SCHEMA | Current schema name |
Array and Row Expressions
| Expression | Example | Notes |
|---|---|---|
ARRAY[…] | ARRAY[1, 2, 3] | Array constructor |
ROW(…) | ROW(a, b, c) | Row constructor |
| Array subscript | arr[1] | Array element access |
| Field access | (rec).field | Composite type field access |
| Star indirection | (data).* | Expand all fields |
Subquery Expressions
Subqueries are supported in the WHERE clause and SELECT list. They are parsed into dedicated DVM operators with specialized delta computation for incremental maintenance.
| Expression | Example | DVM Operator |
|---|---|---|
EXISTS (subquery) | WHERE EXISTS (SELECT 1 FROM orders WHERE orders.cid = c.id) | Semi-Join |
NOT EXISTS (subquery) | WHERE NOT EXISTS (SELECT 1 FROM orders WHERE orders.cid = c.id) | Anti-Join |
IN (subquery) | WHERE id IN (SELECT product_id FROM order_items) | Semi-Join (rewritten as equality) |
NOT IN (subquery) | WHERE id NOT IN (SELECT product_id FROM order_items) | Anti-Join |
| Scalar subquery (SELECT) | SELECT (SELECT max(price) FROM products) AS max_p | Scalar Subquery |
Notes:
EXISTSandIN (subquery)in theWHEREclause are transformed into semi-join operators.NOT EXISTSandNOT IN (subquery)become anti-join operators.- Multiple subqueries in the same
WHEREclause are supported when combined withAND. Subqueries combined withORare also supported — they are automatically rewritten intoUNIONof separate filtered queries. - Scalar subqueries in the
SELECTlist are supported as long as they return exactly one row and one column. ALL (subquery)is supported — it is automatically rewritten to an anti-join viaNOT EXISTSwith the negated condition.
Auto-Rewrite Pipeline
pg_stream transparently rewrites certain SQL constructs before parsing. These rewrites are applied automatically and require no user action:
| Order | Trigger | Rewrite |
|---|---|---|
| #0 | View references in FROM | Inline view body as subquery |
| #1 | DISTINCT ON (expr) | Convert to ROW_NUMBER() OVER (PARTITION BY expr ORDER BY ...) = 1 subquery |
| #2 | GROUPING SETS / CUBE / ROLLUP | Decompose into UNION ALL of separate GROUP BY queries |
| #3 | Scalar subquery in WHERE | Convert to CROSS JOIN with inline view |
| #4 | EXISTS/IN inside OR | Split into UNION of separate filtered queries |
| #5 | Multiple PARTITION BY clauses | Split into joined subqueries, one per distinct partitioning |
HAVING Clause
HAVING is fully supported. The filter predicate is applied on top of the aggregate delta computation — groups that pass the HAVING condition are included in the stream table.
SELECT pgstream.create_stream_table('big_departments',
'SELECT department, COUNT(*) AS cnt FROM employees GROUP BY department HAVING COUNT(*) > 10',
'1m', 'DIFFERENTIAL');
Tables Without Primary Keys (Keyless Tables)
Tables without a primary key can be used as sources. pg_stream generates a content-based row identity
by hashing all column values using pg_stream_hash_multi(). This allows DIFFERENTIAL mode to work,
though at the cost of being unable to distinguish truly duplicate rows (rows with identical values in all columns).
-- No primary key — pg_stream uses content hashing for row identity
CREATE TABLE events (ts TIMESTAMPTZ, payload JSONB);
SELECT pgstream.create_stream_table('event_summary',
'SELECT payload->>''type'' AS event_type, COUNT(*) FROM events GROUP BY 1',
'1m', 'DIFFERENTIAL');
Volatile Function Detection
pg_stream checks all functions and operators in the defining query against pg_proc.provolatile:
- VOLATILE functions (e.g.,
random(),now(),gen_random_uuid()) are rejected in DIFFERENTIAL mode because they produce different results on each evaluation, breaking delta correctness. - VOLATILE operators — custom operators backed by volatile functions are also detected. The check resolves the operator’s implementation function via
pg_operator.oprcodeand checks its volatility inpg_proc. - STABLE functions (e.g.,
current_setting()) produce a warning — they are consistent within a single transaction but may differ between refreshes. - IMMUTABLE functions are always safe and produce no warnings.
FULL mode accepts all volatility classes since it re-evaluates the entire query each time.
COLLATE Expressions
COLLATE clauses on expressions are supported:
SELECT pgstream.create_stream_table('sorted_names',
'SELECT name COLLATE "C" AS c_name FROM users',
'1m', 'DIFFERENTIAL');
IS JSON Predicate (PostgreSQL 16+)
The IS JSON predicate validates whether a value is valid JSON. All variants are supported:
-- Filter rows with valid JSON
SELECT pgstream.create_stream_table('valid_json_events',
'SELECT id, payload FROM events WHERE payload::text IS JSON',
'1m', 'DIFFERENTIAL');
-- Type-specific checks
SELECT pgstream.create_stream_table('json_objects_only',
'SELECT id, data IS JSON OBJECT AS is_obj,
data IS JSON ARRAY AS is_arr,
data IS JSON SCALAR AS is_scalar
FROM json_data',
'1m', 'FULL');
Supported variants: IS JSON, IS JSON OBJECT, IS JSON ARRAY, IS JSON SCALAR, IS NOT JSON (all forms), WITH UNIQUE KEYS.
SQL/JSON Constructors (PostgreSQL 16+)
SQL-standard JSON constructor functions are supported in both FULL and DIFFERENTIAL modes:
-- JSON_OBJECT: construct a JSON object from key-value pairs
SELECT pgstream.create_stream_table('user_json',
'SELECT id, JSON_OBJECT(''name'' : name, ''age'' : age) AS data FROM users',
'1m', 'DIFFERENTIAL');
-- JSON_ARRAY: construct a JSON array from values
SELECT pgstream.create_stream_table('value_arrays',
'SELECT id, JSON_ARRAY(a, b, c) AS arr FROM measurements',
'1m', 'FULL');
-- JSON(): parse a text value as JSON
-- JSON_SCALAR(): wrap a scalar value as JSON
-- JSON_SERIALIZE(): serialize a JSON value to text
Note:
JSON_ARRAYAGG()andJSON_OBJECTAGG()are SQL-standard aggregate functions fully recognized by the DVM engine. In DIFFERENTIAL mode, they use the group-rescan strategy (affected groups are re-aggregated from source data). The full deparsed SQL is preserved to handle the specialkey: value,ABSENT ON NULL,ORDER BY, andRETURNINGclause syntax.
JSON_TABLE (PostgreSQL 17+)
JSON_TABLE() generates a relational table from JSON data. It is supported in the FROM clause in both FULL and DIFFERENTIAL modes. Internally, it is modeled as a LateralFunction.
-- Extract structured data from a JSON column
SELECT pgstream.create_stream_table('user_phones',
$$SELECT u.id, j.phone_type, j.phone_number
FROM users u,
JSON_TABLE(u.contact_info, '$.phones[*]'
COLUMNS (
phone_type TEXT PATH '$.type',
phone_number TEXT PATH '$.number'
)
) AS j$$,
'1m', 'DIFFERENTIAL');
Supported column types:
- Regular columns —
name TYPE PATH '$.path'(with optionalON ERROR/ON EMPTYbehaviors) - EXISTS columns —
name TYPE EXISTS PATH '$.path' - Formatted columns —
name TYPE FORMAT JSON PATH '$.path' - Nested columns —
NESTED PATH '$.path' COLUMNS (...)
The PASSING clause is also supported for passing named variables to path expressions.
Unsupported Expression Types
The following are rejected with clear error messages rather than producing broken SQL:
| Expression | Error Behavior | Suggested Rewrite |
|---|---|---|
TABLESAMPLE | Rejected — stream tables materialize the complete result set | Use WHERE random() < 0.1 if sampling is needed |
| Window functions in expressions | Rejected — e.g., CASE WHEN ROW_NUMBER() OVER (...) ... | Move window function to a separate column |
FOR UPDATE / FOR SHARE | Rejected — stream tables do not support row-level locking | Remove the locking clause |
| Unknown node types | Rejected with type information | — |
Restrictions & Interoperability
Stream tables are standard PostgreSQL heap tables stored in the pgstream schema with an additional __pgs_row_id BIGINT PRIMARY KEY column managed by the refresh engine. This section describes what you can and cannot do with them.
Referencing Other Stream Tables
Stream tables can reference other stream tables in their defining query. This creates a dependency edge in the internal DAG, and the scheduler refreshes upstream tables before downstream ones. Cycles are detected and rejected at creation time.
-- ST1 reads from a base table
SELECT pgstream.create_stream_table('order_totals',
'SELECT customer_id, SUM(amount) AS total FROM orders GROUP BY customer_id',
'1m', 'DIFFERENTIAL');
-- ST2 reads from ST1
SELECT pgstream.create_stream_table('big_customers',
'SELECT customer_id, total FROM pgstream.order_totals WHERE total > 1000',
'1m', 'DIFFERENTIAL');
Views as Sources in Defining Queries
PostgreSQL views can be used as source tables in a stream table's defining query. Views are automatically inlined — replaced with their underlying SELECT definition as subqueries — so CDC triggers land on the actual base tables.
CREATE VIEW active_orders AS
SELECT * FROM orders WHERE status = 'active';
-- This works in DIFFERENTIAL mode:
SELECT pgstream.create_stream_table('order_summary',
'SELECT customer_id, COUNT(*) FROM active_orders GROUP BY customer_id',
'1m', 'DIFFERENTIAL');
-- Internally, 'active_orders' is replaced with:
-- (SELECT ... FROM orders WHERE status = 'active') AS active_orders
Nested views (view → view → table) are fully expanded via a fixpoint loop. Column-renaming views (CREATE VIEW v(a, b) AS ...) work correctly — pg_get_viewdef() produces the proper column aliases.
When a view is inlined, the user's original SQL is stored in the original_query catalog column for reinit and introspection. The defining_query column contains the expanded (post-inlining) form.
DDL hooks: CREATE OR REPLACE VIEW on a view that was inlined into a stream table marks that ST for reinit. DROP VIEW sets affected STs to ERROR status.
Materialized views are rejected in DIFFERENTIAL mode — their stale-snapshot semantics prevent CDC triggers from tracking changes. Use the underlying query directly, or switch to FULL mode. In FULL mode, materialized views are allowed (no CDC needed).
Foreign tables are rejected in DIFFERENTIAL mode — row-level triggers cannot be created on foreign tables. Use FULL mode instead.
Partitioned Tables as Sources
Partitioned tables are fully supported as source tables in both FULL and DIFFERENTIAL modes. CDC triggers are installed on the partitioned parent table, and PostgreSQL 13+ ensures the trigger fires for all DML routed to child partitions. The change buffer uses the parent table's OID (pgstream_changes.changes_<parent_oid>).
CREATE TABLE orders (
id INT, region TEXT, amount NUMERIC
) PARTITION BY LIST (region);
CREATE TABLE orders_us PARTITION OF orders FOR VALUES IN ('US');
CREATE TABLE orders_eu PARTITION OF orders FOR VALUES IN ('EU');
-- Works — inserts into any partition are captured:
SELECT pgstream.create_stream_table('order_summary',
'SELECT region, SUM(amount) FROM orders GROUP BY region',
'1m', 'DIFFERENTIAL');
Note: pg_stream targets PostgreSQL 18. On PostgreSQL 12 or earlier (not supported), parent triggers do not fire for partition-routed rows, which would cause silent data loss.
Logical Replication Targets
Tables that receive data via logical replication require special consideration. Changes arriving via replication do not fire normal row-level triggers, which means CDC triggers will miss those changes.
pg_stream emits a WARNING at stream table creation time if any source table is detected as a logical replication target (via pg_subscription_rel).
Workarounds:
- Use
cdc_mode = 'wal'for WAL-based CDC that captures all changes regardless of origin. - Use
FULLrefresh mode, which recomputes entirely from the current table state. - Set a frequent refresh schedule with FULL mode to limit staleness.
Views on Stream Tables
PostgreSQL views can reference stream tables. The view reflects the data as of the most recent refresh.
CREATE VIEW top_customers AS
SELECT customer_id, total
FROM pgstream.order_totals
WHERE total > 500
ORDER BY total DESC;
Materialized Views on Stream Tables
Materialized views can reference stream tables, though this is typically redundant (both are physical snapshots of a query). The materialized view requires its own REFRESH MATERIALIZED VIEW — it does not auto-refresh when the stream table refreshes.
Logical Replication of Stream Tables
Stream tables can be published for logical replication like any ordinary table:
-- On publisher
CREATE PUBLICATION my_pub FOR TABLE pgstream.order_totals;
-- On subscriber
CREATE SUBSCRIPTION my_sub
CONNECTION 'host=... dbname=...'
PUBLICATION my_pub;
Caveats:
- The
__pgs_row_idcolumn is replicated (it is the primary key), which is an internal implementation detail. - The subscriber receives materialized data, not the defining query. Refreshes on the publisher propagate as normal DML via logical replication.
- Do not install pg_stream on the subscriber and attempt to refresh the replicated table — it will have no CDC triggers or catalog entries.
- The internal change buffer tables (
pgstream_changes.changes_<oid>) and catalog tables are not published by default; subscribers only receive the final output.
Known Delta Computation Limitations
The following edge cases produce incorrect delta results in DIFFERENTIAL mode under specific data mutation patterns. They have no effect on FULL mode.
JOIN Key Column Change + Simultaneous Right-Side Delete
When a row's join key column is updated in the same refresh cycle as the joined-side row is deleted, the delta query may fail to emit the required DELETE from the stream table:
-- Stream table joining orders with customers
SELECT pgstream.create_stream_table('order_details',
'SELECT o.id, c.name FROM orders o JOIN customers c ON o.cust_id = c.id',
'1m', 'DIFFERENTIAL');
-- Scenario that exposes the limitation:
-- In the same transaction (or same refresh interval):
UPDATE orders SET cust_id = 5 WHERE cust_id = 3; -- key change
DELETE FROM customers WHERE id = 3; -- old join partner deleted
-- The delta for the now-stale (orders.cust_id=3, customers.id=3) join result
-- may not be emitted as a DELETE, leaving a stale row in the stream table
-- until the next full refresh cycle.
Root cause: The JOIN delta query reads current_right (customers) after all changes are applied.
When customer 3 is deleted before the delta runs, the DELETE half of the join cannot find its join
partner and is silently dropped.
Mitigations:
- Adaptive FULL fallback (default): when the scheduler detects a high change volume, it switches
to a full recompute, which will correct any stale rows. The threshold is configurable via
pg_stream.adaptive_full_threshold. - Avoid co-locating key-changing UPDATEs and DELETEs in the same refresh interval. Stagger changes across multiple refresh cycles.
- FULL mode for stream tables where join key changes and right-side deletes are expected to co-occur frequently.
CUBE/ROLLUP Expansion Limit
CUBE(a, b, c...n) on N columns generates $2^N$ grouping set branches (a UNION ALL of N queries).
pg_stream rejects CUBE/ROLLUP that would produce more than 64 branches to prevent runaway
memory usage during query generation. Use explicit GROUPING SETS(...) instead:
-- Rejected: CUBE(a, b, c, d, e, f, g) would generate 128 branches
-- Use instead:
SELECT pgstream.create_stream_table('multi_dim',
'SELECT a, b, c, SUM(v) FROM t
GROUP BY GROUPING SETS ((a, b, c), (a, b), (a), ())',
'5m', 'DIFFERENTIAL');
What Is NOT Allowed
| Operation | Restriction | Reason |
|---|---|---|
Direct DML (INSERT, UPDATE, DELETE) | ❌ Not supported | Stream table contents are managed exclusively by the refresh engine. |
Direct DDL (ALTER TABLE) | ❌ Not supported | Use pgstream.alter_stream_table() to change the defining query or schedule. |
| Foreign keys referencing or from a stream table | ❌ Not supported | The refresh engine performs bulk MERGE operations that do not respect FK ordering. |
| User-defined triggers on stream tables | ✅ Supported (DIFFERENTIAL) | In DIFFERENTIAL mode, the refresh engine decomposes changes into explicit DELETE + UPDATE + INSERT statements so triggers fire with correct TG_OP, OLD, and NEW. Row-level triggers are suppressed during FULL refresh. Controlled by pg_stream.user_triggers GUC (default: auto). |
TRUNCATE on a stream table | ❌ Not supported | Use pgstream.refresh_stream_table() to reset data. |
Tip: The
__pgs_row_idcolumn is visible but should be ignored by consuming queries — it is an implementation detail used for deltaMERGEoperations.
Views
pgstream.stream_tables_info
Status overview with computed staleness information.
SELECT * FROM pgstream.stream_tables_info;
Columns include all pgstream.pgs_stream_tables columns plus:
| Column | Type | Description |
|---|---|---|
staleness | interval | now() - data_timestamp |
stale | bool | true if staleness > schedule |
pgstream.pg_stat_stream_tables
Comprehensive monitoring view combining catalog metadata with aggregate refresh statistics.
SELECT * FROM pgstream.pg_stat_stream_tables;
Key columns:
| Column | Type | Description |
|---|---|---|
pgs_id | bigint | Stream table ID |
pgs_schema / pgs_name | text | Schema and name |
status | text | INITIALIZING, ACTIVE, SUSPENDED, ERROR |
refresh_mode | text | FULL or DIFFERENTIAL |
data_timestamp | timestamptz | Timestamp of last refresh |
staleness | interval | Current staleness |
stale | bool | Whether schedule is exceeded |
total_refreshes | bigint | Total refresh count |
successful_refreshes | bigint | Successful refresh count |
failed_refreshes | bigint | Failed refresh count |
avg_duration_ms | float8 | Average refresh duration |
consecutive_errors | int | Current error streak |
Catalog Tables
pgstream.pgs_stream_tables
Core metadata for each stream table.
| Column | Type | Description |
|---|---|---|
pgs_id | bigserial | Primary key |
pgs_relid | oid | OID of the storage table |
pgs_name | text | Table name |
pgs_schema | text | Schema name |
defining_query | text | The SQL query that defines the ST |
schedule | text | Refresh schedule (duration or cron expression) |
refresh_mode | text | FULL or DIFFERENTIAL |
status | text | INITIALIZING, ACTIVE, SUSPENDED, ERROR |
is_populated | bool | Whether the table has been populated |
data_timestamp | timestamptz | Timestamp of the data in the ST |
frontier | jsonb | Per-source LSN positions (version tracking) |
last_refresh_at | timestamptz | When last refreshed |
consecutive_errors | int | Current error streak count |
needs_reinit | bool | Whether upstream DDL requires reinitialization |
functions_used | text[] | Function names used in the defining query (for DDL tracking) |
created_at | timestamptz | Creation timestamp |
updated_at | timestamptz | Last modification timestamp |
pgstream.pgs_dependencies
DAG edges — records which source tables each ST depends on, including CDC mode metadata.
| Column | Type | Description |
|---|---|---|
pgs_id | bigint | FK to pgs_stream_tables |
source_relid | oid | OID of the source table |
source_type | text | TABLE, STREAM_TABLE, or VIEW |
columns_used | text[] | Which columns are referenced |
cdc_mode | text | Current CDC mode: TRIGGER, TRANSITIONING, or WAL |
slot_name | text | Replication slot name (WAL/TRANSITIONING modes) |
decoder_confirmed_lsn | pg_lsn | WAL decoder's last confirmed position |
transition_started_at | timestamptz | When the trigger→WAL transition started |
pgstream.pgs_refresh_history
Audit log of all refresh operations.
| Column | Type | Description |
|---|---|---|
refresh_id | bigserial | Primary key |
pgs_id | bigint | FK to pgs_stream_tables |
data_timestamp | timestamptz | Data timestamp of the refresh |
start_time | timestamptz | When the refresh started |
end_time | timestamptz | When it completed |
action | text | NO_DATA, FULL, DIFFERENTIAL, REINITIALIZE, SKIP |
rows_inserted | bigint | Rows inserted |
rows_deleted | bigint | Rows deleted |
error_message | text | Error message if failed |
status | text | RUNNING, COMPLETED, FAILED, SKIPPED |
initiated_by | text | What triggered: SCHEDULER, MANUAL, or INITIAL |
freshness_deadline | timestamptz | SLA deadline (duration schedules only; NULL for cron) |
pgstream.pgs_change_tracking
CDC slot tracking per source table.
| Column | Type | Description |
|---|---|---|
source_relid | oid | OID of the tracked source table |
slot_name | text | Logical replication slot name |
last_consumed_lsn | pg_lsn | Last consumed WAL position |
tracked_by_pgs_ids | bigint[] | Array of ST IDs depending on this source |
Configuration
Complete reference for all pg_stream GUC (Grand Unified Configuration) variables.
Overview
pg_stream exposes sixteen configuration variables in the pg_stream namespace. All can be set in postgresql.conf or at runtime via SET / ALTER SYSTEM.
Required postgresql.conf settings:
shared_preload_libraries = 'pg_stream'
The extension must be loaded via shared_preload_libraries because it registers GUC variables and a background worker at startup.
Note:
wal_level = logicalandmax_replication_slotsare not required by default. The default CDC mode (trigger) uses lightweight row-level triggers. If you setpg_stream.cdc_mode = 'auto'or'wal', thenwal_level = logicalis needed for WAL-based capture (see pg_stream.cdc_mode).
GUC Variables
pg_stream.enabled
Enable or disable the pg_stream extension.
| Property | Value |
|---|---|
| Type | bool |
| Default | true |
| Context | SUSET (superuser) |
| Restart Required | No |
When set to false, the background scheduler stops processing refreshes. Existing stream tables remain in the catalog but are not refreshed. Manual pgstream.refresh_stream_table() calls still work.
-- Disable automatic refreshes
SET pg_stream.enabled = false;
-- Re-enable
SET pg_stream.enabled = true;
pg_stream.scheduler_interval_ms
How often the background scheduler checks for stream tables that need refreshing.
| Property | Value |
|---|---|
| Type | int |
| Default | 1000 (1 second) |
| Range | 100 – 60000 (100ms to 60s) |
| Context | SUSET |
| Restart Required | No |
Tuning Guidance:
- Low-latency workloads (sub-second schedule): Set to
100–500. - Standard workloads (minutes of schedule): Default
1000is appropriate. - Low-overhead workloads (many STs with long schedules): Increase to
5000–10000to reduce scheduler overhead.
The scheduler interval does not determine refresh frequency — it determines how often the scheduler checks whether any ST's staleness exceeds its schedule (or whether a cron expression has fired). The actual refresh frequency is governed by schedule (duration or cron) and canonical period alignment.
SET pg_stream.scheduler_interval_ms = 500;
pg_stream.min_schedule_seconds
Minimum allowed schedule value (in seconds) when creating or altering a stream table with a duration-based schedule. This limit does not apply to cron expressions.
| Property | Value |
|---|---|
| Type | int |
| Default | 60 (1 minute) |
| Range | 1 – 86400 (1 second to 24 hours) |
| Context | SUSET |
| Restart Required | No |
This acts as a safety guardrail to prevent users from setting impractically small schedules that would cause excessive refresh overhead.
Tuning Guidance:
- Development/testing: Set to
1for fast iteration. - Production: Keep at
60or higher to prevent excessive WAL consumption and CPU usage.
-- Allow 10-second schedules (for testing)
SET pg_stream.min_schedule_seconds = 10;
pg_stream.max_consecutive_errors
Maximum consecutive refresh failures before a stream table is moved to ERROR status.
| Property | Value |
|---|---|
| Type | int |
| Default | 3 |
| Range | 1 – 100 |
| Context | SUSET |
| Restart Required | No |
When a ST's consecutive_errors reaches this threshold:
- The ST status changes to
ERROR. - Automatic refreshes stop for this ST.
- Manual intervention is required:
SELECT pgstream.alter_stream_table('...', status => 'ACTIVE').
Tuning Guidance:
- Strict (production):
3— fail fast to surface issues. - Lenient (development):
10–20— tolerate transient errors.
SET pg_stream.max_consecutive_errors = 5;
pg_stream.change_buffer_schema
Schema where CDC change buffer tables are created.
| Property | Value |
|---|---|
| Type | text |
| Default | 'pgstream_changes' |
| Context | SUSET |
| Restart Required | No (but existing change buffers remain in the old schema) |
Change buffer tables are named <schema>.changes_<oid> where <oid> is the source table's OID. Placing them in a dedicated schema keeps them out of the public namespace.
Tuning Guidance:
- Generally leave at the default. Change only if
pgstream_changesconflicts with an existing schema in your database.
SET pg_stream.change_buffer_schema = 'my_change_buffers';
pg_stream.max_concurrent_refreshes
Maximum number of stream tables that can be refreshed simultaneously.
| Property | Value |
|---|---|
| Type | int |
| Default | 4 |
| Range | 1 – 32 |
| Context | SUSET |
| Restart Required | No |
Controls concurrency in the scheduler. Each refresh acquires an advisory lock, and the scheduler skips STs that exceed this limit.
Tuning Guidance:
- Small databases (few STs):
1–4is sufficient. - Large deployments (50+ STs): Increase to
8–16if the server has spare CPU and I/O capacity. - Resource-constrained: Set to
1for fully sequential refresh processing.
The optimal setting depends on:
- Number of CPU cores available
- I/O throughput (SSD vs HDD)
- Complexity of the defining queries
- Amount of concurrent OLTP workload
SET pg_stream.max_concurrent_refreshes = 8;
pg_stream.differential_max_change_ratio
Maximum change-to-table ratio before DIFFERENTIAL refresh falls back to FULL refresh.
| Property | Value |
|---|---|
| Type | float |
| Default | 0.15 (15%) |
| Range | 0.0 – 1.0 |
| Context | SUSET |
| Restart Required | No |
When the number of pending change buffer rows exceeds this fraction of the source table's estimated row count, the refresh engine switches from DIFFERENTIAL (which uses JSONB parsing and window functions) to FULL refresh. At high change rates FULL refresh is cheaper because it avoids the per-row JSONB overhead.
Special Values:
0.0: Disable adaptive fallback — always use DIFFERENTIAL.1.0: Always fall back to FULL (effectively forces FULL mode).
Tuning Guidance:
- OLTP with low change rates (< 5%): Default
0.15is appropriate. - Batch-load workloads (bulk inserts): Lower to
0.05–0.10so large batches trigger FULL refresh sooner. - Latency-sensitive (want deterministic refresh time): Set to
0.0to always use DIFFERENTIAL.
-- Lower threshold for batch-heavy workloads
SET pg_stream.differential_max_change_ratio = 0.10;
-- Disable adaptive fallback
SET pg_stream.differential_max_change_ratio = 0.0;
pg_stream.cleanup_use_truncate
Use TRUNCATE instead of per-row DELETE for change buffer cleanup when the entire buffer is consumed by a refresh.
| Property | Value |
|---|---|
| Type | bool |
| Default | true |
| Context | SUSET |
| Restart Required | No |
After a differential refresh consumes all rows from the change buffer, the engine must clean up the buffer table. TRUNCATE is O(1) regardless of row count, versus DELETE which must update indexes row-by-row. This saves 3–5 ms per refresh at 10%+ change rates.
Trade-off: TRUNCATE acquires an AccessExclusiveLock on the change buffer table. If concurrent DML on the source table is actively inserting into the same change buffer via triggers, this lock can cause brief contention.
Tuning Guidance:
- Most workloads: Leave at
true— the performance benefit outweighs the brief lock. - High-concurrency OLTP with continuous writes during refresh: Set to
falseif you observe lock-wait timeouts on the change buffer.
-- Use per-row DELETE for change buffer cleanup
SET pg_stream.cleanup_use_truncate = false;
pg_stream.merge_planner_hints
Inject SET LOCAL planner hints before MERGE execution during differential refresh.
| Property | Value |
|---|---|
| Type | bool |
| Default | true |
| Context | SUSET |
| Restart Required | No |
When enabled, the refresh executor estimates the delta size and applies optimizer hints within the transaction:
- Delta ≥ 100 rows:
SET LOCAL enable_nestloop = off— forces hash joins instead of nested-loop joins. - Delta ≥ 10,000 rows: additionally
SET LOCAL work_mem = '<N>MB'(see pg_stream.merge_work_mem_mb).
This reduces P95 latency spikes caused by PostgreSQL choosing nested-loop plans for medium/large delta sizes.
Tuning Guidance:
- Most workloads: Leave at
true— the hints improve tail latency without affecting small deltas. - Custom plan overrides: Set to
falseif you manage planner settings yourself or if the hints conflict with yourpg_hint_planconfiguration.
-- Disable planner hints
SET pg_stream.merge_planner_hints = false;
pg_stream.merge_work_mem_mb
work_mem value (in MB) applied via SET LOCAL when the delta exceeds 10,000 rows and planner hints are enabled.
| Property | Value |
|---|---|
| Type | int |
| Default | 64 (64 MB) |
| Range | 8 – 4096 (8 MB to 4 GB) |
| Context | SUSET |
| Restart Required | No |
A higher value lets PostgreSQL use larger in-memory hash tables for the MERGE join, avoiding disk-spilling sort/merge strategies on large deltas. This setting is only applied when both merge_planner_hints = true and the delta exceeds 10,000 rows.
Tuning Guidance:
- Servers with ample RAM (32+ GB): Increase to
128–256for faster large-delta refreshes. - Memory-constrained: Lower to
16–32or disable planner hints entirely. - Very large deltas (100K+ rows): Consider
256–512if refresh latency matters.
SET pg_stream.merge_work_mem_mb = 128;
pg_stream.merge_strategy
Strategy for applying delta changes to the stream table during differential refresh.
| Property | Value |
|---|---|
| Type | text |
| Default | 'auto' |
| Values | 'auto', 'merge', 'delete_insert' |
| Context | SUSET |
| Restart Required | No |
Values:
auto(default): UseMERGEfor small deltas; switch toDELETE+INSERTwhen the delta exceeds 25% of the stream table row count.merge: Always use a singleMERGEstatement. Best for correctness and simplicity.delete_insert: Always useDELETE+INSERT. May be faster for very large deltas but has known limitations with aggregate/DISTINCT queries.
Tuning Guidance:
- Most workloads: Leave at
'auto'or'merge'. - Batch-heavy ETL: Try
'delete_insert'if MERGE performance degrades on large deltas, but test thoroughly with your specific queries.
-- Always use MERGE
SET pg_stream.merge_strategy = 'merge';
-- Always use DELETE + INSERT
SET pg_stream.merge_strategy = 'delete_insert';
pg_stream.use_prepared_statements
Use SQL PREPARE / EXECUTE for MERGE statements during differential refresh.
| Property | Value |
|---|---|
| Type | bool |
| Default | true |
| Context | SUSET |
| Restart Required | No |
When enabled, the refresh executor issues PREPARE __pgs_merge_{id} on the first cache-hit cycle, then uses EXECUTE on subsequent cycles. After approximately 5 executions, PostgreSQL switches from a custom plan to a generic plan, saving 1–2 ms of parse/plan overhead per refresh.
Tuning Guidance:
- Most workloads: Leave at
true— the cumulative parse/plan savings are significant for frequently-refreshed stream tables. - Highly skewed data: Set to
falseif prepared-statement parameter sniffing produces poor plans (e.g., highly skewed LSN distributions causing bad join estimates).
-- Disable prepared statements
SET pg_stream.use_prepared_statements = false;
pg_stream.user_triggers
Control how user-defined triggers on stream tables are handled during refresh.
| Property | Value |
|---|---|
| Type | text |
| Default | 'auto' |
| Values | 'auto', 'on', 'off' |
| Context | SUSET |
| Restart Required | No |
When a stream table has user-defined row-level triggers, the refresh engine can decompose the MERGE into explicit DELETE + UPDATE + INSERT statements so triggers fire with correct TG_OP, OLD, and NEW values.
Values:
auto(default): Automatically detect user triggers on the stream table. If present, use the explicit DML path; otherwise useMERGE.on: Always use the explicit DML path, even without user triggers. Useful for testing.off: Always useMERGE. User triggers are suppressed during refresh. This is the escape hatch if explicit DML causes issues.
Notes:
- Row-level triggers do not fire during FULL refresh regardless of this setting. FULL refresh uses
DISABLE TRIGGER USER/ENABLE TRIGGER USERto suppress them. - The explicit DML path adds ~25–60% overhead compared to MERGE for affected stream tables.
- Stream tables without user triggers have zero overhead when using
auto(only a fastpg_triggercheck).
-- Auto-detect (default)
SET pg_stream.user_triggers = 'auto';
-- Always use explicit DML (for testing)
SET pg_stream.user_triggers = 'on';
-- Suppress triggers, use MERGE
SET pg_stream.user_triggers = 'off';
pg_stream.block_source_ddl
When enabled, column-affecting DDL (e.g., ALTER TABLE ... DROP COLUMN,
ALTER TABLE ... ALTER COLUMN ... TYPE) on source tables tracked by stream
tables is blocked with an ERROR instead of silently marking stream tables
for reinitialization.
This is useful in production environments where you want to prevent accidental schema changes that would trigger expensive full recomputation of downstream stream tables.
Default: false
Context: Superuser
-- Block column-affecting DDL on tracked source tables
SET pg_stream.block_source_ddl = true;
-- Allow DDL (stream tables will be marked for reinit instead)
SET pg_stream.block_source_ddl = false;
Note: Only column-affecting changes are blocked. Benign DDL (adding indexes, comments, constraints) is always allowed regardless of this setting.
pg_stream.cdc_mode
CDC (Change Data Capture) mechanism selection.
| Value | Description |
|---|---|
'trigger' | (default) Always use row-level triggers for change capture |
'auto' | Use triggers for creation; transition to WAL-based CDC if wal_level = logical |
'wal' | Require WAL-based CDC (fails if wal_level != logical) |
Default: 'trigger'
-- Always use triggers (default, zero-config)
SET pg_stream.cdc_mode = 'trigger';
-- Enable automatic trigger → WAL transition
SET pg_stream.cdc_mode = 'auto';
-- Require WAL-based CDC (error if wal_level != logical)
SET pg_stream.cdc_mode = 'wal';
pg_stream.wal_transition_timeout
Maximum time (seconds) to wait for the WAL decoder to catch up during the transition from trigger-based to WAL-based CDC. If the decoder has not caught up within this timeout, the system falls back to triggers.
Default: 300 (5 minutes)
Range: 10 – 3600
SET pg_stream.wal_transition_timeout = 300;
Complete postgresql.conf Example
# Required
shared_preload_libraries = 'pg_stream'
# Optional tuning
pg_stream.enabled = true
pg_stream.scheduler_interval_ms = 1000
pg_stream.min_schedule_seconds = 60
pg_stream.max_consecutive_errors = 3
pg_stream.change_buffer_schema = 'pgstream_changes'
pg_stream.max_concurrent_refreshes = 4
pg_stream.differential_max_change_ratio = 0.15
pg_stream.cleanup_use_truncate = true
pg_stream.merge_planner_hints = true
pg_stream.merge_work_mem_mb = 64
pg_stream.merge_strategy = 'auto'
pg_stream.use_prepared_statements = true
pg_stream.user_triggers = 'auto'
pg_stream.block_source_ddl = false
pg_stream.cdc_mode = 'trigger'
pg_stream.wal_transition_timeout = 300
Runtime Configuration
All GUC variables can be changed at runtime by a superuser:
-- View current settings
SHOW pg_stream.enabled;
SHOW pg_stream.scheduler_interval_ms;
-- Change for current session
SET pg_stream.max_concurrent_refreshes = 8;
-- Change persistently (requires reload)
ALTER SYSTEM SET pg_stream.scheduler_interval_ms = 500;
SELECT pg_reload_conf();
Further Reading
- INSTALL.md — Installation and initial configuration
- ARCHITECTURE.md — System architecture overview
- SQL_REFERENCE.md — Complete function reference
Installation Guide
Prerequisites
| Requirement | Version |
|---|---|
| PostgreSQL | 18.x |
Building from source additionally requires Rust 1.82+ and pgrx 0.17.x. Pre-built release artifacts only need a running PostgreSQL 18.x instance.
Installing from a Pre-built Release
1. Download the release archive
Download the archive for your platform from the GitHub Releases page:
| Platform | Archive |
|---|---|
| Linux x86_64 | pg_stream-<ver>-pg18-linux-amd64.tar.gz |
| macOS Apple Silicon | pg_stream-<ver>-pg18-macos-arm64.tar.gz |
| Windows x64 | pg_stream-<ver>-pg18-windows-amd64.zip |
Optionally verify the checksum against SHA256SUMS.txt from the same release:
sha256sum -c SHA256SUMS.txt
2. Extract and install
Linux / macOS:
tar xzf pg_stream-0.1.1-pg18-linux-amd64.tar.gz
cd pg_stream-0.1.1-pg18-linux-amd64
sudo cp lib/*.so "$(pg_config --pkglibdir)/"
sudo cp extension/*.control extension/*.sql "$(pg_config --sharedir)/extension/"
Windows (PowerShell):
Expand-Archive pg_stream-0.1.1-pg18-windows-amd64.zip -DestinationPath .
cd pg_stream-0.1.1-pg18-windows-amd64
Copy-Item lib\*.dll "$(pg_config --pkglibdir)\"
Copy-Item extension\* "$(pg_config --sharedir)\extension\"
3. Using with CloudNativePG (Kubernetes)
pg_stream is distributed as an OCI extension image for use with CloudNativePG Image Volume Extensions.
Requirements: Kubernetes 1.33+, CNPG 1.28+, PostgreSQL 18.
# Pull the extension image
docker pull ghcr.io/grove/pg_stream-ext:0.1.1
See cnpg/cluster-example.yaml and cnpg/database-example.yaml for complete Cluster and Database deployment examples.
4. Local Docker development (without Kubernetes)
For local development without Kubernetes, install the extension files manually into a standard PostgreSQL container from a release archive:
# Extract extension files from the release archive
tar xzf pg_stream-0.1.1-pg18-linux-amd64.tar.gz
cd pg_stream-0.1.1-pg18-linux-amd64
# Run PostgreSQL with the extension mounted
docker run --rm \
-v $PWD/lib/pg_stream.so:/usr/lib/postgresql/18/lib/pg_stream.so:ro \
-v $PWD/extension/:/tmp/ext/:ro \
-e POSTGRES_PASSWORD=postgres \
postgres:18.1 \
sh -c 'cp /tmp/ext/* /usr/share/postgresql/18/extension/ && \
exec postgres -c shared_preload_libraries=pg_stream'
Building from Source
1. Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
2. Install pgrx
cargo install --locked cargo-pgrx --version 0.17.0
cargo pgrx init --pg18 $(pg_config --bindir)/pg_config
3. Build the Extension
# Development build (faster compilation)
cargo pgrx install --pg-config $(pg_config --bindir)/pg_config
# Release build (optimized, for production)
cargo pgrx install --release --pg-config $(pg_config --bindir)/pg_config
# Package for deployment (creates installable artifacts)
cargo pgrx package --pg-config $(pg_config --bindir)/pg_config
PostgreSQL Configuration
Add the following to postgresql.conf before starting PostgreSQL:
# Required — loads the extension shared library at server start
shared_preload_libraries = 'pg_stream'
# Recommended — must accommodate scheduler + refresh workers
max_worker_processes = 8
Note:
wal_level = logicalandmax_replication_slotsare not required. The extension uses lightweight row-level triggers for CDC, not logical replication.
Restart PostgreSQL after modifying these settings:
pg_ctl restart -D /path/to/data
# or
systemctl restart postgresql
Extension Installation
Connect to the target database and run:
CREATE EXTENSION pg_stream;
This creates:
- The
pgstreamschema with catalog tables and SQL functions - The
pgstream_changesschema for change buffer tables - Event triggers for DDL tracking
- The
pgstream.pg_stat_stream_tablesmonitoring view
Verification
After installation, verify everything is working:
-- Check the extension version
SELECT extname, extversion FROM pg_extension WHERE extname = 'pg_stream';
-- Or get a full status overview (includes version, scheduler state, stream table count)
SELECT * FROM pgstream.pgs_status();
Inspecting the installation
-- Check the installed version
SELECT extversion FROM pg_extension WHERE extname = 'pg_stream';
-- Check which schemas were created
SELECT schema_name
FROM information_schema.schemata
WHERE schema_name IN ('pgstream', 'pgstream_changes');
-- Check all registered GUC variables
SHOW pg_stream.enabled;
SHOW pg_stream.scheduler_interval_ms;
SHOW pg_stream.max_concurrent_refreshes;
-- Check the scheduler background worker is running
SELECT * FROM pgstream.pgs_status();
-- List all stream tables
SELECT pgs_schema, pgs_name, status, refresh_mode, is_populated
FROM pgstream.pgs_stream_tables;
-- Check that the shared library loaded correctly
SELECT * FROM pg_extension WHERE extname = 'pg_stream';
-- Verify the catalog tables exist
SELECT tablename
FROM pg_tables
WHERE schemaname = 'pgstream'
ORDER BY tablename;
Quick functional test
CREATE TABLE test_source (id INT PRIMARY KEY, val TEXT);
INSERT INTO test_source VALUES (1, 'hello');
SELECT pgstream.create_stream_table(
'test_st',
'SELECT id, val FROM test_source',
'1m',
'FULL'
);
SELECT * FROM test_st;
-- Should return: 1 | hello
-- Clean up
SELECT pgstream.drop_stream_table('test_st');
DROP TABLE test_source;
Uninstallation
-- Drop all stream tables first
SELECT pgstream.drop_stream_table(pgs_schema || '.' || pgs_name)
FROM pgstream.pgs_stream_tables;
-- Drop the extension
DROP EXTENSION pg_stream CASCADE;
Remove pg_stream from shared_preload_libraries in postgresql.conf and restart PostgreSQL.
Troubleshooting
Unit tests crash on macOS 26+ (symbol not found in flat namespace)
macOS 26 (Tahoe) changed dyld to eagerly resolve all flat-namespace symbols
at binary load time. pgrx extensions reference PostgreSQL server-internal
symbols (e.g. CacheMemoryContext, SPI_connect) via the
-Wl,-undefined,dynamic_lookup linker flag. These symbols are normally
provided by the postgres executable when the extension is loaded as a shared
library — but for cargo test --lib there is no postgres process, so the test
binary aborts immediately:
dyld[66617]: symbol not found in flat namespace '_CacheMemoryContext'
This affects local development only — integration tests, E2E tests, and the extension itself running inside PostgreSQL are unaffected.
The fix is built into the just test-unit recipe. It automatically:
- Compiles a tiny C stub library (
scripts/pg_stub.c→target/libpg_stub.dylib) that provides NULL/no-op definitions for the ~28 PostgreSQL symbols. - Compiles the test binary with
--no-run. - Runs the binary with
DYLD_INSERT_LIBRARIESpointing to the stub.
The stub is only built on macOS 26+. On Linux or older macOS, just test-unit
runs cargo test --lib directly with no changes.
Note: The stub symbols are never called — unit tests exercise pure Rust logic only. If a test accidentally calls a PostgreSQL function it will crash with a NULL dereference (the desired fail-fast behavior).
If you run unit tests without just (e.g. directly via cargo test --lib),
you can use the wrapper script instead:
./scripts/run_unit_tests.sh pg18
# With test name filter:
./scripts/run_unit_tests.sh pg18 -- test_parse_basic
Extension fails to load
Ensure shared_preload_libraries = 'pg_stream' is set and PostgreSQL has been restarted (not just reloaded). The extension requires shared memory initialization at startup.
Background worker not starting
Check that max_worker_processes is high enough to accommodate the scheduler worker plus any refresh workers. The default of 8 is usually sufficient.
Check logs for details
The extension logs at various levels. Enable debug logging for more detail:
SET client_min_messages TO debug1;
Architecture
This document describes the internal architecture of pg_stream — a PostgreSQL 18 extension that implements stream tables with differential view maintenance. For a high-level description of what pg_stream does and why, read ESSENCE.md. For release milestones and future plans, see ROADMAP.md.
High-Level Overview
┌─────────────────────────────────────────────────────────────────┐
│ PostgreSQL 18 Backend │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌─────────────┐ │
│ │ Source │ │ Source │ │ Storage │ │ Storage │ │
│ │ Table A │ │ Table B │ │ Table X │ │ Table Y │ │
│ └────┬─────┘ └────┬─────┘ └────▲─────┘ └────▲────────┘ │
│ │ │ │ │ │
│ ═════╪══════════════╪══════════════╪══════════════╪════════ │
│ │ │ │ │ │
│ ┌────▼──────────────▼────┐ ┌────┴──────────────┴────┐ │
│ │ Hybrid CDC Layer │ │ Delta Application │ │
│ │ Triggers ──or── WAL │ │ (INSERT/DELETE diffs) │ │
│ └────────────┬───────────┘ └────────────▲───────────┘ │
│ │ │ │
│ ┌────────────▼───────────┐ ┌────────────┴───────────┐ │
│ │ Change Buffer │ │ DVM Engine │ │
│ │ (pgstream_changes.*) │ │ (Operator Tree) │ │
│ └────────────┬───────────┘ └────────────▲───────────┘ │
│ │ │ │
│ └────────────┬───────────────┘ │
│ │ │
│ ┌─────────────────────────▼─────────────────────────────┐ │
│ │ Refresh Engine │ │
│ │ ┌──────────┐ ┌──────────┐ ┌─────────────────────┐ │ │
│ │ │ Frontier │ │ DAG │ │ Scheduler │ │ │
│ │ │ Tracker │ │ Resolver │ │ (canonical schedule)│ │ │
│ │ └──────────┘ └──────────┘ └─────────────────────┘ │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Catalog (pgstream.*) │ │
│ │ pgs_stream_tables │ pgs_dependencies │ pgs_refresh_history│ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Monitoring Layer │ │
│ │ st_refresh_stats │ slot_health │ check_cdc_health │ │
│ │ explain_st │ views │ NOTIFY alerting │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Component Details
1. SQL API Layer (src/api.rs)
The public entry point for users. All operations are exposed as #[pg_extern] functions in the pgstream schema:
- create_stream_table — Applies a chain of auto-rewrite passes (view inlining → DISTINCT ON → GROUPING SETS → scalar subquery → SubLinks in OR → multi-PARTITION BY windows), parses the defining query, builds an operator tree, creates the storage table, registers CDC slots, populates the catalog, and optionally performs an initial full refresh.
- alter_stream_table — Modifies schedule, refresh mode, or status (ACTIVE/SUSPENDED).
- drop_stream_table — Removes the storage table, catalog entries, and cleans up CDC slots.
- refresh_stream_table — Triggers a manual refresh (same path as automatic scheduling).
- pgs_status — Returns a summary of all registered stream tables.
2. Catalog (src/catalog.rs)
The catalog manages persistent metadata stored in PostgreSQL tables within the pgstream schema:
| Table | Purpose |
|---|---|
pgstream.pgs_stream_tables | Core metadata: name, query, schedule, status, frontier, etc. |
pgstream.pgs_dependencies | DAG edges from ST to source tables |
pgstream.pgs_refresh_history | Audit log of every refresh operation |
pgstream.pgs_change_tracking | Per-source CDC slot metadata |
Schema creation is handled by extension_sql!() macros that run at CREATE EXTENSION time.
Entity-Relationship Diagram
erDiagram
pgs_stream_tables {
bigserial pgs_id PK
oid pgs_relid UK "OID of materialized storage table"
text pgs_name
text pgs_schema
text defining_query
text original_query "User's original SQL (pre-inlining)"
text schedule "Duration or cron expression"
text refresh_mode "FULL | DIFFERENTIAL"
text status "INITIALIZING | ACTIVE | SUSPENDED | ERROR"
boolean is_populated
timestamptz data_timestamp "Freshness watermark"
jsonb frontier "DBSP-style version frontier"
timestamptz last_refresh_at
int consecutive_errors
boolean needs_reinit
float8 auto_threshold
float8 last_full_ms
timestamptz created_at
timestamptz updated_at
}
pgs_dependencies {
bigint pgs_id PK,FK "References pgs_stream_tables.pgs_id"
oid source_relid PK "OID of source table"
text source_type "TABLE | STREAM_TABLE | VIEW"
text_arr columns_used "Column-level lineage"
text cdc_mode "TRIGGER | TRANSITIONING | WAL"
text slot_name "Replication slot (WAL mode)"
pg_lsn decoder_confirmed_lsn "WAL decoder progress"
timestamptz transition_started_at "Trigger→WAL transition start"
}
pgs_refresh_history {
bigserial refresh_id PK
bigint pgs_id FK "References pgs_stream_tables.pgs_id"
timestamptz data_timestamp
timestamptz start_time
timestamptz end_time
text action "NO_DATA | FULL | DIFFERENTIAL | REINITIALIZE | SKIP"
bigint rows_inserted
bigint rows_deleted
text error_message
text status "RUNNING | COMPLETED | FAILED | SKIPPED"
text initiated_by "SCHEDULER | MANUAL | INITIAL"
timestamptz freshness_deadline
}
pgs_change_tracking {
oid source_relid PK "OID of tracked source table"
text slot_name "Trigger function name"
pg_lsn last_consumed_lsn
bigint_arr tracked_by_pgs_ids "ST IDs sharing this source"
}
pgs_stream_tables ||--o{ pgs_dependencies : "has sources"
pgs_stream_tables ||--o{ pgs_refresh_history : "has refresh history"
pgs_stream_tables }o--o{ pgs_change_tracking : "tracks via pgs_ids array"
Note: Change buffer tables (
pgstream_changes.changes_<oid>) are created dynamically per source table OID and live in the separatepgstream_changesschema.
3. CDC / Change Data Capture (src/cdc.rs, src/wal_decoder.rs)
pg_stream uses a hybrid CDC architecture that starts with triggers and optionally transitions to WAL-based (logical replication) capture for lower write-side overhead.
Trigger Mode (default)
- Trigger Management — Creates
AFTER INSERT OR UPDATE OR DELETErow-level triggers (pg_stream_cdc_<oid>) on each tracked source table. Each trigger fires a PL/pgSQL function (pg_stream_cdc_fn_<oid>()) that writes changes to the buffer table. - Change Buffering — Decoded changes are written to per-source change buffer tables in the
pgstream_changesschema. Each row captures the LSN (pg_current_wal_lsn()), transaction ID, action type (I/U/D), and the new/old row data as JSONB viato_jsonb(). - Cleanup — Consumed changes are deleted after each successful refresh via
delete_consumed_changes(), bounded by the upper LSN to prevent unbounded scans. - Lifecycle — Triggers and trigger functions are automatically created when a source table is first tracked and dropped when the last stream table referencing a source is removed.
The trigger approach was chosen as the default for transaction safety (triggers can be created in the same transaction as DDL), simplicity (no slot management, no wal_level = logical requirement), and immediate visibility (changes are visible in buffer tables as soon as the source transaction commits).
WAL Mode (optional, automatic transition)
When pg_stream.cdc_mode is set to 'auto' or 'wal' and wal_level = logical is available, the system transitions from trigger-based to WAL-based CDC after the first successful refresh:
- WAL Availability Detection — At stream table creation, checks whether
wal_level = logicalis configured. If so, the source dependency is marked for WAL transition. - WAL Decoder Background Worker — A dedicated background worker (
src/wal_decoder.rs) polls logical replication slots and writes decoded changes into the same change buffer tables used by triggers, ensuring a uniform format for the DVM engine. - Transition Orchestration — The transition is a three-step process: (a) create a replication slot, (b) wait for the decoder to catch up to the trigger's last confirmed LSN, (c) drop the trigger and switch the dependency to WAL mode. If the decoder doesn't catch up within
pg_stream.wal_transition_timeout(default 300s), the system falls back to triggers. - CDC Mode Tracking — Each source dependency in
pgs_dependenciescarries acdc_modecolumn (TRIGGER / TRANSITIONING / WAL) and WAL-specific metadata (slot_name,decoder_confirmed_lsn,transition_started_at).
See ADR-001 and ADR-002 in plans/adrs/PLAN_ADRS.md for the original design rationale and plans/sql/PLAN_HYBRID_CDC.md for the full implementation plan.
4. DVM Engine (src/dvm/)
The Differential View Maintenance engine is the core of the system. It transforms the defining SQL query into an executable operator tree that can compute deltas efficiently.
Auto-Rewrite Pipeline (src/dvm/parser.rs)
Before the defining query is parsed into an operator tree, it passes through a chain of auto-rewrite passes that normalize SQL constructs the DVM parser doesn't handle directly:
| Pass | Function | Purpose |
|---|---|---|
| #0 | rewrite_views_inline() | Replace view references with (view_definition) AS alias subqueries |
| #1 | rewrite_distinct_on() | Convert DISTINCT ON to ROW_NUMBER() OVER (…) = 1 window subquery |
| #2 | rewrite_grouping_sets() | Decompose GROUPING SETS / CUBE / ROLLUP into UNION ALL of GROUP BY |
| #3 | rewrite_scalar_subquery_in_where() | Convert WHERE col > (SELECT …) to CROSS JOIN |
| #4 | rewrite_sublinks_in_or() | Split WHERE a OR EXISTS (…) into UNION branches |
| #5 | rewrite_multi_partition_windows() | Split multiple PARTITION BY clauses into joined subqueries |
The view inlining pass (#0) runs first so that view definitions containing DISTINCT ON, GROUPING SETS, etc. are further rewritten by downstream passes. Nested views are expanded via a fixpoint loop (max depth 10).
Query Parser (src/dvm/parser.rs)
Parses the defining query using PostgreSQL's internal parser (via pgrx raw_parser) and extracts:
- WITH clause — CTE definitions (non-recursive: inline expansion or shared delta; recursive: detected for mode gating)
- Target list — output columns
- FROM clause — source tables, joins, subqueries, and CTE references
- WHERE clause — filters
- GROUP BY / aggregate functions
- DISTINCT / UNION ALL / INTERSECT / EXCEPT
The parser produces an OpTree — a tree of operator nodes. CTE handling follows a tiered approach:
- Tier 1 (Inline Expansion) — Non-recursive CTEs referenced once are expanded into
Subquerynodes, equivalent to subqueries in FROM. - Tier 2 (Shared Delta) — Non-recursive CTEs referenced multiple times produce
CteScannodes that share a single delta computation via a CTE registry and delta cache. - Tier 3a/3b/3c (Recursive) — Recursive CTEs (
WITH RECURSIVE) are detected viaquery_has_recursive_cte(). In FULL mode, the query executes as-is. In DIFFERENTIAL mode, the strategy is auto-selected: semi-naive evaluation for INSERT-only changes, Delete-and-Rederive (DRed) for mixed changes, or recomputation fallback when CTE columns don't match ST storage.
Operators (src/dvm/operators/)
Each operator knows how to generate a delta query — given a set of changes to its inputs, it produces the corresponding changes to its output:
| Operator | Delta Strategy |
|---|---|
| Scan | Direct passthrough of CDC changes |
| Filter | Apply WHERE predicate to deltas |
| Project | Apply column projection to deltas |
| Join | Join deltas against the other side's current state |
| OuterJoin | LEFT/RIGHT outer join with NULL padding |
| FullJoin | FULL OUTER JOIN with 8-part delta (both sides may produce NULLs) |
| Aggregate | Recompute group values where affected keys changed |
| Distinct | COUNT-based duplicate tracking |
| UnionAll | Merge deltas from both branches |
| Intersect | Dual-count multiplicity with LEAST boundary crossing |
| Except | Dual-count multiplicity with GREATEST(0, L-R) boundary crossing |
| Subquery | Transparent delegation + optional column renaming (CTEs, subselects) |
| CteScan | Shared delta lookup from CTE cache (multi-reference CTEs) |
| RecursiveCte | Semi-naive / DRed / recomputation for WITH RECURSIVE |
| Window | Partition-based recomputation for window functions |
| LateralFunction | Row-scoped recomputation for SRFs in FROM (jsonb_array_elements, unnest, etc.) |
| LateralSubquery | Row-scoped recomputation for correlated subqueries in LATERAL FROM |
| SemiJoin | EXISTS / IN subquery delta via semi-join |
| AntiJoin | NOT EXISTS / NOT IN subquery delta via anti-join |
| ScalarSubquery | Correlated scalar subquery in SELECT list |
See DVM_OPERATORS.md for detailed descriptions.
Diff Engine (src/dvm/diff.rs)
Generates the final diff SQL that:
- Computes the delta from the operator tree
- Produces
('+', row)for inserts and('-', row)for deletes - Applies the diff via
DELETEmatching old rows andINSERTfor new rows
5. DAG / Dependency Graph (src/dag.rs)
Stream tables can depend on other stream tables (cascading), forming a Directed Acyclic Graph:
- Cycle detection — Prevents circular dependencies at creation time using DFS.
- Topological ordering — Determines refresh order: upstream STs must be refreshed before downstream STs.
- Cascade operations — When a source table changes, all transitive dependents are identified for refresh.
6. Version / Frontier Tracking (src/version.rs)
Implements a per-source frontier (JSONB map of source_oid → LSN) to track exactly how far each stream table has consumed changes:
- Read frontier — Before refresh, read the frontier to know where to start consuming changes.
- Advance frontier — After a successful refresh, the frontier is updated to the latest consumed LSN.
- Consistent snapshots — The frontier ensures that each refresh processes a contiguous, non-overlapping window of changes.
Delayed View Semantics (DVS) Guarantee
The contents of every stream table are logically equivalent to evaluating its defining query at some past point in time — the data_timestamp. The scheduler refreshes STs in topological order so that when ST B references upstream ST A, A has already been refreshed to the target data_timestamp before B runs its delta query against A's contents. The frontier lifecycle is:
- Created — on first full refresh; records the LSN of each source at that moment.
- Advanced — on each differential refresh; the old frontier becomes the lower bound and the new frontier (with fresh LSNs) the upper bound. The DVM engine reads changes in
[old, new]. - Reset — on reinitialize; a fresh frontier is created from scratch.
7. Refresh Engine (src/refresh.rs)
Orchestrates the complete refresh cycle:
┌──────────────┐
│ Check State │ → Is ST active? Has it been populated?
└──────┬───────┘
│
┌─────▼──────┐
│ Drain CDC │ → Read WAL changes into change buffer tables
└─────┬──────┘
│
┌─────▼──────────────┐
│ Determine Action │ → FULL, DIFFERENTIAL, NO_DATA, REINITIALIZE, or SKIP?
│ │ (adaptive: if change ratio > pg_stream.differential_max_change_ratio,
│ │ downgrade DIFFERENTIAL → FULL automatically)
└─────┬──────────────┘
│
┌─────▼──────┐
│ Execute │ → Full: TRUNCATE + INSERT ... SELECT
│ │ Differential: Generate & apply delta SQL
└─────┬──────┘
│
┌─────▼──────────────┐
│ Record History │ → Write to pgstream.pgs_refresh_history
└─────┬──────────────┘
│
┌─────▼──────────────┐
│ Advance Frontier │ → Update JSONB frontier in catalog
└─────┬──────────────┘
│
┌─────▼──────────────┐
│ Reset Error Count │ → On success, reset consecutive_errors to 0
└──────────────────────┘
8. Background Worker & Scheduling (src/scheduler.rs)
Registration & Lifecycle
pg_stream registers one PostgreSQL background worker — the scheduler — during _PG_init() (extension load). Because it is registered at startup, pg_stream must appear in shared_preload_libraries, which requires a server restart.
┌──────────────────────────────────────────────────────────────────┐
│ PostgreSQL postmaster │
│ │
│ shared_preload_libraries = 'pg_stream' │
│ │ │
│ ▼ │
│ _PG_init() │
│ ├─ Register GUCs (pg_stream.enabled, scheduler_interval_ms …) │
│ ├─ Register shared memory (PgStreamSharedState, atomics) │
│ └─ BackgroundWorkerBuilder::new("pg_stream scheduler") │
│ .set_start_time(RecoveryFinished) │
│ .set_restart_time(5s) ← auto-restart on crash │
│ .load() │
│ │
│ After recovery finishes: │
│ │ │
│ ▼ │
│ pg_stream_scheduler_main() ← background worker starts │
│ ├─ Attach SIGHUP + SIGTERM handlers │
│ ├─ Connect to SPI (database = "postgres") │
│ ├─ Crash recovery: mark stale RUNNING records as FAILED │
│ └─ Enter main loop ─────────────────────────┐ │
│ │ │ │
│ ▼ │ │
│ wait_latch(scheduler_interval_ms) │ │
│ │ │ │
│ ┌───▼───────────────────────────────┐ │ │
│ │ SIGTERM? → log + break │ │ │
│ │ pg_stream.enabled = false? → skip │ │ │
│ │ Otherwise → scheduler tick │ │ │
│ └───┬───────────────────────────────┘ │ │
│ │ │ │
│ └──────────── loop ────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
Key lifecycle properties:
| Property | Behaviour |
|---|---|
| Start condition | After PostgreSQL recovery finishes (RecoveryFinished) |
| Auto-restart | 5-second delay after an unexpected crash |
| Graceful shutdown | Handles SIGTERM — breaks the main loop and exits cleanly |
| Config reload | Handles SIGHUP — re-reads GUC values on the next latch wake |
| Crash recovery | On startup, any pgs_refresh_history rows stuck in RUNNING status are marked FAILED (the transaction that wrote them was rolled back by PostgreSQL, but the status row may have been committed in a prior transaction) |
| Database | Connects to the postgres database via SPI |
Scheduler Tick
Each tick of the main loop performs the following steps inside a single transaction:
- DAG rebuild — Compare the shared-memory
DAG_REBUILD_SIGNALcounter against the local copy. If it advanced (aCREATE,ALTER, orDROPstream table occurred), rebuild the in-memory dependency graph (StDag) from the catalog. - Topological traversal — Walk stream tables in dependency order (upstream before downstream). This ensures that when ST B references ST A, A is refreshed first.
- Per-ST evaluation — For each active ST:
- Skip if in retry backoff (exponential, per-ST).
- Skip if schedule/cron says not yet due.
- Skip if an advisory lock indicates a concurrent refresh.
- Check upstream change buffers for pending rows.
- Execute refresh — Acquire an advisory lock → record
RUNNINGin history → runFULL/DIFFERENTIAL/REINITIALIZE→ store new frontier → release lock → record completion. - WAL transitions — Advance any trigger→WAL CDC mode transitions (
src/wal_decoder.rs). - Slot health — Check replication slot health and emit
NOTIFYalerts. - Prune retry state — Remove backoff entries for STs that no longer exist.
Sequential Processing
The scheduler processes stream tables sequentially within a single background worker. Although pg_stream.max_concurrent_refreshes (default 4) exists as a GUC, it currently only prevents a manual pgstream.refresh_stream_table() call from overlapping with the scheduler on the same ST — it does not spawn additional workers. All STs are refreshed one at a time in topological order.
The PostgreSQL GUC max_worker_processes (default 8) sets the server-wide budget for all background workers (autovacuum, parallel query, logical replication, extensions). pg_stream consumes one slot from that budget.
Retry & Error Handling
Each ST maintains an in-memory RetryState (reset on scheduler restart):
- Retryable errors (SPI failures, lock contention, slot issues) trigger exponential backoff.
- Permanent errors (schema mismatch, user errors) skip backoff but increment
consecutive_errors. - When
consecutive_errorsreachespg_stream.max_consecutive_errors(default 3), the ST is auto-suspended and aNOTIFYalert is emitted. - Schema errors additionally set
needs_reinit, triggering aREINITIALIZEon the next successful cycle.
Scheduling Policy
Automatic refresh scheduling uses canonical periods (48·2ⁿ seconds, n = 0, 1, 2, …) snapped to the user's schedule:
- Picks the smallest canonical period ≤
schedule. - For DOWNSTREAM schedule (NULL schedule), the ST refreshes only when explicitly triggered or when a downstream ST needs it.
- Advisory locks prevent concurrent refreshes of the same ST.
- The scheduler is driven by the background worker polling at the
pg_stream.scheduler_interval_msGUC interval.
Shared Memory (src/shmem.rs)
The scheduler background worker and user sessions share a PgStreamSharedState structure protected by a PgLwLock. Key fields:
| Field | Type | Purpose |
|---|---|---|
dag_version | u64 | Incremented when the ST catalog changes; used by the scheduler to detect when the DAG needs rebuilding. |
scheduler_pid | i32 | PID of the scheduler background worker (0 if not running). |
scheduler_running | bool | Whether the scheduler is active. |
last_scheduler_wake | i64 | Unix timestamp of the last scheduler wake cycle (for monitoring). |
A separate PgAtomic<AtomicU64> named DAG_REBUILD_SIGNAL is incremented by API functions (create, alter, drop) after catalog mutations. The scheduler compares its local copy against the atomic counter to detect when to rebuild its in-memory DAG without holding a lock.
A second PgAtomic<AtomicU64> named CACHE_GENERATION tracks DDL events that may invalidate cached delta or MERGE templates across backends. When DDL hooks fire (view change, ALTER TABLE, function change) or API functions mutate the catalog, CACHE_GENERATION is bumped. Each backend maintains a thread-local generation counter; on the next refresh, if the shared generation has advanced, the backend flushes its delta template cache, MERGE template cache, and prepared statements.
9. DDL Tracking (src/hooks.rs)
Event triggers monitor DDL changes to source tables and functions:
_on_ddl_end— Fires onALTER TABLEto detect column adds/drops/type changes. If a source table used by a ST is altered, the ST'sneeds_reinitflag is set. Also detectsCREATE OR REPLACE FUNCTION/ALTER FUNCTION— if the function appears in a ST'sfunctions_usedcatalog column, the ST is marked for reinit._on_sql_drop— Fires onDROP TABLEto setneeds_reinitfor affected STs. Also detectsDROP FUNCTIONand marks affected STs for reinit.- Function name extraction —
object_identitystrings (e.g.,public.my_func(integer, text)) are parsed to extract the bare function name, which is matched against thefunctions_used TEXT[]column inpgs_stream_tables.
Reinitialization is deferred until the next refresh cycle, which then performs a REINITIALIZE action (drop and recreate the storage table from the updated query).
10. Error Handling (src/error.rs)
Centralized error types using thiserror:
PgStreamErrorvariants cover catalog access, SQL execution, CDC, DVM, DAG, and config errors.- Each refresh failure increments
consecutive_errors. - When
consecutive_errorsreachespg_stream.max_consecutive_errors(default 3), the ST is moved toERRORstatus and suspended from automatic refresh. - Manual intervention (
ALTER ... status => 'ACTIVE') resets the counter.
11. Monitoring (src/monitor.rs)
Provides observability functions:
- st_refresh_stats — Aggregate statistics (total/successful/failed refreshes, avg duration, staleness status).
- get_refresh_history — Per-ST audit trail.
- get_staleness — Current staleness in seconds.
- slot_health — Checks replication slot state and WAL retention.
- check_cdc_health — Per-source CDC health status including mode, slot lag, confirmed LSN, and alerts.
- explain_st — Describes the DVM plan for a given ST.
- Views —
pgstream.stream_tables_info(computed staleness) andpgstream.pg_stat_stream_tables(combined stats).
NOTIFY Alerting
Operational events are broadcast via PostgreSQL NOTIFY on the pg_stream_alert channel. Clients can subscribe with LISTEN pg_stream_alert; and receive JSON-formatted events:
| Event | Condition |
|---|---|
stale | data staleness exceeds 2× schedule |
auto_suspended | ST suspended after pg_stream.max_consecutive_errors failures |
reinitialize_needed | Upstream DDL change detected |
slot_lag_warning | Replication slot WAL retention is growing |
cdc_transition_complete | Source transitioned from trigger to WAL-based CDC |
cdc_transition_failed | Trigger→WAL transition failed (fell back to triggers) |
refresh_completed | Refresh completed successfully |
refresh_failed | Refresh failed with an error |
12. Row ID Hashing (src/hash.rs)
Provides deterministic 64-bit row identifiers using xxHash (xxh64) with a fixed seed. Two SQL functions are exposed:
pgstream.pg_stream_hash(text)— Hash a single text value; used for simple single-column row IDs.pgstream.pg_stream_hash_multi(text[])— Hash multiple values (separated by a record-separator byte\x1E) for composite keys (join row IDs, GROUP BY keys).
Row IDs are written into every stream table's storage as an internal __pgs_row_id BIGINT column and are used by the delta application phase to match DELETE candidates precisely.
13. Configuration (src/config.rs)
Twelve GUC (Grand Unified Configuration) variables control runtime behavior, plus five performance-tuning GUCs. See CONFIGURATION.md for details.
| GUC | Default | Purpose |
|---|---|---|
pg_stream.enabled | true | Master on/off switch for the scheduler |
pg_stream.scheduler_interval_ms | 1000 | Scheduler background worker wake interval (ms) |
pg_stream.min_schedule_seconds | 60 | Minimum allowed schedule |
pg_stream.max_consecutive_errors | 3 | Errors before auto-suspending a ST |
pg_stream.change_buffer_schema | pgstream_changes | Schema for change buffer tables |
pg_stream.max_concurrent_refreshes | 4 | Maximum parallel refresh workers |
pg_stream.differential_max_change_ratio | 0.15 | Change-to-table-size ratio above which DIFFERENTIAL falls back to FULL |
pg_stream.cleanup_use_truncate | true | Use TRUNCATE instead of DELETE for change buffer cleanup when the entire buffer is consumed |
pg_stream.user_triggers | 'auto' | User-defined trigger handling: auto / on / off |
pg_stream.block_source_ddl | false | Block column-affecting DDL on tracked source tables instead of reinit |
pg_stream.cdc_mode | 'trigger' | CDC mechanism: trigger / auto / wal |
pg_stream.wal_transition_timeout | 300 | Max seconds to wait for WAL decoder catch-up during transition |
pg_stream.merge_planner_hints | true | Inject SET LOCAL planner hints (disable nestloop, raise work_mem) before MERGE |
pg_stream.merge_work_mem_mb | 64 | work_mem (MB) applied when delta exceeds 10 000 rows and planner hints enabled |
pg_stream.merge_strategy | 'auto' | Delta application strategy: auto / merge / delete_insert |
pg_stream.use_prepared_statements | true | Use SQL PREPARE/EXECUTE for cached MERGE templates |
Data Flow: End-to-End Refresh
Source Table INSERT/UPDATE/DELETE
│
▼
Hybrid CDC Layer:
┌─────────────────────────────────────────────┐
│ TRIGGER mode: Row-Level AFTER Trigger │
│ pg_stream_cdc_fn_<oid>() → buffer table │
│ │
│ WAL mode: Logical Replication Slot │
│ wal_decoder bgworker → same buffer table │
└─────────────────────────────────────────────┘
│
▼
Change Buffer Table (pgstream_changes.changes_<oid>)
Columns: change_id, lsn, xid, action (I/U/D), row_data (jsonb)
│
▼
DVM Engine: generate delta SQL from operator tree
- Scan operator reads from change buffer
- Filter/Project/Join transform the deltas
- Aggregate recomputes affected groups
│
▼
Diff Engine: produce (+/-) diff rows
│
▼
Delta Application:
DELETE FROM storage WHERE __pgs_row_id IN (removed)
INSERT INTO storage SELECT ... FROM (added)
│
▼
Frontier Update: advance per-source LSN
│
▼
History Record: log to pgstream.pgs_refresh_history
Module Map
src/
├── lib.rs # Extension entry, module declarations, _PG_init
├── bin/
│ └── pgrx_embed.rs# pgrx SQL entity embedding (generated)
├── api.rs # SQL API functions (create/alter/drop/refresh/status)
├── catalog.rs # Catalog CRUD operations
├── cdc.rs # Change data capture (triggers + WAL transition)
├── config.rs # GUC variable registration
├── dag.rs # Dependency graph (cycle detection, topo sort)
├── error.rs # Centralized error types
├── hash.rs # xxHash row ID generation (pg_stream_hash / pg_stream_hash_multi)
├── hooks.rs # DDL event trigger handlers (_on_ddl_end, _on_sql_drop)
├── shmem.rs # Shared memory state (PgStreamSharedState, DAG_REBUILD_SIGNAL, CACHE_GENERATION)
├── dvm/
│ ├── mod.rs # DVM module root + recursive CTE orchestration
│ ├── parser.rs # Query → OpTree converter (CTE extraction, subquery, window support)
│ ├── diff.rs # Delta SQL generation (CTE delta cache)
│ ├── row_id.rs # Row ID generation
│ └── operators/
│ ├── mod.rs # Operator trait + registry
│ ├── scan.rs # Table scan (CDC passthrough)
│ ├── filter.rs # WHERE clause filtering
│ ├── project.rs # Column projection
│ ├── join.rs # Inner join
│ ├── join_common.rs # Shared join utilities (snapshot subqueries, column disambiguation)
│ ├── outer_join.rs # LEFT/RIGHT outer join
│ ├── full_join.rs # FULL OUTER JOIN (8-part delta)
│ ├── aggregate.rs # GROUP BY + aggregate functions (39 AggFunc variants)
│ ├── distinct.rs # DISTINCT deduplication
│ ├── union_all.rs # UNION ALL merging
│ ├── intersect.rs # INTERSECT / INTERSECT ALL (dual-count LEAST)
│ ├── except.rs # EXCEPT / EXCEPT ALL (dual-count GREATEST)
│ ├── subquery.rs # Subquery / inlined CTE delegation
│ ├── cte_scan.rs # Shared CTE delta (multi-reference)
│ ├── recursive_cte.rs # Recursive CTE (semi-naive + DRed + recomputation)
│ ├── window.rs # Window function (partition recomputation)
│ ├── lateral_function.rs # LATERAL SRF (row-scoped recomputation)
│ ├── lateral_subquery.rs # LATERAL correlated subquery
│ ├── semi_join.rs # EXISTS / IN subquery (semi-join delta)
│ ├── anti_join.rs # NOT EXISTS / NOT IN subquery (anti-join delta)
│ └── scalar_subquery.rs # Correlated scalar subquery in SELECT
├── monitor.rs # Monitoring & observability functions
├── refresh.rs # Refresh orchestration
├── scheduler.rs # Automatic scheduling with canonical periods
├── version.rs # Frontier / LSN tracking
└── wal_decoder.rs # WAL-based CDC (logical replication slot polling, transitions)
Extension Control File (pg_stream.control)
The pg_stream.control file in the repository root is required by PostgreSQL's
extension infrastructure. It declares the extension's description, default
version, shared-library path, and privilege requirements. PostgreSQL reads this
file when CREATE EXTENSION pg_stream; is executed.
During packaging (cargo pgrx package), pgrx replaces the @CARGO_VERSION@
placeholder with the version from Cargo.toml and copies the file into the
target's share/extension/ directory alongside the SQL migration scripts.
DVM Operators
This document describes the Differential View Maintenance (DVM) operators implemented by pgstream. Each operator transforms a stream of row-level changes (deltas) propagated from source tables through the operator tree.
Prior Art
- Budiu, M. et al. (2023). "DBSP: Automatic Incremental View Maintenance." VLDB 2023. (comparison)
- Gupta, A. & Mumick, I.S. (1999). Materialized Views: Techniques, Implementations, and Applications. MIT Press.
- Koch, C. et al. (2014). "DBToaster: Higher-order Delta Processing for Dynamic, Frequently Fresh Views." VLDB Journal.
- PostgreSQL 9.4+ — Materialized views with
REFRESH MATERIALIZED VIEW CONCURRENTLY.
Overview
When a stream table is created, the defining SQL query is parsed into a tree of DVM operators. During an differential refresh, changes flow bottom-up through this tree:
Aggregate
│
Project
│
Filter
│
┌───────┴───────┐
Join │
┌─┴─┐ │
Scan(A) Scan(B) Scan(C)
Each operator implements a differentiation rule: given the delta (Δ) to its input(s), it produces the corresponding delta to its output. This is conceptually similar to automatic differentiation in calculus.
The general contract:
- Input: a set of
('+', row)and('-', row)tuples (inserts and deletes) - Output: a set of
('+', row)and('-', row)tuples
Updates are modeled as a delete of the old row followed by an insert of the new row.
Operators
Scan
Module: src/dvm/operators/scan.rs
The leaf operator. Reads CDC changes from a source table's change buffer.
Delta Rule:
$$\Delta(\text{Scan}(R)) = \Delta R$$
The scan operator is a direct passthrough — inserts in the source become inserts in the output, deletes become deletes.
SQL Generation:
SELECT op, row_data FROM pgstream_changes.changes_<oid>
WHERE xid >= <last_consumed_xid>
Notes:
- Each source table has a dedicated change buffer table created by the CDC module.
- Row data is stored as JSONB with column names as keys.
- The
__pgs_row_idcolumn (xxHash of primary key) is included for deduplication.
Filter
Module: src/dvm/operators/filter.rs
Applies a WHERE clause predicate to the delta stream.
Delta Rule:
$$\Delta(\sigma_p(R)) = \sigma_p(\Delta R)$$
Filtering is applied to the deltas in the same way as to the base data — only rows satisfying the predicate pass through.
SQL Generation:
SELECT * FROM (<input_delta>) AS d
WHERE <predicate>
Example:
If the defining query is:
SELECT * FROM orders WHERE status = 'shipped'
And a new row (id=5, status='pending') is inserted, it does not appear in the delta output. If (id=3, status='shipped') is inserted, it passes through.
Edge Cases:
- For updates that change the predicate column (e.g.,
statusfrom'pending'to'shipped'), the CDC produces a delete of the old row and insert of the new row. The filter passes the insert (matches) and blocks the delete (doesn't match the old row against the predicate), correctly resulting in a net insert.
Project
Module: src/dvm/operators/project.rs
Applies column projection from the target list.
Delta Rule:
$$\Delta(\pi_L(R)) = \pi_L(\Delta R)$$
Projects the same columns from the delta that the query projects from the base data.
SQL Generation:
SELECT <target_columns> FROM (<input_delta>) AS d
Notes:
- Projection is applied after filtering for efficiency.
- Computed expressions in the target list (e.g.,
price * quantity AS total) are evaluated on the delta rows.
Join (Inner)
Module: src/dvm/operators/join.rs
Implements inner join between two inputs.
Delta Rule:
For $R \bowtie S$:
$$\Delta(R \bowtie S) = (\Delta R \bowtie S) \cup (R' \bowtie \Delta S)$$
Where $R' = R \cup \Delta R$ (the new state of R after applying deltas).
In practice, when only one side has changes (common case), the delta join simplifies to joining the changed rows against the current state of the other side.
SQL Generation:
-- Changes to left side joined with current right side
SELECT '+' AS op, l.*, r.*
FROM (<left_delta> WHERE op = '+') AS l
JOIN <right_table> AS r ON <join_condition>
UNION ALL
-- Current left side joined with changes to right side
SELECT '+' AS op, l.*, r.*
FROM <left_table> AS l
JOIN (<right_delta> WHERE op = '+') AS r ON <join_condition>
(And corresponding DELETE queries for op = '-'.)
Notes:
- The join uses the current state of the non-changed side, not the change buffer.
- For equi-joins, this is efficient — the join key narrows the scan.
- Non-equi joins (theta joins) may require broader scans.
Outer Join
Module: src/dvm/operators/outer_join.rs (LEFT JOIN), src/dvm/operators/full_join.rs (FULL JOIN)
Implements LEFT, RIGHT, and FULL OUTER JOIN.
RIGHT JOIN Handling:
RIGHT JOIN is automatically converted to a LEFT JOIN with swapped left/right operands during query parsing. This normalization happens transparently — the user can write RIGHT JOIN and the parser rewrites it to an equivalent LEFT JOIN before the operator tree is constructed.
Delta Rule:
Similar to inner join, but additionally handles NULL-padded rows:
$$\Delta(R \text{ LEFT JOIN } S) = (\Delta R \bowtie_L S) \cup (R' \bowtie_L \Delta S)$$
With special handling for:
- Rows in ΔR that have no match in S → emit
('+', row, NULLs) - Rows in ΔS that create a first match for an R row → emit
('-', row, NULLs)and('+', row, s_data) - Rows in ΔS that remove the last match for an R row → emit
('-', row, s_data)and('+', row, NULLs)
SQL Generation (LEFT JOIN):
Uses anti-join detection (via NOT EXISTS) to correctly handle the NULL padding transitions.
FULL OUTER JOIN Delta Rule:
FULL OUTER JOIN extends the LEFT JOIN delta with symmetric right-side handling. The delta is computed as an 8-part UNION ALL:
- Parts 1–5: Same as LEFT JOIN delta (inserted/deleted rows from both sides, with NULL-padding transitions)
- Parts 6–7: Symmetric anti-join transitions for the right side (rows in ΔL that remove/create the last/first match for an S row)
- Part 8: Right-side insertions that have no match in the left side → emit
('+', NULLs, s_data)
Each part uses pre-computed delta flags (__has_ins_*, __has_del_*) to efficiently detect first-match/last-match transitions without redundant subqueries.
Nested Join Support:
Module: src/dvm/operators/join_common.rs
All join operators (inner, left, full) support nested children — i.e., a join whose left or right operand is itself another join. The join_common module provides shared helpers:
build_snapshot_sql()— returns the table reference for simple (Scan) operands, or a parenthesized subquery with disambiguated columns for nested join operandsrewrite_join_condition()— rewrites column references in ON conditions to use the correct alias prefixes for nested children (e.g.,o.cust_id→dl.o__cust_id)
This enables queries with 3 or more joined tables, e.g.:
SELECT o.id, c.name, p.title
FROM orders o
JOIN customers c ON o.cust_id = c.id
JOIN products p ON o.prod_id = p.id
Limitations:
- FULL OUTER JOIN delta computation can be expensive due to dual-side NULL tracking (8 UNION ALL parts).
- Performance degrades with high-cardinality join keys.
NATURAL JOINis supported — common columns are resolved automatically and synthesized into an explicit equi-join condition.
Aggregate
Module: src/dvm/operators/aggregate.rs
Handles GROUP BY with aggregate functions (COUNT, SUM, AVG, MIN, MAX, BOOL_AND, BOOL_OR, STRING_AGG, ARRAY_AGG, JSON_AGG, JSONB_AGG, BIT_AND, BIT_OR, BIT_XOR, JSON_OBJECT_AGG, JSONB_OBJECT_AGG, STDDEV_POP, STDDEV_SAMP, VAR_POP, VAR_SAMP, MODE, PERCENTILE_CONT, PERCENTILE_DISC, JSON_ARRAYAGG, JSON_OBJECTAGG) and the FILTER (WHERE …) and WITHIN GROUP (ORDER BY …) clauses.
Delta Rule:
$$\Delta(\gamma_{G, \text{agg}}(R)) = \gamma_{G, \text{agg}}(R' \text{ WHERE } G \in \text{affected_keys}) - \gamma_{G, \text{agg}}(R \text{ WHERE } G \in \text{affected_keys})$$
Where:
- $G$ = grouping columns
affected_keys= the set of group key values that appear in ΔR- $R'$ = $R \cup \Delta R$ (the new state)
Strategy:
- Identify affected groups — Collect all group key values that appear in the delta (either inserted or deleted rows).
- Recompute old values — Query the storage table for current aggregate values of affected groups.
- Recompute new values — Query the updated source for new aggregate values of affected groups.
- Diff — For each affected group:
- If old exists and new differs → emit
('-', old)and('+', new) - If old exists and new is gone → emit
('-', old)(group eliminated) - If no old and new exists → emit
('+', new)(new group appeared)
- If old exists and new differs → emit
Supported Aggregate Functions:
| Function | DVM Strategy | Notes |
|---|---|---|
COUNT(*) | Algebraic | Fully differential |
COUNT(expr) | Algebraic | Fully differential |
SUM(expr) | Algebraic | Fully differential |
AVG(expr) | Algebraic | Decomposed to SUM/COUNT internally |
MIN(expr) | Semi-algebraic | Uses LEAST merge; falls back to per-group rescan when min row is deleted |
MAX(expr) | Semi-algebraic | Uses GREATEST merge; falls back to per-group rescan when max row is deleted |
BOOL_AND(expr) | Group-rescan | Affected groups are re-aggregated from source data |
BOOL_OR(expr) | Group-rescan | Affected groups are re-aggregated from source data |
STRING_AGG(expr, sep) | Group-rescan | Affected groups are re-aggregated from source data |
ARRAY_AGG(expr) | Group-rescan | Affected groups are re-aggregated from source data |
JSON_AGG(expr) | Group-rescan | Affected groups are re-aggregated from source data |
JSONB_AGG(expr) | Group-rescan | Affected groups are re-aggregated from source data |
BIT_AND(expr) | Group-rescan | Affected groups are re-aggregated from source data |
BIT_OR(expr) | Group-rescan | Affected groups are re-aggregated from source data |
BIT_XOR(expr) | Group-rescan | Affected groups are re-aggregated from source data |
JSON_OBJECT_AGG(key, value) | Group-rescan | Affected groups are re-aggregated from source data |
JSONB_OBJECT_AGG(key, value) | Group-rescan | Affected groups are re-aggregated from source data |
STDDEV_POP(expr) / STDDEV(expr) | Group-rescan | Affected groups are re-aggregated from source data |
STDDEV_SAMP(expr) | Group-rescan | Affected groups are re-aggregated from source data |
VAR_POP(expr) | Group-rescan | Affected groups are re-aggregated from source data |
VAR_SAMP(expr) / VARIANCE(expr) | Group-rescan | Affected groups are re-aggregated from source data |
MODE() WITHIN GROUP (ORDER BY expr) | Group-rescan | Ordered-set aggregate; affected groups re-aggregated |
PERCENTILE_CONT(frac) WITHIN GROUP (ORDER BY expr) | Group-rescan | Ordered-set aggregate; affected groups re-aggregated |
PERCENTILE_DISC(frac) WITHIN GROUP (ORDER BY expr) | Group-rescan | Ordered-set aggregate; affected groups re-aggregated |
CORR(Y, X) | Group-rescan | Regression aggregate; affected groups re-aggregated |
COVAR_POP(Y, X) | Group-rescan | Regression aggregate; affected groups re-aggregated |
COVAR_SAMP(Y, X) | Group-rescan | Regression aggregate; affected groups re-aggregated |
REGR_AVGX(Y, X) | Group-rescan | Regression aggregate; affected groups re-aggregated |
REGR_AVGY(Y, X) | Group-rescan | Regression aggregate; affected groups re-aggregated |
REGR_COUNT(Y, X) | Group-rescan | Regression aggregate; affected groups re-aggregated |
REGR_INTERCEPT(Y, X) | Group-rescan | Regression aggregate; affected groups re-aggregated |
REGR_R2(Y, X) | Group-rescan | Regression aggregate; affected groups re-aggregated |
REGR_SLOPE(Y, X) | Group-rescan | Regression aggregate; affected groups re-aggregated |
REGR_SXX(Y, X) | Group-rescan | Regression aggregate; affected groups re-aggregated |
REGR_SXY(Y, X) | Group-rescan | Regression aggregate; affected groups re-aggregated |
REGR_SYY(Y, X) | Group-rescan | Regression aggregate; affected groups re-aggregated |
ANY_VALUE(expr) | Group-rescan | PostgreSQL 16+; affected groups re-aggregated |
JSON_ARRAYAGG(expr ...) | Group-rescan | SQL-standard JSON aggregation (PostgreSQL 16+); full deparsed SQL preserved |
JSON_OBJECTAGG(key: value ...) | Group-rescan | SQL-standard JSON aggregation (PostgreSQL 16+); full deparsed SQL preserved |
FILTER Clause:
All aggregate functions support the FILTER (WHERE …) clause:
SELECT COUNT(*) FILTER (WHERE status = 'active') AS active_count FROM orders GROUP BY region
The filter predicate is applied within the delta computation — only rows matching the filter contribute to the aggregate delta. Filtered aggregates are excluded from the P5 direct-bypass optimization.
SQL Generation:
The aggregate operator uses a 3-CTE pipeline:
- Merge CTE — Joins affected group keys against old (storage) and new (source) aggregate values, producing
__pgs_meta_action('I' for new-only groups, 'D' for disappeared groups, 'U' for changed groups). - LATERAL VALUES expansion — A single-pass
LATERAL (VALUES ...)clause expands each merge row into insert and delete actions, avoiding a 4-branch UNION ALL:
FROM merge_cte m,
LATERAL (VALUES
('I', m.new_count, m.new_total),
('D', m.old_count, m.old_total)
) v(action, count_val, val_total)
WHERE (m.__pgs_meta_action = 'I' AND v.action = 'I')
OR (m.__pgs_meta_action = 'D' AND v.action = 'D')
OR (m.__pgs_meta_action = 'U')
- Final projection — Emits
('+', row)and('-', row)tuples for the refresh engine.
MIN/MAX Merge Strategy:
MIN and MAX use a semi-algebraic strategy with two cases:
-
Non-extremum deletion — When the deleted row is NOT the current minimum (or maximum), the merge uses
LEAST(old_value, new_inserts)for MIN orGREATEST(old_value, new_inserts)for MAX. This is fully algebraic and requires no rescan. -
Extremum deletion — When the row holding the current minimum (or maximum) IS deleted, the new value cannot be computed from the delta alone. The merge expression returns
NULLas a sentinel, which triggers the change-detection guard (IS DISTINCT FROM) to emit the group for re-aggregation. The MERGE layer treats this as a DELETE + INSERT pair, recomputing the group from source data. This is still more efficient than a full table refresh since only affected groups are rescanned.
Distinct
Module: src/dvm/operators/distinct.rs
Implements SELECT DISTINCT using reference counting.
Delta Rule:
$$\Delta(\delta(R)) = { r \in \Delta R : \text{count}(r, R) = 0 \land \text{count}(r, R') > 0 } - { r \in \Delta R : \text{count}(r, R) > 0 \land \text{count}(r, R') = 0 }$$
In other words:
- A row enters the output when its count transitions from 0 to ≥1
- A row leaves the output when its count transitions from ≥1 to 0
Strategy:
Maintains a hidden __pgs_dup_count column in the storage table to track how many times each distinct row appears in the pre-distinct input.
- On insert: increment count. If count was 0, emit
('+', row). - On delete: decrement count. If count becomes 0, emit
('-', row).
Notes:
- The duplicate count is not visible in user queries against the storage table (projected away by the view layer).
- Duplicate counting uses
__pgs_row_id(xxHash) for efficient lookups.
Union All
Module: src/dvm/operators/union_all.rs
Merges deltas from two branches.
Delta Rule:
$$\Delta(R \cup_{\text{all}} S) = \Delta R \cup_{\text{all}} \Delta S$$
Simply concatenates the delta streams from both branches.
SQL Generation:
SELECT * FROM (<left_delta>)
UNION ALL
SELECT * FROM (<right_delta>)
Notes:
- Column count and types must match between branches.
- Each branch is independently processed through its own operator sub-tree.
- This is the simplest operator since
UNION ALLpreserves all duplicates.
Intersect
Module: src/dvm/operators/intersect.rs
Implements INTERSECT and INTERSECT ALL using dual-count per-branch multiplicity tracking.
Delta Rule:
$$\Delta(R \cap S): \text{emit rows where } \min(\text{count}_L, \text{count}_R) \text{ crosses the 0 boundary}$$
- INTERSECT (set): a row is present when both branches contain it.
- INTERSECT ALL (bag): a row appears $\min(\text{count}_L, \text{count}_R)$ times.
SQL Generation (3-CTE chain):
- Delta CTE — tags rows from left/right child deltas with branch indicator (
'L'/'R') and computes per-row net_count. - Merge CTE — joins with the storage table to compute old and new per-branch counts (
__pgs_count_l,__pgs_count_r). - Final CTE — detects boundary crossings using
LEAST(old_count_l, old_count_r)vsLEAST(new_count_l, new_count_r).
Notes:
- Storage table requires hidden columns
__pgs_count_land__pgs_count_rfor multiplicity tracking. - Both set and bag variants use the same 3-CTE structure; only the boundary logic stays the same (both use LEAST).
Except
Module: src/dvm/operators/except.rs
Implements EXCEPT and EXCEPT ALL using dual-count per-branch multiplicity tracking.
Delta Rule:
$$\Delta(R - S): \text{emit rows where } \max(0, \text{count}_L - \text{count}_R) \text{ crosses the 0 boundary}$$
- EXCEPT (set): a row is present when it exists in the left but not the right branch.
- EXCEPT ALL (bag): a row appears $\max(0, \text{count}_L - \text{count}_R)$ times.
SQL Generation (3-CTE chain):
- Delta CTE — same as Intersect: tags rows from both child deltas with branch indicator.
- Merge CTE — joins with storage table for old/new per-branch counts.
- Final CTE — detects boundary crossings using
GREATEST(0, old_count_l - old_count_r)vsGREATEST(0, new_count_l - new_count_r).
Notes:
- EXCEPT is not commutative — left branch is the positive input, right is subtracted.
- Storage table requires hidden columns
__pgs_count_land__pgs_count_r. - Same 3-CTE structure as Intersect with different effective-count function.
Subquery
Module: src/dvm/operators/subquery.rs
Handles both inlined CTEs and explicit subqueries in FROM ((SELECT ...) AS alias).
Delta Rule:
$$\Delta(\rho_{\text{alias}}(Q)) = \rho_{\text{alias}}(\Delta Q)$$
A subquery wrapper is transparent for differentiation — it delegates to its child's delta and optionally renames output columns to match the subquery's column aliases.
SQL Generation:
-- If column aliases differ from child output columns:
SELECT __pgs_row_id, __pgs_action, child_col1 AS alias_col1, child_col2 AS alias_col2
FROM (<child_delta>)
If the child columns already match the aliases, the subquery is a pure passthrough — no additional CTE is emitted.
Notes:
- This operator enables both CTE support (Tier 1) and standalone subqueries in FROM.
- Column aliases on subqueries (
FROM (...) AS x(a, b)) are handled by emitting a thin renaming CTE. - The subquery body is fully differentiated as a normal operator sub-tree.
CTE Scan (Shared Delta)
Module: src/dvm/operators/cte_scan.rs
Handles multi-reference CTEs by computing the CTE body's delta once and reusing it across all references (Tier 2).
Delta Rule:
$$\Delta(\text{CteScan}(\text{id}, Q)) = \text{cache}[\text{id}] \quad \text{(computed once, reused)}$$
When a CTE is referenced multiple times in a query, each reference produces a CteScan node with the same cte_id. The diff engine differentiates the CTE body once and caches the result. Subsequent CteScan nodes for the same CTE reuse the cached delta.
SQL Generation:
-- First reference: differentiates the CTE body and stores result in cache
-- Subsequent references: point to the same system CTE name
SELECT __pgs_row_id, __pgs_action, <columns>
FROM __pgs_cte_<cte_name>_delta -- shared across all references
If column aliases are present, a thin renaming CTE is added on top of the cached delta.
Notes:
- Without CteScan (Tier 1), multi-reference CTEs are inlined: each reference duplicates the full operator sub-tree. CteScan (Tier 2) eliminates this duplication.
- The CTE body is pre-differentiated in dependency order (earlier CTEs before later ones that reference them).
- Column alias support follows the same pattern as the Subquery operator.
Recursive CTEs
Recursive CTEs (WITH RECURSIVE) are supported via two strategies depending on the refresh mode:
FULL Mode
Recursive CTEs work out-of-the-box with refresh_mode = 'FULL'. The defining query is executed as-is via INSERT INTO ... SELECT ..., and PostgreSQL handles the iterative evaluation internally.
DIFFERENTIAL Mode (Three-Strategy Incremental Maintenance)
Recursive CTEs with refresh_mode = 'DIFFERENTIAL' use an automatic three-strategy approach, selected based on column compatibility and change type:
Strategy 1: Semi-Naive Evaluation (INSERT-only changes)
When only INSERT changes are present in the change buffer, pg_stream uses semi-naive evaluation — the standard technique for incremental fixpoint computation. The base case is differentiated normally through the DVM operator tree, then the resulting delta is propagated through the recursive term using a nested WITH RECURSIVE:
WITH RECURSIVE
__pgs_base_delta AS (
-- Normal DVM differentiation of the base case (INSERT rows only)
<differentiated base case>
),
__pgs_rec_delta AS (
-- Seed: base case delta rows
SELECT cols FROM __pgs_base_delta WHERE __pgs_action = 'I'
UNION ALL
-- Seed: new base rows joining existing ST storage
SELECT cols FROM <recursive term with self_ref = ST_storage, base = change_buffer>
UNION ALL
-- Propagation: recursive term applied to growing delta
SELECT cols FROM <recursive term with self_ref = __pgs_rec_delta, base = full>
)
SELECT pgstream.pg_stream_hash(...) AS __pgs_row_id, 'I' AS __pgs_action, cols
FROM __pgs_rec_delta
The cost is proportional to the number of new rows produced by the change, not the full result set.
Strategy 2: Delete-and-Rederive / DRed (mixed INSERT/DELETE/UPDATE changes)
When the change buffer contains DELETE or UPDATE changes, simple propagation is insufficient — a deleted base row may have transitively derived many recursive rows, some of which may still be derivable from alternative paths. DRed handles this in four phases:
- Insert propagation — semi-naive evaluation for the INSERT portion (same as Strategy 1)
- Over-deletion cascade — propagate base-case deletions through the recursive term against ST storage to find all transitively-derived rows that might be invalidated
- Rederivation — re-execute the recursive CTE from the remaining (non-deleted) base rows to restore any over-deleted rows that have alternative derivations
- Combine — final delta = inserts + (over-deletions − rederived rows)
This avoids full recomputation while correctly handling deletions with alternative derivation paths.
Strategy 3: Recomputation Fallback
When the CTE defines more columns than the outer SELECT projects (column mismatch), the incremental strategies cannot be used because the ST storage table lacks columns needed for recursive self-joins. In this case, the full defining query is re-executed and anti-joined against current storage:
WITH __pgs_recomp_new AS (
SELECT pgstream.pg_stream_hash(row_to_json(sub)::text) AS __pgs_row_id, col1, col2, ...
FROM (<defining_query>) sub
),
__pgs_recomp_ins AS (
SELECT n.__pgs_row_id, 'I'::text AS __pgs_action, n.col1, n.col2, ...
FROM __pgs_recomp_new n
LEFT JOIN <storage_table> s ON s.__pgs_row_id = n.__pgs_row_id
WHERE s.__pgs_row_id IS NULL
),
__pgs_recomp_del AS (
SELECT s.__pgs_row_id, 'D'::text AS __pgs_action, s.col1, s.col2, ...
FROM <storage_table> s
LEFT JOIN __pgs_recomp_new n ON n.__pgs_row_id = s.__pgs_row_id
WHERE n.__pgs_row_id IS NULL
)
SELECT * FROM __pgs_recomp_ins
UNION ALL
SELECT * FROM __pgs_recomp_del
The cost is proportional to the full result set size.
Strategy Selection
| CTE columns match ST? | Change type | Strategy |
|---|---|---|
| ✅ Match | INSERT-only | Semi-naive (Strategy 1) |
| ✅ Match | Mixed (INSERT+DELETE/UPDATE) | DRed (Strategy 2) |
| ❌ Mismatch | Any | Recomputation (Strategy 3) |
Notes:
- Non-linear recursion (multiple self-references in the recursive term) is rejected — PostgreSQL restricts the recursive term to reference the CTE at most once.
- The
__pgs_row_idcolumn (xxHash of the JSON-serialized row) is used for row identity. - For write-heavy workloads on very large recursive result sets with frequent mixed changes,
refresh_mode = 'FULL'may still be more efficient than DRed.
Window Functions
Module: src/dvm/operators/window.rs
Handles window functions (ROW_NUMBER, RANK, DENSE_RANK, SUM() OVER, etc.) using partition-based recomputation.
Delta Rule:
When any row in a partition changes (insert, update, or delete), the entire partition's window function output is recomputed:
$$\Delta(\omega_{f, P}(R)) = \omega_{f, P}(R'|{\text{affected partitions}}) - \omega{f, P}(R|_{\text{affected partitions}})$$
Where $P$ is the PARTITION BY key and $f$ is the window function.
Strategy:
- Identify affected partition keys from the child delta.
- Delete old window function results for affected partitions from storage.
- Fetch surviving (unchanged) rows in affected partitions from the child.
- Recompute the window function on the current input for affected partitions.
- Emit the recomputed rows as inserts.
SQL Generation:
-- CTE 1: Affected partition keys from delta
WITH affected_partitions AS (
SELECT DISTINCT <partition_cols> FROM (<child_delta>)
),
-- CTE 2: Surviving rows (not in delta) for affected partitions
surviving AS (
SELECT * FROM <storage_table>
WHERE (<partition_cols>) IN (SELECT * FROM affected_partitions)
AND __pgs_row_id NOT IN (SELECT __pgs_row_id FROM (<child_delta>) WHERE __pgs_action = 'D')
),
-- CTE 3: Recompute window function
recomputed AS (
SELECT *, <window_func> OVER (PARTITION BY <partition_cols> ORDER BY <order_cols>) AS <alias>
FROM surviving
)
-- Delete old results + insert recomputed results
SELECT 'D' AS __pgs_action, ... -- old rows from affected partitions
UNION ALL
SELECT 'I' AS __pgs_action, ... -- recomputed rows
Notes:
- The cost is proportional to the size of affected partitions, not the full table. For workloads where changes spread across few partitions, this is efficient.
- All window expressions must share the same PARTITION BY / ORDER BY clause.
- Without PARTITION BY, the entire table is treated as a single partition — any change triggers a full recomputation.
Window Frame Clauses:
Window frame specifications are fully supported:
- Modes:
ROWS,RANGE,GROUPS - Bounds:
UNBOUNDED PRECEDING,N PRECEDING,CURRENT ROW,N FOLLOWING,UNBOUNDED FOLLOWING - Between syntax:
BETWEEN <start> AND <end> - Exclusion:
EXCLUDE CURRENT ROW,EXCLUDE GROUP,EXCLUDE TIES,EXCLUDE NO OTHERS
Example: SUM(val) OVER (ORDER BY ts ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
Named WINDOW Clauses:
Named window definitions are resolved from the query-level WINDOW clause:
SELECT id, SUM(val) OVER w, AVG(val) OVER w
FROM data
WINDOW w AS (PARTITION BY category ORDER BY ts)
The parser resolves OVER w by looking up the window definition from the WINDOW clause and merging partition, order, and frame specifications.
Lateral Function (Set-Returning Functions in FROM)
Module: src/dvm/operators/lateral_function.rs
Handles set-returning functions (SRFs) used in the FROM clause with implicit LATERAL semantics: jsonb_array_elements, jsonb_each, jsonb_each_text, unnest, etc.
Delta Rule:
When a source row changes (insert, update, or delete), the SRF expansion is re-evaluated only for that source row:
$$\Delta(R \ltimes f(R.\text{col})) = (R' \ltimes f(R'.\text{col}))|{\text{changed rows}} - (R \ltimes f(R.\text{col}))|{\text{changed rows}}$$
Where $R$ is the source table, $f$ is the SRF, and changed rows are identified via the child delta.
Strategy (Row-Scoped Recomputation):
- Propagate the child delta to identify changed source rows.
- Find all existing ST rows derived from changed source rows (via column matching).
- Delete old SRF expansions for those source rows.
- Re-expand the SRF for inserted/updated source rows.
- Emit deletes + inserts as the final delta.
SQL Generation (4-CTE chain):
-- CTE 1: Changed source rows from child delta
WITH lat_changed AS (
SELECT DISTINCT "__pgs_row_id", "__pgs_action", <child_cols>
FROM <child_delta>
),
-- CTE 2: Old ST rows for changed source rows (to be deleted)
lat_old AS (
SELECT st."__pgs_row_id", st.<all_output_cols>
FROM <st_table> st
WHERE EXISTS (
SELECT 1 FROM lat_changed cs
WHERE st.<col1> IS NOT DISTINCT FROM cs.<col1>
AND st.<col2> IS NOT DISTINCT FROM cs.<col2>
...
)
),
-- CTE 3: Re-expand SRF for inserted/updated source rows
lat_expand AS (
SELECT pg_stream_hash(<all_cols>::text) AS "__pgs_row_id",
cs.<child_cols>, <srf_alias>.<srf_cols>
FROM lat_changed cs,
LATERAL <srf_function>(cs.<arg>) AS <srf_alias>
WHERE cs."__pgs_action" = 'I'
),
-- CTE 4: Final delta
lat_final AS (
SELECT "__pgs_row_id", 'D' AS "__pgs_action", <cols> FROM lat_old
UNION ALL
SELECT "__pgs_row_id", 'I' AS "__pgs_action", <cols> FROM lat_expand
)
Row Identity:
Content-based: hash(child_columns || srf_result_columns). This is stable as long as the same source row produces the same expanded values.
Supported SRFs:
| Function | Output Columns | Notes |
|---|---|---|
jsonb_array_elements(jsonb) | value (jsonb) | Expands JSONB array to rows |
jsonb_array_elements_text(jsonb) | value (text) | Text variant |
jsonb_each(jsonb) | key (text), value (jsonb) | Expands JSONB object to key-value pairs |
jsonb_each_text(jsonb) | key (text), value (text) | Text variant |
unnest(anyarray) | Element type | Unnests PostgreSQL arrays |
| Custom SRFs | User-provided column aliases | AS alias(col1, col2) |
Notes:
- The cost is proportional to the number of changed source rows × average SRF expansion size, not the full table.
WITH ORDINALITYis supported — adds abigintordinality column to the output.ROWS FROM()with multiple functions is not supported (rejected at parse time).- Column aliases (e.g.,
AS child(value)) are used to determine output column names; for known SRFs without aliases, the alias name becomes the column name. - JSON_TABLE (PostgreSQL 17+) —
JSON_TABLE(expr, path COLUMNS (...))is modeled as aLateralFunctionand uses the same row-scoped recomputation strategy. Supported column types: regular, EXISTS, formatted, and nested columns withON ERROR/ON EMPTYbehaviors andPASSINGclauses.
Lateral Subquery (Correlated Subqueries in FROM)
Module: src/dvm/operators/lateral_subquery.rs
Handles correlated subqueries used in the FROM clause with explicit or implicit LATERAL semantics: FROM t, LATERAL (SELECT ... WHERE ref = t.col) AS alias or FROM t LEFT JOIN LATERAL (...) AS alias ON true.
Delta Rule:
When an outer row changes, the correlated subquery is re-executed only for that row:
$$\Delta(R \ltimes Q(R)) = (R' \ltimes Q(R'))|{\text{changed rows}} - (R \ltimes Q(R))|{\text{changed rows}}$$
Where $R$ is the outer table, $Q(R)$ is the correlated subquery, and changed rows are identified via the child delta.
Strategy (Row-Scoped Recomputation):
- Propagate the child delta to identify changed outer rows.
- Find all existing ST rows derived from changed outer rows (via column matching with
IS NOT DISTINCT FROM). - Delete old subquery expansions for those outer rows.
- Re-execute the subquery for inserted/updated outer rows using the original outer alias.
- Emit deletes + inserts as the final delta.
SQL Generation (4-CTE chain):
-- CTE 1: Changed outer rows from child delta
WITH lat_sq_changed AS (
SELECT DISTINCT "__pgs_row_id", "__pgs_action", <child_cols>
FROM <child_delta>
),
-- CTE 2: Old ST rows for changed outer rows (to be deleted)
lat_sq_old AS (
SELECT st."__pgs_row_id", st.<all_output_cols>
FROM <st_table> st
WHERE EXISTS (
SELECT 1 FROM lat_sq_changed cs
WHERE st.<col1> IS NOT DISTINCT FROM cs.<col1>
AND st.<col2> IS NOT DISTINCT FROM cs.<col2>
...
)
),
-- CTE 3: Re-execute subquery for inserted/updated outer rows
lat_sq_expand AS (
SELECT pg_stream_hash(<all_cols>::text) AS "__pgs_row_id",
<outer_alias>.<child_cols>, <sub_alias>.<sub_cols>
FROM lat_sq_changed AS <outer_alias>, -- Original outer alias!
LATERAL (<subquery_sql>) AS <sub_alias>
WHERE <outer_alias>."__pgs_action" = 'I'
),
-- CTE 4: Final delta
lat_sq_final AS (
SELECT "__pgs_row_id", 'D' AS "__pgs_action", <cols> FROM lat_sq_old
UNION ALL
SELECT "__pgs_row_id", 'I' AS "__pgs_action", <cols> FROM lat_sq_expand
)
LEFT JOIN LATERAL Handling:
For queries using LEFT JOIN LATERAL (...) ON true, the expand CTE uses LEFT JOIN LATERAL instead of comma syntax and wraps subquery columns in COALESCE for hash stability:
lat_sq_expand AS (
SELECT pg_stream_hash(<outer_cols>::text || '/' || COALESCE(<sub_cols>::text, '')) AS "__pgs_row_id",
<outer_alias>.<child_cols>, <sub_alias>.<sub_cols>
FROM lat_sq_changed AS <outer_alias>
LEFT JOIN LATERAL (<subquery_sql>) AS <sub_alias> ON true
WHERE <outer_alias>."__pgs_action" = 'I'
)
Row Identity:
Content-based: hash(outer_columns || '/' || subquery_result_columns). For LEFT JOIN with NULL results, COALESCE ensures a stable hash.
Supported Patterns:
| Pattern | Syntax | Notes |
|---|---|---|
| Top-N per group | LATERAL (SELECT ... ORDER BY ... LIMIT N) | Most common use case |
| Correlated aggregate | LATERAL (SELECT SUM(x) FROM t WHERE t.fk = p.pk) | Returns single row per outer row |
| Existence with data | LEFT JOIN LATERAL (...) ON true | Preserves outer rows with NULLs |
| Multi-column lookup | LATERAL (SELECT a, b FROM t WHERE t.fk = p.pk LIMIT 1) | Multiple derived values |
| GROUP BY inside subquery | LATERAL (SELECT type, COUNT(*) FROM t WHERE t.fk = p.pk GROUP BY type) | Multiple rows per outer row |
Key Design Decision: Outer Alias Rewriting
The subquery body contains column references to the outer table (e.g., WHERE li.order_id = o.id). In the expansion CTE, the changed-sources CTE is aliased with the original outer table alias (e.g., lat_sq_changed AS o) so that the subquery's column references resolve naturally without rewriting.
Notes:
- The cost is proportional to the number of changed outer rows × average subquery result size, not the full table.
- The subquery is stored as raw SQL (like
LateralFunction) because it cannot be independently differentiated — it depends on outer row context. - Source table OIDs referenced by the subquery body are extracted at parse time for CDC trigger setup.
- ORDER BY + LIMIT inside the subquery are valid (they apply per-outer-row, not to the stream table).
Semi-Join (EXISTS / IN Subquery)
Module: src/dvm/operators/semi_join.rs
Handles WHERE EXISTS (SELECT ... FROM ...) and WHERE col IN (SELECT ...) patterns. The parser transforms these into a SemiJoin operator with a left (outer) child, a right (inner) child, and a join condition.
Delta Rule:
$$\Delta(L \ltimes R) = \Delta L|{R} + L|{\Delta R \text{ causes existence change}}$$
- Part 1: Outer rows that changed and still satisfy the semi-join condition.
- Part 2: Existing outer rows whose semi-join result flipped due to inner changes (a matching inner row was inserted or deleted).
Strategy (Two-Part Delta):
- Part 1 (outer delta): Filter
delta_leftto rows that have at least one match in the current right-hand snapshot. - Part 2 (inner delta): For each row in the left snapshot, check whether the existence of matching right-hand rows changed between the old and current state. Emit
'I'if a match appeared,'D'if all matches disappeared.
The "old" right-hand state is reconstructed from the current state by reversing the delta: R_old = (R_current EXCEPT ALL delta_right(action='I')) UNION ALL delta_right(action='D').
Row Identity:
- Part 1: Uses
__pgs_row_idfrom the left delta. - Part 2: Content-based hash via
pg_stream_hash_multion left-side columns.
Supported Patterns:
| Pattern | SQL | Notes |
|---|---|---|
EXISTS | WHERE EXISTS (SELECT 1 FROM t WHERE t.fk = s.pk) | Direct semi-join |
IN (subquery) | WHERE id IN (SELECT fk FROM t) | Rewritten to EXISTS with equality |
| Multiple conditions | WHERE EXISTS (... AND ...) | Additional predicates in subquery WHERE |
Anti-Join (NOT EXISTS / NOT IN Subquery)
Module: src/dvm/operators/anti_join.rs
Handles WHERE NOT EXISTS (SELECT ... FROM ...) and WHERE col NOT IN (SELECT ...) patterns. The inverse of the semi-join operator.
Delta Rule:
$$\Delta(L \triangleright R) = \Delta L|{\neg R} + L|{\Delta R \text{ causes existence change}}$$
- Part 1: Outer rows that changed and have no match in the right-hand snapshot.
- Part 2: Existing outer rows whose anti-join result flipped due to inner changes.
Strategy (Two-Part Delta):
- Part 1 (outer delta): Filter
delta_leftto rows withNOT EXISTSin the current right snapshot. - Part 2 (inner delta): For each row in the left snapshot, detect existence changes. Emit
'D'if a match appeared (row no longer qualifies),'I'if all matches disappeared (row now qualifies).
Note the inverted semantics compared to semi-join: a new match means deletion, losing all matches means insertion.
Row Identity: Same as semi-join.
Supported Patterns:
| Pattern | SQL | Notes |
|---|---|---|
NOT EXISTS | WHERE NOT EXISTS (SELECT 1 FROM t WHERE t.fk = s.pk) | Direct anti-join |
NOT IN (subquery) | WHERE id NOT IN (SELECT fk FROM t) | Rewritten to NOT EXISTS with equality |
Scalar Subquery (Correlated SELECT Subquery)
Module: src/dvm/operators/scalar_subquery.rs
Handles scalar subqueries appearing in the SELECT list, e.g., SELECT a, (SELECT max(x) FROM t) AS mx FROM s. The subquery must return exactly one row and one column.
Delta Rule:
$$\Delta(L \times q) = \Delta L \times q' + L \times (q' - q)$$
Where $q$ is the scalar subquery value and $q'$ is the updated value.
Strategy (Two-Part Delta):
- Part 1 (outer delta): Propagate the child delta, appending the current scalar subquery value to each row.
- Part 2 (scalar value change): When the scalar subquery's result changes, emit deletes for all existing outer rows (with the old scalar value) and re-inserts for all outer rows (with the new value). The old scalar value is reconstructed by reversing the inner delta.
SQL Generation (3 or 4 CTEs):
-- Part 1: child delta + current scalar value
WITH sq_outer AS (
SELECT *, (<scalar_subquery>) AS "<alias>"
FROM <child_delta>
),
-- Part 2a: DELETE all outer rows when scalar changed
sq_del AS (
SELECT "__pgs_row_id", 'D' AS "__pgs_action", <cols>
FROM <st_table>
WHERE (<scalar_old>) IS DISTINCT FROM (<scalar_current>)
),
-- Part 2b: INSERT all outer rows with new scalar value
sq_ins AS (
SELECT pg_stream_hash_multi(...) AS "__pgs_row_id",
'I' AS "__pgs_action", <cols>, (<scalar_current>) AS "<alias>"
FROM <source_snapshot>
WHERE (<scalar_old>) IS DISTINCT FROM (<scalar_current>)
)
-- Final: UNION ALL of all parts
SELECT * FROM sq_outer
UNION ALL SELECT * FROM sq_del
UNION ALL SELECT * FROM sq_ins
Row Identity:
- Part 1:
__pgs_row_idfrom the child delta. - Part 2: Content-based hash via
pg_stream_hash_multion all output columns.
Notes:
- The scalar subquery is stored as raw SQL (deparsed from the parse tree).
- The old scalar value is approximated using the same
EXCEPT ALL / UNION ALLreversal technique as semi/anti-join. - If the scalar subquery references a table that changes, all outer rows must be re-evaluated — the delta can be large.
- Source OIDs used by the scalar subquery are captured at parse time for CDC trigger registration.
Operator Tree Construction
The DVM engine builds the operator tree by analyzing the parsed query:
- WITH clause → CTE definitions extracted into a name→body map (non-recursive) or CTE registry (multi-reference)
- FROM clause →
Scannodes for physical tables;Subquerynodes for inlined CTEs and subqueries in FROM;CteScannodes for multi-reference CTEs;LateralFunctionnodes for SRFs and JSON_TABLE in FROM;LateralSubquerynodes for correlated subqueries in FROM - JOIN →
JoinorOuterJoinwrapping two sub-trees - LATERAL SRFs →
LateralFunctionwrapping the left-hand FROM item as its child - LATERAL subqueries →
LateralSubquerywrapping the left-hand FROM item as its child (comma syntax or JOIN LATERAL) - WHERE subqueries →
SemiJoinforEXISTS/IN (subquery),AntiJoinforNOT EXISTS/NOT IN (subquery), extracted from the WHERE clause - Scalar subqueries →
ScalarSubqueryfor(SELECT ...)in the SELECT list, wrapping the child tree - WHERE →
Filterwrapping the scan/join tree (remaining non-subquery predicates) - SELECT list →
Projectfor column selection and expressions - GROUP BY →
Aggregatewrapping the filtered/projected tree - DISTINCT →
Distincton top - UNION ALL →
UnionAllcombining two complete sub-trees - INTERSECT / EXCEPT →
IntersectorExceptcombining two sub-trees with dual-count tracking - Window functions →
Windowwrapping the sub-tree with PARTITION BY / ORDER BY metadata - ORDER BY → silently discarded (storage row order is undefined)
- LIMIT / OFFSET → rejected with a clear error (stream tables materialize the full result set)
For recursive CTEs (WITH RECURSIVE), the query is parsed into an OpTree with RecursiveCte operator nodes. In DIFFERENTIAL mode, the strategy (semi-naive, DRed, or recomputation) is selected automatically based on column compatibility and change type — see the Recursive CTEs section above for details.
The tree is then traversed bottom-up during delta generation: each operator's generate_delta_sql() method composes its SQL fragment around the output of its child operator(s).
Further Reading
- ARCHITECTURE.md — System-wide component overview
- SQL_REFERENCE.md — Complete function reference
- CONFIGURATION.md — GUC tuning guide
pg_stream — Benchmark Guide
This document explains how the database-level refresh benchmarks work and how to interpret their output.
Overview
The benchmark suite in tests/e2e_bench_tests.rs measures wall-clock refresh time for FULL vs INCREMENTAL mode across a matrix of table sizes, change rates, and query complexities. Each benchmark spawns an isolated PostgreSQL 18.x container via Testcontainers, ensuring reproducible and interference-free measurements.
The core question the benchmarks answer:
How much faster is an INCREMENTAL refresh compared to a FULL refresh, given a specific workload?
Prerequisites
Build the E2E test Docker image before running any benchmarks:
./tests/build_e2e_image.sh
Docker must be running on the host.
Running Benchmarks
All benchmark tests are tagged #[ignore] so they are skipped during normal CI. The --nocapture flag is required to see the printed output tables.
Quick Spot Checks (~5–10 seconds each)
# Simple scan, 10K rows, 1% change rate
cargo test --test e2e_bench_tests --features pg18 -- --ignored --nocapture bench_scan_10k_1pct
# Aggregate query, 100K rows, 1% change rate
cargo test --test e2e_bench_tests --features pg18 -- --ignored --nocapture bench_aggregate_100k_1pct
# Join + aggregate, 100K rows, 10% change rate
cargo test --test e2e_bench_tests --features pg18 -- --ignored --nocapture bench_join_agg_100k_10pct
Zero-Change Latency (~5 seconds)
cargo test --test e2e_bench_tests --features pg18 -- --ignored --nocapture bench_no_data_refresh_latency
Full Matrix (~15–30 minutes)
Runs all 30 combinations and prints a consolidated summary:
cargo test --test e2e_bench_tests --features pg18 -- --ignored --nocapture bench_full_matrix
Run All Benchmarks in Parallel
cargo test --test e2e_bench_tests --features pg18 -- --ignored --nocapture
Note: each test starts its own container, so parallel execution requires sufficient Docker resources.
Benchmark Dimensions
Table Sizes
| Size | Rows | Purpose |
|---|---|---|
| Small | 10,000 | Fast iteration; measures per-row overhead |
| Medium | 100,000 | More realistic; reveals scaling characteristics |
Change Rates
| Rate | Description |
|---|---|
| 1% | Low churn — the sweet spot for incremental refresh |
| 10% | Moderate churn — tests delta query scalability |
| 50% | High churn — stress test; approaches full-refresh cost |
Query Complexities
| Scenario | Defining Query | Operators Tested |
|---|---|---|
| scan | SELECT id, region, category, amount, score FROM src | Table scan only |
| filter | SELECT id, region, amount FROM src WHERE amount > 5000 | Scan + filter (WHERE) |
| aggregate | SELECT region, SUM(amount), COUNT(*) FROM src GROUP BY region | Scan + group-by aggregate |
| join | SELECT s.id, s.region, s.amount, d.region_name FROM src s JOIN dim d ON ... | Scan + inner join |
| join_agg | SELECT d.region_name, SUM(s.amount), COUNT(*) FROM src s JOIN dim d ON ... GROUP BY ... | Scan + join + aggregate |
DML Mix per Cycle
Each change cycle applies a realistic mix of operations:
| Operation | Fraction | Example at 10K rows, 10% rate |
|---|---|---|
| UPDATE | 70% | 700 rows have amount incremented |
| DELETE | 15% | 150 rows removed |
| INSERT | 15% | 150 new rows added |
What Each Benchmark Does
1. Start a fresh PostgreSQL 18.x container
2. Install the pg_stream extension
3. Create and populate the source table (10K or 100K rows)
4. Create dimension table if needed (for join scenarios)
5. ANALYZE for stable query plans
── FULL mode ──
6. Create a Stream Table in FULL refresh mode
7. For each of 3 cycles:
a. Apply random DML (updates + deletes + inserts)
b. ANALYZE
c. Time the FULL refresh (TRUNCATE + re-execute entire query)
d. Record refresh_ms and ST row count
8. Drop the FULL-mode ST
── INCREMENTAL mode ──
9. Reset source table to same starting state
10. Create a Stream Table in INCREMENTAL refresh mode
11. For each of 3 cycles:
a. Apply random DML (same parameters)
b. ANALYZE
c. Time the INCREMENTAL refresh (delta query + MERGE)
d. Record refresh_ms and ST row count
12. Print results table and summary
Both modes start from the same data to ensure a fair comparison. The 3-cycle design captures warm-up effects (cycle 1 may be slower due to plan caching).
Reading the Output
Detail Table
╔══════════════════════════════════════════════════════════════════════════════════════╗
║ pg_stream Refresh Benchmark Results ║
╠════════════╤══════════╤════════╤═════════════╤═══════╤════════════╤═════════════════╣
║ Scenario │ Rows │ Chg % │ Mode │ Cycle │ Refresh ms │ ST Rows ║
╠════════════╪══════════╪════════╪═════════════╪═══════╪════════════╪═════════════════╣
║ aggregate │ 10000 │ 1% │ FULL │ 1 │ 22.1 │ 5 ║
║ aggregate │ 10000 │ 1% │ FULL │ 2 │ 4.8 │ 5 ║
║ aggregate │ 10000 │ 1% │ FULL │ 3 │ 5.3 │ 5 ║
║ aggregate │ 10000 │ 1% │ INCREMENTAL │ 1 │ 8.4 │ 5 ║
║ aggregate │ 10000 │ 1% │ INCREMENTAL │ 2 │ 4.4 │ 5 ║
║ aggregate │ 10000 │ 1% │ INCREMENTAL │ 3 │ 4.6 │ 5 ║
╚════════════╧══════════╧════════╧═════════════╧═══════╧════════════╧═════════════════╝
| Column | Meaning |
|---|---|
| Scenario | Query complexity level (scan, filter, aggregate, join, join_agg) |
| Rows | Number of rows in the base table |
| Chg % | Percentage of rows changed per cycle |
| Mode | FULL (truncate + recompute) or INCREMENTAL (delta + merge) |
| Cycle | Which of the 3 measurement rounds (cycle 1 often includes warm-up) |
| Refresh ms | Wall-clock time for the refresh operation |
| ST Rows | Row count in the Stream Table after refresh (sanity check) |
Summary Table
┌─────────────────────────────────────────────────────────────────────────┐
│ Summary (avg ms per cycle) │
├────────────┬──────────┬────────┬─────────────────┬──────────────────────┤
│ Scenario │ Rows │ Chg % │ FULL avg ms │ INCREMENTAL avg ms │
├────────────┼──────────┼────────┼─────────────────┼──────────────────────┤
│ aggregate │ 10000 │ 1% │ 10.7 │ 5.8 ( 1.8x) │
└────────────┴──────────┴────────┴─────────────────┴──────────────────────┘
The Speedup value in parentheses is FULL avg / INCREMENTAL avg — how many times faster the incremental refresh is compared to a full refresh.
Interpreting the Speedup
What to Expect
| Change Rate | Table Size | Expected Speedup | Explanation |
|---|---|---|---|
| 1% | 10K | 1.5–5x | Small table; overhead is similar, delta is tiny |
| 1% | 100K | 5–50x | Larger table amplifies full-refresh cost |
| 10% | 100K | 2–10x | Moderate delta; still significantly faster |
| 50% | any | 1–2x | Delta is nearly as large as full table |
Rules of Thumb
| Speedup | Interpretation |
|---|---|
| > 10x | Strong win for INCREMENTAL — typical at low change rates on larger tables |
| 5–10x | Clear advantage for INCREMENTAL |
| 2–5x | Moderate advantage — INCREMENTAL is the right choice |
| 1–2x | Marginal gain — either mode is acceptable |
| ~1x | Break-even — change rate is too high for incremental to help |
| < 1x | INCREMENTAL is slower — would indicate overhead exceeds savings (investigate) |
Key Patterns to Look For
-
Scaling with table size: For the same change rate, speedup should increase with table size. FULL must re-process all rows; INCREMENTAL processes only the delta.
-
Degradation with change rate: As change rate rises from 1% → 50%, speedup should decrease. At 50%, INCREMENTAL processes half the table which approaches FULL cost.
-
Query complexity amplifies speedup: Aggregate and join queries benefit more from INCREMENTAL because they avoid expensive re-computation. A join_agg at 1% changes should show higher speedup than a simple scan at the same parameters.
-
Cycle 1 warm-up: The first cycle in each mode may be slower due to PostgreSQL plan cache population. Use cycles 2–3 for the steadiest numbers.
-
ST Rows consistency: The ST row count should be similar between FULL and INCREMENTAL for the same scenario (accounting for random DML). Large discrepancies indicate a correctness issue.
Zero-Change Latency
The bench_no_data_refresh_latency test measures the overhead of a refresh when no data has changed — the NO_DATA code path.
┌──────────────────────────────────────────────┐
│ NO_DATA Refresh Latency (10 iterations) │
├──────────────────────────────────────────────┤
│ Avg: 3.21 ms │
│ Max: 5.10 ms │
│ Target: < 10 ms │
│ Status: ✅ PASS │
└──────────────────────────────────────────────┘
| Metric | Meaning |
|---|---|
| Avg | Average wall-clock time across 10 no-op refreshes |
| Max | Worst-case single iteration |
| Target | The PLAN.md goal: < 10 ms per no-op refresh |
| Status | PASS if avg < 10 ms, SLOW otherwise |
A passing result confirms the scheduler's per-cycle overhead is negligible. Values > 10 ms in containerized environments may be acceptable due to Docker overhead; bare-metal PostgreSQL should comfortably meet the target.
Available Tests
Individual Tests (10K rows)
| Test Name | Scenario | Change Rate |
|---|---|---|
bench_scan_10k_1pct | scan | 1% |
bench_scan_10k_10pct | scan | 10% |
bench_scan_10k_50pct | scan | 50% |
bench_filter_10k_1pct | filter | 1% |
bench_aggregate_10k_1pct | aggregate | 1% |
bench_join_10k_1pct | join | 1% |
bench_join_agg_10k_1pct | join_agg | 1% |
Individual Tests (100K rows)
| Test Name | Scenario | Change Rate |
|---|---|---|
bench_scan_100k_1pct | scan | 1% |
bench_scan_100k_10pct | scan | 10% |
bench_scan_100k_50pct | scan | 50% |
bench_aggregate_100k_1pct | aggregate | 1% |
bench_aggregate_100k_10pct | aggregate | 10% |
bench_join_agg_100k_1pct | join_agg | 1% |
bench_join_agg_100k_10pct | join_agg | 10% |
Special Tests
| Test Name | Description |
|---|---|
bench_full_matrix | All 30 combinations (5 queries × 2 sizes × 3 rates) |
bench_no_data_refresh_latency | Zero-change overhead (10 iterations) |
In-Process Micro-Benchmarks (Criterion.rs)
In addition to the E2E database benchmarks, the project includes two Criterion.rs benchmark suites that measure pure Rust computation time without database overhead. These are useful for tracking performance regressions in the internal query-building and IVM differentiation logic.
Benchmark Suites
refresh_bench — Utility Functions
benches/refresh_bench.rs benchmarks the low-level helper functions used during refresh operations:
| Benchmark Group | What It Measures |
|---|---|
| quote_ident | PostgreSQL identifier quoting speed |
| col_list | Column list SQL generation |
| prefixed_col_list | Prefixed column list generation (e.g., NEW.col) |
| expr_to_sql | AST expression → SQL string conversion |
| output_columns | Output column extraction from parsed queries |
| source_oids | Source table OID resolution |
| lsn_gt | LSN comparison expression generation |
| frontier_json | Frontier state JSON serialization |
| canonical_period | Interval parsing and canonicalization |
| dag_operations | DAG topological sort and cycle detection |
| xxh64 | xxHash-64 hashing throughput |
diff_operators — IVM Operator Differentiation
benches/diff_operators.rs benchmarks the delta SQL generation for every IVM operator. Each benchmark creates a realistic operator tree and measures differentiate() throughput:
| Benchmark Group | What It Measures |
|---|---|
| diff_scan | Table scan differentiation (3, 10, 20 columns) |
| diff_filter | Filter (WHERE) differentiation |
| diff_project | Projection (SELECT subset) differentiation |
| diff_aggregate | GROUP BY aggregate differentiation (simple + complex) |
| diff_inner_join | Inner join differentiation |
| diff_left_join | Left outer join differentiation |
| diff_distinct | DISTINCT differentiation |
| diff_union_all | UNION ALL differentiation (2, 5, 10 children) |
| diff_window | Window function differentiation |
| diff_join_aggregate | Composite join + aggregate pipeline |
| differentiate_full | Full differentiate() call for scan-only and filter+scan trees |
Running Micro-Benchmarks
# Run all Criterion benchmarks
just bench
# Run only refresh utility benchmarks
cargo bench --bench refresh_bench --features pg18
# Run only IVM diff operator benchmarks
just bench-diff
# or equivalently:
cargo bench --bench diff_operators --features pg18
# Output in Bencher-compatible format (for CI integration)
just bench-bencher
Output and Reports
Criterion produces statistical analysis for each benchmark including:
- Mean and standard deviation of execution time
- Throughput (iterations/sec)
- Comparison with previous run — reports improvements/regressions with confidence intervals
HTML reports are generated in target/criterion/ with interactive charts showing distributions and regression history. Open target/criterion/report/index.html to browse all results.
Sample output:
diff_scan/3_columns time: [11.834 µs 12.074 µs 12.329 µs]
diff_scan/10_columns time: [16.203 µs 16.525 µs 16.869 µs]
diff_aggregate/simple time: [21.447 µs 21.862 µs 22.301 µs]
diff_inner_join time: [25.919 µs 26.421 µs 26.952 µs]
Continuous Benchmarking with Bencher
Bencher provides continuous benchmark tracking in CI, detecting performance regressions on pull requests before they merge.
How It Works
The .github/workflows/benchmarks.yml workflow:
-
On
mainpushes — runs both Criterion suites and uploads results to Bencher as the baseline. This establishes the expected performance for each benchmark. -
On pull requests — runs the same benchmarks and compares against the
mainbaseline using a Student's t-test with a 99% upper confidence boundary. If any benchmark regresses beyond the threshold, the PR check fails.
Setup
To enable Bencher for your fork or deployment:
-
Create a Bencher account at bencher.dev and create a project.
-
Add the API token as a GitHub Actions secret:
- Go to Settings → Secrets and variables → Actions
- Add
BENCHER_API_TOKENwith your Bencher API token
-
Update the project slug in
.github/workflows/benchmarks.ymlif your Bencher project name differs frompg-stream.
The workflow gracefully degrades — if BENCHER_API_TOKEN is not set, benchmarks still run and upload artifacts but skip Bencher tracking.
Local Bencher-Format Output
To see what Bencher would receive from CI:
just bench-bencher
This runs both suites with --output-format bencher, producing JSON output compatible with bencher run.
Dashboard
Once configured, the Bencher dashboard shows:
- Historical trends for every benchmark across commits
- Statistical thresholds with configurable alerting
- PR annotations highlighting which benchmarks regressed and by how much
Troubleshooting
| Issue | Resolution |
|---|---|
docker: command not found | Install Docker Desktop and ensure it is running |
| Container startup timeout | Increase Docker memory allocation (≥ 4 GB recommended) |
image not found | Run ./tests/build_e2e_image.sh to build the test image |
| Highly variable timings | Close other workloads; use --test-threads=1 to avoid container contention |
| SLOW status on latency test | Expected in Docker; bare-metal should pass < 10 ms |
What Happens When You INSERT a Row?
This tutorial traces the complete lifecycle of a single INSERT statement on a base table that is referenced by a stream table — from the moment the row is written to the moment the stream table reflects the change.
Setup: A Real-World Example
Suppose you run an e-commerce platform. You have an orders table and a stream table that maintains a running total per customer:
-- Base table
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
customer TEXT NOT NULL,
amount NUMERIC(10,2) NOT NULL
);
-- Stream table: always-fresh customer totals
SELECT pgstream.create_stream_table(
'customer_totals',
$$
SELECT customer, SUM(amount) AS total, COUNT(*) AS order_count
FROM orders GROUP BY customer
$$,
'1m', -- refresh when data is staler than 1 minute
'DIFFERENTIAL' -- only process changed rows, not the full table
);
After creation, customer_totals is a real PostgreSQL table:
SELECT * FROM customer_totals;
-- (empty — no orders yet)
Phase 1: The INSERT
A new order arrives:
INSERT INTO orders (customer, amount) VALUES ('alice', 49.99);
What happens inside PostgreSQL
When create_stream_table() was called, pg_stream installed an AFTER INSERT OR UPDATE OR DELETE trigger on the orders table. This trigger fires automatically — the user's INSERT statement triggers it transparently.
The trigger function (pgstream_changes.pg_stream_cdc_fn_<oid>()) executes inside the same transaction as the INSERT and writes a single row into the change buffer table:
pgstream_changes.changes_16384 (where 16384 = orders table OID)
┌───────────┬─────────────┬────────┬─────────┬──────────┬──────────┬────────────┐
│ change_id │ lsn │ action │ pk_hash │ new_id │ new_cust │ new_amount │
├───────────┼─────────────┼────────┼─────────┼──────────┼──────────┼────────────┤
│ 1 │ 0/1A3F2B80 │ I │ -837291 │ 1 │ alice │ 49.99 │
└───────────┴─────────────┴────────┴─────────┴──────────┴──────────┴────────────┘
Key details:
lsn: The current WAL Log Sequence Number (pg_current_wal_lsn()), used to bound which changes belong to which refresh cycle.action:'I'for INSERT,'U'for UPDATE,'D'for DELETE.pk_hash: A pre-computed hash of the primary key (orders.id), used later for efficient row matching.new_*columns: The actual column values fromNEW, stored as native PostgreSQL types (not JSONB). There are noold_*values for INSERTs.
The trigger adds zero overhead to the user's transaction commit beyond this single INSERT into the buffer table. There is no JSONB serialization, no logical replication slot, and no external process involved.
Phase 2: The Scheduler Wakes Up
A background worker called the scheduler runs inside PostgreSQL (registered via shared_preload_libraries). It wakes up every pg_stream.scheduler_interval_ms milliseconds (default: 1000ms) and performs a tick:
- Rebuild the DAG (if any stream tables were created/dropped since last tick) — a dependency graph of all stream tables and their source tables.
- Topological sort — determine the refresh order so that stream tables depending on other stream tables are refreshed after their dependencies.
- For each stream table, check: has its staleness exceeded its schedule?
For customer_totals with a '1m' schedule, the scheduler compares:
now()minusdata_timestamp(the freshness watermark from the last refresh)- Against the schedule: 60 seconds
If more than 60 seconds have elapsed and the stream table isn't already being refreshed, the scheduler begins a refresh.
Phase 3: Frontier Advancement
Before executing the refresh, the scheduler creates a new frontier — a snapshot of how far to read changes from each source table:
Previous frontier: { orders(16384): lsn = 0/1A3F2A00 }
New frontier: { orders(16384): lsn = 0/1A3F2C00 }
The frontier is a DBSP-inspired version vector. Each source table has its own LSN cursor. The refresh will process all changes in the buffer table where lsn > previous_frontier_lsn AND lsn <= new_frontier_lsn.
This means:
- Changes committed before the previous refresh are already reflected.
- Changes committed after the new frontier will be picked up in the next cycle.
- The INSERT we made (
lsn = 0/1A3F2B80) falls within this window.
Phase 4: Change Detection — Is There Anything to Do?
Before running the full delta query, the scheduler runs a short-circuit check: does the change buffer actually have any rows in the LSN window?
SELECT count(*)::bigint FROM (
SELECT 1 FROM pgstream_changes.changes_16384
WHERE lsn > '0/1A3F2A00'::pg_lsn
AND lsn <= '0/1A3F2C00'::pg_lsn
LIMIT <threshold>
) __pgs_capped
This query also checks the adaptive threshold: if the number of changes exceeds a percentage of the source table size (default: 10%), the scheduler falls back to a FULL refresh instead of DIFFERENTIAL, because applying thousands of individual deltas would be slower than a bulk reload.
For our single INSERT, the count is 1 — well below the threshold. The scheduler proceeds with a DIFFERENTIAL refresh.
Phase 5: Delta Query Generation (DVM Engine)
This is where the Differential View Maintenance (DVM) engine does its work. The defining query:
SELECT customer, SUM(amount) AS total, COUNT(*) AS order_count
FROM orders GROUP BY customer
is parsed into an operator tree:
Aggregate(GROUP BY customer, SUM(amount), COUNT(*))
└── Scan(orders)
The DVM engine differentiates each operator — converting it from "compute the full result" to "compute only what changed":
Step 1: Differentiate the Scan
The Scan(orders) operator becomes a read from the change buffer:
-- Reads only changes in the LSN window, splitting UPDATEs into DELETE+INSERT
WITH __pgs_raw AS (
SELECT c.pk_hash, c.action,
c."new_customer", c."old_customer",
c."new_amount", c."old_amount"
FROM pgstream_changes.changes_16384 c
WHERE c.lsn > '0/1A3F2A00'::pg_lsn
AND c.lsn <= '0/1A3F2C00'::pg_lsn
)
-- INSERT rows: take new_* values
SELECT pk_hash AS __pgs_row_id, 'I' AS __pgs_action,
"new_customer" AS customer, "new_amount" AS amount
FROM __pgs_raw WHERE action IN ('I', 'U')
UNION ALL
-- DELETE rows: take old_* values
SELECT pk_hash AS __pgs_row_id, 'D' AS __pgs_action,
"old_customer" AS customer, "old_amount" AS amount
FROM __pgs_raw WHERE action IN ('D', 'U')
For our single INSERT, this produces:
__pgs_row_id | __pgs_action | customer | amount
-------------|--------------|----------|-------
-837291 | I | alice | 49.99
Step 2: Differentiate the Aggregate
The Aggregate differentiation is the heart of incremental maintenance. Instead of re-computing SUM(amount) over the entire orders table, it computes:
-- Delta for SUM: add new values, subtract deleted values
SELECT customer,
SUM(CASE WHEN __pgs_action = 'I' THEN amount
WHEN __pgs_action = 'D' THEN -amount END) AS total,
SUM(CASE WHEN __pgs_action = 'I' THEN 1
WHEN __pgs_action = 'D' THEN -1 END) AS order_count,
pgstream.pg_stream_hash(customer::text) AS __pgs_row_id,
'I' AS __pgs_action
FROM <scan_delta>
GROUP BY customer
For our INSERT of ('alice', 49.99), this yields:
customer | total | order_count | __pgs_row_id | __pgs_action
---------|--------|-------------|--------------|-------------
alice | +49.99 | +1 | 7283194 | I
The stream table uses reference counting: it tracks __pgs_count (how many source rows contribute to each group). When __pgs_count reaches 0, the group row is deleted.
Phase 6: MERGE Into the Stream Table
The delta is applied to the customer_totals storage table using a single SQL MERGE statement:
MERGE INTO public.customer_totals AS st
USING (<delta_query>) AS d
ON st.__pgs_row_id = d.__pgs_row_id
WHEN MATCHED AND d.__pgs_action = 'D' THEN DELETE
WHEN MATCHED AND d.__pgs_action = 'I' THEN
UPDATE SET customer = d.customer, total = d.total, order_count = d.order_count
WHEN NOT MATCHED AND d.__pgs_action = 'I' THEN
INSERT (__pgs_row_id, customer, total, order_count)
VALUES (d.__pgs_row_id, d.customer, d.total, d.order_count)
Since alice didn't exist before, this is a NOT MATCHED → INSERT. The stream table now contains:
SELECT * FROM customer_totals;
customer | total | order_count
----------|-------|------------
alice | 49.99 | 1
Phase 7: Cleanup and Bookkeeping
After the MERGE succeeds:
-
Consumed changes are deleted from the buffer table:
DELETE FROM pgstream_changes.changes_16384 WHERE lsn > '0/1A3F2A00'::pg_lsn AND lsn <= '0/1A3F2C00'::pg_lsn -
The frontier is saved to the catalog as JSONB, so the next refresh knows where to start.
-
The refresh is recorded in
pgstream.pgs_refresh_history:refresh_id | pgs_id | action | rows_inserted | rows_deleted | status | initiated_by 1 | 1 | DIFFERENTIAL | 1 | 0 | COMPLETED | SCHEDULER -
The data timestamp on the stream table is advanced, resetting the staleness clock.
-
The MERGE template is cached in thread-local storage. The next refresh for this stream table skips SQL parsing, operator tree construction, and differentiation — it only substitutes LSN values into the cached template. This saves ~45ms per refresh cycle.
What About UPDATE and DELETE?
UPDATE
UPDATE orders SET amount = 59.99 WHERE id = 1;
The trigger writes a single row with action = 'U', capturing both OLD and NEW values:
action | new_amount | old_amount | new_customer | old_customer
-------|------------|------------|--------------|-------------
U | 59.99 | 49.99 | alice | alice
The scan differentiation splits this into:
- DELETE old:
(alice, 49.99)with action'D' - INSERT new:
(alice, 59.99)with action'I'
The aggregate differentiation computes: +59.99 - 49.99 = +10.00 for alice's total. The MERGE updates the existing row.
DELETE
DELETE FROM orders WHERE id = 1;
The trigger writes action = 'D' with the OLD values. The aggregate differentiation computes -49.99 for the total and -1 for the count. If the __pgs_count reaches 0 (no more orders for alice), the MERGE deletes alice's row from the stream table entirely.
Performance: Why This Is Fast
| Step | What it avoids |
|---|---|
| Trigger-based CDC | No logical replication slot, no WAL parsing, no external process |
| Typed columns | No JSONB serialization in the trigger, no jsonb_populate_record in the delta query |
| Pre-computed pk_hash | No per-row hash computation during the delta query |
| LSN-bounded reads | Index scan on the change buffer, not a full table scan |
| Algebraic differentiation | Processes only changed rows — O(changes) not O(table size) |
| MERGE statement | Single SQL round-trip for all inserts, updates, and deletes |
| Cached templates | After the first refresh, delta SQL generation is skipped entirely |
| Adaptive fallback | Automatically switches to FULL refresh when changes exceed a threshold |
For a table with 10 million rows and 100 changed rows, a DIFFERENTIAL refresh processes only those 100 rows. A FULL refresh would need to scan all 10 million.
Next in This Series
- What Happens When You UPDATE a Row? — D+I split, group key changes, net-effect for multiple UPDATEs
- What Happens When You DELETE a Row? — Reference counting, group deletion, INSERT+DELETE cancellation
- What Happens When You TRUNCATE a Table? — Why TRUNCATE bypasses triggers and how to recover
What Happens When You UPDATE a Row?
This tutorial traces what happens when an UPDATE statement hits a base table that is referenced by a stream table. It covers the trigger capture, the scan-level decomposition into DELETE + INSERT, and how each DVM operator propagates the change — including cases where the group key changes, where JOINs are involved, and where multiple UPDATEs happen within a single refresh window.
Prerequisite: Read WHAT_HAPPENS_ON_INSERT.md first — it introduces the full 7-phase lifecycle. This tutorial focuses on how UPDATE differs.
Setup
Same e-commerce example:
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
customer TEXT NOT NULL,
amount NUMERIC(10,2) NOT NULL
);
SELECT pgstream.create_stream_table(
'customer_totals',
$$
SELECT customer, SUM(amount) AS total, COUNT(*) AS order_count
FROM orders GROUP BY customer
$$,
'1m',
'DIFFERENTIAL'
);
-- Seed some data
INSERT INTO orders (customer, amount) VALUES
('alice', 49.99),
('alice', 30.00),
('bob', 75.00);
After the first refresh, the stream table contains:
customer | total | order_count
---------|-------|------------
alice | 79.99 | 2
bob | 75.00 | 1
Case 1: Simple Value UPDATE (Same Group Key)
UPDATE orders SET amount = 59.99 WHERE id = 1;
Alice's first order changes from 49.99 to 59.99. The customer (group key) stays the same.
Phase 1: Trigger Capture
The AFTER UPDATE trigger fires and writes one row to the change buffer with both OLD and NEW values:
pgstream_changes.changes_16384
┌───────────┬─────────────┬────────┬──────────┬──────────┬────────────┬──────────┬────────────┐
│ change_id │ lsn │ action │ new_cust │ new_amt │ old_cust │ old_amt │ pk_hash │
├───────────┼─────────────┼────────┼──────────┼──────────┼────────────┼──────────┼────────────┤
│ 4 │ 0/1A3F3000 │ U │ alice │ 59.99 │ alice │ 49.99 │ -837291 │
└───────────┴─────────────┴────────┴──────────┴──────────┴────────────┴──────────┴────────────┘
Key difference from INSERT: the trigger writes both new_* and old_* columns. The pk_hash is computed from NEW.id.
Phase 2–4: Scheduler, Frontier, Change Detection
Identical to the INSERT flow. The scheduler detects one change row in the LSN window.
Phase 5: Scan Differentiation — The U → D+I Split
This is where UPDATE handling diverges fundamentally. The scan delta operator decomposes the UPDATE into two events:
__pgs_row_id | __pgs_action | customer | amount
-------------|--------------|----------|-------
-837291 | D | alice | 49.99 ← old values (DELETE)
-837291 | I | alice | 59.99 ← new values (INSERT)
Why split into D+I? This is a core IVM principle. Downstream operators (aggregates, joins, filters) don't have special "update" logic — they only understand insertions and deletions. By decomposing the UPDATE:
- The DELETE event subtracts the old values from running aggregates
- The INSERT event adds the new values
This algebraic approach handles arbitrary operator trees without operator-specific update logic.
Phase 5 (continued): Aggregate Differentiation
The aggregate operator processes both events against the alice group:
-- DELETE event: subtract old values
alice: total += CASE WHEN action='D' THEN -49.99 END → -49.99
alice: count += CASE WHEN action='D' THEN -1 END → -1
-- INSERT event: add new values
alice: total += CASE WHEN action='I' THEN +59.99 END → +59.99
alice: count += CASE WHEN action='I' THEN +1 END → +1
Net effect on alice's group:
total delta: -49.99 + 59.99 = +10.00
count delta: -1 + 1 = 0
The aggregate emits this as an INSERT (because the group still exists and its value changed):
customer | total | order_count | __pgs_row_id | __pgs_action
---------|--------|-------------|--------------|-------------
alice | +10.00 | 0 | 7283194 | I
Phase 6: MERGE
The MERGE updates the existing row:
-- MERGE WHEN MATCHED AND action = 'I' THEN UPDATE:
-- alice's total: 79.99 + 10.00 = 89.99 (via reference counting)
-- alice's count: 2 + 0 = 2
Wait — that's not right. The MERGE doesn't add deltas; it replaces the row. The aggregate delta query actually computes the new absolute value by combining the stored state with the delta:
COALESCE(existing.total, 0) + delta.total → 79.99 + 10.00 = 89.99
COALESCE(existing.__pgs_count, 0) + delta.__pgs_count → 2 + 0 = 2
Result:
SELECT * FROM customer_totals;
customer | total | order_count
----------|-------|------------
alice | 89.99 | 2 ← was 79.99
bob | 75.00 | 1
Case 2: Group Key Change (Customer Reassignment)
UPDATE orders SET customer = 'bob' WHERE id = 2;
Alice's second order (amount=30.00) is reassigned to Bob. The group key itself changes.
Trigger Capture
change_id | lsn | action | new_cust | new_amt | old_cust | old_amt | pk_hash
5 | 0/1A3F3100 | U | bob | 30.00 | alice | 30.00 | 4521038
The old and new customer values differ.
Scan Delta: D+I Split
__pgs_row_id | __pgs_action | customer | amount
-------------|--------------|----------|-------
4521038 | D | alice | 30.00 ← removes from alice's group
4521038 | I | bob | 30.00 ← adds to bob's group
Aggregate Delta
The aggregate groups by customer, so the DELETE and INSERT land in different groups:
Group "alice":
total delta: -30.00
count delta: -1
Group "bob":
total delta: +30.00
count delta: +1
After MERGE
SELECT * FROM customer_totals;
customer | total | order_count
----------|--------|------------
alice | 59.99 | 1 ← lost one order (-30.00)
bob | 105.00 | 2 ← gained one order (+30.00)
This is why the D+I decomposition is essential. Without it, you'd need special "move between groups" logic. With it, the standard aggregate differentiation handles group key changes naturally.
Case 3: UPDATE That Deletes a Group
-- Alice only has one order left. Reassign it to bob.
UPDATE orders SET customer = 'bob' WHERE id = 1;
Aggregate Delta
Group "alice":
total delta: -59.99
count delta: -1
new __pgs_count: 1 - 1 = 0 → group vanishes!
Group "bob":
total delta: +59.99
count delta: +1
When __pgs_count reaches 0, the aggregate emits a DELETE for alice's group:
customer | total | __pgs_row_id | __pgs_action
---------|-------|--------------|-------------
alice | — | 7283194 | D ← group removed
bob | ... | 9182734 | I ← group updated
The MERGE deletes alice's row entirely:
SELECT * FROM customer_totals;
customer | total | order_count
----------|--------|------------
bob | 165.00 | 3
Case 4: Multiple UPDATEs on the Same Row (Within One Refresh Window)
What if a row is updated multiple times before the next refresh?
UPDATE orders SET amount = 10.00 WHERE id = 3; -- bob: 75 → 10
UPDATE orders SET amount = 20.00 WHERE id = 3; -- bob: 10 → 20
UPDATE orders SET amount = 30.00 WHERE id = 3; -- bob: 20 → 30
The change buffer now has 3 rows for pk_hash of order #3:
change_id | action | old_amt | new_amt
6 | U | 75.00 | 10.00
7 | U | 10.00 | 20.00
8 | U | 20.00 | 30.00
Net-Effect Computation
The scan delta uses a split fast-path design. Since order #3 has multiple changes (cnt > 1), it takes the multi-change path with window functions:
FIRST_VALUE(action) OVER (PARTITION BY pk_hash ORDER BY change_id) → 'U'
LAST_VALUE(action) OVER (...) → 'U'
Both first and last actions are 'U', so:
- DELETE: emits using old values from the earliest change (change_id=6):
old_amt = 75.00 - INSERT: emits using new values from the latest change (change_id=8):
new_amt = 30.00
Net delta:
__pgs_row_id | __pgs_action | amount
-------------|--------------|-------
pk_hash_3 | D | 75.00 ← original value before all changes
pk_hash_3 | I | 30.00 ← final value after all changes
The aggregate sees -75.00 + 30.00 = -45.00. This is correct regardless of the intermediate values. The intermediate rows (10.00, 20.00) are never seen.
Case 5: INSERT + UPDATE in Same Window
INSERT INTO orders (customer, amount) VALUES ('charlie', 100.00);
UPDATE orders SET amount = 200.00 WHERE customer = 'charlie';
Both happen before the next refresh. The buffer has:
change_id | action | old_amt | new_amt
9 | I | NULL | 100.00
10 | U | 100.00 | 200.00
Net-effect analysis:
first_action = 'I'(row didn't exist before this window)last_action = 'U'(row exists after)
Result:
- No DELETE emitted (first_action = 'I' means the row was born in this window)
- INSERT with final values:
(charlie, 200.00)
The aggregate sees a pure insertion of (charlie, 200.00) — the intermediate value of 100.00 never appears.
Case 6: UPDATE + DELETE in Same Window
UPDATE orders SET amount = 999.99 WHERE id = 3;
DELETE FROM orders WHERE id = 3;
Net-effect:
first_action = 'U'(row existed before)last_action = 'D'(row no longer exists)
Result:
- DELETE with original old values from the first change
- No INSERT (last_action = 'D')
The aggregate correctly sees only a removal.
Case 7: UPDATE with JOINs
Consider a stream table that joins two tables:
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
tier TEXT NOT NULL DEFAULT 'standard'
);
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
customer_id INT REFERENCES customers(id),
amount NUMERIC(10,2)
);
SELECT pgstream.create_stream_table(
'order_details',
$$
SELECT c.name, c.tier, o.amount
FROM orders o
JOIN customers c ON o.customer_id = c.id
$$,
'1m',
'DIFFERENTIAL'
);
Now update a customer's tier:
UPDATE customers SET tier = 'premium' WHERE name = 'alice';
How the JOIN Delta Works
The join differentiation follows the formula:
$$\Delta(L \bowtie R) = (\Delta L \bowtie R) \cup (L \bowtie \Delta R) - (\Delta L \bowtie \Delta R)$$
Since only the customers table changed:
- $\Delta L$ = changes to orders (empty)
- $\Delta R$ = changes to customers (alice's tier: standard → premium)
So:
- Part 1: $\Delta\text{orders} \bowtie \text{customers}$ = empty (no order changes)
- Part 2: $\text{orders} \bowtie \Delta\text{customers}$ = all of alice's orders joined with her tier change
- Part 3: $\Delta\text{orders} \bowtie \Delta\text{customers}$ = empty (no order changes)
Part 2 produces the delta: for each of alice's orders, DELETE the old row (with tier='standard') and INSERT a new row (with tier='premium').
The stream table is updated to reflect the new tier across all of alice's order rows.
Performance Summary
| Scenario | Buffer rows | Delta rows emitted | Work |
|---|---|---|---|
| Simple value change | 1 | 2 (D+I) | O(1) per group |
| Group key change | 1 | 2 (D+I, different groups) | O(1) per affected group |
| Group deletion | 1 | 1 (D) + 1 (I) or 1 (D) | O(1) |
| N updates same row | N | 2 (D first-old + I last-new) | O(N) scan, O(1) aggregate |
| INSERT+UPDATE same window | 2 | 1 (I only) | O(1) |
| UPDATE+DELETE same window | 2 | 1 (D only) | O(1) |
In all cases, the work is proportional to the number of changed rows, not the total table size. A single UPDATE on a billion-row table produces the same delta cost as on a 10-row table.
Next in This Series
- What Happens When You INSERT a Row? — The full 7-phase lifecycle (start here if you haven't already)
- What Happens When You DELETE a Row? — Reference counting, group deletion, INSERT+DELETE cancellation
- What Happens When You TRUNCATE a Table? — Why TRUNCATE bypasses triggers and how to recover
What Happens When You DELETE a Row?
This tutorial traces what happens when a DELETE statement hits a base table that is referenced by a stream table. It covers the trigger capture, how the scan delta emits a single DELETE event, and how each DVM operator propagates the removal — including group deletion, partial group reduction, JOINs, cascading deletes within a single refresh window, and the important edge case where a DELETE cancels a prior INSERT.
Prerequisite: Read WHAT_HAPPENS_ON_INSERT.md first — it introduces the full 7-phase lifecycle (trigger → scheduler → frontier → change detection → DVM delta → MERGE → cleanup). This tutorial focuses on how DELETE differs.
Setup
Same e-commerce example used throughout the series:
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
customer TEXT NOT NULL,
amount NUMERIC(10,2) NOT NULL
);
SELECT pgstream.create_stream_table(
'customer_totals',
$$
SELECT customer, SUM(amount) AS total, COUNT(*) AS order_count
FROM orders GROUP BY customer
$$,
'1m',
'DIFFERENTIAL'
);
-- Seed some data
INSERT INTO orders (customer, amount) VALUES
('alice', 50.00),
('alice', 30.00),
('bob', 75.00),
('bob', 25.00);
After the first refresh, the stream table contains:
customer | total | order_count
---------|--------|------------
alice | 80.00 | 2
bob | 100.00 | 2
Case 1: Delete One Row (Group Survives)
DELETE FROM orders WHERE id = 2; -- alice's 30.00 order
Alice still has one remaining order (id=1, amount=50.00). The group shrinks but doesn't vanish.
Phase 1: Trigger Capture
The AFTER DELETE trigger fires and writes one row to the change buffer with only OLD values:
pgstream_changes.changes_16384
┌───────────┬─────────────┬────────┬──────────┬──────────┬────────────┬──────────┬────────────┐
│ change_id │ lsn │ action │ new_cust │ new_amt │ old_cust │ old_amt │ pk_hash │
├───────────┼─────────────┼────────┼──────────┼──────────┼────────────┼──────────┼────────────┤
│ 5 │ 0/1A3F3000 │ D │ NULL │ NULL │ alice │ 30.00 │ 4521038 │
└───────────┴─────────────┴────────┴──────────┴──────────┴────────────┴──────────┴────────────┘
Key difference from INSERT and UPDATE:
new_*columns are all NULL — the row no longer exists, so there are no NEW valuesold_*columns contain the deleted row's data — this is what gets subtractedpk_hashis computed fromOLD.id(the deleted row's primary key)
Phase 2–4: Scheduler, Frontier, Change Detection
Identical to the INSERT flow. The scheduler detects one change row in the LSN window.
Phase 5: Scan Differentiation — Pure DELETE
Unlike UPDATE (which splits into D+I), a DELETE produces a single event:
__pgs_row_id | __pgs_action | customer | amount
-------------|--------------|----------|-------
4521038 | D | alice | 30.00
The scan delta applies the net-effect filtering rule:
first_action = 'D'→ row existed before the refresh windowlast_action = 'D'→ row does not exist after
Result: emit a DELETE using old values. No INSERT is emitted (because last_action = 'D').
This is the simplest path through the scan delta — one change, one PK, one DELETE event.
Phase 5 (continued): Aggregate Differentiation
The aggregate operator processes the DELETE event against the alice group:
-- DELETE event: subtract old values from alice's group
__ins_count = 0 -- no inserts
__del_count = 1 -- one deletion
__ins_total = 0 -- no amount added
__del_total = 30.00 -- 30.00 removed
The merge CTE joins this delta with the existing stream table state:
new_count = old_count + ins_count - del_count = 2 + 0 - 1 = 1 (still > 0)
Since new_count > 0 and the group already existed (old_count = 2), the action is classified as 'U' (update). The aggregate emits the group with its new values:
customer | total | order_count | __pgs_row_id | __pgs_action
---------|-------|-------------|--------------|-------------
alice | 50.00 | 1 | 7283194 | I
Note: the 'U' meta-action is emitted as __pgs_action = 'I' because the MERGE treats it as an update-via-INSERT (see aggregate final CTE: CASE WHEN __pgs_meta_action = 'D' THEN 'D' ELSE 'I' END).
Phase 6: MERGE
The MERGE statement matches alice's existing row and updates it:
MERGE INTO customer_totals AS st
USING (...delta...) AS d
ON st.__pgs_row_id = d.__pgs_row_id
WHEN MATCHED AND d.__pgs_action = 'I' THEN
UPDATE SET customer = d.customer, total = d.total, order_count = d.order_count, ...
Result:
SELECT * FROM customer_totals;
customer | total | order_count
----------|--------|------------
alice | 50.00 | 1 ← was 80.00 / 2
bob | 100.00 | 2
Phase 7: Cleanup
The change buffer rows in the consumed LSN window are deleted:
DELETE FROM pgstream_changes.changes_16384
WHERE lsn > '0/1A3F2FFF'::pg_lsn AND lsn <= '0/1A3F3000'::pg_lsn;
Case 2: Delete Last Row in Group (Group Vanishes)
-- Alice has one order left (id=1, amount=50.00). Delete it.
DELETE FROM orders WHERE id = 1;
Trigger Capture
change_id | lsn | action | old_cust | old_amt | pk_hash
6 | 0/1A3F3100 | D | alice | 50.00 | -837291
Scan Delta
Single DELETE event:
__pgs_row_id | __pgs_action | customer | amount
-------------|--------------|----------|-------
-837291 | D | alice | 50.00
Aggregate Delta
Group "alice":
ins_count = 0
del_count = 1
new_count = old_count + 0 - 1 = 1 - 1 = 0 → group vanishes!
When new_count drops to 0 (or below), the aggregate classifies this as action 'D' (delete). The reference count has reached zero — no rows contribute to this group anymore.
The aggregate emits a DELETE for alice's group:
customer | __pgs_row_id | __pgs_action
---------|--------------|-------------
alice | 7283194 | D
MERGE
The MERGE matches alice's existing row and deletes it:
WHEN MATCHED AND d.__pgs_action = 'D' THEN DELETE
Result:
SELECT * FROM customer_totals;
customer | total | order_count
----------|--------|------------
bob | 100.00 | 2
Alice's row is completely removed from the stream table. This is the correct behavior — with zero contributing rows, the group should not exist.
Case 3: Delete Multiple Rows (Same Group, Same Window)
-- Delete both of bob's orders before the next refresh
DELETE FROM orders WHERE id = 3; -- bob, 75.00
DELETE FROM orders WHERE id = 4; -- bob, 25.00
The change buffer has two rows with different pk_hash values (different PKs):
change_id | action | old_cust | old_amt | pk_hash
7 | D | bob | 75.00 | pk_hash_3
8 | D | bob | 25.00 | pk_hash_4
Scan Delta
Each PK has exactly one change, so both take the single-change fast path:
__pgs_row_id | __pgs_action | customer | amount
-------------|--------------|----------|-------
pk_hash_3 | D | bob | 75.00
pk_hash_4 | D | bob | 25.00
Two DELETE events, both targeting bob's group.
Aggregate Delta
The aggregate sums both deletions:
Group "bob":
ins_count = 0
del_count = 2
del_total = 75.00 + 25.00 = 100.00
new_count = 2 + 0 - 2 = 0 → group vanishes!
The aggregate emits a DELETE for bob's group.
MERGE
Bob's row is deleted from the stream table. With both alice and bob gone (from Cases 1+2+3), the stream table is now empty.
Case 4: INSERT + DELETE in Same Window (Cancellation)
What if a row is inserted and then deleted before the next refresh?
INSERT INTO orders (customer, amount) VALUES ('charlie', 200.00);
DELETE FROM orders WHERE customer = 'charlie';
The change buffer has:
change_id | action | new_cust | new_amt | old_cust | old_amt | pk_hash
9 | I | charlie | 200.00 | NULL | NULL | pk_hash_new
10 | D | NULL | NULL | charlie | 200.00 | pk_hash_new
Net-Effect Computation
Both changes share the same pk_hash. The pk_stats CTE finds cnt = 2, so this goes through the multi-change path:
first_action = FIRST_VALUE(action) OVER (...) → 'I'
last_action = LAST_VALUE(action) OVER (...) → 'D'
The scan delta applies the net-effect filtering:
- DELETE branch: requires
first_action != 'I'→ FAILS (first_action = 'I') - INSERT branch: requires
last_action != 'D'→ FAILS (last_action = 'D')
Result: zero events emitted. The INSERT and DELETE completely cancel each other out.
The aggregate never sees charlie. The stream table is unchanged. This is correct — the row was born and died within the same refresh window, so it should have no visible effect.
Case 5: UPDATE + DELETE in Same Window
UPDATE orders SET amount = 999.99 WHERE id = 3; -- bob: 75 → 999.99
DELETE FROM orders WHERE id = 3;
The change buffer:
change_id | action | old_amt | new_amt
11 | U | 75.00 | 999.99
12 | D | 999.99 | NULL
Net-Effect Computation
Same pk_hash, cnt = 2:
first_action = 'U' (row existed before this window)
last_action = 'D' (row no longer exists)
Filtering:
- DELETE branch:
first_action != 'I'→ OK. Emit DELETE with old values from the earliest change:old_amt = 75.00 - INSERT branch:
last_action != 'D'→ FAILS. No INSERT emitted.
Net delta:
__pgs_row_id | __pgs_action | amount
-------------|--------------|-------
pk_hash_3 | D | 75.00
The intermediate value of 999.99 never appears. The aggregate sees only the removal of the original value (75.00), which is correct — that's the value that was previously accounted for in the stream table.
Case 6: DELETE with JOINs
Consider a stream table that joins two tables:
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
tier TEXT NOT NULL DEFAULT 'standard'
);
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
customer_id INT REFERENCES customers(id),
amount NUMERIC(10,2)
);
SELECT pgstream.create_stream_table(
'order_details',
$$
SELECT c.name, c.tier, o.amount
FROM orders o
JOIN customers c ON o.customer_id = c.id
$$,
'1m',
'DIFFERENTIAL'
);
Seed data:
INSERT INTO customers VALUES (1, 'alice', 'premium'), (2, 'bob', 'standard');
INSERT INTO orders VALUES (1, 1, 50.00), (2, 1, 30.00), (3, 2, 75.00);
After refresh, the stream table has:
name | tier | amount
------|----------|-------
alice | premium | 50.00
alice | premium | 30.00
bob | standard | 75.00
Now delete an order:
DELETE FROM orders WHERE id = 2; -- alice's 30.00 order
How the JOIN Delta Works
The join differentiation formula:
$$\Delta(L \bowtie R) = (\Delta L \bowtie R) \cup (L \bowtie \Delta R) - (\Delta L \bowtie \Delta R)$$
Since only the orders table changed:
- $\Delta L$ = changes to orders (one DELETE: order #2)
- $\Delta R$ = changes to customers (empty)
So:
- Part 1: $\Delta\text{orders} \bowtie \text{customers}$ = the deleted order joined with its customer
- Part 2: $\text{orders} \bowtie \Delta\text{customers}$ = empty (no customer changes)
- Part 3: $\Delta\text{orders} \bowtie \Delta\text{customers}$ = empty (customers unchanged)
Part 1 produces:
name | tier | amount | __pgs_action
------|---------|--------|-------------
alice | premium | 30.00 | D
The deleted order is joined with alice's customer record to produce a DELETE delta row with the complete joined values.
MERGE
The MERGE matches the row (alice, premium, 30.00) and deletes it:
SELECT * FROM order_details;
name | tier | amount
-------|----------|-------
alice | premium | 50.00 ← alice's remaining order
bob | standard | 75.00
What About Deleting From the Dimension Table?
DELETE FROM customers WHERE id = 2; -- remove bob entirely
Now $\Delta R$ has a DELETE for bob, while $\Delta L$ is empty:
- Part 1: $\Delta\text{orders} \bowtie \text{customers}$ = empty
- Part 2: $\text{orders} \bowtie \Delta\text{customers}$ = bob's order(s) joined with deleted customer record
Part 2 produces DELETE events for every order that referenced bob:
name | tier | amount | __pgs_action
-----|----------|--------|-------------
bob | standard | 75.00 | D
After MERGE, bob's rows vanish from the stream table.
Note: This assumes referential integrity — if
ordersstill references customer #2, a foreign key constraint would prevent the DELETE in practice. But from the IVM perspective, the join delta correctly handles the removal regardless.
Case 7: Bulk DELETE
DELETE FROM orders WHERE amount < 50.00;
This deletes multiple rows across potentially multiple groups. The trigger fires once per row (it's a FOR EACH ROW trigger), writing one change buffer entry per deleted row:
change_id | action | old_cust | old_amt | pk_hash
13 | D | alice | 30.00 | pk_hash_2
14 | D | bob | 25.00 | pk_hash_4
Scan Delta
Each deleted PK is independent (different pk_hash values), so each takes the single-change fast path. Two DELETE events:
__pgs_row_id | __pgs_action | customer | amount
-------------|--------------|----------|-------
pk_hash_2 | D | alice | 30.00
pk_hash_4 | D | bob | 25.00
Aggregate Delta
The aggregate groups these by customer:
Group "alice":
del_count = 1, del_total = 30.00
new_count = 2 - 1 = 1 (survives)
Group "bob":
del_count = 1, del_total = 25.00
new_count = 2 - 1 = 1 (survives)
Both groups survive (count > 0), so the aggregate emits UPDATE (as 'I') events with new values:
customer | total | order_count
---------|-------|------------
alice | 50.00 | 1
bob | 75.00 | 1
The MERGE updates both rows. All work is proportional to the number of deleted rows (2), not the total table size.
Case 8: TRUNCATE (What Doesn't Work)
TRUNCATE orders;
TRUNCATE does not fire row-level triggers. The change buffer receives zero entries. The next refresh cycle will detect no changes and skip the stream table.
This means the stream table becomes stale — it still shows the old aggregate values while the base table is empty.
Workaround: After a TRUNCATE, manually reinitialize:
SELECT pgstream.refresh_stream_table('customer_totals');
With the FULL refresh mode (or by setting the reinitialize flag), the stream table is recomputed from scratch, producing correct results.
Why no TRUNCATE support? PostgreSQL fires statement-level
AFTER TRUNCATEtriggers, but they don't provide OLD row data. The trigger can detect that a TRUNCATE happened, but cannot enumerate which rows were removed. Supporting TRUNCATE would require reading the stream table to determine what to subtract — effectively a full refresh anyway.
How DELETE Differs From INSERT and UPDATE — A Summary
| Aspect | INSERT | UPDATE | DELETE |
|---|---|---|---|
| Trigger writes | new_* columns only | Both new_* and old_* | old_* columns only |
| new_ columns* | Row values | New values | NULL |
| old_ columns* | NULL | Old values | Row values |
| pk_hash source | NEW.pk | NEW.pk | OLD.pk |
| Scan delta output | 1 INSERT event | 2 events (D+I split) | 1 DELETE event |
| Aggregate effect | Adds to group count/sum | Subtracts old, adds new | Subtracts from group |
| Can delete a group? | No (only creates/grows) | Yes (if group key changes) | Yes (if count reaches 0) |
| MERGE action | INSERT new row | UPDATE existing row | DELETE matched row |
The Reference Counting Principle
The core insight behind incremental DELETE handling is reference counting. Every aggregate group in the stream table maintains an internal counter (__pgs_count) that tracks how many source rows contribute to the group:
Stream table internal state:
customer | total | order_count | __pgs_count (hidden)
---------|-------|-------------|---------------------
alice | 80.00 | 2 | 2
bob | 100.00| 2 | 2
- INSERT →
__pgs_count += 1 - DELETE →
__pgs_count -= 1 - UPDATE →
__pgs_count += 0(D cancels I for same-group updates)
When __pgs_count reaches 0:
- The group has zero contributing rows
- The aggregate emits a DELETE event
- The MERGE removes the row from the stream table
This is mathematically rigorous — the stream table always reflects the correct result of the defining query over the current base table contents, incrementally maintained through algebraic delta operations.
Performance Summary
| Scenario | Buffer rows | Delta rows emitted | Work |
|---|---|---|---|
| Single row DELETE (group survives) | 1 | 1 (D) | O(1) per group |
| Single row DELETE (group vanishes) | 1 | 1 (D) | O(1) |
| N deletes same group | N | N (D) → 1 group delta | O(N) scan, O(1) per group |
| INSERT+DELETE same window | 2 | 0 (cancels) | O(1) |
| UPDATE+DELETE same window | 2 | 1 (D original) | O(1) |
| Bulk DELETE across M groups | N | N (D) → M group deltas | O(N) scan, O(M) aggregate |
| JOIN table DELETE | 1 | K (one per matched join row) | O(K) join |
In all cases, the work is proportional to the number of changed rows, not the total table size. Deleting 3 rows from a billion-row table produces the same delta cost as from a 10-row table.
Next in This Series
- What Happens When You INSERT a Row? — The full 7-phase lifecycle (start here if you haven't already)
- What Happens When You UPDATE a Row? — D+I split, group key changes, net-effect for multiple UPDATEs
- What Happens When You TRUNCATE a Table? — Why TRUNCATE bypasses triggers and how to recover
What Happens When You TRUNCATE a Table?
This tutorial explains what happens when a TRUNCATE statement hits a base table that is referenced by a stream table. Unlike INSERT, UPDATE, and DELETE — which are fully tracked by the CDC trigger — TRUNCATE is a special case that bypasses row-level triggers entirely. Understanding this gap is essential for operating pg_stream correctly.
Prerequisite: Read WHAT_HAPPENS_ON_INSERT.md first — it introduces the 7-phase lifecycle. This tutorial explains why TRUNCATE breaks that lifecycle and how to recover.
Setup
Same e-commerce example used throughout the series:
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
customer TEXT NOT NULL,
amount NUMERIC(10,2) NOT NULL
);
SELECT pgstream.create_stream_table(
'customer_totals',
$$
SELECT customer, SUM(amount) AS total, COUNT(*) AS order_count
FROM orders GROUP BY customer
$$,
'1m',
'DIFFERENTIAL'
);
-- Seed some data
INSERT INTO orders (customer, amount) VALUES
('alice', 50.00),
('alice', 30.00),
('bob', 75.00),
('bob', 25.00);
After the first refresh, the stream table contains:
customer | total | order_count
---------|--------|------------
alice | 80.00 | 2
bob | 100.00 | 2
Case 1: TRUNCATE the Base Table
TRUNCATE orders;
All four rows are removed instantly.
What Happens at the Trigger Level: Nothing
PostgreSQL's TRUNCATE command does not fire row-level AFTER triggers. This is by design in PostgreSQL — TRUNCATE is a DDL-like operation that removes all rows without scanning them individually. The per-row AFTER INSERT OR UPDATE OR DELETE trigger that pg_stream installs is simply never invoked.
The change buffer remains empty:
pgstream_changes.changes_16384
┌───────────┬─────────────┬────────┬──────────┬──────────┐
│ change_id │ lsn │ action │ ... │ ... │
├───────────┼─────────────┼────────┼──────────┼──────────┤
│ (empty) │ │ │ │ │
└───────────┴─────────────┴────────┴──────────┴──────────┘
No rows. Zero change events captured.
What Happens at the Scheduler: Skip
On the next refresh cycle, the scheduler:
- Checks the change buffer for rows in the LSN window
- Finds zero rows
- Classifies the refresh action as
NO_DATA - Advances the data timestamp but does not modify the stream table
The stream table still shows the old values:
SELECT * FROM customer_totals;
customer | total | order_count
----------|--------|------------
alice | 80.00 | 2 ← STALE! orders table is empty
bob | 100.00 | 2 ← STALE! orders table is empty
The stream table is now stale. It reflects data that no longer exists in the base table.
Why This Happens
The entire incremental view maintenance pipeline depends on the change buffer to know what changed. Without change events, the DVM has no deltas to apply. The stream table's reference-counted aggregates still think there are 4 orders contributing to two groups.
This is not a bug — it's a fundamental limitation of trigger-based CDC. The trigger can only fire when PostgreSQL executes individual row operations, and TRUNCATE deliberately skips per-row processing for performance.
Case 2: How to Recover — Manual Refresh
The fix is straightforward. Force a manual refresh:
SELECT pgstream.refresh_stream_table('customer_totals');
This executes a full refresh regardless of the stream table's configured refresh mode:
- TRUNCATE the stream table itself (clearing the stale data)
- Re-execute the defining query:
SELECT customer, SUM(amount) AS total, COUNT(*) AS order_count FROM orders GROUP BY customer - INSERT the results into the stream table
- Update the frontier so future differential refreshes start from the current LSN
Since the orders table is empty, the defining query returns zero rows. The stream table becomes empty too:
SELECT pgstream.refresh_stream_table('customer_totals');
SELECT * FROM customer_totals;
customer | total | order_count
----------|-------|------------
(0 rows) ← correct: orders is empty
The stream table is now consistent again. Future INSERT/UPDATE/DELETE operations on orders will be captured normally by the trigger and propagated incrementally.
Case 3: TRUNCATE Then INSERT (Common ETL Pattern)
A common data loading pattern is:
BEGIN;
TRUNCATE orders;
INSERT INTO orders (customer, amount) VALUES
('charlie', 100.00),
('charlie', 200.00),
('dave', 150.00);
COMMIT;
What the Change Buffer Sees
- TRUNCATE: 0 events (not captured)
- INSERT charlie 100.00: 1 event (captured)
- INSERT charlie 200.00: 1 event (captured)
- INSERT dave 150.00: 1 event (captured)
The change buffer has 3 INSERT events — but knows nothing about the 4 rows that were removed by TRUNCATE.
What the Scheduler Does
The scheduler sees 3 pending changes and runs a DIFFERENTIAL refresh:
Aggregate delta (from INSERTs only):
Group "charlie": ins_count = 2, ins_total = 300.00
Group "dave": ins_count = 1, ins_total = 150.00
The MERGE applies these deltas to the existing stream table:
SELECT * FROM customer_totals;
customer | total | order_count
----------|--------|------------
alice | 80.00 | 2 ← STALE! alice has no orders
bob | 100.00 | 2 ← STALE! bob has no orders
charlie | 300.00 | 2 ← correct
dave | 150.00 | 1 ← correct
The new data (charlie, dave) is correct. But the old data (alice, bob) persists because the TRUNCATE that removed their orders was never captured. The reference counts for alice and bob were never decremented.
How to Fix
SELECT pgstream.refresh_stream_table('customer_totals');
SELECT * FROM customer_totals;
customer | total | order_count
----------|--------|------------
charlie | 300.00 | 2 ← correct
dave | 150.00 | 1 ← correct
Case 4: TRUNCATE a Dimension Table in a JOIN
Consider a stream table that joins two tables:
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
tier TEXT NOT NULL DEFAULT 'standard'
);
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
customer_id INT REFERENCES customers(id),
amount NUMERIC(10,2)
);
SELECT pgstream.create_stream_table(
'order_details',
$$
SELECT c.name, c.tier, o.amount
FROM orders o
JOIN customers c ON o.customer_id = c.id
$$,
'1m',
'DIFFERENTIAL'
);
Now truncate the dimension table:
TRUNCATE customers CASCADE;
The CASCADE also truncates orders (due to the foreign key). Neither TRUNCATE fires row-level triggers. The change buffer is empty for both tables.
The stream table continues to show all the old joined rows as if nothing happened. The only recovery is a manual refresh:
SELECT pgstream.refresh_stream_table('order_details');
Case 5: FULL Mode Stream Tables Are Immune
If the stream table uses FULL refresh mode instead of DIFFERENTIAL:
SELECT pgstream.create_stream_table(
'customer_totals_full',
$$
SELECT customer, SUM(amount) AS total, COUNT(*) AS order_count
FROM orders GROUP BY customer
$$,
'1m',
'FULL'
);
A FULL-mode stream table doesn't use the change buffer at all. Every refresh cycle:
- TRUNCATEs the stream table
- Re-executes the defining query
- Inserts all results
So after a TRUNCATE of the base table, the next scheduled refresh automatically picks up the correct state — no manual intervention needed. The trade-off is that every refresh recomputes from scratch, which is more expensive for large result sets.
Why PostgreSQL Doesn't Fire Row Triggers on TRUNCATE
Understanding the PostgreSQL internals helps explain why this limitation exists:
| Operation | Mechanism | Row triggers fired? |
|---|---|---|
DELETE FROM t | Scans and removes rows one by one | Yes — AFTER DELETE per row |
TRUNCATE t | Removes all heap files and reinitializes the table storage | No — no per-row processing |
DELETE FROM t WHERE true | Same as DELETE FROM t (full scan) | Yes — AFTER DELETE per row |
TRUNCATE is fundamentally different from DELETE. It's an O(1) operation that replaces the table's storage files, while DELETE is O(N) — scanning every row and recording each removal in WAL.
PostgreSQL does support statement-level AFTER TRUNCATE triggers:
CREATE TRIGGER after_truncate_trigger
AFTER TRUNCATE ON orders
FOR EACH STATEMENT
EXECUTE FUNCTION some_function();
However, statement-level TRUNCATE triggers:
- Do not receive OLD row data (there's no
OLDrecord) - Cannot enumerate which rows were removed
- Only know that a TRUNCATE happened on a specific table
This means a TRUNCATE trigger could detect the event but cannot generate the per-row DELETE events that the DVM pipeline needs.
Alternative: DELETE FROM Instead of TRUNCATE
If you need the stream table to stay consistent without manual intervention, use DELETE FROM instead of TRUNCATE:
-- Instead of: TRUNCATE orders;
DELETE FROM orders;
This is slower (O(N) vs O(1)) but fires the row-level DELETE trigger for every row. The change buffer captures all removals, and the next differential refresh correctly decrements all reference counts, removing groups whose count reaches zero.
| Approach | Speed | Stream table consistent? |
|---|---|---|
TRUNCATE orders | O(1) — instant | No — requires manual refresh |
DELETE FROM orders | O(N) — scans all rows | Yes — triggers fire for each row |
TRUNCATE + manual refresh | O(1) + O(query) | Yes — after manual refresh |
For tables with millions of rows, DELETE FROM can be slow and generate significant WAL. In those cases, TRUNCATE followed by a manual refresh is often the pragmatic choice.
Best Practices
1. Always Refresh After TRUNCATE
Make it a habit: if you TRUNCATE a base table, immediately refresh all dependent stream tables:
TRUNCATE orders;
SELECT pgstream.refresh_stream_table('customer_totals');
2. Use DELETE FROM for Small Tables
For tables with fewer than ~100K rows, DELETE FROM is fast enough and keeps everything consistent automatically.
3. Wrap TRUNCATE + Refresh in a Function
For ETL pipelines, create a helper:
CREATE OR REPLACE FUNCTION reload_orders(data jsonb) RETURNS void AS $$
BEGIN
TRUNCATE orders;
INSERT INTO orders SELECT * FROM jsonb_populate_recordset(null::orders, data);
PERFORM pgstream.refresh_stream_table('customer_totals');
END;
$$ LANGUAGE plpgsql;
4. Consider FULL Mode for ETL-Heavy Tables
If a table is routinely truncated and reloaded, FULL refresh mode may be simpler than DIFFERENTIAL — it naturally handles TRUNCATE because it recomputes from scratch every cycle.
5. Monitor for Staleness
pg_stream emits monitoring alerts when a stream table's data deviates from expected freshness. After a TRUNCATE, the data_timestamp still advances (since the scheduler sees no pending changes), but the actual data is stale. The most reliable detection is comparing the stream table results against a fresh query:
-- Quick consistency check
SELECT count(*) FROM customer_totals; -- still shows rows?
SELECT count(*) FROM orders; -- should be 0 after TRUNCATE
How TRUNCATE Compares to Other Operations
| Aspect | INSERT | UPDATE | DELETE | TRUNCATE |
|---|---|---|---|---|
| Trigger fires? | Yes (per row) | Yes (per row) | Yes (per row) | No |
| Change buffer | 1 row per INSERT | 1 row per UPDATE | 1 row per DELETE | Empty |
| Stream table updated? | Yes (next refresh) | Yes (next refresh) | Yes (next refresh) | No — stays stale |
| Recovery | Automatic | Automatic | Automatic | Manual refresh required |
| FULL mode affected? | N/A (recomputes) | N/A (recomputes) | N/A (recomputes) | N/A (recomputes) |
| Speed | O(1) per row | O(1) per row | O(1) per row | O(1) total |
Summary
TRUNCATE is the one common DML-like operation that falls outside pg_stream's automatic change tracking. The trigger-based CDC architecture captures INSERT, UPDATE, and DELETE perfectly — but TRUNCATE bypasses row-level triggers by design.
The key takeaways:
- TRUNCATE does not fire row-level triggers — the change buffer stays empty
- The stream table becomes stale — showing data that no longer exists in the base table
- Manual refresh fixes it —
SELECT pgstream.refresh_stream_table('name')recomputes from scratch - FULL mode is immune — every refresh recomputes regardless of change tracking
DELETE FROMis the trigger-safe alternative — slower but keeps everything consistent automatically- After TRUNCATE, always refresh — make this a standard part of your ETL workflow
Next in This Series
- What Happens When You INSERT a Row? — The full 7-phase lifecycle (start here if you haven't already)
- What Happens When You UPDATE a Row? — D+I split, group key changes, net-effect for multiple UPDATEs
- What Happens When You DELETE a Row? — Reference counting, group deletion, INSERT+DELETE cancellation
Frequently Asked Questions
General
What is pg_stream?
pg_stream is a PostgreSQL 18 extension that implements stream tables — declarative, automatically-refreshing materialized views with Differential View Maintenance (DVM). You define a SQL query and a refresh schedule; the extension handles change capture, delta computation, and incremental refresh automatically.
It is inspired by the DBSP differential dataflow framework. See DBSP_COMPARISON.md for a detailed comparison.
How is this different from PostgreSQL materialized views?
| Feature | Materialized Views | Stream Tables |
|---|---|---|
| Refresh | Manual (REFRESH MATERIALIZED VIEW) | Automatic (scheduler) or manual |
| Incremental refresh | Not supported natively | Built-in differential mode |
| Change detection | None — always full recompute | CDC triggers track row-level changes |
| Dependency ordering | None | DAG-aware topological refresh |
| Monitoring | None | Built-in views, stats, NOTIFY alerts |
| Schedule | None | Duration strings (5m) or cron (*/5 * * * *) |
What PostgreSQL versions are supported?
PostgreSQL 18.x exclusively. The extension uses features specific to PostgreSQL 18.
Does pg_stream require wal_level = logical?
No. pg_stream uses lightweight row-level triggers for change data capture, not logical replication. You do not need to set wal_level = logical or configure max_replication_slots.
Is pg_stream production-ready?
pg_stream is under active development. It has a comprehensive test suite (700+ unit tests, 290+ end-to-end tests), but users should evaluate it against their specific workloads before deploying to production.
Installation & Setup
How do I install pg_stream?
- Add
pg_streamtoshared_preload_librariesinpostgresql.conf:shared_preload_libraries = 'pg_stream' - Restart PostgreSQL.
- Run:
CREATE EXTENSION pg_stream;
See INSTALL.md for platform-specific instructions and pre-built release artifacts.
What are the minimum configuration requirements?
Only shared_preload_libraries = 'pg_stream' is mandatory (requires a restart). All other settings have sensible defaults. max_worker_processes = 8 is recommended.
Can I install pg_stream on a managed PostgreSQL service (RDS, Cloud SQL, etc.)?
It depends on whether the service allows custom extensions and shared_preload_libraries. Since pg_stream does not require wal_level = logical, it avoids one of the most common restrictions on managed services. Check your provider's documentation for custom extension support.
How do I uninstall pg_stream?
- Drop all stream tables first (or they will be cascade-dropped):
SELECT pgstream.drop_stream_table(pgs_name) FROM pgstream.pgs_stream_tables; - Drop the extension:
DROP EXTENSION pg_stream CASCADE; - Remove
pg_streamfromshared_preload_librariesand restart PostgreSQL.
Creating & Managing Stream Tables
How do I create a stream table?
SELECT pgstream.create_stream_table(
'order_totals', -- name
'SELECT customer_id, SUM(amount) AS total
FROM orders GROUP BY customer_id', -- defining query
'5m', -- refresh schedule
'DIFFERENTIAL' -- refresh mode
);
What is the difference between FULL and DIFFERENTIAL refresh mode?
- FULL — Truncates the stream table and re-runs the entire defining query every refresh cycle. Simple but expensive for large result sets.
- DIFFERENTIAL — Computes only the delta (changes since the last refresh) using the DVM engine and applies it via a
MERGEstatement. Much faster when only a small fraction of source data changes between refreshes.
When should I use FULL vs. DIFFERENTIAL?
Use DIFFERENTIAL (default) when:
- Source tables are large and changes between refreshes are small
- The defining query uses supported operators (most common SQL is supported)
Use FULL when:
- The defining query uses unsupported aggregates (
CORR,COVAR_*,REGR_*) - Source tables are small and a full recompute is cheap
- You see frequent adaptive fallbacks to FULL (check refresh history)
What schedule formats are supported?
Duration strings:
| Unit | Suffix | Example |
|---|---|---|
| Seconds | s | 30s |
| Minutes | m | 5m |
| Hours | h | 2h |
| Days | d | 1d |
| Weeks | w | 1w |
| Compound | — | 1h30m |
Cron expressions:
| Format | Example | Description |
|---|---|---|
| 5-field | */5 * * * * | Every 5 minutes |
| Aliases | @hourly, @daily | Built-in shortcuts |
CALCULATED mode: Pass NULL as the schedule to inherit the schedule from downstream dependents.
What is the minimum allowed schedule?
The pg_stream.min_schedule_seconds GUC (default: 60) sets the floor. Schedules shorter than this value are rejected. Set to 1 for development/testing.
Can a stream table reference another stream table?
Yes. Stream tables can depend on other stream tables. The scheduler automatically refreshes them in topological order (upstream first). Circular dependencies are detected and rejected at creation time.
-- ST1: aggregates orders
SELECT pgstream.create_stream_table('order_totals',
'SELECT customer_id, SUM(amount) AS total FROM orders GROUP BY customer_id',
'1m', 'DIFFERENTIAL');
-- ST2: filters ST1
SELECT pgstream.create_stream_table('big_customers',
'SELECT customer_id, total FROM pgstream.order_totals WHERE total > 1000',
'1m', 'DIFFERENTIAL');
How do I change a stream table's schedule or mode?
-- Change schedule
SELECT pgstream.alter_stream_table('order_totals', schedule => '10m');
-- Switch refresh mode
SELECT pgstream.alter_stream_table('order_totals', refresh_mode => 'FULL');
-- Suspend
SELECT pgstream.alter_stream_table('order_totals', status => 'SUSPENDED');
-- Resume
SELECT pgstream.alter_stream_table('order_totals', status => 'ACTIVE');
Can I change the defining query of a stream table?
Not directly. You must drop and recreate the stream table:
SELECT pgstream.drop_stream_table('order_totals');
SELECT pgstream.create_stream_table('order_totals', '<new query>', '5m', 'DIFFERENTIAL');
How do I trigger a manual refresh?
SELECT pgstream.refresh_stream_table('order_totals');
This works even when pg_stream.enabled = false (scheduler disabled).
SQL Support
What SQL features are supported in defining queries?
Most common SQL is supported in both FULL and DIFFERENTIAL modes:
- Table scans, projections,
WHERE/HAVINGfilters INNER,LEFT,RIGHT,FULL OUTER JOIN(including multi-table joins)GROUP BYwith 25+ aggregate functions (COUNT,SUM,AVG,MIN,MAX,BOOL_AND/OR,STRING_AGG,ARRAY_AGG,JSON_AGG,JSONB_AGG,BIT_AND/OR/XOR,STDDEV,VARIANCE,MODE,PERCENTILE_CONT/DISC, and more)FILTER (WHERE ...)on aggregatesDISTINCT- Set operations:
UNION ALL,UNION,INTERSECT,INTERSECT ALL,EXCEPT,EXCEPT ALL - Subqueries:
EXISTS,NOT EXISTS,IN (subquery),NOT IN (subquery), scalar subqueries - Non-recursive and recursive CTEs
- Window functions (
ROW_NUMBER,RANK,SUM OVER, etc.) LATERALjoins with set-returning functions and correlated subqueriesCASE,COALESCE,NULLIF,GREATEST,LEAST,BETWEEN,IS DISTINCT FROM
See DVM Operators for the complete list.
What SQL features are NOT supported?
The following are rejected with clear error messages and suggested rewrites:
| Feature | Reason | Suggested Rewrite |
|---|---|---|
TABLESAMPLE | Stream tables materialize the full result set | Use WHERE random() < fraction in consuming query |
| Window functions in expressions | Cannot be differentially maintained | Move window function to a separate column |
LIMIT / OFFSET | Stream tables materialize the full result set | Apply when querying the stream table |
FOR UPDATE / FOR SHARE | Row-level locking not applicable | Remove the locking clause |
The following were previously rejected but are now supported via automatic parse-time rewrites:
| Feature | How It Works |
|---|---|
DISTINCT ON (…) | Auto-rewritten to ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) = 1 subquery |
GROUPING SETS / CUBE / ROLLUP | Auto-rewritten to UNION ALL of separate GROUP BY queries |
NATURAL JOIN | Common columns resolved at parse time; explicit equi-join synthesized |
ALL (subquery) | Rewritten to NOT EXISTS with negated condition (AntiJoin) |
Each rejected feature is explained in detail in the Why Are These SQL Features Not Supported? section below.
What happens to ORDER BY in defining queries?
ORDER BY is accepted but silently discarded. Row order in a stream table is undefined (consistent with PostgreSQL's CREATE MATERIALIZED VIEW behavior). Apply ORDER BY when querying the stream table, not in the defining query.
Which aggregates support DIFFERENTIAL mode?
Algebraic (O(changes), fully incremental): COUNT, SUM, AVG
Semi-algebraic (incremental with occasional group rescan): MIN, MAX
Group-rescan (affected groups re-aggregated from source): STRING_AGG, ARRAY_AGG, JSON_AGG, JSONB_AGG, BOOL_AND, BOOL_OR, BIT_AND, BIT_OR, BIT_XOR, JSON_OBJECT_AGG, JSONB_OBJECT_AGG, STDDEV, STDDEV_POP, STDDEV_SAMP, VARIANCE, VAR_POP, VAR_SAMP, MODE, PERCENTILE_CONT, PERCENTILE_DISC, CORR, COVAR_POP, COVAR_SAMP, REGR_AVGX, REGR_AVGY, REGR_COUNT, REGR_INTERCEPT, REGR_R2, REGR_SLOPE, REGR_SXX, REGR_SXY, REGR_SYY
37 aggregate function variants are supported in total.
Change Data Capture (CDC)
How does pg_stream capture changes to source tables?
pg_stream installs AFTER INSERT/UPDATE/DELETE row-level PL/pgSQL triggers on each source table. These triggers write change records (action, old/new row data as JSONB, LSN, transaction ID) into per-source buffer tables in the pgstream_changes schema.
What is the overhead of CDC triggers?
Approximately 20–55 μs per row (PL/pgSQL dispatch + row_to_json() + buffer INSERT). At typical write rates (<1000 writes/sec per source table), this adds less than 5% DML latency overhead.
What happens when I TRUNCATE a source table?
TRUNCATE is now captured via a statement-level AFTER TRUNCATE trigger that writes a T marker row to the change buffer. When the differential refresh engine detects this marker, it automatically falls back to a full refresh for that cycle, ensuring the stream table stays consistent.
Previously, TRUNCATE bypassed row-level triggers entirely. This is no longer a concern — both FULL and DIFFERENTIAL mode stream tables handle TRUNCATE correctly.
Are CDC triggers automatically cleaned up?
Yes. When the last stream table referencing a source is dropped, the trigger and its associated change buffer table are automatically removed.
What happens if a source table is dropped or altered?
pg_stream has DDL event triggers that detect ALTER TABLE and DROP TABLE on source tables. When detected:
- Affected stream tables are marked with
needs_reinit = true - The next refresh cycle performs a full reinitialization (drops and recreates the storage table)
- A
reinitialize_neededNOTIFY alert is sent
How do I check if a source table has switched from trigger-based CDC to WAL-based CDC?
When you enable hybrid CDC (pg_stream.cdc_mode = 'auto'), pg_stream starts capturing changes with triggers and can automatically transition to WAL-based logical replication once conditions are met. There are several ways to check the current CDC mode for each source table:
1. Query the dependency catalog directly:
SELECT d.source_relid, c.relname AS source_table, d.cdc_mode,
d.slot_name, d.decoder_confirmed_lsn, d.transition_started_at
FROM pgstream.pgs_dependencies d
JOIN pg_class c ON c.oid = d.source_relid;
The cdc_mode column shows one of three values:
TRIGGER— changes are captured via row-level triggers (the default)TRANSITIONING— the system is in the process of switching from triggers to WALWAL— changes are captured via logical replication
2. Use the built-in health check function:
SELECT source_table, cdc_mode, slot_name, lag_bytes, alert
FROM pgstream.check_cdc_health();
This returns a row per source table with the current mode, replication slot lag (for WAL-mode sources), and any alert conditions such as slot_lag_exceeds_threshold or replication_slot_missing.
3. Listen for real-time transition notifications:
LISTEN pg_stream_cdc_transition;
pg_stream sends a NOTIFY with a JSON payload whenever a transition starts, completes, or is rolled back. Example payload:
{
"event": "transition_complete",
"source_table": "public.orders",
"old_mode": "TRANSITIONING",
"new_mode": "WAL",
"slot_name": "pg_stream_slot_16384"
}
This lets you integrate CDC mode changes into your monitoring stack without polling.
4. Check the global GUC setting:
SHOW pg_stream.cdc_mode;
This shows the desired global behavior (trigger, auto, or wal), not the per-table actual state. The per-table state lives in pgs_dependencies.cdc_mode as described above.
See CONFIGURATION.md for details on the pg_stream.cdc_mode and pg_stream.wal_transition_timeout GUCs.
Is it safe to add triggers to a stream table while the source table is switching CDC modes?
Yes, this is completely safe. CDC mode transitions and user-defined triggers operate on different tables and do not interfere with each other:
- CDC transitions affect how changes are captured from source tables (e.g.,
orders). The transition switches the capture mechanism from row-level triggers on the source table to WAL-based logical replication. - User-defined triggers live on stream tables (e.g.,
order_totals) and control how the refresh engine applies changes to the materialized output.
Because these are independent concerns, you can freely add, modify, or remove triggers on a stream table at any point — including during an active CDC transition on its source tables.
How it works in practice:
- The refresh engine checks for user-defined triggers on the stream table at the start of each refresh cycle (via a fast
pg_triggerlookup, <0.1 ms). - If user triggers are detected, the engine uses explicit
DELETE/UPDATE/INSERTstatements instead ofMERGE, so your triggers fire with correctTG_OP,OLD, andNEWvalues. - The change data consumed by the refresh engine has the same format regardless of whether it came from CDC triggers or WAL decoding — so the trigger detection and the CDC mode are fully decoupled.
A trigger added between two refresh cycles will simply be picked up on the next cycle. The only (theoretical) edge case is adding a trigger in the tiny window during a single refresh transaction, between the trigger-detection check and the MERGE execution — but since both happen within the same transaction, this is virtually impossible in practice.
Performance & Tuning
How do I tune the scheduler interval?
The pg_stream.scheduler_interval_ms GUC controls how often the scheduler checks for stale stream tables (default: 1000 ms).
| Workload | Recommended Value |
|---|---|
| Low-latency (near real-time) | 100–500 |
| Standard | 1000 (default) |
| Low-overhead (many STs, long schedules) | 5000–10000 |
What is the adaptive fallback to FULL?
When the number of pending changes exceeds pg_stream.differential_max_change_ratio (default: 15%) of the source table size, DIFFERENTIAL mode automatically falls back to FULL for that refresh cycle. This prevents pathological delta queries on bulk changes.
- Set to
0.0to always use DIFFERENTIAL (even on large change sets) - Set to
1.0to effectively always use FULL - Default
0.15(15%) is a good balance
How many concurrent refreshes can run?
Controlled by pg_stream.max_concurrent_refreshes (default: 4, range: 1–32). Each concurrent refresh uses a background worker. Increase this if you have many stream tables and available CPU/IO.
How do I check if my stream tables are keeping up?
-- Quick overview
SELECT pgs_name, status, staleness, stale
FROM pgstream.stream_tables_info;
-- Detailed statistics
SELECT pgs_name, total_refreshes, avg_duration_ms, consecutive_errors, stale
FROM pgstream.pg_stat_stream_tables;
-- Recent refresh history for a specific ST
SELECT * FROM pgstream.get_refresh_history('order_totals', 10);
What is __pgs_row_id?
Every stream table has a __pgs_row_id BIGINT PRIMARY KEY column. It stores a 64-bit xxHash of the row's group-by key (or all columns for non-aggregate queries). The refresh engine uses it for delta MERGE operations (matching DELETEs and INSERTs by row ID).
You should ignore this column in your queries. It is an implementation detail.
Interoperability
Can PostgreSQL views reference stream tables?
Yes. Stream tables are standard heap tables. Views work normally and reflect data as of the most recent refresh.
Can materialized views reference stream tables?
Yes, though it is somewhat redundant (both are physical snapshots). The materialized view requires its own REFRESH MATERIALIZED VIEW — it does not auto-refresh when the stream table refreshes.
Can I replicate stream tables with logical replication?
Yes. Stream tables can be published like any ordinary table:
CREATE PUBLICATION my_pub FOR TABLE pgstream.order_totals;
Important caveats:
- The
__pgs_row_idcolumn is replicated (it is the primary key) - Subscribers receive materialized data, not the defining query
- Do not install pg_stream on the subscriber and attempt to refresh the replicated table — it will have no CDC triggers or catalog entries
- Internal change buffer tables are not published by default
Can I INSERT, UPDATE, or DELETE rows in a stream table directly?
No. Stream table contents are managed exclusively by the refresh engine. Direct DML will corrupt the internal state.
Can I add foreign keys to or from stream tables?
No. The refresh engine uses bulk MERGE operations that do not respect foreign key ordering. Foreign key constraints on stream tables are not supported.
Can I add my own triggers to stream tables?
Yes, for DIFFERENTIAL mode stream tables. When user-defined row-level triggers are detected (or pg_stream.user_triggers = 'on'), the refresh engine automatically switches from MERGE to explicit DELETE + UPDATE + INSERT statements. This ensures triggers fire with the correct TG_OP, OLD, and NEW values.
Limitations:
- Row-level triggers do not fire during FULL refresh (they are automatically suppressed via
DISABLE TRIGGER USER). UseREFRESH MODE DIFFERENTIALfor stream tables with triggers. - The
IS DISTINCT FROMguard prevents no-opUPDATEtriggers when the aggregate result is unchanged. BEFOREtriggers that modifyNEWwill affect the stored value — the next refresh may "correct" it back, causing oscillation.
See the pg_stream.user_triggers GUC in CONFIGURATION.md for control options.
Can I ALTER TABLE a stream table directly?
No. Use pgstream.alter_stream_table() to modify schedule, refresh mode, or status. To change the defining query, drop and recreate the stream table.
Monitoring & Alerting
What monitoring views are available?
| View | Description |
|---|---|
pgstream.stream_tables_info | Status overview with computed staleness |
pgstream.pg_stat_stream_tables | Comprehensive stats (refresh counts, avg duration, error streaks) |
How do I get alerted when something goes wrong?
pg_stream sends PostgreSQL NOTIFY messages on the pg_stream_alert channel with JSON payloads:
| Event | When |
|---|---|
stale_data | Staleness exceeds 2× the schedule |
auto_suspended | Stream table suspended after max consecutive errors |
reinitialize_needed | Upstream DDL change detected |
buffer_growth_warning | Change buffer growing unexpectedly |
refresh_completed | Refresh completed successfully |
refresh_failed | Refresh failed |
Listen with:
LISTEN pg_stream_alert;
What happens when a stream table keeps failing?
After pg_stream.max_consecutive_errors (default: 3) consecutive failures, the stream table moves to ERROR status and automatic refreshes stop. An auto_suspended NOTIFY alert is sent.
To recover:
-- Fix the underlying issue (e.g., restore a dropped source table), then:
SELECT pgstream.alter_stream_table('my_table', status => 'ACTIVE');
Retries use exponential backoff (base 1s, max 60s, ±25% jitter, up to 5 retries before counting as a real failure).
Configuration Reference
| GUC | Type | Default | Description |
|---|---|---|---|
pg_stream.enabled | bool | true | Enable/disable the scheduler. Manual refreshes still work when false. |
pg_stream.scheduler_interval_ms | int | 1000 | Scheduler wake interval in milliseconds (100–60000) |
pg_stream.min_schedule_seconds | int | 60 | Minimum allowed schedule duration (1–86400) |
pg_stream.max_consecutive_errors | int | 3 | Failures before auto-suspending (1–100) |
pg_stream.change_buffer_schema | text | pgstream_changes | Schema for CDC buffer tables |
pg_stream.max_concurrent_refreshes | int | 4 | Max parallel refresh workers (1–32) |
pg_stream.user_triggers | text | auto | User trigger handling: auto (detect), on (always explicit DML), off (suppress) |
pg_stream.differential_max_change_ratio | float | 0.15 | Change ratio threshold for adaptive FULL fallback (0.0–1.0) |
pg_stream.cleanup_use_truncate | bool | true | Use TRUNCATE instead of DELETE for buffer cleanup |
All GUCs are SUSET context (superuser SET) and take effect without restart, except shared_preload_libraries which requires a PostgreSQL restart.
Troubleshooting
Unit tests crash with symbol not found in flat namespace on macOS 26+
macOS 26 (Tahoe) changed the dynamic linker (dyld) to eagerly resolve all
flat-namespace symbols at binary load time. pgrx extensions link PostgreSQL
server symbols (e.g. CacheMemoryContext, SPI_connect) with
-Wl,-undefined,dynamic_lookup, which previously resolved lazily. Since
cargo test --lib runs outside the postgres process, those symbols are
missing and the test binary aborts:
dyld[66617]: symbol not found in flat namespace '_CacheMemoryContext'
Use just test-unit — it automatically detects macOS 26+ and injects a
stub library (libpg_stub.dylib) via DYLD_INSERT_LIBRARIES. The stub
provides NULL/no-op definitions for the ~28 PostgreSQL symbols; they are never
called during unit tests (pure Rust logic only).
This does not affect integration tests, E2E tests, just lint,
just build, or the extension running inside PostgreSQL.
See the Installation Guide for details and manual usage.
My stream table is stuck in INITIALIZING status
The initial full refresh may have failed. Check:
SELECT * FROM pgstream.get_refresh_history('my_table', 5);
If the error is transient, retry with:
SELECT pgstream.refresh_stream_table('my_table');
My stream table shows stale data but the scheduler is running
Common causes:
- TRUNCATE on source table — bypasses CDC triggers. Manual refresh needed.
- Too many errors — check
consecutive_errorsinpgstream.pg_stat_stream_tables. Resume withALTER ... status => 'ACTIVE'. - Long-running refresh — check for lock contention or slow defining queries.
- Scheduler disabled — verify
SHOW pg_stream.enabled;returnson.
I get "cycle detected" when creating a stream table
Stream tables cannot have circular dependencies. If ST-A depends on ST-B and ST-B depends on ST-A (directly or transitively), creation is rejected. Restructure your queries to break the cycle.
A source table was altered and my stream table stopped refreshing
pg_stream detects DDL changes via event triggers and marks affected stream tables with needs_reinit = true. The next scheduler cycle will reinitialize (full drop + recreate of storage) the stream table automatically. If the schema change breaks the defining query, the reinitialization will fail — check refresh history for the error and recreate the stream table with an updated query.
How do I see the delta query generated for a stream table?
SELECT pgstream.explain_st('order_totals');
This shows the DVM operator tree, source tables, and the generated delta SQL.
Why Are These SQL Features Not Supported?
This section gives detailed technical explanations for each SQL limitation. pg_stream follows the principle of "fail loudly rather than produce wrong data" — every unsupported feature is detected at stream-table creation time and rejected with a clear error message and a suggested rewrite.
How does NATURAL JOIN work?
NATURAL JOIN is now fully supported. At parse time, pg_stream resolves the common columns between the two tables (using OpTree::output_columns()) and synthesizes explicit equi-join conditions. This supports INNER, LEFT, RIGHT, and FULL NATURAL JOIN variants.
Internally, NATURAL JOIN is converted to an explicit JOIN ... ON before the DVM engine builds its operator tree, so delta computation works identically to a manually specified equi-join.
Note: The internal __pgs_row_id column is excluded from common column resolution, so NATURAL JOINs between stream tables work correctly.
How do GROUPING SETS, CUBE, and ROLLUP work?
GROUPING SETS, CUBE, and ROLLUP are now fully supported via an automatic parse-time rewrite. pg_stream decomposes these constructs into a UNION ALL of separate GROUP BY queries before the DVM engine processes the query.
Explosion guard:
CUBE(N)generates $2^N$ branches. pg_stream rejects CUBE/ROLLUP combinations that would produce more than 64 branches to prevent runaway memory usage. Use explicitGROUPING SETS(...)instead.
For example:
-- This defining query:
SELECT dept, region, SUM(amount) FROM sales GROUP BY CUBE(dept, region)
-- Is automatically rewritten to:
SELECT dept, region, SUM(amount) FROM sales GROUP BY dept, region
UNION ALL
SELECT dept, NULL::text, SUM(amount) FROM sales GROUP BY dept
UNION ALL
SELECT NULL::text, region, SUM(amount) FROM sales GROUP BY region
UNION ALL
SELECT NULL::text, NULL::text, SUM(amount) FROM sales
GROUPING() function calls are replaced with integer literal constants corresponding to the grouping level. The rewrite is transparent — the DVM engine sees only standard GROUP BY + UNION ALL operators and can apply incremental delta computation to each branch independently.
How does DISTINCT ON (…) work?
DISTINCT ON is now fully supported via an automatic parse-time rewrite. pg_stream transparently transforms DISTINCT ON into a ROW_NUMBER() window function subquery:
-- This defining query:
SELECT DISTINCT ON (dept) dept, employee, salary
FROM employees ORDER BY dept, salary DESC
-- Is automatically rewritten to:
SELECT dept, employee, salary FROM (
SELECT dept, employee, salary,
ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) AS rn
FROM employees
) sub WHERE rn = 1
The rewrite happens before DVM parsing, so the operator tree sees a standard window function query and can apply partition-based recomputation for incremental delta maintenance.
Why is TABLESAMPLE rejected?
TABLESAMPLE returns a random subset of rows from a table (e.g., FROM orders TABLESAMPLE BERNOULLI(10) gives ~10% of rows).
Stream tables materialize the complete result set of the defining query and keep it up-to-date across refreshes. Baking a random sample into the defining query is not meaningful because:
-
Non-determinism. Each refresh would sample different rows, making the stream table contents unstable and unpredictable. The delta between refreshes would be dominated by sampling noise, not actual data changes.
-
CDC incompatibility. The trigger-based change-capture system tracks specific row-level changes (inserts, updates, deletes). A
TABLESAMPLEdefining query has no stable row identity — the "changed rows" concept doesn't apply when the entire sample shifts each cycle.
Rewrite:
-- Instead of sampling in the defining query:
SELECT * FROM orders TABLESAMPLE BERNOULLI(10)
-- Materialize the full result and sample when querying:
SELECT * FROM order_stream_table WHERE random() < 0.1
Why is LIMIT / OFFSET rejected?
Stream tables materialize the complete result set and keep it synchronized with source data. LIMIT/OFFSET would truncate the result:
-
Undefined ordering.
LIMITwithoutORDER BYreturns an arbitrary subset. Even withORDER BY, stream tables discard ordering — the "top N" rows concept doesn't apply to a set-based materialized result. -
Delta instability. When source rows change, the boundary between "in the LIMIT" and "out of the LIMIT" shifts. A single INSERT could evict one row and admit another, requiring the refresh to track the full ordered position of every row — essentially a full rescan.
-
Semantic mismatch. Users who write
LIMIT 100typically want to limit what they read, not what is stored. Since stream tables are queried separately from their definition, theLIMITbelongs in the consuming query.
Rewrite:
-- Instead of:
'SELECT * FROM orders ORDER BY created_at DESC LIMIT 100'
-- Omit LIMIT from the defining query, apply when reading:
SELECT * FROM orders_stream_table ORDER BY created_at DESC LIMIT 100
Why are window functions in expressions rejected?
Window functions like ROW_NUMBER() OVER (…) are supported as standalone columns in stream tables. However, embedding a window function inside an expression — such as CASE WHEN ROW_NUMBER() OVER (...) = 1 THEN ... or SUM(x) OVER (...) + 1 — is rejected.
This restriction exists because:
-
Partition-based recomputation. pg_stream's differential mode handles window functions by recomputing entire partitions that were affected by changes. When a window function is buried inside an expression, the DVM engine cannot isolate the window computation from the surrounding expression, making it impossible to correctly identify which partitions to recompute.
-
Expression tree ambiguity. The DVM parser would need to differentiate the outer expression (arithmetic,
CASE, etc.) while treating the inner window function specially. This creates a combinatorial explosion of differentiation rules for every possible expression type × window function combination.
Rewrite:
-- Instead of:
SELECT id, CASE WHEN ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) = 1
THEN 'top' ELSE 'other' END AS rank_label
FROM employees
-- Move window function to a separate column, then use a wrapping stream table:
-- ST1:
SELECT id, dept, salary,
ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) AS rn
FROM employees
-- ST2 (references ST1):
SELECT id, CASE WHEN rn = 1 THEN 'top' ELSE 'other' END AS rank_label
FROM pgstream.employees_ranked
Why is FOR UPDATE / FOR SHARE rejected?
FOR UPDATE and related locking clauses (FOR SHARE, FOR NO KEY UPDATE, FOR KEY SHARE) acquire row-level locks on selected rows. This is incompatible with stream tables because:
-
Refresh semantics. Stream table contents are managed by the refresh engine using bulk
MERGEoperations. Row-level locks taken during the defining query would conflict with the refresh engine's own locking strategy. -
No direct DML. Since users cannot directly modify stream table rows, there is no use case for locking rows inside the defining query. The locks would be held for the duration of the refresh transaction and then released, serving no purpose.
Why is ALL (subquery) not supported?
ALL (subquery) compares a value against every row returned by a subquery (e.g., WHERE x > ALL (SELECT y FROM t)). It is rejected because:
-
Negation rewrite complexity.
x > ALL (SELECT y FROM t)is logically equivalent toNOT EXISTS (SELECT 1 FROM t WHERE y >= x), which pg_stream can handle via its anti-join operator. The rewrite is straightforward. -
Rare usage.
ALL (subquery)is uncommon in analytical queries. Supporting it directly would add operator complexity for minimal benefit.
Rewrite:
-- Instead of:
WHERE amount > ALL (SELECT threshold FROM limits)
-- Use NOT EXISTS:
WHERE NOT EXISTS (SELECT 1 FROM limits WHERE threshold >= amount)
Why is ORDER BY silently discarded?
ORDER BY in the defining query is accepted but ignored. This is consistent with how PostgreSQL treats CREATE MATERIALIZED VIEW AS SELECT ... ORDER BY ... — the ordering is not preserved in the stored data.
Stream tables are heap tables with no guaranteed row order. The ORDER BY in the defining query would only affect the order of the initial INSERT, which has no lasting effect. Apply ordering when querying the stream table:
-- This ORDER BY is meaningless in the defining query:
'SELECT region, SUM(amount) FROM orders GROUP BY region ORDER BY total DESC'
-- Instead, order when reading:
SELECT * FROM regional_totals ORDER BY total DESC
Why are unsupported aggregates (CORR, COVAR_*, REGR_*) limited to FULL mode?
Regression aggregates like CORR, COVAR_POP, COVAR_SAMP, and the REGR_* family require maintaining running sums of products and squares across the entire group. Unlike COUNT/SUM/AVG (where deltas can be computed from the change alone) or group-rescan aggregates (where only affected groups are re-read), regression aggregates:
-
Lack algebraic delta rules. There is no closed-form way to update a correlation coefficient from a single row change without access to the full group's data.
-
Would degrade to group-rescan anyway. Even if supported, the implementation would need to rescan the full group from source — identical to FULL mode for most practical group sizes.
These aggregates work fine in FULL refresh mode, which re-runs the entire query from scratch each cycle.
Why Are These Stream Table Operations Restricted?
Stream tables are regular PostgreSQL heap tables under the hood, but their contents are managed exclusively by the refresh engine. This section explains why certain operations that work on ordinary tables are disallowed or unsupported on stream tables.
Why can't I INSERT, UPDATE, or DELETE rows in a stream table?
Stream table contents are the output of the refresh engine — they represent the materialized result of the defining query at a specific point in time. Direct DML would corrupt this contract in several ways:
-
Row ID integrity. Every row has a
__pgs_row_id(a 64-bit xxHash of the group-by key or all columns). The refresh engine uses this for deltaMERGE— matching incoming deltas against existing rows. A manually inserted row with an incorrect or duplicate__pgs_row_idwould cause the next differential refresh to produce wrong results (double-counting, missed deletes, or merge conflicts). -
Frontier inconsistency. Each refresh records a frontier — a set of per-source LSN positions that represent "data up to this point has been materialized." A manual DML change is not tracked by any frontier. The next differential refresh would either overwrite the change (if the delta touches the same row) or leave the stream table in a state that doesn't match any consistent point-in-time snapshot of the source data.
-
Change buffer desync. The CDC triggers on source tables write changes to buffer tables. The refresh engine reads these buffers and advances the frontier. Manual DML on the stream table bypasses this pipeline entirely — the buffer and frontier have no record of the change, so future refreshes cannot account for it.
If you need to post-process stream table data, create a view or a second stream table that references the first one.
Why can't I add foreign keys to or from a stream table?
Foreign key constraints require that referenced/referencing rows exist at the time of each DML statement. The refresh engine violates this assumption:
-
Bulk
MERGEordering. A differential refresh executes a singleMERGE INTOstatement that applies all deltas (inserts and deletes) atomically. PostgreSQL evaluates FK constraints row-by-row within thisMERGE. If a parent row is deleted and a new parent inserted in the same delta batch, the child FK check may fail because it sees the delete before the insert — even though the final state would be consistent. -
Full refresh uses
TRUNCATE+INSERT. In FULL mode, the refresh engine truncates the stream table and re-inserts all rows.TRUNCATEdoes not fire individualDELETEtriggers and bypasses FK cascade logic, which would leave referencing tables with dangling references. -
Cross-table refresh ordering. If stream table A has an FK referencing stream table B, both tables refresh independently (in topological order, but in separate transactions). There is no guarantee that A's refresh sees B's latest data — the FK constraint could transiently fail between refreshes.
Workaround: Enforce referential integrity in the consuming application or use a view that joins the stream tables and validates the relationship.
How do user-defined triggers work on stream tables?
When a DIFFERENTIAL mode stream table has user-defined row-level triggers (or pg_stream.user_triggers = 'on'), the refresh engine uses explicit DML decomposition instead of MERGE:
-
Delta materialized once. The delta query result is stored in a temporary table (
__pgs_delta_<id>) to avoid evaluating it three times. -
DELETE removed rows. Rows in the stream table whose
__pgs_row_idis absent from the delta are deleted.AFTER DELETEtriggers fire with correctOLDvalues. -
UPDATE changed rows. Rows whose
__pgs_row_idexists in both the stream table and delta but whose values differ (checked viaIS DISTINCT FROM) are updated.AFTER UPDATEtriggers fire with correctOLDandNEW. No-op updates (where values are identical) are skipped, preventing spurious triggers. -
INSERT new rows. Rows in the delta whose
__pgs_row_idis absent from the stream table are inserted.AFTER INSERTtriggers fire with correctNEWvalues.
FULL refresh behavior: Row-level user triggers are automatically suppressed during FULL refresh via DISABLE TRIGGER USER / ENABLE TRIGGER USER. A NOTIFY pgstream_refresh is emitted so listeners know a FULL refresh occurred. Use REFRESH MODE DIFFERENTIAL for stream tables that need per-row trigger semantics.
Performance: The explicit DML path adds ~25–60% overhead compared to MERGE for triggered stream tables. Stream tables without user triggers have zero overhead (only a fast pg_trigger check, <0.1 ms).
Control: The pg_stream.user_triggers GUC controls this behavior:
auto(default): detect user triggers automaticallyon: always use explicit DML (useful for testing)off: always use MERGE, suppressing triggers
Why can't I ALTER TABLE a stream table directly?
Stream table metadata (defining query, schedule, refresh mode) is stored in the pg_stream catalog (pgstream.pgs_stream_tables). A direct ALTER TABLE would change the physical table without updating the catalog, causing:
-
Column mismatch. If you add or remove columns, the refresh engine's cached delta query and
MERGEstatement would reference columns that no longer exist (or miss new ones), causing runtime errors. -
__pgs_row_idinvalidation. The row ID hash is computed from the defining query's output columns. Altering the table schema without updating the defining query would make existing row IDs inconsistent with the new column set.
Use pgstream.alter_stream_table() to change schedule, refresh mode, or status. To change the defining query or column structure, drop and recreate the stream table.
Why can't I TRUNCATE a stream table?
TRUNCATE removes all rows instantly but does not update the pg_stream frontier or change buffers. After a TRUNCATE:
-
Differential refresh sees no changes. The frontier still records the last-processed LSN. No new source changes may have occurred, so the next differential refresh produces an empty delta — leaving the stream table empty even though the source still has data.
-
No recovery path for differential mode. The refresh engine has no way to detect that the stream table was externally truncated. It assumes the current contents match the frontier.
Use pgstream.refresh_stream_table('my_table') to force a full re-materialization, or drop and recreate the stream table if you need a clean slate.
Changelog
All notable changes to pg_stream are documented in this file.
The format is based on Keep a Changelog. For future plans and release milestones, see ROADMAP.md.
[Unreleased]
[0.1.1] — 2026-02-26
Changed
CloudNativePG Image Volume Extension Distribution
- Extension-only OCI image — replaced the full PostgreSQL Docker image
(
ghcr.io/<owner>/pg_stream) with a minimalscratch-based extension image (ghcr.io/<owner>/pg_stream-ext) following the CNPG Image Volume Extensions specification. The image contains only.so,.control, and.sqlfiles (< 10 MB vs ~400 MB for the old full image). - New
cnpg/Dockerfile.ext— release Dockerfile for packaging pre-built artifacts into the scratch-based extension image. - New
cnpg/Dockerfile.ext-build— multi-stage from-source build for local development and CI. - New
cnpg/database-example.yaml— CNPGDatabaseresource for declarativeCREATE EXTENSION pg_stream(replacespostInitSQL). - Updated
cnpg/cluster-example.yaml— uses official CNPG PostgreSQL 18 operand image with.spec.postgresql.extensionsfor Image Volume mounting. - Removed
cnpg/Dockerfileandcnpg/Dockerfile.release— the old full PostgreSQL images are no longer built or published. - Updated release workflow — publishes multi-arch (amd64/arm64) extension image to GHCR with layout verification and SQL smoke test.
- Updated CI CNPG smoke test — uses transitional composite image approach
until
kindsupports Kubernetes 1.33 withImageVolumefeature gate.
[0.1.0] — 2026-02-26
Fixed
WAL Decoder pgoutput Action Parsing (F4 / G2.3)
- Positional action parsing —
parse_pgoutput_action()previously useddata.contains("INSERT:")etc., which would misclassify events when a schema name, table name, or column value contained an action keyword (e.g., a table namedINSERT_LOGor a text value"DELETE: old row"). Replaced with positional parsing: strip"table "prefix, skipschema.table:, then match the action keyword before the next:. - 3 new unit tests covering the edge cases.
Added
CUBE / ROLLUP Combinatorial Explosion Guard (F14 / G5.2)
- Branch limit guard —
CUBE(n)on N columns generates $2^N$UNION ALLbranches. Large CUBEs would silently produce memory-exhausting query trees.rewrite_grouping_sets()now rejects CUBE/ROLLUP combinations that would expand beyond 64 branches, emitting a clear error that directs users to explicitGROUPING SETS(...).
Documentation: Known Delta Computation Limitations (F7 / F11)
- JOIN key change + simultaneous right-side delete — documented in
docs/SQL_REFERENCE.md§ "Known Delta Computation Limitations" with a concrete SQL example, root-cause explanation, and three mitigations (adaptive FULL fallback, staggered changes, FULL mode). - Keyless table duplicate-row limitation — the "Tables Without Primary
Keys" section now includes a
> Limitationcallout explaining that rows with identical content produce the same content hash, causing INSERT deduplication and ambiguous DELETE matching. Recommends adding a surrogate PK or UNIQUE constraint. - SQL-standard JSON aggregate recognition —
JSON_ARRAYAGG(expr ...)andJSON_OBJECTAGG(key: value ...)are now recognized as first-class DVM aggregates with the group-rescan strategy. Previously treated as opaque raw expressions, they now work correctly in DIFFERENTIAL mode. - Two new
AggFuncvariants:JsonObjectAggStd(String),JsonArrayAggStd(String). The carried String preserves the full deparsed SQL since the specialkey: value,ABSENT ON NULL,ORDER BY, andRETURNINGclauses differ from regular function syntax. - 4 E2E tests covering both FULL and DIFFERENTIAL modes.
JSON_TABLE Support (F12)
JSON_TABLE()in FROM clause — PostgreSQL 17+JSON_TABLE(expr, path COLUMNS (...))is now supported. Deparsed with full syntax includingPASSINGclauses, regular/EXISTS/formatted/nested columns, andON ERROR/ON EMPTYbehaviors. Modeled internally asLateralFunction.- 2 E2E tests (FULL and DIFFERENTIAL modes).
Operator Volatility Checking (F16)
- Operator volatility checking — custom operators backed by volatile
functions are now detected and rejected in DIFFERENTIAL mode. The check
queries
pg_operator→pg_procto resolve operator function volatility. This completes the volatility coverage (G7.2) started with function volatility detection. - 3 unit tests and 2 E2E tests.
Cross-Session Cache Invalidation (F17)
- Cross-session cache invalidation — a shared-memory atomic counter
(
CACHE_GENERATION) ensures that when one backend alters a stream table, drops a stream table, or triggers a DDL hook, all other backends automatically flush their delta template and MERGE template caches on the next refresh cycle. Previously, cached templates could become stale in multi-backend deployments. - Thread-local generation tracking in both
dvm/mod.rs(delta cache) andrefresh.rs(MERGE cache + prepared statements).
Function/Operator DDL Tracking (F18)
- Function DDL tracking —
CREATE OR REPLACE FUNCTIONandALTER FUNCTIONon functions referenced by stream table defining queries now trigger reinit of affected STs.DROP FUNCTIONalso marks affected STs for reinit. functions_usedcatalog column — newTEXT[]column inpgstream.pgs_stream_tablesstores all function names used by the defining query (extracted from the parsed OpTree at creation time). DDL hooks query this column to find affected STs.- 2 E2E tests and 5 unit tests for function name extraction.
View Inlining (G2.1)
- View inlining auto-rewrite — views referenced in defining queries are transparently replaced with their underlying SELECT definition as inline subqueries. CDC triggers land on base tables, so DIFFERENTIAL mode works correctly with views. Nested views (view → view → table) are fully expanded via a fixpoint loop (max depth 10).
- Materialized view rejection — materialized views (
relkind = 'm') are rejected with a clear error in DIFFERENTIAL mode. FULL mode allows them. - Foreign table rejection — foreign tables (
relkind = 'f') are rejected in DIFFERENTIAL mode (row-level triggers cannot be created on foreign tables). - Original query preservation — the user's original SQL (pre-inlining) is
stored in
pgstream.pgs_stream_tables.original_queryfor reinit after view changes and user introspection. - View DDL hooks —
CREATE OR REPLACE VIEWtriggers reinit of affected stream tables.DROP VIEWsets affected stream tables to ERROR status. - View dependency tracking — views are registered as soft dependencies in
pgstream.pgs_dependencies(source_type = 'VIEW') for DDL hook lookups. - E2E test suite — 16 E2E tests covering basic view inlining, UPDATE/DELETE through views, filtered views, aggregation, joins, nested views, FULL mode, materialized view rejection/allowance, view replacement/drop hooks, TRUNCATE propagation, column renaming, catalog verification, and dependency registration.
SQL Feature Gaps (S1–S15)
- Volatile function detection (S1) — defining queries containing volatile
functions (e.g.,
random(),clock_timestamp()) are rejected in DIFFERENTIAL mode with a clear error. Stable functions (e.g.,now()) emit a warning. - TRUNCATE capture in CDC (S2) — statement-level
AFTER TRUNCATEtrigger writes aTmarker row to the change buffer. Differential refresh detects the marker and automatically falls back to a full refresh. ALL (subquery)support (S3) —x op ALL (subquery)is rewritten to an AntiJoin viaNOT EXISTSwith a negated condition.DISTINCT ONauto-rewrite (S4) —DISTINCT ON (col1, col2)is transparently rewritten to aROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) = 1subquery before DVM parsing. Previously rejected.- 12 regression aggregates (S5) —
CORR,COVAR_POP,COVAR_SAMP,REGR_AVGX,REGR_AVGY,REGR_COUNT,REGR_INTERCEPT,REGR_R2,REGR_SLOPE,REGR_SXX,REGR_SXY,REGR_SYY— all use group-rescan strategy. 39 aggregate function variants total (up from 25). - Mixed
UNION/UNION ALL(S6) — nested set operations with differentALLflags are now parsed correctly. - Column snapshot + schema fingerprint (S7) —
pgs_dependenciesstores a JSONB column snapshot and SHA-256 fingerprint for each source table. DDL change detection uses a 3-tier fast path: fingerprint → snapshot → legacycolumns_usedfallback. pg_stream.block_source_ddlGUC (S8) — whentrue, column-affecting DDL on tracked source tables is blocked with an ERROR instead of marking stream tables for reinit.NATURAL JOINsupport (S9) — common columns are resolved at parse time and an explicit equi-join condition is synthesized. Supports INNER, LEFT, RIGHT, and FULL NATURAL JOIN variants. Previously rejected.- Keyless table support (S10) — source tables without a primary key now
work correctly. CDC triggers compute an all-column content hash for row
identity. Consistent
__pgs_row_idbetween full and delta refreshes. GROUPING SETS/CUBE/ROLLUPauto-rewrite (S11) — decomposed at parse time into aUNION ALLof separateGROUP BYqueries.GROUPING()calls become integer literals. Previously rejected.- Scalar subquery in WHERE rewrite (S12) —
WHERE col > (SELECT avg(x) FROM t)is rewritten to aCROSS JOINwith column reference replacement. - SubLinks in OR rewrite (S13) —
WHERE a OR EXISTS (...)is decomposed intoUNIONbranches, one per OR arm. - Multi-PARTITION BY window rewrite (S14) — window functions with different
PARTITION BYclauses are split into separate subqueries joined by aROW_NUMBER() OVER ()row marker. - Recursive CTE semi-naive + DRed (S15) — DIFFERENTIAL mode for recursive CTEs now uses semi-naive evaluation for INSERT-only changes, Delete-and- Rederive (DRed) for mixed changes, and recomputation fallback. Strategy is auto-selected per refresh.
Native Syntax Planning
- Native DDL syntax research — comprehensive analysis of 15 PostgreSQL
extension syntax mechanisms for supporting
CREATE STREAM TABLE-like syntax. Seedocs/research/CUSTOM_SQL_SYNTAX.md. - Native syntax plan — tiered strategy: Tier 1 (function API, existing),
Tier 1.5 (
CALLprocedure wrappers), Tier 2 (CREATE MATERIALIZED VIEW ... WITH (pgstream.stream = true)viaProcessUtility_hook). Seeplans/sql/PLAN_NATIVE_SYNTAX.md.
Hybrid CDC — Automatic Trigger → WAL Transition
- Hybrid CDC architecture — stream tables now start with lightweight
row-level triggers for zero-config setup and can automatically transition to
WAL-based (logical replication) capture for lower write-side overhead. The
transition is controlled by the
pg_stream.cdc_modeGUC (trigger/auto/wal). - WAL decoder background worker — dedicated worker that polls logical replication slots and writes decoded changes into the same change buffer tables used by triggers, ensuring a uniform format for the DVM engine.
- Transition orchestration — transparent three-step process: create replication slot, wait for decoder catch-up, drop trigger. Falls back to triggers automatically if the decoder does not catch up within the timeout.
- CDC health monitoring — new
pgstream.check_cdc_health()function returns per-source CDC mode, slot lag, confirmed LSN, and alerts. - CDC transition notifications —
NOTIFY pg_stream_cdc_transitionemits JSON payloads when sources transition between CDC modes. - New GUCs —
pg_stream.cdc_modeandpg_stream.wal_transition_timeout. - Catalog extension —
pgs_dependenciestable gainscdc_mode,slot_name,decoder_confirmed_lsn, andtransition_started_atcolumns.
User-Defined Triggers on Stream Tables
- User trigger support in DIFFERENTIAL mode — user-created
AFTERtriggers on stream tables now fire correctly during differential refresh via explicit per-row DML (INSERT/UPDATE/DELETE) instead of bulk MERGE. - FULL refresh trigger handling — user triggers are suppressed during FULL
refresh with
DISABLE TRIGGER USERand aNOTIFY pgstream_refreshis emitted so listeners know when to re-query. - Trigger detection —
has_user_triggers()automatically detects user-defined triggers on storage tables at refresh time. - DDL warning —
CREATE TRIGGERon a stream table emits a notice explaining the trigger semantics and thepg_stream.user_triggersGUC. - New GUC —
pg_stream.user_triggers(auto/on/off) controls whether the explicit DML path is used.
Changed
- Monitoring layer —
slot_health()now covers WAL-mode sources. Architecture diagrams and documentation updated to reflect the hybrid CDC model. - Stream table restrictions — user triggers on stream tables upgraded from "⚠️ Unsupported" to "✅ Supported (DIFFERENTIAL mode)".
Core Engine
- Declarative stream tables — define a SQL query and a schedule; the extension handles automatic refresh.
- Differential View Maintenance (DVM) — incremental delta computation derived automatically from the defining query's operator tree.
- Trigger-based CDC — lightweight
AFTERrow-level triggers capture changes into per-source buffer tables. Nowal_level = logicalrequired. - DAG-aware scheduling — stream tables that depend on other stream tables are refreshed in topological order with cycle detection.
- Background scheduler — canonical-period scheduling (48·2ⁿ seconds) with cron expression support.
- Crash-safe refresh — advisory locks prevent concurrent refreshes; crash recovery marks in-flight refreshes as failed.
SQL Support
- Full operator coverage — table scans, projections, WHERE/HAVING filters, INNER/LEFT/RIGHT/FULL OUTER joins, nested multi-table joins, GROUP BY with 25 aggregate functions, DISTINCT, UNION ALL, UNION, INTERSECT, EXCEPT.
- Subquery support — subqueries in FROM, EXISTS/NOT EXISTS, IN/NOT IN (subquery), scalar subqueries in SELECT.
- CTE support — non-recursive CTEs (inline and shared delta), recursive
CTEs (
WITH RECURSIVE) in both FULL and DIFFERENTIAL modes. - Recursive CTE incremental maintenance — DIFFERENTIAL mode now uses semi-naive evaluation for INSERT-only changes, Delete-and-Rederive (DRed) for mixed changes, and recomputation fallback when CTE columns don't match ST storage. Strategy is auto-selected per refresh.
- DISTINCT ON auto-rewrite — transparently rewritten to ROW_NUMBER() window subquery before DVM parsing.
- GROUPING SETS / CUBE / ROLLUP auto-rewrite — decomposed into UNION ALL of separate GROUP BY queries at parse time.
- NATURAL JOIN support — common columns resolved at parse time with explicit equi-join synthesis.
- ALL (subquery) support — rewritten to AntiJoin via NOT EXISTS.
- Scalar subquery in WHERE — rewritten to CROSS JOIN.
- SubLinks in OR — decomposed into UNION branches.
- Multi-PARTITION BY windows — split into joined subqueries.
- Regression aggregates — CORR, COVAR_POP, COVAR_SAMP, REGR_* (12 new).
- JSON_ARRAYAGG / JSON_OBJECTAGG — SQL-standard JSON aggregates recognized as first-class DVM aggregates in DIFFERENTIAL mode.
- JSON_TABLE — PostgreSQL 17+ JSON_TABLE() in FROM clause.
- Keyless table support — tables without primary keys use content hashing.
- Volatile function and operator detection — rejected in DIFFERENTIAL, warned for stable. Custom operators backed by volatile functions are also detected.
- TRUNCATE capture in CDC — triggers fall back to full refresh.
- Window functions — ROW_NUMBER, RANK, SUM OVER, etc. with full frame clause support (ROWS, RANGE, GROUPS, BETWEEN, EXCLUDE) and named WINDOW clauses.
- LATERAL SRFs —
jsonb_array_elements,unnest,jsonb_each, etc. via row-scoped recomputation. - LATERAL subqueries — explicit
LATERAL (SELECT ...)in FROM with correlated references. - Expression support — CASE WHEN, COALESCE, NULLIF, GREATEST, LEAST, IN (list), BETWEEN, IS DISTINCT FROM, IS TRUE/FALSE/UNKNOWN, SIMILAR TO, ANY/ALL (array), ARRAY/ROW constructors, array subscript, field access.
- Ordered-set aggregates — MODE, PERCENTILE_CONT, PERCENTILE_DISC with WITHIN GROUP (ORDER BY).
Monitoring & Observability
- Refresh statistics —
st_refresh_stats(),get_refresh_history(),get_staleness(). - Slot health —
slot_health()checks replication slot state and WAL retention. - DVM plan inspection —
explain_st()describes the operator tree. - Monitoring views —
pgstream.stream_tables_infoandpgstream.pg_stat_stream_tables. - NOTIFY alerting —
pg_stream_alertchannel broadcasts stale, suspended, reinitialize, slot lag, refresh completed/failed events.
Infrastructure
- Row ID hashing —
pg_stream_hash()andpg_stream_hash_multi()using xxHash (xxh64) for deterministic row identity. - DDL event tracking —
ALTER TABLEandDROP TABLEon source tables automatically setneeds_reiniton affected stream tables.CREATE OR REPLACE FUNCTION/ALTER FUNCTION/DROP FUNCTIONon functions used by defining queries also triggers reinit. - Cross-session cache coherence — shared-memory
CACHE_GENERATIONatomic counter ensures all backends flush delta/MERGE template caches when DDL changes occur. - Version / frontier tracking — per-source JSONB frontier for consistent snapshots and Delayed View Semantics (DVS) guarantee.
- 12 GUC variables —
enabled,scheduler_interval_ms,min_schedule_seconds,max_consecutive_errors,change_buffer_schema,max_concurrent_refreshes,differential_max_change_ratio,cleanup_use_truncate,user_triggers,cdc_mode,wal_transition_timeout,block_source_ddl.
Documentation
- Architecture guide, SQL reference, configuration reference, FAQ, getting-started tutorial, DVM operators reference, benchmark guide.
- Deep-dive tutorials: What Happens on INSERT / UPDATE / DELETE / TRUNCATE.
Testing
- ~1,138 unit tests, 22 E2E test suites (Testcontainers + custom Docker image).
- Property-based tests, integration tests, resilience tests.
- Column snapshot and schema fingerprint-based DDL change detection.
Known Limitations
TABLESAMPLE,LIMIT/OFFSET,FOR UPDATE/FOR SHARE— rejected with clear error messages.- Window functions inside expressions (CASE, COALESCE, arithmetic) — rejected.
- Circular stream table dependencies (cycles) — not yet supported.
pg_stream — Project Roadmap
Last updated: 2026-02-26 Current version: 0.1.1
For a concise description of what pg_stream is and why it exists, read
ESSENCE.md — it explains the core problem (full REFRESH MATERIALIZED VIEW recomputation), how the differential dataflow approach
solves it, the hybrid trigger→WAL CDC architecture, and the broad SQL
coverage, all in plain language.
Overview
pg_stream is a PostgreSQL 18 extension that implements streaming tables with incremental view maintenance (IVM) via differential dataflow. All 13 design phases are complete. This roadmap tracks the path from pre-release to 1.0 and beyond.
We are here
│
▼
┌─────────┐ ┌─────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ 0.1.0 │──▶│ 0.2.0 │──▶│ 0.3.0 │──▶│ 0.4.0 │──▶│ 1.0.0 │──▶│ 1.x+ │
│ Pre- │ │ Correct-│ │ Prod- │ │ Observ- │ │ Stable │ │ Scale & │
│ release │ │ ness │ │ ready │ │ ability │ │ Release │ │ Ecosystem│
└─────────┘ └─────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘
v0.1.0 — Released (2026-02-26)
Status: Released — all 13 design phases implemented.
Core engine, DVM with 21 OpTree operators, trigger-based CDC, DAG-aware scheduling, monitoring, dbt macro package, and 1,300+ tests.
See CHANGELOG.md for the full feature list.
Late additions (pre-March 1st)
Low-risk, high-value items pulled forward from v0.2.0:
| Item | Description | Effort | Ref |
|---|---|---|---|
| F4 | WAL decoder: pgoutput message parsing edge cases | 2–3h | SQL_GAPS_7.md G3.3 |
| F7 | Document JOIN key column change limitations | 1–2h | SQL_GAPS_7.md G4.2 |
| F11 | Keyless table duplicate-rows: document known behavior | 1h | SQL_GAPS_7.md G7.1 |
| F14 | CUBE explosion guard (reject oversized CUBE grouping sets) | 1h | SQL_GAPS_7.md G5.2 |
v0.2.0 — Correctness & Stability
Goal: Close all critical and high-priority gaps to reach a provably correct baseline. No new features — only fixes, verification, and test coverage.
Tier 0 — Critical (must-fix)
| Item | Description | Effort | Ref |
|---|---|---|---|
| F1 | DELETE+INSERT merge strategy double-evaluation guard | 3–4h | SQL_GAPS_7.md G1.1 (P0) |
| F2 | WAL decoder: keyless-table pk_hash computation | 4–6h | SQL_GAPS_7.md G3.1 · PLAN_HYBRID_CDC.md |
| F3 | WAL decoder: old_* column population for UPDATEs | 4–6h | SQL_GAPS_7.md G3.2 · PLAN_HYBRID_CDC.md |
| F5 | JOIN key column change detection in delta SQL | 3–4h | SQL_GAPS_7.md G4.1 (P1) |
| F6 | ALTER TYPE / ALTER POLICY DDL tracking | 3–5h | SQL_GAPS_7.md G9.1 (P1) |
Subtotal: 17–25 hours (F4 and F7 moved to v0.1.0)
Tier 1 — Verification
| Item | Description | Effort | Ref |
|---|---|---|---|
| F8–F10, F12 | Window partition key E2E, recursive CTE monotonicity audit, ALTER DOMAIN tracking, PgBouncer compatibility docs | 16–23h | SQL_GAPS_7.md G5–G9 |
F11 (keyless table duplicate-rows documentation) moved to v0.1.0.
Tier 2 — Robustness
| Item | Description | Effort | Ref |
|---|---|---|---|
| F13, F15–F16 | LIMIT-in-subquery warning, RANGE_AGG recognition, read replica detection, SPI error classification | 6–8h | SQL_GAPS_7.md G2, G5, G6, G8 |
F14 (CUBE explosion guard) moved to v0.1.0.
Tier 3 — Test coverage
| Item | Description | Effort | Ref |
|---|---|---|---|
| F17–F26 | 21 aggregate differential E2E, FULL JOIN E2E, INTERSECT/EXCEPT pairs, GUC variation tests, CI combined coverage | 29–38h | SQL_GAPS_7.md G7 · STATUS_TESTING.md |
v0.2.0 total: ~68–94 hours (F4, F7, F11, F14 moved to v0.1.0)
Exit criteria:
- Zero P0 gaps
- All P1 gaps resolved or documented as known limitations
- E2E test count ≥ 400 with 0 pre-existing failures
- Combined coverage ≥ 75%
v0.3.0 — Production Readiness
Goal: Operational polish, parallel refresh, and production-grade WAL-based CDC. The extension is suitable for production use after this milestone.
Performance & Parallelism
| Item | Description | Effort | Ref |
|---|---|---|---|
| P1 | Verify PostgreSQL parallel query for delta SQL | 0h | REPORT_PARALLELIZATION.md §E |
| P2 | DAG level extraction (topological_levels()) | 2–4h | REPORT_PARALLELIZATION.md §B |
| P3 | Dynamic background worker dispatch per level | 12–16h | REPORT_PARALLELIZATION.md §A+B |
Operational
| Item | Description | Effort | Ref |
|---|---|---|---|
| O1 | Extension upgrade migrations (ALTER EXTENSION UPDATE) | 4–6h | SQL_GAPS_7.md G8.2 · PLAN_UPGRADE_MIGRATIONS.md |
| O2 | Prepared statement cleanup on cache invalidation | 3–4h | SQL_GAPS_7.md G8.3 |
| O3 | Adaptive fallback threshold exposure via monitoring | 2–3h | SQL_GAPS_7.md G8.4 |
| O4 | SPI SQLSTATE error classification for retry | 3–4h | SQL_GAPS_7.md G8.6 |
| O5 | Slot lag alerting thresholds (configurable) | 2–3h | SQL_GAPS_7.md G10 |
WAL CDC Hardening
| Item | Description | Effort | Ref |
|---|---|---|---|
| W1 | WAL decoder fixes (F2–F4 prerequisite from v0.2.0) | Done in v0.2.0 | PLAN_HYBRID_CDC.md |
| W2 | WAL mode E2E test suite (parallel to trigger suite) | 8–12h | PLAN_HYBRID_CDC.md |
| W3 | WAL→trigger automatic fallback hardening | 4–6h | PLAN_HYBRID_CDC.md |
| W4 | Promote pg_stream.cdc_mode = 'auto' to recommended | Documentation | PLAN_HYBRID_CDC.md |
v0.3.0 total: ~40–58 hours (excluding v0.2.0 prerequisites)
Exit criteria:
-
max_concurrent_refreshesdrives real parallel refresh - WAL CDC mode passes full E2E suite
-
Extension upgrade path tested (
0.1.0 → 0.3.0) - Zero P0/P1 gaps remaining
v0.4.0 — Observability & Integration
Goal: Prometheus/Grafana observability, dbt-pgstream formal release, complete documentation review, and validated upgrade path. After this milestone the product is externally visible and monitored.
Observability
| Item | Description | Effort | Ref |
|---|---|---|---|
| M1 | Prometheus exporter configuration guide | 4–6h | PLAN_ECO_SYSTEM.md §1 |
| M2 | Grafana dashboard (refresh latency, staleness, CDC lag) | 4–6h | PLAN_ECO_SYSTEM.md §1 |
Integration & Release prep
| Item | Description | Effort | Ref |
|---|---|---|---|
| R5 | dbt-pgstream 0.1.0 formal release (PyPI) | 2–3h | dbt-pgstream/ · PLAN_DBT_MACRO.md |
| R6 | Complete documentation review & polish | 4–6h | docs/ |
| O1 | Extension upgrade migrations (ALTER EXTENSION UPDATE) | 4–6h | SQL_GAPS_7.md G8.2 · PLAN_UPGRADE_MIGRATIONS.md |
v0.4.0 total: ~18–27 hours
Exit criteria:
- Grafana dashboard published
- dbt-pgstream 0.1.0 on PyPI
-
ALTER EXTENSION pg_stream UPDATEtested (0.3.0 → 0.4.0) - All public documentation current and reviewed
v1.0.0 — Stable Release
Goal: First officially supported release. Semantic versioning locks in. API, catalog schema, and GUC names are considered stable. Focus is distribution — getting pg_stream onto package registries.
Release engineering
| Item | Description | Effort | Ref |
|---|---|---|---|
| R1 | Semantic versioning policy + compatibility guarantees | 2–3h | PLAN_VERSIONING.md |
| R2 | PGXN / apt / rpm packaging | 8–12h | PLAN_PACKAGING.md |
| R3 | ✅ Done | PLAN_CLOUDNATIVEPG.md | |
| R4 | CNPG operator hardening (K8s 1.33+ native ImageVolume) | 4–6h | PLAN_CLOUDNATIVEPG.md |
v1.0.0 total: ~18–27 hours
Exit criteria:
- Published on PGXN and Docker Hub
-
CNPG extension image published to GHCR (
pg_stream-ext) - CNPG cluster-example.yaml validated (Image Volume approach)
- Upgrade path from v0.4.0 tested
- Semantic versioning policy in effect
Post-1.0 — Scale & Ecosystem
These are not gated on 1.0 but represent the longer-term horizon.
Ecosystem expansion
| Item | Description | Effort | Ref |
|---|---|---|---|
| E1 | dbt full adapter (dbt-pgstream extending dbt-postgres) | 20–30h | PLAN_DBT_ADAPTER.md |
| E2 | Airflow provider (apache-airflow-providers-pgstream) | 16–20h | PLAN_ECO_SYSTEM.md §4 |
| E3 | CLI tool (pgstream) for management outside SQL | 16–20h | PLAN_ECO_SYSTEM.md §4 |
| E4 | Flyway / Liquibase migration support | 8–12h | PLAN_ECO_SYSTEM.md §5 |
| E5 | ORM integrations guide (SQLAlchemy, Django, etc.) | 8–12h | PLAN_ECO_SYSTEM.md §5 |
Scale
| Item | Description | Effort | Ref |
|---|---|---|---|
| S1 | External orchestrator sidecar for 100+ STs | 20–40h | REPORT_PARALLELIZATION.md §D |
| S2 | Citus / distributed PostgreSQL compatibility | ~6 months | plans/infra/CITUS.md |
| S3 | Multi-database support (beyond postgres DB) | TBD | PLAN_MULTI_DATABASE.md |
Advanced SQL
| Item | Description | Effort | Ref |
|---|---|---|---|
| A1 | Circular dependency support (SCC fixpoint iteration) | ~40h | CIRCULAR_REFERENCES.md |
| A2 | Streaming aggregation (sub-second latency path) | TBD | PLAN_STREAMING_AGGREGATION.md |
| A3 | PostgreSQL 19 forward-compatibility | TBD | PLAN_PG19_COMPAT.md |
Effort Summary
| Milestone | Effort estimate | Cumulative |
|---|---|---|
| v0.2.0 — Correctness | 68–94h | 68–94h |
| v0.3.0 — Production ready | 40–58h | 108–152h |
| v0.4.0 — Observability & Integration | 18–27h | 126–179h |
| v1.0.0 — Stable release | 18–27h | 144–206h |
| Post-1.0 (ecosystem) | 88–134h | 232–340h |
| Post-1.0 (scale) | 6+ months | — |
References
| Document | Purpose |
|---|---|
| CHANGELOG.md | What's been built |
| plans/PLAN.md | Original 13-phase design plan |
| plans/sql/SQL_GAPS_7.md | 53 known gaps, prioritized |
| plans/performance/REPORT_PARALLELIZATION.md | Parallelization options analysis |
| plans/performance/STATUS_PERFORMANCE.md | Benchmark results |
| plans/ecosystem/PLAN_ECO_SYSTEM.md | Ecosystem project catalog |
| plans/dbt/PLAN_DBT_ADAPTER.md | Full dbt adapter plan |
| plans/infra/CITUS.md | Citus compatibility plan |
| plans/infra/PLAN_VERSIONING.md | Versioning & compatibility policy |
| plans/infra/PLAN_PACKAGING.md | PGXN / deb / rpm packaging |
| plans/infra/PLAN_DOCKER_IMAGE.md | Official Docker image (superseded by CNPG extension image) |
| plans/ecosystem/PLAN_CLOUDNATIVEPG.md | CNPG Image Volume extension image |
| plans/infra/PLAN_MULTI_DATABASE.md | Multi-database support |
| plans/infra/PLAN_PG19_COMPAT.md | PostgreSQL 19 forward-compatibility |
| plans/sql/PLAN_UPGRADE_MIGRATIONS.md | Extension upgrade migrations |
| plans/sql/PLAN_STREAMING_AGGREGATION.md | Sub-second streaming aggregation |
| plans/adrs/PLAN_ADRS.md | Architectural decisions |
| docs/ARCHITECTURE.md | System architecture |
pg_stream vs. DBSP: Similarities and Differences
What They Share (Conceptual Foundation)
pg_stream explicitly cites DBSP as its theoretical foundation (see PRIOR_ART.md). The key overlap:
| Concept | DBSP (paper) | pg_stream (implementation) |
|---|---|---|
| Z-set / delta model | Rows annotated with weights (+1/−1) in an abelian group | __pgs_action = 'I'/'D' column on every delta row — effectively Z-sets restricted to {+1, −1} |
| Per-operator differentiation | Recursive Algorithm 4.6: Q^Δ = D ∘ Q ∘ I, decomposed per-operator via the chain rule (Q₁ ∘ Q₂)^Δ = Q₁^Δ ∘ Q₂^Δ | DiffContext::diff_node() walks the OpTree and calls per-operator differentiators (scan, filter, project, join, aggregate, distinct, union, etc.) — same recursive structural decomposition |
| Linear operators are self-incremental | Theorem 3.3: for LTI operator Q, Q^Δ = Q | Filter and Project pass deltas through unchanged (just apply predicate/projection to the delta stream) |
| Bilinear join rule | Theorem 3.4: Δ(a × b) = Δa × Δb + a × Δb + Δa × b | diff_inner_join generates exactly 3 UNION ALL parts: (delta_left ⋈ current_right), (current_left ⋈ delta_right), and optionally (delta_left ⋈ delta_right) |
| Aggregate auxiliary counters | §4.2: counting algorithm for maintaining aggregates with deletions | __pgs_count auxiliary column, LEFT JOIN back to stream table to read old counts and compute new counts |
| Recursive queries | §6: fixed-point iteration with z⁻¹ delay operator, semi-naive evaluation | diff_recursive_cte uses recomputation-diff (DRed-style), not DBSP's native fixed-point circuit |
Key Differences
1. Execution model — standalone engine vs. embedded in PostgreSQL
DBSP is a standalone streaming runtime (Rust library, now Feldera). It compiles query plans into dataflow graphs that maintain in-memory state and process continuous micro-batches. Operators are long-lived stateful actors with their own memory.
pg_stream is an extension inside PostgreSQL. It has no persistent dataflow graph. On each refresh, it generates a single SQL query (CTE chain) that PostgreSQL's own planner/executor evaluates. After execution, no operator state persists — auxiliary state lives in the stream table itself (__pgs_count columns) and change buffer tables.
2. Streams vs. periodic batches
DBSP operates on true infinite streams indexed by logical time t ∈ ℕ. Each "step" processes one micro-batch of changes, and operators carry integration state (I operator = running sum from t=0).
pg_stream operates in discrete refresh cycles triggered by a lag-based scheduler. There is no integration operator — the "current state" is just the stream table's contents, and changes are consumed from CDC buffer tables between LSN boundaries. Each refresh is a self-contained transaction.
3. Z-set weights vs. binary actions
DBSP uses integer weights in ℤ — rows can have weights > 1 (bags) or < −1 (multiple deletions). This enables correct multiset semantics and composable group algebra.
pg_stream uses binary actions ('I' insert, 'D' delete, sometimes 'U' update). It doesn't maintain true Z-set weights. For aggregates, the __pgs_count auxiliary column serves a similar purpose but is specific to the aggregate operator — it's not a general weight propagated through the operator tree.
4. Integration operator (I)
DBSP: The integration operator I(s)[t] = Σᵢ≤ₜ s[i] is an explicit first-class circuit element. It maintains running sums of changes and is the key mechanism for computing incremental joins (z⁻¹(I(a)) = "accumulated left side up to previous step").
pg_stream: No explicit integration. The equivalent of I is just "read the current contents of the source/stream table." Join differentiation directly reads the current snapshot of the non-delta side (build_snapshot_sql() generates FROM "public"."orders" r), which implicitly includes all historical changes.
5. Recursion
DBSP: Native fixed-point circuits with z⁻¹ delay. Can incrementally maintain recursive queries (e.g., transitive closure) by iterating only on new changes within each step — semi-naive evaluation generalized to arbitrary recursion.
pg_stream: Uses recomputation-diff for recursive CTEs — re-executes the full recursive query and anti-joins against current storage to compute the delta. This is correct but not truly incremental for the recursive part.
6. Correctness guarantees
DBSP: Proven correct in Lean. All theorems are machine-checked. The chain rule, cycle rule, and bilinear decomposition are formally verified.
pg_stream: Verified empirically via property-based tests (the assert_invariant checks that Contents(ST) = Q(DB) after each mutation cycle). No formal proof, but the per-operator rules are direct translations of DBSP's rules.
7. Scope
DBSP: A general-purpose theory and streaming engine. Handles nested relations, streaming aggregation over windows, arbitrary compositions. The Feldera implementation supports a full SQL frontend.
pg_stream: Focused on materialized views inside PostgreSQL. Supports a specific subset of SQL (scan, filter, project, inner/left/full join, aggregates, DISTINCT, UNION ALL, INTERSECT, EXCEPT, CTEs, window functions, lateral joins). It is not a general streaming engine — it leverages PostgreSQL's own query planner and executor.
Summary
pg_stream applies DBSP's differentiation rules to generate delta queries, but it is not a DBSP implementation. It borrows the mathematical framework (per-operator differentiation, Z-set-like deltas, bilinear join decomposition) while making fundamentally different architectural choices: embedded in PostgreSQL, no persistent dataflow state, periodic batch execution, and PostgreSQL's planner as the optimizer. Think of it as "DBSP's differentiation algebra, compiled down to SQL CTEs and executed by PostgreSQL."
Prior Art
This document lists the academic papers, PostgreSQL commits, open-source tools,
and standard algorithms whose techniques are reused in pg_stream.
Maintaining this record serves two purposes:
- Attribution — credit the research and engineering work this project builds upon.
- Independent derivation — demonstrate that every core technique predates and is independent of any single vendor's commercial product.
Differential View Maintenance (DVM)
DBSP — Automatic Incremental View Maintenance
Budiu, M., Ryzhyk, L., McSherry, F., & Tannen, V. (2023). "DBSP: Automatic Incremental View Maintenance for Rich Query Languages." Proceedings of the VLDB Endowment (PVLDB), 16(7), 1601–1614. https://arxiv.org/abs/2203.16684
The Z-set abstraction (rows annotated with +1/−1 multiplicity) is the
theoretical foundation for the __pgs_action column produced by the delta
operators in src/dvm/operators/. The per-operator differentiation rules
(scan, filter, project, join, aggregate, union) are direct applications of
the DBSP lifting operator (D) described in this paper.
See DBSP_COMPARISON.md for a detailed comparison of pg_stream's architecture with the DBSP model.
Gupta & Mumick — Materialized Views Survey
Gupta, A. & Mumick, I.S. (1995). "Maintenance of Materialized Views: Problems, Techniques, and Applications." IEEE Data Engineering Bulletin, 18(2), 3–18.
Gupta, A. & Mumick, I.S. (1999). Materialized Views: Techniques, Implementations, and Applications. MIT Press. ISBN 978-0-262-57122-7.
The per-operator differentiation rules in src/dvm/operators/ follow the
derivation given in section 3 of the 1995 survey. The counting algorithm
for maintaining aggregates with deletions uses the approach described in
the MIT Press book.
DBToaster — Higher-order Delta Processing
Koch, C., Ahmad, Y., Kennedy, O., Nikolic, M., Nötzli, A., Olteanu, D., & Zavodny, J. (2014). "DBToaster: Higher-order Delta Processing for Dynamic, Frequently Fresh Views." The VLDB Journal, 23(2), 253–278. https://doi.org/10.1007/s00778-013-0348-4
Inspiration for the recursive delta compilation strategy where the delta of a complex query is itself a query that can be differentiated.
DRed — Deletion and Re-derivation
Gupta, A., Mumick, I.S., & Subrahmanian, V.S. (1993). "Maintaining Views Incrementally." Proceedings of the 1993 ACM SIGMOD International Conference, 157–166.
The DRed algorithm for handling deletions in recursive views is the basis for
the recursive CTE differential refresh strategy in src/dvm/operators/recursive_cte.rs.
Scheduling
Earliest-Deadline-First (EDF)
Liu, C.L. & Layland, J.W. (1973). "Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment." Journal of the ACM, 20(1), 46–61. https://doi.org/10.1145/321738.321743
The schedule-based scheduling in src/scheduler.rs applies the classic
EDF principle: the stream table whose freshness deadline expires soonest is
refreshed first. EDF is optimal for uniprocessor preemptive scheduling and is
a standard technique in operating systems and real-time databases.
Topological Sort — Kahn's Algorithm
Kahn, A.B. (1962). "Topological sorting of large networks." Communications of the ACM, 5(11), 558–562. https://doi.org/10.1145/368996.369025
The dependency DAG in src/dag.rs uses Kahn's algorithm for topological
ordering and cycle detection. This is standard computer science curriculum
and appears in every major algorithms textbook (Cormen et al., Sedgewick,
Kleinberg & Tardos).
Change Data Capture (CDC)
PostgreSQL Row-Level Triggers
Row-level AFTER INSERT/UPDATE/DELETE triggers have been available in
PostgreSQL since version 6.x (late 1990s). The trigger-based change capture
pattern used in src/cdc.rs is a well-established PostgreSQL technique:
- PostgreSQL documentation: CREATE TRIGGER — trigger-based CDC has been a standard pattern for decades.
- PostgreSQL wiki: "Trigger-based Change Data Capture in PostgreSQL."
Debezium
Debezium project (Red Hat, open source since 2016). https://debezium.io/
Debezium implements trigger-based and WAL-based CDC for PostgreSQL and other
databases. The change buffer table pattern (pg_stream_changes.changes_<oid>)
follows a similar approach, modified for single-process consumption within
the PostgreSQL backend.
pgaudit
pgaudit extension (2015). https://github.com/pgaudit/pgaudit
Captures DML via AFTER row-level triggers for audit logging, demonstrating
the same trigger-based change-capture technique in production since 2015.
Materialized View Refresh
PostgreSQL REFRESH MATERIALIZED VIEW CONCURRENTLY
PostgreSQL 9.4 (December 2014, commit
96ef3b8).src/backend/commands/matview.c
The snapshot-diff strategy used for recomputation-diff refreshes (where the
full query is re-executed and anti-joined against current storage to compute
inserts and deletes) mirrors the algorithm implemented in PostgreSQL's
REFRESH MATERIALIZED VIEW CONCURRENTLY. This PostgreSQL feature predates
all relevant patents and is publicly documented.
SQL MERGE Statement
ISO/IEC 9075:2003 (SQL:2003 standard) —
MERGEstatement. PostgreSQL 15 (October 2022, commit7103eba).
The MERGE-based delta application in src/refresh.rs uses the
ISO-standard MERGE statement, independently implemented by Oracle, SQL
Server, DB2, and PostgreSQL. This is not derived from any vendor-specific
implementation.
General Database Theory
Relational Algebra
Codd, E.F. (1970). "A Relational Model of Data for Large Shared Data Banks." Communications of the ACM, 13(6), 377–387.
The operator tree in src/dvm/parser.rs models standard relational algebra
operators (select, project, join, aggregate, union). These are foundational
database theory from 1970.
Semi-Naive Evaluation
Bancilhon, F. & Ramakrishnan, R. (1986). "An Amateur's Introduction to Recursive Query Processing Strategies." Proceedings ACM SIGMOD, 16–52.
General background for recursive CTE evaluation strategies. PostgreSQL's own
WITH RECURSIVE implementation uses iterative fixpoint evaluation based on
these principles.
This document is maintained for attribution and independent-derivation documentation purposes. It does not constitute legal advice.
Custom SQL Syntax for PostgreSQL Extensions
Comprehensive Technical Research Report
Date: 2026-02-25
Context: pg_stream extension — evaluating approaches to support CREATE STREAM TABLE syntax or equivalent native-feeling DDL.
Table of Contents
- Executive Summary
- PostgreSQL Parser Hooks / Utility Hooks
- The ProcessUtility_hook Approach
- Raw Parser Extension (gram.y)
- The Utility Command Approach
- Custom Access Methods (CREATE ACCESS METHOD)
- Table Access Method API (PostgreSQL 12+)
- Foreign Data Wrapper Approach
- Event Triggers
- TimescaleDB Continuous Aggregates Pattern
- Citus Distributed DDL Pattern
- PostgreSQL 18 New Features
- COMMENT / OPTIONS Abuse Pattern
- pg_ivm (Incremental View Maintenance) Pattern
- CREATE TABLE ... USING (Table Access Methods) Deep Dive
- Comparison Matrix
- Recommendations for pg_stream
1. Executive Summary
PostgreSQL's parser is not extensible — there is no parser hook that allows extensions to add new grammar rules. This is a fundamental design constraint. Every approach to "custom DDL syntax" in extensions falls into one of two categories:
- Intercept existing syntax — Use
ProcessUtility_hookor event triggers to intercept standard DDL (e.g.,CREATE TABLE,CREATE VIEW) and augment its behavior. - Use a SQL function as the DDL interface — Define
SELECT my_extension.create_thing(...)as the user-facing API (this is what pg_stream currently does).
No production PostgreSQL extension ships truly new SQL grammar without forking the PostgreSQL parser. TimescaleDB, Citus, pg_ivm, and others all work within existing syntax boundaries.
2. PostgreSQL Parser Hooks / Utility Hooks
Available Hook Points
PostgreSQL provides several hook function pointers that extensions can override in _PG_init():
| Hook | Header | Purpose |
|---|---|---|
ProcessUtility_hook | tcop/utility.h | Intercept utility (DDL) statement execution |
post_parse_analyze_hook | parser/analyze.h | Inspect/modify the analyzed parse tree after semantic analysis |
planner_hook | optimizer/planner.h | Replace or augment the query planner |
ExecutorStart_hook | executor/executor.h | Intercept executor startup |
ExecutorRun_hook | executor/executor.h | Intercept executor row processing |
ExecutorFinish_hook | executor/executor.h | Intercept executor finish |
ExecutorEnd_hook | executor/executor.h | Intercept executor cleanup |
object_access_hook | catalog/objectaccess.h | Notifications when objects are created/modified/dropped |
emit_log_hook | utils/elog.h | Intercept log messages |
What's Missing: No Parser Hook
There is no parser_hook or raw_parser_hook. The raw parser (gram.y → scan.l → bison grammar) is compiled into the PostgreSQL server binary. Extensions cannot:
- Add new keywords (e.g.,
STREAM) - Add new grammar productions (e.g.,
CREATE STREAM TABLE) - Modify the tokenizer/lexer
- Intercept raw SQL text before parsing
The closest hook is post_parse_analyze_hook, which fires after the SQL has already been parsed and analyzed. By this point:
- The SQL string has already been tokenized and parsed by gram.y
- A parse tree (
Querynode) has been produced - If the SQL contains unknown syntax, a
syntax errorhas already been raised
Technical Details of post_parse_analyze_hook
/* In src/backend/parser/analyze.c */
typedef void (*post_parse_analyze_hook_type)(ParseState *pstate,
Query *query,
JumbleState *jstate);
post_parse_analyze_hook_type post_parse_analyze_hook = NULL;
Extensions can set this in _PG_init():
static post_parse_analyze_hook_type prev_post_parse_analyze_hook = NULL;
void _PG_init(void) {
prev_post_parse_analyze_hook = post_parse_analyze_hook;
post_parse_analyze_hook = my_post_parse_analyze;
}
Use cases: Query rewriting after parsing (e.g., adding security predicates, row-level security), statistics collection, plan caching invalidation. Not usable for new syntax because parsing has already completed.
Pros/Cons
| Aspect | Assessment |
|---|---|
| Native syntax | Impossible — cannot add new grammar |
| Intercept existing DDL | Yes via ProcessUtility_hook |
| Modify parsed queries | Yes via post_parse_analyze_hook |
| Complexity | Low for hooking, but limited in capability |
| PG version | All modern versions (hooks stable since PG 9.x) |
| Maintenance | Very low — hook signatures rarely change |
3. The ProcessUtility_hook Approach
How It Works
ProcessUtility_hook is the most powerful DDL interception point. It fires for every "utility statement" (DDL, COPY, EXPLAIN, etc.) after parsing but before execution.
typedef void (*ProcessUtility_hook_type)(PlannedStmt *pstmt,
const char *queryString,
bool readOnlyTree,
ProcessUtilityContext context,
ParamListInfo params,
QueryEnvironment *queryEnv,
DestReceiver *dest,
QueryCompletion *qc);
An extension can:
- Inspect the parse tree node — The
PlannedStmt->utilityStmtfield contains the parsed DDL node (e.g.,CreateStmt,AlterTableStmt,ViewStmt). - Modify the parse tree — Change fields before passing to the standard handler.
- Replace execution entirely — Skip calling the standard handler and do something else.
- Post-process — Call the standard handler first, then do additional work.
- Block execution — Raise an error to prevent the DDL.
What Extensions Use This
| Extension | What they intercept | Purpose |
|---|---|---|
| TimescaleDB | CREATE TABLE, ALTER TABLE, DROP TABLE, CREATE INDEX, etc. | Convert regular tables to hypertables, distribute DDL |
| Citus | Most DDL statements | Propagate DDL to worker nodes |
| pg_partman | CREATE TABLE, partition DDL | Auto-manage partitioning |
| pg_stat_statements | All utility statements | Track DDL execution statistics |
| pgAudit | All utility statements | Audit logging |
| pg_hint_plan | — | Uses post_parse_analyze_hook instead |
| sepgsql | Object creation/modification | Security label enforcement |
Can It Handle New Syntax?
No. It can only intercept DDL that PostgreSQL's parser already understands. You cannot use ProcessUtility_hook to handle CREATE STREAM TABLE because the parser will reject that syntax before the hook is ever called.
However, it can intercept and augment existing syntax:
CREATE TABLE ... (some_option)→ InterceptCreateStmt, check for special markers, do extra workCREATE VIEW ... WITH (custom_option = true)→ InterceptViewStmt, checkreloptionsCREATE MATERIALIZED VIEW ... WITH (custom = true)→ Same approach
Pattern: Intercepting CREATE TABLE
static void my_process_utility(PlannedStmt *pstmt, ...) {
Node *parsetree = pstmt->utilityStmt;
if (IsA(parsetree, CreateStmt)) {
CreateStmt *stmt = (CreateStmt *) parsetree;
// Check for a special reloption or table name pattern
ListCell *lc;
foreach(lc, stmt->options) {
DefElem *opt = (DefElem *) lfirst(lc);
if (strcmp(opt->defname, "stream") == 0) {
// This is a stream table! Do custom logic.
create_stream_table_from_ddl(stmt, queryString);
return; // Don't call standard handler
}
}
}
// Pass through to standard handler
if (prev_ProcessUtility)
prev_ProcessUtility(pstmt, ...);
else
standard_ProcessUtility(pstmt, ...);
}
Pros/Cons
| Aspect | Assessment |
|---|---|
Native CREATE STREAM TABLE | No — parser rejects unknown syntax |
CREATE TABLE ... WITH (stream=true) | Yes — feasible via reloptions |
| Complexity | Medium — must carefully chain with other extensions |
| PG version | All modern versions |
| Maintenance | Low — hook signature changes rarely (changed in PG14, PG15) |
| Risk | Must always chain prev_ProcessUtility — misbehaving can break other extensions |
4. Raw Parser Extension (gram.y)
How It Works
PostgreSQL's SQL parser is a Bison-generated LALR(1) parser defined in:
src/backend/parser/gram.y— Grammar rules (~18,000 lines)src/backend/parser/scan.l— Flex lexer (tokenizer)src/include/parser/kwlist.h— Reserved/unreserved keyword list
To add CREATE STREAM TABLE, you would:
- Add
STREAMto the keyword list (unreserved or reserved) - Add grammar rules to
gram.y:CreateStreamTableStmt: CREATE STREAM TABLE qualified_name '(' OptTableElementList ')' OptWith AS SelectStmt { CreateStreamTableStmt *n = makeNode(CreateStreamTableStmt); n->relation = $4; n->query = $9; /* ... */ $$ = (Node *) n; } ; - Add a new
NodeTagforCreateStreamTableStmt - Handle it in
ProcessUtility - Rebuild the PostgreSQL server
Implications
This requires forking PostgreSQL. The modified parser is compiled into postgres binary. You cannot ship a grammar modification as a loadable extension (.so/.dylib).
Who Does This?
- YugabyteDB — Fork of PG with custom grammar for distributed features
- CockroachDB — Entirely custom parser (Go, not PG's Bison grammar)
- Amazon Aurora (partially) — Custom grammar additions for Aurora-specific features
- Greenplum — Fork of PG with added grammar for
DISTRIBUTED BY,PARTITION BYetc. - ParadeDB — Fork of PG with some custom syntax additions
Pros/Cons
| Aspect | Assessment |
|---|---|
Native CREATE STREAM TABLE | Yes — full parser-level support |
| Complexity | Very high — must maintain a PG fork |
| PG version | Tied to a single PG version |
| Maintenance | Extremely high — must rebase on every PG release (gram.y changes significantly between major versions) |
| Distribution | Cannot use CREATE EXTENSION; must ship entire modified PostgreSQL |
| User adoption | Very low — users must replace their PostgreSQL installation |
| psql autocomplete | Would work with matching psql modifications |
| pg_dump/pg_restore | Broken unless you also modify those tools |
Verdict: Not viable for an extension. Only viable for a PostgreSQL fork/distribution.
5. The Utility Command Approach
How It Works
Some sources reference a "custom utility command" mechanism. In practice, this does not exist as a formal PostgreSQL extension point. What people sometimes mean is one of:
5a. Using DO Blocks as Custom Commands
DO $$ BEGIN PERFORM pgstream.create_stream_table('my_st', 'SELECT ...'); END $$;
This is just a wrapped function call — not a real custom command.
5b. Abusing COMMENT or SET for Command Dispatch
Some extensions parse custom commands from strings:
-- Using SET to pass commands
SET myext.command = 'CREATE STREAM TABLE my_st AS SELECT ...';
SELECT myext.execute_pending_command();
Or using post_parse_analyze_hook to intercept a specially-formatted query:
-- Extension intercepts this via post_parse_analyze_hook
SELECT * FROM myext.dispatch('CREATE STREAM TABLE ...');
5c. Overloading Existing Syntax
Some extensions overload SELECT or CALL:
CALL pgstream.create_stream_table('my_st', $$SELECT ...$$);
CALL was introduced in PostgreSQL 11 for stored procedures. Using it makes the DDL feel more "command-like" than SELECT function().
Pros/Cons
| Aspect | Assessment |
|---|---|
| Native syntax | No — still a function call in disguise |
| User experience | Moderate — CALL is better than SELECT |
| Complexity | Low |
| PG version | PG11+ for CALL |
| Maintenance | Very low |
6. Custom Access Methods (CREATE ACCESS METHOD)
How It Works
PostgreSQL supports extension-defined access methods (index AMs and table AMs):
CREATE ACCESS METHOD my_am TYPE TABLE HANDLER my_am_handler;
This was introduced in PostgreSQL 9.6 for index AMs and extended to table AMs in PostgreSQL 12. The CREATE ACCESS METHOD statement shows PostgreSQL's philosophy: extensions can define new implementations of existing concepts (tables, indexes) but not new concepts (stream tables).
Table AM vs. Index AM
| Type | Since | Handler Signature | Example |
|---|---|---|---|
| Index AM | PG 9.6 | IndexAmRoutine with scan/insert/delete callbacks | bloom, brin, GiST |
| Table AM | PG 12 | TableAmRoutine with 60+ callbacks | heap (default), columnar (Citus), zedstore (experimental) |
Can We Use This for Stream Tables?
The table AM API defines how tuples are stored and retrieved, not how tables are created or maintained. A stream table's key features are:
- Defining query — Not part of the table AM concept
- Automatic refresh — Not part of the table AM concept
- Change tracking — Could partially overlap with table AM's tuple modification callbacks
- Storage — The actual storage could use heap (default) AM
You could theoretically create a custom table AM that:
- Uses heap storage underneath
- Intercepts INSERT/UPDATE/DELETE to maintain change buffers
- Adds custom metadata
But this would be an extreme abuse of the API. Table AMs are meant for storage engines, not for implementing materialized view semantics.
Pros/Cons
| Aspect | Assessment |
|---|---|
| Native syntax | No — CREATE TABLE ... USING my_am is the closest |
| Complexity | Extremely high — 60+ callbacks to implement |
| Fitness | Poor — table AM is about storage, not view maintenance |
| PG version | PG 12+ |
| Maintenance | High — AM API evolves between major versions |
7. Table Access Method API (PostgreSQL 12+)
Deep Technical Details
The Table Access Method (AM) API was introduced in PostgreSQL 12 via commit c2fe139c20 by Andres Freund. It abstracts the storage layer, allowing extensions to replace the default heap storage with custom implementations.
The CREATE TABLE ... USING Syntax
-- Use default AM (heap)
CREATE TABLE normal_table (id int, data text);
-- Use custom AM
CREATE TABLE my_table (id int, data text) USING my_custom_am;
-- Set default for a database
SET default_table_access_method = 'my_custom_am';
TableAmRoutine Structure
The handler function must return a TableAmRoutine struct with callbacks:
typedef struct TableAmRoutine {
NodeTag type;
/* Slot callbacks */
const TupleTableSlotOps *(*slot_callbacks)(Relation rel);
/* Scan callbacks */
TableScanDesc (*scan_begin)(Relation rel, Snapshot snap, int nkeys, ...);
void (*scan_end)(TableScanDesc scan);
void (*scan_rescan)(TableScanDesc scan, ...);
bool (*scan_getnextslot)(TableScanDesc scan, ...);
/* Parallel scan */
Size (*parallelscan_estimate)(Relation rel);
Size (*parallelscan_initialize)(Relation rel, ...);
void (*parallelscan_reinitialize)(Relation rel, ...);
/* Index fetch */
IndexFetchTableData *(*index_fetch_begin)(Relation rel);
void (*index_fetch_reset)(IndexFetchTableData *data);
void (*index_fetch_end)(IndexFetchTableData *data);
bool (*index_fetch_tuple)(IndexFetchTableData *data, ...);
/* Tuple modification */
void (*tuple_insert)(Relation rel, TupleTableSlot *slot, ...);
void (*tuple_insert_speculative)(Relation rel, ...);
void (*tuple_complete_speculative)(Relation rel, ...);
void (*multi_insert)(Relation rel, TupleTableSlot **slots, int nslots, ...);
TM_Result (*tuple_delete)(Relation rel, ItemPointer tid, ...);
TM_Result (*tuple_update)(Relation rel, ItemPointer otid, ...);
TM_Result (*tuple_lock)(Relation rel, ItemPointer tid, ...);
/* DDL callbacks */
void (*relation_set_new_filelocator)(Relation rel, ...);
void (*relation_nontransactional_truncate)(Relation rel);
void (*relation_copy_data)(Relation rel, const RelFileLocator *newrlocator);
void (*relation_copy_for_cluster)(Relation rel, ...);
void (*relation_vacuum)(Relation rel, VacuumParams *params, ...);
bool (*scan_analyze_next_block)(TableScanDesc scan, ...);
bool (*scan_analyze_next_tuple)(TableScanDesc scan, ...);
/* Planner support */
void (*relation_estimate_size)(Relation rel, int32 *attr_widths, ...);
/* ... more callbacks */
} TableAmRoutine;
Hybrid Approach: Table AM + ProcessUtility_hook
A more practical pattern:
- Register a custom table AM (e.g.,
stream_am) that wraps heap - Use
ProcessUtility_hookto interceptCREATE TABLE ... USING stream_am - When detected, perform stream table registration (catalog, CDC, etc.)
- The actual storage uses standard heap via delegation
-- User writes:
CREATE TABLE order_totals (region text, total numeric)
USING stream_am
WITH (query = 'SELECT region, SUM(amount) FROM orders GROUP BY region',
schedule = '1m',
refresh_mode = 'DIFFERENTIAL');
Problems with This Approach
- Column list is mandatory —
CREATE TABLE ... USINGrequires explicit column definitions. Stream tables should derive columns from the query. - Query in WITH clause — Storing a full SQL query in
reloptionsis hacky and has length limits. - No AS SELECT — Table AMs don't support
CREATE TABLE ... AS SELECTwith USING clause in the standard grammar. - VACUUM, ANALYZE complexity — Must implement or delegate all maintenance callbacks.
- pg_dump compatibility — pg_dump would dump
CREATE TABLE ... USING stream_ambut not the associated metadata (query, schedule, etc.)
Pros/Cons
| Aspect | Assessment |
|---|---|
| Native syntax | Partial — CREATE TABLE ... USING stream_am |
| Feels like a stream table | No — still looks like a regular table with options |
| Complexity | Very high |
| pg_dump | Broken — metadata in catalog tables won't be dumped |
| PG version | PG 12+ |
| Maintenance | High — table AM API changes between versions |
8. Foreign Data Wrapper Approach
How It Works
Foreign Data Wrappers (FDW) allow PostgreSQL to access external data sources via CREATE FOREIGN TABLE. An extension can register a custom FDW:
CREATE EXTENSION pg_stream;
CREATE SERVER stream_server FOREIGN DATA WRAPPER pgstream_fdw;
CREATE FOREIGN TABLE order_totals (region text, total numeric)
SERVER stream_server
OPTIONS (
query 'SELECT region, SUM(amount) FROM orders GROUP BY region',
schedule '1m',
refresh_mode 'DIFFERENTIAL'
);
FDW API
The FDW API provides callbacks for:
GetForeignRelSize— Estimate relation size for planningGetForeignPaths— Generate access pathsGetForeignPlan— Create a plan nodeBeginForeignScan— Start scanIterateForeignScan— Get next tupleEndForeignScan— End scanAddForeignUpdatePaths— Support INSERT/UPDATE/DELETE (optional)
How It Could Work for Stream Tables
- Define a custom FDW (
pgstream_fdw) - The FDW's scan callbacks read from the underlying storage table
ProcessUtility_hookinterceptsCREATE FOREIGN TABLE ... SERVER stream_serverto set up CDC, catalog entries, etc.- A background worker handles refresh scheduling
Problems
- Foreign tables have restrictions — Cannot have indexes, constraints, triggers, or participate in inheritance. This severely limits usability.
- Query planner limitations — Foreign tables use a separate planning path with potentially worse plan quality.
- No MVCC — Foreign tables typically don't provide snapshot isolation semantics.
- User model confusion — "Foreign table" implies external data, not a derived view.
- EXPLAIN output — Shows "Foreign Scan" instead of "Seq Scan", confusing users.
- pg_dump — Foreign tables are dumped, but server/FDW setup may not transfer correctly.
- Two-step creation — Requires
CREATE SERVERbeforeCREATE FOREIGN TABLE.
Pros/Cons
| Aspect | Assessment |
|---|---|
| Native syntax | Partial — CREATE FOREIGN TABLE with options |
| Feels like a stream table | No — foreign tables have different semantics |
| Index support | No — major limitation |
| Trigger support | No — major limitation |
| Complexity | Medium |
| PG version | PG 9.1+ |
| Maintenance | Low — FDW API is very stable |
Verdict: Not suitable. The restrictions on foreign tables (no indexes, no triggers) make this impractical for stream tables that need to behave like regular tables.
9. Event Triggers
How It Works
Event triggers fire on DDL events at the database level:
CREATE EVENT TRIGGER my_trigger ON ddl_command_end
WHEN TAG IN ('CREATE TABLE', 'ALTER TABLE', 'DROP TABLE')
EXECUTE FUNCTION my_handler();
Available events:
ddl_command_start— Before DDL execution (PG 9.3+)ddl_command_end— After DDL execution (PG 9.3+)sql_drop— When objects are dropped (PG 9.3+)table_rewrite— When a table is rewritten (PG 9.5+)
Inside the Handler
CREATE FUNCTION my_handler() RETURNS event_trigger AS $$
DECLARE
obj record;
BEGIN
FOR obj IN SELECT * FROM pg_event_trigger_ddl_commands()
LOOP
-- obj.objid, obj.object_type, obj.command_tag, etc.
IF obj.command_tag = 'CREATE TABLE' AND obj.object_type = 'table' THEN
-- Check if this table has a special marker
-- (e.g., a specific reloption or comment)
END IF;
END LOOP;
END;
$$ LANGUAGE plpgsql;
Pattern: CREATE TABLE + Event Trigger
- User creates a table with a special comment or option:
CREATE TABLE order_totals (region text, total numeric); COMMENT ON TABLE order_totals IS 'pgstream:query=SELECT region...;schedule=1m'; - Event trigger on
ddl_command_endfires - Handler parses the comment, detects stream table intent
- Handler registers the stream table in the catalog
Limitations
- Cannot modify the DDL — Event triggers observe DDL, they can't change what happened. On
ddl_command_end, the table already exists. - Cannot prevent DDL — On
ddl_command_start, you can raise an error to prevent it, but you can't redirect it. - Two-step process — User must
CREATE TABLEAND then mark it somehow (comment, option, separate function call). - No custom syntax — Event triggers watch existing DDL commands.
- pg_stream already uses this — For DDL tracking on upstream tables (see
hooks.rs).
Pros/Cons
| Aspect | Assessment |
|---|---|
| Native syntax | No — watches existing DDL only |
| Complexity | Low |
| Can transform DDL | No — observe only |
| PG version | PG 9.3+ |
| Maintenance | Very low |
| pg_stream usage | Already used for upstream DDL tracking |
10. TimescaleDB Continuous Aggregates Pattern
How It Works
TimescaleDB continuous aggregates (caggs) demonstrate the most sophisticated approach to custom DDL-like syntax in a PostgreSQL extension. Their evolution is instructive.
Phase 1: Pure Function API (early versions)
-- Create a view, then register it
CREATE VIEW daily_temps AS
SELECT time_bucket('1 day', time) AS day, AVG(temp)
FROM conditions GROUP BY 1;
SELECT add_continuous_aggregate_policy('daily_temps', ...);
Phase 2: CREATE MATERIALIZED VIEW WITH (introduced in TimescaleDB 2.0)
CREATE MATERIALIZED VIEW daily_temps
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 day', time) AS day, device_id, AVG(temp)
FROM conditions
GROUP BY 1, 2;
How the Hook Chain Works
TimescaleDB's approach uses layered hooks:
ProcessUtility_hookinterceptsCREATE MATERIALIZED VIEW- Checks
reloptionsfortimescaledb.continuousin theWithClause - If found:
- Does NOT call standard ProcessUtility for the matview
- Instead creates a regular hypertable (the materialization)
- Creates an internal view (the user-facing query interface)
- Registers refresh policies in the catalog
- Sets up continuous aggregate metadata
- For
REFRESH MATERIALIZED VIEW, intercepts and routes to their refresh engine - For
DROP MATERIALIZED VIEW, intercepts and cleans up all artifacts
The Magic: Reloptions as Extension Point
PostgreSQL's CREATE MATERIALIZED VIEW ... WITH (option = value) passes options as DefElem nodes in the parse tree. The parser treats these as generic key-value pairs — it does NOT validate the option names. This is the key insight: PostgreSQL's parser accepts arbitrary options in WITH clauses.
// In ProcessUtility_hook:
if (IsA(parsetree, CreateTableAsStmt)) {
CreateTableAsStmt *stmt = (CreateTableAsStmt *) parsetree;
if (stmt->objtype == OBJECT_MATVIEW) {
// Check for our custom option in stmt->into->options
bool is_continuous = false;
ListCell *lc;
foreach(lc, stmt->into->rel->options) {
DefElem *opt = (DefElem *) lfirst(lc);
if (strcmp(opt->defname, "timescaledb.continuous") == 0) {
is_continuous = true;
break;
}
}
if (is_continuous) {
// Handle as continuous aggregate
return;
}
}
}
Refresh Policies
-- Add a refresh policy (function call, not DDL)
SELECT add_continuous_aggregate_policy('daily_temps',
start_offset => INTERVAL '1 month',
end_offset => INTERVAL '1 day',
schedule_interval => INTERVAL '1 hour');
What pg_stream Could Learn
The TimescaleDB pattern for pg_stream would look like:
-- Option A: CREATE MATERIALIZED VIEW with custom option
CREATE MATERIALIZED VIEW order_totals
WITH (pgstream.stream = true, pgstream.schedule = '1m', pgstream.mode = 'DIFFERENTIAL')
AS SELECT region, SUM(amount) FROM orders GROUP BY region;
-- Option B: CREATE TABLE with custom option (less natural)
CREATE TABLE order_totals (region text, total numeric)
WITH (pgstream.stream = true);
-- Then separately: SELECT pgstream.set_query('order_totals', 'SELECT ...');
Pros/Cons
| Aspect | Assessment |
|---|---|
| Native syntax | Good — CREATE MATERIALIZED VIEW ... WITH (pgstream.stream) looks natural |
| User experience | Very good — familiar DDL syntax with extension options |
| Complexity | High — must implement full ProcessUtility_hook chain |
| pg_dump | Partial — matview DDL is dumped, but custom metadata needs pg_dump extension or config tables |
| PG version | PG 9.3+ (matviews), PG 12+ (better option handling) |
| Maintenance | Medium — must track changes to matview creation internals |
| Shared preload | Required — ProcessUtility_hook needs shared_preload_libraries |
11. Citus Distributed DDL Pattern
How It Works
Citus (now part of Microsoft) demonstrates another approach to extending DDL behavior:
ProcessUtility_hook Chain
Citus has one of the most comprehensive ProcessUtility_hook implementations:
void multi_ProcessUtility(PlannedStmt *pstmt, ...) {
// 1. Classify the DDL
Node *parsetree = pstmt->utilityStmt;
// 2. Check if it affects distributed tables
if (IsA(parsetree, AlterTableStmt)) {
// Propagate ALTER TABLE to all worker nodes
PropagateAlterTable((AlterTableStmt *)parsetree, queryString);
}
// 3. Call standard handler (or skip for intercepted commands)
if (prev_ProcessUtility)
prev_ProcessUtility(pstmt, ...);
else
standard_ProcessUtility(pstmt, ...);
// 4. Post-processing
if (IsA(parsetree, CreateStmt)) {
// Check if we should auto-distribute this table
}
}
Table Distribution via Function Calls
Citus does NOT add custom DDL syntax. Distribution is done via function calls:
-- Create a regular table
CREATE TABLE events (id bigint, data jsonb, created_at timestamptz);
-- Distribute it (function call, not DDL)
SELECT create_distributed_table('events', 'id');
-- Or create a reference table
SELECT create_reference_table('lookups');
Columnar Storage via Table AM
Citus also provides columnar storage as a table AM:
CREATE TABLE analytics_data (...)
USING columnar;
This uses the table AM API (PostgreSQL 12+) — see Section 7.
What Citus Teaches Us
- Function calls for complex operations —
create_distributed_table()is analogous topgstream.create_stream_table(). - ProcessUtility_hook for DDL propagation — Intercept standard DDL and add behavior.
- Table AM for storage — Separate concern from distribution logic.
- No custom syntax — Even with Microsoft's resources, Citus doesn't fork the parser.
Pros/Cons
| Aspect | Assessment |
|---|---|
| Native syntax | No — uses function calls like pg_stream |
| Approach validated | Yes — Citus is used at massive scale with this pattern |
| Complexity | Medium (function API) to High (ProcessUtility_hook) |
| User adoption | Proven successful |
| Maintenance | Low for function API |
12. PostgreSQL 18 New Features
Relevant Extension Points in PG 18
PostgreSQL 18 (released 2025) includes several features relevant to this analysis:
12a. Virtual Generated Columns
PG 18 adds GENERATED ALWAYS AS (expr) VIRTUAL columns. Not directly relevant to stream tables, but shows PostgreSQL's willingness to expand CREATE TABLE syntax incrementally.
12b. Improved Table AM API
PG 18 refines the table AM API with better TOAST handling and improved parallel scan support. This makes custom table AMs slightly more practical.
12c. Enhanced Event Trigger Information
PG 18 expands pg_event_trigger_ddl_commands() with additional metadata fields, making event-trigger-based approaches more capable.
12d. pg_stat_io Improvements
Enhanced I/O statistics infrastructure that could benefit monitoring of stream table refresh operations.
12e. No New Parser Extension Points
PostgreSQL 18 does not add any parser extension mechanism. The parser remains monolithic and non-extensible. There have been occasional discussions on pgsql-hackers about parser hooks, but no concrete proposals have been accepted.
12f. No Custom DDL Extension Points
No new general-purpose DDL extension points beyond the existing hook system.
Looking Forward: Discussion on pgsql-hackers
There have been recurring threads on pgsql-hackers about:
- Extension-defined SQL syntax — Rejected due to complexity and parser architecture
- Loadable parser modules — Theoretical discussions, no implementation
- Extension catalogs — Some interest in allowing extensions to register custom catalogs
None of these are implemented in PG 18.
Pros/Cons
| Aspect | Assessment |
|---|---|
| New syntax extension points | None in PG 18 |
| Table AM improvements | Minor — slightly easier to implement |
| Event trigger improvements | Minor — more metadata available |
| Parser extensibility | Not planned for any upcoming PG release |
13. COMMENT / OPTIONS Abuse Pattern
How It Works
Several extensions use table comments or reloptions as a "poor man's metadata" to tag tables with custom semantics.
Pattern 1: COMMENT-based
CREATE TABLE order_totals (region text, total numeric);
COMMENT ON TABLE order_totals IS '@pgstream {"query": "SELECT ...", "schedule": "1m"}';
An event trigger or background worker scans pg_description for tables with the @pgstream prefix and processes them.
Pattern 2: Reloptions-based
CREATE TABLE order_totals (region text, total numeric)
WITH (fillfactor = 70, pgstream.stream = true);
Problem: PostgreSQL validates reloptions against a known list. You cannot add arbitrary options to WITH (...) without registering them. Extensions can register custom reloptions via add_reloption() functions, but this is a relatively obscure API.
Pattern 3: GUC-based Tagging
-- Set a GUC that our ProcessUtility_hook reads
SET pgstream.next_create_is_stream = true;
SET pgstream.stream_query = 'SELECT region, SUM(amount) FROM orders GROUP BY region';
-- Hook intercepts this CREATE TABLE and registers it
CREATE TABLE order_totals (region text, total numeric);
-- Reset
RESET pgstream.next_create_is_stream;
This is extremely hacky but has been used in practice (some partitioning extensions used similar patterns before native partitioning).
Who Uses This?
- pgmemcache — Uses comments to configure caching behavior
- Some row-level security extensions — Comments to define policies
- pg_partman — Uses a configuration table (not comments) but similar concept
Pros/Cons
| Aspect | Assessment |
|---|---|
| Native syntax | No — abuses existing mechanisms |
| User experience | Poor — fragile, easy to break by editing comments |
| Complexity | Low |
| pg_dump | COMMENT is dumped — metadata survives pg_dump/restore |
| Robustness | Low — comments can be accidentally changed |
| PG version | All versions |
14. pg_ivm (Incremental View Maintenance) Pattern
How It Works
pg_ivm is the most directly comparable extension to pg_stream. It implements incremental view maintenance for PostgreSQL.
API Design
pg_ivm uses a pure function-call API:
-- Create an incrementally maintainable materialized view
SELECT create_immv('order_totals', 'SELECT region, SUM(amount) FROM orders GROUP BY region');
-- Refresh
SELECT refresh_immv('order_totals');
-- Drop
DROP TABLE order_totals; -- Just drop the underlying table
Key function: create_immv(name, query) — Creates an "Incrementally Maintainable Materialized View" (IMMV).
Internal Implementation
create_immv()is a SQL function (not a hook)- It parses the query, creates a storage table, sets up triggers on source tables
- IMMVs are stored as regular tables with metadata in a custom catalog (
pg_ivm_immv) - Triggers on source tables automatically update the IMMV on DML
No ProcessUtility_hook
pg_ivm does not use ProcessUtility_hook. It operates entirely through:
- SQL functions (
create_immv,refresh_immv) - Row-level triggers for automatic maintenance
- A custom catalog table for metadata
Why No Custom Syntax?
pg_ivm was developed as a proof-of-concept for PostgreSQL core IVM support. The authors explicitly chose function-call syntax to:
- Avoid
shared_preload_librariesrequirement (hooks need it) - Keep the extension simple and portable
- Focus on the IVM algorithm, not the user interface
Eventually Merged to Core?
There was discussion about upstreaming IVM to PostgreSQL core. If merged, it would get proper syntax (CREATE INCREMENTAL MATERIALIZED VIEW). As an extension, it stays with function calls.
Relevance to pg_stream
pg_stream's current API (pgstream.create_stream_table()) follows the exact same pattern as pg_ivm. This is the established approach for IVM extensions.
Pros/Cons
| Aspect | Assessment |
|---|---|
| Native syntax | No — function calls |
| Complexity | Low — simple function API |
| shared_preload_libraries | Not required for basic function API |
| pg_dump | No — function calls are not dumped; must use custom dump/restore |
| User experience | Moderate — familiar to pg_ivm users |
| Community acceptance | Established pattern for IVM extensions |
15. CREATE TABLE ... USING (Table Access Methods) Deep Dive
Full Syntax
CREATE TABLE tablename (
column1 datatype,
column2 datatype,
...
) USING access_method_name
WITH (storage_parameter = value, ...);
How the Parser Handles USING
In gram.y:
CreateStmt: CREATE OptTemp TABLE ...
OptTableAccessMethod OptWith ...
OptTableAccessMethod:
USING name { $$ = $2; }
| /* empty */ { $$ = NULL; }
;
The USING clause sets CreateStmt->accessMethod to the access method name string.
How ProcessUtility Handles It
In createRelation() (src/backend/commands/tablecmds.c):
- If
accessMethodis specified, look it up inpg_am - Verify it's a table AM (not an index AM)
- Store the AM OID in
pg_class.relam - Use the AM's callbacks for all subsequent operations
Custom Reloptions with Table AMs
Table AMs can define custom reloptions via:
static relopt_parse_elt stream_relopt_tab[] = {
{"query", RELOPT_TYPE_STRING, offsetof(StreamOptions, query)},
{"schedule", RELOPT_TYPE_STRING, offsetof(StreamOptions, schedule)},
{"refresh_mode", RELOPT_TYPE_STRING, offsetof(StreamOptions, refresh_mode)},
};
This would allow:
CREATE TABLE order_totals (region text, total numeric)
USING stream_heap
WITH (query = 'SELECT ...', schedule = '1m', refresh_mode = 'DIFFERENTIAL');
Problems Specific to Stream Tables
-
Column derivation — Stream tables derive columns from the query.
CREATE TABLE ... USINGrequires explicit column definitions, creating redundancy and potential inconsistency. -
No AS SELECT — You can't combine
USINGwithAS SELECT:-- This does NOT work in PostgreSQL grammar: CREATE TABLE order_totals USING stream_heap AS SELECT region, SUM(amount) FROM orders GROUP BY region; -
Full AM implementation required — Even if you delegate to heap, you must implement all callbacks and handle edge cases.
-
VACUUM/ANALYZE — Must properly delegate to heap for these to work.
-
Replication — Logical replication assumes heap tuples; custom AMs may break.
Hybrid Practical Approach
If pursuing this route:
-- Step 1: Set default AM
SET default_table_access_method = 'stream_heap';
-- Step 2: Create with query in options
CREATE TABLE order_totals ()
WITH (pgstream.query = 'SELECT region, SUM(amount) FROM orders GROUP BY region',
pgstream.schedule = '1m');
-- ProcessUtility_hook would:
-- 1. Detect USING stream_heap (or detect our custom reloptions)
-- 2. Parse the query from options
-- 3. Derive columns from the query
-- 4. Create the actual table with proper columns using heap AM
-- 5. Register in pgstream catalog
-- 6. Set up CDC
Pros/Cons
| Aspect | Assessment |
|---|---|
| Native syntax | Partial — CREATE TABLE ... USING stream_heap WITH (...) |
| Column derivation | Not supported — must specify columns or use hook magic |
| Complexity | Very high |
| pg_dump | Good — CREATE TABLE ... USING is properly dumped |
| PG version | PG 12+ |
| Maintenance | High — AM API changes between versions |
16. Comparison Matrix
| Approach | Native Syntax | Complexity | pg_dump | PG Version | Maintenance | Recommended |
|---|---|---|---|---|---|---|
| Function API (current) | No | Low | No* | Any | Very Low | Yes |
| ProcessUtility_hook + MATVIEW WITH | Good | High | Partial | 9.3+ | Medium | Maybe |
| Raw parser fork | Perfect | Very High | No | Fork only | Very High | No |
| Table AM USING | Partial | Very High | Yes | 12+ | High | No |
| FDW FOREIGN TABLE | Partial | Medium | Yes | 9.1+ | Low | No |
| Event triggers alone | No | Low | No | 9.3+ | Low | No |
| COMMENT abuse | No | Low | Yes | Any | Low | No |
| GUC + CREATE TABLE hack | No | Medium | Partial | Any | Medium | No |
| TimescaleDB pattern (MATVIEW + WITH) | Good | High | Partial | 9.3+ | Medium | Best option |
* Custom pg_dump support can be added via pg_dump hook or wrapper script.
17. Recommendations for pg_stream
Current Approach: Function API (Keep and Enhance)
pg_stream's current approach (pgstream.create_stream_table('name', 'query', ...)) is:
- Proven — Same pattern as pg_ivm, Citus, and many other extensions
- Simple — No
shared_preload_librariesrequired for basic usage - Maintainable — No hook chains to debug
- Portable — Works on any PG version that supports pgrx
Enhancement opportunities:
-- Current
SELECT pgstream.create_stream_table('order_totals',
'SELECT region, SUM(amount) FROM orders GROUP BY region', '1m');
-- Enhanced: CALL syntax for more DDL-like feel (PG 11+)
CALL pgstream.create_stream_table('order_totals',
$$SELECT region, SUM(amount) FROM orders GROUP BY region$$, '1m');
Future Option: TimescaleDB-style Materialized View Integration
If user demand justifies the complexity, pg_stream could add a second creation path via ProcessUtility_hook:
-- New native-feeling syntax (requires shared_preload_libraries)
CREATE MATERIALIZED VIEW order_totals
WITH (pgstream.stream = true, pgstream.schedule = '1m')
AS SELECT region, SUM(amount) FROM orders GROUP BY region
WITH NO DATA;
-- Original function API still works (no hook needed)
SELECT pgstream.create_stream_table('order_totals',
'SELECT region, SUM(amount) FROM orders GROUP BY region', '1m');
Implementation plan for hook-based approach:
- Register
ProcessUtility_hookin_PG_init()(already needed forshared_preload_libraries) - Intercept
CREATE MATERIALIZED VIEW→ Check forpgstream.streamoption - If found: parse options, call
create_stream_table_impl()internally, create standard storage table instead of matview - Intercept
DROP MATERIALIZED VIEW→ Check if target is a stream table → Clean up - Intercept
REFRESH MATERIALIZED VIEW→ Route to stream table refresh engine - Intercept
ALTER MATERIALIZED VIEW→ Route to stream table alter logic
Estimated complexity: ~800-1200 lines of Rust hook code + tests.
Not Recommended
- Forking PostgreSQL for custom grammar — Maintenance cost is prohibitive
- Table AM approach — Complexity without proportional benefit
- FDW approach — Too many restrictions on foreign tables
- COMMENT abuse — Fragile and poor UX
pg_dump / pg_restore Strategy
Regardless of approach, pg_dump is a challenge. Options:
- Custom dump/restore functions —
pgstream.dump_config()andpgstream.restore_config() - Migration script generation —
pgstream.generate_migration()outputs SQL to recreate all stream tables - Event trigger on restore — Detect when tables are restored and re-register them
- Sidecar file — Generate a companion SQL file alongside pg_dump
Appendix A: Hook Registration in pgrx (Rust)
For reference, here's how ProcessUtility_hook registration works in pgrx:
#![allow(unused)] fn main() { use pgrx::prelude::*; use pgrx::pg_sys; use std::ffi::CStr; static mut PREV_PROCESS_UTILITY_HOOK: pg_sys::ProcessUtility_hook_type = None; #[pg_guard] pub extern "C-unwind" fn my_process_utility( pstmt: *mut pg_sys::PlannedStmt, query_string: *const std::os::raw::c_char, read_only_tree: bool, context: pg_sys::ProcessUtilityContext, params: pg_sys::ParamListInfo, query_env: *mut pg_sys::QueryEnvironment, dest: *mut pg_sys::DestReceiver, qc: *mut pg_sys::QueryCompletion, ) { // SAFETY: pstmt is a valid pointer provided by PostgreSQL let stmt = unsafe { (*pstmt).utilityStmt }; // Check if this is a CreateTableAsStmt (materialized view) if unsafe { pgrx::is_a(stmt, pg_sys::NodeTag::T_CreateTableAsStmt) } { // Check for our custom options... } // Chain to previous hook or standard handler unsafe { if let Some(prev) = PREV_PROCESS_UTILITY_HOOK { prev(pstmt, query_string, read_only_tree, context, params, query_env, dest, qc); } else { pg_sys::standard_ProcessUtility( pstmt, query_string, read_only_tree, context, params, query_env, dest, qc); } } } pub fn register_hooks() { unsafe { PREV_PROCESS_UTILITY_HOOK = pg_sys::ProcessUtility_hook; pg_sys::ProcessUtility_hook = Some(my_process_utility); } } }
Appendix B: Key Source Files in PostgreSQL
| File | Purpose |
|---|---|
src/backend/parser/gram.y | SQL grammar (~18,000 lines) |
src/backend/parser/scan.l | Lexer/tokenizer |
src/include/parser/kwlist.h | Keyword definitions |
src/backend/tcop/utility.c | ProcessUtility() — DDL dispatcher |
src/backend/commands/tablecmds.c | CREATE/ALTER/DROP TABLE implementation |
src/backend/commands/createas.c | CREATE TABLE AS / CREATE MATVIEW AS |
src/include/access/tableam.h | Table Access Method API |
src/include/foreign/fdwapi.h | FDW API |
src/backend/commands/event_trigger.c | Event trigger infrastructure |
Appendix C: References
- PostgreSQL Documentation — Table Access Method Interface
- PostgreSQL Documentation — Event Triggers
- PostgreSQL Documentation — Writing A Foreign Data Wrapper
- TimescaleDB Source — process_utility.c
- Citus Source — multi_utility.c
- pg_ivm Source — createas.c
- pgrx Documentation — Hooks
- PostgreSQL Wiki — CustomScanProviders