Troubleshooting & Failure Mode Runbook

This document covers common failure scenarios, their symptoms, diagnosis steps, and resolution procedures. It is intended for operators and DBAs running pg_trickle in production.

Quick start: Run SELECT * FROM pgtrickle.health_check() WHERE severity != 'OK'; for a single-call triage of your installation.

See also:

Error Reference — all PgTrickleError variants with causes and fixes

FAQ — Troubleshooting section — common user questions

Pre-Deployment Checklist — configuration verification

Configuration — GUC reference

Diagnostic Toolkit
Failure Scenarios

Diagnostic Toolkit

These functions are your primary tools for diagnosing issues:

Function	Purpose
`pgtrickle.health_check()`	Single-call overall health triage (OK/WARN/ERROR)
`pgtrickle.pgt_status()`	Status, staleness, error count for all stream tables
`pgtrickle.refresh_timeline(N)`	Last N refresh events across all stream tables
`pgtrickle.diagnose_errors('name')`	Last 5 failed events with classification and remediation
`pgtrickle.change_buffer_sizes()`	CDC pipeline: pending rows and buffer bytes per source
`pgtrickle.trigger_inventory()`	CDC trigger presence and enabled state
`pgtrickle.check_cdc_health()`	WAL replication slot health (WAL mode only)
`pgtrickle.dependency_tree()`	Dependency DAG visualization
`pgtrickle.worker_pool_status()`	Parallel refresh worker pool state
`pgtrickle.explain_st('name')`	DVM operator tree and generated delta SQL

Quick health check script:

-- 1. Overall health
SELECT * FROM pgtrickle.health_check() WHERE severity != 'OK';

-- 2. Problem stream tables
SELECT name, status, refresh_mode, consecutive_errors, staleness
FROM pgtrickle.pgt_status()
WHERE status != 'ACTIVE' OR consecutive_errors > 0
ORDER BY consecutive_errors DESC;

-- 3. Recent failures
SELECT start_time, stream_table, action, status, duration_ms, error_message
FROM pgtrickle.refresh_timeline(20)
WHERE status = 'FAILED';

Failure Scenarios

1. Scheduler Not Running

Symptoms:

No stream tables are refreshing
health_check() reports scheduler_running = false
No pg_trickle scheduler process in pg_stat_activity

Diagnosis:

-- Check for the scheduler process
SELECT pid, datname, backend_type, state
FROM pg_stat_activity
WHERE backend_type = 'pg_trickle scheduler';

-- Check GUC
SHOW pg_trickle.enabled;

-- Check shared_preload_libraries
SHOW shared_preload_libraries;

Resolution:

Cause	Fix
`pg_trickle.enabled = off`	`ALTER SYSTEM SET pg_trickle.enabled = on; SELECT pg_reload_conf();`
Not in `shared_preload_libraries`	Add `pg_trickle` to `shared_preload_libraries` in `postgresql.conf` and restart PostgreSQL
`max_worker_processes` exhausted	Increase `max_worker_processes` and restart. The launcher retries every 5 minutes — check PostgreSQL logs for `WARNING: pg_trickle launcher: could not spawn scheduler`
Scheduler crashed	Check PostgreSQL logs for crash details. The launcher will auto-restart it. If recurring, check for OOM or resource limits

2. Stream Table Stuck in SUSPENDED Status

Symptoms:

Stream table status shows SUSPENDED
consecutive_errors is at or above pg_trickle.max_consecutive_errors
No refreshes happening for this stream table

Diagnosis:

-- Check the stream table status
SELECT pgt_name, status, consecutive_errors, last_error_message
FROM pgtrickle.pg_stat_stream_tables
WHERE pgt_name = 'my_stream_table';

-- Get detailed error history
SELECT * FROM pgtrickle.diagnose_errors('my_stream_table');

Resolution:

Fix the underlying error (check last_error_message and diagnose_errors)
Resume the stream table:

SELECT pgtrickle.alter_stream_table('my_stream_table', enabled => true);

Trigger a manual refresh to verify:

SELECT pgtrickle.refresh_stream_table('my_stream_table');

Prevention: Increase pg_trickle.max_consecutive_errors (default 3) if transient errors are common in your environment:

ALTER SYSTEM SET pg_trickle.max_consecutive_errors = 5;
SELECT pg_reload_conf();

3. CDC Triggers Missing or Disabled

Symptoms:

Stream table refreshes succeed but shows no changes
change_buffer_sizes() shows pending_rows = 0 despite active DML
Source tables have no pg_trickle triggers

Diagnosis:

-- Check trigger inventory
SELECT source_table, trigger_type, trigger_name, present, enabled
FROM pgtrickle.trigger_inventory()
WHERE NOT present OR NOT enabled;

-- Manual check on a specific source table
SELECT tgname, tgenabled
FROM pg_trigger
WHERE tgrelid = 'public.orders'::regclass
  AND tgname LIKE 'pgt_%';

Resolution:

Cause	Fix
Triggers dropped by DDL (e.g., `pg_dump` + restore without triggers)	Drop and recreate the stream table, or reinitialize: `SELECT pgtrickle.refresh_stream_table('my_st');`
Triggers disabled (`ALTER TABLE ... DISABLE TRIGGER`)	`ALTER TABLE source_table ENABLE TRIGGER ALL;`
Source gating active	Check `SELECT * FROM pgtrickle.source_gates();` and ungate: `SELECT pgtrickle.ungate_source('source_table');`
WAL mode active but slot missing	See WAL Replication Slot Lag or Missing

4. WAL Replication Slot Lag or Missing

Symptoms:

check_cdc_health() shows slot_lag_exceeds_threshold or replication_slot_missing
WAL disk usage growing unexpectedly
Stream tables not receiving changes in WAL mode

Diagnosis:

-- Check CDC health
SELECT * FROM pgtrickle.check_cdc_health();

-- Check replication slots directly
SELECT slot_name, active, restart_lsn, confirmed_flush_lsn,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag
FROM pg_replication_slots
WHERE slot_name LIKE 'pgt_%';

Resolution:

Cause	Fix
Slot dropped externally	pg_trickle will auto-fallback to trigger-based CDC. To recreate: drop and recreate the stream table
Slot lagging (WAL accumulation)	Check for long-running transactions: `SELECT pid, age(backend_xmin) FROM pg_stat_replication;`. Kill idle-in-transaction sessions
`wal_level != logical`	WAL CDC requires `wal_level = logical`. Set it and restart PostgreSQL
`max_replication_slots` exhausted	Increase `max_replication_slots` and restart

Fallback: Force trigger-based CDC mode if WAL mode is problematic:

ALTER SYSTEM SET pg_trickle.cdc_mode = 'trigger';
SELECT pg_reload_conf();

5. Stream Table Stuck in INITIALIZING

Symptoms:

Stream table status is INITIALIZING for an extended period
The initial full refresh may have failed or is still running

Diagnosis:

-- Check refresh history
SELECT * FROM pgtrickle.get_refresh_history('my_st', 5);

-- Check for active refresh
SELECT pid, state, query, now() - query_start AS running_for
FROM pg_stat_activity
WHERE query LIKE '%pgtrickle%' AND state = 'active';

Resolution:

Cause	Fix
Initial refresh failed (check error in history)	Fix the error, then: `SELECT pgtrickle.refresh_stream_table('my_st');`
Defining query is very slow	Optimize the query, add indexes on source tables, or increase `work_mem`
Lock contention during initial refresh	See Lock Contention

6. Change Buffers Growing Without Refresh

Symptoms:

change_buffer_sizes() shows large pending_rows and growing buffer_bytes
Stream tables are stale
Refreshes are not running or are failing

Diagnosis:

-- Check buffer sizes
SELECT stream_table, source_table, pending_rows, buffer_bytes
FROM pgtrickle.change_buffer_sizes()
ORDER BY pending_rows DESC;

-- Check if refreshes are happening
SELECT * FROM pgtrickle.refresh_timeline(10);

-- Check for blocked refresh processes
SELECT pid, wait_event_type, wait_event, state, query
FROM pg_stat_activity
WHERE query LIKE '%pgtrickle%';

Resolution:

Cause	Fix
Scheduler not running	See Scheduler Not Running
All refreshes failing	Check `diagnose_errors()` for each affected stream table
Lock contention	See Lock Contention
Very large buffer causing slow MERGE	Consider lowering `pg_trickle.differential_change_ratio_threshold` to trigger FULL refresh for large batches

Emergency: If buffers are dangerously large and you need immediate relief:

-- Force a full refresh (bypasses change buffers)
SELECT pgtrickle.refresh_stream_table('my_st', force_full => true);

7. Lock Contention Blocking Refresh

Symptoms:

Refresh duration is much longer than usual
pg_stat_activity shows refresh processes in Lock wait state
Long-running transactions on source or stream tables

Diagnosis:

-- Find blocking locks
SELECT blocked.pid AS blocked_pid,
       blocked.query AS blocked_query,
       blocking.pid AS blocking_pid,
       blocking.query AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_locks bl ON bl.pid = blocked.pid AND NOT bl.granted
JOIN pg_locks gl ON gl.locktype = bl.locktype
    AND gl.database IS NOT DISTINCT FROM bl.database
    AND gl.relation IS NOT DISTINCT FROM bl.relation
    AND gl.page IS NOT DISTINCT FROM bl.page
    AND gl.tuple IS NOT DISTINCT FROM bl.tuple
    AND gl.pid != bl.pid
    AND gl.granted
JOIN pg_stat_activity blocking ON blocking.pid = gl.pid
WHERE blocked.query LIKE '%pgtrickle%';

Resolution:

Identify and terminate the blocking session if appropriate:
```
SELECT pg_terminate_backend(<blocking_pid>);
```
Investigate why the blocking transaction is long-running (idle in transaction, slow query, etc.)
Consider adding statement_timeout or idle_in_transaction_session_timeout to prevent future occurrences

8. Out-of-Memory During Refresh

Symptoms:

Refresh processes killed by OS OOM killer
PostgreSQL logs show out of memory errors
Stream tables fail with system-category errors

Diagnosis:

# Check OS OOM killer logs
dmesg | grep -i "oom\|killed process" | tail -20

# Check PostgreSQL logs for memory errors
grep -i "out of memory\|oom" /var/log/postgresql/postgresql-*.log | tail -10

-- Check which stream tables have large source data
SELECT stream_table, source_table, pending_rows
FROM pgtrickle.change_buffer_sizes()
ORDER BY pending_rows DESC;

Resolution:

Cause	Fix
Large FULL refresh on big table	Reduce `work_mem` or `maintenance_work_mem` to limit per-query memory
Large change buffer accumulation	Refresh more frequently (shorter schedule) to keep buffers small
Complex query with many joins	Simplify the defining query or break into cascading stream tables
Parallel refresh amplifies memory	Reduce `pg_trickle.max_concurrent_refreshes`

Tuning:

-- Limit per-refresh memory
SET work_mem = '64MB';

-- Limit concurrent refreshes to reduce peak memory
ALTER SYSTEM SET pg_trickle.max_concurrent_refreshes = 2;
SELECT pg_reload_conf();

9. Disk Full / WAL Retention Exceeded

Symptoms:

PostgreSQL logs No space left on device errors
WAL directory consuming excessive disk
Replication slots preventing WAL cleanup

Diagnosis:

# Check disk usage
df -h /var/lib/postgresql/data
du -sh /var/lib/postgresql/data/pg_wal/

-- Check replication slot WAL retention
SELECT slot_name, active,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;

-- Check change buffer table sizes
SELECT stream_table, source_table,
       pg_size_pretty(buffer_bytes::bigint) AS buffer_size
FROM pgtrickle.change_buffer_sizes()
ORDER BY buffer_bytes DESC;

Resolution:

Cause	Fix
Inactive replication slot holding WAL	Drop the slot: `SELECT pg_drop_replication_slot('pgt_...');`
Change buffer tables too large	Force full refresh to clear buffers, or refresh more frequently
WAL accumulation from long transactions	Terminate idle-in-transaction sessions
`max_wal_size` too low	Increase `max_wal_size` in `postgresql.conf`

Emergency cleanup:

-- Drop inactive pg_trickle replication slots
SELECT pg_drop_replication_slot(slot_name)
FROM pg_replication_slots
WHERE slot_name LIKE 'pgt_%' AND NOT active;

10. Circular Pipeline Convergence Failure

Symptoms:

Stream tables in a circular dependency hit the maximum iteration limit
Refresh history shows repeated cycles without convergence
Error messages mention fixed_point_max_iterations

Diagnosis:

-- Check for circular dependencies
SELECT * FROM pgtrickle.dependency_tree();

-- Check refresh history for iteration patterns
SELECT start_time, stream_table, action, status, error_message
FROM pgtrickle.refresh_timeline(50)
WHERE stream_table IN ('st_a', 'st_b')  -- suspected cycle members
ORDER BY start_time DESC;

Resolution:

Verify the cycle is intentional (see Circular Dependencies tutorial)

Increase the iteration limit if convergence is slow:

ALTER SYSTEM SET pg_trickle.fixed_point_max_iterations = 20;
SELECT pg_reload_conf();

If the cycle never converges, the defining queries may not be monotone. Restructure to eliminate the cycle or ensure monotonicity

11. Schema Change Broke Stream Table

Symptoms:

Stream table has needs_reinit = true
Reinitialization keeps failing
Error messages reference dropped or renamed columns

Diagnosis:

-- Check for pending reinit
SELECT pgt_name, needs_reinit, status, last_error_message
FROM pgtrickle.pg_stat_stream_tables
WHERE needs_reinit;

-- Get error details
SELECT * FROM pgtrickle.diagnose_errors('my_st');

Resolution:

If the defining query is still valid after the DDL change, force a reinit:

SELECT pgtrickle.refresh_stream_table('my_st');

If the defining query needs to be updated:

-- Option 1: Alter the defining query
SELECT pgtrickle.alter_stream_table('my_st',
    query => 'SELECT new_column, SUM(amount) FROM orders GROUP BY new_column'
);

-- Option 2: Drop and recreate
SELECT pgtrickle.drop_stream_table('my_st');
SELECT pgtrickle.create_stream_table(
    'my_st',
    'SELECT new_column, SUM(amount) FROM orders GROUP BY new_column',
    '1m'
);

12. Worker Pool Exhaustion

Symptoms:

Refresh latency increases across the board
Some stream tables refresh while others queue indefinitely
worker_pool_status() shows all workers busy

Diagnosis:

-- Check worker pool
SELECT * FROM pgtrickle.worker_pool_status();

-- Check for long-running parallel jobs
SELECT job_id, unit_key, status, duration_ms
FROM pgtrickle.parallel_job_status(300)
WHERE status = 'RUNNING'
ORDER BY duration_ms DESC;

Resolution:

Cause	Fix
Too few workers for workload	Increase `pg_trickle.max_concurrent_refreshes` and/or `max_worker_processes`
One stream table monopolizing workers	Check if a single slow refresh is blocking the pool. Consider splitting into smaller stream tables
Global worker cap reached	Increase `pg_trickle.max_dynamic_refresh_workers`

13. Fuse Tripped (Circuit Breaker)

Symptoms:

Stream table shows fuse_state = 'BLOWN' or refresh is paused
fuse_status() reports a tripped fuse
No refreshes happening despite active scheduler

Diagnosis:

-- Check fuse status
SELECT * FROM pgtrickle.fuse_status();

Resolution:

Reset the fuse after investigating the root cause:

SELECT pgtrickle.reset_fuse('my_stream_table');

See the Fuse Circuit Breaker tutorial for details on fuse thresholds and configuration.

General Diagnostic Workflow

When investigating any issue, follow this sequence:

1. health_check()          → identify which subsystem is unhealthy
2. pgt_status()            → find specific affected stream tables
3. diagnose_errors('name') → get root cause for failures
4. refresh_timeline(20)    → correlate with recent refresh events
5. change_buffer_sizes()   → check CDC pipeline health
6. trigger_inventory()     → verify change capture is working
7. dependency_tree()       → confirm DAG wiring
8. PostgreSQL logs         → low-level crash/resource details

GUC Quick Reference for Troubleshooting

GUC	Default	What to check
`pg_trickle.enabled`	`on`	Must be `on` for scheduler to run
`pg_trickle.max_consecutive_errors`	`3`	Stream tables suspend after this many failures
`pg_trickle.scheduler_interval_ms`	`100`	Very high values cause refresh lag
`pg_trickle.event_driven_wake`	`on`	`off` = poll-only, higher latency
`pg_trickle.cdc_mode`	`auto`	`trigger` for reliable fallback
`pg_trickle.max_concurrent_refreshes`	`4`	Per-database parallel refresh cap
`pg_trickle.fixed_point_max_iterations`	`10`	Circular pipeline iteration limit
`pg_trickle.differential_change_ratio_threshold`	`0.5`	Falls back to FULL above this ratio
`pg_trickle.auto_backoff`	`on`	Stretches intervals up to 8x under load

pg_trickle Documentation

Troubleshooting & Failure Mode Runbook

Table of Contents

Diagnostic Toolkit

Failure Scenarios

1. Scheduler Not Running

2. Stream Table Stuck in SUSPENDED Status

3. CDC Triggers Missing or Disabled

4. WAL Replication Slot Lag or Missing

5. Stream Table Stuck in INITIALIZING

6. Change Buffers Growing Without Refresh

7. Lock Contention Blocking Refresh

8. Out-of-Memory During Refresh

9. Disk Full / WAL Retention Exceeded

10. Circular Pipeline Convergence Failure

11. Schema Change Broke Stream Table

12. Worker Pool Exhaustion

13. Fuse Tripped (Circuit Breaker)

General Diagnostic Workflow

GUC Quick Reference for Troubleshooting