Tutorial: Monitoring & Alerting

This guide consolidates all pg_trickle monitoring capabilities into a single reference: built-in SQL views, NOTIFY-based alerts, and the Prometheus/Grafana observability stack.

Quick Health Check

The fastest way to verify pg_trickle is healthy:

SELECT * FROM pgtrickle.health_check() WHERE severity != 'OK';

If this returns no rows, everything is working. Any WARN or ERROR rows tell you where to investigate.

Built-in Monitoring Views

Stream table status

-- Overview: name, status, mode, staleness
SELECT name, status, refresh_mode, staleness, stale
FROM pgtrickle.stream_tables_info;

-- Detailed stats: refresh counts, duration, error streaks
SELECT pgt_name, total_refreshes, avg_duration_ms, consecutive_errors, stale
FROM pgtrickle.pg_stat_stream_tables;

-- Live status with error counts
SELECT * FROM pgtrickle.pgt_status();

Refresh history

-- Last 10 refreshes for a specific stream table
SELECT start_time, action, status, duration_ms, rows_inserted, rows_deleted, error_message
FROM pgtrickle.get_refresh_history('order_totals', 10);

-- Global refresh timeline (last 20 events across all stream tables)
SELECT start_time, stream_table, action, status, duration_ms, error_message
FROM pgtrickle.refresh_timeline(20);

-- Aggregate refresh statistics
SELECT * FROM pgtrickle.st_refresh_stats();

CDC pipeline health

-- Per-source CDC mode, WAL lag, and alerts
SELECT * FROM pgtrickle.check_cdc_health();

-- Change buffer sizes (pending changes not yet consumed)
SELECT stream_table, source_table, cdc_mode, pending_rows, buffer_bytes
FROM pgtrickle.change_buffer_sizes()
ORDER BY pending_rows DESC;

-- Verify all CDC triggers are installed and enabled
SELECT source_table, trigger_type, trigger_name
FROM pgtrickle.trigger_inventory()
WHERE NOT present OR NOT enabled;

Dependencies

-- ASCII tree view of the entire dependency graph
SELECT tree_line, status, refresh_mode
FROM pgtrickle.dependency_tree();

-- Diamond consistency groups
SELECT * FROM pgtrickle.diamond_groups();

Fuse circuit breaker

-- Check fuse state for all stream tables
SELECT name, fuse_mode, fuse_state, fuse_ceiling, blown_at
FROM pgtrickle.fuse_status();

Parallel workers

-- Worker pool status (when parallel_refresh_mode = 'on')
SELECT * FROM pgtrickle.worker_pool_status();

-- Recent parallel job history
SELECT job_id, unit_key, status, duration_ms
FROM pgtrickle.parallel_job_status(60);

NOTIFY-Based Alerting

pg_trickle emits real-time events via PostgreSQL's NOTIFY system:

LISTEN pg_trickle_alert;

Event Types

EventTriggerSeverity
stale_dataScheduler is also behind — view is genuinely out of dateWarning
no_upstream_changesScheduler is healthy but source tables have had no writes — view is correctInfo
auto_suspendedStream table suspended after max consecutive errorsCritical
resumedStream table resumed after suspensionInfo
reinitialize_neededUpstream DDL change detectedWarning
buffer_growth_warningChange buffer growing unexpectedlyWarning
slot_lag_warningWAL replication slot retaining excessive dataWarning
fuse_blownCircuit breaker trippedWarning
refresh_completedRefresh completed successfullyInfo
refresh_failedRefresh failedError
diamond_partial_failureOne member of an atomic diamond group failedWarning
scheduler_falling_behindRefresh duration approaching the schedule intervalWarning
spill_threshold_exceededDelta MERGE spilled to temp files for consecutive refreshes, forcing FULLWarning

Notification Payload

Each notification carries a JSON payload:

{
  "event": "auto_suspended",
  "stream_table": "order_totals",
  "consecutive_errors": 3,
  "last_error": "column \"deleted_column\" does not exist",
  "timestamp": "2026-03-31T14:22:01.123Z"
}

Bridging to External Systems

To forward NOTIFY events to external alerting systems (PagerDuty, Slack, OpsGenie), use a listener process:

# Example: Python listener using psycopg
import psycopg
import json

conn = psycopg.connect("postgresql://user:pass@host/db", autocommit=True)
conn.execute("LISTEN pg_trickle_alert")

for notify in conn.notifies():
    payload = json.loads(notify.payload)
    event = payload["event"]
    # no_upstream_changes is informational — source tables are quiet but healthy.
    # Only page on actionable events.
    if event in ("auto_suspended", "fuse_blown", "refresh_failed"):
        send_to_pagerduty(payload)
    elif event == "stale_data":  # scheduler itself is falling behind
        send_to_pagerduty(payload)

Prometheus & Grafana Stack

For production deployments, use the pre-built observability stack in the monitoring/ directory:

cd monitoring/
docker compose up -d

This gives you:

  • Prometheus scraping pg_trickle metrics via postgres_exporter
  • Grafana with a pre-provisioned dashboard
  • Alerting rules for staleness, errors, CDC lag, and scheduler health

See Prometheus & Grafana Integration for full setup details.

Diagnostic Workflow

When something is wrong, follow this systematic workflow:

Step 1 — Global health

SELECT * FROM pgtrickle.health_check() WHERE severity != 'OK';

Step 2 — Status and staleness

SELECT name, status, consecutive_errors, staleness
FROM pgtrickle.pgt_status()
ORDER BY staleness DESC NULLS FIRST;

Step 3 — Recent refresh activity

SELECT start_time, stream_table, action, status, error_message
FROM pgtrickle.refresh_timeline(20);

Step 4 — Error details for a specific stream table

SELECT * FROM pgtrickle.diagnose_errors('my_stream_table');

Step 5 — CDC pipeline

SELECT stream_table, source_table, pending_rows, buffer_bytes
FROM pgtrickle.change_buffer_sizes()
ORDER BY pending_rows DESC;

Step 6 — Trigger verification

SELECT source_table, trigger_type, trigger_name
FROM pgtrickle.trigger_inventory()
WHERE NOT present OR NOT enabled;

Common Alert Responses

AlertLikely CauseAction
stale_dataScheduler behind, long refresh, or lock contentionCheck pgt_status() and refresh_timeline()
auto_suspendedRepeated refresh failuresFix root cause, then resume_stream_table()
fuse_blownBulk load exceeded fuse ceilingInvestigate, then reset_fuse()
buffer_growth_warningScheduler not consuming buffers fast enoughCheck scheduler status and refresh errors
reinitialize_neededSource table DDL changedVerify schema compatibility; scheduler handles automatically

Further Reading