Backup and Disaster Recovery

pg_ripple stores all data in standard PostgreSQL tables within the _pg_ripple schema. This means every PostgreSQL backup tool works out of the box — VP tables, the dictionary, the predicates catalog, SHACL constraints, Datalog rules, and inferred triples are all captured by pg_dump, WAL archiving, and streaming replication.

No special export needed

Unlike triple stores that require a separate RDF dump/reload cycle, pg_ripple data is just PostgreSQL data. Your existing backup infrastructure already covers it.


What Gets Backed Up

ObjectSchemaCaptured by pg_dump?Notes
Dictionary table_pg_ripple.dictionaryYesAll IRI, blank node, and literal mappings
Predicates catalog_pg_ripple.predicatesYesPredicate → VP table OID mapping
VP tables (main + delta + tombstones)_pg_ripple.vp_{id}_*YesOne table set per predicate
Rare predicates table_pg_ripple.vp_rareYesConsolidated low-cardinality predicates
SHACL constraints_pg_ripple.shacl_*YesShape definitions and validation state
Datalog rules_pg_ripple.rulesYesRule text and compiled plans
Inferred triplesVP tables, source = 1YesMaterialized inference results
Extension metadatapg_catalogYesExtension version and control file
Shared memory stateIn-memory onlyNoDictionary LRU cache, merge worker counters

Shared memory state

The dictionary LRU cache and merge worker counters live in shared memory and are not persisted to disk. They are rebuilt automatically on PostgreSQL restart. This is by design — the cache warms up quickly from normal query traffic.


Logical Backup with pg_dump

Full Database Dump

# Custom format (recommended — compressed, parallel-restore capable)
pg_dump -Fc -f pg_ripple_backup.dump mydb

# Plain SQL (human-readable, useful for auditing)
pg_dump -Fp -f pg_ripple_backup.sql mydb

Extension-Only Dump

To back up only pg_ripple data without the rest of the database:

pg_dump -Fc \
  --schema=_pg_ripple \
  --schema=pg_ripple \
  -f pg_ripple_only.dump mydb

Include both schemas

Always include both _pg_ripple (internal storage) and pg_ripple (public API functions). Restoring one without the other leaves the extension in an inconsistent state.

Parallel Dump for Large Datasets

For databases with millions of triples, use parallel workers:

# Directory format required for parallel dump
pg_dump -Fd -j 4 -f pg_ripple_backup_dir/ mydb

The dictionary table and large VP tables will be dumped in parallel, significantly reducing backup time.


Restoring from Backup

Full Restore to a New Database

# Create the target database
createdb mydb_restored

# Restore (custom format)
pg_restore -d mydb_restored -Fc pg_ripple_backup.dump

# Restore (directory format, parallel)
pg_restore -d mydb_restored -Fd -j 4 pg_ripple_backup_dir/

Restore from Plain SQL

psql -d mydb_restored -f pg_ripple_backup.sql

Post-Restore Verification

After restoring, verify the extension is intact:

-- Check extension version
SELECT extversion FROM pg_extension WHERE extname = 'pg_ripple';

-- Verify triple count
SELECT pg_ripple.stats();

-- Run the health check
SELECT pg_ripple.canary();

-- Spot-check a SPARQL query
SELECT pg_ripple.sparql($$
  SELECT (COUNT(*) AS ?n) WHERE { ?s ?p ?o }
$$);

Do VP tables survive dump/restore?

Yes. VP tables are standard PostgreSQL heap tables with B-tree or BRIN indexes. pg_dump captures them exactly like any other table. The HTAP delta/main/tombstone split, indexes, and the merge worker view definitions are all preserved. After restore, the merge worker resumes normal operation once shared_preload_libraries includes pg_ripple.


WAL-Based Continuous Archiving

For point-in-time recovery (PITR), configure WAL archiving:

Enable WAL Archiving

In postgresql.conf:

wal_level = replica
archive_mode = on
archive_command = 'cp %p /backup/wal_archive/%f'
max_wal_senders = 3

Take a Base Backup

pg_basebackup -D /backup/base -Ft -z -P

Point-in-Time Recovery

Create a recovery.signal file and configure the restore target:

# postgresql.conf (or postgresql.auto.conf)
restore_command = 'cp /backup/wal_archive/%f %p'
recovery_target_time = '2026-04-19 14:30:00'

Start PostgreSQL — it will replay WAL up to the specified time.

HTAP merge and PITR

If you recover to a point mid-merge, the merge worker will detect the incomplete state and re-run the merge on startup. No manual intervention is needed, but the first merge cycle after recovery may take longer than usual.


Streaming Replication

pg_ripple works transparently with PostgreSQL streaming replication:

# On the replica
pg_basebackup -h primary-host -D /var/lib/postgresql/18/main -R -P

The -R flag writes the standby.signal and connection parameters. All VP tables, dictionary data, and HTAP state replicate via WAL.

Merge worker on replicas

The background merge worker does not run on read replicas. Replicas receive merged state via WAL replay from the primary. This is correct behavior — replicas should never write.


Backup Strategy Recommendations

Small Datasets (< 1M triples)

ComponentRecommendation
Methodpg_dump -Fc nightly
Retention7 daily + 4 weekly
RPO24 hours
RTOMinutes

Medium Datasets (1M – 100M triples)

ComponentRecommendation
MethodWAL archiving + daily base backup
Retention7 daily base + continuous WAL
RPOSeconds (WAL)
RTOMinutes to hours

Large Datasets (> 100M triples)

ComponentRecommendation
MethodWAL archiving + pgBackRest or Barman
RetentionIncremental base + continuous WAL
RPOSeconds (WAL)
RTOProportional to dataset size

Test your restores

Schedule monthly restore drills. A backup that has never been tested is not a backup. Automate the verification queries shown above as part of the drill.


Disaster Recovery Checklist

  1. Before disaster: WAL archiving enabled, base backups on schedule, replication lag monitored
  2. During incident: identify the failure scope (single table, full database, or host loss)
  3. Recovery steps:
    • Host loss → promote replica or restore from base backup + WAL
    • Corruption → PITR to last known good time
    • Accidental deletion → PITR to just before the DROP/DELETE
  4. Post-recovery:
    • Run SELECT pg_ripple.canary() to verify health
    • Check pg_ripple.stats() for expected triple counts
    • Verify the merge worker is running (merge_worker_pid > 0)
    • Run representative SPARQL queries to confirm data integrity
    • Resume WAL archiving and replication

Common Pitfalls

Don't forget shared_preload_libraries

After restoring to a fresh PostgreSQL instance, ensure shared_preload_libraries = 'pg_ripple' is set in postgresql.conf before starting the server. Without it, the merge worker will not start, the dictionary cache will be unavailable, and queries will fall back to uncached dictionary lookups.

  • Schema ownership: the restoring user must be a superuser or own both _pg_ripple and pg_ripple schemas
  • Sequence values: pg_dump captures sequence state — statement IDs (i column) will continue from the correct value after restore
  • Tablespace placement: if you used custom tablespaces for VP tables, ensure they exist on the target server before restoring