Backup and Disaster Recovery
pg_ripple stores all data in standard PostgreSQL tables within the _pg_ripple schema. This means every PostgreSQL backup tool works out of the box — VP tables, the dictionary, the predicates catalog, SHACL constraints, Datalog rules, and inferred triples are all captured by pg_dump, WAL archiving, and streaming replication.
Unlike triple stores that require a separate RDF dump/reload cycle, pg_ripple data is just PostgreSQL data. Your existing backup infrastructure already covers it.
What Gets Backed Up
| Object | Schema | Captured by pg_dump? | Notes |
|---|---|---|---|
| Dictionary table | _pg_ripple.dictionary | Yes | All IRI, blank node, and literal mappings |
| Predicates catalog | _pg_ripple.predicates | Yes | Predicate → VP table OID mapping |
| VP tables (main + delta + tombstones) | _pg_ripple.vp_{id}_* | Yes | One table set per predicate |
| Rare predicates table | _pg_ripple.vp_rare | Yes | Consolidated low-cardinality predicates |
| SHACL constraints | _pg_ripple.shacl_* | Yes | Shape definitions and validation state |
| Datalog rules | _pg_ripple.rules | Yes | Rule text and compiled plans |
| Inferred triples | VP tables, source = 1 | Yes | Materialized inference results |
| Extension metadata | pg_catalog | Yes | Extension version and control file |
| Shared memory state | In-memory only | No | Dictionary LRU cache, merge worker counters |
Logical Backup with pg_dump
Full Database Dump
# Custom format (recommended — compressed, parallel-restore capable)
pg_dump -Fc -f pg_ripple_backup.dump mydb
# Plain SQL (human-readable, useful for auditing)
pg_dump -Fp -f pg_ripple_backup.sql mydb
Extension-Only Dump
To back up only pg_ripple data without the rest of the database:
pg_dump -Fc \
--schema=_pg_ripple \
--schema=pg_ripple \
-f pg_ripple_only.dump mydb
Always include both _pg_ripple (internal storage) and pg_ripple (public API functions). Restoring one without the other leaves the extension in an inconsistent state.
Parallel Dump for Large Datasets
For databases with millions of triples, use parallel workers:
# Directory format required for parallel dump
pg_dump -Fd -j 4 -f pg_ripple_backup_dir/ mydb
The dictionary table and large VP tables will be dumped in parallel, significantly reducing backup time.
Restoring from Backup
Full Restore to a New Database
# Create the target database
createdb mydb_restored
# Restore (custom format)
pg_restore -d mydb_restored -Fc pg_ripple_backup.dump
# Restore (directory format, parallel)
pg_restore -d mydb_restored -Fd -j 4 pg_ripple_backup_dir/
Restore from Plain SQL
psql -d mydb_restored -f pg_ripple_backup.sql
Post-Restore Verification
After restoring, verify the extension is intact:
-- Check extension version
SELECT extversion FROM pg_extension WHERE extname = 'pg_ripple';
-- Verify triple count
SELECT pg_ripple.stats();
-- Run the health check
SELECT pg_ripple.canary();
-- Spot-check a SPARQL query
SELECT pg_ripple.sparql($$
SELECT (COUNT(*) AS ?n) WHERE { ?s ?p ?o }
$$);
Yes. VP tables are standard PostgreSQL heap tables with B-tree or BRIN indexes. pg_dump captures them exactly like any other table. The HTAP delta/main/tombstone split, indexes, and the merge worker view definitions are all preserved. After restore, the merge worker resumes normal operation once shared_preload_libraries includes pg_ripple.
WAL-Based Continuous Archiving
For point-in-time recovery (PITR), configure WAL archiving:
Enable WAL Archiving
In postgresql.conf:
wal_level = replica
archive_mode = on
archive_command = 'cp %p /backup/wal_archive/%f'
max_wal_senders = 3
Take a Base Backup
pg_basebackup -D /backup/base -Ft -z -P
Point-in-Time Recovery
Create a recovery.signal file and configure the restore target:
# postgresql.conf (or postgresql.auto.conf)
restore_command = 'cp /backup/wal_archive/%f %p'
recovery_target_time = '2026-04-19 14:30:00'
Start PostgreSQL — it will replay WAL up to the specified time.
If you recover to a point mid-merge, the merge worker will detect the incomplete state and re-run the merge on startup. No manual intervention is needed, but the first merge cycle after recovery may take longer than usual.
Streaming Replication
pg_ripple works transparently with PostgreSQL streaming replication:
# On the replica
pg_basebackup -h primary-host -D /var/lib/postgresql/18/main -R -P
The -R flag writes the standby.signal and connection parameters. All VP tables, dictionary data, and HTAP state replicate via WAL.
The background merge worker does not run on read replicas. Replicas receive merged state via WAL replay from the primary. This is correct behavior — replicas should never write.
Backup Strategy Recommendations
Small Datasets (< 1M triples)
| Component | Recommendation |
|---|---|
| Method | pg_dump -Fc nightly |
| Retention | 7 daily + 4 weekly |
| RPO | 24 hours |
| RTO | Minutes |
Medium Datasets (1M – 100M triples)
| Component | Recommendation |
|---|---|
| Method | WAL archiving + daily base backup |
| Retention | 7 daily base + continuous WAL |
| RPO | Seconds (WAL) |
| RTO | Minutes to hours |
Large Datasets (> 100M triples)
| Component | Recommendation |
|---|---|
| Method | WAL archiving + pgBackRest or Barman |
| Retention | Incremental base + continuous WAL |
| RPO | Seconds (WAL) |
| RTO | Proportional to dataset size |
Schedule monthly restore drills. A backup that has never been tested is not a backup. Automate the verification queries shown above as part of the drill.
Disaster Recovery Checklist
- Before disaster: WAL archiving enabled, base backups on schedule, replication lag monitored
- During incident: identify the failure scope (single table, full database, or host loss)
- Recovery steps:
- Host loss → promote replica or restore from base backup + WAL
- Corruption → PITR to last known good time
- Accidental deletion → PITR to just before the DROP/DELETE
- Post-recovery:
- Run
SELECT pg_ripple.canary()to verify health - Check
pg_ripple.stats()for expected triple counts - Verify the merge worker is running (
merge_worker_pid > 0) - Run representative SPARQL queries to confirm data integrity
- Resume WAL archiving and replication
- Run
Common Pitfalls
- Schema ownership: the restoring user must be a superuser or own both
_pg_rippleandpg_rippleschemas - Sequence values:
pg_dumpcaptures sequence state — statement IDs (icolumn) will continue from the correct value after restore - Tablespace placement: if you used custom tablespaces for VP tables, ensure they exist on the target server before restoring