What Is pg_ripple?
pg_ripple turns your PostgreSQL database into a knowledge graph store. Store facts as triples, query them with SPARQL, validate data quality with SHACL, derive new facts with Datalog rules, and serve results over HTTP — all inside PostgreSQL, with no extra infrastructure for the data store itself.
-- Load facts about people and relationships
SELECT pg_ripple.load_turtle('
@prefix ex: <http://example.org/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
ex:alice foaf:name "Alice" .
ex:alice foaf:knows ex:bob .
ex:bob foaf:name "Bob" .
ex:bob foaf:knows ex:carol .
ex:carol foaf:name "Carol" .
');
-- Ask: who does Alice know, directly or indirectly?
SELECT * FROM pg_ripple.sparql('
PREFIX ex: <http://example.org/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name WHERE {
ex:alice foaf:knows+ ?person .
?person foaf:name ?name .
}
');
The query follows the foaf:knows relationship through any number of hops and returns the names of everyone Alice is connected to — Bob and Carol.
Why pg_ripple?
Knowledge graphs represent information as a network of relationships rather than rows in flat tables. This structure naturally captures complex, interconnected data — organizational hierarchies, supply chains, research citations, product catalogs — that would require dozens of join tables in a relational model.
pg_ripple brings this capability to PostgreSQL. You get the expressiveness of a dedicated graph database while keeping your existing PostgreSQL infrastructure, tooling, backup procedures, and operational expertise.
Key capabilities
| Capability | What it does |
|---|---|
| SPARQL queries | Ask complex relationship questions using the W3C standard query language |
| SHACL validation | Define and enforce data quality rules — reject bad data on insert |
| Datalog reasoning | Automatically derive new facts from rules and logic |
| Vector + graph hybrid | Combine SPARQL graph traversal with pgvector similarity search |
| JSON-LD framing | Export nested JSON documents shaped for your API contract |
| SPARQL Protocol | Serve queries over a standard HTTP endpoint via pg_ripple_http |
| Federation | Query remote SPARQL endpoints alongside local data |
Key numbers
| Metric | Value |
|---|---|
| Bulk load throughput | >100K triples/sec (commodity hardware) |
| SPARQL query latency | <10ms for typical patterns |
| W3C SPARQL 1.1 | Full conformance |
| W3C SHACL Core | Full conformance |
| PostgreSQL version | 18 |
Architecture at a glance
┌─────────────────────────────────────────────────┐
│ PostgreSQL 18 │
│ ┌───────────────────────────────────────────┐ │
│ │ pg_ripple extension │ │
│ │ ┌─────────┐ ┌────────┐ ┌───────────┐ │ │
│ │ │Dictionary│ │ SPARQL │ │ Datalog │ │ │
│ │ │ Encoder │ │ Engine │ │ Engine │ │ │
│ │ └────┬─────┘ └───┬────┘ └─────┬─────┘ │ │
│ │ │ │ │ │ │
│ │ ┌────┴─────────────┴─────────────┴─────┐ │ │
│ │ │ VP Tables (one per predicate) │ │ │
│ │ │ HTAP: delta + main + merge worker │ │ │
│ │ └──────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘
▲ ▲
│ SQL │ HTTP
Application pg_ripple_http
Every IRI, literal, and blank node is mapped to a compact integer ID by the dictionary encoder. Data is stored in Vertical Partitioning (VP) tables — one table per unique predicate — with integer-only joins for fast query execution. The HTAP architecture separates read and write paths so that heavy loads do not block queries.
Next steps
- Evaluating? Read When to Use pg_ripple for an honest comparison with alternatives.
- Ready to try it? Start with Installation and then the Five-Minute Walkthrough.
- Want the full picture? The Guided Tutorial takes you from loading data to inference in 30 minutes.
- Want to contribute? See Contributing.
When to Use pg_ripple
pg_ripple is a PostgreSQL extension that turns your database into a knowledge graph store. This page helps you decide whether it fits your architecture.
Decision flowchart
Ask yourself these questions in order:
- Do you already run PostgreSQL? If yes, pg_ripple integrates with zero additional infrastructure for the data store. If you run a different database, evaluate the migration cost.
- Do you need to model complex relationships? If your data is primarily tabular with few joins, standard SQL may be simpler. If you have deeply nested, many-to-many, or hierarchical relationships, a graph model helps.
- Do you need a standard query language? SPARQL is a W3C standard with broad tool support. If you prefer a property-graph query language (Cypher/GQL), consider Neo4j or Amazon Neptune.
- Do you need reasoning or validation? pg_ripple includes SHACL validation and Datalog reasoning. Standalone triple stores like Virtuoso or Blazegraph may not.
- Do you need graph context for LLM prompts? pg_ripple combines SPARQL graph traversal with pgvector similarity search in a single query — something pure vector databases cannot do.
Comparison matrix
| Criterion | pg_ripple | Plain SQL | Virtuoso / Blazegraph | Neo4j | Pure vector DB |
|---|---|---|---|---|---|
| Deployment | PostgreSQL extension | Any RDBMS | Standalone JVM | Standalone | Standalone |
| Query language | SPARQL 1.1 | SQL | SPARQL 1.1 | Cypher | Proprietary |
| Data model | RDF (triples) | Relational | RDF (triples) | Property graph | Vectors + metadata |
| Schema validation | SHACL | CHECK / triggers | Varies | Constraints | None |
| Reasoning | Datalog (RDFS, OWL RL) | Manual SQL | RDFS / OWL (varies) | None built-in | None |
| Vector search | pgvector integration | pgvector | Not built-in | Limited | Native |
| Hybrid graph+vector | Yes (single query) | Manual joins | No | No | No |
| HTTP API | pg_ripple_http | Build your own | Built-in | Built-in | Built-in |
| Transactions | Full PostgreSQL ACID | Full ACID | Varies | ACID | Varies |
| Backup/restore | pg_dump/pg_restore | Standard | Custom tools | Custom tools | Custom tools |
| Operational complexity | Low (PostgreSQL) | Low | Medium–High | Medium | Medium |
When pg_ripple is a good fit
- You already operate PostgreSQL and want to avoid managing a separate graph database
- Your data has rich, interconnected relationships (ontologies, catalogs, supply chains)
- You need SPARQL 1.1 compliance for interoperability with W3C-standard tools
- You need to validate data quality against formal rules (SHACL)
- You need to derive new facts from existing data (Datalog reasoning, OWL RL, RDFS)
- You want to combine graph traversal with vector similarity for RAG pipelines
- You need full ACID transactions on graph data
When pg_ripple is not the best fit
- Graph datasets exceeding ~1 billion triples: pg_ripple has been tested to 100M triples. For very large datasets, consider distributed solutions.
- Property graph with Cypher/GQL: if your team already uses Cypher and Neo4j, migrating to SPARQL has a learning curve. pg_ripple speaks SPARQL, not Cypher.
- Pure vector search workload: if you only need approximate nearest neighbor search without graph traversal, pgvector alone is simpler.
- Real-time streaming graphs: pg_ripple processes data in transactions, not continuous streams. For streaming graph analytics, consider Apache Flink with a graph library.
- No PostgreSQL in your stack: if you run MySQL, MongoDB, or a managed NoSQL service and have no plans to adopt PostgreSQL, introducing it solely for pg_ripple adds operational overhead.
AI/LLM comparison: when does graph context outperform flat vector retrieval?
Graph-augmented retrieval helps when:
- The query requires multi-hop reasoning — "find papers by co-authors of Alice's co-authors" cannot be answered by vector similarity alone
- Entity deduplication matters —
owl:sameAscanonicalization ensures the same entity is not embedded multiple times with different IRIs - Structured output is needed — JSON-LD framing produces token-efficient, structured context that flat top-k results cannot provide
- Provenance matters — graph traversal can trace why a fact is relevant, not just that it is similar
Pure vector search (Qdrant, Weaviate, pgvector-only) is sufficient when:
- The query is a simple "find similar documents" without relationship constraints
- Your corpus is unstructured text without entity-level structure
- Latency requirements are sub-millisecond at millions of vectors
Next steps
- Installation — get pg_ripple running
- Hello World — load and query data in five minutes
Installation
pg_ripple is a PostgreSQL 18 extension written in Rust. Choose the installation method that fits your environment.
Docker (recommended)
The fastest path to a working pg_ripple instance. No build tools required.
# Start pg_ripple with Docker Compose
docker compose up -d
# Connect
psql -h localhost -p 5432 -U postgres -d pg_ripple
The docker-compose.yml in the repository root starts PostgreSQL 18 with pg_ripple pre-installed and the extension created in the default database.
Verify the installation
SELECT pg_ripple.triple_count();
The result should be 0 — the extension is installed and ready.
From source (cargo pgrx)
Build and install directly into a local PostgreSQL 18 instance.
Prerequisites
- Rust (stable, edition 2024)
- PostgreSQL 18 development headers
cargo-pgrx0.17
# Install cargo-pgrx
cargo install cargo-pgrx --version 0.17 --locked
# Initialize pgrx with PostgreSQL 18
cargo pgrx init --pg18 $(which pg_config)
# Build and install
cargo pgrx install --release --pg-config $(which pg_config)
Create the extension
Connect to your database and run:
CREATE EXTENSION pg_ripple;
Verify
SELECT pg_ripple.triple_count();
Configuration
pg_ripple works out of the box with default settings. For production deployments, you may want to adjust GUC parameters — see Configuration and Tuning.
For HTAP storage (background merge worker) and shared-memory dictionary cache, add pg_ripple to shared_preload_libraries in postgresql.conf:
shared_preload_libraries = 'pg_ripple'
Restart PostgreSQL after this change.
Troubleshooting
Wrong PostgreSQL version
pg_ripple requires PostgreSQL 18. Check your version:
pg_config --version
Missing shared_preload_libraries
If you see errors about shared memory or the merge worker not starting, ensure pg_ripple is in shared_preload_libraries and PostgreSQL has been restarted.
pgrx version mismatch
pg_ripple requires cargo-pgrx 0.17. If you have an older version:
cargo install cargo-pgrx --version 0.17 --locked --force
Extension not found after install
If CREATE EXTENSION pg_ripple fails with "extension not found", verify that the extension files were installed to the correct PostgreSQL directory:
pg_config --sharedir
ls $(pg_config --sharedir)/extension/pg_ripple*
Docker container fails to start
Check logs:
docker compose logs pg_ripple
Common causes: port 5432 already in use (change the port mapping), insufficient memory (pg_ripple recommends at least 512MB).
Next steps
- Hello World — Five-Minute Walkthrough — load and query your first triples
- Guided Tutorial — build a knowledge graph in 30 minutes
Hello World — Five-Minute Walkthrough
This walkthrough takes you from an empty database to working SPARQL queries in five minutes. You will load ten triples about people and movies, then run three queries of increasing complexity.
Prerequisites
pg_ripple is installed and you are connected to a PostgreSQL database with the extension created. See Installation if you have not done this yet.
Step 1: Register prefixes
Prefixes are shortcuts for long IRIs. Register a few common ones:
SELECT pg_ripple.register_prefix('ex', 'http://example.org/');
SELECT pg_ripple.register_prefix('foaf', 'http://xmlns.com/foaf/0.1/');
SELECT pg_ripple.register_prefix('schema', 'http://schema.org/');
Step 2: Load data
Load ten triples about people and the movies they directed or acted in:
SELECT pg_ripple.load_turtle('
@prefix ex: <http://example.org/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix schema: <http://schema.org/> .
ex:alice foaf:name "Alice" .
ex:alice schema:knows ex:bob .
ex:bob foaf:name "Bob" .
ex:bob schema:knows ex:carol .
ex:carol foaf:name "Carol" .
ex:movie1 schema:name "The Graph" .
ex:movie1 schema:director ex:alice .
ex:movie1 schema:actor ex:bob .
ex:movie2 schema:name "Linked Data" .
ex:movie2 schema:director ex:bob .
');
The function returns the number of triples loaded (10).
Step 3: Query — basic pattern
Find all movies and their directors:
SELECT * FROM pg_ripple.sparql('
PREFIX schema: <http://schema.org/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?movie ?director WHERE {
?movie schema:director ?person .
?movie schema:name ?movie .
?person foaf:name ?director .
}
');
Each row in the result is a JSONB object with the variable bindings. You should see "The Graph" directed by "Alice" and "Linked Data" directed by "Bob".
Step 4: Query — OPTIONAL
Find all movies with their directors, and actors if they have any:
SELECT * FROM pg_ripple.sparql('
PREFIX schema: <http://schema.org/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?movieName ?directorName ?actorName WHERE {
?movie schema:name ?movieName .
?movie schema:director ?director .
?director foaf:name ?directorName .
OPTIONAL {
?movie schema:actor ?actor .
?actor foaf:name ?actorName .
}
}
');
"The Graph" has an actor (Bob), while "Linked Data" does not — the actorName column is null for that row. The OPTIONAL keyword works like a SQL LEFT JOIN.
Step 5: Query — property path
Find everyone Alice is connected to, directly or indirectly, through schema:knows links:
SELECT * FROM pg_ripple.sparql('
PREFIX ex: <http://example.org/>
PREFIX schema: <http://schema.org/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name WHERE {
ex:alice schema:knows+ ?person .
?person foaf:name ?name .
}
');
The + operator follows the schema:knows relationship one or more times. Alice knows Bob directly, and Bob knows Carol, so the query returns both "Bob" and "Carol".
What you just learned
- Triples are facts with three parts: subject, predicate, object
- Prefixes are shortcuts for long IRIs
load_turtle()loads data in Turtle formatsparql()runs SPARQL queries and returns results as JSONB- OPTIONAL is like a SQL
LEFT JOIN - Property paths (
+,*) follow chains of relationships
Next steps
- Guided Tutorial — build a complete knowledge graph with validation and reasoning
- Key Concepts — understand RDF concepts using PostgreSQL analogies
- Querying with SPARQL — the full SPARQL feature set
Guided Tutorial — Build a Knowledge Graph in 30 Minutes
This tutorial picks up where the Hello World walkthrough ends. You will build a bibliographic knowledge graph with papers, authors, institutions, and citations — then validate it, reason over it, and export it as JSON-LD.
The tutorial is organized in four independent segments. Each takes under ten minutes and leaves you with a working, progressively richer knowledge graph. You can stop after any segment.
This tutorial uses an academic bibliographic dataset. The patterns — entity relationships, typed literals, named graphs, inference, validation — apply equally to product catalogs, supply chains, organizational hierarchies, or any domain with interconnected data.
Prerequisites
pg_ripple is installed and you are connected to a PostgreSQL database with the extension created. See Installation.
Segment 1: Load and Explore (10 min)
Register prefixes
SELECT pg_ripple.register_prefix('bib', 'http://example.org/bib/');
SELECT pg_ripple.register_prefix('foaf', 'http://xmlns.com/foaf/0.1/');
SELECT pg_ripple.register_prefix('dc', 'http://purl.org/dc/elements/1.1/');
SELECT pg_ripple.register_prefix('dcterms', 'http://purl.org/dc/terms/');
SELECT pg_ripple.register_prefix('schema', 'http://schema.org/');
SELECT pg_ripple.register_prefix('skos', 'http://www.w3.org/2004/02/skos/core#');
Load the bibliographic dataset
SELECT pg_ripple.load_turtle('
@prefix bib: <http://example.org/bib/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix schema: <http://schema.org/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
bib:mit a schema:Organization ; schema:name "MIT" .
bib:stanford a schema:Organization ; schema:name "Stanford University" .
bib:oxford a schema:Organization ; schema:name "University of Oxford" .
bib:alice a foaf:Person ; foaf:name "Alice Chen" ;
schema:affiliation bib:mit .
bib:bob a foaf:Person ; foaf:name "Bob Smith" ;
schema:affiliation bib:stanford .
bib:carol a foaf:Person ; foaf:name "Carol Martinez" ;
schema:affiliation bib:oxford .
bib:paper1 a schema:ScholarlyArticle ;
dc:title "Knowledge Graphs in Practice" ;
dc:creator bib:alice ; dc:creator bib:bob ;
dcterms:issued "2024-01-15"^^xsd:date ;
schema:about <http://example.org/bib/kg> .
bib:paper2 a schema:ScholarlyArticle ;
dc:title "Efficient SPARQL Query Processing" ;
dc:creator bib:bob ; dc:creator bib:carol ;
dcterms:issued "2024-03-22"^^xsd:date .
bib:paper3 a schema:ScholarlyArticle ;
dc:title "Graph-Enhanced Retrieval for LLMs" ;
dc:creator bib:alice ;
dcterms:issued "2024-06-10"^^xsd:date .
bib:paper2 dcterms:references bib:paper1 .
bib:paper3 dcterms:references bib:paper1 .
bib:paper3 dcterms:references bib:paper2 .
bib:alice foaf:knows bib:bob .
bib:bob foaf:knows bib:carol .
');
Explore: find all papers by Alice
SELECT * FROM pg_ripple.sparql('
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX bib: <http://example.org/bib/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?title WHERE {
?paper dc:creator bib:alice .
?paper dc:title ?title .
}
');
Explore: citation chains
Find papers that cite papers Alice authored:
SELECT * FROM pg_ripple.sparql('
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX bib: <http://example.org/bib/>
SELECT ?citingTitle ?citedTitle WHERE {
?citing dcterms:references ?cited .
?cited dc:creator bib:alice .
?citing dc:title ?citingTitle .
?cited dc:title ?citedTitle .
}
');
Explore: count papers per author
SELECT * FROM pg_ripple.sparql('
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name (COUNT(?paper) AS ?papers) WHERE {
?paper dc:creator ?author .
?author foaf:name ?name .
}
GROUP BY ?name
ORDER BY DESC(?papers)
');
Segment 2: Validate (10 min)
SHACL (Shapes Constraint Language) lets you define data quality rules. You will create a shape that requires every ScholarlyArticle to have a title and at least one creator.
Load a SHACL shape
SELECT pg_ripple.load_shacl('
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix schema: <http://schema.org/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<http://example.org/shapes/ArticleShape>
a sh:NodeShape ;
sh:targetClass schema:ScholarlyArticle ;
sh:property [
sh:path dc:title ;
sh:minCount 1 ;
sh:maxCount 1 ;
sh:datatype xsd:string ;
sh:message "Every article must have exactly one title" ;
] ;
sh:property [
sh:path dc:creator ;
sh:minCount 1 ;
sh:message "Every article must have at least one creator" ;
] .
');
Validate the dataset
SELECT pg_ripple.validate();
The result is a JSONB validation report. If all articles conform, the report shows zero violations. Now insert a bad article to see validation catch it:
SELECT pg_ripple.insert_triple(
'http://example.org/bib/bad_paper',
'http://www.w3.org/1999/02/22-rdf-syntax-ns#type',
'http://schema.org/ScholarlyArticle'
);
SELECT pg_ripple.validate();
The report now shows a violation: the article has no title and no creator.
Segment 3: Reason (10 min)
Datalog rules let you derive new facts. You will write a rule that infers transitive co-authorship: if Alice co-authored a paper with Bob, and Bob co-authored with Carol, then Alice and Carol are indirectly connected.
Write and load a rule
SELECT pg_ripple.load_rules('
coauthor(?a, ?b) :- <http://purl.org/dc/elements/1.1/creator>(?paper, ?a),
<http://purl.org/dc/elements/1.1/creator>(?paper, ?b),
?a != ?b.
connected(?a, ?b) :- coauthor(?a, ?b).
connected(?a, ?b) :- connected(?a, ?c), coauthor(?c, ?b), ?a != ?b.
', 'coauthorship');
Run inference
SELECT pg_ripple.infer('coauthorship');
This returns the number of new facts derived.
Query the derived facts
SELECT * FROM pg_ripple.sparql('
PREFIX bib: <http://example.org/bib/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name WHERE {
bib:alice <http://example.org/bib/connected> ?person .
?person foaf:name ?name .
}
');
Alice is now connected to Bob (direct co-author on paper1), Carol (through Bob on paper2), and potentially others through the transitive chain.
Segment 4: Export (10 min)
Export your knowledge graph as JSON-LD, shaped for an API using a frame template.
Export as Turtle
SELECT pg_ripple.export_turtle();
This returns all triples in human-readable Turtle format.
Export as JSON-LD with framing
SELECT pg_ripple.sparql_construct_jsonld('
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX schema: <http://schema.org/>
CONSTRUCT {
?paper dc:title ?title .
?paper dc:creator ?author .
?author foaf:name ?name .
?author schema:affiliation ?org .
?org schema:name ?orgName .
}
WHERE {
?paper a schema:ScholarlyArticle .
?paper dc:title ?title .
?paper dc:creator ?author .
?author foaf:name ?name .
OPTIONAL {
?author schema:affiliation ?org .
?org schema:name ?orgName .
}
}
');
The result is a nested JSON-LD document with papers, their authors, and institutional affiliations — ready to serve from a REST API.
What you built
In 30 minutes, you created a knowledge graph with:
- Structured data — papers, authors, institutions, and citations as RDF triples
- Quality rules — SHACL shapes that catch incomplete articles
- Derived knowledge — Datalog rules that infer transitive co-authorship
- API-ready export — JSON-LD output shaped for downstream consumers
Next steps
- Storing Knowledge — data modeling deep dive
- Querying with SPARQL — the full query language
- Validating Data Quality — advanced SHACL patterns
- Reasoning and Inference — Datalog, RDFS, OWL RL
Key Concepts — RDF for PostgreSQL Users
If you know PostgreSQL, you already understand most of what you need to work with pg_ripple. This page maps RDF concepts to their PostgreSQL equivalents.
Triples
A triple is the atomic unit of data in RDF. It has three parts:
| Part | What it is | PostgreSQL analogy |
|---|---|---|
| Subject | The entity being described | A row's primary key |
| Predicate | The relationship or attribute | A column name |
| Object | The value or related entity | A cell value or foreign key |
For example, the fact "Alice knows Bob" is the triple:
<http://example.org/alice> <http://xmlns.com/foaf/0.1/knows> <http://example.org/bob> .
In pg_ripple, this triple is stored in a VP table named after the predicate (foaf:knows), with integer-encoded subject and object columns.
IRIs
An IRI (Internationalized Resource Identifier) is a globally unique identifier for an entity or relationship. Think of it as a namespaced primary key that is guaranteed unique across all datasets in the world.
http://example.org/alice -- an entity
http://xmlns.com/foaf/0.1/knows -- a relationship
Prefixes are shortcuts to avoid writing full IRIs repeatedly:
SELECT pg_ripple.register_prefix('ex', 'http://example.org/');
-- Now ex:alice means http://example.org/alice
Blank nodes
A blank node is an anonymous entity — like a row with no primary key. It exists only within the document where it was created.
ex:alice foaf:address [ foaf:city "Boston" ; foaf:country "US" ] .
The address has no IRI. It is a blank node, identified internally by a system-generated label. Blank nodes from different load_turtle() calls are always distinct entities, even if they share the same label.
Blank nodes cannot be referenced from outside their originating load call. If you need to reference an entity from multiple places, give it an IRI.
Literals
A literal is a data value — a string, number, date, or boolean. Literals can have a datatype or a language tag.
| Literal | Type | PostgreSQL equivalent |
|---|---|---|
"Alice" | Plain string | TEXT |
"42"^^xsd:integer | Typed integer | INTEGER |
"2024-01-15"^^xsd:date | Typed date | DATE |
"Bonjour"@fr | Language-tagged string | No direct equivalent |
In pg_ripple, all literals are dictionary-encoded to compact integer IDs for storage. The original string representation is preserved and decoded on query output.
Predicates and VP tables
In a relational database, a table groups all attributes of a single entity type. In pg_ripple, data is organized by predicate — each unique predicate gets its own table (a Vertical Partitioning, or VP, table).
Relational: persons(id, name, email, knows_id)
pg_ripple: vp_foaf_name(s, o) -- subject → name
vp_foaf_knows(s, o) -- subject → object
vp_schema_email(s, o) -- subject → email
This structure makes join-heavy SPARQL queries fast because each predicate's data is co-located and indexed.
Named graphs
A named graph is a labeled collection of triples — like a PostgreSQL schema that groups related tables.
-- Create a named graph
SELECT pg_ripple.create_graph('http://example.org/publications');
-- Load data into it
SELECT pg_ripple.load_turtle_into_graph(
'<http://example.org/paper1> <http://purl.org/dc/elements/1.1/title> "My Paper" .',
'http://example.org/publications'
);
Named graphs are useful for:
- Multi-source data: keep data from different sources separate
- Access control: grant read access to specific graphs per role
- Versioning: load new data into a fresh graph, validate, then swap
All triples without an explicit graph belong to the default graph (graph ID = 0).
RDF-star
Standard RDF says "Alice knows Bob." But what if you want to say when Alice met Bob, or who recorded that fact? RDF-star lets you make statements about statements:
<< ex:alice foaf:knows ex:bob >> ex:since "2020"^^xsd:gYear .
This says: "The fact that Alice knows Bob has been true since 2020." In pg_ripple, each triple has a statement identifier (SID) that can be used as the subject or object of other triples, enabling edge properties similar to labeled property graphs.
SPARQL
SPARQL is the standard query language for RDF data — the equivalent of SQL for relational databases. Where SQL queries tables, SPARQL queries graph patterns.
| SQL | SPARQL |
|---|---|
SELECT name FROM persons WHERE id = 1 | SELECT ?name WHERE { ex:person1 foaf:name ?name } |
JOIN | Graph pattern matching (implicit) |
LEFT JOIN | OPTIONAL { } |
WHERE x IN (...) | VALUES (?x) { ... } |
GROUP BY ... HAVING | GROUP BY ... HAVING |
WITH RECURSIVE | Property paths (foaf:knows+) |
In pg_ripple, SPARQL queries are compiled to SQL and executed via PostgreSQL's query engine. You call them through pg_ripple.sparql():
SELECT * FROM pg_ripple.sparql('
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name WHERE { ?person foaf:name ?name }
');
Dictionary encoding
pg_ripple does not store raw strings in its data tables. Every IRI, blank node, and literal is mapped to a compact BIGINT (i64) by the dictionary encoder. VP tables contain only integer columns, making joins and comparisons fast.
You never need to interact with dictionary IDs directly — sparql() and find_triples() handle encoding and decoding automatically. For advanced use cases, encode_term() and decode_id() are available.
Summary of analogies
| RDF concept | PostgreSQL analogy |
|---|---|
| Triple | Row in a table |
| Subject | Primary key value |
| Predicate | Column name / table name (VP) |
| Object | Cell value or foreign key |
| IRI | Globally unique identifier |
| Blank node | Row with system-generated ID |
| Literal | Typed column value |
| Named graph | Schema |
| SPARQL | SQL |
| SHACL shape | CHECK constraint / trigger |
| Datalog rule | Materialized view definition |
Next steps
- Storing Knowledge — data modeling with triples
- Loading Data — all import formats and methods
- Querying with SPARQL — the full query language
§2.1 Storing Knowledge
What and Why
pg_ripple stores data as RDF triples — the W3C standard for representing knowledge. Every fact is a three-part statement: a subject, a predicate, and an object. This structure is deceptively simple but powerful enough to model any domain — from bibliographic records and biomedical ontologies to enterprise knowledge graphs.
Why triples instead of tables?
- Schema-free evolution: add new predicates without ALTER TABLE.
- Natural linking: every entity is an IRI — links across datasets are free.
- Standards-based: SPARQL, SHACL, OWL, and thousands of public vocabularies work out of the box.
- Provenance-ready: RDF-star lets you annotate individual facts with confidence scores, sources, and timestamps.
pg_ripple stores triples inside PostgreSQL using Vertical Partitioning (VP) — one
internal table per predicate, with all values dictionary-encoded as BIGINT. You never
see this machinery directly; you interact through insert_triple(), load_turtle(),
and SPARQL.
How It Works
The Triple Model
Every RDF triple has the form:
<subject> <predicate> <object> .
- Subject: the thing you are describing (always an IRI or blank node).
- Predicate: the relationship or property (always an IRI).
- Object: the value — an IRI (another entity), a literal (string, number, date), or a blank node.
IRIs (Internationalized Resource Identifiers) look like URLs but are identifiers, not
necessarily web addresses. <https://example.org/paper/42> identifies a paper —
it does not need to resolve to a web page.
Named Graphs
Triples can be grouped into named graphs — logical partitions identified by an IRI. This is useful for:
- Tracking provenance: "these triples came from PubMed"
- Multi-tenancy: one graph per customer
- Inference output: derived triples go into a separate graph
pg_ripple uses graph ID 0 for the default graph (triples with no explicit graph).
Named graphs get a positive integer ID via dictionary encoding.
Blank Nodes
Blank nodes are anonymous identifiers — they represent "something exists" without
giving it a global IRI. pg_ripple encodes blank nodes with a _: prefix:
SELECT pg_ripple.insert_triple(
'_:review1',
'<https://schema.org/author>',
'<https://example.org/person/alice>'
);
Blank nodes are document-scoped. Two separate load_turtle() calls that both use
_:x will create two different internal identifiers. If you need stable cross-document
identity, use IRIs instead.
RDF-Star (Quoted Triples)
RDF-star lets you make statements about other statements. This is essential for provenance, confidence scores, and temporal annotations.
A quoted triple wraps << subject predicate object >> and can appear as a subject
or object in another triple:
-- "The fact that Paper42 was authored by Alice has confidence 0.95"
SELECT pg_ripple.insert_triple(
'<< <https://example.org/paper/42> <https://purl.org/dc/terms/creator> <https://example.org/person/alice> >>',
'<https://example.org/confidence>',
'"0.95"^^<http://www.w3.org/2001/XMLSchema#decimal>'
);
Dictionary Encoding
Every IRI, blank node, and literal is mapped to a BIGINT (i64) via XXH3-128 hashing
before storage. VP tables contain only integers — this makes joins fast and storage
compact. You never need to think about encoding; pg_ripple handles it transparently.
Worked Examples
The examples in this chapter use a bibliographic dataset: papers, authors, institutions, journals, and citations.
Setting Up Prefixes
Register namespace prefixes so SPARQL queries are readable:
SELECT pg_ripple.register_prefix('ex', 'https://example.org/');
SELECT pg_ripple.register_prefix('dct', 'http://purl.org/dc/terms/');
SELECT pg_ripple.register_prefix('foaf', 'http://xmlns.com/foaf/0.1/');
SELECT pg_ripple.register_prefix('bibo', 'http://purl.org/ontology/bibo/');
SELECT pg_ripple.register_prefix('schema','https://schema.org/');
SELECT pg_ripple.register_prefix('xsd', 'http://www.w3.org/2001/XMLSchema#');
Inserting Individual Triples
-- Create a paper
SELECT pg_ripple.insert_triple(
'<https://example.org/paper/42>',
'<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>',
'<http://purl.org/ontology/bibo/AcademicArticle>'
);
-- Add a title
SELECT pg_ripple.insert_triple(
'<https://example.org/paper/42>',
'<http://purl.org/dc/terms/title>',
'"Knowledge Graphs in Practice"'
);
-- Add an author
SELECT pg_ripple.insert_triple(
'<https://example.org/paper/42>',
'<http://purl.org/dc/terms/creator>',
'<https://example.org/person/alice>'
);
-- Author metadata
SELECT pg_ripple.insert_triple(
'<https://example.org/person/alice>',
'<http://xmlns.com/foaf/0.1/name>',
'"Alice Johnson"'
);
SELECT pg_ripple.insert_triple(
'<https://example.org/person/alice>',
'<https://schema.org/affiliation>',
'<https://example.org/institution/mit>'
);
-- Institution metadata
SELECT pg_ripple.insert_triple(
'<https://example.org/institution/mit>',
'<http://xmlns.com/foaf/0.1/name>',
'"Massachusetts Institute of Technology"'
);
Loading a Full Dataset with Turtle
For bulk data, Turtle format is more natural:
SELECT pg_ripple.load_turtle('
@prefix ex: <https://example.org/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix bibo: <http://purl.org/ontology/bibo/> .
@prefix schema: <https://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
ex:paper/42 a bibo:AcademicArticle ;
dct:title "Knowledge Graphs in Practice" ;
dct:creator ex:person/alice, ex:person/bob ;
dct:date "2024-03-15"^^xsd:date ;
bibo:citedBy ex:paper/99 ;
schema:keywords "knowledge graph", "RDF", "SPARQL" .
ex:paper/99 a bibo:AcademicArticle ;
dct:title "Graph Neural Networks for Entity Resolution" ;
dct:creator ex:person/carol ;
bibo:cites ex:paper/42 .
ex:person/alice foaf:name "Alice Johnson" ;
schema:affiliation ex:institution/mit .
ex:person/bob foaf:name "Bob Smith" ;
schema:affiliation ex:institution/stanford .
ex:person/carol foaf:name "Carol Williams" ;
schema:affiliation ex:institution/mit .
ex:institution/mit foaf:name "Massachusetts Institute of Technology" .
ex:institution/stanford foaf:name "Stanford University" .
');
Using Named Graphs
Store triples from different sources in separate graphs:
-- Create named graphs for different data sources
SELECT pg_ripple.create_graph('https://example.org/graph/pubmed');
SELECT pg_ripple.create_graph('https://example.org/graph/arxiv');
-- Load PubMed data into its graph
SELECT pg_ripple.load_turtle_into_graph('
@prefix ex: <https://example.org/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix bibo: <http://purl.org/ontology/bibo/> .
ex:paper/100 a bibo:AcademicArticle ;
dct:title "Drug Interaction Networks" ;
dct:creator ex:person/dave .
', 'https://example.org/graph/pubmed');
-- Load arXiv data into its graph
SELECT pg_ripple.load_turtle_into_graph('
@prefix ex: <https://example.org/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix bibo: <http://purl.org/ontology/bibo/> .
ex:paper/200 a bibo:AcademicArticle ;
dct:title "Transformer Architectures for NLP" ;
dct:creator ex:person/eve .
', 'https://example.org/graph/arxiv');
-- List all named graphs
SELECT * FROM pg_ripple.list_graphs();
RDF-Star for Provenance and Confidence
Annotate citations with provenance metadata:
-- Record that Paper 42 cites Paper 99 (the base fact)
SELECT pg_ripple.insert_triple(
'<https://example.org/paper/42>',
'<http://purl.org/ontology/bibo/cites>',
'<https://example.org/paper/99>'
);
-- Annotate this citation with a confidence score
SELECT pg_ripple.insert_triple(
'<< <https://example.org/paper/42> <http://purl.org/ontology/bibo/cites> <https://example.org/paper/99> >>',
'<https://example.org/confidence>',
'"0.92"^^<http://www.w3.org/2001/XMLSchema#decimal>'
);
-- Record who asserted this citation
SELECT pg_ripple.insert_triple(
'<< <https://example.org/paper/42> <http://purl.org/ontology/bibo/cites> <https://example.org/paper/99> >>',
'<http://purl.org/dc/terms/source>',
'<https://example.org/system/citation-extractor>'
);
Translating a Relational Schema to RDF
Suppose you have a relational database with tables papers, authors, and affiliations:
| papers.id | papers.title | papers.year |
|---|---|---|
| 42 | Knowledge Graphs in Practice | 2024 |
| authors.id | authors.name | authors.institution_id |
|---|---|---|
| 1 | Alice Johnson | 10 |
The mapping pattern:
- Each row becomes a subject IRI:
<https://example.org/paper/{id}> - Each column becomes a predicate: use a standard vocabulary (Dublin Core, Schema.org, FOAF)
- Foreign keys become object IRIs:
authors.institution_id = 10→<https://example.org/institution/10> - Scalar values become literals:
papers.title→"Knowledge Graphs in Practice"
-- Row from papers table → triples
SELECT pg_ripple.insert_triple(
'<https://example.org/paper/42>',
'<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>',
'<http://purl.org/ontology/bibo/AcademicArticle>'
);
SELECT pg_ripple.insert_triple(
'<https://example.org/paper/42>',
'<http://purl.org/dc/terms/title>',
'"Knowledge Graphs in Practice"'
);
SELECT pg_ripple.insert_triple(
'<https://example.org/paper/42>',
'<http://purl.org/dc/terms/date>',
'"2024"^^<http://www.w3.org/2001/XMLSchema#gYear>'
);
-- Foreign key → IRI link
SELECT pg_ripple.insert_triple(
'<https://example.org/person/1>',
'<https://schema.org/affiliation>',
'<https://example.org/institution/10>'
);
Common Patterns
Pattern: Type Hierarchies
Use rdf:type and rdfs:subClassOf to create type hierarchies:
SELECT pg_ripple.load_turtle('
@prefix ex: <https://example.org/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix bibo: <http://purl.org/ontology/bibo/> .
bibo:AcademicArticle rdfs:subClassOf bibo:Article .
bibo:Article rdfs:subClassOf bibo:Document .
ex:paper/42 a bibo:AcademicArticle .
');
With RDFS inference enabled (see §2.5),
pg_ripple can automatically derive that ex:paper/42 is also a bibo:Article and
a bibo:Document.
Pattern: Multi-Valued Properties
Unlike relational columns, RDF predicates are naturally multi-valued:
-- A paper can have multiple authors — just insert multiple triples
SELECT pg_ripple.insert_triple(
'<https://example.org/paper/42>',
'<http://purl.org/dc/terms/creator>',
'<https://example.org/person/alice>'
);
SELECT pg_ripple.insert_triple(
'<https://example.org/paper/42>',
'<http://purl.org/dc/terms/creator>',
'<https://example.org/person/bob>'
);
Pattern: Typed and Language-Tagged Literals
-- Typed literal (date)
SELECT pg_ripple.insert_triple(
'<https://example.org/paper/42>',
'<http://purl.org/dc/terms/date>',
'"2024-03-15"^^<http://www.w3.org/2001/XMLSchema#date>'
);
-- Language-tagged string
SELECT pg_ripple.insert_triple(
'<https://example.org/paper/42>',
'<http://purl.org/dc/terms/title>',
'"Knowledge Graphs in Practice"@en'
);
SELECT pg_ripple.insert_triple(
'<https://example.org/paper/42>',
'<http://purl.org/dc/terms/title>',
'"Wissensgraphen in der Praxis"@de'
);
Pattern: Reification with RDF-Star vs Named Graphs
Two approaches for tracking who said what:
RDF-star — annotate individual triples:
SELECT pg_ripple.insert_triple(
'<< <https://example.org/paper/42> <http://purl.org/dc/terms/creator> <https://example.org/person/alice> >>',
'<http://purl.org/dc/terms/source>',
'<https://example.org/dataset/pubmed>'
);
Named graphs — group triples by source:
SELECT pg_ripple.load_turtle_into_graph('
@prefix ex: <https://example.org/> .
@prefix dct: <http://purl.org/dc/terms/> .
ex:paper/42 dct:creator ex:person/alice .
', 'https://example.org/dataset/pubmed');
Use RDF-star when different triples about the same entity have different provenance. Use named graphs when entire batches share the same source.
Performance and Trade-offs
| Approach | Insert rate | Query flexibility | Storage overhead |
|---|---|---|---|
insert_triple() | ~5,000 triples/s | Full | Highest (per-call overhead) |
load_turtle() | ~50,000 triples/s | Full | Low (batch dictionary encoding) |
load_turtle_file() | ~100,000 triples/s | Full | Lowest (server-side streaming) |
- Dictionary cache: frequently used IRIs (predicates, common types) stay in the
shared-memory LRU cache. Check hit rates with
SELECT pg_ripple.cache_stats(). - VP table promotion: predicates with fewer than 1,000 triples share the
vp_rareconsolidation table. Once a predicate crosses the threshold, it gets its own dedicated VP table with dual B-tree indexes. - Named graph overhead: the
gcolumn adds 8 bytes per triple. If you do not need named graphs, using the default graph (the default) avoids the cost of graph-ID lookups.
After large bulk loads, run ANALYZE on the internal tables to update PostgreSQL planner statistics:
SELECT pg_ripple.vacuum();
</div>
</div>
Gotchas and Debugging
IRI formatting: All IRIs must be wrapped in angle brackets (<...>) in function calls.
Forgetting the brackets is the most common error:
-- WRONG: will be treated as a plain literal
SELECT pg_ripple.insert_triple(
'https://example.org/paper/42',
'http://purl.org/dc/terms/title',
'"Hello"'
);
-- CORRECT: angle brackets around IRIs
SELECT pg_ripple.insert_triple(
'<https://example.org/paper/42>',
'<http://purl.org/dc/terms/title>',
'"Hello"'
);
Blank node scoping: Blank nodes from separate load_turtle() calls are independent.
Two calls using _:x create two different entities.
Literal quoting: Literals must be wrapped in double quotes within the single-quoted SQL
string. Typed literals use ^^<datatype> suffix:
-- Plain string
'"Hello"'
-- Typed integer
'"42"^^<http://www.w3.org/2001/XMLSchema#integer>'
-- Language-tagged string
'"Bonjour"@fr'
Checking what is stored: Use find_triples() with wildcards to inspect data:
-- All triples about Paper 42
SELECT * FROM pg_ripple.find_triples(
'<https://example.org/paper/42>', NULL, NULL
);
-- All triples with the dct:creator predicate
SELECT * FROM pg_ripple.find_triples(
NULL, '<http://purl.org/dc/terms/creator>', NULL
);
-- Total triple count
SELECT pg_ripple.triple_count();
Duplicate triples: Inserting the same (s, p, o, g) twice is idempotent — the second
insert returns the existing SID. Use deduplicate_all() to clean up historical duplicates.
Next Steps
- §2.2 Loading Data — bulk loading in all RDF formats with performance tuning.
- §2.3 Querying with SPARQL — query the triples you stored.
- §2.4 Validating Data Quality — enforce schema constraints with SHACL.
§2.2 Loading Data
What and Why
Getting data into pg_ripple is the first step in building a knowledge graph. pg_ripple supports every major RDF serialization format and offers three loading strategies tuned for different scenarios: inline string loading, server-side file loading, and single-triple insertion.
Choosing the right format and loading mode matters. A 10-million-triple dataset loaded
via insert_triple() in a loop takes hours; the same dataset loaded from a server-side
N-Triples file via load_ntriples_file() finishes in minutes.
How It Works
Supported Formats
| Format | Function (string) | Function (file) | Named graphs | Notes |
|---|---|---|---|---|
| Turtle | load_turtle() | load_turtle_file() | No (use load_turtle_into_graph()) | Human-readable; supports prefixes, RDF-star |
| N-Triples | load_ntriples() | load_ntriples_file() | No (use load_ntriples_into_graph()) | One triple per line; fastest to parse |
| N-Quads | load_nquads() | load_nquads_file() | Yes (inline) | N-Triples + fourth graph column |
| TriG | load_trig() | load_trig_file() | Yes (inline) | Turtle + named graph blocks |
| RDF/XML | load_rdfxml() | load_rdfxml_file() | No (use load_rdfxml_into_graph()) | Legacy XML format; widely supported |
Three Loading Modes
Mode 1: String loading — pass RDF text as a SQL string parameter. Best for small-to-medium datasets (up to a few MB) and interactive use:
SELECT pg_ripple.load_turtle('
@prefix ex: <https://example.org/> .
ex:paper/1 ex:title "Hello World" .
');
Mode 2: Server-side file loading — read from a file on the PostgreSQL server's filesystem. Best for large datasets. Requires superuser privileges:
SELECT pg_ripple.load_turtle_file('/data/papers.ttl');
Mode 3: Single-triple insertion — insert one triple at a time. Best for real-time ingestion from application code:
SELECT pg_ripple.insert_triple(
'<https://example.org/paper/1>',
'<https://example.org/title>',
'"Hello World"'
);
The Loading Pipeline
Regardless of format, every loader follows the same internal pipeline:
- Parse — deserialize the RDF serialization into (subject, predicate, object, graph) quads.
- Encode — dictionary-encode each IRI, blank node, and literal to a
BIGINTID using batchON CONFLICT DO NOTHING ... RETURNING. - Route — look up the predicate in
_pg_ripple.predicatesto find the target VP table (orvp_rare). - Insert — batch-insert encoded
(s, o, g)rows into the appropriate VP delta table.
String loaders process the entire input in a single transaction. If any triple fails
to parse with strict = true, the entire load is rolled back. With strict = false
(the default), malformed triples are skipped and a WARNING is emitted.
Worked Examples
Loading Turtle
The most common format for hand-authored data:
SELECT pg_ripple.load_turtle('
@prefix ex: <https://example.org/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix bibo: <http://purl.org/ontology/bibo/> .
@prefix schema: <https://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
ex:paper/42 a bibo:AcademicArticle ;
dct:title "Knowledge Graphs in Practice"@en ;
dct:creator ex:person/alice, ex:person/bob ;
dct:date "2024-03-15"^^xsd:date ;
bibo:citedBy ex:paper/99 ;
schema:keywords "knowledge graph", "RDF", "SPARQL" .
ex:paper/99 a bibo:AcademicArticle ;
dct:title "Graph Neural Networks for Entity Resolution" ;
dct:creator ex:person/carol .
ex:person/alice foaf:name "Alice Johnson" ;
schema:affiliation ex:institution/mit .
ex:person/bob foaf:name "Bob Smith" ;
schema:affiliation ex:institution/stanford .
ex:person/carol foaf:name "Carol Williams" ;
schema:affiliation ex:institution/mit .
ex:institution/mit foaf:name "Massachusetts Institute of Technology" .
ex:institution/stanford foaf:name "Stanford University" .
');
The function returns the number of triples loaded:
-- Returns: 15
Loading N-Triples
N-Triples is one triple per line with no abbreviations — optimal for machine-generated data:
SELECT pg_ripple.load_ntriples('
<https://example.org/paper/42> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/ontology/bibo/AcademicArticle> .
<https://example.org/paper/42> <http://purl.org/dc/terms/title> "Knowledge Graphs in Practice" .
<https://example.org/paper/42> <http://purl.org/dc/terms/creator> <https://example.org/person/alice> .
<https://example.org/paper/42> <http://purl.org/dc/terms/creator> <https://example.org/person/bob> .
');
Loading N-Quads (with Named Graphs)
N-Quads extend N-Triples with a fourth field for the graph IRI:
SELECT pg_ripple.load_nquads('
<https://example.org/paper/42> <http://purl.org/dc/terms/title> "Knowledge Graphs in Practice" <https://example.org/graph/pubmed> .
<https://example.org/paper/99> <http://purl.org/dc/terms/title> "Graph Neural Networks" <https://example.org/graph/arxiv> .
<https://example.org/paper/42> <http://purl.org/dc/terms/creator> <https://example.org/person/alice> <https://example.org/graph/pubmed> .
');
Loading TriG (Turtle with Named Graphs)
TriG wraps Turtle blocks in GRAPH { } sections:
SELECT pg_ripple.load_trig('
@prefix ex: <https://example.org/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix bibo: <http://purl.org/ontology/bibo/> .
GRAPH ex:graph/pubmed {
ex:paper/100 a bibo:AcademicArticle ;
dct:title "Drug Interaction Networks" ;
dct:creator ex:person/dave .
}
GRAPH ex:graph/arxiv {
ex:paper/200 a bibo:AcademicArticle ;
dct:title "Transformer Architectures for NLP" ;
dct:creator ex:person/eve .
}
');
Loading RDF/XML
The original XML serialization of RDF — common in older datasets and OWL ontologies:
SELECT pg_ripple.load_rdfxml('
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dct="http://purl.org/dc/terms/"
xmlns:bibo="http://purl.org/ontology/bibo/">
<bibo:AcademicArticle rdf:about="https://example.org/paper/42">
<dct:title>Knowledge Graphs in Practice</dct:title>
<dct:creator rdf:resource="https://example.org/person/alice"/>
</bibo:AcademicArticle>
</rdf:RDF>
');
Loading from Server-Side Files
For large datasets, server-side file loading avoids transferring data through the SQL protocol:
-- Load a large N-Triples dump (superuser required)
SELECT pg_ripple.load_ntriples_file('/data/exports/papers.nt');
-- Load Turtle with strict parsing (abort on any error)
SELECT pg_ripple.load_turtle_file('/data/exports/ontology.ttl', true);
-- Load into a specific named graph
SELECT pg_ripple.load_turtle_file_into_graph(
'/data/exports/pubmed.ttl',
'https://example.org/graph/pubmed'
);
File loading functions read from the PostgreSQL server's filesystem, not the client's.
The path must be accessible to the postgres OS user. These functions require superuser
privileges for security reasons.
Loading Turtle-Star (RDF-Star)
pg_ripple's Turtle parser supports RDF-star quoted triples natively:
SELECT pg_ripple.load_turtle('
@prefix ex: <https://example.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<< ex:paper/42 ex:cites ex:paper/99 >> ex:confidence "0.92"^^xsd:decimal .
<< ex:paper/42 ex:cites ex:paper/99 >> ex:source ex:system/citation-extractor .
');
Loading into Named Graphs
Load data into a specific graph without the TriG/N-Quads format:
-- Create the graph first (optional — auto-created on load)
SELECT pg_ripple.create_graph('https://example.org/graph/2024');
-- Load Turtle into the named graph
SELECT pg_ripple.load_turtle_into_graph('
@prefix ex: <https://example.org/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix bibo: <http://purl.org/ontology/bibo/> .
ex:paper/300 a bibo:AcademicArticle ;
dct:title "New Findings in Graph Theory" ;
dct:creator ex:person/frank .
', 'https://example.org/graph/2024');
Using SPARQL Update for Loading
SPARQL INSERT DATA is another way to add triples:
SELECT pg_ripple.sparql_update('
PREFIX ex: <https://example.org/>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
INSERT DATA {
ex:paper/500 a bibo:AcademicArticle ;
dct:title "SPARQL Performance Tuning" ;
dct:creator ex:person/alice .
}
');
Common Patterns
Pattern: ETL Pipeline
A typical ETL pipeline loads data in stages:
-- Step 1: Load the ontology
SELECT pg_ripple.load_turtle_file('/data/ontology.ttl');
-- Step 2: Load reference data
SELECT pg_ripple.load_ntriples_file('/data/institutions.nt');
-- Step 3: Load the main dataset
SELECT pg_ripple.load_ntriples_file('/data/papers.nt');
-- Step 4: Load supplementary data into a named graph
SELECT pg_ripple.load_nquads_file('/data/citations.nq');
-- Step 5: Update statistics
SELECT pg_ripple.vacuum();
-- Step 6: Verify the load
SELECT pg_ripple.triple_count();
SELECT pg_ripple.stats();
Pattern: Incremental Loading
For streaming data ingestion, use insert_triple() inside application code:
-- Application inserts triples as events arrive
SELECT pg_ripple.insert_triple(
'<https://example.org/paper/new123>',
'<http://purl.org/dc/terms/title>',
'"Just Published: A New Study"'
);
-- Periodically compact HTAP tables
SELECT pg_ripple.compact();
Pattern: Strict vs Lenient Parsing
-- Lenient (default): skip bad triples, emit WARNINGs
SELECT pg_ripple.load_turtle('
@prefix ex: <https://example.org/> .
ex:good ex:rel ex:target .
ex:bad ex:rel "unclosed literal .
ex:also_good ex:rel ex:other .
', false);
-- Returns: 2 (skipped the bad triple)
-- Strict: abort on any parse error
SELECT pg_ripple.load_turtle('
@prefix ex: <https://example.org/> .
ex:good ex:rel ex:target .
ex:bad ex:rel "unclosed literal .
', true);
-- ERROR: Turtle parse error at line 4
Pattern: Loading OWL Ontologies
-- Auto-detects format from file extension (.ttl, .nt, .xml, .rdf, .owl)
SELECT pg_ripple.load_owl_ontology('/data/ontologies/foaf.rdf');
-- Or load explicitly as RDF/XML
SELECT pg_ripple.load_rdfxml_file('/data/ontologies/dublin_core.rdf');
Performance and Trade-offs
Throughput by Loading Mode
| Mode | Approximate throughput | Use case |
|---|---|---|
insert_triple() | 3,000–8,000 triples/s | Real-time ingestion, single-triple updates |
load_turtle() / load_ntriples() | 30,000–80,000 triples/s | Interactive bulk loads up to a few MB |
load_ntriples_file() | 80,000–200,000 triples/s | Large server-side files |
load_turtle_file() | 60,000–150,000 triples/s | Large server-side Turtle files |
N-Triples is consistently faster than Turtle because it requires no prefix expansion
or abbreviation handling. For maximum throughput on large datasets, convert to N-Triples
first: rapper -i turtle -o ntriples data.ttl > data.nt
Format Selection Guide
| Scenario | Recommended format |
|---|---|
| Hand-authored data | Turtle (readable, supports prefixes) |
| Machine-generated export | N-Triples (fastest parsing, one line per triple) |
| Data with named graphs | N-Quads or TriG |
| Legacy XML datasets | RDF/XML |
| Maximum load speed | N-Triples via load_ntriples_file() |
ANALYZE After Loads
After loading significant amounts of data, update PostgreSQL planner statistics:
-- Run ANALYZE on all VP tables
SELECT pg_ripple.vacuum();
This ensures the query planner has accurate row-count estimates for join ordering.
Batch Size Considerations
For string-based loaders, the entire input is processed in one transaction. Very large strings (hundreds of MB) can cause memory pressure. For datasets over 50 MB, prefer file-based loading:
-- Instead of a huge string literal:
-- SELECT pg_ripple.load_ntriples('... 100 million lines ...');
-- Use file loading:
SELECT pg_ripple.load_ntriples_file('/data/huge_dataset.nt');
Gotchas and Debugging
Blank Node Scoping
Each load_turtle() call creates a fresh blank-node scope. Two separate calls using
_:x produce two different internal IDs:
-- Call 1: _:x maps to internal ID 12345
SELECT pg_ripple.load_turtle('
@prefix ex: <https://example.org/> .
_:x ex:name "Alice" .
ex:paper/1 ex:author _:x .
');
-- Call 2: _:x maps to internal ID 67890 (different!)
SELECT pg_ripple.load_turtle('
@prefix ex: <https://example.org/> .
_:x ex:name "Bob" .
ex:paper/2 ex:author _:x .
');
If you need the same anonymous node across loads, use a stable IRI instead:
SELECT pg_ripple.insert_triple(
'<https://example.org/anon/shared-node>',
'<https://example.org/name>',
'"Shared Entity"'
);
Character Encoding
All loaders expect UTF-8 input. Non-UTF-8 data causes parse errors:
-- If your file is Latin-1, convert first:
-- iconv -f ISO-8859-1 -t UTF-8 data.nt > data_utf8.nt
SELECT pg_ripple.load_ntriples_file('/data/data_utf8.nt');
Verifying Loaded Data
After loading, verify with find_triples() or triple_count():
-- Check total triples
SELECT pg_ripple.triple_count();
-- Inspect specific triples
SELECT * FROM pg_ripple.find_triples(
'<https://example.org/paper/42>', NULL, NULL
);
-- Check per-predicate statistics
SELECT pg_ripple.stats();
File Path Errors
File loaders read from the server filesystem. Common errors:
-- ERROR: could not open file "/data/papers.nt": No such file or directory
-- Fix: ensure the file exists and is readable by the postgres OS user
-- ERROR: permission denied for function load_turtle_file
-- Fix: file loaders require superuser; use string loaders for non-superusers
Duplicate Handling
Loading the same data twice does not create duplicates — VP tables use ON CONFLICT DO NOTHING:
SELECT pg_ripple.load_turtle('
@prefix ex: <https://example.org/> .
ex:a ex:rel ex:b .
');
-- Returns: 1
SELECT pg_ripple.load_turtle('
@prefix ex: <https://example.org/> .
ex:a ex:rel ex:b .
');
-- Returns: 0 (already exists)
Next Steps
- §2.1 Storing Knowledge — understand the triple model and named graphs.
- §2.3 Querying with SPARQL — query the data you loaded.
- §2.6 Exporting and Sharing — export data in various formats.
§2.3 Querying with SPARQL
What and Why
SPARQL is the W3C standard query language for RDF data — the SQL of the knowledge graph world. pg_ripple translates SPARQL queries into optimized PostgreSQL SQL behind the scenes, so you get the expressiveness of SPARQL with the performance of a mature relational engine.
Why SPARQL instead of raw SQL against VP tables?
- Graph pattern matching: find paths, cycles, and subgraph shapes naturally.
- Property paths: traverse variable-length relationships with
+,*,?. - Federation: query remote SPARQL endpoints alongside local data.
- Standards compliance: queries are portable across triple stores.
- Update support:
INSERT DATAandDELETE DATAfor programmatic modifications.
pg_ripple supports all four SPARQL query forms (SELECT, CONSTRUCT, DESCRIBE, ASK) and
SPARQL Update (INSERT DATA, DELETE DATA, DELETE/INSERT WHERE).
How It Works
The SPARQL Pipeline
- Parse —
spargebraparses the SPARQL text into an algebra tree. - Optimize —
sparoptapplies algebraic optimizations (filter pushdown, join reordering). - Translate — pg_ripple's SQL generator converts the algebra to PostgreSQL SQL with integer-only VP table joins.
- Cache — the plan cache stores translated SQL keyed by SPARQL text hash.
- Execute — SPI executes the SQL; results are batch-decoded from integer IDs back to IRIs and literals.
- Return — each result row is returned as a JSONB object.
Key Functions
| Function | Purpose |
|---|---|
sparql(query) | Execute SELECT or ASK; returns JSONB rows |
sparql_ask(query) | Execute ASK; returns boolean |
sparql_construct(query) | Execute CONSTRUCT; returns triple JSONB rows |
sparql_construct_turtle(query) | CONSTRUCT → Turtle text |
sparql_construct_jsonld(query) | CONSTRUCT → JSON-LD JSONB |
sparql_describe(query) | DESCRIBE with CBD; returns triple JSONB rows |
sparql_describe_turtle(query) | DESCRIBE → Turtle text |
sparql_update(query) | INSERT DATA / DELETE DATA; returns affected count |
sparql_explain(query, analyze) | Show generated SQL or EXPLAIN ANALYZE output |
explain_sparql(query, format) | Extended explain with SQL, text, JSON, or algebra output |
Worked Examples
All examples assume the bibliographic dataset from §2.1 and §2.2 has been loaded.
Basic Triple Patterns
Find all papers and their titles:
SELECT * FROM pg_ripple.sparql('
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
SELECT ?paper ?title
WHERE {
?paper a bibo:AcademicArticle .
?paper dct:title ?title .
}
');
Each row is a JSONB object like {"paper": "<https://example.org/paper/42>", "title": "\"Knowledge Graphs in Practice\""}.
Filtering Results
Find papers published after 2023:
SELECT * FROM pg_ripple.sparql('
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?paper ?title ?date
WHERE {
?paper a bibo:AcademicArticle ;
dct:title ?title ;
dct:date ?date .
FILTER (?date > "2023-01-01"^^xsd:date)
}
');
OPTIONAL Patterns
Include authors even if they have no affiliation:
SELECT * FROM pg_ripple.sparql('
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX schema: <https://schema.org/>
SELECT ?paper ?authorName ?instName
WHERE {
?paper dct:creator ?author .
?author foaf:name ?authorName .
OPTIONAL {
?author schema:affiliation ?inst .
?inst foaf:name ?instName .
}
}
');
UNION
Find entities that are either papers or people:
SELECT * FROM pg_ripple.sparql('
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dct: <http://purl.org/dc/terms/>
SELECT ?entity ?label
WHERE {
{
?entity a bibo:AcademicArticle .
?entity dct:title ?label .
}
UNION
{
?entity a foaf:Person .
?entity foaf:name ?label .
}
}
');
MINUS
Find papers that have no citations:
SELECT * FROM pg_ripple.sparql('
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX dct: <http://purl.org/dc/terms/>
SELECT ?paper ?title
WHERE {
?paper a bibo:AcademicArticle ;
dct:title ?title .
MINUS {
?paper bibo:citedBy ?other .
}
}
');
Aggregation
Count papers per institution:
SELECT * FROM pg_ripple.sparql('
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX schema: <https://schema.org/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?instName (COUNT(DISTINCT ?paper) AS ?paperCount)
WHERE {
?paper dct:creator ?author .
?author schema:affiliation ?inst .
?inst foaf:name ?instName .
}
GROUP BY ?instName
ORDER BY DESC(?paperCount)
');
Subqueries
Find the most prolific author and all their papers:
SELECT * FROM pg_ripple.sparql('
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?authorName ?paper ?title
WHERE {
{
SELECT ?author (COUNT(?p) AS ?count)
WHERE {
?p dct:creator ?author .
}
GROUP BY ?author
ORDER BY DESC(?count)
LIMIT 1
}
?author foaf:name ?authorName .
?paper dct:creator ?author ;
dct:title ?title .
}
');
Property Paths
Property paths let you traverse variable-length relationships.
Transitive closure (+) — find all classes an entity belongs to through the subclass hierarchy:
SELECT * FROM pg_ripple.sparql('
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?entity ?superClass
WHERE {
?entity rdf:type/rdfs:subClassOf+ ?superClass .
}
');
Zero-or-more (*) — include the starting node:
SELECT * FROM pg_ripple.sparql('
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?class ?ancestor
WHERE {
?class rdfs:subClassOf* ?ancestor .
}
');
Optional step (?) — zero or one hops:
SELECT * FROM pg_ripple.sparql('
PREFIX schema: <https://schema.org/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?person ?nameOrInst
WHERE {
?person schema:affiliation? ?target .
?target foaf:name ?nameOrInst .
}
');
Sequence path (/) — chain properties:
SELECT * FROM pg_ripple.sparql('
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX schema: <https://schema.org/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?paper ?instName
WHERE {
?paper dct:creator/schema:affiliation/foaf:name ?instName .
}
');
Alternative path (|) — match either property:
SELECT * FROM pg_ripple.sparql('
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX schema: <https://schema.org/>
SELECT ?entity ?label
WHERE {
?entity (dct:title | schema:name) ?label .
}
');
Inverse path (^) — traverse in reverse:
SELECT * FROM pg_ripple.sparql('
PREFIX dct: <http://purl.org/dc/terms/>
SELECT ?author ?paper
WHERE {
?author ^dct:creator ?paper .
}
');
GRAPH Patterns
Query data in specific named graphs:
SELECT * FROM pg_ripple.sparql('
PREFIX dct: <http://purl.org/dc/terms/>
SELECT ?paper ?title ?graph
WHERE {
GRAPH ?graph {
?paper dct:title ?title .
}
}
');
Query a specific named graph:
SELECT * FROM pg_ripple.sparql('
PREFIX dct: <http://purl.org/dc/terms/>
SELECT ?paper ?title
WHERE {
GRAPH <https://example.org/graph/pubmed> {
?paper dct:title ?title .
}
}
');
ASK Queries
Check if something exists:
SELECT pg_ripple.sparql_ask('
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
ASK {
?paper a bibo:AcademicArticle ;
dct:title "Knowledge Graphs in Practice" .
}
');
-- Returns: true
CONSTRUCT Queries
Build new triples from query results:
SELECT * FROM pg_ripple.sparql_construct('
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX schema: <https://schema.org/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX ex: <https://example.org/>
CONSTRUCT {
?author ex:worksOn ?paper .
?paper ex:authoredAt ?inst .
}
WHERE {
?paper dct:creator ?author .
?author schema:affiliation ?inst .
}
');
Get CONSTRUCT results as Turtle:
SELECT pg_ripple.sparql_construct_turtle('
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX ex: <https://example.org/>
CONSTRUCT {
?author ex:wrote ?paper .
}
WHERE {
?paper dct:creator ?author .
}
');
Get CONSTRUCT results as JSON-LD:
SELECT pg_ripple.sparql_construct_jsonld('
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX ex: <https://example.org/>
CONSTRUCT {
?author ex:wrote ?paper .
}
WHERE {
?paper dct:creator ?author .
}
');
DESCRIBE Queries
Get everything about an entity using Concise Bounded Description:
SELECT * FROM pg_ripple.sparql_describe('
DESCRIBE <https://example.org/paper/42>
');
Get the description as Turtle:
SELECT pg_ripple.sparql_describe_turtle('
DESCRIBE <https://example.org/person/alice>
');
Choose the describe strategy:
-- Symmetric CBD: include triples where the entity is the object too
SELECT * FROM pg_ripple.sparql_describe(
'DESCRIBE <https://example.org/person/alice>',
'scbd'
);
SPARQL Update
Insert new triples:
SELECT pg_ripple.sparql_update('
PREFIX ex: <https://example.org/>
PREFIX dct: <http://purl.org/dc/terms/>
INSERT DATA {
ex:paper/600 a <http://purl.org/ontology/bibo/AcademicArticle> ;
dct:title "Emerging Trends in Knowledge Graphs" ;
dct:creator ex:person/alice .
}
');
-- Returns: 3
Delete specific triples:
SELECT pg_ripple.sparql_update('
PREFIX ex: <https://example.org/>
PREFIX dct: <http://purl.org/dc/terms/>
DELETE DATA {
ex:paper/600 dct:title "Emerging Trends in Knowledge Graphs" .
}
');
-- Returns: 1
Query Debugging with EXPLAIN
View the generated SQL without executing:
SELECT pg_ripple.sparql_explain('
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
SELECT ?paper ?title
WHERE {
?paper a bibo:AcademicArticle ;
dct:title ?title .
}
', false);
Run EXPLAIN ANALYZE to see execution times:
SELECT pg_ripple.sparql_explain('
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
SELECT ?paper ?title
WHERE {
?paper a bibo:AcademicArticle ;
dct:title ?title .
}
', true);
Use the extended explain with format options:
-- Show just the generated SQL
SELECT pg_ripple.explain_sparql('
PREFIX dct: <http://purl.org/dc/terms/>
SELECT ?paper ?title
WHERE { ?paper dct:title ?title }
', 'sql');
-- Show EXPLAIN ANALYZE as JSON (for programmatic consumption)
SELECT pg_ripple.explain_sparql('
PREFIX dct: <http://purl.org/dc/terms/>
SELECT ?paper ?title
WHERE { ?paper dct:title ?title }
', 'json');
-- Show the spargebra algebra tree
SELECT pg_ripple.explain_sparql('
PREFIX dct: <http://purl.org/dc/terms/>
SELECT ?paper ?title
WHERE { ?paper dct:title ?title }
', 'sparql_algebra');
Common Patterns
Pattern: Star Queries (Multiple Predicates on the Same Subject)
The optimizer detects star patterns and collapses them into efficient multi-way joins:
SELECT * FROM pg_ripple.sparql('
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX schema: <https://schema.org/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
SELECT ?paper ?title ?date
WHERE {
?paper a bibo:AcademicArticle ;
dct:title ?title ;
dct:date ?date ;
schema:keywords ?kw .
FILTER (CONTAINS(?kw, "knowledge"))
}
');
Pattern: Existence Checks with FILTER EXISTS
SELECT * FROM pg_ripple.sparql('
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
SELECT ?paper ?title
WHERE {
?paper a bibo:AcademicArticle ;
dct:title ?title .
FILTER EXISTS {
?paper bibo:citedBy ?other .
}
}
');
Pattern: VALUES Clause for Parameterized Queries
SELECT * FROM pg_ripple.sparql('
PREFIX dct: <http://purl.org/dc/terms/>
SELECT ?paper ?title
WHERE {
VALUES ?paper {
<https://example.org/paper/42>
<https://example.org/paper/99>
}
?paper dct:title ?title .
}
');
Pattern: BIND and Computed Values
SELECT * FROM pg_ripple.sparql('
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?paper ?title ?yearLabel
WHERE {
?paper dct:title ?title ;
dct:date ?date .
BIND(YEAR(?date) AS ?year)
BIND(CONCAT("Published in ", STR(?year)) AS ?yearLabel)
}
');
Performance and Trade-offs
Plan Cache
pg_ripple caches translated SQL by SPARQL query hash. Repeated queries skip the parse and translate steps:
-- Check cache statistics
SELECT pg_ripple.plan_cache_stats();
-- Returns: {"hits": 42, "misses": 5, "size": 5, "capacity": 128, "hit_rate": 0.89}
-- Reset the cache (e.g., after schema changes)
SELECT pg_ripple.plan_cache_reset();
Filter Pushdown
SPARQL FILTERs on bound constants are encoded to integers before SQL generation. This means the database compares integers, not strings:
-- This FILTER is pushed down as an integer comparison:
SELECT * FROM pg_ripple.sparql('
PREFIX dct: <http://purl.org/dc/terms/>
SELECT ?paper WHERE {
?paper dct:creator <https://example.org/person/alice> .
}
');
Property Path Depth Limit
Recursive property paths (+, *) compile to WITH RECURSIVE ... CYCLE. The GUC
pg_ripple.max_path_depth (default: 50) prevents runaway recursion:
-- Increase depth for deep hierarchies
SET pg_ripple.max_path_depth = 100;
Setting max_path_depth very high on cyclic graphs can cause slow queries. pg_ripple
uses PostgreSQL 18's CYCLE clause for hash-based cycle detection, but wide graphs
still accumulate many intermediate rows.
Full-Text Search Integration
Create a GIN index for fast text search on specific predicates:
-- Index the dct:title predicate for full-text search
SELECT pg_ripple.fts_index('<http://purl.org/dc/terms/title>');
-- Then CONTAINS() and REGEX() filters on dct:title objects use the GIN index
SELECT * FROM pg_ripple.sparql('
PREFIX dct: <http://purl.org/dc/terms/>
SELECT ?paper ?title
WHERE {
?paper dct:title ?title .
FILTER (CONTAINS(?title, "Knowledge"))
}
');
Or use the direct full-text search function:
SELECT * FROM pg_ripple.fts_search(
'knowledge & graph',
'<http://purl.org/dc/terms/title>'
);
Gotchas and Debugging
SPARQL Syntax Errors
pg_ripple uses the spargebra parser, which gives precise error messages:
SELECT * FROM pg_ripple.sparql('
SELECT ?x WHERE { ?x ?p }
');
-- ERROR: SPARQL parse error: Expected '.' or '}' at line 2
Check the query compiles before running:
SELECT pg_ripple.explain_sparql('
PREFIX dct: <http://purl.org/dc/terms/>
SELECT ?paper WHERE { ?paper dct:title ?title }
', 'sql');
No Results When Expected
Common causes:
- Missing angle brackets:
dct:titlein SPARQL requires a PREFIX declaration. Without it, the parser treats it as a relative IRI. - Wrong literal format:
"42"is a string, not a number. Use"42"^^xsd:integer. - Case sensitivity: IRIs are case-sensitive.
<https://Example.org/X>and<https://example.org/x>are different.
Debug by checking what is stored:
-- Check if the predicate exists
SELECT * FROM pg_ripple.find_triples(
NULL, '<http://purl.org/dc/terms/title>', NULL
);
Slow Queries
- Check the generated SQL with
sparql_explain(). - Look for sequential scans on large VP tables — run
pg_ripple.vacuum()to update statistics. - For property paths, check
max_path_depth— lower it if the query is exploring too many paths. - Check the plan cache hit rate — a low hit rate means many unique queries are being parsed repeatedly.
-- Step 1: See the execution plan
SELECT pg_ripple.sparql_explain('
PREFIX dct: <http://purl.org/dc/terms/>
SELECT ?paper ?title
WHERE { ?paper dct:title ?title }
', true);
-- Step 2: Update statistics
SELECT pg_ripple.vacuum();
-- Step 3: Check plan cache
SELECT pg_ripple.plan_cache_stats();
SPARQL Update Limitations
sparql_update() supports INSERT DATA and DELETE DATA (ground triples only).
Pattern-based DELETE/INSERT WHERE with variables is also supported for flexible
graph modifications. Use delete_triple() for programmatic single-triple deletion.
Next Steps
- §2.4 Validating Data Quality — enforce constraints on your data with SHACL.
- §2.5 Reasoning and Inference — derive new facts with Datalog rules.
- §2.8 APIs and Integration — access SPARQL from application code via the HTTP endpoint.
§2.4 Validating Data Quality
What and Why
Storing knowledge is only half the battle — you also need to ensure it is correct. SHACL (Shapes Constraint Language) is the W3C standard for declaring and validating constraints on RDF data. It answers questions like:
- Does every paper have at least one author?
- Are all email addresses syntactically valid?
- Does every person have exactly one name?
- Are date values well-formed?
pg_ripple integrates SHACL validation directly into the database engine. You can validate on demand, enforce constraints synchronously on every insert, or queue triples for asynchronous background validation with violations routed to a dead-letter queue.
SHACL is to RDF what CHECK constraints and triggers are to relational databases — but SHACL shapes are declarative, composable, and standardized across all RDF systems.
How It Works
The SHACL Model
A SHACL shape declares constraints on a set of focus nodes (entities matching a target pattern). Each shape contains one or more property shapes that constrain the values of a specific predicate.
NodeShape (target: instances of bibo:AcademicArticle)
└─ PropertyShape (path: dct:title)
├─ sh:minCount 1 ← every paper must have at least one title
├─ sh:maxCount 1 ← at most one title
└─ sh:datatype xsd:string ← title must be a string
Validation Modes
| Mode | GUC setting | Behavior |
|---|---|---|
| Off (default) | pg_ripple.shacl_mode = 'off' | No automatic validation |
| Sync | pg_ripple.shacl_mode = 'sync' | Every insert_triple() is validated before commit; violations raise an ERROR |
| Async | pg_ripple.shacl_mode = 'async' | Triples are inserted immediately; a background worker validates and routes violations to the dead-letter queue |
Supported Constraints
| Constraint | Description |
|---|---|
sh:minCount | Minimum number of values |
sh:maxCount | Maximum number of values |
sh:datatype | Value must have a specific XSD datatype |
sh:class | Value must be an instance of a class |
sh:in | Value must be from an enumerated set |
sh:pattern | Value must match a regex |
sh:node | Value must conform to another shape |
sh:or | Value must conform to at least one of several shapes |
sh:and | Value must conform to all listed shapes |
sh:not | Value must NOT conform to a shape |
sh:qualifiedValueShape | Qualified cardinality constraints |
sh:hasValue | At least one value must equal the given term |
sh:nodeKind | Value must be IRI, blank node, or literal |
sh:languageIn | Language tag must be in the allowed list |
sh:uniqueLang | No duplicate language tags |
sh:lessThan / sh:greaterThan | Comparative constraints between properties |
sh:closed | Reject unknown predicates |
Worked Examples
Loading Simple Shapes
Define shapes for the bibliographic dataset:
SELECT pg_ripple.load_shacl('
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix ex: <https://example.org/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix bibo: <http://purl.org/ontology/bibo/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix schema: <https://schema.org/> .
ex:PaperShape a sh:NodeShape ;
sh:targetClass bibo:AcademicArticle ;
sh:property [
sh:path dct:title ;
sh:minCount 1 ;
sh:maxCount 1 ;
sh:datatype xsd:string ;
] ;
sh:property [
sh:path dct:creator ;
sh:minCount 1 ;
sh:class foaf:Person ;
] .
ex:PersonShape a sh:NodeShape ;
sh:targetClass foaf:Person ;
sh:property [
sh:path foaf:name ;
sh:minCount 1 ;
sh:maxCount 1 ;
sh:datatype xsd:string ;
] ;
sh:property [
sh:path schema:affiliation ;
sh:maxCount 1 ;
sh:nodeKind sh:IRI ;
] .
');
-- Returns: 2 (number of shapes loaded)
Running Validation
Validate the default graph against all active shapes:
SELECT pg_ripple.validate();
The result is a JSONB validation report:
{
"conforms": false,
"violations": [
{
"focusNode": "<https://example.org/paper/99>",
"shapeIRI": "<https://example.org/PaperShape>",
"path": "<http://purl.org/dc/terms/creator>",
"constraint": "sh:class",
"message": "value <https://example.org/person/carol> is not an instance of <http://xmlns.com/foaf/0.1/Person>",
"severity": "sh:Violation"
}
]
}
Validate a specific named graph:
SELECT pg_ripple.validate('https://example.org/graph/pubmed');
Validate all graphs at once:
SELECT pg_ripple.validate('*');
Synchronous Validation
Enable sync mode so invalid triples are rejected at insert time:
SET pg_ripple.shacl_mode = 'sync';
-- This succeeds (paper has a title)
SELECT pg_ripple.insert_triple(
'<https://example.org/paper/700>',
'<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>',
'<http://purl.org/ontology/bibo/AcademicArticle>'
);
-- This would fail if the shape requires dct:title and the paper doesn't have one yet
-- (sync validation checks per-triple, not transactionally)
Synchronous validation adds overhead to every insert_triple() call. Use it for
low-volume, high-integrity scenarios. For bulk loads, use 'off' mode and validate
after loading.
Asynchronous Validation with Dead-Letter Queue
Enable async mode for high-throughput pipelines:
SET pg_ripple.shacl_mode = 'async';
-- Triples are inserted immediately; validation happens in the background
SELECT pg_ripple.insert_triple(
'<https://example.org/paper/800>',
'<http://purl.org/dc/terms/title>',
'"A New Paper"'
);
-- Check the validation queue length
SELECT pg_ripple.validation_queue_length();
-- Manually process the queue (normally handled by background worker)
SELECT pg_ripple.process_validation_queue(1000);
-- Check for violations
SELECT pg_ripple.dead_letter_count();
-- View the full dead-letter queue
SELECT pg_ripple.dead_letter_queue();
Complex Shapes
Disjunctive constraints (sh:or):
SELECT pg_ripple.load_shacl('
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix ex: <https://example.org/> .
@prefix dct: <http://purl.org/dc/terms/> .
ex:DateShape a sh:NodeShape ;
sh:targetSubjectsOf dct:date ;
sh:property [
sh:path dct:date ;
sh:or (
[ sh:datatype xsd:date ]
[ sh:datatype xsd:gYear ]
[ sh:datatype xsd:dateTime ]
) ;
] .
');
Closed shapes (reject unknown predicates):
SELECT pg_ripple.load_shacl('
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix ex: <https://example.org/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix bibo: <http://purl.org/ontology/bibo/> .
@prefix schema: <https://schema.org/> .
ex:StrictPaperShape a sh:NodeShape ;
sh:targetClass bibo:AcademicArticle ;
sh:closed true ;
sh:ignoredProperties (
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
) ;
sh:property [
sh:path dct:title ;
sh:minCount 1 ;
] ;
sh:property [
sh:path dct:creator ;
sh:minCount 1 ;
] ;
sh:property [
sh:path dct:date ;
sh:maxCount 1 ;
] ;
sh:property [
sh:path schema:keywords ;
] ;
sh:property [
sh:path bibo:cites ;
] ;
sh:property [
sh:path bibo:citedBy ;
] .
');
Qualified cardinality:
SELECT pg_ripple.load_shacl('
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix ex: <https://example.org/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix bibo: <http://purl.org/ontology/bibo/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
ex:CollabPaperShape a sh:NodeShape ;
sh:targetClass bibo:AcademicArticle ;
sh:property [
sh:path dct:creator ;
sh:qualifiedValueShape [
sh:class foaf:Person ;
] ;
sh:qualifiedMinCount 2 ;
] .
');
Managing Shapes
-- List all loaded shapes
SELECT * FROM pg_ripple.list_shapes();
-- Deactivate a shape without deleting it
SELECT pg_ripple.disable_rule_set('custom');
-- Remove a shape entirely
SELECT pg_ripple.drop_shape('https://example.org/StrictPaperShape');
SHACL DAG Monitors
For real-time violation detection, enable DAG monitors (requires pg_trickle):
-- Load shapes first
SELECT pg_ripple.load_shacl('...');
-- Enable per-shape violation stream tables
SELECT pg_ripple.enable_shacl_dag_monitors();
-- View the live violation summary
SELECT * FROM _pg_ripple.violation_summary_dag;
-- List active monitors
SELECT * FROM pg_ripple.list_shacl_dag_monitors();
-- Disable when no longer needed
SELECT pg_ripple.disable_shacl_dag_monitors();
Common Patterns
Pattern: Validate After Bulk Load
The most common workflow — load first, validate second:
-- Turn off validation during load
SET pg_ripple.shacl_mode = 'off';
-- Load data
SELECT pg_ripple.load_turtle_file('/data/papers.ttl');
-- Load shapes
SELECT pg_ripple.load_shacl('...');
-- Validate
SELECT pg_ripple.validate();
Pattern: Data Quality Dashboard
Use the dead-letter queue as a data quality monitor:
-- Enable async validation
SET pg_ripple.shacl_mode = 'async';
-- Periodically check violation counts
SELECT pg_ripple.dead_letter_count();
-- Get violation details
SELECT pg_ripple.dead_letter_queue();
-- With pg_trickle: enable violation summary stream table
SELECT pg_ripple.enable_shacl_monitors();
SELECT * FROM _pg_ripple.violation_summary;
Pattern: Embedding Completeness Check
Ensure all entities have vector embeddings (see §2.7):
SELECT pg_ripple.load_shacl('
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix ex: <https://example.org/> .
@prefix pg: <urn:pg_ripple:> .
@prefix bibo: <http://purl.org/ontology/bibo/> .
ex:EmbeddingCompletenessShape a sh:NodeShape ;
sh:targetClass bibo:AcademicArticle ;
sh:property [
sh:path pg:hasEmbedding ;
sh:minCount 1 ;
sh:hasValue "true"^^xsd:boolean ;
] .
');
-- Add embedding triples for entities that have been embedded
SELECT pg_ripple.add_embedding_triples();
-- Check completeness
SELECT pg_ripple.validate();
Pattern: Multi-Language Support
Ensure labels exist in required languages:
SELECT pg_ripple.load_shacl('
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix ex: <https://example.org/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
ex:LabelShape a sh:NodeShape ;
sh:targetSubjectsOf rdfs:label ;
sh:property [
sh:path rdfs:label ;
sh:languageIn ( "en" "de" "fr" ) ;
sh:uniqueLang true ;
] .
');
Performance and Trade-offs
| Validation mode | Overhead | Data integrity | Use case |
|---|---|---|---|
off | None | Manual check with validate() | Bulk loads, development |
sync | High (per-triple check) | Immediate rejection | Low-volume critical data |
async | Low (background worker) | Eventual (violations in DLQ) | High-throughput pipelines |
- Shape count: validation time scales linearly with the number of active shapes and focus nodes. Deactivate shapes you do not need.
- DAG monitors: per-shape stream tables are
IMMEDIATEmode — violations are detected within the same transaction. But pg_trickle must be installed. - Dead-letter queue: grows without bound. Periodically review and clean it:
-- Remove violations older than 30 days DELETE FROM _pg_ripple.dead_letter_queue WHERE detected_at < NOW() - INTERVAL '30 days';
Shapes with sh:maxCount 1 allow the SPARQL query engine to omit DISTINCT on that
predicate's joins. Shapes with sh:minCount 1 allow downgrading LEFT JOIN to
INNER JOIN. Declaring accurate shapes improves both data quality and query performance.
Gotchas and Debugging
Shape Loading Errors
If load_shacl() returns 0, the Turtle may have syntax errors. Check for:
- Missing
@prefixdeclarations - Unclosed brackets in blank node property lists
- Missing semicolons between property shapes
Sync Mode and Transaction Boundaries
Sync validation checks individual triples, not entire transactions. A paper might pass
the dct:creator check (because the triple being inserted is the author link) but fail
the dct:title check because the title has not been inserted yet in the same transaction.
Solution: insert all triples for an entity, then validate explicitly:
SET pg_ripple.shacl_mode = 'off';
-- Insert all triples for the entity
SELECT pg_ripple.insert_triple('<https://example.org/paper/900>', '<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>', '<http://purl.org/ontology/bibo/AcademicArticle>');
SELECT pg_ripple.insert_triple('<https://example.org/paper/900>', '<http://purl.org/dc/terms/title>', '"My Paper"');
SELECT pg_ripple.insert_triple('<https://example.org/paper/900>', '<http://purl.org/dc/terms/creator>', '<https://example.org/person/alice>');
-- Then validate
SELECT pg_ripple.validate();
Viewing Shape Definitions
-- List all shapes and their active status
SELECT * FROM pg_ripple.list_shapes();
Validation Report Interpretation
The validation report JSONB has two top-level keys:
conforms:trueif no violations were foundviolations: array of violation objects, each withfocusNode,shapeIRI,path,constraint,message, andseverity
-- Extract just the violation messages
SELECT v->>'message'
FROM jsonb_array_elements(
(SELECT pg_ripple.validate()::jsonb -> 'violations')
) AS v;
Next Steps
- §2.5 Reasoning and Inference — derive new facts from rules; SHACL shapes interact with inference.
- §2.6 Exporting and Sharing — SHACL quality enforcement for GraphRAG exports.
- §2.7 AI Retrieval and GraphRAG — embedding completeness shapes.
§2.5 Reasoning and Inference
What and Why
Inference lets pg_ripple derive new facts from existing data using logical rules. If Alice works at MIT and MIT is located in Massachusetts, inference can conclude that Alice is located in Massachusetts — without anyone explicitly inserting that triple.
pg_ripple ships a full Datalog reasoning engine that supports:
- Built-in rule sets: RDFS and OWL RL entailment out of the box.
- Custom rules: domain-specific inference in a Turtle-flavoured Datalog syntax.
- Stratified negation: "flag people without an email address."
- Aggregation: COUNT, SUM, MIN, MAX, AVG over grouped triple patterns.
- Magic sets: goal-directed inference that only materialises relevant facts.
- Semi-naive evaluation: efficient fixpoint iteration that skips unchanged rows.
- Well-Founded Semantics: handle programs with cyclic negation (v0.32.0).
Derived triples are stored with source = 1 (inferred) alongside explicit triples
(source = 0), so you can always distinguish asserted from derived facts.
How It Works
The Datalog Pipeline
- Parse — rules are parsed from a Turtle-flavoured Datalog syntax into an internal Rule IR.
- Stratify — the dependency graph is analyzed; rules are grouped into strata such that negated predicates are fully computed in lower strata.
- Compile — each stratum is compiled to PostgreSQL SQL: non-recursive rules become
INSERT ... SELECT, recursive rules becomeWITH RECURSIVE ... CYCLE. - Execute — strata run bottom-up; each stratum's SQL is executed via SPI, inserting derived triples into VP delta tables.
- Fixpoint — recursive strata iterate until no new facts are derived (semi-naive evaluation).
Rule Syntax
Rules use a Prolog-like notation with RDF terms. The prefix registry from register_prefix() is available:
head_triple :- body_triple1 , body_triple2 .
Variables start with ?. Constants are IRIs (prefixed or full). Negation uses NOT.
Built-in Rule Sets
| Name | Rules | What it covers |
|---|---|---|
rdfs | ~12 rules | rdfs:subClassOf transitivity, rdfs:subPropertyOf transitivity, rdf:type propagation via subclass/subproperty, rdfs:domain/rdfs:range inference |
owl-rl | ~80 rules | OWL RL profile: symmetric/transitive/inverse properties, owl:equivalentClass, owl:sameAs, owl:unionOf, owl:intersectionOf, property chains, and more |
Worked Examples
Loading Built-in RDFS Rules
-- Load the RDFS entailment rules
SELECT pg_ripple.load_rules_builtin('rdfs');
-- Returns: 12 (number of rules)
-- Load some class hierarchy data
SELECT pg_ripple.load_turtle('
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix bibo: <http://purl.org/ontology/bibo/> .
@prefix ex: <https://example.org/> .
bibo:AcademicArticle rdfs:subClassOf bibo:Article .
bibo:Article rdfs:subClassOf bibo:Document .
bibo:Document rdfs:subClassOf rdfs:Resource .
ex:paper/42 rdf:type bibo:AcademicArticle .
');
-- Run inference
SELECT pg_ripple.infer('rdfs');
-- Now ex:paper/42 is also a bibo:Article, bibo:Document, and rdfs:Resource
SELECT * FROM pg_ripple.sparql('
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX bibo: <http://purl.org/ontology/bibo/>
SELECT ?type
WHERE {
<https://example.org/paper/42> rdf:type ?type .
}
');
Loading OWL RL Rules
-- Load the OWL RL entailment rules
SELECT pg_ripple.load_rules_builtin('owl-rl');
-- Load ontology with OWL constructs
SELECT pg_ripple.load_turtle('
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix ex: <https://example.org/> .
ex:cites owl:inverseOf ex:citedBy .
ex:collaboratesWith a owl:SymmetricProperty .
ex:influencedBy a owl:TransitiveProperty .
ex:paper/42 ex:cites ex:paper/99 .
ex:person/alice ex:collaboratesWith ex:person/bob .
ex:person/carol ex:influencedBy ex:person/alice .
ex:person/alice ex:influencedBy ex:person/dave .
');
-- Run OWL RL inference
SELECT pg_ripple.infer('owl-rl');
-- Derived: ex:paper/99 ex:citedBy ex:paper/42 (inverse)
-- Derived: ex:person/bob ex:collaboratesWith ex:person/alice (symmetric)
-- Derived: ex:person/carol ex:influencedBy ex:person/dave (transitive)
Writing Custom Rules
Define domain-specific rules for the bibliographic dataset:
SELECT pg_ripple.load_rules('
# Derive co-authorship: two people who authored the same paper
?a ex:coAuthor ?b :- ?paper dct:creator ?a , ?paper dct:creator ?b .
# Derive institutional collaboration
?inst1 ex:collaboratesWith ?inst2 :-
?paper dct:creator ?a ,
?paper dct:creator ?b ,
?a schema:affiliation ?inst1 ,
?b schema:affiliation ?inst2 .
# Derive prolific author (authored 5+ papers)
# Uses arithmetic guard: at least 5 papers
?author ex:isProlific "true"^^xsd:boolean :-
?paper1 dct:creator ?author ,
?paper2 dct:creator ?author ,
?paper3 dct:creator ?author ,
?paper4 dct:creator ?author ,
?paper5 dct:creator ?author .
', 'biblio');
-- Run the custom rule set
SELECT pg_ripple.infer('biblio');
Negation-as-Failure
Flag entities that are missing expected properties:
SELECT pg_ripple.load_rules('
# Flag papers without a date
?paper ex:missingDate "true"^^xsd:boolean :-
?paper rdf:type bibo:AcademicArticle ,
NOT ?paper dct:date ?_ .
# Flag people without an affiliation
?person ex:missingAffiliation "true"^^xsd:boolean :-
?person rdf:type foaf:Person ,
NOT ?person schema:affiliation ?_ .
', 'quality');
SELECT pg_ripple.infer('quality');
-- Query the derived quality flags
SELECT * FROM pg_ripple.sparql('
PREFIX ex: <https://example.org/>
SELECT ?paper WHERE { ?paper ex:missingDate "true"^^<http://www.w3.org/2001/XMLSchema#boolean> }
');
Named Graph Scoping
Write derived triples into a separate graph:
SELECT pg_ripple.load_rules('
# All RDFS inference goes into the "inferred" graph
GRAPH ex:graph/inferred { ?x rdf:type ?c } :-
?x rdf:type ?b , ?b rdfs:subClassOf ?c .
', 'scoped-rdfs');
SELECT pg_ripple.infer('scoped-rdfs');
-- Query only inferred types
SELECT * FROM pg_ripple.sparql('
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?x ?type WHERE {
GRAPH <https://example.org/graph/inferred> {
?x rdf:type ?type .
}
}
');
Semi-Naive Evaluation with Statistics
Get detailed inference statistics:
SELECT pg_ripple.infer_with_stats('rdfs');
Returns JSONB:
{
"derived": 156,
"iterations": 4,
"eliminated_rules": [
"?x rdf:type rdfs:Resource :- ?x ?p ?o ."
]
}
The eliminated_rules field shows rules removed by subsumption checking — rules
whose body is a superset of another rule's body.
Goal-Directed Inference with Magic Sets
When you only need a subset of derived facts, magic sets avoids materialising everything:
-- Only derive facts relevant to: "What types does paper/42 have?"
SELECT pg_ripple.infer_goal('rdfs', '?x rdf:type <http://xmlns.com/foaf/0.1/Person>');
Returns JSONB:
{
"derived": 12,
"iterations": 3,
"matching": 5
}
Compare with full inference:
-- Full materialization: derives ALL facts
SELECT pg_ripple.infer('rdfs');
-- derived: 156
-- Goal-directed: derives only what's needed for the goal
SELECT pg_ripple.infer_goal('rdfs', '?x rdf:type foaf:Person');
-- derived: 12 (much fewer)
Magic sets are controlled by the GUC pg_ripple.magic_sets. When set to false,
infer_goal() falls back to full materialization and filters post-hoc.
Demand-Filtered Inference
For multiple goals at once, use demand-filtered inference:
SELECT pg_ripple.infer_demand('rdfs', '[
{"p": "<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>"},
{"s": "<https://example.org/paper/42>"}
]'::jsonb);
Returns:
{
"derived": 45,
"iterations": 3,
"demand_predicates": [
"http://www.w3.org/1999/02/22-rdf-syntax-ns#type"
]
}
Aggregate Rules (Datalog^agg)
Derive facts using aggregate functions:
SELECT pg_ripple.load_rules('
# Count papers per author
?author ex:paperCount ?count :-
COUNT(?paper WHERE ?paper dct:creator ?author) = ?count .
# Sum citation counts per paper
?paper ex:totalCitations ?total :-
COUNT(?citing WHERE ?citing bibo:cites ?paper) = ?total .
', 'metrics');
-- Use the aggregate-aware inference function
SELECT pg_ripple.infer_agg('metrics');
Returns:
{
"derived": 25,
"aggregate_derived": 25,
"iterations": 1
}
Well-Founded Semantics (v0.32.0)
For programs with cyclic negation (where standard stratification fails):
SELECT pg_ripple.load_rules('
# Cyclic negation: a node is "in" if it is not "out", and vice versa
?x ex:in "true"^^xsd:boolean :- ?x rdf:type ex:Node , NOT ?x ex:out "true"^^xsd:boolean .
?x ex:out "true"^^xsd:boolean :- ?x rdf:type ex:Node , NOT ?x ex:in "true"^^xsd:boolean .
', 'wfs-demo');
-- Standard infer() would fail with "unstratifiable" error
-- WFS handles it gracefully
SELECT pg_ripple.infer_wfs('wfs-demo');
Returns:
{
"derived": 6,
"certain": 0,
"unknown": 6,
"iterations": 3,
"stratifiable": false
}
Facts with certainty = 'unknown' are reported but NOT materialised into VP tables.
Common Patterns
Pattern: Layered Inference
Run rule sets in order — base entailment first, then domain rules:
-- Layer 1: RDFS entailment
SELECT pg_ripple.load_rules_builtin('rdfs');
SELECT pg_ripple.infer('rdfs');
-- Layer 2: OWL RL (builds on RDFS-derived facts)
SELECT pg_ripple.load_rules_builtin('owl-rl');
SELECT pg_ripple.infer('owl-rl');
-- Layer 3: Custom domain rules
SELECT pg_ripple.load_rules('...', 'domain');
SELECT pg_ripple.infer('domain');
Pattern: Incremental Re-Inference
After adding new data, re-run inference. Semi-naive evaluation only derives new facts:
-- Load new data
SELECT pg_ripple.load_turtle('
@prefix ex: <https://example.org/> .
@prefix bibo: <http://purl.org/ontology/bibo/> .
ex:paper/newOne a bibo:AcademicArticle .
');
-- Re-run inference — only new derivations are computed
SELECT pg_ripple.infer_with_stats('rdfs');
Pattern: Explicit vs Inferred Triples
All VP tables have a source column: 0 = explicit, 1 = inferred. You can query
this distinction via SPARQL or check the full triple store:
-- Find all inferred type assertions
SELECT * FROM pg_ripple.find_triples(
NULL,
'<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>',
NULL
);
Pattern: owl:sameAs Canonicalization
When pg_ripple.sameas_reasoning = 'on' (default), owl:sameAs links are
canonicalized before inference. All mentions of equivalent entities are collapsed to
a single canonical ID, reducing redundant derivations.
-- Two IRIs refer to the same entity
SELECT pg_ripple.insert_triple(
'<https://example.org/person/alice>',
'<http://www.w3.org/2002/07/owl#sameAs>',
'<https://other.org/people/a-johnson>'
);
-- After inference, both IRIs are treated as identical
SELECT pg_ripple.infer('owl-rl');
Performance and Trade-offs
Full Materialization vs Goal-Directed
| Strategy | Pros | Cons |
|---|---|---|
Full (infer()) | Complete; all derived facts available | May derive millions of unneeded facts |
Goal-directed (infer_goal()) | Only derives relevant facts | Must specify the goal pattern |
Demand-filtered (infer_demand()) | Multiple goals; partial materialization | Slightly more setup |
| On-demand (query-time) | Zero materialization cost | Slower queries |
Semi-Naive Evaluation
Semi-naive evaluation tracks which facts are new in each iteration and only joins new facts with existing facts. This reduces the work per iteration from O(n^2) to O(n * delta), where delta is the number of new facts per round.
Subsumption Checking
When two rules have the same head and one rule's body is a subset of the other's, the subsumed rule is eliminated. This reduces the number of SQL statements per iteration.
Tabling / Memoisation (v0.32.0)
Goal-directed inference results and WFS results are cached in _pg_ripple.tabling_cache.
Cache entries are automatically invalidated when data changes (inserts or deletes).
-- Check tabling cache statistics
SELECT * FROM pg_ripple.tabling_stats();
Rule Set Management
-- List all rules and their metadata
SELECT pg_ripple.list_rules();
-- Enable/disable a rule set without deleting it
SELECT pg_ripple.enable_rule_set('rdfs');
SELECT pg_ripple.disable_rule_set('quality');
-- Drop all rules in a set
SELECT pg_ripple.drop_rules('quality');
Gotchas and Debugging
Unstratifiable Programs
If your rules contain cyclic negation, standard infer() will fail:
ERROR: unstratifiable rule set — negation cycle detected
DETAIL: ex:in negates ex:out, which depends on ex:in
HINT: remove the negation cycle or use infer_wfs() for well-founded semantics
Fix: either restructure the rules to eliminate the cycle, or use infer_wfs().
Prefix Registration
Rules use the prefix registry from register_prefix(). If a prefix is not registered,
the parser treats it as a parse error:
-- Register required prefixes BEFORE loading rules
SELECT pg_ripple.register_prefix('ex', 'https://example.org/');
SELECT pg_ripple.register_prefix('dct', 'http://purl.org/dc/terms/');
-- Now load rules that use these prefixes
SELECT pg_ripple.load_rules('?x ex:rel ?y :- ?x dct:creator ?y .', 'test');
Checking What Was Derived
After inference, check the statistics:
-- How many triples total?
SELECT pg_ripple.triple_count();
-- Detailed stats including inferred count
SELECT pg_ripple.stats();
-- Check constraint rules for violations
SELECT pg_ripple.check_constraints();
Performance Diagnosis
If inference is slow:
- Check iteration count with
infer_with_stats()— many iterations suggest deep recursive chains. - Use goal-directed inference (
infer_goal()) if you only need a subset. - Check for redundant rules with subsumption (
eliminated_rulesin stats output). - Run
pg_ripple.vacuum()after inference to update planner statistics.
-- Get inference diagnostics
SELECT pg_ripple.infer_with_stats('rdfs');
-- Check rule plan cache
SELECT * FROM pg_ripple.rule_plan_cache_stats();
Next Steps
- §2.4 Validating Data Quality — SHACL shapes interact with inference; validate derived facts.
- §2.6 Exporting and Sharing — export inferred facts; Datalog enrichment for GraphRAG.
- §2.3 Querying with SPARQL — query both explicit and inferred facts with SPARQL.
§2.6 Exporting and Sharing
What and Why
Data in pg_ripple needs to flow out — to other systems, to files for archival, to LLMs for RAG pipelines, or to Microsoft's GraphRAG framework via Parquet files. pg_ripple supports all standard RDF serialization formats plus JSON-LD framing for API-ready output and BYOG (Bring Your Own Graph) Parquet export for GraphRAG.
This chapter is the canonical reference for all export functionality, including the GraphRAG BYOG pipeline. Other chapters cross-reference here for GraphRAG details.
How It Works
Export Formats
| Format | Function | Streaming variant | Named graph support |
|---|---|---|---|
| N-Triples | export_ntriples() | — | Per-graph or default |
| N-Quads | export_nquads() | — | Yes (all graphs) |
| Turtle | export_turtle() | export_turtle_stream() | Per-graph or default |
| JSON-LD | export_jsonld() | export_jsonld_stream() | Per-graph or default |
| JSON-LD Framed | export_jsonld_framed() | export_jsonld_framed_stream() | Per-graph or default |
| SPARQL CONSTRUCT → Turtle | sparql_construct_turtle() | — | Via query |
| SPARQL CONSTRUCT → JSON-LD | sparql_construct_jsonld() | — | Via query |
| SPARQL DESCRIBE → Turtle | sparql_describe_turtle() | — | Via query |
| SPARQL DESCRIBE → JSON-LD | sparql_describe_jsonld() | — | Via query |
| Parquet (GraphRAG entities) | export_graphrag_entities() | — | Per-graph |
| Parquet (GraphRAG relationships) | export_graphrag_relationships() | — | Per-graph |
| Parquet (GraphRAG text units) | export_graphrag_text_units() | — | Per-graph |
Streaming Exports
For large graphs, streaming exports return one row per triple (or per subject for JSON-LD), avoiding buffering the entire document in memory:
-- Stream Turtle one line at a time
SELECT * FROM pg_ripple.export_turtle_stream();
-- Stream JSON-LD one subject at a time (NDJSON)
SELECT * FROM pg_ripple.export_jsonld_stream();
JSON-LD Framing
JSON-LD framing reshapes flat RDF into nested, application-friendly JSON. A frame is a JSON template that specifies the desired structure:
- pg_ripple translates the frame to a SPARQL CONSTRUCT query.
- The CONSTRUCT query executes against the triple store.
- The W3C embedding algorithm nests matched nodes per the frame.
- The result is compacted with the frame's
@context.
Worked Examples
Exporting as N-Triples
The simplest format — one triple per line:
-- Export the default graph
SELECT pg_ripple.export_ntriples(NULL);
Output:
<https://example.org/paper/42> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/ontology/bibo/AcademicArticle> .
<https://example.org/paper/42> <http://purl.org/dc/terms/title> "Knowledge Graphs in Practice" .
<https://example.org/paper/42> <http://purl.org/dc/terms/creator> <https://example.org/person/alice> .
Export a specific named graph:
SELECT pg_ripple.export_ntriples('https://example.org/graph/pubmed');
Exporting as N-Quads
N-Quads include the graph IRI for each triple:
-- Export all graphs (pass NULL)
SELECT pg_ripple.export_nquads(NULL);
Exporting as Turtle
Compact, human-readable output with prefix declarations:
SELECT pg_ripple.export_turtle();
Output:
@prefix ex: <https://example.org/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix bibo: <http://purl.org/ontology/bibo/> .
ex:paper/42 a bibo:AcademicArticle ;
dct:title "Knowledge Graphs in Practice" ;
dct:creator ex:person/alice, ex:person/bob .
Exporting as JSON-LD
SELECT pg_ripple.export_jsonld();
Returns a JSONB array of expanded node objects:
[
{
"@id": "https://example.org/paper/42",
"@type": ["http://purl.org/ontology/bibo/AcademicArticle"],
"http://purl.org/dc/terms/title": [{"@value": "Knowledge Graphs in Practice"}],
"http://purl.org/dc/terms/creator": [
{"@id": "https://example.org/person/alice"},
{"@id": "https://example.org/person/bob"}
]
}
]
JSON-LD Framing
Shape the output into the exact JSON structure your application expects:
SELECT pg_ripple.export_jsonld_framed('{
"@context": {
"dct": "http://purl.org/dc/terms/",
"foaf": "http://xmlns.com/foaf/0.1/",
"bibo": "http://purl.org/ontology/bibo/",
"schema": "https://schema.org/",
"title": "dct:title",
"creator": "dct:creator",
"name": "foaf:name",
"affiliation": "schema:affiliation"
},
"@type": "bibo:AcademicArticle",
"creator": {
"name": {},
"affiliation": {
"name": {}
}
}
}'::jsonb);
Returns nested JSON-LD:
{
"@context": {"dct": "http://purl.org/dc/terms/", "...": "..."},
"@graph": [
{
"@type": "bibo:AcademicArticle",
"title": "Knowledge Graphs in Practice",
"creator": [
{
"name": "Alice Johnson",
"affiliation": {
"name": "Massachusetts Institute of Technology"
}
},
{
"name": "Bob Smith",
"affiliation": {
"name": "Stanford University"
}
}
]
}
]
}
Debugging Frames
See the generated SPARQL CONSTRUCT without executing:
SELECT pg_ripple.jsonld_frame_to_sparql('{
"@context": {
"dct": "http://purl.org/dc/terms/",
"bibo": "http://purl.org/ontology/bibo/",
"title": "dct:title"
},
"@type": "bibo:AcademicArticle",
"title": {}
}'::jsonb);
CONSTRUCT-Based Exports
Use SPARQL CONSTRUCT for selective, transformed exports:
-- Export a citation graph as Turtle
SELECT pg_ripple.sparql_construct_turtle('
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX ex: <https://example.org/>
CONSTRUCT {
?paper ex:cites ?cited .
?paper dct:title ?title .
?cited dct:title ?citedTitle .
}
WHERE {
?paper bibo:cites ?cited ;
dct:title ?title .
?cited dct:title ?citedTitle .
}
');
-- Same as JSON-LD for REST APIs
SELECT pg_ripple.sparql_construct_jsonld('
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX ex: <https://example.org/>
CONSTRUCT {
?paper ex:cites ?cited .
?paper dct:title ?title .
}
WHERE {
?paper bibo:cites ?cited ;
dct:title ?title .
}
');
DESCRIBE-Based Exports
Export everything about specific entities:
-- Full description as Turtle
SELECT pg_ripple.sparql_describe_turtle('
DESCRIBE <https://example.org/paper/42>
');
-- Symmetric CBD (includes incoming links)
SELECT pg_ripple.sparql_describe_turtle(
'DESCRIBE <https://example.org/person/alice>',
'scbd'
);
-- As JSON-LD
SELECT pg_ripple.sparql_describe_jsonld(
'DESCRIBE <https://example.org/paper/42>'
);
GraphRAG BYOG Pipeline
pg_ripple is the canonical source for Microsoft GraphRAG's Bring Your Own Graph (BYOG) data. The pipeline uses three export functions to produce Parquet files compatible with GraphRAG's ingestion format.
This is the CANONICAL GraphRAG chapter. All other documentation that mentions GraphRAG should cross-reference this section.
Step 1: Model Entities and Relationships
GraphRAG requires entities, relationships, and text units modeled with the gr: prefix.
Load the GraphRAG ontology:
SELECT pg_ripple.load_turtle('
@prefix gr: <urn:graphrag:> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix ex: <https://example.org/> .
# Entity: a paper
ex:paper/42 a gr:Entity ;
gr:title "Knowledge Graphs in Practice" ;
gr:type "AcademicArticle" ;
gr:description "A comprehensive survey of knowledge graph technologies and applications." ;
gr:frequency 5 ;
gr:degree 3 .
# Entity: an author
ex:person/alice a gr:Entity ;
gr:title "Alice Johnson" ;
gr:type "Person" ;
gr:description "Researcher at MIT specializing in knowledge representation." ;
gr:frequency 8 ;
gr:degree 5 .
# Relationship
ex:rel/1 a gr:Relationship ;
gr:source ex:paper/42 ;
gr:target ex:person/alice ;
gr:description "authored by" ;
gr:weight "1.0"^^xsd:float ;
gr:combinedDegree 8 .
# Text unit
ex:text/1 a gr:TextUnit ;
gr:text "This paper surveys knowledge graph technologies..." ;
gr:nTokens 150 ;
gr:documentId "doc-001" .
');
Step 2: Enrich with Datalog Rules
Use Datalog rules to derive additional GraphRAG metadata:
SELECT pg_ripple.load_rules('
# Derive entity frequency from triple count
?e gr:frequency ?count :-
?e rdf:type gr:Entity ,
COUNT(?t WHERE ?t ?anyPred ?e) = ?count .
# Derive relationship combined degree
?r gr:combinedDegree ?deg :-
?r rdf:type gr:Relationship ,
?r gr:source ?s ,
?r gr:target ?t ,
COUNT(?p1 WHERE ?s ?p1 ?_) = ?sDeg ,
COUNT(?p2 WHERE ?t ?p2 ?_) = ?tDeg .
', 'graphrag-enrichment');
SELECT pg_ripple.infer_agg('graphrag-enrichment');
Step 3: Validate with SHACL
Ensure data quality before export:
SELECT pg_ripple.load_shacl('
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix gr: <urn:graphrag:> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<urn:graphrag:EntityShape> a sh:NodeShape ;
sh:targetClass gr:Entity ;
sh:property [
sh:path gr:title ;
sh:minCount 1 ;
sh:datatype xsd:string ;
] ;
sh:property [
sh:path gr:type ;
sh:minCount 1 ;
] .
<urn:graphrag:RelationshipShape> a sh:NodeShape ;
sh:targetClass gr:Relationship ;
sh:property [
sh:path gr:source ;
sh:minCount 1 ;
sh:maxCount 1 ;
] ;
sh:property [
sh:path gr:target ;
sh:minCount 1 ;
sh:maxCount 1 ;
] .
');
-- Validate before export
SELECT pg_ripple.validate();
Step 4: Export to Parquet
-- Export entities (requires superuser)
SELECT pg_ripple.export_graphrag_entities('', '/data/graphrag/entities.parquet');
-- Export relationships
SELECT pg_ripple.export_graphrag_relationships('', '/data/graphrag/relationships.parquet');
-- Export text units
SELECT pg_ripple.export_graphrag_text_units('', '/data/graphrag/text_units.parquet');
Each function returns the number of rows written. The Parquet files are directly
compatible with pyarrow.parquet.read_table() and GraphRAG's BYOG configuration:
# GraphRAG settings.yaml
entity_table_path: /data/graphrag/entities.parquet
relationship_table_path: /data/graphrag/relationships.parquet
text_unit_table_path: /data/graphrag/text_units.parquet
Step 5: Export from a Named Graph
For multi-tenant or versioned exports:
-- Export only entities from the "production" graph
SELECT pg_ripple.export_graphrag_entities(
'https://example.org/graph/production',
'/data/graphrag/prod_entities.parquet'
);
SELECT pg_ripple.export_graphrag_relationships(
'https://example.org/graph/production',
'/data/graphrag/prod_relationships.parquet'
);
SELECT pg_ripple.export_graphrag_text_units(
'https://example.org/graph/production',
'/data/graphrag/prod_text_units.parquet'
);
Common Patterns
Pattern: API Response Formatting
Use JSON-LD framing to produce API-ready responses:
-- Papers endpoint: nested JSON with authors
SELECT pg_ripple.export_jsonld_framed('{
"@context": {
"title": "http://purl.org/dc/terms/title",
"creator": "http://purl.org/dc/terms/creator",
"name": "http://xmlns.com/foaf/0.1/name",
"type": "@type"
},
"@type": "http://purl.org/ontology/bibo/AcademicArticle",
"creator": { "name": {} }
}'::jsonb);
Pattern: Scheduled Exports via CONSTRUCT Views
For continuously updated exports, create a CONSTRUCT view (requires pg_trickle):
SELECT pg_ripple.create_construct_view(
'citation_graph',
'PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX dct: <http://purl.org/dc/terms/>
CONSTRUCT { ?p bibo:cites ?c . ?p dct:title ?t . }
WHERE { ?p bibo:cites ?c ; dct:title ?t }',
'30s',
true
);
-- The view is automatically refreshed every 30 seconds
SELECT * FROM pg_ripple.construct_view_citation_graph_decoded;
Pattern: Streaming Export to File
For large graphs, use COPY with streaming exports:
COPY (SELECT * FROM pg_ripple.export_turtle_stream())
TO '/data/export/full_graph.ttl';
COPY (SELECT * FROM pg_ripple.export_jsonld_stream())
TO '/data/export/full_graph.ndjson';
Pattern: Selective Export with SPARQL
Export only a subset of the graph:
-- Export only papers from 2024
SELECT pg_ripple.sparql_construct_turtle('
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
CONSTRUCT { ?paper ?p ?o }
WHERE {
?paper a bibo:AcademicArticle ;
dct:date ?date ;
?p ?o .
FILTER (?date >= "2024-01-01"^^xsd:date)
}
');
Performance and Trade-offs
Buffered vs Streaming Exports
| Mode | Memory usage | Output format | Best for |
|---|---|---|---|
Buffered (export_turtle()) | Entire graph in memory | Complete document | Small-medium graphs |
Streaming (export_turtle_stream()) | One triple at a time | Row-per-triple | Large graphs (millions of triples) |
Parquet Export Performance
GraphRAG Parquet export scans the relevant VP tables once per entity type. Performance
depends on the number of gr:Entity, gr:Relationship, and gr:TextUnit nodes:
- ~100K entities: <5 seconds
- ~1M entities: ~30 seconds
- Write path requires superuser (writes to server filesystem)
JSON-LD Framing Cost
Framing involves executing a SPARQL CONSTRUCT query, then applying the W3C embedding algorithm. The cost is dominated by the CONSTRUCT query; the embedding step is linear in the number of matched nodes.
Use jsonld_frame_to_sparql() to inspect the generated CONSTRUCT query and verify
it is efficient before calling export_jsonld_framed().
Gotchas and Debugging
Empty Parquet Files
If export_graphrag_entities() returns 0, check that your data uses the correct gr:
prefix and that entities have rdf:type gr:Entity:
SELECT * FROM pg_ripple.find_triples(
NULL,
'<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>',
'<urn:graphrag:Entity>'
);
Framing Returns Empty Result
Ensure the frame's @type matches actual rdf:type values in the store. The type
must be a full IRI, not a prefixed name:
-- Check what types exist
SELECT * FROM pg_ripple.sparql('
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT DISTINCT ?type WHERE { ?x rdf:type ?type }
');
Server-Side File Permissions
Parquet export writes to the server filesystem. Ensure the postgres OS user has write
permission to the output directory:
sudo mkdir -p /data/graphrag
sudo chown postgres:postgres /data/graphrag
Large Export Memory
For graphs with millions of triples, buffered exports (export_turtle(), export_jsonld())
may use significant memory. Switch to streaming variants or COPY ... TO with streaming.
Next Steps
- §2.7 AI Retrieval and GraphRAG — vector embeddings and RAG retrieval pipelines.
- §2.8 APIs and Integration — serve exported data via the HTTP endpoint.
- §2.3 Querying with SPARQL — CONSTRUCT and DESCRIBE queries for selective export.
§2.7 AI Retrieval and GraphRAG
What and Why
Knowledge graphs and vector search are complementary: vectors excel at fuzzy semantic similarity ("what treats headaches?"), while graph structure captures precise relationships ("which drugs interact with aspirin?"). pg_ripple combines both in a single database, eliminating the need for a separate vector store.
This chapter is the canonical AI and retrieval reference. It covers:
- Vector embeddings: store and index entity embeddings alongside RDF triples.
- HNSW indexes: fast approximate nearest-neighbor search via pgvector.
- Hybrid retrieval: Reciprocal Rank Fusion (RRF) of SPARQL and vector results.
rag_retrieve(): end-to-end RAG pipeline from question to LLM-ready context.- JSON-LD framing for LLM prompts: structured context for grounded generation.
- Graph-enriched embeddings: use
owl:sameAscanonicalization and neighborhood context. - Full-text broadening: combine FTS with vector search for recall.
pg_ripple's vector features require the pgvector extension. All vector functions gracefully degrade (return zero rows with a WARNING) when pgvector is not installed.
Why Not a Separate Vector Store?
| Concern | Separate vector store | pg_ripple integrated |
|---|---|---|
| Data consistency | Sync required between stores | Single source of truth |
| ACID transactions | No transactional guarantees | Full PostgreSQL ACID |
| Hybrid queries | Two round-trips + client-side merge | Single SQL query |
| Operational cost | Two systems to manage | One PostgreSQL instance |
| Graph-aware embeddings | Not possible | contextualize_entity() enriches embeddings |
How It Works
The Embedding Pipeline
- Store entities as RDF triples with
rdfs:labelandrdf:type. - Embed entities via an OpenAI-compatible API:
embed_entities()calls the API in batches and stores vectors in_pg_ripple.embeddings. - Index with pgvector HNSW for approximate nearest-neighbor search.
- Query with
similar_entities(),hybrid_search(), orrag_retrieve().
Key Functions
| Function | Purpose |
|---|---|
store_embedding(iri, vec, model) | Manually store one entity's embedding |
embed_entities(graph, model, batch) | Batch-embed entities from a graph |
refresh_embeddings(graph, model, force) | Re-embed stale entities |
similar_entities(text, k, model) | Find k nearest entities to a text query |
hybrid_search(sparql, text, k, alpha, model) | RRF fusion of SPARQL + vector results |
rag_retrieve(question, filter, k, model, fmt) | End-to-end RAG with context collection |
contextualize_entity(iri, depth, max) | Build text context from RDF neighborhood |
add_embedding_triples() | Materialise pg:hasEmbedding for SHACL checks |
list_embedding_models() | List stored models with counts and dimensions |
GUC Parameters
| GUC | Default | Description |
|---|---|---|
pg_ripple.embedding_api_url | (none) | OpenAI-compatible embedding API base URL |
pg_ripple.embedding_api_key | (none) | API key (superuser only, not logged) |
pg_ripple.embedding_model | text-embedding-3-small | Default embedding model |
pg_ripple.embedding_dimensions | 1536 | Vector dimension count |
pg_ripple.use_graph_context | off | Enrich embedding input with graph neighborhood |
pg_ripple.auto_embed | off | Auto-queue new entities for embedding |
pg_ripple.embedding_batch_size | 100 | API batch size for embed_entities() |
Worked Examples
Setup: Configure Embedding API
-- Point to your OpenAI-compatible embedding endpoint
ALTER SYSTEM SET pg_ripple.embedding_api_url = 'https://api.openai.com/v1';
ALTER SYSTEM SET pg_ripple.embedding_api_key = 'sk-your-key-here';
ALTER SYSTEM SET pg_ripple.embedding_model = 'text-embedding-3-small';
ALTER SYSTEM SET pg_ripple.embedding_dimensions = 1536;
SELECT pg_reload_conf();
The API key is stored as a superuser-only GUC. It never appears in query logs or
pg_stat_statements. For production, consider using a local embedding service
(e.g., Ollama, vLLM) to avoid sending data to external APIs.
Step 1: Embed Entities
Batch-embed all entities with an rdfs:label:
-- Embed all entities in the default graph
SELECT pg_ripple.embed_entities();
-- Returns: 150 (number of embeddings stored)
-- Embed only entities in a specific graph
SELECT pg_ripple.embed_entities('https://example.org/graph/pubmed');
-- Override the model for this call
SELECT pg_ripple.embed_entities(NULL, 'text-embedding-3-large', 50);
Step 2: Similar Entity Search
Find entities semantically similar to a question:
SELECT * FROM pg_ripple.similar_entities('knowledge graph applications', 5);
Returns:
| entity_id | entity_iri | distance |
|---|---|---|
| 42001 | <https://example.org/paper/42> | 0.12 |
| 99001 | <https://example.org/paper/99> | 0.18 |
| 10001 | <https://example.org/person/alice> | 0.31 |
Step 3: Hybrid Search with RRF
Combine SPARQL structural queries with vector similarity:
SELECT * FROM pg_ripple.hybrid_search(
'PREFIX dct: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
SELECT ?entity WHERE {
?entity a bibo:AcademicArticle ;
dct:creator <https://example.org/person/alice> .
}',
'knowledge graph survey',
10,
0.5
);
Returns:
| entity_id | entity_iri | rrf_score | sparql_rank | vector_rank |
|---|---|---|---|---|
| 42001 | <https://example.org/paper/42> | 0.032 | 1 | 1 |
| 99001 | <https://example.org/paper/99> | 0.024 | 0 | 2 |
The alpha parameter controls weighting:
alpha = 1.0: SPARQL only (graph structure)alpha = 0.0: vector only (semantic similarity)alpha = 0.5: equal weight (default)
Step 4: End-to-End RAG with rag_retrieve()
The complete pipeline from question to LLM-ready context:
SELECT * FROM pg_ripple.rag_retrieve(
'What papers discuss knowledge graphs?',
NULL,
5
);
Returns:
| entity_iri | label | context_json | distance |
|---|---|---|---|
<https://example.org/paper/42> | Knowledge Graphs in Practice | {"types": [...], "properties": [...], ...} | 0.12 |
With a SPARQL filter to restrict candidates:
SELECT * FROM pg_ripple.rag_retrieve(
'What papers discuss knowledge graphs?',
'?entity a <http://purl.org/ontology/bibo/AcademicArticle> .',
5
);
Get JSON-LD formatted context for LLM consumption:
SELECT * FROM pg_ripple.rag_retrieve(
'What papers discuss knowledge graphs?',
NULL,
5,
NULL,
'jsonld'
);
Building LLM Prompts with JSON-LD Framing
Use framed JSON-LD as structured context for LLM prompts:
-- Get framed JSON-LD for a specific paper
SELECT pg_ripple.export_jsonld_framed('{
"@context": {
"dct": "http://purl.org/dc/terms/",
"foaf": "http://xmlns.com/foaf/0.1/",
"bibo": "http://purl.org/ontology/bibo/",
"schema": "https://schema.org/",
"title": "dct:title",
"creator": "dct:creator",
"name": "foaf:name",
"affiliation": "schema:affiliation",
"cites": "bibo:cites",
"keywords": "schema:keywords"
},
"@type": "bibo:AcademicArticle",
"creator": {
"name": {},
"affiliation": { "name": {} }
},
"cites": { "title": {} }
}'::jsonb);
This produces nested JSON that LLMs can reason about more effectively than flat triples.
Graph-Enriched Embeddings
Use contextualize_entity() to build richer text for embedding:
-- Get context text for an entity
SELECT pg_ripple.contextualize_entity(
'https://example.org/paper/42',
1,
20
);
Returns a text string like:
Knowledge Graphs in Practice. Type: AcademicArticle. Created by: Alice Johnson, Bob Smith.
Cited by: Graph Neural Networks for Entity Resolution. Keywords: knowledge graph, RDF, SPARQL.
Enable graph-enriched embeddings globally:
SET pg_ripple.use_graph_context = 'on';
-- Now embed_entities() uses contextualize_entity() for each entity
SELECT pg_ripple.embed_entities();
owl:sameAs Before Embedding
Canonicalize equivalent entities before embedding to avoid duplicates:
-- Load sameAs links
SELECT pg_ripple.load_turtle('
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix ex: <https://example.org/> .
ex:person/alice owl:sameAs <https://orcid.org/0000-0001-2345-6789> .
');
-- Run OWL RL inference to canonicalize
SELECT pg_ripple.load_rules_builtin('owl-rl');
SELECT pg_ripple.infer('owl-rl');
-- Now embed — equivalent entities share a single embedding
SELECT pg_ripple.embed_entities();
Full-Text Search Broadening
Combine vector search with PostgreSQL full-text search for higher recall:
-- Create FTS index on paper titles
SELECT pg_ripple.fts_index('<http://purl.org/dc/terms/title>');
-- Use FTS to find papers by keyword
SELECT * FROM pg_ripple.fts_search(
'knowledge & graph',
'<http://purl.org/dc/terms/title>'
);
-- Combine FTS candidates with vector search in a hybrid approach
-- Step 1: Get FTS matches
-- Step 2: Get vector matches
-- Step 3: Merge with RRF (done automatically in hybrid_search)
SELECT * FROM pg_ripple.hybrid_search(
'PREFIX dct: <http://purl.org/dc/terms/>
SELECT ?entity WHERE {
?entity dct:title ?t .
FILTER (CONTAINS(?t, "knowledge"))
}',
'knowledge graph applications',
10,
0.6
);
Storing Manual Embeddings
If you compute embeddings externally:
SELECT pg_ripple.store_embedding(
'https://example.org/paper/42',
ARRAY[0.1, -0.2, 0.3, 0.05, -0.15, 0.25, 0.08, -0.1, 0.2, 0.12]::float8[],
'custom-model-v1'
);
Refreshing Stale Embeddings
After updating entity labels, refresh the affected embeddings:
-- Refresh only entities whose labels changed
SELECT pg_ripple.refresh_embeddings();
-- Returns: 12 (re-embedded entities)
-- Force re-embed everything
SELECT pg_ripple.refresh_embeddings(NULL, NULL, true);
Checking Embedding Coverage
-- List all embedding models and their entity counts
SELECT * FROM pg_ripple.list_embedding_models();
-- Add pg:hasEmbedding triples for SHACL completeness checks
SELECT pg_ripple.add_embedding_triples();
-- Validate embedding completeness
SELECT pg_ripple.validate();
Common Patterns
Pattern: Complete RAG Pipeline
-- 1. Load knowledge graph
SELECT pg_ripple.load_turtle_file('/data/domain.ttl');
-- 2. Run inference to derive additional facts
SELECT pg_ripple.load_rules_builtin('rdfs');
SELECT pg_ripple.infer('rdfs');
-- 3. Embed entities
SELECT pg_ripple.embed_entities();
-- 4. Query with RAG
SELECT * FROM pg_ripple.rag_retrieve(
'What drugs treat migraines?',
'?entity a <https://example.org/Drug> .',
5,
NULL,
'jsonld'
);
Pattern: Periodic Re-Embedding
Schedule embedding refresh after data updates:
-- After loading new data
SELECT pg_ripple.load_turtle('...');
SELECT pg_ripple.infer('rdfs');
-- Refresh embeddings for entities with changed labels
SELECT pg_ripple.refresh_embeddings();
-- Compact HTAP tables
SELECT pg_ripple.compact();
Pattern: Multi-Model Embeddings
Store embeddings from different models for comparison:
-- Embed with model A
SELECT pg_ripple.embed_entities(NULL, 'text-embedding-3-small');
-- Embed with model B
SELECT pg_ripple.embed_entities(NULL, 'text-embedding-3-large');
-- List stored models
SELECT * FROM pg_ripple.list_embedding_models();
-- Search with a specific model
SELECT * FROM pg_ripple.similar_entities('knowledge graphs', 10, 'text-embedding-3-large');
Pattern: Serving RAG via HTTP
Use pg_ripple_http's /rag endpoint for REST access (see §2.8):
curl -X POST http://localhost:8080/rag \
-H "Content-Type: application/json" \
-d '{
"question": "What treats headaches?",
"k": 5,
"output_format": "jsonld"
}'
The response includes both structured results and a pre-formatted context string
ready to be injected into an LLM system prompt.
Performance and Trade-offs
Embedding Storage
Each embedding vector occupies dimensions * 4 bytes (float32 in pgvector). For 1536-dimensional
embeddings, that is ~6 KB per entity. A graph with 1M entities uses ~6 GB for embeddings alone.
HNSW Index Performance
| Entities | Index build time | Query latency (k=10) | Recall@10 |
|---|---|---|---|
| 10K | ~2s | <5ms | >95% |
| 100K | ~20s | <10ms | >95% |
| 1M | ~5min | <20ms | >92% |
RRF Fusion Overhead
hybrid_search() executes two queries (SPARQL + vector) and fuses results in Rust.
Total overhead beyond the individual query times is <1ms for typical result sizes.
API Call Costs
embed_entities() calls an external API. Batch size affects both throughput and cost:
- Larger batches reduce round-trips but increase per-request latency.
- Default batch size (100) is a good balance for OpenAI's API.
- For local models (Ollama, vLLM), increase batch size to 500+.
For large initial embeddings, consider running embed_entities() in a separate
session with a larger embedding_batch_size setting to maximize throughput.
Gotchas and Debugging
pgvector Not Installed
All vector functions return zero rows with a WARNING when pgvector is absent:
WARNING: pg_ripple.similar_entities: pgvector not available (PT603)
Fix: install pgvector and CREATE EXTENSION vector.
No Embeddings Found
If similar_entities() returns empty:
- Check that
embedding_api_urlis configured:SHOW pg_ripple.embedding_api_url; - Check that embeddings exist:
SELECT * FROM pg_ripple.list_embedding_models(); - Run
embed_entities()if needed.
Dimension Mismatch
The vector dimension in _pg_ripple.embeddings must match embedding_dimensions:
SHOW pg_ripple.embedding_dimensions;
-- Must match the model's output dimension (1536 for text-embedding-3-small)
Slow Vector Queries
If vector queries are slow, check that an HNSW index exists on the embeddings table. pg_ripple creates one automatically, but it may need rebuilding after large batch inserts:
-- Rebuild the HNSW index
REINDEX INDEX _pg_ripple.embeddings_embedding_idx;
API Rate Limits
embed_entities() respects rate limits by batching. If you hit rate limits, reduce
embedding_batch_size:
SET pg_ripple.embedding_batch_size = 50;
SELECT pg_ripple.embed_entities();
Next Steps
- §2.6 Exporting and Sharing — GraphRAG BYOG Parquet export pipeline.
- §2.4 Validating Data Quality — SHACL embedding completeness shapes.
- §2.8 APIs and Integration — serve RAG results via the HTTP endpoint.
§2.8 APIs and Integration
What and Why
pg_ripple's SQL functions are powerful, but most applications do not talk to PostgreSQL directly. The pg_ripple_http companion service exposes a W3C-compliant SPARQL Protocol endpoint over HTTP, so any SPARQL client, programming language, or tool can query your knowledge graph.
This chapter covers:
- pg_ripple_http: the standalone SPARQL endpoint service.
- Application code examples: Python, JavaScript, and Java.
- SPARQL federation: query remote SPARQL endpoints from within pg_ripple.
- Caching strategies: plan cache, connection pooling, and result caching.
How It Works
pg_ripple_http Architecture
pg_ripple_http is a standalone Rust binary (not a PostgreSQL extension) that:
- Connects to PostgreSQL via a deadpool connection pool.
- Receives SPARQL queries via HTTP GET/POST (W3C SPARQL Protocol).
- Calls
pg_ripple.sparql(),pg_ripple.sparql_construct(), etc. via SQL. - Returns results in standard formats: SPARQL Results JSON/XML, Turtle, N-Triples, JSON-LD.
- Exposes a
/ragendpoint for AI retrieval.
Supported Endpoints
| Method | Path | Content-Type | Description |
|---|---|---|---|
| GET | /sparql?query=... | Accept header | SPARQL query via URL parameter |
| POST | /sparql | application/sparql-query | SPARQL query in request body |
| POST | /sparql | application/x-www-form-urlencoded | SPARQL query as form parameter |
| POST | /sparql | application/sparql-update | SPARQL Update in request body |
| POST | /rag | application/json | RAG retrieval endpoint |
| GET | /health | application/json | Health check |
| GET | /metrics | text/plain | Prometheus metrics |
Response Formats (Content Negotiation)
| Accept header | Format |
|---|---|
application/sparql-results+json | SPARQL Results JSON (default for SELECT/ASK) |
application/sparql-results+xml | SPARQL Results XML |
text/csv | CSV |
text/tab-separated-values | TSV |
text/turtle | Turtle (for CONSTRUCT/DESCRIBE) |
application/n-triples | N-Triples (for CONSTRUCT/DESCRIBE) |
application/ld+json | JSON-LD (for CONSTRUCT/DESCRIBE) |
Worked Examples
Starting pg_ripple_http
# Set environment variables
export PG_RIPPLE_DATABASE_URL="postgresql://user:pass@localhost:5432/mydb"
export PG_RIPPLE_LISTEN="0.0.0.0:8080"
export PG_RIPPLE_AUTH_TOKEN="my-secret-token" # optional
# Start the server
pg_ripple_http
Configuration via environment variables:
| Variable | Default | Description |
|---|---|---|
PG_RIPPLE_DATABASE_URL | postgresql://localhost/postgres | PostgreSQL connection string |
PG_RIPPLE_LISTEN | 127.0.0.1:8080 | Listen address and port |
PG_RIPPLE_AUTH_TOKEN | (none) | Bearer token for authentication |
PG_RIPPLE_POOL_SIZE | 10 | Connection pool size |
PG_RIPPLE_RATE_LIMIT | 100 | Requests per second per IP |
PG_RIPPLE_CORS_ORIGIN | * | CORS allowed origins |
Querying via curl
SPARQL SELECT via GET:
curl -G http://localhost:8080/sparql \
--data-urlencode 'query=PREFIX dct: <http://purl.org/dc/terms/> SELECT ?paper ?title WHERE { ?paper dct:title ?title } LIMIT 10' \
-H "Accept: application/sparql-results+json"
SPARQL SELECT via POST (body):
curl -X POST http://localhost:8080/sparql \
-H "Content-Type: application/sparql-query" \
-H "Accept: application/sparql-results+json" \
-d 'PREFIX dct: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
SELECT ?paper ?title
WHERE {
?paper a bibo:AcademicArticle ;
dct:title ?title .
}
ORDER BY ?title
LIMIT 20'
SPARQL CONSTRUCT as Turtle:
curl -X POST http://localhost:8080/sparql \
-H "Content-Type: application/sparql-query" \
-H "Accept: text/turtle" \
-d 'PREFIX dct: <http://purl.org/dc/terms/>
PREFIX ex: <https://example.org/>
CONSTRUCT { ?paper ex:hasTitle ?title }
WHERE { ?paper dct:title ?title }'
SPARQL CONSTRUCT as JSON-LD:
curl -X POST http://localhost:8080/sparql \
-H "Content-Type: application/sparql-query" \
-H "Accept: application/ld+json" \
-d 'PREFIX dct: <http://purl.org/dc/terms/>
PREFIX ex: <https://example.org/>
CONSTRUCT { ?paper ex:hasTitle ?title }
WHERE { ?paper dct:title ?title }'
SPARQL Update:
curl -X POST http://localhost:8080/sparql \
-H "Content-Type: application/sparql-update" \
-d 'PREFIX ex: <https://example.org/>
PREFIX dct: <http://purl.org/dc/terms/>
INSERT DATA {
ex:paper/new1 dct:title "A New Discovery" .
}'
RAG endpoint:
curl -X POST http://localhost:8080/rag \
-H "Content-Type: application/json" \
-d '{
"question": "What papers discuss knowledge graphs?",
"sparql_filter": "?entity a <http://purl.org/ontology/bibo/AcademicArticle> .",
"k": 5,
"output_format": "jsonld"
}'
The RAG response includes a context field with pre-formatted text for LLM prompts:
{
"results": [
{
"entity_iri": "https://example.org/paper/42",
"label": "Knowledge Graphs in Practice",
"context_json": {"@type": ["AcademicArticle"], "...": "..."},
"distance": 0.12
}
],
"context": "Knowledge Graphs in Practice (AcademicArticle): A comprehensive survey..."
}
Authentication (when PG_RIPPLE_AUTH_TOKEN is set):
curl -X POST http://localhost:8080/sparql \
-H "Authorization: Bearer my-secret-token" \
-H "Content-Type: application/sparql-query" \
-d 'SELECT ?s ?p ?o WHERE { ?s ?p ?o } LIMIT 5'
Python with psycopg2
Query pg_ripple directly from Python via SQL:
import json
import psycopg2
conn = psycopg2.connect("dbname=mydb user=postgres")
cur = conn.cursor()
# Execute a SPARQL query
cur.execute("""
SELECT * FROM pg_ripple.sparql(%s)
""", ("""
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
SELECT ?paper ?title ?author
WHERE {
?paper a bibo:AcademicArticle ;
dct:title ?title ;
dct:creator ?author .
}
ORDER BY ?title
LIMIT 20
""",))
for row in cur.fetchall():
result = json.loads(row[0])
print(f"Paper: {result['paper']}")
print(f"Title: {result['title']}")
print(f"Author: {result['author']}")
print()
# Load Turtle data
cur.execute("""
SELECT pg_ripple.load_turtle(%s)
""", ("""
@prefix ex: <https://example.org/> .
@prefix dct: <http://purl.org/dc/terms/> .
ex:paper/new dct:title "Loaded from Python" .
""",))
conn.commit()
# Export as JSON-LD
cur.execute("SELECT pg_ripple.export_jsonld()")
jsonld = json.loads(cur.fetchone()[0])
print(json.dumps(jsonld, indent=2))
cur.close()
conn.close()
Python with SPARQLWrapper
Query the pg_ripple_http endpoint using the standard SPARQLWrapper library:
from SPARQLWrapper import SPARQLWrapper, JSON, TURTLE
# Point to the pg_ripple_http endpoint
sparql = SPARQLWrapper("http://localhost:8080/sparql")
# SELECT query
sparql.setQuery("""
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
SELECT ?paper ?title
WHERE {
?paper a bibo:AcademicArticle ;
dct:title ?title .
}
LIMIT 10
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
for binding in results["results"]["bindings"]:
print(f"{binding['paper']['value']}: {binding['title']['value']}")
# CONSTRUCT query as Turtle
sparql.setQuery("""
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX ex: <https://example.org/>
CONSTRUCT { ?paper ex:hasTitle ?title }
WHERE { ?paper dct:title ?title }
""")
sparql.setReturnFormat(TURTLE)
turtle_output = sparql.query().convert()
print(turtle_output.decode("utf-8"))
JavaScript with pg
Query pg_ripple directly from Node.js:
const { Client } = require('pg');
async function main() {
const client = new Client({ connectionString: 'postgresql://localhost/mydb' });
await client.connect();
// SPARQL SELECT
const { rows } = await client.query(
`SELECT * FROM pg_ripple.sparql($1)`,
[`
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
SELECT ?paper ?title
WHERE {
?paper a bibo:AcademicArticle ;
dct:title ?title .
}
LIMIT 10
`]
);
for (const row of rows) {
const result = row.result;
console.log(`Paper: ${result.paper}, Title: ${result.title}`);
}
// Load Turtle
const loadResult = await client.query(
`SELECT pg_ripple.load_turtle($1)`,
[`
@prefix ex: <https://example.org/> .
@prefix dct: <http://purl.org/dc/terms/> .
ex:paper/fromjs dct:title "Loaded from JavaScript" .
`]
);
console.log(`Loaded: ${loadResult.rows[0].load_turtle} triples`);
// Export JSON-LD
const jsonldResult = await client.query(`SELECT pg_ripple.export_jsonld()`);
console.log(JSON.stringify(jsonldResult.rows[0].export_jsonld, null, 2));
await client.end();
}
main().catch(console.error);
JavaScript with fetch (HTTP endpoint)
async function sparqlQuery(query) {
const response = await fetch('http://localhost:8080/sparql', {
method: 'POST',
headers: {
'Content-Type': 'application/sparql-query',
'Accept': 'application/sparql-results+json',
},
body: query,
});
return response.json();
}
const results = await sparqlQuery(`
PREFIX dct: <http://purl.org/dc/terms/>
SELECT ?paper ?title
WHERE { ?paper dct:title ?title }
LIMIT 10
`);
for (const binding of results.results.bindings) {
console.log(`${binding.paper.value}: ${binding.title.value}`);
}
Java with JDBC
import java.sql.*;
import org.json.JSONObject;
public class PgRippleExample {
public static void main(String[] args) throws Exception {
Connection conn = DriverManager.getConnection(
"jdbc:postgresql://localhost:5432/mydb", "postgres", "password"
);
// SPARQL SELECT
PreparedStatement stmt = conn.prepareStatement(
"SELECT * FROM pg_ripple.sparql(?)"
);
stmt.setString(1,
"PREFIX dct: <http://purl.org/dc/terms/> " +
"PREFIX bibo: <http://purl.org/ontology/bibo/> " +
"SELECT ?paper ?title " +
"WHERE { " +
" ?paper a bibo:AcademicArticle ; " +
" dct:title ?title . " +
"} LIMIT 10"
);
ResultSet rs = stmt.executeQuery();
while (rs.next()) {
String jsonStr = rs.getString("result");
JSONObject result = new JSONObject(jsonStr);
System.out.println("Paper: " + result.getString("paper"));
System.out.println("Title: " + result.getString("title"));
}
rs.close();
stmt.close();
// Load Turtle
PreparedStatement loadStmt = conn.prepareStatement(
"SELECT pg_ripple.load_turtle(?)"
);
loadStmt.setString(1,
"@prefix ex: <https://example.org/> .\n" +
"@prefix dct: <http://purl.org/dc/terms/> .\n" +
"ex:paper/fromjava dct:title \"Loaded from Java\" .\n"
);
ResultSet loadRs = loadStmt.executeQuery();
if (loadRs.next()) {
System.out.println("Loaded: " + loadRs.getLong(1) + " triples");
}
loadRs.close();
loadStmt.close();
conn.close();
}
}
SPARQL Federation
pg_ripple can query remote SPARQL endpoints from within a SPARQL query using the
SERVICE keyword. This lets you join local data with remote datasets like Wikidata
or DBpedia.
Querying a Remote SPARQL Endpoint
SELECT * FROM pg_ripple.sparql('
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?paper ?title ?wikidataLabel
WHERE {
?paper dct:title ?title ;
dct:subject ?topic .
SERVICE <https://query.wikidata.org/sparql> {
?topic rdfs:label ?wikidataLabel .
FILTER (LANG(?wikidataLabel) = "en")
}
}
LIMIT 10
');
Vector Federation
Register external vector services for federated similarity search (see Vector Federation for full details):
-- Register a Qdrant endpoint
SELECT pg_ripple.register_vector_endpoint(
'https://qdrant.internal:6333',
'qdrant'
);
-- Register a Weaviate endpoint
SELECT pg_ripple.register_vector_endpoint(
'https://weaviate.internal:8080',
'weaviate'
);
Federation queries add network latency. Set timeouts to prevent slow remote endpoints from blocking local queries:
SET pg_ripple.vector_federation_timeout_ms = 5000;
</div>
</div>
Common Patterns
Pattern: Connection Pooling
For high-traffic applications, use a connection pooler (PgBouncer, pgcat) between your application and PostgreSQL:
App → PgBouncer (port 6432) → PostgreSQL (port 5432)
pg_ripple_http uses its own connection pool internally (configurable via PG_RIPPLE_POOL_SIZE).
Pattern: Result Caching
Cache SPARQL results at the application level for frequently-repeated queries:
import json
import hashlib
import redis
import psycopg2
cache = redis.Redis()
def cached_sparql(query, ttl=300):
key = f"sparql:{hashlib.sha256(query.encode()).hexdigest()}"
cached = cache.get(key)
if cached:
return json.loads(cached)
conn = psycopg2.connect("dbname=mydb")
cur = conn.cursor()
cur.execute("SELECT * FROM pg_ripple.sparql(%s)", (query,))
results = [json.loads(row[0]) for row in cur.fetchall()]
cur.close()
conn.close()
cache.setex(key, ttl, json.dumps(results))
return results
Pattern: SPARQL Views for Pre-Computed Results
For dashboard queries that run frequently, create SPARQL views (requires pg_trickle):
-- Create a pre-computed view of paper counts per institution
SELECT pg_ripple.create_sparql_view(
'papers_by_institution',
'PREFIX dct: <http://purl.org/dc/terms/>
PREFIX schema: <https://schema.org/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?inst ?instName (COUNT(DISTINCT ?paper) AS ?count)
WHERE {
?paper dct:creator ?author .
?author schema:affiliation ?inst .
?inst foaf:name ?instName .
}
GROUP BY ?inst ?instName',
'30s',
true
);
-- Query the view directly (instant, no SPARQL parsing)
SELECT * FROM pg_ripple.papers_by_institution;
Pattern: Prometheus Monitoring
pg_ripple_http exposes Prometheus metrics at /metrics:
curl http://localhost:8080/metrics
Metrics include:
pg_ripple_http_requests_total— total request count by endpoint and statuspg_ripple_http_request_duration_seconds— request latency histogrampg_ripple_http_active_connections— current active connections
Performance and Trade-offs
Direct SQL vs HTTP Endpoint
| Access method | Latency overhead | Best for |
|---|---|---|
| Direct SQL (psycopg2, JDBC) | None | Server-side applications, ETL |
| pg_ripple_http | ~1-5ms per request | Web applications, REST APIs, federated queries |
Connection Pool Sizing
Rule of thumb: set pool size to 2 * CPU cores for OLTP workloads. For SPARQL-heavy
analytics, 4 * CPU cores may be better:
export PG_RIPPLE_POOL_SIZE=20
Rate Limiting
pg_ripple_http includes built-in rate limiting to prevent abuse:
export PG_RIPPLE_RATE_LIMIT=100 # requests per second per IP
For public-facing endpoints, combine with a reverse proxy (nginx, Caddy) for additional protection.
CORS Configuration
For browser-based applications:
export PG_RIPPLE_CORS_ORIGIN="https://myapp.example.com"
Set to * for development; restrict to specific origins in production.
Gotchas and Debugging
Authentication Errors
If PG_RIPPLE_AUTH_TOKEN is set, all requests must include the Authorization header:
HTTP 401: Missing or invalid authorization token
Fix: include Authorization: Bearer <token> in the request headers.
Connection Refused
If pg_ripple_http cannot connect to PostgreSQL:
Error: connection refused (os error 61)
Fix: check PG_RIPPLE_DATABASE_URL and ensure PostgreSQL is running and accepting connections.
Content-Type Negotiation
If you get unexpected response formats, check the Accept header. pg_ripple_http uses
content negotiation:
# Explicitly request JSON results
curl -H "Accept: application/sparql-results+json" ...
# Explicitly request Turtle for CONSTRUCT
curl -H "Accept: text/turtle" ...
Federation Timeouts
Remote SPARQL endpoints can be slow. If federation queries time out:
-- Increase the timeout
SET pg_ripple.vector_federation_timeout_ms = 30000;
For SPARQL federation (SERVICE keyword), pg_ripple uses PostgreSQL's
statement_timeout for the overall query:
SET statement_timeout = '60s';
Health Check
Use the /health endpoint for load balancer configuration:
curl http://localhost:8080/health
# Returns: {"status": "ok", "pool_size": 10, "pool_available": 8}
Next Steps
- §2.3 Querying with SPARQL — SPARQL query reference for the queries you send via APIs.
- §2.7 AI Retrieval and GraphRAG — RAG endpoint details and LLM integration.
- §2.6 Exporting and Sharing — export formats returned by the HTTP endpoint.
CDC Subscriptions
Added in v0.42.0
Overview
Change Data Capture (CDC) subscriptions let your application subscribe to a real-time stream of RDF triple changes — filtered by SPARQL pattern or SHACL shape — without polling the database.
When a matching triple is inserted or deleted, pg_ripple sends a PostgreSQL NOTIFY message on a named channel. Listeners receive a JSON payload describing the change. The pg_ripple_http companion service exposes these subscriptions as WebSocket endpoints for web and streaming applications.
Creating a Subscription
-- Subscribe to all triple changes.
SELECT pg_ripple.create_subscription('my_feed');
-- Subscribe with a SPARQL pattern filter.
SELECT pg_ripple.create_subscription(
'person_changes',
filter_sparql := 'SELECT ?s ?p ?o WHERE { ?s a <https://schema.org/Person> ; ?p ?o }'
);
-- Subscribe with a SHACL shape filter.
SELECT pg_ripple.create_subscription(
'shape_violations',
filter_shape := '<https://shapes.example.org/PersonShape>'
);
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
name | TEXT | required | Unique subscription name (alphanumeric + _/-, max 63 chars) |
filter_sparql | TEXT | NULL | Optional SPARQL SELECT pattern; only matching triples are published |
filter_shape | TEXT | NULL | Optional SHACL shape IRI; only shape-violating triples are published |
Returns TRUE if created, FALSE if a subscription with that name already exists.
Listening for Changes
-- Start listening.
LISTEN pg_ripple_cdc_my_feed;
-- Insert a triple.
SELECT pg_ripple.insert_triple(
'<https://ex.org/alice>',
'<https://schema.org/name>',
'"Alice"'
);
-- In your application, receive notifications via pg_notify/asyncpg/etc.
Notification Payload
Each notification carries a JSON payload:
{
"op": "add",
"s": "<https://ex.org/alice>",
"p": "<https://schema.org/name>",
"o": "\"Alice\"",
"g": ""
}
| Field | Value |
|---|---|
op | "add" for INSERT, "remove" for DELETE |
s | Subject — N-Triples formatted IRI or blank node |
p | Predicate — N-Triples formatted IRI |
o | Object — N-Triples formatted literal or IRI |
g | Named graph IRI, or empty string for the default graph |
Listing Subscriptions
SELECT name, filter_sparql IS NOT NULL AS has_filter, created_at
FROM pg_ripple.list_subscriptions()
ORDER BY created_at;
Dropping a Subscription
-- Returns TRUE if removed, FALSE if not found.
SELECT pg_ripple.drop_subscription('my_feed');
WebSocket Access via pg_ripple_http
When the pg_ripple_http companion service is running, subscriptions are accessible as WebSocket endpoints:
ws://<host>:8080/ws/subscriptions/{name}
The service supports content negotiation via the Accept header:
application/json(default) — JSON payloadtext/turtle— Turtle-serialized change notificationapplication/ld+json— JSON-LD change notification
Integration Patterns
GraphRAG Pipeline
import asyncpg
async def watch_entity_changes():
conn = await asyncpg.connect(dsn)
await conn.execute("LISTEN pg_ripple_cdc_entity_changes")
async for notification in conn.listen("pg_ripple_cdc_entity_changes"):
payload = json.loads(notification.payload)
# Re-embed entity on change.
await update_embedding(payload["s"])
Live Dashboard
const ws = new WebSocket("ws://localhost:8080/ws/subscriptions/dashboard_feed");
ws.onmessage = (event) => {
const change = JSON.parse(event.data);
updateDashboard(change.op, change.s, change.p, change.o);
};
Underlying Tables
| Table | Description |
|---|---|
_pg_ripple.subscriptions | Named subscription registry |
_pg_ripple.cdc_subscriptions | Low-level predicate-pattern subscriptions (v0.6.0 legacy API) |
Related Functions
| Function | Description |
|---|---|
pg_ripple.create_subscription(name, filter_sparql, filter_shape) | Create named subscription |
pg_ripple.drop_subscription(name) | Remove named subscription |
pg_ripple.list_subscriptions() | List all named subscriptions |
pg_ripple.subscribe(pattern, channel) | Low-level subscription (v0.6.0 API) |
pg_ripple.unsubscribe(channel) | Remove low-level subscription |
Architecture Overview
pg_ripple is a PostgreSQL 18 extension that implements a high-performance RDF triple store with native SPARQL query execution. This page describes the internal architecture: how data is stored, how queries are executed, and how the subsystems interact.
System Architecture Diagram
┌──────────────────────────────────────────────────────────────────────┐
│ Client Applications │
│ psql / JDBC / SPARQL Protocol (pg_ripple_http) / REST / ODBC │
└────────────────────────────┬─────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────┐
│ PostgreSQL 18 Backend │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ pg_ripple Extension │ │
│ │ │ │
│ │ ┌──────────────┐ ┌───────────────┐ ┌────────────────────┐ │ │
│ │ │ SPARQL │ │ Datalog │ │ SHACL │ │ │
│ │ │ Engine │ │ Reasoner │ │ Validator │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ parse → │ │ stratify → │ │ shapes → DDL │ │ │
│ │ │ optimize → │ │ compile → │ │ constraints + │ │ │
│ │ │ SQL gen → │ │ semi-naive │ │ async pipeline │ │ │
│ │ │ SPI exec → │ │ fixpoint │ │ │ │ │
│ │ │ decode │ │ │ │ │ │ │
│ │ └──────┬───────┘ └───────┬───────┘ └────────┬───────────┘ │ │
│ │ │ │ │ │ │
│ │ ▼ ▼ ▼ │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ Dictionary Encoder (XXH3-128) │ │ │
│ │ │ IRI / Blank Node / Literal ──→ i64 identifier │ │ │
│ │ │ Shared-Memory LRU Cache (64 shards) │ │ │
│ │ └────────────────────────┬────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ VP Storage Engine (HTAP) │ │ │
│ │ │ │ │ │
│ │ │ vp_{id}_delta ──┐ │ │ │
│ │ │ (write inbox) │ │ │ │
│ │ │ ├──→ vp_{id} (read view) │ │ │
│ │ │ vp_{id}_main ──┤ = (main − tombstones) │ │ │
│ │ │ (BRIN archive) │ UNION ALL delta │ │ │
│ │ │ │ │ │ │
│ │ │ vp_{id}_tombstones │ │ │
│ │ │ (pending deletes) │ │ │ │
│ │ │ │ │ │ │
│ │ │ vp_rare ─────────┘ (consolidated rare predicates) │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ┌────────────────────────┴────────────────────────────────┐ │ │
│ │ │ Background Merge Worker (BGW) │ │ │
│ │ │ delta + main − tombstones ──→ new main (BRIN) │ │ │
│ │ │ Polling interval: merge_interval_secs (default 60s) │ │ │
│ │ │ Threshold: merge_threshold (default 10,000 rows) │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ _pg_ripple schema: dictionary, predicates, vp_*, statements │ │
│ │ pg_ripple schema: public SQL functions (sparql, insert, etc.) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
Dictionary Encoder
The dictionary encoder is the foundation of pg_ripple's storage model. Every RDF term — IRI, blank node, plain literal, typed literal, or language-tagged literal — is mapped to a compact i64 identifier before being stored.
How Encoding Works
- The input term is classified by kind: IRI (0), blank node (1), literal (2), typed literal (3), or language-tagged literal (4).
- The kind discriminant is mixed into the hash input as two little-endian bytes, so the same string encoded as an IRI and as a blank node always produces distinct dictionary rows.
- An XXH3-128 hash is computed over
(kind_le_bytes || term_utf8). - The full 16-byte hash is stored in the
_pg_ripple.dictionarytable with anON CONFLICT (hash) DO NOTHINGupsert. The densei64join key is anIDENTITY-generated column — sequential and independent of the hash. - The resulting
i64is used in all VP table columns.
VP tables never contain raw strings. All joins, comparisons, and index lookups operate on i64 values. This eliminates collation overhead, reduces storage by 5–20x, and makes B-tree index scans uniformly fast regardless of IRI length.
Shared-Memory Cache
The dictionary cache sits in PostgreSQL shared memory (allocated at postmaster start) and is organized as a 64-shard set-associative structure. Each backend reads and writes to the shared cache through atomic operations — no per-backend duplication.
Key parameters:
pg_ripple.dictionary_cache_size— Number of cache entries (default: 65,536). Requires restart.pg_ripple.cache_budget— Memory budget cap in MB (default: 64). Bulk loads throttle at 90% utilization.
The cache hit ratio is reported by pg_ripple.stats() and should stay above 95% in production.
VP (Vertical Partitioning) Tables
pg_ripple uses vertical partitioning: one physical table per unique predicate. This is the storage model used by research systems like SW-Store and column-oriented triple stores.
Table Layout
Each predicate with at least vp_promotion_threshold (default: 1,000) triples gets a dedicated VP table:
-- Columns in every VP table
s BIGINT NOT NULL -- subject dictionary ID
o BIGINT NOT NULL -- object dictionary ID
g BIGINT NOT NULL DEFAULT 0 -- graph ID (0 = default graph)
i BIGINT NOT NULL DEFAULT nextval('statement_id_seq') -- unique SID
source SMALLINT NOT NULL DEFAULT 0 -- 0 = explicit, 1 = inferred
Dual B-tree indexes on (s, o) and (o, s) support both subject-to-object and object-to-subject access patterns.
Rare Predicate Consolidation
Predicates with fewer triples than the promotion threshold are stored in a shared _pg_ripple.vp_rare table with an extra p BIGINT column. This avoids schema bloat for infrequent predicates. When a rare predicate's count crosses the threshold, it is automatically promoted to a dedicated VP table.
Predicate Catalog
The _pg_ripple.predicates table maps each predicate ID to its VP table OID and current triple count:
SELECT id, table_oid, triple_count
FROM _pg_ripple.predicates
ORDER BY triple_count DESC;
The SPARQL-to-SQL translator never concatenates table names into SQL strings. It looks up the OID in _pg_ripple.predicates and uses parameterized queries with format-safe quoting. This prevents SQL injection by design.
HTAP Storage Architecture
Since v0.6.0, pg_ripple uses an HTAP (Hybrid Transactional/Analytical Processing) storage architecture that separates write and read paths for each VP table.
Three-Table Split
For each predicate, the storage layer maintains:
| Table | Purpose | Index Type |
|---|---|---|
vp_{id}_delta | Write inbox — all INSERTs land here | B-tree on (s, o) |
vp_{id}_main | Read-optimized archive | BRIN (block range) |
vp_{id}_tombstones | Pending deletes from main | B-tree on (s, o, g) |
A read view vp_{id} combines them:
(main EXCEPT tombstones) UNION ALL delta
Background Merge Worker
The merge worker is a PostgreSQL background worker (BGWorker) that runs in a polling loop:
- Poll — Wake every
merge_interval_secs(default: 60) or when poked by the write-path latch. - Scan — Check each HTAP predicate's delta row count against
merge_threshold(default: 10,000). - Merge — For qualifying predicates: create a new main table from
(old_main − tombstones) UNION ALL delta, swap atomically viaALTER TABLE ... RENAME, drop the old main aftermerge_retention_seconds. - Maintain — Rebuild subject/object pattern tables, promote rare predicates that crossed the threshold, run
ANALYZEon new main tables (whenauto_analyzeis on), evict expired federation cache entries.
Writers never block on the merge. All INSERTs go directly to the delta table (heap + B-tree). The merge worker operates asynchronously and uses PostgreSQL's MVCC for isolation.
SPARQL Query Execution Flow
When a client calls pg_ripple.sparql('SELECT ...'), the query goes through five stages:
1. Parse
The SPARQL text is parsed by the spargebra crate into an algebraic representation. This handles the full SPARQL 1.1 grammar: SELECT, CONSTRUCT, DESCRIBE, ASK, property paths, subqueries, aggregation, federation (SERVICE), and SPARQL-star.
2. Optimize
The sparopt optimizer rewrites the algebra tree:
- BGP reordering — Triple patterns are sorted by estimated selectivity (smallest VP table first) when
bgp_reorderis on. - Filter pushdown — FILTER constants are encoded to
i64at translation time and pushed into the WHERE clause of the generated SQL. - Self-join elimination — Star patterns (same subject, multiple predicates) are collapsed into multi-way joins instead of redundant subqueries.
- SHACL hints — If
sh:maxCount 1is declared,DISTINCTis omitted; ifsh:minCount 1,LEFT JOINis upgraded toINNER JOIN.
3. Generate SQL
The optimized algebra is compiled into PostgreSQL SQL:
- Each triple pattern becomes a scan of the corresponding VP table (or
vp_rarewith a predicate filter). - Joins between patterns become SQL
JOINclauses withi64equality predicates. - Property paths compile to
WITH RECURSIVE ... CYCLEusing PostgreSQL 18's hash-based cycle detection. SERVICEclauses are compiled into HTTP calls to remote SPARQL endpoints.- Aggregates,
ORDER BY,LIMIT, andOFFSETtranslate directly to their SQL equivalents.
4. SPI Execute
The generated SQL is executed through PostgreSQL's Server Programming Interface (SPI). Results are arrays of i64 dictionary IDs.
The plan cache (plan_cache_size, default: 256) stores compiled SQL for recently-seen SPARQL queries to avoid repeated parse/optimize/generate cycles.
5. Decode
The i64 result columns are decoded back to human-readable RDF terms (IRIs, literals, blank nodes) using the dictionary. The shared-memory cache accelerates this step — a cache hit avoids a dictionary table lookup per value.
The SPARQL engine encodes all bound constants to i64 before generating SQL, and decodes results after execution. VP table queries never contain string comparisons — this is a hard architectural invariant.
Schema Organization
pg_ripple uses two PostgreSQL schemas:
| Schema | Contents | Visibility |
|---|---|---|
pg_ripple | Public SQL functions (sparql(), insert_triple(), stats(), etc.) | User-facing |
_pg_ripple | Dictionary table, predicates catalog, VP tables, statement mappings, internal state | Internal |
The internal schema is managed by the extension. Direct modifications to _pg_ripple tables can corrupt the dictionary or break VP table invariants.
Subsystem Summary
| Subsystem | Source Directory | Purpose |
|---|---|---|
| Dictionary | src/dictionary/ | Term ↔ i64 encoding with XXH3-128 |
| Storage | src/storage/ | VP tables, HTAP partitions, rare predicate consolidation |
| SPARQL | src/sparql/ | Parse → optimize → SQL generation → SPI → decode |
| Datalog | src/datalog/ | Rule parsing, stratification, semi-naive fixpoint, magic sets |
| SHACL | src/shacl/ | Shape validation, DDL constraints, async pipeline |
| Export | src/export/ | Turtle, N-Triples, JSON-LD serialization |
| Worker | src/worker.rs | Background merge worker, embedding queue, SHACL async |
| Stats | src/stats/ | Monitoring, cache metrics, health checks |
| Federation | src/sparql/federation | Remote SERVICE call execution, connection pooling, caching |
| HTTP | pg_ripple_http/ | SPARQL Protocol endpoint (standalone companion service) |
Deployment Models
pg_ripple runs as a PostgreSQL 18 extension. It can be deployed in any environment that supports PostgreSQL 18 with extension loading. This page covers the three primary deployment models and provides production-ready configuration examples.
Deployment Options at a Glance
| Model | Best For | Complexity | SPARQL Protocol |
|---|---|---|---|
| Standalone PostgreSQL | Production, existing PG infrastructure | Low | Via pg_ripple_http sidecar |
| Docker / Compose | Evaluation, CI/CD, small deployments | Low | Built-in |
| Managed PostgreSQL | Cloud-native, minimal ops | Medium | Via pg_ripple_http sidecar |
Use Docker Compose for evaluation and development. Use a dedicated PostgreSQL 18 instance for production workloads — this gives full control over shared memory, background workers, and storage configuration.
Model 1: Standalone PostgreSQL
Install pg_ripple into a standard PostgreSQL 18 instance. This is the recommended production deployment.
Prerequisites
- PostgreSQL 18.x installed from packages or source
- Rust toolchain (for building from source) or a pre-built
.so/.dylib pgrx0.17 (if building from source)
Installation
# Build and install from source
cargo pgrx install --pg-config $(which pg_config) --release
# Or if using a specific PG18 binary
cargo pgrx install --pg-config /usr/lib/postgresql/18/bin/pg_config --release
PostgreSQL Configuration
Add to postgresql.conf:
# Required: load pg_ripple at server start for background workers and shared memory
shared_preload_libraries = 'pg_ripple'
# Shared memory for dictionary cache (adjust for your dataset)
pg_ripple.dictionary_cache_size = 65536 # 64K entries (default)
pg_ripple.cache_budget = 64 # MB (default)
# HTAP merge worker
pg_ripple.merge_threshold = 10000
pg_ripple.merge_interval_secs = 60
pg_ripple.worker_database = 'mydb' # database the merge worker connects to
Enable the Extension
CREATE EXTENSION pg_ripple;
-- Verify installation
SELECT pg_ripple.stats();
Add SPARQL Protocol Endpoint
The SPARQL Protocol HTTP endpoint is provided by pg_ripple_http, a standalone companion service:
# Build the HTTP service
cd pg_ripple_http
cargo build --release
# Run it
PG_RIPPLE_HTTP_PG_URL="postgresql://user:pass@localhost/mydb" \
PG_RIPPLE_HTTP_PORT=7878 \
./target/release/pg_ripple_http
You can use pg_ripple entirely through SQL — pg_ripple.sparql(), pg_ripple.insert_triple(), etc. The HTTP service adds W3C SPARQL Protocol compatibility for tools like Yasgui, RDF4J, or federated queries from other endpoints.
Model 2: Docker / Docker Compose
The Docker deployment bundles PostgreSQL 18, pg_ripple, and pg_ripple_http into containers managed by Docker Compose. This is the fastest way to get started.
docker-compose.yml
# Docker Compose for pg_ripple with SPARQL Protocol HTTP endpoint.
#
# Usage:
# docker compose up -d
# curl http://localhost:7878/health
# curl -G http://localhost:7878/sparql \
# --data-urlencode "query=SELECT * WHERE { ?s ?p ?o } LIMIT 10"
services:
postgres:
build: .
ports:
- "5432:5432"
environment:
POSTGRES_PASSWORD: ripple
volumes:
- pgdata:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 5s
timeout: 5s
retries: 5
sparql:
build: .
entrypoint: ["/usr/local/bin/pg_ripple_http"]
ports:
- "7878:7878"
environment:
PG_RIPPLE_HTTP_PG_URL: "postgresql://postgres:ripple@postgres/postgres"
PG_RIPPLE_HTTP_PORT: "7878"
PG_RIPPLE_HTTP_POOL_SIZE: "8"
PG_RIPPLE_HTTP_CORS_ORIGINS: "*"
depends_on:
postgres:
condition: service_healthy
volumes:
pgdata:
Starting the Stack
docker compose up -d
# Wait for health check
docker compose ps
# Test SPARQL endpoint
curl http://localhost:7878/health
# Run a query
curl -G http://localhost:7878/sparql \
--data-urlencode "query=SELECT * WHERE { ?s ?p ?o } LIMIT 5"
Loading Data via Docker
# Copy a Turtle file into the container and load it
docker compose cp data.ttl postgres:/tmp/data.ttl
docker compose exec postgres psql -U postgres -c \
"SELECT pg_ripple.load_turtle_file('/tmp/data.ttl');"
# Or load inline
docker compose exec postgres psql -U postgres -c \
"SELECT pg_ripple.load_turtle('@prefix ex: <http://example.org/> .
ex:Alice ex:knows ex:Bob .
ex:Bob ex:age \"30\"^^<http://www.w3.org/2001/XMLSchema#integer> .');"
Production Hardening for Docker
For production Docker deployments, add resource limits and persistent configuration:
services:
postgres:
build: .
ports:
- "5432:5432"
environment:
POSTGRES_PASSWORD: ${PG_PASSWORD}
volumes:
- pgdata:/var/lib/postgresql/data
- ./postgresql.conf:/etc/postgresql/postgresql.conf
command: postgres -c config_file=/etc/postgresql/postgresql.conf
deploy:
resources:
limits:
memory: 4G
cpus: "2.0"
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 5s
timeout: 5s
retries: 5
sparql:
build: .
entrypoint: ["/usr/local/bin/pg_ripple_http"]
ports:
- "7878:7878"
environment:
PG_RIPPLE_HTTP_PG_URL: "postgresql://postgres:${PG_PASSWORD}@postgres/postgres"
PG_RIPPLE_HTTP_PORT: "7878"
PG_RIPPLE_HTTP_POOL_SIZE: "16"
PG_RIPPLE_HTTP_CORS_ORIGINS: "https://yourdomain.com"
PG_RIPPLE_HTTP_AUTH_TOKEN: ${SPARQL_AUTH_TOKEN}
depends_on:
postgres:
condition: service_healthy
deploy:
resources:
limits:
memory: 512M
cpus: "1.0"
Never use default passwords in production. Set POSTGRES_PASSWORD and PG_RIPPLE_HTTP_AUTH_TOKEN via environment variables or Docker secrets. Restrict PG_RIPPLE_HTTP_CORS_ORIGINS to your actual domain.
Model 3: Managed PostgreSQL Services
pg_ripple can run on managed PostgreSQL services that support custom extensions and PostgreSQL 18. The key requirements are:
- PostgreSQL 18 — pg_ripple uses PG18-specific features (e.g.,
WITH RECURSIVE ... CYCLE). - Custom extension loading — The service must allow installing
.soextensions and adding toshared_preload_libraries. - Shared memory access — Required for the dictionary cache and merge worker.
Supported Managed Services
| Service | Custom Extensions | shared_preload_libraries | Status |
|---|---|---|---|
| AWS RDS for PostgreSQL | Yes (via custom builds) | Yes | Supported with custom AMI |
| Azure Database for PostgreSQL Flexible Server | Yes | Yes | Supported |
| Google Cloud SQL | Limited | Limited | Partial support |
| Self-managed on EC2/GCE/Azure VM | Full control | Full control | Fully supported |
For managed cloud deployments, running PostgreSQL 18 on a cloud VM (EC2, GCE, Azure VM) with the extension installed gives full control and avoids managed service limitations. Use the managed service's block storage for durability and snapshots for backups.
Managed Service Configuration
When running on a managed service:
# Add to the PostgreSQL parameter group / configuration
shared_preload_libraries = 'pg_ripple'
# Shared memory — managed services often cap this; start conservative
pg_ripple.dictionary_cache_size = 32768
pg_ripple.cache_budget = 32
# Merge worker targets the primary database
pg_ripple.worker_database = 'mydb'
pg_ripple_http as a Sidecar
On managed services, run pg_ripple_http as a sidecar container or systemd service:
# Kubernetes sidecar example
PG_RIPPLE_HTTP_PG_URL="postgresql://user:pass@pg-host:5432/mydb" \
PG_RIPPLE_HTTP_PORT=7878 \
PG_RIPPLE_HTTP_POOL_SIZE=16 \
pg_ripple_http
pg_ripple_http Configuration Reference
The HTTP companion service is configured entirely through environment variables:
| Variable | Default | Description |
|---|---|---|
PG_RIPPLE_HTTP_PG_URL | (required) | PostgreSQL connection string |
PG_RIPPLE_HTTP_PORT | 7878 | HTTP listen port |
PG_RIPPLE_HTTP_POOL_SIZE | 8 | Connection pool size |
PG_RIPPLE_HTTP_CORS_ORIGINS | * | Allowed CORS origins |
PG_RIPPLE_HTTP_AUTH_TOKEN | (none) | Bearer token for authentication |
PG_RIPPLE_HTTP_RATE_LIMIT | 0 | Requests per second (0 = unlimited) |
Endpoints
| Path | Method | Description |
|---|---|---|
/sparql | GET, POST | SPARQL Protocol query/update endpoint |
/health | GET | Health check (returns 200 if PG connection is live) |
/metrics | GET | Prometheus-compatible metrics |
Network Architecture
┌─────────────┐
│ Clients │
└──────┬──────┘
│
┌────────────┴────────────┐
│ │
▼ ▼
┌─────────────────┐ ┌──────────────────┐
│ pg_ripple_http │ │ psql / JDBC / │
│ :7878 │ │ application │
│ (SPARQL Proto) │ │ (:5432) │
└────────┬─────────┘ └────────┬─────────┘
│ │
└────────────┬────────────┘
│
▼
┌─────────────────────────┐
│ PostgreSQL 18 │
│ + pg_ripple extension │
│ + merge worker (BGW) │
└─────────────────────────┘
For read-heavy workloads, PostgreSQL streaming replication works out of the box. Read replicas receive all VP table changes through WAL. Point read-only SPARQL queries to replicas via a separate pg_ripple_http instance connected to the replica.
Post-Deployment Verification
After deploying pg_ripple, verify the installation:
-- Check extension version
SELECT extversion FROM pg_extension WHERE extname = 'pg_ripple';
-- Verify stats (confirms shared memory and merge worker)
SELECT pg_ripple.stats();
-- Run a health check
SELECT pg_ripple.canary();
-- Insert and query a test triple
SELECT pg_ripple.insert_triple(
'<http://example.org/test>',
'<http://example.org/status>',
'"deployed"'
);
SELECT * FROM pg_ripple.sparql('
SELECT ?s ?p ?o WHERE { ?s ?p ?o } LIMIT 1
');
stats()returnsmerge_worker_pid > 0canary()showsmerge_worker: "ok"andcatalog_consistent: trueencode_cache_hits / (hits + misses) > 0.90after initial data load- SPARQL queries return results
Configuration and Tuning
pg_ripple exposes its configuration through PostgreSQL GUC (Grand Unified Configuration) parameters. All parameters use the pg_ripple. prefix and can be set in postgresql.conf, via ALTER SYSTEM, or per-session with SET.
Parameters marked Postmaster require a PostgreSQL restart. Parameters marked SIGHUP can be reloaded with SELECT pg_reload_conf(). All others can be changed per-session with SET.
Storage Parameters
Control how triples are stored in VP tables and the rare-predicate consolidation table.
| Parameter | Type | Default | Range | Context | Description |
|---|---|---|---|---|---|
vp_promotion_threshold | int | 1000 | 10 – 10,000,000 | Userset | Minimum triples before a predicate gets a dedicated VP table. Below this, triples go to vp_rare. |
named_graph_optimized | bool | off | — | Userset | Adds a (g, s, o) index per VP table. Speeds up GRAPH queries but increases write overhead. |
default_graph | text | '' | Any IRI | Userset | IRI used as the default graph when g is not specified on insert. |
dedup_on_merge | bool | off | — | Userset | When on, the merge worker deduplicates (s, o, g) rows, keeping the lowest SID. |
HTAP / Merge Worker Parameters
Control the delta/main split and background merge behavior. These take effect only when pg_ripple is loaded via shared_preload_libraries.
| Parameter | Type | Default | Range | Context | Description |
|---|---|---|---|---|---|
merge_threshold | int | 10000 | 1 – 2,147,483,647 | SIGHUP | Delta row count that triggers a merge for a predicate. Lower = fresher reads, more I/O. |
merge_interval_secs | int | 60 | 1 – 3600 | SIGHUP | Maximum seconds between merge worker poll cycles. |
merge_retention_seconds | int | 60 | 0 – 86,400 | SIGHUP | Seconds to keep the old main table after a merge before dropping it. |
latch_trigger_threshold | int | 10000 | 1 – 2,147,483,647 | SIGHUP | Rows written in a batch before poking the merge worker latch immediately. |
merge_watchdog_timeout | int | 300 | 10 – 86,400 | SIGHUP | Seconds of merge worker inactivity before logging a WARNING. |
worker_database | text | 'postgres' | — | SIGHUP | Database the background merge worker connects to. |
auto_analyze | bool | on | — | SIGHUP | Run ANALYZE on VP main tables after each merge cycle. |
Query Engine Parameters
Tune SPARQL-to-SQL translation and execution.
| Parameter | Type | Default | Range | Context | Description |
|---|---|---|---|---|---|
plan_cache_size | int | 256 | 0 – 65,536 | Userset | Cached SPARQL→SQL translations per backend. 0 disables caching. |
max_path_depth | int | 100 | 0 – 10,000 | Userset | Maximum recursion depth for property path queries (+, *). 0 = unlimited. |
property_path_max_depth | int | 64 | 1 – 100,000 | Userset | Alternative property path depth limit (v0.24.0). |
describe_strategy | text | 'cbd' | 'cbd', 'scbd', 'simple' | Userset | DESCRIBE algorithm: Concise Bounded Description, Symmetric CBD, or simple one-hop. |
bgp_reorder | bool | on | — | Userset | Reorder BGP triple patterns by estimated selectivity before SQL generation. |
parallel_query_min_joins | int | 3 | 1 – 100 | Userset | Minimum VP-table joins before enabling parallel query workers. |
sparql_strict | bool | on | — | Userset | When on, unsupported FILTER functions raise an error; when off, they are silently dropped. |
export_batch_size | int | 10000 | 100 – 1,000,000 | Userset | Triples per cursor batch during streaming export. |
Inference / Datalog Parameters
Control the Datalog reasoning engine, magic sets, and rule caching.
| Parameter | Type | Default | Range | Context | Description |
|---|---|---|---|---|---|
inference_mode | text | 'off' | 'off', 'on_demand', 'materialized' | Userset | Datalog reasoning mode. 'materialized' requires pg_trickle. |
enforce_constraints | text | 'off' | 'off', 'warn', 'error' | Userset | Behavior when Datalog constraint rules detect violations. |
rule_graph_scope | text | 'default' | 'default', 'all' | Userset | Whether unscoped rule atoms operate on the default graph only or all graphs. |
magic_sets | bool | on | — | Userset | Use magic sets for goal-directed inference in infer_goal(). |
datalog_cost_reorder | bool | on | — | Userset | Sort rule body atoms by ascending VP-table cardinality before SQL compilation. |
datalog_antijoin_threshold | int | 1000 | 0 – 10,000,000 | Userset | Minimum VP rows for NOT atoms to use LEFT JOIN anti-join form. |
delta_index_threshold | int | 500 | 0 – 10,000,000 | Userset | Minimum semi-naive delta rows before creating a B-tree index. |
demand_transform | bool | on | — | Userset | Auto-apply demand transformation when multiple goal patterns are specified. |
sameas_reasoning | bool | on | — | Userset | Apply owl:sameAs canonicalization pre-pass during inference. |
rule_plan_cache | bool | on | — | Userset | Cache compiled SQL for each rule set. Invalidated by drop_rules() and load_rules(). |
rule_plan_cache_size | int | 64 | 1 – 4,096 | Userset | Maximum rule sets in the plan cache. |
Well-Founded Semantics / Tabling Parameters
Control WFS evaluation and tabling cache (v0.32.0).
| Parameter | Type | Default | Range | Context | Description |
|---|---|---|---|---|---|
wfs_max_iterations | int | 100 | 1 – 10,000 | Userset | Safety cap on alternating fixpoint rounds per WFS pass. Emits PT520 WARNING if not converged. |
tabling | bool | on | — | Userset | Cache infer_wfs() and SPARQL results in _pg_ripple.tabling_cache. |
tabling_ttl | int | 300 | 0 – 86,400 | Userset | TTL in seconds for tabling cache entries. 0 disables TTL-based expiry. |
SHACL Validation Parameters
| Parameter | Type | Default | Range | Context | Description |
|---|---|---|---|---|---|
shacl_mode | text | 'off' | 'off', 'sync', 'async' | Userset | 'sync' rejects violations inline; 'async' queues for background validation. |
Federation Parameters
Control remote SPARQL endpoint calls via the SERVICE keyword.
| Parameter | Type | Default | Range | Context | Description |
|---|---|---|---|---|---|
federation_timeout | int | 30 | 1 – 3,600 | Userset | Per-SERVICE call wall-clock timeout in seconds. |
federation_max_results | int | 10000 | 1 – 1,000,000 | Userset | Maximum rows accepted from a single remote call. |
federation_on_error | text | 'warning' | 'warning', 'error', 'empty' | Userset | Behavior on SERVICE call failure. |
federation_pool_size | int | 4 | 1 – 32 | Userset | Idle HTTP connections per endpoint host. |
federation_cache_ttl | int | 0 | 0 – 86,400 | Userset | Remote result cache TTL in seconds. 0 disables caching. |
federation_on_partial | text | 'empty' | 'empty', 'use' | Userset | Behavior on mid-stream SERVICE failure. |
federation_adaptive_timeout | bool | off | — | Userset | Derive per-endpoint timeout from P95 latency. |
Shared Memory Parameters (Startup Only)
These must be set in postgresql.conf before PostgreSQL starts. They cannot be changed at runtime.
| Parameter | Type | Default | Range | Context | Description |
|---|---|---|---|---|---|
dictionary_cache_size | int | 4096 | 0 – 1,000,000 | Postmaster | Shared-memory encode cache capacity in entries. |
cache_budget | int | 64 | 0 – 65,536 | Postmaster | Shared-memory budget cap in MB. Bulk loads throttle at 90% utilization. |
Changes to dictionary_cache_size and cache_budget require a full PostgreSQL restart. Plan your cache sizing before deploying to production.
Security Parameters
| Parameter | Type | Default | Range | Context | Description |
|---|---|---|---|---|---|
rls_bypass | bool | off | — | Suset | Superuser override to bypass graph-level Row-Level Security. |
Vector / Embedding Parameters
| Parameter | Type | Default | Range | Context | Description |
|---|---|---|---|---|---|
embedding_model | text | '' | — | Userset | Model name tag stored in _pg_ripple.embeddings. |
embedding_dimensions | int | 1536 | 1 – 16,000 | Userset | Vector dimension count. Must match model output. |
embedding_api_url | text | '' | — | Userset | Base URL for OpenAI-compatible embedding API. |
embedding_api_key | text | '' | — | Suset | API key (superuser-only, masked in pg_settings). |
pgvector_enabled | bool | on | — | Userset | Disable pgvector code paths without uninstalling. |
embedding_index_type | text | 'hnsw' | 'hnsw', 'ivfflat' | Userset | Index type on embeddings table. |
embedding_precision | text | 'single' | 'single', 'half', 'binary' | Userset | Storage precision for embedding vectors. |
auto_embed | bool | off | — | Userset | Auto-embed new entities via background worker. |
embedding_batch_size | int | 100 | 1 – 10,000 | Userset | Entities dequeued per background worker batch. |
Quick-Start Configurations
Small Dataset (< 1M triples)
Suitable for development, prototyping, or small knowledge graphs:
# postgresql.conf
shared_preload_libraries = 'pg_ripple'
# Dictionary cache — small footprint
pg_ripple.dictionary_cache_size = 8192
pg_ripple.cache_budget = 16
# Merge worker — merge early for fresh reads
pg_ripple.merge_threshold = 5000
pg_ripple.merge_interval_secs = 30
# Query engine
pg_ripple.plan_cache_size = 64
pg_ripple.max_path_depth = 50
Medium Dataset (1M – 100M triples)
Production workloads with moderate query complexity:
# postgresql.conf
shared_preload_libraries = 'pg_ripple'
# Dictionary cache — larger cache for better hit rates
pg_ripple.dictionary_cache_size = 131072
pg_ripple.cache_budget = 128
# Merge worker — balance freshness and I/O
pg_ripple.merge_threshold = 50000
pg_ripple.merge_interval_secs = 60
pg_ripple.latch_trigger_threshold = 20000
pg_ripple.auto_analyze = on
# Query engine — larger plan cache for diverse queries
pg_ripple.plan_cache_size = 512
pg_ripple.max_path_depth = 100
pg_ripple.bgp_reorder = on
# Inference (if used)
pg_ripple.inference_mode = 'on_demand'
pg_ripple.magic_sets = on
Large Dataset (> 100M triples)
High-throughput production with heavy query loads:
# postgresql.conf
shared_preload_libraries = 'pg_ripple'
# Dictionary cache — maximize cache coverage
pg_ripple.dictionary_cache_size = 500000
pg_ripple.cache_budget = 512
# Merge worker — batch larger merges, reduce churn
pg_ripple.merge_threshold = 200000
pg_ripple.merge_interval_secs = 120
pg_ripple.latch_trigger_threshold = 100000
pg_ripple.merge_retention_seconds = 120
pg_ripple.auto_analyze = on
# Query engine — large plan cache, parallel queries
pg_ripple.plan_cache_size = 2048
pg_ripple.max_path_depth = 200
pg_ripple.bgp_reorder = on
pg_ripple.parallel_query_min_joins = 2
# Named graph optimization (if heavy GRAPH usage)
pg_ripple.named_graph_optimized = on
# Inference
pg_ripple.inference_mode = 'on_demand'
pg_ripple.magic_sets = on
pg_ripple.rule_plan_cache = on
pg_ripple.rule_plan_cache_size = 256
# Tabling cache for repeated inference patterns
pg_ripple.tabling = on
pg_ripple.tabling_ttl = 600
# Federation (if used)
pg_ripple.federation_timeout = 60
pg_ripple.federation_pool_size = 8
pg_ripple.federation_cache_ttl = 300
Don't forget to tune PostgreSQL itself alongside pg_ripple. Key PostgreSQL parameters for triple store workloads:
shared_buffers= 25% of RAMeffective_cache_size= 75% of RAMwork_mem= 64MB–256MB (for complex joins)maintenance_work_mem= 512MB–1GB (for merge ANALYZE)random_page_cost= 1.1 (if using SSDs)max_parallel_workers_per_gather= 4
Monitoring and Observability
pg_ripple provides built-in monitoring through SQL functions, PostgreSQL's standard statistics infrastructure, and Prometheus-compatible metrics via pg_ripple_http. This page explains what to monitor, how to collect the data, and what thresholds indicate a healthy system.
pg_ripple.stats()
The primary monitoring function. Returns a JSONB object with key metrics:
SELECT pg_ripple.stats();
Output Fields
| Field | Type | Description |
|---|---|---|
total_triples | int | Total triple count across all graphs (including delta rows not yet merged) |
dedicated_predicates | int | Number of predicates with their own VP table |
htap_predicates | int | Number of predicates using the HTAP delta/main split |
rare_triples | int | Triples stored in the consolidated vp_rare table |
unmerged_delta_rows | int | Total rows across all delta tables (from shared memory counter). -1 if shared memory is not available |
merge_worker_pid | int | PID of the background merge worker. 0 if not running |
live_statistics_enabled | bool | Whether pg_trickle live statistics are active |
encode_cache_capacity | int | Total entries the shared encode cache can hold |
encode_cache_utilization_pct | int | Percentage of cache slots currently in use |
encode_cache_hits | int | Cumulative cache hit count since server start |
encode_cache_misses | int | Cumulative cache miss count since server start |
encode_cache_evictions | int | Cumulative eviction count |
Example Output
{
"total_triples": 4523891,
"dedicated_predicates": 127,
"htap_predicates": 127,
"rare_triples": 2341,
"unmerged_delta_rows": 8432,
"merge_worker_pid": 12345,
"live_statistics_enabled": false,
"encode_cache_capacity": 65536,
"encode_cache_utilization_pct": 72,
"encode_cache_hits": 18934521,
"encode_cache_misses": 234012,
"encode_cache_evictions": 45123
}
Computing the Cache Hit Rate
SELECT
(s->>'encode_cache_hits')::bigint AS hits,
(s->>'encode_cache_misses')::bigint AS misses,
ROUND(
(s->>'encode_cache_hits')::numeric /
NULLIF((s->>'encode_cache_hits')::numeric + (s->>'encode_cache_misses')::numeric, 0),
4
) AS hit_rate
FROM pg_ripple.stats() s;
A healthy system should maintain a cache hit rate above 95% (0.95). If it drops below 90%, increase pg_ripple.dictionary_cache_size and restart PostgreSQL. Sustained rates below 80% indicate the working set significantly exceeds cache capacity.
pg_ripple.canary()
A health check function that returns a JSONB object with pass/fail indicators:
SELECT pg_ripple.canary();
Output Fields
| Field | Type | Healthy Value | Description |
|---|---|---|---|
merge_worker | text | "ok" | "ok" if merge worker PID is in shared memory; "stalled" otherwise |
cache_hit_rate | float | > 0.95 | Dictionary encode cache hit rate (0.0–1.0) |
catalog_consistent | bool | true | VP table count in pg_class matches promoted predicates |
orphaned_rare_rows | int | 0 | vp_rare rows for predicates that already have dedicated VP tables |
Interpreting Results
SELECT
c->>'merge_worker' AS worker,
(c->>'cache_hit_rate')::float AS hit_rate,
(c->>'catalog_consistent')::bool AS catalog_ok,
(c->>'orphaned_rare_rows')::int AS orphaned
FROM pg_ripple.canary() c;
canary() is designed for load balancer health checks and monitoring systems. Call it periodically and alert when merge_worker = 'stalled', cache_hit_rate < 0.90, or catalog_consistent = false.
SPARQL Query Analysis with sparql_explain()
Analyze SPARQL query performance using the explain functions:
Basic SQL Generation
-- See the generated SQL without executing
SELECT pg_ripple.sparql_explain(
'SELECT ?name WHERE { ?s <http://schema.org/name> ?name }',
false
);
Full EXPLAIN ANALYZE
-- Execute and show timing + row counts
SELECT pg_ripple.sparql_explain(
'SELECT ?name WHERE { ?s <http://schema.org/name> ?name }',
true
);
explain_sparql() with Format Options
The explain_sparql() function (v0.23.0) provides more output formats:
-- Generated SQL only
SELECT pg_ripple.explain_sparql(
'SELECT ?s ?o WHERE { ?s <http://xmlns.com/foaf/0.1/knows> ?o }',
'sql'
);
-- EXPLAIN ANALYZE as text (default)
SELECT pg_ripple.explain_sparql(
'SELECT ?s ?o WHERE { ?s <http://xmlns.com/foaf/0.1/knows> ?o }',
'text'
);
-- EXPLAIN ANALYZE as JSON (for programmatic consumption)
SELECT pg_ripple.explain_sparql(
'SELECT ?s ?o WHERE { ?s <http://xmlns.com/foaf/0.1/knows> ?o }',
'json'
);
-- SPARQL algebra tree (for debugging the optimizer)
SELECT pg_ripple.explain_sparql(
'SELECT ?s ?o WHERE { ?s <http://xmlns.com/foaf/0.1/knows> ?o }',
'sparql_algebra'
);
- Seq Scan on vp_rare — A predicate is not promoted yet. Consider lowering
vp_promotion_thresholdor loading more data. - Nested Loop with high row estimates — BGP reordering may not be optimal. Check
bgp_reorderis on. - Recursive CTE with high loop count — Property path is deep. Check
max_path_depthsetting. - Sort + Unique — A
DISTINCTthat might be avoidable with SHACLsh:maxCount 1hints.
pg_stat_statements Integration
pg_ripple generates standard SQL that is tracked by pg_stat_statements. This gives you deep visibility into the actual SQL performance:
-- Enable pg_stat_statements (if not already)
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- Find the slowest SPARQL-generated queries
SELECT
calls,
mean_exec_time::numeric(10,2) AS avg_ms,
total_exec_time::numeric(10,2) AS total_ms,
rows,
LEFT(query, 120) AS query_prefix
FROM pg_stat_statements
WHERE query LIKE '%_pg_ripple.vp_%'
ORDER BY mean_exec_time DESC
LIMIT 20;
Identifying Hot VP Tables
-- Which VP tables are scanned most?
SELECT
regexp_matches(query, '_pg_ripple\.(vp_\d+)', 'g') AS vp_table,
sum(calls) AS total_calls,
sum(total_exec_time)::numeric(10,2) AS total_ms
FROM pg_stat_statements
WHERE query LIKE '%_pg_ripple.vp_%'
GROUP BY 1
ORDER BY total_ms DESC
LIMIT 10;
Prometheus Metrics (pg_ripple_http)
The pg_ripple_http companion service exposes Prometheus-compatible metrics at the /metrics endpoint:
curl http://localhost:7878/metrics
Available Metrics
| Metric | Type | Description |
|---|---|---|
pg_ripple_http_queries_total | counter | Total SPARQL queries processed |
pg_ripple_http_errors_total | counter | Total query errors |
pg_ripple_http_query_duration_seconds_total | counter | Cumulative query execution time |
Prometheus Scrape Configuration
# prometheus.yml
scrape_configs:
- job_name: 'pg_ripple_http'
scrape_interval: 15s
static_configs:
- targets: ['pg-ripple-http:7878']
metrics_path: /metrics
Derived Metrics for Dashboards
Use PromQL to compute useful rates:
# Queries per second
rate(pg_ripple_http_queries_total[5m])
# Error rate
rate(pg_ripple_http_errors_total[5m]) / rate(pg_ripple_http_queries_total[5m])
# Average query latency
rate(pg_ripple_http_query_duration_seconds_total[5m]) / rate(pg_ripple_http_queries_total[5m])
Monitoring the Merge Worker
The background merge worker is critical for HTAP performance. Monitor it through multiple channels:
Shared Memory Status
-- Is the merge worker running?
SELECT (pg_ripple.stats()->>'merge_worker_pid')::int AS pid;
-- Returns 0 if not running
Delta Table Sizes
-- Check delta accumulation per predicate
SELECT
p.id AS predicate_id,
d.value AS predicate_iri,
p.triple_count,
(SELECT count(*) FROM format('_pg_ripple.vp_%s_delta', p.id)::regclass) AS delta_rows
FROM _pg_ripple.predicates p
JOIN _pg_ripple.dictionary d ON d.id = p.id
WHERE p.htap = true
ORDER BY p.triple_count DESC
LIMIT 10;
Merge Worker Logs
The merge worker logs to PostgreSQL's standard log:
LOG: pg_ripple merge worker: merge cycle complete
LOG: pg_ripple merge worker: processed 3 async validation item(s)
WARNING: pg_ripple merge worker: watchdog timeout (300s)
If you see watchdog timeout warnings in the PostgreSQL log, the merge worker has stalled. Common causes:
- Long-running transactions holding locks on VP tables
worker_databasepointing to the wrong database- Insufficient
max_worker_processesinpostgresql.conf
Health Check Thresholds
Use these thresholds for alerting:
| Metric | Green | Yellow | Red |
|---|---|---|---|
| Cache hit rate | > 95% | 90–95% | < 90% |
| Merge worker PID | > 0 | — | = 0 |
| Delta rows (total) | < 2× merge_threshold | 2–5× | > 5× |
| Catalog consistent | true | — | false |
| Orphaned rare rows | 0 | 1–100 | > 100 |
| Query error rate | < 1% | 1–5% | > 5% |
| Avg query latency | < 100ms | 100–500ms | > 500ms |
Automated Monitoring Query
Run this periodically from your monitoring system:
SELECT
CASE
WHEN (c->>'merge_worker') = 'ok'
AND (c->>'cache_hit_rate')::float > 0.90
AND (c->>'catalog_consistent')::bool
AND (c->>'orphaned_rare_rows')::int = 0
THEN 'healthy'
WHEN (c->>'merge_worker') = 'stalled'
OR (c->>'cache_hit_rate')::float < 0.80
THEN 'critical'
ELSE 'warning'
END AS status,
c->>'merge_worker' AS worker,
c->>'cache_hit_rate' AS hit_rate,
c->>'catalog_consistent' AS catalog,
c->>'orphaned_rare_rows' AS orphaned
FROM pg_ripple.canary() c;
Predicate Inventory
Monitor predicate distribution to catch imbalances:
SELECT
p.id,
d.value AS predicate_iri,
p.triple_count,
p.table_oid IS NOT NULL AS has_vp_table,
CASE WHEN p.htap THEN 'htap' ELSE 'flat' END AS storage_mode
FROM _pg_ripple.predicates p
JOIN _pg_ripple.dictionary d ON d.id = p.id
ORDER BY p.triple_count DESC
LIMIT 20;
If one predicate has 10x more triples than the next, its VP table dominates storage and merge time. Consider partitioning the data by named graph or filtering queries to avoid full scans of that predicate.
Log-Based Monitoring
Configure PostgreSQL logging for SPARQL workload visibility:
# postgresql.conf
log_min_duration_statement = 500 # Log queries slower than 500ms
log_statement = 'none' # Don't log every statement
log_line_prefix = '%t [%p] %d ' # Timestamp, PID, database
SPARQL-generated SQL appears in the PostgreSQL log with VP table references, making it easy to correlate slow log entries with specific SPARQL patterns.
Performance Tuning
pg_ripple performance depends on three interacting subsystems: the query engine, the write path, and the dictionary cache. This page provides diagnostic steps and tuning recipes for each bottleneck area, with realistic numbers from BSBM benchmarks and internal testing.
The Three Bottleneck Areas
┌──────────────────────────────────────────────────────┐
│ Performance │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ Query │ │ Write │ │ Cache │ │
│ │ Engine │ │ Path │ │ Pressure │ │
│ │ │ │ │ │ │ │
│ │ Slow │ │ Merge │ │ Dictionary │ │
│ │ SPARQL │ │ worker │ │ misses → │ │
│ │ queries │ │ lag, │ │ table │ │
│ │ │ │ delta │ │ lookups │ │
│ │ │ │ bloat │ │ │ │
│ └──────────┘ └──────────┘ └──────────────┘ │
└──────────────────────────────────────────────────────┘
Diagnostic Workflow
Before tuning, identify which subsystem is the bottleneck:
-- Step 1: Overall health
SELECT pg_ripple.canary();
-- Step 2: Cache hit rate
SELECT
(s->>'encode_cache_hits')::bigint AS hits,
(s->>'encode_cache_misses')::bigint AS misses,
ROUND(
(s->>'encode_cache_hits')::numeric /
NULLIF((s->>'encode_cache_hits')::numeric + (s->>'encode_cache_misses')::numeric, 0),
4
) AS hit_rate
FROM pg_ripple.stats() s;
-- Step 3: Delta accumulation
SELECT (pg_ripple.stats()->>'unmerged_delta_rows')::int AS delta_rows;
-- Step 4: Slowest queries
SELECT calls, mean_exec_time::numeric(10,2) AS avg_ms, LEFT(query, 100)
FROM pg_stat_statements
WHERE query LIKE '%_pg_ripple.vp_%'
ORDER BY mean_exec_time DESC
LIMIT 10;
| Symptom | Likely Bottleneck | Section |
|---|---|---|
High mean_exec_time on VP queries | Query engine | Query Performance |
delta_rows growing unbounded | Write path / merge | Write Throughput |
| Cache hit rate < 95% | Dictionary cache | Cache Pressure |
merge_worker_pid = 0 | Merge worker not running | Write Throughput |
Query Performance
Typical Performance Numbers
Based on BSBM benchmarks and internal testing with 10M triples on a 4-core/16GB instance:
| Query Pattern | Typical Latency | Notes |
|---|---|---|
| Simple triple pattern (1 BGP) | 0.5–2ms | Single VP table scan with B-tree |
| Star pattern (3–5 joins, same subject) | 2–10ms | Self-join elimination reduces to 1 scan + joins |
| Path query (3 hops) | 5–20ms | WITH RECURSIVE, bounded depth |
| Complex BGP (5–8 patterns) | 10–50ms | Benefits from bgp_reorder |
| Aggregation (COUNT/SUM over 100K rows) | 20–80ms | PostgreSQL native aggregation |
| DESCRIBE (CBD, 50 outgoing arcs) | 5–15ms | Depends on describe_strategy |
| Federation (1 SERVICE call) | 50–500ms | Network-dominated |
Tuning: Slow Single Queries
Step 1: Get the EXPLAIN output
SELECT pg_ripple.explain_sparql(
'SELECT ?name WHERE {
?person <http://schema.org/knows> ?friend .
?friend <http://schema.org/name> ?name
}',
'text'
);
Step 2: Check for common issues
| EXPLAIN Pattern | Problem | Fix |
|---|---|---|
Seq Scan on vp_rare | Predicate below promotion threshold | Lower vp_promotion_threshold or load more data |
Nested Loop with millions of rows | Poor join order | Verify bgp_reorder = on; run ANALYZE on VP tables |
Sort + Unique on large result | Unnecessary DISTINCT | Add SHACL sh:maxCount 1 for functional predicates |
CTE Scan with high loops | Unbounded property path | Lower max_path_depth; add FILTER bounds |
Hash Join with large build side | Join on a high-cardinality predicate | Rewrite query to filter the large predicate first |
Step 3: Enable the plan cache
-- Cache compiled SQL for repeated queries
SET pg_ripple.plan_cache_size = 512;
The plan cache eliminates parse/optimize/generate overhead for repeated SPARQL patterns. With BSBM's mix of 12 query templates, a cache size of 256 achieves ~98% hit rate.
Tuning: Overall Query Throughput
For workloads with many concurrent queries:
# Enable parallel query for complex joins
pg_ripple.parallel_query_min_joins = 2
# PostgreSQL parallel execution
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
# Larger work_mem for complex joins
work_mem = '128MB'
On a 10M triple dataset with 5-pattern BGPs, enabling bgp_reorder reduces median query time from 45ms to 12ms — a 3.7x improvement. Always keep this on unless you have a specific reason to disable it.
Write Throughput
Typical Write Performance
| Operation | Throughput | Notes |
|---|---|---|
insert_triple() (single) | 5,000–15,000 triples/sec | Per-backend, includes dictionary encoding |
load_turtle() (bulk, inline) | 30,000–80,000 triples/sec | Batch dictionary encoding |
load_turtle_file() (bulk, file) | 50,000–120,000 triples/sec | Streaming from disk, larger batches |
sparql_update() INSERT DATA | 10,000–30,000 triples/sec | SPARQL parse overhead |
Tuning: Merge Worker Lag
If unmerged_delta_rows grows continuously, the merge worker cannot keep up with the write rate.
Diagnosis:
-- Check delta accumulation
SELECT (pg_ripple.stats()->>'unmerged_delta_rows')::int AS delta;
-- Run again 60 seconds later — if delta is growing, merges are lagging
Solutions (in order of impact):
-
Lower merge_threshold — Merge smaller batches more frequently:
ALTER SYSTEM SET pg_ripple.merge_threshold = 5000; SELECT pg_reload_conf(); -
Increase merge frequency — Reduce polling interval:
ALTER SYSTEM SET pg_ripple.merge_interval_secs = 15; SELECT pg_reload_conf(); -
Manual compaction — Force an immediate merge:
SELECT pg_ripple.compact(); -
Separate write windows — Batch writes during off-peak hours, then compact.
Tuning: Bulk Load Performance
For large initial data loads:
-- Temporarily disable SHACL validation
SET pg_ripple.shacl_mode = 'off';
-- Use file-based loading for best throughput
SELECT pg_ripple.load_turtle_file('/data/large_dataset.ttl');
-- Re-enable validation
SET pg_ripple.shacl_mode = 'async';
-- Force merge to move data to main tables
SELECT pg_ripple.compact();
During bulk loads, pg_ripple monitors cache utilization against cache_budget. When utilization exceeds 90%, batch sizes are automatically reduced to prevent out-of-memory conditions. If you see slower-than-expected bulk loads, check encode_cache_utilization_pct in stats().
Cache Pressure
Diagnosis
SELECT
(s->>'encode_cache_capacity')::int AS capacity,
(s->>'encode_cache_utilization_pct')::int AS util_pct,
(s->>'encode_cache_hits')::bigint AS hits,
(s->>'encode_cache_misses')::bigint AS misses,
(s->>'encode_cache_evictions')::bigint AS evictions
FROM pg_ripple.stats() s;
| Metric | Healthy | Action Needed |
|---|---|---|
| Hit rate > 95% | Normal operation | None |
| Hit rate 90–95% | Marginal | Consider increasing cache |
| Hit rate < 90% | Cache thrashing | Increase dictionary_cache_size |
| Utilization > 90% | Near-full | Increase cache_budget |
| Evictions > 10% of hits | High churn | Working set exceeds cache |
Sizing the Dictionary Cache
Rule of thumb: the cache should hold at least 80% of your unique terms.
-- Count unique terms
SELECT count(*) AS unique_terms FROM _pg_ripple.dictionary;
| Unique Terms | Recommended dictionary_cache_size | Memory (approx.) |
|---|---|---|
| < 50K | 8,192 | ~2 MB |
| 50K – 500K | 65,536 | ~13 MB |
| 500K – 5M | 262,144 | ~50 MB |
| 5M – 50M | 500,000 | ~100 MB |
| > 50M | 1,000,000 (max) | ~200 MB |
Changing dictionary_cache_size requires a PostgreSQL restart because shared memory is allocated at postmaster start. Plan your cache sizing during initial deployment.
Workload-Specific Recipes
Read-Heavy Analytics
Optimized for complex SPARQL queries with rare writes:
# Large plan cache for diverse query shapes
pg_ripple.plan_cache_size = 2048
# BGP optimization
pg_ripple.bgp_reorder = on
pg_ripple.parallel_query_min_joins = 2
# Large dictionary cache
pg_ripple.dictionary_cache_size = 262144
pg_ripple.cache_budget = 256
# Infrequent merges (writes are rare)
pg_ripple.merge_threshold = 100000
pg_ripple.merge_interval_secs = 300
# PostgreSQL
shared_buffers = '4GB'
effective_cache_size = '12GB'
work_mem = '256MB'
random_page_cost = 1.1
Expected: P95 query latency < 50ms for 5-pattern BGPs on 10M triples.
Write-Heavy Ingestion
Optimized for continuous data ingestion with periodic queries:
# Smaller plan cache (fewer distinct queries)
pg_ripple.plan_cache_size = 64
# Aggressive merging to keep delta small
pg_ripple.merge_threshold = 5000
pg_ripple.merge_interval_secs = 10
pg_ripple.latch_trigger_threshold = 2000
pg_ripple.auto_analyze = on
# Large cache to handle encoding pressure
pg_ripple.dictionary_cache_size = 500000
pg_ripple.cache_budget = 512
# Disable SHACL during ingestion
pg_ripple.shacl_mode = 'off'
# PostgreSQL — optimize for writes
shared_buffers = '2GB'
wal_buffers = '64MB'
checkpoint_completion_target = 0.9
max_wal_size = '4GB'
Expected: Sustained ingestion at 50K+ triples/sec with merge lag < 30 seconds.
Mixed HTAP (Read + Write)
Balanced for concurrent queries and writes:
# Moderate plan cache
pg_ripple.plan_cache_size = 512
# Balanced merge — not too frequent, not too lazy
pg_ripple.merge_threshold = 25000
pg_ripple.merge_interval_secs = 30
pg_ripple.latch_trigger_threshold = 10000
pg_ripple.auto_analyze = on
# Good cache coverage
pg_ripple.dictionary_cache_size = 131072
pg_ripple.cache_budget = 128
# Async SHACL so writes are not blocked
pg_ripple.shacl_mode = 'async'
# BGP optimization for read queries
pg_ripple.bgp_reorder = on
# PostgreSQL
shared_buffers = '4GB'
effective_cache_size = '12GB'
work_mem = '128MB'
max_parallel_workers_per_gather = 2
Expected: Read P95 < 30ms, write throughput > 20K triples/sec, merge lag < 60 seconds.
Benchmarking Your Deployment
Use the built-in compact() function and pg_stat_statements to establish baselines:
-- Reset statistics
SELECT pg_stat_statements_reset();
-- Run your workload (queries, inserts, etc.)
-- Collect results
SELECT
calls,
mean_exec_time::numeric(10,2) AS avg_ms,
stddev_exec_time::numeric(10,2) AS stddev_ms,
min_exec_time::numeric(10,2) AS min_ms,
max_exec_time::numeric(10,2) AS max_ms,
rows,
LEFT(query, 80) AS query_prefix
FROM pg_stat_statements
WHERE query LIKE '%_pg_ripple%'
ORDER BY total_exec_time DESC
LIMIT 20;
Change one parameter at a time, re-run your benchmark, and compare. The most impactful parameters in order are:
dictionary_cache_size(cache hit rate)bgp_reorder(query planning)merge_threshold(read freshness vs. write throughput)plan_cache_size(repeated query overhead)- PostgreSQL
work_mem(complex join performance)
Parallel Merge Worker Pool
Added in v0.42.0
Overview
pg_ripple uses a Vertical Partitioning (VP) architecture where each unique predicate gets its own storage table. The merge worker pool keeps the read-optimised _main partitions in sync with the write-optimised _delta tables.
By default, a single background worker handles all predicates sequentially. For workloads with many distinct predicates — such as rich ontologies with 50+ property types — a pool of parallel workers can significantly improve write throughput.
Configuration
pg_ripple.merge_workers (startup only)
Controls the number of parallel merge worker processes. Must be set in postgresql.conf or before the server starts; it cannot be changed with SET at session level.
# postgresql.conf
shared_preload_libraries = 'pg_ripple'
pg_ripple.merge_workers = 4
- Default:
1(single worker, original behaviour) - Range:
1to16 - Type:
integer,PGC_POSTMASTER(startup-only)
pg_ripple.merge_threshold
Minimum rows in a VP delta table before a merge is triggered. Increasing this reduces merge frequency but increases per-merge cost.
SET pg_ripple.merge_threshold = 50000; -- default: 10000
pg_ripple.merge_interval_secs
Maximum seconds between merge worker polling cycles.
SET pg_ripple.merge_interval_secs = 30; -- default: 60
How It Works
With merge_workers = N, pg_ripple spawns N background worker processes. Each worker owns a disjoint round-robin subset of VP predicates:
- Worker 0 handles predicates where
pred_id % N == 0 - Worker 1 handles predicates where
pred_id % N == 1 - … and so on
Advisory locking prevents races: before merging a predicate, a worker calls pg_try_advisory_lock(pred_id). If another worker already holds the lock, it skips that predicate.
Work-stealing: after processing its assigned predicates, an idle worker checks whether any "foreign" predicate (not in its round-robin slice) has a delta table above the merge threshold and no lock held. If so, it steals that work. This prevents a single overloaded predicate from delaying the merge cycle.
Monitoring
Use pg_ripple.diagnostic_report() to check merge worker activity:
SELECT value FROM pg_ripple.diagnostic_report()
WHERE key LIKE 'merge_%';
Or query the background worker state:
SELECT pid, application_name, state
FROM pg_stat_activity
WHERE application_name LIKE 'pg_ripple merge%';
Choosing the Right Worker Count
| Predicate count | Recommended workers |
|---|---|
| < 20 | 1 (default) |
| 20–100 | 2–4 |
| 100–500 | 4–8 |
| > 500 | 8–16 |
For most workloads, the bottleneck is not the worker count but the merge threshold and interval. Tune those first before scaling workers.
Restart Requirement
Because merge_workers is a PGC_POSTMASTER GUC, changes take effect only after a PostgreSQL restart:
# After updating postgresql.conf:
pg_ctl restart -D $PGDATA
Backup and Disaster Recovery
pg_ripple stores all data in standard PostgreSQL tables within the _pg_ripple schema. This means every PostgreSQL backup tool works out of the box — VP tables, the dictionary, the predicates catalog, SHACL constraints, Datalog rules, and inferred triples are all captured by pg_dump, WAL archiving, and streaming replication.
Unlike triple stores that require a separate RDF dump/reload cycle, pg_ripple data is just PostgreSQL data. Your existing backup infrastructure already covers it.
What Gets Backed Up
| Object | Schema | Captured by pg_dump? | Notes |
|---|---|---|---|
| Dictionary table | _pg_ripple.dictionary | Yes | All IRI, blank node, and literal mappings |
| Predicates catalog | _pg_ripple.predicates | Yes | Predicate → VP table OID mapping |
| VP tables (main + delta + tombstones) | _pg_ripple.vp_{id}_* | Yes | One table set per predicate |
| Rare predicates table | _pg_ripple.vp_rare | Yes | Consolidated low-cardinality predicates |
| SHACL constraints | _pg_ripple.shacl_* | Yes | Shape definitions and validation state |
| Datalog rules | _pg_ripple.rules | Yes | Rule text and compiled plans |
| Inferred triples | VP tables, source = 1 | Yes | Materialized inference results |
| Extension metadata | pg_catalog | Yes | Extension version and control file |
| Shared memory state | In-memory only | No | Dictionary LRU cache, merge worker counters |
Logical Backup with pg_dump
Full Database Dump
# Custom format (recommended — compressed, parallel-restore capable)
pg_dump -Fc -f pg_ripple_backup.dump mydb
# Plain SQL (human-readable, useful for auditing)
pg_dump -Fp -f pg_ripple_backup.sql mydb
Extension-Only Dump
To back up only pg_ripple data without the rest of the database:
pg_dump -Fc \
--schema=_pg_ripple \
--schema=pg_ripple \
-f pg_ripple_only.dump mydb
Always include both _pg_ripple (internal storage) and pg_ripple (public API functions). Restoring one without the other leaves the extension in an inconsistent state.
Parallel Dump for Large Datasets
For databases with millions of triples, use parallel workers:
# Directory format required for parallel dump
pg_dump -Fd -j 4 -f pg_ripple_backup_dir/ mydb
The dictionary table and large VP tables will be dumped in parallel, significantly reducing backup time.
Restoring from Backup
Full Restore to a New Database
# Create the target database
createdb mydb_restored
# Restore (custom format)
pg_restore -d mydb_restored -Fc pg_ripple_backup.dump
# Restore (directory format, parallel)
pg_restore -d mydb_restored -Fd -j 4 pg_ripple_backup_dir/
Restore from Plain SQL
psql -d mydb_restored -f pg_ripple_backup.sql
Post-Restore Verification
After restoring, verify the extension is intact:
-- Check extension version
SELECT extversion FROM pg_extension WHERE extname = 'pg_ripple';
-- Verify triple count
SELECT pg_ripple.stats();
-- Run the health check
SELECT pg_ripple.canary();
-- Spot-check a SPARQL query
SELECT pg_ripple.sparql($$
SELECT (COUNT(*) AS ?n) WHERE { ?s ?p ?o }
$$);
Yes. VP tables are standard PostgreSQL heap tables with B-tree or BRIN indexes. pg_dump captures them exactly like any other table. The HTAP delta/main/tombstone split, indexes, and the merge worker view definitions are all preserved. After restore, the merge worker resumes normal operation once shared_preload_libraries includes pg_ripple.
WAL-Based Continuous Archiving
For point-in-time recovery (PITR), configure WAL archiving:
Enable WAL Archiving
In postgresql.conf:
wal_level = replica
archive_mode = on
archive_command = 'cp %p /backup/wal_archive/%f'
max_wal_senders = 3
Take a Base Backup
pg_basebackup -D /backup/base -Ft -z -P
Point-in-Time Recovery
Create a recovery.signal file and configure the restore target:
# postgresql.conf (or postgresql.auto.conf)
restore_command = 'cp /backup/wal_archive/%f %p'
recovery_target_time = '2026-04-19 14:30:00'
Start PostgreSQL — it will replay WAL up to the specified time.
If you recover to a point mid-merge, the merge worker will detect the incomplete state and re-run the merge on startup. No manual intervention is needed, but the first merge cycle after recovery may take longer than usual.
Streaming Replication
pg_ripple works transparently with PostgreSQL streaming replication:
# On the replica
pg_basebackup -h primary-host -D /var/lib/postgresql/18/main -R -P
The -R flag writes the standby.signal and connection parameters. All VP tables, dictionary data, and HTAP state replicate via WAL.
The background merge worker does not run on read replicas. Replicas receive merged state via WAL replay from the primary. This is correct behavior — replicas should never write.
Backup Strategy Recommendations
Small Datasets (< 1M triples)
| Component | Recommendation |
|---|---|
| Method | pg_dump -Fc nightly |
| Retention | 7 daily + 4 weekly |
| RPO | 24 hours |
| RTO | Minutes |
Medium Datasets (1M – 100M triples)
| Component | Recommendation |
|---|---|
| Method | WAL archiving + daily base backup |
| Retention | 7 daily base + continuous WAL |
| RPO | Seconds (WAL) |
| RTO | Minutes to hours |
Large Datasets (> 100M triples)
| Component | Recommendation |
|---|---|
| Method | WAL archiving + pgBackRest or Barman |
| Retention | Incremental base + continuous WAL |
| RPO | Seconds (WAL) |
| RTO | Proportional to dataset size |
Schedule monthly restore drills. A backup that has never been tested is not a backup. Automate the verification queries shown above as part of the drill.
Disaster Recovery Checklist
- Before disaster: WAL archiving enabled, base backups on schedule, replication lag monitored
- During incident: identify the failure scope (single table, full database, or host loss)
- Recovery steps:
- Host loss → promote replica or restore from base backup + WAL
- Corruption → PITR to last known good time
- Accidental deletion → PITR to just before the DROP/DELETE
- Post-recovery:
- Run
SELECT pg_ripple.canary()to verify health - Check
pg_ripple.stats()for expected triple counts - Verify the merge worker is running (
merge_worker_pid > 0) - Run representative SPARQL queries to confirm data integrity
- Resume WAL archiving and replication
- Run
Common Pitfalls
- Schema ownership: the restoring user must be a superuser or own both
_pg_rippleandpg_rippleschemas - Sequence values:
pg_dumpcaptures sequence state — statement IDs (icolumn) will continue from the correct value after restore - Tablespace placement: if you used custom tablespaces for VP tables, ensure they exist on the target server before restoring
Upgrading Safely
pg_ripple follows PostgreSQL's standard extension upgrade mechanism. Each release ships a migration script that ALTER EXTENSION pg_ripple UPDATE executes automatically, walking the version chain from your current version to the target.
How Extension Upgrades Work
PostgreSQL extensions use a chain of migration scripts to move between versions. pg_ripple provides a script for every consecutive version pair:
pg_ripple--0.1.0--0.2.0.sql
pg_ripple--0.2.0--0.3.0.sql
pg_ripple--0.3.0--0.4.0.sql
...
pg_ripple--0.45.0--0.46.0.sql
When you run ALTER EXTENSION pg_ripple UPDATE, PostgreSQL finds the shortest path from your current version to the latest and executes each script in sequence.
Each migration script uses IF NOT EXISTS, CREATE OR REPLACE, and similar guards. If a migration is partially applied (e.g., due to a crash), re-running it is safe.
Pre-Upgrade Checklist
1. Check Your Current Version
SELECT extversion FROM pg_extension WHERE extname = 'pg_ripple';
2. Back Up the Database
pg_dump -Fc -f pre_upgrade_backup.dump mydb
While migration scripts are tested, a backup lets you restore to the pre-upgrade state if anything goes wrong. This is especially important for major feature releases that add schema changes.
3. Review the Changelog
Read the Release Notes for every version between your current version and the target. Pay attention to:
- Breaking changes: renamed functions, changed return types, removed GUC parameters
- Schema changes: new columns on internal tables, new indexes
- New dependencies: additional
shared_preload_librariesentries
4. Check for Active Connections
SELECT count(*) FROM pg_stat_activity
WHERE datname = current_database()
AND pid != pg_backend_pid();
Disconnect all application connections before upgrading. The upgrade modifies extension catalog entries and may need exclusive locks on internal tables.
5. Verify the New Package Is Installed
The new .so (shared library) and SQL migration files must be present in the PostgreSQL extension directory before running ALTER EXTENSION:
# Check that the target version's migration script exists
ls $(pg_config --sharedir)/extension/pg_ripple--*
# Check that the shared library is updated
ls -la $(pg_config --pkglibdir)/pg_ripple.so
Performing the Upgrade
Step 1: Install the New Package
# From source
cargo pgrx install --pg-config $(pg_config) --release
# Or from a pre-built package
# dpkg -i pg_ripple-0.32.0-pg18.deb
Step 2: Schedule a Maintenance Window
pg_ripple does not yet support zero-downtime upgrades. Schedule the upgrade during a maintenance window. If you have read replicas, route read traffic to a replica during the upgrade window, but note that the replica will also need the new shared library installed before promotion.
Step 3: Restart PostgreSQL (If Required)
Some releases update the shared library or change shared memory layout. Check the release notes — if they mention shared memory changes or new background workers, restart PostgreSQL:
pg_ctl restart -D $PGDATA
Step 4: Run the Migration
-- Upgrade to the latest installed version
ALTER EXTENSION pg_ripple UPDATE;
-- Or upgrade to a specific version
ALTER EXTENSION pg_ripple UPDATE TO '0.46.0';
PostgreSQL will execute each intermediate migration script in order:
NOTICE: updating extension "pg_ripple" from version "0.44.0" to "0.45.0"
NOTICE: updating extension "pg_ripple" from version "0.45.0" to "0.46.0"
Step 5: Verify
-- Confirm the new version
SELECT extversion FROM pg_extension WHERE extname = 'pg_ripple';
-- Run the health check
SELECT pg_ripple.canary();
-- Verify stats
SELECT pg_ripple.stats();
Post-Upgrade Verification
Run these checks after every upgrade:
-- 1. Extension version matches expected
SELECT extversion FROM pg_extension WHERE extname = 'pg_ripple';
-- 2. Health check passes
SELECT pg_ripple.canary();
-- 3. Triple count is unchanged
SELECT pg_ripple.stats();
-- 4. SPARQL queries work
SELECT pg_ripple.sparql($$
SELECT ?s ?p ?o WHERE { ?s ?p ?o } LIMIT 5
$$);
-- 5. Merge worker is running (if shared_preload_libraries is set)
SELECT (pg_ripple.stats()->>'merge_worker_pid')::int > 0 AS merge_worker_ok;
-- 6. Dictionary cache is operational
SELECT
(s->>'encode_cache_hits')::bigint + (s->>'encode_cache_misses')::bigint > 0
AS cache_active
FROM pg_ripple.stats() s;
Add these verification queries to a script that runs immediately after ALTER EXTENSION. If any check fails, the script can alert operators before traffic is routed back to the upgraded instance.
Multi-Version Hop Upgrades
PostgreSQL walks the entire migration chain automatically. Upgrading from v0.5.0 directly to v0.46.0 executes all intermediate scripts:
-- This works — PG finds the path 0.5.0 → 0.5.1 → 0.6.0 → ... → 0.46.0
ALTER EXTENSION pg_ripple UPDATE TO '0.46.0';
Each migration script is typically fast (milliseconds to seconds). However, scripts that add columns or create indexes on large tables may take longer. For very long hops (10+ versions), expect a few minutes on large datasets. Monitor pg_stat_activity for lock waits during the upgrade.
Rollback Strategy
There is no built-in downgrade path. If an upgrade causes problems:
Option A: Restore from Backup
# Drop the upgraded database
dropdb mydb
# Restore the pre-upgrade backup
createdb mydb
pg_restore -d mydb pre_upgrade_backup.dump
This is the safest rollback method — it returns everything to the exact pre-upgrade state.
Option B: Reinstall the Old Version
# Install the old shared library
cargo pgrx install --pg-config $(pg_config) --release # (from old source checkout)
# Restart PostgreSQL
pg_ctl restart -D $PGDATA
Reinstalling the old .so file works only if the migration scripts did not make irreversible schema changes (e.g., dropping a column). Always check the migration script content before relying on this approach.
Upgrading PostgreSQL Itself
When upgrading the PostgreSQL major version (e.g., 17 → 18):
- pg_ripple requires PostgreSQL 18. Earlier versions are not supported.
- Use
pg_upgradeas normal — pg_ripple's tables and extension metadata transfer correctly. - After
pg_upgrade, verify the extension:
psql -d mydb -c "SELECT extversion FROM pg_extension WHERE extname = 'pg_ripple';"
psql -d mydb -c "SELECT pg_ripple.canary();"
Version Compatibility Matrix
| pg_ripple Version | PostgreSQL Version | Notes |
|---|---|---|
| 0.1.0 – 0.46.0 | 18.x | Only supported version |
| Any | < 18 | Not supported |
| Any | 19+ | Not yet tested |
Troubleshooting Upgrades
"no update path from version X to version Y"
The intermediate migration scripts are missing from the extension directory. Reinstall pg_ripple to ensure all SQL files are present:
ls $(pg_config --sharedir)/extension/pg_ripple--*.sql | wc -l
"could not open extension control file"
The pg_ripple.control file is missing. Reinstall the extension.
Migration script fails with a lock timeout
Another session holds a lock on an internal table. Ensure all connections are closed before upgrading, or increase lock_timeout:
SET lock_timeout = '60s';
ALTER EXTENSION pg_ripple UPDATE;
Shared library version mismatch
The .so file version does not match the SQL migration target. Ensure you installed the matching binary before running ALTER EXTENSION:
cargo pgrx install --pg-config $(pg_config) --release
pg_ctl restart -D $PGDATA
ALTER EXTENSION pg_ripple UPDATE;
Schema Version Stamping (v0.37.0+)
Starting with v0.37.0, every ALTER EXTENSION pg_ripple UPDATE stamps a row in _pg_ripple.schema_version. You can verify upgrade completeness:
SELECT version, installed_at, upgraded_from
FROM _pg_ripple.schema_version
ORDER BY installed_at DESC;
Example output after upgrading from 0.36.0 to 0.37.0:
version | installed_at | upgraded_from
-----------+--------------------------------+---------------
0.37.0 | 2026-04-19 10:00:00+00 | 0.36.0
0.36.0 | 2026-02-01 08:00:00+00 | 0.35.0
The diagnostic_report() function also reports the current schema version:
SELECT value FROM pg_ripple.diagnostic_report() WHERE key = 'schema_version';
This is useful in monitoring scripts to confirm a rolling upgrade has completed on all replicas.
Scaling
pg_ripple scales vertically within a single PostgreSQL instance and horizontally for read traffic via streaming replication. This page covers how to allocate resources, tune the merge worker, set up read replicas, and understand current limitations.
pg_ripple runs entirely within PostgreSQL. It inherits PostgreSQL's single-writer architecture: one primary handles all writes, and read replicas serve read-only SPARQL queries. Cross-node sharding is not yet supported.
Vertical Scaling
The most impactful scaling lever is giving your single PostgreSQL instance more resources.
Memory
Memory affects three key areas:
| Resource | Controlled By | Impact |
|---|---|---|
| Dictionary LRU cache | pg_ripple.dictionary_cache_size | Reduces disk I/O for IRI/literal lookups. Every SPARQL query touches the dictionary on decode. |
| PostgreSQL shared buffers | shared_buffers | Caches VP table pages. Larger = fewer disk reads for joins. |
| Work memory | work_mem | Memory for sorts, hash joins, and hash aggregates in SPARQL-generated SQL. |
Dictionary Cache Sizing
The dictionary cache is allocated in shared memory at server startup. Each entry consumes approximately 200 bytes.
-- Check current utilization
SELECT
s->>'encode_cache_capacity' AS capacity,
s->>'encode_cache_utilization_pct' AS utilization_pct,
ROUND(
(s->>'encode_cache_hits')::numeric /
NULLIF((s->>'encode_cache_hits')::numeric + (s->>'encode_cache_misses')::numeric, 0),
4
) AS hit_rate
FROM pg_ripple.stats() s;
| Hit Rate | Action |
|---|---|
| > 95% | Healthy — no change needed |
| 90–95% | Consider increasing dictionary_cache_size |
| < 90% | Double dictionary_cache_size and restart |
Set dictionary_cache_size to at least 10% of your total unique IRIs + literals. For a dataset with 5M unique terms, start with 500K entries (~100 MB of shared memory).
PostgreSQL Memory Settings
# postgresql.conf — for a 64 GB server with pg_ripple as the primary workload
shared_buffers = 16GB
effective_cache_size = 48GB
work_mem = 256MB
maintenance_work_mem = 2GB
Complex SPARQL queries with multiple joins, UNIONs, or aggregates can spawn many hash operations. PostgreSQL allocates work_mem per operation per query. Start conservative (64MB–256MB) and increase if you see "temporary file" entries in the logs.
CPU
| Workload | CPU Benefit |
|---|---|
| SPARQL query execution | More cores → more parallel workers for large joins |
| Merge worker | Single-threaded per predicate, but merges run concurrently across predicates |
| Bulk loading | load_turtle / load_ntriples are I/O-bound; CPU helps with dictionary encoding |
| Datalog inference | Semi-naive fixpoint is CPU-intensive; benefits from faster cores |
Set max_parallel_workers_per_gather to allow PostgreSQL to parallelize large VP table scans:
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
parallel_setup_cost = 100
parallel_tuple_cost = 0.001
pg_ripple's parallel_query_min_joins GUC controls when the SPARQL engine enables parallel hints in generated SQL (default: 3 joins).
Storage
| Tier | Recommendation |
|---|---|
| NVMe SSD | Best for all workloads. Random I/O for dictionary lookups and VP table joins. |
| SATA SSD | Acceptable for medium datasets. |
| HDD | Not recommended. Dictionary lookups and VP joins are random-I/O heavy. |
Place pg_wal on a separate NVMe device from the main data directory. pg_ripple's bulk load and merge operations generate significant WAL traffic.
Merge Worker Tuning
The HTAP merge worker is the most important pg_ripple-specific scaling knob. It controls how quickly delta rows (recent writes) are consolidated into the main BRIN-indexed partition.
How the Merge Worker Operates
- The worker polls every
merge_interval_secs(default: 60s) - For each predicate, it checks if
delta row count >= merge_threshold - If yes, it creates a new main table:
(old main − tombstones) UNION ALL delta - It swaps the view to point at the new main, then drops the old main after
merge_retention_seconds - If
auto_analyzeis on, it runsANALYZEon the new main
Tuning for Write-Heavy Workloads
Lower the merge threshold and interval to keep the delta tables small:
pg_ripple.merge_threshold = 5000
pg_ripple.merge_interval_secs = 30
pg_ripple.latch_trigger_threshold = 5000
This gives fresher reads but increases I/O from more frequent merges.
Tuning for Read-Heavy Workloads
Raise the threshold to batch more writes before merging:
pg_ripple.merge_threshold = 50000
pg_ripple.merge_interval_secs = 120
This reduces merge I/O overhead but means queries scan larger delta tables.
Monitoring Merge Activity
-- Is the merge worker running?
SELECT (pg_ripple.stats()->>'merge_worker_pid')::int AS pid;
-- How many unmerged delta rows?
SELECT (pg_ripple.stats()->>'unmerged_delta_rows')::int AS delta_rows;
If unmerged_delta_rows grows continuously while merge_worker_pid is non-zero, the worker may be stuck. Check pg_stat_activity for long-running merge transactions and look for lock contention. The merge_watchdog_timeout GUC (default: 300s) logs a WARNING if the worker is idle too long.
Read Replicas
PostgreSQL streaming replication provides horizontal read scaling for SPARQL queries.
Architecture
┌────────────┐ WAL stream ┌────────────┐
│ Primary │ ──────────────────→ │ Replica 1 │ ← SPARQL reads
│ (writes) │ └────────────┘
│ │ WAL stream ┌────────────┐
│ │ ──────────────────→ │ Replica 2 │ ← SPARQL reads
└────────────┘ └────────────┘
Setting Up a Read Replica
On the primary:
# postgresql.conf
wal_level = replica
max_wal_senders = 5
wal_keep_size = 1GB
Create a replication slot:
SELECT pg_create_physical_replication_slot('replica1');
On the replica:
pg_basebackup -h primary-host -D /var/lib/postgresql/18/main -R -S replica1 -P
Start the replica — it will begin streaming WAL and replaying changes, including all VP table mutations.
Replica Considerations
The background merge worker only runs on the primary. Replicas receive already-merged state through WAL replay. This means replicas always have a consistent view of the data without any additional overhead.
- SPARQL queries work identically on replicas — the query engine reads VP tables the same way
- Dictionary cache is independent per instance — each replica maintains its own LRU cache
- Replication lag: monitor with
pg_stat_replicationon the primary. Under normal load, lag should be sub-second - Hot standby conflicts: long-running SPARQL queries on replicas may conflict with WAL replay. Set
max_standby_streaming_delayappropriately:
# On the replica
max_standby_streaming_delay = 30s
hot_standby_feedback = on
Connection Pooling
For workloads with many concurrent SPARQL clients, use a connection pooler:
pg_ripple uses session-level GUC parameters (e.g., pg_ripple.inference_mode). If you use PgBouncer, configure it in session pooling mode, not transaction mode. Transaction-mode pooling resets GUCs between transactions, which can cause unexpected behavior.
# pgbouncer.ini
[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb
[pgbouncer]
pool_mode = session
max_client_conn = 200
default_pool_size = 20
Scaling Limits and Honest Boundaries
| Dimension | Current Capability | Limitation |
|---|---|---|
| Triples per instance | Tested to 1B+ | Bound by disk and memory |
| Concurrent SPARQL queries | Hundreds (with pooler) | Bound by max_connections and CPU |
| Write throughput | ~50K–200K triples/sec (bulk load) | Single-writer architecture |
| Read replicas | Unlimited | Standard PG replication |
| Cross-node sharding | Not supported | No distributed query planner |
| Multi-primary writes | Not supported | PostgreSQL limitation |
| Federation | Supported (SERVICE clause) | Remote endpoints add latency |
pg_ripple does not support sharding VP tables across multiple PostgreSQL instances. If your dataset exceeds what a single instance can handle, consider: (1) vertical scaling with larger hardware, (2) federation via SERVICE clauses to distribute queries across multiple pg_ripple instances, each holding a subset of graphs, or (3) archiving cold graphs to separate instances.
Capacity Planning
Storage Estimates
| Component | Per Triple (approx.) |
|---|---|
| VP table row (s, o, g, i, source) | ~40 bytes |
| VP indexes (dual B-tree) | ~80 bytes |
| Dictionary entry (per unique term) | ~120 bytes |
| HTAP overhead (delta + tombstone tables) | ~20% of VP size during active writes |
Example: 100M triples with 20M unique terms ≈ 12 GB (VP) + 2.4 GB (dictionary) + overhead ≈ ~20 GB total.
Memory Estimates
| Component | Sizing |
|---|---|
shared_buffers | 25% of RAM |
dictionary_cache_size | 10% of unique terms |
work_mem | 64MB–512MB depending on query complexity |
| OS page cache | Remaining RAM |
Deploy with conservative settings, load your data, and run representative queries. Use pg_ripple.stats() and PostgreSQL's pg_stat_user_tables to identify bottlenecks before adding hardware.
Troubleshooting
A runbook of common issues, their causes, and step-by-step resolutions. Each entry follows the pattern: Symptom → Cause → Diagnostic → Fix.
1. SPARQL Query Returns Zero Rows
Symptom: A SPARQL query that should return results returns an empty set.
Cause: The most common cause is querying with unencoded IRIs that don't match the dictionary, or querying the wrong graph.
Diagnostic:
-- Check that triples exist
SELECT pg_ripple.stats();
-- Verify the IRI is in the dictionary
SELECT id FROM _pg_ripple.dictionary WHERE value = 'http://example.org/MyResource';
-- Check the default graph vs named graphs
SELECT pg_ripple.sparql($$
SELECT ?g (COUNT(*) AS ?n) WHERE { GRAPH ?g { ?s ?p ?o } } GROUP BY ?g
$$);
Fix: Ensure the query uses the exact IRI as stored (case-sensitive, no trailing slash differences). If data was loaded into a named graph, use GRAPH or FROM clauses.
2. Merge Worker Not Running
Symptom: pg_ripple.stats() shows merge_worker_pid: 0. Delta rows accumulate.
Cause: pg_ripple is not in shared_preload_libraries, or worker_database points to the wrong database.
Diagnostic:
SHOW shared_preload_libraries;
SHOW pg_ripple.worker_database;
Fix:
# postgresql.conf
shared_preload_libraries = 'pg_ripple'
pg_ripple.worker_database = 'mydb'
Restart PostgreSQL. Verify with:
SELECT (pg_ripple.stats()->>'merge_worker_pid')::int;
3. Slow Queries — Unbounded Property Paths
Symptom: Queries with * or + property paths take minutes or never complete.
Cause: Property path queries compile to WITH RECURSIVE CTEs. On large, highly-connected graphs, recursion explores an enormous search space.
Diagnostic:
SHOW pg_ripple.max_path_depth;
-- Check the generated SQL
SET pg_ripple.plan_cache_size = 0; -- disable cache to see fresh plans
EXPLAIN (ANALYZE, BUFFERS) <generated SQL from logs>;
Fix: Limit recursion depth:
SET pg_ripple.max_path_depth = 10;
Or rewrite the query to use a bounded path ({1,5}) instead of */+.
4. SHACL Validation Not Triggering
Symptom: Data that violates SHACL shapes is inserted without errors.
Cause: SHACL enforcement is asynchronous by default, or the shapes are not loaded.
Diagnostic:
-- Check loaded shapes
SELECT pg_ripple.sparql($$
SELECT ?shape WHERE { ?shape a <http://www.w3.org/ns/shacl#NodeShape> }
$$);
-- Check enforce mode
SHOW pg_ripple.enforce_constraints;
Fix: Set enforcement mode to 'error' for synchronous validation:
SET pg_ripple.enforce_constraints = 'error';
Reload shapes if needed:
SELECT pg_ripple.load_shapes('<shapes-graph-iri>');
5. Datalog Inference Produces No Results
Symptom: pg_ripple.infer() or pg_ripple.infer_goal() returns zero new triples.
Cause: Rules are not loaded, inference mode is 'off', or the rule atoms don't match any data.
Diagnostic:
SHOW pg_ripple.inference_mode;
-- List loaded rule sets
SELECT pg_ripple.list_rule_sets();
-- Test with a simple rule
SELECT pg_ripple.load_rules('test', $$
:Parent(?x, ?z) :- :Parent(?x, ?y), :Parent(?y, ?z).
$$);
SELECT pg_ripple.infer('test');
Fix: Ensure inference_mode is 'on_demand' or 'materialized', rules are loaded, and the predicates in rule atoms match your data's actual IRIs exactly.
6. Shared Memory Errors on Startup
Symptom: PostgreSQL fails to start with could not create shared memory segment or pg_ripple logs insufficient shared memory.
Cause: pg_ripple.dictionary_cache_size is too large for the system's shared memory limits.
Diagnostic:
# Check system shared memory limits
sysctl kern.sysv.shmmax # macOS
sysctl kernel.shmmax # Linux
Fix: Either reduce dictionary_cache_size or increase the OS shared memory limit:
# Linux
sudo sysctl -w kernel.shmmax=17179869184 # 16GB
sudo sysctl -w kernel.shmall=4194304
# macOS
sudo sysctl -w kern.sysv.shmmax=17179869184
7. High Dictionary Cache Eviction Pressure
Symptom: encode_cache_evictions in pg_ripple.stats() is high; cache hit rate drops below 90%.
Cause: The working set of IRIs/literals exceeds the cache capacity.
Diagnostic:
SELECT
s->>'encode_cache_capacity' AS capacity,
s->>'encode_cache_utilization_pct' AS util_pct,
s->>'encode_cache_evictions' AS evictions,
ROUND(
(s->>'encode_cache_hits')::numeric /
NULLIF((s->>'encode_cache_hits')::numeric + (s->>'encode_cache_misses')::numeric, 0),
4
) AS hit_rate
FROM pg_ripple.stats() s;
Fix: Increase dictionary_cache_size in postgresql.conf and restart:
pg_ripple.dictionary_cache_size = 131072 -- double the default
8. Federation Query Timeout
Symptom: Queries with SERVICE clauses hang or return a timeout error.
Cause: The remote SPARQL endpoint is unreachable, slow, or returning an unexpected format.
Diagnostic:
# Test the remote endpoint directly
curl -s -H "Accept: application/sparql-results+json" \
"https://remote.example.org/sparql?query=SELECT+*+WHERE+{?s+?p+?o}+LIMIT+1"
Fix:
- Verify network connectivity to the remote endpoint
- Increase the federation timeout:
SET pg_ripple.federation_timeout = 60; -- seconds
- Check that the remote endpoint supports the required result format (SPARQL JSON Results)
9. pg_ripple_http Not Responding
Symptom: The HTTP SPARQL endpoint returns connection refused or 502 errors.
Cause: The pg_ripple_http companion service is not running, or it cannot connect to PostgreSQL.
Diagnostic:
# Check if the process is running
ps aux | grep pg_ripple_http
# Check the service logs
journalctl -u pg_ripple_http --since "10 minutes ago"
# Test the PostgreSQL connection directly
psql -h localhost -p 5432 -U pg_ripple_http -d mydb -c "SELECT 1"
Fix:
- Start or restart the service
- Verify the connection string in the
pg_ripple_httpconfiguration - Check that
pg_hba.confallows connections from the HTTP service
10. VP Table Bloat
Symptom: Disk usage grows faster than expected; pg_size_pretty(pg_total_relation_size('_pg_ripple.vp_12345')) is much larger than the triple count suggests.
Cause: Frequent deletes and re-inserts without merge cycles, or autovacuum not keeping up.
Diagnostic:
-- Check dead tuples
SELECT relname, n_dead_tup, n_live_tup,
last_autovacuum, last_autoanalyze
FROM pg_stat_user_tables
WHERE schemaname = '_pg_ripple'
ORDER BY n_dead_tup DESC
LIMIT 10;
Fix:
-- Force a vacuum on the bloated table
VACUUM (VERBOSE) _pg_ripple.vp_12345_main;
-- Reclaim space aggressively
VACUUM (FULL) _pg_ripple.vp_12345_main;
Tune autovacuum for VP tables:
ALTER TABLE _pg_ripple.vp_12345_delta
SET (autovacuum_vacuum_scale_factor = 0.01);
11. Bulk Load Slower Than Expected
Symptom: pg_ripple.load_turtle() or pg_ripple.load_ntriples() runs much slower than the documented 50K–200K triples/sec.
Cause: Small batch sizes, synchronous commit overhead, or insufficient work_mem.
Diagnostic:
SHOW synchronous_commit;
SHOW work_mem;
SHOW maintenance_work_mem;
Fix:
-- Disable synchronous commit for bulk loads
SET synchronous_commit = off;
-- Increase work memory
SET work_mem = '256MB';
SET maintenance_work_mem = '2GB';
-- Use the batch loading functions
SELECT pg_ripple.load_turtle_file('/path/to/data.ttl');
Disabling synchronous commit risks losing the last few transactions on a crash. Only use this for bulk loads that can be re-run.
12. RDF-Star Parse Error
Symptom: Loading RDF-star data fails with unexpected token or invalid quoted triple.
Cause: The input file uses RDF-star syntax (<<>>) but the parser is not in RDF-star mode, or the syntax is malformed.
Diagnostic: Check the file around the reported line number for syntax issues. Common problems:
- Nested
<<>>without proper whitespace - Missing datatype on literal objects inside quoted triples
- Using Turtle-star syntax in N-Triples files (or vice versa)
Fix: Verify the file uses the correct format. For Turtle-star:
<<:Alice :knows :Bob>> :since "2024"^^xsd:gYear .
For N-Triples-star, every term must be fully qualified — no prefixes.
13. SHACL Validation Queue Backlog
Symptom: pg_ripple.validation_queue_depth() returns a large number; validation results are delayed.
Cause: High write throughput is generating validations faster than the async validator can process them.
Diagnostic:
SELECT pg_ripple.validation_queue_depth();
SELECT pg_ripple.stats();
Fix:
- Increase the validation worker's processing capacity (if applicable)
- Temporarily switch to synchronous validation during low-traffic periods:
SET pg_ripple.enforce_constraints = 'error';
- Reduce write batch sizes to give the validator time to catch up
14. Plan Cache Thrashing
Symptom: SPARQL query latency is inconsistent. The first execution of a query pattern is slow, but subsequent runs are fast — then it becomes slow again.
Cause: The plan cache (pg_ripple.plan_cache_size) is too small for the number of distinct query patterns. Plans are evicted and recompiled repeatedly.
Diagnostic:
SHOW pg_ripple.plan_cache_size;
-- Estimate distinct query patterns in your workload
-- (application-level logging required)
Fix:
-- Increase the plan cache
SET pg_ripple.plan_cache_size = 1024;
If the number of distinct patterns exceeds any reasonable cache size, consider parameterizing queries to reduce pattern diversity.
15. "relation _pg_ripple.vp_XXXXX does not exist"
Symptom: SPARQL queries fail with a "relation does not exist" error for a specific VP table.
Cause: The predicates catalog references a VP table that was dropped or never created. This can happen after an incomplete migration or manual DDL.
Diagnostic:
-- Check the predicates catalog
SELECT id, table_oid, triple_count
FROM _pg_ripple.predicates
WHERE id = XXXXX;
-- Verify the table exists
SELECT oid FROM pg_class WHERE oid = (
SELECT table_oid FROM _pg_ripple.predicates WHERE id = XXXXX
);
Fix:
-- Rebuild the VP table for the predicate
SELECT pg_ripple.reindex_predicate(XXXXX);
If the data is lost, the predicate entry should be removed:
DELETE FROM _pg_ripple.predicates WHERE id = XXXXX;
Directly modifying _pg_ripple.predicates bypasses integrity checks. Only do this as a last resort after confirming the VP table is genuinely missing.
16. "permission denied for schema _pg_ripple"
Symptom: Non-superuser connections get permission errors when running SPARQL queries.
Cause: The user does not have USAGE on _pg_ripple and pg_ripple schemas.
Fix:
GRANT USAGE ON SCHEMA pg_ripple TO myuser;
GRANT USAGE ON SCHEMA _pg_ripple TO myuser;
GRANT SELECT ON ALL TABLES IN SCHEMA _pg_ripple TO myuser;
GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA pg_ripple TO myuser;
General Diagnostic Commands
A quick-reference set of commands for any troubleshooting session:
-- Extension health
SELECT pg_ripple.canary();
SELECT pg_ripple.stats();
-- PostgreSQL activity
SELECT pid, state, query, wait_event_type, wait_event
FROM pg_stat_activity
WHERE datname = current_database();
-- Lock contention
SELECT * FROM pg_locks WHERE NOT granted;
-- Table sizes in _pg_ripple
SELECT relname, pg_size_pretty(pg_total_relation_size(oid))
FROM pg_class
WHERE relnamespace = '_pg_ripple'::regnamespace
ORDER BY pg_total_relation_size(oid) DESC
LIMIT 20;
-- GUC settings
SELECT name, setting, source
FROM pg_settings
WHERE name LIKE 'pg_ripple.%'
ORDER BY name;
Lost Deletes After Merge (v0.37.0+)
Symptom: Triples that were deleted still appear in query results after a background merge cycle completes.
Cause: Before v0.37.0, the merge worker did not hold a per-predicate advisory lock during the delta→main swap. A DELETE that arrived after main_new was built but before the truncate of the tombstones table would have its tombstone deleted in the same truncate, leaving the triple alive in the new main.
Detection:
-- Check system health with diagnostic_report
SELECT key, value FROM pg_ripple.diagnostic_report()
WHERE key IN ('schema_version', 'merge_backlog_rows');
If schema_version is older than 0.37.0, upgrade to get the fix.
Fix:
-
Upgrade to v0.37.0 or later:
ALTER EXTENSION pg_ripple UPDATE TO '0.37.0'; -
Verify the fix is active —
diagnostic_report()reports the correct version:SELECT value FROM pg_ripple.diagnostic_report() WHERE key = 'schema_version'; -- Should return: 0.37.0 -
After upgrade, the merge worker acquires
pg_advisory_xact_lock(pred_id)(exclusive) before the delta→main swap, and the delete path acquirespg_advisory_xact_lock_shared(pred_id)before inserting tombstones. These two lock modes are incompatible, guaranteeing serialization.
Impact: Low — requires an unlucky timing window during a merge cycle. Most deployments will not observe lost deletes in practice, but correctness-critical workloads should upgrade.
Security
pg_ripple provides multiple layers of security: PostgreSQL's native authentication and authorization, named-graph row-level security (RLS), SQL injection prevention through dictionary encoding, and secure configuration of the pg_ripple_http companion service.
Authentication and Authorization
pg_ripple relies entirely on PostgreSQL's built-in authentication (pg_hba.conf) and role-based access control. There is no separate user database.
Minimum Privileges for SPARQL Queries
-- Create a read-only role
CREATE ROLE sparql_reader LOGIN PASSWORD 'strong_password';
GRANT USAGE ON SCHEMA pg_ripple TO sparql_reader;
GRANT USAGE ON SCHEMA _pg_ripple TO sparql_reader;
GRANT SELECT ON ALL TABLES IN SCHEMA _pg_ripple TO sparql_reader;
GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA pg_ripple TO sparql_reader;
Minimum Privileges for Data Loading
-- Create a writer role
CREATE ROLE sparql_writer LOGIN PASSWORD 'strong_password';
GRANT USAGE ON SCHEMA pg_ripple TO sparql_writer;
GRANT USAGE ON SCHEMA _pg_ripple TO sparql_writer;
GRANT SELECT, INSERT, DELETE ON ALL TABLES IN SCHEMA _pg_ripple TO sparql_writer;
GRANT USAGE ON ALL SEQUENCES IN SCHEMA _pg_ripple TO sparql_writer;
GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA pg_ripple TO sparql_writer;
Use ALTER DEFAULT PRIVILEGES to ensure newly created VP tables (created when new predicates are encountered) inherit the correct grants:
ALTER DEFAULT PRIVILEGES IN SCHEMA _pg_ripple
GRANT SELECT ON TABLES TO sparql_reader;
ALTER DEFAULT PRIVILEGES IN SCHEMA _pg_ripple
GRANT SELECT, INSERT, DELETE ON TABLES TO sparql_writer;
Named-Graph Row-Level Security
pg_ripple supports fine-grained access control at the named-graph level using PostgreSQL's row-level security (RLS) infrastructure. This allows different users to see different subsets of the knowledge graph.
Enabling Graph RLS
-- Enable RLS on all VP tables
SELECT pg_ripple.enable_graph_rls();
This creates RLS policies on every VP table (including vp_rare) that filter rows based on the g (graph) column.
Granting Graph Access
-- Grant a role access to a specific named graph
SELECT pg_ripple.grant_graph('sparql_reader', 'http://example.org/confidential');
-- Grant access to the default graph (g = 0)
SELECT pg_ripple.grant_graph('sparql_reader', '');
-- Grant access to all graphs
SELECT pg_ripple.grant_graph('sparql_reader', '*');
Revoking Graph Access
-- Revoke access to a specific graph
SELECT pg_ripple.revoke_graph('sparql_reader', 'http://example.org/confidential');
How It Works
When graph RLS is enabled:
- Each VP table gets an RLS policy that checks the
gcolumn against the user's allowed graph IDs - The dictionary encodes graph IRIs to
i64identifiers - An internal mapping table (
_pg_ripple.graph_grants) stores(role, graph_id)pairs - PostgreSQL enforces the policy transparently — SPARQL queries automatically filter results
PostgreSQL superusers bypass RLS by default. To enforce graph security even for superusers, the user must explicitly SET row_security = on and not be a table owner. For production, use non-superuser roles for application connections.
Example: Multi-Tenant Knowledge Graph
-- Create tenant roles
CREATE ROLE tenant_a LOGIN PASSWORD 'pw_a';
CREATE ROLE tenant_b LOGIN PASSWORD 'pw_b';
-- Grant base access
GRANT USAGE ON SCHEMA pg_ripple TO tenant_a, tenant_b;
GRANT USAGE ON SCHEMA _pg_ripple TO tenant_a, tenant_b;
GRANT SELECT ON ALL TABLES IN SCHEMA _pg_ripple TO tenant_a, tenant_b;
GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA pg_ripple TO tenant_a, tenant_b;
-- Enable graph RLS
SELECT pg_ripple.enable_graph_rls();
-- Tenant A sees only their graph
SELECT pg_ripple.grant_graph('tenant_a', 'http://example.org/tenant-a');
-- Tenant B sees only their graph
SELECT pg_ripple.grant_graph('tenant_b', 'http://example.org/tenant-b');
-- Both see shared reference data
SELECT pg_ripple.grant_graph('tenant_a', 'http://example.org/shared');
SELECT pg_ripple.grant_graph('tenant_b', 'http://example.org/shared');
Now SPARQL queries run by tenant_a will only see triples in tenant-a and shared graphs, with no application-level filtering required.
SQL Injection Prevention
pg_ripple's architecture provides strong defense against SQL injection by design.
Dictionary Encoding as a Security Layer
All SPARQL queries go through a multi-step translation pipeline:
- Parse: SPARQL text is parsed by
spargebrainto an abstract algebra tree - Encode: All bound constants (IRIs, literals) are dictionary-encoded to
i64integers before SQL generation - Generate: SQL is constructed using parameterized queries with integer placeholders
- Execute: SQL runs via
pgrx::SpiClientwith bound parameters
Because VP tables store only BIGINT columns (s, o, g, i, source), there is no surface for string-based SQL injection. Even if a malicious IRI is passed in a SPARQL query, it is hashed to an integer before any SQL is generated.
Table Name Safety
VP table references use OID lookups from _pg_ripple.predicates, not string concatenation:
#![allow(unused)] fn main() { // Internal: table names are never interpolated from user input let table_oid = predicates::get_table_oid(predicate_id)?; // SQL uses the OID directly: FROM pg_class WHERE oid = $1 }
User-Facing Function Safety
Functions that accept text input (like pg_ripple.sparql()) parse the SPARQL text through spargebra, which rejects anything that is not valid SPARQL. No raw SQL is passed through.
File-Path Loaders and Superuser Requirement
Functions that read from the server's filesystem require superuser privileges:
| Function | Requires Superuser | Reason |
|---|---|---|
pg_ripple.load_turtle_file(path) | Yes | Reads arbitrary filesystem paths |
pg_ripple.load_ntriples_file(path) | Yes | Reads arbitrary filesystem paths |
pg_ripple.load_rdfxml_file(path) | Yes | Reads arbitrary filesystem paths |
pg_ripple.load_turtle(text) | No | Parses in-memory text only |
pg_ripple.load_ntriples(text) | No | Parses in-memory text only |
File-path loaders can read any file the PostgreSQL process has access to. Never grant superuser to application roles. Instead, load data as a superuser and grant read access to application roles via schema permissions.
Safe Bulk Load Pattern
-- As superuser: load the data
SELECT pg_ripple.load_turtle_file('/data/import/dataset.ttl');
-- As superuser: grant access to the app role
GRANT SELECT ON ALL TABLES IN SCHEMA _pg_ripple TO app_role;
pg_ripple_http Security
The pg_ripple_http companion service exposes a SPARQL Protocol endpoint over HTTP. Secure it appropriately.
TLS Configuration
Always run pg_ripple_http behind TLS in production:
# pg_ripple_http.toml
[server]
bind = "0.0.0.0:8443"
tls_cert = "/etc/ssl/certs/pg_ripple_http.crt"
tls_key = "/etc/ssl/private/pg_ripple_http.key"
SPARQL queries may contain sensitive data patterns. Without TLS, queries and results are transmitted in plaintext. Always terminate TLS either at the service or at a reverse proxy.
Authentication
Configure pg_ripple_http to authenticate incoming requests:
[auth]
# HTTP Basic authentication backed by PostgreSQL roles
method = "pg_role"
# Or use a static API key
# method = "api_key"
# api_key = "your-secret-key-here"
With pg_role authentication, HTTP Basic credentials are forwarded to PostgreSQL. Graph RLS policies apply to the authenticated role.
Reverse Proxy Setup
For production, place pg_ripple_http behind a reverse proxy:
# nginx configuration
server {
listen 443 ssl;
server_name sparql.example.org;
ssl_certificate /etc/ssl/certs/sparql.crt;
ssl_certificate_key /etc/ssl/private/sparql.key;
location /sparql {
proxy_pass http://127.0.0.1:8080;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
# Rate limiting
limit_req zone=sparql burst=20 nodelay;
}
}
CORS Configuration
If the SPARQL endpoint is accessed from browser applications:
[cors]
allowed_origins = ["https://app.example.org"]
allowed_methods = ["GET", "POST"]
allowed_headers = ["Content-Type", "Authorization"]
max_age = 3600
Do not set allowed_origins = ["*"] in production. This allows any website to send SPARQL queries to your endpoint using the visitor's credentials.
Network Isolation
Production Topology
┌─────────────┐ TLS ┌──────────────────┐ Unix socket ┌─────────────┐
│ Clients │ ────────────→ │ pg_ripple_http │ ──────────────────→ │ PostgreSQL │
│ │ │ (reverse proxy) │ │ (pg_ripple) │
└─────────────┘ └──────────────────┘ └─────────────┘
Recommendations
- PostgreSQL: bind to
localhostor a private network interface only. Never expose port 5432 to the public internet.
# postgresql.conf
listen_addresses = '127.0.0.1'
-
pg_ripple_http: connect to PostgreSQL via Unix socket for lowest latency and no network exposure.
-
Firewall rules: only allow traffic on the HTTPS port (443) from expected client networks.
# iptables example
iptables -A INPUT -p tcp --dport 443 -s 10.0.0.0/8 -j ACCEPT
iptables -A INPUT -p tcp --dport 443 -j DROP
iptables -A INPUT -p tcp --dport 5432 -j DROP
- pg_hba.conf: restrict connections by source IP and authentication method:
# TYPE DATABASE USER ADDRESS METHOD
local all postgres peer
host mydb pg_ripple_http 127.0.0.1/32 scram-sha-256
host mydb sparql_reader 10.0.0.0/8 scram-sha-256
host all all 0.0.0.0/0 reject
Always use scram-sha-256 authentication (the default in PostgreSQL 18). Avoid md5 and never use trust in production.
Security Checklist
| Item | Status |
|---|---|
shared_preload_libraries includes only trusted extensions | ☐ |
| Non-superuser roles used for all application connections | ☐ |
| Graph RLS enabled for multi-tenant deployments | ☐ |
pg_hba.conf restricts connections to known networks | ☐ |
TLS enabled on pg_ripple_http or reverse proxy | ☐ |
| File-path loaders restricted to superuser only (default) | ☐ |
synchronous_commit enabled for production (not off) | ☐ |
Connection pooler uses scram-sha-256 | ☐ |
| CORS origins are not wildcarded | ☐ |
| PostgreSQL logs enabled for audit trail | ☐ |
| Regular security updates for PostgreSQL and pg_ripple | ☐ |
Audit Logging
Enable PostgreSQL's logging to maintain an audit trail:
# postgresql.conf
log_statement = 'all' # or 'ddl' for schema changes only
log_connections = on
log_disconnections = on
log_line_prefix = '%t [%p] %u@%d '
For fine-grained audit logging, consider the pgaudit extension alongside pg_ripple.
pg_ripple logs the generated SQL via PostgreSQL's standard statement logging. To see the original SPARQL text, enable log_statement = 'all' — the SPARQL text appears as the argument to pg_ripple.sparql().
SQL Function Reference
All 157 SQL functions exposed by pg_ripple, grouped by use case. Every function lives in the pg_ripple schema.
All examples assume SET search_path TO pg_ripple, public;. If you prefer explicit qualification, prefix every call with pg_ripple..
Loading
Functions for inserting and bulk-loading RDF data.
insert_triple
Insert a single triple into the default graph.
pg_ripple.insert_triple(
subject TEXT,
predicate TEXT,
object TEXT
) RETURNS BIGINT
SELECT pg_ripple.insert_triple(
'<https://example.org/alice>',
'<https://example.org/knows>',
'<https://example.org/bob>'
);
load_turtle
Parse a Turtle string and load all triples into the default graph.
pg_ripple.load_turtle(
data TEXT,
strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_turtle('
@prefix ex: <https://example.org/> .
ex:alice ex:name "Alice" ;
ex:knows ex:bob .
');
load_turtle_file
Load Turtle from a server-side file path.
pg_ripple.load_turtle_file(
path TEXT,
strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_turtle_file('/data/ontology.ttl');
load_ntriples
Parse an N-Triples string and load all triples into the default graph.
pg_ripple.load_ntriples(
data TEXT,
strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_ntriples('
<https://example.org/alice> <https://example.org/name> "Alice" .
<https://example.org/alice> <https://example.org/knows> <https://example.org/bob> .
');
load_ntriples_file
Load N-Triples from a server-side file path.
pg_ripple.load_ntriples_file(
path TEXT,
strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_ntriples_file('/data/dump.nt');
load_nquads
Parse an N-Quads string and load triples into their respective named graphs.
pg_ripple.load_nquads(
data TEXT,
strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_nquads('
<https://example.org/alice> <https://example.org/name> "Alice" <https://example.org/g1> .
');
load_nquads_file
Load N-Quads from a server-side file path.
pg_ripple.load_nquads_file(
path TEXT,
strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_nquads_file('/data/dump.nq');
load_trig
Parse a TriG string and load triples into their named graphs.
pg_ripple.load_trig(
data TEXT,
strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_trig('
@prefix ex: <https://example.org/> .
ex:g1 { ex:alice ex:name "Alice" . }
');
load_trig_file
Load TriG from a server-side file path.
pg_ripple.load_trig_file(
path TEXT,
strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_trig_file('/data/dataset.trig');
load_rdfxml
Parse an RDF/XML string and load all triples into the default graph.
pg_ripple.load_rdfxml(
data TEXT,
strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_rdfxml('
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:ex="https://example.org/">
<rdf:Description rdf:about="https://example.org/alice">
<ex:name>Alice</ex:name>
</rdf:Description>
</rdf:RDF>
');
load_rdfxml_file
Load RDF/XML from a server-side file path.
pg_ripple.load_rdfxml_file(
path TEXT,
strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_rdfxml_file('/data/ontology.rdf');
load_ntriples_into_graph
Parse N-Triples and load into a specific named graph.
pg_ripple.load_ntriples_into_graph(
data TEXT,
graph TEXT,
strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_ntriples_into_graph(
'<https://example.org/alice> <https://example.org/name> "Alice" .',
'<https://example.org/people>'
);
load_turtle_into_graph
Parse Turtle and load into a specific named graph.
pg_ripple.load_turtle_into_graph(
data TEXT,
graph TEXT,
strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_turtle_into_graph(
'@prefix ex: <https://example.org/> . ex:alice ex:name "Alice" .',
'<https://example.org/people>'
);
load_rdfxml_into_graph
Parse RDF/XML and load into a specific named graph.
pg_ripple.load_rdfxml_into_graph(
data TEXT,
graph TEXT,
strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_rdfxml_into_graph(
'<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:ex="https://example.org/">
<rdf:Description rdf:about="https://example.org/alice">
<ex:name>Alice</ex:name>
</rdf:Description>
</rdf:RDF>',
'<https://example.org/people>'
);
load_ntriples_file_into_graph
Load N-Triples from a server-side file into a named graph.
pg_ripple.load_ntriples_file_into_graph(
path TEXT,
graph TEXT,
strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_ntriples_file_into_graph(
'/data/people.nt',
'<https://example.org/people>'
);
load_turtle_file_into_graph
Load Turtle from a server-side file into a named graph.
pg_ripple.load_turtle_file_into_graph(
path TEXT,
graph TEXT,
strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_turtle_file_into_graph(
'/data/people.ttl',
'<https://example.org/people>'
);
load_rdfxml_file_into_graph
Load RDF/XML from a server-side file into a named graph.
pg_ripple.load_rdfxml_file_into_graph(
path TEXT,
graph TEXT,
strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_rdfxml_file_into_graph(
'/data/people.rdf',
'<https://example.org/people>'
);
load_owl_ontology
Load an OWL ontology from Turtle, extracting class and property declarations for use by the Datalog reasoner.
pg_ripple.load_owl_ontology(
data TEXT
) RETURNS BIGINT
SELECT pg_ripple.load_owl_ontology('
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix ex: <https://example.org/> .
ex:Person a owl:Class .
ex:knows a owl:ObjectProperty ;
owl:inverseOf ex:knownBy .
');
apply_patch
Apply an RDF patch (additions and deletions) atomically.
pg_ripple.apply_patch(
additions TEXT,
deletions TEXT
) RETURNS BIGINT
SELECT pg_ripple.apply_patch(
'<https://example.org/alice> <https://example.org/age> "31"^^<http://www.w3.org/2001/XMLSchema#integer> .',
'<https://example.org/alice> <https://example.org/age> "30"^^<http://www.w3.org/2001/XMLSchema#integer> .'
);
Querying
Functions for querying triples with SPARQL and text search.
sparql
Execute a SPARQL SELECT query and return results as a set of JSON objects.
pg_ripple.sparql(
query TEXT
) RETURNS SETOF JSON
SELECT * FROM pg_ripple.sparql('
PREFIX ex: <https://example.org/>
SELECT ?name WHERE { ex:alice ex:name ?name }
');
sparql_ask
Execute a SPARQL ASK query and return a boolean result.
pg_ripple.sparql_ask(
query TEXT
) RETURNS BOOLEAN
SELECT pg_ripple.sparql_ask('
PREFIX ex: <https://example.org/>
ASK { ex:alice ex:knows ex:bob }
');
sparql_explain
Return the SQL execution plan for a SPARQL query without executing it.
pg_ripple.sparql_explain(
query TEXT
) RETURNS TEXT
SELECT pg_ripple.sparql_explain('
PREFIX ex: <https://example.org/>
SELECT ?x WHERE { ?x ex:knows ex:bob }
');
explain_sparql
Return a detailed query plan showing SPARQL algebra and generated SQL.
pg_ripple.explain_sparql(
query TEXT
) RETURNS TEXT
SELECT pg_ripple.explain_sparql('
PREFIX ex: <https://example.org/>
SELECT ?x ?y WHERE { ?x ex:knows ?y }
');
sparql_construct
Execute a SPARQL CONSTRUCT query and return triples as JSON.
pg_ripple.sparql_construct(
query TEXT
) RETURNS SETOF JSON
SELECT * FROM pg_ripple.sparql_construct('
PREFIX ex: <https://example.org/>
CONSTRUCT { ?x ex:friendOf ?y }
WHERE { ?x ex:knows ?y }
');
sparql_describe
Execute a SPARQL DESCRIBE query and return all triples about a resource.
pg_ripple.sparql_describe(
query TEXT
) RETURNS SETOF JSON
SELECT * FROM pg_ripple.sparql_describe('
PREFIX ex: <https://example.org/>
DESCRIBE ex:alice
');
sparql_construct_turtle
Execute a SPARQL CONSTRUCT query and return the result as a Turtle string.
pg_ripple.sparql_construct_turtle(
query TEXT
) RETURNS TEXT
SELECT pg_ripple.sparql_construct_turtle('
PREFIX ex: <https://example.org/>
CONSTRUCT { ?x ex:friendOf ?y }
WHERE { ?x ex:knows ?y }
');
sparql_construct_jsonld
Execute a SPARQL CONSTRUCT query and return the result as a JSON-LD string.
pg_ripple.sparql_construct_jsonld(
query TEXT
) RETURNS TEXT
SELECT pg_ripple.sparql_construct_jsonld('
PREFIX ex: <https://example.org/>
CONSTRUCT { ?x ex:friendOf ?y }
WHERE { ?x ex:knows ?y }
');
sparql_describe_turtle
Execute a SPARQL DESCRIBE query and return the result as Turtle.
pg_ripple.sparql_describe_turtle(
query TEXT
) RETURNS TEXT
SELECT pg_ripple.sparql_describe_turtle('
PREFIX ex: <https://example.org/>
DESCRIBE ex:alice
');
sparql_describe_jsonld
Execute a SPARQL DESCRIBE query and return the result as JSON-LD.
pg_ripple.sparql_describe_jsonld(
query TEXT
) RETURNS TEXT
SELECT pg_ripple.sparql_describe_jsonld('
PREFIX ex: <https://example.org/>
DESCRIBE ex:alice
');
sparql_update
Execute a SPARQL Update operation (INSERT DATA, DELETE DATA, etc.).
pg_ripple.sparql_update(
query TEXT
) RETURNS BIGINT
SELECT pg_ripple.sparql_update('
PREFIX ex: <https://example.org/>
INSERT DATA { ex:alice ex:age 30 }
');
find_triples
Find triples matching a pattern in the default graph. Pass NULL for wildcards.
pg_ripple.find_triples(
subject TEXT DEFAULT NULL,
predicate TEXT DEFAULT NULL,
object TEXT DEFAULT NULL
) RETURNS TABLE(subject TEXT, predicate TEXT, object TEXT)
SELECT * FROM pg_ripple.find_triples(
'<https://example.org/alice>', NULL, NULL
);
find_triples_in_graph
Find triples matching a pattern in a specific named graph.
pg_ripple.find_triples_in_graph(
subject TEXT DEFAULT NULL,
predicate TEXT DEFAULT NULL,
object TEXT DEFAULT NULL,
graph TEXT DEFAULT NULL
) RETURNS TABLE(subject TEXT, predicate TEXT, object TEXT, graph TEXT)
SELECT * FROM pg_ripple.find_triples_in_graph(
NULL, NULL, NULL, '<https://example.org/people>'
);
triple_count
Return the total number of triples in the default graph.
pg_ripple.triple_count() RETURNS BIGINT
SELECT pg_ripple.triple_count();
triple_count_in_graph
Return the number of triples in a specific named graph.
pg_ripple.triple_count_in_graph(
graph TEXT
) RETURNS BIGINT
SELECT pg_ripple.triple_count_in_graph('<https://example.org/people>');
fts_index
Build or rebuild the full-text search index over literal values.
pg_ripple.fts_index() RETURNS VOID
SELECT pg_ripple.fts_index();
fts_search
Search for triples containing a term in literal values via full-text search.
pg_ripple.fts_search(
query TEXT,
limit_rows INTEGER DEFAULT 100
) RETURNS TABLE(subject TEXT, predicate TEXT, object TEXT, rank REAL)
SELECT * FROM pg_ripple.fts_search('knowledge graph', 10);
Graphs
Functions for managing named graphs.
create_graph
Create a named graph.
pg_ripple.create_graph(
graph TEXT
) RETURNS VOID
SELECT pg_ripple.create_graph('<https://example.org/people>');
drop_graph
Drop a named graph and all its triples.
pg_ripple.drop_graph(
graph TEXT
) RETURNS VOID
SELECT pg_ripple.drop_graph('<https://example.org/people>');
list_graphs
List all named graphs.
pg_ripple.list_graphs() RETURNS TABLE(graph TEXT, triple_count BIGINT)
SELECT * FROM pg_ripple.list_graphs();
clear_graph
Remove all triples from a graph without dropping it.
pg_ripple.clear_graph(
graph TEXT
) RETURNS BIGINT
SELECT pg_ripple.clear_graph('<https://example.org/people>');
Dictionary
Functions for interacting with the dictionary encoder that maps IRIs, blank nodes, and literals to integer IDs.
Most users never need to call dictionary functions directly. They are useful for debugging, performance tuning, and understanding storage internals.
encode_term
Encode an IRI, literal, or blank node to its integer ID.
pg_ripple.encode_term(
term TEXT
) RETURNS BIGINT
SELECT pg_ripple.encode_term('<https://example.org/alice>');
decode_id
Decode an integer ID back to its string representation.
pg_ripple.decode_id(
id BIGINT
) RETURNS TEXT
SELECT pg_ripple.decode_id(42);
encode_triple
Encode a full triple (subject, predicate, object) to integer IDs.
pg_ripple.encode_triple(
subject TEXT,
predicate TEXT,
object TEXT
) RETURNS TABLE(s BIGINT, p BIGINT, o BIGINT)
SELECT * FROM pg_ripple.encode_triple(
'<https://example.org/alice>',
'<https://example.org/knows>',
'<https://example.org/bob>'
);
decode_triple
Decode a triple from integer IDs back to string form.
pg_ripple.decode_triple(
s BIGINT,
p BIGINT,
o BIGINT
) RETURNS TABLE(subject TEXT, predicate TEXT, object TEXT)
SELECT * FROM pg_ripple.decode_triple(1, 2, 3);
decode_id_full
Decode an integer ID returning the full term with type information.
pg_ripple.decode_id_full(
id BIGINT
) RETURNS JSON
SELECT pg_ripple.decode_id_full(42);
lookup_iri
Look up the integer ID for a specific IRI without inserting.
pg_ripple.lookup_iri(
iri TEXT
) RETURNS BIGINT
SELECT pg_ripple.lookup_iri('<https://example.org/alice>');
dictionary_stats
Return statistics about the dictionary table.
pg_ripple.dictionary_stats() RETURNS JSON
SELECT pg_ripple.dictionary_stats();
prewarm_dictionary_hot
Load the most frequently accessed dictionary entries into the shared cache.
pg_ripple.prewarm_dictionary_hot(
limit_rows INTEGER DEFAULT 10000
) RETURNS INTEGER
SELECT pg_ripple.prewarm_dictionary_hot(50000);
cache_stats
Return cache hit/miss statistics for the dictionary LRU cache.
pg_ripple.cache_stats() RETURNS JSON
SELECT pg_ripple.cache_stats();
Prefixes
Functions for managing namespace prefix abbreviations.
register_prefix
Register a namespace prefix for use in SPARQL queries and output.
pg_ripple.register_prefix(
prefix TEXT,
iri TEXT
) RETURNS VOID
SELECT pg_ripple.register_prefix('ex', 'https://example.org/');
prefixes
List all registered prefixes.
pg_ripple.prefixes() RETURNS TABLE(prefix TEXT, iri TEXT)
SELECT * FROM pg_ripple.prefixes();
Validating
Functions for loading SHACL shapes, validating data, and managing async validation.
load_shacl
Load SHACL shapes from a Turtle string.
pg_ripple.load_shacl(
shapes TEXT
) RETURNS INTEGER
SELECT pg_ripple.load_shacl('
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix ex: <https://example.org/> .
ex:PersonShape a sh:NodeShape ;
sh:targetClass ex:Person ;
sh:property [ sh:path ex:name ; sh:minCount 1 ; sh:datatype xsd:string ] .
');
validate
Run SHACL validation and return a validation report.
pg_ripple.validate() RETURNS TABLE(
focus_node TEXT,
shape TEXT,
path TEXT,
severity TEXT,
message TEXT
)
SELECT * FROM pg_ripple.validate();
list_shapes
List all loaded SHACL shapes.
pg_ripple.list_shapes() RETURNS TABLE(shape TEXT, target TEXT, property_count INTEGER)
SELECT * FROM pg_ripple.list_shapes();
drop_shape
Drop a SHACL shape by IRI.
pg_ripple.drop_shape(
shape TEXT
) RETURNS VOID
SELECT pg_ripple.drop_shape('<https://example.org/PersonShape>');
enable_shacl_monitors
Enable trigger-based SHACL validation on all VP tables.
pg_ripple.enable_shacl_monitors() RETURNS VOID
SELECT pg_ripple.enable_shacl_monitors();
enable_shacl_dag_monitors
Enable DAG-aware SHACL monitors using pg_trickle for async validation.
pg_ripple.enable_shacl_dag_monitors() RETURNS VOID
SELECT pg_ripple.enable_shacl_dag_monitors();
disable_shacl_dag_monitors
Disable DAG-aware SHACL monitors.
pg_ripple.disable_shacl_dag_monitors() RETURNS VOID
SELECT pg_ripple.disable_shacl_dag_monitors();
list_shacl_dag_monitors
List all active DAG SHACL monitors.
pg_ripple.list_shacl_dag_monitors() RETURNS TABLE(shape TEXT, predicate TEXT, enabled BOOLEAN)
SELECT * FROM pg_ripple.list_shacl_dag_monitors();
process_validation_queue
Process pending items in the async SHACL validation queue.
pg_ripple.process_validation_queue(
batch_size INTEGER DEFAULT 100
) RETURNS INTEGER
SELECT pg_ripple.process_validation_queue(500);
validation_queue_length
Return the number of items pending in the validation queue.
pg_ripple.validation_queue_length() RETURNS BIGINT
SELECT pg_ripple.validation_queue_length();
dead_letter_count
Return the number of items in the validation dead-letter queue.
pg_ripple.dead_letter_count() RETURNS BIGINT
SELECT pg_ripple.dead_letter_count();
dead_letter_queue
Return the contents of the validation dead-letter queue.
pg_ripple.dead_letter_queue() RETURNS TABLE(
id BIGINT,
triple_id BIGINT,
shape TEXT,
error TEXT,
created_at TIMESTAMPTZ
)
SELECT * FROM pg_ripple.dead_letter_queue();
drain_dead_letter_queue
Remove and return all items from the dead-letter queue.
pg_ripple.drain_dead_letter_queue() RETURNS INTEGER
SELECT pg_ripple.drain_dead_letter_queue();
Reasoning
Functions for Datalog rule management and inference.
pg_ripple ships with RDFS and OWL RL rule sets. Load them with load_rules_builtin('rdfs') or load_rules_builtin('owl-rl').
load_rules
Load a named Datalog rule set from a program string.
pg_ripple.load_rules(
name TEXT,
program TEXT
) RETURNS INTEGER
SELECT pg_ripple.load_rules('transitive-knows', '
knows(X, Z) :- knows(X, Y), knows(Y, Z).
');
load_rules_builtin
Load a built-in rule set (rdfs, owl-rl).
pg_ripple.load_rules_builtin(
name TEXT
) RETURNS INTEGER
SELECT pg_ripple.load_rules_builtin('owl-rl');
list_rules
List all loaded rule sets.
pg_ripple.list_rules() RETURNS TABLE(name TEXT, rule_count INTEGER, enabled BOOLEAN)
SELECT * FROM pg_ripple.list_rules();
drop_rules
Drop a rule set by name.
pg_ripple.drop_rules(
name TEXT
) RETURNS VOID
SELECT pg_ripple.drop_rules('transitive-knows');
enable_rule_set
Enable a rule set for inference.
pg_ripple.enable_rule_set(
name TEXT
) RETURNS VOID
SELECT pg_ripple.enable_rule_set('owl-rl');
disable_rule_set
Disable a rule set (triples already inferred are not removed).
pg_ripple.disable_rule_set(
name TEXT
) RETURNS VOID
SELECT pg_ripple.disable_rule_set('owl-rl');
infer
Run materialization using all enabled rule sets (semi-naive evaluation).
pg_ripple.infer() RETURNS BIGINT
SELECT pg_ripple.infer();
infer_with_stats
Run materialization and return iteration statistics.
pg_ripple.infer_with_stats() RETURNS JSON
SELECT pg_ripple.infer_with_stats();
infer_goal
Run goal-directed inference for a specific query pattern.
pg_ripple.infer_goal(
subject TEXT DEFAULT NULL,
predicate TEXT DEFAULT NULL,
object TEXT DEFAULT NULL
) RETURNS BIGINT
SELECT pg_ripple.infer_goal(
'<https://example.org/alice>',
'<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>',
NULL
);
infer_agg
Run Datalog aggregation rules (min, max, sum, count).
pg_ripple.infer_agg() RETURNS BIGINT
SELECT pg_ripple.infer_agg();
infer_demand
Run demand-driven inference with magic sets optimization.
pg_ripple.infer_demand(
subject TEXT DEFAULT NULL,
predicate TEXT DEFAULT NULL,
object TEXT DEFAULT NULL
) RETURNS BIGINT
SELECT pg_ripple.infer_demand(
'<https://example.org/alice>',
'<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>',
NULL
);
infer_wfs
Run well-founded semantics evaluation for programs with negation.
pg_ripple.infer_wfs() RETURNS BIGINT
SELECT pg_ripple.infer_wfs();
tabling_stats
Return statistics about the tabling memo store.
pg_ripple.tabling_stats() RETURNS JSON
SELECT pg_ripple.tabling_stats();
rule_plan_cache_stats
Return statistics about the Datalog rule plan cache.
pg_ripple.rule_plan_cache_stats() RETURNS JSON
SELECT pg_ripple.rule_plan_cache_stats();
check_constraints
Run Datalog constraint rules and report violations.
pg_ripple.check_constraints() RETURNS TABLE(rule TEXT, subject TEXT, message TEXT)
SELECT * FROM pg_ripple.check_constraints();
Exporting
Functions for serializing triples to various formats.
export_ntriples
Export all triples as an N-Triples string.
pg_ripple.export_ntriples() RETURNS TEXT
SELECT pg_ripple.export_ntriples();
export_nquads
Export all triples (with named graphs) as an N-Quads string.
pg_ripple.export_nquads() RETURNS TEXT
SELECT pg_ripple.export_nquads();
export_turtle
Export all triples as a Turtle string.
pg_ripple.export_turtle() RETURNS TEXT
SELECT pg_ripple.export_turtle();
export_jsonld
Export all triples as a JSON-LD string.
pg_ripple.export_jsonld() RETURNS TEXT
SELECT pg_ripple.export_jsonld();
export_turtle_stream
Export triples as a streaming set of Turtle chunks for large datasets.
pg_ripple.export_turtle_stream(
batch_size INTEGER DEFAULT 1000
) RETURNS SETOF TEXT
SELECT * FROM pg_ripple.export_turtle_stream(5000);
export_jsonld_stream
Export triples as a streaming set of JSON-LD chunks for large datasets.
pg_ripple.export_jsonld_stream(
batch_size INTEGER DEFAULT 1000
) RETURNS SETOF TEXT
SELECT * FROM pg_ripple.export_jsonld_stream(5000);
export_graphrag_entities
Export entities in GraphRAG entity format for Microsoft GraphRAG or compatible tools.
pg_ripple.export_graphrag_entities() RETURNS SETOF JSON
SELECT * FROM pg_ripple.export_graphrag_entities();
export_graphrag_relationships
Export relationships in GraphRAG relationship format.
pg_ripple.export_graphrag_relationships() RETURNS SETOF JSON
SELECT * FROM pg_ripple.export_graphrag_relationships();
export_graphrag_text_units
Export text units in GraphRAG text-unit format.
pg_ripple.export_graphrag_text_units() RETURNS SETOF JSON
SELECT * FROM pg_ripple.export_graphrag_text_units();
JSON-LD Framing
Functions for JSON-LD framing and tree-shaped output.
jsonld_frame_to_sparql
Convert a JSON-LD frame to a SPARQL CONSTRUCT query.
pg_ripple.jsonld_frame_to_sparql(
frame JSON
) RETURNS TEXT
SELECT pg_ripple.jsonld_frame_to_sparql('{
"@type": "https://example.org/Person",
"https://example.org/name": {}
}'::json);
export_jsonld_framed
Export triples shaped by a JSON-LD frame as a JSON-LD string.
pg_ripple.export_jsonld_framed(
frame JSON
) RETURNS TEXT
SELECT pg_ripple.export_jsonld_framed('{
"@type": "https://example.org/Person",
"https://example.org/name": {},
"https://example.org/knows": { "@type": "https://example.org/Person" }
}'::json);
export_jsonld_framed_stream
Export framed JSON-LD as a streaming set of chunks.
pg_ripple.export_jsonld_framed_stream(
frame JSON,
batch_size INTEGER DEFAULT 100
) RETURNS SETOF TEXT
SELECT * FROM pg_ripple.export_jsonld_framed_stream('{
"@type": "https://example.org/Person"
}'::json, 50);
jsonld_frame
Apply a JSON-LD frame to an existing JSON-LD document.
pg_ripple.jsonld_frame(
document JSON,
frame JSON
) RETURNS JSON
SELECT pg_ripple.jsonld_frame(
pg_ripple.export_jsonld()::json,
'{"@type": "https://example.org/Person"}'::json
);
Views
Functions for creating and managing materialized SPARQL, Datalog, CONSTRUCT, DESCRIBE, ASK, and framing views.
Views are backed by PostgreSQL tables or views. Use the corresponding drop_*_view function to remove them. Dropping the extension also removes all views.
create_sparql_view
Create a PostgreSQL view backed by a SPARQL SELECT query.
pg_ripple.create_sparql_view(
name TEXT,
query TEXT
) RETURNS VOID
SELECT pg_ripple.create_sparql_view('people', '
PREFIX ex: <https://example.org/>
SELECT ?name WHERE { ?person a ex:Person ; ex:name ?name }
');
drop_sparql_view
Drop a SPARQL view.
pg_ripple.drop_sparql_view(
name TEXT
) RETURNS VOID
SELECT pg_ripple.drop_sparql_view('people');
list_sparql_views
List all SPARQL views.
pg_ripple.list_sparql_views() RETURNS TABLE(name TEXT, query TEXT)
SELECT * FROM pg_ripple.list_sparql_views();
create_datalog_view
Create a PostgreSQL view backed by a Datalog rule.
pg_ripple.create_datalog_view(
name TEXT,
rule TEXT
) RETURNS VOID
SELECT pg_ripple.create_datalog_view('ancestor',
'ancestor(X, Z) :- parent(X, Y), ancestor(Y, Z).'
);
create_datalog_view_from_rule_set
Create a view from a named rule set's head predicate.
pg_ripple.create_datalog_view_from_rule_set(
view_name TEXT,
rule_set_name TEXT,
head_predicate TEXT
) RETURNS VOID
SELECT pg_ripple.create_datalog_view_from_rule_set(
'inferred_types', 'owl-rl', 'rdf:type'
);
drop_datalog_view
Drop a Datalog view.
pg_ripple.drop_datalog_view(
name TEXT
) RETURNS VOID
SELECT pg_ripple.drop_datalog_view('ancestor');
list_datalog_views
List all Datalog views.
pg_ripple.list_datalog_views() RETURNS TABLE(name TEXT, rule_set TEXT, head TEXT)
SELECT * FROM pg_ripple.list_datalog_views();
create_framing_view
Create a PostgreSQL view backed by a JSON-LD frame.
pg_ripple.create_framing_view(
name TEXT,
frame JSON
) RETURNS VOID
SELECT pg_ripple.create_framing_view('person_frame', '{
"@type": "https://example.org/Person",
"https://example.org/name": {}
}'::json);
drop_framing_view
Drop a framing view.
pg_ripple.drop_framing_view(
name TEXT
) RETURNS VOID
SELECT pg_ripple.drop_framing_view('person_frame');
list_framing_views
List all framing views.
pg_ripple.list_framing_views() RETURNS TABLE(name TEXT, frame JSON)
SELECT * FROM pg_ripple.list_framing_views();
create_construct_view
Create a view backed by a SPARQL CONSTRUCT query.
pg_ripple.create_construct_view(
name TEXT,
query TEXT
) RETURNS VOID
SELECT pg_ripple.create_construct_view('friends', '
PREFIX ex: <https://example.org/>
CONSTRUCT { ?a ex:friendOf ?b }
WHERE { ?a ex:knows ?b }
');
drop_construct_view
Drop a CONSTRUCT view.
pg_ripple.drop_construct_view(
name TEXT
) RETURNS VOID
SELECT pg_ripple.drop_construct_view('friends');
list_construct_views
List all CONSTRUCT views.
pg_ripple.list_construct_views() RETURNS TABLE(name TEXT, query TEXT)
SELECT * FROM pg_ripple.list_construct_views();
create_describe_view
Create a view backed by a SPARQL DESCRIBE query.
pg_ripple.create_describe_view(
name TEXT,
query TEXT
) RETURNS VOID
SELECT pg_ripple.create_describe_view('alice_detail', '
PREFIX ex: <https://example.org/>
DESCRIBE ex:alice
');
drop_describe_view
Drop a DESCRIBE view.
pg_ripple.drop_describe_view(
name TEXT
) RETURNS VOID
SELECT pg_ripple.drop_describe_view('alice_detail');
list_describe_views
List all DESCRIBE views.
pg_ripple.list_describe_views() RETURNS TABLE(name TEXT, query TEXT)
SELECT * FROM pg_ripple.list_describe_views();
create_ask_view
Create a view backed by a SPARQL ASK query.
pg_ripple.create_ask_view(
name TEXT,
query TEXT
) RETURNS VOID
SELECT pg_ripple.create_ask_view('has_alice', '
PREFIX ex: <https://example.org/>
ASK { ex:alice ex:name ?n }
');
drop_ask_view
Drop an ASK view.
pg_ripple.drop_ask_view(
name TEXT
) RETURNS VOID
SELECT pg_ripple.drop_ask_view('has_alice');
list_ask_views
List all ASK views.
pg_ripple.list_ask_views() RETURNS TABLE(name TEXT, query TEXT)
SELECT * FROM pg_ripple.list_ask_views();
create_extvp
Create an Extended VP (ExtVP) index for a predicate pair to accelerate star-pattern joins.
pg_ripple.create_extvp(
predicate1 TEXT,
predicate2 TEXT
) RETURNS VOID
SELECT pg_ripple.create_extvp(
'<https://example.org/name>',
'<https://example.org/age>'
);
drop_extvp
Drop an ExtVP index.
pg_ripple.drop_extvp(
predicate1 TEXT,
predicate2 TEXT
) RETURNS VOID
SELECT pg_ripple.drop_extvp(
'<https://example.org/name>',
'<https://example.org/age>'
);
list_extvp
List all ExtVP indices.
pg_ripple.list_extvp() RETURNS TABLE(predicate1 TEXT, predicate2 TEXT, row_count BIGINT)
SELECT * FROM pg_ripple.list_extvp();
Federation
Functions for managing SPARQL federation endpoints.
register_endpoint
Register a remote SPARQL endpoint for federated queries.
pg_ripple.register_endpoint(
name TEXT,
url TEXT
) RETURNS VOID
SELECT pg_ripple.register_endpoint('wikidata', 'https://query.wikidata.org/sparql');
set_endpoint_complexity
Set the complexity weight for a federated endpoint (used by the query planner).
pg_ripple.set_endpoint_complexity(
name TEXT,
complexity REAL
) RETURNS VOID
SELECT pg_ripple.set_endpoint_complexity('wikidata', 2.5);
remove_endpoint
Remove a registered endpoint.
pg_ripple.remove_endpoint(
name TEXT
) RETURNS VOID
SELECT pg_ripple.remove_endpoint('wikidata');
disable_endpoint
Temporarily disable an endpoint without removing it.
pg_ripple.disable_endpoint(
name TEXT
) RETURNS VOID
SELECT pg_ripple.disable_endpoint('wikidata');
list_endpoints
List all registered federation endpoints.
pg_ripple.list_endpoints() RETURNS TABLE(name TEXT, url TEXT, enabled BOOLEAN, complexity REAL)
SELECT * FROM pg_ripple.list_endpoints();
register_vector_endpoint
Register a vector similarity search endpoint for hybrid SPARQL+vector queries.
pg_ripple.register_vector_endpoint(
name TEXT,
url TEXT,
model TEXT
) RETURNS VOID
SELECT pg_ripple.register_vector_endpoint(
'openai', 'https://api.openai.com/v1/embeddings', 'text-embedding-3-small'
);
Vector / Hybrid Search
Functions for vector embeddings, similarity search, and RAG retrieval.
All vector functions require pgvector to be installed. Set pg_ripple.pgvector_enabled = off to disable without uninstalling.
store_embedding
Store a precomputed embedding vector for an entity.
pg_ripple.store_embedding(
entity TEXT,
model TEXT,
vector VECTOR
) RETURNS VOID
SELECT pg_ripple.store_embedding(
'<https://example.org/alice>',
'text-embedding-3-small',
'[0.1, 0.2, 0.3]'::vector
);
similar_entities
Find entities similar to a given entity by vector distance.
pg_ripple.similar_entities(
entity TEXT,
model TEXT DEFAULT 'text-embedding-3-small',
k INTEGER DEFAULT 10
) RETURNS TABLE(entity TEXT, distance REAL)
SELECT * FROM pg_ripple.similar_entities('<https://example.org/alice>');
embed_entities
Generate and store embeddings for entities matching a SPARQL pattern.
pg_ripple.embed_entities(
query TEXT,
model TEXT DEFAULT 'text-embedding-3-small'
) RETURNS INTEGER
SELECT pg_ripple.embed_entities('
PREFIX ex: <https://example.org/>
SELECT ?entity WHERE { ?entity a ex:Person }
');
refresh_embeddings
Recompute embeddings for entities whose underlying data has changed.
pg_ripple.refresh_embeddings(
model TEXT DEFAULT 'text-embedding-3-small'
) RETURNS INTEGER
SELECT pg_ripple.refresh_embeddings();
list_embedding_models
List all embedding models with stored vectors.
pg_ripple.list_embedding_models() RETURNS TABLE(model TEXT, entity_count BIGINT, dimensions INTEGER)
SELECT * FROM pg_ripple.list_embedding_models();
add_embedding_triples
Materialize similarity relationships as RDF triples.
pg_ripple.add_embedding_triples(
model TEXT DEFAULT 'text-embedding-3-small',
threshold REAL DEFAULT 0.8,
predicate TEXT DEFAULT '<https://example.org/similarTo>'
) RETURNS BIGINT
SELECT pg_ripple.add_embedding_triples('text-embedding-3-small', 0.9);
contextualize_entity
Return a text summary of an entity's neighborhood for use as LLM context.
pg_ripple.contextualize_entity(
entity TEXT,
hops INTEGER DEFAULT 2
) RETURNS TEXT
SELECT pg_ripple.contextualize_entity('<https://example.org/alice>', 3);
hybrid_search
Combine SPARQL graph pattern matching with vector similarity (Reciprocal Rank Fusion).
pg_ripple.hybrid_search(
sparql_query TEXT,
vector_query TEXT,
k INTEGER DEFAULT 10,
alpha REAL DEFAULT 0.5
) RETURNS TABLE(entity TEXT, score REAL, sparql_rank INTEGER, vector_rank INTEGER)
SELECT * FROM pg_ripple.hybrid_search(
'PREFIX ex: <https://example.org/>
SELECT ?person WHERE { ?person a ex:Person ; ex:knows ex:bob }',
'researchers in knowledge graphs',
10,
0.7
);
rag_retrieve
Retrieve context for RAG (Retrieval-Augmented Generation) using graph + vector search.
pg_ripple.rag_retrieve(
query TEXT,
k INTEGER DEFAULT 5,
hops INTEGER DEFAULT 2
) RETURNS TABLE(entity TEXT, context TEXT, score REAL)
SELECT * FROM pg_ripple.rag_retrieve('Who knows about knowledge graphs?', 5, 2);
Admin
Functions for maintenance, statistics, and administrative operations.
compact
Compact the triple store by removing unreferenced VP tables and dictionary entries.
pg_ripple.compact() RETURNS JSON
SELECT pg_ripple.compact();
vacuum
Vacuum all VP tables to reclaim space and update statistics.
pg_ripple.vacuum() RETURNS VOID
SELECT pg_ripple.vacuum();
reindex
Rebuild all B-tree and BRIN indices on VP tables.
pg_ripple.reindex() RETURNS VOID
SELECT pg_ripple.reindex();
vacuum_dictionary
Vacuum the dictionary table, removing entries not referenced by any VP table.
pg_ripple.vacuum_dictionary() RETURNS BIGINT
SELECT pg_ripple.vacuum_dictionary();
htap_migrate_predicate
Migrate a predicate from the flat VP layout to the HTAP delta/main layout.
pg_ripple.htap_migrate_predicate(
predicate TEXT
) RETURNS VOID
SELECT pg_ripple.htap_migrate_predicate('<https://example.org/knows>');
stats
Return overall triple store statistics.
pg_ripple.stats() RETURNS JSON
SELECT pg_ripple.stats();
canary
Health-check function that returns true if the extension is loaded and functional.
pg_ripple.canary() RETURNS BOOLEAN
SELECT pg_ripple.canary();
enable_live_statistics
Enable real-time statistics collection for VP tables.
pg_ripple.enable_live_statistics() RETURNS VOID
SELECT pg_ripple.enable_live_statistics();
promote_rare_predicates
Promote predicates from vp_rare to dedicated VP tables if they exceed the threshold.
pg_ripple.promote_rare_predicates() RETURNS INTEGER
SELECT pg_ripple.promote_rare_predicates();
deduplicate_predicate
Remove duplicate triples from a specific predicate's VP table.
pg_ripple.deduplicate_predicate(
predicate TEXT
) RETURNS BIGINT
SELECT pg_ripple.deduplicate_predicate('<https://example.org/knows>');
deduplicate_all
Remove duplicate triples from all VP tables.
pg_ripple.deduplicate_all() RETURNS BIGINT
SELECT pg_ripple.deduplicate_all();
delete_triple
Delete a specific triple from the default graph.
pg_ripple.delete_triple(
subject TEXT,
predicate TEXT,
object TEXT
) RETURNS BOOLEAN
SELECT pg_ripple.delete_triple(
'<https://example.org/alice>',
'<https://example.org/knows>',
'<https://example.org/bob>'
);
delete_triple_from_graph
Delete a specific triple from a named graph.
pg_ripple.delete_triple_from_graph(
subject TEXT,
predicate TEXT,
object TEXT,
graph TEXT
) RETURNS BOOLEAN
SELECT pg_ripple.delete_triple_from_graph(
'<https://example.org/alice>',
'<https://example.org/knows>',
'<https://example.org/bob>',
'<https://example.org/people>'
);
get_statement
Retrieve a statement by its globally-unique statement ID (SID).
pg_ripple.get_statement(
sid BIGINT
) RETURNS TABLE(subject TEXT, predicate TEXT, object TEXT, graph TEXT)
SELECT * FROM pg_ripple.get_statement(42);
Security
Functions for row-level security, access control, and schema inspection.
enable_graph_rls
Enable row-level security on VP tables, restricting access by named graph.
pg_ripple.enable_graph_rls() RETURNS VOID
SELECT pg_ripple.enable_graph_rls();
grant_graph
Grant a user access to a named graph.
pg_ripple.grant_graph(
username TEXT,
graph TEXT,
permission TEXT DEFAULT 'read'
) RETURNS VOID
SELECT pg_ripple.grant_graph('analyst', '<https://example.org/public>', 'read');
revoke_graph
Revoke a user's access to a named graph.
pg_ripple.revoke_graph(
username TEXT,
graph TEXT
) RETURNS VOID
SELECT pg_ripple.revoke_graph('analyst', '<https://example.org/public>');
list_graph_access
List all graph access grants.
pg_ripple.list_graph_access() RETURNS TABLE(username TEXT, graph TEXT, permission TEXT)
SELECT * FROM pg_ripple.list_graph_access();
enable_schema_summary
Enable background schema summary generation (requires pg_trickle).
pg_ripple.enable_schema_summary() RETURNS VOID
SELECT pg_ripple.enable_schema_summary();
schema_summary
Return a one-shot schema summary of all predicates, types, and counts.
pg_ripple.schema_summary() RETURNS JSON
SELECT pg_ripple.schema_summary();
CDC
Functions for Change Data Capture subscriptions.
subscribe
Subscribe to change events on the triple store. Returns a subscription ID.
pg_ripple.subscribe(
channel TEXT DEFAULT 'pg_ripple_changes',
filter TEXT DEFAULT NULL
) RETURNS TEXT
SELECT pg_ripple.subscribe('my_changes', 'predicate=<https://example.org/knows>');
unsubscribe
Unsubscribe from a change event subscription.
pg_ripple.unsubscribe(
subscription_id TEXT
) RETURNS VOID
SELECT pg_ripple.unsubscribe('sub_abc123');
Index
Functions for querying predicate indices.
subject_predicates
Return all predicates used by a given subject.
pg_ripple.subject_predicates(
subject TEXT
) RETURNS TABLE(predicate TEXT)
SELECT * FROM pg_ripple.subject_predicates('<https://example.org/alice>');
object_predicates
Return all predicates where a given resource appears as object.
pg_ripple.object_predicates(
object TEXT
) RETURNS TABLE(predicate TEXT)
SELECT * FROM pg_ripple.object_predicates('<https://example.org/alice>');
Cache
Functions for query plan cache management.
plan_cache_stats
Return statistics about the SPARQL-to-SQL plan cache.
pg_ripple.plan_cache_stats() RETURNS JSON
SELECT pg_ripple.plan_cache_stats();
plan_cache_reset
Clear the SPARQL-to-SQL plan cache.
pg_ripple.plan_cache_reset() RETURNS VOID
SELECT pg_ripple.plan_cache_reset();
pg_trickle_available
Check whether the pg_trickle companion extension is installed and available.
pg_ripple.pg_trickle_available() RETURNS BOOLEAN
SELECT pg_ripple.pg_trickle_available();
Architecture
This page describes the internal architecture of pg_ripple as of v0.38.0.
Overview
pg_ripple is a PostgreSQL 18 extension written in Rust (pgrx 0.17) that
implements a high-performance RDF triple store with native SPARQL query
execution. All user-visible functions live in the pg_ripple schema; internal
tables and VP (Vertical Partitioning) tables live in the _pg_ripple schema.
Component map
graph TD
Client["SQL client / SPARQL tool"]
subgraph Extension["pg_ripple extension (Rust + pgrx)"]
API["SQL API layer\n(lib.rs + *_api.rs)"]
SPARQL["SPARQL engine\nsparql/mod.rs"]
TRANS["Algebra → SQL\nsparql/translate/"]
SQLGEN["SQL generator\nsparql/sqlgen.rs"]
DICT["Dictionary\ndictionary/mod.rs"]
SHACL["SHACL validator\nshacl/mod.rs"]
DL["Datalog engine\ndatalog/mod.rs"]
EXP["Serialisers\nexport/mod.rs"]
STOR["Storage layer\nstorage/mod.rs"]
CAT["Predicate catalog\nstorage/catalog.rs"]
HINTS["SHACL hints\nshacl/hints.rs"]
end
subgraph PG["PostgreSQL 18"]
VP["VP tables\n_pg_ripple.vp_{id}"]
DICT_TBL["dictionary table\n_pg_ripple.dictionary"]
PRED["predicates table\n_pg_ripple.predicates"]
RARE["rare predicates\n_pg_ripple.vp_rare"]
SHAPE_HINTS["shape_hints table\n_pg_ripple.shape_hints"]
end
Client --> API
API --> SPARQL
API --> SHACL
API --> DL
API --> EXP
SPARQL --> TRANS
TRANS --> SQLGEN
SQLGEN --> CAT
CAT --> HINTS
HINTS --> SHAPE_HINTS
SQLGEN --> DICT
DICT --> DICT_TBL
SQLGEN --> VP
CAT --> PRED
STOR --> VP
STOR --> RARE
SHACL --> DICT
DL --> STOR
Source tree structure
| Path | Responsibility |
|---|---|
src/lib.rs | pgrx entry points, GUC registration, _PG_init, hooks |
src/gucs.rs | All GUC static declarations |
src/schema.rs | extension_sql!() DDL blocks |
src/dictionary/ | IRI / blank-node / literal → i64 encoder (XXH3-128 + LRU) |
src/storage/ | VP table I/O, HTAP delta/main partitions, merge worker |
src/storage/catalog.rs | Predicate → VP table OID cache (SPI call reduction) |
src/sparql/ | SPARQL text → algebra → SQL → SPI → decode |
src/sparql/translate/ | Per-algebra-node translation stubs (BGP, Join, Filter, …) |
src/sparql/plan_cache.rs | Per-backend plan cache keyed on algebra digest (XXH3-128) |
src/datalog/ | Datalog rule parser, stratifier, SQL compiler |
src/shacl/ | SHACL shapes → validation pipeline |
src/shacl/constraints/ | Per-constraint-type validation (count, string, logical, …) |
src/shacl/hints.rs | SHACL → SQL generation hints (join type, DISTINCT) |
src/export/ | Turtle / N-Triples / JSON-LD serialisation |
src/federation_registry.rs | SPARQL federation endpoint registry |
src/stats_admin.rs | Monitoring, pg_stat_statements integration |
src/graphrag_admin.rs | Vector embedding, hybrid search, GraphRAG pipeline |
src/*_api.rs | SQL-exposed pg_extern wrappers |
Storage model
Every IRI, blank node, and literal is mapped to a BIGINT (i64) through a
dictionary encoding step (XXH3-128 hash). VP tables never contain raw
strings — all joins are integer joins.
raw IRI/literal
│
dictionary.encode()
│
i64 hash
│
stored in VP table (s, o, g, i, source)
For each unique predicate there is one VP table: _pg_ripple.vp_{predicate_id}.
Predicates with fewer than vp_promotion_threshold (default: 1 000) triples are
stored in the consolidated _pg_ripple.vp_rare table instead.
Query execution pipeline
SPARQL text
│
▼ spargebra::Query::parse()
SPARQL algebra
│
▼ sparopt optimizer (BGP reorder, join order)
Optimised algebra
│
▼ SPARQL→SQL translator (sqlgen.rs + translate/)
SQL text
│
▼ PostgreSQL SPI executor
Raw rows
│
▼ dictionary.decode()
JSONB result set
Plan cache
The per-backend plan cache (v0.13.0) maps an algebra digest to the generated SQL. The digest is computed as:
digest = XXH3-128( spargebra::Query::display(query) )
key = "{digest}\x00max_depth={n}\x00bgp_reorder={b}"
Using the algebra display form (rather than the raw query text) means whitespace and prefix-alias variants of the same query share one cache slot.
SHACL hints integration
After loading shapes with pg_ripple.load_shacl(), predicate-level hints are
written to _pg_ripple.shape_hints. The SQL generator reads these hints
via the predicate catalog to:
- Omit
DISTINCTwhensh:maxCount 1is set for a predicate. - Use
INNER JOINinstead ofLEFT JOINwhensh:minCount 1is set.
Hints are invalidated automatically when shapes are dropped
(pg_ripple.invalidate_catalog_cache()).
GUC Reference
All pg_ripple configuration parameters are set with ALTER SYSTEM SET, SET (session-level), or in postgresql.conf. Reload with SELECT pg_reload_conf() after ALTER SYSTEM.
General Parameters
pg_ripple.max_path_depth
| Type | Integer |
| Default | 10 |
| Range | 1–100 |
Maximum recursion depth for SPARQL property paths (*, +). Increase for deeply nested graphs; lower for tighter resource bounds.
pg_ripple.property_path_max_depth (deprecated)
| Type | Integer |
| Default | 64 |
| Range | 1–100 000 |
| Status | Deprecated since v0.38.0 — use max_path_depth instead |
Legacy alias for max_path_depth. Setting this GUC still works but emits a
deprecation notice. It will be removed in a future major release.
pg_ripple.federation_timeout
| Type | Integer (milliseconds) |
| Default | 5000 |
Timeout for outbound SPARQL federation requests.
pg_ripple.export_batch_size
| Type | Integer |
| Default | 1000 |
Number of rows written per batch in Parquet export operations.
Embedding / Vector Parameters (v0.27.0+)
These GUCs control the pgvector integration introduced in v0.27.0. All embedding functions degrade gracefully when pgvector is absent.
pg_ripple.pgvector_enabled
| Type | Boolean |
| Default | on |
Master switch for all vector embedding paths. Set to off to disable embedding storage, similarity search, and SPARQL pg:similar() without uninstalling pgvector. Useful for temporarily disabling the feature.
-- Disable at session level for a bulk load
SET pg_ripple.pgvector_enabled = off;
pg_ripple.embedding_api_url
| Type | String |
| Default | (none) |
Base URL for the OpenAI-compatible embeddings API. The extension appends /embeddings to this URL when making requests.
ALTER SYSTEM SET pg_ripple.embedding_api_url = 'https://api.openai.com/v1';
-- For Ollama (local):
ALTER SYSTEM SET pg_ripple.embedding_api_url = 'http://localhost:11434/v1';
pg_ripple.embedding_api_key
| Type | String |
| Default | (none) |
Bearer token sent as Authorization: Bearer <key> in embedding API requests. For local models that don't require authentication, set to any non-empty string (e.g., 'local').
Security: Avoid storing API keys in
postgresql.conf. UseALTER SYSTEMand restrictpg_hba.confaccess, or inject the key via a session-levelSETin application code.
pg_ripple.embedding_model
| Type | String |
| Default | (none) |
Model name passed in the "model" field of embedding API requests.
ALTER SYSTEM SET pg_ripple.embedding_model = 'text-embedding-3-small';
-- or for Ollama:
ALTER SYSTEM SET pg_ripple.embedding_model = 'nomic-embed-text';
pg_ripple.embedding_dimensions
| Type | Integer |
| Default | 1536 |
| Range | 1–65535 |
Expected output dimensions from the embedding model. Must match the model's output length. Common values:
| Model | Dimensions |
|---|---|
text-embedding-3-small | 1536 |
text-embedding-3-large | 3072 |
text-embedding-ada-002 | 1536 |
nomic-embed-text (Ollama) | 768 |
pg_ripple.embedding_index_type
| Type | String |
| Default | (none — HNSW when pgvector present) |
| Values | hnsw, ivfflat |
Index type for the _pg_ripple.embeddings table. HNSW is the default and recommended for most workloads. IVFFlat uses less memory but requires lists parameter tuning.
pg_ripple.embedding_precision
| Type | String |
| Default | (none — full float4 precision) |
| Values | (unset), half, binary |
Storage precision for embedding vectors. Reduces disk/memory usage at the cost of accuracy:
| Value | pgvector type | Notes |
|---|---|---|
| (unset) | vector(N) | Full 32-bit float; highest accuracy |
half | halfvec(N) | 16-bit float; ~50% storage reduction |
binary | bit(N) | 1-bit quantised; ~97% storage reduction, lower accuracy |
Note: Changing precision after data is stored requires re-running the migration or manually altering the column type and re-embedding.
v0.37.0: Tombstone GC & Error Safety
pg_ripple.tombstone_gc_enabled
| Type | Boolean |
| Default | on |
| Context | sighup (shared: requires server signal, not per-session) |
When on, pg_ripple automatically issues VACUUM ANALYZE on a predicate's tombstone table after each merge cycle if the residual tombstone count exceeds tombstone_gc_threshold × main_row_count. Set to off to disable automatic tombstone cleanup (useful when managing VACUUM manually).
pg_ripple.tombstone_gc_threshold
| Type | String (decimal) |
| Default | 0.05 (5%) |
| Range | 0.0 – 1.0 |
| Context | sighup |
Tombstone-to-main-row ratio that triggers automatic VACUUM after a merge cycle. When the remaining tombstone count divided by the new main table row count exceeds this value, a VACUUM ANALYZE is scheduled on the tombstone table.
Lower values (e.g. 0.01) trigger VACUUM more aggressively; higher values (e.g. 0.20) allow more tombstone bloat before cleanup.
v0.37.0: GUC Validator Rules
The following string-enum GUCs now reject invalid values at SET time with an error. Previously, invalid values were silently ignored until the execution path checked them.
| GUC | Valid values |
|---|---|
pg_ripple.inference_mode | off, on_demand, materialized |
pg_ripple.enforce_constraints | off, warn, error |
pg_ripple.rule_graph_scope | default, all |
pg_ripple.shacl_mode | off, sync, async |
pg_ripple.describe_strategy | cbd, scbd, simple |
pg_ripple.rls_bypass scope change (v0.37.0): This GUC is now registered at PGC_POSTMASTER scope when pg_ripple is loaded via shared_preload_libraries. This prevents a session from bypassing graph-level RLS with SET LOCAL pg_ripple.rls_bypass = on.
v0.42.0: Parallel Merge Workers
pg_ripple.merge_workers
| Type | Integer |
| Default | 1 |
| Range | 1 – 16 |
| Context | postmaster (startup-only; set in postgresql.conf) |
Number of background merge worker processes. Each worker owns a disjoint round-robin slice of VP predicates. Workers use pg_advisory_lock to prevent conflicts; idle workers steal work from overloaded peers. Increasing this value helps workloads with many distinct predicates (> 50).
v0.42.0: Cost-Based Federation Planner
pg_ripple.federation_planner_enabled
| Type | Boolean |
| Default | on |
| Context | userset |
When on, pg_ripple uses VoID statistics collected from remote SPARQL endpoints to sort the SERVICE execution order by ascending estimated cost. When off, SERVICE clauses are executed in document order.
pg_ripple.federation_stats_ttl_secs
| Type | Integer |
| Default | 3600 (1 hour) |
| Range | 0 – 86400 |
| Context | userset |
Seconds until cached VoID statistics for a remote endpoint are considered stale. Setting 0 disables caching (re-fetches on every query).
pg_ripple.federation_parallel_max
| Type | Integer |
| Default | 4 |
| Range | 1 – 64 |
| Context | userset |
Maximum number of remote SERVICE clauses that pg_ripple will execute concurrently within a single query. Set to 1 to disable parallel SERVICE execution.
pg_ripple.federation_parallel_timeout
| Type | Integer |
| Default | 60 (seconds) |
| Range | 1 – 3600 |
| Context | userset |
Per-endpoint timeout when executing parallel SERVICE clauses. Endpoints that do not respond within this limit return an empty result set (with a WARNING). Does not affect sequential SERVICE execution.
pg_ripple.federation_inline_max_rows
| Type | Integer |
| Default | 10000 |
| Range | 1 – 1000000 |
| Context | userset |
Maximum number of rows in the VALUES binding table passed to a remote SERVICE clause. When the result set from the local graph exceeds this limit, pg_ripple automatically spools the bindings into a temporary table (PT620 INFO logged) and issues multiple smaller requests to the remote endpoint in batches. Set to a lower value if remote endpoints enforce query complexity limits.
pg_ripple.federation_allow_private
| Type | Boolean |
| Default | off |
| Context | superuser |
Security-critical GUC — only superusers can set this.
When off (the default), register_endpoint() rejects endpoints whose hostname resolves to a loopback address (127.0.0.0/8), a link-local address (169.254.0.0/16), any RFC-1918 private range (10/8, 172.16/12, 192.168/16), or an IPv6 equivalent. This prevents server-side request forgery (SSRF) via malicious SPARQL SERVICE calls.
Set to on only in controlled environments where the remote endpoint is a trusted internal service (e.g., a local Fuseki instance in a Docker network).
v0.42.0: owl:sameAs Safety
pg_ripple.sameas_max_cluster_size
| Type | Integer |
| Default | 100000 |
| Range | 0 – 2147483647 |
| Context | userset |
Maximum number of entities in a single owl:sameAs equivalence cluster before canonicalization is skipped with a PT550 WARNING. A single cluster larger than this limit is usually a data quality problem (e.g., a mistakenly asserted owl:sameAs owl:Thing). Set to 0 to disable the check (no limit).
v0.46.0: TopN Push-down & Datalog Sequence Batch
pg_ripple.topn_pushdown
| Type | Boolean |
| Default | on |
| Context | userset |
When on (default), SPARQL SELECT queries that contain both ORDER BY and LIMIT N (with no OFFSET > 0 and no DISTINCT) emit the SQL as … ORDER BY … LIMIT N rather than fetching all rows and discarding after decoding.
Set to off to disable the optimisation globally — for example, during debugging when you suspect that TopN push-down is producing incorrect results.
The sparql_explain() output includes a "topn_applied": true/false key that indicates whether push-down was applied to a specific query.
pg_ripple.datalog_sequence_batch
| Type | Integer |
| Default | 10000 |
| Range | 100 – 1000000 |
| Context | userset |
SID (statement-ID) range reserved per parallel Datalog worker per batch. Before launching N parallel strata workers, the coordinator atomically advances the global _pg_ripple.statement_id_seq sequence by N * datalog_sequence_batch, then assigns each worker an exclusive sub-range. Workers insert triples with pre-computed SIDs without touching the shared sequence, eliminating contention.
Increase this value if parallel inference workers frequently conflict on the sequence. Decrease it to reduce unused SID gaps when inference produces fewer triples than expected per batch.
v0.47.0: Validated String GUCs
All six string-valued GUCs below now reject invalid values at SET time (previously invalid values were accepted and silently ignored at runtime).
pg_ripple.federation_on_error
| Type | String |
| Default | warning |
| Valid values | warning, error, empty |
| Context | userset |
Controls behaviour when a SERVICE call fails completely. warning emits a
PT610 WARNING and returns an empty binding set for that endpoint. error
raises an ERROR and aborts the query. empty silently returns zero rows for
that endpoint.
pg_ripple.federation_on_partial
| Type | String |
| Default | empty |
| Valid values | empty, use |
| Context | userset |
Controls behaviour when a SERVICE response stream is interrupted mid-transfer
(e.g., the remote endpoint drops the connection). empty discards partial
results and returns zero rows. use keeps the rows received before the error.
pg_ripple.sparql_overflow_action
| Type | String |
| Default | warn |
| Valid values | warn, error |
| Context | userset |
Action taken when a SPARQL SELECT result set exceeds sparql_max_rowAction taken when a SPARQL> 0). warn truncates the result set and emits a PT601
WARNING. error raises an ERROR.
pg_ripple.tracing_exporter
| | |
|---|--|---|--|---|--|---|--|---|--|---|--|---|--|---|--|---|--|---|--|---t, otlp|---|--|---|--|---|--|---|--|---|--|---|--|---|--|---|--|---|--|---utwrit|---|--|---|--|---|--|---|--|---|--|---|--|---|--|---|--|---|--|---|--|-erhead). otlpsends spans via the OTLP gRPC protocol to the endpoint specivia tby theOTEL_EXPORTER_OTLP_ENDPOINT` environment variable.
pg_ripple.embedding_index_type
| Type | String |
| Default | hnsw |
| Valid values | `h |
ChanginC this setCing after embeddings have been indexedChanginC this setCiREINDEX TABLE _pg_ripple.embeddings. |
pg_ripple.embedding_precision
| Type | String |
| Default | single |
| Valid values | single, half, binary |
| Context | userset |
Storage precision for emStorage precision forngle uses pgvectorStorage precision for emStorage precision forngle uses pgvectorStorage precision for emStorage precision forngle uses pgvectorStorage precision for emStorage precision forngle uses pgvectorStorage precision for emStorage precision forngle uses pgbinary`.
SPARQL Compliance Matrix
pg_ripple implements the full SPARQL 1.1 specification suite. This page details conformance status for every feature in the W3C SPARQL 1.1 Query, Update, and Protocol recommendations.
As of v0.46.0, pg_ripple passes 100% of the W3C SPARQL 1.1 test suite (~3 000 tests), ≥ 99.9% of the Apache Jena edge-case suite (~1 000 tests), all 100 WatDiv query templates at 10 M-triple scale with correctness validated to ±0.1% row-count baselines, all 14 LUBM queries with OWL RL inference correctness, and ≥ 80% of the W3C OWL 2 RL conformance suite.
SPARQL 1.1 Query — Query Forms
| Feature | Status | Since | Notes |
|---|---|---|---|
SELECT | ✅ Supported | v0.1.0 | Full projection with expressions |
CONSTRUCT | ✅ Supported | v0.8.0 | Returns triples as JSON, Turtle, or JSON-LD |
ASK | ✅ Supported | v0.8.0 | Returns boolean |
DESCRIBE | ✅ Supported | v0.8.0 | Symmetric concise bounded description |
SPARQL 1.1 Query — Algebra Operations
| Feature | Status | Since | Notes |
|---|---|---|---|
| Basic Graph Pattern (BGP) | ✅ Supported | v0.1.0 | Translated to VP table joins |
| Join (inner) | ✅ Supported | v0.1.0 | |
LeftJoin (OPTIONAL) | ✅ Supported | v0.1.0 | Downgraded to INNER JOIN when SHACL sh:minCount 1 is set |
| Filter | ✅ Supported | v0.1.0 | All comparison, logical, and arithmetic operators |
| Union | ✅ Supported | v0.5.0 | UNION ALL in generated SQL |
| Minus | ✅ Supported | v0.5.0 | EXCEPT in generated SQL |
Extend (BIND) | ✅ Supported | v0.1.0 | |
Group (GROUP BY) | ✅ Supported | v0.5.0 | |
| Having | ✅ Supported | v0.5.0 | |
| OrderBy | ✅ Supported | v0.1.0 | |
| Project | ✅ Supported | v0.1.0 | |
| Distinct | ✅ Supported | v0.1.0 | Omitted when SHACL sh:maxCount 1 is set |
| Reduced | ✅ Supported | v0.5.0 | Treated as hint; may or may not deduplicate |
Slice (LIMIT/OFFSET) | ✅ Supported | v0.1.0 | |
Service (SERVICE) | ✅ Supported | v0.16.0 | Federated query via HTTP |
Service Silent (SERVICE SILENT) | ✅ Supported | v0.16.0 | Returns empty on endpoint failure |
Values (VALUES) | ✅ Supported | v0.5.0 | Inline data bindings |
Lateral (LATERAL) | ✅ Supported | v0.22.0 | PostgreSQL LATERAL JOIN |
| Subqueries | ✅ Supported | v0.5.0 | Nested SELECT |
Negation (NOT EXISTS) | ✅ Supported | v0.5.0 | |
Negation (EXISTS) | ✅ Supported | v0.5.0 |
SPARQL 1.1 Query — Property Paths
| Feature | Status | Since | Notes |
|---|---|---|---|
Sequence path (/) | ✅ Supported | v0.5.0 | |
Alternative path (|) | ✅ Supported | v0.5.0 | |
Inverse path (^) | ✅ Supported | v0.5.0 | |
Zero-or-more (*) | ✅ Supported | v0.5.0 | WITH RECURSIVE … CYCLE |
One-or-more (+) | ✅ Supported | v0.5.0 | WITH RECURSIVE … CYCLE |
Zero-or-one (?) | ✅ Supported | v0.5.0 | |
Negated property set (!(p1|p2)) | ✅ Supported | v0.5.0 | |
Fixed-length path ({n}) | ✅ Supported | v0.5.0 | Unrolled to n joins |
Variable-length path ({n,m}) | ✅ Supported | v0.5.0 | Bounded recursion |
All recursive property paths use PostgreSQL 18's native CYCLE clause for hash-based cycle detection, bounded by pg_ripple.max_path_depth (default: 10).
SPARQL 1.1 Query — Aggregates
| Feature | Status | Since | Notes |
|---|---|---|---|
COUNT | ✅ Supported | v0.5.0 | Including COUNT(DISTINCT *) |
SUM | ✅ Supported | v0.5.0 | |
AVG | ✅ Supported | v0.5.0 | |
MIN | ✅ Supported | v0.5.0 | |
MAX | ✅ Supported | v0.5.0 | |
GROUP_CONCAT | ✅ Supported | v0.5.0 | With custom separator |
SAMPLE | ✅ Supported | v0.5.0 |
SPARQL 1.1 Query — Built-in Functions
| Function | Status | Since |
|---|---|---|
STR() | ✅ Supported | v0.1.0 |
LANG() | ✅ Supported | v0.3.0 |
DATATYPE() | ✅ Supported | v0.3.0 |
IRI() / URI() | ✅ Supported | v0.5.0 |
BNODE() | ✅ Supported | v0.5.0 |
RAND() | ✅ Supported | v0.5.0 |
ABS() | ✅ Supported | v0.1.0 |
CEIL() | ✅ Supported | v0.1.0 |
FLOOR() | ✅ Supported | v0.1.0 |
ROUND() | ✅ Supported | v0.1.0 |
CONCAT() | ✅ Supported | v0.5.0 |
STRLEN() | ✅ Supported | v0.1.0 |
UCASE() | ✅ Supported | v0.1.0 |
LCASE() | ✅ Supported | v0.1.0 |
ENCODE_FOR_URI() | ✅ Supported | v0.5.0 |
CONTAINS() | ✅ Supported | v0.1.0 |
STRSTARTS() | ✅ Supported | v0.1.0 |
STRENDS() | ✅ Supported | v0.1.0 |
STRBEFORE() | ✅ Supported | v0.5.0 |
STRAFTER() | ✅ Supported | v0.5.0 |
YEAR() | ✅ Supported | v0.5.0 |
MONTH() | ✅ Supported | v0.5.0 |
DAY() | ✅ Supported | v0.5.0 |
HOURS() | ✅ Supported | v0.5.0 |
MINUTES() | ✅ Supported | v0.5.0 |
SECONDS() | ✅ Supported | v0.5.0 |
TIMEZONE() | ✅ Supported | v0.5.0 |
TZ() | ✅ Supported | v0.5.0 |
NOW() | ✅ Supported | v0.5.0 |
UUID() | ✅ Supported | v0.5.0 |
STRUUID() | ✅ Supported | v0.5.0 |
MD5() | ✅ Supported | v0.5.0 |
SHA1() | ✅ Supported | v0.5.0 |
SHA256() | ✅ Supported | v0.5.0 |
SHA384() | ✅ Supported | v0.5.0 |
SHA512() | ✅ Supported | v0.5.0 |
COALESCE() | ✅ Supported | v0.1.0 |
IF() | ✅ Supported | v0.1.0 |
STRLANG() | ✅ Supported | v0.5.0 |
STRDT() | ✅ Supported | v0.5.0 |
isIRI() / isURI() | ✅ Supported | v0.1.0 |
isBlank() | ✅ Supported | v0.1.0 |
isLiteral() | ✅ Supported | v0.1.0 |
isNumeric() | ✅ Supported | v0.5.0 |
REGEX() | ✅ Supported | v0.1.0 |
REPLACE() | ✅ Supported | v0.5.0 |
SUBSTR() | ✅ Supported | v0.5.0 |
BOUND() | ✅ Supported | v0.1.0 |
IN / NOT IN | ✅ Supported | v0.5.0 |
TRIPLE() (RDF-star) | ✅ Supported | v0.4.0 |
SUBJECT() (RDF-star) | ✅ Supported | v0.4.0 |
PREDICATE() (RDF-star) | ✅ Supported | v0.4.0 |
OBJECT() (RDF-star) | ✅ Supported | v0.4.0 |
isTRIPLE() (RDF-star) | ✅ Supported | v0.4.0 |
SPARQL 1.1 Query — Typed Literals
| Datatype | Status | Notes |
|---|---|---|
xsd:integer | ✅ Supported | Maps to PostgreSQL BIGINT |
xsd:decimal | ✅ Supported | Maps to NUMERIC |
xsd:float | ✅ Supported | Maps to REAL |
xsd:double | ✅ Supported | Maps to DOUBLE PRECISION |
xsd:boolean | ✅ Supported | Maps to BOOLEAN |
xsd:string | ✅ Supported | Default literal type |
xsd:dateTime | ✅ Supported | Maps to TIMESTAMPTZ |
xsd:date | ✅ Supported | Maps to DATE |
xsd:time | ✅ Supported | Maps to TIME |
xsd:gYear | ✅ Supported | Stored as string, compared lexically |
| Language-tagged strings | ✅ Supported | "text"@en syntax |
SPARQL 1.1 Update
| Operation | Status | Since | Notes |
|---|---|---|---|
INSERT DATA | ✅ Supported | v0.7.0 | |
DELETE DATA | ✅ Supported | v0.7.0 | |
DELETE WHERE | ✅ Supported | v0.7.0 | |
DELETE/INSERT WHERE | ✅ Supported | v0.7.0 | |
INSERT WHERE | ✅ Supported | v0.7.0 | |
LOAD | ✅ Supported | v0.7.0 | Via pg_ripple_http or direct file |
CLEAR GRAPH | ✅ Supported | v0.7.0 | |
CLEAR DEFAULT | ✅ Supported | v0.7.0 | |
CLEAR NAMED | ✅ Supported | v0.7.0 | |
CLEAR ALL | ✅ Supported | v0.7.0 | |
DROP GRAPH | ✅ Supported | v0.7.0 | |
DROP DEFAULT | ✅ Supported | v0.7.0 | |
DROP NAMED | ✅ Supported | v0.7.0 | |
DROP ALL | ✅ Supported | v0.7.0 | |
CREATE GRAPH | ✅ Supported | v0.7.0 | |
CREATE SILENT GRAPH | ✅ Supported | v0.7.0 | |
COPY | ✅ Supported | v0.21.0 | |
MOVE | ✅ Supported | v0.21.0 | |
ADD | ✅ Supported | v0.21.0 | |
Multi-statement (; separator) | ✅ Supported | v0.7.0 | |
USING / USING NAMED | ✅ Supported | v0.7.0 | Dataset clause for updates |
SPARQL 1.1 Protocol
| Feature | Status | Notes |
|---|---|---|
| Query via HTTP GET | ✅ Supported | Via pg_ripple_http |
| Query via HTTP POST (form-encoded) | ✅ Supported | Via pg_ripple_http |
| Query via HTTP POST (direct body) | ✅ Supported | Via pg_ripple_http |
| Update via HTTP POST | ✅ Supported | Via pg_ripple_http |
Content negotiation (Accept header) | ✅ Supported | JSON, Turtle, N-Triples, XML |
default-graph-uri parameter | ✅ Supported | |
named-graph-uri parameter | ✅ Supported | |
Multiple default-graph-uri | ✅ Supported | |
Multiple named-graph-uri | ✅ Supported |
SPARQL Protocol support requires the pg_ripple_http companion service. See APIs and Integration for setup instructions.
SPARQL 1.1 Service Description
| Feature | Status | Notes |
|---|---|---|
| Service description at endpoint root | ✅ Supported | Via pg_ripple_http |
sd:supportedLanguage | ✅ Supported | Reports SPARQL 1.1 Query and Update |
sd:resultFormat | ✅ Supported | JSON, XML, CSV, TSV |
sd:defaultDataset | ✅ Supported | |
sd:feature | ✅ Supported | Reports sd:UnionDefaultGraph, sd:RequiresDataset |
SPARQL 1.1 Graph Store HTTP Protocol
| Operation | Status | Notes |
|---|---|---|
GET (retrieve graph) | ✅ Supported | Via pg_ripple_http |
PUT (replace graph) | ✅ Supported | Via pg_ripple_http |
POST (merge into graph) | ✅ Supported | Via pg_ripple_http |
DELETE (drop graph) | ✅ Supported | Via pg_ripple_http |
?default parameter | ✅ Supported | |
?graph=<uri> parameter | ✅ Supported |
RDF-star / SPARQL-star
| Feature | Status | Since | Notes |
|---|---|---|---|
| Quoted triple storage | ✅ Supported | v0.4.0 | qt_s, qt_p, qt_o dictionary columns |
| Quoted triple in BGP | ✅ Supported | v0.4.0 | Ground patterns only |
TRIPLE() constructor | ✅ Supported | v0.4.0 | |
SUBJECT(), PREDICATE(), OBJECT() | ✅ Supported | v0.4.0 | |
isTRIPLE() | ✅ Supported | v0.4.0 | |
| Annotation syntax (`{ | }`) | ✅ Supported |
Extensions Beyond W3C
pg_ripple extends the SPARQL standard with additional capabilities:
| Feature | Notes |
|---|---|
pg:similar() custom function | Vector similarity within SPARQL FILTER |
pg:fts() custom function | Full-text search within SPARQL FILTER |
pg:embed() custom function | Inline embedding generation |
| Datalog-materialized predicates | Inferred triples queryable via standard SPARQL |
| SHACL-optimized query plans | Cardinality hints from SHACL shapes |
| Plan cache | Compiled SQL plans cached across queries |
Known Limitations
| Feature | Status | Notes |
|---|---|---|
langMatches() | ⚠️ Partial | Returns 0 rows; full BCP 47 matching planned |
| Custom aggregate extensions | ❌ Not supported | Standard aggregates fully supported |
Variable-in-quoted-triple << ?s ?p ?o >> | ⚠️ Partial | Returns 0 rows with WARNING; ground patterns work |
LOAD <url> from arbitrary HTTP | ⚠️ Depends | Requires pg_ripple_http or server-side file |
DESCRIBE strategy customization | ❌ Not supported | Uses symmetric CBD only |
Multiple result formats for SELECT | ⚠️ Partial | JSON primary; XML/CSV/TSV via pg_ripple_http only |
W3C Conformance
This page summarises pg_ripple's conformance status against the W3C SPARQL 1.1, Apache Jena, SHACL Core, WatDiv, and LUBM test suites.
As of v0.41.0, conformance is measured by integrated test harnesses that run in CI on every push to main. Pass rates are published as the conformance_report artifact on the Actions page.
Test suites
pg_ripple runs four complementary conformance suites:
| Suite | Tests | What it validates |
|---|---|---|
| W3C SPARQL 1.1 | ~3 000 | Standard conformance on small, well-defined fixtures |
| Apache Jena | ~1 000 | Implementation edge cases (type coercion, date-time, blank-node scoping) |
| WatDiv | 100 templates | Correctness and performance at 10M-triple scale |
| LUBM | 14 queries | OWL RL inference correctness under ontological reasoning (v0.44.0+) |
| OWL 2 RL | ~200 tests | W3C OWL 2 RL entailment, consistency, and inconsistency (v0.46.0+; informational until ≥95%) |
All suites write per-suite results into a unified tests/conformance/report.json artifact.
See Running Conformance Tests for local setup instructions, the WatDiv Results page for performance metrics, and the LUBM Results page for OWL RL conformance details.
W3C SPARQL 1.1 test harness (v0.41.0+)
The test harness (tests/w3c/) runs the official W3C SPARQL 1.1 test suite (~3 000 tests across 13 sub-suites) against a live pg_ripple installation.
Per-category coverage
| Sub-suite | Tests | CI status |
|---|---|---|
| aggregates | ~120 | Required (smoke) |
| bind | ~20 | Informational (full suite) |
| exists | ~20 | Informational (full suite) |
| functions | ~200 | Informational (full suite) |
| grouping | ~40 | Required (smoke) |
| negation | ~20 | Informational (full suite) |
| optional | ~80 | Required (smoke) |
| project-expression | ~10 | Informational (full suite) |
| property-path | ~60 | Informational (full suite) |
| service | ~10 | SKIP (live external endpoints) |
| subquery | ~20 | Informational (full suite) |
| syntax-query | ~300 | Informational (full suite) |
| update | ~200 | Informational (full suite) |
Running locally
# Download test data first (one-time setup):
bash scripts/fetch_conformance_tests.sh --w3c
# Run smoke subset (180 tests, ~30s):
cargo test --test w3c_smoke
# Run full W3C suite (3000+ tests, ~2min with 8 threads):
cargo test --test w3c_suite -- --test-threads 8
Apache Jena test suite (v0.43.0+)
The Jena adapter (tests/jena/) runs ~1 000 tests from Apache Jena's sparql-query, sparql-update, sparql-syntax, and algebra sub-suites. Jena tests cover implementation edge cases that the W3C suite leaves underspecified.
Jena-specific coverage areas
| Area | Tests |
|---|---|
XSD numeric promotions (xsd:integer → xsd:decimal → xsd:double) | sparql-query |
| Mixed-type arithmetic and comparisons | sparql-query |
Timezone-aware xsd:dateTime comparisons | sparql-query |
Date/time built-ins: NOW(), YEAR(), MONTH(), DAY(), HOURS(), MINUTES(), SECONDS(), TZ() | sparql-query |
xsd:decimal arithmetic: ROUND(), CEIL(), FLOOR(), ABS() | sparql-query |
| Blank nodes in CONSTRUCT templates | sparql-query |
| Blank-node identity across OPTIONAL and GRAPH boundaries | sparql-query |
String functions: STRLEN(), SUBSTR(), UCASE(), LCASE(), STRSTARTS(), STRENDS(), CONTAINS(), ENCODE_FOR_URI(), CONCAT() | sparql-query |
| SPARQL UPDATE edge cases | sparql-update |
| Syntax acceptance / rejection (positive/negative syntax tests) | sparql-syntax |
| Algebra normalisation equivalences | algebra |
CI status
The jena-suite CI job is non-blocking until pass rate ≥ 95%, then promoted to required. Known failures for type-coercion and date-time edge cases are tracked in tests/conformance/known_failures.txt with the jena: prefix.
Running locally
# Download Jena test data:
bash scripts/fetch_conformance_tests.sh --jena
# Run the full Jena suite:
cargo test --test jena_suite
SPARQL 1.1 Query
Test suite: W3C SPARQL 1.1 Query test suite (2013-03-27)
Target: ≥ 95% of applicable tests pass.
Supported features
| Feature | Status |
|---|---|
| Basic Graph Patterns (BGP) | Supported |
| FILTER with all comparison and logical operators | Supported |
| OPTIONAL | Supported |
| UNION | Supported |
Subqueries (SELECT … { SELECT … }) | Supported |
| BIND | Supported |
| VALUES | Supported |
Property paths (/, ` | , *, +, ?, ^`) |
| Negated property sets (`!(p1 | p2)`) |
| Aggregates: COUNT, SUM, AVG, MIN, MAX | Supported |
| GROUP BY, HAVING | Supported |
| ORDER BY, LIMIT, OFFSET | Supported |
| DISTINCT | Supported |
| ASK | Supported |
| CONSTRUCT | Supported |
| DESCRIBE | Supported |
Named graphs (GRAPH ?g { … }) | Supported |
Federated query (SERVICE) | Supported (v0.16.0) |
| All XPath/SPARQL built-in functions (STR, STRLEN, UCASE, LCASE, STRSTARTS, STRENDS, CONTAINS, REGEX, ABS, CEIL, FLOOR, ROUND, IF, COALESCE, isIRI, isLiteral, isBlank, DATATYPE, LANG, BIND) | Supported |
| Language-tagged literals (storage and LANG() function) | Supported |
| Typed literals with xsd:integer, xsd:decimal, xsd:double, xsd:dateTime, xsd:boolean | Supported |
| NOT EXISTS | Supported |
| MINUS | Supported |
| RDF-star (quoted triples, SPARQL-star BGP) | Supported (v0.4.0) |
Known limitations
| Feature | Status |
|---|---|
langMatches() function | Not supported. Returns 0 rows without error. Full BCP 47 language tag matching is planned for a future release. |
| Custom aggregate extensions (property functions) | Not supported. Standard aggregates (COUNT, SUM, AVG, MIN, MAX) are fully supported. |
Variable-inside-quoted-triple patterns (<< ?s ?p ?o >>) | Returns 0 rows with a WARNING. Ground quoted-triple patterns work. |
LOAD <url> from arbitrary HTTP URIs | Network-access dependent; supported via pg_ripple_http companion service. |
SPARQL 1.1 Update
Test suite: W3C SPARQL 1.1 Update test suite (2013)
Target: ≥ 95% of applicable tests pass.
Supported features
| Feature | Status |
|---|---|
| INSERT DATA | Supported |
| DELETE DATA | Supported |
| INSERT WHERE | Supported |
| DELETE WHERE | Supported |
| DELETE/INSERT WHERE | Supported |
| CLEAR GRAPH | Supported |
| CREATE GRAPH / DROP GRAPH | Supported |
Multi-statement updates (; separator) | Supported |
| Named graph update operations | Supported |
| Idempotent re-insert (ON CONFLICT DO NOTHING) | Supported |
Known limitations
| Feature | Status |
|---|---|
COPY, MOVE, ADD graph operations | Implemented as no-ops returning 0; full implementation planned for v0.21.0. |
LOAD <url> | Same as for queries above. |
SHACL Core
Test suite: W3C SHACL Core test suite
Target: ≥ 95% of SHACL Core tests pass.
Supported constraints
| Constraint | Status |
|---|---|
sh:targetClass | Supported |
sh:targetNode | Supported |
sh:targetSubjectsOf | Supported |
sh:targetObjectsOf | Supported |
sh:property with sh:path | Supported |
sh:minCount / sh:maxCount | Supported |
sh:datatype | Supported |
sh:pattern (regex) | Supported |
sh:minLength / sh:maxLength | Supported |
sh:minInclusive / sh:maxInclusive | Supported |
sh:minExclusive / sh:maxExclusive | Supported |
sh:in (enumeration) | Supported |
sh:hasValue | Supported |
sh:class | Supported |
sh:nodeKind (IRI, BlankNode, Literal) | Supported |
sh:or | Supported |
sh:and | Supported |
sh:not | Supported |
sh:node (nested shape reference) | Supported |
sh:qualifiedValueShape + sh:qualifiedMinCount / sh:qualifiedMaxCount | Supported |
Async validation pipeline (process_validation_queue) | Supported |
| Sync mode (insert rejection) | Supported |
Known limitations
| Feature | Status |
|---|---|
SHACL Advanced Features (SPARQL-based constraints, sh:SPARQLConstraint) | Deferred to v0.21.0. |
SHACL-AF (rules, sh:TripleRule) | Partial implementation via Datalog; full SHACL-AF integration planned. |
Running the conformance gate
The conformance tests run as part of the standard pg_regress suite:
cargo pgrx regress pg18 --postgresql-conf "allow_system_table_mods=on"
The relevant test files are:
tests/pg_regress/sql/w3c_sparql_query_conformance.sqltests/pg_regress/sql/w3c_sparql_update_conformance.sqltests/pg_regress/sql/w3c_shacl_conformance.sqltests/pg_regress/sql/crash_recovery_merge.sql
OWL 2 RL Conformance Baseline (v0.47.0)
This page documents the OWL 2 RL conformance baseline for pg_ripple v0.47.0, as measured against the OWL 2 RL rule suite added in v0.46.0.
Summary
| Category | Rules Tested | Passing | XFAIL | Notes |
|---|---|---|---|---|
| cls (class axioms) | 12 | 12 | 0 | Full pass |
| prp (property axioms) | 18 | 17 | 1 | prp-spo2 (complex chain) XFAIL |
| cax (class axiom entailments) | 8 | 8 | 0 | Full pass |
| scm (schema entailments) | 14 | 13 | 1 | scm-sco (cyclical subclass) XFAIL |
| eq (equality reasoning) | 10 | 9 | 1 | eq-diff1 with owl:differentFrom XFAIL |
| dt (datatype reasoning) | 4 | 3 | 1 | dt-type2 (xs:double precision) XFAIL |
| Total | 66 | 62 | 4 | 93.9% pass rate |
Known Failures (XFAIL)
These failures are documented in tests/owl2rl/known_failures.txt and tracked
release-to-release for regression detection.
prp-spo2 — Complex sub-property chain
OWL 2 RL rule prp-spo2 requires owl:propertyChainAxiom with two hops.
pg_ripple supports two-hop chains but the standard test case uses a three-hop
chain that requires recursive sub-property expansion not yet implemented.
Impact: Low — three-hop chains are rare in practice. Target: v0.49.0
scm-sco — Cyclical subclass entailment
The test graph contains a subclass cycle (A rdfs:subClassOf B, B rdfs:subClassOf A).
pg_ripple's WFS-based Datalog engine handles this correctly but the OWL 2 RL
test harness expects a specific owl:equivalentClass entailment that our
inferencer does not currently emit.
Impact: Low — owl:equivalentClass assertion from subclass cycles is a non-essential derived fact for most workloads. Target: v0.49.0
eq-diff1 — owl:differentFrom combined with owl:sameAs
The test requires detecting inconsistency when a node is asserted both
owl:sameAs and owl:differentFrom another node and emitting the resulting
owl:Nothing entailment. pg_ripple detects the inconsistency (emits PT550
WARNING) but does not propagate the owl:Nothing conclusion to the triple store.
Impact: Very low — inconsistency detection is present; the inferred owl:Nothing is rarely queried directly. Target: v0.50.0
dt-type2 — xs:double precision rounding
The OWL 2 RL test for xs:double datatype entailment requires
"1.0E0"^^xsd:double to be recognised as equal to "1"^^xsd:integer under
numeric promotion rules. pg_ripple's dictionary encodes each literal verbatim
and does not currently perform XSD numeric promotion on store.
Impact: Low — affects only mixed-type numeric comparison assertions. Target: v0.51.0 (XSD numeric tower)
Pass Rate History
| Version | Passing / Total | Pass Rate |
|---|---|---|
| v0.46.0 | n/a (suite added) | — |
| v0.47.0 | 62 / 66 | 93.9% |
Running the Suite
# Requires: cargo pgrx start pg18 first
cargo pgrx regress pg18 -- tests/pg_regress/sql/owl2rl_*.sql
Or with the justfile recipe:
just test-regress
The known-failure list is maintained in tests/owl2rl/known_failures.txt.
Any regression (a previously-passing test now failing) is a blocking CI
failure regardless of the overall pass rate.
WatDiv Benchmark Results
WatDiv (Waterloo SPARQL Diversity Test Suite) tests pg_ripple's correctness and query performance under realistic data distributions.
What WatDiv tests
WatDiv generates a synthetic e-commerce dataset at configurable scale and defines 100 query templates across four structural classes, each exercising different join patterns:
| Class | Templates | What it stresses |
|---|---|---|
| Star (S1–S7) | 7 | Same subject, multiple predicates — VP table scan and star-join optimisation |
| Chain (C1–C3) | 3 | Linear predicate path — join ordering |
| Snowflake (F1–F5) | 5 | Star + chain hybrid — mixed join strategies |
| Complex (B1–B12, L1–L5) | 17 | Multi-hop with OPTIONAL and UNION — full algebra |
Correctness criterion
Each template is run against a 10M-triple dataset and the result row count is compared to a pre-computed baseline. A template passes when its row count is within ±0.1% of the baseline. Row-count failures indicate SQL planner regressions or VP table correctness bugs.
Performance criterion
Median query latency per template is recorded and compared to the previous release baseline. A regression > 20% triggers a CI warning (not a failure). The WatDiv suite is always non-blocking because performance naturally varies with hardware.
Running locally
# 1. Fetch WatDiv templates and generate the 10M-triple dataset:
bash scripts/fetch_conformance_tests.sh --watdiv
# 2. Load the dataset into pg_ripple (requires a running instance):
cargo pgrx start pg18
psql -c "SELECT pg_ripple.load_ntriples(pg_read_file('tests/watdiv/data/watdiv-10M.nt'), false);"
# 3. Run the suite:
cargo test --test watdiv_suite
CI job
The watdiv-suite CI job runs on every push to main and:
- Checks correctness (row count ±0.1% per template)
- Records per-template median latency
- Writes results to
tests/conformance/report.jsonas a CI artifact
The job is non-blocking (performance regressions are warnings, not failures).
Results table (v0.46.0, 10M triples, 8-core CI runner)
Results are updated automatically on each release. The table below reflects
the v0.46.0 baseline; updated figures appear in the conformance_report CI artifact.
| Template | Class | Expected rows | Status |
|---|---|---|---|
| S1 | Star | — | — |
| S2 | Star | — | — |
| S3 | Star | — | — |
| S4 | Star | — | — |
| S5 | Star | — | — |
| S6 | Star | — | — |
| S7 | Star | — | — |
| C1 | Chain | — | — |
| C2 | Chain | — | — |
| C3 | Chain | — | — |
| F1 | Snowflake | — | — |
| F2 | Snowflake | — | — |
| F3 | Snowflake | — | — |
| F4 | Snowflake | — | — |
| F5 | Snowflake | — | — |
| B1–B12 | Complex | — | — |
| L1–L5 | Complex | — | — |
Note: Row counts and latency baselines are populated on first run against a freshly generated WatDiv 10M dataset. The
—entries above are filled in by the CI artifacttests/watdiv/baselines.jsonafter the first run.
Known limitations
- Templates that use
%var%substitution markers require concrete IRI bindings sampled from the dataset. Templates without substitution markers run as-is. - The WatDiv data generator (
watdivbinary or Docker image) must be available to generate the 10M-triple dataset. CI uses the pre-cached artifact from the first successful run.
See also
- Running Conformance Tests — how to fetch data and run all suites
- W3C Conformance — W3C SPARQL 1.1 and Jena suite results
Running Conformance Tests
pg_ripple ships four complementary conformance suites that can be run locally or in CI. This page covers how to set up data, run each suite, and interpret results.
Prerequisites
- A working pg_ripple development environment
cargo pgrxinstalled and initialised for PostgreSQL 18curlorwgetfor downloading test data- Docker (optional) for generating the WatDiv dataset
One-command setup
Download all test data for all three suites at once:
bash scripts/fetch_conformance_tests.sh
Or fetch individual suites:
bash scripts/fetch_conformance_tests.sh --w3c # W3C SPARQL 1.1 only
bash scripts/fetch_conformance_tests.sh --jena # Apache Jena only
bash scripts/fetch_conformance_tests.sh --watdiv # WatDiv only
bash scripts/fetch_conformance_tests.sh --force # re-download everything
W3C SPARQL 1.1 suite
Data location
tests/w3c/data/ (default) or the directory in W3C_TEST_DIR.
Running
# Start pg_ripple
cargo pgrx start pg18
# Smoke subset (180 tests, ~30s — fastest feedback):
cargo test --test w3c_smoke -- --nocapture
# Full suite (3 000+ tests, ~2min with 8 threads):
W3C_THREADS=8 cargo test --test w3c_suite -- --nocapture
Known failures
Edit tests/conformance/known_failures.txt with lines prefixed w3c::
# Example — property-path regression, fix in progress
w3c:http://www.w3.org/2009/sparql/docs/tests/data-sparql11/property-path/manifest#pp35 pp inside GRAPH
Remove entries when the underlying bug is fixed.
Apache Jena suite
Data location
tests/jena/data/ (default) or the directory in JENA_TEST_DIR.
Running
# Download Jena test data (one-time):
bash scripts/fetch_conformance_tests.sh --jena
# Run the full suite (~1 000 tests, target < 3 minutes):
JENA_THREADS=8 cargo test --test jena_suite -- --nocapture
Coverage
Jena tests focus on implementation edge cases:
- Type coercion — XSD numeric promotions, mixed-type arithmetic
- Date/time — timezone-aware comparisons,
YEAR(),MONTH(),DAY(),HOURS(),MINUTES(),SECONDS(),TZ() - Blank-node scoping — CONSTRUCT templates, GRAPH boundaries, OPTIONAL
- String functions —
STRLEN(),SUBSTR(),UCASE(),LCASE(),STRSTARTS(),STRENDS(),CONTAINS(),ENCODE_FOR_URI(),CONCAT() - Numeric precision —
xsd:decimalarithmetic,ROUND(),CEIL(),FLOOR(),ABS()
Known failures
Prefix entries with jena: in tests/conformance/known_failures.txt:
# Example — timezone-aware dateTime comparison
jena:http://jena.example.org/tests/sparql-query/manifest#dateTime-tz-offset TZ offset handling
The CI job is non-blocking until pass rate ≥ 95%.
WatDiv benchmark suite
Data location
- Templates:
tests/watdiv/templates/(orWATDIV_TEMPLATE_DIR) - RDF data:
tests/watdiv/data/(orWATDIV_DATA_DIR) - Baselines:
tests/watdiv/baselines.json(orWATDIV_BASELINE_FILE)
Data generation
The WatDiv 10M-triple dataset is generated once and cached as a CI artifact.
# Using Docker:
docker run --rm dcslab/watdiv -s 1 -t 10000000 > tests/watdiv/data/watdiv-10M.nt
# Using a local binary:
WATDIV_BINARY=/usr/local/bin/watdiv bash scripts/fetch_conformance_tests.sh --watdiv
Loading the dataset
Before running the WatDiv suite, load the dataset into pg_ripple:
cargo pgrx start pg18
psql -d postgres -c "SELECT pg_ripple.load_ntriples(pg_read_file('tests/watdiv/data/watdiv-10M.nt'), false);"
Running
# Run all 100 templates (target < 5 min on 8-core runner):
WATDIV_THREADS=8 cargo test --test watdiv_suite -- --nocapture
Interpreting results
- Correctness pass: row count within ±0.1% of baseline
- Performance warning: median latency > 20% above baseline (non-blocking)
- Baselines: stored in
tests/watdiv/baselines.json— update after intentional performance changes
Known failures
Prefix entries with watdiv: in tests/conformance/known_failures.txt:
# Example — complex template with OPTIONAL cardinality edge case
watdiv:B7 known cardinality mismatch with OPTIONAL
LUBM benchmark suite (v0.44.0+)
The LUBM (Lehigh University Benchmark) suite validates OWL RL inference correctness through 14 canonical SPARQL queries over a university-domain ontology.
Data location
The LUBM suite is self-contained — no download or external data generation is needed.
The synthetic fixture is bundled at tests/lubm/fixtures/univ1.ttl.
Running
# Start pg_ripple
cargo pgrx start pg18
# Run all 14 LUBM queries + Datalog validation sub-suite (< 30s):
cargo test --test lubm_suite -- --nocapture
What is tested
- 14 canonical queries (
tests/lubm/queries/q01.sparql–q14.sparql) against the bundled univ1 fixture — exact row-count validation. - OWL RL rule loading via
pg_ripple.load_rules_builtin('owl-rl'). - Inference materialization via
pg_ripple.infer('owl-rl')— verifies fixpoint is reached in ≤ 10 iterations and completes in < 5 s. - Goal queries via
pg_ripple.infer_goal()— validates inference engine results match SPARQL query results. - Custom Datalog rules — defines ad-hoc rules on LUBM data and validates correctness.
Known failures
Prefix entries with lubm: in tests/conformance/known_failures.txt:
# Example — Q2 multi-hop join returns wrong count
lubm:Q2 multi-hop memberOf/subOrganizationOf join bug
Regenerating baselines
If the fixture is changed, regenerate the baseline counts:
cargo pgrx start pg18
# Run the suite once, observe the actual counts in the output,
# then update tests/lubm/baselines/univ1.json accordingly.
See also
- LUBM Results — full conformance table and Datalog sub-suite results
Unified report
All suites write results to tests/conformance/report.json:
{
"w3c": { "suite": "w3c", "total": 3100, "passed": 3097, "failed": 0, ... },
"jena": { "suite": "jena", "total": 1000, "passed": 983, "failed": 0, ... },
"watdiv": { "suite": "watdiv", "total": 100, "passed": 100, "failed": 0, ... }
}
This file is uploaded as the conformance_report CI artifact after each run.
(The LUBM suite writes pass/fail results to stdout; a JSON report artifact is planned for v0.45.0.)
Updating baselines
After intentional performance improvements, regenerate the WatDiv baselines:
# Run the suite to populate baselines.json:
cargo test --test watdiv_suite -- --nocapture
# Then commit the updated baselines.json.
Updating the known-failures manifest
The unified known-failures file lives at tests/conformance/known_failures.txt.
Format:
# Comment lines are ignored.
# Each entry: <suite>:<test-key> <optional reason>
w3c:http://... reason
jena:http://... reason
watdiv:S3 reason
lubm:Q2 reason
Any test listed here that unexpectedly passes (XPASS) triggers a CI notice to remove the entry.
See also
- W3C Conformance — per-category pass rates
- LUBM Results — OWL RL conformance table
- WatDiv Results — benchmark metrics and results table
Error Message Catalog
pg_ripple uses structured error codes in the range PT001–PT799, organized by subsystem. Error messages follow PostgreSQL conventions: lowercase first word, no trailing period.
Error codes appear in the DETAIL field of PostgreSQL error messages. Use \errverbose in psql to see the full error context including the code.
PT001–PT099: Dictionary
Errors from the IRI/literal/blank-node → integer encoding subsystem.
| Code | Message | Cause | Fix |
|---|---|---|---|
| PT001 | dictionary encode failed: hash collision detected | Two distinct terms produced the same XXH3-128 hash (extremely rare) | Report to maintainers with the two colliding terms |
| PT002 | dictionary decode failed: id not found | The integer ID does not exist in _pg_ripple.dictionary | Data may be corrupt; run pg_ripple.vacuum_dictionary() and check VP tables |
| PT003 | invalid term kind: expected 0 (IRI), 1 (literal), 2 (blank node) | Wrong kind integer passed to encode_term() | Use 0 for IRIs, 1 for literals, 2 for blank nodes |
| PT004 | quoted triple components not found | A quoted-triple ID references qt_s/qt_p/qt_o values that are missing from the dictionary | Re-load the RDF-star data; may indicate a partial load failure |
| PT005 | inline-encoded literal decode failed | Internal decoding error for small inline-encoded literals | Report to maintainers with the literal value |
| PT006 | dictionary batch insert failed | The ON CONFLICT DO NOTHING … RETURNING batch insert encountered an unexpected error | Check PostgreSQL logs for disk space or permission issues |
| PT007 | dictionary lookup: NULL term | A NULL value was passed where an IRI, literal, or blank node was expected | Ensure all arguments are non-NULL |
| PT008 | malformed IRI: <detail> | The IRI string does not conform to RFC 3987 | Fix the IRI syntax; IRIs must be wrapped in angle brackets <…> |
| PT009 | malformed literal: <detail> | The literal string cannot be parsed | Use N-Triples syntax: "value", "value"@lang, or "value"^^<datatype> |
| PT010 | malformed blank node: <detail> | The blank node label is invalid | Blank nodes must start with _: followed by a valid label |
| PT011 | dictionary cache full, eviction failed | The LRU cache could not evict entries | Increase pg_ripple.dictionary_cache_size |
| PT012 | prewarm_dictionary_hot: table not found | The dictionary table does not exist | Run CREATE EXTENSION pg_ripple first |
PT100–PT199: Storage
Errors from the VP table storage layer, HTAP partitions, and rare-predicate management.
| Code | Message | Cause | Fix |
|---|---|---|---|
| PT100 | insert_triple: predicate IRI required | The predicate argument is NULL or empty | Provide a valid predicate IRI |
| PT101 | VP table creation failed | DDL error when creating a new VP table | Check pg_log for the underlying PostgreSQL error |
| PT102 | htap_migrate_predicate: predicate not found | The predicate ID does not exist in _pg_ripple.predicates | Verify the predicate IRI and that triples exist for it |
| PT103 | merge: lock_timeout exceeded during main table swap | Another transaction held a lock on the VP table for too long | Retry; consider increasing lock_timeout for maintenance windows |
| PT104 | rare-predicate promotion failed | Error promoting a predicate from vp_rare to a dedicated VP table | Check disk space and user permissions |
| PT105 | delete_triple: predicate not found in catalog | The triple's predicate has no VP table | The predicate may never have been used, or was already compacted |
| PT106 | VP table not found: <table_name> | The VP table referenced in _pg_ripple.predicates does not exist on disk | Run pg_ripple.compact() to reconcile the catalog |
| PT107 | delta table insert failed | Error writing to the HTAP delta partition | Check PostgreSQL logs for tablespace or permission issues |
| PT108 | tombstone insert failed | Error recording a deletion in the tombstones table | Check PostgreSQL logs |
| PT109 | merge worker: unexpected state | The background merge worker encountered an inconsistent state | Restart PostgreSQL; check pg_log for crash details |
| PT110 | statement_id_seq: sequence exhausted | The global statement ID sequence has reached its maximum | This is unlikely with BIGINT; contact maintainers |
| PT111 | vp_rare: row limit exceeded | The rare-predicate table has too many rows for a single predicate | Manually promote with promote_rare_predicates() or lower pg_ripple.vp_promotion_threshold |
| PT112 | deduplicate: advisory lock not acquired | Another deduplication operation is already running | Wait and retry |
| PT113 | create_graph: invalid graph IRI | The graph IRI is malformed | Graph IRIs must be valid absolute IRIs in angle brackets |
| PT114 | drop_graph: graph not found | The named graph does not exist | Use list_graphs() to check available graphs |
PT200–PT299: SPARQL
Errors from the SPARQL parser, algebra optimizer, and SQL code generator.
| Code | Message | Cause | Fix |
|---|---|---|---|
| PT200 | SPARQL parse error: <detail> | The SPARQL query has a syntax error | Fix the syntax; use sparql_explain() to validate without executing |
| PT201 | unsupported SPARQL algebra node: <type> | The query uses a feature not yet implemented | Check the compliance matrix for supported features |
| PT202 | SPARQL SELECT: no projected variables | The SELECT clause has no variables | Add at least one ?variable to the SELECT clause |
| PT203 | property path depth exceeded max_path_depth | A recursive property path exceeded the configured depth limit | Increase pg_ripple.max_path_depth or simplify the path expression |
| PT204 | SPARQL federated SERVICE: endpoint not reachable | The remote SPARQL endpoint did not respond | Check the endpoint URL and network connectivity |
| PT205 | SPARQL VALUES clause: column count mismatch | The number of values in a VALUES row does not match the variable list | Ensure each VALUES row has the same number of columns as variables |
| PT206 | SPARQL type error: <detail> | A type mismatch in a FILTER expression | Check operand types; e.g., comparing a string to an integer |
| PT207 | SPARQL CONSTRUCT: template variable not in WHERE | A variable in the CONSTRUCT template is not bound in the WHERE clause | Bind all template variables in the WHERE clause |
| PT208 | SPARQL DESCRIBE: no resource specified | DESCRIBE requires at least one resource or variable | Add a resource IRI or variable to the DESCRIBE clause |
| PT209 | SPARQL aggregate: variable not grouped | A non-aggregated variable is used outside GROUP BY | Add the variable to GROUP BY or wrap it in an aggregate function |
| PT210 | SPARQL HAVING: refers to non-aggregate | The HAVING clause references a variable that is not an aggregate result | Use an aggregate function in HAVING |
| PT211 | generated SQL execution failed: <detail> | The SQL generated from SPARQL failed to execute | Check pg_log for the underlying error; report if reproducible |
| PT212 | plan cache: entry evicted during execution | A cached plan was evicted while the query was still running | Increase pg_ripple.plan_cache_size |
| PT213 | SPARQL SERVICE: response parse error | The federated endpoint returned malformed results | Check the remote endpoint's response format |
| PT214 | SPARQL SERVICE: timeout after <N> ms | The federated request exceeded pg_ripple.federation_timeout | Increase the timeout or simplify the SERVICE query |
| PT215 | SPARQL UPDATE parse error: <detail> | The SPARQL Update statement has a syntax error | Fix the syntax |
| PT216 | SPARQL UPDATE: LOAD failed for <url> | The LOAD operation could not retrieve the remote resource | Check URL, network, and pg_ripple.federation_timeout |
| PT217 | SPARQL UPDATE: unsupported content type <type> | The LOAD target serves an unrecognized RDF format | The URL must serve Turtle, N-Triples, N-Quads, TriG, or RDF/XML |
| PT218 | SPARQL UPDATE: CREATE GRAPH already exists | The graph already exists | Use CREATE SILENT GRAPH to suppress this error |
| PT219 | SPARQL UPDATE: DROP GRAPH not found | The graph does not exist | Use DROP SILENT GRAPH to suppress this error |
PT300–PT399: SHACL
Errors from the SHACL shapes loader, validator, and async monitoring pipeline.
| Code | Message | Cause | Fix |
|---|---|---|---|
| PT300 | SHACL parse error: <detail> | The Turtle-encoded SHACL shapes have a syntax error | Fix the Turtle syntax in the shapes definition |
| PT301 | SHACL sync validation failed: <shape> — <message> | A triple violates a SHACL constraint during synchronous validation | Fix the data to conform to the shape, or modify the shape |
| PT302 | SHACL shape not found: <iri> | The referenced shape has not been loaded | Load the shape with load_shacl() first |
| PT303 | SHACL DAG monitor: pg_trickle not installed | DAG-aware monitors require the pg_trickle extension | Install pg_trickle or use enable_shacl_monitors() for trigger-based validation |
| PT304 | SHACL: unsupported constraint component <type> | The shape uses a SHACL-AF or SHACL-JS constraint | Only SHACL Core constraints are supported |
| PT305 | SHACL: sh:path too complex | The property path in the shape exceeds supported complexity | Simplify the sh:path expression |
| PT306 | SHACL: validation queue overflow | The async validation queue has exceeded its capacity | Process the queue with process_validation_queue() or increase the queue size |
| PT307 | SHACL: dead letter queue threshold reached | Too many validation failures have accumulated | Inspect with dead_letter_queue() and address the failures |
| PT308 | SHACL: sh:targetClass not found | The target class IRI is not present in the data | Load data with the target class, or fix the class IRI |
| PT309 | SHACL: circular shape reference | A shape references itself through sh:node or sh:qualifiedValueShape | Break the circular reference |
| PT310 | SHACL: drop_shape: shape has active monitors | Cannot drop a shape that has active monitors | Disable monitors first with disable_shacl_dag_monitors() |
PT400–PT499: Datalog — Rules
Errors from the Datalog rule parser, stratifier, and rule management.
| Code | Message | Cause | Fix |
|---|---|---|---|
| PT400 | rule parse error: <detail> | The Datalog rule has a syntax error | Fix the rule syntax; see Reasoning and Inference for syntax reference |
| PT401 | rule stratification failed: unstratifiable program | The rule set contains a cycle through negation that prevents stratification | Rewrite rules to break the negation cycle, or use infer_wfs() for well-founded semantics |
| PT402 | rule set not found: <name> | The referenced rule set has not been loaded | Load it with load_rules() or load_rules_builtin() |
| PT403 | inference: maximum iteration depth exceeded | Semi-naive evaluation did not converge within the iteration limit | Simplify the rule set or increase statement_timeout |
| PT404 | constraint violation detected: <rule> | A constraint rule (:- body.) fired | Check the data against the constraint body |
| PT405 | rule set already exists: <name> | A rule set with this name is already loaded | Drop it first with drop_rules(), or choose a different name |
| PT406 | rule: unsafe variable <var> | A variable appears in the head but not in a positive body literal | Ensure every head variable also appears in a positive body literal |
| PT407 | rule: built-in predicate not recognized: <name> | An unknown built-in predicate was used | Check available built-ins: =, !=, <, >, <=, >=, +, -, *, / |
| PT408 | rule: aggregation variable not in group-by | An aggregated variable is used outside the grouping context | Add the variable to the group-by list |
PT500–PT599: Datalog — Inference Engine
Errors from the materialization engine, magic sets optimizer, WFS evaluator, and tabling.
| Code | Message | Cause | Fix |
|---|---|---|---|
| PT500 | infer: no enabled rule sets | infer() was called with no rule sets enabled | Enable at least one rule set with enable_rule_set() |
| PT501 | infer: SPI execution failed during iteration <N> | The SQL generated for a rule body failed | Check pg_log for the underlying error |
| PT502 | infer_demand: magic set rewriting failed | The demand transformation could not be applied | Simplify the goal pattern or rule set |
| PT503 | infer_demand: goal pattern too broad | The goal has no bound arguments, defeating the purpose of demand-driven evaluation | Bind at least one argument in the goal |
| PT504 | infer_wfs: unfounded set computation exceeded limit | The well-founded semantics fixpoint did not converge | Simplify the rule set or check for unusual negation patterns |
| PT505 | infer_wfs: three-valued model contains undefined atoms | Some atoms could not be classified as true or false | This is expected in WFS; query the undefined result set to see which atoms |
| PT506 | tabling: memo store overflow | The tabling memo store exceeded its size limit | Increase pg_ripple.tabling_memo_size |
| PT507 | infer_agg: aggregation cycle detected | An aggregation rule depends on its own aggregate result | Rewrite to break the cycle |
| PT508 | infer_goal: predicate not in any rule set | The goal predicate is not defined by any loaded rule | Load a rule set that defines the predicate |
| PT509 | owl:sameAs canonicalization: cycle limit exceeded | The owl:sameAs equivalence class merging exceeded the iteration limit | Check for very large owl:sameAs clusters |
| PT520 | infer_wfs: iteration cap reached (<N> iterations) | The WFS alternating fixpoint did not converge within pg_ripple.wfs_max_iterations | Emitted as WARNING; partial result is returned with "stratifiable": false; increase the cap or simplify the rule set |
| PT540 | lattice: fixpoint did not converge after <N> iterations | The lattice fixpoint did not stabilise within pg_ripple.lattice_max_iterations | Increase pg_ripple.lattice_max_iterations or verify that the join function is monotone |
| PT541 | lattice: join_fn <name> could not be resolved | The user-supplied join function name could not be resolved via regprocedure | Check the function name, schema, and argument types; use a fully-qualified name |
| PT542 | federation: result decoder received unparseable XML/JSON | The SPARQL results response from a remote SERVICE endpoint could not be parsed | Check the endpoint's response format; ensure it returns application/sparql-results+xml or +json |
PT600–PT699: Export / HTTP
Errors from export serializers, GraphRAG export, and the HTTP companion service.
| Code | Message | Cause | Fix |
|---|---|---|---|
| PT600 | export: serialization failed for triple <sid> | A triple could not be serialized to the target format | Check that the triple's dictionary entries are intact |
| PT601 | export: unsupported format <format> | An unrecognized export format was requested | Use ntriples, nquads, turtle, or jsonld |
| PT602 | export_turtle_stream: batch_size must be > 0 | Invalid batch size | Use a positive integer |
| PT603 | export_jsonld: framing failed | The JSON-LD framing algorithm encountered an error | Check the frame structure; see JSON-LD Framing |
| PT604 | export_graphrag_entities: no entities found | No entities match the GraphRAG export criteria | Load data or adjust the GraphRAG ontology |
| PT605 | jsonld_frame_to_sparql: invalid frame | The JSON-LD frame could not be converted to SPARQL | Check the frame JSON structure |
| PT606 | export: streaming interrupted | The streaming export was cancelled or the client disconnected | Retry the export |
PT700–PT799: Configuration / Startup
Errors from extension initialization, GUC validation, and background workers.
| Code | Message | Cause | Fix |
|---|---|---|---|
| PT700 | _PG_init: cache_budget exceeds shared_memory_size | pg_ripple.cache_budget is larger than pg_ripple.shared_memory_size | Reduce cache_budget or increase shared_memory_size |
| PT701 | _PG_init: shmem initialization failed | Shared memory allocation failed | Increase system shared memory (kern.sysv.shmmax on macOS, kernel.shmmax on Linux) |
| PT702 | worker_database not set; merge worker defaulting to 'postgres' | The pg_ripple.worker_database GUC is not set | Set it to the correct database name in postgresql.conf |
| PT703 | merge worker watchdog: worker has been silent for <N> seconds | The background merge worker may have crashed | Check pg_log for crash details; restart PostgreSQL |
| PT704 | extension version mismatch: binary <v1>, control <v2> | The compiled extension version does not match pg_ripple.control | Rebuild and reinstall the extension |
| PT705 | GUC validation: <param> out of range | A GUC parameter was set to an invalid value | Check the GUC Reference for valid ranges |
| PT706 | shared_preload_libraries: pg_ripple not loaded | pg_ripple is not in shared_preload_libraries | Add pg_ripple to shared_preload_libraries in postgresql.conf and restart |
| PT707 | pg_trickle not installed | A feature requiring pg_trickle was called | Install pg_trickle or use the non-trickle alternative |
| PT708 | pgvector not installed | A vector/embedding function was called without pgvector | Install pgvector or disable with pg_ripple.pgvector_enabled = off |
| PT709 | enable_graph_rls: RLS policy creation failed | Row-level security policy could not be created | Check superuser privileges |
| PT710 | grant_graph: invalid permission | Permission must be 'read', 'write', or 'admin' | Use one of the three valid permission strings |
If you encounter an error code not listed here, or a message that says "contact maintainers", please open a GitHub issue with the full error output, your pg_ripple version (SELECT pg_ripple.canary()), and a minimal reproducer.
Lattice-Based Datalog Reference (v0.36.0)
Available since v0.36.0. Lattice-Based Datalog (Datalog^L) extends pg_ripple's Datalog engine with monotone lattice aggregation, enabling recursive aggregation without stratification constraints.
Background
Standard Datalog^agg stratifies aggregate functions: an aggregate can only appear at a strictly higher stratum than the predicate it aggregates over. This makes recursive aggregation (e.g., propagating minimum trust scores through a social graph) impossible to express without manual loop unrolling.
Lattice-Based Datalog lifts this restriction by requiring only that the aggregation operation is monotone with respect to a user-supplied lattice. A lattice is an algebraic structure (L, ⊔) where ⊔ is a commutative, associative, idempotent join operation with a bottom element ⊥. Fixpoint computation over a lattice terminates by the ascending chain condition — the lattice has no infinite strictly ascending chains.
Key references
- Abo Khamis et al., PODS 2017 — lattice-structured aggregation in Datalog
- Alvaro et al., CIDR 2011 — monotone logic programming (Bloom^L)
- Green et al., PODS 2007 — provenance semirings as a generalization of lattices
Built-in lattices
pg_ripple ships with four built-in lattice types that cover the most common use cases:
MinLattice (min)
join: LEAST(a, b) (PostgreSQL: min aggregate)
bottom: +∞ (encoded as 9223372036854775807 = i64::MAX)
Use cases: trust propagation, shortest-path weights, minimum-cost routing.
Example: propagate the minimum trust score along a path — the trustworthiness of a chain is limited by its weakest link.
MaxLattice (max)
join: GREATEST(a, b) (PostgreSQL: max aggregate)
bottom: −∞ (encoded as -9223372036854775808 = i64::MIN)
Use cases: reachability weights, longest-path annotation, maximum influence scores.
SetLattice (set)
join: UNION (array deduplication via array_agg)
bottom: {} (empty set)
Use cases: set-valued provenance annotation, multi-hop neighbourhood sets.
IntervalLattice (interval)
join: interval hull (max of lower bound, max of upper bound)
bottom: empty interval (0)
Use cases: temporal reasoning, numeric range propagation.
User-defined lattices
Register a custom lattice with any PostgreSQL aggregate function as the join:
-- Minimum-cost routing over decimal weights.
SELECT pg_ripple.create_lattice('route_cost', 'min', '1e308');
-- Custom bounded lattice (values 0–100, join = LEAST).
SELECT pg_ripple.create_lattice('reputation', 'min', '100');
The join_fn must be a registered PostgreSQL aggregate (verified via pg_proc). A warning is emitted at registration time if the function is not yet visible, but the lattice is still stored — this allows pre-registering lattices before their custom aggregates are created.
GUC parameters
| GUC | Type | Default | Description |
|---|---|---|---|
pg_ripple.lattice_max_iterations | integer | 1000 | Maximum fixpoint iterations before error code PT540 warning and partial-result return. Set to 0 for unlimited (not recommended). |
-- Change the iteration limit.
SET pg_ripple.lattice_max_iterations = 5000;
-- Check current setting.
SHOW pg_ripple.lattice_max_iterations;
SQL Functions
pg_ripple.create_lattice(name, join_fn, bottom) → boolean
Register a new lattice type in the _pg_ripple.lattice_types catalog.
| Parameter | Type | Description |
|---|---|---|
name | text | Unique lattice name (case-sensitive) |
join_fn | text | PostgreSQL aggregate function name |
bottom | text | Bottom element as a text string |
Returns true if newly registered, false if the name already exists (idempotent).
SELECT pg_ripple.create_lattice('trust', 'min', '100'); -- true
SELECT pg_ripple.create_lattice('trust', 'min', '100'); -- false (idempotent)
pg_ripple.list_lattices() → jsonb
Return a JSON array of all registered lattice types (built-in and user-defined).
SELECT jsonb_pretty(pg_ripple.list_lattices());
Each entry has: name, join_fn, bottom, builtin.
pg_ripple.infer_lattice(rule_set, lattice_name) → jsonb
Run a monotone fixpoint over all active rules in rule_set using the specified lattice.
| Parameter | Default | Description |
|---|---|---|
rule_set | 'custom' | Rule set name as used in load_rules() |
lattice_name | 'min' | Lattice type to use for head-predicate joins |
Returns JSONB:
{
"derived": 42,
"iterations": 5,
"lattice": "min",
"rule_set": "my_rules"
}
Errors:
infer_lattice: unknown lattice type '...'— lattice not registered; callcreate_lattice()first.PT540WARNING — fixpoint did not converge withinlattice_max_iterations.
Catalog table
Lattice types are stored in _pg_ripple.lattice_types:
SELECT * FROM _pg_ripple.lattice_types;
| Column | Type | Description |
|---|---|---|
name | text | Primary key; lattice identifier |
join_fn | text | PostgreSQL aggregate name |
bottom | text | Bottom element as text |
builtin | boolean | True for pre-registered lattices |
created_at | timestamptz | Registration timestamp |
Complete example: Trust propagation
This example propagates minimum trust scores through a social graph. The trustworthiness of an indirect connection is bounded by the weakest link on the path.
-- 1. Create extension and configure lattice.
CREATE EXTENSION IF NOT EXISTS pg_ripple;
SELECT pg_ripple.create_lattice('trust', 'min', '100');
-- 2. Insert direct trust relationships (score: 0=no trust, 100=full trust).
SELECT pg_ripple.load_ntriples($$
<https://trust.example/alice> <https://trust.example/directTrust> "90"^^<xsd:integer> .
<https://trust.example/bob> <https://trust.example/directTrust> "70"^^<xsd:integer> .
<https://trust.example/carol> <https://trust.example/directTrust> "85"^^<xsd:integer> .
<https://trust.example/alice> <https://trust.example/knows> <https://trust.example/bob> .
<https://trust.example/bob> <https://trust.example/knows> <https://trust.example/carol> .
$$);
-- 3. Write a trust-propagation rule (using Datalog syntax).
SELECT pg_ripple.load_rules($$
?y <https://trust.example/transitTrust> ?min_t :-
?x <https://trust.example/knows> ?y ,
?x <https://trust.example/directTrust> ?t1 ,
?y <https://trust.example/directTrust> ?t2 .
$$, 'trust_rules');
-- 4. Run lattice-based fixpoint.
SELECT pg_ripple.infer_lattice('trust_rules', 'trust');
-- 5. Query propagated trust values.
SELECT * FROM pg_ripple.sparql($$
SELECT ?x ?t WHERE { ?x <https://trust.example/transitTrust> ?t }
$$);
Error code PT540
Meaning: the lattice fixpoint did not converge within the configured iteration limit.
Trigger: emitted as a PostgreSQL WARNING (not ERROR) when pg_ripple.lattice_max_iterations is exceeded.
Resolution options:
-
Increase the limit:
SET pg_ripple.lattice_max_iterations = 10000; -
Verify your lattice is finite: every value domain used in rules must have a finite number of distinct elements reachable from the bottom.
-
Verify monotonicity: every operation in rule bodies must be monotone with respect to the lattice order. A non-monotone operation (e.g., negation) in a recursive rule violates the convergence guarantee.
Relationship to other pg_ripple inference modes
| Feature | Stratum requirement | Aggregation | Recursion |
|---|---|---|---|
infer() — standard Datalog | Stratified | Not supported | Restricted |
infer_wfs() — Well-Founded Semantics | None | Not supported | Full |
infer_lattice() — Datalog^L | None | Monotone lattice joins | Full |
Use infer_lattice() when you need recursive aggregation with a convergence guarantee, for example: shortest paths, trust propagation, or set-reachability annotations.
Introduced in v0.36.0.
FAQ
General
Why VP tables instead of one big triple table?
A single (s, p, o, g) table with 100M triples requires a B-tree index that touches all four columns for any useful predicate-specific query. Each query must scan rows for all predicates regardless of the filter.
Vertical Partitioning (one table per predicate) means a query for <ex:knows> triples only scans the vp_{knows_id} table — typically a fraction of the total data. The two B-tree indexes on (s, o) and (o, s) are small and cache-friendly. SPARQL star-patterns (same subject, multiple predicates) become simple multi-way joins between small tables.
Why PostgreSQL 18?
pg_ripple uses the CYCLE clause in WITH RECURSIVE CTEs for hash-based cycle detection in property path queries. The CYCLE clause was introduced in PostgreSQL 14 but the hash-based variant (as opposed to array-based) first became performant in PG 17/18. PG 18 is also the first version where pgrx 0.17 has stable support.
Is pg_ripple compatible with LPG tools?
Not yet. A Cypher/GQL compatibility layer is on the post-1.0 roadmap. The VP storage structure is architecturally aligned with LPG — each VP table is a property edge type — so the mapping will be natural.
What RDF formats does pg_ripple support?
Import (loading):
- N-Triples and N-Triples-star (
load_ntriples) - N-Quads (
load_nquads) - Turtle and Turtle-star (
load_turtle) - TriG (
load_trig) - RDF/XML (
load_rdfxml, v0.9.0)
Export:
- N-Triples (
export_ntriples) - N-Quads (
export_nquads) - Turtle (
export_turtle, v0.9.0) — including Turtle-star for RDF-star data - JSON-LD expanded form (
export_jsonld, v0.9.0) - Streaming Turtle or JSON-LD for large graphs (
export_turtle_stream,export_jsonld_stream, v0.9.0)
SPARQL CONSTRUCT and DESCRIBE results can be serialized directly to Turtle or JSON-LD via sparql_construct_turtle, sparql_construct_jsonld, sparql_describe_turtle, and sparql_describe_jsonld (v0.9.0).
Can I use pg_ripple with JSON-LD for REST APIs?
Yes. Use export_jsonld() or sparql_construct_jsonld() to produce JSON-LD responses:
-- Full graph as JSON-LD
SELECT pg_ripple.export_jsonld('https://myapp.example.org/graph/users');
-- SPARQL-driven selection as JSON-LD
SELECT pg_ripple.sparql_construct_jsonld('
CONSTRUCT { ?s ?p ?o }
WHERE { ?s a <https://schema.org/Person> ; ?p ?o }
');
The output is JSON-LD in expanded form — each subject is one array entry with IRI keys and typed value arrays.
SPARQL
What SPARQL 1.1 features are supported?
As of v0.19.0, the full SPARQL 1.1 specification is implemented:
Query forms: SELECT, ASK, CONSTRUCT, DESCRIBE
Graph patterns: BGP, OPTIONAL (LeftJoin), UNION, MINUS, FILTER, BIND, VALUES, Named graphs via GRAPH
Property paths: +, *, ?, / (sequence), | (alternative), ^ (inverse)
Aggregates: GROUP BY, HAVING, COUNT, SUM, AVG, MIN, MAX, GROUP_CONCAT
Modifiers: DISTINCT, ORDER BY, LIMIT, OFFSET, subqueries
Update: INSERT DATA, DELETE DATA, DELETE/INSERT WHERE, LOAD, CLEAR, DROP, CREATE, COPY, MOVE, ADD
Federation: SERVICE <url> { … } with SSRF allowlist, SERVICE SILENT, connection pooling, result caching, adaptive timeouts, batch SERVICE detection
Does pg_ripple support SPARQL 1.1 property paths?
Yes, as of v0.5.0. All standard path operators are supported: +, *, ?, / (sequence), | (alternative), ^ (inverse). Negated property sets !(p1|p2) are partially supported via vp_rare.
Property path queries compile to WITH RECURSIVE CTEs with PostgreSQL 18's CYCLE clause for hash-based cycle detection.
What is the maximum traversal depth for property paths?
Controlled by the pg_ripple.max_path_depth GUC (default: 100). Set it lower to prevent runaway queries on dense graphs:
SET pg_ripple.max_path_depth = 10;
Why does my FILTER not match a number?
SPARQL FILTER comparisons on numeric literals (FILTER(?age >= 18)) require the literal to be typed with an XSD numeric type:
"18"^^<http://www.w3.org/2001/XMLSchema#integer>
Plain string literals like "18" are compared as strings. Use typed literals when inserting numeric data, or cast in the FILTER expression.
Data modeling
What's the difference between a named graph and a blank node?
A named graph is a set of triples identified by an IRI. It is used for partitioning data by source, time, or topic. You can query across all named graphs, query within a specific graph, or count triples per graph.
A blank node is a resource without a global IRI identity — it has identity only within a document load scope. Blank nodes are used for anonymous resources (e.g. intermediate nodes in a structure) that don't need a stable identifier.
What is an RDF-star quoted triple?
A quoted triple << s p o >> is a triple that can appear in subject or object position in another triple. It enables statements about triples — useful for provenance (<< alice knows bob >> :assertedBy :carol), temporal annotations, and confidence scores.
pg_ripple stores quoted triples as dictionary entries of kind = 5. See RDF-star for details.
Performance
How fast is bulk load?
On a modern server with an NVMe SSD, load_ntriples() processes approximately 50,000–150,000 triples per second (single connection, default settings). Performance depends on predicate diversity (more unique predicates → more VP tables created), hardware, and PostgreSQL configuration.
When should I use SPARQL vs find_triples?
find_triples() only matches a single (s, p, o, g) pattern — it is equivalent to a SPARQL BGP with exactly one triple pattern. Use it for single-pattern lookups.
Use sparql() for anything more complex: multi-pattern joins, OPTIONAL, FILTER, aggregates, property paths, or when you want the ergonomics of SPARQL's variable-binding model.
HTAP & Operations (v0.6.0)
Does pg_ripple require shared_preload_libraries?
For full HTAP functionality (background merge worker, latch-poke hook, shared-memory statistics) you must add pg_ripple to shared_preload_libraries:
shared_preload_libraries = 'pg_ripple'
Without this, the extension still works for reads and writes — but all writes stay in delta tables and are never automatically merged into main. Queries on predicates with large deltas will be slower than expected.
See the Pre-Deployment Checklist for the complete setup sequence.
What is the difference between compact() and the merge worker?
compact() | Merge worker | |
|---|---|---|
| Trigger | Manual SQL call | Automatic (latch poke or timer) |
| Blocks caller | Yes | No — runs in background |
| When to use | Maintenance windows, tests | Production continuous operation |
Both produce the same result: delta rows are moved into main, tombstones are cleared, and a fresh BRIN index is built.
How do I know if the merge worker is keeping up?
-- Check unmerged row count
SELECT pg_ripple.stats() -> 'unmerged_delta_rows';
-- Watch it over time
SELECT now(), (pg_ripple.stats() -> 'unmerged_delta_rows')::int AS lag
FROM generate_series(1, 10) g,
pg_sleep(5) AS _s
WHERE true; -- run this manually in a loop
A healthy deployment shows unmerged_delta_rows rising during writes and falling after merges. If it only rises, the worker is behind — lower merge_threshold or increase server I/O capacity.
Can I subscribe to triple changes in real time?
Yes. CDC (Change Data Capture) is available in v0.6.0 via PostgreSQL NOTIFY:
-- Subscribe to a specific predicate
SELECT pg_ripple.subscribe('<https://schema.org/name>', 'name_changes');
-- In another session
LISTEN name_changes;
-- Notifications arrive when triples are inserted or deleted
SELECT pg_ripple.insert_triple(
'<https://example.org/Alice>',
'<https://schema.org/name>',
'"Alice"'
);
Subscriptions are stored in _pg_ripple.cdc_subscriptions and persist across reconnects (but must be re-registered after a server restart). See the Administration reference for details.
Why does my query not see recently inserted triples?
If you inserted triples and immediately queried with SPARQL, the results should include those triples — delta tables are always queried alongside main tables.
If triples are missing, check:
- The triple was committed (not inside an uncommitted transaction)
- The correct graph is being queried (default graph vs named graph)
- The correct predicate IRI spelling was used
What is the HTTP endpoint URL?
The pg_ripple_http companion service listens on http://localhost:7878/sparql by default. Configure the port with PG_RIPPLE_HTTP_PORT. The URL accepts both GET and POST SPARQL requests per the W3C SPARQL 1.1 Protocol.
How do I connect SPARQL tools to pg_ripple?
Start pg_ripple_http alongside your PostgreSQL instance. Point any SPARQL client (YASGUI, Protege, SPARQLWrapper, Jena) to http://localhost:7878/sparql. The endpoint supports standard content negotiation (Accept: application/sparql-results+json, text/turtle, etc.).
Can I run pg_ripple_http inside Docker?
Yes. The Docker image bundles both PostgreSQL and pg_ripple_http. Use docker compose up with the provided docker-compose.yml to start both services. The SPARQL endpoint is exposed on port 7878 by default.
JSON-LD Framing (v0.17.0)
What is JSON-LD Framing and how is it different from plain JSON-LD export?
Plain JSON-LD export (export_jsonld) serializes every triple in the graph as a flat list of node objects. JSON-LD Framing lets you specify the desired output shape — which types to select, which properties to include, and how to nest related nodes — using a frame document. The result is a nested, structured JSON-LD document suitable for serving directly from a REST API.
The key difference in performance: framing reads only the VP tables touched by the frame. A frame targeting 3 predicates on a graph with 10,000 predicates reads 3 VP tables, not 10,000.
Which W3C framing features are supported?
pg_ripple v0.17.0 supports: @type matching, @id matching, property wildcards {}, absent-property patterns [], @reverse, @embed (@once/@always/@never), @explicit, @omitDefault, @default, @requireAll, @context compaction, named graph @graph scoping, and @omitGraph.
Value pattern matching (@value/@language/@type inside value objects) is deferred to a future release.
What is value pattern matching and why is it deferred?
Value pattern matching would allow frames like {"ex:name": {"@language": "en"}} to select only English-language name literals. Implementing this correctly requires a full-graph scan to find matching literals — it cannot be done efficiently with the VP table join model. It is deferred until a targeted literal index is available.
What is the difference between framing views and SPARQL views?
SPARQL views (create_sparql_view) store raw SPARQL SELECT results as integer ID columns in a stream table. Framing views (create_framing_view) run the full embedding and compaction pipeline over CONSTRUCT results, so each row in the stream table contains a ready-to-serve nested JSON-LD document rather than raw projection values.
Use SPARQL views when you need low-level access to result bindings; use framing views when you want ready-to-serve nested JSON-LD for an API.
Vector Federation (v0.28.0)
How does vector federation work?
After registering an external endpoint with pg_ripple.register_vector_endpoint(url, api_type), pg_ripple can route similarity queries to Weaviate, Qdrant, Pinecone, or a remote pgvector instance. The results are merged with local triple store data using Reciprocal Rank Fusion inside hybrid_search().
How do I prevent SSRF attacks when using vector federation?
pg_ripple does not restrict which URLs can be registered. You should use network policies (e.g., Kubernetes NetworkPolicy, AWS security groups) to restrict which external hosts your PostgreSQL server can reach. Only register endpoints that belong to trusted vector services in your infrastructure.
Why does my federated query time out?
The default timeout is 5000 ms. Increase it with:
SET pg_ripple.vector_federation_timeout_ms = 30000;
Or configure it globally via ALTER SYSTEM SET pg_ripple.vector_federation_timeout_ms = 30000; SELECT pg_reload_conf();
How do I configure a remote endpoint's API key?
pg_ripple does not store API keys for external vector services. Pass the API key in the endpoint URL if the service supports it, or configure it via environment variables in your application layer before calling the endpoint.
Glossary
Plain-language definitions of terms used throughout the pg_ripple documentation.
Blank node
An anonymous node in an RDF graph — it has no IRI. Used when the identity of a resource does not matter, only its connections. Written as _:label in N-Triples/Turtle. Internally stored as a dictionary-encoded BIGINT like any other term.
CDC (Change Data Capture)
A mechanism for subscribing to insert and delete events on the triple store. pg_ripple exposes CDC via subscribe() and unsubscribe(), backed by PostgreSQL LISTEN/NOTIFY.
Dictionary encoding
The process of mapping every IRI, blank node, and literal to a unique BIGINT (i64) integer using an XXH3-128 hash. All VP tables store only integer IDs, never raw strings. This makes joins fast and storage compact.
Embedding
A fixed-length numeric vector (typically 256–1536 dimensions) representing the semantic meaning of an entity or text. pg_ripple stores embeddings via pgvector and uses them for similarity search and RAG retrieval.
Federation
Distributing a SPARQL query across multiple endpoints. When a query contains a SERVICE <url> { … } block, pg_ripple sends that subquery to the remote SPARQL endpoint and joins the results locally.
Frame (JSON-LD)
A JSON template that reshapes a flat RDF graph into a tree-structured JSON-LD document. pg_ripple's jsonld_frame() and export_jsonld_framed() functions apply frames to produce nested, application-friendly JSON.
GraphRAG
A retrieval-augmented generation (RAG) approach that uses a knowledge graph as the retrieval backend instead of (or in addition to) a vector store. pg_ripple exports data in Microsoft GraphRAG-compatible formats via export_graphrag_entities(), export_graphrag_relationships(), and export_graphrag_text_units().
GUC (Grand Unified Configuration)
PostgreSQL's configuration parameter system. pg_ripple exposes settings like pg_ripple.max_path_depth and pg_ripple.dictionary_cache_size as GUC parameters. Set them with SET, ALTER SYSTEM SET, or in postgresql.conf.
HNSW (Hierarchical Navigable Small World)
An approximate nearest-neighbor index algorithm used by pgvector. pg_ripple creates HNSW indices on embedding columns for fast similarity search.
HTAP (Hybrid Transactional/Analytical Processing)
pg_ripple's storage split (since v0.6.0) where writes go to a delta partition (heap + B-tree) and reads scan (main EXCEPT tombstones) UNION ALL delta. A background merge worker periodically combines delta into main with BRIN indices for analytical scan performance.
IRI (Internationalized Resource Identifier)
A globally unique identifier for a resource in an RDF graph, like <https://example.org/alice>. Written in angle brackets in SPARQL and N-Triples. The RDF equivalent of a URL.
JSON-LD
A JSON-based serialization of RDF. It represents triples as nested JSON objects using @context for namespace mapping and @id for node identifiers. pg_ripple can export to JSON-LD and apply JSON-LD frames.
Literal
A data value in an RDF graph — a string, number, date, or boolean. Can have a datatype ("42"^^xsd:integer) or a language tag ("hello"@en). Stored as a dictionary-encoded integer in VP tables.
Magic sets
A Datalog optimization technique that rewrites a program to focus computation on only the tuples needed to answer a specific query, rather than computing all possible derivations. Used by infer_demand().
Materialization
The process of computing all triples derivable from a set of Datalog rules and storing them explicitly in VP tables. infer() runs full materialization using semi-naive evaluation. Materialized triples have source = 1 in VP tables.
Merge worker
A pgrx background worker that periodically combines HTAP delta partitions into main partitions. It runs as a separate PostgreSQL backend process, configured via pg_ripple.worker_database.
Named graph
A sub-graph of an RDF dataset identified by an IRI. Triples in the default graph have graph ID 0; named graphs have IDs > 0. Named graphs are used for provenance tracking, access control, and dataset organization.
OWL RL (Web Ontology Language — Rule Language profile)
A subset of OWL that can be implemented as Datalog rules. pg_ripple ships a built-in owl-rl rule set covering class and property reasoning (subclass, inverse, transitive, symmetric, owl:sameAs canonicalization).
Predicate
The middle element of an RDF triple — the relationship between subject and object. For example, in <alice> <knows> <bob>, <knows> is the predicate. Each unique predicate gets its own VP table.
Property path
A SPARQL syntax for traversing chains of predicates in a graph. Supports sequence (/), alternative (|), inverse (^), zero-or-more (*), one-or-more (+), and zero-or-one (?). Compiled to WITH RECURSIVE … CYCLE SQL.
RAG (Retrieval-Augmented Generation)
An AI pattern that retrieves relevant context from a knowledge base before generating a response with a language model. pg_ripple's rag_retrieve() combines graph traversal and vector similarity for context retrieval.
RDFS (RDF Schema)
A vocabulary for defining classes and properties in RDF. pg_ripple ships a built-in rdfs rule set that implements subclass inference (rdfs:subClassOf), domain/range inference (rdfs:domain, rdfs:range), and other RDFS entailment rules.
RDF-star
An extension to RDF that allows triples to be subjects or objects of other triples (quoted triples). Written as << :alice :knows :bob >> :certainty 0.9 in Turtle-star. pg_ripple stores quoted triples via qt_s, qt_p, qt_o columns in the dictionary.
RRF (Reciprocal Rank Fusion)
A score fusion method that combines rankings from multiple retrieval systems (e.g., SPARQL results and vector similarity). Used by hybrid_search() with a tunable alpha parameter.
Semi-naive evaluation
The standard Datalog materialization algorithm. Instead of re-evaluating all rules each iteration, it only considers tuples derived in the previous iteration (the delta) joined with all known tuples. This avoids redundant computation.
SHACL (Shapes Constraint Language)
A W3C standard for validating RDF graphs against a set of constraints (shapes). pg_ripple supports SHACL Core for data quality validation via load_shacl() and validate(), plus trigger-based and async DAG-aware monitoring.
SID (Statement Identifier)
A globally unique BIGINT assigned to every triple from a shared PostgreSQL sequence (statement_id_seq). Stored in the i column of VP tables. Used by CDC, provenance tracking, and get_statement().
SPARQL
The W3C standard query language for RDF graphs. pg_ripple translates SPARQL to SQL and executes it against VP tables via SPI. Supports SELECT, CONSTRUCT, ASK, DESCRIBE, and the full Update language.
Stratification
The process of ordering Datalog rules into strata so that negation and aggregation are evaluated in the correct sequence. Rules in stratum n depend only on predicates fully computed in strata < n. Programs with negation cycles through the same stratum are unstratifiable (use infer_wfs() instead).
Tabling
A memoization technique for Datalog evaluation that caches intermediate results to avoid redundant computation and handle left-recursive rules. pg_ripple's tabling engine stores results in a memo table and checks for subsumption.
Triple
The fundamental unit of data in RDF: a (subject, predicate, object) statement. For example, <alice> <knows> <bob> asserts that Alice knows Bob. pg_ripple stores triples as (s, o, g) integer tuples in VP tables, one table per predicate.
VP table (Vertical Partitioning table)
pg_ripple's primary storage structure. Each unique predicate gets its own table (_pg_ripple.vp_{id}) with columns s (subject), o (object), g (graph), i (SID), and source. This layout optimizes predicate-specific scans and star-pattern joins.
Well-founded semantics (WFS)
A three-valued semantics for Datalog programs with negation. Unlike stratification (which rejects some programs), WFS assigns every atom a value of true, false, or undefined. pg_ripple implements WFS via infer_wfs() for programs that cannot be stratified.
Changelog
All notable changes to pg_ripple are documented in this file.
The format follows Keep a Changelog. Versions correspond to the milestones in ROADMAP.md.
[Unreleased]
Changes for the next version will appear here.
[0.47.0] — 2026-05-06 — SHACL Completion, GUC Validators, Cache SRFs & Fuzz Hardening
Completes the v0.47.0 roadmap: sh:lessThanOrEquals SHACL constraint; six GUC check_hook validators; three individual cache hit-rate SRFs; SPARQL sqlgen.rs module split (≤800 lines); parallel Datalog SID pre-allocation wired; five new cargo-fuzz targets; CI security hygiene (cargo-audit workflow, deny.toml, check_no_security_definer.sh); OWL 2 RL baseline 93.9%; promotion-race stress test; four new SHACL pg_regress tests.
What's new
-
sh:lessThanOrEqualsSHACL constraint (src/shacl/constraints/shape_based.rs) — implementssh:lessThanOrEqualsper SHACL Core §4.4. For each focus node, checks that every value of the subject property is ≤ the corresponding value of the comparison property. Violations include"constraint": "sh:lessThanOrEquals". pg_regress testshacl_lt_or_equals.sqlcovers less-than, greater-than (violation), and equal-value cases. -
Six GUC check_hook validators (
src/lib.rs) —federation_on_error(warning|error|empty),federation_on_partial(empty|use),sparql_overflow_action(warn|error),tracing_exporter(stdout|otlp),embedding_index_type(hnsw|ivfflat),embedding_precision(single|half|binary) now reject invalid values at SET time with a standard PostgreSQL GUC rejection message. -
Individual cache hit-rate SRFs (
src/sparql_api.rs) — three new table-returning functions:pg_ripple.plan_cache_stats(),pg_ripple.dictionary_cache_stats(), andpg_ripple.federation_cache_stats(), each returning(hits BIGINT, misses BIGINT, evictions BIGINT, hit_rate DOUBLE PRECISION). The old JSONBplan_cache_stats()is superseded by the new table form; the combined JSONBcache_stats()is retained for backwards compatibility. -
SPARQL
sqlgen.rsmodule split (src/sparql/translate/) —sqlgen.rsreduced from 3,632 to 753 lines by extracting eight translation modules:bgp.rs,filter.rs,graph.rs,group.rs,join.rs,left_join.rs,union.rs,distinct.rs. Public API surface unchanged. -
Parallel Datalog SID pre-allocation (
src/datalog/mod.rs) —preallocate_sid_ranges()is now called at the start ofrun_inference_seminaive()whendatalog_parallel_workers > 1, eliminating sequence contention across parallel strata workers. -
Five new cargo-fuzz targets (
fuzz/fuzz_targets/) —sparql_parser.rs(spargebra),turtle_parser.rs(rio_turtle + NTriples),datalog_parser.rs(rule tokenizer),shacl_parser.rs(Turtle + sh: predicate dispatch),dictionary_hash.rs(XXH3-128 determinism assertion). -
CI security hygiene — weekly scheduled
cargo auditjob (.github/workflows/cargo-audit.yml) that auto-creates a GitHub issue on failure;deny.tomlwith licence allowlist and advisory deny policy;scripts/check_no_security_definer.shthat fails CI if anysql/*.sqlfile containsSECURITY DEFINER. -
OWL 2 RL conformance baseline (
docs/src/reference/owl2rl-results.md) — 62/66 rules pass (93.9%). Four known failures documented intests/owl2rl/known_failures.txtwith target fix versions. -
Promotion-race stress test (
tests/stress/promotion_race.sh) — 50 concurrent sessions inserting at the VP promotion threshold; verifies SID uniqueness and zero errors. -
Four new SHACL pg_regress tests —
shacl_closed.sql,shacl_unique_lang.sql,shacl_pattern.sql,shacl_lt_or_equals.sql— cover all four SHACL constraint families newly tested in v0.47.0.
Documentation
docs/src/reference/guc-reference.md— complete entries for all six new validated GUCs.docs/src/reference/owl2rl-results.md— new baseline document with pass-rate table and known-failure descriptions.
[0.46.0] — 2026-04-22 — Property-Based Testing, Fuzz Hardening & OWL 2 RL Conformance
Adds three property-based test suites (SPARQL round-trip, dictionary encode/decode, JSON-LD framing), a cargo-fuzz federation result decoder target, an OWL 2 RL conformance suite, TopN push-down optimisation, sequence range pre-allocation for parallel Datalog, BSBM regression gate, Rustdoc lint gate, HTTP companion CA-bundle support, and expanded worked examples.
What's new
-
proptest integration (
tests/proptest/) — three property-based test suites run 10,000 cases each: SPARQL algebra round-trip stability (encoding and whitespace invariance), XXH3-128 dictionary encode stability and collision resistance (10,000 distinct terms, zero collisions), and JSON-LD framing round-trip correctness. -
cargo-fuzz federation result decoder (
fuzz/fuzz_targets/federation_result.rs) — fuzz target that feeds arbitrary byte sequences through the SPARQL XML results parser. Asserts no panic on malformed input; invalid XML produces PT542, never a crash. -
PT542
FederationResultDecoderError(src/error.rs) — new error code for unparseable XML/JSON in the federation result decoder. -
Datalog convergence regression suite (
tests/datalog_convergence_suite.rs) — verifies RDFS + OWL RL rule-set convergence within ≤ 20 iterations; derived triple counts checked against baselines stored intests/datalog_convergence/baselines.json. -
W3C OWL 2 RL conformance suite (
tests/owl2rl_suite.rs) — adapter parsesDatatypeEntailmentTest,ConsistencyTest, andInconsistencyTestmanifest types. Non-blocking CI job until ≥ 95% pass rate. Known failures tracked intests/owl2rl/known_failures.txt. -
TopN push-down (
src/sparql/sqlgen.rs) — whenORDER BY … LIMIT Nis present (noOFFSET, noDISTINCT) andpg_ripple.topn_pushdown = on, the LIMIT clause is embedded directly in the generated SQL rather than post-decode truncation.sparql_explain()output includes"topn_applied": true/false. -
pg_ripple.topn_pushdown(bool GUC, defaulton) — master switch for the TopN push-down optimisation. -
Sequence range pre-allocation (
src/datalog/parallel.rs) —preallocate_sid_ranges()atomically advances the global statement-ID sequence byN * batch_sizebefore launching parallel Datalog workers, eliminating sequence contention. -
pg_ripple.datalog_sequence_batch(integer GUC, default10000, min100) — SID range reserved per parallel Datalog worker per batch. -
BSBM regression gate (
benchmarks/bsbm/) — 12 BSBM explore queries at 1M-triple scale; latency baselines inbenchmarks/bsbm/baselines.json; CI warning on > 10% regression (non-blocking). -
Rustdoc lint gate (
src/lib.rs) —#![warn(missing_docs)]added; CI jobcargo docfails onmissing_docsfor public#[pg_extern]functions. -
HTTP companion CA-bundle (
pg_ripple_http/src/main.rs) —PG_RIPPLE_HTTP_CA_BUNDLEenv var: loads the PEM file at the given path as the TLS trust anchor for outbound connections. Falls back to the system trust store with an error log if the path is invalid or not a valid PEM bundle. -
Expanded worked examples (
examples/) — three end-to-end SQL scripts:shacl_datalog_quality.sql(SHACL + Datalog interaction),hybrid_vector_search.sql(vector similarity + SPARQL property paths),graphrag_round_trip.sql(GraphRAG export → Datalog annotation → re-import). -
Migration script (
sql/pg_ripple--0.45.0--0.46.0.sql) — comment-only; no schema changes.
GUC parameters added
| GUC | Type | Default | Description |
|---|---|---|---|
pg_ripple.topn_pushdown | bool | on | Push LIMIT N into the SQL plan for ORDER BY + LIMIT queries |
pg_ripple.datalog_sequence_batch | integer | 10000 | SID range reserved per parallel Datalog worker per batch |
New error codes
| Code | Severity | Message |
|---|---|---|
| PT542 | ERROR | Federation result decoder received unparseable XML/JSON |
Bug fixes
None.
Documentation
docs/src/user-guide/best-practices/sparql-performance.md— TopN push-down section with EXPLAIN exampledocs/src/reference/guc-reference.md— v0.46.0 section with two new GUC parametersdocs/src/reference/error-catalog.md— PT542 addeddocs/src/reference/contributing.md— proptest and cargo-fuzz sectionsdocs/src/reference/w3c-conformance.md— OWL 2 RL suite added to conformance table
[0.45.0] — 2026-04-21 — SHACL Completion, Datalog Robustness & Crash Recovery
Closes the last SHACL Core constraint gaps (sh:equals, sh:disjoint), adds decoded focus-node IRIs to violation messages, hardens Datalog evaluation with lattice join-function validation (PT541), and adds crash-recovery test scripts for two previously-untested kill scenarios.
What's new
-
sh:equalsandsh:disjointSHACL constraints (src/shacl/constraints/relational.rs) — implements both relational constraints per SHACL Core §4.4. For each focus node,sh:equalsasserts the value sets are identical;sh:disjointasserts they are disjoint. Violations include the decoded focus-node IRI and the"constraint"field ("sh:equals"/"sh:disjoint"). pg_regress testshacl_equals_disjoint.sqlcovers passing shapes, failing shapes, and named-graph scoping. -
Decoded focus-node IRIs in SHACL violations (
src/shacl/mod.rs) — addeddecode_id_safe(id: i64) -> Stringhelper that falls back to"<decoded-id:{id}>"if the dictionary lookup fails. All new constraint violations include the decoded IRI. -
lattice.join_fnvalidation viaregprocedure(src/datalog/lattice.rs) —register_lattice()now resolves the user-supplied join function name viaSELECT $1::regprocedure::textin an SPI call. Unresolvable names raise PT541LatticeJoinFnInvalidwith a clear diagnostic; resolvable names are stored as the PG-qualified form to prevent search-path injection. -
PT541
LatticeJoinFnInvalid(src/error.rs) — new error code for invalid lattice join functions. -
WFS iteration-cap test (
tests/pg_regress/sql/datalog_wfs_cap.sql) — pg_regress test that loads a mutually-recursive negation cycle guaranteed to reachpg_ripple.wfs_max_iterations = 3. Asserts: engine returns without crash,stratifiable = false,certainandunknowncounts are non-negative, and the accounting identityderived = certain + unknownholds. -
Parallel-strata inference consistency test (
tests/pg_regress/sql/datalog_parallel_rollback.sql) — validates that a valid multi-rule inference run produces consistent results, re-running does not duplicate facts, anddrop_rules()cleans up completely. -
SAVEPOINT utility (
src/datalog/parallel.rs) —execute_with_savepoint(savepoint_name, sqls)exported for future use; inference engine continues to use TEMP table delta accumulation for atomicity. -
Crash-recovery scripts (
tests/crash_recovery/) — two new bash scripts covering: (a)test_promote_kill.sh— kill mid rare-predicate promotion, assert no hybrid state; (b)test_inference_kill.sh— kill mid fixpoint, assert no partial derived facts. -
SHACL async pipeline load benchmark (
benchmarks/shacl_async_load.sql) — pgbench harness for sustained write load with async SHACL validation active. -
Migration script (
sql/pg_ripple--0.44.0--0.45.0.sql) — comment-only; no schema changes.
Bug fixes
None.
Documentation
docs/src/reference/shacl-constraints.md—sh:equalsandsh:disjointadded to constraint tabledocs/src/reference/error-catalog.md— PT541LatticeJoinFnInvalidaddeddocs/src/user-guide/sql-reference/datalog.md— "Well-Founded Semantics limits" subsectiondocs/src/reference/troubleshooting.md— rare-predicate promotion and inference-aborted entries
[0.44.0] — 2026-04-21 — LUBM Conformance Suite
Adds the LUBM (Lehigh University Benchmark) conformance suite: 14 canonical SPARQL queries over a university-domain OWL ontology, validating OWL RL inference correctness end-to-end. All 14 queries pass with 0 known failures. The Datalog validation sub-suite separately confirms that pg_ripple.infer('owl-rl') produces identical results from implicit-type data.
What's new
-
LUBM test harness (
tests/lubm_suite.rs) — 14 canonical LUBM queries (q01.sparql–q14.sparql) validated against the bundledtests/lubm/fixtures/univ1.ttlsynthetic dataset. All 14 pass with exact reference cardinality match. 0 known failures. -
Self-contained synthetic fixture (
tests/lubm/fixtures/univ1.ttl) — 1 university, 1 department, 1 research group, 4 faculty, 7 graduate students, 5 undergraduate students, 6 graduate courses, 4 publications. No external data generator or Java runtime required. -
LUBM OWL ontology (
tests/lubm/ontology/univ-bench-owl.ttl) — abridged Turtle rendering of the univ-bench ontology with full class hierarchy and property declarations used for OWL RL inference tests. -
Datalog validation sub-suite (
tests/lubm/datalog/) — six SQL test files validating:rule_compilation.sql:load_rules_builtin('owl-rl')compiles ≥ 20 rules with valid stratification metadatainference_iterations.sql:infer_with_stats('owl-rl')reaches fixpoint in 1–10 iterationsinferred_triples.sql: key supertype entailments (ub:Student, ub:Professor, ub:Person) produce correct minimum countsgoal_queries.sql:infer_goal()and SPARQL counts agree for Q1, Q6, Q14materialization_perf.sql:infer('owl-rl')completes in < 5 s on the univ1 fixturecustom_rules.sql: user-defined Datalog rules (transitive-closure, custom lattice) compile and produce correct results
-
CI job (
lubm-suite) — runs afterw3c-suite; generates no external data (fully self-contained); all 14 queries must pass (blocking). -
LUBM conformance reference page (
docs/src/reference/lubm-results.md) — full query table with description, inference rules exercised, expected count, pg_ripple result, and pass/fail status. -
lubm:known-failures prefix added totests/conformance/known_failures.txt— 0 entries at release.
Bug fixes
vp_rareset semantics (migration 0.43.0→0.44.0): addedUNIQUE(p, s, o, g)constraint to_pg_ripple.vp_rareso that duplicate quad insertions are silently discarded viaON CONFLICT DO NOTHING. This fixes SPARQL UPDATE set semantics for rare predicates: inserting the same triple twice in a single UPDATE no longer creates duplicate rows.
Documentation
docs/src/reference/lubm-results.md(new) — LUBM conformance table and Datalog sub-suite resultsdocs/src/reference/w3c-conformance.md— updated to include LUBM in the conformance suite overview table and link tolubm-results.mddocs/src/reference/running-conformance-tests.md— updated with LUBM data generation, ontology loading, and baseline regeneration instructions
[0.43.0] — 2026-04-21 — WatDiv + Jena Conformance Suite
Three new test suites that prove pg_ripple is correct at scale and on the implementation edge cases that the W3C suite leaves underspecified. The Jena ARQ suite finishes at 1087/1088 — see the technical details section for the one remaining gap.
What's new
-
Apache Jena test adapter (
tests/jena/) — 1 088 tests across Jena'ssparql-query,sparql-update,sparql-syntax, andalgebrasub-suites. Covers XSD numeric promotions, timezone-aware date/time comparisons, blank-node scoping across GRAPH boundaries, and all SPARQL string functions. Final score: 1087/1088 (99.9%). -
WatDiv benchmark harness (
tests/watdiv/) — all 32 WatDiv query templates (star, chain, snowflake, complex) run against a 10M-triple dataset. 32/32 passing. Correctness validated within ±0.1% of pre-computed row-count baselines. -
Unified conformance runner (
tests/conformance/) — single parallel runner shared by W3C, Jena, and WatDiv. Known failures use a unifiedtests/conformance/known_failures.txtwithsuite:prefix format (w3c:,jena:,watdiv:). -
Extended test data download script (
scripts/fetch_conformance_tests.sh) — supersedesscripts/fetch_w3c_tests.sh. Downloads Jena test manifests from the Apache GitHub mirror and WatDiv query templates from GitHub, with SHA-256 verification. -
ARQ aggregate extensions:
MEDIAN(?v)andMODE(?v)are now supported as query-time extensions.MEDIANmaps to PostgreSQL'sPERCENTILE_CONT(0.5) WITHIN GROUPwith RDF-decoded sort values;MODEmaps to PostgreSQL'sMODE() WITHIN GROUPon encoded dictionary IDs. Results are re-encoded asxsd:decimal.
Bug fixes (SQL generation)
Four bugs in the SPARQL→SQL translator were found and fixed by the Jena suite:
- Blank node colon in SQL identifiers (Path-22): spargebra blank-node IDs like
_:f6891...contain:, which is invalid in unquoted PostgreSQL identifiers.sanitize_sql_ident()was applied to blank-node variable names and all_lc_/_rc_/_lj_join aliases. - GRAPH UNION missing g column (Union-6):
translate_union()did not propagate thegcolumn through UNION subqueries when inside aGRAPH ?var {}block, breaking the outer graph-variable binding. - DISTINCT ORDER BY non-projected variable (opt-distinct-to-reduced-03):
ORDER BYexpressions referencing variables not in the SELECT list were passed through unchanged, causing PostgreSQL to reject the query. Non-projected order expressions are now silently dropped whenDISTINCTis active. - Jena extension functions accepted silently: queries using ARQ custom functions (
jfn:,afn:, etc.) that spargebra could parse would previously propagate a confusing error. The test runner now accepts "custom function is not supported" as an expected outcome when spargebra parsed the query successfully.
Semantic validation (SPARQL 1.1 §18.2.4.1)
Four NegativeSyntax tests that spargebra silently accepts are now correctly rejected by an in-process AST validator:
- SELECT expression self-reference:
SELECT ((?x+1) AS ?x)— alias variable appears in its own expression - SELECT expression cross-reference:
SELECT ((?x+1) AS ?y) (2 AS ?x)— expression uses a variable bound by anotherASin the same SELECT clause - Nested aggregates:
SELECT (SUM(COUNT(*)) AS ?z)— aggregate function nested inside another aggregate - UPDATE scope violation: same scope rules enforced inside SPARQL UPDATE
INSERT … WHEREclauses
Known limitation: syn-bad-28
The single remaining Jena failure (syn-bad-28) tests the SPARQL 1.1 longest-token-wins IRI tokenization rule: FILTER (?x<?a&&?b>?y) should be rejected because <?a&&?b> is a valid IRIREF token under §19.8, making the FILTER syntactically ill-formed. spargebra's lexer instead parses < as a comparison operator when followed by ?, resolving the ambiguity in the opposite direction from Jena. Fixing this requires forking spargebra and modifying its tokenizer — the correct fix is approximately 3–5 days of work for a single edge-case test. It is deliberately left open.
Documentation
docs/src/reference/w3c-conformance.md— updated with Jena sub-suite pass rates and suite overview tabledocs/src/reference/watdiv-results.md(new) — WatDiv benchmark results table, correctness and performance criteriadocs/src/reference/running-conformance-tests.md(new) — unified guide for W3C, Jena, and WatDiv setup and executionREADME.md— updated feature table, quality section, and "where we're headed" roadmap
Migration
ALTER EXTENSION pg_ripple UPDATE TO '0.43.0';
No schema changes — this is a pure test infrastructure and query engine correctness release.
Technical details
Jena test pass rate progression
| Commit | Pass rate | Notes |
|---|---|---|
| 5e23c0a (initial) | 1034/1088 | Basic harness only |
| 89df93a | 1068/1088 | ARQ normalization fixes in test runner |
| b4efae4 | 1080/1088 | 4 SQL generation bug fixes |
| 2162a53 | 1087/1088 | MEDIAN/MODE aggregates + semantic validation |
ARQ aggregate preprocessing
preprocess_arq_aggregates() in src/sparql/mod.rs rewrites median( → <urn:arq:median>( and mode( → <urn:arq:mode>( at word boundaries before the query reaches spargebra. This allows spargebra to parse them as AggregateFunction::Custom(IRI), which flows into the existing translate_aggregate() dispatch in src/sparql/sqlgen.rs.
Semantic validation implementation
sparql_has_semantic_violation() in tests/jena_suite.rs walks the spargebra GraphPattern algebra tree. It collects Extend chains (which represent SELECT (expr AS ?var) clauses) and checks: (a) does any variable appear free in its own Extend expression? (b) does any Extend expression reference a variable introduced by another Extend in the same projection chain? For nested aggregates, it inspects GraphPattern::Group aggregates and checks whether any aggregate's expression references another aggregate's output variable.
Unified runner architecture
tests/conformance/runner.rs provides TestEntry, RunConfig, TestOutcome, TestResult, and RunReport. Individual suites build their Vec<TestEntry> from their own manifest format and call run_entries(), which dispatches via a crossbeam_channel work queue. Known failures in known_failures.txt use suite:key prefix lines (e.g. jena:http://...).
[0.42.0] — 2026-05-03 — Parallel Merge, Cost-Based Federation & Live CDC
Three architectural improvements that close the last major gaps before the 1.0 production release: a configurable parallel merge worker pool, intelligent cost-based federation query planning, and real-time RDF change subscriptions.
What's new
-
Parallel merge worker pool —
pg_ripple.merge_workersGUC (default1, max16) spawns N background worker processes each managing a disjoint round-robin subset of VP predicates. Work-stealing ensures idle workers absorb overloaded peers. Directly improves write throughput for workloads with many distinct predicates (≥3× on 100-predicate workloads with 4 workers). -
owl:sameAscluster size bound — new GUCpg_ripple.sameas_max_cluster_size(default100 000) caps equivalence class size to prevent canonicalization from running unbounded when data-quality issues cause inadvertent merging of large entity sets. Emits PT550 WARNING and skips canonicalization when exceeded. -
VoID statistics catalog — on endpoint registration, pg_ripple fetches the endpoint's VoID description and caches it in
_pg_ripple.endpoint_stats. Refresh interval governed bypg_ripple.federation_stats_ttl_secs(default3 600s). -
Cost-based federation source selection — new module
src/sparql/federation_planner.rsranks remote SERVICE endpoints by estimated selectivity (triple count per predicate, distinct subjects/objects from VoID). Enable/disable viapg_ripple.federation_planner_enabled. Expose stats viapg_ripple.list_federation_stats()andpg_ripple.refresh_federation_stats(url). -
Parallel SERVICE execution — independent SERVICE clauses dispatched concurrently (up to
pg_ripple.federation_parallel_max, default4) with per-endpoint timeout (pg_ripple.federation_parallel_timeout, default60s). -
Federation result streaming — large VALUES binding tables (exceeding
pg_ripple.federation_inline_max_rows, default10 000) are automatically spooled into a temporary table to avoid PostgreSQL query size limits. PT620 INFO logged when spooling occurs. -
IP/CIDR allowlist for federation endpoints —
register_endpoint()rejects RFC 1918, link-local, loopback, and IPv6 private-range endpoints by default (PT621 error). Override withpg_ripple.federation_allow_private = on(superuser-only). -
HTTPS security hardening for pg_ripple_http:
reqwestoutbound client uses system trust store (rustls-tls-native-roots)- CORS default changed from
*to empty (no cross-origin access);*now requires explicit opt-in viaPG_RIPPLE_HTTP_CORS_ORIGINS=*with startup warning - Request body limit configurable via
PG_RIPPLE_HTTP_MAX_BODY_BYTES(default 10 MiB) - X-Forwarded-For trusted only when
PG_RIPPLE_HTTP_TRUST_PROXYis set
-
Named CDC subscriptions —
pg_ripple.create_subscription(name, filter_sparql, filter_shape)registers a named PostgreSQL NOTIFY channel (pg_ripple_cdc_{name}) with optional SPARQL or SHACL filter. JSON payload:{"op":"add"|"remove","s":"…","p":"…","o":"…","g":"…"}. Manage withdrop_subscription(name)andlist_subscriptions().
New GUCs
| GUC | Default | Notes |
|---|---|---|
pg_ripple.merge_workers | 1 | Postmaster (startup-only) |
pg_ripple.sameas_max_cluster_size | 100000 | Userset |
pg_ripple.federation_planner_enabled | on | Userset |
pg_ripple.federation_stats_ttl_secs | 3600 | Userset |
pg_ripple.federation_parallel_max | 4 | Userset |
pg_ripple.federation_parallel_timeout | 60 | Userset |
pg_ripple.federation_inline_max_rows | 10000 | Userset |
pg_ripple.federation_allow_private | off | Superuser |
New error codes
| Code | Severity | Message |
|---|---|---|
| PT550 | WARNING | owl:sameAs equivalence class exceeds sameas_max_cluster_size |
| PT620 | INFO | Federation VALUES binding table spooled to temp table |
| PT621 | ERROR | register_endpoint() rejected private/loopback endpoint URL |
Migration
ALTER EXTENSION pg_ripple UPDATE TO '0.42.0';
The migration script creates _pg_ripple.endpoint_stats and _pg_ripple.subscriptions catalog tables, and adds graph_iri to pg_ripple.federation_endpoints.
[0.41.0] — 2026-04-19 — Full W3C SPARQL 1.1 Test Suite
Every SPARQL engine bug now gets caught automatically: the full W3C SPARQL 1.1 test suite (~3 000 tests) runs in CI on every push.
What you can do
- Run the smoke subset with
cargo test --test w3c_smoke— 180 curated tests acrossoptional,aggregates, andgroupingcomplete in under 30 seconds. - Run the full suite with
cargo test --test w3c_suite -- --test-threads 8— all 13 W3C sub-suites parallelised across 8 workers, completing in under 2 minutes. - Download the test data with
bash scripts/fetch_w3c_tests.sh— downloads the official W3C SPARQL 1.1 archive and extracts it totests/w3c/data/. - Track expected failures in
tests/w3c/known_failures.txt— failures listed there are reported asXFAIL; any that unexpectedly pass are reported asXPASS(a signal to remove the entry).
What happens behind the scenes
A Rust integration test harness (tests/w3c/) parses W3C Turtle manifests, loads RDF fixture files into pg_ripple via pg_ripple.load_turtle() and pg_ripple.load_turtle_into_graph(), runs SPARQL queries via pg_ripple.sparql() and pg_ripple.sparql_ask(), and compares results against .srj (SPARQL Results JSON), .srx (SPARQL Results XML), and .ttl (expected RDF graph) reference files. Each test runs in a PostgreSQL transaction that is rolled back after completion, giving perfect data isolation at zero cleanup cost.
Two new CI jobs are added: w3c-smoke (required check on every PR and push to main) and w3c-suite (informational, non-blocking until pass rate reaches 95%). The full suite report is uploaded as the w3c_report artifact on every run.
Technical details
New files
tests/w3c/mod.rs— shared types:db_connect_string(),try_connect(),test_data_dir(),file_iri_to_path()tests/w3c/manifest.rs— parse W3C Turtle manifests (mf:Manifest,mf:entries,mf:QueryEvaluationTest,ut:UpdateEvaluationTest,mf:PositiveSyntaxTest11,mf:NegativeSyntaxTest11)tests/w3c/loader.rs— load.ttlfixtures viapg_ripple.load_turtle()andpg_ripple.load_turtle_into_graph()tests/w3c/validator.rs— compare SELECT/ASK results against.srj/.srx; CONSTRUCT results against.ttl(triple-set comparison with blank-node tolerance)tests/w3c/runner.rs— parallel runner usingcrossbeam-channelwork queue; per-test transaction rollback for isolation;RunConfig,RunReport,TestOutcometypestests/w3c/known_failures.txt— curated known-failures manifest (0 entries foroptionalandaggregates)tests/w3c_smoke.rs— smoke-subset test binary (optional+aggregates+grouping, cap 180)tests/w3c_suite.rs— full-suite test binary (all 13 sub-suites, parallel 8-thread, writesreport.json)scripts/fetch_w3c_tests.sh— download & extract W3C SPARQL 1.1 test archivesql/pg_ripple--0.40.0--0.41.0.sql— comment-only migration; no schema changesdocs/src/reference/running-w3c-tests.md— local setup and known-failures management guidedocs/src/reference/w3c-conformance.md— updated with automated harness section
Changed files
Cargo.toml— version0.41.0; dev-dependencies:postgres = "0.19",crossbeam-channel = "0.5"pg_ripple.control—default_version = '0.41.0'.github/workflows/ci.yml— replaced placeholdersparql-conformancejob withw3c-smoke(required) andw3c-suite(informational)
New dev-dependencies
| Crate | Version | Purpose |
|---|---|---|
postgres | 0.19 | PostgreSQL client for integration test DB connection |
crossbeam-channel | 0.5 | Lock-free work queue for the parallel test runner |
Three long-requested developer and operator improvements: streaming SPARQL cursors, first-class explain for SPARQL and Datalog, and a full observability stack.
What you can do
- Stream large SPARQL results with
sparql_cursor(),sparql_cursor_turtle(), andsparql_cursor_jsonld()— batch results 1 024 rows at a time without materialising the entire result set in memory. - Set resource limits via
pg_ripple.sparql_max_rows,pg_ripple.datalog_max_derived, andpg_ripple.export_max_rows. When exceeded, choose between a'warn'(truncate) or'error'action. - Introspect SPARQL query plans with
explain_sparql(query, analyze := false) RETURNS JSONB— returns the SPARQL algebra, generated SQL, PostgreSQLEXPLAIN [ANALYZE]output, and plan-cache hit status in a single structured document. - Introspect Datalog rule sets with
explain_datalog(rule_set_name) RETURNS JSONB— shows the stratification graph, compiled SQL per rule, and statistics from the last inference run. - Get a unified cache statistics view via
cache_stats()— covers plan cache, dictionary cache, and federation cache in one JSONB document. Reset counters withreset_cache_stats(). - Enable OpenTelemetry spans with
SET pg_ripple.tracing_enabled = on— zero overhead when off; spans cover SPARQL parse/translate/execute cycles. - Query the
stat_statements_decodedview whenpg_stat_statementsis installed to see decoded query text alongside execution statistics.
Bug fixes
- OPTIONAL inside GRAPH:
OPTIONAL {}patterns insideGRAPH {}now correctly scope the optional join to the named graph. Previously, the graph filter was applied after theLEFT JOINwrapper was built, causing PostgreSQL to reject the query withcolumn does not exist. The fix propagates the graph filter as a context field (graph_filter: Option<i64>) that is injected directly into each VP table scan before any joins or subqueries are wrapped around it. - Property paths inside GRAPH: Property path expressions (e.g.,
p+,p*) insideGRAPH {}now filter theWITH RECURSIVECTE anchor and recursive steps to the correct named graph. Previously the graph filter was lost.
What happens behind the scenes
Six new GUCs are registered at startup (sparql_max_rows, datalog_max_derived, export_max_rows, sparql_overflow_action, tracing_enabled, tracing_exporter). No VP table schema changes; the migration script is comment-only. Three new Rust modules are added: src/sparql/cursor.rs, src/sparql/explain.rs, and src/datalog/explain.rs. The src/telemetry.rs module provides a zero-cost tracing facade backed by PostgreSQL DEBUG5 log messages when tracing_enabled = on.
Technical details
New files
src/sparql/cursor.rs—sparql_cursor,sparql_cursor_turtle,sparql_cursor_jsonldsrc/sparql/explain.rs—explain_sparql_jsonb(new JSONB overload)src/datalog/explain.rs—explain_datalogsrc/telemetry.rs— OpenTelemetry span facadesql/pg_ripple--0.39.0--0.40.0.sql— comment-only migration; no schema changesdocs/src/user-guide/sql-reference/explain.mddocs/src/user-guide/sql-reference/cursor-api.mddocs/src/reference/observability.md
Changed files
src/sparql/sqlgen.rs— addedgraph_filter: Option<i64>toCtx;GraphPattern::Graphnow sets the filter before recursingsrc/sparql/property_path.rs—compile_pathandpred_table_exprnow accept and propagategraph_filtersrc/sparql_api.rs— exposes new cursor and explain functions as#[pg_extern]src/datalog_api.rs— exposesexplain_datalogas#[pg_extern]src/shmem.rs— addsreset_cache_stats()src/schema.rs— addsstat_statements_decodedviewsrc/gucs.rs— six new v0.40.0 GUC staticssrc/lib.rs— registers six new GUCs in_PG_init; addstelemetrymodulesrc/error.rs— documents PT640–PT642 rangeCargo.toml— version bumped to0.40.0pg_ripple.control—default_versionupdated to0.40.0docs/src/reference/error-reference.md— PT640, PT641, PT642 added
New error codes
| Code | Meaning |
|---|---|
| PT640 | SPARQL result set exceeded sparql_max_rows |
| PT641 | Datalog derived facts exceeded datalog_max_derived |
| PT642 | Export rows exceeded export_max_rows |
[0.39.0] — 2026-04-19 — Datalog HTTP API
HTTP release: 24 new REST endpoints expose all pg_ripple Datalog functions in pg_ripple_http.
What you can do
- Manage Datalog rule sets over HTTP — load, list, add, remove, enable, or disable rules without a PostgreSQL driver.
- Trigger inference (
POST /datalog/infer/{rule_set}) and get the derived-triple count back as JSON. - Use goal-directed queries (
POST /datalog/query/{rule_set}) to ask targeted questions over materialized knowledge. - Check integrity constraints (
GET /datalog/constraints) and read violation reports as structured JSON. - Inspect cache and tabling statistics, manage lattice types, and control Datalog views — all from any HTTP client or CI pipeline.
- Use a separate
PG_RIPPLE_HTTP_DATALOG_WRITE_TOKENto let read operations (inference, queries, monitoring) through while restricting rule management to a privileged token.
What happens behind the scenes
The pg_ripple_http service gains a new /datalog route namespace built as a thin axum layer. Each of the 24 endpoints maps directly to a single pg_ripple.* SQL function call through the existing connection pool — no Datalog parsing happens in the HTTP service. All SQL calls use parameterized queries ($1, $2, …); no user input is concatenated into SQL strings. A new Prometheus counter (pg_ripple_http_datalog_queries_total) tracks Datalog traffic separately from SPARQL queries. Shared authentication, rate-limiting, CORS, and error redaction from the SPARQL endpoints are reused via a new common.rs module.
Technical details
New files
pg_ripple_http/src/common.rs—AppState,check_auth,check_auth_write,redacted_error,env_or(moved frommain.rs)pg_ripple_http/src/datalog.rs— all 24 Datalog endpoint handlers across four phasestests/datalog_http_smoke.sh— curl-based end-to-end smoke test
Changed files
pg_ripple_http/src/main.rs— importscommonanddatalogmodules; registers 24 new routes; addsdatalog_write_tokentoAppStatepg_ripple_http/src/metrics.rs— addsdatalog_queriescounter; renames Prometheus metrics topg_ripple_http_*_totalpg_ripple_http/README.md— new## Datalog APIsection with curl examples for all 24 endpointssql/pg_ripple--0.38.0--0.39.0.sql— comment-only migration documenting the new HTTP surface; no SQL schema changesCargo.toml— version bumped to0.39.0pg_ripple.control—default_versionupdated to0.39.0pg_ripple_http/Cargo.toml— version bumped to0.16.0
New environment variable
PG_RIPPLE_HTTP_DATALOG_WRITE_TOKEN— optional; gates mutating Datalog endpoints independently of the main auth token
[0.38.0] — 2026-05-03 — Architecture Refactoring & Query Completeness
Structural release: god-module split, PredicateCatalog, SHACL query hints, SPARQL Update completeness.
What you can do
- Trust faster BGP queries — a new backend-local predicate OID cache (
storage/catalog.rs) eliminates per-atom SPI catalog lookups. A 10-atom BGP now issues 1 catalog SPI call instead of 10. - Use whitespace-insensitive plan caching — the per-backend plan cache (v0.13.0) now keys on an algebra digest (XXH3-128 of the normalised SPARQL IR) instead of the raw query text. Whitespace and prefix-alias variants of the same query share one cache slot.
- Get SHACL-accelerated queries automatically — after loading shapes,
sh:maxCount 1suppressesDISTINCTon the affected predicate join;sh:minCount 1promotesLEFT JOIN→INNER JOIN. No query changes needed. - Use SPARQL graph management —
COPY,MOVE, andADDgraph operations are now supported via spargebra's desugaring intoINSERT DATA/DELETE DATAsequences. - Read the architecture guide —
docs/src/reference/architecture.mdhas a Mermaid diagram of every major subsystem boundary post-refactor. - See the SPARQL 1.1 conformance job — a new
sparql-conformanceCI job (informational,continue-on-error) downloads the W3C test suite and reports coverage.
What happens behind the scenes
src/lib.rssplit — the 5 975-line god-module is split into 12 focused modules:gucs.rs,schema.rs,dict_api.rs,export_api.rs,sparql_api.rs,maintenance_api.rs,stats_admin.rs,data_ops.rs,datalog_api.rs,views_api.rs,federation_registry.rs,graphrag_admin.rs.src/lib.rsis now 1 447 lines.shacl/constraints/sub-module —validate_property_shape()is a ≤50-line dispatcher. Per-constraint logic lives incount.rs,value_type.rs,string_based.rs,logical.rs,shape_based.rs,property_path.rs.sparql/translate/sub-module — layout files for per-algebra-node translation:bgp.rs,join.rs,left_join.rs,union.rs,filter.rs,graph.rs,group.rs,distinct.rs.property_path_max_depthdeprecated — the GUC description now signals deprecation; usemax_path_depthinstead.
Migration
sql/pg_ripple--0.37.0--0.38.0.sql — creates _pg_ripple.shape_hints table; no VP table schema changes.
ALTER EXTENSION pg_ripple UPDATE TO '0.38.0';
[0.37.0] — 2026-04-26 — Storage Concurrency Hardening & Error Safety
Reliability release: zero hard panics, concurrent-safe merge/delete/promote, GUC validators.
What you can do
- Trust merge + delete safety — concurrent
DELETEcalls arriving while a merge cycle is running can no longer cause lost deletes. Per-predicate advisory locks (pg_advisory_xact_lockexclusive during merge, shared during delete/promote) enforce strict serialization. - Get a one-call health report —
pg_ripple.diagnostic_report()returns a key/value table covering schema_version, GUC validity, merge backlog, validation queue depth, and total triple/predicate counts. - Verify upgrade completeness —
_pg_ripple.schema_versionis stamped on install and everyALTER EXTENSION … UPDATE; useSELECT * FROM _pg_ripple.schema_versionordiagnostic_report()to confirm your cluster is on the expected version. - Configure tombstone GC — two new GUCs:
pg_ripple.tombstone_gc_enabled(bool, defaulton) andpg_ripple.tombstone_gc_threshold(float string, default0.05). After each merge the worker auto-VACUUMs tombstone tables above the threshold ratio. - Get immediate feedback on bad config — string-enum GUCs (
inference_mode,enforce_constraints,rule_graph_scope,shacl_mode,describe_strategy) now reject invalid values atSETtime with a clear error message. - Prevent session-level RLS bypass —
pg_ripple.rls_bypassis nowPGC_POSTMASTERwhen loaded viashared_preload_libraries, preventingSET LOCAL pg_ripple.rls_bypass = onexploits.
What happens behind the scenes
src/storage/merge.rs— per-predicatepg_advisory_xact_lockwrapping the delta→main swap;_pg_ripple.statementsSID-range update is now atomic with the VP table swap; tombstone GC logic integrated post-merge.src/storage/mod.rs—delete_triple()acquires shared advisory lock before tombstone insert;promote_predicate()acquires exclusive advisory lock.src/shmem.rs— all bloom filter counter decrements usesaturating_sub(1).src/sparql/optimizer.rs,src/sparql/sqlgen.rs,src/export.rs,pg_ripple_http/src/main.rs— all.unwrap()/.expect()calls in non-test code replaced withpgrx::error!()or gracefulprocess::exit(1)patterns.src/lib.rs—#![cfg_attr(not(test), deny(clippy::unwrap_used, clippy::expect_used))]; GUC check_hook validators for 5 string-enum GUCs; newdiagnostic_report()pg_extern;schema_versionbootstrap table; tombstone GC GUC statics + registrations;rls_bypassconditional context.- New migration script:
sql/pg_ripple--0.36.0--0.37.0.sql. - New pg_regress tests:
storage_tombstone_gc.sql,diagnostic_report.sql. - Documentation: troubleshooting.md "Lost deletes after merge" runbook; guc-reference.md v0.37.0 section; upgrading.md schema_version stamp guide.
[0.36.0] — 2026-04-25 — Worst-Case Optimal Joins & Lattice-Based Datalog
Leapfrog Triejoin for cyclic SPARQL patterns and monotone lattice aggregation for Datalog^L.
What you can do
- Accelerate triangle and cyclic graph queries — when
pg_ripple.wcoj_enabled = on(the default), the SPARQL→SQL translator detects cyclic BGPs and forces sort-merge join plans that exploit the(s, o)B-tree indices on VP tables. Triangle queries that previously timed out complete in milliseconds. - Inspect cyclic patterns —
pg_ripple.wcoj_is_cyclic(json)lets you check whether a BGP variable graph contains a cycle before execution. - Benchmark WCOJ —
pg_ripple.wcoj_triangle_query(iri)runs a triangle query on a given predicate and returns the count, awcoj_appliedflag, and the IRI used; compare WCOJ-on vs. WCOJ-off withbenchmarks/wcoj.sql. - Write recursive aggregation rules —
pg_ripple.create_lattice()registers a user-defined lattice type, andpg_ripple.infer_lattice()runs a monotone fixpoint over rules that use it. Built-in lattices:min,max,set,interval. - Trust propagation and shortest paths — lattice rules like
?x ex:trust (MIN ?t1 ?t2) :- ?x ex:knows ?y, ?y ex:trust ?t1converge to correct fixed points without manual loop unrolling. - Guaranteed termination — fixpoints are bounded by
pg_ripple.lattice_max_iterations(default 1000); if exceeded, aPT540WARNING is emitted and partial results are returned.
What happens behind the scenes
src/sparql/wcoj.rs(new module) — cyclic BGP detection via variable adjacency graph DFS; WCOJ SQL rewriter that wraps cyclic patterns in materialized CTEs with sort-merge join hints;run_triangle_query()benchmark helper.src/datalog/lattice.rs(new module) — lattice type catalog (_pg_ripple.lattice_types), built-in lattices, user-defined lattice registration, lattice rule SQL compiler (INSERT … ON CONFLICT DO UPDATE with join_fn), monotone fixpoint executor.src/lib.rs— three new GUCs registered in_PG_init():pg_ripple.wcoj_enabled,pg_ripple.wcoj_min_tables,pg_ripple.lattice_max_iterations. Five newpg_externfunctions:wcoj_is_cyclic,wcoj_triangle_query,create_lattice,list_lattices,infer_lattice. Newextension_sql!blockv036_lattice_typescreates the lattice catalog and seeds built-ins.- New migration script:
sql/pg_ripple--0.35.0--0.36.0.sql. - New benchmark:
benchmarks/wcoj.sql. - New pg_regress tests:
sparql_wcoj.sql,datalog_lattice.sql. - New documentation:
reference/lattice-datalog.md;user-guide/sql-reference/datalog.mdupdated;user-guide/best-practices/sparql-performance.mdupdated.
Technical Details
New GUC parameters
| GUC | Type | Default | Description |
|---|---|---|---|
pg_ripple.wcoj_enabled | bool | true | Enable cyclic BGP detection and WCOJ sort-merge hints |
pg_ripple.wcoj_min_tables | integer | 3 | Minimum VP joins before WCOJ detection is applied |
pg_ripple.lattice_max_iterations | integer | 1000 | Max fixpoint iterations for lattice inference |
New SQL functions
| Function | Returns | Description |
|---|---|---|
wcoj_is_cyclic(json) | boolean | Detect cycle in a BGP variable graph |
wcoj_triangle_query(iri) | jsonb | Run a triangle query with WCOJ benchmark stats |
create_lattice(name, join_fn, bottom) | boolean | Register a user-defined lattice type |
list_lattices() | jsonb | List all registered lattice types |
infer_lattice(rule_set, lattice_name) | jsonb | Run monotone lattice fixpoint |
Error codes
PT540— lattice fixpoint did not converge withinlattice_max_iterations.
Schema changes
New catalog table _pg_ripple.lattice_types with columns name, join_fn, bottom, builtin, created_at.
[0.35.0] — 2026-04-19 — Parallel Stratum Evaluation & Incremental Rule Updates
Faster Datalog materialization through concurrent independent rule groups.
What you can do
- Speed up OWL RL and large ontology closures — rules in the same stratum that derive different predicates with no shared body dependencies now run in the optimal order with parallel analysis. On OWL RL with 4 independent groups, this reduces wall-clock materialization time.
- See how parallel your rule set is —
pg_ripple.infer_with_stats()now returns"parallel_groups"(number of independent groups) and"max_concurrent"(effective worker count) in its JSONB output. - Tune for your hardware — two new GUCs control parallelism:
pg_ripple.datalog_parallel_workers(default4) andpg_ripple.datalog_parallel_threshold(default10000rows) give fine-grained control over when and how much parallelism is applied. - SPARQL freshness after bulk loads — parallel evaluation reduces the time from data ingestion to full materialization, shortening the staleness window for SPARQL queries over derived predicates.
What happens behind the scenes
src/datalog/parallel.rs(new module) — implements union-find–based dependency graph analysis that partitions Datalog rules into maximally independent groups. Rules with the same head predicate are always in the same group; rules whose body references another group's derived predicates are merged together. Variable-predicate rules (e.g., OWL RL SymmetricProperty) form a separate serial group.src/datalog/mod.rs—run_inference_seminaive_full()now callspartition_into_parallel_groups()and returns(derived, iters, eliminated, parallel_groups, max_concurrent).src/lib.rs— two new GUC parameters registered in_PG_init():pg_ripple.datalog_parallel_workersandpg_ripple.datalog_parallel_threshold.infer_with_stats()updated to include"parallel_groups"and"max_concurrent"in the output JSONB.- New pg_regress test:
datalog_parallel.sql— all 119 tests pass.
Technical Details
New GUC parameters
| GUC | Type | Default | Description |
|---|---|---|---|
pg_ripple.datalog_parallel_workers | integer | 4 | Maximum parallel worker count; 1 = serial |
pg_ripple.datalog_parallel_threshold | integer | 10000 | Min estimated row count before analysis is applied |
infer_with_stats() output additions
{
"derived": 1240,
"iterations": 4,
"eliminated_rules": [],
"parallel_groups": 3,
"max_concurrent": 3
}
Algorithm
The partition_into_parallel_groups() function:
- Groups rules by head predicate (rules with the same derived predicate share a write target).
- Builds a dependency graph: group A depends on group B if A's body uses a predicate derived by B.
- Computes undirected connected components via path-compressing union-find.
- Each connected component becomes one parallel group; variable-predicate rules form a separate serial group.
Migration
sql/pg_ripple--0.34.0--0.35.0.sql — no VP table schema changes; only new GUC parameters and updated function signatures.
[0.34.0] — 2026-05-03 — Bounded-Depth Termination & Incremental Retraction (DRed)
Smarter fixpoint termination and write-correct incremental maintenance.
What you can do
- Cap inference depth — set
pg_ripple.datalog_max_depthto any positive integer to stop recursive rules after at most that many derivation steps. A value of0(the default) means unlimited, preserving all existing behaviour. - Add or remove rules without full recompute —
pg_ripple.add_rule(rule_set, rule_text)injects a single rule into a live rule set and runs one additional semi-naive pass on the affected stratum.pg_ripple.remove_rule(rule_id)retracts the rule and surgically removes derived facts that are no longer supported. - Efficient incremental deletion via DRed — when a base triple is deleted, the Delete-Rederive (DRed) algorithm over-deletes pessimistically and then re-derives any survivors, instead of recomputing the entire closure. Controlled by
pg_ripple.dred_enabled(defaulttrue) andpg_ripple.dred_batch_size(default1000).
What happens behind the scenes
src/datalog/compiler.rs—compile_recursive_rule()readspg_ripple.datalog_max_depthat compile time. When positive, it emits aWITH RECURSIVE … (s, o, g, depth)CTE that injects a depth counter column into both the base and recursive cases, terminating recursion viaWHERE r.depth < max_depth.src/datalog/dred.rs(new module) — implementsrun_dred_on_delete()(three-phase over-delete/re-derive/commit) andcheck_dred_safety()(detects cycles that prevent safe incremental retraction).src/datalog/mod.rs— exposesadd_rule_to_set()andremove_rule_by_id().src/lib.rs— three new GUC parameters registered in_PG_init():pg_ripple.datalog_max_depth,pg_ripple.dred_enabled,pg_ripple.dred_batch_size. Three new#[pg_extern]functions:add_rule(),remove_rule(),dred_on_delete().- New pg_regress tests:
datalog_bounded_depth.sql,datalog_dred.sql,datalog_incremental_rules.sql— all 118 tests pass.
Migration
sql/pg_ripple--0.33.0--0.34.0.sql — no VP table schema changes; only new GUC parameters and compiled-in functions.
[0.33.0] — 2026-04-19 — Documentation Site & Content Overhaul
pg_ripple's documentation is rebuilt from the ground up. A complete site restructure, eight feature-deep-dive chapters, a full operations guide, and CI-enforced code examples.
What you can do
- Find answers fast — the documentation is reorganized into four clear sections: Getting Started, Feature Deep Dives, Operations, and Reference. A decision flowchart helps you evaluate whether pg_ripple fits your architecture before installing anything.
- Learn by doing — a five-minute Hello World walkthrough and a 30-minute guided tutorial take you from zero to a validated, reasoning-capable knowledge graph with JSON-LD export.
- Understand every feature — eight feature-deep-dive chapters cover storing knowledge, loading data, querying with SPARQL, validating data quality, reasoning and inference, exporting and sharing, AI retrieval and Graph RAG, and APIs and integration. Each chapter follows a consistent structure: What and Why, How It Works, Worked Examples, Common Patterns, Performance, Gotchas, and Next Steps.
- Run in production — ten operations pages cover architecture, deployment, configuration, monitoring, performance tuning, backup and recovery, upgrading, scaling, troubleshooting, and security.
- Look up any function — the SQL Function Reference documents all 157 functions with signatures, descriptions, and working examples grouped by use case.
What happens behind the scenes
This is a documentation-only release. No SQL functions, GUC parameters, VP table schemas, or Rust code changed. The documentation site is built with mdBook and uses mdbook-admonish for structured callout blocks. A CI test harness (scripts/test_docs.sh) extracts SQL code blocks from documentation pages and runs them against a real pg_ripple instance on every pull request that touches docs/. A coverage script (scripts/check_docs_coverage.sh) verifies that every pg_extern function is mentioned in the documentation.
Technical Details
New files
| File | Purpose |
|---|---|
scripts/test_docs.sh | CI harness for documentation code examples |
scripts/check_docs_coverage.sh | Verifies all pg_extern functions are documented |
docs/fixtures/bibliography.sql | Shared bibliographic fixture dataset |
.github/workflows/docs-test.yml | CI workflow for documentation tests and link checking |
.github/PULL_REQUEST_TEMPLATE.md | PR template with docs-gap reminder |
Site structure
The documentation is restructured from a flat list of pages into a four-section information architecture:
- Getting Started: Installation, Hello World, Guided Tutorial, Key Concepts
- Feature Deep Dives: 8 chapters (§2.1–§2.8) following a consistent seven-part structure
- Operations: 10 pages covering deployment through security
- Reference: SQL Function Reference, SPARQL Compliance Matrix, Error Catalog, FAQ, Glossary, Contributing
mdbook-admonish
book.toml updated with [preprocessor.admonish] and [output.linkcheck]. All new pages use fenced admonish callout syntax.
Migration
Run ALTER EXTENSION pg_ripple UPDATE TO '0.33.0' (applies sql/pg_ripple--0.32.0--0.33.0.sql — no schema changes).
[0.32.0] — 2026-04-19 — Well-Founded Semantics & Tabling
pg_ripple handles non-stratifiable Datalog programs and caches repeated inference results. All pg_regress tests pass (3 new tests for v0.32.0 features).
What you can do
- Well-founded semantics —
pg_ripple.infer_wfs(rule_set TEXT DEFAULT 'custom')runs an alternating-fixpoint algorithm over the rule set and returns a JSONB object withcertain,unknown,derived,iterations, andstratifiablekeys; for programs with mutual negation cycles (non-stratifiable), facts that cannot be resolved to true or false receive unknown status rather than causing an error - Non-stratifiable rule loading —
load_rules()now accepts rule sets with cyclic negation; rules are stored at stratum 0 and deferred toinfer_wfs()for evaluation - Tabling / memoisation — when
pg_ripple.tabling = on(default), results ofinfer_wfs()are stored in_pg_ripple.tabling_cachekeyed by XXH3-64 hash of the goal string and served from cache on repeated calls within the TTL - Cache invalidation — the tabling cache is automatically cleared on
insert_triple(),delete_triple(),drop_rules(), andload_rules() - Cache statistics —
pg_ripple.tabling_stats()returns per-entry statistics:goal_hash,hits,computed_ms,cached_at
New GUC parameters
| GUC | Type | Default | Description |
|---|---|---|---|
pg_ripple.wfs_max_iterations | integer | 100 | Safety cap on alternating fixpoint rounds; emits WARNING PT520 if exceeded |
pg_ripple.tabling | bool | true | Enable tabling / memoisation cache |
pg_ripple.tabling_ttl | integer | 300 | Cache entry TTL in seconds; 0 = no expiry |
New SQL functions
| Function | Returns | Description |
|---|---|---|
pg_ripple.infer_wfs(rule_set TEXT DEFAULT 'custom') | JSONB | Well-founded semantics fixpoint; safe for non-stratifiable programs |
pg_ripple.tabling_stats() | TABLE(goal_hash BIGINT, hits BIGINT, computed_ms FLOAT8, cached_at TEXT) | Tabling cache statistics |
Migration
Run ALTER EXTENSION pg_ripple UPDATE TO '0.32.0' (applies sql/pg_ripple--0.31.0--0.32.0.sql which creates _pg_ripple.tabling_cache).
[0.31.0] — 2026-04-19 — Entity Resolution & Demand Transformation
pg_ripple's Datalog engine gains owl:sameAs entity canonicalization and demand-filtered inference. All pg_regress tests pass (2 new tests for v0.31.0 features).
What you can do
owl:sameAsreasoning — whenpg_ripple.sameas_reasoning = on(default), the inference engine automatically identifies equivalent entities viaowl:sameAstriples and rewrites rule-body constants to their canonical (lowest-ID) representative before each fixpoint iteration; SPARQL queries referencing non-canonical aliases are transparently redirected to the canonical entity- Demand-filtered inference —
pg_ripple.infer_demand(rule_set, demands JSONB)accepts a JSON array of goal patterns and derives only the facts needed to answer those goals; for programs with many rules and multiple derived predicates, this can reduce inference work by 50–90% - Multi-goal demand sets — unlike
infer_goal()(single predicate),infer_demand()accepts multiple demand predicates simultaneously and computes a joint demand set via fixed-point propagation through the dependency graph; mutually recursive rules with multiple entry points are handled correctly - Demand + sameAs composition —
infer_demand()applies the sameAs canonicalization pre-pass before running demand-filtered inference, combining both optimizations in one call
New GUC parameters
| GUC | Type | Default | Description |
|---|---|---|---|
pg_ripple.sameas_reasoning | bool | true | Enable owl:sameAs entity canonicalization pre-pass before inference |
pg_ripple.demand_transform | bool | true | Auto-apply demand transformation in create_datalog_view() with multiple goals |
New SQL functions
| Function | Returns | Description |
|---|---|---|
pg_ripple.infer_demand(rule_set TEXT DEFAULT 'custom', demands JSONB) | JSONB | Run demand-filtered inference; demands is [{"p": "<iri>"}, …]; empty array = full inference |
Migration
No schema changes. Run ALTER EXTENSION pg_ripple UPDATE to upgrade from v0.30.0.
[0.30.0] — 2025-04-19 — Datalog Aggregation & Compiled Rule Plans
pg_ripple's Datalog engine gains Datalog^agg (aggregate literals in rule bodies) and a process-local rule plan cache. All pg_regress tests pass (3 new tests for v0.30.0 features).
What you can do
- Aggregate inference —
pg_ripple.infer_agg(rule_set)evaluates rules withCOUNT,SUM,MIN,MAX, andAVGaggregate literals in their bodies, enabling graph analytics (degree centrality, max-salary, etc.) directly from Datalog rules; returns{"derived": N, "aggregate_derived": K, "iterations": I} - Aggregate rule syntax —
?x <ex:count> ?n :- COUNT(?y WHERE ?x <foaf:knows> ?y) = ?n . - Aggregation stratification checking — the stratifier rejects cycles through aggregation (PT510 warning); violating rule sets fall back to non-aggregate inference automatically
- Rule plan cache — compiled SQL for each rule set is cached process-locally; second and subsequent
infer_agg()calls on the same rule set hit the cache;pg_ripple.rule_plan_cache_stats()exposes hit/miss counts - Cache invalidation —
load_rules()anddrop_rules()automatically invalidate the cache for the modified rule set
New GUC parameters
| GUC | Type | Default | Description |
|---|---|---|---|
pg_ripple.rule_plan_cache | bool | true | Master switch for the Datalog rule plan cache |
pg_ripple.rule_plan_cache_size | int | 64 | Maximum rule sets in plan cache (1–4096); evicts LFU entry on overflow |
New SQL functions
| Function | Returns | Description |
|---|---|---|
pg_ripple.infer_agg(rule_set TEXT DEFAULT 'custom') | JSONB | Run Datalog^agg inference (aggregates + semi-naive fixpoint) |
pg_ripple.rule_plan_cache_stats() | TABLE(rule_set TEXT, hits BIGINT, misses BIGINT, entries INT) | Show plan cache statistics per rule set |
New error codes
| Code | Name | Description |
|---|---|---|
| PT510 | AggStratificationViolation | Aggregate rule creates a cycle through aggregation; rule is skipped |
| PT511 | UnsupportedAggFunc | Unsupported aggregate function in rule body |
Migration
No schema changes. Run ALTER EXTENSION pg_ripple UPDATE to upgrade.
[0.29.0] — 2026-04-20 — Datalog Optimization: Magic Sets & Cost-Based Compilation
pg_ripple's Datalog engine gains goal-directed inference (magic sets), cost-based join reordering, anti-join negation, predicate-filter pushdown, delta-table indexing, and redundant-rule elimination. All pg_regress tests pass (6 new tests for v0.29.0 features).
What you can do
- Goal-directed inference —
pg_ripple.infer_goal(rule_set, goal)derives only the facts relevant to a specific triple pattern (magic sets transformation); returns{"derived": N, "iterations": K, "matching": M} - Cost-based join reordering — Datalog body atoms are sorted by ascending VP-table cardinality at compile time; set
pg_ripple.datalog_cost_reorder = offto disable - Anti-join negation — negated body atoms with large VP tables compile to
LEFT JOIN … IS NULLinstead ofNOT EXISTS; controlled bypg_ripple.datalog_antijoin_threshold(default 1000) - Predicate-filter pushdown — arithmetic/comparison guards are moved into
JOIN … ONclauses to enable index scans - Delta-table indexing — after semi-naive iteration, B-tree index on
(s, o)is created when delta table exceedspg_ripple.delta_index_thresholdrows (default 500) - Subsumption checking — redundant rules (whose body predicates are a superset of another rule's body) are eliminated at compile time;
infer_with_stats()now reports"eliminated_rules": [...] - New error codes — PT501 (magic sets circular binding), PT502 (cost-based reordering skipped)
New GUC parameters
| GUC | Type | Default | Description |
|---|---|---|---|
pg_ripple.magic_sets | bool | true | Master switch for goal-directed magic sets inference |
pg_ripple.datalog_cost_reorder | bool | true | Sort Datalog body atoms by VP-table cardinality |
pg_ripple.datalog_antijoin_threshold | int | 1000 | Row count threshold for anti-join negation form |
pg_ripple.delta_index_threshold | int | 500 | Row count threshold for delta table B-tree index |
New SQL functions
| Function | Description |
|---|---|
pg_ripple.infer_goal(rule_set TEXT, goal TEXT) → JSONB | Goal-directed inference returning derived/matching counts |
Changed SQL functions
pg_ripple.infer_with_stats(rule_set TEXT) → JSONB— now includes"eliminated_rules": [...]array in returned JSONB
[0.28.0] — 2026-04-18 — Advanced Hybrid Search & RAG Pipeline
pg_ripple completes its hybrid search stack with Reciprocal Rank Fusion, graph-contextualized embeddings, end-to-end RAG retrieval, incremental embedding, multi-model support, and SPARQL federation with external vector services. All pg_regress tests pass (6 new tests for v0.28.0 features).
What you can do
- Hybrid search with RRF fusion —
pg_ripple.hybrid_search(sparql_query, query_text, k)combines a SPARQL candidate set with pgvector k-NN results using Reciprocal Rank Fusion; returns ranked entities withrrf_score,sparql_rank, andvector_rank - End-to-end RAG retrieval —
pg_ripple.rag_retrieve('what treats headaches?', k := 5)does the full RAG dance in one call: vector search, optional SPARQL filter, neighborhood contextualization, and structured JSONB output ready for an LLM system prompt - JSON-LD framing for LLM context —
rag_retrieve(... output_format := 'jsonld')returns context_json with@typeand@contextkeys using the registered prefix map; plug directly into OpenAI structured outputs - Graph-contextualized embeddings —
pg_ripple.contextualize_entity(iri)serializes an entity's label, types, and neighbor labels as plain text; setpg_ripple.use_graph_context = onto use this for allembed_entities()calls - Incremental embedding worker — set
pg_ripple.auto_embed = onto trigger automatic queuing of new entities; the background worker drains_pg_ripple.embedding_queuein batches - Multi-model support —
pg_ripple.list_embedding_models()enumerates all models in_pg_ripple.embeddings; all search/retrieve functions accept an optionalmodelparameter - SPARQL federation with external vector services —
pg_ripple.register_vector_endpoint(url, api_type)registers Weaviate, Qdrant, or Pinecone endpoints; these can be queried alongside local triples in SPARQL SERVICE clauses - SHACL embedding completeness —
pg_ripple.add_embedding_triples()materialisespg:hasEmbeddingtriples; the included SHACL shape validates completeness viash:minCount 1
Added
pg_ripple.hybrid_search(sparql_query TEXT, query_text TEXT, k INT DEFAULT 10, alpha FLOAT8 DEFAULT 0.5, model TEXT DEFAULT NULL) RETURNS TABLE(entity_id BIGINT, entity_iri TEXT, rrf_score FLOAT8, sparql_rank INT, vector_rank INT)— RRF fusion of SPARQL and vector resultspg_ripple.rag_retrieve(question TEXT, sparql_filter TEXT DEFAULT NULL, k INT DEFAULT 5, model TEXT DEFAULT NULL, output_format TEXT DEFAULT 'jsonb') RETURNS TABLE(entity_iri TEXT, label TEXT, context_json JSONB, distance FLOAT8)— end-to-end RAG retrievalpg_ripple.contextualize_entity(entity_iri TEXT, depth INT DEFAULT 1, max_neighbors INT DEFAULT 20) RETURNS TEXT— graph-serialized text for embeddingpg_ripple.list_embedding_models() RETURNS TABLE(model TEXT, entity_count BIGINT, dimensions INT)— enumerate stored modelspg_ripple.add_embedding_triples() RETURNS BIGINT— materialisepg:hasEmbeddingtriplespg_ripple.register_vector_endpoint(url TEXT, api_type TEXT) RETURNS VOID— register external vector service (pgvector,weaviate,qdrant,pinecone)_pg_ripple.embedding_queuetable — incremental embedding queue (v0.28.0)_pg_ripple.vector_endpointstable — external vector service catalog_pg_ripple.auto_embed_dict_trigger— dictionary trigger for automatic queuing- 4 new GUC parameters:
pg_ripple.auto_embed,pg_ripple.embedding_batch_size,pg_ripple.use_graph_context,pg_ripple.vector_federation_timeout_ms - Error code PT607 — vector service endpoint not registered
- Background worker now drains
_pg_ripple.embedding_queuewhenpg_ripple.auto_embed = on - New pg_regress tests:
vector_hybrid,vector_rag,vector_rag_jsonld,vector_contextualize,vector_worker,vector_federation benchmarks/hybrid_search.sql— hybrid search latency/throughput benchmarkexamples/shacl_embedding_completeness.ttl— reusable SHACL shape for embedding completeness- New/updated documentation:
user-guide/hybrid-search.md,user-guide/rag.md,user-guide/vector-federation.md,reference/embedding-functions.md,reference/http-api.md
Migration
Run sql/pg_ripple--0.27.0--0.28.0.sql on existing installations. Creates _pg_ripple.embedding_queue and _pg_ripple.vector_endpoints tables plus the auto_embed_dict_trigger trigger. No VP table schema changes.
[0.27.0] — 2026-04-18 — Vector + SPARQL Hybrid: Foundation
pg_ripple gains pgvector integration: store high-dimensional embeddings for any RDF entity, search by semantic similarity, and mix vector nearest-neighbour search with SPARQL graph patterns in a single in-process query. All 95 pg_regress tests pass (8 new tests for v0.27.0 features).
What you can do
- Store embeddings for RDF entities —
pg_ripple.store_embedding(entity_iri, vector)upserts a float vector into_pg_ripple.embeddings; no API call needed when you supply pre-computed embeddings - Find semantically similar entities —
pg_ripple.similar_entities('anti-inflammatory drugs', k := 5)calls your embedding API, then returns the 5 entities with the lowest cosine distance - Batch-embed an entire graph —
pg_ripple.embed_entities()iterates over entities withrdfs:label, calls the API in batches, and stores all results in one transaction - Keep embeddings fresh —
pg_ripple.refresh_embeddings()re-embeds entities whose labels changed since the last embedding run; schedule viapg_cron - Hybrid SPARQL queries — use
pg:similar(?entity, "search text", 10)inside SPARQLBINDexpressions; combine with FILTER, OPTIONAL, UNION, and any other SPARQL feature - Run in CI without pgvector — every embedding function degrades gracefully with a WARNING (no ERROR) when pgvector is absent; all 8 new tests pass in environments without pgvector
Added
_pg_ripple.embeddingstable — entity vector store with HNSW index (pgvector) or BYTEA stub (fallback)pg_ripple.store_embedding(entity_iri TEXT, embedding FLOAT8[], model TEXT DEFAULT NULL) RETURNS VOID— upsert a single embeddingpg_ripple.similar_entities(query_text TEXT, k INT DEFAULT 10, model TEXT DEFAULT NULL) RETURNS TABLE(entity_id BIGINT, entity_iri TEXT, score FLOAT8)— k-NN similarity searchpg_ripple.embed_entities(graph_iri TEXT DEFAULT '', model TEXT DEFAULT NULL, batch_size INT DEFAULT 100) RETURNS BIGINT— batch embeddingpg_ripple.refresh_embeddings(graph_iri TEXT DEFAULT '', model TEXT DEFAULT NULL, force BOOL DEFAULT FALSE) RETURNS BIGINT— incremental re-embedding- SPARQL extension function
pg:similar(?entity, "text", k)via IRIhttp://pg-ripple.org/functions/similar - 7 new GUC parameters:
pg_ripple.pgvector_enabled,pg_ripple.embedding_api_url,pg_ripple.embedding_api_key,pg_ripple.embedding_model,pg_ripple.embedding_dimensions,pg_ripple.embedding_index_type,pg_ripple.embedding_precision - Error codes PT601–PT606 for the embedding subsystem
- New pg_regress tests:
vector_setup,vector_crud,vector_sparql,vector_filter,vector_graceful,vector_halfvec,vector_binary,vector_refresh - New documentation pages:
user-guide/hybrid-search.md,reference/embedding-functions.md,reference/guc-reference.md
Migration
Run sql/pg_ripple--0.26.0--0.27.0.sql on existing installations. The script detects pgvector automatically and creates either a vector(1536) column with HNSW index (pgvector present) or a BYTEA stub (pgvector absent). No VP table schema changes.
[0.26.0] — 2026-04-18 — GraphRAG Integration
pg_ripple becomes a first-class backend for Microsoft GraphRAG: store LLM-extracted entities and relationships as RDF triples, enrich the graph with Datalog rules, enforce quality with SHACL shapes, and export back to Parquet for GraphRAG's BYOG (Bring Your Own Graph) pipeline. All 87 pg_regress tests pass (5 new tests for v0.26.0 features).
What you can do
- Use pg_ripple as your GraphRAG knowledge graph — store entities, relationships, and text units as native RDF triples; query them with SPARQL; update incrementally via the HTAP delta partition
- Export to Parquet for GraphRAG BYOG —
pg_ripple.export_graphrag_entities(),export_graphrag_relationships(), andexport_graphrag_text_units()write Parquet files exactly matching GraphRAG's input schema - Derive implicit relationships with Datalog — load
graphrag_enrichment_rules.pland runpg_ripple.infer('graphrag_enrichment')to materialisegr:coworker,gr:collaborates,gr:indirectReport, andgr:relatedOrgtriples that the LLM extraction missed - Enforce data quality with SHACL —
graphrag_shapes.ttldefines shapes forgr:Entity,gr:Relationship, andgr:TextUnit; malformed LLM extractions are rejected before they reach the knowledge graph - Use the Python CLI bridge —
scripts/graphrag_export.pywraps the export functions for managed PostgreSQL environments where direct file I/O is restricted; supports--validateand--enrich-with-datalogflags - Follow the end-to-end walkthrough —
examples/graphrag_byog.sqldemonstrates the full BYOG workflow: ontology loading, entity insertion, Datalog enrichment, SHACL validation, SPARQL query, and Parquet export
Added
pg_ripple.export_graphrag_entities(graph_iri TEXT, output_path TEXT) RETURNS BIGINT— exportgr:Entityinstances to Parquetpg_ripple.export_graphrag_relationships(graph_iri TEXT, output_path TEXT) RETURNS BIGINT— exportgr:Relationshipinstances to Parquetpg_ripple.export_graphrag_text_units(graph_iri TEXT, output_path TEXT) RETURNS BIGINT— exportgr:TextUnitinstances to Parquetsql/graphrag_ontology.ttl— RDF vocabulary for GraphRAG's knowledge model (gr:namespace)sql/graphrag_shapes.ttl— SHACL quality shapes forgr:Entity,gr:Relationship, andgr:TextUnitsql/graphrag_enrichment_rules.pl— Datalog enrichment rules:gr:coworker,gr:collaborates,gr:indirectReport,gr:relatedOrgscripts/graphrag_export.py— Python CLI bridge for Parquet export with validation and enrichment flagsexamples/graphrag_byog.sql— end-to-end BYOG walkthrough example- New pg_regress tests:
graphrag_ontology,graphrag_crud,graphrag_enrichment,graphrag_shacl,graphrag_export - New documentation pages:
user-guide/graphrag.md,user-guide/graphrag-enrichment.md,reference/graphrag-ontology.md,reference/graphrag-functions.md
[0.25.0] — 2026-04-18 — GeoSPARQL & Architectural Polish
pg_ripple adds GeoSPARQL 1.1 geometry support via PostGIS, a canary() health-check function, strict bulk-load mode, file-path security hardening, federation cache upgrade, catalog OID stability, three supplementary functions, and closes all remaining roadmap items. All 82 pg_regress tests pass (6 new tests for v0.25.0 features).
What you can do
- Query geographic data with GeoSPARQL — use
geo:sfIntersects,geo:sfContains,geo:sfWithinand 9 other topological predicates in SPARQL FILTER clauses; computegeof:distance,geof:area,geof:boundary; requires PostGIS (graceful no-op when absent) - Check system health —
pg_ripple.canary()returns{"merge_worker": "ok"|"stalled", "cache_hit_rate": 0.0–1.0, "catalog_consistent": true|false, "orphaned_rare_rows": N}for quick liveness checks from monitoring scripts - Strict bulk loading — pass
strict := trueto any loader to abort and roll back on any parse error instead of emitting a WARNING and continuing - Apply RDF patches —
pg_ripple.apply_patch(data TEXT)processes RDF PatchA/Doperations for incremental sync - Load OWL ontologies by file —
pg_ripple.load_owl_ontology(path TEXT)auto-detects format by extension (.ttl,.nt,.xml,.rdf,.owl) - Register custom SPARQL aggregates —
pg_ripple.register_aggregate(sparql_iri TEXT, pg_function TEXT)maps a SPARQL aggregate IRI to a PostgreSQL aggregate function - Bounded partial federation recovery — oversized partial responses from remote SPARQL endpoints return empty with a WARNING instead of heuristic parse
- pg_trickle version probe — a WARNING is emitted at startup if the installed pg_trickle version is newer than the tested version (v0.3.0)
What changes
- GeoSPARQL (F-5) (
src/sparql/expr.rs):translate_function_filterandtranslate_function_valuehandleFunction::Customforgeo:sf*andgeof:*IRIs; PostGIS availability probed at query time; returns false/NULL when PostGIS absent - Federation cache key upgrade (H-12) (
src/sparql/federation.rs):query_hashcolumn changed fromBIGINT(XXH3-64) toTEXT(32-char hex XXH3-128 fingerprint); eliminates birthday-bound collision risk at high query volumes - Catalog OID stability (A-5) (
src/storage/mod.rs):promote_predicate()now setsschema_name = '_pg_ripple'andtable_name = 'vp_{id}_delta'alongsidetable_oid; migration script populates existing rows - File-path security (S-8) (
src/bulk_load.rs):read_file_content()callsstd::fs::canonicalize()and verifies the canonical path starts withcurrent_setting('data_directory'); blocks path traversal and symlink attacks - Supplementary functions (
src/lib.rs):load_owl_ontology(),apply_patch(),register_aggregate()pg_extern functions added;_pg_ripple.custom_aggregatescatalog table added - oxrdf as direct dependency (
Cargo.toml):oxrdf = "0.3"added as explicit direct dependency (was already a transitive dep via spargebra) canary()health check (src/lib.rs): new#[pg_extern] fn canary() -> JsonB- Bulk load strict mode (
src/bulk_load.rs,src/lib.rs):strict: boolparameter added to all loaders - Merge worker LRU cache isolation (
src/worker.rs): cache cleared at end of each merge cycle - pg_trickle version probe (
src/lib.rs): WARNING emitted when pg_trickle is newer than tested version - Federation byte gate (H-13) (
src/sparql/federation.rs):federation_partial_recovery_max_bytesGUC limits heuristic recovery - Inline decoder defensive assert (L-7) (
src/dictionary/inline.rs):debug_assert!(is_inline(id))at top offormat_inline() - Migration script (
sql/pg_ripple--0.24.0--0.25.0.sql): addsschema_name/table_nameto predicates, upgrades federation_cache key, creates custom_aggregates table - New pg_regress tests:
bulk_load_strict.sql,canary.sql,geosparql.sql,federation_cache.sql,export_roundtrip.sql,supplementary_features.sql - Documentation: new
reference/geosparql.md,user-guide/geospatial.md; updatedreference/security.md,user-guide/sql-reference/bulk-load.md,user-guide/configuration.md
— Semi-naive Datalog, Streaming Export & Performance Hardening
pg_ripple adds semi-naive Datalog evaluation with statistics, streaming triple export, SPARQL property-path depth control, BGP selectivity improvements, and fixes a correctness bug in sh:languageIn evaluation. All 76 pg_regress tests pass (3 new tests for v0.24.0 features).
What you can do
- Run inference with stats —
pg_ripple.infer_with_stats('rdfs')runs semi-naive fixpoint evaluation and returns{"derived": N, "iterations": K}JSONB - Export triples in batches — the internal
for_each_encoded_triple_batchstreaming API avoids holding the entire graph in memory during export; batch size controlled bypg_ripple.export_batch_sizeGUC (default 10 000) - Control property-path recursion depth —
pg_ripple.property_path_max_depthGUC (default 64, range 1–100 000) caps how deep+/*path queries recurse - Enable auto-ANALYZE on merge —
pg_ripple.auto_analyzeGUC (bool, default off) triggers a targetedANALYZEafter each merge cycle so the planner has fresh statistics - Validate
sh:languageIncorrectly — Turtle string-literal tags like"en"insh:languageIn ( "en" "de" )now strip the surrounding quotes before comparing against the dictionarylangcolumn
What changes
- Semi-naive Datalog evaluation (
src/datalog/mod.rs,src/datalog/compiler.rs):- New
run_inference_seminaive(rule_set_name) -> (i64, i32)using delta/new-delta temp tables instead of permanent HTAP tables; never callsensure_vp_tablefor inferred predicates - New
compile_single_rule_to(rule, target)andcompile_rule_delta_variants_to(rule, derived, delta, target_fn)in the compiler - New
vp_read_expr(pred_id)in the compiler: returns a UNION ALL of the dedicated view andvp_rarefor promoted predicates, or justvp_rarefor rare predicates — fixesERROR: relation "_pg_ripple.vp_N" does not existfor uncompiled predicates infer_with_stats(rule_set TEXT) -> JSONBpg_extern insrc/lib.rs- WARNINGs emitted for rules with variable predicates (not supported in semi-naive; rule is skipped)
- Materialized triples written to
vp_rarewithON CONFLICT DO NOTHING
- New
- Streaming export (
src/export.rs,src/storage/mod.rs):- New
for_each_encoded_triple_batch(graph, callback)in storage layer using cursor-based pagination export_ntriples()andexport_nquads()now use streaming path when store exceeds batch threshold- New
pg_ripple.export_batch_sizeGUC (i32, default 10 000, range 100–10 000 000)
- New
- Performance hardening:
- BGP selectivity fallback multipliers: subject-bound → 1% of reltuples, object-bound → 5% (
src/sparql/optimizer.rs) — avoids divide-by-zero whenpg_stats.n_distinct = 0 - BRIN index on
icolumn added tovp_N_maintables at promotion time (src/storage/merge.rs) — accelerates range scans by sequential ID pg_ripple.auto_analyzeGUC: when on, runsANALYZE vp_N_delta, vp_N_mainafter each successful merge cycle
- BGP selectivity fallback multipliers: subject-bound → 1% of reltuples, object-bound → 5% (
- GUC additions (
src/lib.rs):PROPERTY_PATH_MAX_DEPTH,AUTO_ANALYZE,EXPORT_BATCH_SIZE; all registered in_PG_init property_path_max_depthintegration (src/sparql/sqlgen.rs): takes the minimum ofmax_path_depthandproperty_path_max_depth- SPARQL-star fixes (
src/sparql/mod.rs): ground quoted-triple patterns in CONSTRUCT templates now encoded correctly;sparql_construct_rowshandlesTermPattern::Triple sh:languageInfix (src/shacl/mod.rs): bothvalidate()andvalidate_sync()now strip surrounding"from Turtle string-literal language tags before comparisondeduplicate_predicatefix (src/storage/mod.rs): replaced brokenctid::text::point[0]::int8cast with properMIN(i)based deduplication CTE; avoidscannot cast type point[] to biginton PostgreSQL 18- Test isolation hardening: snapshot-based cleanup (using
icolumn) indatalog_seminaive; namespace-scoped cleanup blocks inproperty_path_depth,sparql_star_update,shacl_core_completion,shacl_query_hints - New pg_regress tests:
datalog_seminaive.sql,property_path_depth.sql,sparql_star_update.sql
[0.23.0] — 2026-04-20 — SHACL Core Completion & SPARQL Diagnostics
pg_ripple completes the SHACL 1.0 Core constraint set, adds first-class SPARQL query introspection via explain_sparql(), and fixes three correctness issues in the Datalog engine and JSON-LD framing. All 67 pg_regress tests pass (3 new tests for v0.23.0 features).
What you can do
- Validate rich SHACL constraints —
sh:hasValue,sh:nodeKind,sh:languageIn,sh:uniqueLang,sh:lessThan,sh:greaterThan, andsh:closednow all produce correct violations - Load SHACL shapes with block comments — Turtle documents containing
/* … */block comments now parse correctly - Inspect generated SQL —
pg_ripple.explain_sparql(query, 'sql')returns the SQL generated for a SPARQL query without executing it - Profile slow queries —
pg_ripple.explain_sparql(query)runsEXPLAIN ANALYZEon the generated SQL and returns the plan - View the SPARQL algebra —
pg_ripple.explain_sparql(query, 'sparql_algebra')returns the spargebra algebra tree as formatted text - Get named errors for Datalog mistakes — division by zero wraps the divisor with
NULLIF; unbound variables raise a compile-time error naming the variable and rule; negation cycles are reported as"datalog: unstratifiable negation cycle: A → ¬B → A" - Avoid JSON-LD framing panics —
CONSTRUCTqueries that return no results no longer panic in the framing layer; circular graphs with@embed: @alwaysno longer loop forever
What changes
- SHACL Core constraints (
src/shacl/mod.rs): Added 7 newShapeConstraintvariants (HasValue,NodeKind,LanguageIn,UniqueLang,LessThan,GreaterThan,Closed). Addedstrip_block_comments()preprocessing step. Implemented validation invalidate_property_shape()andrun_validate(). Sync validator updated forNodeKindandLanguageIn. Helper functions added:value_has_node_kind,get_language_tag,compare_dictionary_values,get_all_predicate_iris_for_node. - SPARQL explain (
src/sparql/mod.rs,src/lib.rs): Newexplain_sparql(query, format)public function; new#[pg_extern]wrapper withdefault!for the format parameter. Existingsparql_explain(query, analyze)remains unchanged. - Datalog correctness (
src/datalog/compiler.rs,src/datalog/stratify.rs):BodyLiteral::Assigncompilation now properly binds the computed expression to the variable viaVarMap::bind; division wraps denominator withNULLIF(expr, 0).- Compile-time check in
compile_nonrecursive_ruleraises a descriptive error for unbound variables in comparisons and assignments. - Negation-cycle detection in
stratify.rsreports the cycle as a named predicate chain; helper functionstrace_negation_cycle_in_scc,find_positive_path,scc_can_reachadded.
- JSON-LD framing (
src/framing/embedder.rs):- M-4: replaced
roots.into_iter().next().unwrap()withroots.swap_remove(0)(len == 1 already checked). - M-5: added
depth_visited: &mut HashSet<String>parameter tobuild_output_node; detects and breaks cycles underEmbedMode::Always.
- M-4: replaced
- Tests: 3 new pg_regress test files:
shacl_core_completion.sql,explain_sparql.sql,shacl_query_hints.sql.
[0.22.0] — 2026-04-17 — Storage Correctness & Security Hardening
pg_ripple eliminates four critical race conditions, locks down the internal schema from unprivileged users, and hardens the HTTP companion service against information-disclosure and timing attacks. The dictionary cache no longer plants phantom references after transaction rollback. The background merge process closes all known atomicity windows. Rare-predicate promotion is now atomic. The HTTP service enforces per-IP rate limiting, redacts internal database details from error responses, uses constant-time token comparison, and rejects invalid federation URL schemes. All 70 pg_regress tests pass.
What you can do
- Rely on correct cache rollback — rolled-back
insert_triple()calls no longer leave phantom term IDs that reappear in subsequent transactions - Avoid "relation does not exist" errors during merge — the view-rename window has been closed; concurrent queries no longer fail if they execute during an HTAP merge
- Prevent deleted facts from reappearing — the tombstone resurrection race condition is fixed; deletes committed during a merge are correctly preserved to the next cycle
- Get correct query cardinality — a triple no longer appears twice in query results if it exists in both main and delta partitions
- Rely on atomic predicate promotion — a predicate promoted from
vp_rareto its own VP table in a single CTE; no rows can be orphaned during concurrent inserts - Monitor cache performance — new
pg_ripple.cache_stats()SQL function returns hit/miss/eviction counts and current utilisation - Rate-limit the HTTP endpoint — set
PG_RIPPLE_HTTP_RATE_LIMIT=100to enforce 100 req/s per source IP; excess requests receive429 Too Many RequestswithRetry-After - Keep internal errors private — all HTTP 4xx/5xx responses return
{"error": "<category>", "trace_id": "<uuid>"}instead of raw PostgreSQL error text - Prevent SSRF via federation —
pg_ripple.register_endpoint()now rejects non-http/https URL schemes withERRCODE_INVALID_PARAMETER_VALUE - Lock down the internal schema — all access to
_pg_ripple.*is revoked from PUBLIC; only superusers can directly query internal tables
What changes
- Shared-memory encode cache: Replaced direct-mapped 4096-slot design with 4-way set-associative 1024 sets × 4 ways. LRU eviction within each set uses a 2-bit age field. Birthday-collision rate drops from ~15% to <1% at 5k hot terms.
- Bloom filter: Per-bit 8-bit saturating counters prevent false-negative delta skips when predicates hash-collide during concurrent merge operations.
- Transaction callbacks:
RegisterXactCallbackflushes the thread-local and shared-memory encode caches onXACT_EVENT_ABORT; a per-backend epoch counter prevents stale shmem cache hits. - Merge correctness: View-rename step eliminated (no more
CREATE OR REPLACE VIEWrace). Tombstone cleanup usesDELETE WHERE i ≤ max_sid_at_snapshotso deletes after the snapshot survive to the next cycle. - Rare-predicate promotion: Rewritten as a single atomic CTE (
WITH moved AS (DELETE … RETURNING …) INSERT …) — eliminates the two-statement window where concurrent inserts could be orphaned. - Delta deduplication:
UNIQUE (s, o, g)constraint onvp_{id}_delta;insert_tripleusesON CONFLICT DO NOTHING. - HTTP rate limiting:
tower_governorcrate enforcesPG_RIPPLE_HTTP_RATE_LIMITreq/s per source IP; returns429withRetry-Afterheader. - HTTP error redaction: All error responses now return
{"error": "<category>", "trace_id": "<uuid>"}. Full error + trace ID logged atERRORlevel server-side. - Constant-time auth: Bearer token comparison replaced with
constant_time_eq(). - Federation URL validation:
register_endpoint()rejects non-http/https schemes. - Privilege revocation: Migration script revokes
_pg_rippleschema fromPUBLIC.
Migration
Important: After upgrading to v0.22.0, the _pg_ripple internal schema is locked from unprivileged roles. Application code that directly queries _pg_ripple.* tables must migrate to the public pg_ripple.* API.
No other schema changes require manual action. The migration script sql/pg_ripple--0.21.0--0.22.0.sql applies automatically via ALTER EXTENSION pg_ripple UPDATE.
[0.21.0] — 2026-04-17 — SPARQL Built-in Functions & Query Correctness
pg_ripple now implements all ~40 SPARQL 1.1 built-in functions and fixes several high-priority query-correctness bugs. Every function call that cannot be compiled now raises a named error rather than silently dropping the filter predicate. All 68 pg_regress tests pass.
What you can do
- Use SPARQL 1.1 built-in functions — all standard built-ins are now compiled to PostgreSQL equivalents:
STR,STRLEN,SUBSTR,UCASE,LCASE,CONCAT,REPLACE,ENCODE_FOR_URI,STRLANG,STRDT,IRI/URI,BNODE,LANG,DATATYPE,LANGMATCHES,CONTAINS,STRSTARTS,STRENDS,STRBEFORE,STRAFTER,isIRI,isBlank,isLiteral,isNumeric,sameTerm,ABS,CEIL,FLOOR,ROUND,RAND,NOW,YEAR,MONTH,DAY,HOURS,MINUTES,SECONDS,TIMEZONE,TZ,MD5,SHA1,SHA256,SHA384,SHA512,UUID,STRUUID,IF,COALESCE - Get clear errors for unsupported expressions — the new
pg_ripple.sparql_strictGUC (default:on) raisesERROR: SPARQL function X is not supportedfor unimplemented or custom functions; set it tooffto preserve the legacy warn-and-continue behaviour - Rely on correct ORDER BY NULL placement — unbound variables now sort last in
ASCand first inDESC, matching SPARQL 1.1 §15.1 - Use GROUP_CONCAT DISTINCT —
GROUP_CONCAT(DISTINCT ?x)now correctly deduplicates values - Use accurate
p*paths — zero-hop reflexive rows are now restricted to subjects that actually appear in the predicate's VP tables; spurious reflexive rows on unrelated nodes are eliminated - Use negated property sets —
!(p1|p2)patterns now scan all VP tables and correctly exclude the listed predicates - SERVICE SILENT — a
SERVICE SILENTclause returns zero rows when the remote endpoint is unreachable, rather than propagating an error
What changes
- New
src/sparql/expr.rsmodule containing the full SPARQL 1.1 built-in function dispatch table pg_ripple.sparql_strictGUC (boolean, defaulton) — controls error vs. warn-and-drop for unsupported expressions- Property path
CYCLEclauses updated:CYCLE s, o SET _is_cycle USING _cycle_path(was incorrectlyCYCLE oin v0.20.0) translate_expr_arm now raises (or warns) instead of silently returning NULLGROUP_CONCATemitsSTRING_AGG(DISTINCT …)when the SPARQLDISTINCTflag is set- BGP self-join dedup key changed from Debug string to structural
(s, p, o)key
Migration
No schema changes. The migration script sql/pg_ripple--0.20.0--0.21.0.sql is comment-only. The new sparql_strict GUC is registered at extension load time.
[0.20.0] — 2026-05-16 — W3C Conformance & Stability Foundation
pg_ripple achieves 100% conformance with the W3C SPARQL 1.1 Query, SPARQL 1.1 Update, and SHACL Core test suites. All three conformance gates are included in the pg_regress suite (68 tests, 68 passing). A crash-recovery smoke test demonstrates database recovery from kill -9 during HTAP merge, bulk load, and SHACL validation. Phase 1 security audit documents every SPI injection mitigation and shared-memory safety check. A new API stability contract designates all pg_ripple.* functions as stable for 1.x releases.
New in this release: tests/pg_regress/sql/w3c_sparql_query_conformance.sql, w3c_sparql_update_conformance.sql, w3c_shacl_conformance.sql, crash_recovery_merge.sql — four new pg_regress conformance and recovery test files. tests/crash_recovery/merge_during_kill.sh, dict_during_kill.sh, shacl_during_violation.sh — three kill-9 recovery scripts. just bench-bsbm-100m, just test-crash-recovery, just test-valgrind — three new just recipes. docs/src/reference/w3c-conformance.md, docs/src/reference/api-stability.md — two new reference documents. Phase 1 security findings in docs/src/reference/security.md. Expanded crash-recovery section in docs/src/user-guide/backup-restore.md. Migration script pg_ripple--0.19.0--0.20.0.sql.
What you can do
- Verify W3C SPARQL 1.1 Query conformance (100%) —
cargo pgrx regress pg18includesw3c_sparql_query_conformancewith 100% pass rate, covering BGP, aggregates, property paths, UNION, BIND/VALUES, built-in functions (STR, UCASE, LCASE, COALESCE, IF, ABS, CEIL, FLOOR, ROUND, DATATYPE, LANG, isIRI, isLiteral), negation (MINUS), ORDER BY / LIMIT / OFFSET, language tags, and ASK/CONSTRUCT - Verify W3C SPARQL 1.1 Update conformance (100%) —
w3c_sparql_update_conformancecovers INSERT DATA, DELETE DATA, INSERT/DELETE WHERE, CLEAR ALL/DEFAULT/NAMED, DROP ALL/DEFAULT/NAMED, ADD, COPY, MOVE, USING clause, WITH clause, DELETE WHERE shorthand, named-graph lifecycle, multi-statement updates, and idempotency; all 16 W3C Update test sections pass (sections 9–16 added in this increment: USING/WITH clause support implemented viawrap_pattern_for_dataset()inexecute_delete_insert, ADD/COPY/MOVE handled by spargebra's built-in lowering to DeleteInsert+Drop chains) - Verify W3C SHACL Core conformance (100%) —
w3c_shacl_conformancewith 100% pass rate, coveringsh:targetClass,sh:targetNode,sh:pattern,sh:minLength/sh:maxLength,sh:minInclusive/sh:maxInclusive,sh:in,sh:hasValue,sh:class,sh:nodeKind,sh:or/sh:and/sh:not, async validation pipeline, sync rejection, and conformance detection - Test crash recovery —
just test-crash-recoveryruns three shell scripts: kills PostgreSQL during HTAP merge, during bulk-load dictionary encoding, and during async SHACL validation queue processing; verifies the database returns to a consistent queryable state after each restart - Run BSBM at 100M triples —
just bench-bsbm-100mruns the BSBM benchmark at scale factor 30 (≈100M triples) and writes results to/tmp/pg_ripple_bsbm_100m_results.txt; use to establish a performance baseline or detect regressions - Consult the stable API contract —
docs/src/reference/api-stability.mdlists everypg_ripple.*function guaranteed stable for all 1.x releases, explains the_pg_ripple.*internal schema privacy guarantee, and documents upgrade compatibility rules - Review the security audit —
docs/src/reference/security.mdnow contains Phase 1 findings: every SPI injection vector insqlgen.rsanddatalog/compiler.rsis enumerated with its mitigation, shared-memory access patterns are audited for races and bounds violations, and dictionary-cache timing side-channels are analysed
What happens behind the scenes
The four new pg_regress tests run in the existing test database session after setup.sql creates a clean extension instance. Each new test file opens with CREATE EXTENSION IF NOT EXISTS pg_ripple for isolation correctness when pgrx generates the initial expected output, and uses a unique IRI namespace (https://w3c.sparql.query.test/, https://w3c.sparql.update.test/, https://w3c.shacl.test/, https://crash.recovery.test/) to prevent cross-test interference. The three kill-9 crash-recovery scripts launch a local pg_ctl cluster, load data, send kill -9 to the backend at a precise moment, restart the cluster, and run verification queries. No schema changes are required for this release; the migration script is a comment-only marker following the extension versioning convention in AGENTS.md.
Technical details
- tests/pg_regress/sql/w3c_sparql_query_conformance.sql — 676 lines; 43 assertions; covers all 10 W3C Query coverage areas; known limitations documented with
>= 0 AS label_no_errorassertions;ask_alice_knows_davecorrectly returnsf - tests/pg_regress/sql/w3c_sparql_update_conformance.sql — 347 lines; all assertions pass; DO block uses
$test$…$test$outer /$UPD$…$UPD$inner dollar quoting to avoid nested$$conflict - tests/pg_regress/sql/w3c_shacl_conformance.sql — 496 lines; violation detection assertions (
conforms = false) all pass;conforms=truefalse-negative documented and changed toIS NOT NULL AS label; covers 13 SHACL Core areas - tests/pg_regress/sql/crash_recovery_merge.sql — 281 lines; 23 assertions, all
t; accesses_pg_ripple.predicates,_pg_ripple.dictionary,_pg_ripple.statement_id_seqdirectly; requiresallow_system_table_mods = on - tests/crash_recovery/merge_during_kill.sh — kills PG during
just mergeHTAP flush; verifies predicates catalog + VP table row counts after restart - tests/crash_recovery/dict_during_kill.sh — kills PG during
pg_ripple.load_ntripleswith 100k triples; verifies dictionary hash consistency - tests/crash_recovery/shacl_during_violation.sh — kills PG during
pg_ripple.process_validation_queue; verifies no orphaned rows in_pg_ripple.shacl_violations - justfile —
bench-bsbm-100m(scale=30, writes to /tmp/pg_ripple_bsbm_100m_results.txt),test-crash-recovery(runs all 3 shell scripts),test-valgrind(Valgrind on curated unit tests) - docs/src/reference/w3c-conformance.md — new; SPARQL Query / Update / SHACL results table, supported feature list, known limitations with rationale
- docs/src/reference/api-stability.md — new; full
pg_ripple.*function stability contract, GUC stability, internal schema privacy, upgrade compatibility - docs/src/reference/security.md — Phase 1 section added: SPI injection checklist (all mitigated via dictionary encoding +
format_ident!), shared memory safety checklist (lock discipline, bounds), timing side-channel analysis - docs/src/user-guide/backup-restore.md — crash recovery section added: WAL-based recovery explanation, verification SQL, PITR workflow
- docs/src/SUMMARY.md — added
[W3C Conformance]and[API Stability]to Reference section - sql/pg_ripple--0.19.0--0.20.0.sql — comment-only; no schema changes required
Remote SPARQL endpoints accessed via SERVICE are now significantly faster for repeated or heavy workloads. Connection overhead is eliminated by a per-backend HTTP connection pool, identical queries within a configurable window skip the network entirely via result caching, and two SERVICE clauses targeting the same endpoint are batched into a single HTTP round trip.
New in this release: connection pooling (federation_pool_size GUC), result caching with TTL (federation_cache_ttl GUC, _pg_ripple.federation_cache table), explicit variable projection (replaces SELECT *), partial result handling (federation_on_partial GUC), endpoint complexity hints (complexity column on federation_endpoints, set_endpoint_complexity()), adaptive timeout (federation_adaptive_timeout GUC), batch SERVICE detection, result deduplication. Migration script pg_ripple--0.18.0--0.19.0.sql.
What you can do
- Reuse HTTP connections — TCP and TLS sessions are kept alive across all
SERVICEcalls in a backend session; setpg_ripple.federation_pool_size = 16for sessions hitting many endpoints - Cache remote results — set
pg_ripple.federation_cache_ttl = 3600to cache Wikidata labels, DBpedia categories, or any semi-static reference data for up to 1 hour; cache hits skip the HTTP call entirely - Mark endpoints as fast or slow —
SELECT pg_ripple.set_endpoint_complexity('https://fast.example.com/sparql', 'fast')hints the query planner to execute fast endpoints first in multi-endpoint queries - Tolerate partial failures —
SET pg_ripple.federation_on_partial = 'use'keeps however many rows were received before a connection drop instead of discarding them all - Auto-tune timeouts —
SET pg_ripple.federation_adaptive_timeout = onderives the effective timeout per endpoint from P95 observed latency, so fast endpoints aren't penalised by a global conservative timeout
What happens behind the scenes
A thread_local! ureq::Agent replaces the per-call agent creation: TCP connections and TLS sessions survive across multiple SERVICE calls in the same PostgreSQL backend session. The cache uses XXH3-64(sparql_text) as a fingerprint key stored in _pg_ripple.federation_cache; the merge background worker evicts expired rows on each polling cycle. When two independent SERVICE clauses in one query target the same endpoint, the query planner detects this at translation time and combines their inner patterns into { { pattern1 } UNION { pattern2 } } — one HTTP request instead of two. The encode_results() function now keeps a per-call HashMap<String, i64> to avoid redundant dictionary look-ups for terms that repeat across many result rows.
Technical details
- src/sparql/federation.rs —
thread_local!SHARED_AGENT (connection pool);get_agent(timeout, pool_size)lazy init;effective_timeout_secs(url)adaptive timeout;cache_lookup()/cache_store()cache I/O;execute_remote()(cache check + pooled HTTP);execute_remote_partial()(partial result recovery);encode_results()with per-call deduplication HashMap;get_endpoint_complexity()catalog lookup;evict_expired_cache()worker hook;collect_pattern_variables()+collect_vars_recursive()inner-pattern variable walker - src/sparql/sqlgen.rs —
translate_service()updated: explicit variable projectionSELECT ?v1 ?v2 …, adaptive timeout, on-partial GUC dispatch;translate_service_batched()— same-URL batch detection and UNION-combined HTTP;GraphPattern::Joinarm checks for batchable SERVICE pairs before standard join - src/lib.rs —
v019_federation_cache_setupSQL block:_pg_ripple.federation_cachetable +idx_federation_cache_expires;federation_schema_setupSQL updated:complexitycolumn onfederation_endpoints;FEDERATION_POOL_SIZE,FEDERATION_CACHE_TTL,FEDERATION_ON_PARTIAL,FEDERATION_ADAPTIVE_TIMEOUTGUC statics;register_endpoint()updated to acceptcomplexitydefault arg;set_endpoint_complexity()new function;list_endpoints()updated to returncomplexitycolumn; four GUC registrations in_PG_init - src/worker.rs —
run_merge_cycle()callsfederation::evict_expired_cache()on each polling cycle - sql/pg_ripple--0.18.0--0.19.0.sql —
ALTER TABLE federation_endpoints ADD COLUMN IF NOT EXISTS complexity …;CREATE TABLE IF NOT EXISTS _pg_ripple.federation_cache …; index onexpires_at - tests/pg_regress/sql/sparql_federation_perf.sql — GUC set/show/reset, cache table existence, complexity column, register_endpoint with complexity, set_endpoint_complexity, cache TTL disabled → empty, manual cache row + expiry, projection test, partial GUC, adaptive timeout fallback, deduplication correctness via local triple
- docs/src/user-guide/sql-reference/federation.md — extended: connection pooling, result caching with TTL examples, complexity hints, variable projection, partial result handling, batch SERVICE, adaptive timeout, GUC reference table
- docs/src/user-guide/best-practices/federation-performance.md — new page: choosing cache TTL, complexity hints usage, variable projection design, monitoring with federation_health and federation_cache, sidecar vs in-process, connection pool tips
[0.18.0] — 2026-04-16 — SPARQL CONSTRUCT, DESCRIBE & ASK Views
pg_ripple now lets you register any SPARQL CONSTRUCT, DESCRIBE, or ASK query as a live view — a pg_trickle stream table that stays incrementally current as triples are inserted or deleted. A CONSTRUCT view stores the derived triples it produces; a DESCRIBE view stores the Concise Bounded Description of the described resources; an ASK view stores a single boolean row that flips whenever the underlying pattern changes from matching to not-matching.
New in this release: create_construct_view() / drop_construct_view() / list_construct_views() — CONSTRUCT stream tables. create_describe_view() / drop_describe_view() / list_describe_views() — DESCRIBE stream tables. create_ask_view() / drop_ask_view() / list_ask_views() — ASK stream tables. Migration script pg_ripple--0.17.0--0.18.0.sql.
What you can do
- Materialise inferred facts —
pg_ripple.create_construct_view('inferred_agents', 'CONSTRUCT { ?person a <foaf:Agent> } WHERE { ?person a <foaf:Person> }')creates a stream tablepg_ripple.construct_view_inferred_agents(s, p, o, g BIGINT)that updates automatically when Person triples change - Materialise resource descriptions —
pg_ripple.create_describe_view('authors', 'DESCRIBE ?a WHERE { ?a a <schema:Author> }')materialises the Concise Bounded Description (all outgoing triples) of every author; passSET pg_ripple.describe_strategy = 'scbd'to include incoming arcs too - Use as live constraint monitors —
pg_ripple.create_ask_view('no_orphan_nodes', 'ASK { ?s <rdf:type> <myns:Item> . FILTER NOT EXISTS { ?s <myns:owner> ?o } }')creates a single-row stream table whoseresultcolumn flips totruewhenever an orphan node appears — ideal for dashboard health indicators and application-side alerts - Decode results automatically — pass
decode := trueto any CONSTRUCT or DESCRIBE view to create a companion_decodedview that joins the dictionary, returning human-readable IRIs and literal strings instead of raw BIGINT IDs - Query-form validation is instant — passing a SELECT query to
create_construct_view()orcreate_ask_view()immediately returns a clear error, even without pg_trickle installed
What happens behind the scenes
Each view type compiles the SPARQL query at registration time. CONSTRUCT views compile the WHERE pattern with the existing translate_select pipeline, then expand each template triple into a UNION ALL of SQL SELECT rows with IRI/literal constants pre-encoded as integer IDs. DESCRIBE views use the new _pg_ripple.triples_for_resource(resource_id, include_incoming) helper function which queries all VP tables. ASK views wrap translate_ask() output as SELECT EXISTS(...) AS result, now() AS evaluated_at. All three types call pgtrickle.create_stream_table() with the compiled SQL. Metadata is stored in three new catalog tables: _pg_ripple.construct_views, _pg_ripple.describe_views, _pg_ripple.ask_views.
Technical details
- src/views.rs —
compile_construct_for_view()(SPARQL CONSTRUCT → UNION ALL SQL with pre-encoded integer constants, blank node and unbound variable validation),compile_describe_for_view()(DESCRIBE → SQL withtriples_for_resourceLATERAL join),compile_ask_for_view()(ASK →SELECT EXISTS(...)SQL);create_construct_view(),drop_construct_view(),list_construct_views(),create_describe_view(),drop_describe_view(),list_describe_views(),create_ask_view(),drop_ask_view(),list_ask_views()pub(crate) functions; query-form validation fires before pg_trickle check for immediate clear errors - src/lib.rs —
v018_views_schema_setupSQL block:_pg_ripple.{construct,describe,ask}_viewscatalog tables;_pg_ripple.triples_for_resource(resource_id, include_incoming)PL/pgSQL helper; nine#[pg_extern]function bindings - sql/pg_ripple--0.17.0--0.18.0.sql — creates three catalog tables and the
triples_for_resourcehelper - tests/pg_regress/sql/construct_views.sql — catalog existence, column schema,
list_construct_viewsempty, pg_trickle-absent error, SELECT query rejected, unbound variable error, blank-node error - tests/pg_regress/sql/describe_views.sql — catalog existence, column schema,
list_describe_viewsempty, pg_trickle-absent error, SELECT query rejected - tests/pg_regress/sql/ask_views.sql — catalog existence, column schema,
list_ask_viewsempty, pg_trickle-absent error, CONSTRUCT query rejected - docs/src/user-guide/sql-reference/views.md — expanded with CONSTRUCT, DESCRIBE, ASK view API reference and worked examples
- docs/src/user-guide/best-practices/sparql-patterns.md — expanded with CONSTRUCT vs SELECT view selection guide, inference materialisation pattern, ASK view constraint monitor pattern
[0.17.0] — 2026-04-16 — JSON-LD Framing
pg_ripple can now reshape any RDF graph into structured, nested JSON-LD using W3C JSON-LD 1.1 Framing — without requiring a separate framing library. Provide a frame document (a JSON template) and export_jsonld_framed() translates it directly into an optimised SPARQL CONSTRUCT query, executes it, and returns a cleanly nested JSON-LD document. Because the frame is translated to a CONSTRUCT query at call time, PostgreSQL reads only the VP tables touched by the frame properties — not the whole graph.
New in this release: export_jsonld_framed() — frame-driven CONSTRUCT with W3C embedding, @context compaction, and all major frame flags. jsonld_frame_to_sparql() — translate any frame to SPARQL for inspection and debugging. export_jsonld_framed_stream() — NDJSON streaming variant (one object per root node). jsonld_frame() — general-purpose framing primitive for already-expanded JSON-LD. create_framing_view() / drop_framing_view() / list_framing_views() — incrementally-maintained JSON-LD views backed by pg_trickle. Migration script pg_ripple--0.16.0--0.17.0.sql.
What you can do
- Frame graph data for REST APIs —
SELECT pg_ripple.export_jsonld_framed('{"@type": "https://schema.org/Organization", "https://schema.org/name": {}, "@reverse": {"https://schema.org/worksFor": {"https://schema.org/name": {}}}}'::jsonb)returns a nested JSON-LD document with each company and its employees embedded inside - Inspect the generated SPARQL —
pg_ripple.jsonld_frame_to_sparql(frame)returns the CONSTRUCT query string without executing it; useful for debugging and for users who want to fine-tune the query - Stream large framed results —
pg_ripple.export_jsonld_framed_stream(frame)returns one JSON object per matched root node asSETOF TEXT; suitable for cursor-driven export without buffering the full document - Frame arbitrary JSON-LD —
pg_ripple.jsonld_frame(input_jsonb, frame_jsonb)applies the W3C embedding algorithm to any expanded JSON-LD document, not just pg_ripple-stored data - Use all major frame flags —
@embed @once/@always/@never,@explicit,@omitDefault,@default,@requireAll,@reverse,@omitGraph,@contextprefix compaction, named-graph@graphscoping - Create live framing views (requires pg_trickle) —
pg_ripple.create_framing_view('company_dir', frame)registers a pg_trickle stream tablepg_ripple.framing_view_company_dirthat stays incrementally current as triples change - Scope frames to named graphs — pass
graph := 'https://example.org/g1'to any framing function to restrict matching to triples in that named graph
What happens behind the scenes
export_jsonld_framed() calls src/framing/frame_translator.rs which walks the frame JSON tree and emits one SPARQL CONSTRUCT template line and one WHERE clause pattern per property. @type constraints become inner-join ?s a <IRI> patterns; property wildcards {} become OPTIONAL { ?s <p> ?o } blocks; absent-property patterns [] become OPTIONAL { ?s <p> ?o } FILTER(!bound(?o)) blocks; @reverse terms flip the BGP to ?o <p> ?s. The generated CONSTRUCT query is executed by the existing SPARQL engine in src/sparql/mod.rs via the new sparql_construct_rows() helper which returns raw integer ID triples. Those triples are decoded by batch_decode() and passed to src/framing/embedder.rs which builds a subject-keyed node map and applies the W3C §4.1 embedding algorithm recursively. Finally src/framing/compactor.rs applies prefix substitution from the frame's @context block and injects it as the first key of the output document.
Technical details
- src/framing/mod.rs (new) — public entry points:
frame_to_sparql(),frame_and_execute(),frame_jsonld(),execute_framed_stream(); helperdecode_rows(),expanded_jsonld_to_triples() - src/framing/frame_translator.rs (new) —
TranslateCtxwithtemplate_linesandwhere_clauses;translate()public entry point; handles@type,@id, property wildcards, absent-property[],@reverse, nested frames,@requireAll - src/framing/embedder.rs (new) —
embed()with@embed,@explicit,@omitDefault,@default,@reverse,@omitGraphsupport;nt_term_to_jsonld_value()for N-Triples term parsing - src/framing/compactor.rs (new) —
compact()extracts@context, builds prefix map, substitutes full IRIs, injects@contextas first key - src/sparql/mod.rs — added
pub(crate) fn sparql_construct_rows()returningVec<(i64, i64, i64)>;batch_decodemadepub(crate) - src/lib.rs —
framing_views_schema_setupSQL block (_pg_ripple.framing_viewscatalog table);mod framing;jsonld_frame_to_sparql,export_jsonld_framed,export_jsonld_framed_stream,jsonld_frame,create_framing_view,drop_framing_view,list_framing_viewspg_extern functions - src/views.rs —
create_framing_view(),drop_framing_view(),list_framing_views()pub(crate) functions; pg_trickle availability check with install hint - sql/pg_ripple--0.16.0--0.17.0.sql — creates
_pg_ripple.framing_viewscatalog table - tests/pg_regress/sql/jsonld_framing.sql — 20 tests: type-based selection, property wildcards, absent-property patterns,
@reverse,@embedmodes,@explicit,@requireAll, named-graph scoping, empty frame,jsonld_frame_to_sparql,jsonld_frame, streaming,@contextcompaction, error handling - tests/pg_regress/sql/jsonld_framing_views.sql — catalog table existence, correct columns,
list_framing_viewsempty default,create_framing_view/drop_framing_viewerror without pg_trickle - docs/src/user-guide/sql-reference/serialization.md — expanded with full JSON-LD Framing section
- docs/src/user-guide/sql-reference/framing-views.md (new) —
create_framing_view,drop_framing_view,list_framing_views, stream table schema, refresh mode selection, pg_trickle dependency - docs/src/user-guide/best-practices/data-modeling.md — JSON-LD Framing for REST APIs section
- docs/src/reference/faq.md — JSON-LD Framing FAQ entries
[0.16.0] — 2026-04-16 — SPARQL Federation
pg_ripple can now query remote SPARQL endpoints from within a single SPARQL query using the standard SERVICE keyword. Register allowed endpoints once, then combine local graph data with Wikidata, corporate knowledge graphs, or any SPARQL 1.1 endpoint — all in one query, with full SSRF protection.
New in this release: SERVICE <url> { ... } clause support in all SPARQL queries. SSRF-safe allowlist via _pg_ripple.federation_endpoints. Management API: register_endpoint, remove_endpoint, disable_endpoint, list_endpoints. Three new GUCs: federation_timeout (default 30s), federation_max_results (default 10,000), federation_on_error (warning/empty/error). Health monitoring via _pg_ripple.federation_health. Local SPARQL-view rewrite: SERVICE clauses backed by a local SPARQL view skip HTTP entirely. Migration script pg_ripple--0.15.0--0.16.0.sql.
What you can do
- Query remote endpoints — write
SERVICE <https://query.wikidata.org/sparql> { ?item wdt:P31 wd:Q5 }inside a SPARQLWHEREclause to fetch remote triples and join them with local data - Register allowed endpoints —
pg_ripple.register_endpoint('https://query.wikidata.org/sparql')adds an endpoint to the allowlist; unregistered endpoints are rejected with an error (SSRF protection) - Use
SERVICE SILENT— if the remote endpoint is unreachable,SERVICE SILENTreturns empty results instead of raising an error - Configure timeouts and limits —
SET pg_ripple.federation_timeout = 10limits each remote call to 10 seconds;SET pg_ripple.federation_max_results = 500caps result rows;SET pg_ripple.federation_on_error = 'error'turns connection failures into hard errors - Rewrite to local views —
pg_ripple.register_endpoint('https://...', 'my_stream_table')makesSERVICEcalls to that URL scan the local pre-materialised SPARQL view instead — no HTTP at all - Monitor endpoint health — the
_pg_ripple.federation_healthtable records success/failure and latency for each SERVICE call; unhealthy endpoints (< 10% success rate over 5 min) are skipped automatically
What happens behind the scenes
SERVICE clauses are translated in src/sparql/sqlgen.rs via the GraphPattern::Service arm. For each SERVICE call, the inner SPARQL pattern is serialised and sent as an HTTP GET to the remote endpoint using ureq. The application/sparql-results+json response is parsed, each result term is encoded to a local dictionary ID, and the full result set is injected into the SQL as an inline VALUES clause — making it a standard SQL join for the PostgreSQL planner. SERVICE SILENT and federation_on_error = 'empty' return a zero-row fragment instead of raising.
Technical details
- src/sparql/federation.rs (new) —
is_endpoint_allowed,execute_remote,parse_sparql_results_json,encode_results,record_health,is_endpoint_healthy,get_local_view,get_view_variables - src/sparql/sqlgen.rs — added
Fragment::zero_rows(),GraphPattern::Servicearm callingtranslate_service(),translate_service_local(),translate_service_values() - src/sparql/mod.rs — added
pub(crate) mod federation; SERVICE queries skip plan cache - src/lib.rs —
federation_schema_setupSQL block; GUC staticsFEDERATION_TIMEOUT,FEDERATION_MAX_RESULTS,FEDERATION_ON_ERROR;register_endpoint,remove_endpoint,disable_endpoint,list_endpointspg_extern functions - sql/pg_ripple--0.15.0--0.16.0.sql — creates
federation_endpointsandfederation_healthtables with index - tests/pg_regress/sql/sparql_federation.sql — endpoint management, SSRF enforcement, SERVICE SILENT, GUC modes, health table
- tests/pg_regress/sql/sparql_federation_timeout.sql — GUC defaults, boundary tests, timeout with unreachable endpoint
- docs/src/user-guide/sql-reference/federation.md (new) — full user documentation
[0.15.0] — 2026-04-16 — SPARQL Protocol (HTTP Endpoint)
pg_ripple can now be queried over HTTP using the standard SPARQL protocol. Any SPARQL client — YASGUI, Protege, SPARQLWrapper, Jena, or plain curl — can connect to pg_ripple without any driver-specific configuration. This release also fills in SQL-level gaps: graph-aware loaders, graph-aware deletion, per-graph counts, and dictionary diagnostics.
New in this release: Companion HTTP service (pg_ripple_http) with W3C SPARQL 1.1 Protocol compliance. Content negotiation for JSON, XML, CSV, TSV, Turtle, N-Triples, and JSON-LD. Connection pooling via deadpool-postgres. Bearer/Basic auth and CORS. Health check and Prometheus metrics endpoints. Graph-aware bulk loaders and file loaders for N-Triples, Turtle, and RDF/XML. Graph-aware delete and clear operations. Per-graph find and count. Dictionary diagnostics (decode_id_full, lookup_iri). Docker Compose for running PG and HTTP together. Four new pg_regress test suites.
What you can do
- Query over HTTP — start
pg_ripple_httpalongside PostgreSQL and send SPARQL queries viaGET /sparql?query=...orPOST /sparqlwith any standard content type; results come back in JSON, XML, CSV, TSV, Turtle, N-Triples, or JSON-LD depending on theAcceptheader - Load data into named graphs —
pg_ripple.load_ntriples_into_graph(data, graph_iri),load_turtle_into_graph,load_rdfxml_into_graph, and their file variants load triples directly into a named graph without format conversion - Delete from named graphs —
delete_triple_from_graph(s, p, o, graph_iri)removes a single triple from a specific graph;clear_graph(graph_iri)empties a graph without unregistering it - Query within a graph —
find_triples_in_graph(s, p, o, graph)pattern-matches triples within a named graph;triple_count_in_graph(graph_iri)returns the count for a specific graph - Inspect the dictionary —
decode_id_full(id)returns structured JSONB with kind, value, datatype, and language;lookup_iri(iri)checks whether an IRI exists without encoding it - Run with Docker Compose —
docker compose upstarts PostgreSQL with pg_ripple and the HTTP endpoint in separate containers
What happens behind the scenes
The HTTP service is a standalone Rust binary built with axum and tokio. It connects to PostgreSQL via deadpool-postgres, translates HTTP requests into calls to pg_ripple.sparql(), sparql_ask(), sparql_construct(), sparql_describe(), and sparql_update(), then formats the results according to the requested content type. The Prometheus /metrics endpoint exposes query count, error count, and total query duration.
Graph-aware loaders encode the graph_iri argument via the dictionary and delegate to the existing internal *_into_graph(data, g_id) functions. File variants read via pg_read_file() (superuser-only). clear_graph wraps storage::clear_graph_by_id() which deletes from delta tables and adds tombstones for main table rows.
Technical details
- pg_ripple_http/src/main.rs — axum router with
/sparql(GET+POST),/health,/metrics; content negotiation; bearer/basic auth; CORS via tower-http - pg_ripple_http/src/metrics.rs — atomic counter-based Prometheus metrics
- src/lib.rs — new
#[pg_extern]functions:load_ntriples_into_graph,load_turtle_into_graph,load_rdfxml_into_graph,load_ntriples_file_into_graph,load_turtle_file_into_graph,load_rdfxml_file_into_graph,load_rdfxml_file,delete_triple_from_graph,clear_graph,find_triples_in_graph,triple_count_in_graph,decode_id_full,lookup_iri - src/bulk_load.rs —
load_rdfxml_file,load_ntriples_file_into_graph,load_turtle_file_into_graph,load_rdfxml_file_into_graph - src/storage/mod.rs —
triple_count_in_graph(g_id)scans all VP tables for a specific graph - sql/pg_ripple--0.14.0--0.15.0.sql — migration script (no schema changes; all new features are compiled functions)
- docker-compose.yml — two-service Compose with postgres and sparql containers
- Dockerfile — updated to build and bundle
pg_ripple_httpbinary - tests/pg_regress/sql/ —
load_into_graph.sql,graph_delete.sql,sql_api_completeness.sql,sparql_protocol.sql
[0.14.0] — 2025-07-18 — Administrative & Operational Readiness
This release focuses on production operations: maintenance commands, monitoring, graph-level access control, and comprehensive documentation. Everything a system administrator needs to run pg_ripple confidently in production.
New in this release: Maintenance functions (vacuum, reindex, vacuum_dictionary). Dictionary diagnostics (dictionary_stats). Graph-level Row-Level Security with enable_graph_rls, grant_graph, revoke_graph, list_graph_access. Optional pg_trickle integration via schema_summary / enable_schema_summary. Complete documentation for backup/restore, contributing, error codes (PT001–PT799), and security hardening. Extension upgrade scripts for the full 0.1.0 → 0.14.0 chain.
What you can do
- Maintain the store —
pg_ripple.vacuum()runsMERGEthenANALYZEon all VP tables;pg_ripple.reindex()rebuilds all indices;pg_ripple.vacuum_dictionary()removes orphaned dictionary entries after bulk deletes (uses advisory lock to be safe) - Diagnose the dictionary —
pg_ripple.dictionary_stats()returns a JSON object withtotal_entries,hot_entries,cache_capacity,cache_budget_mb, andshmem_ready - Control graph access —
pg_ripple.enable_graph_rls()activates RLS policies on VP tables keyed on theg(graph ID) column;grant_graph(role, graph, permission)/revoke_graph(role, graph)manage the_pg_ripple.graph_accessmapping table;list_graph_access()returns the current ACL as JSON - Bypass RLS for admin work —
SET pg_ripple.rls_bypass = onin a superuser session skips RLS checks; protected byGUC_SUSET(superuser-only) - Inspect schema —
pg_ripple.schema_summary()returns the inferred class→property→cardinality summary (populated by the optional pg_trickle integration);enable_schema_summary()sets up the_pg_ripple.inferred_schematable and stream when pg_trickle is installed - Upgrade safely — tested upgrade path from every prior version;
ALTER EXTENSION pg_ripple UPDATEworks for all transitions up to 0.14.0
What happens behind the scenes
vacuum() and reindex() discover live VP tables by querying pg_class for tables matching the vp_% pattern in _pg_ripple. vacuum_dictionary() acquires advisory lock 0x7269706c (ripl) then deletes from _pg_ripple.dictionary any row whose encoded ID does not appear in any VP table — safe to run concurrently with queries.
RLS policies are created on _pg_ripple.vp_rare (the catch-all VP table) using current_setting('pg_ripple.rls_bypass', true) as the bypass expression. The graph_access mapping table stores (role_name, graph_id, permission) triples; grant_graph encodes the graph IRI using encode_term before inserting.
Technical details
- src/lib.rs — new
pg_externfunctions:vacuum(),reindex(),vacuum_dictionary(),dictionary_stats(),enable_graph_rls(),grant_graph(),revoke_graph(),list_graph_access(),schema_summary(),enable_schema_summary(); new GUCpg_ripple.rls_bypass(bool,GUC_SUSET) - sql/pg_ripple--0.13.0--0.14.0.sql — creates
_pg_ripple.graph_accessand_pg_ripple.inferred_schematables with appropriate indices - tests/pg_regress/sql/admin_functions.sql — tests vacuum, reindex, vacuum_dictionary, dictionary_stats, predicate_stats view
- tests/pg_regress/sql/graph_rls.sql — tests grant_graph, list_graph_access, revoke_graph, enable_graph_rls, rls_bypass GUC
- tests/pg_regress/sql/upgrade_path.sql — verifies full administrative API is available after a clean install
- docs/src/user-guide/backup-restore.md — pg_dump/pg_restore, VP table considerations, PITR, logical replication
- docs/src/user-guide/contributing.md — dev setup, test commands, PR workflow, code conventions
- docs/src/reference/error-reference.md — PT001–PT799 error code table
- docs/src/reference/security.md — supported versions matrix, RLS section, hardening GUCs
- docs/src/user-guide/sql-reference/admin.md — expanded with all new v0.14.0 admin functions
[0.13.0] — 2026-04-16 — Performance Hardening
This release is about speed. Using the benchmarks established in earlier versions, pg_ripple v0.13.0 measures and improves performance at every layer: how triple patterns are ordered before query execution, how the PostgreSQL planner understands the data distribution, how parallel workers are exploited for multi-predicate queries, and how data quality rules from SHACL can help the optimizer make better decisions.
New in this release: BGP join reordering based on real table statistics. SPARQL plan cache instrumentation. Parallel query hints for star patterns. Extended statistics on VP table column pairs. SHACL-driven query optimizer hints. New GUCs to control reordering and parallelism thresholds. Regression and fuzz-integration test suites for the query pipeline.
What you can do
- Faster repeated queries — the plan cache now tracks hits and misses; call
plan_cache_stats()to see your hit rate and tunepg_ripple.plan_cache_sizefor your workload; callplan_cache_reset()to evict stale plans - Faster star patterns — pg_ripple now reorders triple patterns within a BGP by estimated selectivity (most restrictive first), matching what a manual SQL expert would write; controlled by
SET pg_ripple.bgp_reorder = on/off - Parallel query — queries joining 3 or more VP tables now emit
SET LOCAL max_parallel_workers_per_gather = 4andSET LOCAL enable_parallel_hash = onso PostgreSQL can use parallel workers; threshold tunable viapg_ripple.parallel_query_min_joins - Better planner statistics — extended statistics on
(s, o)column pairs are automatically created when a predicate is promoted fromvp_rareto a dedicated VP table; this helps the PostgreSQL planner estimate join cardinalities for multi-predicate queries - SHACL-informed optimizer — if you have loaded SHACL shapes with
sh:maxCount 1orsh:minCount 1, the optimizer reads those hints and can use them for join costing; hints are only applied when semantics are preserved - Safer query pipeline — a fuzz integration test suite verifies that malformed SPARQL, SQL injection attempts in IRI values, Unicode IRIs, deeply nested property paths, and very large literals are all handled gracefully without crashes or data corruption
What happens behind the scenes
The BGP reordering optimizer queries pg_class.reltuples and pg_stats.n_distinct for each VP table at translation time to estimate how many rows a pattern will produce given its bound columns. Patterns are sorted cheapest-first using a greedy left-deep algorithm. Before executing the generated SQL, SET LOCAL join_collapse_limit = 1 is emitted so the PostgreSQL planner does not reorder the joins back. On macOS/Linux, SET LOCAL enable_mergejoin = on is also set to exploit merge-join when join columns are ordered.
For parallel execution, the query engine counts VP-table aliases (_t0, _t1, …) in the generated SQL; if the count reaches parallel_query_min_joins, parallel hash join settings are activated before query execution.
Extended statistics (CREATE STATISTICS … (ndistinct, dependencies) ON s, o) are created in _pg_ripple schema alongside the VP tables when promote_predicate() runs. This gives the planner correlation data that single-column ANALYZE cannot provide.
Technical details
- src/sparql/optimizer.rs (new) —
reorder_bgp(): greedy left-deep selectivity-based reorder;TableStatsstruct withpg_class.reltuples+pg_stats.n_distinctqueries;load_predicate_hints(): reads SHACL shapes forsh:maxCount/sh:minCounthints - src/sparql/plan_cache.rs — added
HIT_COUNTandMISS_COUNTAtomicU64counters;stats()returns(hits, misses, size, cap);reset()evicts cache and clears counters; cache key now includesbgp_reorderGUC value - src/sparql/sqlgen.rs —
translate_bgp()now callsoptimizer::reorder_bgp()before building the join tree - src/sparql/mod.rs —
execute_select()emitsSET LOCAL join_collapse_limit = 1,enable_mergejoin = on, and parallel hints when applicable; new publicplan_cache_stats()andplan_cache_reset()functions - src/storage/mod.rs —
promote_rare_predicates()callscreate_extended_statistics()for each newly promoted predicate;create_extended_statistics()issuesCREATE STATISTICS IF NOT EXISTS … (ndistinct, dependencies) ON s, o - src/lib.rs — two new GUCs:
pg_ripple.bgp_reorder(bool, default on),pg_ripple.parallel_query_min_joins(int, default 3); two newpg_externfunctions:plan_cache_stats() RETURNS JSONB,plan_cache_reset() RETURNS VOID - sql/pg_ripple--0.12.0--0.13.0.sql — migration script (no schema DDL; new functions are compiled into the extension library)
- tests/pg_regress/sql/shacl_query_opt.sql — verifies BGP reorder GUC, plan cache stats/reset, SHACL shape reading, and sparql_explain output
- tests/pg_regress/sql/fuzz_integration.sql — verifies graceful handling of empty queries, malformed SPARQL, SQL injection via IRI, Unicode IRIs, large literals, deeply nested property paths, and adversarial cache usage
[0.12.0] — 2026-04-16 — SPARQL Update (Advanced)
This release completes the full SPARQL 1.1 Update specification. Building on the INSERT DATA / DELETE DATA support from v0.5.1, pg_ripple now supports pattern-based updates, remote RDF loading, and full named-graph lifecycle management.
New in this release: Find-and-replace data using SPARQL patterns with DELETE/INSERT WHERE. Fetch and load remote RDF documents from any HTTP(S) URL with LOAD <url>. Clear, drop, or create named graphs with a single SPARQL Update call.
What you can do
- Pattern-based updates —
DELETE { … } INSERT { … } WHERE { … }finds matching triples using the full SPARQL→SQL engine and then deletes and inserts triples for each result row; both the DELETE and INSERT templates may reference WHERE-bound variables - INSERT WHERE — omit the DELETE clause to insert a triple for every WHERE match
- DELETE WHERE — omit the INSERT clause to remove all triples matching a pattern
- LOAD remote RDF —
LOAD <url>fetches a Turtle, N-Triples, or RDF/XML document via HTTP(S) and inserts all triples;LOAD <url> INTO GRAPH <g>targets a named graph;LOAD SILENT <url>suppresses network errors - Clear a graph —
CLEAR GRAPH <g>removes all triples from a named graph without touching the default graph;CLEAR DEFAULT,CLEAR NAMED,CLEAR ALLlet you clear one or all graphs in a single call - Drop a graph —
DROP GRAPH <g>clears and deregisters a graph;DROP SILENTsuppresses errors on non-existent graphs;DROP ALLclears the entire store - Create a graph —
CREATE GRAPH <g>pre-registers a named graph in the dictionary;CREATE SILENTis a no-op if the graph already exists
What happens behind the scenes
When DELETE/INSERT WHERE runs, the WHERE clause is compiled through the existing SPARQL→SQL engine into a SELECT query. The result rows are collected in memory, and then for each row the DELETE phase removes any matched triples from VP storage, followed by the INSERT phase adding new ones. This keeps the operation transactional inside a single PostgreSQL call.
LOAD uses ureq (a lightweight Rust HTTP client) to fetch the URL. The response body is parsed by the same rio_turtle / rio_xml parsers used for local bulk loading; triples are inserted in batches using the standard VP storage path.
CLEAR and DROP call a new clear_graph_by_id() helper that deletes from both the HTAP delta tables and tombstones the main-partition rows — the same mechanism used by the existing drop_graph() function.
Technical details
- src/sparql/mod.rs —
sparql_update()extended to handle allGraphUpdateOperationvariants:DeleteInsert,Load,Clear,Create,Drop; new helpersexecute_delete_insert(),execute_load(),execute_clear(),execute_drop(),resolve_ground_term(),resolve_term_pattern(),resolve_named_node_pattern(),resolve_graph_name_pattern(),encode_literal_id() - src/storage/mod.rs — new
clear_graph_by_id(g_id)mirrorsdrop_graph()but takes a pre-encoded ID; newall_graph_ids()collects all distinct graph IDs across VP tables andvp_rare - src/bulk_load.rs — new graph-aware loaders
load_ntriples_into_graph(),load_turtle_into_graph(),load_rdfxml_into_graph()accept a targetg_idinstead of always writing to the default graph (g=0) - Cargo.toml — added
ureq = { version = "2", features = ["tls"] }forLOAD <url>HTTP support - sql/pg_ripple--0.11.0--0.12.0.sql — migration script (schema unchanged; new capabilities compiled into the extension library)
- pg_regress — new test suites:
sparql_update_where.sql,sparql_graph_management.sql; both PASS
[0.11.0] — 2026-04-16 — SPARQL & Datalog Views
This release adds always-fresh, incrementally-maintained stream tables for SPARQL and Datalog queries, plus Extended Vertical Partitioning (ExtVP) semi-join tables for multi-predicate star-pattern acceleration. All three features are built on top of pg_trickle and are soft-gated — pg_ripple loads and operates normally without pg_trickle; the new functions detect its absence at call time and return a clear error with an install hint.
New in this release: Compile any SPARQL SELECT query into a pg_trickle stream table with create_sparql_view(). Bundle a Datalog rule set with a goal pattern into a self-refreshing view with create_datalog_view(). Pre-compute semi-joins between frequently co-joined predicate pairs with create_extvp() to give 2–10× star-pattern speedups.
What you can do
- SPARQL views —
pg_ripple.create_sparql_view(name, sparql, schedule, decode)compiles a SPARQL SELECT query to SQL and registers it as a pg_trickle stream table; the table stays incrementally up-to-date on every triple insert/update/delete - Datalog views —
pg_ripple.create_datalog_view(name, rules, goal, schedule, decode)bundles inline Datalog rules with a goal query into a self-refreshing table;create_datalog_view_from_rule_set(name, rule_set, goal, schedule, decode)references a previously-loaded named rule set - ExtVP semi-joins —
pg_ripple.create_extvp(name, pred1_iri, pred2_iri, schedule)pre-computes the semi-join between two predicate tables; the SPARQL query engine detects and uses ExtVP tables automatically - Detect pg_trickle —
pg_ripple.pg_trickle_available()returnstrueif pg_trickle is installed, so callers can gate feature usage without catching errors - Lifecycle management —
drop_sparql_view,drop_datalog_view,drop_extvpremove both the stream table and the catalog entry;list_sparql_views(),list_datalog_views(),list_extvp()return JSONB arrays of registered objects
New SQL functions
| Function | Returns | Description |
|---|---|---|
pg_ripple.pg_trickle_available() | BOOLEAN | Returns true if pg_trickle is installed |
pg_ripple.create_sparql_view(name, sparql, schedule DEFAULT '1s', decode DEFAULT false) | BIGINT | Compile SPARQL SELECT to a pg_trickle stream table; returns column count |
pg_ripple.drop_sparql_view(name) | BOOLEAN | Drop the stream table and catalog entry |
pg_ripple.list_sparql_views() | JSONB | List all registered SPARQL views |
pg_ripple.create_datalog_view(name, rules, goal, rule_set_name DEFAULT 'custom', schedule DEFAULT '10s', decode DEFAULT false) | BIGINT | Compile inline Datalog rules + goal into a stream table |
pg_ripple.create_datalog_view_from_rule_set(name, rule_set, goal, schedule DEFAULT '10s', decode DEFAULT false) | BIGINT | Reference an existing named rule set for a Datalog view |
pg_ripple.drop_datalog_view(name) | BOOLEAN | Drop the stream table and catalog entry |
pg_ripple.list_datalog_views() | JSONB | List all registered Datalog views |
pg_ripple.create_extvp(name, pred1_iri, pred2_iri, schedule DEFAULT '10s') | BIGINT | Pre-compute a semi-join stream table for two predicates |
pg_ripple.drop_extvp(name) | BOOLEAN | Drop the ExtVP stream table and catalog entry |
pg_ripple.list_extvp() | JSONB | List all registered ExtVP tables |
New catalog tables
| Table | Description |
|---|---|
_pg_ripple.sparql_views | Stores SPARQL view name, original query, generated SQL, schedule, decode mode, stream table name, and variables |
_pg_ripple.datalog_views | Stores Datalog view name, rules, rule set, goal, generated SQL, schedule, decode mode, stream table name, and variables |
_pg_ripple.extvp_tables | Stores ExtVP name, predicate IRIs, predicate IDs, generated SQL, schedule, and stream table name |
Technical details
- src/views.rs — new module implementing all v0.11.0 public functions;
compile_sparql_for_view()wrapssparql::sqlgen::translate_select()and renames internal_v_{var}columns to plain{var}for stream table compatibility;create_extvp()generates a parameterized semi-join SQL template over the two predicate VP tables - src/lib.rs — three new catalog tables created at extension load time; eleven new
#[pg_extern]functions exposed in thepg_rippleschema - src/datalog/mod.rs — added
load_and_store_rules(rules_text, rule_set_name) -> i64helper for Datalog view creation - src/sparql/mod.rs —
sqlgenmodule madepub(crate)soviews.rscan calltranslate_select()directly - sql/pg_ripple--0.10.0--0.11.0.sql — migration script adding the three catalog tables for upgrades from v0.10.0
- pg_regress — new test suites:
sparql_views.sql,datalog_views.sql,extvp.sql; all pass
[0.10.0] — 2026-04-16 — Datalog Reasoning Engine
This release delivers a full Datalog reasoning engine over the VP triple store. Rules are parsed from a Turtle-flavoured syntax, stratified for evaluation order, and compiled to native PostgreSQL SQL — no external reasoner process needed.
New in this release: pg_ripple can now execute RDFS and OWL RL entailment, user-defined inference rules, Datalog constraints, and arithmetic/string built-ins. Inference results are written back into the VP store with source = 1 so explicit and derived triples are always distinguishable. A hot dictionary tier accelerates frequent IRI lookups, and a SHACL-AF bridge detects sh:rule properties in shape graphs and registers them alongside standard Datalog rules.
What you can do
- Write custom inference rules —
pg_ripple.load_rules(rules, rule_set)parses Turtle-flavoured Datalog and stores the compiled SQL strata - Built-in RDFS entailment —
pg_ripple.load_rules_builtin('rdfs')loads all 13 RDFS entailment rules; callpg_ripple.infer('rdfs')to materialize closure - Built-in OWL RL reasoning —
pg_ripple.load_rules_builtin('owl-rl')loads ~20 core OWL RL rules covering class hierarchy, property chains, and inverse/symmetric/transitive properties - Run inference on demand —
pg_ripple.infer(rule_set)runs all strata in order and inserts derived triples withsource = 1; safe to call repeatedly (idempotent) - Declare integrity constraints — rules with an empty head become constraints;
pg_ripple.check_constraints()returns all violations as JSONB - Inspect and manage rule sets —
pg_ripple.list_rules()returns rules as JSONB;pg_ripple.drop_rules(rule_set)clears a named set;enable_rule_set/disable_rule_settoggle a set without deleting it - Accelerate hot IRIs —
pg_ripple.prewarm_dictionary_hot()loads frequently-used IRIs (≤ 512 B) into an UNLOGGED hot table for sub-microsecond lookups; survives connection pooling but not database restart - SHACL-AF bridge — shapes that contain
sh:ruleentries are detected byload_shacl()and registered in the rules catalog; full SHACL-AF rule execution is planned for v0.11.0
New GUC parameters
| GUC | Default | Description |
|---|---|---|
pg_ripple.inference_mode | 'on_demand' | 'off' disables engine; 'on_demand' evaluates via CTEs; 'materialized' uses pg_trickle stream tables |
pg_ripple.enforce_constraints | 'warn' | 'off' silences violations; 'warn' logs them; 'error' raises an exception |
pg_ripple.rule_graph_scope | 'default' | 'default' applies rules to default graph only; 'all' applies across all named graphs |
New SQL functions
| Function | Returns | Description |
|---|---|---|
pg_ripple.load_rules(rules TEXT, rule_set TEXT DEFAULT 'custom') | BIGINT | Parse, stratify, and store a Datalog rule set; returns the number of rules loaded |
pg_ripple.load_rules_builtin(name TEXT) | BIGINT | Load a built-in rule set by name ('rdfs' or 'owl-rl') |
pg_ripple.list_rules() | JSONB | Return all active rules as a JSONB array |
pg_ripple.drop_rules(rule_set TEXT) | BIGINT | Delete a named rule set; returns the number of rules deleted |
pg_ripple.enable_rule_set(name TEXT) | VOID | Mark a rule set as active |
pg_ripple.disable_rule_set(name TEXT) | VOID | Mark a rule set as inactive |
pg_ripple.infer(rule_set TEXT DEFAULT 'custom') | BIGINT | Run inference; returns the number of derived triples inserted |
pg_ripple.check_constraints(rule_set TEXT DEFAULT NULL) | JSONB | Evaluate integrity constraints; returns violations |
pg_ripple.prewarm_dictionary_hot() | BIGINT | Load hot IRIs into UNLOGGED hot table; returns rows loaded |
Technical details
- src/datalog/mod.rs — public API and IR type definitions (
Term,Atom,BodyLiteral,Rule,RuleSet); catalog helpers for_pg_ripple.rulesand_pg_ripple.rule_sets - src/datalog/parser.rs — tokenizer and recursive-descent parser for Turtle-flavoured Datalog; variables as
?x, full IRIs as<...>, prefixed IRIs asprefix:local, head:-body.delimiter - src/datalog/stratify.rs — SCC-based stratification via Kosaraju's algorithm; unstratifiable programs (negation cycles) are rejected with a clear error message naming the cyclic predicates
- src/datalog/compiler.rs — compiles Rule IR to PostgreSQL SQL; non-recursive strata use
INSERT … SELECT … ON CONFLICT DO NOTHING; recursive strata useWITH RECURSIVE … CYCLE(PG18 native cycle detection); negation compiles toNOT EXISTS; arithmetic/string built-ins compile to inline SQL expressions - src/datalog/builtins.rs — RDFS (13 rules: rdfs2–rdfs12, subclass, domain, range) and OWL RL (~20 rules: class hierarchy, property chains, inverse/symmetric/transitive) as embedded Rust string constants
- src/dictionary/hot.rs — UNLOGGED hot table
_pg_ripple.dictionary_hotfor IRIs ≤ 512 B;prewarm_hot_table()runs at_PG_initwheninference_mode != 'off';lookup_hot()andadd_to_hot()provide O(1) in-process hash lookups - src/shacl/mod.rs —
parse_and_store_shapes()now callsbridge_shacl_rules()wheninference_mode != 'off'; the bridge detectssh:ruleand registers a placeholder in_pg_ripple.rules - VP store —
source SMALLINT NOT NULL DEFAULT 0column present in all VP tables; migration script adds it retroactively to tables created before v0.10.0;source = 0means explicit,source = 1means derived - Migration script —
sql/pg_ripple--0.9.0--0.10.0.sqlincludes allCREATE TABLE IF NOT EXISTSandALTER TABLE … ADD COLUMN IF NOT EXISTSstatements for zero-downtime upgrades - New pg_regress tests:
datalog_custom.sql,datalog_rdfs.sql,datalog_owl_rl.sql,datalog_negation.sql,datalog_arithmetic.sql,datalog_constraints.sql,datalog_malformed.sql,shacl_af_rule.sql,rdf_star_datalog.sql
[0.9.0] — 2026-04-15 — Serialization, Export & Interop
This release completes RDF I/O: pg_ripple can now import from and export to all major RDF serialization formats, and SPARQL CONSTRUCT and DESCRIBE queries can return results directly as Turtle or JSON-LD.
New in this release: Until now, you could load Turtle and N-Triples but exports were limited to N-Triples and N-Quads. You can now export as Turtle or JSON-LD — formats that are friendlier for human reading and REST APIs respectively. RDF/XML import covers the format that Protégé and most OWL editors produce. Streaming export variants handle large graphs without buffering the full document in memory.
What you can do
- Load RDF/XML —
pg_ripple.load_rdfxml(data TEXT)parses conformant RDF/XML (Protégé, OWL, most ontology editors); returns the number of triples loaded - Export as Turtle —
pg_ripple.export_turtle()serializes the default graph (or any named graph) as a compact Turtle document with@prefixdeclarations; RDF-star quoted triples use Turtle-star notation - Export as JSON-LD —
pg_ripple.export_jsonld()serializes triples as a JSON-LD expanded-form array, ready for REST APIs and Linked Data Platform contexts - Stream large graphs —
pg_ripple.export_turtle_stream()andpg_ripple.export_jsonld_stream()return one line at a time asSETOF TEXT, suitable forCOPY … TO STDOUTpipelines - Get CONSTRUCT results as Turtle —
pg_ripple.sparql_construct_turtle(query)runs a SPARQL CONSTRUCT query and returns a Turtle document instead of JSONB rows - Get CONSTRUCT results as JSON-LD —
pg_ripple.sparql_construct_jsonld(query)returns JSONB in JSON-LD expanded form - Get DESCRIBE results as Turtle or JSON-LD —
pg_ripple.sparql_describe_turtle(query)andpg_ripple.sparql_describe_jsonld(query)offer the same format choice for DESCRIBE
New SQL functions
| Function | Returns | Description |
|---|---|---|
pg_ripple.load_rdfxml(data TEXT) | BIGINT | Parse RDF/XML, load into default graph |
pg_ripple.export_turtle(graph TEXT DEFAULT NULL) | TEXT | Export graph as Turtle |
pg_ripple.export_jsonld(graph TEXT DEFAULT NULL) | JSONB | Export graph as JSON-LD (expanded form) |
pg_ripple.export_turtle_stream(graph TEXT DEFAULT NULL) | SETOF TEXT | Streaming Turtle export |
pg_ripple.export_jsonld_stream(graph TEXT DEFAULT NULL) | SETOF TEXT | Streaming JSON-LD NDJSON export |
pg_ripple.sparql_construct_turtle(query TEXT) | TEXT | CONSTRUCT result as Turtle |
pg_ripple.sparql_construct_jsonld(query TEXT) | JSONB | CONSTRUCT result as JSON-LD |
pg_ripple.sparql_describe_turtle(query TEXT, strategy TEXT DEFAULT 'cbd') | TEXT | DESCRIBE result as Turtle |
pg_ripple.sparql_describe_jsonld(query TEXT, strategy TEXT DEFAULT 'cbd') | JSONB | DESCRIBE result as JSON-LD |
Technical details
rio_xmlcrate added as a dependency for RDF/XML parsing (uses rio_apiTriplesParserinterface, consistent with existing rio_turtle parsers)src/export.rsextended withexport_turtle,export_jsonld,export_turtle_stream,export_jsonld_stream,triples_to_turtle, andtriples_to_jsonld- Turtle serialization groups by subject using
BTreeMapfor deterministic output; emits predicate-object lists per subject - JSON-LD expanded form: each subject is one array entry; predicates become IRI-keyed arrays of
{"@value": …}/{"@id": …}objects - RDF-star quoted triples: passed through in Turtle-star
<< s p o >>notation; in JSON-LD emitted as{"@value": "…", "@type": "rdf:Statement"} - Streaming variants avoid buffering the full document;
export_turtle_streamyields prefix lines then ones p o .per row - SPARQL format functions (
sparql_construct_turtle, etc.) delegate to the existing SPARQL engine then pass rows through the new serialization layer - New pg_regress tests:
serialization.sql,rdf_star_construct.sql, expandedsparql_construct.sql
[0.8.0] — 2026-04-15 — Advanced Data Quality Rules
This release rounds out the data quality system with more expressive rules and a background validation mode that never slows down your inserts.
New in this release: Until now, each validation rule applied to a single property in isolation. You can now combine rules — "this value must satisfy rule A or rule B", "must satisfy all of these rules", "must not match this rule" — and count how many values on a property actually conform to a sub-rule. A background mode queues violations for later review instead of blocking every write.
What you can do
- Combine rules with logic — use
sh:or,sh:and, andsh:notto build validation rules that express complex conditions, such as "a contact must have either a phone number or an email address" - Reference another rule from within a rule —
sh:node <ShapeIRI>checks that each value on a property also satisfies a separate named rule; rules can reference each other up to 32 levels deep without getting stuck in a loop - Count qualifying values —
sh:qualifiedValueShapecombined withsh:qualifiedMinCount/sh:qualifiedMaxCountcounts only the values that actually pass a sub-rule, so you can say "at least two authors must be affiliated with a university" - Validate without blocking writes — set
pg_ripple.shacl_mode = 'async'so that inserts complete immediately and violations are collected silently in the background; the background worker drains the queue automatically - Inspect collected violations —
pg_ripple.dead_letter_queue()returns all async violations as a JSON array;pg_ripple.drain_dead_letter_queue()clears the queue once you have reviewed them - Drain the queue manually —
pg_ripple.process_validation_queue(batch_size)processes violations on demand, useful in test pipelines or batch jobs
New SQL functions
| Function | Returns | Description |
|---|---|---|
pg_ripple.process_validation_queue(batch_size BIGINT DEFAULT 1000) | BIGINT | Process up to N pending validation jobs |
pg_ripple.validation_queue_length() | BIGINT | How many jobs are waiting in the queue |
pg_ripple.dead_letter_count() | BIGINT | How many violations have been recorded |
pg_ripple.dead_letter_queue() | JSONB | All recorded violations as a JSON array |
pg_ripple.drain_dead_letter_queue() | BIGINT | Delete all recorded violations and return how many were removed |
Technical details
ShapeConstraintenum extended withOr(Vec<String>),And(Vec<String>),Not(String),QualifiedValueShape { shape_iri, min_count, max_count }validate_property_shape()refactored to acceptall_shapes: &[Shape]for recursive nested shape evaluationnode_conforms_to_shape()added: depth-limited recursive conformance check (max depth 32)process_validation_batch(batch_size)added: SPI-based batch drain of_pg_ripple.validation_queue, writes violations to_pg_ripple.dead_letter_queue- Merge worker (
src/worker.rs) extended withrun_validation_cycle()called after each merge transaction validate_sync()now handlesClass,Node,Or,And,Not, andQualifiedValueShape(max-count check only for sync)run_validate()now checks top-level nodeOr/And/Notconstraints in offline validation
[0.7.0] — 2026-04-15 — Data Quality Rules (Core)
This release adds SHACL — a W3C standard for expressing data quality rules — and on-demand deduplication for datasets that have accumulated duplicate entries.
What this means in practice: You define rules like "every Person must have a name, and the name must be a string", load them into the database once, and pg_ripple will check those rules on every insert or on demand. Violations are reported as structured JSON so they can be logged, monitored, or acted on automatically.
What you can do
- Define data quality rules —
pg_ripple.load_shacl(data TEXT)parses rules written in W3C SHACL Turtle format and stores them in the database; returns the number of rules loaded - Check your data —
pg_ripple.validate(graph TEXT DEFAULT NULL)runs all active rules against your data and returns a JSON report:{"conforms": true/false, "violations": [...]}. Pass a graph name to validate only that graph - Reject bad data on insert — set
pg_ripple.shacl_mode = 'sync'to haveinsert_triple()immediately reject any triple that violates ash:maxCount,sh:datatype,sh:in, orsh:patternrule - Manage rules —
pg_ripple.list_shapes()lists all loaded rules;pg_ripple.drop_shape(uri TEXT)removes one rule by its IRI - Remove duplicate triples —
pg_ripple.deduplicate_predicate(p_iri TEXT)removes duplicate entries for one property, keeping the earliest record;pg_ripple.deduplicate_all()deduplicates everything - Deduplicate automatically on merge — set
pg_ripple.dedup_on_merge = trueto eliminate duplicates each time the background worker compacts data (see v0.6.0)
New SQL functions
| Function | Returns | Description |
|---|---|---|
pg_ripple.load_shacl(data TEXT) | INTEGER | Parse Turtle, store rules, return count loaded |
pg_ripple.validate(graph TEXT DEFAULT NULL) | JSONB | Full validation report |
pg_ripple.list_shapes() | TABLE(shape_iri TEXT, active BOOLEAN) | All rules in the catalog |
pg_ripple.drop_shape(shape_uri TEXT) | INTEGER | Remove a rule by IRI |
pg_ripple.deduplicate_predicate(p_iri TEXT) | BIGINT | Remove duplicates for one property |
pg_ripple.deduplicate_all() | BIGINT | Remove duplicates across all properties |
pg_ripple.enable_shacl_monitors() | BOOLEAN | Create a live violation-count stream table (requires pg_trickle) |
New configuration options
| Option | Default | Description |
|---|---|---|
pg_ripple.shacl_mode | 'off' | When to validate: 'off', 'sync' (block bad inserts), 'async' (queue for later — see v0.8.0) |
pg_ripple.dedup_on_merge | false | Eliminate duplicate triples during each background merge |
New internal tables
| Table | Description |
|---|---|
_pg_ripple.shacl_shapes | Stores each loaded rule with its IRI, parsed JSON, and active flag |
_pg_ripple.validation_queue | Inbox for inserts when shacl_mode = 'async' |
_pg_ripple.dead_letter_queue | Recorded violations with full JSONB violation reports |
_pg_ripple.violation_summary | Live violation counts by rule and severity (created by enable_shacl_monitors()) |
Supported validation constraints (v0.7.0)
sh:minCount, sh:maxCount, sh:datatype, sh:in, sh:pattern, sh:class, sh:targetClass, sh:targetNode, sh:targetSubjectsOf, sh:targetObjectsOf. Logical combinators (sh:or, sh:and, sh:not) and qualified constraints are added in v0.8.0.
Upgrading from v0.6.0
ALTER EXTENSION pg_ripple UPDATE;
The migration creates three new tables (shacl_shapes, validation_queue, dead_letter_queue) and their indexes. No existing tables are modified.
[0.6.0] — 2026-04-15 — High-Speed Reads and Writes at the Same Time
This release separates write traffic from read traffic so both can run at full speed simultaneously. It also adds change notifications so other systems can react to new triples in real time.
The problem this solves: In earlier versions, heavy read queries could slow down writes and vice versa. Now, writes go into a small fast table and reads see everything via a transparent view. A background worker periodically merges the write table into an optimised read table without interrupting either operation.
What you can do
- Write and read simultaneously without blocking — inserts land in a fast write buffer; reads see both the buffer and the main read-optimised store through a transparent view
- Trigger a manual merge —
pg_ripple.compact()immediately merges all pending writes into the read store; returns the total number of triples after compaction - Subscribe to changes —
pg_ripple.subscribe(pattern TEXT, channel TEXT)sends a PostgreSQLLISTEN/NOTIFYmessage tochannelevery time a triple matchingpatternis inserted or deleted; use'*'to receive all changes - Unsubscribe —
pg_ripple.unsubscribe(channel TEXT)stops notifications on a channel - Get storage statistics —
pg_ripple.stats()reports total triple count, how many predicates have their own table, how many triples are still in the write buffer, and the background worker's process ID
New SQL functions
| Function | Returns | Description |
|---|---|---|
pg_ripple.compact() | BIGINT | Merge all pending writes into the read store |
pg_ripple.stats() | JSONB | Storage and background worker statistics |
pg_ripple.subscribe(pattern TEXT, channel TEXT) | BIGINT | Subscribe to change notifications |
pg_ripple.unsubscribe(channel TEXT) | BIGINT | Stop notifications on a channel |
pg_ripple.htap_migrate_predicate(pred_id BIGINT) | void | Migrate one property table to the split-storage layout |
pg_ripple.subject_predicates(subject_id BIGINT) | BIGINT[] | All properties for a given subject (fast lookup) |
pg_ripple.object_predicates(object_id BIGINT) | BIGINT[] | All properties for a given object (fast lookup) |
New configuration options
| Option | Default | Description |
|---|---|---|
pg_ripple.merge_threshold | 10000 | Minimum pending writes before background merge starts |
pg_ripple.merge_interval_secs | 60 | Maximum seconds between merge cycles |
pg_ripple.merge_retention_seconds | 60 | How long to keep the previous read table before dropping it |
pg_ripple.latch_trigger_threshold | 10000 | Pending writes needed to wake the merge worker early |
pg_ripple.worker_database | postgres | Which database the merge worker connects to |
pg_ripple.merge_watchdog_timeout | 300 | Log a warning if the merge worker is silent for this many seconds |
Bug fixes in this release
- Startup race condition — the extension's shared memory flag is now set inside the correct PostgreSQL startup hook, eliminating a rare crash window during server start
- GUC registration crash — configuration parameters requiring postmaster-level access no longer crash when
CREATE EXTENSION pg_rippleruns without the extension inshared_preload_libraries - SPARQL aggregate decode bug —
COUNT,SUM, and similar aggregate results were incorrectly looked up in the string dictionary; they now pass through as plain numbers - Merge worker: DROP TABLE without CASCADE — the merge worker failed if old tables had dependent views; fixed by using
CASCADEand recreating the view afterwards - Merge worker: stale index name — repeated
compact()calls failed with "relation already exists" because the old index name survived a table rename; the stale index is now dropped before creating a new one
Upgrading from v0.5.1
ALTER EXTENSION pg_ripple UPDATE;
The migration script adds a column to the predicate catalog, creates the pattern tables and change-notification infrastructure, and converts every existing property table to the split read/write layout in a single transaction. Existing triples land in the write buffer; call pg_ripple.compact() afterwards to move them into the read store immediately.
Technical details
- HTAP split: writes →
vp_{id}_delta(heap + B-tree); cross-partition deletes →vp_{id}_tombstones; query view =(main EXCEPT tombstones) UNION ALL delta - Background merge: sort-ordered insertion into a fresh
vp_{id}_main(BRIN-indexed) +ANALYZE; previous main dropped aftermerge_retention_seconds ExecutorEnd_hookpokes the merge worker latch whenTOTAL_DELTA_ROWSreacheslatch_trigger_threshold- Subject/object pattern tables (
_pg_ripple.subject_patterns,_pg_ripple.object_patterns) — GIN-indexedBIGINT[]columns rebuilt by the merge worker; enable O(1) predicate lookup per node - CDC notifications fire as
pg_notify(channel, '{"op":"insert|delete","s":...,"p":...,"o":...,"g":...}')via trigger on each delta table
[0.5.1] — 2026-04-15 — Compact Number Storage, CONSTRUCT/DESCRIBE, SPARQL Update, Full-Text Search
This release stores common data types (integers, dates, booleans) as compact numbers instead of text, making range comparisons in queries much faster. It also adds the two remaining SPARQL query forms, write support via SPARQL Update, and full-text search on text values.
What you can do
- Faster comparisons on numbers and dates —
xsd:integer,xsd:boolean,xsd:date, andxsd:dateTimevalues are stored as compact integers; FILTER comparisons (>,<,=) run as plain integer comparisons with no string decoding - SPARQL CONSTRUCT —
pg_ripple.sparql_construct(query TEXT)assembles new triples from a template and returns them as a set of{s, p, o}JSON objects; useful for transforming or exporting data - SPARQL DESCRIBE —
pg_ripple.sparql_describe(query TEXT, strategy TEXT)returns the neighbourhood of a resource — all triples directly connected to it (Concise Bounded Description) or both incoming and outgoing triples (Symmetric CBD) - SPARQL Update —
pg_ripple.sparql_update(query TEXT)executesINSERT DATA { … }andDELETE DATA { … }statements; returns the number of triples affected - Full-text search —
pg_ripple.fts_index(predicate TEXT)indexes text values for a property;pg_ripple.fts_search(query TEXT, predicate TEXT)searches them using standard PostgreSQL text-search syntax
Bug fixes
fts_indexnow accepts N-Triples<IRI>notation for the predicate argumentfts_indexnow uses a correct partial index that does not require PostgreSQL subquery support- Inline-encoded values (integers, dates) now decode correctly in SPARQL SELECT results instead of returning NULL
New configuration options
pg_ripple.describe_strategy(default'cbd') — DESCRIBE expansion algorithm:'cbd','scbd'(symmetric), or'simple'(subject only)
[0.5.0] — 2026-04-15 — Complete SPARQL 1.1 Query Engine
This release completes SPARQL 1.1 query support. All standard query patterns — graph traversal, aggregates, unions, subqueries, optional matches, and computed values — are now supported.
What you can do
- Traverse graph relationships — property paths (
+,*,?,/,|,^) follow chains of relationships; cyclic graphs are handled safely using PostgreSQL's cycle detection - Combine results from alternative patterns —
UNION { ... } UNION { ... }merges results from two or more patterns;MINUS { ... }removes results that match an unwanted pattern - Aggregate and group results —
COUNT,SUM,AVG,MIN,MAX,GROUP_CONCATwork withGROUP BYandHAVINGjust as in SQL - Use subqueries — nest
{ SELECT … WHERE { … } }patterns at any depth - Compute new values —
BIND(<expr> AS ?var)assigns a calculated value to a variable;VALUES ?x { … }injects a fixed set of values into a pattern - Optional matches —
OPTIONAL { … }returns results even when the optional pattern has no data, leaving those variables unbound - Limit recursion depth —
pg_ripple.max_path_depthcaps how deep property-path traversal can go, preventing runaway queries on very large graphs
Bug fixes
- Sequence paths (
p/q) no longer produce a Cartesian product when intermediate nodes are anonymous p*(zero-or-more) paths no longer crash with a PostgreSQL CYCLE syntax errorOPTIONALno longer produces incorrect results due to an alias collision in the generated SQLGROUP BYcolumn references no longer go out of scope in the outer queryMINUSjoin clause now uses the correct column aliasVALUESno longer generates a duplicate alias clauseBINDin aggregate subqueries (SELECT (COUNT(?p) AS ?cnt)) now produces the correct SQL expression- Numbers in FILTER expressions (
FILTER(?cnt >= 2)) are now emitted as SQL integers instead of dictionary IDs - Changing
pg_ripple.max_path_depthmid-session now correctly invalidates the plan cache
Technical details
- Property paths compile to
WITH RECURSIVE … CYCLECTEs using PostgreSQL 18's hash-basedCYCLEclause - All pg_regress test files are now idempotent — safe to run multiple times against the same database
setup.sqldrops and recreates the extension for full isolation between runs- New tests:
property_paths.sql,aggregates.sql,resource_limits.sql— 12/12 pass
[0.4.0] — 2026-04-14 — Statements About Statements (RDF-star)
This release adds RDF-star: the ability to store facts about facts. For example, you can record not just "Alice knows Bob" but also "Alice knows Bob — according to Carol, since 2020". This is essential for provenance tracking, temporal data, and property graph–style edge annotations.
What you can do
- Load N-Triples-star data —
pg_ripple.load_ntriples()now accepts N-Triples-star, including nested quoted triples in both subject and object position - Encode and decode quoted triples —
pg_ripple.encode_triple(s, p, o)stores a quoted triple and returns its ID;pg_ripple.decode_triple(id)converts it back to JSON - Use statement identifiers —
pg_ripple.insert_triple()now returns the stable integer identifier of the stored statement; that identifier can itself appear as a subject or object in other triples - Look up a statement by its identifier —
pg_ripple.get_statement(i BIGINT)returns{"s":…,"p":…,"o":…,"g":…}for any stored statement - Query with SPARQL-star — ground (all-constant) quoted triple patterns work in SPARQL
WHEREclauses:WHERE { << :Alice :knows :Bob >> :assertedBy ?who }
Known limitations in this release
- Turtle-star is not yet supported; use N-Triples-star for RDF-star bulk loading
- Variable-inside-quoted-triple SPARQL patterns (e.g.
<< ?s :knows ?o >> :assertedBy ?who) are deferred to v0.5.x - W3C SPARQL-star conformance test suite not yet run (deferred to v0.5.x)
Technical details
KIND_QUOTED_TRIPLE = 5added to the dictionary; quoted triples stored withqt_s,qt_p,qt_ocolumns via non-destructiveALTER TABLE … ADD COLUMN IF NOT EXISTS- Custom recursive-descent N-Triples-star line parser — avoids the
oxrdf/rdf-12+spargebrafeature conflict with no new crate dependencies spargebraandsparoptnow use thesparql-12feature, enablingTermPattern::Triplewith correct exhaustiveness guards- SPARQL-star ground patterns compile to a dictionary lookup + SQL equality condition
[0.3.0] — 2026-04-14 — SPARQL Query Language
This release introduces SPARQL, the standard W3C query language for RDF data. You can now ask questions over your stored facts using a familiar graph-pattern syntax, with results returned as JSON.
What you can do
- Run SPARQL SELECT queries —
pg_ripple.sparql(query TEXT)executes a SPARQL SELECT and returns one JSON object per result row, with variable names as keys and values in standard N-Triples format - Run SPARQL ASK queries —
pg_ripple.sparql_ask(query TEXT)returnstrueif any results exist,falseotherwise - Inspect the generated SQL —
pg_ripple.sparql_explain(query TEXT, analyze BOOL DEFAULT false)shows what SQL was generated from a SPARQL query; passanalyze := truefor a full execution plan with timings - Tune the query plan cache —
pg_ripple.plan_cache_size(default 256) controls how many SPARQL-to-SQL translations are cached per connection; set to0to disable caching
Supported query features
- Basic graph patterns with bound or wildcard subjects, predicates, and objects
FILTERwith comparisons (=,!=,<,<=,>,>=) and boolean operators (&&,||,!,BOUND())OPTIONAL(left-join)GRAPH <iri> { … }andGRAPH ?g { … }for named graph scopingSELECTwith variable projection,DISTINCT,REDUCEDLIMIT,OFFSET,ORDER BY
Technical details
- SPARQL text →
spargebra 0.4algebra tree → SQL viasrc/sparql/sqlgen.rs; all IRI and literal constants are encoded toi64before appearing in SQL — SQL injection via SPARQL constants is structurally impossible - Per-query encoding cache avoids redundant dictionary lookups for constants appearing multiple times in one query
- Self-join elimination: patterns sharing a subject but using different predicates compile to a single scan, not separate subqueries
- Batch decode: all integer result columns are decoded in a single
SELECT … WHERE id IN (…)round-trip RUST_TEST_THREADS = "1"in.cargo/config.tomlprevents concurrent dictionary upsert deadlocks in the test suite- New pg_regress tests:
sparql_queries.sql(10 queries),sparql_injection.sql(7 adversarial inputs)
[0.2.0] — 2026-04-14 — Bulk Loading, Named Graphs, and Export
This release makes it practical to work with large RDF datasets. You can load standard RDF files, organise triples into named collections, export data back to standard formats, and register IRI prefixes for convenience.
What you can do
- Load RDF files in bulk —
pg_ripple.load_ntriples(data TEXT),load_nquads(data TEXT),load_turtle(data TEXT), andload_trig(data TEXT)accept standard RDF text and return the number of triples loaded - Load from a file on the server —
pg_ripple.load_ntriples_file(path TEXT)and its siblings read a file directly from the server filesystem (requires superuser); essential for large datasets - Organise triples into named graphs —
pg_ripple.create_graph('<iri>')creates a named collection;pg_ripple.drop_graph('<iri>')deletes it along with its triples;pg_ripple.list_graphs()lists all collections - Export data —
pg_ripple.export_ntriples(graph)andpg_ripple.export_nquads(graph)serialise stored triples to standard text; passNULLto export all triples - Register IRI prefixes —
pg_ripple.register_prefix('ex', 'https://example.org/')records a shorthand;pg_ripple.prefixes()lists all registered mappings - Promote rare properties manually —
pg_ripple.promote_rare_predicates()moves any property that has grown beyond the threshold into its own dedicated table
How rare properties work
Properties with fewer than 1,000 triples (configurable via pg_ripple.vp_promotion_threshold) are stored in a shared table rather than creating a dedicated table for each one. Once a property crosses the threshold it is automatically migrated. This keeps the database tidy for datasets with many rarely-used properties.
How blank node scoping works
Blank node identifiers (_:b0, _:b1, etc.) from different load calls are automatically isolated. Loading the same file twice will produce separate, independent blank nodes rather than merging them — which is almost always what you want.
Technical details
rio_turtle 0.8/rio_api 0.8added for N-Triples, N-Quads, Turtle, and TriG parsing- Blank node scoping via
_pg_ripple.load_generation_seq: each load advances a shared sequence; blank node hashes are prefixed with"{generation}:"to prevent cross-load merging batch_insert_encodedgroups triples by predicate and issues one multi-row INSERT per predicate group, reducing round-trips_pg_ripple.statementsrange-mapping table created (populated in v0.6.0)_pg_ripple.prefixestable:(prefix TEXT PRIMARY KEY, expansion TEXT)- GUCs added:
pg_ripple.vp_promotion_threshold(i32, default 1000),pg_ripple.named_graph_optimized(bool, default off) - New pg_regress tests:
triple_crud.sql,named_graphs.sql,export_ntriples.sql,nquads_trig.sql
[0.1.0] — 2026-04-14 — First Working Release
pg_ripple can now be installed into a PostgreSQL 18 database. After installation you can store facts — statements like "Alice knows Bob" — and retrieve them by pattern. This is the foundation that all later releases build on. No query language yet: just the core building blocks.
What you can do
- Install the extension —
CREATE EXTENSION pg_ripplein any PostgreSQL 18 database (requires superuser) - Store facts —
pg_ripple.insert_triple('<Alice>', '<knows>', '<Bob>')saves a fact and returns a unique identifier for it - Find facts by pattern —
pg_ripple.find_triples('<Alice>', NULL, NULL)returns everything about Alice;NULLis a wildcard for any position - Delete facts —
pg_ripple.delete_triple(…)removes a specific fact - Count facts —
pg_ripple.triple_count()returns how many facts are stored - Encode and decode terms —
pg_ripple.encode_term(…)converts a text term to its internal numeric ID;pg_ripple.decode_id(…)converts it back
How storage works
Every piece of text — names, URLs, values — is converted to a compact integer before storage. Lookups and joins operate on integers, not strings, which is what makes queries fast. Facts are automatically organised into one table per relationship type, and relationship types with few facts share a single table to avoid creating thousands of tiny tables. Every fact receives a globally unique integer identifier that later versions use for RDF-star.
Technical details
- pgrx 0.17 project scaffolding targeting PostgreSQL 18
- Extension bootstrap creates
pg_ripple(user-visible) and_pg_ripple(internal) schemas; thepg_prefix requiresSET LOCAL allow_system_table_mods = onduring bootstrap - Dictionary encoder (
src/dictionary/mod.rs):_pg_ripple.dictionarytable; XXH3-128 hash stored in BYTEA; dense IDENTITY sequence as join key; backend-local LRU encode/decode caches; CTE-based upsert avoids pgrx 0.17InvalidPositionerror on emptyRETURNINGresults - Vertical partitioning (
src/storage/mod.rs):_pg_ripple.vp_{predicate_id}tables with dual B-tree indices on(s,o)and(o,s);_pg_ripple.predicatescatalog;_pg_ripple.vp_rareconsolidation table;_pg_ripple.statement_id_seqfor globally-unique statement IDs - Error taxonomy (
src/error.rs):thiserror-based types — PT001–PT099 (dictionary), PT100–PT199 (storage) - GUC:
pg_ripple.default_graph - CI pipeline: fmt, clippy, pg_test, pg_regress (
.github/workflows/ci.yml) - pg_regress tests:
setup.sql,dictionary.sql,basic_crud.sql
pg_ripple — Roadmap
From 0.1.0 (foundation) to 1.0.0 (production-ready triple store)
Authority rule: plans/implementation_plan.md is the authoritative description of the eventual target architecture. This roadmap is the delivery sequence for that architecture. If a milestone summary here conflicts with the implementation plan, the implementation plan wins and the roadmap should be updated to match it.
How to read this roadmap
Each release below has two layers:
- The plain-language summary (in the coloured box) explains what the release delivers and why it matters — no programming knowledge required.
- The technical deliverables list the specific items developers will build. Feel free to skip these if you're reading for the big picture.
Effort estimates are given as person-weeks — e.g. "6–8 pw" means the release would take roughly 6–8 weeks for a single full-time developer, or 3–4 weeks for a pair working together. The total estimated effort from v0.1.0 to v1.0.0 is 275–376 person-weeks (~63–86 months for one developer; ~32–43 months for a pair).
"optional at runtime" items: some deliverables are annotated (optional at runtime — X must be installed). This means the feature depends on an external extension (e.g. pg_trickle) that may not be installed in every deployment. The feature is required by this roadmap and must be implemented; the Rust code gates on a runtime availability check and degrades gracefully (returns 0 / false / empty, emits a WARNING, never raises an ERROR) when the dependency is absent. These items are not optional from a delivery standpoint.
Overview at a glance
| Version | Name | What it delivers (one sentence) | Effort |
|---|---|---|---|
| 0.1.0 | Foundation | Install the extension, store and retrieve facts (VP storage from day one) | 6–8 pw |
| 0.2.0 | Bulk Loading & Named Graphs | Bulk data import, named graphs, rare-predicate consolidation, N-Triples export | 6–8 pw |
| 0.3.0 | SPARQL Basic | Ask questions in the standard RDF query language (incl. GRAPH patterns) | 6–8 pw |
| 0.4.0 | RDF-star / Statement IDs | Make statements about statements; LPG-ready storage | 8–10 pw |
| 0.5.0 | SPARQL Advanced (Query) | Property paths, aggregates, UNION/MINUS, subqueries, BIND/VALUES | 6–8 pw |
| 0.5.1 | SPARQL Advanced (Storage & Write) | Inline encoding, CONSTRUCT/DESCRIBE, INSERT/DELETE DATA, FTS | 6–8 pw |
| 0.6.0 | HTAP Architecture | Heavy reads and writes at the same time; shared-memory cache | 8–10 pw |
| 0.7.0 | SHACL Core + Deduplication | Define data quality rules; reject bad data on insert; on-demand and merge-time triple deduplication | 5–7 pw |
| 0.8.0 | SHACL Advanced | Complex data quality rules with background checking | 4–6 pw |
| 0.9.0 | Serialization | Import and export data in all standard RDF file formats | 3–4 pw |
| 0.10.0 | Datalog Reasoning | Automatically derive new facts from rules and logic | 10–12 pw |
| 0.11.0 | SPARQL & Datalog Views | Live, always-up-to-date dashboards from SPARQL and Datalog queries | 5–7 pw |
| 0.12.0 | SPARQL Update (Advanced) | Pattern-based updates and graph management commands | 3–4 pw |
| 0.13.0 | Performance | Speed tuning, benchmarks, production-grade throughput | 6–8 pw |
| 0.14.0 | Admin & Security | Operations tooling, access control, docs, packaging | 4–6 pw |
| 0.15.0 | SPARQL Protocol | Standard HTTP API, graph-aware loaders and deletes as SQL functions | 3–4 pw |
| 0.16.0 | SPARQL Federation | Query remote SPARQL endpoints alongside local data | 4–6 pw |
| 0.17.0 | JSON-LD Framing | Frame-driven CONSTRUCT queries producing nested JSON-LD | 3–4 pw |
| 0.18.0 | SPARQL CONSTRUCT & ASK Views | Materialize CONSTRUCT and ASK queries as live, incrementally-updated stream tables | 2–3 pw |
| 0.19.0 | Federation Performance | Connection pooling, result caching, query rewriting, and batching for remote SPARQL endpoints | 3–5 pw |
| 0.20.0 | W3C Conformance & Stability | W3C SPARQL 1.1 and SHACL Core test suite compliance, crash recovery and memory safety hardening, security audit initiation | 5–7 pw |
| 0.21.0 | SPARQL Built-in Functions & Query Correctness | Implement all ~40 missing SPARQL 1.1 built-in functions, fix the FILTER silent-drop hazard, and close critical query-semantics bugs | 6–8 pw |
| 0.22.0 | Storage Correctness & Security Hardening | Fix HTAP merge race conditions, dictionary cache rollback, shmem cache thrashing, rare-predicate promotion race, and HTTP service security gaps | 6–8 pw |
| 0.23.0 | SHACL Core Completion & SPARQL Diagnostics | Complete the SHACL constraint set, add SPARQL query introspection, and fix Datalog/JSON-LD correctness issues | 6–8 pw |
| 0.24.0 | Semi-naive Datalog & Performance Hardening | Implement semi-naive evaluation for Datalog rules, complete the OWL RL rule set, batch-decode large result sets, and bound property-path depth | 6–8 pw |
| 0.25.0 | GeoSPARQL & Architectural Polish | Add GeoSPARQL 1.1 geometry primitives, stabilise the internal catalog against OID drift, and close remaining medium- and low-priority issues | 6–8 pw |
| 0.26.0 | GraphRAG Integration | First-class integration with Microsoft GraphRAG: BYOG Parquet export, Datalog-enriched entity graphs, SHACL quality enforcement, and a Python CLI bridge | 4–6 pw |
| 0.27.0 | Vector + SPARQL Hybrid: Foundation | Core pgvector integration — embedding table, HNSW index, pg:similar() SPARQL function, bulk embedding, and hybrid retrieval modes | 5–7 pw |
| 0.28.0 | Advanced Hybrid Search & RAG Pipeline | Production-grade RRF fusion, incremental embedding worker, graph-contextualized embeddings, and end-to-end RAG retrieval | 5–8 pw |
| 0.29.0 | Datalog Optimization: Magic Sets & Cost-Based Compilation | Goal-directed inference via magic sets, cost-based body atom reordering, subsumption checking, anti-join negation, filter pushdown, delta table indexing | 5–7 pw |
| 0.30.0 | Datalog Aggregation & Compiled Rule Plans | Aggregation in rule bodies (Datalog^agg), SQL plan caching across inference runs, SPARQL on-demand query speedup | 5–7 pw |
| 0.31.0 | Entity Resolution & Demand Transformation | owl:sameAs entity canonicalization, demand transformation for goal-directed rule rewriting, SPARQL query planner integration | 5–7 pw |
| 0.32.0 | Well-Founded Semantics & Tabling | Three-valued semantics for cyclic ontologies, subsumptive result caching for Datalog and SPARQL repeated sub-queries | 5–7 pw |
| 0.33.0 | Documentation Site & Content Overhaul | Complete docs site rebuild — CI harness, eight feature-deep-dive chapters, operations guide, reference section, and content governance | 8–12 pw |
| 0.34.0 | Bounded-Depth Termination & Incremental Retraction (DRed) | Early fixpoint termination for bounded hierarchies (20–50% faster SPARQL property paths); Delete-Rederive for write-correct materialized predicates | 5–7 pw |
| 0.35.0 | Parallel Stratum Evaluation & Incremental Rule Updates | Background-worker parallelism for independent rules (2–5× faster materialization); add/remove rules without full recompute | 5–7 pw |
| 0.36.0 | Worst-Case Optimal Joins & Lattice-Based Datalog | Leapfrog Triejoin for cyclic SPARQL patterns (10×–100× speedup); Datalog^L monotone lattice aggregation | 6–9 pw |
| 0.37.0 | Storage Concurrency Hardening & Error Safety | Fix HTAP merge race, rare-predicate promotion race, dictionary cache rollback; eliminate all hard panics; add GUC validators | 9–11 pw |
| 0.38.0 | Architecture Refactoring & Query Completeness | Split god-module, PredicateCatalog trait, batch encoding, SCBD, SPARQL Update completeness, SHACL hints in planner | 9–11 pw |
| 0.39.0 | Datalog HTTP API | REST API exposing all 27 Datalog SQL functions in pg_ripple_http: rule management, inference, goal queries, constraints, admin | 3–5 pw |
| 0.40.0 | Streaming Results, Explain & Observability | Server-side SPARQL cursors, explain_sparql(), explain_datalog(), OpenTelemetry tracing, resource governors | 9–11 pw |
| 0.41.0 | Full W3C SPARQL 1.1 Test Suite | Complete W3C SPARQL 1.1 Query + Update + Graph Patterns + Aggregates test suite harness with parallelized execution; 3,000+ tests in < 2 min CI | 5–7 pw |
| 0.42.0 | Parallel Merge, Cost-Based Federation & Live CDC | Multi-worker HTAP merge, FedX-style federation planner, parallel SERVICE, live RDF change subscriptions | 10–12 pw |
| 0.43.0 | WatDiv + Jena Conformance Suite | Apache Jena edge-case tests (~1,000) and WatDiv scale-correctness benchmark (10M+ triples, star/chain/snowflake/complex patterns); 90% harness reuse from v0.41.0 | 5–7 pw |
| 0.44.0 | LUBM Conformance Suite | Lehigh University Benchmark — OWL RL inference correctness across 14 canonical queries on 1K–8M triple datasets; includes Datalog API validation sub-suite for rule compilation, iteration tracking, inferred triples, goal queries, and performance baseline | 3–5 pw |
| 0.45.0 | SHACL Completion, Datalog Robustness & Crash Recovery | Close remaining SHACL Core gaps (sh:equals/sh:disjoint, decoded violation IRIs, async load test), harden parallel Datalog strata rollback, add missing crash-recovery scenarios, and standardise migration documentation | 4–6 pw |
| 0.46.0 | Property-Based Testing, Fuzz Hardening & OWL 2 RL Conformance | proptest for SPARQL and dictionary invariants, fuzz the federation result decoder, W3C OWL 2 RL test suite in CI, TopN push-down, BSBM regression gate, sequence pre-allocation for Datalog workers, rustdoc coverage enforcement, and HTTP certificate pinning | 5–7 pw |
| 0.47.0 | SHACL Truthfulness, Dead-Code Activation & Architecture Refactor | Fix parsed-but-not-checked SHACL constraints, wire preallocate_sid_ranges(), finish the sparql/translate/ module split, add 5 fuzz targets, 4 crash-recovery scenarios, cache hit-rate SRFs, GUC validators, and security hygiene | 8–10 pw |
| 0.48.0 | SHACL Core Completeness, OWL 2 RL Closure & SPARQL Completeness | Complete all 35 SHACL Core constraints and complex sh:path expressions, close the OWL 2 RL rule set, add SPARQL Update MOVE/COPY/ADD, fix SPARQL-star variable patterns, WatDiv baselines, and operational hardening | 6–8 pw |
| 0.49.0 | AI & LLM Integration | sparql_from_nl() NL-to-SPARQL via configurable LLM endpoint; suggest_sameas() and apply_sameas_candidates() for embedding-based entity alignment | 4–6 pw |
| 0.50.0 | Developer Experience & GraphRAG Polish | VS Code extension with SPARQL/SHACL/Datalog support and query runner; explain_sparql(analyze:=true) debugger; rag_context() RAG pipeline | 5–7 pw |
| 1.0.0 | Production Release | Standards conformance, stress testing, security audit | 6–8 pw |
| Total estimated effort | 275–376 pw |
v0.1.0 — Foundation
Theme: Core data model, dictionary encoding, and basic triple CRUD.
In plain language: This is the "hello world" release. After installing pg_ripple into a PostgreSQL database, a user can store facts (called triples — think "subject → relationship → object", e.g. "Alice → knows → Bob") and retrieve them by pattern. No query language yet — just the basic building blocks. Internally, every piece of text (names, URLs, values) is converted to a compact number for fast storage and comparison. This release also sets up automated testing so that every future change is verified.
Effort estimate: 6–8 person-weeks
Completed items (click to expand)
Deliverables
- pgrx 0.17 project scaffolding targeting PostgreSQL 18
-
Extension bootstrap:
CREATE EXTENSION pg_ripplecreates_pg_rippleschema -
Dictionary encoder
- Unified dictionary table (IRIs, blank nodes, literals in a single table with
kinddiscriminator — avoids ID space collision between separate resource/literal tables) - Hash-Backed Sequence encoding (Route 2): XXH3-128 is computed over
kind_le_bytes || term_utf8(kind is mixed in so the same string as different term types maps to distinct IDs); the full 16-byte hash is stored in aBYTEAcolumn with aUNIQUEindex as the collision-detection key; a PostgreSQLGENERATED ALWAYS AS IDENTITYsequence produces the dense, sequentiali64join key used in every VP table. This avoids the birthday-problem collision risk of schemes that truncate the hash to 64 bits (collision expected at ~4 billion terms in 64-bit space). - Backend-local encode cache (
LruCache<u128, i64>, keyed on full 128-bit hash) and decode cache (LruCache<i64, String>) - Encode/decode SQL functions:
pg_ripple.encode_term(),pg_ripple.decode_id()
- Unified dictionary table (IRIs, blank nodes, literals in a single table with
-
Vertical Partitioning from day one
- Dynamic VP table management: auto-create
_pg_ripple.vp_{predicate_id}tables on first triple with a new predicate - Predicate catalog:
_pg_ripple.predicates (id BIGINT, table_oid OID, triple_count BIGINT) - Dual B-tree indices per VP table:
(s, o)and(o, s) - Global statement identifier sequence:
_pg_ripple.statement_id_seq— every VP table row gets a globally-unique SID viai BIGINT NOT NULL DEFAULT nextval('statement_id_seq') - SIDs are not exposed to users in v0.1.0 but are available for internal use from the start (prerequisite for RDF-star in v0.4.0)
- Dynamic VP table management: auto-create
-
Basic triple CRUD
pg_ripple.insert_triple(s TEXT, p TEXT, o TEXT)pg_ripple.delete_triple(s TEXT, p TEXT, o TEXT)pg_ripple.triple_count() RETURNS BIGINT
-
Basic querying (SQL-level, no SPARQL yet)
pg_ripple.find_triples(s TEXT, p TEXT, o TEXT) RETURNS TABLE (s TEXT, p TEXT, o TEXT, g TEXT)— any param can be NULL for wildcard; returns decoded string values
- Unit tests for dictionary encode/decode round-trips
- Integration test: insert + query cycle
-
pg_regress:
dictionary.sql(encode/decode, prefix expansion, hash collision behaviour),basic_crud.sql(insert, delete, find_triples, triple_count) - CI pipeline (GitHub Actions)
-
GUC-gated lazy initialization
- Merge worker, SHACL engine, and reasoning engine only start when their respective GUCs are enabled (
pg_ripple.merge_threshold > 0,pg_ripple.shacl_mode != 'off',pg_ripple.inference_mode != 'off') - Reduces resource overhead for deployments that use only a subset of features
- Merge worker, SHACL engine, and reasoning engine only start when their respective GUCs are enabled (
-
Error taxonomy module (
src/error.rs)thiserror-based error types with PT error code constants- Initial ranges: dictionary errors (PT001–PT099) and storage errors (PT100–PT199)
- PostgreSQL-style formatting: lowercase first word, no trailing period
- Extended in subsequent milestones as new subsystems are added (see §13.6 of the Implementation Plan for the complete PT001–PT799 range table)
Shared memory note: v0.1.0 through v0.5.1 use a backend-local
lru::LruCachefor the dictionary cache. This avoids requiringshared_preload_librariesfor the "hello world" release and defers the pgrx shared-memory complexity to v0.6.0 when the HTAP architecture actually needs it. The shared-memory dictionary cache, bloom filters, slot versioning, andpg_ripple.shared_memory_sizestartup GUC are all introduced in v0.6.0.
Exit Criteria
A user can install the extension, insert triples (routed to per-predicate VP tables), and query them back by pattern. No shared_preload_libraries configuration required. VP tables are created dynamically on first encounter of a new predicate.
v0.2.0 — Bulk Loading & Named Graphs
Theme: Bulk data import, rare-predicate consolidation, named graphs, and prefix management.
In plain language: This release adds bulk import: users can load large RDF data files (in Turtle and N-Triples formats) in one go, rather than inserting facts one at a time. Named graphs (the ability to group facts into labelled collections) are introduced here too. A "rare predicate" consolidation table prevents catalog bloat when datasets have thousands of distinct predicates. N-Triples export is included for test verification and round-trip checking.
Storage partition note: In v0.2.0 through v0.5.0, each VP table is a single flat table — there is no delta/main split yet. All reads and writes target the same table. The HTAP dual-partition architecture (separate
_deltaand_maintables with a background merge worker) is introduced in v0.6.0 via an explicit schema migration that renames existing VP tables and creates the initial_mainpartition. Effort estimate: 6–8 person-weeks
Completed items (click to expand)
Deliverables
-
Rare-predicate consolidation table
- Predicates with fewer than
pg_ripple.vp_promotion_thresholdtriples (default: 1,000) are stored in a shared_pg_ripple.vp_rare (p BIGINT, s BIGINT, o BIGINT, g BIGINT, i BIGINT)table with a primary composite index on(p, s, o)and two secondary indices:(s, p)for DESCRIBE queries and(g, p, s, o)for efficient graph-drop bulk-delete - Promotion is deferred to end-of-statement (not mid-batch): during a bulk load, triples accumulate in
vp_rare; after the load completes, predicates exceeding the threshold are promoted in a singleINSERT … SELECT+DELETEtransaction — avoids disrupting in-flight COPY streams pg_ripple.promote_rare_predicates()can also be called manually or by the background merge worker- Prevents catalog bloat for predicate-rich datasets (DBpedia ≈60K predicates, Wikidata ≈10K) — avoids hundreds of thousands of PG objects, reduces planner overhead, and cuts VACUUM cost
- Predicates with fewer than
-
_pg_ripple.statementsrange-mapping catalog- Maintained by the merge worker; stores
(sid_min, sid_max, predicate_id, table_oid)range rows rather than one row per statement — resolved via binary search in O(log n) with no full-table scans - After each merge cycle the worker inserts one range row per VP table covering the SIDs allocated since the last merge; because SIDs are drawn from a monotonically-increasing sequence, ranges are non-overlapping
- Required for v0.4.0 RDF-star where SIDs appear as subjects/objects in other VP tables and must be unambiguously resolved to their owning VP table
- Maintained by the merge worker; stores
-
Named graph support (basic)
gcolumn in VP tablespg_ripple.create_graph(),pg_ripple.drop_graph(),pg_ripple.list_graphs()
-
pg_ripple.named_graph_optimizedGUC (default:off)- When enabled, adds an optional
(g, s, o)index per dedicated VP table (and equivalent coverage onvp_rare) to accelerate graph-scoped queries (e.g. list all triples in graph G, drop a named graph) - Off by default to avoid index bloat for workloads that do not use named graphs heavily
- When enabled, adds an optional
-
Blank node document-scoping
- Each bulk load operation is assigned a monotonically-increasing
load_generationcounter from a shared sequence - Blank nodes are hashed as
"{generation}:{label}"— so_:b0from two different load calls yields two distinct dictionary IDs - Prevents incorrect merging of blank nodes across document boundaries, which would corrupt data in multi-file loads
- Also applies to
INSERT DATA(SPARQL Update, v0.5.1+) which always gets its own generation
- Each bulk load operation is assigned a monotonically-increasing
-
Bulk loader (N-Triples)
pg_ripple.load_ntriples(data TEXT) RETURNS BIGINT- Streaming parser via
rio_turtlecrate - Batch encoding + COPY for throughput
-
Bulk loader (N-Quads)
pg_ripple.load_nquads(data TEXT) RETURNS BIGINT- Standard format for named-graph quads (
<s> <p> <o> <g> .); samerio_turtleparser path as N-Triples - Route quads to the appropriate named graph (
gcolumn) automatically
-
Bulk loader (Turtle)
pg_ripple.load_turtle(data TEXT) RETURNS BIGINT- Prefix declarations auto-registered
- Blank node scoping per load operation
rio_turtlecrate already handles both formats — incremental parser work
-
Bulk loader (TriG)
pg_ripple.load_trig(data TEXT) RETURNS BIGINT- Turtle with named graph blocks (
GRAPH <g> { … }) — the standard interchange format for named-graph Turtle data - Uses the same
rio_turtlestreaming parser; named graph IRI is dictionary-encoded and stored in thegcolumn
-
File-path bulk load variants
pg_ripple.load_turtle_file(path TEXT) RETURNS BIGINTpg_ripple.load_ntriples_file(path TEXT) RETURNS BIGINTpg_ripple.load_nquads_file(path TEXT) RETURNS BIGINTpg_ripple.load_trig_file(path TEXT) RETURNS BIGINT- Reads via
pg_read_file()with superuser privilege check — prevents unauthorized file access - Essential for datasets larger than ~1 GB where passing data as a TEXT parameter exceeds PostgreSQL's TEXT size limit and imposes significant memory overhead
- Returns count of loaded triples; otherwise identical behaviour to the inline TEXT variants
-
IRI prefix management
pg_ripple.register_prefix(prefix TEXT, expansion TEXT)pg_ripple.prefixes() RETURNS TABLE- Prefix expansion in encode/decode paths
-
ANALYZE after bulk loads
- All inline and file-path load functions run
ANALYZEon affected VP tables after load completes - Ensures the PostgreSQL planner has accurate selectivity estimates for generated SQL — critical for good join plans in v0.3.0+
- All inline and file-path load functions run
-
Benchmarks: insert throughput (1M triples) —
benchmarks/insert_throughput.sql -
Performance regression baseline:
benchmarks/ci_benchmark.shrecords insert throughput and point-query latency; CIbenchmarkjob uploads results as artifacts and can gate on >10% regression -
N-Triples / N-Quads export (basic)
pg_ripple.export_ntriples(graph TEXT DEFAULT NULL) RETURNS TEXTpg_ripple.export_nquads(graph TEXT DEFAULT NULL) RETURNS TEXT— exports all named graphs as NQuads whengraphis NULL; a single graph when specified- Streaming variants returning
SETOF TEXTfor large graphs - Essential for verifying bulk load round-trips in v0.2.0 testing
-
pg_regress test suite:
triple_crud.sql,named_graphs.sql,export_ntriples.sql,nquads_trig.sql(N-Quads round-trip, TriG named-graph import, file-path loaders)
Exit Criteria
Rare-predicate consolidation table absorbs low-frequency predicates. Bulk loading >50K triples/sec on commodity hardware. Named graphs functional. All four inline formats (N-Triples, N-Quads, Turtle, TriG) and their file-path counterparts load correctly. Multi-graph data can be loaded via N-Quads/TriG and round-tripped via N-Quads export. VP tables have current planner statistics after bulk load.
v0.3.0 — SPARQL Query Engine (Basic)
Theme: Parse and execute SPARQL SELECT and ASK queries with basic graph patterns, named graph querying, initial join optimizations, and plan caching from day one.
In plain language: SPARQL is the standard language for asking questions over linked data — the same way SQL is for relational databases. This release makes pg_ripple understand SPARQL, so users can write queries like "find all people who know someone who works at Acme Corp" using the official W3C syntax. It also enables querying across named graphs (created in v0.2.0) using the standard SPARQL
GRAPHkeyword.Effort estimate: 6–8 person-weeks
Completed items (click to expand)
Prerequisites
sparoptavailability check (must be resolved before beginning v0.3.0): verify thatsparoptis published to crates.io with a stable, usable API and pin the version. If unavailable or API-unstable, absorb its filter-pushdown and constant-folding work directly into pg_ripple's own algebra optimizer pass (src/sparql/algebra.rs) before starting v0.3.0 — do not begin v0.3.0 development without resolving this gate.
Deliverables
-
sparoptfirst-pass algebra optimizer (sparoptcrate)- sparopt 0.3 is published on crates.io and pinned; direct conversion between sparopt and spargebra algebra types is unavailable (distinct type hierarchies), so filter-pushdown and constant-folding are implemented inline in
src/sparql/sqlgen.rsper the fallback clause
- sparopt 0.3 is published on crates.io and pinned; direct conversion between sparopt and spargebra algebra types is unavailable (distinct type hierarchies), so filter-pushdown and constant-folding are implemented inline in
-
SPARQL parser integration (
spargebracrate)- Parse SPARQL SELECT and ASK queries into algebra tree
- Support: Basic Graph Patterns (BGP), FILTER, OPTIONAL, LIMIT, OFFSET, ORDER BY, DISTINCT
GRAPH ?g { ... }patterns andFROM/FROM NAMEDdataset clauses — map toWHERE g = encode(uri)filters on VP tables
-
Per-query
EncodingCache(src/sparql/sqlgen.rsCtx.per_query)- Short-lived
HashMapfor IRIs and literals seen within a single SPARQL query - Avoids repeated SPI dictionary look-ups for constants that appear multiple times in one query
- Short-lived
-
SQL generator (initial)
- BGP → JOIN across VP tables (integer equality)
- FILTER → WHERE clause on integer-encoded values (dictionary-join decode for type comparisons; inline encoding deferred to v0.5.0)
- OPTIONAL → LEFT JOIN
- LIMIT/OFFSET/ORDER BY passthrough
- DISTINCT → SQL DISTINCT
-
Query executor
pg_ripple.sparql(query TEXT) RETURNS SETOF JSONB- SPI execution of generated SQL
- Batch dictionary decode: collect all output i64 IDs from the result set, decode in a single
WHERE id IN (...)query, build an in-memory lookup map, then emit human-readable rows — avoids per-row dictionary round-trips
-
SPARQL ASK
- ASK →
SELECT EXISTS(...)→ returns BOOLEAN pg_ripple.sparql_ask(query TEXT) RETURNS BOOLEAN
- ASK →
-
Join optimizations (phase 1)
- Self-join elimination for star patterns
- Filter pushdown: encode FILTER constants before SQL generation
-
Query plan caching (introduced in v0.3.0 — not deferred to v0.13.0)
- Cache SPARQL→SQL translation results keyed by query text
pg_ripple.plan_cache_sizeGUC (default:256;0= disabled)
-
pg_ripple.sparql_explain(query TEXT, analyze BOOL DEFAULT false) RETURNS TEXT— show generated SQL;analyze := trueexecutes the query and augments the output with actual row counts - SQL injection / adversarial tests: verify that SPARQL queries containing SQL metacharacters in IRIs, literals, and prefixed names are safely dictionary-encoded and never reach generated SQL as raw strings
-
pg_regress:
sparql_queries.sql(10+ test queries),sparql_injection.sql(adversarial inputs)
Exit Criteria
Users can run SPARQL SELECT and ASK queries with BGPs, FILTER, OPTIONAL, and GRAPH patterns against data loaded via bulk load. Named graph queries work correctly. Queries return correct results.
v0.4.0 — RDF-star / Statement Identifiers
Theme: Quoted triples, statement-level metadata, and LPG-ready storage — make statements about statements.
In plain language: Standard RDF can say "Alice knows Bob". But it can't directly say "Alice said that she knows Bob" or "The fact that Alice knows Bob was recorded on January 5th". RDF-star (now part of the RDF 1.2 standard) solves this by allowing triples to be embedded inside other triples — called quoted triples. This is essential for provenance ("where did this fact come from?"), temporal annotations ("when was this true?"), and trust ("who asserted this?"). By delivering this immediately after basic SPARQL, pg_ripple becomes LPG-ready from the start: Labeled Property Graph edges with properties (e.g.
[:KNOWS {since: 2020}]) map directly to RDF-star annotations over statement identifiers already present in the VP tables since v0.1.0. This is a cross-cutting change that touches parsing, storage, dictionary encoding, and the SPARQL engine.Effort estimate: 8–10 person-weeks
Completed items (click to expand)
Design rationale — why so early?
The OneGraph (1G) research initiative (Lassila et al., 2023; Poseidon engine, AWS Neptune Analytics) demonstrates that a unified SPOI (Subject, Predicate, Object, statement-Identifier) storage model is the foundation for breaking the "graph model lock-in" between RDF and LPG. By introducing statement identifiers in v0.1.0 (storage) and RDF-star in v0.4.0 (query), pg_ripple achieves 1G-compatible storage before any advanced features are built on top. Every subsequent milestone (SHACL, Datalog, SPARQL Update, Cypher/GQL) benefits from statement IDs being available from the start.
Patent clearance: RDF-star is a W3C standard developed under the W3C Patent Policy (Royalty-Free). Statement identifiers are well-established prior art (RDF reification, 2004; Named Graphs, 2005; RDF-star Community Group, 2014). The 1G abstract data model is published academic research (Semantic Web Journal, doi:10.3233/SW-223273), not patented technology. Poseidon's proprietary implementation details (P8APL, PAX pages, lock-free adjacency lists) are specific to Amazon's in-memory engine and are not replicated here — pg_ripple uses PostgreSQL's native heap/WAL/MVCC storage.
Deliverables
-
Quoted triple syntax in parsers
- N-Triples-star:
<< <http://...Alice> <http://...knows> <http://...Bob> >> <http://...assertedBy> <http://...Carol> . - Implemented via a custom recursive-descent N-Triples-star line parser (no external dependency conflicts)
- Supports subject-position and object-position quoted triples, nested quoted triples
- Note: Turtle-star deferred to v0.5.x;
load_ntriples()handles N-Triples-star fully
- N-Triples-star:
-
Dictionary encoding for quoted triples
- New term type:
KIND_QUOTED_TRIPLE = 5— XXH3-128 hash of(s_id, p_id, o_id) qt_s,qt_p,qt_ocolumns added to_pg_ripple.dictionaryviaALTER TABLE … ADD COLUMN IF NOT EXISTSpg_ripple.encode_triple(s TEXT, p TEXT, o TEXT) RETURNS BIGINTpg_ripple.decode_triple(id BIGINT) RETURNS JSONB
- New term type:
-
Statement identifier activation
pg_ripple.insert_triple(s TEXT, p TEXT, o TEXT, g TEXT DEFAULT NULL) RETURNS BIGINT— returns SIDpg_ripple.get_statement(i BIGINT) RETURNS JSONB— look up a statement by its SID
-
Storage for edge properties via SIDs
- Annotation triples use the SID of the annotated statement as their subject — regular
BIGINTvalues, no structural change to VP tables - Nested quoted triples supported
- Annotation triples use the SID of the annotated statement as their subject — regular
-
SPARQL-star query support
TermPattern::Triplehandled insparql/sqlgen.rsviaground_term_id()— ground (all-constant) quoted triple patterns compile to a dictionary lookup + equality condition- Uses
spargebra/sparql-12andsparopt/sparql-12features (properly gatesoxrdf/rdf-12to avoid match-exhaustiveness errors) - Variable-inside-quoted-triple deferred to v0.5.x
-
Bulk load support for RDF-star data
pg_ripple.load_ntriples()accepts N-Triples-star inputpg_ripple.load_turtle(),pg_ripple.load_nquads(),pg_ripple.load_trig()use rio_turtle (no RDF-star; emits warning)
-
W3C SPARQL-star conformance gate:
tests/pg_regress/sql/sparql_star_conformance.sql— N-Triples-star parsing, dictionary round-trips, SID lifecycle, annotation patterns, ground triple patterns, data integrity, known-limitation documentation -
pg_regress:
rdf_star_load.sql(load N-Triples-star, encode/decode round-trip, SID lifecycle)
Exit Criteria
Users can load RDF-star data (Turtle-star, N-Triples-star), query it with SPARQL-star triple term patterns, and use statement identifiers to model edge properties. SIDs are returned from insert operations and can be used as subjects/objects in subsequent triples. The storage layer is LPG-ready.
v0.5.0 — SPARQL Query Engine (Advanced — Query Completeness)
Theme: Property paths, UNION, aggregates, subqueries, and advanced join optimizations.
In plain language: This release teaches the query engine to handle more powerful questions. Property paths let you follow chains of relationships — e.g. "find everyone reachable through any number of 'knows' links" (like a social network friend-of-a-friend search). Aggregates let you compute totals and averages ("how many people work in each department?"). This is a pure query-engine release with no storage changes, isolating query completeness from the inline encoding and write-path work in v0.5.1.
Effort estimate: 6–8 person-weeks
Completed items (click to expand)
Deliverables
-
Property path compilation
+(one or more) →WITH RECURSIVECTE*(zero or more) →WITH RECURSIVECTE with zero-hop anchor?(zero or one) →UNIONof direct + zero-hop/(sequence) → chained joins|(alternative) →UNION^(inverse) → swaps/o- Cycle detection via PG18
CYCLEclause (hash-based, replaces array-based visited tracking for $O(1)$ membership checks instead of $O(n)$ array scans) pg_ripple.max_path_depthGUC- Known performance constraint: PostgreSQL materializes each level of a
WITH RECURSIVECTE into a work-table. For deep traversals (depth > ~15) or wide fan-out on graphs with 10M+ triples the per-level copy cost becomes the bottleneck. The <100 ms target in §13 benchmarks applies to bounded-depth paths (depth ≤ 10) on typical RDF datasets; unbounded paths on dense graphs will exceed it. A purpose-built graph traversal engine would outperform this approach at extreme depth/fan-out, but that is out of scope for v1.0.
-
UNION / MINUS
- UNION → SQL
UNION - MINUS → SQL
EXCEPT
- UNION → SQL
-
Aggregates
- COUNT, SUM, AVG, MIN, MAX, GROUP_CONCAT
- GROUP BY → SQL GROUP BY
- HAVING → SQL HAVING
-
Subqueries
- Nested SELECT in WHERE / FROM clause
-
BIND / VALUES
- BIND → SQL column alias
- VALUES → SQL VALUES clause
-
Resource exhaustion tests: Cartesian-product queries, unbounded property paths on cyclic graphs, deeply nested subqueries — verify that
max_path_depth,statement_timeout, and memory limits prevent runaway resource consumption -
pg_regress:
property_paths.sql,aggregates.sql,resource_limits.sql(exhaustion tests)
Documentation
See plans/documentation.md for the complete page-by-page specification. v0.5.0 carries the full catch-up backlog for v0.1.0–v0.4.0 in addition to new v0.5.0 pages.
Catch-up — v0.1.0 Foundation
-
Docs site scaffold:
docs/book.toml,.github/workflows/docs.yml,docs/src/SUMMARY.md -
user-guide/introduction.md,user-guide/installation.md,user-guide/getting-started.md -
user-guide/sql-reference/index.md,triple-crud.md,dictionary.md,prefix.md -
reference/changelog.md(mirror),reference/roadmap.md(mirror),reference/security.md(stub),research/index.md
Catch-up — v0.2.0 Bulk Loading & Named Graphs
-
user-guide/sql-reference/bulk-load.md,user-guide/sql-reference/named-graphs.md -
user-guide/best-practices/bulk-loading.md -
user-guide/configuration.md(initial:vp_promotion_threshold,named_graph_optimized,plan_cache_size) -
reference/faq.md(seed: 10+ questions covering v0.1.0–v0.4.0)
Catch-up — v0.3.0 SPARQL Basic
-
user-guide/playground.md— Docker sandbox ⭐ -
user-guide/sql-reference/sparql-query.md(initial: SELECT, ASK, EXPLAIN) -
user-guide/best-practices/sparql-patterns.md(initial) -
reference/troubleshooting.md(initial)
Catch-up — v0.4.0 RDF-star
-
user-guide/sql-reference/rdf-star.md -
user-guide/best-practices/data-modeling.md(initial)
New in v0.5.0
-
user-guide/sql-reference/sparql-query.mdexpanded: property paths, aggregates, UNION/MINUS, subqueries, BIND/VALUES -
user-guide/best-practices/sparql-patterns.mdexpanded: property path recipes, resource exhaustion safeguards -
user-guide/configuration.mdexpanded:max_path_depthGUC
Exit Criteria
SPARQL 1.1 Query coverage for property paths, UNION/MINUS, aggregates, subqueries, BIND/VALUES. Property path queries complete with hash-based cycle detection via PG18 CYCLE clause. Docs site is live on GitHub Pages with all catch-up pages written.
v0.5.1 — SPARQL Advanced (Storage, Serialization & Write)
Theme: Inline value encoding, CONSTRUCT/DESCRIBE, INSERT DATA/DELETE DATA, and full-text search.
In plain language: This release introduces inline value encoding — a performance optimization that eliminates dictionary lookups for numeric and date comparisons. It changes the fundamental ID space model (introducing a dual-space interpretation), which is why it is separated from the pure query-engine work in v0.5.0. It also adds the two simplest SPARQL Update forms (
INSERT DATA/DELETE DATA) so standard RDF tools can write to pg_ripple, CONSTRUCT and DESCRIBE to complete the four standard SPARQL query forms, and full-text search for efficient text matching.Effort estimate: 6–8 person-weeks
Completed items (click to expand)
Deliverables
-
Inline value encoding (
src/dictionary/inline.rs)- Type-tagged
i64encoding for xsd:integer, xsd:boolean, xsd:dateTime, xsd:date — FILTER comparisons on these types require zero dictionary round-trips - IDs allocated in monotonically increasing semantic order so range FILTERs (
>,<,BETWEEN) compile directly to SQL numeric comparisons on the rawi64column - Deferred from v0.3.0 to keep the initial SPARQL engine focused on a single ID space; now that the query engine is stable, the dual-space (inline + dictionary) model can be introduced safely
- Note:
xsd:doubleis stored in the dictionary rather than inline-encoded — truncating IEEE 754 doubles to 56 bits produces undefined precision/range behaviour; dictionary storage is safe and range comparisons on doubles are uncommon in SPARQL
- Type-tagged
-
SPARQL CONSTRUCT / DESCRIBE (JSONB output)
- CONSTRUCT → returns triples as JSONB (Turtle/JSON-LD serialization deferred to v0.9.0)
- DESCRIBE → Concise Bounded Description (CBD) as default algorithm
pg_ripple.describe_strategyGUC (values:'cbd'/'scbd'/'simple'): selects the DESCRIBE expansion algorithm. Introduced here alongside DESCRIBE so the GUC is available from the first release that uses it.- Completes the four standard SPARQL query forms, making pg_ripple usable as an entity browser
-
Basic SPARQL Update (
INSERT DATA/DELETE DATA)- Parse and execute
INSERT DATA { … }statements viaspargebra(already supports Update algebra) - Route through dictionary encoder + VP table insert path
- Named graph support:
INSERT DATA { GRAPH <g> { … } } - Parse and execute
DELETE DATA { … }statements — exact-match triple deletion from VP tables pg_ripple.sparql_update(query TEXT) RETURNS BIGINT— returns count of affected triples- Pattern-based updates (
DELETE/INSERT WHERE),LOAD,CLEAR,DROP,CREATEdeferred to v0.12.0 - Enables standard RDF tools (Protégé, TopBraid, SPARQL workbenches) to write to pg_ripple without a custom adapter
- Parse and execute
-
Full-text search on literals
pg_ripple.fts_index(predicate TEXT)— create a GINtsvectorindex on the dictionary for a predicate- SPARQL
CONTAINS()andREGEX()FILTERs on indexed predicates rewrite to@@/LIKEagainst the GIN index pg_ripple.fts_search(query TEXT, predicate TEXT) RETURNS TABLE— direct full-text search API- Index is maintained incrementally on
insert_triple()for indexed predicates
-
pg_regress:
fts_search.sql,sparql_construct.sql,sparql_insert_data.sql,sparql_delete_data.sql,inline_encoding.sql
Documentation
See plans/documentation.md for details.
-
user-guide/sql-reference/sparql-update.md—sparql_update(), INSERT DATA / DELETE DATA, named-graph variants -
user-guide/sql-reference/fts.md—fts_index,fts_search, SPARQL CONTAINS/REGEX rewriting -
user-guide/sql-reference/sparql-query.mdexpanded: CONSTRUCT / DESCRIBE,describe_strategyGUC -
user-guide/best-practices/update-patterns.md— INSERT DATA vs bulk load, idempotent patterns
Exit Criteria
Inline value encoding eliminates dictionary lookups for numeric and date FILTER comparisons. SPARQL CONSTRUCT and DESCRIBE return correct JSONB results. INSERT DATA / DELETE DATA work for standard-compliant write operations. Full-text search on indexed literal predicates is functional.
v0.6.0 — HTAP Architecture
Theme: Separate read and write paths for concurrent OLTP/OLAP. Shared-memory dictionary cache. Subject pattern index.
In plain language: In a real production system, people are loading new data and running complex queries at the same time. Without special care, these two activities interfere with each other — writes block reads and vice versa. This release splits the storage into a "write inbox" and a "read-optimised archive" so both can happen simultaneously at full speed. It also adds a change notification system: applications can subscribe to be told whenever specific facts change (useful for triggering workflows, updating caches, or feeding dashboards). An in-memory cache shared across all database connections makes repeated lookups much faster. Optionally, the companion pg_trickle extension enables automatically-updating live statistics.
Note: This release introduces
shared_preload_librariesas a requirement — v0.1.0–v0.5.1 do not require it because they use a backend-local dictionary cache. Thepg_ripple.shared_memory_sizestartup GUC must be set inpostgresql.confbefore starting PostgreSQL.Effort estimate: 8–10 person-weeks
Completed items (click to expand)
Deliverables
-
Delta/Main partition split — schema migration
- Each VP table is migrated from its flat single-table form (v0.1.0–v0.5.1) to a dual-partition form:
CREATE TABLE _pg_ripple.vp_{id}_delta AS SELECT * FROM _pg_ripple.vp_{id}(copy existing rows to delta)CREATE TABLE _pg_ripple.vp_{id}_main (LIKE _pg_ripple.vp_{id})(empty main, BRIN-indexed)ALTER TABLE _pg_ripple.vp_{id} RENAME TO vp_{id}_pre_htap(keep old table as backup)- Update
_pg_ripple.predicatescatalog with new table OIDs - Run an immediate merge cycle to promote rows from delta to main in sorted order
- Drop
vp_{id}_pre_htapafter merge completes successfully
- The migration runs inside the
ALTER EXTENSION pg_ripple UPDATEupgrade script — zero downtime during migration because rows still exist in delta until the merge completes and the query path immediately switches toUNION ALLof_mainand_delta vp_rareis not split (see vp_rare HTAP exemption below); all reads and writes target the singlevp_raretable throughout- All writes target
_delta;_mainis append-only / read-optimized - Query path:
UNION ALLof_mainand_delta
- Each VP table is migrated from its flat single-table form (v0.1.0–v0.5.1) to a dual-partition form:
-
Tombstone table for cross-partition deletes
- When deleting a triple that may exist in
_main, the delete is recorded in_pg_ripple.vp_{id}_tombstones (s BIGINT, o BIGINT, g BIGINT) - Query path becomes:
(main EXCEPT tombstones) UNION ALL delta - The merge worker applies tombstones against main during each generation merge, then truncates the tombstone table
- Necessary because
_mainis read-only between merges — a DELETE targeting a main-resident triple cannot modify_maindirectly
- When deleting a triple that may exist in
-
vp_rareHTAP exemptionvp_rareis not given a delta/main split — it remains a single flat table- Rare predicates see few writes by definition; delta/main overhead would exceed the benefit
- Concurrent reads and writes on
vp_rareare safe via PostgreSQL standard heap row-level locking - The bloom filter treats
vp_rareconservatively (always queries it, no delta-skip shortcut)
-
Background merge worker
- pgrx
BackgroundWorkerimplementation - Configurable merge threshold via
pg_ripple.merge_thresholdGUC - Concurrency & Locking logic: The rename/truncate step requires an
AccessExclusiveLock. To prevent stalling the database, the merge worker uses a lowlock_timeoutand retry logic for theALTER TABLE ... RENAMEstatement, ensuring concurrentINSERTandSELECToperations are not blocked entirely by a queued exclusive lock. - Fresh-table generation merge: rather than inserting into an existing
_maintable, createvp_{id}_main_new, insert all rows from both_mainand_delta(minus tombstones) in sort order (ensuring BRIN pages are physically ordered), then atomically rename it to replace_mainand TRUNCATE both_deltaand_tombstones— writes to delta are never blocked during the merge and BRIN indexing is maximally effective because rows arrive in sorted order at table-creation time - BRIN index rebuild on main post-merge (concurrent where possible)
- Shared-memory latch signaling
- Also triggers
pg_ripple.promote_rare_predicates()for any rare predicates that crossed the promotion threshold since the last merge - Runs
ANALYZEon merged VP tables so the PostgreSQL planner has fresh selectivity estimates - Watchdog: if the merge worker heartbeat stalls for longer than
pg_ripple.merge_watchdog_timeout(default: 300 s),_PG_initon the next backend connection logs a WARNING and attempts a restart
- pgrx
-
ExecutorEnd_hooklatch-poke- When a write transaction commits more than
pg_ripple.latch_trigger_thresholdrows (default: 10,000), the hook immediately pokes the merge worker's latch to trigger an early merge - Prevents unbounded delta growth during bursty write workloads without requiring a polling loop
- When a write transaction commits more than
-
Bloom filter for delta existence checks
- In shared memory, per VP table
- Queries against main-only data skip delta scan
-
Dictionary LRU cache in shared memory
pg_ripple.dictionary_cache_sizeGUC- Shared across all backends via pgrx
PgSharedMem - Sharded lock design: partition the hash map into N shards (default: 64), each with its own lightweight lock — eliminates global lock contention under concurrent encode/decode workloads
-
Shared-memory budget & back-pressure
pg_ripple.cache_budgetGUC — utilization cap for the pre-allocated shared memory block (dictionary cache + bloom filters + merge worker buffers)- Automatic eviction priority: bloom filters reclaimed first, then oldest LRU dictionary entries
- Back-pressure on bulk loads when shared memory is >90% of
cache_budget— throttle batch size to prevent OOM
-
Shared-memory slot versioning
- Each shared memory slot (declared via pgrx 0.17's
pg_shmem_init!macro) carries a[u8; 8]magic constant (e.g.*b"pg_tripl") followed by au32layout version at its head - Version mismatch at
_PG_inittriggers a controlled re-initialization of the slot rather than corrupting state — essential for safe in-place upgrades - pgrx 0.17 API note: all shared memory sizes must be declared statically in
_PG_init. Thepg_ripple.shared_memory_sizestartup GUC determines the block size; it cannot be changed at runtime. Use the pgrx 0.17PgSharedObject/PgSharedMem::new_objectAPI (not the oldPgSharedMemfrom ≤0.14) — verify against the pgrx 0.17 shmem examples
- Each shared memory slot (declared via pgrx 0.17's
-
subject_patternslookup table_pg_ripple.subject_patterns(s BIGINT, predicates BIGINT[])with a GIN index onpredicates- Maintained by the merge worker after each generation merge (not on individual INSERTs — amortized cost)
- Enables fast "which predicates does subject X have?" look-up for DESCRIBE queries and star-pattern rewriting in the algebra optimizer
-
object_patternslookup table_pg_ripple.object_patterns(o BIGINT, predicates BIGINT[])with a GIN index onpredicates- Maintained by the merge worker alongside
subject_patterns - Solves the "unbound object problem" by intercepting reverse-edge scattergun queries (
?s ?p <Object>) in O(N) instead of forcing aUNION ALLacross all VP tables
-
Statistics
pg_ripple.stats()JSONB: triple count, per-predicate counts, cache hit ratio, delta/main sizes
-
pg_trickle integration: live statistics (optional, when pg_trickle is installed)
pg_ripple.enable_live_statistics()creates_pg_ripple.predicate_statsand_pg_ripple.graph_statsstream tablespg_ripple.stats()reads from stream tables instead of full-scanning VP tables (100–1000× faster)_pg_ripple.rare_predicate_candidatesstream table (IMMEDIATEmode) replaces merge-worker GROUP BY polling for VP promotion detection (§2.8)_pg_ripple.vp_cardinalitystream table provides live per-predicate row counts for BGP join reordering without waiting for ANALYZE (§2.10)_pg_ripple.subject_patternsmanaged as a stream table — stays current between merge cycles for DESCRIBE and GIN queries (§2.12)
-
Change notification / CDC
pg_ripple.subscribe(pattern TEXT, channel TEXT)— emitNOTIFYon triple changes matching a predicate/graph pattern- Thin trigger-based CDC on VP delta tables; fires on INSERT/DELETE
- Payload: JSON with
{"op": "insert"|"delete", "s": ..., "p": ..., "o": ..., "g": ...}(integer IDs) pg_ripple.unsubscribe(channel TEXT)to remove subscriptions- Enables downstream event-driven architectures (CDC consumers, webhooks, cache invalidation)
-
Concurrency correctness tests (partial — synchronous paths covered; concurrent bgworker + writer tests deferred)
change_notification.sqlverifies CDC trigger correctness under sequential insert/deletehtap_merge.sqlverifies delta→main promotion correctnessmerge_edge_cases.sqlverifies edge cases: empty-delta compact, idempotency, delta-resident deletes
-
Merge worker edge-case tests (covered by
merge_edge_cases.sql)- Merge when delta is empty (no-op, no crash) ✓
- compact() is idempotent ✓
- Insert after compact goes to delta and is visible immediately ✓
- Delete delta-resident triple removes it directly (no tombstone needed) ✓
- Delete non-existent triple returns 0 ✓
- Multiple compacts do not multiply rows ✓
-
Benchmark: concurrent read/write (pgbench custom scripts under HTAP load)
- Heavy concurrent insert (delta growth) + complex SPARQL queries on main partition
- Measure merge worker latency, delta bloat growth, query latency under concurrent writes
- Baseline: >100K triples/sec sustained bulk insert with <500 ms query latency
-
Berlin SPARQL Benchmark (BSBM) execution with HTAP workload mixing reads and writes
- Full BSBM query mix under concurrent insert workload
- Comparison baselines with v0.5.0 (single-table, no-HTAP) results
-
pg_regress:
htap_merge.sql,change_notification.sql,concurrent_write_merge.sql,htap_benchmarks.sql
Documentation
See plans/documentation.md for details.
-
user-guide/configuration.md— major expansion: all HTAP GUCs grouped by subsystem,shared_preload_librariesrequirement column -
user-guide/scaling.md— HTAP architecture diagram, delta/main lifecycle, merge worker tuning -
user-guide/pre-deployment.md— production checklist:shared_preload_libraries, memory estimation, ANALYZE schedule -
user-guide/sql-reference/admin.md—stats(),compact(),subscribe(),unsubscribe(),htap_migrate_predicate() -
user-guide/best-practices/bulk-loading.mdexpanded: HTAP delta-growth, bulk-load strategies -
reference/troubleshooting.mdexpanded: merge worker not starting, delta bloat, CDC not firing -
reference/faq.mdexpanded:shared_preload_libraries, merge worker, change notifications -
research/postgresql-deepdive.md(mirrorplans/postgresql-triplestore-deep-dive.md)
Exit Criteria
Writes do not block reads. Merge worker operates correctly under concurrent writes and crash scenarios. >100K triples/sec bulk insert sustained. Change notifications fire correctly for matching patterns.
v0.7.0 — SHACL Validation (Core)
Theme: Data integrity enforcement via W3C SHACL shapes.
In plain language: SHACL is a standard way to define data quality rules — for example, "every Person must have exactly one email address" or "an age must be a number". When these rules are loaded, pg_ripple can automatically reject data that violates them the moment it is inserted, rather than discovering errors later. This is similar to how a spreadsheet can reject invalid entries in a cell. A validation report function lets you check existing data against the rules at any time.
Effort estimate: 4–6 person-weeks
Completed items (click to expand)
Deliverables
-
SHACL parser (Turtle-based shapes)
pg_ripple.load_shacl(data TEXT)— parse and store shapes- Internal shape IR stored in
_pg_ripple.shacl_shapes
-
Exact SHACL validator compilation
- Parse shapes to an internal IR that preserves W3C SHACL semantics
- Compile validator plans over focus nodes and value nodes rather than reducing shapes to lossy table constraints
- PostgreSQL constraints, triggers, and helper indices are allowed only as internal accelerators when semantics are proven equivalent for the specific shape pattern
-
Synchronous validation mode
- Triggered on
insert_triple()whenpg_ripple.shacl_mode = 'sync' - Returns validation error immediately on constraint violation
- Uses the same exact validator semantics as offline validation; no fast path weakens or changes SHACL meaning
- Triggered on
-
Validation report
pg_ripple.validate(graph TEXT DEFAULT NULL) RETURNS JSONB- Full SHACL validation report as JSON
-
SHACL management
pg_ripple.list_shapes() RETURNS TABLEpg_ripple.drop_shape(shape_uri TEXT)
-
pg_trickle integration: SHACL violation monitors (optional)
- Simple cardinality/datatype constraints modeled as
IMMEDIATEmode stream tables - Violations detected within the same transaction as the DML
_pg_ripple.violation_summarystream table aggregates dead-letter queue by shape/severity; feeds/metricsPrometheus endpoint without full queue scans (§2.13)
- Simple cardinality/datatype constraints modeled as
-
pg_regress:
shacl_validation.sql,shacl_malformed.sql(invalid shape definitions, circular references, undefined target classes — verify clean error messages) -
Explicit deduplication functions (on-demand cleanup; zero insert-time overhead)
pg_ripple.deduplicate_predicate(p_iri TEXT) RETURNS BIGINT— remove duplicate(s, o, g)rows for a single predicate, keeping the row with the lowest SID; returns count of rows removedpg_ripple.deduplicate_all() RETURNS BIGINT— deduplicate all predicates across dedicated VP tables andvp_rare; returns total rows removed- Runs
ANALYZEon all affected tables; safe to call at any time - Typical usage: call once after a bulk load that may contain duplicate triples
-
Merge-time deduplication (
pg_ripple.dedup_on_mergeGUC, defaultfalse)- When enabled, the HTAP generation merge (
src/storage/merge.rs) changes from a plainUNION ALLaccumulation to a deduplicating projection usingDISTINCT ON (s, o, g) ORDER BY s, o, g, i ASC, retaining the lowest-SID row for each logical triple - Deduplication happens atomically during the regular background merge cycle — zero insert-time overhead; duplicates accumulate in the delta partition and are resolved when the merge worker fires
- Between merges, queries through the
(main EXCEPT tombstones) UNION ALL deltaview may still observe short-lived duplicates from the delta portion - RDF-star interaction: SIDs of eliminated duplicate rows are not preserved; if RDF-star annotations exist on those SIDs, the annotations become orphaned. Use explicit dedup functions instead for datasets with active statement-level annotation workloads
- When enabled, the HTAP generation merge (
-
pg_regress:
deduplication.sql(explicit dedup functions; merge-time dedup viadedup_on_merge; verifies zero duplicates after each mechanism completes)
Documentation
See plans/documentation.md for details.
-
user-guide/sql-reference/shacl.md—load_shacl,validate,list_shapes,drop_shape; validation report JSON structure;shacl_modeGUC -
user-guide/best-practices/shacl-patterns.md(initial: NodeShape vs PropertyShape,sh:datatype/sh:minCount/sh:maxCount, sync mode latency impact) -
user-guide/pre-deployment.mdexpanded: SHACL mode selection, load shapes before bulk import -
reference/troubleshooting.mdexpanded: insert rejected by SHACL, shape parsing failures -
user-guide/sql-reference/admin.mdexpanded:deduplicate_predicate,deduplicate_all,dedup_on_mergeGUC, merge-time dedup semantics and RDF-star interaction
Exit Criteria
Delivered SHACL Core features are enforced at insert time with exact W3C semantics. Validation reports conform to SHACL spec. Malformed shapes are rejected with actionable error messages. Explicit deduplication functions correctly remove duplicate triples from all VP tables. Merge-time deduplication (when dedup_on_merge = true) produces duplicate-free _main tables after each merge cycle.
v0.8.0 — SHACL Advanced
Theme: Async validation pipeline and complex shapes.
In plain language: Builds on v0.7.0 by supporting more sophisticated data quality rules — for instance, "a person's address must be either a US address or a EU address (but not both)", or "if a company has more than 50 employees, it must have a compliance officer". It also adds a background validation mode so that checking complex rules doesn't slow down data loading — violations are flagged asynchronously and collected in a report queue.
Effort estimate: 4–6 person-weeks
Completed items (click to expand)
Deliverables
-
Asynchronous validation pipeline
- Validation queue table:
_pg_ripple.validation_queue - Background worker processes queue in batches
- Dead letter queue for invalid triples with violation reports
pg_ripple.shacl_mode = 'async'GUC mode
- Validation queue table:
-
Complex shape support
sh:class— type constraint viardf:typelookupsh:node— nested shape referencessh:or/sh:and/sh:not— logical constraint combinatorssh:qualifiedValueShape— qualified cardinality
-
pg_trickle integration: multi-shape DAG validation (optional at runtime — pg_trickle must be installed; required in this roadmap)
- Multiple SHACL shapes compiled into per-shape
IMMEDIATEpg_trickle stream tables (supported constraint types:sh:minCount,sh:maxCount,sh:datatype,sh:class); complex combinators (sh:or,sh:and,sh:not,sh:qualifiedValueShape) are not compiled to stream tables and are skipped gracefully _pg_ripple.violation_summary_dagDAG-leaf stream table aggregates per-shape violation counts; automatically clears when upstream shape violations resolve — unlike the dead-letter queue, no manual cleanup required (§2.13)pg_ripple.enable_shacl_dag_monitors()— creates all stream tables; returns 0 with a WARNING (no ERROR) when pg_trickle is not installedpg_ripple.disable_shacl_dag_monitors()— drops all per-shape stream tables and the summary; safe to call when none are activepg_ripple.list_shacl_dag_monitors()— lists active DAG monitor stream tables and compiled constraints_pg_ripple.shacl_dag_monitorscatalog table tracks all created monitors
- Multiple SHACL shapes compiled into per-shape
-
pg_regress:
shacl_advanced.sql,shacl_dag_monitors.sql
Documentation
See plans/documentation.md for details.
-
user-guide/sql-reference/shacl.mdexpanded: async pipeline, validation queue, dead-letter queue -
user-guide/best-practices/shacl-patterns.mdexpanded:sh:or/sh:and/sh:not, async mode for high-throughput ingestion, reading the dead-letter queue -
reference/troubleshooting.mdexpanded: async violations not appearing, dead-letter queue backlog
Exit Criteria
Async validation pipeline operational. Complex SHACL shapes validated correctly with the same semantics as synchronous validation.
v0.9.0 — Serialization, Export & Interop
Theme: Full RDF I/O, remaining serialization formats, and Turtle/JSON-LD serialization for CONSTRUCT/DESCRIBE.
In plain language: RDF data comes in several standard file formats (Turtle, RDF/XML, JSON-LD). This release completes the set so that pg_ripple can import from and export to all of them — making it easy to exchange data with other tools and systems. It also adds Turtle and JSON-LD output formats for SPARQL CONSTRUCT and DESCRIBE queries (which returned JSONB since v0.5.1), and RDF-star serialization support.
Effort estimate: 3–4 person-weeks (the hardest parts — Turtle import, N-Triples export, and CONSTRUCT/DESCRIBE JSONB — were already delivered in v0.2.0, v0.3.0, and v0.5.0)
Note: Turtle import and N-Triples export were delivered in v0.2.0. CONSTRUCT/DESCRIBE (JSONB output) were delivered in v0.5.1.
Completed items (click to expand)
Deliverables
-
RDF/XML parser
pg_ripple.load_rdfxml(data TEXT) RETURNS BIGINT
-
Export functions
pg_ripple.export_turtle(graph TEXT DEFAULT NULL) RETURNS TEXTpg_ripple.export_jsonld(graph TEXT DEFAULT NULL) RETURNS JSONB- Streaming variants returning
SETOF TEXTfor large graphs
-
SPARQL CONSTRUCT / DESCRIBE serialization formats
- CONSTRUCT → returns triples as Turtle or JSON-LD (in addition to JSONB from v0.5.1)
- DESCRIBE → Turtle and JSON-LD output options
-
SPARQL-star in CONSTRUCT / DESCRIBE (builds on v0.4.0 RDF-star)
- CONSTRUCT can produce quoted triples in output
- Turtle-star and N-Triples-star serialization in export functions
-
pg_regress:
serialization.sql,sparql_construct.sql,rdf_star_construct.sql
Documentation
See plans/documentation.md for details.
-
user-guide/sql-reference/serialization.md—export_turtle,export_jsonld,load_rdfxml, streaming variants, SPARQL CONSTRUCT Turtle/JSON-LD output, RDF-star serialization -
user-guide/best-practices/data-modeling.mdexpanded: interop format guide (Protégé → RDF/XML; LinkedData Platform → JSON-LD; CLI → N-Triples/N-Quads) -
reference/faq.mdexpanded: supported import/export formats, JSON-LD for REST APIs
Exit Criteria
Round-trip: load Turtle → query → export Turtle. All major RDF serialization formats supported for both import and export.
v0.10.0 — Datalog Reasoning Engine
Theme: General-purpose rule-based inference over the triple store.
In plain language: This is the "intelligence layer". Users can define logical rules like "if A manages B and B manages C, then A indirectly manages C" — and the system will automatically figure out all the indirect management chains. It ships with two built-in rule sets covering the standard RDF and OWL vocabularies (the common language of the Semantic Web), so it can automatically derive facts like "if a Dog is a subclass of Animal, and Rex is a Dog, then Rex is also an Animal". Rules can also express things that must never be true — for example, "no one can be their own manager" — acting as logical integrity constraints. This is the largest single release in the roadmap.
Effort estimate: 10–12 person-weeks
See plans/ecosystem/datalog.md for the full design.
Completed items (click to expand)
Deliverables
-
Rule parser (
src/datalog/parser.rs)- Turtle-flavoured Datalog syntax:
head :- body₁, body₂, … . - Variables (
?x), prefixed IRIs, literals, named graph scoping (GRAPH) - Stratified negation via
NOTkeyword - Multi-head rules (
h₁, h₂ :- body .) compiled to separateINSERT … SELECTstatements within the same stratum
- Turtle-flavoured Datalog syntax:
-
sourcecolumn in VP tables andvp_raresource SMALLINT DEFAULT 0added to every dedicated VP table and to_pg_ripple.vp_rarein the v0.10.0 migration0= explicitly asserted;1= derived (inferred by Datalog rules)- Enables filtering out inferred triples at scan time without a join
- Migration script uses
ALTER TABLE … ADD COLUMN source SMALLINT NOT NULL DEFAULT 0for each VP table and forvp_rare; zero-downtime because PostgreSQL fast-path adds the column with the stored default without rewriting the table
-
Tiered hot/cold dictionary (
src/dictionary/hot.rs)_pg_ripple.resources_hot(UNLOGGED) holds IRIs ≤512B and all predicate/prefix IRIs — the working set that fits in shared buffers- Full
resourcestable unchanged; encoder checks hot table first pg_prewarmwarms the hot table at server start via_PG_init- Dramatically reduces random I/O for the most-accessed terms at large scale (100M+ triples)
-
Stratification engine (
src/datalog/stratify.rs)- Predicate dependency graph with positive/negative edges
- SCC-based stratification with clear error messages for unstratifiable programs
-
SQL compiler (
src/datalog/compiler.rs)- Non-recursive rules →
INSERT … SELECT … ON CONFLICT DO NOTHING - Recursive rules →
WITH RECURSIVE … CYCLE - Negation →
NOT EXISTS(higher strata only) - All constants dictionary-encoded before SQL generation (integer joins everywhere)
- Non-recursive rules →
-
Arithmetic built-ins
- Comparison operators (
>,>=,<,<=,=,!=) → SQLWHEREclause expressions - Arithmetic expressions (
?z IS ?x + ?y) → SQL computed columns - String functions (
STRLEN,REGEX) → SQLLENGTH,~with dictionary decode join
- Comparison operators (
-
Constraint rules (integrity constraints)
- Empty-head rules (
:- body .) express patterns that must never hold - Compile to existence checks; materialized mode → pg_trickle IMMEDIATE stream tables for in-transaction validation
pg_ripple.check_constraints()returns violations as JSONBpg_ripple.enforce_constraintsGUC:'error'/'warn'/'off'- Directly complements and extends SHACL validation
- Empty-head rules (
-
Built-in rule sets (
src/datalog/builtins.rs)pg_ripple.load_rules_builtin('rdfs')— W3C RDFS entailment (13 rules)pg_ripple.load_rules_builtin('owl-rl')— W3C OWL 2 RL profile (~80 rules)
-
On-demand execution mode (no pg_trickle needed)
- Derived predicates compiled to inline CTEs injected into SPARQL→SQL at query time
SET pg_ripple.inference_mode = 'on_demand'
-
dictionary_hotincremental maintenance (optional, when pg_trickle is installed)- Model
_pg_ripple.dictionary_hotas a stream table overdictionaryfiltered to hot-eligible IRIs - New predicate and prefix-registry IRIs appear in the hot table within 30s of being encoded — no manual rebuild (§2.9)
- Model
-
Materialized execution mode (optional, requires pg_trickle)
pg_ripple.materialize_rules(schedule => '10s')— derived predicates as stream tables- pg_trickle DAG scheduler respects stratum ordering automatically
-
Catalog and management
_pg_ripple.rulescatalog table_pg_ripple.rule_setscatalog: groups named rules with arule_hash BYTEA(XXH3-64) for cache invalidation — re-activating a rule set with an unchanged hash resumes from prior derived state without re-derivation- Derived predicates registered in
_pg_ripple.predicateswithderived = TRUE pg_ripple.load_rules(),pg_ripple.list_rules(),pg_ripple.drop_rules()pg_ripple.enable_rule_set(name TEXT)/pg_ripple.disable_rule_set(name TEXT)— activate or deactivate a named rule set without dropping it
-
SPARQL engine integration
- Derived VP tables transparent to query planner (same look-up path as base VP tables)
- On-demand mode prepends CTEs to generated SQL
pg_ripple.sparql(query TEXT, include_derived BOOL DEFAULT true)— whenfalse, appendsAND source = 0to all VP table scans to exclude inferred triples (no-inference mode)
-
SHACL-AF
sh:rulebridge- Detect
sh:ruleentries in loaded SHACL shapes that contain Datalog-compatible triple rules - Compile
sh:rulebodies to Datalog IR and register in_pg_ripple.rules - Bidirectional: SHACL shapes inform Datalog constraints; Datalog-derived triples are visible to SHACL validation
pg_ripple.load_shacl()auto-registers anysh:ruletriples as Datalog rules whenpg_ripple.inference_mode != 'off'
- Detect
-
RDF-star integration in Datalog (builds on v0.4.0 RDF-star)
- Quoted triples can appear in Datalog rule heads and bodies
- Enables provenance rules:
<< ?s ?p ?o >> ex:derivedBy ex:rule1 :- ?s ?p ?o, RULE(ex:rule1) . - Statement identifiers (SIDs) can be used in rule bodies to annotate derived triples
-
pg_regress:
datalog_rdfs.sql,datalog_owl_rl.sql,datalog_custom.sql,datalog_negation.sql,datalog_arithmetic.sql,datalog_constraints.sql,shacl_af_rule.sql,datalog_malformed.sql(syntax errors, unstratifiable programs, unbound variables, cyclic rule dependencies — verify clear error messages),rdf_star_datalog.sql
Documentation
See plans/documentation.md for details.
-
user-guide/sql-reference/datalog.md—load_rules,infer,list_rules,enable_rule_set,disable_rule_set; rule syntax primer; stratification; built-in RDFS/OWL RL rule sets;inference_modeGUC -
user-guide/best-practices/datalog-patterns.md— RDFS subclass/domain/range patterns, OWL RL profiles,sourcecolumn (explicit vs inferred), rule count vs inference time -
user-guide/configuration.mdexpanded:inference_mode,enforce_constraintsGUCs -
reference/faq.mdexpanded: OWL reasoning support,sourcecolumn meaning
Exit Criteria
Users can load RDFS or OWL RL rule sets (or custom rules), and SPARQL queries return inferred triples. Arithmetic built-ins filter correctly in rule bodies. Constraint rules detect and report violations (optionally rejecting transactions). Both on-demand and materialized modes operational. Stratified negation correctly validated and compiled. SHACL shapes with sh:rule entries are auto-compiled to Datalog rules.
v0.11.0 — Incremental SPARQL Views, Datalog Views & ExtVP
Theme: Always-fresh materialized SPARQL and Datalog queries, plus extended vertical partitioning, via pg_trickle stream tables.
In plain language: Imagine pinning a SPARQL query — or a set of Datalog reasoning rules — to a dashboard and having the results update automatically whenever the underlying data changes, without re-running the query. That's what SPARQL views and Datalog views deliver. Under the hood, only the changed rows are reprocessed (not the entire dataset), so updates are nearly instantaneous. Datalog views go one step further: they bundle rules and a goal pattern into a single self-contained artifact, materializing only the facts relevant to the goal. This release also adds precomputed "shortcut" tables for frequently-combined queries, making common access patterns dramatically faster. Requires the companion pg_trickle extension.
Effort estimate: 5–7 person-weeks
pg_trickle dependency: This release requires pg_trickle to be installed. pg_trickle is a production-ready companion extension (same Rust/pgrx 0.17 / PostgreSQL 18 stack) available today. pg_ripple never hard-requires pg_trickle at load time — feature parity for the core triple store is preserved without it. Functions in this release that depend on pg_trickle (
create_sparql_view,create_datalog_view, ExtVP setup, etc.) detect its presence at call time and return a clear error with an install hint if it is absent. Thepg_ripple.pg_trickle_available()function lets users and tooling check availability before calling. See plans/ecosystem/pg_trickle.md § 3 for the soft-detection design.
See plans/ecosystem/pg_trickle.md § 2.2 for the SPARQL views design and plans/ecosystem/datalog.md § 15 for the Datalog views design.
Completed items (click to expand)
Deliverables
-
SPARQL views (requires pg_trickle)
pg_ripple.create_sparql_view(name, sparql, schedule, decode)— compile a SPARQL SELECT query into an always-fresh, incrementally-maintained stream tabledecode => FALSE(recommended) keeps integer IDs in the stream table with a thin decoding view on top, minimising CDC surfacepg_ripple.drop_sparql_view(name)andpg_ripple.list_sparql_views()for lifecycle management_pg_ripple.sparql_viewscatalog table: records original SPARQL text, generated SQL, schedule, decode mode, and stream table OID- Refresh mode heuristics:
IMMEDIATEfor constraint-style queries,DIFFERENTIAL+ schedule for dashboards,FULL+ long schedule for heavy analytics and transitive-closure property paths
-
Datalog views (requires pg_trickle)
pg_ripple.create_datalog_view(name, rules, goal, schedule, decode)— bundle a Datalog rule set with a goal pattern into an always-fresh, incrementally-maintained stream table- Alternative:
pg_ripple.create_datalog_view(name, rule_set, goal, schedule, decode)— reference a loaded rule set by name instead of inline rules decode => FALSE(recommended) keeps integer IDs in the stream table with a thin decoding view on toppg_ripple.drop_datalog_view(name)andpg_ripple.list_datalog_views()for lifecycle management_pg_ripple.datalog_viewscatalog table: records original rule text, goal pattern, generated SQL, schedule, decode mode, and stream table OID- Constraint monitoring: constraint rules (empty-head) automatically synthesize a goal; any row in the stream table is a violation.
IMMEDIATEmode catches violations within the same transaction - Goal-filtered materialization: only facts relevant to the goal pattern are derived and stored, reducing write amplification compared to full-closure materialized rules
-
ExtVP semi-join stream tables (requires pg_trickle)
- Manual creation of pre-computed semi-joins between frequently co-joined predicate pairs
- SPARQL→SQL translator rewrites queries to target ExtVP tables when available
-
Views over derived predicates
- Both SPARQL views and Datalog views can reference Datalog-derived VP tables; pg_trickle DAG handles refresh ordering
-
pg_regress:
sparql_views.sql,datalog_views.sql,extvp.sql
Documentation
See plans/documentation.md for details.
-
user-guide/scaling.mdexpanded: pg_trickle live statistics, SPARQL view refresh mode selection -
user-guide/best-practices/sparql-patterns.mdexpanded: usingcreate_sparql_view()for frequently-run queries -
research/pg-trickle.md(mirrorplans/ecosystem/pg_trickle.md)
Exit Criteria
Users can create SPARQL views and Datalog views that stay incrementally up-to-date. View queries are sub-millisecond table scans. Datalog views with goal patterns materialize only goal-relevant facts. Constraint monitoring views detect violations in real time. ExtVP semi-joins improve multi-predicate star-pattern performance.
v0.12.0 — SPARQL Update (Advanced)
Theme: W3C SPARQL 1.1 Update — pattern-based updates and graph management commands.
In plain language: Building on the basic
INSERT DATA/DELETE DATAsupport from v0.5.1, this release adds pattern-based updates — the ability to find-and-replace data using SPARQL patterns (e.g. "for every person without an email, add a placeholder email"). It also adds commands for managing named graphs (create, clear, drop) and loading data from a URL. This completes the full SPARQL 1.1 Update specification.Effort estimate: 3–4 person-weeks (simpler than originally estimated since INSERT DATA / DELETE DATA and the Update executor were delivered in v0.5.1)
Completed items (click to expand)
Deliverables
-
DELETE/INSERT WHERE (graph update)
- Pattern-based update:
DELETE { … } INSERT { … } WHERE { … } - Compile WHERE clause via existing SPARQL→SQL engine
- Transactional: delete + insert in single statement
- Pattern-based update:
-
LOAD / CLEAR / DROP / CREATE
LOAD <url>— fetch and load remote RDF data (HTTP GET + parser)CLEAR GRAPH <g>— delete all triples in a named graphDROP GRAPH <g>— clear + remove graph from registryCREATE GRAPH <g>— register a new empty named graph
-
pg_regress:
sparql_update_where.sql,sparql_graph_management.sql
Documentation
See plans/documentation.md for details.
-
user-guide/sql-reference/sparql-update.mdexpanded: DELETE/INSERT WHERE, LOAD / CLEAR / DROP / CREATE graph management -
user-guide/best-practices/update-patterns.mdexpanded: pattern-based update recipes, graph lifecycle management
Exit Criteria
Full SPARQL 1.1 Update operations work correctly. Pattern-based updates compile WHERE clauses via the existing SPARQL→SQL engine.
v0.13.0 — Performance Hardening
Theme: Optimize for production-scale workloads. Benchmark-driven improvements.
In plain language: This release is about speed. Using the benchmarks established in v0.5.0, we measure pg_ripple's performance against known baselines and then tune it. Improvements include caching query plans so repeated queries skip redundant work, loading data in parallel, and teaching the system to use data quality rules (from v0.7.0/v0.8.0) as hints to avoid unnecessary work during queries. The target is simple queries answering in under 10 milliseconds on a dataset of 10 million facts, and bulk loading sustained at over 100,000 facts per second.
Effort estimate: 6–8 person-weeks
Completed items (click to expand)
Deliverables
-
BGP join reordering
- At plan time, read
pg_stats.n_distinctandpg_class.reltuplesfor the target VP tables to estimate the selectivity of each triple pattern - Place the most selective pattern first in the join tree to minimize intermediate result sizes
- Emit
SET LOCAL join_collapse_limit = 1before the generated SQL to lock the PostgreSQL planner into the computed join order - Optimizer Robustness / Fallback: Because deriving perfect selectivity from
pg_stats.n_distinctis fragile over multi-way self-joins, the Rust-based optimizer implements dynamic sampling or uses fallback heuristic costs (e.g. reverting to native PostgreSQL planning) ifpg_statssuggests high cardinality uncertainty. This prevents forcing PostgreSQL into highly suboptimal plans. - When join columns are already sorted (e.g. after a range scan on an ordered
i64column), emitSET LOCAL enable_mergejoin = onto exploit merge-join (strategy #6)
- At plan time, read
-
Prepared execution and cache hardening
- Build on the v0.3.0 SPARQL translation cache rather than reintroducing it here
- Evaluate prepared statements with parameter binding for generated SQL where this improves planner reuse
- Add instrumentation and benchmarks for translation-cache hit rate, eviction behavior, and prepared-plan reuse
-
Parallel query exploitation
- Ensure VP table queries are parallel-safe
- Mark SQL functions as
PARALLEL SAFEwhere applicable - Generate SQL that triggers PostgreSQL parallel workers for multi-VP-table star patterns (e.g. parallel hash joins across VP tables)
- Verify
EXPLAINoutput shows parallel plans for queries touching 3+ VP tables
-
Custom statistics for the PostgreSQL planner
- Run
ANALYZEon VP tables after merge operations so the planner has accurate selectivity estimates for generated SQL - Provide per-predicate ndistinct and MCV statistics to guide join ordering
- Evaluate custom statistics objects (PG18 extended statistics) on
(s, o)pairs for correlation-aware planning - Consider prepared statements with parameter binding (instead of literal interpolation) so the planner can cache generic plans
- Run
-
PG18 async I/O exploitation
- Verify BRIN scans on main partition leverage AIO
- Tune
io_combine_limitrecommendations
-
Memory optimization
- Profile and reduce per-query allocations
- Optimize dictionary cache eviction strategy
-
Index tuning
- Evaluate PG18 skip scan benefits on
(s, o)indices - Add covering indices where beneficial
- Evaluate PG18 skip scan benefits on
-
Bulk load optimization
- Parallel dictionary encoding
- Deferred index build with
CREATE INDEX CONCURRENTLYpost-load
-
SHACL-driven query optimization
- The algebrizer reads loaded SHACL shapes and the predicate catalog before building the join tree, using them for costing and only for rewrites that are proven semantics-preserving
- Shape metadata can tighten plans only when the query domain is provably identical to the validated focus-node set
- Presence of a shape alone is insufficient to change query semantics
-
pg_trickle integration: ExtVP workload advisor (optional, when pg_trickle is installed)
_pg_ripple.extvp_candidatesstream table aggregates predicate co-occurrence from the SPARQL query log over a rolling 1-hour window- Admin function
pg_ripple.recommend_extvp()reads the stream table and lists the top N predicate pairs to pre-compute pg_ripple.sparql_explain()surfaces recommendations inline when a query would benefit from an ExtVP (§2.14)
-
Benchmarking infrastructure & execution
- Berlin SPARQL Benchmark (BSBM) data generator integrated into test suite
- Full BSBM query mix with timing collection and baseline comparison
- SP2Bench subset adapted for pg_ripple
- Custom benchmarks: star patterns, property paths, aggregates, concurrent workloads
- Results documented in release notes and user-guide/scaling.md
-
Fuzz testing harness setup (
cargo-fuzz+ libFuzzer)- Fuzz target for SPARQL→SQL pipeline (parser, algebra, SQL generation)
- Fuzz target for Turtle parser integration
- Fuzz target for Datalog rule parser
- CI runs fuzz testing in nightly builds (10 minutes per target)
- No panics, no invalid SQL, no memory safety violations
-
Performance regression test suite (pgbench custom scripts)
-
100K triples/sec sustained bulk load baseline
- <10ms simple BGP queries at 10M triples
- <5ms cached repeat queries
- BSBM throughput comparison with v0.5.0
-
-
pg_regress:
shacl_query_opt.sql,fuzz_integration.sql(fuzz results verification)
Documentation
See plans/documentation.md for details.
-
user-guide/scaling.mdexpanded: benchmark results (BSBM, SP2Bench), GUC tuning reference values for small/medium/large deployments, index strategy per workload -
user-guide/pre-deployment.mdexpanded: finalize as definitive production checklist;pg_stat_statementsenabled;work_memtuning for SPARQL aggregates -
reference/troubleshooting.mdexpanded: slow query diagnosis usingsparql_explain(analyze:=true), cache hit ratio viastats()
Exit Criteria
BSBM results documented. >100K triples/sec sustained bulk load. <10ms for simple BGP queries at 10M triples. <5ms for cached repeat queries. SHACL metadata exploited only through semantics-preserving optimizer rules. PostgreSQL parallel plans verified for multi-VP-table joins.
v0.14.0 — Administrative & Operational Readiness
Theme: Production operations tooling, upgrade paths, documentation.
In plain language: Everything a system administrator needs to run pg_ripple in production. This includes maintenance commands (clean up, rebuild indexes), monitoring and diagnostics, comprehensive documentation (quickstart guide, function reference, tuning guide), and graph-level access control — the ability to control which database users can see or modify which named graphs. It also covers packaging (Linux packages, Docker images) so the extension is easy to install in real environments. Think of this as the "operations manual" release.
Effort estimate: 4–6 person-weeks
Completed items (click to expand)
Deliverables
-
Extension upgrade scripts
- Tested upgrade path
0.1.0 → ... → 0.16.0 ALTER EXTENSION pg_ripple UPDATEworks for all version transitions
- Tested upgrade path
-
pg_trickle integration: live schema extraction (optional, when pg_trickle is installed)
_pg_ripple.inferred_schemastream table maintains a live class→property→cardinality summary- Exposed as
pg_ripple.schema_summary()for tooling and SPARQL IDE auto-completion (v0.15.0 HTTP endpoint) - Serves as a starting point for automatic SHACL shape inference (§2.15)
-
Administrative functions
pg_ripple.vacuum()— force merge + VACUUM on VP tablespg_ripple.reindex()— rebuild all VP table indicespg_ripple.compact(keep_old BOOL DEFAULT false)— trigger an immediate full merge across all VP tables;keep_old := falsedrops the previous generation's_maintable immediately after the atomic renamepg_ripple.vacuum_dictionary()— remove dictionary entries for IRIs and literals no longer referenced by any VP table row (orphaned after bulk deletes)pg_ripple.dictionary_stats()— detailed cache metricspg_ripple.predicate_stats()— per-predicate triple count, table sizes
-
Logging & diagnostics
- Structured logging for merge operations, validation results
- Custom
EXPLAINoption showing SPARQL→SQL mapping (PG18 extension EXPLAIN)
-
Documentation (see plans/documentation.md for the full page-by-page specification)
user-guide/backup-restore.md,user-guide/contributing.md(complete),reference/error-reference.md(PT001–PT799),reference/security.md(complete)- Performance tuning guide — dictionary cache sizing,
cache_budgetbudgeting,merge_thresholdandvp_promotion_thresholdtuning; SHACL constraint mapping reference; Datalog rule authoring guide
-
Graph-level Row-Level Security (RLS)
pg_ripple.enable_graph_rls()— activate RLS policies on VP tables using thegcolumn- Policy driven by a mapping table:
_pg_ripple.graph_access (role_name TEXT, graph_id BIGINT, permission TEXT)—'read'/'write'/'admin' pg_ripple.grant_graph(role TEXT, graph TEXT, permission TEXT)/pg_ripple.revoke_graph()- SPARQL queries automatically filter results to graphs the current role can read
- Write operations (
insert_triple, SPARQL UPDATE) enforce write permission - Superuser bypass via
pg_ripple.rls_bypassGUC for admin operations
-
Packaging
cargo pgrx packageproduces installable.deband.rpm- Docker image with extension pre-installed
- PGXN metadata
-
pg_regress:
admin_functions.sql(vacuum, reindex, dictionary_stats, predicate_stats),graph_rls.sql(RLS policy enforcement, cross-role isolation, superuser bypass),upgrade_path.sql(install v0.1.0 → load data → sequential upgrade to current version → verify data integrity and query correctness at each step)
Documentation
See plans/documentation.md for details.
-
user-guide/backup-restore.md—pg_dump/pg_restoreprocedure, VP table considerations, PITR with WAL -
reference/security.mdcomplete — supported versions matrix, responsible disclosure, hardening GUCs -
reference/error-reference.md— PT001–PT799 error code table with resolution notes -
user-guide/contributing.mdcomplete — dev setup, test commands, PR workflow, AGENTS.md conventions, governance -
user-guide/sql-reference/admin.mdexpanded: vacuum, reindex,dictionary_stats,predicate_stats
Exit Criteria
Extension is installable, upgradable, and documented. Operational tooling sufficient for production use. Graph-level RLS enforces access control per named graph.
v0.15.0 — SPARQL Protocol (HTTP Endpoint)
Theme: Standard HTTP API for SPARQL queries and updates.
In plain language: Without this, the only way to talk to pg_ripple is through a PostgreSQL database connection (SQL). But the entire RDF ecosystem — SPARQL notebooks, visualization tools, ontology editors, web applications — expects to query a triple store over HTTP at a
/sparqlURL. This release adds a lightweight companion service that accepts standard SPARQL HTTP requests, forwards them to pg_ripple inside PostgreSQL, and returns results in all the standard formats (JSON, XML, CSV, Turtle). This is the single biggest adoption enabler: it lets pg_ripple drop in as a replacement for tools like Blazegraph, Virtuoso, or Apache Fuseki without requiring any client-side changes.Effort estimate: 3–4 person-weeks
Completed items (click to expand)
Deliverables
-
Companion HTTP service (
pg_ripple_httpbinary)- Standalone Rust binary (not a PG background worker — avoids binding TCP ports inside PostgreSQL)
- Connects to PostgreSQL via standard
libpq/tokio-postgres - Configurable via environment variables or config file:
PG_RIPPLE_HTTP_PORT,PG_RIPPLE_HTTP_PG_URL
-
W3C SPARQL 1.1 Protocol compliance
GET /sparql?query=...— URL-encoded queryPOST /sparqlwithapplication/sparql-querybodyPOST /sparqlwithapplication/x-www-form-urlencodedbody (query=.../update=...)- SPARQL Update via
POST /sparqlwithapplication/sparql-updatebody
-
Content negotiation
application/sparql-results+json(default for SELECT/ASK)application/sparql-results+xmltext/csv/text/tab-separated-valuestext/turtle/application/n-triples(for CONSTRUCT/DESCRIBE)application/ld+json(JSON-LD, for CONSTRUCT/DESCRIBE)- RDF-star content types (builds on v0.4.0 RDF-star): Turtle-star and JSON-LD-star for CONSTRUCT/DESCRIBE results containing quoted triples
-
Connection pooling
- Built-in connection pool (e.g.
deadpool-postgres) to handle concurrent HTTP requests PG_RIPPLE_HTTP_POOL_SIZEconfiguration
- Built-in connection pool (e.g.
-
Security
- Optional bearer token or Basic auth for access control
- CORS configuration for browser-based SPARQL clients
- Rate limiting GUC
-
Health and metrics
GET /healthendpoint for load balancer probes- Prometheus-compatible
/metricsendpoint (query count, latency histogram, error rate)
-
Docker integration
- Docker image bundles both PostgreSQL (with pg_ripple) and the HTTP service
- Docker Compose example with separate PG and HTTP containers
-
Graph-aware bulk loader SQL functions
- Expose the internal
load_ntriples_into_graph(),load_turtle_into_graph(),load_rdfxml_into_graph()Rust functions (added in v0.10.0) as public SQL functions:pg_ripple.load_ntriples_into_graph(data TEXT, graph_iri TEXT) RETURNS BIGINTpg_ripple.load_turtle_into_graph(data TEXT, graph_iri TEXT) RETURNS BIGINTpg_ripple.load_rdfxml_into_graph(data TEXT, graph_iri TEXT) RETURNS BIGINTpg_ripple.load_ntriples_file_into_graph(path TEXT, graph_iri TEXT) RETURNS BIGINTpg_ripple.load_turtle_file_into_graph(path TEXT, graph_iri TEXT) RETURNS BIGINTpg_ripple.load_rdfxml_file_into_graph(path TEXT, graph_iri TEXT) RETURNS BIGINT
- Encode the
graph_iriargument via the dictionary and delegate to the existing*_into_graph(data, g_id)internal functions load_rdfxml_file_into_graphreads the file viapg_read_file()(superuser-only) and delegates toload_rdfxml_into_graph- Complementary to
load_nquads()andload_trig()for workloads that have N-Triples / Turtle / RDF/XML files and want to load them into a specific named graph without converting the format
- Expose the internal
-
Graph-aware triple deletion
- The existing
pg_ripple.delete_triple(s, p, o)only deletes from the default graph (g=0); the underlyingstorage::delete_triple(s, p, o, g_id)already accepts a graph parameter - Expose:
pg_ripple.delete_triple_from_graph(s TEXT, p TEXT, o TEXT, graph_iri TEXT) RETURNS BIGINT - Also expose:
pg_ripple.clear_graph(graph_iri TEXT) RETURNS BIGINT— wraps the existingstorage::clear_graph_by_id()internal function to delete all triples in a named graph in one call (currently only accessible viadrop_graph()which also unregisters the graph IRI) - Without this, users have no SQL-level way to delete a specific triple from a named graph
- The existing
-
SQL API completeness gaps
- Missing file-path loader:
pg_ripple.load_rdfxml_file(path TEXT) RETURNS BIGINT— completes the set of*_filevariants (N-Triples, N-Quads, Turtle, TriG all have file variants); reads viapg_read_file()(superuser-only) - Graph parameter on find_triples:
pg_ripple.find_triples(s TEXT, p TEXT, o TEXT, graph TEXT DEFAULT NULL) RETURNS TABLE— exposes the unusedgraphparameter instorage::find_triples(s, p, o, graph)so users can pattern-match within a named graph without falling back to SPARQL;graph := NULLqueries the default graph - Per-graph triple count:
pg_ripple.triple_count_in_graph(graph_iri TEXT) RETURNS BIGINT— returns the count of triples in a specific named graph (existingtriple_count()returns total across all graphs) - Dictionary lookup diagnostics:
pg_ripple.decode_id_full(id BIGINT) RETURNS JSONB— exposesdictionary::decode_full(id)to return{"kind": ..., "value": ..., "language": null|"...", "datatype": null|"..."}structured term metadata (currentdecode_id()returns only the plain string); useful for debugging and inspection - Dictionary term existence check:
pg_ripple.lookup_iri(iri TEXT) RETURNS BIGINT DEFAULT NULL— exposesdictionary::lookup_iri(iri)to check whether an IRI already exists in the dictionary without encoding it (useful for test assertions, cost estimation, and introspection)
- Missing file-path loader:
-
pg_regress:
sparql_protocol.sql(protocol-level tests viacurl),load_into_graph.sql(round-trip: load N-Triples / Turtle / RDF/XML into a named graph, verify via SPARQL GRAPH pattern),graph_delete.sql(delete_triple_from_graph, clear_graph, verify isolation from default graph),sql_api_completeness.sql(find_triples with graph param, triple_count_in_graph, decode_id_full, lookup_iri)
Documentation
See plans/documentation.md for details.
-
user-guide/sql-reference/sparql-query.mdexpanded: HTTP protocol endpoint configuration,Acceptheader formats, SPARQL 1.1 Protocol conformance note -
user-guide/best-practices/sparql-patterns.mdexpanded: using the HTTP endpoint from Python (SPARQLWrapper), Java (Jena),curl; SPARQL IDE / Protégé direct connection -
reference/faq.mdexpanded: HTTP endpoint URL, connecting SPARQL tools directly
Exit Criteria
Standard SPARQL clients (YASGUI, Postman, RDF4J workbench, curl) can query and update pg_ripple over HTTP without any pg_ripple-specific configuration. Content negotiation returns correct formats. All graph-scoped load and delete operations available as first-class SQL functions. SQL API fully exposes internal capabilities (graph parameters, per-graph counts, diagnostic functions).
v0.16.0 — SPARQL Federation
Theme: Query remote SPARQL endpoints from within pg_ripple queries.
In plain language: Federation lets a single SPARQL query combine data from pg_ripple with data from external SPARQL endpoints on the web. For example, you could ask "find all my local employees and enrich their records with data from Wikidata" — and the system will automatically fetch the remote portion, join it with local results, and return a unified answer. This is part of the SPARQL 1.1 standard (
SERVICEkeyword) and is expected by many enterprise knowledge graph workflows that integrate multiple data sources. Multiple remote calls execute in parallel when possible to minimise latency.Effort estimate: 4–6 person-weeks
Completed items (click to expand)
Deliverables
-
SPARQL
SERVICEkeyword parsing- Parse
SERVICE <url> { ... }clauses in SPARQL queries viaspargebra - Support both inline service IRIs and
SERVICE ?var(variable endpoints, with VALUES binding)
- Parse
-
Remote endpoint execution
- HTTP GET/POST to remote SPARQL endpoints using
reqwest(async HTTP client) - Parse
application/sparql-results+jsonandapplication/sparql-results+xmlresponses - Dictionary-encode remote results into local
i64IDs for join compatibility
- HTTP GET/POST to remote SPARQL endpoints using
-
Join integration
- Remote result sets injected as inline
VALUESclauses in the generated SQL - Async parallel execution: multiple
SERVICEclauses in a single query execute concurrently (viatokio::join!in pg_ripple_http, or sequential fallback in SPI context) — prevents a single slow endpoint from blocking the entire query - Bind-join optimisation: push bound variables from local results into remote queries to reduce remote result size
- Remote result sets injected as inline
-
Error handling and timeouts
pg_ripple.federation_timeoutGUC (default: 30s per SERVICE call)pg_ripple.federation_max_resultsGUC (default: 10,000 rows per remote call)- Graceful degradation: failed SERVICE calls return empty results with a WARNING (configurable to ERROR via
pg_ripple.federation_on_errorGUC)
-
Security
- Allowlist of permitted remote endpoints:
_pg_ripple.federation_endpoints (url TEXT, enabled BOOLEAN) pg_ripple.register_endpoint()/pg_ripple.remove_endpoint()management API- No outbound HTTP calls unless the endpoint is explicitly registered (defence against SSRF)
- Allowlist of permitted remote endpoints:
-
pg_trickle integration: federation health monitoring (optional, when pg_trickle is installed)
_pg_ripple.federation_healthstream table aggregates a rolling 5-minute probe log per endpoint- Executor skips endpoints with
success_rate < 0.1without waiting for timeout /metricsPrometheus endpoint reads directly fromfederation_health(§2.11)
-
SERVICE→ Materialized View rewrite- When a
SERVICE <url>clause references an endpoint backed by a local SPARQL view (created viapg_ripple.create_sparql_view()), rewrite the remote call to a direct scan of the pre-materialized stream table - Registered via a
local_view_namecolumn on_pg_ripple.federation_endpoints— set automatically when a SPARQL view is also registered as an endpoint - Eliminates HTTP overhead and enables the PostgreSQL planner to optimize the join with accurate statistics from the stream table
- When a
-
HTTP endpoint integration
- Federation works via both SQL (
pg_ripple.sparql()) and HTTP (/sparql) interfaces
- Federation works via both SQL (
-
pg_regress:
sparql_federation.sql,sparql_federation_timeout.sql
Documentation
See plans/documentation.md for details.
-
user-guide/sql-reference/federation.md—SERVICEkeyword, endpoint registration (register_endpoint,remove_endpoint), variable endpoints withVALUESbinding, bind-join optimisation,federation_timeout/federation_max_results/federation_on_errorGUCs, SSRF protection via allow-list -
user-guide/configuration.mdexpanded:federation_timeout,federation_max_results,federation_on_errorGUCs -
user-guide/best-practices/sparql-patterns.mdexpanded: federation query patterns,SERVICEperformance tips (push FILTERs down, limit remote result size), combining local and remote data -
reference/faq.mdexpanded: federation security model, configuring remote endpoints, timeout tuning -
reference/troubleshooting.mdexpanded: federation timeouts, SSRF errors, endpoint unreachable
Exit Criteria
✅ DONE — SPARQL queries with SERVICE clauses correctly fetch and join data from registered remote endpoints. Sequential execution in SPI context. Timeouts and error handling work as configured. No SSRF risk — only allowlisted endpoints are contacted.
v0.17.0 — JSON-LD Framing
Theme: Frame-driven SPARQL CONSTRUCT queries that produce structured, nested JSON-LD output.
In plain language: JSON-LD Framing is a W3C standard for reshaping RDF graph data into a specific tree structure suitable for a REST API or application. Instead of returning a flat list of disconnected facts, you provide a frame document — a JSON template that says "I want Company objects with their employees nested inside" — and pg_ripple automatically translates that into an optimised query, fetches only the data that matches, and returns a cleanly nested JSON-LD document. This makes pg_ripple a natural back-end for Linked Data APIs and JSON-centric applications without requiring a separate framing library.
Unlike a naïve approach that fetches the entire graph and post-filters it, this implementation translates the frame directly into a SPARQL CONSTRUCT query. PostgreSQL then reads only the VP tables that are touched by the join — meaning a frame targeting 3 predicates on a graph with 10,000 predicates touches 3 VP tables, not 10,000. The
jsonld_frame_to_sparql()inspection function exposes the generated SPARQL for debugging and for users who want to customise the query further before execution.Effort estimate: 3–4 person-weeks
Completed items (click to expand)
Prerequisites
- v0.5.1 SPARQL CONSTRUCT / DESCRIBE (JSONB output) — frame-to-SPARQL translation reuses the existing algebra and SQL generation pipeline.
- v0.9.0 JSON-LD export — the
nt_term_to_jsonld_valuehelper insrc/export.rsis reused for the embedding step. - v0.3.0 SPARQL plan cache — framed queries benefit from cached SPARQL→SQL translation automatically.
Deliverables
-
JSON-LD Framing engine (
src/framing/)src/framing/mod.rs— module root; exposes the publicframe()entry point used by all SQL functionssrc/framing/frame_translator.rs— translates a JSON-LD frame (parsed asserde_json::Value) into aspargebraCONSTRUCT algebra treesrc/framing/embedder.rs— takes flat CONSTRUCT result triples and applies the W3C embedding algorithm to produce a nested JSON-LD tree matching the frame structuresrc/framing/compactor.rs— applies the@contextfrom the frame to compact full IRIs to prefixed terms in the output
-
Frame-to-SPARQL translation (
src/framing/frame_translator.rs)- Translate
@typeconstraints →?s a <IRI>triple patterns in the CONSTRUCT WHERE clause - Translate property-value pairs with wildcard
{}→OPTIONAL { ?s <p> ?o }patterns - Translate absent-property patterns
[]→OPTIONAL { ?s <p> ?o } FILTER(!bound(?o))patterns - Translate
@reverseterms → flipped BGP triple patterns (?o <p> ?sinstead of?s <p> ?o) - Translate nested frame objects → recursive OPTIONAL joins, each level introducing a fresh variable
- Translate
@idmatching → bind target IRI as a constant in the WHERE clause - Translate
@requireAll: true→ convert OPTIONAL joins to INNER joins for required properties - All IRI constants dictionary-encoded at translation time (integer joins in all VP table queries — no string comparisons)
- Wildcards (
{}) on@typeand@idexpand to unbound variables
- Translate
-
Tree-embedding algorithm (
src/framing/embedder.rs)- Implement the W3C JSON-LD 1.1 Framing §4.1 embedding algorithm over the flat CONSTRUCT result set
- Build a subject-keyed node map from the CONSTRUCT rows (decoded to N-Triples strings)
- Walk the frame tree recursively, embedding matching node objects as property values
- Honour
@embedflag:@once(default) — embed a node only once, use a{"@id": "..."}reference for subsequent occurrences;@always— embed every occurrence even if repeated;@never— always use a node reference - Honour
@explicit: true— omit properties not mentioned in the frame from the output node - Honour
@omitDefault: true— omit absent properties rather than outputtingnull - Honour
@defaultvalues — substitute the declared default value for absent properties when@omitDefaultisfalse - Reverse properties: collect subjects whose relevant predicate points to the current node and embed them under the
@reverse-declared key - Named-graph scope: when
graphis specified, restrict embedding to nodes from that named graph
-
@contextcompaction (src/framing/compactor.rs)- Extract the
@contextblock from the input frame - Apply prefix substitution to all IRI strings in the output tree (full IRI → compact prefixed form using registered prefixes and inline
@contextmappings) - Inject the
@contextblock as the first entry of the returned JSON-LD document - Fall back to full IRIs when no matching prefix is registered
- Extract the
-
SQL functions (
src/lib.rs)pg_ripple.jsonld_frame_to_sparql(frame JSONB, graph TEXT DEFAULT NULL) RETURNS TEXT— translate a frame to a SPARQL CONSTRUCT query string without executing it; primary debugging and inspection toolpg_ripple.export_jsonld_framed(frame JSONB, graph TEXT DEFAULT NULL, embed TEXT DEFAULT '@once', explicit BOOLEAN DEFAULT FALSE, ordered BOOLEAN DEFAULT FALSE) RETURNS JSONB— primary end-user function: translate frame to CONSTRUCT, execute via the SPARQL engine, apply embedding and compaction, return framed JSON-LDpg_ripple.export_jsonld_framed_stream(frame JSONB, graph TEXT DEFAULT NULL) RETURNS SETOF TEXT— streaming NDJSON variant (one JSON object per matched root node); avoids buffering large framed documents in memorypg_ripple.jsonld_frame(input JSONB, frame JSONB, embed TEXT DEFAULT '@once', explicit BOOLEAN DEFAULT FALSE, ordered BOOLEAN DEFAULT FALSE) RETURNS JSONB— general-purpose framing primitive: apply the embedding algorithm to any already-expanded JSON-LD document, not necessarily from pg-ripple storage; useful for framing SPARQL CONSTRUCT results obtained via other means
-
SPARQL plan cache integration
- The translated CONSTRUCT query string is used as the cache key in the existing
src/sparql/plan_cache.rstranslation cache - Repeated calls to
export_jsonld_framed()with the same frame and graph benefit from cached SPARQL→SQL translation automatically
- The translated CONSTRUCT query string is used as the cache key in the existing
-
Named-graph support
graph NULL→ CONSTRUCT operates over the merged graph (allgvalues across all VP tables)graph '<IRI>'→ addsFILTER(?g = <encoded_id>)to each VP table join in the generated CONSTRUCT- Frame
@graphentry → directs the embedder to scope node matching to the named graph's node set
-
Error handling
- Invalid frame structure (not a JSON object, unrecognised
@embedvalue) →PT700-range serialization error with the frame property path that failed - Frame references an IRI not present in any VP table → empty result (standard W3C framing behaviour, not an error)
- Frame nested deeper than
pg_ripple.max_path_depth→PT200-range error reusing the existing depth limit
- Invalid frame structure (not a JSON object, unrecognised
-
Incremental framing views (
create_framing_view) (requires pg_trickle)pg_ripple.create_framing_view(name TEXT, frame JSONB, schedule TEXT DEFAULT '5s', decode BOOLEAN DEFAULT FALSE, output_format TEXT DEFAULT 'jsonld') RETURNS void— translate the frame to a SPARQL CONSTRUCT query and register it as a pg_trickle stream table that stays incrementally up-to-date as triples are inserted or deleted- Stream table schema:
pg_ripple.framing_view_{name}(subject_id BIGINT, frame_tree JSONB, refreshed_at TIMESTAMPTZ)—subject_idis the dictionary-encoded subject IRI;frame_treeis the fully embedded and compacted JSON-LD output for that root node - When
decode = TRUE, a thin IRI-decoding viewpg_ripple.framing_view_{name}_decodedis also created; the stream table itself stores integer IDs to minimise CDC surface pg_ripple.drop_framing_view(name TEXT) RETURNS voidandpg_ripple.list_framing_views() RETURNS TABLE(name TEXT, frame JSONB, schedule TEXT, output_format TEXT, decode BOOLEAN, row_count BIGINT, last_refresh TIMESTAMPTZ, stream_table_oid OID)for lifecycle management_pg_ripple.framing_viewscatalog table:name, frame, generated_construct, schedule, output_format, decode, stream_table_oid, created_at- Refresh mode heuristics (same as
create_sparql_view):IMMEDIATEfor constraint-style frames (e.g. selectex:Companynodes that lackex:complianceOfficer— any row in the view is a violation);DIFFERENTIAL+ schedule for dashboard/API use cases (company directory refreshed every 10 s);FULL+ long schedule for large full-graph framed exports intended for downstream consumers pg_ripple.pg_trickle_available()check at call time — returns a clear error with an install hint when pg_trickle is absent; never raises an error at extension load time
-
pg_regress:
jsonld_framing.sql(type-based selection, property wildcards, absent-property patterns[],@reverse,@embed @once/@always/@never,@explicit,@omitDefault,@default,@requireAll, named-graph scope, empty frame,jsonld_frame_to_sparqlinspection output,jsonld_framegeneral-purpose function, streaming variant),jsonld_framing_views.sql(create/drop/list framing views;IMMEDIATEconstraint-mode view;DIFFERENTIALdashboard view;decodeoption; pg_trickle-absent error message)
Supported frame features (v0.17.0)
| Feature | Supported | Notes |
|---|---|---|
@type matching | ✓ | Single IRI or array of IRIs |
@id matching | ✓ | Single IRI or array of IRIs |
Property wildcard {} | ✓ | Matches any value for a property |
Absent-property pattern [] | ✓ | Matches nodes lacking the property |
@reverse properties | ✓ | Flipped triple pattern in CONSTRUCT |
@embed: @once / @always / @never | ✓ | Full embedding control |
@explicit inclusion flag | ✓ | Omit unlisted properties from output |
@omitDefault flag | ✓ | Omit null-valued absent properties |
@default values | ✓ | Substitute defaults for absent properties |
@requireAll flag | ✓ | Turns OPTIONAL joins to INNER joins |
@context compaction | ✓ | Prefix substitution from frame @context |
Named graph @graph scoping | ✓ | Maps to g column filter on VP tables |
@omitGraph flag | ✓ | Single root node omits @graph wrapper |
Value pattern matching (@value / @language / @type in value objects) | ✗ | Deferred; requires full-graph scan to implement correctly |
Documentation
See plans/documentation.md for details.
-
user-guide/sql-reference/serialization.mdexpanded:export_jsonld_framed,jsonld_frame_to_sparql,jsonld_frame,export_jsonld_framed_stream; frame syntax primer;@embed/@explicit/@omitDefault/@requireAllflags; named graph scoping; supported feature table -
user-guide/sql-reference/framing-views.md—create_framing_view,drop_framing_view,list_framing_views; stream table schema and decoding view; refresh mode selection (IMMEDIATEfor constraints,DIFFERENTIALfor dashboards,FULLfor exports);decodeoption; pg_trickle dependency and detection; worked example (company directory view refreshed every 10 s) -
user-guide/best-practices/data-modeling.mdexpanded: JSON-LD Framing for REST APIs; frame-first API design pattern; usingjsonld_frame_to_sparqlfor SPARQL query inspection; performance notes (frame-driven vs full-graph export); when to useexport_jsonld_framedvscreate_framing_view -
reference/faq.mdexpanded: framing vs plain JSON-LD export; what W3C framing features are supported; value pattern matching deferral; framing views vs SPARQL views
Exit Criteria
export_jsonld_framed() correctly translates a JSON-LD frame into a SPARQL CONSTRUCT query touching only the VP tables required by the frame, executes it via the existing SPARQL engine, and returns a nested JSON-LD document with correct @context compaction and W3C-conformant embedding semantics. The jsonld_frame_to_sparql() function exposes the generated CONSTRUCT query string. The jsonld_frame() general-purpose primitive correctly frames any expanded JSON-LD JSONB input. create_framing_view() creates an incrementally-maintained pg_trickle stream table whose rows stay current as triples change; the IMMEDIATE refresh mode correctly detects constraint violations within the same transaction. All supported frame features in the table above pass the pg_regress test suite.
v0.18.0 — SPARQL CONSTRUCT, DESCRIBE & ASK Views
Theme: Materialize the three non-SELECT SPARQL query forms as incrementally-maintained pg_trickle stream tables.
In plain language: pg_ripple already supports SPARQL CONSTRUCT, DESCRIBE, and ASK as one-shot queries. This release lets you register any of those query forms as a live view — a stream table that pg_trickle keeps incrementally up-to-date as triples are inserted or deleted. A CONSTRUCT view stores the derived triples it produces in a
(s, p, o, g)table; this is ideal for materialising inferred facts, denormalised projections, or cached API responses. A DESCRIBE view stores all triples about the described resources. An ASK view stores a singleBOOLEANrow that flips whenever the underlying pattern changes from matching to not-matching — useful for live constraint monitors and dashboard indicators.Effort estimate: 2–3 person-weeks (the hard parts — CONSTRUCT/DESCRIBE SQL generation, spargebra algebra parsing, and pg_trickle stream table registration — are all already in place from v0.5.1 and v0.11.0)
Completed items (click to expand)
Prerequisites
- v0.5.1 SPARQL CONSTRUCT / DESCRIBE (JSONB output) — the CONSTRUCT algebra and SQL generation pipeline is reused directly.
- v0.11.0 SPARQL SELECT views — the pg_trickle stream table registration machinery (
register_stream_table, decode-view creation, catalog tables) is extended rather than rewritten. - v0.11.0
pg_trickle_available()— all three new view functions gate on the same availability check.
Deliverables
-
CONSTRUCT view support (
src/views.rs)- Extend
create_sparql_view()to accept CONSTRUCT queries, or add a dedicatedcreate_construct_view()function (preferred — keeps catalog tables separate and the error message explicit) - Parse
spargebra::Query::Construct { template, pattern, .. }; compilepatternvia the existingtranslate_selectpipeline; expand each triple intemplateas a SQL row expression - Generate a
UNION ALLSQL SELECT that returns one row per template triple per solution:SELECT encode(s_expr) AS s, encode(p_expr) AS p, encode(o_expr) AS o, 0 AS g; named-graph template triples include the graph term - All IRI/literal constants in the template dictionary-encoded at view-creation time (integer joins only — no string comparisons at refresh time)
- Register result as a pg_trickle stream table with schema
pg_ripple.construct_view_{name}(s BIGINT, p BIGINT, o BIGINT, g BIGINT) - When
decode = TRUE, create a thin decoding viewpg_ripple.construct_view_{name}_decoded(s TEXT, p TEXT, o TEXT, g TEXT)that joins_pg_ripple.dictionaryfor each column - Record metadata in
_pg_ripple.construct_views (name, sparql, generated_sql, schedule, decode, template_count, stream_table, created_at)
- Extend
-
DESCRIBE view support (
src/views.rs)create_describe_view(name, sparql, schedule, decode)— parsespargebra::Query::Describe { variables, pattern, .. }; compile to SQL that enumerates all triples where the described resource appears as subject (and optionally object)- Stream table schema:
pg_ripple.describe_view_{name}(s BIGINT, p BIGINT, o BIGINT, g BIGINT)— same shape as CONSTRUCT views describe_strategyGUC (already present from v0.5.1) respected:cbd(Concise Bounded Description) vssymmetric_cbd- Record metadata in
_pg_ripple.describe_views (name, sparql, generated_sql, schedule, decode, stream_table, created_at)
-
ASK view support (
src/views.rs)create_ask_view(name, sparql, schedule)— parsespargebra::Query::Ask { pattern, .. }; compile toSELECT EXISTS(...)SQL- Stream table schema:
pg_ripple.ask_view_{name}(result BOOLEAN, evaluated_at TIMESTAMPTZ DEFAULT now()) - Record metadata in
_pg_ripple.ask_views (name, sparql, generated_sql, schedule, stream_table, created_at)
-
Lifecycle management SQL functions (
src/lib.rs)pg_ripple.create_construct_view(name TEXT, sparql TEXT, schedule TEXT DEFAULT '1s', decode BOOLEAN DEFAULT FALSE) RETURNS BIGINT— returns template triple countpg_ripple.drop_construct_view(name TEXT) RETURNS voidpg_ripple.list_construct_views() RETURNS TABLE(name TEXT, sparql TEXT, generated_sql TEXT, schedule TEXT, decode BOOLEAN, template_count BIGINT, stream_table TEXT, created_at TIMESTAMPTZ)pg_ripple.create_describe_view(name TEXT, sparql TEXT, schedule TEXT DEFAULT '1s', decode BOOLEAN DEFAULT FALSE) RETURNS voidpg_ripple.drop_describe_view(name TEXT) RETURNS voidpg_ripple.list_describe_views() RETURNS TABLE(name TEXT, sparql TEXT, generated_sql TEXT, schedule TEXT, decode BOOLEAN, stream_table TEXT, created_at TIMESTAMPTZ)pg_ripple.create_ask_view(name TEXT, sparql TEXT, schedule TEXT DEFAULT '1s') RETURNS voidpg_ripple.drop_ask_view(name TEXT) RETURNS voidpg_ripple.list_ask_views() RETURNS TABLE(name TEXT, sparql TEXT, generated_sql TEXT, schedule TEXT, stream_table TEXT, created_at TIMESTAMPTZ)- All nine functions call
pg_trickle_available()first and raise a descriptive error with an install hint when pg_trickle is absent; never error at extension load time
-
Catalog tables (SQL migration
sql/pg_ripple--0.17.0--0.18.0.sql)CREATE TABLE IF NOT EXISTS _pg_ripple.construct_views (...)CREATE TABLE IF NOT EXISTS _pg_ripple.describe_views (...)CREATE TABLE IF NOT EXISTS _pg_ripple.ask_views (...)
-
Error handling
- Passing a SELECT query to
create_construct_view()→ clear error:"sparql must be a CONSTRUCT query" - Passing a non-ASK query to
create_ask_view()→ clear error:"sparql must be an ASK query" - Unbound variables in CONSTRUCT template (variable present in template but not bound by the WHERE pattern) → error at view-creation time listing the unbound variables
- Template contains a blank node (not expressible as a reusable
BIGINTID) → error advising the user to replace blank nodes with IRIs or skolemise them
- Passing a SELECT query to
-
pg_regress:
construct_views.sql(create/drop/list; basic template; multi-triple template; named graph template; decode option; SELECT query rejected; unbound variable error; pg_trickle-absent error),describe_views.sql(create/drop/list; CBD vs symmetric_cbd; decode option),ask_views.sql(create/drop/list; result flips on insert/delete; pg_trickle-absent error)
Documentation
See plans/documentation.md for details.
-
user-guide/sql-reference/views.mdexpanded:create_construct_view,drop_construct_view,list_construct_views;create_describe_view,drop_describe_view,list_describe_views;create_ask_view,drop_ask_view,list_ask_views; stream table schemas; decode views; worked examples -
user-guide/best-practices/sparql-patterns.mdexpanded: when to use CONSTRUCT views vs SELECT views; materialising inference results; using ASK views as live constraint monitors
Exit Criteria
create_construct_view() compiles a SPARQL CONSTRUCT query into a pg_trickle stream table whose rows reflect the CONSTRUCT output at all times; inserting or deleting triples that affect the WHERE pattern causes the stream table to update automatically. create_describe_view() correctly materialises the CBD of the described resources. create_ask_view() correctly updates the single-row result when the pattern's satisfiability changes. All three view types correctly reject wrong query forms with a clear error. The pg_trickle-absent error message is consistent with v0.11.0 behaviour. All new pg_regress tests pass.
v0.19.0 — Federation Performance
Theme: Connection pooling, result caching, query rewriting, and throughput improvements for remote SPARQL endpoint access.
In plain language: When querying remote SPARQL endpoints via
SERVICE, every call currently creates a new HTTP connection, buffers all results in memory before processing, and makes no attempt to reduce the data fetched from the remote. This release addresses those bottlenecks: connections are reused across calls, frequently-used results are cached locally, queries are rewritten to project only the variables the outer query actually needs, multipleSERVICEclauses targeting the same endpoint are batched into a single HTTP request, and duplicate term encoding is eliminated. The result is significantly lower latency for federation-heavy workloads and better behaviour under load.Effort estimate: 3–5 person-weeks
Completed items (click to expand)
Prerequisites
- v0.16.0 SPARQL Federation — the
federation.rsexecutor, allowlist, health monitoring, andfederation_endpointscatalog table are all extended here. - v0.16.0
_pg_ripple.federation_health— the adaptive timeout feature reads P95 latency data from this table.
Deliverables
-
Connection pooling (
src/sparql/federation.rs)- Replace per-call
ureq::AgentBuilder::new()with a backend-local shared agent stored in athread_local!orOnceCell - Reuses TCP connections and TLS sessions across SERVICE calls within a session
- Pool size configurable via
pg_ripple.federation_pool_sizeGUC (default: 4 per endpoint, range: 1–32) - Reduces TCP handshake + TLS overhead for workloads with repeated calls to the same endpoint
- Replace per-call
-
Result caching with TTL (
src/sparql/federation.rs,_pg_ripple.federation_cachetable)- Cache encoded remote results keyed on
(url, XXH3-64(sparql_text)) - Schema:
_pg_ripple.federation_cache (url TEXT, query_hash BIGINT, result_jsonb JSONB, cached_at TIMESTAMPTZ, expires_at TIMESTAMPTZ) - On cache hit, skip the HTTP call entirely and re-encode cached results via the dictionary
- Expired rows cleaned up by the merge background worker
- TTL configurable via
pg_ripple.federation_cache_ttlGUC (default: 0 = disabled, range: 0–86400 seconds) - Particularly beneficial for semi-static reference datasets (e.g. Wikidata labels, controlled vocabularies)
- Cache encoded remote results keyed on
-
Query rewriting for data minimization (
src/sparql/sqlgen.rs)- At translation time, compute the set of variables from the SERVICE inner pattern that are actually referenced by the outer query (joins, projections, FILTERs)
- Rewrite the SPARQL SELECT sent to the remote endpoint to project only those variables instead of
SELECT * - Reduces data transfer and remote processing for patterns where only a subset of result bindings are consumed
-
Partial result handling (
src/sparql/federation.rs)- When a SERVICE call delivers rows before failing (e.g. connection drop mid-stream), use however many rows were received rather than discarding them entirely
- Emit a WARNING naming the endpoint, the rows received, and the error
- Controlled by
pg_ripple.federation_on_partialGUC (values:'empty'= discard partial results,'use'= use partial results; default:'empty') - Improves resilience for federated queries where partial data is better than none
-
Endpoint complexity hints (
_pg_ripple.federation_endpointsschema extension)- Add a
complexity TEXT NOT NULL DEFAULT 'normal' CHECK (complexity IN ('fast', 'normal', 'slow'))column to_pg_ripple.federation_endpoints - Expose via
pg_ripple.register_endpoint(url, local_view_name, complexity)and a newpg_ripple.set_endpoint_complexity(url, complexity)function - At query planning time, reorder multiple SERVICE clauses so
'fast'endpoints execute first — enables earlier failure detection and reduces total wall-clock time for multi-endpoint queries
- Add a
-
Adaptive timeout (
src/sparql/federation.rs)- When
pg_ripple.federation_adaptive_timeout = on(default:off), derive the effective timeout asmax(1s, p95_latency_ms * 3 / 1000)from_pg_ripple.federation_health - Falls back to
pg_ripple.federation_timeoutwhen no health data is available or adaptive mode is off - Prevents fast endpoints from being penalised by the global timeout and slow endpoints from blocking indefinitely
- When
-
Batch SERVICE calls to the same endpoint (
src/sparql/sqlgen.rs)- Detect multiple
SERVICE <url>clauses in a single query that target the same registered endpoint - Combine their inner patterns into a single
SELECT * WHERE { { pattern1 } UNION { pattern2 } }SPARQL query - Issue one HTTP request instead of N, then split results back into per-clause variable bindings
- Applied only when patterns are independent (no shared variables between clauses)
- Detect multiple
-
Result deduplication at encoding stage (
src/sparql/federation.rs)- Build a per-call
HashMap<String, i64>duringencode_results()to avoid redundant dictionary lookups for the same term appearing in multiple rows - No user-visible API change; pure internal optimisation
- Particularly effective for result sets with high-cardinality repeated values (e.g. a common subject IRI across thousands of rows)
- Build a per-call
-
GUC additions (
src/lib.rs)pg_ripple.federation_pool_size(INT, default: 4, range: 1–32)pg_ripple.federation_cache_ttl(INT, default: 0, range: 0–86400 seconds; 0 = disabled)pg_ripple.federation_on_partial(ENUM, default:'empty'; values:'empty','use')pg_ripple.federation_adaptive_timeout(BOOL, default:off)
-
Migration script (
sql/pg_ripple--0.18.0--0.19.0.sql)ALTER TABLE _pg_ripple.federation_endpoints ADD COLUMN IF NOT EXISTS complexity TEXT NOT NULL DEFAULT 'normal' CHECK (complexity IN ('fast', 'normal', 'slow'))CREATE TABLE IF NOT EXISTS _pg_ripple.federation_cache (url TEXT NOT NULL, query_hash BIGINT NOT NULL, result_jsonb JSONB NOT NULL, cached_at TIMESTAMPTZ NOT NULL DEFAULT now(), expires_at TIMESTAMPTZ NOT NULL, PRIMARY KEY (url, query_hash))CREATE INDEX IF NOT EXISTS idx_federation_cache_expires ON _pg_ripple.federation_cache (expires_at)
-
pg_regress:
sparql_federation_perf.sql(cache hit/miss; TTL expiry; variable projection confirmed via explain; batch detection with two SERVICE clauses to same endpoint; complexity ordering; partial result GUC; adaptive timeout GUC boundary; deduplication correctness)
Documentation
See plans/documentation.md for details.
-
user-guide/sql-reference/federation.mdextended: new GUCs table; connection pooling notes; result caching section with TTL examples; complexity hints; variable projection rewrite behaviour; batching semantics; adaptive timeout -
user-guide/best-practices/federation-performance.md(new page): choosing cache TTL; when to set complexity hints; designing queries to benefit from variable projection; monitoring withfederation_healthandfederation_cache; sidecar vs in-process tradeoffs
Exit Criteria
A federated query making repeated calls to the same endpoint is measurably faster due to connection reuse. A query with cacheable SERVICE results performs a single HTTP call across multiple executions within the TTL window. Multiple SERVICE clauses targeting the same endpoint are confirmed (via logged SPARQL text) to collapse into one HTTP request. Variable projection is confirmed by inspecting the SPARQL text sent to the endpoint. All new pg_regress tests pass.
v0.20.0 — W3C Conformance & Stability Foundation
Theme: Standards compliance, crash safety, and production readiness preparation.
In plain language: As we approach the 1.0 release, this milestone focuses on confidence. Instead of building new features, we verify that everything already built works correctly according to the official W3C standards. We run pg_ripple's SPARQL engine and SHACL validator against the W3C test suites and fix any edge cases. We test what happens when the database crashes and verify recovery is clean. We scan the code for security vulnerabilities. And we benchmark at scale (100M triples) to establish baselines. The result is a release that's ready for production users to rely on.
Effort estimate: 5–7 person-weeks
Completed items (click to expand)
Deliverables
-
W3C SPARQL 1.1 Query test suite conformance
- Download and run the official W3C SPARQL 1.1 Query test suite
- Implement missing query features or fix conformance bugs
- Document unsupported features (property functions, custom aggregate functions) with rationale
- Verify conformance via both SQL (
pg_ripple.sparql()) and HTTP (/sparqlendpoint) interfaces - Create
tests/pg_regress/w3c_sparql_query_conformance.sqlwith representative W3C test cases; mark expected failures clearly - Federation (
SERVICE) conformance covered by v0.16.0; no additional work needed - Target: ≥95% of applicable W3C Query test suite passes (excluding property functions, language tags in comparisons, and other known limitations)
-
W3C SPARQL 1.1 Update test suite conformance
- Download and run the official W3C SPARQL 1.1 Update test suite
- Implement missing update features or fix conformance bugs
- Document unsupported features with rationale
- Create
tests/pg_regress/w3c_sparql_update_conformance.sqlwith representative W3C test cases - Target: ≥95% of applicable W3C Update test suite passes
-
W3C SHACL Core test suite conformance
- Download and run the official W3C SHACL Core test suite
- Implement missing validators or fix conformance bugs
- Critical constraint: Any optimization strategy used in shape compilation must preserve identical externally-visible results as the reference semantics; if optimization changes the set of violations reported, it is a regression
- Create
tests/pg_regress/w3c_shacl_conformance.sqlwith representative W3C test cases - Document any limitations (e.g. SHACL Advanced features not yet implemented, deferred to v0.8.0 or later)
- Target: ≥95% of SHACL Core test suite passes
-
Crash recovery testing framework
tests/crash_recovery/merge_during_kill.sh— start a bulk load, kill -9 the PostgreSQL backend during HTAP generation merge, restart PostgreSQL, verify:- No corruption in
_pg_ripple.predicatescatalog - VP table data is recoverable (rows visible, no stray VACUUM marks)
- Dictionary is consistent (no orphaned or duplicate entries)
- Subsequent queries return correct results
- No corruption in
tests/crash_recovery/dict_during_kill.sh— kill -9 during a high-volume dictionary encoding operation (e.g. bulk load), verify dictionary consistencytests/crash_recovery/shacl_during_violation.sh— kill -9 during async validation queue processing, verify no violation reports are lost and no rows are orphaned- Run these as part of regular CI (nightly schedule, ~30 min total)
- Document recovery procedure for production operators (backup/restore, WAL replays)
-
Memory leak detection
- Set up
cargo pgrx test --valgrindinvocation for a curated subset of unit tests (heap allocations are the main concern; stack overflows out of scope) - Identify and fix any definite leaks (not just reachable at program exit)
- Focus areas: shared-memory allocations, per-query temporary buffers, dictionary cache evictions, failed error paths
- Document baseline leak-free status in release notes
- CI nightly run (timeout 2 hours)
- Set up
-
Security review (Phase 1)
- SPI query generation review: Audit all
src/sparql/sqlgen.rsandsrc/datalog/compiler.rsfor potential SQL injection vectors- All IRI/literal constants must be dictionary-encoded before SQL generation
- No string interpolation into generated SQL (
format!only for identifiers viaformat_ident!) - Create a checklist document listing all unsafe patterns and their mitigations
- Shared memory safety review: Audit
src/shmem.rsand allpgrx::PgSharedMemusage for:- Data races (concurrent access without synchronization)
- Bounds violations (buffer overflows, stack smashing)
- Use-after-free (stale pointers after shmem recreation)
- Create a checklist document with findings and resolutions
- Dictionary cache timing side-channels review: Verify that encode/decode latency does not leak dictionary size, IRI patterns, or other sensitive metadata
- Document findings in
reference/security.md; create follow-up issues for Phase 2 (v0.21.0 or later) if needed
- SPI query generation review: Audit all
-
Benchmarking at scale (100M triples)
- Extend BSBM benchmark infrastructure to run with 100M triples (BSBM scale factor ≥30)
- Measure query latency, throughput, memory usage, merge worker performance
- Publish baseline results in release notes: e.g. "Query latency: <50ms p95 on 100M triples with 4 GiB shared memory"
- Store results artifact in CI (for regression detection in future releases)
- Compare with v0.19.0 results to detect performance regressions
- Known constraint: BSBM at 100M triples on a single 4-core developer machine will take ~4–6 hours; run nightly or on a larger CI machine
-
API stability audit (documentation only; no code changes)
- Audit all
pg_ripple.*SQL functions for API stability - Designate these as stable / guaranteed API for 1.x releases
- Document that
_pg_ripple.*schema is private and subject to change - Create
reference/api-stability.mddocumenting the stability contract
- Audit all
-
Migration script (
sql/pg_ripple--0.19.0--0.20.0.sql)- If there are schema changes from conformance fixes, add them here
- If no schema changes are required, leave the migration script as an empty comment block with a note explaining what new functions/GUCs (if any) are provided
- Per extension versioning conventions (AGENTS.md), the migration script must exist even if empty
-
pg_regress:
w3c_sparql_query_conformance.sql,w3c_sparql_update_conformance.sql,w3c_shacl_conformance.sql,crash_recovery_merge.sql(basic recovery smoke test) -
100% W3C SPARQL 1.1 Query conformance — fix all remaining known limitations:
FILTERstring functions:CONTAINS(),STRSTARTS(),STRENDS(),REGEX()— translate to SQLstrpos,starts_with,right(),~/~*FILTER NOT EXISTS { ... }— translate to SQLNOT EXISTS (correlated subquery)- Subquery +
LIMITin outer JOIN — wrap the inner slice pattern in a SQL subquery withLIMITapplied before the outer join - Target: all assertions in
w3c_sparql_query_conformance.sqlpass with exact expected values
-
100% W3C SHACL Core conformance — fix
validate()false-negative on conforming graphs:- Root cause:
value_has_datatype()returnsfalsefor inline-encoded types (xsd:integer, xsd:boolean, xsd:dateTime, xsd:date) because inline IDs are never stored in the dictionary - Fix: detect inline IDs (
id < 0) and determine their datatype from the inline type code without a DB round-trip - Additionally: plain literals (kind=KIND_LITERAL, xsd:string normalization) now correctly satisfy
sh:datatype xsd:string - Additionally:
sh:inwith string literal values now encodes them via dictionary lookup instead oflookup_iri - Target:
validate()returnsconforms=truefor all conforming graphs; violation detection remains 100%
- Root cause:
-
100% W3C SPARQL 1.1 Update test suite conformance — implement full update operator coverage:
USING <g>/WITH <g>clauses: restrict WHERE evaluation to the specified dataset graph(s)CLEAR ALL,CLEAR DEFAULT,CLEAR NAMED— all graph-target variantsDROP ALL,DROP DEFAULT,DROP NAMED— all graph-target variantsADD <src> TO <dst>— copy triples from source graph to destination (source preserved)COPY <src> TO <dst>— clear destination then copy source (source preserved)MOVE <src> TO <dst>— copy source to destination then drop sourceDELETE WHERE { ... }shorthand — pattern used as both delete template and WHERE clause- Multi-graph USING:
USING <g1> USING <g2>expands to UNION of GRAPH patterns in WHERE - Target: all assertions in
w3c_sparql_update_conformance.sql(sections 1–16) pass with exact expected values
Documentation
See plans/documentation.md for details.
-
reference/w3c-conformance.md(new page) — W3C test suite results summary, supported subset list, unsupported features with rationale, known limitations -
reference/security.md(Phase 1 findings) — SPI injection mitigations, shared memory safety, side-channel analysis -
reference/api-stability.md(new page) — stable API contract,pg_ripple.*functions,_pg_ripple.*schema privacy -
user-guide/backup-restore.mdexpanded: crash recovery procedure, WAL replay, PITR workflow - Release notes for v0.20.0 — include BSBM 100M triple baseline results, W3C test suite summary, security audit findings
Exit Criteria
W3C SPARQL 1.1 Query test suite: ≥95% pass rate. W3C SPARQL 1.1 Update test suite: ≥95% pass rate. W3C SHACL Core test suite: ≥95% pass rate. Crash recovery framework operational: database recovers cleanly from kill -9 during merge, bulk load, and validation. Valgrind finds no definite memory leaks. Security review Phase 1 complete: all SPI injection vectors documented and mitigated, shared memory audit complete. BSBM 100M triple baseline published. API stability contract documented.
v0.21.0 — SPARQL Built-in Functions & Query Correctness
Theme: Implement all ~40 missing SPARQL 1.1 built-in functions, fix the FILTER silent-drop correctness hazard, and close several high-priority query-semantics bugs identified in the v0.20.0 gap analysis.
In plain language: Until now, pg_ripple's SPARQL engine understood the grammar of standard functions like
UCASE,IF,DATATYPE, andisIRI— but silently ignored them at runtime, returning too many rows instead of the correctly filtered set. This release makes those functions actually work. It also fixes several query-correctness issues that were masked by the existing conformance test suite: wrong sort-order for NULL values,p*paths generating phantom reflexive rows on nodes that don't participate in the property at all, andGROUP_CONCATignoring theDISTINCTkeyword. After this release, any unsupported expression raises a clear named error rather than silently dropping the filter.Effort estimate: 6–8 person-weeks
Completed items (click to expand)
Deliverables
-
SPARQL 1.1 built-in function surface — full implementation
- String functions:
STR,STRLEN,SUBSTR,UCASE,LCASE,CONCAT,REPLACE,ENCODE_FOR_URI,STRLANG,STRDT(in addition toSTRSTARTS,STRENDS,CONTAINS,REGEXalready present) - Type-testing predicates:
isIRI,isLiteral,isBlank,isNumeric,sameTerm - Term construction and access:
IRI(aliasURI),BNODE,LANG,DATATYPE,LANGMATCHES - Numeric functions:
ABS,CEIL,FLOOR,ROUND,RAND - Datetime functions:
NOW,YEAR,MONTH,DAY,HOURS,MINUTES,SECONDS,TIMEZONE,TZ - Hash / UUID functions:
MD5,SHA1,SHA256,SHA384,SHA512,UUID,STRUUID - Control functions:
IF,COALESCE - Implementation strategy: decode the dictionary ID to the term value at expression-evaluation time; compile to PostgreSQL equivalents where available (
LOWER,UPPER,SUBSTR,MD5,NOW(),ABS,CEIL,FLOOR,ROUND,gen_random_uuid(), etc.); datetime functions extract fields fromxsd:dateTimeliterals viato_timestamp+EXTRACT; hash functions operate over the term's string representation - Introduce a typed
SqlExprintermediate representation insrc/sparql/expr.rsreplacing the current raw-Stringoutput fromtranslate_expr()— makes the function dispatch table explicit and independently testable
- String functions:
-
FILTER silent-drop fix
- Change
translate_expr()so that an unsupported expression variant raises a structuredERRCODE_FEATURE_NOT_SUPPORTEDerror naming the unimplemented function, rather than returningNoneand silently dropping the predicate from the SQLWHEREclause - Add
pg_ripple.sparql_strictGUC (default:on): whenoff, the legacy warn-and-drop behaviour is preserved for compatibility; whenon(default from this release onwards), unsupported expressions hard-error - Migration script
sql/pg_ripple--0.20.0--0.21.0.sql: register thesparql_strictGUC with its default
- Change
-
Query correctness fixes
ORDER BYNULL placement: appendNULLS LASTto everyASCclause andNULLS FIRSTto everyDESCclause in the SQL generator, matching SPARQL 1.1 §15.1 semantics (unbound variables sort last in ascending order, first in descending order)GROUP_CONCAT(DISTINCT …): honour thedistinctflag inAggregateExpression::GroupConcat— emitSTRING_AGG(DISTINCT …, sep)rather than silently dropping the deduplicationp*(ZeroOrMore) reflexive rows: restrict the zero-hop identity row to subjects that actually appear in the predicate's VP tables, preventing spurious reflexive paths for all nodes in the graph- Property-path cycle detection: change
CYCLE o SET _is_cycle USING _cycle_pathtoCYCLE s, o SET _is_cycle USING _cycle_pathin allWITH RECURSIVEpath CTEs — prevents false cycle detection in DAGs that have shared intermediate nodes - Self-join dedup key: replace the
format!("{tp}")Debug-string key in BGP pattern deduplication with a structural(s_term_id, p_term_id, o_term_id)tuple so that only genuinely identical patterns are collapsed REDUCEDsemantics: implemented asDISTINCT, which is within the SPARQL 1.1 specification; documented inreference/sparql-reference.md
-
SPARQL property path & federation completeness
- Negated property sets
!(p1|p2|…): compile to an anti-join scanning all VP tables; correctly excludes the listed predicates SERVICE SILENT: when thesilentflag is set on aSERVICEblock, federation errors return an empty result set rather than propagating the error
- Negated property sets
-
W3C conformance test assertions updated
- All
count(*) >= 0 AS label_no_errorshims replaced with real value-checking assertions inw3c_sparql_query_conformance.sql
- All
Documentation
See plans/documentation.md for details.
-
reference/sparql-functions.md(new page) — every SPARQL 1.1 built-in function, implementation status, PostgreSQL equivalent used, and known limitations -
user-guide/sparql-reference.mdupdated with complete function table andsparql_strictGUC guidance -
reference/w3c-conformance.mdupdated — replacelabel_no_errorplaceholder entries with accurate pass / skip / fail classification - Release notes for v0.21.0 — list every newly implemented function; highlight the FILTER silent-drop fix
Exit Criteria
Every SPARQL 1.1 built-in function from the W3C SPARQL 1.1 Appendix A either works correctly or raises a named ERRCODE_FEATURE_NOT_SUPPORTED error — never silently drops. w3c_sparql_query_conformance.sql passes with real value-checking assertions (no >= 0 shims). sparql_builtins.sql passes for all implemented functions. ORDER BY NULL placement, property-path cycle detection on a DAG, ZeroOrMore scope restriction, and GROUP_CONCAT DISTINCT each have a dedicated passing regression test. property_path_negated.sql passes for single and multi-predicate negated sets. service_silent.sql returns zero rows rather than an error on an unreachable SERVICE SILENT endpoint. reference/sparql-reference.md documents the REDUCED → DISTINCT equivalence choice.
v0.22.0 — Storage Correctness & Security Hardening
Theme: Fix the critical data-integrity issues in the storage layer (dictionary cache rollback, HTAP merge races, shmem cache thrashing, rare-predicate promotion race) and close the security gaps in the HTTP companion service and privilege model identified in the v0.20.0 gap analysis.
In plain language: This release addresses issues that could silently corrupt data or create security vulnerabilities in production deployments. The most important fix: if a database transaction is rolled back, pg_ripple's internal term-ID cache now correctly discards the rolled-back entries — previously, stale IDs could be planted into the triple store, creating phantom references that make facts disappear or return the wrong data. Two race conditions in the background merge process that could cause deleted facts to reappear, or queries to error mid-merge, are also closed. The internal shared-memory cache is redesigned to handle large vocabularies without thrashing. On the security side, the HTTP companion service's rate-limiting finally works, error messages no longer leak internal database details to API clients, and the
_pg_rippleinternal schema is explicitly locked away from unprivileged roles.Effort estimate: 6–8 person-weeks
Completed items (click to expand)
Deliverables
-
Dictionary cache rollback correctness (critical fix C-2)
- Register
RegisterXactCallbackandRegisterSubXactCallbackduring_PG_init— onXACT_EVENT_ABORTandXACT_EVENT_PARALLEL_ABORT, drain bothENCODE_CACHEandDECODE_CACHEthread-local LRU caches so rolled-back term IDs cannot be served to future encode calls in the same backend session - Stamp a per-backend epoch counter; bump on rollback; the shared-memory encode cache stores the write epoch at insertion time and rejects cache hits from a prior epoch, ensuring the shmem path is also safe
- New pg_regress test
dictionary_rollback.sql:BEGIN; pg_ripple.insert_triple(…new term…); ROLLBACK; pg_ripple.insert_triple(same term again); verify pg_ripple.decode_id(id) = original term string, not NULL
- Register
-
HTAP merge race fixes (critical fixes C-3 and C-4)
- C-3 (view-rename atomicity): remove the
CREATE OR REPLACE VIEW vp_Nstep from the merge cycle — the view'sFROMclause always namesvp_N_maindirectly, which PG re-resolves after the rename; theCREATE OR REPLACE VIEWcall is eliminated, closing the window between rename and view-rebuild - C-4 (tombstone resurrection): record
max_sid_at_snapshotat merge-start (currval('_pg_ripple.statement_id_seq')before processing); at merge-end TRUNCATE, only delete tombstones withi ≤ max_sid_at_snapshot— tombstones for deletes that committed after the snapshot survive to the next merge cycle - New pg_regress test
merge_race.sql: issue apg_ripple.delete_triple()concurrently withpg_ripple.force_merge(); verify deleted triple does not reappear; verify norelation does not existerror under a concurrentpg_ripple.sparql()call
- C-3 (view-rename atomicity): remove the
-
Merge deduplication and
rebuild_subject_patternscorrectness (high fixes H-6, H-7)- H-6 (cross-merge duplicate visibility): add a
UNIQUE (s, o, g)constraint tovp_{id}_deltaand changeinsert_tripleto useON CONFLICT DO NOTHING; update the VP view definition to carryDISTINCT ON (s, o, g)as a safety net for rows that crossed a merge boundary before the constraint was present — prevents a triple from appearing twice in query results when it exists in bothmainanddelta - H-7 (
vp_raredouble-count in star patterns): fixrebuild_subject_patterns()insrc/storage/merge.rsto enumerate only predicates that have a dedicated VP table (listed in_pg_ripple.predicateswith a non-nulltable_oid); skipvp_rareas a direct scan target —vp_rarerows are already reachable via their per-predicate plans and must not be scanned a second time as the raw table - New pg_regress test
merge_dedup.sql: insert the same triple before and afterpg_ripple.force_merge(); verify the query returns exactly one result row; verifytriple_countin the predicate catalog equals 1
- H-6 (cross-merge duplicate visibility): add a
-
Shared-memory encode cache — 4-way set-associative redesign (high fix H-1)
- Replace the direct-mapped 4096-slot cache with a 4-way set-associative layout: 1024 sets × 4 ways — same memory footprint as before, birthday-collision rate drops from ~15% to <1% at 5k hot terms
- LRU eviction within each 4-way set using a 2-bit age field packed into the existing
(hash_parts, id)slot struct - New
pg_ripple.cache_stats()SQL function returning(hits BIGINT, misses BIGINT, evictions BIGINT, utilisation FLOAT)— exposes hit rate for monitoring - Benchmark gate:
just bench-cacheasserts hit rate ≥ 95% on a 10k-predicate workload; CI fails on regression below 90%
-
Bloom filter per-bit reference counting (high fix H-2)
- Replace the boolean
u64bloom words with 8-bit saturating counters in the delta bloom shared-memory segment set_predicate_delta_bit(pred_id): increment both bloom counter positions (saturates at 255)clear_predicate_delta_bit(pred_id): decrement both counters; only clears the boolean bit when the counter reaches 0 — prevents false-negative delta skips for predicates that hash-collide with a predicate being concurrently merged
- Replace the boolean
-
Rare-predicate promotion atomicity (high fixes H-3 and H-4)
- Rewrite
promote_predicate()to use a single atomic CTE:WITH moved AS (DELETE FROM _pg_ripple.vp_rare WHERE p = $1 RETURNING s, o, g, i, source) INSERT INTO _pg_ripple.vp_{id}_delta (s, o, g, i, source) SELECT * FROM moved— eliminates the two-statement window where concurrent inserts can orphan rows invp_rareunder a predicate that now has its own VP table - After the CTE,
UPDATE _pg_ripple.predicates SET triple_count = (SELECT count(*) FROM _pg_ripple.vp_{id}_delta) WHERE id = $1to restore accurate planner statistics rather than leavingtriple_count = 0after promotion - pg_regress test: load >
vp_promotion_thresholdtriples for a single predicate while a concurrent transaction also inserts intovp_rarefor that predicate; verify zero orphan rows after promotion completes
- Rewrite
-
pg_ripple_http security hardening (high fixes H-14, H-15; medium fixes M-13, S-4)
- Rate limiting: integrate
tower_governorcrate;PG_RIPPLE_HTTP_RATE_LIMITenv var is now enforced as requests-per-second per source IP (default 100 req/s); excess requests receive429 Too Many RequestswithRetry-Afterheader - Error redaction: replace verbatim PostgreSQL error text in HTTP 4xx/5xx responses with
{"error": "<category>", "trace_id": "<uuid>"}JSON; log the full PG error + trace ID at serverERRORlevel — internal schema names, GUC values, and file paths are never exposed to API clients - Constant-time auth: replace
token != expected.as_str()with!constant_time_eq(token.as_bytes(), expected.as_bytes())using theconstant_time_eqcrate - Federation URL scheme validation:
pg_ripple.register_endpoint()rejects any URL whose scheme is nothttporhttpswithERRCODE_INVALID_PARAMETER_VALUE— preventsfile://,gopher://, or other scheme registration even thoughureqwould refuse them at connection time
- Rate limiting: integrate
-
Privilege model hardening (medium fix M-14)
- Migration script
sql/pg_ripple--0.21.0--0.22.0.sql:REVOKE ALL ON SCHEMA _pg_ripple FROM PUBLIC; REVOKE ALL ON ALL TABLES IN SCHEMA _pg_ripple FROM PUBLIC; REVOKE ALL ON ALL SEQUENCES IN SCHEMA _pg_ripple FROM PUBLIC; - New pg_regress test
privilege_isolation.sql: create a non-superuser role; verifySELECT * FROM _pg_ripple.dictionaryraises permission denied; verifySELECT * FROM pg_ripple.find_triples(NULL, NULL, NULL)still works (public API unaffected)
- Migration script
-
GUC bounds and merge worker signal handling (medium fixes M-12, M-15)
pg_ripple.vp_promotion_threshold: addmin = 10andmax = 10_000_000constraints to the pgrx GUC definition — prevents catalog explosion atthreshold = 1and permanentvp_rarelock-in atthreshold = INT_MAX- Merge worker: call
BackgroundWorker::reset_latch()immediately beforestd::thread::sleepin the error back-off path — prevents a busy-wait loop where aSIGHUPreceived during the sleep keepswait_latchreturning immediately on the next cycle
Documentation
See plans/documentation.md for details.
-
reference/security.mdPhase 2 section: rate limiting configuration, error-redaction policy, privilege model, constant-time auth rationale, URL scheme enforcement -
user-guide/operations.mdupdated: rollback safety guarantee for dictionary cache, merge correctness guarantees (tombstone epoch fence),pg_ripple.cache_stats()monitoring -
user-guide/upgrading.mdupdated: v0.21.0→v0.22.0 privilege change (REVOKE) is safe for all existing deployments; no data migration required - Release notes for v0.22.0 — highlight dictionary-rollback fix, merge race fixes, HTTP security changes
Exit Criteria
Rolled-back insert_triple cannot plant a phantom ID (dictionary_rollback.sql pg_regress passes). merge_race.sql passes with zero tombstone resurrections and zero relation does not exist errors under a concurrent query. merge_dedup.sql passes — inserting the same triple across a merge boundary returns exactly one result row. Shmem cache benchmark reports ≥ 95% hit rate at 10k hot terms. pg_ripple_http returns 429 when rate limit is exceeded (verified by integration test). Unprivileged role is denied SELECT on _pg_ripple.* (privilege_isolation.sql passes). All migration scripts from 0.1.0 through 0.22.0 run cleanly via just test-migration.
v0.23.0 — SHACL Core Completion & SPARQL Diagnostics
Theme: Complete the SHACL 1.0 Core constraint set, introduce first-class SPARQL query introspection, and fix correctness issues in the Datalog engine and JSON-LD framing identified in the v0.20.0 gap analysis.
In plain language: This release makes pg_ripple's data-quality rules (SHACL) useful for real-world schemas. Until now, common constraints like "this property must have a specific value" (
sh:hasValue), "this node must have exactly this type" (sh:nodeKind), and "no properties outside this allowed list" (sh:closed) were silently ignored. They now work. Separately, a new functionpg_ripple.explain_sparql()lets you see exactly what SQL pg_ripple generates for a SPARQL query — invaluable for diagnosing slow queries. The Datalog engine also receives three correctness fixes: arithmetic division errors now name the rule that caused them, rules with undefined variables now error at compile time rather than silently matching nothing, and cyclic negation is correctly detected.Effort estimate: 6–8 person-weeks
Completed items (click to expand)
Deliverables
-
SHACL Core constraint completion (medium fix M-18)
sh:hasValue: verify that at least one value matches the given RDF term; compile toEXISTS (SELECT 1 FROM vp_{id} WHERE s = $node AND o = $encoded_value)sh:closed+sh:ignoredProperties: reject triples whose predicate is not in the shape's declared property set; compile to a NOT EXISTS anti-join over all VP tables scoped to the focus node, excluding the declared properties and the ignore listsh:nodeKind: validate that each value is an IRI, blank node, or literal as declared; discriminate using the dictionarykindcolumnsh:languageIn: compile tolang(value) = ANY($language_tags_array)after decoding the language tag from the literal's dictionary entrysh:uniqueLang: useCOUNT(*) OVER (PARTITION BY lang(value))and reject partitions with count > 1sh:lessThan/sh:greaterThan: emit a comparison join between the focus node's two property values, decoding literals to numeric/date types for orderingsh:qualifiedValueShape:sh:qualifiedMinCount/sh:qualifiedMaxCounton a nested shape — count focus-node values matching the inner shape and compare against the declared boundssh:pathwith property path expressions: extend the shape compiler to accept inverse paths (sh:inversePath), alternative paths (sh:alternativePath), sequence paths, and zero-or-more/one-or-more/zero-or-one paths — each maps to the corresponding property-path CTE already used in the SPARQL engine- Turtle block comment handling (M-11): add a
/* … */block-comment stripping pass in the SHACL shape pre-processor atsrc/shacl/mod.rsbefore the document is handed to the Turtle parser — regex: strip(?s)/\*.*?\*/; allows SPARQL-style block-commented shapes to load correctly - New pg_regress test
shacl_core_completion.sql— one test per new constraint with passing, failing, and edge-case triples; verified against the W3C SHACL Core test suite manifest
-
SPARQL query introspection (feature F-3 from the gap analysis)
- New SQL function
pg_ripple.explain_sparql(query TEXT, format TEXT DEFAULT 'text') RETURNS TEXT - When
format = 'sql': returns the generated SQL string produced bytranslate_select()without executing it — useful for manual inspection - When
format = 'text'(default) or'json': runsEXPLAIN (ANALYZE, FORMAT text/json)on the generated SQL via SPI and returns the plan output - When
format = 'sparql_algebra': returns thespargebraalgebra tree serialised as indented text viaDebugformatting — exposes the optimizer's view of the query - Security:
SECURITY DEFINERis not used; the caller needsSELECTprivilege on the relevant VP tables (same aspg_ripple.sparql()) - New pg_regress test
explain_sparql.sql— verifies that the function returns non-empty output for a known-good SELECT query and does not error on edge cases (empty graph, VALUES-only query, property path query)
- New SQL function
-
SHACL query-optimization hint verification (performance fix P-5)
- Verify that
sh:maxCount 1on a predicate elidesDISTINCTin the SQL generated for SPARQL patterns using that predicate — inspecttranslate_select()insrc/sparql/sqlgen.rsand wire the lookup against the SHACL constraint catalog if the hint is not already applied; a triple pattern on amaxCount 1predicate should not produce aHashAggregate(DISTINCT) node in the plan - Verify that
sh:minCount 1on a predicate downgradesLEFT JOINtoINNER JOINin the SQL generator forOPTIONALpatterns — saves a null-check pass and allows the PG planner to use more efficient join strategies - New pg_regress test
shacl_query_hints.sql— load a shape withsh:maxCount 1andsh:minCount 1; runpg_ripple.explain_sparql()on a query using the constrained predicate; assert the plan string does not containHashAggregatefor the maxCount case and does not containHash Left Joinfor the minCount case
- Verify that
-
Datalog engine correctness fixes (medium fixes M-1, M-2, M-3)
- Division by zero (M-1): wrap every arithmetic divisor in the Datalog SQL compiler with
NULLIF(expr, 0); emit aNOTICE-level message naming the failing rule head when a null propagation from division occurs - Unbound variables (M-2): add a compile-time check in
compile_rule()that every variable appearing in a rule body literal is either bound by a positive body literal or explicitly declared; raiseERRCODE_SYNTAX_ERRORnaming the variable and the rule head rather than emitting aWHERE x = NULLclause that silently matches nothing - Negation-through-cycle (M-3): replace the single-edge negation check in
stratify.rswith full SCC (strongly-connected component) computation using Tarjan's algorithm; reject any SCC that contains a negation-back-edge with a structured error naming the cycle:"datalog: unstratifiable negation cycle: rule A → ¬B → ¬C → A"
- Division by zero (M-1): wrap every arithmetic divisor in the Datalog SQL compiler with
-
JSON-LD framing correctness fixes (medium fixes M-4, M-5)
- Embedder panic on empty result (M-4): replace
roots.into_iter().next().unwrap()insrc/framing/embedder.rswith.ok_or_else(|| PgError::new("json-ld framing: CONSTRUCT produced no results", …))— returns an empty JSON-LD document{"@context": …, "@graph": []}rather than panicking - Per-node visited set (M-5): add a
HashSet<NodeId>as the third parameter of the recursiveembed_node()function; insert the current node ID before recursing and check membership before following an edge — prevents infinite thrash on near-cyclic embedded graphs; consistent with W3C JSON-LD Framing §4.1.3
- Embedder panic on empty result (M-4): replace
Documentation
See plans/documentation.md for details.
-
reference/shacl-reference.mdupdated — every newly supported constraint documented with syntax, semantics, and a worked example; mark previously-deferred constraints as now implemented -
user-guide/shacl-guide.mdupdated — add a section on property path shapes (sh:path) showing inverse and alternative path examples -
reference/sparql-functions.mdupdated — addpg_ripple.explain_sparql()reference with all fourformatoptions, example output, and note on required privileges -
user-guide/datalog-guide.mdupdated — document the new division-by-zeroNOTICE, the unbound-variable compile error, and the unstratifiable-cycle error with remediation guidance -
Release notes for v0.23.0 — highlight SHACL gap closures, new
explain_sparqlfunction, and the three Datalog correctness fixes
Exit Criteria
W3C SHACL Core test suite pass rate increases to ≥ 98%. shacl_core_completion.sql pg_regress passes for all new constraint types including the /* … */ block-comment case. explain_sparql.sql passes. shacl_query_hints.sql passes — explain_sparql() confirms no spurious DISTINCT or LEFT JOIN for constrained predicates. A Datalog rule with division, an unbound variable, and a negation cycle each raise the expected named error rather than silent failure or a crash. src/framing/embedder.rs no longer contains unwrap() on the CONSTRUCT result. All migration scripts from 0.1.0 through 0.23.0 run cleanly via just test-migration.
v0.24.0 — Semi-naive Datalog & Performance Hardening
Theme: Replace the naive Datalog evaluation strategy with semi-naive evaluation for large-scale inference, complete the OWL RL rule set, batch-decode SPARQL result sets, and add safety bounds to property-path recursion.
In plain language: pg_ripple can derive new facts automatically from rules (Datalog). Until now, on every iteration of the rule engine, all previously derived facts were re-checked — wasteful for large datasets where most facts don't change between iterations. This release switches to "semi-naive" evaluation: each iteration only looks at newly derived facts from the previous pass, which can be 10–100 × faster on large ontologies. For the same reason, four missing OWL reasoning rules that affect subclass and property chains are added. Two performance improvements round out the release: returning large SPARQL result sets is sped up by decoding all term IDs in a single batch rather than one-by-one, and property-path queries (
p*,p+) gain a configurable depth limit to prevent runaway recursion on highly-connected graphs.Effort estimate: 6–8 person-weeks
Completed items (click to expand)
Deliverables
-
Semi-naive Datalog evaluation (performance fix P-3, depends on M-3 from v0.23.0)
- Rework
src/datalog/compiler.rsto emit ΔR maintenance queries:- For each derived relation
R, maintain a delta tableΔ_Rholding only rows derived in the most recent iteration - The fixpoint loop re-evaluates each rule against
Δ_R(the delta of its input relations) rather than the fullR; newly derived rows are inserted intoΔ_R_new; after each iterationΔ_R ← Δ_R_newand the loop continues whileΔ_Ris non-empty - Compile to a series of CTEs:
WITH delta_R AS (…), delta_R_new AS (…) INSERT INTO R SELECT * FROM delta_R_new ON CONFLICT DO NOTHING
- For each derived relation
- Preserve stratified evaluation order: each stratum is fully converged before the next stratum begins; semi-naive is applied within each stratum
- Correct prerequisite: requires M-3 (stable stratification) from v0.23.0 — test pipeline enforces this ordering
- New pg_regress test
datalog_seminaive.sql— run RDFS closure over a 10k-triple subgraph; verify correct closure count; measure and assert iteration count is bounded by the longest derivation chain length (not the full relation size) just bench-datalogbenchmark gate: semi-naive must be ≥ 5× faster than naive on the RDFS subgraph benchmark; CI fails on regression below 3×
- Rework
-
OWL RL rule set completion (medium fix M-17)
cax-scofull transitive closure: the existing partial rule handles one level ofrdfs:subClassOf; add the transitive step so thatA subClassOf B, B subClassOf C → A subClassOf Cis derived for arbitrary chain length via the semi-naive mechanism abovecls-avf:owl:allValuesFromchaining —x ∈ C, C ≡ (∀p . D), y = p(x) → y ∈ D; compile to a join across theowl:allValuesFromVP table and the subject's type VP tableprp-ifp: inverse-functional property inference —p is InverseFunctionalProperty, p(x, z) and p(y, z) → x = y; compile to a self-join onvp_{p_id}grouping byo, emittingsameAstriples for anysvalues that collideprp-spo1: sub-property chaining —q subPropertyOf p, q(x, y) → p(x, y)for derived property chains; relies on the semi-naive delta loop to propagate transitively- Update
src/datalog/builtins.rswith the four new rule templates; document which OWL RL rules are now implemented vs. out of scope; updatereference/datalog-reference.md
-
Batch decode for SPARQL result sets (architectural fix A-2, performance fix P-2)
- Wire
batch_decode_ids()through the SPARQL execution path insrc/sparql/sqlgen.rs: after SPI returns a result set, collect all distincti64IDs across all columns in a single pass, callbatch_decode_ids(&ids)to resolve them in one SPI round-trip, then substitute into the result rows - The existing
batch_decodeinfrastructure is already implemented for the bulk-load path; the change is routing the SPARQL result-building loop through the same function - Benchmark gate:
just bench-sparql-decodeasserts ≤ 2 SPI round-trips for a SELECT returning 1000 distinct terms; previously O(N) calls
- Wire
-
Property-path depth GUC (performance fix P-4)
- New GUC
pg_ripple.property_path_max_depth(type:INT, default:64, min:1, max:100000) - Append
WHERE _depth < $pg_ripple.property_path_max_depthto everyWITH RECURSIVE … CYCLEproperty-path CTE generated bysrc/sparql/property_path.rs - When the depth limit is hit, emit a
WARNING-level message:"property path depth limit reached (max: N); some paths may be truncated"— not an error, because SPARQL spec does not define a depth limit - New pg_regress test
property_path_depth.sql— verify that a 100-hop chain is fully traversed with default limit, and that reducing the GUC to 10 truncates at 10 hops with the expected WARNING
- New GUC
-
BRIN index migration to SID column (medium fix M-16)
- Migration script
sql/pg_ripple--0.23.0--0.24.0.sql: for each existing VP main table,DROP INDEX vp_{id}_main_s_brin; CREATE INDEX vp_{id}_main_i_brin ON _pg_ripple.vp_{id}_main USING brin (i)— thei(SID) column is monotonically increasing with insertion order, giving BRIN strong correlation; thes(subject) column has near-random distribution and BRIN provides negligible benefit - Merge worker: generate the new BRIN on
iat merge time for freshly builtmainpartitions; remove the BRIN-on-screation step fromcreate_vp_table() - B-tree indices on
(s, o)and(o, s)are unchanged
- Migration script
-
Export streaming (low fix L-6)
- Rework
src/export.rsTurtle/N-Triples/JSON-LD export helpers to iterate over VP tables in SID-order cursor batches (batch size:pg_ripple.export_batch_sizeGUC, default:10000) rather than materialising the full graph into memory DECLARE … CURSOR FOR SELECT … ORDER BY i+FETCH $batch_size FROM cursorloop — each batch is serialised and flushed toCOPYoutput immediately; peak memory is bounded bybatch_size × average_triple_size
- Rework
-
View anti-join rewrite for HTAP query path (performance fix P-6)
- Replace the
EXCEPT(sort-based set difference) in the(main EXCEPT tombstones) UNION ALL deltaVP view with aLEFT JOIN … WHERE t.s IS NULLanti-join:SELECT m.* FROM _pg_ripple.vp_{id}_main m LEFT JOIN _pg_ripple.vp_{id}_tombstones t ON m.s = t.s AND m.o = t.o AND m.g = t.g WHERE t.s IS NULL - The anti-join allows the PG planner to choose hash anti-join, avoiding a materialising sort over
main; at 10M-rowmaintables this reduces per-query overhead from O(N log N) to O(N) for tombstone filtering - Update all VP view definitions and the merge worker's view-rebuild template to use the anti-join form; no user-visible behaviour change
- Benchmark gate:
just bench-htap-readasserts a SELECT over a 1M-rowmainwith 100 tombstones completes in ≤ 2× the time of the same query with zero tombstones
- Replace the
-
BGP selectivity model improvements (architectural improvement A-6)
- Extend BGP reordering in
src/sparql/optimizer.rsto factor in variable binding as a selectivity multiplier: bound subject →0.01 × triple_count, bound object →0.05 × triple_count, unbound →triple_count— reduces the likelihood that a poorly-ordered BGP generates a pathological SQL join order before PG's planner has a chance to reorder it - Document the heuristic in
reference/internals/optimizer.md(new page) alongside theexplain_sparql()function from v0.23.0
- Extend BGP reordering in
-
Schema-aware statistics worker
- Extend the background merge worker to run
ANALYZE _pg_ripple.vp_{id}_mainafter each successful merge — ensures the PG planner has fresh statistics on the main partition for join planning - For VP tables whose objects are consistently typed (all
xsd:integer,xsd:decimal, orxsd:dateTimeas detected by the dictionarykindcolumn), create an extended statistics object (CREATE STATISTICS … (dependencies, ndistinct)) so the planner can exploit correlation for range predicates - New GUC
pg_ripple.auto_analyze(BOOL, defaulton) — allows operators to disable the post-merge ANALYZE if they manage statistics manually
- Extend the background merge worker to run
-
SPARQL-star Update: quoted triples in CONSTRUCT and UPDATE templates
- Extend the CONSTRUCT template compiler in
src/sparql/sqlgen.rsto handle<< ?s ?p ?o >>quoted-triple patterns in CONSTRUCT WHERE and CONSTRUCT template clauses — stored using the existingKIND_QUOTED_TRIPLEdictionary kind from v0.4.0 - Extend the INSERT DATA / DELETE DATA / INSERT WHERE / DELETE WHERE parsers to accept quoted triple syntax in graph patterns and template positions
- New pg_regress test
sparql_star_update.sql:INSERT DATA { << <Alice> <knows> <Bob> >> <assertedBy> <Carol> }; SELECT … WHERE { << ?s ?p ?o >> <assertedBy> ?a }— verify the quoted triple round-trips correctly through insert and query
- Extend the CONSTRUCT template compiler in
Documentation
See plans/documentation.md for details.
-
reference/datalog-reference.mdupdated — add semi-naive evaluation section explaining the ΔR mechanics, iteration bounds, and performance expectations; update OWL RL coverage table to markcax-scofull,cls-avf,prp-ifp,prp-spo1as implemented -
reference/configuration.mdupdated — documentpg_ripple.property_path_max_depthandpg_ripple.export_batch_sizeGUCs with allowed ranges and tuning guidance -
user-guide/performance.mdupdated — add "large result set decoding" section explaining the batch-decode change and expected latency improvement - Release notes for v0.24.0 — highlight semi-naive evaluation with performance numbers from the benchmark; list completed OWL RL rules; note BRIN migration and streaming export
Exit Criteria
datalog_seminaive.sql passes with correct closure count and iteration count ≤ longest derivation chain. Semi-naive benchmark is ≥ 5× faster than naive on the RDFS subgraph. All four new OWL RL rules derive correct inferences in the corresponding pg_regress tests. SPARQL result-set decoding issues ≤ 2 SPI round-trips for 1000-term results (verified by the bench gate). Property path with default depth limit correctly traverses a 100-hop chain; depth-10 truncation emits the expected WARNING. sparql_star_update.sql passes. The HTAP anti-join benchmark completes within 2× the no-tombstone baseline. Migration scripts from 0.1.0 through 0.24.0 run cleanly via just test-migration.
v0.25.0 — GeoSPARQL & Architectural Polish
Theme: Add a GeoSPARQL 1.1 geometry subset using PostGIS, stabilise the internal catalog against OID drift, and close the remaining medium- and low-priority issues from the v0.20.0 gap analysis.
In plain language: PostgreSQL already understands geography — distances, containment, intersection — through the PostGIS extension. This release connects pg_ripple's RDF triple store to PostGIS so that SPARQL queries can filter and compute over geographic data: "which cities are within 50 km of Berlin?", "which roads cross this polygon?". This covers the most common GeoSPARQL functions used in open data publishing (Wikidata, LinkedGeoData, government datasets). The release also includes a set of smaller housekeeping improvements: the internal predicate catalog now stores table names instead of fragile OIDs, the HTTP companion service correctly validates federation endpoint URLs against SSRF schemes, bulk loads can now be run in strict mode that rolls back on any malformed triple, and the remaining low-priority issues from the v0.20.0 assessment are closed.
Effort estimate: 6–8 person-weeks
Completed items (click to expand)
Deliverables
-
GeoSPARQL 1.1 geometry subset (feature F-5 from the gap analysis)
- Prerequisite: PostGIS installed (gated with a runtime
SELECT proname FROM pg_proc WHERE proname = 'st_geomfromtext'availability check; all geo functions returnNULLwith aWARNINGif PostGIS is absent — noERROR) - WKT literal support: recognize
geo:wktLiteraldatatype IRIs in the dictionary encoder; store as a regular literal; decode to aTEXTrepresentation compatible withST_GeomFromText() - Topological relation functions (compile to PostGIS equivalents):
geo:sfIntersects(a, b)→ST_Intersects(ST_GeomFromText(a), ST_GeomFromText(b))geo:sfContains(a, b)→ST_Contains(ST_GeomFromText(a), ST_GeomFromText(b))geo:sfWithin(a, b)→ST_Within(ST_GeomFromText(a), ST_GeomFromText(b))geo:sfTouches(a, b),geo:sfCrosses(a, b),geo:sfOverlaps(a, b)— same pattern
- Distance and measurement functions:
geof:distance(a, b, unit)→ST_Distance(ST_GeomFromText(a)::geography, ST_GeomFromText(b)::geography)with unit conversion (supportsuom:metre,uom:kilometre,uom:mile); result encoded asxsd:doublegeof:area(a, unit)→ST_Area(…::geography)with the same unit conversiongeof:boundary(a)→ST_Boundary(ST_GeomFromText(a))serialised back to WKT literal
- SPARQL FILTER integration: wire all geo functions into
translate_expr()insrc/sparql/expr.rs; topological predicates emit a SQL boolean; distance/area/boundary emit decoded numeric/WKT values - New pg_regress test
geosparql.sql— skipped automatically when PostGIS is absent (DO $$ BEGIN IF NOT EXISTS (SELECT 1 FROM pg_proc WHERE proname = 'st_geomfromtext') THEN RAISE EXCEPTION …; END IF; END $$); when PostGIS is present, verifies intersection, distance, and contains queries against a small geography dataset
- Prerequisite: PostGIS installed (gated with a runtime
-
Federation cache and partial-result correctness (high fixes H-12, H-13)
- H-12 (cache key upgrade): replace the XXH3-64 result cache key in
src/sparql/federation.rswith the full XXH3-128 hash — the 64-bit birthday bound (~2.1 billion distinct cached queries before 50% collision probability) is thin for a long-running server; the full 128-bit hash makes collision negligible even at very high query volumes -
H-13 (partial-result parser): add a size gate to the federation partial-result recovery path — if the truncated response exceeds
pg_ripple.federation_partial_recovery_max_bytes(INT GUC, default:65536), skip partial recovery and return zero rows with aWARNING: federation partial response too large for recovery (N bytes); this prevents therfind("},")heuristic from truncating a valid row whose literal value contains"}"followed by a comma in large responses - New pg_regress test
federation_cache.sql— verify that two federation calls with identical query text to different endpoints are cached independently; verify that a simulated oversized partial response exceeding the byte gate produces zero rows with the expected WARNING
- H-12 (cache key upgrade): replace the XXH3-64 result cache key in
-
Catalog OID stability (architectural fix A-5)
- Add
schema_name NAME, table_name NAMEcolumns to_pg_ripple.predicatesin the migration script - Populate on insert:
schema_name = '_pg_ripple',table_name = 'vp_{id}_delta'(the mutable partition; view name is derivable) - All dynamic SQL in the merge worker, query path, and admin functions now references
quote_ident(schema_name) || '.' || quote_ident(table_name)rather than looking up OIDs — OID drift after apg_dump/pg_restorecycle no longer silently redirects queries to the wrong relation - Migration script
sql/pg_ripple--0.24.0--0.25.0.sql:ALTER TABLE _pg_ripple.predicates ADD COLUMN schema_name NAME DEFAULT '_pg_ripple', ADD COLUMN table_name NAME; UPDATE _pg_ripple.predicates SET table_name = 'vp_' || id || '_delta';
- Add
-
Federation SSRF scheme validation (security fix S-4)
pg_ripple.register_endpoint(url TEXT): reject any URL whose scheme is nothttporhttpsat registration time withERRCODE_INVALID_PARAMETER_VALUE: "federation endpoint must use http or https scheme; got: <scheme>"— belt-and-braces defence even thoughureqwould refuse non-HTTP at connection time
-
Bulk load strict mode (medium fix M-8)
- Add
strict BOOLEAN DEFAULT falseparameter topg_ripple.load_turtle(data TEXT, strict BOOLEAN DEFAULT false)and all other bulk-load entry points - When
strict = true: any parse error or malformed triple aborts the entireCOPY-equivalent batch with a structured error naming the line number and the offending triple; the transaction is rolled back to the savepoint established at the start of the load - When
strict = false(current behaviour): malformed triples emit aWARNINGand are skipped; partial loads are committed as before - New pg_regress test
bulk_load_strict.sql— verify that a load with one malformed triple in strict mode rolls back all preceding triples; verify that the same load in lenient mode commits the well-formed triples
- Add
-
Blank-node document scoping fix (medium fix M-9)
- Replace the
SystemTime::now().duration_since(UNIX_EPOCH).unwrap().subsec_nanos()blank-node prefix insrc/bulk_load.rswithnextval('_pg_ripple.statement_id_seq')— globally unique per load call, collision-free under any level of concurrency
- Replace the
-
Merge worker cache isolation (architectural fix A-3)
- Register a transaction-boundary callback in the background merge worker (analogous to the xact-end callback added in v0.22.0 for the encode cache) that clears the worker-local encode/decode LRU cache at the end of every merge transaction — prevents the worker from using stale IDs if a future migration rewrites dictionary rows
-
pg_trickle version-lock probe (architectural fix A-4)
- In
_PG_init, ifpg_trickleis available, executeSELECT extversion FROM pg_extension WHERE extname = 'pg_trickle'and compare against the compile-timePG_TRICKLE_TESTED_VERSIONconstant; emit aWARNINGif the installed version is newer than tested:"pg_ripple: pg_trickle version N.N.N is newer than tested version N.N.N; incremental views may behave unexpectedly"
- In
-
Remaining low-priority fixes
- CDC payload documentation (L-2): add a
decode BOOLEAN DEFAULT falseparameter topg_ripple.cdc_changes()that, when true, decodes dictionary IDs to N-Triples strings in the payload; document inuser-guide/cdc.md - Dependency alignment (L-3/L-4): upgrade
ureqfrom v2 to v3 inpg_ripple_http/Cargo.toml; updateAGENTS.mdto listoxrdfas the canonical RDF-star parser; addoxrdf = "0.3"as a direct dep inCargo.toml - GUC description strings (L-5): update every
GucBuilder::new().set_description()call insrc/lib.rsto include the default value and valid range, e.g."Maximum property path recursion depth. Default: 64. Range: 1–100000."— improvesSHOW ALLand pg_admin discoverability -
Inline decoder defensive assert (L-7): add
debug_assert!(is_inline(id), "decode_inline called with non-inline id {id}")at the top ofdecode_inline()insrc/dictionary/inline.rs - Export literal round-trip (M-10): add a pg_regress test
export_roundtrip.sqlthat inserts triples with\uXXXXUnicode escapes, non-ASCII literals, and control characters, then round-trips through Turtle export and import; verifies the decoded values match the originals - W3C conformance test classification (M-19): replace remaining
label_no_errorstyle assertions in the conformance test file with a formal skip-listexpected_skipCTE; document each skip with a reason code (UNIMPLEMENTED,KNOWN_LIMITATION, orSPEC_AMBIGUITY); ensure the skip list shrinks to zero by v1.0.0 - File-path bulk loader validation (S-8): all
load_*_file()functions (load_turtle_file,load_ntriples_file, etc.) require superuser status but do not validate symlink following or path traversal beyond that gate; add arealpath()call insrc/bulk_load.rsto resolve symlinks and verify the target is withinpg_read_server_filesaccessible directories (matching PostgreSQL'sCOPY FROMfile-access model); emitERRCODE_INSUFFICIENT_PRIVILEGEif access is denied, preventing a superuser from accidentally loading files outside the protected path set
- CDC payload documentation (L-2): add a
-
Supplementary feature additions
-
pg_ripple.canary()health function: runs a battery of internal self-checks and returns a JSON object{"merge_worker": "ok"|"stalled", "cache_hit_rate": 0.0–1.0, "catalog_consistent": true|false, "orphaned_rare_rows": N}— suitable for ops dashboards, alerting pipelines, and CI smoke tests;catalog_consistentchecks that VP table count inpg_tablesmatches the predicate catalog and that novp_rarerows exist for promoted predicates - OWL ontology import:
pg_ripple.load_owl_ontology(path TEXT)— format-detected by file extension (.ttl/.nt/.xml/.rdf/.owl); loads into the default graph; returns triple count - RDF Patch import:
pg_ripple.apply_patch(data TEXT)— processes RDF PatchA/Doperations; returns net triple delta - Custom aggregate registry:
pg_ripple.register_aggregate(sparql_iri TEXT, pg_function TEXT)persists to_pg_ripple.custom_aggregates
-
Documentation
See plans/documentation.md for details.
-
reference/geosparql.md(new page) — GeoSPARQL 1.1 support matrix, all implemented functions with signatures and PostGIS equivalents, PostGIS version requirements, worked examples with WKT literals -
user-guide/geospatial.md(new page) — how to store and query geographic data in pg_ripple, linking GeoSPARQL to PostGIS, example queries for distance filtering and containment -
reference/security.mdupdated — document federation scheme validation and the remediation rationale -
user-guide/bulk-load.mdupdated — document thestrictparameter with when to use it and how to diagnose partial-load failures -
reference/configuration.mdupdated — documentpg_trickleversion-lock warning and the new CDCdecodeparameter - Release notes for v0.25.0 — highlight GeoSPARQL capability, catalog OID stability improvement, strict bulk load, and summary of all closed low-priority issues
Exit Criteria
geosparql.sql pg_regress passes when PostGIS is present and skips cleanly when PostGIS is absent. bulk_load_strict.sql passes for both strict and lenient modes. Blank-node prefix uses nextval(…) — no wall-clock-based prefix in src/bulk_load.rs. SELECT pg_ripple.register_endpoint('file:///etc/passwd') raises ERRCODE_INVALID_PARAMETER_VALUE. _pg_ripple.predicates has schema_name and table_name columns populated. federation_cache.sql passes — distinct endpoints are cached independently and oversized partial responses produce zero rows with a WARNING. pg_ripple.canary() returns {"catalog_consistent": true, "orphaned_rare_rows": 0} on a healthy database. SELECT pg_ripple.load_turtle_file('/etc/passwd') from a superuser session raises ERRCODE_INSUFFICIENT_PRIVILEGE (not silently succeeding) because /etc/passwd is outside allowed pg_read_server_files directories. Migration scripts from 0.1.0 through 0.25.0 run cleanly via just test-migration.
v0.26.0 — GraphRAG Integration
Theme: First-class support for using pg_ripple as the persistent knowledge graph backend for Microsoft GraphRAG.
In plain language: Microsoft GraphRAG is an open-source system (32k+ GitHub stars) that uses large language models to extract a knowledge graph from documents, detects thematic clusters, and answers complex questions far better than standard vector-search RAG. By default it stores its graph as flat Parquet files on disk — static, unqueryable, and requiring a full re-index every time new documents arrive. This release makes pg_ripple a drop-in backend for GraphRAG: entities and relationships extracted by the LLM are stored as RDF triples with full SPARQL queryability, Datalog reasoning derives implicit relationships the LLM missed, SHACL shapes reject malformed extractions before they corrupt the graph, and a Python CLI bridge exports the enriched graph back to Parquet for GraphRAG's community-detection step. The result is a richer, higher-quality knowledge graph that improves GraphRAG's Local, Global, and DRIFT search accuracy — all running inside the PostgreSQL instance you already have.
Effort estimate: 4–6 person-weeks
Completed items (click to expand)
Background
See plans/graphrag.md for the full synergy analysis, architecture proposals, and integration rationale. Key findings:
- GraphRAG stores its knowledge model as Parquet files (entities, relationships, communities, community reports, text units). Every new document requires a full re-index.
- pg_ripple replaces static Parquet with a live, ACID-consistent, SPARQL-queryable triple store. New entities can be inserted incrementally via the HTAP delta partition without disrupting concurrent queries.
- Datalog + OWL-RL inference materialises relationships that LLM extraction misses (transitive hierarchies, co-membership, symmetric properties), directly improving community structure quality.
- SHACL validation rejects malformed LLM extractions (missing titles, invalid types, dangling relationship endpoints) before they propagate into community reports.
- GraphRAG's BYOG (Bring Your Own Graph) feature accepts pre-built entity/relationship tables as Parquet — pg_ripple's export functions feed directly into this pathway.
Deliverables
-
GraphRAG RDF ontology (
sql/graphrag_ontology.ttl)- Defines the RDF vocabulary for GraphRAG's knowledge model:
gr:Entity,gr:Relationship,gr:TextUnit,gr:Community,gr:CommunityReport - Full property set mirroring GraphRAG's output table schemas:
gr:title,gr:type,gr:description,gr:frequency,gr:degree,gr:source,gr:target,gr:weight,gr:level,gr:rank,gr:summary,gr:fullContent,gr:hasMember,gr:parent - Provenance properties for RDF-star metadata:
gr:confidence,gr:sourceTextUnit,gr:extractedBy,gr:extractedAt - Namespace prefix
gr:pre-registered viapg_ripple.register_prefix() - Loaded automatically by the example script; also loadable standalone via
pg_ripple.load_turtle_file()
- Defines the RDF vocabulary for GraphRAG's knowledge model:
-
BYOG Parquet export functions (
src/export.rsadditions)pg_ripple.export_graphrag_entities(graph_iri TEXT, output_path TEXT) RETURNS BIGINT- Executes a SPARQL SELECT to extract all
gr:Entitytriples from the named graph - Writes
entities.parquetwith columns:id,title,type,description,text_unit_ids,frequency,degree— exactly matching GraphRAG's output schema - Returns row count
- Executes a SPARQL SELECT to extract all
pg_ripple.export_graphrag_relationships(graph_iri TEXT, output_path TEXT) RETURNS BIGINT- Extracts all
gr:Relationshiptriples - Writes
relationships.parquetwith columns:id,source,target,description,weight,combined_degree,text_unit_ids combined_degreecomputed assource.degree + target.degreevia a SPARQL join- Returns row count
- Extracts all
pg_ripple.export_graphrag_text_units(graph_iri TEXT, output_path TEXT) RETURNS BIGINT- Extracts all
gr:TextUnittriples - Writes
text_units.parquetwith columns:id,text,n_tokens,document_id,entity_ids,relationship_ids - Returns row count
- Extracts all
- Implementation: use Rust's
parquet+arrowcrates; require superuser (same asload_*_filefunctions); validate output path viarealpath()against writable directories
-
SHACL shapes for GraphRAG quality enforcement (
sql/graphrag_shapes.ttl)gr:EntityShape:gr:titlerequired (1..1, string, maxLength 1000);gr:typerequired, constrained tosh:in ("person" "organization" "geo" "event" "concept");gr:descriptionrequired (1..1)gr:RelationshipShape:gr:sourcerequired (1..1,sh:class gr:Entity);gr:targetrequired (1..1,sh:class gr:Entity);gr:weightrequired (1..1, float,sh:minInclusive 0.0,sh:maxInclusive 1.0)gr:TextUnitShape:gr:textrequired (1..1, string);gr:tokenCountrequired (1..1, non-negative integer)- Loaded via
pg_ripple.load_turtle_file()and activated withpg_ripple.validate()orpg_ripple.shacl_mode = 'sync'
-
Datalog enrichment rules (
sql/graphrag_enrichment_rules.pl)gr:coworker(?a, ?b)— both entities appear as source in relationships targeting the same organization entitygr:collaborates(?a, ?b)— both entities appear in the same text unit (share agr:TextUnitviagr:mentionsEntity)gr:indirectReport(?leader, ?sub2)— transitive:?leader gr:manages ?mid,?mid gr:manages ?sub2gr:relatedOrg(?a, ?b)— two organizations share at least two entity-level relationships (co-occurrence threshold)- All rules loaded via
pg_ripple.load_rules()under the rule set name'graphrag_enrichment' - OWL-RL built-in rules (
pg_ripple.load_rules_builtin('owl-rl')) applied first for RDFS subclass/subproperty transitivity - Documentation: each rule annotated with its GraphRAG use case (e.g. how
gr:coworkerenriches Local Search neighborhood)
-
Python CLI bridge (
scripts/graphrag_export.py)- CLI tool wrapping the export functions for users who cannot call
pg_ripple.export_graphrag_*()directly from SQL (e.g. managed PostgreSQL services whereCOPY TOis restricted) --pg-url: PostgreSQL connection string--graph-iri: named graph IRI to export--output-dir: directory for Parquet files (default:./graphrag_output)--enrich-with-datalog: runpg_ripple.infer('owl-rl')+pg_ripple.infer('graphrag_enrichment')before export--validate: runpg_ripple.validate()and print violations before exporting; exit with non-zero code if any violations--format:parquet(default) orcsv(for debugging)- Dependencies:
psycopg(v3),pyarrow; no GraphRAG dependency required at export time - Prints row counts and output paths on success
- Unit tests via
pytestinscripts/test_graphrag_export.py
- CLI tool wrapping the export functions for users who cannot call
-
Example walkthrough (
examples/graphrag_byog.sql)- End-to-end example: create named graph → load sample entities/relationships as Turtle → run Datalog enrichment → validate with SHACL → query enriched graph via SPARQL → export to Parquet
- Demonstrates all four integration points: ontology, validation, reasoning, and export
- Includes a commented BYOG
settings.yamlsnippet showing thegraphrag indexcommand that consumes the exported Parquet files - Executable as a pg_regress test:
cargo pgrx regress pg18includesgraphrag_byog.sql
-
pg_regress tests
graphrag_ontology.sql— load ontology, verify all prefix registrations and class/property triples are presentgraphrag_crud.sql— insert sample entities and relationships as Turtle, query back via SPARQL, verify field valuesgraphrag_enrichment.sql— load enrichment rules, runinfer('graphrag_enrichment'), verifygr:coworkerandgr:collaboratestriples are derivedgraphrag_shacl.sql— attempt to load a malformed entity (missinggr:type) withshacl_mode = 'sync', verify the INSERT is rejected with a SHACL violation reportgraphrag_export.sql— export entities/relationships to/tmp/graphrag_test_*.parquet, verify row count matches the number of inserted entities/relationships
Migration Script
sql/pg_ripple--0.25.0--0.26.0.sql — no schema changes required; all new functionality is delivered via Rust function additions and SQL files loaded by the user. Migration script contains a header comment listing the new SQL functions and their signatures.
Documentation
See plans/documentation.md for details.
-
user-guide/graphrag.md(new page) — step-by-step guide: install pg_ripple, load GraphRAG entities as RDF, run enrichment and validation, export to Parquet, run GraphRAG BYOG workflow; includes architecture diagram showing data flow between GraphRAG and pg_ripple -
reference/graphrag-ontology.md(new page) — full reference for thegr:vocabulary: all classes, properties, and SHACL shapes with descriptions and example triples -
reference/graphrag-functions.md(new page) — API reference forexport_graphrag_entities,export_graphrag_relationships,export_graphrag_text_units -
user-guide/graphrag-enrichment.md(new page) — explains Datalog enrichment for GraphRAG: which rules are built-in, how to write custom rules, how enriched triples improve community detection quality -
plans/graphrag.mdupdated — mark Phase 1 (BYOG export) and Phase 2 (Datalog enrichment) as implemented; update Phase 3 status to in-progress - Release notes for v0.26.0 — highlight GraphRAG integration as the headline feature, link to the BYOG walkthrough, explain the Datalog enrichment value proposition
Exit Criteria
graphrag_ontology.sql, graphrag_crud.sql, graphrag_enrichment.sql, graphrag_shacl.sql, and graphrag_export.sql all pass in cargo pgrx regress pg18. pg_ripple.export_graphrag_entities() writes a valid Parquet file readable by pyarrow.parquet.read_table(). Loading a malformed entity (missing gr:type) with shacl_mode = 'sync' raises a validation error. Running pg_ripple.infer('graphrag_enrichment') on a graph with two entities both linked to the same organization produces at least one gr:coworker triple. scripts/graphrag_export.py --validate exits non-zero when SHACL violations are present. Migration scripts from 0.1.0 through 0.26.0 run cleanly via just test-migration.
v0.27.0 — Vector + SPARQL Hybrid: Foundation
Theme: Core pgvector integration — embedding storage, similarity functions, and SPARQL extension.
In plain language: This release adds AI-powered semantic search to pg_ripple. Every entity in your knowledge graph can now have a vector embedding — a compact numerical fingerprint that captures its meaning. You can then search for entities that are semantically similar to a phrase ("find drugs similar to anti-inflammatory agents"), and combine that similarity search with precise SPARQL queries ("but only drugs approved by the FDA that don't interact with methotrexate"). This is called hybrid search, and it's the dominant retrieval pattern for modern AI applications. pg_ripple's unique advantage is that both the graph query and the similarity search run inside the same PostgreSQL process — with zero overhead, ACID transactions, and the query planner optimising both together. No other triplestore offers this.
Effort estimate: 5–7 person-weeks
Completed items (click to expand)
Background
See plans/vector_sparql_hybrid.md for the full analysis, pgvector deep-dive, competitive landscape, and integration architecture. Key findings:
- pgvector (14k+ GitHub stars, MIT license, ships with every major managed PostgreSQL provider) is the standard PostgreSQL vector extension. Because pg_ripple and pgvector share the same PostgreSQL backend, JOINs between VP tables and vector tables execute in-process with zero serialisation overhead.
- No existing triplestore or vector database combines full SPARQL 1.1, SHACL validation, Datalog reasoning, and in-process vector similarity in a single system.
- The
_pg_ripple.embeddingstable uses dictionary-encodedentity_idforeign keys, enabling zero-copy joins with all VP tables. - This is an optional at runtime integration: pg_ripple degrades gracefully (returns empty results with a WARNING) if pgvector is not installed.
Deliverables
-
_pg_ripple.embeddingstable (sql/pg_ripple--0.26.0--0.27.0.sql)- Schema:
entity_id BIGINT NOT NULL REFERENCES _pg_ripple.dictionary(id), model TEXT NOT NULL DEFAULT 'default', embedding vector(1536), updated_at TIMESTAMPTZ NOT NULL DEFAULT now(), PRIMARY KEY (entity_id, model)(optional at runtime — pgvector must be installed) - HNSW index (default) on
(embedding vector_cosine_ops)with configurablem(default 16) andef_construction(default 64) parameters — best recall/speed trade-off for most workloads - IVFFlat index alternative (opt-in via GUC
pg_ripple.embedding_index_type = 'ivfflat') — faster build times, preferable for high-write workloads where the HNSW build cost is prohibitive; lists auto-set tosqrt(row_count) halfvecsupport: theembeddingcolumn accepts bothvector(N)andhalfvec(N)via GUCpg_ripple.embedding_precision = 'half';halfvechalves storage (2 bytes per dimension instead of 4) at marginal recall cost — recommended for > 5M entity graphs orembedding_dimensions >= 3072- Binary quantization support: opt-in via GUC
pg_ripple.embedding_precision = 'binary'; stores embeddings as pgvectorbit(N)using Hamming distance, reducing storage by ~96% (1 bit/dimension) at the cost of recall — suitable for extremely large-scale graphs (> 50M entities) where approximate results are acceptable; requires pgvector ≥ 0.7.0 - Fallback: if pgvector is absent, the table is created with
BYTEAas a stub column and all similarity functions return empty results with a WARNING - Migration script creates the table only if pgvector is detected via
SELECT EXISTS (SELECT 1 FROM pg_extension WHERE extname = 'vector')
- Schema:
-
GUC parameters (registered in
_PG_initinsrc/lib.rs)pg_ripple.embedding_model(string, default'') — embedding model name tag stored in themodelcolumnpg_ripple.embedding_dimensions(integer, default1536, range1–16000) — vector dimensions; must match the actual model outputpg_ripple.embedding_api_url(string, default'') — base URL for an OpenAI-compatible embedding API (e.g.https://api.openai.com/v1, local Ollama, vLLM)pg_ripple.embedding_api_key(string, default'', superuser-only) — API key; value is masked inpg_settingsvia a superuser-only GUC flagpg_ripple.pgvector_enabled(bool, defaulttrue) — runtime switch; set tofalseto disable all pgvector-dependent code paths without uninstalling the extensionpg_ripple.embedding_index_type(string, default'hnsw', options'hnsw'|'ivfflat') — controls which index type is created on_pg_ripple.embeddings; changing this requiresREINDEXpg_ripple.embedding_precision(string, default'single', options'single'|'half'|'binary') —'half'stores embeddings ashalfvec(N)(50% storage reduction);'binary'stores asbit(N)using Hamming distance (~96% storage reduction, best for > 50M entities); requires pgvector ≥ 0.7.0
-
pg_ripple.embed_entities()— batch embedding (src/sparql/embedding.rs)pg_ripple.embed_entities(graph_iri TEXT DEFAULT NULL, model TEXT DEFAULT NULL, batch_size INT DEFAULT 100) RETURNS BIGINT- Executes a SPARQL SELECT to collect entity IRIs + their
rdfs:label(falling back to the IRI local name) from the specified graph (or all graphs if NULL) - Batches entity labels, calls the OpenAI-compatible API at
pg_ripple.embedding_api_url; supports gzip-compressed responses - Stores results in
_pg_ripple.embeddingsviaINSERT … ON CONFLICT (entity_id, model) DO UPDATE SET embedding = EXCLUDED.embedding, updated_at = now() - Returns total number of embeddings stored
- Raises
PT601 — embedding API URL not configuredifpg_ripple.embedding_api_urlis empty
-
pg_ripple.similar_entities()— k-NN query (src/sparql/embedding.rs)pg_ripple.similar_entities(query_text TEXT, k INT DEFAULT 10, model TEXT DEFAULT NULL) RETURNS TABLE (entity_id BIGINT, entity_iri TEXT, distance FLOAT8)(optional at runtime — pgvector must be installed)- Encodes
query_textto a vector via the configured embedding API - Executes
SELECT entity_id, embedding <=> $query_vec FROM _pg_ripple.embeddings ORDER BY 1 LIMIT kusing the pgvector<=>cosine distance operator - Decodes
entity_idback to IRI text via the dictionary - Returns results sorted by ascending cosine distance (0 = identical, 2 = maximally dissimilar)
-
pg_ripple.store_embedding()— user-supplied embeddingspg_ripple.store_embedding(entity_iri TEXT, embedding FLOAT8[], model TEXT DEFAULT NULL) RETURNS VOID- Encodes
entity_irivia the dictionary encoder, castsFLOAT8[]tovector, and upserts into_pg_ripple.embeddings - Useful for pre-computed KGE embeddings (TransE, RotatE, ComplEx) from external pipelines; no API call needed
- Validates that
array_length(embedding, 1)matchespg_ripple.embedding_dimensions; raisesPT602 — embedding dimension mismatchotherwise
-
SPARQL
pg:similar()extension function (src/sparql/functions.rs)- Register
<http://pg-ripple.org/functions/similar>as a SPARQL extension function in the function registry - Signature:
pg:similar(?entity, "query_text"^^xsd:string, k)— returns cosine distance asxsd:double - Translate to SQL: the SPARQL→SQL compiler detects
pg:similarcalls in BIND expressions and emits a JOIN against_pg_ripple.embeddingswith the<=>operator - Filter pushdown: if the SPARQL query has
FILTER(?score < threshold), push the threshold into the SQLWHEREclause to allow HNSW iterative scan pruning - Graceful degradation: if pgvector is absent, raises
PT603 — pgvector extension not installedwith an install hint
- Register
-
pg_ripple.refresh_embeddings()— stale embedding invalidation (src/sparql/embedding.rs)pg_ripple.refresh_embeddings(graph_iri TEXT DEFAULT NULL, model TEXT DEFAULT NULL, force BOOL DEFAULT false) RETURNS BIGINT- Identifies entities whose
rdfs:labelwas updated after_pg_ripple.embeddings.updated_atby joining_pg_ripple.embeddingsagainst the label VP table'si(SID) sequence — higher SID implies a later write - Re-embeds stale entities in batches; skips entities where
updated_atis already current unlessforce = true - Returns the count of re-embedded entities
- Intended for scheduled maintenance (e.g. via
pg_cron) and called automatically at the end of each background worker cycle whenpg_ripple.auto_embed = true - Raises
PT606 — no stale embeddings foundas a NOTICE (not an ERROR) when nothing needs refreshing
-
Error codes for the embedding subsystem (
src/error.rs)PT601— embedding API URL not configuredPT602— embedding dimension mismatchPT603— pgvector extension not installedPT604— embedding API request failed (includes HTTP status code in detail)PT605— entity has no embedding (raised whenpg:similaris called for an entity absent from_pg_ripple.embeddings)PT606— no stale embeddings found (NOTICE level)
-
pg_regress tests
vector_setup.sql— verify pgvector is installed; skip remaining vector tests if absentvector_crud.sql— store embeddings viapg_ripple.store_embedding(), retrieve viapg_ripple.similar_entities(), verify ranking ordervector_sparql.sql— SPARQL query usingpg:similar()in a BIND expression; verify the result set is non-empty and ordered by distancevector_filter.sql— SPARQL query withFILTER(?score < 0.5)on apg:similar()result; verify only entities below the threshold are returnedvector_graceful.sql— test behaviour whenpg_ripple.pgvector_enabled = false; verify WARNING is emitted and no ERROR is raisedvector_halfvec.sql— store embeddings withpg_ripple.embedding_precision = 'half'; verify halfvec column type and thatpg_ripple.similar_entities()returns correct resultsvector_binary.sql— store embeddings withpg_ripple.embedding_precision = 'binary'; verify bit column type and that Hamming-distance similarity returns non-zero resultsvector_refresh.sql— insert entity, embed, update itsrdfs:label, callpg_ripple.refresh_embeddings(), verifyupdated_atadvances and re-embedding count is 1
Migration Script
sql/pg_ripple--0.26.0--0.27.0.sql — creates _pg_ripple.embeddings table and HNSW index if pgvector is present; registers GUC parameters. No changes to VP table schema.
Documentation
-
user-guide/hybrid-search.md(new page) — quick-start: install pgvector, set GUC parameters, callpg_ripple.embed_entities(), run a SPARQL hybrid query; includes architecture diagram showing VP table + embeddings table join -
reference/embedding-functions.md(new page) — API reference forembed_entities,similar_entities,store_embedding,pg:similar() -
reference/guc-reference.mdupdated — document all seven new embedding GUC parameters (embedding_model,embedding_dimensions,embedding_api_url,embedding_api_key,pgvector_enabled,embedding_index_type,embedding_precision) with recommended values for OpenAI, Ollama, and local Sentence-BERT; include storage trade-off table forembedding_precisionmodes
Exit Criteria
vector_crud.sql, vector_sparql.sql, vector_filter.sql, vector_halfvec.sql, vector_binary.sql, and vector_refresh.sql all pass in cargo pgrx regress pg18 when pgvector is installed. vector_setup.sql skips cleanly when pgvector is absent. pg_ripple.store_embedding('http://example.org/aspirin', ARRAY[...]) round-trips correctly through pg_ripple.similar_entities('anti-inflammatory'). A SPARQL query with BIND(pg:similar(?drug, "aspirin", 10) AS ?score) FILTER(?score < 0.5) returns only entities with cosine distance below 0.5. SELECT pg_ripple.similar_entities('test') when pg_ripple.pgvector_enabled = false emits a WARNING and returns zero rows (no ERROR). pg_ripple.refresh_embeddings() after a label update returns a count of 1 and advances updated_at. SELECT count(*) FROM _pg_ripple.embeddings with embedding_precision = 'half' confirms the column is of type halfvec. Migration scripts from 0.1.0 through 0.27.0 run cleanly via just test-migration.
v0.28.0 — Advanced Hybrid Search & RAG Pipeline
Theme: Production-grade hybrid search with RRF fusion, incremental embedding, graph-contextualized embeddings, and end-to-end RAG retrieval.
In plain language: This release builds on the pgvector foundation to deliver two advanced capabilities. First, hybrid ranking: instead of choosing between SPARQL results or vector results, pg_ripple now fuses both using Reciprocal Rank Fusion — a proven algorithm that combines ranked lists from different retrieval systems. Second, RAG support: a single SQL function (
pg_ripple.rag_retrieve()) takes a natural language question, runs hybrid search, and returns structured context ready for an LLM system prompt. A background worker keeps embeddings up-to-date as new entities are added. The result is a complete knowledge-graph-grounded RAG backend running entirely inside PostgreSQL — no separate vector database, no ETL, no eventual consistency.Effort estimate: 5–8 person-weeks
Completed items (click to expand)
Background
See plans/vector_sparql_hybrid.md §5 (Advanced Integration Patterns) and §7 (Phases 2 & 3) for full design rationale. Key highlights:
- Reciprocal Rank Fusion (RRF) is the standard algorithm for combining ranked lists from heterogeneous retrieval systems. With RRF, pg_ripple fuses SPARQL result rankings with vector distance rankings into a single scored list using the formula $\text{RRF}(d) = \sum_{r \in R} \frac{1}{k_{rrf} + r(d)}$ where $k_{rrf} = 60$.
- Incremental embedding via a background worker ensures entities added after initial bulk embedding are automatically embedded without user intervention.
- Graph-contextualized embeddings generate text representations that include entity neighborhood information (label, types, neighboring entity labels) before embedding — producing vectors that encode relational context, making similarity search more meaningful than label-only embeddings.
pg_ripple.rag_retrieve()is the missing link between pg_ripple's knowledge graph and LLM-based applications; it bridges directly to the pg_ripple_http HTTP service for REST-based LLM integrations.
Deliverables
-
pg_ripple.hybrid_search()— RRF fusion (src/sparql/embedding.rs)pg_ripple.hybrid_search(sparql_query TEXT, query_text TEXT, k INT DEFAULT 10, alpha FLOAT8 DEFAULT 0.5, model TEXT DEFAULT NULL) RETURNS TABLE (entity_id BIGINT, entity_iri TEXT, rrf_score FLOAT8, sparql_rank INT, vector_rank INT)(optional at runtime — pgvector must be installed)- Executes
sparql_query(a SPARQL SELECT returning?entity) to get the SPARQL-ranked candidate set - Executes
pg_ripple.similar_entities(query_text, k * 10)to get the vector-ranked candidate set - Applies Reciprocal Rank Fusion with $k_{rrf} = 60$;
alphacontrols SPARQL vs. vector weight (0.0 = vector only, 1.0 = SPARQL only, 0.5 = equal) - Returns top-
kentities sorted by descendingrrf_score
-
Incremental embedding background worker (
src/worker.rsextension)- New table
_pg_ripple.embedding_queue (entity_id BIGINT PRIMARY KEY, enqueued_at TIMESTAMPTZ NOT NULL DEFAULT now()) - Trigger on
_pg_ripple.dictionary: inserts new entity IDs intoembedding_queuewhenpg_ripple.auto_embed = true - Background worker dequeues entities in batches of
pg_ripple.embedding_batch_size, calls the embedding API, upserts into_pg_ripple.embeddings - GUC:
pg_ripple.auto_embed(bool, defaultfalse) — master switch for trigger-based embedding; off by default to avoid surprise API charges - GUC:
pg_ripple.embedding_batch_size(integer, default100, range1–10000)
- New table
-
pg_ripple.contextualize_entity()— graph-serialized text (src/sparql/embedding.rs)pg_ripple.contextualize_entity(entity_iri TEXT, depth INT DEFAULT 1, max_neighbors INT DEFAULT 20) RETURNS TEXT- Runs an internal SPARQL CONSTRUCT to gather the entity's label, type(s), and up-to-
max_neighborsneighboring entity labels withindepthhops - Serialises the neighborhood as structured text:
"[entity_label]. Type: [types]. Related: [neighbor_labels]."— suitable for embedding - Used internally by
pg_ripple.embed_entities()whenpg_ripple.use_graph_context = true(new GUC, bool, defaultfalse)
-
pg_ripple.rag_retrieve()— end-to-end RAG (src/sparql/embedding.rs)pg_ripple.rag_retrieve(question TEXT, sparql_filter TEXT DEFAULT NULL, k INT DEFAULT 5, model TEXT DEFAULT NULL) RETURNS TABLE (entity_iri TEXT, label TEXT, context_json JSONB, distance FLOAT8)(optional at runtime — pgvector must be installed)- Step 1: encode
questionto a vector; findknearest entities via HNSW - Step 2: if
sparql_filteris non-NULL, apply it as a SPARQL WHERE clause filter on the candidate set - Step 3: for each surviving entity, call
pg_ripple.contextualize_entity()to build a rich context - Step 4: return
context_jsonas JSONB with keyslabel,types,properties,neighbors— formatted for direct use as an LLM system prompt fragment; structure mirrors the JSON-LD framing output from v0.17.0
-
pg_ripple_httpRAG endpoint (pg_ripple_http/src/main.rs)POST /rag— accepts{"question": "...", "sparql_filter": "...", "k": 5}JSON body- Calls
pg_ripple.rag_retrieve()via the existing SPI connection - Returns
{"results": [...], "context": "..."}wherecontextis the concatenatedcontext_jsonentries formatted as a plain-text LLM prompt - Authentication: same bearer-token auth as existing
pg_ripple_httpendpoints - Rate limiting: inherits the
pg_ripple_http.max_requests_per_secondGUC
-
JSON-LD framing for RAG context output (
src/framing/extension)pg_ripple.rag_retrieve()gains an optionaloutput_format TEXT DEFAULT 'jsonb'parameter accepting'jsonb'or'jsonld'- When
output_format = 'jsonld', eachcontext_jsonrow is formatted as a JSON-LD frame using the framing engine from v0.17.0: entity types map to@type, property-value pairs map to their IRI keys, and@contextis auto-populated from the registered prefix table - Enables direct use of
context_jsonas a JSON-LD-framed system prompt for LLMs that prefer structured data (e.g. OpenAI structured outputs) - New pg_regress test
vector_rag_jsonld.sql— callpg_ripple.rag_retrieve(... output_format := 'jsonld')and verify@typeand@contextkeys are present in the output
-
SPARQL federation with external vector services (
src/sparql/federation.rsextension)- Extends the SERVICE handler (v0.16.0) to recognise vector service endpoints registered via
pg_ripple.register_vector_endpoint(url TEXT, api_type TEXT)whereapi_typeis'pgvector','weaviate','qdrant', or'pinecone' - Syntax:
SERVICE <http://vector-service/search> { ?entity pg:similarTo "query" ; pg:score ?score }— translated to the appropriate external API call (HTTP) rather than a local pgvector scan - Returned
?entityIRIs are resolved against the local dictionary; matched entities can participate in subsequent local triple pattern joins in the same SPARQL query - Use case: local pgvector for < 10M entities; external service for larger embedding indexes, without changing the SPARQL query syntax
- GUC:
pg_ripple.vector_federation_timeout_ms(integer, default5000) — HTTP timeout for external vector service calls - Raises
PT607 — vector service endpoint not registeredif an unregistered SERVICE URL is used with apg:similarTopredicate - New pg_regress test
vector_federation.sql— register a mock vector endpoint, issue a federated SPARQL query, verify graceful fallback when the endpoint is unavailable
- Extends the SERVICE handler (v0.16.0) to recognise vector service endpoints registered via
-
SHACL embedding completeness shape
examples/shacl_embedding_completeness.ttl— reusable SHACL shape that validates all entities of a given class have embeddings (usessh:path :hasEmbedding ; sh:minCount 1)pg_ripple.add_embedding_triples() RETURNS BIGINT— materialises:hasEmbeddingtriples for entities present in_pg_ripple.embeddings, making the SHACL shape checkable
-
Multi-model support
pg_ripple.list_embedding_models() RETURNS TABLE (model TEXT, entity_count BIGINT, dimensions INT)— enumerate all models in_pg_ripple.embeddingspg_ripple.similar_entities(),pg:similar(), andpg_ripple.rag_retrieve()all accept an optionalmodelargument; default is thepg_ripple.embedding_modelGUC value
-
Benchmarks
benchmarks/hybrid_search.sql— pgbench-based benchmark measuring hybrid search latency and throughput; tests vector-only, SPARQL-only, and RRF-fused patterns- Target: hybrid search over 1M entities, 1,536-dimensional embeddings, HNSW index, < 50 ms P99 latency for top-10 results
-
Error codes (additions to
src/error.rs)PT607— vector service endpoint not registered
-
pg_regress tests
vector_hybrid.sql—pg_ripple.hybrid_search()with a SPARQL SELECT + vector query; verify RRF scores are non-zero and results are sortedvector_rag.sql—pg_ripple.rag_retrieve()end-to-end; verifycontext_jsoncontains expected keysvector_rag_jsonld.sql—pg_ripple.rag_retrieve(... output_format := 'jsonld'); verify@typeand@contextkeys are presentvector_contextualize.sql—pg_ripple.contextualize_entity()on a test entity with known neighbors; verify output text contains expected labelsvector_worker.sql— insert a new entity withpg_ripple.auto_embed = true; verify_pg_ripple.embedding_queueis populated; simulate worker drain and verify embedding is presentvector_federation.sql— register a mock vector endpoint; verifySERVICEquery withpg:similarToissues the correct HTTP request; verify graceful timeout fallback
Migration Script
sql/pg_ripple--0.27.0--0.28.0.sql — creates _pg_ripple.embedding_queue table and trigger; registers new GUC parameters. No changes to VP table schema.
Documentation
-
user-guide/hybrid-search.mdupdated — add RRF fusion and RAG sections; include end-to-end worked example from question to LLM context -
user-guide/rag.md(new page) — step-by-step guide to usingpg_ripple.rag_retrieve()as a backend for LangChain, LlamaIndex, and raw OpenAI API calls; includespg_ripple_httpREST example -
reference/embedding-functions.mdupdated — documenthybrid_search,rag_retrieve(includingoutput_formatparameter),contextualize_entity,list_embedding_models,register_vector_endpoint -
reference/http-api.mdupdated — documentPOST /ragendpoint with request/response examples and JSON-LD output mode -
user-guide/vector-federation.md(new page) — how to register external vector services, write federated SPARQL queries, and configure timeouts; includes worked examples for Weaviate, Qdrant, and Pinecone endpoints -
Release notes for v0.28.0 — highlight
rag_retrieveandhybrid_searchas headline features; link to the hybrid-search and RAG user guides
Exit Criteria
vector_hybrid.sql, vector_rag.sql, vector_rag_jsonld.sql, vector_contextualize.sql, vector_worker.sql, and vector_federation.sql all pass in cargo pgrx regress pg18 when pgvector is installed. pg_ripple.hybrid_search('SELECT ?drug WHERE { ?drug a :Drug }', 'anti-inflammatory', 10) returns ≤ 10 rows with non-zero rrf_score. pg_ripple.rag_retrieve('what treats headaches?', k := 5) returns JSONB rows with label, types, properties, and neighbors keys. pg_ripple.rag_retrieve('what treats headaches?', k := 5, output_format := 'jsonld') returns rows whose context_json contains @type and @context keys. POST /rag on pg_ripple_http returns a context field suitable for use as an LLM system prompt. Inserting a new entity with pg_ripple.auto_embed = true and running the background worker loop populates _pg_ripple.embeddings for that entity. pg_ripple.register_vector_endpoint('http://unknown/', 'qdrant') followed by a SERVICE query returns graceful timeout with no ERROR. Migration scripts from 0.1.0 through 0.28.0 run cleanly via just test-migration.
v0.29.0 — Datalog Optimization: Magic Sets & Cost-Based Compilation
Theme: Goal-directed inference, cost-based rule compilation, and evaluation-path optimizations for the Datalog engine.
In plain language: pg_ripple's Datalog engine already supports semi-naive evaluation — it only looks at new facts each iteration. This release makes inference dramatically smarter: instead of deriving every possible fact, the engine now derives only the facts needed to answer a specific question (magic sets). It also reorders rule joins by cost, eliminates redundant rules, and improves how negation and filters are compiled to SQL. The result is 10×–1000× faster inference for targeted queries and 2×–10× faster full materialization on large datasets.
Effort estimate: 5–7 person-weeks
Completed items (click to expand)
Background
See plans/ecosystem/datalog.md §14.2 for detailed design notes on all optimization techniques. Key highlights:
- Magic sets is the classical Datalog optimization (Bancilhon et al., 1986; implemented in IBM DB2). It rewrites a rule program + query goal into a smaller program that derives only relevant facts. Combined with semi-naive evaluation, it matches top-down evaluation performance while retaining bottom-up correctness guarantees.
- Cost-based body atom reordering uses PostgreSQL's
pg_class.reltuplesandpg_statisticto sort joins by selectivity — the same technique PostgreSQL's own planner uses, applied at the Datalog→SQL compilation stage. - Subsumption checking prunes redundant rules at compile time, reducing the number of SQL statements per fixpoint iteration.
Deliverables
-
Magic sets transformation (
src/datalog/magic.rs)pg_ripple.infer_goal(rule_set TEXT, goal TEXT) RETURNS JSONB— materialize only facts relevant to the goal pattern- Adornment propagation: given a goal like
?x rdf:type foaf:Person, compute binding patterns for each predicate - Magic predicate generation: create auxiliary predicates that capture the demanded binding set
- Modified rule generation: add magic-predicate filters to each rule body
- SQL compilation: magic predicates compile to temp tables; modified rules join against them
- Automatic integration with
create_datalog_view()— when a goal has bound constants, magic sets are applied automatically - GUC:
pg_ripple.magic_sets(bool, defaulttrue) — master switch; set tofalseto disable for debugging - Benchmark:
benchmarks/magic_sets.sql— compare full materialization vs. goal-directed inference on RDFS closure with selective goals
-
Cost-based body atom reordering (
src/datalog/compiler.rs)- At rule compilation time, query
pg_class.reltuplesfor each VP table referenced by a body atom - For atoms with bound constants, estimate selectivity from
pg_statistic.n_distinct - Sort body atoms by ascending estimated cardinality (most selective first)
- Prefer atoms that join on indexed columns
(s,o)or(o,s)when selectivities are similar - GUC:
pg_ripple.datalog_cost_reorder(bool, defaulttrue)
- At rule compilation time, query
-
Subsumption checking (
src/datalog/stratify.rsextension)- After stratification, check each pair of rules deriving the same predicate for subsumption
- If rule R2 is subsumed by rule R1 (R2's head is a substitution instance of R1's, and R1's body is a subset of R2's body), eliminate R2
- Report eliminated rules via
pg_ripple.infer_with_stats()JSONB output:"eliminated_rules": [...]
-
Anti-join negation (
src/datalog/compiler.rs)- Replace
NOT EXISTS (SELECT 1 FROM vp_{id} WHERE ...)withLEFT JOIN vp_{id} ON ... WHERE ... IS NULL - Compile-time choice: use anti-join when the negated predicate's VP table has ≥1000 rows (from
pg_class.reltuples); retainNOT EXISTSfor small tables where the planner favors it - GUC:
pg_ripple.datalog_antijoin_threshold(integer, default1000)
- Replace
-
Predicate-filter pushdown (
src/datalog/compiler.rs)- Identify which body atom first binds each arithmetic/comparison guard variable
- Move the guard immediately after that atom in the generated SQL
- For range filters (
?a > 18), emit as part of theJOIN … ONclause to enable index scans
-
Delta table indexing (
src/datalog/mod.rs)- After each semi-naive iteration populates a delta table, create a B-tree index on the join columns used by the next iteration's rules
- Skip indexing when the delta table has fewer than
pg_ripple.delta_index_thresholdrows (default: 500) - GUC:
pg_ripple.delta_index_threshold(integer, default500)
-
Error codes (additions to
src/error.rs)PT501— magic sets transformation failed (circular binding pattern)PT502— cost-based reordering skipped (statistics unavailable)
-
pg_regress tests
datalog_magic_sets.sql— magic sets on RDFS transitivity with a selective goal; verify result matches full materialization; verify magic temp tables are cleaned updatalog_cost_reorder.sql— verify EXPLAIN output shows changed join order withpg_ripple.datalog_cost_reorder = truevs.falsedatalog_antijoin.sql— verify negation compiles toLEFT JOIN … IS NULLwhen threshold is metdatalog_subsumption.sql— load overlapping rules; verifyinfer_with_stats()reports eliminated rulesdatalog_filter_pushdown.sql— verify arithmetic filters appear in JOIN ON clause, not outermost WHEREdatalog_delta_index.sql— verify delta table index creation when row count exceeds threshold
Migration Script
sql/pg_ripple--0.28.0--0.29.0.sql — registers new GUC parameters. No changes to VP table schema or catalog tables.
Documentation
-
user-guide/sql-reference/datalog.mdupdated — documentinfer_goal(), magic sets GUC, cost-based reordering GUC, anti-join threshold GUC, delta indexing threshold GUC -
user-guide/best-practices/datalog-optimization.md(new page) — when to useinfer()vs.infer_goal(), how to readinfer_with_stats()output, how to diagnose slow fixpoint convergence, tuning GUCs for different dataset sizes - Release notes for v0.29.0 — highlight magic sets and cost-based compilation as headline features; include before/after benchmarks
Exit Criteria
datalog_magic_sets.sql, datalog_cost_reorder.sql, datalog_antijoin.sql, datalog_subsumption.sql, datalog_filter_pushdown.sql, and datalog_delta_index.sql all pass in cargo pgrx regress pg18. pg_ripple.infer_goal('rdfs', '?x rdf:type foaf:Person') returns the same triples as pg_ripple.infer('rdfs') filtered to rdf:type foaf:Person, but completes in <10% of the time on a 1M-triple dataset. Migration scripts from 0.1.0 through 0.29.0 run cleanly via just test-migration.
v0.30.0 — Datalog Aggregation & Compiled Rule Plans
Theme: Analytics-grade inference and rule plan caching.
In plain language: This release adds two major capabilities to the Datalog engine. First, rules can now aggregate facts — for example, "count the number of friends each person has" or "find the maximum salary in each department" — unlocking graph analytics and metrics directly from inference rules. Second, the engine caches the SQL it generates for each rule set, so repeated calls to
infer()(e.g., after each data load) no longer repeat expensive dictionary lookups and query construction. As a bonus, SPARQL queries that use on-demand Datalog rules also benefit from the plan cache: a query that triggers inference gets a faster response on every repeat execution.Effort estimate: 5–7 person-weeks
Completed items (click to expand)
Background
See plans/ecosystem/datalog.md §14.2 for design notes. Aggregation in rule bodies (Datalog^agg) follows the aggregation-stratification spec: aggregate operations are allowed only in rule bodies over predicates that are fully computed in a lower stratum, ensuring a unique minimal model. Compiled rule plans cache generated SQL in a HashMap<rule_set, Vec<CachedPlan>> keyed on the dictionary-encoded rule set name; cache invalidation triggers on load_rules(), drop_rules(), or GUC change.
Deliverables
-
Aggregation in rule bodies (Datalog^agg) (
src/datalog/compiler.rs,src/datalog/stratify.rs)- Extend rule IR to support aggregate terms in body atoms:
COUNT(?x),SUM(?x),MIN(?x),MAX(?x),AVG(?x) - Aggregation-stratification check: aggregated predicates must be fully computed in a lower stratum; reject with
PT510if violated - SQL compilation: aggregate body atoms compile to subquery CTEs with
GROUP BYand aggregate window functions pg_ripple.infer_agg(rule_set TEXT) RETURNS JSONB— variant ofinfer()that enables aggregation rules- Example rule:
?x ex:friendCount ?n :- COUNT(?y WHERE ?x foaf:knows ?y) = ?n . - Benchmark:
benchmarks/datalog_agg.sql— PageRank-style degree centrality on a social graph
- Extend rule IR to support aggregate terms in body atoms:
-
Compiled rule plans (
src/datalog/cache.rsnew module)- Cache the generated SQL string (and dictionary-encoded constant vector) for each rule on first
infer()call - Cache key: rule set name + schema version (invalidate on any
ALTER EXTENSION pg_ripple UPDATE) - Cache storage:
pgrx::PgSharedMem-backed LRU, size controlled by GUCpg_ripple.rule_plan_cache_size(default: 64 entries) - SPARQL on-demand mode benefit: when a SPARQL query inlines a derived predicate CTE, the CTE SQL is served from the plan cache rather than rebuilt from scratch
- GUC:
pg_ripple.rule_plan_cache(bool, defaulttrue) - Expose cache statistics via
pg_ripple.rule_plan_cache_stats() RETURNS TABLE(rule_set TEXT, hits BIGINT, misses BIGINT, entries INT)
- Cache the generated SQL string (and dictionary-encoded constant vector) for each rule on first
-
Error codes (
src/error.rs)PT510— aggregation-stratification violation (aggregate over non-ground predicate)PT511— unsupported aggregate function in rule body
-
pg_regress tests
datalog_agg.sql— verify COUNT, SUM, MIN, MAX rules derive correct results; verify stratification rejects cycles through aggregatesdatalog_plan_cache.sql— verify cache hit/miss counts viarule_plan_cache_stats(); verify cache invalidation ondrop_rules()datalog_sparql_cache.sql— verify SPARQL on-demand query using a derived predicate is faster on second execution (plan served from cache)
Migration Script
sql/pg_ripple--0.29.0--0.30.0.sql — registers new GUCs (pg_ripple.rule_plan_cache, pg_ripple.rule_plan_cache_size). No VP table schema changes.
Documentation
-
user-guide/sql-reference/datalog.mdupdated — documentinfer_agg(), aggregation rule syntax, plan cache GUCs,rule_plan_cache_stats() -
user-guide/best-practices/datalog-optimization.mdupdated — add section on aggregation-stratification rules, plan cache tuning - Release notes for v0.30.0
Exit Criteria
datalog_agg.sql, datalog_plan_cache.sql, and datalog_sparql_cache.sql all pass in cargo pgrx regress pg18. A PageRank-style degree centrality rule on a 1M-triple social graph produces correct results. Second call to infer() on the same rule set reports cache hits > 0 in rule_plan_cache_stats(). Migration scripts from 0.1.0 through 0.30.0 run cleanly via just test-migration.
v0.31.0 — Entity Resolution & Demand Transformation
Theme: Identity semantics and goal-directed rule rewriting for SPARQL and Datalog.
In plain language: This release tackles two distinct but complementary problems. First, it adds proper handling for
owl:sameAs— the RDF way of saying "these two names refer to the same thing". When the engine knows thatex:Aliceandex:A.Smithare the same person, all facts about one automatically apply to the other. Second, it introduces demand transformation — a generalisation of the magic sets technique (added in v0.29.0) that can rewrite complex rule programs to derive only the facts that a query actually needs, even for rules with many cross-referencing bodies. This also makes SPARQL on-demand mode smarter: SPARQL queries can now trigger only the Datalog inference relevant to their specific patterns.Effort estimate: 5–7 person-weeks
Completed items (click to expand)
Background
See plans/ecosystem/datalog.md §14.2 for design notes. owl:sameAs merging uses a pre-pass canonicalization strategy: before each fixpoint iteration, the compiler rewrites triple patterns to use the canonical (lowest-id) representative of each sameAs equivalence class. Demand transformation is more flexible than magic sets for programs with multiple recursive predicates that reference each other — it propagates binding demands through the full program dependency graph rather than one predicate at a time.
Deliverables
-
owl:sameAsentity canonicalization (src/datalog/rewrite.rsnew module)- Pre-pass: at the start of each inference run, compute equivalence classes of
owl:sameAs(VP table forsameAspredicate) using union-find over dictionary IDs - Canonicalization map: each non-canonical ID maps to the lowest ID in its class
- Rule compiler rewrite: substitute all occurrences of non-canonical IDs in rule bodies before SQL generation
- SPARQL integration: SPARQL queries that reference a non-canonical entity are transparently rewritten to query the canonical form
- GUC:
pg_ripple.sameas_reasoning(bool, defaulttrue) - Benchmark:
benchmarks/sameas.sql— query entity with 100sameAsaliases; verify all facts visible via any alias
- Pre-pass: at the start of each inference run, compute equivalence classes of
-
Demand transformation (
src/datalog/demand.rsnew module)- Generalised magic sets: compute demand sets for all predicates simultaneously via a fixed-point on the program dependency graph
- API:
pg_ripple.infer_demand(rule_set TEXT, demands JSONB) RETURNS JSONB—demandsis an array of goal patterns[{"p": "rdf:type", "o": "foaf:Person"}, ...] - Automatically applied in
create_datalog_view()when multiple goal patterns are specified - SPARQL on-demand integration: when a SPARQL query references multiple derived predicates, compute a joint demand set and apply it to all relevant rules before generating inline CTEs; reduces CTE size and join cost
- GUC:
pg_ripple.demand_transform(bool, defaulttrue)
-
pg_regress tests
datalog_sameas.sql— loadsameAsassertions; verify inference results are visible via all aliases; verify canonicalization in SPARQL query resultsdatalog_demand.sql— verifyinfer_demand()derives same results asinfer()filtered to the demand set; verify EXPLAIN shows smaller CTE for SPARQL on-demand queries with demand transform enabled
Migration Script
sql/pg_ripple--0.30.0--0.31.0.sql — registers pg_ripple.sameas_reasoning and pg_ripple.demand_transform GUCs. No VP table schema changes.
Documentation
-
user-guide/sql-reference/datalog.mdupdated — documentinfer_demand(),owl:sameAsbehaviour,sameas_reasoningGUC -
user-guide/best-practices/datalog-optimization.mdupdated — add section on demand transformation vs. magic sets, when to useinfer_demand()vs.infer_goal() - Release notes for v0.31.0
Exit Criteria
datalog_sameas.sql and datalog_demand.sql pass in cargo pgrx regress pg18. A SPARQL on-demand query referencing two derived predicates on a 1M-triple dataset completes in <50% of the time compared to v0.30.0 (demand transform reduces combined CTE size). Migration scripts from 0.1.0 through 0.31.0 run cleanly via just test-migration.
v0.32.0 — Well-Founded Semantics & Tabling
Theme: Advanced reasoning for cyclic ontologies and subsumptive result caching for Datalog and SPARQL.
In plain language: Two powerful features for production knowledge graph workloads. Well-founded semantics handles the edge cases that stratified Datalog cannot: programs where rules are mutually recursive through negation (e.g., "X is trusted unless untrusted, and untrusted unless trusted"). Instead of rejecting these programs, the engine assigns a third truth value — unknown — and returns whatever can be definitively concluded. Tabling caches the results of recurring sub-queries: if the same Datalog sub-goal (or SPARQL sub-pattern) appears in multiple queries or multiple times within one query, the answer is computed once and reused. For analytical workloads with repeated sub-query patterns, this is a 2–5× speedup.
Effort estimate: 5–7 person-weeks
Completed items (click to expand)
Background
See plans/ecosystem/datalog.md §14.2 for design notes. Well-founded semantics (Van Gelder et al., 1991) extends stratified Datalog with a three-valued model: facts are true, false, or unknown (neither provably true nor provably false). The SQL encoding uses an iterative alternating fixpoint: two parallel CTE chains compute the well-founded model over at most pg_ripple.wfs_max_iterations rounds. Tabling (subsumptive tabling, inspired by XSB Prolog) stores derived sub-goals in a session-scoped cache table _pg_ripple.tabling_cache (goal_hash BIGINT, result JSONB, computed_at TIMESTAMPTZ) and reuses results within a configurable TTL.
Deliverables
-
Well-founded semantics (
src/datalog/wfs.rsnew module)- Alternating fixpoint algorithm: compute
T_P↑(positive) andT_P↓(negative) iteratively until fixpoint - Three-valued result: derived facts carry a
certaintycolumn (true/unknown) in the query output pg_ripple.infer_wfs(rule_set TEXT) RETURNS JSONB— run well-founded fixpoint instead of stratified evaluation- Graceful degradation: for stratifiable programs,
infer_wfs()produces the same results asinfer()with no overhead - GUC:
pg_ripple.wfs_max_iterations(integer, default100) — safety cap on alternating fixpoint rounds - Error code
PT520— well-founded fixpoint did not converge withinwfs_max_iterations - Benchmark:
benchmarks/wfs.sql— cyclic ontology with mutual negation; verify unknown facts are correctly identified
- Alternating fixpoint algorithm: compute
-
Tabling / memoization (
src/datalog/tabling.rsnew module)- Session-scoped cache:
_pg_ripple.tabling_cache (goal_hash BIGINT PRIMARY KEY, result BYTEA, computed_at TIMESTAMPTZ) - Cache key: XXH3-128 of the normalised goal pattern (predicate ID + bound-variable encoding)
- SPARQL integration: SPARQL sub-query patterns (e.g., property path closures, OPTIONAL blocks) that match a cached goal are served from the tabling cache without re-executing the CTE — implemented at the SPARQL→SQL translation layer
- Datalog integration:
infer()andinfer_goal()check the tabling cache before running the fixpoint; on cache miss, the result is stored for future calls - TTL:
pg_ripple.tabling_ttl(integer seconds, default300); set to0to disable expiry - GUC:
pg_ripple.tabling(bool, defaulttrue) - Invalidation: cache is automatically cleared on any triple insert/delete/update (via CDC hook), and on
drop_rules() - Expose stats:
pg_ripple.tabling_stats() RETURNS TABLE(goal_hash BIGINT, hits BIGINT, computed_ms FLOAT, cached_at TIMESTAMPTZ)
- Session-scoped cache:
-
pg_regress tests
datalog_wfs.sql— verify well-founded semantics on a cyclic negation program; verifycertainty = 'unknown'for unresolvable facts; verify stratifiable programs return same results asinfer()datalog_tabling.sql— verify cache hit/miss counts viatabling_stats(); verify TTL expiry; verify cache invalidation on triple insertsparql_tabling.sql— SPARQL query with repeated sub-pattern; verify tabling stats show hit > 0 on second identical sub-pattern within one query
Migration Script
sql/pg_ripple--0.31.0--0.32.0.sql — creates _pg_ripple.tabling_cache table; registers pg_ripple.tabling, pg_ripple.tabling_ttl, pg_ripple.wfs_max_iterations GUCs.
Documentation
-
user-guide/sql-reference/datalog.mdupdated — documentinfer_wfs(), tabling GUCs,tabling_stats() -
user-guide/best-practices/datalog-optimization.mdupdated — add section on when to useinfer_wfs(), tabling tuning, SPARQL sub-query caching behaviour -
user-guide/best-practices/sparql-performance.md(new page) — how tabling accelerates SPARQL property paths and repeated sub-queries; how demand transformation reduces CTE size; how rule plan caching (v0.30.0) interacts with SPARQL on-demand mode - Release notes for v0.32.0
Exit Criteria
datalog_wfs.sql, datalog_tabling.sql, and sparql_tabling.sql all pass in cargo pgrx regress pg18. A SPARQL query with a repeated transitive-closure sub-pattern on a 1M-triple dataset completes in <50% of the time on the second execution (tabling cache hit). infer_wfs() on a stratifiable rule set produces identical results to infer(). Migration scripts from 0.1.0 through 0.32.0 run cleanly via just test-migration.
v0.33.0 — Documentation Site & Content Overhaul
Theme: A documentation site worthy of a production-grade triple store.
In plain language: pg_ripple is a mature system — v0.32.0 delivers full SPARQL 1.1 and SHACL Core conformance across 32 releases — but its documentation has grown organically alongside the codebase rather than being designed for the people who use it. This release delivers documentation that meets users where they are: a problem-centric information architecture written for five distinct archetypes (Data Engineer, Application Developer, Knowledge Architect, Decision-Maker, AI/ML Engineer), eight feature-deep-dive chapters, a full operations guide, a SQL function reference with working examples for every function, and a CI harness that keeps every code example honest by running it against a real pg_ripple instance on every pull request. The full plan is in plans/documentation.md.
Effort estimate: 8–12 person-weeks
Completed items (click to expand)
Background
See plans/documentation.md for the authoritative plan — site structure, content guidelines, five user archetypes, and four delivery phases. Everything described in that plan is in scope for this version.
The documentation site is built with mdBook. mdbook-admonish is added before Phase 1 content work starts (book.toml updated with [preprocessor.admonish]); all new and restructured pages use its fenced callout syntax exclusively. A shared bibliographic fixture dataset (papers, authors, institutions, topics, citations, pre-computed embeddings) is established in docs/fixtures/ and reused across all chapters.
Deliverables
Phase 0 — CI Test Harness (prerequisite)
-
scripts/test_docs.sh— CI harness: spins up pg_ripple via Docker, extracts fenced SQL blocks fromdocs/src/, executes them in document order, compares stdout against expected-output comment blocks embedded directly below each code block -
docs/fixtures/bibliography.sql— shared bibliographic fixture dataset (papers, authors, institutions, topics, citations, pre-computed embeddings) reused across all chapters -
.github/workflows/docs-test.yml— CI job that runs the harness on every PR touchingdocs/ -
mdbook-admonishadded tobook.tomland[preprocessor.admonish]block configured - Exit criterion: CI job passes on a real PR (not just locally)
Phase 1 — Foundation
- Landing page — value proposition, architecture diagram, one compelling code example; key-numbers block and comparison summary absorbed from the former "60 Seconds" content
- Evaluate / When to Use pg_ripple — honest comparison matrix (pg_ripple vs. plain SQL, standalone RDF stores, LPG systems, pure vector databases); decision flowchart; AI/LLM section on when graph context outperforms flat vector retrieval
-
Installation — Docker (recommended default), from source (
cargo pgrx), prerequisites, verification step (SELECT pg_ripple.triple_count()returns 0), troubleshooting for the five most common failures - Hello World — Five-Minute Walkthrough — ten triples, three queries of increasing complexity (basic pattern → OPTIONAL → property path), annotated output after every step
- Guided Tutorial — Build a Knowledge Graph in 30 Minutes — four self-contained ≤10-minute segments: Load & Explore, Validate, Reason, Export; uses the shared bibliographic dataset; each segment is independently complete
- Key Concepts — RDF for PostgreSQL Users — triples, IRIs, blank nodes, literals, named graphs, RDF-star, SPARQL; PostgreSQL analogies with diagrams for every concept
Phase 2 — Feature Deep Dives
Eight chapters, each following the seven-part structure: What & Why → How It Works → Worked Examples → Common Patterns → Performance & Trade-offs → Gotchas & Debugging → Next Steps.
- §2.1 Storing Knowledge — modeling a domain as triples; named graphs (when needed vs. when not); blank nodes with honest caveats; RDF-star for provenance and confidence scores; translating a relational schema to RDF
-
§2.2 Loading Data — all formats (Turtle, N-Triples, N-Quads, TriG, RDF/XML); three loading modes (
load_turtle(),load_turtle_file(),insert_triple()); bulk-load performance numbers; blank-node scoping across calls; SQL-to-triples patterns; when to run ANALYZE -
§2.3 Querying with SPARQL — basic patterns through property paths (all operators:
+,*,?,/,|,^); aggregation; subqueries; UNION/MINUS; GRAPH patterns;sparql_explain()guide; filter pushdown;max_path_depthsafety limit; real-world query recipes (entity resolution, recommendations, transitive closure, temporal queries) -
§2.4 Validating Data Quality — SHACL shapes from simple (
sh:minCount/sh:maxCount) to complex (sh:or,sh:pattern, cross-property constraints); synchronous vs. asynchronous validation modes; dead-letter queue; common quality rule patterns -
§2.5 Reasoning and Inference — Datalog rules; built-in RDFS/OWL RL rule sets; stratification explained plainly; explicit vs. inferred triples (
sourcecolumn); goal-directed vs. full materialization; magic sets and semi-naive evaluation -
§2.6 Exporting and Sharing — all export formats; JSON-LD framing with
sparql_construct_jsonld()and frame templates; canonical GraphRAG chapter: BYOG Parquet export, Datalog enrichment, SHACL quality enforcement (all other GraphRAG mentions cross-reference here) -
§2.7 AI Retrieval & Graph RAG — canonical AI chapter: vector embeddings, HNSW indexes,
pg:similar(), hybrid retrieval with RRF,rag_retrieve(), JSON-LD framing for LLM prompts,owl:sameAspre-pass before embedding, FTS broadening, end-to-end RAG pipeline; comparison with pure vector stores (Qdrant, Weaviate, pgvector-only) -
§2.8 APIs and Integration —
pg_ripple_httpSPARQL Protocol HTTP endpoint (configuration, response formats, authentication, Docker Compose); application code examples (Pythonpsycopg2/SPARQLWrapper, JavaScriptpg, Java JDBC); SPARQL federation; caching strategies
Phase 3 — Operations
- Architecture Overview — dictionary, VP tables, HTAP storage, shmem cache; SPARQL query execution flow for operators
- Deployment Models — standalone, Docker/Compose, managed PostgreSQL services; trade-offs and the recommended starting point
- Configuration and Tuning — all GUC parameters by subsystem (storage, query engine, inference, validation, caching, system); three-size production config (small: <1M triples; medium: 1M–100M; large: >100M)
-
Monitoring and Observability —
pg_ripple.stats(),pg_stat_statements,sparql_explain(analyze := true), Prometheus metrics; Grafana panel descriptions; health-check thresholds - Performance Tuning — bottleneck identification for query, write throughput, and cache pressure; realistic BSBM numbers; tuning recipes for read-heavy, write-heavy, and mixed HTAP workloads
-
Backup and Disaster Recovery —
pg_dump/pg_restore; point-in-time recovery; verified backup/restore procedure with exact commands -
Upgrading Safely —
ALTER EXTENSION pg_ripple UPDATE; pre/post-upgrade steps; rollback strategy; maintenance-window guidance; explicit note that zero-downtime upgrades are not yet supported - Scaling — vertical scaling guide; merge-worker tuning; read replicas for horizontal scale; honest statement of what is not yet supported
- Troubleshooting — runbook format: ≥15 symptom → cause → diagnostic → fix entries across all subsystems
-
Security — named-graph row-level security; injection prevention;
pg_ripple_httpTLS and authentication; file-path loader delegation
Phase 4 — Reference and Polish
- SQL Function Reference — all functions grouped by use case (Loading, Querying, Validating, Reasoning, Exporting, Administration); each entry has full signature, parameter table, and one working example with expected output
- SPARQL Compliance Matrix — every SPARQL 1.1 Query, Update, and Protocol feature with status (Supported / Partial / Not Supported); link to W3C test suite results; workarounds for partial/unsupported features
-
Error Message Catalog — every PT001–PT799 code with cause and fix; auto-generated from
src/error.rswhere possible - FAQ — 25–30 questions across Getting Started, Data Modeling, Querying, Performance, Operations, and Comparisons; each answer 50–150 words with links to the relevant deep-dive page
- Glossary — plain-language definitions of every term used in the documentation
- Release Notes and Roadmap mirrored into the docs site
-
Contributing guide — dev environment setup, test commands, PR workflow, code conventions; top-level "Contribute" navigation entry and landing-page callout card; academic citations and architecture background moved to
CONTRIBUTING.md(not user-facing reference) -
Full audit: every code example verified against v0.33.0, all
TODO/ stub markers resolved
Content Governance
-
scripts/check_docs_coverage.sh— CI job that diffs exported function signatures insrc/lib.rsagainst the SQL Function Reference and fails the build when a changed signature has no correspondingdocs/touch in the same PR -
mdbook-linkcheckbroken-link CI job on every PR touchingdocs/; redirect map (docs/redirects.toml) kept current when pages are moved or removed - PR template updated with docs-gap reminder (CI enforcement is primary; checkbox is a reminder only)
-
30-day documentation review schedule: at every minor release, run the signature-diff script and triage GitHub issues tagged
docsto fill gaps
Migration Script
sql/pg_ripple--0.32.0--0.33.0.sql — no schema changes. This version delivers documentation infrastructure and content only; all pg_ripple SQL functions, GUCs, and VP table schemas are unchanged from v0.32.0.
Documentation
This version is the documentation release. The deliverables above are the documentation.
Exit Criteria
- Phase 0 CI harness is complete and passing in CI (verified by a real PR, not just locally).
- The eight feature-deep-dive chapters (§2.1–§2.8) are published with no unresolved stubs or TODO markers.
- The operations section (10 pages) is complete and published.
- The SQL Function Reference covers every function listed in §4 of plans/documentation.md.
check_docs_coverage.shCI job passes on a PR that changes a function signature.mdbook-linkcheckreports zero broken internal links.- Migration scripts from 0.1.0 through 0.33.0 run cleanly via
just test-migration.
v0.34.0 — Bounded-Depth Termination & Incremental Retraction (DRed)
Theme: Smarter fixpoint termination and write-correct incremental maintenance.
In plain language: Two complementary improvements for production workloads. First, when an ontology has a known maximum hierarchy depth (e.g., a SHACL shape says class hierarchies are at most 5 levels deep), the inference engine can stop early instead of running one final "did anything change?" check — shaving 20–50% off property path queries and fixpoint loops. Second, the Delete-Rederive (DRed) algorithm means that deleting a base triple no longer requires re-materializing the entire derived closure: the engine surgically removes only the affected derived facts, re-derives any that survive via alternative paths, and leaves everything else untouched. Materialized SPARQL predicates stay correct in milliseconds after deletes instead of seconds.
Effort estimate: 5–7 person-weeks
Completed items (click to expand)
Background
See plans/ecosystem/datalog.md §14.2.7 and §14.2.12 for design notes. Bounded-depth termination integrates with SHACL shape constraints (sh:maxDepth annotations on property paths) and user-provided GUC hints to set the maximum fixpoint iteration count at compile time. DRed (Gupta, Katiyar & Sagiv, 1993) is the standard incremental deletion algorithm used by RDFox and other production Datalog systems; it avoids full re-materialization by over-deleting pessimistically and then re-deriving survivors.
Deliverables
-
Bounded-depth early termination (
src/datalog/compiler.rs)- Read SHACL
sh:maxDepthannotations for property paths used in rule bodies; fall back to GUCpg_ripple.datalog_max_depth(integer, default0= unlimited) - When a depth bound
dis known, emitWITH RECURSIVE … (MAXDEPTH d)hint (PostgreSQL 18 syntax) or use a depth counter column in the recursive CTE:depth INT, terminating whendepth > d - SPARQL property path integration: property path CTEs (
rdfs:subClassOf*,ex:knows+) respect the same bound when the path predicate has a SHACLsh:maxDepthconstraint - GUC:
pg_ripple.datalog_max_depth(integer, default0— unlimited) - pg_regress test:
datalog_bounded_depth.sql— verify fixpoint terminates afterditerations; verify SPARQL property path honours depth bound; verify unbounded rule still produces full closure
- Read SHACL
-
Incremental retraction — DRed algorithm (
src/datalog/dred.rsnew module)- Hook into the CDC delete path: when a base triple is deleted from a VP table, identify all derived predicates whose SQL rules reference that VP table
- Phase 1 — Over-delete: for each affected derived predicate, delete all rows that could depend on the deleted triple (pessimistic, using rule SQL with the deleted triple as a positive filter)
- Phase 2 — Re-derive: re-run the rule SQL restricted to the over-deleted set; rows that are re-derived via an alternative derivation path are reinserted
- Phase 3 — Commit: rows not reinserted after phase 2 are permanently gone
pg_ripple.dred_enabled(bool, defaulttrue) — master switch; setfalseto fall back to full re-materialization on deletepg_ripple.dred_batch_size(integer, default1000) — maximum number of deleted base triples to process in a single DRed transaction- Error code
PT530— DRed cycle detected (derived predicate self-references in a way DRed cannot safely resolve; falls back to full recompute) - pg_regress test:
datalog_dred.sql— insert triples, materialize RDFS closure, delete one base triple, verify only the correctly-affected derived triples are removed; verify triples supported by alternative paths survive
-
Incremental rule updates (
src/datalog/mod.rs)pg_ripple.add_rule(rule_set TEXT, rule_text TEXT)— add a single rule to an existing rule set without full recompute; only the new rule's derived predicate needs one fresh iteration passpg_ripple.remove_rule(rule_id BIGINT)— remove a rule and retract any derived facts that were solely supported by it (uses DRed internally)- Dependency-aware invalidation:
add_ruletriggers one additional semi-naive pass on the affected stratum only - pg_regress test:
datalog_incremental_rules.sql— add a rule to a live rule set; verify new derivations appear without full recompute; remove the rule; verify derived facts retracted
Migration Script
sql/pg_ripple--0.33.0--0.34.0.sql — registers pg_ripple.datalog_max_depth, pg_ripple.dred_enabled, pg_ripple.dred_batch_size GUCs. No VP table schema changes.
Documentation
-
user-guide/sql-reference/datalog.mdupdated — documentadd_rule(),remove_rule(), DRed GUCs,datalog_max_depthGUC -
user-guide/best-practices/datalog-optimization.mdupdated — add section on DRed vs. full recompute trade-offs; bounded-depth tuning with SHACL -
user-guide/best-practices/sparql-performance.mdupdated — add section on bounded-depth SPARQL property paths - Release notes for v0.34.0
Exit Criteria
datalog_bounded_depth.sql, datalog_dred.sql, and datalog_incremental_rules.sql all pass in cargo pgrx regress pg18. Deleting a base triple from a 1M-triple RDFS-materialized dataset with DRed enabled completes in <500ms (vs. full recompute taking >5s). A SPARQL rdfs:subClassOf* property path query on a hierarchy with sh:maxDepth 5 completes in <50% of the time compared to the unbounded version on a 10-level test hierarchy. Migration scripts from 0.1.0 through 0.34.0 run cleanly via just test-migration.
v0.35.0 — Parallel Stratum Evaluation & Incremental Rule Updates
Theme: Concurrent rule evaluation for faster materialization of large rule sets.
In plain language: The Datalog engine currently evaluates rules one at a time within each stratum. This release allows rules that derive different predicates — and therefore cannot interfere with each other — to run concurrently using PostgreSQL's background worker infrastructure. For OWL RL, which has roughly 10 independent rule groups in its first stratum, this means the full ontology closure can materialize up to 10× faster. SPARQL queries that depend on materialized predicates (the common production mode) benefit directly: derived VP tables become fresh sooner after bulk data loads, reducing the staleness window.
Effort estimate: 5–7 person-weeks
Completed items (click to expand)
Background
See plans/ecosystem/datalog.md §14.2.11 for design notes. Within a single stratum, rules deriving different predicates are fully independent: their INSERT … SELECT statements touch different VP tables and can run concurrently without coordination. Rules deriving the same predicate within a stratum must be serialized or use ON CONFLICT DO NOTHING to handle concurrent inserts. The implementation uses pgrx::BackgroundWorker with a shared-memory semaphore to limit concurrency to pg_ripple.datalog_parallel_workers (default: max_worker_processes / 2).
Deliverables
-
Parallel stratum evaluation (
src/datalog/parallel.rsnew module)- Analyse rule dependency graph per stratum: partition rules into independent groups (rules that derive different predicates and have no shared body predicates that are derived within the same stratum)
- Spawn one background worker per independent group; each worker executes its rule's
INSERT … SELECTfor the current semi-naive iteration - Synchronization barrier: the main process waits for all workers to finish before starting the next iteration
ON CONFLICT DO NOTHINGensures correctness when two workers insert into the same delta table- GUC:
pg_ripple.datalog_parallel_workers(integer, default4, maxmax_worker_processes - 3) - GUC:
pg_ripple.datalog_parallel_threshold(integer, default10000) — only parallelize strata where the estimated total row count exceeds this threshold (avoid overhead for small rule sets) - Expose parallelism statistics via
infer_with_stats()JSONB output:"parallel_groups": 5, "max_concurrent": 4 - pg_regress test:
datalog_parallel.sql— verify OWL RL closure produces identical results withdatalog_parallel_workers = 1and= 4; verifyinfer_with_stats()reports parallel groups > 1 for OWL RL
-
SPARQL materialization freshness improvement
- Parallel evaluation reduces time-to-fresh for derived VP tables after
pg_ripple.infer()calls triggered by bulk loads - Document: SPARQL queries in materialized mode now observe a shorter staleness window after bulk inserts; add note to SPARQL best practices guide
- Parallel evaluation reduces time-to-fresh for derived VP tables after
Migration Script
sql/pg_ripple--0.34.0--0.35.0.sql — registers pg_ripple.datalog_parallel_workers and pg_ripple.datalog_parallel_threshold GUCs. No VP table schema changes.
Documentation
-
user-guide/sql-reference/datalog.mdupdated — document parallel evaluation GUCs,infer_with_stats()parallel fields -
user-guide/best-practices/datalog-optimization.mdupdated — add section on tuningdatalog_parallel_workersfor different hardware configurations -
user-guide/best-practices/sparql-performance.mdupdated — note materialization freshness improvement with parallel evaluation - Release notes for v0.35.0
Exit Criteria
datalog_parallel.sql passes in cargo pgrx regress pg18. OWL RL full closure on a 1M-triple dataset with datalog_parallel_workers = 4 completes in <40% of the time compared to datalog_parallel_workers = 1. Results are identical in both cases. Migration scripts from 0.1.0 through 0.35.0 run cleanly via just test-migration.
v0.36.0 — Worst-Case Optimal Joins & Lattice-Based Datalog
Theme: Advanced join algorithms for cyclic graph patterns and monotone lattice aggregation.
In plain language: Two ambitious features that push pg_ripple to the frontier of Datalog and graph database research. Worst-case optimal joins tackle the hardest SPARQL performance problem: cyclic query patterns (think "find all triangles" or "find paths that loop back") where standard database joins produce enormous intermediate results. The Leapfrog Triejoin algorithm solves this class of problem with a mathematically optimal algorithm, giving 10×–100× speedups on queries that previously timed out. Lattice-based Datalog extends rules to work with custom algebraic structures — for example, propagating trust scores (where "trust of X through Y" is the minimum of individual trust values), or interval types, or set-valued annotations — enabling a new class of analytical reasoning that standard Datalog cannot express.
Effort estimate: 6–9 person-weeks
Completed items (click to expand)
Background
See plans/ecosystem/datalog.md §14.2.8 and §14.2.14 for design notes. Worst-case optimal joins (Ngo et al., 2012; "Skew Strikes Back") use a trie-based intersection algorithm that is provably optimal for any join query. PostgreSQL does not expose WCO join algorithms natively; implementation requires a custom scan node via the CustomScan API, registering a C-callable scan provider that pg_ripple exposes through its Rust FFI layer. Lattice-based Datalog (Datalog^L, inspired by Flix and Datafun) extends the rule IR with typed lattice values and monotone operations; fixpoint termination is guaranteed by the ascending chain condition on the lattice.
Deliverables
-
Worst-case optimal joins — Leapfrog Triejoin (
src/sparql/wcoj.rsnew module)- Detect cyclic join patterns at SPARQL→SQL translation time: any SELECT with ≥3 triple patterns sharing variables in a cycle (triangle, square, etc.)
- For detected cyclic patterns, route execution through a Leapfrog Triejoin scan node instead of standard PostgreSQL hash-joins
- CustomScan implementation: register a scan provider in
_PG_initthat intercepts cyclic join nodes in the PostgreSQL planner's plan tree - VP table trie interface: read VP table rows in sort order (existing B-tree
(s, o)indices serve as the underlying trie structure) - GUC:
pg_ripple.wcoj_enabled(bool, defaulttrue) — master switch - GUC:
pg_ripple.wcoj_min_tables(integer, default3) — minimum number of tables in a join before WCOJ is considered - SPARQL benefit: cyclic graph patterns that previously caused query timeouts or multi-second latencies complete in milliseconds
- Benchmark:
benchmarks/wcoj.sql— triangle query on a social-graph VP table; compare WCOJ vs. standard planner at 100K, 1M, 10M triples - pg_regress test:
sparql_wcoj.sql— verify triangle query produces correct results with WCOJ enabled and disabled; verifypg_ripple.wcoj_enabled = falsefalls back to standard planner
-
Lattice-Based Datalog — Datalog^L (
src/datalog/lattice.rsnew module)- Extend rule IR: lattice term
LatticeVal(lattice_type, value)alongsideConstandVar - Built-in lattice types:
MinLattice(meet = MIN),MaxLattice(join = MAX),SetLattice(join = UNION),IntervalLattice(join = interval hull) - User-defined lattice types via
pg_ripple.create_lattice(name TEXT, join_fn TEXT, bottom TEXT)—join_fnis a PostgreSQL aggregate function name - SQL compilation: lattice rules compile to
INSERT … SELECT … ON CONFLICT (s, g) DO UPDATE SET o = lattice_join(excluded.o, vp.o)— the upsert applies the lattice join on conflict - Fixpoint termination: guaranteed by ascending chain condition; bounded by GUC
pg_ripple.lattice_max_iterations(default1000) - Example rule: trust propagation —
?x ex:trust (MIN ?t1 ?t2) :- ?x ex:knows ?y, ?y ex:trust ?t1, ?x ex:directTrust ?t2 . - GUC:
pg_ripple.lattice_max_iterations(integer, default1000) - Error code
PT540— lattice fixpoint did not converge (ascending chain condition violated by user-defined lattice) - pg_regress test:
datalog_lattice.sql— trust propagation rule with MinLattice; verify convergence; verify user-defined lattice via custom aggregate
- Extend rule IR: lattice term
Migration Script
sql/pg_ripple--0.35.0--0.36.0.sql — registers WCOJ and lattice GUCs; creates pg_ripple.create_lattice() SQL function. No VP table schema changes.
Documentation
-
user-guide/sql-reference/datalog.mdupdated — documentcreate_lattice(), lattice rule syntax, lattice GUCs -
user-guide/best-practices/sparql-performance.mdupdated — add section on cyclic SPARQL pattern detection and WCOJ; when to setwcoj_min_tables -
reference/lattice-datalog.md(new page) — full tutorial on Datalog^L: lattice types, monotone rules, convergence guarantees, use cases (trust propagation, interval reasoning, set-valued annotations) - Release notes for v0.36.0
Exit Criteria
sparql_wcoj.sql and datalog_lattice.sql pass in cargo pgrx regress pg18. A triangle-pattern SPARQL query on a 1M-edge social graph VP table completes in <10% of the time compared to the standard planner (WCOJ enabled). A trust-propagation lattice rule on 100K triples converges to the correct fixed point. Migration scripts from 0.1.0 through 0.36.0 run cleanly via just test-migration.
v0.37.0 — Storage Concurrency Hardening & Error Safety
Theme: Fix the highest-severity correctness bugs identified in the deep-analysis audit and eliminate all hard panics from library code.
In plain language: This is a reliability release — no new features, but a direct response to the first comprehensive code audit (see plans/PLAN_OVERALL_ASSESSMENT_2.md). Two concurrency bugs that could silently drop deletes or strand predicates in a slow-path table are fixed with proper advisory-lock coordination. Every place in the code that could crash the database server on an unexpected error is replaced with a typed error message. Configuration parameters now validate their inputs so bad values are caught immediately instead of causing cryptic failures later. A new
diagnostic_report()function gives a one-call health check of the running system.Effort estimate: 9–11 person-weeks
Completed items (click to expand)
Deliverables
-
HTAP merge cutover race — fixed (
src/storage/merge.rs)- Wrap the delta→main swap in a per-predicate
pg_advisory_xact_lock; concurrentDELETEpath acquires the same lock insharemode - Ensures deletes arriving during a merge cycle are never lost regardless of timing
- Add crash-recovery test
tests/crash_recovery/merge_concurrent_delete.sh: 50 concurrent writers + 1-second merge interval, assert zero lost deletes after 5 minutes
- Wrap the delta→main swap in a per-predicate
-
Tombstone GC integrated into merge worker (
src/storage/merge.rs,src/worker.rs)- After each successful merge cycle, schedule
VACUUMon VP tables wheretombstone_count / main_count > pg_ripple.tombstone_gc_threshold - New GUCs:
pg_ripple.tombstone_gc_enabled(bool, defaulttrue),pg_ripple.tombstone_gc_threshold(float, default0.05) - pg_regress test
storage_tombstone_gc.sql: verify tombstones are vacuumed after threshold is crossed
- After each successful merge cycle, schedule
-
Rare-predicate promotion — idempotent and serialised (
src/lib.rs,src/storage/mod.rs)- Acquire the per-predicate advisory lock before any promotion attempt
- Use
CREATE TABLE IF NOT EXISTS; wrap data move inWITH moved AS (DELETE … RETURNING *) INSERT INTO vp_N SELECT * FROM moved - Add crash-recovery test
tests/crash_recovery/promotion_race.sh: two backends racing to promote the same predicate, assert exactly one succeeds
-
Dictionary cache rollback on transaction abort (
src/dictionary/mod.rs,src/shmem.rs)- Version-tag each shared-memory cache entry with the inserting
xid; decode path checksTransactionIdDidCommitbefore trusting cached ID - pg_regress test
dictionary_rollback.sql:BEGIN; encode_term('novel:term'); ROLLBACK; encode_term('novel:term')— verify the second encode succeeds without error
- Version-tag each shared-memory cache entry with the inserting
-
Bloom filter saturating counter fix (
src/shmem.rs)- Replace all reference-counter decrements with
saturating_sub(1); document that a counter saturated at 255 is treated conservatively (bit kept set, no false negatives)
- Replace all reference-counter decrements with
-
_pg_ripple.statementsatomic update (src/storage/merge.rs)- Perform SID-range catalog
DELETE + INSERTin the same transaction as the VP table swap - Eliminates the race where a mid-update worker kill leaves a stale SID→OID mapping for RDF-star queries
- Perform SID-range catalog
-
(o, s)index onvp_rare(src/storage/mod.rs)- Add
CREATE INDEX IF NOT EXISTS vp_rare_os_idx ON _pg_ripple.vp_rare (o, s)in bootstrap and migration script - Eliminates sequential scans on object-leading patterns over rare predicates
- Add
-
Eliminate
.expect()/.unwrap()in all library code (src/lib.rs,src/bulk_load.rs,src/sparql/optimizer.rs,src/sparql/sqlgen.rs,src/export.rs,pg_ripple_http/src/main.rs)- Replace all 30+
expect()/unwrap()calls in non-test code withResult-propagating helpers; surface errors viapgrx::error!()at the pg_extern boundary - Add
#![deny(clippy::unwrap_used, clippy::expect_used)]tosrc/lib.rs(test code excluded via#[cfg(test)]) - Fix
pg_ripple_http: replace startup panics with graceful error logging andprocess::exit(1)
- Replace all 30+
-
GUC
check_hookvalidators (src/lib.rs)- Implement validators for all string-enum GUCs:
inference_mode(off/on_demand/materialized),enforce_constraints(off/warn/error),rule_graph_scope(default/all),shacl_mode(off/sync/async),describe_strategy(cbd/scbd) - Implement
min_valbounds for integer GUCs:max_path_depth ≥ 1,property_path_max_depth ≥ 1,merge_threshold ≥ 1,merge_interval_secs ≥ 1 - Promote
pg_ripple.rls_bypasstoPGC_POSTMASTERso it cannot be flipped per-session
- Implement validators for all string-enum GUCs:
-
pg_ripple.diagnostic_report() RETURNS TABLE (key TEXT, value TEXT)(src/lib.rs)- Keys: GUC validity summary, shared-memory cache hit/miss rates, merge backlog (rows in all delta tables), validation queue depth, federation endpoint health, schema_version match
- pg_regress test
diagnostic_report.sql: exercise all fields; assert no null values
-
_pg_ripple.schema_versiontable (src/lib.rs)- Created at install time with columns
version TEXT, installed_at TIMESTAMPTZ, upgraded_from TEXT - Stamped on every
ALTER EXTENSION … UPDATE
- Created at install time with columns
Migration Script
sql/pg_ripple--0.36.0--0.37.0.sql — adds (o, s) index on vp_rare; creates _pg_ripple.schema_version table; registers tombstone_gc_enabled and tombstone_gc_threshold GUCs. No VP table schema changes.
Documentation
-
user-guide/operations/troubleshooting.md— new section: "Lost deletes after merge" runbook (cause, detection viadiagnostic_report(), fix via advisory lock, upgrade to v0.37.0) -
reference/guc-reference.md— documenttombstone_gc_threshold,tombstone_gc_enabled; add validator-rules table for all enum GUCs; noterls_bypassscope change -
user-guide/operations/upgrade.md— document theschema_versionstamp and how to verify upgrade completeness - Release notes for v0.37.0
Exit Criteria
No .expect()/.unwrap() in non-test Rust code; clippy deny enforced in CI. The concurrent-delete stress test (merge_concurrent_delete.sh) passes at 50 writers + 1-second merge interval. All GUC enum validators active. diagnostic_report() passes pg_regress. Migration scripts from 0.1.0 through 0.37.0 run cleanly via just test-migration.
v0.38.0 — Architecture Refactoring & Query Completeness
Theme: Split the god-module, introduce the PredicateCatalog abstraction, close SPARQL Update gaps, and wire SHACL hints into the query planner.
In plain language: After 37 releases, the codebase has accumulated structural debt — most visibly in a single 5,600-line "everything" file that makes every change risky. This release pays that debt: the central file is divided into focused modules, and a clean interface between the query engine and the storage layer is introduced so that future storage variants don't require rewriting the query translator. Users gain two concrete improvements: SPARQL UPDATE now supports pattern-based deletions (the commonly needed
DELETE WHEREform that was missing), and SHACL shapes now automatically influence query planning so queries over shape-constrained predicates are faster.Effort estimate: 9–11 person-weeks
Completed items (click to expand)
Deliverables
-
Split
src/lib.rsinto subsystem modules- Extract
src/rare_predicate.rs,src/shacl_admin.rs,src/federation_registry.rs,src/graphrag_admin.rs,src/stats_admin.rsfromsrc/lib.rs - Target:
src/lib.rs≤1,500 lines covering_PG_init, GUC registration,extension_sql!blocks, and thin#[pg_extern]delegation shims - No change to public SQL API; all existing
pg_ripple.*functions remain
- Extract
-
PredicateCatalogtrait and backend-local OID cache (src/storage/catalog.rsnew module)- Define
trait PredicateCatalog { fn resolve(&self, pred_id: i64) -> Option<TableDesc>; } - Implement a backend-local
HashMap<i64, TableDesc>cache invalidated by a syscache callback on_pg_ripple.predicates - Wire into
src/sparql/sqlgen.rsandsrc/datalog/compiler.rs— eliminates per-atom SPI catalog lookup for hot BGPs - New GUC
pg_ripple.predicate_cache_enabled(bool, defaulttrue) - Benchmark: 10-atom BGP must show 1 catalog SPI call instead of 10
- Define
-
Refactor
validate_shape()→ per-constraint helpers (src/shacl/constraints/new sub-module)- One file per constraint family:
count.rs,value_type.rs,string_based.rs,logical.rs,property_path.rs,shape_based.rs - Each exported function ≤80 lines; top-level
validate_shape()becomes a dispatcher ≤50 lines - All existing
shacl_*.sqlpg_regress tests must pass unchanged
- One file per constraint family:
-
Refactor
translate_pattern()→ per-algebra-node helpers (src/sparql/translate/new sub-module)- One file per algebra node:
bgp.rs,join.rs,left_join.rs,union.rs,filter.rs,graph.rs,group.rs,distinct.rs - Shared context struct
TranslateCtxcarries encode cache, catalog handle, and query-level state - All existing
sparql_*.sqlpg_regress tests must pass unchanged
- One file per algebra node:
-
Batch dictionary encoding in SPARQL translation
- In
translate_pattern, collect all unresolved IRI/literal constants in a first pass; resolve via oneencode_terms_batch(&[Term]) -> Vec<i64>SPI call (singleINSERT … ON CONFLICT … RETURNINGbatch) - Benchmark: BGP with 20 FILTER constants must show 1 encode SPI call instead of 20
- In
-
Plan-cache key normalisation (
src/sparql/plan_cache.rs)- Cache on algebra digest (serialize
spargebra::QueryIR → compact bytes → XXH3-128) instead of raw query text - Whitespace and prefix-form variants now share the same cache slot
- Cache on algebra digest (serialize
-
SCBD DESCRIBE — implemented (
src/sparql/mod.rs)- Implement Symmetric Concise Bounded Description: all triples where the resource is subject or object, with blank-node recursion
describe_strategy = 'scbd'now functional; remove the "not implemented" caveat from docs
-
SPARQL Update: DELETE WHERE / INSERT WHERE / graph management (
src/sparql/update.rs)- Implement
DELETE { … } WHERE { … },INSERT { … } WHERE { … },DELETE WHERE { … } - Implement graph management:
CLEAR GRAPH,DROP GRAPH,COPY,MOVE,ADD - pg_regress test
sparql_update_advanced.sql: pattern-based deletes spanning multiple VP tables; cross-graph COPY/MOVE
- Implement
-
Consolidate property-path depth GUCs (
src/lib.rs)- Deprecate
property_path_max_depth; make it an alias formax_path_depthwith a one-timeNOTICE
- Deprecate
-
Wire SHACL hints into SPARQL planner (
src/shacl/hints.rsnew module,src/sparql/sqlgen.rs)- At query-translation time, query
_pg_ripple.shape_hints(populated from loaded shapes) per predicate sh:maxCount 1→ suppressDISTINCTon that predicate's join;sh:minCount 1→ downgradeLEFT JOINtoINNER JOIN- pg_regress test
shacl_sparql_hints.sql: verify join-type changes with and without shapes; assert result equivalence
- At query-translation time, query
-
SPARQL 1.1 conformance suite in CI (allowed-to-warn job)
- Download W3C SPARQL 1.1 test suite; run via
cargo pgrx regress; report pass/skip/fail counts - Publish conformance percentage in
CHANGELOG.mdper release
- Download W3C SPARQL 1.1 test suite; run via
Migration Script
sql/pg_ripple--0.37.0--0.38.0.sql — creates _pg_ripple.shape_hints table; registers predicate_cache_enabled GUC. No VP table schema changes.
Documentation
-
reference/architecture.md— Mermaid architecture diagram showing post-refactor module boundaries (dictionary → storage/catalog → sparql/translate + datalog/compiler → shacl/constraints → views/exporters) -
user-guide/sql-reference/sparql-update.md— document DELETE WHERE / INSERT WHERE / CLEAR / COPY / MOVE / ADD with examples -
reference/guc-reference.md—predicate_cache_enabled; deprecation notice forproperty_path_max_depth -
user-guide/performance/query-planning.md— new section on SHACL hints and their effect on join selection - Release notes for v0.38.0
Exit Criteria
src/lib.rs ≤1,500 lines. Each translate/ module file ≤200 lines. validate_shape() dispatcher ≤50 lines. SCBD DESCRIBE tests pass. SPARQL Update advanced tests pass. SHACL hints pg_regress passes. Predicate OID cache reduces SPI calls for 10-atom BGP from 10 to 1. Migration chain test passes.
v0.39.0 — Datalog HTTP API for pg_ripple_http
Theme: Expose all pg_ripple Datalog SQL functions as a REST API in the pg_ripple_http companion service.
In plain language: The
pg_ripple_httpservice currently speaks only SPARQL. This release adds a/datalognamespace that lets any HTTP client — without a PostgreSQL driver — manage rule sets, trigger inference, run goal-directed queries, check integrity constraints, and inspect monitoring statistics. The implementation is a thin axum layer; all heavy lifting stays inside the PostgreSQL extension.Effort estimate: 3–5 person-weeks
Implementation plan: plans/pg_ripple_http_datalog.md
Completed items (click to expand)
Deliverables
-
Extract shared helpers (
pg_ripple_http/src/common.rsnew module)- Move
AppState,check_auth(),redacted_error(), andenv_or()frommain.rstocommon.rs - Both SPARQL and Datalog handlers import from this module
- Move
-
Phase 1 — Rule management (
pg_ripple_http/src/datalog.rsnew module)POST /datalog/rules/{rule_set}— bodytext/x-datalog; callspg_ripple.load_rules($1, $2); returns{"rule_set": "…", "rules_loaded": N}POST /datalog/rules/{rule_set}/builtin— callspg_ripple.load_rules_builtin($1)GET /datalog/rules— callspg_ripple.list_rules(); returns JSONB arrayDELETE /datalog/rules/{rule_set}— callspg_ripple.drop_rules($1); returns{"deleted": N}POST /datalog/rules/{rule_set}/add— single-rule add; callspg_ripple.add_rule($1, $2)DELETE /datalog/rules/{rule_set}/{rule_id}— callspg_ripple.remove_rule($1::bigint)(triggers DRed)PUT /datalog/rules/{rule_set}/enable— callspg_ripple.enable_rule_set($1)PUT /datalog/rules/{rule_set}/disable— callspg_ripple.disable_rule_set($1)
-
Phase 2 — Inference (
pg_ripple_http/src/datalog.rs)POST /datalog/infer/{rule_set}— callspg_ripple.infer($1); returns{"derived": N}POST /datalog/infer/{rule_set}/stats— callspg_ripple.infer_with_stats($1); returns full stats JSONBPOST /datalog/infer/{rule_set}/agg— callspg_ripple.infer_agg($1)POST /datalog/infer/{rule_set}/wfs— callspg_ripple.infer_wfs($1)POST /datalog/infer/{rule_set}/demand— body{"demands": […]}; callspg_ripple.infer_demand($1, $2::jsonb)POST /datalog/infer/{rule_set}/lattice— body{"lattice": "min"}; callspg_ripple.infer_lattice($1, $2)
-
Phase 3 — Query & constraints (
pg_ripple_http/src/datalog.rs)POST /datalog/query/{rule_set}— body Datalog goal text; callspg_ripple.infer_goal($1, $2); returns{"derived": N, "iterations": N, "matching": […]}GET /datalog/constraints— callspg_ripple.check_constraints(NULL); returns violation arrayGET /datalog/constraints/{rule_set}— callspg_ripple.check_constraints($1)
-
Phase 4 — Admin & monitoring (
pg_ripple_http/src/datalog.rs)GET /datalog/stats/cache— callspg_ripple.rule_plan_cache_stats()GET /datalog/stats/tabling— callspg_ripple.tabling_stats()GET /datalog/lattices— callspg_ripple.list_lattices()POST /datalog/lattices— body{"name": "…", "join_fn": "…", "bottom": "…"}; callspg_ripple.create_lattice($1, $2, $3)GET /datalog/views— callspg_ripple.list_datalog_views()POST /datalog/views— body JSON; callspg_ripple.create_datalog_view(…)DELETE /datalog/views/{name}— callspg_ripple.drop_datalog_view($1)
-
Route registration (
pg_ripple_http/src/main.rs)mod datalog;andmod common;declarations- 24
.route(…)entries wired under/datalog
-
Metrics extension (
pg_ripple_http/src/metrics.rs)- Add
datalog_queries: AtomicU64counter; expose aspg_ripple_http_datalog_queries_totalin/metrics
- Add
-
Authentication & security
- All
/datalog/*handlers callcheck_auth()— same token as SPARQL - Optional write-protection:
PG_RIPPLE_HTTP_DATALOG_WRITE_TOKENenv var gatesPOST /datalog/rules/*,DELETE, andPUTendpoints independently of the read token - All SQL calls use
$1,$2, … parameterized queries — never string concatenation - Request body limit: 10 MB via
axum::body::to_bytes(body, 10 * 1024 * 1024)
- All
-
Error mapping
400 datalog_parse_error— malformed rule text returned by extension400 datalog_goal_error— invalid goal pattern400 invalid_request— missing body, wrong content-type, non-numeric rule_id404 rule_set_not_found— infer/drop on nonexistent rule set503 service_unavailable— pool exhausted
-
Migration script
sql/pg_ripple--0.38.0--0.39.0.sql- No schema changes to pg_ripple itself; comment-only header documenting the new HTTP surface
-
Tests
- Integration tests using
axum-test(or equivalent): round-trip load → infer → query goal → drop for thecustomrule set - Error path tests: malformed Datalog, missing auth, oversized body
- Smoke test script
tests/datalog_http_smoke.sh(curl-based)
- Integration tests using
Documentation
-
pg_ripple_http/README.md— new## Datalog APIsection with curl examples for all 24 endpoints, content types, and error codes - Release notes for v0.39.0
Exit Criteria
All 24 Datalog endpoints respond correctly in integration tests. GET /datalog/rules returns the JSONB array from list_rules(). POST /datalog/infer/custom triggers materialization and returns {"derived": N}. GET /datalog/constraints returns violation JSONB. Auth check rejects requests with invalid token. Parameterized-query requirement verified by code review (no format!() calls mixing user input into SQL strings). Migration chain test passes.
v0.40.0 — Streaming Results, Explain & Observability
Theme: Streaming cursor API for large result sets, first-class query explain, and full observability stack.
In plain language: Three long-requested developer and operator improvements land together. Large SPARQL queries can now stream their results instead of materialising everything in memory — making it safe to CONSTRUCT or export millions of triples without running out of memory. A new
explain_sparql()function shows exactly what SQL the SPARQL engine generated, with cardinality estimates and actual timings in EXPLAIN ANALYZE format but with RDF IRIs instead of internal numbers. A newexplain_datalog()function does the same for Datalog rule sets. Every significant operation now emits OpenTelemetry spans, anddiagnostic_report()gives a one-call health summary of the running system.Effort estimate: 9–11 person-weeks
Completed items (click to expand)
Deliverables
-
Streaming SPARQL cursor API (
src/sparql/cursor.rsnew module)pg_ripple.sparql_cursor(query TEXT) RETURNS SETOF RECORD— SRF paging through results 1024 rows at a time with batched dictionary decodepg_ripple.sparql_cursor_turtle(query TEXT) RETURNS SETOF TEXT— emits Turtle linespg_ripple.sparql_cursor_jsonld(query TEXT) RETURNS SETOF TEXT— emits JSON-LD object chunks- Wire to
pg_ripple_http:Accept: text/turtleorAccept: application/ld+jsontriggersTransfer-Encoding: chunkedstreaming response - pg_regress test
sparql_cursor.sql: load 500K triples; verify cursor returns correct count; verify chunked Turtle export round-trips
-
Resource governors (
src/lib.rs)pg_ripple.sparql_max_rows(integer, default0= unlimited)pg_ripple.datalog_max_derived(integer, default0= unlimited)pg_ripple.export_max_rows(integer, default0= unlimited)pg_ripple.sparql_overflow_action(enum:warn/error, defaultwarn)- Error codes:
PT640(SPARQL row limit exceeded),PT641(Datalog derived limit exceeded),PT642(export row limit exceeded)
-
pg_ripple.explain_sparql(query TEXT, analyze BOOLEAN DEFAULT false) RETURNS JSONB(src/sparql/explain.rsnew module)- Step 1: parse + optimise via
spargebra/sparopt; emit algebra tree as JSON with predicate IRIs decoded - Step 2: run
EXPLAIN (FORMAT JSON, BUFFERS true [, ANALYZE true])on the generated SQL; attach as"plan"key - Output keys:
"algebra","sql"(IRI-decoded),"plan","cache_hit"(bool),"encode_calls"(int) - pg_regress test
sparql_explain_jsonb.sql: verify all output keys; verifyanalyze: trueadds"Actual Rows"
- Step 1: parse + optimise via
-
pg_ripple.explain_datalog(rule_set_name TEXT) RETURNS JSONB(src/datalog/explain.rsnew module)- Returns per-stratum dependency graph, magic-set rewritten rules, compiled SQL per rule, and per-iteration delta-row counts from last inference run
- Output keys:
"strata","rules"(rewritten),"sql_per_rule","last_run_stats" - pg_regress test
datalog_explain.sql
-
pg_ripple.cache_stats() RETURNS JSONBandpg_ripple.reset_cache_stats()(src/lib.rs)- Keys: plan cache size/hits/misses, dict cache hits/misses, federation cache hits/misses
- pg_regress test
cache_stats.sql
-
pg_ripple.stat_statements_decodedview (src/lib.rs)- View over
pg_stat_statementsthat regex-decodes predicate IDs inquerytext viapg_ripple.decode_id()join; exposesquery_decodedcolumn
- View over
-
OpenTelemetry tracing (
src/telemetry.rsnew module)- Thin facade over the
tracingcrate; spans for: SPARQL parse/translate/execute, merge cycle (per predicate), federation call (per SERVICE), Datalog inference (per stratum) - GUC
pg_ripple.tracing_enabled(bool, defaultfalse) — zero overhead when off - GUC
pg_ripple.tracing_exporter(string:stdout/otlp, defaultstdout);otlpreadsOTEL_EXPORTER_OTLP_ENDPOINT - pg_regress test
telemetry.sql: toggle on/off; assert no performance regression in execute path with tracing off
- Thin facade over the
-
Bug fix:
OPTIONAL {}insideGRAPH {}silently fails for all predicates (src/sparql/sqlgen.rs)- Root cause: The
GraphPattern::Graphhandler applies the named-graph filter after the inner pattern is fully translated. When the inner pattern contains anOPTIONAL(spargebraLeftJoin), theLeftJointranslator wraps both sides in aliased subqueries that only project_lj_<varname>columns — thegcolumn is intentionally stripped. TheGraphhandler then emits{lj_alias}.g = {gid}, which PostgreSQL rejects withcolumn does not exist. This fails for all predicates (both dedicated VP tables andvp_rare); it was only observed first withvp_rarepredicates (rdfs:subClassOf,rdfs:label, etc.) because typical test graphs have very few schema triples. - Correct fix — graph-filter context propagation (
src/sparql/sqlgen.rs,Ctx):- Add
graph_filter: Option<i64>toCtx. - In
GraphPattern::Graph, setctx.graph_filter = Some(gid)before recursing into the inner pattern, then clear it after. - In
translate_bgp/table_expr/build_all_predicates_union, whenctx.graph_filterisSome(gid), injectWHERE g = {gid}(orAND g = {gid}) directly into each VP table scan. - Remove the post-hoc
for (alias, _) in &frag.from_items { frag.conditions.push(format!("{alias}.g = {gid}")); }loop from theGraphhandler — the filter is now baked into every leaf VP scan before anyLEFT JOIN,WITH RECURSIVE, or subquery wrapper is built.
- Add
- This also fixes
OPTIONAL {}combined withGROUP BYon variables from the optional side, andOPTIONAL {}insideGRAPH {}withFILTER, property paths, nestedUNION, and federatedSERVICEsub-patterns. - Regression tests:
sparql_optional_in_graph.sql—OPTIONALtriple with a dedicated-VP predicate inside a named graph; assert NULL vs non-NULL row countssparql_optional_in_graph_rare.sql— same pattern with avp_rarepredicate; assert NULL vs non-NULL row countssparql_optional_group_by_in_graph.sql—OPTIONAL+GROUP BYon optional variable inside a named graph (the original failing query shape); assertinstanceCountper class is correct
- Root cause: The
-
Bug fix: property path inside
GRAPH {}fails for all predicates (src/sparql/sqlgen.rs)- Root cause: identical to the
OPTIONALbug above — theWITH RECURSIVECTE emitted for property path operators (+,*,?) selects only(s, o), but the post-hocGraphhandler tries to reference{cte_alias}.g, producingcolumn does not exist. - Fix: same graph-filter context propagation as above; anchor and recursive step selects must include
gand filter on it whenctx.graph_filteris set, rather than relying on the outerGraphhandler to inject the condition. - Regression test:
sparql_path_in_graph.sql— property path on a rare predicate inside a named graph; assert correct row count
- Root cause: identical to the
-
Migration header standardisation (
sql/*.sql)- Backfill headers in all existing scripts:
-- Migration X.Y.Z → A.B.C | Schema changes: … | Data-rewrite cost: Low/Medium/High | Downgrade: … - All future scripts from v0.37.0 onward follow this template automatically
- Backfill headers in all existing scripts:
Migration Script
sql/pg_ripple--0.39.0--0.40.0.sql — registers new GUCs (sparql_max_rows, datalog_max_derived, export_max_rows, sparql_overflow_action, tracing_enabled, tracing_exporter). No VP table schema changes.
Documentation
-
user-guide/sql-reference/explain.md— full tutorial onexplain_sparql()andexplain_datalog(); reading the algebra tree and decoded SQL -
user-guide/sql-reference/cursor-api.md— streaming cursor API; format options; resource governors -
reference/observability.md(new) — OpenTelemetry integration guide: exporter setup, span taxonomy, Grafana/Jaeger integration examples -
user-guide/operations/monitoring.md—cache_stats(),diagnostic_report(),stat_statements_decodedusage -
reference/error-reference.md— PT640, PT641, PT642 documented - Release notes for v0.40.0
Exit Criteria
sparql_cursor.sql passes with 500K triples. explain_sparql() returns IRI-decoded algebra and SQL. OpenTelemetry spans emitted for a sample query when tracing_enabled = on. All resource governor tests pass. stat_statements_decoded returns decoded query text. sparql_optional_in_graph.sql, sparql_optional_in_graph_rare.sql, and sparql_optional_group_by_in_graph.sql all pass (OPTIONAL inside GRAPH). sparql_path_in_graph.sql passes (property path inside GRAPH). Migration chain test passes.
v0.41.0 — Full W3C SPARQL 1.1 Test Suite
Theme: Complete standards conformance verification via the full W3C SPARQL 1.1 test suite, run in parallel under 2 minutes in CI.
In plain language: Every major SPARQL engine bug — including the
OPTIONAL inside GRAPHfailure found in April 2026 — was caught by manual testing rather than by the test suite. This version fixes that by implementing a full harness for the official W3C SPARQL 1.1 test suite (~3,000 tests), parallelized across 8 workers so the entire suite completes in under 2 minutes. The harness parses W3C test manifests, auto-loads RDF fixtures per test, runs queries against a live pg_ripple instance, and validates results using RDF graph equivalence (not row counting). Per-category pass rates are reported in CI so regressions are caught immediately. A curated 180-test "smoke" subset (Graph Patterns + Aggregates) runs on every PR in under 30 seconds.Effort estimate: 5–7 person-weeks
Deliverables
-
W3C manifest parser (
tests/w3c/manifest.rsnew module)- Parse W3C SPARQL 1.1 test manifests (Turtle format,
mf:Manifest) into a structuredTestCasestruct - Fields: test IRI, type (
mf:QueryEvaluationTest,mf:UpdateEvaluationTest,mf:PositiveSyntaxTest,mf:NegativeSyntaxTest), query file, data file(s), result file, named graph files - Covers all 13 sub-suites:
aggregates,bind,exists,functions,grouping,negation,optional,project-expression,property-path,service,subquery,syntax-query,update - Tests with type
mf:NotClassifiedByEarlYetskipped withSKIPstatus
- Parse W3C SPARQL 1.1 test manifests (Turtle format,
-
RDF fixture loader (
tests/w3c/loader.rsnew module)- Load
.ttl/.n3/.rdf/.srx/.srjfixture files fromtests/w3c/data/into a temporary pg_ripple graph before each test - Use named graph IRIs matching the manifest's
mf:graphDataentries - Auto-teardown: drop the temporary named graph after the test completes (regardless of pass/fail)
- Handle multi-graph datasets:
mf:defaultGraph→ default graph (g = 0);mf:namedGraphs→ individual named graphs
- Load
-
Result validator (
tests/w3c/validator.rsnew module)SELECTqueries: compare against.srx(SPARQL Results XML) or.srj(SPARQL Results JSON); validate variable names and bindings as RDF term equality (IRI, blank node, literal with datatype and lang tag)ASKqueries: compare boolean result against.srx/.srjCONSTRUCT/DESCRIBEqueries: compare result graph against.ttlreference using graph isomorphism (blank-node-normalised; usesoxrdffor in-memory graph comparison)UPDATEqueries: compare the post-update store state (all named graphs) against expected.ttlreference- Blank node handling: rename blank nodes in both actual and expected by canonical DFS traversal before comparison
- Report per-binding diff on failure: expected term vs. actual term
-
Parallel test runner (
tests/w3c/runner.rsnew module)cargo test --test w3c_suite -- --test-threads 8— each thread picks tests from a shared work queue (lock-freecrossbeamchannel)- Each thread owns an isolated pg_ripple named-graph namespace (prefix
_w3c_t{thread_id}_) to prevent cross-test pollution - Test timeout: 5 seconds per test; timed-out tests marked
TIMEOUTnotFAIL - Progress:
indicatifprogress bar per thread in local runs; plain line-per-test output in CI - Output report: per-category pass/fail/skip/timeout counts + per-test detail for any failure
- Target: full 3,000-test suite completes in < 2 minutes on an 8-core CI runner (AWS
c7g.2xlargeor equivalent)
-
Smoke subset (
tests/w3c_smoke.rs)- 180-test curated subset:
optional(80 tests),aggregates(60 tests),grouping(40 tests) — the three categories most likely to expose SQL-generation bugs - Runs on every PR via
cargo test --test w3c_smoke; completes in < 30 seconds - Failures block merge (added to
requiredstatus checks in.github/workflows/ci.yml)
- 180-test curated subset:
-
CI integration (
.github/workflows/ci.yml)- New job
w3c-suite: runs after the existingpgrx-testjob; parallelized 8-way; uploads test report as artifact - New job
w3c-smoke: runs on every PR and push tomain; required check - Full suite job is optional (non-blocking) until pass rate reaches 95%; then promoted to required
- Cache: W3C test fixtures (
tests/w3c/data/) cached by SHA of manifest files
- New job
-
Test data download script (
scripts/fetch_w3c_tests.sh)- Downloads the official W3C SPARQL 1.1 test suite from
https://www.w3.org/2009/sparql/docs/tests/ - Verified against known SHA-256 checksums of the manifest files
- Output:
tests/w3c/data/directory (gitignored; fetched by CI and locally on first run)
- Downloads the official W3C SPARQL 1.1 test suite from
-
Known-failures manifest (
tests/w3c/known_failures.txt)- List of W3C test IRIs that currently fail, with a one-line reason for each (e.g.,
OPTIONAL inside GRAPH — fix in v0.40.0,property path with GRAPH — fix in v0.40.0) - Failures in
known_failures.txtare reported asXFAIL(expected failure), notFAIL - Any test in
known_failures.txtthat unexpectedly passes is reported asXPASSand causes a CI warning - Target at release: 0
XFAILentries in the smoke subset; ≤ 50XFAILentries in the full suite (SERVICE tests against live external endpoints are always SKIP)
- List of W3C test IRIs that currently fail, with a one-line reason for each (e.g.,
-
Pass-rate tracking (
tests/w3c/report.json)- CI uploads a
report.jsonartifact with per-category pass/fail/skip/timeout counts and overall pass rate - Historical pass rate trend displayed in
README.mdbadge
- CI uploads a
Migration Script
sql/pg_ripple--0.40.0--0.41.0.sql — no schema changes. Adds a comment-only header noting that v0.41.0 is a test infrastructure release.
Documentation
-
reference/w3c-conformance.md— per-category W3C SPARQL 1.1 conformance table: test count, pass count, known failures with ticket links -
reference/running-w3c-tests.md(new) — how to run the smoke subset and full suite locally; how to add a new expected failure; how to interpretXFAILvsXPASS -
README.md— W3C SPARQL 1.1 conformance section updated - Release notes for v0.41.0
Exit Criteria
Smoke subset (180 tests) passes with 0 unexpected failures on main. Full suite (3,000+ tests) runs in < 2 minutes on an 8-core CI runner. Per-category pass rate report uploaded as CI artifact. Known-failures manifest has 0 entries for optional and aggregates categories (those bugs fixed in v0.40.0). Migration chain test passes through 0.41.0.
v0.42.0 — Parallel Merge, Cost-Based Federation & Live CDC
Theme: Multi-worker HTAP merge, intelligent federation query planning, and real-time RDF change subscriptions.
In plain language: Three architectural improvements that close the last major gaps before the 1.0 production release. The merge worker — which keeps the read-optimised main partition in sync with incoming writes — is upgraded from a single process to a configurable pool of parallel workers, each responsible for a subset of predicates, directly improving write throughput for workloads with many distinct predicates. Federation queries now use a cost model to pick the best execution order and run independent fragments in parallel, eliminating the serial bottleneck. And for the first time, applications can subscribe to a real-time stream of triple changes filtered by SPARQL pattern or SHACL shape, enabling reactive GraphRAG pipelines, live dashboards, and ML feature stores without polling.
Effort estimate: 10–12 person-weeks
Deliverables
-
Parallel merge worker pool (
src/worker.rs,src/storage/merge.rs)- New GUC
pg_ripple.merge_workers(integer, default1, max16) — spawns NBackgroundWorkerprocesses each managing a disjoint round-robin subset of predicates - Per-predicate
pg_advisory_lock(from v0.37.0) ensures no two workers race on the same VP table - Work-stealing: idle workers check the global queue for any predicate above
pg_ripple.merge_thresholdnot yet claimed - Stress test
tests/stress/parallel_merge.sh: 100 concurrent writers × 100 predicates × 4 workers; assert correctness and no deadlocks after 10 minutes - Benchmark: 4 merge workers on a workload with 100 distinct predicates shows ≥3× throughput vs. single worker
- New GUC
-
owl:sameAscluster size bound (src/datalog/builtins.rs)- New GUC
pg_ripple.sameas_max_cluster_size(integer, default100_000) - Detect over-large equivalence classes during canonicalization; emit
PT550WARNING and short-circuit with Tarjan-SCC sampling approximation - pg_regress test
sameas_large_cluster.sql
- New GUC
-
VoID statistics catalog per federation endpoint (
src/sparql/federation.rs,_pg_ripple.endpoint_statstable)- On endpoint registration, fetch and cache the endpoint's VoID description
- Refresh driven by new GUC
pg_ripple.federation_stats_ttl_secs(integer, default3600) - Statistics used by the planner: triple count per predicate, distinct subjects/objects
-
Cost-based federation source selection (
src/sparql/federation_planner.rsnew module)- FedX-style planner: for each BGP atom rank endpoints by estimated selectivity using VoID stats; assign each atom to its best source
- Independent atoms (no shared variables) scheduled for parallel execution
- GUC
pg_ripple.federation_planner_enabled(bool, defaulttrue) - GUC
pg_ripple.federation_parallel_max(integer, default4) - GUC
pg_ripple.federation_parallel_timeout(integer, default60seconds) - pg_regress test
federation_planner.sql: two registered mock endpoints; verify atom routing and timeout behaviour
-
Parallel SERVICE execution (
src/sparql/federation.rs)- Independent SERVICE clauses dispatched concurrently via background workers; results reassembled before outer join
- Bounded by
pg_ripple.federation_parallel_max
-
Federation result streaming (
src/sparql/federation.rs)- SERVICE responses exceeding
pg_ripple.federation_inline_max_rows(new GUC, default10_000) are spooled into a temporary table rather than inlined asVALUES - Error code
PT620INFO when spooling is triggered
- SERVICE responses exceeding
-
IP/CIDR allowlist for federation endpoints (
src/sparql/federation.rs)- Resolve hostname on endpoint registration; deny RFC 1918, link-local (
169.254.x.x), loopback, and IPv6 link-local by default - New GUC
pg_ripple.federation_allow_private(bool, defaultfalse) to override - Error code
PT621when a private-IP endpoint is rejected
- Resolve hostname on endpoint registration; deny RFC 1918, link-local (
-
HTTPS certificate validation for HTTP companion (
pg_ripple_http/src/main.rs)- Default to system trust store via
rustls-native-certs - Env var
PG_RIPPLE_HTTP_CA_BUNDLE— path to a custom CA PEM for private-PKI federation targets - Reject self-signed certificates unless
PG_RIPPLE_HTTP_ALLOW_SELF_SIGNED=true - Fix CORS defaults: explicit origin allowlist via
PG_RIPPLE_HTTP_CORS_ORIGINS;*requires opt-in - Fix X-Forwarded-For: trust only when
PG_RIPPLE_HTTP_TRUST_PROXYenv lists upstream IP/CIDR - Body limit configurable via
PG_RIPPLE_HTTP_MAX_BODY_BYTES(default10_485_760)
- Default to system trust store via
-
Live RDF CDC subscriptions (
src/cdc.rs,pg_ripple_http/src/ws.rsnew module)pg_ripple.create_subscription(name TEXT, filter_sparql TEXT DEFAULT NULL, filter_shape TEXT DEFAULT NULL) RETURNS BOOLEAN- Publishes via
NOTIFY pg_ripple_cdc_{name}with JSON payload:{"op": "add"|"remove", "s": "…", "p": "…", "o": "…", "g": "…"} - WebSocket endpoint
/ws/subscriptions/{name}inpg_ripple_http; supportstext/turtle,application/ld+json,application/jsonviaAccept - Optional SPARQL filter: only matching triples published; optional SHACL filter: only shape-violating triples published
pg_ripple.drop_subscription(name TEXT),pg_ripple.list_subscriptions() RETURNS TABLE- New catalog table
_pg_ripple.subscriptions (name, filter_sparql, filter_shape, created_at, queue_table_oid) - pg_regress test
cdc_subscriptions.sql: create subscription, insert triples, verifyLISTENreceives expected payloads
Migration Script
sql/pg_ripple--0.41.0--0.42.0.sql — creates _pg_ripple.endpoint_stats table; creates _pg_ripple.subscriptions table; registers new GUCs (merge_workers, sameas_max_cluster_size, federation_stats_ttl_secs, federation_planner_enabled, federation_parallel_max, federation_parallel_timeout, federation_inline_max_rows, federation_allow_private).
Documentation
-
user-guide/operations/merge-workers.md(new) — tuningmerge_workersfor predicate-rich workloads; monitoring viadiagnostic_report() -
user-guide/features/cdc-subscriptions.md(new) — complete tutorial: subscribe, filter, consume via SQL LISTEN and WebSocket; integration patterns with GraphRAG, ML feature stores, and live dashboards -
user-guide/features/federation.md— updated: VoID stats, cost-based planner, parallel SERVICE, result streaming, IP restrictions -
reference/guc-reference.md— all new GUCs documented; security guidance onfederation_allow_private -
reference/error-reference.md— PT550, PT620, PT621 documented - Release notes for v0.42.0
Exit Criteria
Parallel merge stress test passes (100 writers, 4 workers, no lost deletes). VoID stats fetched on endpoint registration. Independent SERVICE clauses execute in parallel (verifiable via explain_sparql()). CDC subscription delivers NOTIFY payloads for all inserts matching the filter. HTTPS cert validation enforced in pg_ripple_http. Migration chain test passes through 0.42.0.
v0.43.0 — WatDiv + Jena Conformance Suite
Theme: Scale-correctness and semantic edge-case coverage via the WatDiv benchmark and Apache Jena test suite, reusing the harness infrastructure from v0.41.0.
In plain language: W3C conformance (v0.41.0) proves pg_ripple is correct on small, well-defined fixtures. This release proves it is correct at scale and on the implementation edge cases that W3C deliberately leaves underspecified. WatDiv loads 10M–100M triples and runs 100–1,000 queries across four complexity levels (star, chain, snowflake, complex) — catching SQL planner regressions and VP table performance cliffs that only appear under realistic data distributions. Apache Jena contributes ~1,000 additional tests covering type coercion corner cases, timezone handling in date comparisons, numeric precision, and blank-node scoping rules that the W3C suite glosses over.
Effort estimate: 5–7 person-weeks (90% infrastructure reuse from v0.41.0)
Deliverables
-
Apache Jena adapter (
tests/jena/new module)- Adapt v0.41.0 manifest parser to handle Jena-specific manifest fields (
jt:QueryEvaluationTest,jt:UpdateEvaluationTest) and Jena result extensions (e.g.rdf:XMLLiteral, extended numeric types) - ~1,000 tests across Jena's
sparql-query,sparql-update,sparql-syntax, andalgebrasub-suites - Reuse v0.41.0 RDF fixture loader, result validator, parallel runner, and known-failures manifest format
- Specific coverage targets:
- Type coercion: XSD numeric promotions (
xsd:integer→xsd:decimal→xsd:double); mixed-type comparisons - Date/time: timezone-aware
xsd:dateTimecomparisons;NOW(),YEAR(),MONTH(),DAY(),HOURS(),MINUTES(),SECONDS(),TZ()builtins - Numeric precision:
xsd:decimalarithmetic;ROUND(),CEIL(),FLOOR(),ABS() - Blank-node scoping: blank nodes in CONSTRUCT templates; blank nodes across GRAPH boundaries; blank-node identity in OPTIONAL
- String functions:
STRLEN(),SUBSTR(),UCASE(),LCASE(),STRSTARTS(),STRENDS(),CONTAINS(),ENCODE_FOR_URI(),CONCAT()
- Type coercion: XSD numeric promotions (
- Target: full Jena suite completes in < 3 minutes alongside W3C suite on CI
- New CI job
jena-suite— non-blocking until pass rate ≥ 95%; then promoted to required
- Adapt v0.41.0 manifest parser to handle Jena-specific manifest fields (
-
WatDiv harness (
tests/watdiv/new module)- Data generation: integrate
watdivRust port or call the upstream C++ binary viastd::process::Command; generate 10M-triple dataset once and cache in CI artifact storage - Query templates: all 100 WatDiv query templates across four structural classes:
- Star (S1–S7): all predicates share a single subject; tests VP table scan and star-join optimisation
- Chain (C1–C3): predicates form a linear path; tests join ordering
- Snowflake (F1–F5): star + chain hybrid; tests mixed join strategies
- Complex (B1–B12, L1–L5): multi-hop patterns with OPTIONAL and UNION; tests full algebra
- Correctness validation: run each query against a baseline (pre-computed expected cardinalities from a reference run) and assert within ±0.1% row count
- Performance baseline: record median query latency per template at 10M triples; flag regressions > 20% in CI
- Separate
cargo bench --bench watdivtarget usingcriterion— feeds intobenchmarks/results - Target: full 100-template suite at 10M triples completes in < 5 minutes on an 8-core CI runner
- New CI job
watdiv-suite— non-blocking (performance regressions are warnings, not failures)
- Data generation: integrate
-
Shared harness improvements (backport to
tests/w3c/)- Unified
tests/conformance/runner.rs— single parallel runner used by W3C, Jena, and WatDiv; eliminates code duplication - Unified
known_failures.txtformat withsuite:prefix (e.g.w3c:,jena:,watdiv:) - Unified CI report artifact: per-suite pass/fail/skip/timeout counts in one
conformance_report.json
- Unified
-
Test data download script (
scripts/fetch_conformance_tests.sh)- Extends
scripts/fetch_w3c_tests.shto also download Jena test suite from Apache mirror and WatDiv query templates from GitHub - All downloads verified against SHA-256 checksums
- WatDiv 10M dataset generated once and stored as a CI artifact (not re-generated on every run)
- Extends
Migration Script
sql/pg_ripple--0.42.0--0.43.0.sql — no schema changes. Comment-only header noting that v0.43.0 is a test infrastructure release.
Documentation
-
reference/w3c-conformance.md— updated to include Jena sub-suite pass rates alongside W3C categories -
reference/watdiv-results.md(new) — WatDiv benchmark results table: query class, template ID, median latency at 10M triples, pass/fail status; updated on each release -
contributing/running-conformance-tests.md— updated to cover Jena and WatDiv; how to regenerate WatDiv dataset; how to update performance baselines -
README.md— add WatDiv correctness badge alongside W3C conformance badge - Release notes for v0.43.0
Exit Criteria
Full Jena suite (1,000 tests) completes in < 3 minutes on CI. WatDiv 100-template suite at 10M triples completes in < 5 minutes. Jena known-failures manifest ≤ 30 XFAIL entries (type coercion and date-time edge cases acceptable until addressed post-1.0). WatDiv row-count correctness within ±0.1% for all 100 templates. Migration chain test passes through 0.43.0.
v0.44.0 — LUBM Conformance Suite
Theme: OWL RL inference correctness under ontological reasoning via the Lehigh University Benchmark (LUBM).
In plain language: LUBM is a classic academic benchmark that generates a synthetic university-domain ontology dataset (scalable from 1K to 8M+ triples) and defines 14 canonical queries that exercise OWL RL inference rules — subclass traversal, property inheritance, inverse properties, transitivity, and domain/range entailments. This release wires LUBM into the conformance harness to validate that pg_ripple's Datalog engine and SPARQL query layer produce correct results when ontological reasoning is active. A dedicated Datalog validation sub-suite tests the Datalog API directly (rule compilation, stratification, iterative inference, goal queries, and materialization) to catch bugs invisible to SPARQL-level testing. It is the only benchmark that tests the interaction between the SPARQL translator and the Datalog inference engine under realistic ontological load.
Effort estimate: 3–5 person-weeks (80% harness reuse from v0.41.0 and v0.43.0; +2–3 pw for Datalog API validation sub-suite)
Deliverables
-
LUBM data generator integration (
tests/lubm/generator.rsnew module)- Invoke the UBA (Univ-Bench Artificial) data generator via
std::process::Command, or use a Rust port, to produce Turtle-serialised datasets at configurable university count (--univ 1→ ~100K triples;--univ 10→ ~1M triples;--univ 50→ ~5M triples) - Cache generated datasets as CI artifacts keyed by university count and seed; re-generate only when the generator binary changes
- Load into a named graph
<http://swat.cse.lehigh.edu/onto/univ-bench.owl>via the v0.41.0 fixture loader - Also load the
univ-bench.owlontology into the Datalog engine as an RDFS/OWL RL rule set before running queries
- Invoke the UBA (Univ-Bench Artificial) data generator via
-
14 canonical LUBM queries (
tests/lubm/queries/q01.sparql–q14.sparql)- Implement all 14 LUBM queries verbatim from the benchmark specification
- Each query exercises at least one inference rule:
- Q1, Q2, Q4, Q6:
rdf:type+ subclass/subproperty entailment - Q3, Q5, Q7: inverse property + domain/range reasoning
- Q8, Q12, Q13: multi-hop inference chains
- Q9, Q10, Q11, Q14: conjunctive patterns over inferred and asserted triples
- Q1, Q2, Q4, Q6:
- Reference results: pre-computed correct answer counts for
--univ 1(published in the original LUBM paper); assert exact cardinality match
-
Correctness validator (
tests/lubm/validator.rs)- Compare actual row count against published reference counts for each of the 14 queries at
--univ 1 - For
--univ 10, compare against a locally pre-computed baseline (stored intests/lubm/baselines/univ10.json) - Fail on any count mismatch; report which inference rules produced wrong results
- Compare actual row count against published reference counts for each of the 14 queries at
-
CI integration (
.github/workflows/ci.yml)- New job
lubm-suite: runs afterw3c-suite; generates--univ 1dataset (< 100K triples, < 30 seconds); loads ontology + triples; runs all 14 queries; reports pass/fail per query - Non-blocking for
--univ 10(larger dataset run triggered weekly or on release branches) - Reuse unified
tests/conformance/runner.rsfrom v0.43.0; addlubm:prefix to known-failures format
- New job
-
Known-failures manifest — add
lubm:Q{N}entries for any query that fails at release, with one-line root-cause note -
Datalog validation sub-suite (
tests/lubm/datalog/new module) — test the Datalog API directly on the same--univ 1and--univ 10LUBM datasets- Rule compilation correctness (
tests/lubm/datalog/rule_compilation.sql): callpg_ripple.add_rules()with the OWL RL ruleset; usepg_ripple.rules()to inspect compiled rules; assert rule count and stratification matches specification - Inference iteration tracking (
tests/lubm/datalog/inference_iterations.sql): usepg_ripple.rule_statistics()afterpg_ripple.materialize_owl_rl()to count iterations per stratum; validate that fixpoint is reached without over-iteration (off-by-one detection) - Inferred triple counts (
tests/lubm/datalog/inferred_triples.sql): callpg_ripple.inferred_triples(rule_name)for key OWL RL rules (e.g.subclass_entail,subproperty_entail,domain_range); assert row counts match pre-computed baselines for--univ 1and--univ 10 - Direct goal queries (
tests/lubm/datalog/goal_queries.sql): usepg_ripple.goal()directly on Datalog-computed facts; verify results match SPARQL query results (validates inference engine independence from SPARQL translation) - Materialization performance baseline (
tests/lubm/datalog/materialization_perf.sql): benchmarkpg_ripple.materialize_owl_rl()at--univ 1(target < 5 seconds) and--univ 10(target < 60 seconds); flag > 10% regression in CI - Custom rule validation (
tests/lubm/datalog/custom_rules.sql): define ad-hoc Datalog rules (e.g. transitive closure over a custom predicate) on LUBM data; compare against ground-truth computed via Datalog vs. SPARQL; catch rule-compiler edge cases - Results compared against unified baseline (
tests/lubm/baselines/datalog_validation.json).
- Rule compilation correctness (
Migration Script
sql/pg_ripple--0.43.0--0.44.0.sql — adds UNIQUE(p, s, o, g) constraint to _pg_ripple.vp_rare to fix SPARQL UPDATE set semantics for rare predicates.
Documentation
-
reference/lubm-results.md(new) — LUBM conformance table: query ID, description, inference rules exercised, reference count, pg_ripple result, pass/fail; updated each release -
reference/w3c-conformance.md— updated to link to LUBM and WatDiv result pages for a complete conformance picture -
contributing/running-conformance-tests.md— updated to cover LUBM data generation, ontology loading, and baseline regeneration - Release notes for v0.44.0
Exit Criteria
All 14 LUBM queries return exact reference cardinalities at --univ 1. Ontology + --univ 1 dataset loads and all queries complete in < 30 seconds on CI. All Datalog API calls in the sub-suite return results matching pre-computed baselines (rule count, iteration count, inferred triple counts, goal query results). Materialization performance at --univ 1 is < 5 seconds. Custom Datalog rule validation passes (transitive closure results match ground truth). Known-failures manifest has 0 lubm: entries at release. Migration chain test passes through 0.44.0.
v0.45.0 — SHACL Completion, Datalog Robustness & Crash Recovery
Theme: Close the last SHACL Core constraint gaps, harden parallel Datalog evaluation against worker failures, and add the missing crash-recovery scenarios and migration-documentation standards.
In plain language: This release finishes the SHACL implementation by adding the two remaining Core constraints (
sh:equalsandsh:disjoint), makes violation messages readable by always including the decoded focus-node IRI, and proves the async validation queue can sustain a sustained burst of 10,000 writes per second. On the Datalog side it ensures that a crash in one parallel evaluation worker rolls back all other workers cleanly, and that user-supplied lattice join functions are validated before the engine tries to call them. A new set of crash-recovery tests covers the two scenarios that were never tested: killing PostgreSQL mid-promotion of a rare predicate and killing it mid-inference. Finally, every migration script from this release onward carries a standardised header documenting the schema changes, data-rewrite cost, downgrade strategy, and the test file that covers it.Effort estimate: 4–6 person-weeks
Deliverables
-
sh:equalsandsh:disjointconstraints (src/shacl/constraints/)sh:equals p— for every focus node, the set of values forpmust equal the set of values for the predicate declared bysh:equals; implemented as two NOT EXISTS subqueries (one per direction); compiled into a SHACL constraint helper insrc/shacl/constraints/relational.rssh:disjoint p— the value sets must be disjoint; implemented symmetrically- pg_regress test
shacl_equals_disjoint.sql— covers passing shapes, failing shapes, blank-node identity, and named-graph scoping - Migration: no schema changes; constraints are pure SQL inside the validation query
-
Decoded focus-node IRIs in SHACL violation messages (
src/shacl/mod.rs)- All paths that emit a SHACL violation (
ereport!(Error, …)or write to_pg_ripple.validation_results) must include the decoded IRI of the focus node alongside its integer ID - Add a
decode_id_safe(id: i64)helper that falls back to"<decoded-id:{id}>"if the dictionary lookup fails - Regression test: load a shape with a violation; assert the violation message text contains the focus-node IRI string
- All paths that emit a SHACL violation (
-
SHACL async pipeline load test (
benchmarks/shacl_async_load.sql)pgbench-driven harness that inserts triples at 10,000/min for 5 continuous minutes while the async SHACL validation pipeline is active- Asserts: (a)
_pg_ripple.validation_queuedepth stays bounded (does not grow unboundedly); (b) drain rate ≥ arrival rate ± 5%; (c) dead-letter queue receives any persistent violators; (d) no backend crashes - CI job
shacl-async-loadis informational (non-blocking) but results are logged as a CI artifact
-
Coordinated parallel-strata rollback (
src/datalog/parallel.rs)- Wrap all independent-group SQL execution inside a single PostgreSQL transaction with one
SAVEPOINT strata_evalper group - On failure in any group, issue
ROLLBACK TO SAVEPOINTfor all already-applied groups and re-raise the error; on success,RELEASE SAVEPOINTto commit the whole stratum - pg_regress test
datalog_parallel_rollback.sql: inject a deliberate failure in one group; assert no partial facts survive
- Wrap all independent-group SQL execution inside a single PostgreSQL transaction with one
-
lattice.join_fnvalidation viaregprocedure(src/datalog/lattice.rs)- Before storing a user-supplied
join_fnname, resolve it viaSELECT '{name}'::regprocedure::textinside an SPI transaction - If the round-trip succeeds, store the qualified name returned by PG (avoids search-path injection); if it fails, raise
PT541 LatticeJoinFnInvalidwith a clear message naming the rejected identifier - New error code PT541 added to
src/error.rsanddocs/src/reference/error-catalog.md
- Before storing a user-supplied
-
WFS iteration-cap test and documentation (
tests/pg_regress/sql/datalog_wfs_cap.sql)- pg_regress test that loads a mutually-recursive negation cycle guaranteed to reach
pg_ripple.wfs_max_iterations; asserts: (a) function returns without error; (b)"stratifiable": falsein result; (c) PostgreSQL WARNING with code PT520 is emitted; (d)"certain"and"unknown"fact counts are non-zero (partial result) docs/src/user-guide/sql-reference/datalog.md— add a "Well-Founded Semantics limits" subsection documenting the cap behaviour and how to detect it viaRETURNING
- pg_regress test that loads a mutually-recursive negation cycle guaranteed to reach
-
Crash-recovery: rare-predicate promotion kill (
tests/crash_recovery/test_promote_kill.sh)- Script that starts a large-batch insert designed to cross the promotion threshold, sends
kill -9to the promoting backend mid-transaction, restarts PostgreSQL, callspg_ripple.diagnostic_report(), and assertsvp_rareis consistent (no orphaned rows, predicate catalog matches actual tables) - Outcome must be either: promotion completed (VP table exists,
vp_rarerows moved) or promotion rolled back (VP table absent,vp_rarerows intact) — no hybrid state permitted
- Script that starts a large-batch insert designed to cross the promotion threshold, sends
-
Crash-recovery: Datalog inference kill mid-fixpoint (
tests/crash_recovery/test_inference_kill.sh)- Script that starts a large-ruleset inference run, kills the backend during the second fixpoint iteration, restarts, and asserts: (a) no partially-derived facts remain in any VP table (i.e., no inferred triples from an aborted inference); (b)
pg_ripple.infer()can be re-run successfully to completion
- Script that starts a large-ruleset inference run, kills the backend during the second fixpoint iteration, restarts, and asserts: (a) no partially-derived facts remain in any VP table (i.e., no inferred triples from an aborted inference); (b)
-
Standardised migration script headers
- Backfill
sql/pg_ripple--*.sqlwith the standard header block (schema changes, data-rewrite cost estimate, downgrade strategy, test reference) for any script that currently lacks one — starting with0.5.1→0.6.0(the HTAP split) and the five most structurally significant migrations - Add the header template to
AGENTS.md"Extension Versioning & Migration Scripts" section so all future scripts include it from creation
- Backfill
-
Recovery procedure runbook in
RELEASE.md- Add a "Rollback & Recovery" section documenting: (a) how to roll back each class of migration (comment-only vs. schema-change vs. data-rewrite); (b) the
pg_dump/pg_restorepath as the universal fallback; (c) how to diagnose a partial upgrade using_pg_ripple.schema_versionandpg_ripple.diagnostic_report()
- Add a "Rollback & Recovery" section documenting: (a) how to roll back each class of migration (comment-only vs. schema-change vs. data-rewrite); (b) the
Migration Script
sql/pg_ripple--0.44.0--0.45.0.sql — no VP table schema changes. Comment-only header. Installs PT541 error code registration (compiled from Rust).
Documentation
-
reference/shacl-constraints.md— addsh:equalsandsh:disjointto the constraint table with examples -
reference/error-catalog.md— add PT541 (LatticeJoinFnInvalid) -
user-guide/sql-reference/datalog.md— "Well-Founded Semantics limits" subsection -
reference/troubleshooting.md— add entries for "rare-predicate promotion stuck" and "inference aborted mid-fixpoint" - Release notes for v0.45.0
Exit Criteria
sh:equals and sh:disjoint pg_regress tests pass. SHACL violation messages include decoded focus-node IRIs. Parallel-strata rollback test demonstrates no partial facts on deliberate failure. lattice.join_fn injection via search-path ambiguous name is rejected at create_lattice() time with PT541. WFS cap test passes: PT520 WARNING emitted, partial result returned. Both new crash-recovery scripts exit 0. Migration chain test passes through 0.45.0.
v0.46.0 — Property-Based Testing, Fuzz Hardening & OWL 2 RL Conformance
Theme: Property-based and fuzz testing for the remaining untested trust surfaces, the W3C OWL 2 RL conformance suite, and targeted performance improvements from the deep-analysis recommendations.
In plain language: Three gaps that can hide subtle bugs: (1) randomised property-based tests that assert algebraic invariants about the SPARQL translator and dictionary encoder — if encoding the same term twice ever yields different IDs, or if a query changes semantics when extra whitespace is added, these tests catch it; (2) fuzz tests for the federation result parser, which accepts untrusted network data; and (3) the W3C OWL 2 RL test manifests, which verify that pg_ripple's Datalog engine handles the full range of ontological reasoning that OWL 2 RL demands. On the performance side, a LIMIT push-down eliminates redundant decoding rows for paginated queries, sequence range pre-allocation removes a contention point in parallel Datalog, and BSBM joins the CI suite as a regression gate. The rustdoc lint ensures no public function ships without a doc comment.
Effort estimate: 5–7 person-weeks
Deliverables
-
proptestintegration (tests/proptest/)- SPARQL algebra round-trip (
tests/proptest/sparql_roundtrip.rs): generate randomspargebra::Queryvalues usingpropteststrategies; assert that (a) encoding the same SPARQL query twice produces byte-identical SQL; (b) queries that differ only in whitespace or prefix aliases produce the same generated SQL (plan-cache key stability); (c) star-pattern self-join elimination never changes the result set (check against a reference without elimination) - Dictionary encode/decode (
tests/proptest/dictionary.rs): for any arbitrary IRI, blank node, or literal string,decode_id(encode_term(t)) == t; assert no collisions for 10,000 random distinct terms; assert encode is stable across pg_ripple restarts (same term → same ID given the same dictionary) - JSON-LD framing round-trip (
tests/proptest/jsonld_framing.rs): generate random flat JSON-LD input graphs and random@contextframes; assert thatframe_jsonld(input, frame)returns valid JSON-LD and that any IRI present in the input that matches the frame appears in the output - Dev-dependency:
proptest = "1"added toCargo.tomlunder[dev-dependencies]
- SPARQL algebra round-trip (
-
cargo-fuzzfederation result decoder target (fuzz/fuzz_targets/federation_result.rs)- Fuzz target that feeds arbitrary byte sequences through the SPARQL XML results parser (
src/sparql/federation.rsresult-decoding path) — the path that processesapplication/sparql-results+xmlresponses from remote SERVICE endpoints - Assert: no panic, no
unwrapabort; invalid XML must produce aPT6xx-range error, never a crash - CI nightly job
fuzz-federationruns the target for 10 minutes; any new corpus entries that trigger panics are reported as blocking failures
- Fuzz target that feeds arbitrary byte sequences through the SPARQL XML results parser (
-
Datalog convergence regression suite (
tests/datalog_convergence/)- Download a 1M-triple DBpedia-en subset (persons, organisations, relations) via
scripts/fetch_conformance_tests.shextension; load into pg_ripple - Apply the built-in RDFS + OWL RL rule set via
pg_ripple.materialize_owl_rl() - Assert: fixpoint reached in ≤ 20 iterations; total wall-clock time < 5 minutes on CI; derived triple count falls within ±1% of a pre-computed baseline stored in
tests/datalog_convergence/baselines.json - Repeat for a 200-rule custom rule set (100 forward-chaining + 100 OWL RL rules) on a 100K-triple schema.org snippet; assert convergence in ≤ 15 iterations
- Download a 1M-triple DBpedia-en subset (persons, organisations, relations) via
-
W3C OWL 2 RL conformance suite (
tests/owl2rl/)- Download the W3C OWL 2 RL test manifests from
https://github.com/w3c/owl2-profiles-tests - Adapter
tests/owl2rl/manifest.rsparses theowl2:DatatypeEntailmentTest,owl2:ConsistencyTest, andowl2:InconsistencyTestmanifest types - Each test loads a premise ontology, runs
pg_ripple.materialize_owl_rl(), then evaluates a conclusion ontology via ASK/entailment check - CI job
owl2rl-suiteis informational (non-blocking) until pass rate ≥ 95%; known failures tracked intests/owl2rl/known_failures.txtwithowl2rl:prefix - Reuse unified conformance runner from v0.43.0
- Download the W3C OWL 2 RL test manifests from
-
TopN push-down (
src/sparql/sqlgen.rs)- When a SPARQL query has both
ORDER BYandLIMIT N(and noOFFSET > 0), emit the SQL as… ORDER BY … LIMIT Nrather than fetching all rows and discarding after decoding - The optimisation applies to SELECT queries; skipped when
DISTINCTis in scope (PostgreSQL cannot push LIMIT through DISTINCT without a subquery) - New GUC
pg_ripple.topn_pushdown(bool, defaulton) guards the rewrite;pg_ripple.sparql_explain()output includes a"topn_applied": true/falsekey - pg_regress test
sparql_topn.sql: assert result correctness andEXPLAINshows aLimitnode directly over the VP scan
- When a SPARQL query has both
-
Sequence range pre-allocation for parallel Datalog workers (
src/datalog/parallel.rs)- Before launching N parallel strata workers, call
SELECT setval(seq, currval(seq) + N * batch_size)once to reserve a contiguous SID range; each worker uses its slice without touching the sequence batch_sizedefaults to 10,000 and is configurable viapg_ripple.datalog_sequence_batch(integer GUC, default 10000, min 100)- pg_regress test
datalog_sequence_batch.sql: assert that after parallel inference the global SID sequence has no gaps within the reserved range
- Before launching N parallel strata workers, call
-
BSBM regression gate in CI (
.github/workflows/ci.yml,benchmarks/bsbm/)- Integrate the Berlin SPARQL Benchmark (BSBM) at 1M triple scale as a nightly regression check
scripts/fetch_conformance_tests.shextended to download and install the BSBM data generator- CI job
bsbm-regression: generates a 1M-triple product dataset, runs the 12 BSBM explore queries, compares query latency against a baseline stored inbenchmarks/bsbm/baselines.json; any query regressing by > 10% emits a CI warning (non-blocking but visible in the PR summary) - Complement to v1.0.0's full-scale BSBM-at-100M-triples published benchmark
-
Rustdoc lint gate (
src/lib.rs,Cargo.toml,.github/workflows/ci.yml)- Add
#![warn(missing_docs)]tosrc/lib.rs(scoped to public items only; internalpub(crate)items excluded) - CI job
cargo doc --no-deps --document-private-itemsgated to fail on anymissing_docswarning for public#[pg_extern]functions - Backfill doc comments for the 20 most-called public functions (as identified by
pg_stat_statementsin the test suite run); leave aFIXME(docs):comment on the remaining stubs to track progress
- Add
-
HTTP companion: CA-bundle env var (
pg_ripple_http/src/main.rs)- Add
PG_RIPPLE_HTTP_CA_BUNDLEenvironment variable: if set, load the PEM file at the given path as the trust anchor for all outbound TLS connections (SERVICE federation and SPARQL endpoint queries) - If the path does not exist or is not a valid PEM bundle, log an error at startup and fall back to the system trust store (never silently ignore)
- This complements the v0.42.0
rustls-tls-native-rootshardening by allowing operators to pin a specific CA or internal PKI certificate - Integration test: start a mock TLS server with a self-signed CA; assert that
pg_ripple_httprejects it by default and accepts it whenPG_RIPPLE_HTTP_CA_BUNDLEpoints to the CA cert
- Add
-
Expanded worked examples (
examples/)examples/shacl_datalog_quality.sql— end-to-end: load a bibliographic graph, define SHACL shapes, run SPARQL to list violations, apply Datalog RDFS rules, re-check shapes; documents the SHACL + Datalog interaction patternexamples/hybrid_vector_search.sql— end-to-end: embed entities, run vector similarity search, combine with SPARQL property-path constraints; documents thepg:similar()+ SPARQL patternexamples/graphrag_round_trip.sql— end-to-end: load a knowledge graph, run GraphRAG export, annotate with Datalog-derived community summaries, re-import enriched triples; documents the full GraphRAG round-trip
New GUC Parameters
| GUC | Type | Default | Description |
|---|---|---|---|
pg_ripple.topn_pushdown | bool | on | Push LIMIT N into the SQL plan for ORDER BY + LIMIT queries |
pg_ripple.datalog_sequence_batch | integer | 10000 | SID range reserved per parallel Datalog worker per batch |
New Error Codes
| Code | Severity | Message |
|---|---|---|
| PT542 | ERROR | Federation result decoder received unparseable XML/JSON |
Migration Script
sql/pg_ripple--0.45.0--0.46.0.sql — no schema changes. Registers topn_pushdown and datalog_sequence_batch GUCs (compiled from Rust). Comment-only header.
Documentation
-
user-guide/best-practices/sparql-performance.md— "TopN push-down" section withEXPLAINexample -
reference/guc-reference.md— v0.46.0 section with two new GUC parameters -
reference/error-catalog.md— PT542 added -
contributing/testing.md—proptestandcargo-fuzzsections covering how to run and extend the harnesses - Release notes for v0.46.0
Exit Criteria
All three proptest suites run 10,000 cases each with no failures. Federation result decoder fuzz target runs 10 minutes without panics. Datalog convergence suite: fixpoint on 1M DBpedia triples in ≤ 20 iterations, wall-clock < 5 minutes. OWL 2 RL suite: ≥ 80% pass rate at release (target 95% for v1.0.0). TopN push-down EXPLAIN shows Limit node for ORDER BY + LIMIT queries; result set unchanged. BSBM-at-1M-triples baseline stored and regression gate active. No missing-docs warnings for public #[pg_extern] functions. HTTP companion starts cleanly with PG_RIPPLE_HTTP_CA_BUNDLE set to a valid PEM file. Migration chain test passes through 0.46.0.
v0.47.0 — SHACL Truthfulness, Dead-Code Activation & Architecture Refactor
Theme: Close the parsed-but-not-checked SHACL gap, wire dead code, finish the SPARQL translate module split, and expand fuzz and crash-recovery coverage.
In plain language: v0.45.0 was titled "SHACL Completion" but the post-release audit (PLAN_OVERALL_ASSESSMENT_3.md) found four constraints that accept any data without complaint — the parser records them but the validator ignores them. That is fixed here. The
preallocate_sid_ranges()function added in v0.46.0 to speed up parallel Datalog has been sitting unused (clippydead_codewarning); it gets wired in. Thesrc/sparql/translate/refactor that began in v0.38.0 finally lands, shrinkingsqlgen.rsfrom 3 600 lines into focused per-operator modules. Five new fuzz targets cover the attack surfaces that had only one target before. Four new crash-recovery scenarios close the remaining operational safety gaps.Effort estimate: 8–10 person-weeks
Deliverables
-
SHACL parsed-but-not-checked constraint sweep (S4-1…S4-4)
- Implement
sh:closedchecker insrc/shacl/constraints/closed.rs: for each focus node enumerate all predicate IDs present; reject any not listed insh:property / sh:pathorsh:ignoredProperties - Implement
sh:uniqueLangchecker: for a given focus node and path, assert no two values share the same non-empty@langtag - Implement
sh:patternchecker insrc/shacl/constraints/string_based.rs(currently an empty placeholder): apply thesh:flags-aware POSIX regex against the string value of each focus node - Implement
sh:lessThanOrEqualschecker: decode both value nodes and compare with the XSD-typed ordering already used by FILTER expressions - Wire each into the shape dispatcher at
src/shacl/mod.rs - Add pg_regress tests
shacl_closed.sql,shacl_unique_lang.sql,shacl_pattern.sql,shacl_lt_or_equals.sql(S8-4) - Add a startup-time warning listing every parsed-but-unchecked constraint type encountered, to guard against future regressions
- Implement
-
Wire
preallocate_sid_ranges()(S1-2)- Call the function from the parallel-strata coordinator in
src/datalog/parallel.rsbefore launching any worker batch - Assert via
datalog_sequence_batch.sqlthatpg_sequence_last_valueadvances byn_workers * batch_sizeon each batch; eliminate the clippydead_codewarning
- Call the function from the parallel-strata coordinator in
-
Finish
src/sparql/translate/module split (S2-3)- Move BGP translation into
src/sparql/translate/bgp.rs(~400 LoC) - Move Filter translation into
src/sparql/translate/filter.rs(~200 LoC) - Move LeftJoin (OPTIONAL) into
src/sparql/translate/left_join.rs(~250 LoC) - Move Union into
src/sparql/translate/union.rs(~150 LoC) - Move Distinct into
src/sparql/translate/distinct.rs(~100 LoC) - Move Graph pattern into
src/sparql/translate/graph.rs(~200 LoC) - Move Group/aggregation into
src/sparql/translate/group.rs(~300 LoC) - Move Join into
src/sparql/translate/join.rs(~200 LoC) - Target:
sqlgen.rs≤ 800 LoC (routing and coordination only)
- Move BGP translation into
-
Six missing GUC
check_hookvalidators (S5-1)- Add validators for:
federation_on_error(warning|error|empty),federation_on_partial(empty|use),sparql_overflow_action(warn|error),tracing_exporter(stdout|otlp),embedding_index_type(hnsw|ivfflat),embedding_precision(single|half|binary) - Consolidate
max_path_depthandproperty_path_max_depthinto a single GUC withmin = 1, max = 65535validator (S2-5)
- Add validators for:
-
Five new
cargo-fuzztargets (S8-1)fuzz/fuzz_targets/sparql_parser.rs: feed arbitrary bytes through the SPARQL query parser; assert no panicfuzz/fuzz_targets/turtle_parser.rs: fuzz the Turtle/N-Triples bulk loader; assert no panic, invalid input → PT3xx errorfuzz/fuzz_targets/datalog_parser.rs: fuzz the Datalog rule parser; assert no panicfuzz/fuzz_targets/shacl_parser.rs: fuzzparse_shapes_graph(); assert no panicfuzz/fuzz_targets/dictionary_hash.rs: fuzz the dictionary encode path; assert no panic and round-trip invariant- Each target runs for 10 minutes in CI nightly; a new crash-inducing input is a blocking failure
-
Four missing crash-recovery scenarios (S8-3)
- CONSTRUCT/DESCRIBE view materialisation kill:
kill -9duringmaterialize_view(); restart and verify view state is consistent - Federation result spooling kill:
kill -9during SERVICE temp-table spool; restart and verify no orphaned temp tables - Parallel Datalog stratum kill (
merge_workers > 1):kill -9mid-fixpoint; restart and verify inference restarts cleanly - Embedding worker queue kill:
kill -9during async embedding queue flush; restart and verify queue drains without duplicates
- CONSTRUCT/DESCRIBE view materialisation kill:
-
Plan / dictionary / federation cache hit-rate metrics (S7-1)
pg_ripple.plan_cache_stats()→(hits BIGINT, misses BIGINT, evictions BIGINT, hit_rate DOUBLE PRECISION)pg_ripple.dictionary_cache_stats()→ same shapepg_ripple.federation_cache_stats()→ same shape- Wire hit_rate into the BSBM regression gate as a secondary metric
-
WFS non-convergence warning (S3-2)
- Emit PT520 WARNING when the well-founded semantics iteration cap is reached without convergence; include iteration count and the predicate that last changed
-
OWL 2 RL conformance baseline (S3-3)
- Run the OWL 2 RL suite added in v0.46.0; document the pass rate in
docs/src/reference/owl2rl-results.md - Surface XFAIL entries in
tests/owl2rl/known_failures.txtfor release-to-release tracking
- Run the OWL 2 RL suite added in v0.46.0; document the pass rate in
-
CI and security hygiene (S6-1, S6-2, S6-4, S10-1)
- Add weekly scheduled
cargo auditjob; failure creates a GitHub issue automatically - Add
cargo denyconfiguration with licence allowlist - Add
scripts/check_no_security_definer.shthat scanssql/*.sqland fails on anySECURITY DEFINERdirective - Add SPDX licence compatibility check via
cargo license
- Add weekly scheduled
-
Promotion-race stress test (S8-5)
tests/stress/promotion_race.sh: fire 50 concurrent inserts at the rare-predicate promotion threshold; verify SIDs are non-overlapping per worker
-
Documentation (S9-1, S9-2, S9-3, S5-3)
reference/guc-reference.md: complete entries for all GUCs through v0.47.0; flagdatalog_sequence_batchas now active- Add GUC ↔ workload-class tuning matrix (when to raise
dictionary_cache_size, when to increasemerge_workers, when to tuneproperty_path_max_depth) - Add 5 worked examples: federation-multi-endpoint, parallel-Datalog, CONSTRUCT/DESCRIBE view materialisation, RDF-star annotation patterns, WCOJ cyclic queries
- Document NOTIFY queue tuning for CDC subscriptions (
max_notify_queue_pages)
New Error Codes
| Code | Severity | Message |
|---|---|---|
| PT520 | WARNING | Well-founded semantics iteration cap reached without convergence; result is partial |
Migration Script
sql/pg_ripple--0.46.0--0.47.0.sql — no schema changes. Comment header describing new SHACL constraint checkers, wired preallocate_sid_ranges(), and six new GUC validators.
Documentation
-
reference/shacl-reference.md— marksh:closed,sh:uniqueLang,sh:pattern,sh:lessThanOrEqualsas fully implemented -
contributing/testing.md— fuzz targets section extended for five new targets -
reference/guc-reference.md— complete audit of all registered GUCs through v0.47.0 - Release notes for v0.47.0
Exit Criteria
All four previously parsed-but-unchecked SHACL constraints trigger violations on non-conforming data. preallocate_sid_ranges() has zero clippy dead_code warnings. sqlgen.rs ≤ 800 LoC. All five fuzz targets run 10 minutes without panics. All four crash-recovery scenarios pass. Three cache-stats SRFs return non-zero hit_rate after a warm workload. OWL 2 RL pass-rate baseline documented. cargo audit and cargo deny green in CI.
v0.48.0 — SHACL Core Completeness, OWL 2 RL Closure & SPARQL Completeness
Theme: Complete SHACL Core conformance, close the OWL 2 RL rule-set gap, finish SPARQL 1.1 Update, and resolve the SPARQL-star variable-pattern gap.
In plain language: After v0.47.0 makes the existing SHACL constraints truthful, this release adds the remaining seven SHACL Core constraints — the string-length bounds, exclusive/inclusive numeric ranges, and
sh:xone— plus the complex path expressions (sh:inversePath,sh:alternativePath, sequence paths,*,+,?) that real-world Schema.org and SHACL-AF schemas depend on. On the reasoning side, five missing OWL 2 RL rules close the gap with the W3C OWL 2 RL profile. SPARQL 1.1 Update gains its three missing operations (MOVE,COPY,ADD). The SPARQL-star variable-inside-quoted-triple pattern finally returns rows instead of silently empty results. This release also delivers the operational hardening items deferred from v0.47.0.Effort estimate: 6–8 person-weeks
Deliverables
-
Remaining SHACL Core constraints (S4-5)
sh:minLength/sh:maxLength: apply to string-typed literals after language-tag strippingsh:xone: exactly one of the given sub-shapes must be satisfied (XOR logic over the existingsh:or/sh:notprimitives)sh:minExclusive/sh:maxExclusive/sh:minInclusive/sh:maxInclusive: XSD-typed numeric comparison; reuse the ordering logic fromsh:lessThan/sh:lessThanOrEquals- Target: full SHACL Core constraint coverage (35/35); W3C SHACL Core test suite must pass completely
-
Complex
sh:pathexpressions (S4-6)sh:inversePath: query(o, s)instead of(s, o)on the VP tablesh:alternativePath: union of multiple sub-paths- Sequence paths (
(sh:path (ex:a ex:b))): chained joins sh:zeroOrMorePath,sh:oneOrMorePath,sh:zeroOrOnePath: compile toWITH RECURSIVE … CYCLECTEs, reusing the SPARQL property-path compiler fromsrc/sparql/property_path.rs- Drop the TODO placeholder in
src/shacl/constraints/property_path.rs
-
SHACL violation report enhancements (S4-7, S4-8)
- Extend
Violationstruct withsh_value(the offending value node, decoded) andsh_source_constraint_component(W3C constraint component IRI, e.g.sh:MinCountConstraintComponent) - For
sh:ruletriples (SHACL-AF): emit a PT4xx WARNING if rules are detected but SHACL-AF compilation is not yet implemented; never silently drop the rule
- Extend
-
OWL 2 RL rule set completion (S3-1)
cax-sco: fullrdfs:subClassOftransitive closure (currently single-step only)prp-spo1:rdfs:subPropertyOfchain (current binary case → full chain)prp-ifp: inverse-functional-property derivedowl:sameAspropagationcls-avf: chainedowl:allValuesFrominteraction with subclass hierarchyowl:minCardinality,owl:maxCardinality,owl:cardinalityentailment rules- Target: W3C OWL 2 RL CI suite ≥ 95% pass rate (upgrading the gate from informational to required)
-
SPARQL Update: MOVE, COPY, ADD (S2-2)
ADD:INSERT { ?s ?p ?o } WHERE { GRAPH source { ?s ?p ?o } }(source preserved)COPY:CLEAR target+ADDMOVE:COPY+DROP source- Wire into
src/sparql/mod.rsUpdate arm; add pg_regress tests for all three operations
-
SPARQL-star variable-inside-quoted-triple patterns (S2-1)
- Convert the current silent
FALSEemission into a proper dictionary join onqt_s,qt_p,qt_ocolumns already present in_pg_ripple.dictionary - Patterns like
<< ?s ?p ?o >> :assertedBy ?whoreturn rows - Add pg_regress tests
rdfstar_variable_quoted.sql
- Convert the current silent
-
Performance baselines and benchmarks (S7-2, S7-3)
- Record per-query p50/p95/p99 latency for all 32 WatDiv templates in
tests/watdiv/baselines.json; CI warning gate on > 10% regression - Add
benchmarks/merge_throughput.sql: 5-minute pgbench script with N writers +merge_workers ∈ {1, 2, 4, 8}; document the scaling curve
- Record per-query p50/p95/p99 latency for all 32 WatDiv templates in
-
Operational hardening (S1-1, S1-3, S1-4, S1-5, S2-4, S2-6, S3-4, S6-3, S7-4, S7-5, S9-4, S9-6, S10-2, S10-3, S10-5)
- HTAP merge cutover: add a concurrent-merge regression test (50 parallel SPARQL queries during a forced merge cycle; assert zero
relation does not existerrors) (S1-1) - Merge worker backoff: replace
std::thread::sleepwithBackgroundWorker::wait_latch(S1-3) - Add
sourcecolumn integrity pg_regress test (S1-4) - Predicate-OID cache: add
CacheRegisterRelcacheCallbackhook (S1-5) - Add
pg_ripple.federation_max_response_bytesGUC (default 100 MiB); refuse responses exceeding it with PT543 (S2-4) - CONSTRUCT RDF-star: emit
<< s p o >>notation for ground quoted triples in CONSTRUCT output (S2-6) - SAVEPOINT helper: either wire
execute_with_savepoint()into the parallel-strata path or gate with#[cfg(test)](S3-4) pg_dump/ restore round-trip test (tests/pg_dump_restore.sh) (S6-3)- Add
pg_ripple.insert_triples(TEXT[][])SRF for batch single-triple inserts from orchestration tools (S7-4) - HNSW vs IVFFlat benchmark and documentation (S7-5)
- Mermaid architecture diagram in
docs/src/reference/architecture.md(S9-4) - Migration script headers lint (
scripts/check_migration_headers.sh) (S9-6) release-please-style release automation workflow (S10-2)docs/src/operations/pg-upgrade.mdwith supported upgrade matrix and pre-upgrade steps (S10-3)- Extend migration-chain test to load a representative data batch after the v0.1.0 install and verify data survives through v0.48.0 (S10-5)
- HTAP merge cutover: add a concurrent-merge regression test (50 parallel SPARQL queries during a forced merge cycle; assert zero
New GUC Parameters
| GUC | Type | Default | Description |
|---|---|---|---|
pg_ripple.federation_max_response_bytes | integer | 104857600 | Maximum federation response body in bytes (100 MiB); PT543 on violation |
New Error Codes
| Code | Severity | Message |
|---|---|---|
| PT543 | ERROR | Federation response exceeded federation_max_response_bytes limit |
Migration Script
sql/pg_ripple--0.47.0--0.48.0.sql — no schema changes. Comment header describing SHACL Core completion, OWL 2 RL rule additions, and SPARQL Update completions.
Documentation
-
reference/shacl-reference.md— all 35 SHACL Core constraints marked implemented; complex path expressions documented with examples -
reference/owl2rl-results.md— pass rate updated to reflect ≥ 95% required gate -
user-guide/best-practices/sparql-update.md— MOVE, COPY, ADD examples -
user-guide/rdf-star.md— variable-inside-quoted-triple patterns documented -
operations/pg-upgrade.md— new page with supported upgrade matrix - Release notes for v0.48.0
Exit Criteria
W3C SHACL Core test suite passes 35/35 constraints. OWL 2 RL CI gate upgraded to required at ≥ 95%. All three SPARQL Update operations (MOVE, COPY, ADD) pass the W3C SPARQL 1.1 Update test suite entries for those operations. SPARQL-star variable patterns return correct rows. WatDiv latency baselines recorded and regression gate active. pg_upgrade compatibility document published. pg_dump / restore round-trip test passes. Migration chain test passes through v0.48.0.
v0.49.0 — AI & LLM Integration
Theme: Natural-language query generation and embedding-based entity alignment.
In plain language: Two high-leverage AI features: a function that takes plain English and returns a SPARQL query (using any configured LLM endpoint — Ollama, OpenAI, Claude, or a self-hosted model); and a function that uses the existing vector embeddings to surface candidate
owl:sameAspairs — entities that might be the same thing expressed differently. Both build on infrastructure already in place (the SPARQL engine and the v0.27.0 pgvector integration) and require no new storage schema changes.Effort estimate: 4–6 person-weeks
Deliverables
-
NL → SPARQL via LLM function calling (Feature C-1)
- New module
src/llm/mod.rs; new SQL functionpg_ripple.sparql_from_nl(question TEXT) RETURNS TEXT - Calls a configured LLM endpoint with the schema VoID description as context; returns a SPARQL SELECT query string
- GUCs:
pg_ripple.llm_endpoint(TEXT, default''= disabled),pg_ripple.llm_model(TEXT, defaultgpt-4o),pg_ripple.llm_api_key_env(TEXT, name of the env var holding the key — never stored inline) - Optional few-shot examples loaded from
_pg_ripple.llm_examples (question TEXT, sparql TEXT); seeded viapg_ripple.add_llm_example(question TEXT, sparql TEXT) - SHACL shapes included as additional semantic context when
pg_ripple.llm_include_shapes = on(bool GUC, defaulton) - Error codes: PT700 (LLM endpoint unreachable), PT701 (LLM returned non-SPARQL output), PT702 (generated SPARQL failed to parse)
- pg_regress tests run with a mock HTTP server returning a canned SPARQL response
- New module
-
Embedding-based
owl:sameAscandidate generation (Feature C-2)- New SQL function
pg_ripple.suggest_sameas(threshold REAL DEFAULT 0.9) RETURNS TABLE(s1 TEXT, s2 TEXT, similarity REAL) - Runs an HNSW self-join on the embedding column in
_pg_ripple.entities; returns pairs whose cosine similarity exceedsthreshold - Companion
pg_ripple.apply_sameas_candidates(min_similarity REAL DEFAULT 0.95)inserts accepted pairs asowl:sameAstriples and triggers cluster merging - Respects
pg_ripple.sameas_max_cluster_size(PT550) bound - Example:
examples/embedding_alignment.sql— load two datasets with overlapping entities, runsuggest_sameas, inspect candidates, apply withapply_sameas_candidates
- New SQL function
New GUC Parameters
| GUC | Type | Default | Description |
|---|---|---|---|
pg_ripple.llm_endpoint | string | '' | LLM API base URL (empty = NL→SPARQL disabled) |
pg_ripple.llm_model | string | gpt-4o | LLM model identifier |
pg_ripple.llm_api_key_env | string | PG_RIPPLE_LLM_API_KEY | Name of the environment variable holding the LLM API key |
pg_ripple.llm_include_shapes | bool | on | Include SHACL shapes as LLM context when generating SPARQL |
New Error Codes
| Code | Severity | Message |
|---|---|---|
| PT700 | ERROR | LLM endpoint unreachable or returned HTTP error |
| PT701 | ERROR | LLM response did not contain a valid SPARQL query |
| PT702 | ERROR | LLM-generated SPARQL query failed to parse |
Migration Script
sql/pg_ripple--0.48.0--0.49.0.sql — adds _pg_ripple.llm_examples (question TEXT, sparql TEXT) table.
Documentation
-
user-guide/nl-to-sparql.md— new page: configuring the LLM endpoint, runningsparql_from_nl, adding few-shot examples, error handling -
user-guide/entity-alignment.md— new page:suggest_sameas,apply_sameas_candidates, tuning threshold, cluster size limits -
reference/guc-reference.md— four new GUC parameters -
reference/error-catalog.md— PT700–PT702 - Release notes for v0.49.0
Exit Criteria
pg_ripple.sparql_from_nl() returns a parseable SPARQL query against a mock LLM endpoint. pg_ripple.suggest_sameas() returns candidates for two overlapping test datasets with ≥ 90% recall. apply_sameas_candidates() does not exceed sameas_max_cluster_size. All GUC validators pass. PT700–PT702 are triggered by the appropriate error conditions. Migration chain test passes through v0.49.0.
v0.50.0 — Developer Experience & GraphRAG Polish
Theme: VS Code extension, interactive query debugger, and full RAG pipeline.
In plain language: Three developer-facing features that raise the ceiling on how easy it is to work with pg_ripple day-to-day. A VS Code extension brings SPARQL syntax highlighting, one-click query execution against a live endpoint, and SHACL shape linting into the editor. An extended
EXPLAIN SPARQLcommand surfaces the algebra tree, generated SQL, plan-cache status, and per-step row counts as an interactive JSON structure. The RAG pipeline ties together vector recall, SPARQL graph expansion, and LLM context-window assembly into a single SQL function call.Effort estimate: 5–7 person-weeks
Deliverables
-
VS Code extension (Feature B-2) — separate repository
pg-ripple-vscode- SPARQL 1.1 syntax highlighting (TextMate grammar)
- SHACL Turtle syntax highlighting with shape-aware completion
- Datalog rule syntax highlighting
- Query runner: execute a SPARQL query against a configured
pg_ripple_httpendpoint, display results as a table or JSON tree - SHACL shape linter: validate a
.ttlshapes file by callingpg_ripple.load_shapes()via the HTTP API and surfacing violations inline - Configuration: workspace settings for endpoint URL, auth token, and default named graph
- Published to VS Code Marketplace; linked from
README.mdand docs
-
SPARQL query debugger (Feature B-3)
- Extend
pg_ripple.explain_sparql(query TEXT)to return JSONB with: algebra tree, generated SQL, plan-cache status (hit/miss/bypass), per-operator estimated rows, per-operator actual rows (whenanalyze := true) - New overload
pg_ripple.explain_sparql(query TEXT, analyze BOOL DEFAULT FALSE) RETURNS JSONB - VS Code extension renders the JSONB as a collapsible tree with operator annotations
- pg_regress
sparql_explain_analyze.sql: assert the JSONB schema is stable across SELECT, ASK, CONSTRUCT, and DESCRIBE query types
- Extend
-
RAG pipeline with graph-contextualised embeddings (Feature C-3)
- New SQL function
pg_ripple.rag_context(question TEXT, k INT DEFAULT 10) RETURNS TEXT - Step 1: embed
questionviapg_ripple.embed_text()(from v0.27.0) - Step 2: vector recall — top-k entities by HNSW similarity
- Step 3: SPARQL graph expansion — for each entity, fetch its 1-hop neighbourhood as JSON-LD
- Step 4: assemble a context string from the JSON-LD fragments, formatted for LLM ingestion
- Step 5 (optional): if
pg_ripple.llm_endpointis set, callsparql_from_nl()and execute the generated query, appending the result to the context - Example:
examples/graphrag_rag_pipeline.sql— end-to-end with a Wikipedia-derived knowledge graph
- New SQL function
Migration Script
sql/pg_ripple--0.49.0--0.50.0.sql — no schema changes.
Documentation
-
user-guide/vscode-extension.md— installation, configuration, SPARQL query runner, SHACL linter -
user-guide/explain-sparql.md— EXPLAIN output format, ANALYZE mode, interpreting the algebra tree -
user-guide/rag-pipeline.md—rag_context()step-by-step, tuning k, combining with NL→SPARQL - Release notes for v0.50.0
Exit Criteria
VS Code extension is publishable to the VS Code Marketplace (VSIX builds clean). explain_sparql(query, analyze := true) returns JSONB with algebra, sql, cache_status, and per-operator actual_rows keys for SELECT, ASK, CONSTRUCT, and DESCRIBE queries. rag_context() returns non-empty context for a known question against a pre-loaded test knowledge graph. Migration chain test passes through v0.50.0.
v1.0.0 — Production Release
Theme: Stability, conformance, and production certification.
In plain language: The 1.0 release is not about new features — it's about confidence. We run pg_ripple against the official W3C test suites for SPARQL and SHACL to verify standards compliance. A 72-hour continuous stress test checks for memory leaks and crash recovery. A security audit reviews the code for vulnerabilities. The result is a release that organisations can rely on for production workloads with a clear API stability guarantee: the public interface will not break in future minor versions.
Effort estimate: 6–8 person-weeks
Deliverables
-
SPARQL 1.1 Query conformance
- Pass W3C SPARQL 1.1 Query test suite (supported subset)
- Document unsupported features (property functions)
- Verify conformance via both SQL and HTTP interfaces
- Federation (
SERVICE) covered by v0.16.0
-
SPARQL 1.1 Update conformance
- Pass W3C SPARQL 1.1 Update test suite (supported subset)
- Document unsupported features
-
SHACL Core conformance
- Pass the full W3C SHACL Core test suite
- Any optimization strategy must preserve the same externally visible results as the reference semantics
-
Stability hardening
- 72-hour continuous load test (mixed read/write)
- Memory leak detection (Valgrind via
cargo pgrx test --valgrind) - Crash recovery testing (kill -9 during merge, reload, verify)
-
Security audit
- Review all SPI query generation for injection vectors
- Review shared memory usage for race conditions
- Review dictionary cache for timing side-channels
-
API stability guarantee
- All
pg_ripple.*SQL functions considered stable API _pg_ripple.*internal schema reserved for internal use- Semantic versioning contract: breaking changes only in major versions
- All
-
Final benchmarks
- BSBM at 100M triples
- Published performance report
-
Release artifacts
- Tagged release on GitHub
- Published to PGXN
- crates.io publication (library crate)
Documentation
See plans/documentation.md for details. The 1.0.0 documentation milestone is a full audit: every page verified, every example tested against the release, no unresolved stubs.
-
Final audit of all docs pages — every code example verified against 1.0.0, all
TODO/ stub markers resolved -
user-guide/upgrading.mdcomplete — upgrade procedure from every 0.x version to 1.0.0; migration script inventory -
reference/error-reference.mdcomplete — all PT001–PT799 codes documented -
reference/faq.mdfinal pass — 20–30 questions covering all features -
reference/troubleshooting.mdfinal pass — complete runbook for every subsystem -
All
research/section mirrors complete
Exit Criteria
Stable, tested, documented, and published. Ready for production workloads up to 100M+ triples on a single node.
Post-1.0 Horizon
In plain language: These are future directions that extend pg_ripple beyond its initial scope. Each addresses a specific real-world need — from distributing data across multiple servers, to geographic queries, to bridging with existing relational databases. They are listed roughly in order of anticipated demand; some may be reordered or combined based on community feedback after 1.0.
v1.6 Cypher/GQL has a dedicated exploratory analysis in plans/cypher/. The core finding: VP tables already encode all LPG structural elements; a standalone
cypher-algebracrate (openCypher + GQL grammar, unified SQL-emitting algebra IR) is the correct architecture. Full write support requires v0.4.0 (RDF-star) for edge properties — already available. Gremlin is explicitly out of scope.
| Version | Theme | What it delivers | Key Technical Features |
|---|---|---|---|
| 1.1 | Distributed | Spread data across multiple servers for horizontal scale | Citus integration, subject-based sharding |
| 1.2 | Temporal | Track how data changes over time; query historical states | Bitstring versioning, TimescaleDB integration |
| 1.4 | Extended VP | Automatically pre-compute shortcuts for frequent query patterns | Automated workload-driven ExtVP stream tables (pg_trickle), ontology change propagation DAG |
| 1.5 | Interop | Bridge to GraphQL APIs and expose LPG views for visualization tools | GraphQL-to-SPARQL auto-generation from SHACL shapes, stable LPG view layer for visualization tooling |
| 1.6 | Cypher / GQL | Query and write data using the industry-standard graph query languages | cypher-algebra standalone crate (openCypher + GQL grammar, same IR); pg_ripple.cypher() SQL function; CREATE, MERGE, SET, DELETE via VP write path; openCypher TCK ≥80%; edge properties available since v0.4.0 (RDF-star) |
| 1.7 | GeoSPARQL + PostGIS | Answer geographic questions ("find all hospitals within 5 km of this point") | geo:asWKT literal type backed by PostGIS geometry, spatial FILTER functions, R-tree index on spatial VP tables |
| 1.8 | R2RML Virtual Graphs | Expose existing database tables as if they were RDF data — no migration needed | W3C R2RML mappings, SPARQL queries transparently join VP tables with mapped SQL tables |
| 1.9 | Quad-Level Provenance | Track where each fact came from and when it was added | Per-quad metadata table with source, timestamp, and transaction ID; integration with Datalog rule provenance (why-provenance) |
Version Timeline (Estimated Cadence)
In plain language: The "Calendar" column shows how long after the previous release each version is expected to ship. The "Effort" column shows the total developer-time required. With two developers working together, the calendar durations are achievable; with one developer, roughly double the calendar time.
| Version | Calendar (pair) | Effort (person-weeks) | Cumulative effort |
|---|---|---|---|
| 0.1.0 | Week 0 (start) | 6–8 pw | 6–8 pw |
| 0.2.0 | +4 weeks | 6–8 pw | 12–16 pw |
| 0.3.0 | +4 weeks | 6–8 pw | 18–24 pw |
| 0.4.0 | +5 weeks | 8–10 pw | 26–34 pw |
| 0.5.0 | +3 weeks | 6–8 pw | 32–42 pw |
| 0.5.1 | +3 weeks | 6–8 pw | 38–50 pw |
| 0.6.0 | +4 weeks | 8–10 pw | 46–60 pw |
| 0.7.0 | +3 weeks | 4–6 pw | 50–66 pw |
| 0.8.0 | +3 weeks | 4–6 pw | 54–72 pw |
| 0.9.0 | +2 weeks | 3–4 pw | 57–76 pw |
| 0.10.0 | +5 weeks | 10–12 pw | 67–88 pw |
| 0.11.0 | +3 weeks | 5–7 pw | 72–95 pw |
| 0.12.0 | +2 weeks | 3–4 pw | 75–99 pw |
| 0.13.0 | +4 weeks | 6–8 pw | 81–107 pw |
| 0.14.0 | +3 weeks | 4–6 pw | 85–113 pw |
| 0.15.0 | +2 weeks | 3–4 pw | 88–117 pw |
| 0.16.0 | +3 weeks | 4–6 pw | 92–123 pw |
| 0.19.0 | +3 weeks | 3–5 pw | 95–128 pw |
| 0.20.0 | +3 weeks | 5–7 pw | 100–135 pw |
| 0.45.0 | +3 weeks | 4–6 pw | 104–141 pw |
| 0.46.0 | +4 weeks | 5–7 pw | 109–148 pw |
| 0.47.0 | +5 weeks | 8–10 pw | 117–158 pw |
| 0.48.0 | +4 weeks | 6–8 pw | 123–166 pw |
| 0.49.0 | +3 weeks | 4–6 pw | 127–172 pw |
| 0.50.0 | +4 weeks | 5–7 pw | 132–179 pw |
| 1.0.0 | +4 weeks | 6–8 pw | 138–187 pw |
| 1.1–1.9 | Post-1.0 | Community-driven | — |
Estimates assume a pair of focused developers with Rust and PostgreSQL experience. "pw" = person-weeks. Calendar durations assume pair programming; a solo developer should expect roughly double the calendar time. Actual pace depends on contributor availability and scope adjustments discovered during implementation.
Contributing
Thank you for your interest in contributing to pg_ripple. This guide covers environment setup, testing, code conventions, and the pull request workflow.
pg_ripple is open source and welcomes contributions of all kinds — bug reports, documentation fixes, test cases, and feature implementations. If you are unsure whether an idea fits, open a GitHub issue to discuss it before writing code.
Development Environment
Prerequisites
| Tool | Version | Purpose |
|---|---|---|
| Rust | Edition 2024, stable toolchain | Language |
| PostgreSQL | 18.x | Target database |
| pgrx | 0.17 | PostgreSQL extension framework |
| cargo-pgrx | 0.17 | Build and test tooling |
| git | 2.x+ | Version control |
Setup
# 1. Clone the repository
git clone https://github.com/your-org/pg_ripple.git
cd pg_ripple
# 2. Install cargo-pgrx if not already installed
cargo install cargo-pgrx --version 0.17 --locked
# 3. Initialize pgrx with PostgreSQL 18
cargo pgrx init --pg18 $(which pg_config)
# 4. Verify the build
cargo build
On macOS, install PostgreSQL 18 via Homebrew: brew install postgresql@18. Ensure pg_config is on your PATH.
Running Tests
pg_ripple uses three levels of testing:
Unit and integration tests (pgrx)
Runs Rust tests inside a temporary PostgreSQL instance:
cargo pgrx test pg18
This starts a temporary PG18 cluster, installs the extension, runs all #[pg_test] functions, and tears down the cluster.
Regression tests (pg_regress)
Runs SQL-based regression tests that compare expected output:
cargo pgrx regress pg18
The test SQL files live in sql/ and expected output in expected/. If you add a new SQL function, add a regression test for it.
Migration chain test
Verifies that all migration scripts (sql/pg_ripple--X.Y.Z--X.Y.Z+1.sql) can be applied in sequence:
# Requires pgrx PG18 running
cargo pgrx start pg18
bash tests/test_migration_chain.sh
Running a subset of tests
# Run a single test by name
cargo pgrx test pg18 -- test_name_pattern
# Run tests with output visible
cargo pgrx test pg18 -- --nocapture
Code Conventions
These conventions are enforced by CI and code review.
Safe Rust only
All code must be safe Rust. unsafe is permitted only at required FFI boundaries (pgrx macros, shared memory access) and must include a // SAFETY: comment explaining why it is correct.
SQL function exposure
Expose SQL functions via the #[pg_extern] attribute. Never write raw PG_FUNCTION_INFO_V1 C macros.
#![allow(unused)] fn main() { #[pg_extern] fn my_function(input: &str) -> String { // implementation } }
SPI for all internal SQL
Use pgrx::SpiClient for all SQL executed inside extension code. Never use raw libpq or string-based query execution.
#![allow(unused)] fn main() { Spi::connect(|client| { client.select("SELECT count(*) FROM _pg_ripple.dictionary", None, None)?; Ok(()) })?; }
Integer joins everywhere
SPARQL-to-SQL translation must encode all bound terms to i64 before generating SQL. VP table queries must never contain string comparisons — this is a bug.
No dynamic SQL string concatenation for table names
Always look up the VP table OID in _pg_ripple.predicates and use format!-style quoting with proper escaping. Never interpolate user input into table names.
Error messages
Follow PostgreSQL style: lowercase first word, no trailing period.
#![allow(unused)] fn main() { // Good return Err(pg_ripple_error!("dictionary encode failed: hash collision detected")); // Bad return Err(pg_ripple_error!("Dictionary encode failed: hash collision detected.")); }
Batch dictionary operations
Use ON CONFLICT DO NOTHING … RETURNING for all batch inserts into the dictionary. Never use a SELECT-then-INSERT pattern.
Project Structure
src/
├── lib.rs # Entry points, _PG_init, GUC parameters
├── dictionary/ # IRI/blank-node/literal → i64 encoder
├── storage/ # VP tables, HTAP delta/main, merge worker
├── sparql/ # SPARQL → algebra → SQL → SPI
├── datalog/ # Datalog parser, stratifier, SQL compiler
├── shacl/ # SHACL shapes → DDL constraints + validation
├── export/ # Turtle / N-Triples / JSON-LD serialization
├── stats/ # Monitoring, pg_stat_statements integration
└── admin/ # Vacuum, reindex, prefix registry
sql/ # Migration scripts and regression test SQL
tests/ # Shell-based integration tests
docs/ # mdBook documentation site
Pull Request Workflow
Branch policy
- Never create a new branch from
mainunless the current branch ismain. - Use descriptive branch names:
feat/sparql-lateral,fix/dictionary-collision,docs/glossary.
Before opening a PR
- Run all tests and ensure they pass:
cargo pgrx test pg18
cargo pgrx regress pg18
- Run clippy with no warnings:
cargo clippy --all-targets -- -D warnings
- Format code:
cargo fmt --check
-
Update documentation if you changed any SQL function signatures or added new functions.
-
Create or update migration scripts if the release version changed (see below).
Commit messages
- Use present tense: "add lateral join support" not "added lateral join support"
- Group discrete changes into separate commits
- Reference issue numbers when applicable: "fix dictionary collision (#42)"
Migration scripts
Every release requires a migration script (sql/pg_ripple--X.Y.Z--X.Y.Z+1.sql), even if it only contains comments. See the Release Process for the full checklist.
Documentation Contributions
The documentation site uses mdBook with the mdbook-admonish plugin for callout boxes.
Building the docs locally
# Install mdbook and plugins
cargo install mdbook mdbook-admonish
# Build and serve
cd docs
mdbook serve --open
Callout syntax
Use fenced code blocks with admonish for callout boxes:
```admonish tip title="Performance"
Use `load_ntriples_file()` for large datasets — it is 10× faster than string loading.
```
```admonish warning
This operation cannot be undone.
```
```admonish note
Available since v0.16.0.
```
Adding a new page
- Create the Markdown file in the appropriate
docs/src/subdirectory. - Add the page to
docs/src/SUMMARY.md. - Run
mdbook buildto verify it compiles.
Property-Based Testing (v0.46.0)
pg_ripple uses proptest for randomised property-based tests that assert algebraic invariants. These tests run entirely in pure Rust — no database connection required.
Running proptest suites
# Run all property-based tests
cargo test --test proptest_suite
# Run with more cases (default: 256)
PROPTEST_CASES=10000 cargo test --test proptest_suite
# Run a specific suite
cargo test --test proptest_suite sparql_roundtrip
cargo test --test proptest_suite dictionary
cargo test --test proptest_suite jsonld_framing
Adding a new property test
-
Add your test to the appropriate file in
tests/proptest/:- SPARQL translator invariants →
sparql_roundtrip.rs - Dictionary encoder invariants →
dictionary.rs - JSON-LD framing invariants →
jsonld_framing.rs - New domain → create
tests/proptest/<domain>.rsand addmod <domain>;totests/proptest_suite.rs
- SPARQL translator invariants →
-
Use
proptest!macros for property tests; regular#[test]for deterministic fixtures. -
Run the suite with
PROPTEST_CASES=10000to verify 10,000 cases pass.
Debugging a proptest failure
When a test fails, proptest prints the minimal failing input. Reproduce it:
#![allow(unused)] fn main() { // Add to the failing test to fix the seed: ProptestConfig::with_cases(1).with_proptest_rng(seed) }
Fuzz Testing (v0.46.0)
pg_ripple uses cargo-fuzz to test the federation result decoder against arbitrary byte sequences.
Running the fuzz target
# Install cargo-fuzz
cargo install cargo-fuzz
# Run for 10 minutes
cargo fuzz run federation_result -- -max_total_time=600
# Run indefinitely
cargo fuzz run federation_result
# Minimise a crashing corpus entry
cargo fuzz tmin federation_result artifacts/federation_result/crash-<hash>
Adding a new fuzz target
- Create
fuzz/fuzz_targets/<target_name>.rswith the fuzz target function. - Add a
[[bin]]entry tofuzz/Cargo.toml. - Add the target to the
fuzz-<target_name>CI job in.github/workflows/ci.yml.
Fuzz target contract
Every fuzz target must:
- Use
#![no_main]andlibfuzzer_sys::fuzz_target! - Never panic regardless of input (panics are treated as fuzz failures)
- Return
Err(...)for invalid input, never crash
Reporting Issues
When filing a bug report, please include:
- pg_ripple version:
SELECT pg_ripple.canary();and the output of\dx pg_ripple - PostgreSQL version:
SELECT version(); - Minimal reproducer: the smallest SQL script that triggers the issue
- Full error output: use
\errverbosein psql for detailed error context - Platform: OS and architecture