What Is pg_ripple?

pg_ripple turns your PostgreSQL database into a knowledge graph store. Store facts as triples, query them with SPARQL, validate data quality with SHACL, derive new facts with Datalog rules, and serve results over HTTP — all inside PostgreSQL, with no extra infrastructure for the data store itself.

-- Load facts about people and relationships
SELECT pg_ripple.load_turtle('
  @prefix ex: <http://example.org/> .
  @prefix foaf: <http://xmlns.com/foaf/0.1/> .
  ex:alice foaf:name "Alice" .
  ex:alice foaf:knows ex:bob .
  ex:bob   foaf:name "Bob" .
  ex:bob   foaf:knows ex:carol .
  ex:carol foaf:name "Carol" .
');

-- Ask: who does Alice know, directly or indirectly?
SELECT * FROM pg_ripple.sparql('
  PREFIX ex: <http://example.org/>
  PREFIX foaf: <http://xmlns.com/foaf/0.1/>
  SELECT ?name WHERE {
    ex:alice foaf:knows+ ?person .
    ?person foaf:name ?name .
  }
');

The query follows the foaf:knows relationship through any number of hops and returns the names of everyone Alice is connected to — Bob and Carol.


Why pg_ripple?

Knowledge graphs represent information as a network of relationships rather than rows in flat tables. This structure naturally captures complex, interconnected data — organizational hierarchies, supply chains, research citations, product catalogs — that would require dozens of join tables in a relational model.

pg_ripple brings this capability to PostgreSQL. You get the expressiveness of a dedicated graph database while keeping your existing PostgreSQL infrastructure, tooling, backup procedures, and operational expertise.

Key capabilities

CapabilityWhat it does
SPARQL queriesAsk complex relationship questions using the W3C standard query language
SHACL validationDefine and enforce data quality rules — reject bad data on insert
Datalog reasoningAutomatically derive new facts from rules and logic
Vector + graph hybridCombine SPARQL graph traversal with pgvector similarity search
JSON-LD framingExport nested JSON documents shaped for your API contract
SPARQL ProtocolServe queries over a standard HTTP endpoint via pg_ripple_http
FederationQuery remote SPARQL endpoints alongside local data

Key numbers

MetricValue
Bulk load throughput>100K triples/sec (commodity hardware)
SPARQL query latency<10ms for typical patterns
W3C SPARQL 1.1Full conformance
W3C SHACL CoreFull conformance
PostgreSQL version18

Architecture at a glance

┌─────────────────────────────────────────────────┐
│                  PostgreSQL 18                   │
│  ┌───────────────────────────────────────────┐  │
│  │              pg_ripple extension           │  │
│  │  ┌─────────┐  ┌────────┐  ┌───────────┐  │  │
│  │  │Dictionary│  │ SPARQL │  │  Datalog   │  │  │
│  │  │ Encoder  │  │ Engine │  │  Engine    │  │  │
│  │  └────┬─────┘  └───┬────┘  └─────┬─────┘  │  │
│  │       │             │             │         │  │
│  │  ┌────┴─────────────┴─────────────┴─────┐  │  │
│  │  │     VP Tables (one per predicate)     │  │  │
│  │  │   HTAP: delta + main + merge worker   │  │  │
│  │  └──────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────┘  │
└─────────────────────────────────────────────────┘
         ▲                          ▲
         │ SQL                      │ HTTP
    Application              pg_ripple_http

Every IRI, literal, and blank node is mapped to a compact integer ID by the dictionary encoder. Data is stored in Vertical Partitioning (VP) tables — one table per unique predicate — with integer-only joins for fast query execution. The HTAP architecture separates read and write paths so that heavy loads do not block queries.


Next steps

When to Use pg_ripple

pg_ripple is a PostgreSQL extension that turns your database into a knowledge graph store. This page helps you decide whether it fits your architecture.

Decision flowchart

Ask yourself these questions in order:

  1. Do you already run PostgreSQL? If yes, pg_ripple integrates with zero additional infrastructure for the data store. If you run a different database, evaluate the migration cost.
  2. Do you need to model complex relationships? If your data is primarily tabular with few joins, standard SQL may be simpler. If you have deeply nested, many-to-many, or hierarchical relationships, a graph model helps.
  3. Do you need a standard query language? SPARQL is a W3C standard with broad tool support. If you prefer a property-graph query language (Cypher/GQL), consider Neo4j or Amazon Neptune.
  4. Do you need reasoning or validation? pg_ripple includes SHACL validation and Datalog reasoning. Standalone triple stores like Virtuoso or Blazegraph may not.
  5. Do you need graph context for LLM prompts? pg_ripple combines SPARQL graph traversal with pgvector similarity search in a single query — something pure vector databases cannot do.

Comparison matrix

Criterionpg_ripplePlain SQLVirtuoso / BlazegraphNeo4jPure vector DB
DeploymentPostgreSQL extensionAny RDBMSStandalone JVMStandaloneStandalone
Query languageSPARQL 1.1SQLSPARQL 1.1CypherProprietary
Data modelRDF (triples)RelationalRDF (triples)Property graphVectors + metadata
Schema validationSHACLCHECK / triggersVariesConstraintsNone
ReasoningDatalog (RDFS, OWL RL)Manual SQLRDFS / OWL (varies)None built-inNone
Vector searchpgvector integrationpgvectorNot built-inLimitedNative
Hybrid graph+vectorYes (single query)Manual joinsNoNoNo
HTTP APIpg_ripple_httpBuild your ownBuilt-inBuilt-inBuilt-in
TransactionsFull PostgreSQL ACIDFull ACIDVariesACIDVaries
Backup/restorepg_dump/pg_restoreStandardCustom toolsCustom toolsCustom tools
Operational complexityLow (PostgreSQL)LowMedium–HighMediumMedium

When pg_ripple is a good fit

  • You already operate PostgreSQL and want to avoid managing a separate graph database
  • Your data has rich, interconnected relationships (ontologies, catalogs, supply chains)
  • You need SPARQL 1.1 compliance for interoperability with W3C-standard tools
  • You need to validate data quality against formal rules (SHACL)
  • You need to derive new facts from existing data (Datalog reasoning, OWL RL, RDFS)
  • You want to combine graph traversal with vector similarity for RAG pipelines
  • You need full ACID transactions on graph data

When pg_ripple is not the best fit

  • Graph datasets exceeding ~1 billion triples: pg_ripple has been tested to 100M triples. For very large datasets, consider distributed solutions.
  • Property graph with Cypher/GQL: if your team already uses Cypher and Neo4j, migrating to SPARQL has a learning curve. pg_ripple speaks SPARQL, not Cypher.
  • Pure vector search workload: if you only need approximate nearest neighbor search without graph traversal, pgvector alone is simpler.
  • Real-time streaming graphs: pg_ripple processes data in transactions, not continuous streams. For streaming graph analytics, consider Apache Flink with a graph library.
  • No PostgreSQL in your stack: if you run MySQL, MongoDB, or a managed NoSQL service and have no plans to adopt PostgreSQL, introducing it solely for pg_ripple adds operational overhead.

AI/LLM comparison: when does graph context outperform flat vector retrieval?

Graph-augmented retrieval helps when:

  • The query requires multi-hop reasoning — "find papers by co-authors of Alice's co-authors" cannot be answered by vector similarity alone
  • Entity deduplication matters — owl:sameAs canonicalization ensures the same entity is not embedded multiple times with different IRIs
  • Structured output is needed — JSON-LD framing produces token-efficient, structured context that flat top-k results cannot provide
  • Provenance matters — graph traversal can trace why a fact is relevant, not just that it is similar

Pure vector search (Qdrant, Weaviate, pgvector-only) is sufficient when:

  • The query is a simple "find similar documents" without relationship constraints
  • Your corpus is unstructured text without entity-level structure
  • Latency requirements are sub-millisecond at millions of vectors

Next steps

Installation

pg_ripple is a PostgreSQL 18 extension written in Rust. Choose the installation method that fits your environment.

The fastest path to a working pg_ripple instance. No build tools required.

# Start pg_ripple with Docker Compose
docker compose up -d

# Connect
psql -h localhost -p 5432 -U postgres -d pg_ripple

The docker-compose.yml in the repository root starts PostgreSQL 18 with pg_ripple pre-installed and the extension created in the default database.

Verify the installation

SELECT pg_ripple.triple_count();

The result should be 0 — the extension is installed and ready.

From source (cargo pgrx)

Build and install directly into a local PostgreSQL 18 instance.

Prerequisites

  • Rust (stable, edition 2024)
  • PostgreSQL 18 development headers
  • cargo-pgrx 0.17
# Install cargo-pgrx
cargo install cargo-pgrx --version 0.17 --locked

# Initialize pgrx with PostgreSQL 18
cargo pgrx init --pg18 $(which pg_config)

# Build and install
cargo pgrx install --release --pg-config $(which pg_config)

Create the extension

Connect to your database and run:

CREATE EXTENSION pg_ripple;

Verify

SELECT pg_ripple.triple_count();

Configuration

pg_ripple works out of the box with default settings. For production deployments, you may want to adjust GUC parameters — see Configuration and Tuning.

For HTAP storage (background merge worker) and shared-memory dictionary cache, add pg_ripple to shared_preload_libraries in postgresql.conf:

shared_preload_libraries = 'pg_ripple'

Restart PostgreSQL after this change.

Troubleshooting

Wrong PostgreSQL version

pg_ripple requires PostgreSQL 18. Check your version:

pg_config --version

Missing shared_preload_libraries

If you see errors about shared memory or the merge worker not starting, ensure pg_ripple is in shared_preload_libraries and PostgreSQL has been restarted.

pgrx version mismatch

pg_ripple requires cargo-pgrx 0.17. If you have an older version:

cargo install cargo-pgrx --version 0.17 --locked --force

Extension not found after install

If CREATE EXTENSION pg_ripple fails with "extension not found", verify that the extension files were installed to the correct PostgreSQL directory:

pg_config --sharedir
ls $(pg_config --sharedir)/extension/pg_ripple*

Docker container fails to start

Check logs:

docker compose logs pg_ripple

Common causes: port 5432 already in use (change the port mapping), insufficient memory (pg_ripple recommends at least 512MB).

Next steps

Hello World — Five-Minute Walkthrough

This walkthrough takes you from an empty database to working SPARQL queries in five minutes. You will load ten triples about people and movies, then run three queries of increasing complexity.

Prerequisites

pg_ripple is installed and you are connected to a PostgreSQL database with the extension created. See Installation if you have not done this yet.

Step 1: Register prefixes

Prefixes are shortcuts for long IRIs. Register a few common ones:

SELECT pg_ripple.register_prefix('ex', 'http://example.org/');
SELECT pg_ripple.register_prefix('foaf', 'http://xmlns.com/foaf/0.1/');
SELECT pg_ripple.register_prefix('schema', 'http://schema.org/');

Step 2: Load data

Load ten triples about people and the movies they directed or acted in:

SELECT pg_ripple.load_turtle('
  @prefix ex:     <http://example.org/> .
  @prefix foaf:   <http://xmlns.com/foaf/0.1/> .
  @prefix schema: <http://schema.org/> .

  ex:alice   foaf:name     "Alice" .
  ex:alice   schema:knows  ex:bob .
  ex:bob     foaf:name     "Bob" .
  ex:bob     schema:knows  ex:carol .
  ex:carol   foaf:name     "Carol" .
  ex:movie1  schema:name   "The Graph" .
  ex:movie1  schema:director ex:alice .
  ex:movie1  schema:actor    ex:bob .
  ex:movie2  schema:name   "Linked Data" .
  ex:movie2  schema:director ex:bob .
');

The function returns the number of triples loaded (10).

Step 3: Query — basic pattern

Find all movies and their directors:

SELECT * FROM pg_ripple.sparql('
  PREFIX schema: <http://schema.org/>
  PREFIX foaf: <http://xmlns.com/foaf/0.1/>
  SELECT ?movie ?director WHERE {
    ?movie schema:director ?person .
    ?movie schema:name ?movie .
    ?person foaf:name ?director .
  }
');

Each row in the result is a JSONB object with the variable bindings. You should see "The Graph" directed by "Alice" and "Linked Data" directed by "Bob".

Step 4: Query — OPTIONAL

Find all movies with their directors, and actors if they have any:

SELECT * FROM pg_ripple.sparql('
  PREFIX schema: <http://schema.org/>
  PREFIX foaf: <http://xmlns.com/foaf/0.1/>
  SELECT ?movieName ?directorName ?actorName WHERE {
    ?movie schema:name ?movieName .
    ?movie schema:director ?director .
    ?director foaf:name ?directorName .
    OPTIONAL {
      ?movie schema:actor ?actor .
      ?actor foaf:name ?actorName .
    }
  }
');

"The Graph" has an actor (Bob), while "Linked Data" does not — the actorName column is null for that row. The OPTIONAL keyword works like a SQL LEFT JOIN.

Step 5: Query — property path

Find everyone Alice is connected to, directly or indirectly, through schema:knows links:

SELECT * FROM pg_ripple.sparql('
  PREFIX ex: <http://example.org/>
  PREFIX schema: <http://schema.org/>
  PREFIX foaf: <http://xmlns.com/foaf/0.1/>
  SELECT ?name WHERE {
    ex:alice schema:knows+ ?person .
    ?person foaf:name ?name .
  }
');

The + operator follows the schema:knows relationship one or more times. Alice knows Bob directly, and Bob knows Carol, so the query returns both "Bob" and "Carol".

What you just learned

  • Triples are facts with three parts: subject, predicate, object
  • Prefixes are shortcuts for long IRIs
  • load_turtle() loads data in Turtle format
  • sparql() runs SPARQL queries and returns results as JSONB
  • OPTIONAL is like a SQL LEFT JOIN
  • Property paths (+, *) follow chains of relationships

Next steps

Guided Tutorial — Build a Knowledge Graph in 30 Minutes

This tutorial picks up where the Hello World walkthrough ends. You will build a bibliographic knowledge graph with papers, authors, institutions, and citations — then validate it, reason over it, and export it as JSON-LD.

The tutorial is organized in four independent segments. Each takes under ten minutes and leaves you with a working, progressively richer knowledge graph. You can stop after any segment.

Note

This tutorial uses an academic bibliographic dataset. The patterns — entity relationships, typed literals, named graphs, inference, validation — apply equally to product catalogs, supply chains, organizational hierarchies, or any domain with interconnected data.

Prerequisites

pg_ripple is installed and you are connected to a PostgreSQL database with the extension created. See Installation.


Segment 1: Load and Explore (10 min)

Register prefixes

SELECT pg_ripple.register_prefix('bib', 'http://example.org/bib/');
SELECT pg_ripple.register_prefix('foaf', 'http://xmlns.com/foaf/0.1/');
SELECT pg_ripple.register_prefix('dc', 'http://purl.org/dc/elements/1.1/');
SELECT pg_ripple.register_prefix('dcterms', 'http://purl.org/dc/terms/');
SELECT pg_ripple.register_prefix('schema', 'http://schema.org/');
SELECT pg_ripple.register_prefix('skos', 'http://www.w3.org/2004/02/skos/core#');

Load the bibliographic dataset

SELECT pg_ripple.load_turtle('
@prefix bib:     <http://example.org/bib/> .
@prefix foaf:    <http://xmlns.com/foaf/0.1/> .
@prefix dc:      <http://purl.org/dc/elements/1.1/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix schema:  <http://schema.org/> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd:     <http://www.w3.org/2001/XMLSchema#> .
@prefix skos:    <http://www.w3.org/2004/02/skos/core#> .

bib:mit       a schema:Organization ; schema:name "MIT" .
bib:stanford  a schema:Organization ; schema:name "Stanford University" .
bib:oxford    a schema:Organization ; schema:name "University of Oxford" .

bib:alice     a foaf:Person ; foaf:name "Alice Chen" ;
              schema:affiliation bib:mit .
bib:bob       a foaf:Person ; foaf:name "Bob Smith" ;
              schema:affiliation bib:stanford .
bib:carol     a foaf:Person ; foaf:name "Carol Martinez" ;
              schema:affiliation bib:oxford .

bib:paper1    a schema:ScholarlyArticle ;
              dc:title "Knowledge Graphs in Practice" ;
              dc:creator bib:alice ; dc:creator bib:bob ;
              dcterms:issued "2024-01-15"^^xsd:date ;
              schema:about <http://example.org/bib/kg> .

bib:paper2    a schema:ScholarlyArticle ;
              dc:title "Efficient SPARQL Query Processing" ;
              dc:creator bib:bob ; dc:creator bib:carol ;
              dcterms:issued "2024-03-22"^^xsd:date .

bib:paper3    a schema:ScholarlyArticle ;
              dc:title "Graph-Enhanced Retrieval for LLMs" ;
              dc:creator bib:alice ;
              dcterms:issued "2024-06-10"^^xsd:date .

bib:paper2    dcterms:references bib:paper1 .
bib:paper3    dcterms:references bib:paper1 .
bib:paper3    dcterms:references bib:paper2 .

bib:alice foaf:knows bib:bob .
bib:bob   foaf:knows bib:carol .
');

Explore: find all papers by Alice

SELECT * FROM pg_ripple.sparql('
  PREFIX dc: <http://purl.org/dc/elements/1.1/>
  PREFIX bib: <http://example.org/bib/>
  PREFIX foaf: <http://xmlns.com/foaf/0.1/>
  SELECT ?title WHERE {
    ?paper dc:creator bib:alice .
    ?paper dc:title ?title .
  }
');

Explore: citation chains

Find papers that cite papers Alice authored:

SELECT * FROM pg_ripple.sparql('
  PREFIX dc: <http://purl.org/dc/elements/1.1/>
  PREFIX dcterms: <http://purl.org/dc/terms/>
  PREFIX bib: <http://example.org/bib/>
  SELECT ?citingTitle ?citedTitle WHERE {
    ?citing dcterms:references ?cited .
    ?cited dc:creator bib:alice .
    ?citing dc:title ?citingTitle .
    ?cited dc:title ?citedTitle .
  }
');

Explore: count papers per author

SELECT * FROM pg_ripple.sparql('
  PREFIX dc: <http://purl.org/dc/elements/1.1/>
  PREFIX foaf: <http://xmlns.com/foaf/0.1/>
  SELECT ?name (COUNT(?paper) AS ?papers) WHERE {
    ?paper dc:creator ?author .
    ?author foaf:name ?name .
  }
  GROUP BY ?name
  ORDER BY DESC(?papers)
');

Segment 2: Validate (10 min)

SHACL (Shapes Constraint Language) lets you define data quality rules. You will create a shape that requires every ScholarlyArticle to have a title and at least one creator.

Load a SHACL shape

SELECT pg_ripple.load_shacl('
@prefix sh:     <http://www.w3.org/ns/shacl#> .
@prefix schema: <http://schema.org/> .
@prefix dc:     <http://purl.org/dc/elements/1.1/> .
@prefix xsd:    <http://www.w3.org/2001/XMLSchema#> .

<http://example.org/shapes/ArticleShape>
  a sh:NodeShape ;
  sh:targetClass schema:ScholarlyArticle ;
  sh:property [
    sh:path dc:title ;
    sh:minCount 1 ;
    sh:maxCount 1 ;
    sh:datatype xsd:string ;
    sh:message "Every article must have exactly one title" ;
  ] ;
  sh:property [
    sh:path dc:creator ;
    sh:minCount 1 ;
    sh:message "Every article must have at least one creator" ;
  ] .
');

Validate the dataset

SELECT pg_ripple.validate();

The result is a JSONB validation report. If all articles conform, the report shows zero violations. Now insert a bad article to see validation catch it:

SELECT pg_ripple.insert_triple(
  'http://example.org/bib/bad_paper',
  'http://www.w3.org/1999/02/22-rdf-syntax-ns#type',
  'http://schema.org/ScholarlyArticle'
);

SELECT pg_ripple.validate();

The report now shows a violation: the article has no title and no creator.


Segment 3: Reason (10 min)

Datalog rules let you derive new facts. You will write a rule that infers transitive co-authorship: if Alice co-authored a paper with Bob, and Bob co-authored with Carol, then Alice and Carol are indirectly connected.

Write and load a rule

SELECT pg_ripple.load_rules('
  coauthor(?a, ?b) :- <http://purl.org/dc/elements/1.1/creator>(?paper, ?a),
                      <http://purl.org/dc/elements/1.1/creator>(?paper, ?b),
                      ?a != ?b.
  connected(?a, ?b) :- coauthor(?a, ?b).
  connected(?a, ?b) :- connected(?a, ?c), coauthor(?c, ?b), ?a != ?b.
', 'coauthorship');

Run inference

SELECT pg_ripple.infer('coauthorship');

This returns the number of new facts derived.

Query the derived facts

SELECT * FROM pg_ripple.sparql('
  PREFIX bib: <http://example.org/bib/>
  PREFIX foaf: <http://xmlns.com/foaf/0.1/>
  SELECT ?name WHERE {
    bib:alice <http://example.org/bib/connected> ?person .
    ?person foaf:name ?name .
  }
');

Alice is now connected to Bob (direct co-author on paper1), Carol (through Bob on paper2), and potentially others through the transitive chain.


Segment 4: Export (10 min)

Export your knowledge graph as JSON-LD, shaped for an API using a frame template.

Export as Turtle

SELECT pg_ripple.export_turtle();

This returns all triples in human-readable Turtle format.

Export as JSON-LD with framing

SELECT pg_ripple.sparql_construct_jsonld('
  PREFIX dc: <http://purl.org/dc/elements/1.1/>
  PREFIX foaf: <http://xmlns.com/foaf/0.1/>
  PREFIX schema: <http://schema.org/>
  CONSTRUCT {
    ?paper dc:title ?title .
    ?paper dc:creator ?author .
    ?author foaf:name ?name .
    ?author schema:affiliation ?org .
    ?org schema:name ?orgName .
  }
  WHERE {
    ?paper a schema:ScholarlyArticle .
    ?paper dc:title ?title .
    ?paper dc:creator ?author .
    ?author foaf:name ?name .
    OPTIONAL {
      ?author schema:affiliation ?org .
      ?org schema:name ?orgName .
    }
  }
');

The result is a nested JSON-LD document with papers, their authors, and institutional affiliations — ready to serve from a REST API.


What you built

In 30 minutes, you created a knowledge graph with:

  • Structured data — papers, authors, institutions, and citations as RDF triples
  • Quality rules — SHACL shapes that catch incomplete articles
  • Derived knowledge — Datalog rules that infer transitive co-authorship
  • API-ready export — JSON-LD output shaped for downstream consumers

Next steps

Key Concepts — RDF for PostgreSQL Users

If you know PostgreSQL, you already understand most of what you need to work with pg_ripple. This page maps RDF concepts to their PostgreSQL equivalents.

Triples

A triple is the atomic unit of data in RDF. It has three parts:

PartWhat it isPostgreSQL analogy
SubjectThe entity being describedA row's primary key
PredicateThe relationship or attributeA column name
ObjectThe value or related entityA cell value or foreign key

For example, the fact "Alice knows Bob" is the triple:

<http://example.org/alice> <http://xmlns.com/foaf/0.1/knows> <http://example.org/bob> .

In pg_ripple, this triple is stored in a VP table named after the predicate (foaf:knows), with integer-encoded subject and object columns.

IRIs

An IRI (Internationalized Resource Identifier) is a globally unique identifier for an entity or relationship. Think of it as a namespaced primary key that is guaranteed unique across all datasets in the world.

http://example.org/alice          -- an entity
http://xmlns.com/foaf/0.1/knows   -- a relationship

Prefixes are shortcuts to avoid writing full IRIs repeatedly:

SELECT pg_ripple.register_prefix('ex', 'http://example.org/');
-- Now ex:alice means http://example.org/alice

Blank nodes

A blank node is an anonymous entity — like a row with no primary key. It exists only within the document where it was created.

ex:alice foaf:address [ foaf:city "Boston" ; foaf:country "US" ] .

The address has no IRI. It is a blank node, identified internally by a system-generated label. Blank nodes from different load_turtle() calls are always distinct entities, even if they share the same label.

Warning

Blank nodes cannot be referenced from outside their originating load call. If you need to reference an entity from multiple places, give it an IRI.

Literals

A literal is a data value — a string, number, date, or boolean. Literals can have a datatype or a language tag.

LiteralTypePostgreSQL equivalent
"Alice"Plain stringTEXT
"42"^^xsd:integerTyped integerINTEGER
"2024-01-15"^^xsd:dateTyped dateDATE
"Bonjour"@frLanguage-tagged stringNo direct equivalent

In pg_ripple, all literals are dictionary-encoded to compact integer IDs for storage. The original string representation is preserved and decoded on query output.

Predicates and VP tables

In a relational database, a table groups all attributes of a single entity type. In pg_ripple, data is organized by predicate — each unique predicate gets its own table (a Vertical Partitioning, or VP, table).

Relational:  persons(id, name, email, knows_id)
pg_ripple:   vp_foaf_name(s, o)      -- subject → name
             vp_foaf_knows(s, o)     -- subject → object
             vp_schema_email(s, o)   -- subject → email

This structure makes join-heavy SPARQL queries fast because each predicate's data is co-located and indexed.

Named graphs

A named graph is a labeled collection of triples — like a PostgreSQL schema that groups related tables.

-- Create a named graph
SELECT pg_ripple.create_graph('http://example.org/publications');

-- Load data into it
SELECT pg_ripple.load_turtle_into_graph(
  '<http://example.org/paper1> <http://purl.org/dc/elements/1.1/title> "My Paper" .',
  'http://example.org/publications'
);

Named graphs are useful for:

  • Multi-source data: keep data from different sources separate
  • Access control: grant read access to specific graphs per role
  • Versioning: load new data into a fresh graph, validate, then swap

All triples without an explicit graph belong to the default graph (graph ID = 0).

RDF-star

Standard RDF says "Alice knows Bob." But what if you want to say when Alice met Bob, or who recorded that fact? RDF-star lets you make statements about statements:

<< ex:alice foaf:knows ex:bob >> ex:since "2020"^^xsd:gYear .

This says: "The fact that Alice knows Bob has been true since 2020." In pg_ripple, each triple has a statement identifier (SID) that can be used as the subject or object of other triples, enabling edge properties similar to labeled property graphs.

SPARQL

SPARQL is the standard query language for RDF data — the equivalent of SQL for relational databases. Where SQL queries tables, SPARQL queries graph patterns.

SQLSPARQL
SELECT name FROM persons WHERE id = 1SELECT ?name WHERE { ex:person1 foaf:name ?name }
JOINGraph pattern matching (implicit)
LEFT JOINOPTIONAL { }
WHERE x IN (...)VALUES (?x) { ... }
GROUP BY ... HAVINGGROUP BY ... HAVING
WITH RECURSIVEProperty paths (foaf:knows+)

In pg_ripple, SPARQL queries are compiled to SQL and executed via PostgreSQL's query engine. You call them through pg_ripple.sparql():

SELECT * FROM pg_ripple.sparql('
  PREFIX foaf: <http://xmlns.com/foaf/0.1/>
  SELECT ?name WHERE { ?person foaf:name ?name }
');

Dictionary encoding

pg_ripple does not store raw strings in its data tables. Every IRI, blank node, and literal is mapped to a compact BIGINT (i64) by the dictionary encoder. VP tables contain only integer columns, making joins and comparisons fast.

You never need to interact with dictionary IDs directly — sparql() and find_triples() handle encoding and decoding automatically. For advanced use cases, encode_term() and decode_id() are available.

Summary of analogies

RDF conceptPostgreSQL analogy
TripleRow in a table
SubjectPrimary key value
PredicateColumn name / table name (VP)
ObjectCell value or foreign key
IRIGlobally unique identifier
Blank nodeRow with system-generated ID
LiteralTyped column value
Named graphSchema
SPARQLSQL
SHACL shapeCHECK constraint / trigger
Datalog ruleMaterialized view definition

Next steps

§2.1 Storing Knowledge

What and Why

pg_ripple stores data as RDF triples — the W3C standard for representing knowledge. Every fact is a three-part statement: a subject, a predicate, and an object. This structure is deceptively simple but powerful enough to model any domain — from bibliographic records and biomedical ontologies to enterprise knowledge graphs.

Why triples instead of tables?

  • Schema-free evolution: add new predicates without ALTER TABLE.
  • Natural linking: every entity is an IRI — links across datasets are free.
  • Standards-based: SPARQL, SHACL, OWL, and thousands of public vocabularies work out of the box.
  • Provenance-ready: RDF-star lets you annotate individual facts with confidence scores, sources, and timestamps.

pg_ripple stores triples inside PostgreSQL using Vertical Partitioning (VP) — one internal table per predicate, with all values dictionary-encoded as BIGINT. You never see this machinery directly; you interact through insert_triple(), load_turtle(), and SPARQL.


How It Works

The Triple Model

Every RDF triple has the form:

<subject>  <predicate>  <object> .
  • Subject: the thing you are describing (always an IRI or blank node).
  • Predicate: the relationship or property (always an IRI).
  • Object: the value — an IRI (another entity), a literal (string, number, date), or a blank node.

Note

IRIs (Internationalized Resource Identifiers) look like URLs but are identifiers, not necessarily web addresses. <https://example.org/paper/42> identifies a paper — it does not need to resolve to a web page.

Named Graphs

Triples can be grouped into named graphs — logical partitions identified by an IRI. This is useful for:

  • Tracking provenance: "these triples came from PubMed"
  • Multi-tenancy: one graph per customer
  • Inference output: derived triples go into a separate graph

pg_ripple uses graph ID 0 for the default graph (triples with no explicit graph). Named graphs get a positive integer ID via dictionary encoding.

Blank Nodes

Blank nodes are anonymous identifiers — they represent "something exists" without giving it a global IRI. pg_ripple encodes blank nodes with a _: prefix:

SELECT pg_ripple.insert_triple(
    '_:review1',
    '<https://schema.org/author>',
    '<https://example.org/person/alice>'
);

Warning

Blank nodes are document-scoped. Two separate load_turtle() calls that both use _:x will create two different internal identifiers. If you need stable cross-document identity, use IRIs instead.

RDF-Star (Quoted Triples)

RDF-star lets you make statements about other statements. This is essential for provenance, confidence scores, and temporal annotations.

A quoted triple wraps << subject predicate object >> and can appear as a subject or object in another triple:

-- "The fact that Paper42 was authored by Alice has confidence 0.95"
SELECT pg_ripple.insert_triple(
    '<< <https://example.org/paper/42> <https://purl.org/dc/terms/creator> <https://example.org/person/alice> >>',
    '<https://example.org/confidence>',
    '"0.95"^^<http://www.w3.org/2001/XMLSchema#decimal>'
);

Dictionary Encoding

Every IRI, blank node, and literal is mapped to a BIGINT (i64) via XXH3-128 hashing before storage. VP tables contain only integers — this makes joins fast and storage compact. You never need to think about encoding; pg_ripple handles it transparently.


Worked Examples

The examples in this chapter use a bibliographic dataset: papers, authors, institutions, journals, and citations.

Setting Up Prefixes

Register namespace prefixes so SPARQL queries are readable:

SELECT pg_ripple.register_prefix('ex',    'https://example.org/');
SELECT pg_ripple.register_prefix('dct',   'http://purl.org/dc/terms/');
SELECT pg_ripple.register_prefix('foaf',  'http://xmlns.com/foaf/0.1/');
SELECT pg_ripple.register_prefix('bibo',  'http://purl.org/ontology/bibo/');
SELECT pg_ripple.register_prefix('schema','https://schema.org/');
SELECT pg_ripple.register_prefix('xsd',   'http://www.w3.org/2001/XMLSchema#');

Inserting Individual Triples

-- Create a paper
SELECT pg_ripple.insert_triple(
    '<https://example.org/paper/42>',
    '<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>',
    '<http://purl.org/ontology/bibo/AcademicArticle>'
);

-- Add a title
SELECT pg_ripple.insert_triple(
    '<https://example.org/paper/42>',
    '<http://purl.org/dc/terms/title>',
    '"Knowledge Graphs in Practice"'
);

-- Add an author
SELECT pg_ripple.insert_triple(
    '<https://example.org/paper/42>',
    '<http://purl.org/dc/terms/creator>',
    '<https://example.org/person/alice>'
);

-- Author metadata
SELECT pg_ripple.insert_triple(
    '<https://example.org/person/alice>',
    '<http://xmlns.com/foaf/0.1/name>',
    '"Alice Johnson"'
);

SELECT pg_ripple.insert_triple(
    '<https://example.org/person/alice>',
    '<https://schema.org/affiliation>',
    '<https://example.org/institution/mit>'
);

-- Institution metadata
SELECT pg_ripple.insert_triple(
    '<https://example.org/institution/mit>',
    '<http://xmlns.com/foaf/0.1/name>',
    '"Massachusetts Institute of Technology"'
);

Loading a Full Dataset with Turtle

For bulk data, Turtle format is more natural:

SELECT pg_ripple.load_turtle('
@prefix ex:     <https://example.org/> .
@prefix dct:    <http://purl.org/dc/terms/> .
@prefix foaf:   <http://xmlns.com/foaf/0.1/> .
@prefix bibo:   <http://purl.org/ontology/bibo/> .
@prefix schema: <https://schema.org/> .
@prefix xsd:    <http://www.w3.org/2001/XMLSchema#> .

ex:paper/42 a bibo:AcademicArticle ;
    dct:title "Knowledge Graphs in Practice" ;
    dct:creator ex:person/alice, ex:person/bob ;
    dct:date "2024-03-15"^^xsd:date ;
    bibo:citedBy ex:paper/99 ;
    schema:keywords "knowledge graph", "RDF", "SPARQL" .

ex:paper/99 a bibo:AcademicArticle ;
    dct:title "Graph Neural Networks for Entity Resolution" ;
    dct:creator ex:person/carol ;
    bibo:cites ex:paper/42 .

ex:person/alice foaf:name "Alice Johnson" ;
    schema:affiliation ex:institution/mit .

ex:person/bob foaf:name "Bob Smith" ;
    schema:affiliation ex:institution/stanford .

ex:person/carol foaf:name "Carol Williams" ;
    schema:affiliation ex:institution/mit .

ex:institution/mit foaf:name "Massachusetts Institute of Technology" .
ex:institution/stanford foaf:name "Stanford University" .
');

Using Named Graphs

Store triples from different sources in separate graphs:

-- Create named graphs for different data sources
SELECT pg_ripple.create_graph('https://example.org/graph/pubmed');
SELECT pg_ripple.create_graph('https://example.org/graph/arxiv');

-- Load PubMed data into its graph
SELECT pg_ripple.load_turtle_into_graph('
@prefix ex:   <https://example.org/> .
@prefix dct:  <http://purl.org/dc/terms/> .
@prefix bibo: <http://purl.org/ontology/bibo/> .

ex:paper/100 a bibo:AcademicArticle ;
    dct:title "Drug Interaction Networks" ;
    dct:creator ex:person/dave .
', 'https://example.org/graph/pubmed');

-- Load arXiv data into its graph
SELECT pg_ripple.load_turtle_into_graph('
@prefix ex:   <https://example.org/> .
@prefix dct:  <http://purl.org/dc/terms/> .
@prefix bibo: <http://purl.org/ontology/bibo/> .

ex:paper/200 a bibo:AcademicArticle ;
    dct:title "Transformer Architectures for NLP" ;
    dct:creator ex:person/eve .
', 'https://example.org/graph/arxiv');

-- List all named graphs
SELECT * FROM pg_ripple.list_graphs();

RDF-Star for Provenance and Confidence

Annotate citations with provenance metadata:

-- Record that Paper 42 cites Paper 99 (the base fact)
SELECT pg_ripple.insert_triple(
    '<https://example.org/paper/42>',
    '<http://purl.org/ontology/bibo/cites>',
    '<https://example.org/paper/99>'
);

-- Annotate this citation with a confidence score
SELECT pg_ripple.insert_triple(
    '<< <https://example.org/paper/42> <http://purl.org/ontology/bibo/cites> <https://example.org/paper/99> >>',
    '<https://example.org/confidence>',
    '"0.92"^^<http://www.w3.org/2001/XMLSchema#decimal>'
);

-- Record who asserted this citation
SELECT pg_ripple.insert_triple(
    '<< <https://example.org/paper/42> <http://purl.org/ontology/bibo/cites> <https://example.org/paper/99> >>',
    '<http://purl.org/dc/terms/source>',
    '<https://example.org/system/citation-extractor>'
);

Translating a Relational Schema to RDF

Suppose you have a relational database with tables papers, authors, and affiliations:

papers.idpapers.titlepapers.year
42Knowledge Graphs in Practice2024
authors.idauthors.nameauthors.institution_id
1Alice Johnson10

The mapping pattern:

  1. Each row becomes a subject IRI: <https://example.org/paper/{id}>
  2. Each column becomes a predicate: use a standard vocabulary (Dublin Core, Schema.org, FOAF)
  3. Foreign keys become object IRIs: authors.institution_id = 10<https://example.org/institution/10>
  4. Scalar values become literals: papers.title"Knowledge Graphs in Practice"
-- Row from papers table → triples
SELECT pg_ripple.insert_triple(
    '<https://example.org/paper/42>',
    '<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>',
    '<http://purl.org/ontology/bibo/AcademicArticle>'
);
SELECT pg_ripple.insert_triple(
    '<https://example.org/paper/42>',
    '<http://purl.org/dc/terms/title>',
    '"Knowledge Graphs in Practice"'
);
SELECT pg_ripple.insert_triple(
    '<https://example.org/paper/42>',
    '<http://purl.org/dc/terms/date>',
    '"2024"^^<http://www.w3.org/2001/XMLSchema#gYear>'
);

-- Foreign key → IRI link
SELECT pg_ripple.insert_triple(
    '<https://example.org/person/1>',
    '<https://schema.org/affiliation>',
    '<https://example.org/institution/10>'
);

Common Patterns

Pattern: Type Hierarchies

Use rdf:type and rdfs:subClassOf to create type hierarchies:

SELECT pg_ripple.load_turtle('
@prefix ex:   <https://example.org/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix bibo: <http://purl.org/ontology/bibo/> .

bibo:AcademicArticle rdfs:subClassOf bibo:Article .
bibo:Article rdfs:subClassOf bibo:Document .

ex:paper/42 a bibo:AcademicArticle .
');

With RDFS inference enabled (see §2.5), pg_ripple can automatically derive that ex:paper/42 is also a bibo:Article and a bibo:Document.

Pattern: Multi-Valued Properties

Unlike relational columns, RDF predicates are naturally multi-valued:

-- A paper can have multiple authors — just insert multiple triples
SELECT pg_ripple.insert_triple(
    '<https://example.org/paper/42>',
    '<http://purl.org/dc/terms/creator>',
    '<https://example.org/person/alice>'
);
SELECT pg_ripple.insert_triple(
    '<https://example.org/paper/42>',
    '<http://purl.org/dc/terms/creator>',
    '<https://example.org/person/bob>'
);

Pattern: Typed and Language-Tagged Literals

-- Typed literal (date)
SELECT pg_ripple.insert_triple(
    '<https://example.org/paper/42>',
    '<http://purl.org/dc/terms/date>',
    '"2024-03-15"^^<http://www.w3.org/2001/XMLSchema#date>'
);

-- Language-tagged string
SELECT pg_ripple.insert_triple(
    '<https://example.org/paper/42>',
    '<http://purl.org/dc/terms/title>',
    '"Knowledge Graphs in Practice"@en'
);
SELECT pg_ripple.insert_triple(
    '<https://example.org/paper/42>',
    '<http://purl.org/dc/terms/title>',
    '"Wissensgraphen in der Praxis"@de'
);

Pattern: Reification with RDF-Star vs Named Graphs

Two approaches for tracking who said what:

RDF-star — annotate individual triples:

SELECT pg_ripple.insert_triple(
    '<< <https://example.org/paper/42> <http://purl.org/dc/terms/creator> <https://example.org/person/alice> >>',
    '<http://purl.org/dc/terms/source>',
    '<https://example.org/dataset/pubmed>'
);

Named graphs — group triples by source:

SELECT pg_ripple.load_turtle_into_graph('
@prefix ex:  <https://example.org/> .
@prefix dct: <http://purl.org/dc/terms/> .
ex:paper/42 dct:creator ex:person/alice .
', 'https://example.org/dataset/pubmed');

Use RDF-star when different triples about the same entity have different provenance. Use named graphs when entire batches share the same source.


Performance and Trade-offs

ApproachInsert rateQuery flexibilityStorage overhead
insert_triple()~5,000 triples/sFullHighest (per-call overhead)
load_turtle()~50,000 triples/sFullLow (batch dictionary encoding)
load_turtle_file()~100,000 triples/sFullLowest (server-side streaming)
  • Dictionary cache: frequently used IRIs (predicates, common types) stay in the shared-memory LRU cache. Check hit rates with SELECT pg_ripple.cache_stats().
  • VP table promotion: predicates with fewer than 1,000 triples share the vp_rare consolidation table. Once a predicate crosses the threshold, it gets its own dedicated VP table with dual B-tree indexes.
  • Named graph overhead: the g column adds 8 bytes per triple. If you do not need named graphs, using the default graph (the default) avoids the cost of graph-ID lookups.

Tip

After large bulk loads, run ANALYZE on the internal tables to update PostgreSQL planner statistics:

SELECT pg_ripple.vacuum();

</div>
</div>

Gotchas and Debugging

IRI formatting: All IRIs must be wrapped in angle brackets (<...>) in function calls. Forgetting the brackets is the most common error:

-- WRONG: will be treated as a plain literal
SELECT pg_ripple.insert_triple(
    'https://example.org/paper/42',
    'http://purl.org/dc/terms/title',
    '"Hello"'
);

-- CORRECT: angle brackets around IRIs
SELECT pg_ripple.insert_triple(
    '<https://example.org/paper/42>',
    '<http://purl.org/dc/terms/title>',
    '"Hello"'
);

Blank node scoping: Blank nodes from separate load_turtle() calls are independent. Two calls using _:x create two different entities.

Literal quoting: Literals must be wrapped in double quotes within the single-quoted SQL string. Typed literals use ^^<datatype> suffix:

-- Plain string
'"Hello"'

-- Typed integer
'"42"^^<http://www.w3.org/2001/XMLSchema#integer>'

-- Language-tagged string
'"Bonjour"@fr'

Checking what is stored: Use find_triples() with wildcards to inspect data:

-- All triples about Paper 42
SELECT * FROM pg_ripple.find_triples(
    '<https://example.org/paper/42>', NULL, NULL
);

-- All triples with the dct:creator predicate
SELECT * FROM pg_ripple.find_triples(
    NULL, '<http://purl.org/dc/terms/creator>', NULL
);

-- Total triple count
SELECT pg_ripple.triple_count();

Duplicate triples: Inserting the same (s, p, o, g) twice is idempotent — the second insert returns the existing SID. Use deduplicate_all() to clean up historical duplicates.


Next Steps

§2.2 Loading Data

What and Why

Getting data into pg_ripple is the first step in building a knowledge graph. pg_ripple supports every major RDF serialization format and offers three loading strategies tuned for different scenarios: inline string loading, server-side file loading, and single-triple insertion.

Choosing the right format and loading mode matters. A 10-million-triple dataset loaded via insert_triple() in a loop takes hours; the same dataset loaded from a server-side N-Triples file via load_ntriples_file() finishes in minutes.


How It Works

Supported Formats

FormatFunction (string)Function (file)Named graphsNotes
Turtleload_turtle()load_turtle_file()No (use load_turtle_into_graph())Human-readable; supports prefixes, RDF-star
N-Triplesload_ntriples()load_ntriples_file()No (use load_ntriples_into_graph())One triple per line; fastest to parse
N-Quadsload_nquads()load_nquads_file()Yes (inline)N-Triples + fourth graph column
TriGload_trig()load_trig_file()Yes (inline)Turtle + named graph blocks
RDF/XMLload_rdfxml()load_rdfxml_file()No (use load_rdfxml_into_graph())Legacy XML format; widely supported

Three Loading Modes

Mode 1: String loading — pass RDF text as a SQL string parameter. Best for small-to-medium datasets (up to a few MB) and interactive use:

SELECT pg_ripple.load_turtle('
@prefix ex: <https://example.org/> .
ex:paper/1 ex:title "Hello World" .
');

Mode 2: Server-side file loading — read from a file on the PostgreSQL server's filesystem. Best for large datasets. Requires superuser privileges:

SELECT pg_ripple.load_turtle_file('/data/papers.ttl');

Mode 3: Single-triple insertion — insert one triple at a time. Best for real-time ingestion from application code:

SELECT pg_ripple.insert_triple(
    '<https://example.org/paper/1>',
    '<https://example.org/title>',
    '"Hello World"'
);

The Loading Pipeline

Regardless of format, every loader follows the same internal pipeline:

  1. Parse — deserialize the RDF serialization into (subject, predicate, object, graph) quads.
  2. Encode — dictionary-encode each IRI, blank node, and literal to a BIGINT ID using batch ON CONFLICT DO NOTHING ... RETURNING.
  3. Route — look up the predicate in _pg_ripple.predicates to find the target VP table (or vp_rare).
  4. Insert — batch-insert encoded (s, o, g) rows into the appropriate VP delta table.

Tip

String loaders process the entire input in a single transaction. If any triple fails to parse with strict = true, the entire load is rolled back. With strict = false (the default), malformed triples are skipped and a WARNING is emitted.


Worked Examples

Loading Turtle

The most common format for hand-authored data:

SELECT pg_ripple.load_turtle('
@prefix ex:    <https://example.org/> .
@prefix dct:   <http://purl.org/dc/terms/> .
@prefix foaf:  <http://xmlns.com/foaf/0.1/> .
@prefix bibo:  <http://purl.org/ontology/bibo/> .
@prefix schema: <https://schema.org/> .
@prefix xsd:   <http://www.w3.org/2001/XMLSchema#> .

ex:paper/42 a bibo:AcademicArticle ;
    dct:title "Knowledge Graphs in Practice"@en ;
    dct:creator ex:person/alice, ex:person/bob ;
    dct:date "2024-03-15"^^xsd:date ;
    bibo:citedBy ex:paper/99 ;
    schema:keywords "knowledge graph", "RDF", "SPARQL" .

ex:paper/99 a bibo:AcademicArticle ;
    dct:title "Graph Neural Networks for Entity Resolution" ;
    dct:creator ex:person/carol .

ex:person/alice foaf:name "Alice Johnson" ;
    schema:affiliation ex:institution/mit .

ex:person/bob foaf:name "Bob Smith" ;
    schema:affiliation ex:institution/stanford .

ex:person/carol foaf:name "Carol Williams" ;
    schema:affiliation ex:institution/mit .

ex:institution/mit foaf:name "Massachusetts Institute of Technology" .
ex:institution/stanford foaf:name "Stanford University" .
');

The function returns the number of triples loaded:

-- Returns: 15

Loading N-Triples

N-Triples is one triple per line with no abbreviations — optimal for machine-generated data:

SELECT pg_ripple.load_ntriples('
<https://example.org/paper/42> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/ontology/bibo/AcademicArticle> .
<https://example.org/paper/42> <http://purl.org/dc/terms/title> "Knowledge Graphs in Practice" .
<https://example.org/paper/42> <http://purl.org/dc/terms/creator> <https://example.org/person/alice> .
<https://example.org/paper/42> <http://purl.org/dc/terms/creator> <https://example.org/person/bob> .
');

Loading N-Quads (with Named Graphs)

N-Quads extend N-Triples with a fourth field for the graph IRI:

SELECT pg_ripple.load_nquads('
<https://example.org/paper/42> <http://purl.org/dc/terms/title> "Knowledge Graphs in Practice" <https://example.org/graph/pubmed> .
<https://example.org/paper/99> <http://purl.org/dc/terms/title> "Graph Neural Networks" <https://example.org/graph/arxiv> .
<https://example.org/paper/42> <http://purl.org/dc/terms/creator> <https://example.org/person/alice> <https://example.org/graph/pubmed> .
');

Loading TriG (Turtle with Named Graphs)

TriG wraps Turtle blocks in GRAPH { } sections:

SELECT pg_ripple.load_trig('
@prefix ex:  <https://example.org/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix bibo: <http://purl.org/ontology/bibo/> .

GRAPH ex:graph/pubmed {
    ex:paper/100 a bibo:AcademicArticle ;
        dct:title "Drug Interaction Networks" ;
        dct:creator ex:person/dave .
}

GRAPH ex:graph/arxiv {
    ex:paper/200 a bibo:AcademicArticle ;
        dct:title "Transformer Architectures for NLP" ;
        dct:creator ex:person/eve .
}
');

Loading RDF/XML

The original XML serialization of RDF — common in older datasets and OWL ontologies:

SELECT pg_ripple.load_rdfxml('
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:dct="http://purl.org/dc/terms/"
         xmlns:bibo="http://purl.org/ontology/bibo/">
  <bibo:AcademicArticle rdf:about="https://example.org/paper/42">
    <dct:title>Knowledge Graphs in Practice</dct:title>
    <dct:creator rdf:resource="https://example.org/person/alice"/>
  </bibo:AcademicArticle>
</rdf:RDF>
');

Loading from Server-Side Files

For large datasets, server-side file loading avoids transferring data through the SQL protocol:

-- Load a large N-Triples dump (superuser required)
SELECT pg_ripple.load_ntriples_file('/data/exports/papers.nt');

-- Load Turtle with strict parsing (abort on any error)
SELECT pg_ripple.load_turtle_file('/data/exports/ontology.ttl', true);

-- Load into a specific named graph
SELECT pg_ripple.load_turtle_file_into_graph(
    '/data/exports/pubmed.ttl',
    'https://example.org/graph/pubmed'
);

Warning

File loading functions read from the PostgreSQL server's filesystem, not the client's. The path must be accessible to the postgres OS user. These functions require superuser privileges for security reasons.

Loading Turtle-Star (RDF-Star)

pg_ripple's Turtle parser supports RDF-star quoted triples natively:

SELECT pg_ripple.load_turtle('
@prefix ex:  <https://example.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<< ex:paper/42 ex:cites ex:paper/99 >> ex:confidence "0.92"^^xsd:decimal .
<< ex:paper/42 ex:cites ex:paper/99 >> ex:source ex:system/citation-extractor .
');

Loading into Named Graphs

Load data into a specific graph without the TriG/N-Quads format:

-- Create the graph first (optional — auto-created on load)
SELECT pg_ripple.create_graph('https://example.org/graph/2024');

-- Load Turtle into the named graph
SELECT pg_ripple.load_turtle_into_graph('
@prefix ex:  <https://example.org/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix bibo: <http://purl.org/ontology/bibo/> .

ex:paper/300 a bibo:AcademicArticle ;
    dct:title "New Findings in Graph Theory" ;
    dct:creator ex:person/frank .
', 'https://example.org/graph/2024');

Using SPARQL Update for Loading

SPARQL INSERT DATA is another way to add triples:

SELECT pg_ripple.sparql_update('
PREFIX ex:   <https://example.org/>
PREFIX dct:  <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>

INSERT DATA {
    ex:paper/500 a bibo:AcademicArticle ;
        dct:title "SPARQL Performance Tuning" ;
        dct:creator ex:person/alice .
}
');

Common Patterns

Pattern: ETL Pipeline

A typical ETL pipeline loads data in stages:

-- Step 1: Load the ontology
SELECT pg_ripple.load_turtle_file('/data/ontology.ttl');

-- Step 2: Load reference data
SELECT pg_ripple.load_ntriples_file('/data/institutions.nt');

-- Step 3: Load the main dataset
SELECT pg_ripple.load_ntriples_file('/data/papers.nt');

-- Step 4: Load supplementary data into a named graph
SELECT pg_ripple.load_nquads_file('/data/citations.nq');

-- Step 5: Update statistics
SELECT pg_ripple.vacuum();

-- Step 6: Verify the load
SELECT pg_ripple.triple_count();
SELECT pg_ripple.stats();

Pattern: Incremental Loading

For streaming data ingestion, use insert_triple() inside application code:

-- Application inserts triples as events arrive
SELECT pg_ripple.insert_triple(
    '<https://example.org/paper/new123>',
    '<http://purl.org/dc/terms/title>',
    '"Just Published: A New Study"'
);

-- Periodically compact HTAP tables
SELECT pg_ripple.compact();

Pattern: Strict vs Lenient Parsing

-- Lenient (default): skip bad triples, emit WARNINGs
SELECT pg_ripple.load_turtle('
@prefix ex: <https://example.org/> .
ex:good ex:rel ex:target .
ex:bad ex:rel "unclosed literal .
ex:also_good ex:rel ex:other .
', false);
-- Returns: 2 (skipped the bad triple)

-- Strict: abort on any parse error
SELECT pg_ripple.load_turtle('
@prefix ex: <https://example.org/> .
ex:good ex:rel ex:target .
ex:bad ex:rel "unclosed literal .
', true);
-- ERROR: Turtle parse error at line 4

Pattern: Loading OWL Ontologies

-- Auto-detects format from file extension (.ttl, .nt, .xml, .rdf, .owl)
SELECT pg_ripple.load_owl_ontology('/data/ontologies/foaf.rdf');

-- Or load explicitly as RDF/XML
SELECT pg_ripple.load_rdfxml_file('/data/ontologies/dublin_core.rdf');

Performance and Trade-offs

Throughput by Loading Mode

ModeApproximate throughputUse case
insert_triple()3,000–8,000 triples/sReal-time ingestion, single-triple updates
load_turtle() / load_ntriples()30,000–80,000 triples/sInteractive bulk loads up to a few MB
load_ntriples_file()80,000–200,000 triples/sLarge server-side files
load_turtle_file()60,000–150,000 triples/sLarge server-side Turtle files

Note

N-Triples is consistently faster than Turtle because it requires no prefix expansion or abbreviation handling. For maximum throughput on large datasets, convert to N-Triples first: rapper -i turtle -o ntriples data.ttl > data.nt

Format Selection Guide

ScenarioRecommended format
Hand-authored dataTurtle (readable, supports prefixes)
Machine-generated exportN-Triples (fastest parsing, one line per triple)
Data with named graphsN-Quads or TriG
Legacy XML datasetsRDF/XML
Maximum load speedN-Triples via load_ntriples_file()

ANALYZE After Loads

After loading significant amounts of data, update PostgreSQL planner statistics:

-- Run ANALYZE on all VP tables
SELECT pg_ripple.vacuum();

This ensures the query planner has accurate row-count estimates for join ordering.

Batch Size Considerations

For string-based loaders, the entire input is processed in one transaction. Very large strings (hundreds of MB) can cause memory pressure. For datasets over 50 MB, prefer file-based loading:

-- Instead of a huge string literal:
-- SELECT pg_ripple.load_ntriples('... 100 million lines ...');

-- Use file loading:
SELECT pg_ripple.load_ntriples_file('/data/huge_dataset.nt');

Gotchas and Debugging

Blank Node Scoping

Each load_turtle() call creates a fresh blank-node scope. Two separate calls using _:x produce two different internal IDs:

-- Call 1: _:x maps to internal ID 12345
SELECT pg_ripple.load_turtle('
@prefix ex: <https://example.org/> .
_:x ex:name "Alice" .
ex:paper/1 ex:author _:x .
');

-- Call 2: _:x maps to internal ID 67890 (different!)
SELECT pg_ripple.load_turtle('
@prefix ex: <https://example.org/> .
_:x ex:name "Bob" .
ex:paper/2 ex:author _:x .
');

If you need the same anonymous node across loads, use a stable IRI instead:

SELECT pg_ripple.insert_triple(
    '<https://example.org/anon/shared-node>',
    '<https://example.org/name>',
    '"Shared Entity"'
);

Character Encoding

All loaders expect UTF-8 input. Non-UTF-8 data causes parse errors:

-- If your file is Latin-1, convert first:
-- iconv -f ISO-8859-1 -t UTF-8 data.nt > data_utf8.nt
SELECT pg_ripple.load_ntriples_file('/data/data_utf8.nt');

Verifying Loaded Data

After loading, verify with find_triples() or triple_count():

-- Check total triples
SELECT pg_ripple.triple_count();

-- Inspect specific triples
SELECT * FROM pg_ripple.find_triples(
    '<https://example.org/paper/42>', NULL, NULL
);

-- Check per-predicate statistics
SELECT pg_ripple.stats();

File Path Errors

File loaders read from the server filesystem. Common errors:

-- ERROR: could not open file "/data/papers.nt": No such file or directory
-- Fix: ensure the file exists and is readable by the postgres OS user

-- ERROR: permission denied for function load_turtle_file
-- Fix: file loaders require superuser; use string loaders for non-superusers

Duplicate Handling

Loading the same data twice does not create duplicates — VP tables use ON CONFLICT DO NOTHING:

SELECT pg_ripple.load_turtle('
@prefix ex: <https://example.org/> .
ex:a ex:rel ex:b .
');
-- Returns: 1

SELECT pg_ripple.load_turtle('
@prefix ex: <https://example.org/> .
ex:a ex:rel ex:b .
');
-- Returns: 0 (already exists)

Next Steps

§2.3 Querying with SPARQL

What and Why

SPARQL is the W3C standard query language for RDF data — the SQL of the knowledge graph world. pg_ripple translates SPARQL queries into optimized PostgreSQL SQL behind the scenes, so you get the expressiveness of SPARQL with the performance of a mature relational engine.

Why SPARQL instead of raw SQL against VP tables?

  • Graph pattern matching: find paths, cycles, and subgraph shapes naturally.
  • Property paths: traverse variable-length relationships with +, *, ?.
  • Federation: query remote SPARQL endpoints alongside local data.
  • Standards compliance: queries are portable across triple stores.
  • Update support: INSERT DATA and DELETE DATA for programmatic modifications.

pg_ripple supports all four SPARQL query forms (SELECT, CONSTRUCT, DESCRIBE, ASK) and SPARQL Update (INSERT DATA, DELETE DATA, DELETE/INSERT WHERE).


How It Works

The SPARQL Pipeline

  1. Parsespargebra parses the SPARQL text into an algebra tree.
  2. Optimizesparopt applies algebraic optimizations (filter pushdown, join reordering).
  3. Translate — pg_ripple's SQL generator converts the algebra to PostgreSQL SQL with integer-only VP table joins.
  4. Cache — the plan cache stores translated SQL keyed by SPARQL text hash.
  5. Execute — SPI executes the SQL; results are batch-decoded from integer IDs back to IRIs and literals.
  6. Return — each result row is returned as a JSONB object.

Key Functions

FunctionPurpose
sparql(query)Execute SELECT or ASK; returns JSONB rows
sparql_ask(query)Execute ASK; returns boolean
sparql_construct(query)Execute CONSTRUCT; returns triple JSONB rows
sparql_construct_turtle(query)CONSTRUCT → Turtle text
sparql_construct_jsonld(query)CONSTRUCT → JSON-LD JSONB
sparql_describe(query)DESCRIBE with CBD; returns triple JSONB rows
sparql_describe_turtle(query)DESCRIBE → Turtle text
sparql_update(query)INSERT DATA / DELETE DATA; returns affected count
sparql_explain(query, analyze)Show generated SQL or EXPLAIN ANALYZE output
explain_sparql(query, format)Extended explain with SQL, text, JSON, or algebra output

Worked Examples

All examples assume the bibliographic dataset from §2.1 and §2.2 has been loaded.

Basic Triple Patterns

Find all papers and their titles:

SELECT * FROM pg_ripple.sparql('
PREFIX dct:  <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>

SELECT ?paper ?title
WHERE {
    ?paper a bibo:AcademicArticle .
    ?paper dct:title ?title .
}
');

Each row is a JSONB object like {"paper": "<https://example.org/paper/42>", "title": "\"Knowledge Graphs in Practice\""}.

Filtering Results

Find papers published after 2023:

SELECT * FROM pg_ripple.sparql('
PREFIX dct:  <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX xsd:  <http://www.w3.org/2001/XMLSchema#>

SELECT ?paper ?title ?date
WHERE {
    ?paper a bibo:AcademicArticle ;
           dct:title ?title ;
           dct:date ?date .
    FILTER (?date > "2023-01-01"^^xsd:date)
}
');

OPTIONAL Patterns

Include authors even if they have no affiliation:

SELECT * FROM pg_ripple.sparql('
PREFIX dct:    <http://purl.org/dc/terms/>
PREFIX foaf:   <http://xmlns.com/foaf/0.1/>
PREFIX schema: <https://schema.org/>

SELECT ?paper ?authorName ?instName
WHERE {
    ?paper dct:creator ?author .
    ?author foaf:name ?authorName .
    OPTIONAL {
        ?author schema:affiliation ?inst .
        ?inst foaf:name ?instName .
    }
}
');

UNION

Find entities that are either papers or people:

SELECT * FROM pg_ripple.sparql('
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dct:  <http://purl.org/dc/terms/>

SELECT ?entity ?label
WHERE {
    {
        ?entity a bibo:AcademicArticle .
        ?entity dct:title ?label .
    }
    UNION
    {
        ?entity a foaf:Person .
        ?entity foaf:name ?label .
    }
}
');

MINUS

Find papers that have no citations:

SELECT * FROM pg_ripple.sparql('
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX dct:  <http://purl.org/dc/terms/>

SELECT ?paper ?title
WHERE {
    ?paper a bibo:AcademicArticle ;
           dct:title ?title .
    MINUS {
        ?paper bibo:citedBy ?other .
    }
}
');

Aggregation

Count papers per institution:

SELECT * FROM pg_ripple.sparql('
PREFIX dct:    <http://purl.org/dc/terms/>
PREFIX schema: <https://schema.org/>
PREFIX foaf:   <http://xmlns.com/foaf/0.1/>

SELECT ?instName (COUNT(DISTINCT ?paper) AS ?paperCount)
WHERE {
    ?paper dct:creator ?author .
    ?author schema:affiliation ?inst .
    ?inst foaf:name ?instName .
}
GROUP BY ?instName
ORDER BY DESC(?paperCount)
');

Subqueries

Find the most prolific author and all their papers:

SELECT * FROM pg_ripple.sparql('
PREFIX dct:  <http://purl.org/dc/terms/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT ?authorName ?paper ?title
WHERE {
    {
        SELECT ?author (COUNT(?p) AS ?count)
        WHERE {
            ?p dct:creator ?author .
        }
        GROUP BY ?author
        ORDER BY DESC(?count)
        LIMIT 1
    }
    ?author foaf:name ?authorName .
    ?paper dct:creator ?author ;
           dct:title ?title .
}
');

Property Paths

Property paths let you traverse variable-length relationships.

Transitive closure (+) — find all classes an entity belongs to through the subclass hierarchy:

SELECT * FROM pg_ripple.sparql('
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT ?entity ?superClass
WHERE {
    ?entity rdf:type/rdfs:subClassOf+ ?superClass .
}
');

Zero-or-more (*) — include the starting node:

SELECT * FROM pg_ripple.sparql('
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?class ?ancestor
WHERE {
    ?class rdfs:subClassOf* ?ancestor .
}
');

Optional step (?) — zero or one hops:

SELECT * FROM pg_ripple.sparql('
PREFIX schema: <https://schema.org/>
PREFIX foaf:   <http://xmlns.com/foaf/0.1/>

SELECT ?person ?nameOrInst
WHERE {
    ?person schema:affiliation? ?target .
    ?target foaf:name ?nameOrInst .
}
');

Sequence path (/) — chain properties:

SELECT * FROM pg_ripple.sparql('
PREFIX dct:    <http://purl.org/dc/terms/>
PREFIX schema: <https://schema.org/>
PREFIX foaf:   <http://xmlns.com/foaf/0.1/>

SELECT ?paper ?instName
WHERE {
    ?paper dct:creator/schema:affiliation/foaf:name ?instName .
}
');

Alternative path (|) — match either property:

SELECT * FROM pg_ripple.sparql('
PREFIX dct:    <http://purl.org/dc/terms/>
PREFIX schema: <https://schema.org/>

SELECT ?entity ?label
WHERE {
    ?entity (dct:title | schema:name) ?label .
}
');

Inverse path (^) — traverse in reverse:

SELECT * FROM pg_ripple.sparql('
PREFIX dct: <http://purl.org/dc/terms/>

SELECT ?author ?paper
WHERE {
    ?author ^dct:creator ?paper .
}
');

GRAPH Patterns

Query data in specific named graphs:

SELECT * FROM pg_ripple.sparql('
PREFIX dct: <http://purl.org/dc/terms/>

SELECT ?paper ?title ?graph
WHERE {
    GRAPH ?graph {
        ?paper dct:title ?title .
    }
}
');

Query a specific named graph:

SELECT * FROM pg_ripple.sparql('
PREFIX dct: <http://purl.org/dc/terms/>

SELECT ?paper ?title
WHERE {
    GRAPH <https://example.org/graph/pubmed> {
        ?paper dct:title ?title .
    }
}
');

ASK Queries

Check if something exists:

SELECT pg_ripple.sparql_ask('
PREFIX dct:  <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>

ASK {
    ?paper a bibo:AcademicArticle ;
           dct:title "Knowledge Graphs in Practice" .
}
');
-- Returns: true

CONSTRUCT Queries

Build new triples from query results:

SELECT * FROM pg_ripple.sparql_construct('
PREFIX dct:    <http://purl.org/dc/terms/>
PREFIX schema: <https://schema.org/>
PREFIX foaf:   <http://xmlns.com/foaf/0.1/>
PREFIX ex:     <https://example.org/>

CONSTRUCT {
    ?author ex:worksOn ?paper .
    ?paper ex:authoredAt ?inst .
}
WHERE {
    ?paper dct:creator ?author .
    ?author schema:affiliation ?inst .
}
');

Get CONSTRUCT results as Turtle:

SELECT pg_ripple.sparql_construct_turtle('
PREFIX dct:    <http://purl.org/dc/terms/>
PREFIX foaf:   <http://xmlns.com/foaf/0.1/>
PREFIX ex:     <https://example.org/>

CONSTRUCT {
    ?author ex:wrote ?paper .
}
WHERE {
    ?paper dct:creator ?author .
}
');

Get CONSTRUCT results as JSON-LD:

SELECT pg_ripple.sparql_construct_jsonld('
PREFIX dct:  <http://purl.org/dc/terms/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX ex:   <https://example.org/>

CONSTRUCT {
    ?author ex:wrote ?paper .
}
WHERE {
    ?paper dct:creator ?author .
}
');

DESCRIBE Queries

Get everything about an entity using Concise Bounded Description:

SELECT * FROM pg_ripple.sparql_describe('
DESCRIBE <https://example.org/paper/42>
');

Get the description as Turtle:

SELECT pg_ripple.sparql_describe_turtle('
DESCRIBE <https://example.org/person/alice>
');

Choose the describe strategy:

-- Symmetric CBD: include triples where the entity is the object too
SELECT * FROM pg_ripple.sparql_describe(
    'DESCRIBE <https://example.org/person/alice>',
    'scbd'
);

SPARQL Update

Insert new triples:

SELECT pg_ripple.sparql_update('
PREFIX ex:  <https://example.org/>
PREFIX dct: <http://purl.org/dc/terms/>

INSERT DATA {
    ex:paper/600 a <http://purl.org/ontology/bibo/AcademicArticle> ;
        dct:title "Emerging Trends in Knowledge Graphs" ;
        dct:creator ex:person/alice .
}
');
-- Returns: 3

Delete specific triples:

SELECT pg_ripple.sparql_update('
PREFIX ex:  <https://example.org/>
PREFIX dct: <http://purl.org/dc/terms/>

DELETE DATA {
    ex:paper/600 dct:title "Emerging Trends in Knowledge Graphs" .
}
');
-- Returns: 1

Query Debugging with EXPLAIN

View the generated SQL without executing:

SELECT pg_ripple.sparql_explain('
PREFIX dct:  <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>

SELECT ?paper ?title
WHERE {
    ?paper a bibo:AcademicArticle ;
           dct:title ?title .
}
', false);

Run EXPLAIN ANALYZE to see execution times:

SELECT pg_ripple.sparql_explain('
PREFIX dct:  <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>

SELECT ?paper ?title
WHERE {
    ?paper a bibo:AcademicArticle ;
           dct:title ?title .
}
', true);

Use the extended explain with format options:

-- Show just the generated SQL
SELECT pg_ripple.explain_sparql('
PREFIX dct: <http://purl.org/dc/terms/>
SELECT ?paper ?title
WHERE { ?paper dct:title ?title }
', 'sql');

-- Show EXPLAIN ANALYZE as JSON (for programmatic consumption)
SELECT pg_ripple.explain_sparql('
PREFIX dct: <http://purl.org/dc/terms/>
SELECT ?paper ?title
WHERE { ?paper dct:title ?title }
', 'json');

-- Show the spargebra algebra tree
SELECT pg_ripple.explain_sparql('
PREFIX dct: <http://purl.org/dc/terms/>
SELECT ?paper ?title
WHERE { ?paper dct:title ?title }
', 'sparql_algebra');

Common Patterns

Pattern: Star Queries (Multiple Predicates on the Same Subject)

The optimizer detects star patterns and collapses them into efficient multi-way joins:

SELECT * FROM pg_ripple.sparql('
PREFIX dct:    <http://purl.org/dc/terms/>
PREFIX schema: <https://schema.org/>
PREFIX bibo:   <http://purl.org/ontology/bibo/>

SELECT ?paper ?title ?date
WHERE {
    ?paper a bibo:AcademicArticle ;
           dct:title ?title ;
           dct:date ?date ;
           schema:keywords ?kw .
    FILTER (CONTAINS(?kw, "knowledge"))
}
');

Pattern: Existence Checks with FILTER EXISTS

SELECT * FROM pg_ripple.sparql('
PREFIX dct:  <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>

SELECT ?paper ?title
WHERE {
    ?paper a bibo:AcademicArticle ;
           dct:title ?title .
    FILTER EXISTS {
        ?paper bibo:citedBy ?other .
    }
}
');

Pattern: VALUES Clause for Parameterized Queries

SELECT * FROM pg_ripple.sparql('
PREFIX dct: <http://purl.org/dc/terms/>

SELECT ?paper ?title
WHERE {
    VALUES ?paper {
        <https://example.org/paper/42>
        <https://example.org/paper/99>
    }
    ?paper dct:title ?title .
}
');

Pattern: BIND and Computed Values

SELECT * FROM pg_ripple.sparql('
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?paper ?title ?yearLabel
WHERE {
    ?paper dct:title ?title ;
           dct:date ?date .
    BIND(YEAR(?date) AS ?year)
    BIND(CONCAT("Published in ", STR(?year)) AS ?yearLabel)
}
');

Performance and Trade-offs

Plan Cache

pg_ripple caches translated SQL by SPARQL query hash. Repeated queries skip the parse and translate steps:

-- Check cache statistics
SELECT pg_ripple.plan_cache_stats();
-- Returns: {"hits": 42, "misses": 5, "size": 5, "capacity": 128, "hit_rate": 0.89}

-- Reset the cache (e.g., after schema changes)
SELECT pg_ripple.plan_cache_reset();

Filter Pushdown

SPARQL FILTERs on bound constants are encoded to integers before SQL generation. This means the database compares integers, not strings:

-- This FILTER is pushed down as an integer comparison:
SELECT * FROM pg_ripple.sparql('
PREFIX dct: <http://purl.org/dc/terms/>
SELECT ?paper WHERE {
    ?paper dct:creator <https://example.org/person/alice> .
}
');

Property Path Depth Limit

Recursive property paths (+, *) compile to WITH RECURSIVE ... CYCLE. The GUC pg_ripple.max_path_depth (default: 50) prevents runaway recursion:

-- Increase depth for deep hierarchies
SET pg_ripple.max_path_depth = 100;

Warning

Setting max_path_depth very high on cyclic graphs can cause slow queries. pg_ripple uses PostgreSQL 18's CYCLE clause for hash-based cycle detection, but wide graphs still accumulate many intermediate rows.

Full-Text Search Integration

Create a GIN index for fast text search on specific predicates:

-- Index the dct:title predicate for full-text search
SELECT pg_ripple.fts_index('<http://purl.org/dc/terms/title>');

-- Then CONTAINS() and REGEX() filters on dct:title objects use the GIN index
SELECT * FROM pg_ripple.sparql('
PREFIX dct: <http://purl.org/dc/terms/>
SELECT ?paper ?title
WHERE {
    ?paper dct:title ?title .
    FILTER (CONTAINS(?title, "Knowledge"))
}
');

Or use the direct full-text search function:

SELECT * FROM pg_ripple.fts_search(
    'knowledge & graph',
    '<http://purl.org/dc/terms/title>'
);

Gotchas and Debugging

SPARQL Syntax Errors

pg_ripple uses the spargebra parser, which gives precise error messages:

SELECT * FROM pg_ripple.sparql('
SELECT ?x WHERE { ?x ?p }
');
-- ERROR: SPARQL parse error: Expected '.' or '}' at line 2

Check the query compiles before running:

SELECT pg_ripple.explain_sparql('
PREFIX dct: <http://purl.org/dc/terms/>
SELECT ?paper WHERE { ?paper dct:title ?title }
', 'sql');

No Results When Expected

Common causes:

  1. Missing angle brackets: dct:title in SPARQL requires a PREFIX declaration. Without it, the parser treats it as a relative IRI.
  2. Wrong literal format: "42" is a string, not a number. Use "42"^^xsd:integer.
  3. Case sensitivity: IRIs are case-sensitive. <https://Example.org/X> and <https://example.org/x> are different.

Debug by checking what is stored:

-- Check if the predicate exists
SELECT * FROM pg_ripple.find_triples(
    NULL, '<http://purl.org/dc/terms/title>', NULL
);

Slow Queries

  1. Check the generated SQL with sparql_explain().
  2. Look for sequential scans on large VP tables — run pg_ripple.vacuum() to update statistics.
  3. For property paths, check max_path_depth — lower it if the query is exploring too many paths.
  4. Check the plan cache hit rate — a low hit rate means many unique queries are being parsed repeatedly.
-- Step 1: See the execution plan
SELECT pg_ripple.sparql_explain('
PREFIX dct: <http://purl.org/dc/terms/>
SELECT ?paper ?title
WHERE { ?paper dct:title ?title }
', true);

-- Step 2: Update statistics
SELECT pg_ripple.vacuum();

-- Step 3: Check plan cache
SELECT pg_ripple.plan_cache_stats();

SPARQL Update Limitations

sparql_update() supports INSERT DATA and DELETE DATA (ground triples only). Pattern-based DELETE/INSERT WHERE with variables is also supported for flexible graph modifications. Use delete_triple() for programmatic single-triple deletion.


Next Steps

§2.4 Validating Data Quality

What and Why

Storing knowledge is only half the battle — you also need to ensure it is correct. SHACL (Shapes Constraint Language) is the W3C standard for declaring and validating constraints on RDF data. It answers questions like:

  • Does every paper have at least one author?
  • Are all email addresses syntactically valid?
  • Does every person have exactly one name?
  • Are date values well-formed?

pg_ripple integrates SHACL validation directly into the database engine. You can validate on demand, enforce constraints synchronously on every insert, or queue triples for asynchronous background validation with violations routed to a dead-letter queue.

Note

SHACL is to RDF what CHECK constraints and triggers are to relational databases — but SHACL shapes are declarative, composable, and standardized across all RDF systems.


How It Works

The SHACL Model

A SHACL shape declares constraints on a set of focus nodes (entities matching a target pattern). Each shape contains one or more property shapes that constrain the values of a specific predicate.

NodeShape (target: instances of bibo:AcademicArticle)
  └─ PropertyShape (path: dct:title)
       ├─ sh:minCount 1     ← every paper must have at least one title
       ├─ sh:maxCount 1     ← at most one title
       └─ sh:datatype xsd:string  ← title must be a string

Validation Modes

ModeGUC settingBehavior
Off (default)pg_ripple.shacl_mode = 'off'No automatic validation
Syncpg_ripple.shacl_mode = 'sync'Every insert_triple() is validated before commit; violations raise an ERROR
Asyncpg_ripple.shacl_mode = 'async'Triples are inserted immediately; a background worker validates and routes violations to the dead-letter queue

Supported Constraints

ConstraintDescription
sh:minCountMinimum number of values
sh:maxCountMaximum number of values
sh:datatypeValue must have a specific XSD datatype
sh:classValue must be an instance of a class
sh:inValue must be from an enumerated set
sh:patternValue must match a regex
sh:nodeValue must conform to another shape
sh:orValue must conform to at least one of several shapes
sh:andValue must conform to all listed shapes
sh:notValue must NOT conform to a shape
sh:qualifiedValueShapeQualified cardinality constraints
sh:hasValueAt least one value must equal the given term
sh:nodeKindValue must be IRI, blank node, or literal
sh:languageInLanguage tag must be in the allowed list
sh:uniqueLangNo duplicate language tags
sh:lessThan / sh:greaterThanComparative constraints between properties
sh:closedReject unknown predicates

Worked Examples

Loading Simple Shapes

Define shapes for the bibliographic dataset:

SELECT pg_ripple.load_shacl('
@prefix sh:     <http://www.w3.org/ns/shacl#> .
@prefix xsd:    <http://www.w3.org/2001/XMLSchema#> .
@prefix ex:     <https://example.org/> .
@prefix dct:    <http://purl.org/dc/terms/> .
@prefix bibo:   <http://purl.org/ontology/bibo/> .
@prefix foaf:   <http://xmlns.com/foaf/0.1/> .
@prefix schema: <https://schema.org/> .

ex:PaperShape a sh:NodeShape ;
    sh:targetClass bibo:AcademicArticle ;
    sh:property [
        sh:path dct:title ;
        sh:minCount 1 ;
        sh:maxCount 1 ;
        sh:datatype xsd:string ;
    ] ;
    sh:property [
        sh:path dct:creator ;
        sh:minCount 1 ;
        sh:class foaf:Person ;
    ] .

ex:PersonShape a sh:NodeShape ;
    sh:targetClass foaf:Person ;
    sh:property [
        sh:path foaf:name ;
        sh:minCount 1 ;
        sh:maxCount 1 ;
        sh:datatype xsd:string ;
    ] ;
    sh:property [
        sh:path schema:affiliation ;
        sh:maxCount 1 ;
        sh:nodeKind sh:IRI ;
    ] .
');
-- Returns: 2 (number of shapes loaded)

Running Validation

Validate the default graph against all active shapes:

SELECT pg_ripple.validate();

The result is a JSONB validation report:

{
  "conforms": false,
  "violations": [
    {
      "focusNode": "<https://example.org/paper/99>",
      "shapeIRI": "<https://example.org/PaperShape>",
      "path": "<http://purl.org/dc/terms/creator>",
      "constraint": "sh:class",
      "message": "value <https://example.org/person/carol> is not an instance of <http://xmlns.com/foaf/0.1/Person>",
      "severity": "sh:Violation"
    }
  ]
}

Validate a specific named graph:

SELECT pg_ripple.validate('https://example.org/graph/pubmed');

Validate all graphs at once:

SELECT pg_ripple.validate('*');

Synchronous Validation

Enable sync mode so invalid triples are rejected at insert time:

SET pg_ripple.shacl_mode = 'sync';

-- This succeeds (paper has a title)
SELECT pg_ripple.insert_triple(
    '<https://example.org/paper/700>',
    '<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>',
    '<http://purl.org/ontology/bibo/AcademicArticle>'
);

-- This would fail if the shape requires dct:title and the paper doesn't have one yet
-- (sync validation checks per-triple, not transactionally)

Warning

Synchronous validation adds overhead to every insert_triple() call. Use it for low-volume, high-integrity scenarios. For bulk loads, use 'off' mode and validate after loading.

Asynchronous Validation with Dead-Letter Queue

Enable async mode for high-throughput pipelines:

SET pg_ripple.shacl_mode = 'async';

-- Triples are inserted immediately; validation happens in the background
SELECT pg_ripple.insert_triple(
    '<https://example.org/paper/800>',
    '<http://purl.org/dc/terms/title>',
    '"A New Paper"'
);

-- Check the validation queue length
SELECT pg_ripple.validation_queue_length();

-- Manually process the queue (normally handled by background worker)
SELECT pg_ripple.process_validation_queue(1000);

-- Check for violations
SELECT pg_ripple.dead_letter_count();

-- View the full dead-letter queue
SELECT pg_ripple.dead_letter_queue();

Complex Shapes

Disjunctive constraints (sh:or):

SELECT pg_ripple.load_shacl('
@prefix sh:   <http://www.w3.org/ns/shacl#> .
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .
@prefix ex:   <https://example.org/> .
@prefix dct:  <http://purl.org/dc/terms/> .

ex:DateShape a sh:NodeShape ;
    sh:targetSubjectsOf dct:date ;
    sh:property [
        sh:path dct:date ;
        sh:or (
            [ sh:datatype xsd:date ]
            [ sh:datatype xsd:gYear ]
            [ sh:datatype xsd:dateTime ]
        ) ;
    ] .
');

Closed shapes (reject unknown predicates):

SELECT pg_ripple.load_shacl('
@prefix sh:     <http://www.w3.org/ns/shacl#> .
@prefix ex:     <https://example.org/> .
@prefix dct:    <http://purl.org/dc/terms/> .
@prefix bibo:   <http://purl.org/ontology/bibo/> .
@prefix schema: <https://schema.org/> .

ex:StrictPaperShape a sh:NodeShape ;
    sh:targetClass bibo:AcademicArticle ;
    sh:closed true ;
    sh:ignoredProperties (
        <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
    ) ;
    sh:property [
        sh:path dct:title ;
        sh:minCount 1 ;
    ] ;
    sh:property [
        sh:path dct:creator ;
        sh:minCount 1 ;
    ] ;
    sh:property [
        sh:path dct:date ;
        sh:maxCount 1 ;
    ] ;
    sh:property [
        sh:path schema:keywords ;
    ] ;
    sh:property [
        sh:path bibo:cites ;
    ] ;
    sh:property [
        sh:path bibo:citedBy ;
    ] .
');

Qualified cardinality:

SELECT pg_ripple.load_shacl('
@prefix sh:     <http://www.w3.org/ns/shacl#> .
@prefix ex:     <https://example.org/> .
@prefix dct:    <http://purl.org/dc/terms/> .
@prefix bibo:   <http://purl.org/ontology/bibo/> .
@prefix foaf:   <http://xmlns.com/foaf/0.1/> .

ex:CollabPaperShape a sh:NodeShape ;
    sh:targetClass bibo:AcademicArticle ;
    sh:property [
        sh:path dct:creator ;
        sh:qualifiedValueShape [
            sh:class foaf:Person ;
        ] ;
        sh:qualifiedMinCount 2 ;
    ] .
');

Managing Shapes

-- List all loaded shapes
SELECT * FROM pg_ripple.list_shapes();

-- Deactivate a shape without deleting it
SELECT pg_ripple.disable_rule_set('custom');

-- Remove a shape entirely
SELECT pg_ripple.drop_shape('https://example.org/StrictPaperShape');

SHACL DAG Monitors

For real-time violation detection, enable DAG monitors (requires pg_trickle):

-- Load shapes first
SELECT pg_ripple.load_shacl('...');

-- Enable per-shape violation stream tables
SELECT pg_ripple.enable_shacl_dag_monitors();

-- View the live violation summary
SELECT * FROM _pg_ripple.violation_summary_dag;

-- List active monitors
SELECT * FROM pg_ripple.list_shacl_dag_monitors();

-- Disable when no longer needed
SELECT pg_ripple.disable_shacl_dag_monitors();

Common Patterns

Pattern: Validate After Bulk Load

The most common workflow — load first, validate second:

-- Turn off validation during load
SET pg_ripple.shacl_mode = 'off';

-- Load data
SELECT pg_ripple.load_turtle_file('/data/papers.ttl');

-- Load shapes
SELECT pg_ripple.load_shacl('...');

-- Validate
SELECT pg_ripple.validate();

Pattern: Data Quality Dashboard

Use the dead-letter queue as a data quality monitor:

-- Enable async validation
SET pg_ripple.shacl_mode = 'async';

-- Periodically check violation counts
SELECT pg_ripple.dead_letter_count();

-- Get violation details
SELECT pg_ripple.dead_letter_queue();

-- With pg_trickle: enable violation summary stream table
SELECT pg_ripple.enable_shacl_monitors();
SELECT * FROM _pg_ripple.violation_summary;

Pattern: Embedding Completeness Check

Ensure all entities have vector embeddings (see §2.7):

SELECT pg_ripple.load_shacl('
@prefix sh:   <http://www.w3.org/ns/shacl#> .
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .
@prefix ex:   <https://example.org/> .
@prefix pg:   <urn:pg_ripple:> .
@prefix bibo: <http://purl.org/ontology/bibo/> .

ex:EmbeddingCompletenessShape a sh:NodeShape ;
    sh:targetClass bibo:AcademicArticle ;
    sh:property [
        sh:path pg:hasEmbedding ;
        sh:minCount 1 ;
        sh:hasValue "true"^^xsd:boolean ;
    ] .
');

-- Add embedding triples for entities that have been embedded
SELECT pg_ripple.add_embedding_triples();

-- Check completeness
SELECT pg_ripple.validate();

Pattern: Multi-Language Support

Ensure labels exist in required languages:

SELECT pg_ripple.load_shacl('
@prefix sh:   <http://www.w3.org/ns/shacl#> .
@prefix ex:   <https://example.org/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

ex:LabelShape a sh:NodeShape ;
    sh:targetSubjectsOf rdfs:label ;
    sh:property [
        sh:path rdfs:label ;
        sh:languageIn ( "en" "de" "fr" ) ;
        sh:uniqueLang true ;
    ] .
');

Performance and Trade-offs

Validation modeOverheadData integrityUse case
offNoneManual check with validate()Bulk loads, development
syncHigh (per-triple check)Immediate rejectionLow-volume critical data
asyncLow (background worker)Eventual (violations in DLQ)High-throughput pipelines
  • Shape count: validation time scales linearly with the number of active shapes and focus nodes. Deactivate shapes you do not need.
  • DAG monitors: per-shape stream tables are IMMEDIATE mode — violations are detected within the same transaction. But pg_trickle must be installed.
  • Dead-letter queue: grows without bound. Periodically review and clean it:
    -- Remove violations older than 30 days
    DELETE FROM _pg_ripple.dead_letter_queue
    WHERE detected_at < NOW() - INTERVAL '30 days';
    

Tip

Shapes with sh:maxCount 1 allow the SPARQL query engine to omit DISTINCT on that predicate's joins. Shapes with sh:minCount 1 allow downgrading LEFT JOIN to INNER JOIN. Declaring accurate shapes improves both data quality and query performance.


Gotchas and Debugging

Shape Loading Errors

If load_shacl() returns 0, the Turtle may have syntax errors. Check for:

  • Missing @prefix declarations
  • Unclosed brackets in blank node property lists
  • Missing semicolons between property shapes

Sync Mode and Transaction Boundaries

Sync validation checks individual triples, not entire transactions. A paper might pass the dct:creator check (because the triple being inserted is the author link) but fail the dct:title check because the title has not been inserted yet in the same transaction.

Solution: insert all triples for an entity, then validate explicitly:

SET pg_ripple.shacl_mode = 'off';

-- Insert all triples for the entity
SELECT pg_ripple.insert_triple('<https://example.org/paper/900>', '<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>', '<http://purl.org/ontology/bibo/AcademicArticle>');
SELECT pg_ripple.insert_triple('<https://example.org/paper/900>', '<http://purl.org/dc/terms/title>', '"My Paper"');
SELECT pg_ripple.insert_triple('<https://example.org/paper/900>', '<http://purl.org/dc/terms/creator>', '<https://example.org/person/alice>');

-- Then validate
SELECT pg_ripple.validate();

Viewing Shape Definitions

-- List all shapes and their active status
SELECT * FROM pg_ripple.list_shapes();

Validation Report Interpretation

The validation report JSONB has two top-level keys:

  • conforms: true if no violations were found
  • violations: array of violation objects, each with focusNode, shapeIRI, path, constraint, message, and severity
-- Extract just the violation messages
SELECT v->>'message'
FROM jsonb_array_elements(
    (SELECT pg_ripple.validate()::jsonb -> 'violations')
) AS v;

Next Steps

§2.5 Reasoning and Inference

What and Why

Inference lets pg_ripple derive new facts from existing data using logical rules. If Alice works at MIT and MIT is located in Massachusetts, inference can conclude that Alice is located in Massachusetts — without anyone explicitly inserting that triple.

pg_ripple ships a full Datalog reasoning engine that supports:

  • Built-in rule sets: RDFS and OWL RL entailment out of the box.
  • Custom rules: domain-specific inference in a Turtle-flavoured Datalog syntax.
  • Stratified negation: "flag people without an email address."
  • Aggregation: COUNT, SUM, MIN, MAX, AVG over grouped triple patterns.
  • Magic sets: goal-directed inference that only materialises relevant facts.
  • Semi-naive evaluation: efficient fixpoint iteration that skips unchanged rows.
  • Well-Founded Semantics: handle programs with cyclic negation (v0.32.0).

Derived triples are stored with source = 1 (inferred) alongside explicit triples (source = 0), so you can always distinguish asserted from derived facts.


How It Works

The Datalog Pipeline

  1. Parse — rules are parsed from a Turtle-flavoured Datalog syntax into an internal Rule IR.
  2. Stratify — the dependency graph is analyzed; rules are grouped into strata such that negated predicates are fully computed in lower strata.
  3. Compile — each stratum is compiled to PostgreSQL SQL: non-recursive rules become INSERT ... SELECT, recursive rules become WITH RECURSIVE ... CYCLE.
  4. Execute — strata run bottom-up; each stratum's SQL is executed via SPI, inserting derived triples into VP delta tables.
  5. Fixpoint — recursive strata iterate until no new facts are derived (semi-naive evaluation).

Rule Syntax

Rules use a Prolog-like notation with RDF terms. The prefix registry from register_prefix() is available:

head_triple :- body_triple1 , body_triple2 .

Variables start with ?. Constants are IRIs (prefixed or full). Negation uses NOT.

Built-in Rule Sets

NameRulesWhat it covers
rdfs~12 rulesrdfs:subClassOf transitivity, rdfs:subPropertyOf transitivity, rdf:type propagation via subclass/subproperty, rdfs:domain/rdfs:range inference
owl-rl~80 rulesOWL RL profile: symmetric/transitive/inverse properties, owl:equivalentClass, owl:sameAs, owl:unionOf, owl:intersectionOf, property chains, and more

Worked Examples

Loading Built-in RDFS Rules

-- Load the RDFS entailment rules
SELECT pg_ripple.load_rules_builtin('rdfs');
-- Returns: 12 (number of rules)

-- Load some class hierarchy data
SELECT pg_ripple.load_turtle('
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix bibo: <http://purl.org/ontology/bibo/> .
@prefix ex:   <https://example.org/> .

bibo:AcademicArticle rdfs:subClassOf bibo:Article .
bibo:Article rdfs:subClassOf bibo:Document .
bibo:Document rdfs:subClassOf rdfs:Resource .

ex:paper/42 rdf:type bibo:AcademicArticle .
');

-- Run inference
SELECT pg_ripple.infer('rdfs');

-- Now ex:paper/42 is also a bibo:Article, bibo:Document, and rdfs:Resource
SELECT * FROM pg_ripple.sparql('
PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX bibo: <http://purl.org/ontology/bibo/>

SELECT ?type
WHERE {
    <https://example.org/paper/42> rdf:type ?type .
}
');

Loading OWL RL Rules

-- Load the OWL RL entailment rules
SELECT pg_ripple.load_rules_builtin('owl-rl');

-- Load ontology with OWL constructs
SELECT pg_ripple.load_turtle('
@prefix owl:  <http://www.w3.org/2002/07/owl#> .
@prefix ex:   <https://example.org/> .

ex:cites owl:inverseOf ex:citedBy .
ex:collaboratesWith a owl:SymmetricProperty .
ex:influencedBy a owl:TransitiveProperty .

ex:paper/42 ex:cites ex:paper/99 .
ex:person/alice ex:collaboratesWith ex:person/bob .
ex:person/carol ex:influencedBy ex:person/alice .
ex:person/alice ex:influencedBy ex:person/dave .
');

-- Run OWL RL inference
SELECT pg_ripple.infer('owl-rl');

-- Derived: ex:paper/99 ex:citedBy ex:paper/42  (inverse)
-- Derived: ex:person/bob ex:collaboratesWith ex:person/alice  (symmetric)
-- Derived: ex:person/carol ex:influencedBy ex:person/dave  (transitive)

Writing Custom Rules

Define domain-specific rules for the bibliographic dataset:

SELECT pg_ripple.load_rules('
# Derive co-authorship: two people who authored the same paper
?a ex:coAuthor ?b :- ?paper dct:creator ?a , ?paper dct:creator ?b .

# Derive institutional collaboration
?inst1 ex:collaboratesWith ?inst2 :-
    ?paper dct:creator ?a ,
    ?paper dct:creator ?b ,
    ?a schema:affiliation ?inst1 ,
    ?b schema:affiliation ?inst2 .

# Derive prolific author (authored 5+ papers)
# Uses arithmetic guard: at least 5 papers
?author ex:isProlific "true"^^xsd:boolean :-
    ?paper1 dct:creator ?author ,
    ?paper2 dct:creator ?author ,
    ?paper3 dct:creator ?author ,
    ?paper4 dct:creator ?author ,
    ?paper5 dct:creator ?author .
', 'biblio');

-- Run the custom rule set
SELECT pg_ripple.infer('biblio');

Negation-as-Failure

Flag entities that are missing expected properties:

SELECT pg_ripple.load_rules('
# Flag papers without a date
?paper ex:missingDate "true"^^xsd:boolean :-
    ?paper rdf:type bibo:AcademicArticle ,
    NOT ?paper dct:date ?_ .

# Flag people without an affiliation
?person ex:missingAffiliation "true"^^xsd:boolean :-
    ?person rdf:type foaf:Person ,
    NOT ?person schema:affiliation ?_ .
', 'quality');

SELECT pg_ripple.infer('quality');

-- Query the derived quality flags
SELECT * FROM pg_ripple.sparql('
PREFIX ex: <https://example.org/>
SELECT ?paper WHERE { ?paper ex:missingDate "true"^^<http://www.w3.org/2001/XMLSchema#boolean> }
');

Named Graph Scoping

Write derived triples into a separate graph:

SELECT pg_ripple.load_rules('
# All RDFS inference goes into the "inferred" graph
GRAPH ex:graph/inferred { ?x rdf:type ?c } :-
    ?x rdf:type ?b , ?b rdfs:subClassOf ?c .
', 'scoped-rdfs');

SELECT pg_ripple.infer('scoped-rdfs');

-- Query only inferred types
SELECT * FROM pg_ripple.sparql('
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?x ?type WHERE {
    GRAPH <https://example.org/graph/inferred> {
        ?x rdf:type ?type .
    }
}
');

Semi-Naive Evaluation with Statistics

Get detailed inference statistics:

SELECT pg_ripple.infer_with_stats('rdfs');

Returns JSONB:

{
  "derived": 156,
  "iterations": 4,
  "eliminated_rules": [
    "?x rdf:type rdfs:Resource :- ?x ?p ?o ."
  ]
}

The eliminated_rules field shows rules removed by subsumption checking — rules whose body is a superset of another rule's body.

Goal-Directed Inference with Magic Sets

When you only need a subset of derived facts, magic sets avoids materialising everything:

-- Only derive facts relevant to: "What types does paper/42 have?"
SELECT pg_ripple.infer_goal('rdfs', '?x rdf:type <http://xmlns.com/foaf/0.1/Person>');

Returns JSONB:

{
  "derived": 12,
  "iterations": 3,
  "matching": 5
}

Compare with full inference:

-- Full materialization: derives ALL facts
SELECT pg_ripple.infer('rdfs');
-- derived: 156

-- Goal-directed: derives only what's needed for the goal
SELECT pg_ripple.infer_goal('rdfs', '?x rdf:type foaf:Person');
-- derived: 12 (much fewer)

Tip

Magic sets are controlled by the GUC pg_ripple.magic_sets. When set to false, infer_goal() falls back to full materialization and filters post-hoc.

Demand-Filtered Inference

For multiple goals at once, use demand-filtered inference:

SELECT pg_ripple.infer_demand('rdfs', '[
    {"p": "<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>"},
    {"s": "<https://example.org/paper/42>"}
]'::jsonb);

Returns:

{
  "derived": 45,
  "iterations": 3,
  "demand_predicates": [
    "http://www.w3.org/1999/02/22-rdf-syntax-ns#type"
  ]
}

Aggregate Rules (Datalog^agg)

Derive facts using aggregate functions:

SELECT pg_ripple.load_rules('
# Count papers per author
?author ex:paperCount ?count :-
    COUNT(?paper WHERE ?paper dct:creator ?author) = ?count .

# Sum citation counts per paper
?paper ex:totalCitations ?total :-
    COUNT(?citing WHERE ?citing bibo:cites ?paper) = ?total .
', 'metrics');

-- Use the aggregate-aware inference function
SELECT pg_ripple.infer_agg('metrics');

Returns:

{
  "derived": 25,
  "aggregate_derived": 25,
  "iterations": 1
}

Well-Founded Semantics (v0.32.0)

For programs with cyclic negation (where standard stratification fails):

SELECT pg_ripple.load_rules('
# Cyclic negation: a node is "in" if it is not "out", and vice versa
?x ex:in "true"^^xsd:boolean :- ?x rdf:type ex:Node , NOT ?x ex:out "true"^^xsd:boolean .
?x ex:out "true"^^xsd:boolean :- ?x rdf:type ex:Node , NOT ?x ex:in "true"^^xsd:boolean .
', 'wfs-demo');

-- Standard infer() would fail with "unstratifiable" error
-- WFS handles it gracefully
SELECT pg_ripple.infer_wfs('wfs-demo');

Returns:

{
  "derived": 6,
  "certain": 0,
  "unknown": 6,
  "iterations": 3,
  "stratifiable": false
}

Facts with certainty = 'unknown' are reported but NOT materialised into VP tables.


Common Patterns

Pattern: Layered Inference

Run rule sets in order — base entailment first, then domain rules:

-- Layer 1: RDFS entailment
SELECT pg_ripple.load_rules_builtin('rdfs');
SELECT pg_ripple.infer('rdfs');

-- Layer 2: OWL RL (builds on RDFS-derived facts)
SELECT pg_ripple.load_rules_builtin('owl-rl');
SELECT pg_ripple.infer('owl-rl');

-- Layer 3: Custom domain rules
SELECT pg_ripple.load_rules('...', 'domain');
SELECT pg_ripple.infer('domain');

Pattern: Incremental Re-Inference

After adding new data, re-run inference. Semi-naive evaluation only derives new facts:

-- Load new data
SELECT pg_ripple.load_turtle('
@prefix ex:  <https://example.org/> .
@prefix bibo: <http://purl.org/ontology/bibo/> .
ex:paper/newOne a bibo:AcademicArticle .
');

-- Re-run inference — only new derivations are computed
SELECT pg_ripple.infer_with_stats('rdfs');

Pattern: Explicit vs Inferred Triples

All VP tables have a source column: 0 = explicit, 1 = inferred. You can query this distinction via SPARQL or check the full triple store:

-- Find all inferred type assertions
SELECT * FROM pg_ripple.find_triples(
    NULL,
    '<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>',
    NULL
);

Pattern: owl:sameAs Canonicalization

When pg_ripple.sameas_reasoning = 'on' (default), owl:sameAs links are canonicalized before inference. All mentions of equivalent entities are collapsed to a single canonical ID, reducing redundant derivations.

-- Two IRIs refer to the same entity
SELECT pg_ripple.insert_triple(
    '<https://example.org/person/alice>',
    '<http://www.w3.org/2002/07/owl#sameAs>',
    '<https://other.org/people/a-johnson>'
);

-- After inference, both IRIs are treated as identical
SELECT pg_ripple.infer('owl-rl');

Performance and Trade-offs

Full Materialization vs Goal-Directed

StrategyProsCons
Full (infer())Complete; all derived facts availableMay derive millions of unneeded facts
Goal-directed (infer_goal())Only derives relevant factsMust specify the goal pattern
Demand-filtered (infer_demand())Multiple goals; partial materializationSlightly more setup
On-demand (query-time)Zero materialization costSlower queries

Semi-Naive Evaluation

Semi-naive evaluation tracks which facts are new in each iteration and only joins new facts with existing facts. This reduces the work per iteration from O(n^2) to O(n * delta), where delta is the number of new facts per round.

Subsumption Checking

When two rules have the same head and one rule's body is a subset of the other's, the subsumed rule is eliminated. This reduces the number of SQL statements per iteration.

Tabling / Memoisation (v0.32.0)

Goal-directed inference results and WFS results are cached in _pg_ripple.tabling_cache. Cache entries are automatically invalidated when data changes (inserts or deletes).

-- Check tabling cache statistics
SELECT * FROM pg_ripple.tabling_stats();

Rule Set Management

-- List all rules and their metadata
SELECT pg_ripple.list_rules();

-- Enable/disable a rule set without deleting it
SELECT pg_ripple.enable_rule_set('rdfs');
SELECT pg_ripple.disable_rule_set('quality');

-- Drop all rules in a set
SELECT pg_ripple.drop_rules('quality');

Gotchas and Debugging

Unstratifiable Programs

If your rules contain cyclic negation, standard infer() will fail:

ERROR: unstratifiable rule set — negation cycle detected
DETAIL: ex:in negates ex:out, which depends on ex:in
HINT: remove the negation cycle or use infer_wfs() for well-founded semantics

Fix: either restructure the rules to eliminate the cycle, or use infer_wfs().

Prefix Registration

Rules use the prefix registry from register_prefix(). If a prefix is not registered, the parser treats it as a parse error:

-- Register required prefixes BEFORE loading rules
SELECT pg_ripple.register_prefix('ex', 'https://example.org/');
SELECT pg_ripple.register_prefix('dct', 'http://purl.org/dc/terms/');

-- Now load rules that use these prefixes
SELECT pg_ripple.load_rules('?x ex:rel ?y :- ?x dct:creator ?y .', 'test');

Note

Built-in rule sets (rdfs, owl-rl) automatically register standard RDF/RDFS/OWL prefixes.

Checking What Was Derived

After inference, check the statistics:

-- How many triples total?
SELECT pg_ripple.triple_count();

-- Detailed stats including inferred count
SELECT pg_ripple.stats();

-- Check constraint rules for violations
SELECT pg_ripple.check_constraints();

Performance Diagnosis

If inference is slow:

  1. Check iteration count with infer_with_stats() — many iterations suggest deep recursive chains.
  2. Use goal-directed inference (infer_goal()) if you only need a subset.
  3. Check for redundant rules with subsumption (eliminated_rules in stats output).
  4. Run pg_ripple.vacuum() after inference to update planner statistics.
-- Get inference diagnostics
SELECT pg_ripple.infer_with_stats('rdfs');

-- Check rule plan cache
SELECT * FROM pg_ripple.rule_plan_cache_stats();

Next Steps

§2.6 Exporting and Sharing

What and Why

Data in pg_ripple needs to flow out — to other systems, to files for archival, to LLMs for RAG pipelines, or to Microsoft's GraphRAG framework via Parquet files. pg_ripple supports all standard RDF serialization formats plus JSON-LD framing for API-ready output and BYOG (Bring Your Own Graph) Parquet export for GraphRAG.

This chapter is the canonical reference for all export functionality, including the GraphRAG BYOG pipeline. Other chapters cross-reference here for GraphRAG details.


How It Works

Export Formats

FormatFunctionStreaming variantNamed graph support
N-Triplesexport_ntriples()Per-graph or default
N-Quadsexport_nquads()Yes (all graphs)
Turtleexport_turtle()export_turtle_stream()Per-graph or default
JSON-LDexport_jsonld()export_jsonld_stream()Per-graph or default
JSON-LD Framedexport_jsonld_framed()export_jsonld_framed_stream()Per-graph or default
SPARQL CONSTRUCT → Turtlesparql_construct_turtle()Via query
SPARQL CONSTRUCT → JSON-LDsparql_construct_jsonld()Via query
SPARQL DESCRIBE → Turtlesparql_describe_turtle()Via query
SPARQL DESCRIBE → JSON-LDsparql_describe_jsonld()Via query
Parquet (GraphRAG entities)export_graphrag_entities()Per-graph
Parquet (GraphRAG relationships)export_graphrag_relationships()Per-graph
Parquet (GraphRAG text units)export_graphrag_text_units()Per-graph

Streaming Exports

For large graphs, streaming exports return one row per triple (or per subject for JSON-LD), avoiding buffering the entire document in memory:

-- Stream Turtle one line at a time
SELECT * FROM pg_ripple.export_turtle_stream();

-- Stream JSON-LD one subject at a time (NDJSON)
SELECT * FROM pg_ripple.export_jsonld_stream();

JSON-LD Framing

JSON-LD framing reshapes flat RDF into nested, application-friendly JSON. A frame is a JSON template that specifies the desired structure:

  1. pg_ripple translates the frame to a SPARQL CONSTRUCT query.
  2. The CONSTRUCT query executes against the triple store.
  3. The W3C embedding algorithm nests matched nodes per the frame.
  4. The result is compacted with the frame's @context.

Worked Examples

Exporting as N-Triples

The simplest format — one triple per line:

-- Export the default graph
SELECT pg_ripple.export_ntriples(NULL);

Output:

<https://example.org/paper/42> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/ontology/bibo/AcademicArticle> .
<https://example.org/paper/42> <http://purl.org/dc/terms/title> "Knowledge Graphs in Practice" .
<https://example.org/paper/42> <http://purl.org/dc/terms/creator> <https://example.org/person/alice> .

Export a specific named graph:

SELECT pg_ripple.export_ntriples('https://example.org/graph/pubmed');

Exporting as N-Quads

N-Quads include the graph IRI for each triple:

-- Export all graphs (pass NULL)
SELECT pg_ripple.export_nquads(NULL);

Exporting as Turtle

Compact, human-readable output with prefix declarations:

SELECT pg_ripple.export_turtle();

Output:

@prefix ex: <https://example.org/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix bibo: <http://purl.org/ontology/bibo/> .

ex:paper/42 a bibo:AcademicArticle ;
    dct:title "Knowledge Graphs in Practice" ;
    dct:creator ex:person/alice, ex:person/bob .

Exporting as JSON-LD

SELECT pg_ripple.export_jsonld();

Returns a JSONB array of expanded node objects:

[
  {
    "@id": "https://example.org/paper/42",
    "@type": ["http://purl.org/ontology/bibo/AcademicArticle"],
    "http://purl.org/dc/terms/title": [{"@value": "Knowledge Graphs in Practice"}],
    "http://purl.org/dc/terms/creator": [
      {"@id": "https://example.org/person/alice"},
      {"@id": "https://example.org/person/bob"}
    ]
  }
]

JSON-LD Framing

Shape the output into the exact JSON structure your application expects:

SELECT pg_ripple.export_jsonld_framed('{
    "@context": {
        "dct": "http://purl.org/dc/terms/",
        "foaf": "http://xmlns.com/foaf/0.1/",
        "bibo": "http://purl.org/ontology/bibo/",
        "schema": "https://schema.org/",
        "title": "dct:title",
        "creator": "dct:creator",
        "name": "foaf:name",
        "affiliation": "schema:affiliation"
    },
    "@type": "bibo:AcademicArticle",
    "creator": {
        "name": {},
        "affiliation": {
            "name": {}
        }
    }
}'::jsonb);

Returns nested JSON-LD:

{
  "@context": {"dct": "http://purl.org/dc/terms/", "...": "..."},
  "@graph": [
    {
      "@type": "bibo:AcademicArticle",
      "title": "Knowledge Graphs in Practice",
      "creator": [
        {
          "name": "Alice Johnson",
          "affiliation": {
            "name": "Massachusetts Institute of Technology"
          }
        },
        {
          "name": "Bob Smith",
          "affiliation": {
            "name": "Stanford University"
          }
        }
      ]
    }
  ]
}

Debugging Frames

See the generated SPARQL CONSTRUCT without executing:

SELECT pg_ripple.jsonld_frame_to_sparql('{
    "@context": {
        "dct": "http://purl.org/dc/terms/",
        "bibo": "http://purl.org/ontology/bibo/",
        "title": "dct:title"
    },
    "@type": "bibo:AcademicArticle",
    "title": {}
}'::jsonb);

CONSTRUCT-Based Exports

Use SPARQL CONSTRUCT for selective, transformed exports:

-- Export a citation graph as Turtle
SELECT pg_ripple.sparql_construct_turtle('
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX dct:  <http://purl.org/dc/terms/>
PREFIX ex:   <https://example.org/>

CONSTRUCT {
    ?paper ex:cites ?cited .
    ?paper dct:title ?title .
    ?cited dct:title ?citedTitle .
}
WHERE {
    ?paper bibo:cites ?cited ;
           dct:title ?title .
    ?cited dct:title ?citedTitle .
}
');

-- Same as JSON-LD for REST APIs
SELECT pg_ripple.sparql_construct_jsonld('
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX dct:  <http://purl.org/dc/terms/>
PREFIX ex:   <https://example.org/>

CONSTRUCT {
    ?paper ex:cites ?cited .
    ?paper dct:title ?title .
}
WHERE {
    ?paper bibo:cites ?cited ;
           dct:title ?title .
}
');

DESCRIBE-Based Exports

Export everything about specific entities:

-- Full description as Turtle
SELECT pg_ripple.sparql_describe_turtle('
DESCRIBE <https://example.org/paper/42>
');

-- Symmetric CBD (includes incoming links)
SELECT pg_ripple.sparql_describe_turtle(
    'DESCRIBE <https://example.org/person/alice>',
    'scbd'
);

-- As JSON-LD
SELECT pg_ripple.sparql_describe_jsonld(
    'DESCRIBE <https://example.org/paper/42>'
);

GraphRAG BYOG Pipeline

pg_ripple is the canonical source for Microsoft GraphRAG's Bring Your Own Graph (BYOG) data. The pipeline uses three export functions to produce Parquet files compatible with GraphRAG's ingestion format.

Note

This is the CANONICAL GraphRAG chapter. All other documentation that mentions GraphRAG should cross-reference this section.

Step 1: Model Entities and Relationships

GraphRAG requires entities, relationships, and text units modeled with the gr: prefix. Load the GraphRAG ontology:

SELECT pg_ripple.load_turtle('
@prefix gr:   <urn:graphrag:> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix ex:   <https://example.org/> .

# Entity: a paper
ex:paper/42 a gr:Entity ;
    gr:title "Knowledge Graphs in Practice" ;
    gr:type "AcademicArticle" ;
    gr:description "A comprehensive survey of knowledge graph technologies and applications." ;
    gr:frequency 5 ;
    gr:degree 3 .

# Entity: an author
ex:person/alice a gr:Entity ;
    gr:title "Alice Johnson" ;
    gr:type "Person" ;
    gr:description "Researcher at MIT specializing in knowledge representation." ;
    gr:frequency 8 ;
    gr:degree 5 .

# Relationship
ex:rel/1 a gr:Relationship ;
    gr:source ex:paper/42 ;
    gr:target ex:person/alice ;
    gr:description "authored by" ;
    gr:weight "1.0"^^xsd:float ;
    gr:combinedDegree 8 .

# Text unit
ex:text/1 a gr:TextUnit ;
    gr:text "This paper surveys knowledge graph technologies..." ;
    gr:nTokens 150 ;
    gr:documentId "doc-001" .
');

Step 2: Enrich with Datalog Rules

Use Datalog rules to derive additional GraphRAG metadata:

SELECT pg_ripple.load_rules('
# Derive entity frequency from triple count
?e gr:frequency ?count :-
    ?e rdf:type gr:Entity ,
    COUNT(?t WHERE ?t ?anyPred ?e) = ?count .

# Derive relationship combined degree
?r gr:combinedDegree ?deg :-
    ?r rdf:type gr:Relationship ,
    ?r gr:source ?s ,
    ?r gr:target ?t ,
    COUNT(?p1 WHERE ?s ?p1 ?_) = ?sDeg ,
    COUNT(?p2 WHERE ?t ?p2 ?_) = ?tDeg .
', 'graphrag-enrichment');

SELECT pg_ripple.infer_agg('graphrag-enrichment');

Step 3: Validate with SHACL

Ensure data quality before export:

SELECT pg_ripple.load_shacl('
@prefix sh:   <http://www.w3.org/ns/shacl#> .
@prefix gr:   <urn:graphrag:> .
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .

<urn:graphrag:EntityShape> a sh:NodeShape ;
    sh:targetClass gr:Entity ;
    sh:property [
        sh:path gr:title ;
        sh:minCount 1 ;
        sh:datatype xsd:string ;
    ] ;
    sh:property [
        sh:path gr:type ;
        sh:minCount 1 ;
    ] .

<urn:graphrag:RelationshipShape> a sh:NodeShape ;
    sh:targetClass gr:Relationship ;
    sh:property [
        sh:path gr:source ;
        sh:minCount 1 ;
        sh:maxCount 1 ;
    ] ;
    sh:property [
        sh:path gr:target ;
        sh:minCount 1 ;
        sh:maxCount 1 ;
    ] .
');

-- Validate before export
SELECT pg_ripple.validate();

Step 4: Export to Parquet

-- Export entities (requires superuser)
SELECT pg_ripple.export_graphrag_entities('', '/data/graphrag/entities.parquet');

-- Export relationships
SELECT pg_ripple.export_graphrag_relationships('', '/data/graphrag/relationships.parquet');

-- Export text units
SELECT pg_ripple.export_graphrag_text_units('', '/data/graphrag/text_units.parquet');

Each function returns the number of rows written. The Parquet files are directly compatible with pyarrow.parquet.read_table() and GraphRAG's BYOG configuration:

# GraphRAG settings.yaml
entity_table_path: /data/graphrag/entities.parquet
relationship_table_path: /data/graphrag/relationships.parquet
text_unit_table_path: /data/graphrag/text_units.parquet

Step 5: Export from a Named Graph

For multi-tenant or versioned exports:

-- Export only entities from the "production" graph
SELECT pg_ripple.export_graphrag_entities(
    'https://example.org/graph/production',
    '/data/graphrag/prod_entities.parquet'
);

SELECT pg_ripple.export_graphrag_relationships(
    'https://example.org/graph/production',
    '/data/graphrag/prod_relationships.parquet'
);

SELECT pg_ripple.export_graphrag_text_units(
    'https://example.org/graph/production',
    '/data/graphrag/prod_text_units.parquet'
);

Common Patterns

Pattern: API Response Formatting

Use JSON-LD framing to produce API-ready responses:

-- Papers endpoint: nested JSON with authors
SELECT pg_ripple.export_jsonld_framed('{
    "@context": {
        "title": "http://purl.org/dc/terms/title",
        "creator": "http://purl.org/dc/terms/creator",
        "name": "http://xmlns.com/foaf/0.1/name",
        "type": "@type"
    },
    "@type": "http://purl.org/ontology/bibo/AcademicArticle",
    "creator": { "name": {} }
}'::jsonb);

Pattern: Scheduled Exports via CONSTRUCT Views

For continuously updated exports, create a CONSTRUCT view (requires pg_trickle):

SELECT pg_ripple.create_construct_view(
    'citation_graph',
    'PREFIX bibo: <http://purl.org/ontology/bibo/>
     PREFIX dct: <http://purl.org/dc/terms/>
     CONSTRUCT { ?p bibo:cites ?c . ?p dct:title ?t . }
     WHERE { ?p bibo:cites ?c ; dct:title ?t }',
    '30s',
    true
);

-- The view is automatically refreshed every 30 seconds
SELECT * FROM pg_ripple.construct_view_citation_graph_decoded;

Pattern: Streaming Export to File

For large graphs, use COPY with streaming exports:

COPY (SELECT * FROM pg_ripple.export_turtle_stream())
TO '/data/export/full_graph.ttl';

COPY (SELECT * FROM pg_ripple.export_jsonld_stream())
TO '/data/export/full_graph.ndjson';

Pattern: Selective Export with SPARQL

Export only a subset of the graph:

-- Export only papers from 2024
SELECT pg_ripple.sparql_construct_turtle('
PREFIX dct:  <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX xsd:  <http://www.w3.org/2001/XMLSchema#>

CONSTRUCT { ?paper ?p ?o }
WHERE {
    ?paper a bibo:AcademicArticle ;
           dct:date ?date ;
           ?p ?o .
    FILTER (?date >= "2024-01-01"^^xsd:date)
}
');

Performance and Trade-offs

Buffered vs Streaming Exports

ModeMemory usageOutput formatBest for
Buffered (export_turtle())Entire graph in memoryComplete documentSmall-medium graphs
Streaming (export_turtle_stream())One triple at a timeRow-per-tripleLarge graphs (millions of triples)

Parquet Export Performance

GraphRAG Parquet export scans the relevant VP tables once per entity type. Performance depends on the number of gr:Entity, gr:Relationship, and gr:TextUnit nodes:

  • ~100K entities: <5 seconds
  • ~1M entities: ~30 seconds
  • Write path requires superuser (writes to server filesystem)

JSON-LD Framing Cost

Framing involves executing a SPARQL CONSTRUCT query, then applying the W3C embedding algorithm. The cost is dominated by the CONSTRUCT query; the embedding step is linear in the number of matched nodes.

Tip

Use jsonld_frame_to_sparql() to inspect the generated CONSTRUCT query and verify it is efficient before calling export_jsonld_framed().


Gotchas and Debugging

Empty Parquet Files

If export_graphrag_entities() returns 0, check that your data uses the correct gr: prefix and that entities have rdf:type gr:Entity:

SELECT * FROM pg_ripple.find_triples(
    NULL,
    '<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>',
    '<urn:graphrag:Entity>'
);

Framing Returns Empty Result

Ensure the frame's @type matches actual rdf:type values in the store. The type must be a full IRI, not a prefixed name:

-- Check what types exist
SELECT * FROM pg_ripple.sparql('
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT DISTINCT ?type WHERE { ?x rdf:type ?type }
');

Server-Side File Permissions

Parquet export writes to the server filesystem. Ensure the postgres OS user has write permission to the output directory:

sudo mkdir -p /data/graphrag
sudo chown postgres:postgres /data/graphrag

Large Export Memory

For graphs with millions of triples, buffered exports (export_turtle(), export_jsonld()) may use significant memory. Switch to streaming variants or COPY ... TO with streaming.


Next Steps

§2.7 AI Retrieval and GraphRAG

What and Why

Knowledge graphs and vector search are complementary: vectors excel at fuzzy semantic similarity ("what treats headaches?"), while graph structure captures precise relationships ("which drugs interact with aspirin?"). pg_ripple combines both in a single database, eliminating the need for a separate vector store.

This chapter is the canonical AI and retrieval reference. It covers:

  • Vector embeddings: store and index entity embeddings alongside RDF triples.
  • HNSW indexes: fast approximate nearest-neighbor search via pgvector.
  • Hybrid retrieval: Reciprocal Rank Fusion (RRF) of SPARQL and vector results.
  • rag_retrieve(): end-to-end RAG pipeline from question to LLM-ready context.
  • JSON-LD framing for LLM prompts: structured context for grounded generation.
  • Graph-enriched embeddings: use owl:sameAs canonicalization and neighborhood context.
  • Full-text broadening: combine FTS with vector search for recall.

Note

pg_ripple's vector features require the pgvector extension. All vector functions gracefully degrade (return zero rows with a WARNING) when pgvector is not installed.

Why Not a Separate Vector Store?

ConcernSeparate vector storepg_ripple integrated
Data consistencySync required between storesSingle source of truth
ACID transactionsNo transactional guaranteesFull PostgreSQL ACID
Hybrid queriesTwo round-trips + client-side mergeSingle SQL query
Operational costTwo systems to manageOne PostgreSQL instance
Graph-aware embeddingsNot possiblecontextualize_entity() enriches embeddings

How It Works

The Embedding Pipeline

  1. Store entities as RDF triples with rdfs:label and rdf:type.
  2. Embed entities via an OpenAI-compatible API: embed_entities() calls the API in batches and stores vectors in _pg_ripple.embeddings.
  3. Index with pgvector HNSW for approximate nearest-neighbor search.
  4. Query with similar_entities(), hybrid_search(), or rag_retrieve().

Key Functions

FunctionPurpose
store_embedding(iri, vec, model)Manually store one entity's embedding
embed_entities(graph, model, batch)Batch-embed entities from a graph
refresh_embeddings(graph, model, force)Re-embed stale entities
similar_entities(text, k, model)Find k nearest entities to a text query
hybrid_search(sparql, text, k, alpha, model)RRF fusion of SPARQL + vector results
rag_retrieve(question, filter, k, model, fmt)End-to-end RAG with context collection
contextualize_entity(iri, depth, max)Build text context from RDF neighborhood
add_embedding_triples()Materialise pg:hasEmbedding for SHACL checks
list_embedding_models()List stored models with counts and dimensions

GUC Parameters

GUCDefaultDescription
pg_ripple.embedding_api_url(none)OpenAI-compatible embedding API base URL
pg_ripple.embedding_api_key(none)API key (superuser only, not logged)
pg_ripple.embedding_modeltext-embedding-3-smallDefault embedding model
pg_ripple.embedding_dimensions1536Vector dimension count
pg_ripple.use_graph_contextoffEnrich embedding input with graph neighborhood
pg_ripple.auto_embedoffAuto-queue new entities for embedding
pg_ripple.embedding_batch_size100API batch size for embed_entities()

Worked Examples

Setup: Configure Embedding API

-- Point to your OpenAI-compatible embedding endpoint
ALTER SYSTEM SET pg_ripple.embedding_api_url = 'https://api.openai.com/v1';
ALTER SYSTEM SET pg_ripple.embedding_api_key = 'sk-your-key-here';
ALTER SYSTEM SET pg_ripple.embedding_model = 'text-embedding-3-small';
ALTER SYSTEM SET pg_ripple.embedding_dimensions = 1536;
SELECT pg_reload_conf();

Warning

The API key is stored as a superuser-only GUC. It never appears in query logs or pg_stat_statements. For production, consider using a local embedding service (e.g., Ollama, vLLM) to avoid sending data to external APIs.

Step 1: Embed Entities

Batch-embed all entities with an rdfs:label:

-- Embed all entities in the default graph
SELECT pg_ripple.embed_entities();
-- Returns: 150 (number of embeddings stored)

-- Embed only entities in a specific graph
SELECT pg_ripple.embed_entities('https://example.org/graph/pubmed');

-- Override the model for this call
SELECT pg_ripple.embed_entities(NULL, 'text-embedding-3-large', 50);

Find entities semantically similar to a question:

SELECT * FROM pg_ripple.similar_entities('knowledge graph applications', 5);

Returns:

entity_identity_iridistance
42001<https://example.org/paper/42>0.12
99001<https://example.org/paper/99>0.18
10001<https://example.org/person/alice>0.31

Step 3: Hybrid Search with RRF

Combine SPARQL structural queries with vector similarity:

SELECT * FROM pg_ripple.hybrid_search(
    'PREFIX dct: <http://purl.org/dc/terms/>
     PREFIX bibo: <http://purl.org/ontology/bibo/>
     SELECT ?entity WHERE {
         ?entity a bibo:AcademicArticle ;
                 dct:creator <https://example.org/person/alice> .
     }',
    'knowledge graph survey',
    10,
    0.5
);

Returns:

entity_identity_irirrf_scoresparql_rankvector_rank
42001<https://example.org/paper/42>0.03211
99001<https://example.org/paper/99>0.02402

The alpha parameter controls weighting:

  • alpha = 1.0: SPARQL only (graph structure)
  • alpha = 0.0: vector only (semantic similarity)
  • alpha = 0.5: equal weight (default)

Step 4: End-to-End RAG with rag_retrieve()

The complete pipeline from question to LLM-ready context:

SELECT * FROM pg_ripple.rag_retrieve(
    'What papers discuss knowledge graphs?',
    NULL,
    5
);

Returns:

entity_irilabelcontext_jsondistance
<https://example.org/paper/42>Knowledge Graphs in Practice{"types": [...], "properties": [...], ...}0.12

With a SPARQL filter to restrict candidates:

SELECT * FROM pg_ripple.rag_retrieve(
    'What papers discuss knowledge graphs?',
    '?entity a <http://purl.org/ontology/bibo/AcademicArticle> .',
    5
);

Get JSON-LD formatted context for LLM consumption:

SELECT * FROM pg_ripple.rag_retrieve(
    'What papers discuss knowledge graphs?',
    NULL,
    5,
    NULL,
    'jsonld'
);

Building LLM Prompts with JSON-LD Framing

Use framed JSON-LD as structured context for LLM prompts:

-- Get framed JSON-LD for a specific paper
SELECT pg_ripple.export_jsonld_framed('{
    "@context": {
        "dct": "http://purl.org/dc/terms/",
        "foaf": "http://xmlns.com/foaf/0.1/",
        "bibo": "http://purl.org/ontology/bibo/",
        "schema": "https://schema.org/",
        "title": "dct:title",
        "creator": "dct:creator",
        "name": "foaf:name",
        "affiliation": "schema:affiliation",
        "cites": "bibo:cites",
        "keywords": "schema:keywords"
    },
    "@type": "bibo:AcademicArticle",
    "creator": {
        "name": {},
        "affiliation": { "name": {} }
    },
    "cites": { "title": {} }
}'::jsonb);

This produces nested JSON that LLMs can reason about more effectively than flat triples.

Graph-Enriched Embeddings

Use contextualize_entity() to build richer text for embedding:

-- Get context text for an entity
SELECT pg_ripple.contextualize_entity(
    'https://example.org/paper/42',
    1,
    20
);

Returns a text string like:

Knowledge Graphs in Practice. Type: AcademicArticle. Created by: Alice Johnson, Bob Smith. 
Cited by: Graph Neural Networks for Entity Resolution. Keywords: knowledge graph, RDF, SPARQL.

Enable graph-enriched embeddings globally:

SET pg_ripple.use_graph_context = 'on';

-- Now embed_entities() uses contextualize_entity() for each entity
SELECT pg_ripple.embed_entities();

owl:sameAs Before Embedding

Canonicalize equivalent entities before embedding to avoid duplicates:

-- Load sameAs links
SELECT pg_ripple.load_turtle('
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix ex:  <https://example.org/> .

ex:person/alice owl:sameAs <https://orcid.org/0000-0001-2345-6789> .
');

-- Run OWL RL inference to canonicalize
SELECT pg_ripple.load_rules_builtin('owl-rl');
SELECT pg_ripple.infer('owl-rl');

-- Now embed — equivalent entities share a single embedding
SELECT pg_ripple.embed_entities();

Full-Text Search Broadening

Combine vector search with PostgreSQL full-text search for higher recall:

-- Create FTS index on paper titles
SELECT pg_ripple.fts_index('<http://purl.org/dc/terms/title>');

-- Use FTS to find papers by keyword
SELECT * FROM pg_ripple.fts_search(
    'knowledge & graph',
    '<http://purl.org/dc/terms/title>'
);

-- Combine FTS candidates with vector search in a hybrid approach
-- Step 1: Get FTS matches
-- Step 2: Get vector matches
-- Step 3: Merge with RRF (done automatically in hybrid_search)
SELECT * FROM pg_ripple.hybrid_search(
    'PREFIX dct: <http://purl.org/dc/terms/>
     SELECT ?entity WHERE {
         ?entity dct:title ?t .
         FILTER (CONTAINS(?t, "knowledge"))
     }',
    'knowledge graph applications',
    10,
    0.6
);

Storing Manual Embeddings

If you compute embeddings externally:

SELECT pg_ripple.store_embedding(
    'https://example.org/paper/42',
    ARRAY[0.1, -0.2, 0.3, 0.05, -0.15, 0.25, 0.08, -0.1, 0.2, 0.12]::float8[],
    'custom-model-v1'
);

Refreshing Stale Embeddings

After updating entity labels, refresh the affected embeddings:

-- Refresh only entities whose labels changed
SELECT pg_ripple.refresh_embeddings();
-- Returns: 12 (re-embedded entities)

-- Force re-embed everything
SELECT pg_ripple.refresh_embeddings(NULL, NULL, true);

Checking Embedding Coverage

-- List all embedding models and their entity counts
SELECT * FROM pg_ripple.list_embedding_models();

-- Add pg:hasEmbedding triples for SHACL completeness checks
SELECT pg_ripple.add_embedding_triples();

-- Validate embedding completeness
SELECT pg_ripple.validate();

Common Patterns

Pattern: Complete RAG Pipeline

-- 1. Load knowledge graph
SELECT pg_ripple.load_turtle_file('/data/domain.ttl');

-- 2. Run inference to derive additional facts
SELECT pg_ripple.load_rules_builtin('rdfs');
SELECT pg_ripple.infer('rdfs');

-- 3. Embed entities
SELECT pg_ripple.embed_entities();

-- 4. Query with RAG
SELECT * FROM pg_ripple.rag_retrieve(
    'What drugs treat migraines?',
    '?entity a <https://example.org/Drug> .',
    5,
    NULL,
    'jsonld'
);

Pattern: Periodic Re-Embedding

Schedule embedding refresh after data updates:

-- After loading new data
SELECT pg_ripple.load_turtle('...');
SELECT pg_ripple.infer('rdfs');

-- Refresh embeddings for entities with changed labels
SELECT pg_ripple.refresh_embeddings();

-- Compact HTAP tables
SELECT pg_ripple.compact();

Pattern: Multi-Model Embeddings

Store embeddings from different models for comparison:

-- Embed with model A
SELECT pg_ripple.embed_entities(NULL, 'text-embedding-3-small');

-- Embed with model B
SELECT pg_ripple.embed_entities(NULL, 'text-embedding-3-large');

-- List stored models
SELECT * FROM pg_ripple.list_embedding_models();

-- Search with a specific model
SELECT * FROM pg_ripple.similar_entities('knowledge graphs', 10, 'text-embedding-3-large');

Pattern: Serving RAG via HTTP

Use pg_ripple_http's /rag endpoint for REST access (see §2.8):

curl -X POST http://localhost:8080/rag \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What treats headaches?",
    "k": 5,
    "output_format": "jsonld"
  }'

The response includes both structured results and a pre-formatted context string ready to be injected into an LLM system prompt.


Performance and Trade-offs

Embedding Storage

Each embedding vector occupies dimensions * 4 bytes (float32 in pgvector). For 1536-dimensional embeddings, that is ~6 KB per entity. A graph with 1M entities uses ~6 GB for embeddings alone.

HNSW Index Performance

EntitiesIndex build timeQuery latency (k=10)Recall@10
10K~2s<5ms>95%
100K~20s<10ms>95%
1M~5min<20ms>92%

RRF Fusion Overhead

hybrid_search() executes two queries (SPARQL + vector) and fuses results in Rust. Total overhead beyond the individual query times is <1ms for typical result sizes.

API Call Costs

embed_entities() calls an external API. Batch size affects both throughput and cost:

  • Larger batches reduce round-trips but increase per-request latency.
  • Default batch size (100) is a good balance for OpenAI's API.
  • For local models (Ollama, vLLM), increase batch size to 500+.

Tip

For large initial embeddings, consider running embed_entities() in a separate session with a larger embedding_batch_size setting to maximize throughput.


Gotchas and Debugging

pgvector Not Installed

All vector functions return zero rows with a WARNING when pgvector is absent:

WARNING: pg_ripple.similar_entities: pgvector not available (PT603)

Fix: install pgvector and CREATE EXTENSION vector.

No Embeddings Found

If similar_entities() returns empty:

  1. Check that embedding_api_url is configured:
    SHOW pg_ripple.embedding_api_url;
    
  2. Check that embeddings exist:
    SELECT * FROM pg_ripple.list_embedding_models();
    
  3. Run embed_entities() if needed.

Dimension Mismatch

The vector dimension in _pg_ripple.embeddings must match embedding_dimensions:

SHOW pg_ripple.embedding_dimensions;
-- Must match the model's output dimension (1536 for text-embedding-3-small)

Slow Vector Queries

If vector queries are slow, check that an HNSW index exists on the embeddings table. pg_ripple creates one automatically, but it may need rebuilding after large batch inserts:

-- Rebuild the HNSW index
REINDEX INDEX _pg_ripple.embeddings_embedding_idx;

API Rate Limits

embed_entities() respects rate limits by batching. If you hit rate limits, reduce embedding_batch_size:

SET pg_ripple.embedding_batch_size = 50;
SELECT pg_ripple.embed_entities();

Next Steps

§2.8 APIs and Integration

What and Why

pg_ripple's SQL functions are powerful, but most applications do not talk to PostgreSQL directly. The pg_ripple_http companion service exposes a W3C-compliant SPARQL Protocol endpoint over HTTP, so any SPARQL client, programming language, or tool can query your knowledge graph.

This chapter covers:

  • pg_ripple_http: the standalone SPARQL endpoint service.
  • Application code examples: Python, JavaScript, and Java.
  • SPARQL federation: query remote SPARQL endpoints from within pg_ripple.
  • Caching strategies: plan cache, connection pooling, and result caching.

How It Works

pg_ripple_http Architecture

pg_ripple_http is a standalone Rust binary (not a PostgreSQL extension) that:

  1. Connects to PostgreSQL via a deadpool connection pool.
  2. Receives SPARQL queries via HTTP GET/POST (W3C SPARQL Protocol).
  3. Calls pg_ripple.sparql(), pg_ripple.sparql_construct(), etc. via SQL.
  4. Returns results in standard formats: SPARQL Results JSON/XML, Turtle, N-Triples, JSON-LD.
  5. Exposes a /rag endpoint for AI retrieval.

Supported Endpoints

MethodPathContent-TypeDescription
GET/sparql?query=...Accept headerSPARQL query via URL parameter
POST/sparqlapplication/sparql-querySPARQL query in request body
POST/sparqlapplication/x-www-form-urlencodedSPARQL query as form parameter
POST/sparqlapplication/sparql-updateSPARQL Update in request body
POST/ragapplication/jsonRAG retrieval endpoint
GET/healthapplication/jsonHealth check
GET/metricstext/plainPrometheus metrics

Response Formats (Content Negotiation)

Accept headerFormat
application/sparql-results+jsonSPARQL Results JSON (default for SELECT/ASK)
application/sparql-results+xmlSPARQL Results XML
text/csvCSV
text/tab-separated-valuesTSV
text/turtleTurtle (for CONSTRUCT/DESCRIBE)
application/n-triplesN-Triples (for CONSTRUCT/DESCRIBE)
application/ld+jsonJSON-LD (for CONSTRUCT/DESCRIBE)

Worked Examples

Starting pg_ripple_http

# Set environment variables
export PG_RIPPLE_DATABASE_URL="postgresql://user:pass@localhost:5432/mydb"
export PG_RIPPLE_LISTEN="0.0.0.0:8080"
export PG_RIPPLE_AUTH_TOKEN="my-secret-token"  # optional

# Start the server
pg_ripple_http

Configuration via environment variables:

VariableDefaultDescription
PG_RIPPLE_DATABASE_URLpostgresql://localhost/postgresPostgreSQL connection string
PG_RIPPLE_LISTEN127.0.0.1:8080Listen address and port
PG_RIPPLE_AUTH_TOKEN(none)Bearer token for authentication
PG_RIPPLE_POOL_SIZE10Connection pool size
PG_RIPPLE_RATE_LIMIT100Requests per second per IP
PG_RIPPLE_CORS_ORIGIN*CORS allowed origins

Querying via curl

SPARQL SELECT via GET:

curl -G http://localhost:8080/sparql \
  --data-urlencode 'query=PREFIX dct: <http://purl.org/dc/terms/> SELECT ?paper ?title WHERE { ?paper dct:title ?title } LIMIT 10' \
  -H "Accept: application/sparql-results+json"

SPARQL SELECT via POST (body):

curl -X POST http://localhost:8080/sparql \
  -H "Content-Type: application/sparql-query" \
  -H "Accept: application/sparql-results+json" \
  -d 'PREFIX dct: <http://purl.org/dc/terms/>
      PREFIX bibo: <http://purl.org/ontology/bibo/>
      SELECT ?paper ?title
      WHERE {
          ?paper a bibo:AcademicArticle ;
                 dct:title ?title .
      }
      ORDER BY ?title
      LIMIT 20'

SPARQL CONSTRUCT as Turtle:

curl -X POST http://localhost:8080/sparql \
  -H "Content-Type: application/sparql-query" \
  -H "Accept: text/turtle" \
  -d 'PREFIX dct: <http://purl.org/dc/terms/>
      PREFIX ex: <https://example.org/>
      CONSTRUCT { ?paper ex:hasTitle ?title }
      WHERE { ?paper dct:title ?title }'

SPARQL CONSTRUCT as JSON-LD:

curl -X POST http://localhost:8080/sparql \
  -H "Content-Type: application/sparql-query" \
  -H "Accept: application/ld+json" \
  -d 'PREFIX dct: <http://purl.org/dc/terms/>
      PREFIX ex: <https://example.org/>
      CONSTRUCT { ?paper ex:hasTitle ?title }
      WHERE { ?paper dct:title ?title }'

SPARQL Update:

curl -X POST http://localhost:8080/sparql \
  -H "Content-Type: application/sparql-update" \
  -d 'PREFIX ex: <https://example.org/>
      PREFIX dct: <http://purl.org/dc/terms/>
      INSERT DATA {
          ex:paper/new1 dct:title "A New Discovery" .
      }'

RAG endpoint:

curl -X POST http://localhost:8080/rag \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What papers discuss knowledge graphs?",
    "sparql_filter": "?entity a <http://purl.org/ontology/bibo/AcademicArticle> .",
    "k": 5,
    "output_format": "jsonld"
  }'

The RAG response includes a context field with pre-formatted text for LLM prompts:

{
  "results": [
    {
      "entity_iri": "https://example.org/paper/42",
      "label": "Knowledge Graphs in Practice",
      "context_json": {"@type": ["AcademicArticle"], "...": "..."},
      "distance": 0.12
    }
  ],
  "context": "Knowledge Graphs in Practice (AcademicArticle): A comprehensive survey..."
}

Authentication (when PG_RIPPLE_AUTH_TOKEN is set):

curl -X POST http://localhost:8080/sparql \
  -H "Authorization: Bearer my-secret-token" \
  -H "Content-Type: application/sparql-query" \
  -d 'SELECT ?s ?p ?o WHERE { ?s ?p ?o } LIMIT 5'

Python with psycopg2

Query pg_ripple directly from Python via SQL:

import json
import psycopg2

conn = psycopg2.connect("dbname=mydb user=postgres")
cur = conn.cursor()

# Execute a SPARQL query
cur.execute("""
    SELECT * FROM pg_ripple.sparql(%s)
""", ("""
    PREFIX dct:  <http://purl.org/dc/terms/>
    PREFIX bibo: <http://purl.org/ontology/bibo/>
    
    SELECT ?paper ?title ?author
    WHERE {
        ?paper a bibo:AcademicArticle ;
               dct:title ?title ;
               dct:creator ?author .
    }
    ORDER BY ?title
    LIMIT 20
""",))

for row in cur.fetchall():
    result = json.loads(row[0])
    print(f"Paper: {result['paper']}")
    print(f"Title: {result['title']}")
    print(f"Author: {result['author']}")
    print()

# Load Turtle data
cur.execute("""
    SELECT pg_ripple.load_turtle(%s)
""", ("""
    @prefix ex: <https://example.org/> .
    @prefix dct: <http://purl.org/dc/terms/> .
    ex:paper/new dct:title "Loaded from Python" .
""",))
conn.commit()

# Export as JSON-LD
cur.execute("SELECT pg_ripple.export_jsonld()")
jsonld = json.loads(cur.fetchone()[0])
print(json.dumps(jsonld, indent=2))

cur.close()
conn.close()

Python with SPARQLWrapper

Query the pg_ripple_http endpoint using the standard SPARQLWrapper library:

from SPARQLWrapper import SPARQLWrapper, JSON, TURTLE

# Point to the pg_ripple_http endpoint
sparql = SPARQLWrapper("http://localhost:8080/sparql")

# SELECT query
sparql.setQuery("""
    PREFIX dct:  <http://purl.org/dc/terms/>
    PREFIX bibo: <http://purl.org/ontology/bibo/>
    
    SELECT ?paper ?title
    WHERE {
        ?paper a bibo:AcademicArticle ;
               dct:title ?title .
    }
    LIMIT 10
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

for binding in results["results"]["bindings"]:
    print(f"{binding['paper']['value']}: {binding['title']['value']}")

# CONSTRUCT query as Turtle
sparql.setQuery("""
    PREFIX dct: <http://purl.org/dc/terms/>
    PREFIX ex:  <https://example.org/>
    
    CONSTRUCT { ?paper ex:hasTitle ?title }
    WHERE { ?paper dct:title ?title }
""")
sparql.setReturnFormat(TURTLE)
turtle_output = sparql.query().convert()
print(turtle_output.decode("utf-8"))

JavaScript with pg

Query pg_ripple directly from Node.js:

const { Client } = require('pg');

async function main() {
    const client = new Client({ connectionString: 'postgresql://localhost/mydb' });
    await client.connect();

    // SPARQL SELECT
    const { rows } = await client.query(
        `SELECT * FROM pg_ripple.sparql($1)`,
        [`
            PREFIX dct:  <http://purl.org/dc/terms/>
            PREFIX bibo: <http://purl.org/ontology/bibo/>
            
            SELECT ?paper ?title
            WHERE {
                ?paper a bibo:AcademicArticle ;
                       dct:title ?title .
            }
            LIMIT 10
        `]
    );

    for (const row of rows) {
        const result = row.result;
        console.log(`Paper: ${result.paper}, Title: ${result.title}`);
    }

    // Load Turtle
    const loadResult = await client.query(
        `SELECT pg_ripple.load_turtle($1)`,
        [`
            @prefix ex: <https://example.org/> .
            @prefix dct: <http://purl.org/dc/terms/> .
            ex:paper/fromjs dct:title "Loaded from JavaScript" .
        `]
    );
    console.log(`Loaded: ${loadResult.rows[0].load_turtle} triples`);

    // Export JSON-LD
    const jsonldResult = await client.query(`SELECT pg_ripple.export_jsonld()`);
    console.log(JSON.stringify(jsonldResult.rows[0].export_jsonld, null, 2));

    await client.end();
}

main().catch(console.error);

JavaScript with fetch (HTTP endpoint)

async function sparqlQuery(query) {
    const response = await fetch('http://localhost:8080/sparql', {
        method: 'POST',
        headers: {
            'Content-Type': 'application/sparql-query',
            'Accept': 'application/sparql-results+json',
        },
        body: query,
    });
    return response.json();
}

const results = await sparqlQuery(`
    PREFIX dct: <http://purl.org/dc/terms/>
    SELECT ?paper ?title
    WHERE { ?paper dct:title ?title }
    LIMIT 10
`);

for (const binding of results.results.bindings) {
    console.log(`${binding.paper.value}: ${binding.title.value}`);
}

Java with JDBC

import java.sql.*;
import org.json.JSONObject;

public class PgRippleExample {
    public static void main(String[] args) throws Exception {
        Connection conn = DriverManager.getConnection(
            "jdbc:postgresql://localhost:5432/mydb", "postgres", "password"
        );

        // SPARQL SELECT
        PreparedStatement stmt = conn.prepareStatement(
            "SELECT * FROM pg_ripple.sparql(?)"
        );
        stmt.setString(1,
            "PREFIX dct: <http://purl.org/dc/terms/> " +
            "PREFIX bibo: <http://purl.org/ontology/bibo/> " +
            "SELECT ?paper ?title " +
            "WHERE { " +
            "    ?paper a bibo:AcademicArticle ; " +
            "           dct:title ?title . " +
            "} LIMIT 10"
        );

        ResultSet rs = stmt.executeQuery();
        while (rs.next()) {
            String jsonStr = rs.getString("result");
            JSONObject result = new JSONObject(jsonStr);
            System.out.println("Paper: " + result.getString("paper"));
            System.out.println("Title: " + result.getString("title"));
        }
        rs.close();
        stmt.close();

        // Load Turtle
        PreparedStatement loadStmt = conn.prepareStatement(
            "SELECT pg_ripple.load_turtle(?)"
        );
        loadStmt.setString(1,
            "@prefix ex: <https://example.org/> .\n" +
            "@prefix dct: <http://purl.org/dc/terms/> .\n" +
            "ex:paper/fromjava dct:title \"Loaded from Java\" .\n"
        );
        ResultSet loadRs = loadStmt.executeQuery();
        if (loadRs.next()) {
            System.out.println("Loaded: " + loadRs.getLong(1) + " triples");
        }
        loadRs.close();
        loadStmt.close();

        conn.close();
    }
}

SPARQL Federation

pg_ripple can query remote SPARQL endpoints from within a SPARQL query using the SERVICE keyword. This lets you join local data with remote datasets like Wikidata or DBpedia.

Querying a Remote SPARQL Endpoint

SELECT * FROM pg_ripple.sparql('
PREFIX dct:    <http://purl.org/dc/terms/>
PREFIX rdfs:   <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd:     <http://www.wikidata.org/entity/>
PREFIX wdt:    <http://www.wikidata.org/prop/direct/>

SELECT ?paper ?title ?wikidataLabel
WHERE {
    ?paper dct:title ?title ;
           dct:subject ?topic .
    
    SERVICE <https://query.wikidata.org/sparql> {
        ?topic rdfs:label ?wikidataLabel .
        FILTER (LANG(?wikidataLabel) = "en")
    }
}
LIMIT 10
');

Vector Federation

Register external vector services for federated similarity search (see Vector Federation for full details):

-- Register a Qdrant endpoint
SELECT pg_ripple.register_vector_endpoint(
    'https://qdrant.internal:6333',
    'qdrant'
);

-- Register a Weaviate endpoint
SELECT pg_ripple.register_vector_endpoint(
    'https://weaviate.internal:8080',
    'weaviate'
);

Tip

Federation queries add network latency. Set timeouts to prevent slow remote endpoints from blocking local queries:

SET pg_ripple.vector_federation_timeout_ms = 5000;

</div>
</div>

Common Patterns

Pattern: Connection Pooling

For high-traffic applications, use a connection pooler (PgBouncer, pgcat) between your application and PostgreSQL:

App → PgBouncer (port 6432) → PostgreSQL (port 5432)

pg_ripple_http uses its own connection pool internally (configurable via PG_RIPPLE_POOL_SIZE).

Pattern: Result Caching

Cache SPARQL results at the application level for frequently-repeated queries:

import json
import hashlib
import redis
import psycopg2

cache = redis.Redis()

def cached_sparql(query, ttl=300):
    key = f"sparql:{hashlib.sha256(query.encode()).hexdigest()}"
    cached = cache.get(key)
    if cached:
        return json.loads(cached)

    conn = psycopg2.connect("dbname=mydb")
    cur = conn.cursor()
    cur.execute("SELECT * FROM pg_ripple.sparql(%s)", (query,))
    results = [json.loads(row[0]) for row in cur.fetchall()]
    cur.close()
    conn.close()

    cache.setex(key, ttl, json.dumps(results))
    return results

Pattern: SPARQL Views for Pre-Computed Results

For dashboard queries that run frequently, create SPARQL views (requires pg_trickle):

-- Create a pre-computed view of paper counts per institution
SELECT pg_ripple.create_sparql_view(
    'papers_by_institution',
    'PREFIX dct: <http://purl.org/dc/terms/>
     PREFIX schema: <https://schema.org/>
     PREFIX foaf: <http://xmlns.com/foaf/0.1/>
     SELECT ?inst ?instName (COUNT(DISTINCT ?paper) AS ?count)
     WHERE {
         ?paper dct:creator ?author .
         ?author schema:affiliation ?inst .
         ?inst foaf:name ?instName .
     }
     GROUP BY ?inst ?instName',
    '30s',
    true
);

-- Query the view directly (instant, no SPARQL parsing)
SELECT * FROM pg_ripple.papers_by_institution;

Pattern: Prometheus Monitoring

pg_ripple_http exposes Prometheus metrics at /metrics:

curl http://localhost:8080/metrics

Metrics include:

  • pg_ripple_http_requests_total — total request count by endpoint and status
  • pg_ripple_http_request_duration_seconds — request latency histogram
  • pg_ripple_http_active_connections — current active connections

Performance and Trade-offs

Direct SQL vs HTTP Endpoint

Access methodLatency overheadBest for
Direct SQL (psycopg2, JDBC)NoneServer-side applications, ETL
pg_ripple_http~1-5ms per requestWeb applications, REST APIs, federated queries

Connection Pool Sizing

Rule of thumb: set pool size to 2 * CPU cores for OLTP workloads. For SPARQL-heavy analytics, 4 * CPU cores may be better:

export PG_RIPPLE_POOL_SIZE=20

Rate Limiting

pg_ripple_http includes built-in rate limiting to prevent abuse:

export PG_RIPPLE_RATE_LIMIT=100  # requests per second per IP

For public-facing endpoints, combine with a reverse proxy (nginx, Caddy) for additional protection.

CORS Configuration

For browser-based applications:

export PG_RIPPLE_CORS_ORIGIN="https://myapp.example.com"

Set to * for development; restrict to specific origins in production.


Gotchas and Debugging

Authentication Errors

If PG_RIPPLE_AUTH_TOKEN is set, all requests must include the Authorization header:

HTTP 401: Missing or invalid authorization token

Fix: include Authorization: Bearer <token> in the request headers.

Connection Refused

If pg_ripple_http cannot connect to PostgreSQL:

Error: connection refused (os error 61)

Fix: check PG_RIPPLE_DATABASE_URL and ensure PostgreSQL is running and accepting connections.

Content-Type Negotiation

If you get unexpected response formats, check the Accept header. pg_ripple_http uses content negotiation:

# Explicitly request JSON results
curl -H "Accept: application/sparql-results+json" ...

# Explicitly request Turtle for CONSTRUCT
curl -H "Accept: text/turtle" ...

Federation Timeouts

Remote SPARQL endpoints can be slow. If federation queries time out:

-- Increase the timeout
SET pg_ripple.vector_federation_timeout_ms = 30000;

For SPARQL federation (SERVICE keyword), pg_ripple uses PostgreSQL's statement_timeout for the overall query:

SET statement_timeout = '60s';

Health Check

Use the /health endpoint for load balancer configuration:

curl http://localhost:8080/health
# Returns: {"status": "ok", "pool_size": 10, "pool_available": 8}

Next Steps

CDC Subscriptions

Added in v0.42.0

Overview

Change Data Capture (CDC) subscriptions let your application subscribe to a real-time stream of RDF triple changes — filtered by SPARQL pattern or SHACL shape — without polling the database.

When a matching triple is inserted or deleted, pg_ripple sends a PostgreSQL NOTIFY message on a named channel. Listeners receive a JSON payload describing the change. The pg_ripple_http companion service exposes these subscriptions as WebSocket endpoints for web and streaming applications.

Creating a Subscription

-- Subscribe to all triple changes.
SELECT pg_ripple.create_subscription('my_feed');

-- Subscribe with a SPARQL pattern filter.
SELECT pg_ripple.create_subscription(
    'person_changes',
    filter_sparql := 'SELECT ?s ?p ?o WHERE { ?s a <https://schema.org/Person> ; ?p ?o }'
);

-- Subscribe with a SHACL shape filter.
SELECT pg_ripple.create_subscription(
    'shape_violations',
    filter_shape := '<https://shapes.example.org/PersonShape>'
);

Parameters

ParameterTypeDefaultDescription
nameTEXTrequiredUnique subscription name (alphanumeric + _/-, max 63 chars)
filter_sparqlTEXTNULLOptional SPARQL SELECT pattern; only matching triples are published
filter_shapeTEXTNULLOptional SHACL shape IRI; only shape-violating triples are published

Returns TRUE if created, FALSE if a subscription with that name already exists.

Listening for Changes

-- Start listening.
LISTEN pg_ripple_cdc_my_feed;

-- Insert a triple.
SELECT pg_ripple.insert_triple(
    '<https://ex.org/alice>',
    '<https://schema.org/name>',
    '"Alice"'
);

-- In your application, receive notifications via pg_notify/asyncpg/etc.

Notification Payload

Each notification carries a JSON payload:

{
  "op": "add",
  "s": "<https://ex.org/alice>",
  "p": "<https://schema.org/name>",
  "o": "\"Alice\"",
  "g": ""
}
FieldValue
op"add" for INSERT, "remove" for DELETE
sSubject — N-Triples formatted IRI or blank node
pPredicate — N-Triples formatted IRI
oObject — N-Triples formatted literal or IRI
gNamed graph IRI, or empty string for the default graph

Listing Subscriptions

SELECT name, filter_sparql IS NOT NULL AS has_filter, created_at
FROM pg_ripple.list_subscriptions()
ORDER BY created_at;

Dropping a Subscription

-- Returns TRUE if removed, FALSE if not found.
SELECT pg_ripple.drop_subscription('my_feed');

WebSocket Access via pg_ripple_http

When the pg_ripple_http companion service is running, subscriptions are accessible as WebSocket endpoints:

ws://<host>:8080/ws/subscriptions/{name}

The service supports content negotiation via the Accept header:

  • application/json (default) — JSON payload
  • text/turtle — Turtle-serialized change notification
  • application/ld+json — JSON-LD change notification

Integration Patterns

GraphRAG Pipeline

import asyncpg

async def watch_entity_changes():
    conn = await asyncpg.connect(dsn)
    await conn.execute("LISTEN pg_ripple_cdc_entity_changes")

    async for notification in conn.listen("pg_ripple_cdc_entity_changes"):
        payload = json.loads(notification.payload)
        # Re-embed entity on change.
        await update_embedding(payload["s"])

Live Dashboard

const ws = new WebSocket("ws://localhost:8080/ws/subscriptions/dashboard_feed");
ws.onmessage = (event) => {
  const change = JSON.parse(event.data);
  updateDashboard(change.op, change.s, change.p, change.o);
};

Underlying Tables

TableDescription
_pg_ripple.subscriptionsNamed subscription registry
_pg_ripple.cdc_subscriptionsLow-level predicate-pattern subscriptions (v0.6.0 legacy API)
FunctionDescription
pg_ripple.create_subscription(name, filter_sparql, filter_shape)Create named subscription
pg_ripple.drop_subscription(name)Remove named subscription
pg_ripple.list_subscriptions()List all named subscriptions
pg_ripple.subscribe(pattern, channel)Low-level subscription (v0.6.0 API)
pg_ripple.unsubscribe(channel)Remove low-level subscription

Architecture Overview

pg_ripple is a PostgreSQL 18 extension that implements a high-performance RDF triple store with native SPARQL query execution. This page describes the internal architecture: how data is stored, how queries are executed, and how the subsystems interact.


System Architecture Diagram

┌──────────────────────────────────────────────────────────────────────┐
│                        Client Applications                           │
│   psql / JDBC / SPARQL Protocol (pg_ripple_http) / REST / ODBC      │
└────────────────────────────┬─────────────────────────────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────────────────┐
│                     PostgreSQL 18 Backend                             │
│  ┌────────────────────────────────────────────────────────────────┐  │
│  │                    pg_ripple Extension                          │  │
│  │                                                                │  │
│  │  ┌──────────────┐  ┌───────────────┐  ┌────────────────────┐  │  │
│  │  │  SPARQL       │  │  Datalog       │  │  SHACL             │  │  │
│  │  │  Engine       │  │  Reasoner      │  │  Validator         │  │  │
│  │  │              │  │               │  │                    │  │  │
│  │  │  parse →     │  │  stratify →   │  │  shapes → DDL     │  │  │
│  │  │  optimize → │  │  compile →   │  │  constraints +    │  │  │
│  │  │  SQL gen →  │  │  semi-naive   │  │  async pipeline   │  │  │
│  │  │  SPI exec → │  │  fixpoint     │  │                    │  │  │
│  │  │  decode     │  │               │  │                    │  │  │
│  │  └──────┬───────┘  └───────┬───────┘  └────────┬───────────┘  │  │
│  │         │                  │                    │              │  │
│  │         ▼                  ▼                    ▼              │  │
│  │  ┌─────────────────────────────────────────────────────────┐  │  │
│  │  │              Dictionary Encoder (XXH3-128)               │  │  │
│  │  │    IRI / Blank Node / Literal  ──→  i64 identifier       │  │  │
│  │  │         Shared-Memory LRU Cache (64 shards)              │  │  │
│  │  └────────────────────────┬────────────────────────────────┘  │  │
│  │                           │                                    │  │
│  │                           ▼                                    │  │
│  │  ┌─────────────────────────────────────────────────────────┐  │  │
│  │  │                VP Storage Engine (HTAP)                  │  │  │
│  │  │                                                         │  │  │
│  │  │   vp_{id}_delta  ──┐                                    │  │  │
│  │  │   (write inbox)    │                                    │  │  │
│  │  │                    ├──→  vp_{id} (read view)            │  │  │
│  │  │   vp_{id}_main   ──┤    = (main − tombstones)          │  │  │
│  │  │   (BRIN archive)   │      UNION ALL delta               │  │  │
│  │  │                    │                                    │  │  │
│  │  │   vp_{id}_tombstones                                    │  │  │
│  │  │   (pending deletes) │                                    │  │  │
│  │  │                    │                                    │  │  │
│  │  │   vp_rare ─────────┘  (consolidated rare predicates)    │  │  │
│  │  └─────────────────────────────────────────────────────────┘  │  │
│  │                           │                                    │  │
│  │  ┌────────────────────────┴────────────────────────────────┐  │  │
│  │  │           Background Merge Worker (BGW)                  │  │  │
│  │  │   delta + main − tombstones ──→ new main (BRIN)          │  │  │
│  │  │   Polling interval: merge_interval_secs (default 60s)   │  │  │
│  │  │   Threshold: merge_threshold (default 10,000 rows)       │  │  │
│  │  └─────────────────────────────────────────────────────────┘  │  │
│  └────────────────────────────────────────────────────────────────┘  │
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────────┐ │
│  │  _pg_ripple schema: dictionary, predicates, vp_*, statements    │ │
│  │  pg_ripple schema:  public SQL functions (sparql, insert, etc.) │ │
│  └─────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘

Dictionary Encoder

The dictionary encoder is the foundation of pg_ripple's storage model. Every RDF term — IRI, blank node, plain literal, typed literal, or language-tagged literal — is mapped to a compact i64 identifier before being stored.

How Encoding Works

  1. The input term is classified by kind: IRI (0), blank node (1), literal (2), typed literal (3), or language-tagged literal (4).
  2. The kind discriminant is mixed into the hash input as two little-endian bytes, so the same string encoded as an IRI and as a blank node always produces distinct dictionary rows.
  3. An XXH3-128 hash is computed over (kind_le_bytes || term_utf8).
  4. The full 16-byte hash is stored in the _pg_ripple.dictionary table with an ON CONFLICT (hash) DO NOTHING upsert. The dense i64 join key is an IDENTITY-generated column — sequential and independent of the hash.
  5. The resulting i64 is used in all VP table columns.

Why integer encoding?

VP tables never contain raw strings. All joins, comparisons, and index lookups operate on i64 values. This eliminates collation overhead, reduces storage by 5–20x, and makes B-tree index scans uniformly fast regardless of IRI length.

Shared-Memory Cache

The dictionary cache sits in PostgreSQL shared memory (allocated at postmaster start) and is organized as a 64-shard set-associative structure. Each backend reads and writes to the shared cache through atomic operations — no per-backend duplication.

Key parameters:

  • pg_ripple.dictionary_cache_size — Number of cache entries (default: 65,536). Requires restart.
  • pg_ripple.cache_budget — Memory budget cap in MB (default: 64). Bulk loads throttle at 90% utilization.

The cache hit ratio is reported by pg_ripple.stats() and should stay above 95% in production.


VP (Vertical Partitioning) Tables

pg_ripple uses vertical partitioning: one physical table per unique predicate. This is the storage model used by research systems like SW-Store and column-oriented triple stores.

Table Layout

Each predicate with at least vp_promotion_threshold (default: 1,000) triples gets a dedicated VP table:

-- Columns in every VP table
s      BIGINT  NOT NULL   -- subject dictionary ID
o      BIGINT  NOT NULL   -- object dictionary ID
g      BIGINT  NOT NULL DEFAULT 0  -- graph ID (0 = default graph)
i      BIGINT  NOT NULL DEFAULT nextval('statement_id_seq')  -- unique SID
source SMALLINT NOT NULL DEFAULT 0  -- 0 = explicit, 1 = inferred

Dual B-tree indexes on (s, o) and (o, s) support both subject-to-object and object-to-subject access patterns.

Rare Predicate Consolidation

Predicates with fewer triples than the promotion threshold are stored in a shared _pg_ripple.vp_rare table with an extra p BIGINT column. This avoids schema bloat for infrequent predicates. When a rare predicate's count crosses the threshold, it is automatically promoted to a dedicated VP table.

Predicate Catalog

The _pg_ripple.predicates table maps each predicate ID to its VP table OID and current triple count:

SELECT id, table_oid, triple_count
FROM _pg_ripple.predicates
ORDER BY triple_count DESC;

No dynamic SQL string concatenation

The SPARQL-to-SQL translator never concatenates table names into SQL strings. It looks up the OID in _pg_ripple.predicates and uses parameterized queries with format-safe quoting. This prevents SQL injection by design.


HTAP Storage Architecture

Since v0.6.0, pg_ripple uses an HTAP (Hybrid Transactional/Analytical Processing) storage architecture that separates write and read paths for each VP table.

Three-Table Split

For each predicate, the storage layer maintains:

TablePurposeIndex Type
vp_{id}_deltaWrite inbox — all INSERTs land hereB-tree on (s, o)
vp_{id}_mainRead-optimized archiveBRIN (block range)
vp_{id}_tombstonesPending deletes from mainB-tree on (s, o, g)

A read view vp_{id} combines them:

(main EXCEPT tombstones) UNION ALL delta

Background Merge Worker

The merge worker is a PostgreSQL background worker (BGWorker) that runs in a polling loop:

  1. Poll — Wake every merge_interval_secs (default: 60) or when poked by the write-path latch.
  2. Scan — Check each HTAP predicate's delta row count against merge_threshold (default: 10,000).
  3. Merge — For qualifying predicates: create a new main table from (old_main − tombstones) UNION ALL delta, swap atomically via ALTER TABLE ... RENAME, drop the old main after merge_retention_seconds.
  4. Maintain — Rebuild subject/object pattern tables, promote rare predicates that crossed the threshold, run ANALYZE on new main tables (when auto_analyze is on), evict expired federation cache entries.

Write path

Writers never block on the merge. All INSERTs go directly to the delta table (heap + B-tree). The merge worker operates asynchronously and uses PostgreSQL's MVCC for isolation.


SPARQL Query Execution Flow

When a client calls pg_ripple.sparql('SELECT ...'), the query goes through five stages:

1. Parse

The SPARQL text is parsed by the spargebra crate into an algebraic representation. This handles the full SPARQL 1.1 grammar: SELECT, CONSTRUCT, DESCRIBE, ASK, property paths, subqueries, aggregation, federation (SERVICE), and SPARQL-star.

2. Optimize

The sparopt optimizer rewrites the algebra tree:

  • BGP reordering — Triple patterns are sorted by estimated selectivity (smallest VP table first) when bgp_reorder is on.
  • Filter pushdown — FILTER constants are encoded to i64 at translation time and pushed into the WHERE clause of the generated SQL.
  • Self-join elimination — Star patterns (same subject, multiple predicates) are collapsed into multi-way joins instead of redundant subqueries.
  • SHACL hints — If sh:maxCount 1 is declared, DISTINCT is omitted; if sh:minCount 1, LEFT JOIN is upgraded to INNER JOIN.

3. Generate SQL

The optimized algebra is compiled into PostgreSQL SQL:

  • Each triple pattern becomes a scan of the corresponding VP table (or vp_rare with a predicate filter).
  • Joins between patterns become SQL JOIN clauses with i64 equality predicates.
  • Property paths compile to WITH RECURSIVE ... CYCLE using PostgreSQL 18's hash-based cycle detection.
  • SERVICE clauses are compiled into HTTP calls to remote SPARQL endpoints.
  • Aggregates, ORDER BY, LIMIT, and OFFSET translate directly to their SQL equivalents.

4. SPI Execute

The generated SQL is executed through PostgreSQL's Server Programming Interface (SPI). Results are arrays of i64 dictionary IDs.

The plan cache (plan_cache_size, default: 256) stores compiled SQL for recently-seen SPARQL queries to avoid repeated parse/optimize/generate cycles.

5. Decode

The i64 result columns are decoded back to human-readable RDF terms (IRIs, literals, blank nodes) using the dictionary. The shared-memory cache accelerates this step — a cache hit avoids a dictionary table lookup per value.

Integer joins everywhere

The SPARQL engine encodes all bound constants to i64 before generating SQL, and decodes results after execution. VP table queries never contain string comparisons — this is a hard architectural invariant.


Schema Organization

pg_ripple uses two PostgreSQL schemas:

SchemaContentsVisibility
pg_ripplePublic SQL functions (sparql(), insert_triple(), stats(), etc.)User-facing
_pg_rippleDictionary table, predicates catalog, VP tables, statement mappings, internal stateInternal

Do not modify _pg_ripple directly

The internal schema is managed by the extension. Direct modifications to _pg_ripple tables can corrupt the dictionary or break VP table invariants.


Subsystem Summary

SubsystemSource DirectoryPurpose
Dictionarysrc/dictionary/Term ↔ i64 encoding with XXH3-128
Storagesrc/storage/VP tables, HTAP partitions, rare predicate consolidation
SPARQLsrc/sparql/Parse → optimize → SQL generation → SPI → decode
Datalogsrc/datalog/Rule parsing, stratification, semi-naive fixpoint, magic sets
SHACLsrc/shacl/Shape validation, DDL constraints, async pipeline
Exportsrc/export/Turtle, N-Triples, JSON-LD serialization
Workersrc/worker.rsBackground merge worker, embedding queue, SHACL async
Statssrc/stats/Monitoring, cache metrics, health checks
Federationsrc/sparql/federationRemote SERVICE call execution, connection pooling, caching
HTTPpg_ripple_http/SPARQL Protocol endpoint (standalone companion service)

Deployment Models

pg_ripple runs as a PostgreSQL 18 extension. It can be deployed in any environment that supports PostgreSQL 18 with extension loading. This page covers the three primary deployment models and provides production-ready configuration examples.


Deployment Options at a Glance

ModelBest ForComplexitySPARQL Protocol
Standalone PostgreSQLProduction, existing PG infrastructureLowVia pg_ripple_http sidecar
Docker / ComposeEvaluation, CI/CD, small deploymentsLowBuilt-in
Managed PostgreSQLCloud-native, minimal opsMediumVia pg_ripple_http sidecar

Recommendation

Use Docker Compose for evaluation and development. Use a dedicated PostgreSQL 18 instance for production workloads — this gives full control over shared memory, background workers, and storage configuration.


Model 1: Standalone PostgreSQL

Install pg_ripple into a standard PostgreSQL 18 instance. This is the recommended production deployment.

Prerequisites

  • PostgreSQL 18.x installed from packages or source
  • Rust toolchain (for building from source) or a pre-built .so/.dylib
  • pgrx 0.17 (if building from source)

Installation

# Build and install from source
cargo pgrx install --pg-config $(which pg_config) --release

# Or if using a specific PG18 binary
cargo pgrx install --pg-config /usr/lib/postgresql/18/bin/pg_config --release

PostgreSQL Configuration

Add to postgresql.conf:

# Required: load pg_ripple at server start for background workers and shared memory
shared_preload_libraries = 'pg_ripple'

# Shared memory for dictionary cache (adjust for your dataset)
pg_ripple.dictionary_cache_size = 65536   # 64K entries (default)
pg_ripple.cache_budget = 64               # MB (default)

# HTAP merge worker
pg_ripple.merge_threshold = 10000
pg_ripple.merge_interval_secs = 60
pg_ripple.worker_database = 'mydb'        # database the merge worker connects to

Enable the Extension

CREATE EXTENSION pg_ripple;

-- Verify installation
SELECT pg_ripple.stats();

Add SPARQL Protocol Endpoint

The SPARQL Protocol HTTP endpoint is provided by pg_ripple_http, a standalone companion service:

# Build the HTTP service
cd pg_ripple_http
cargo build --release

# Run it
PG_RIPPLE_HTTP_PG_URL="postgresql://user:pass@localhost/mydb" \
PG_RIPPLE_HTTP_PORT=7878 \
./target/release/pg_ripple_http

pg_ripple_http is optional

You can use pg_ripple entirely through SQL — pg_ripple.sparql(), pg_ripple.insert_triple(), etc. The HTTP service adds W3C SPARQL Protocol compatibility for tools like Yasgui, RDF4J, or federated queries from other endpoints.


Model 2: Docker / Docker Compose

The Docker deployment bundles PostgreSQL 18, pg_ripple, and pg_ripple_http into containers managed by Docker Compose. This is the fastest way to get started.

docker-compose.yml

# Docker Compose for pg_ripple with SPARQL Protocol HTTP endpoint.
#
# Usage:
#   docker compose up -d
#   curl http://localhost:7878/health
#   curl -G http://localhost:7878/sparql \
#     --data-urlencode "query=SELECT * WHERE { ?s ?p ?o } LIMIT 10"

services:
  postgres:
    build: .
    ports:
      - "5432:5432"
    environment:
      POSTGRES_PASSWORD: ripple
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 5s
      retries: 5

  sparql:
    build: .
    entrypoint: ["/usr/local/bin/pg_ripple_http"]
    ports:
      - "7878:7878"
    environment:
      PG_RIPPLE_HTTP_PG_URL: "postgresql://postgres:ripple@postgres/postgres"
      PG_RIPPLE_HTTP_PORT: "7878"
      PG_RIPPLE_HTTP_POOL_SIZE: "8"
      PG_RIPPLE_HTTP_CORS_ORIGINS: "*"
    depends_on:
      postgres:
        condition: service_healthy

volumes:
  pgdata:

Starting the Stack

docker compose up -d

# Wait for health check
docker compose ps

# Test SPARQL endpoint
curl http://localhost:7878/health

# Run a query
curl -G http://localhost:7878/sparql \
  --data-urlencode "query=SELECT * WHERE { ?s ?p ?o } LIMIT 5"

Loading Data via Docker

# Copy a Turtle file into the container and load it
docker compose cp data.ttl postgres:/tmp/data.ttl
docker compose exec postgres psql -U postgres -c \
  "SELECT pg_ripple.load_turtle_file('/tmp/data.ttl');"

# Or load inline
docker compose exec postgres psql -U postgres -c \
  "SELECT pg_ripple.load_turtle('@prefix ex: <http://example.org/> .
    ex:Alice ex:knows ex:Bob .
    ex:Bob ex:age \"30\"^^<http://www.w3.org/2001/XMLSchema#integer> .');"

Production Hardening for Docker

For production Docker deployments, add resource limits and persistent configuration:

services:
  postgres:
    build: .
    ports:
      - "5432:5432"
    environment:
      POSTGRES_PASSWORD: ${PG_PASSWORD}
    volumes:
      - pgdata:/var/lib/postgresql/data
      - ./postgresql.conf:/etc/postgresql/postgresql.conf
    command: postgres -c config_file=/etc/postgresql/postgresql.conf
    deploy:
      resources:
        limits:
          memory: 4G
          cpus: "2.0"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 5s
      retries: 5

  sparql:
    build: .
    entrypoint: ["/usr/local/bin/pg_ripple_http"]
    ports:
      - "7878:7878"
    environment:
      PG_RIPPLE_HTTP_PG_URL: "postgresql://postgres:${PG_PASSWORD}@postgres/postgres"
      PG_RIPPLE_HTTP_PORT: "7878"
      PG_RIPPLE_HTTP_POOL_SIZE: "16"
      PG_RIPPLE_HTTP_CORS_ORIGINS: "https://yourdomain.com"
      PG_RIPPLE_HTTP_AUTH_TOKEN: ${SPARQL_AUTH_TOKEN}
    depends_on:
      postgres:
        condition: service_healthy
    deploy:
      resources:
        limits:
          memory: 512M
          cpus: "1.0"

Security

Never use default passwords in production. Set POSTGRES_PASSWORD and PG_RIPPLE_HTTP_AUTH_TOKEN via environment variables or Docker secrets. Restrict PG_RIPPLE_HTTP_CORS_ORIGINS to your actual domain.


Model 3: Managed PostgreSQL Services

pg_ripple can run on managed PostgreSQL services that support custom extensions and PostgreSQL 18. The key requirements are:

  1. PostgreSQL 18 — pg_ripple uses PG18-specific features (e.g., WITH RECURSIVE ... CYCLE).
  2. Custom extension loading — The service must allow installing .so extensions and adding to shared_preload_libraries.
  3. Shared memory access — Required for the dictionary cache and merge worker.

Supported Managed Services

ServiceCustom Extensionsshared_preload_librariesStatus
AWS RDS for PostgreSQLYes (via custom builds)YesSupported with custom AMI
Azure Database for PostgreSQL Flexible ServerYesYesSupported
Google Cloud SQLLimitedLimitedPartial support
Self-managed on EC2/GCE/Azure VMFull controlFull controlFully supported

Cloud VM recommendation

For managed cloud deployments, running PostgreSQL 18 on a cloud VM (EC2, GCE, Azure VM) with the extension installed gives full control and avoids managed service limitations. Use the managed service's block storage for durability and snapshots for backups.

Managed Service Configuration

When running on a managed service:

# Add to the PostgreSQL parameter group / configuration
shared_preload_libraries = 'pg_ripple'

# Shared memory — managed services often cap this; start conservative
pg_ripple.dictionary_cache_size = 32768
pg_ripple.cache_budget = 32

# Merge worker targets the primary database
pg_ripple.worker_database = 'mydb'

pg_ripple_http as a Sidecar

On managed services, run pg_ripple_http as a sidecar container or systemd service:

# Kubernetes sidecar example
PG_RIPPLE_HTTP_PG_URL="postgresql://user:pass@pg-host:5432/mydb" \
PG_RIPPLE_HTTP_PORT=7878 \
PG_RIPPLE_HTTP_POOL_SIZE=16 \
pg_ripple_http

pg_ripple_http Configuration Reference

The HTTP companion service is configured entirely through environment variables:

VariableDefaultDescription
PG_RIPPLE_HTTP_PG_URL(required)PostgreSQL connection string
PG_RIPPLE_HTTP_PORT7878HTTP listen port
PG_RIPPLE_HTTP_POOL_SIZE8Connection pool size
PG_RIPPLE_HTTP_CORS_ORIGINS*Allowed CORS origins
PG_RIPPLE_HTTP_AUTH_TOKEN(none)Bearer token for authentication
PG_RIPPLE_HTTP_RATE_LIMIT0Requests per second (0 = unlimited)

Endpoints

PathMethodDescription
/sparqlGET, POSTSPARQL Protocol query/update endpoint
/healthGETHealth check (returns 200 if PG connection is live)
/metricsGETPrometheus-compatible metrics

Network Architecture

                    ┌─────────────┐
                    │   Clients   │
                    └──────┬──────┘
                           │
              ┌────────────┴────────────┐
              │                         │
              ▼                         ▼
     ┌─────────────────┐      ┌──────────────────┐
     │  pg_ripple_http  │      │  psql / JDBC /   │
     │  :7878           │      │  application     │
     │  (SPARQL Proto)  │      │  (:5432)         │
     └────────┬─────────┘      └────────┬─────────┘
              │                         │
              └────────────┬────────────┘
                           │
                           ▼
              ┌─────────────────────────┐
              │  PostgreSQL 18          │
              │  + pg_ripple extension  │
              │  + merge worker (BGW)   │
              └─────────────────────────┘

Read replicas

For read-heavy workloads, PostgreSQL streaming replication works out of the box. Read replicas receive all VP table changes through WAL. Point read-only SPARQL queries to replicas via a separate pg_ripple_http instance connected to the replica.


Post-Deployment Verification

After deploying pg_ripple, verify the installation:

-- Check extension version
SELECT extversion FROM pg_extension WHERE extname = 'pg_ripple';

-- Verify stats (confirms shared memory and merge worker)
SELECT pg_ripple.stats();

-- Run a health check
SELECT pg_ripple.canary();

-- Insert and query a test triple
SELECT pg_ripple.insert_triple(
    '<http://example.org/test>',
    '<http://example.org/status>',
    '"deployed"'
);

SELECT * FROM pg_ripple.sparql('
    SELECT ?s ?p ?o WHERE { ?s ?p ?o } LIMIT 1
');

Healthy deployment checklist

  • stats() returns merge_worker_pid > 0
  • canary() shows merge_worker: "ok" and catalog_consistent: true
  • encode_cache_hits / (hits + misses) > 0.90 after initial data load
  • SPARQL queries return results

Configuration and Tuning

pg_ripple exposes its configuration through PostgreSQL GUC (Grand Unified Configuration) parameters. All parameters use the pg_ripple. prefix and can be set in postgresql.conf, via ALTER SYSTEM, or per-session with SET.

Restart requirements

Parameters marked Postmaster require a PostgreSQL restart. Parameters marked SIGHUP can be reloaded with SELECT pg_reload_conf(). All others can be changed per-session with SET.


Storage Parameters

Control how triples are stored in VP tables and the rare-predicate consolidation table.

ParameterTypeDefaultRangeContextDescription
vp_promotion_thresholdint100010 – 10,000,000UsersetMinimum triples before a predicate gets a dedicated VP table. Below this, triples go to vp_rare.
named_graph_optimizedbooloffUsersetAdds a (g, s, o) index per VP table. Speeds up GRAPH queries but increases write overhead.
default_graphtext''Any IRIUsersetIRI used as the default graph when g is not specified on insert.
dedup_on_mergebooloffUsersetWhen on, the merge worker deduplicates (s, o, g) rows, keeping the lowest SID.

HTAP / Merge Worker Parameters

Control the delta/main split and background merge behavior. These take effect only when pg_ripple is loaded via shared_preload_libraries.

ParameterTypeDefaultRangeContextDescription
merge_thresholdint100001 – 2,147,483,647SIGHUPDelta row count that triggers a merge for a predicate. Lower = fresher reads, more I/O.
merge_interval_secsint601 – 3600SIGHUPMaximum seconds between merge worker poll cycles.
merge_retention_secondsint600 – 86,400SIGHUPSeconds to keep the old main table after a merge before dropping it.
latch_trigger_thresholdint100001 – 2,147,483,647SIGHUPRows written in a batch before poking the merge worker latch immediately.
merge_watchdog_timeoutint30010 – 86,400SIGHUPSeconds of merge worker inactivity before logging a WARNING.
worker_databasetext'postgres'SIGHUPDatabase the background merge worker connects to.
auto_analyzeboolonSIGHUPRun ANALYZE on VP main tables after each merge cycle.

Query Engine Parameters

Tune SPARQL-to-SQL translation and execution.

ParameterTypeDefaultRangeContextDescription
plan_cache_sizeint2560 – 65,536UsersetCached SPARQL→SQL translations per backend. 0 disables caching.
max_path_depthint1000 – 10,000UsersetMaximum recursion depth for property path queries (+, *). 0 = unlimited.
property_path_max_depthint641 – 100,000UsersetAlternative property path depth limit (v0.24.0).
describe_strategytext'cbd''cbd', 'scbd', 'simple'UsersetDESCRIBE algorithm: Concise Bounded Description, Symmetric CBD, or simple one-hop.
bgp_reorderboolonUsersetReorder BGP triple patterns by estimated selectivity before SQL generation.
parallel_query_min_joinsint31 – 100UsersetMinimum VP-table joins before enabling parallel query workers.
sparql_strictboolonUsersetWhen on, unsupported FILTER functions raise an error; when off, they are silently dropped.
export_batch_sizeint10000100 – 1,000,000UsersetTriples per cursor batch during streaming export.

Inference / Datalog Parameters

Control the Datalog reasoning engine, magic sets, and rule caching.

ParameterTypeDefaultRangeContextDescription
inference_modetext'off''off', 'on_demand', 'materialized'UsersetDatalog reasoning mode. 'materialized' requires pg_trickle.
enforce_constraintstext'off''off', 'warn', 'error'UsersetBehavior when Datalog constraint rules detect violations.
rule_graph_scopetext'default''default', 'all'UsersetWhether unscoped rule atoms operate on the default graph only or all graphs.
magic_setsboolonUsersetUse magic sets for goal-directed inference in infer_goal().
datalog_cost_reorderboolonUsersetSort rule body atoms by ascending VP-table cardinality before SQL compilation.
datalog_antijoin_thresholdint10000 – 10,000,000UsersetMinimum VP rows for NOT atoms to use LEFT JOIN anti-join form.
delta_index_thresholdint5000 – 10,000,000UsersetMinimum semi-naive delta rows before creating a B-tree index.
demand_transformboolonUsersetAuto-apply demand transformation when multiple goal patterns are specified.
sameas_reasoningboolonUsersetApply owl:sameAs canonicalization pre-pass during inference.
rule_plan_cacheboolonUsersetCache compiled SQL for each rule set. Invalidated by drop_rules() and load_rules().
rule_plan_cache_sizeint641 – 4,096UsersetMaximum rule sets in the plan cache.

Well-Founded Semantics / Tabling Parameters

Control WFS evaluation and tabling cache (v0.32.0).

ParameterTypeDefaultRangeContextDescription
wfs_max_iterationsint1001 – 10,000UsersetSafety cap on alternating fixpoint rounds per WFS pass. Emits PT520 WARNING if not converged.
tablingboolonUsersetCache infer_wfs() and SPARQL results in _pg_ripple.tabling_cache.
tabling_ttlint3000 – 86,400UsersetTTL in seconds for tabling cache entries. 0 disables TTL-based expiry.

SHACL Validation Parameters

ParameterTypeDefaultRangeContextDescription
shacl_modetext'off''off', 'sync', 'async'Userset'sync' rejects violations inline; 'async' queues for background validation.

Federation Parameters

Control remote SPARQL endpoint calls via the SERVICE keyword.

ParameterTypeDefaultRangeContextDescription
federation_timeoutint301 – 3,600UsersetPer-SERVICE call wall-clock timeout in seconds.
federation_max_resultsint100001 – 1,000,000UsersetMaximum rows accepted from a single remote call.
federation_on_errortext'warning''warning', 'error', 'empty'UsersetBehavior on SERVICE call failure.
federation_pool_sizeint41 – 32UsersetIdle HTTP connections per endpoint host.
federation_cache_ttlint00 – 86,400UsersetRemote result cache TTL in seconds. 0 disables caching.
federation_on_partialtext'empty''empty', 'use'UsersetBehavior on mid-stream SERVICE failure.
federation_adaptive_timeoutbooloffUsersetDerive per-endpoint timeout from P95 latency.

Shared Memory Parameters (Startup Only)

These must be set in postgresql.conf before PostgreSQL starts. They cannot be changed at runtime.

ParameterTypeDefaultRangeContextDescription
dictionary_cache_sizeint40960 – 1,000,000PostmasterShared-memory encode cache capacity in entries.
cache_budgetint640 – 65,536PostmasterShared-memory budget cap in MB. Bulk loads throttle at 90% utilization.

Startup GUCs require restart

Changes to dictionary_cache_size and cache_budget require a full PostgreSQL restart. Plan your cache sizing before deploying to production.


Security Parameters

ParameterTypeDefaultRangeContextDescription
rls_bypassbooloffSusetSuperuser override to bypass graph-level Row-Level Security.

Vector / Embedding Parameters

ParameterTypeDefaultRangeContextDescription
embedding_modeltext''UsersetModel name tag stored in _pg_ripple.embeddings.
embedding_dimensionsint15361 – 16,000UsersetVector dimension count. Must match model output.
embedding_api_urltext''UsersetBase URL for OpenAI-compatible embedding API.
embedding_api_keytext''SusetAPI key (superuser-only, masked in pg_settings).
pgvector_enabledboolonUsersetDisable pgvector code paths without uninstalling.
embedding_index_typetext'hnsw''hnsw', 'ivfflat'UsersetIndex type on embeddings table.
embedding_precisiontext'single''single', 'half', 'binary'UsersetStorage precision for embedding vectors.
auto_embedbooloffUsersetAuto-embed new entities via background worker.
embedding_batch_sizeint1001 – 10,000UsersetEntities dequeued per background worker batch.

Quick-Start Configurations

Small Dataset (< 1M triples)

Suitable for development, prototyping, or small knowledge graphs:

# postgresql.conf
shared_preload_libraries = 'pg_ripple'

# Dictionary cache — small footprint
pg_ripple.dictionary_cache_size = 8192
pg_ripple.cache_budget = 16

# Merge worker — merge early for fresh reads
pg_ripple.merge_threshold = 5000
pg_ripple.merge_interval_secs = 30

# Query engine
pg_ripple.plan_cache_size = 64
pg_ripple.max_path_depth = 50

Medium Dataset (1M – 100M triples)

Production workloads with moderate query complexity:

# postgresql.conf
shared_preload_libraries = 'pg_ripple'

# Dictionary cache — larger cache for better hit rates
pg_ripple.dictionary_cache_size = 131072
pg_ripple.cache_budget = 128

# Merge worker — balance freshness and I/O
pg_ripple.merge_threshold = 50000
pg_ripple.merge_interval_secs = 60
pg_ripple.latch_trigger_threshold = 20000
pg_ripple.auto_analyze = on

# Query engine — larger plan cache for diverse queries
pg_ripple.plan_cache_size = 512
pg_ripple.max_path_depth = 100
pg_ripple.bgp_reorder = on

# Inference (if used)
pg_ripple.inference_mode = 'on_demand'
pg_ripple.magic_sets = on

Large Dataset (> 100M triples)

High-throughput production with heavy query loads:

# postgresql.conf
shared_preload_libraries = 'pg_ripple'

# Dictionary cache — maximize cache coverage
pg_ripple.dictionary_cache_size = 500000
pg_ripple.cache_budget = 512

# Merge worker — batch larger merges, reduce churn
pg_ripple.merge_threshold = 200000
pg_ripple.merge_interval_secs = 120
pg_ripple.latch_trigger_threshold = 100000
pg_ripple.merge_retention_seconds = 120
pg_ripple.auto_analyze = on

# Query engine — large plan cache, parallel queries
pg_ripple.plan_cache_size = 2048
pg_ripple.max_path_depth = 200
pg_ripple.bgp_reorder = on
pg_ripple.parallel_query_min_joins = 2

# Named graph optimization (if heavy GRAPH usage)
pg_ripple.named_graph_optimized = on

# Inference
pg_ripple.inference_mode = 'on_demand'
pg_ripple.magic_sets = on
pg_ripple.rule_plan_cache = on
pg_ripple.rule_plan_cache_size = 256

# Tabling cache for repeated inference patterns
pg_ripple.tabling = on
pg_ripple.tabling_ttl = 600

# Federation (if used)
pg_ripple.federation_timeout = 60
pg_ripple.federation_pool_size = 8
pg_ripple.federation_cache_ttl = 300

PostgreSQL tuning

Don't forget to tune PostgreSQL itself alongside pg_ripple. Key PostgreSQL parameters for triple store workloads:

  • shared_buffers = 25% of RAM
  • effective_cache_size = 75% of RAM
  • work_mem = 64MB–256MB (for complex joins)
  • maintenance_work_mem = 512MB–1GB (for merge ANALYZE)
  • random_page_cost = 1.1 (if using SSDs)
  • max_parallel_workers_per_gather = 4

Monitoring and Observability

pg_ripple provides built-in monitoring through SQL functions, PostgreSQL's standard statistics infrastructure, and Prometheus-compatible metrics via pg_ripple_http. This page explains what to monitor, how to collect the data, and what thresholds indicate a healthy system.


pg_ripple.stats()

The primary monitoring function. Returns a JSONB object with key metrics:

SELECT pg_ripple.stats();

Output Fields

FieldTypeDescription
total_triplesintTotal triple count across all graphs (including delta rows not yet merged)
dedicated_predicatesintNumber of predicates with their own VP table
htap_predicatesintNumber of predicates using the HTAP delta/main split
rare_triplesintTriples stored in the consolidated vp_rare table
unmerged_delta_rowsintTotal rows across all delta tables (from shared memory counter). -1 if shared memory is not available
merge_worker_pidintPID of the background merge worker. 0 if not running
live_statistics_enabledboolWhether pg_trickle live statistics are active
encode_cache_capacityintTotal entries the shared encode cache can hold
encode_cache_utilization_pctintPercentage of cache slots currently in use
encode_cache_hitsintCumulative cache hit count since server start
encode_cache_missesintCumulative cache miss count since server start
encode_cache_evictionsintCumulative eviction count

Example Output

{
  "total_triples": 4523891,
  "dedicated_predicates": 127,
  "htap_predicates": 127,
  "rare_triples": 2341,
  "unmerged_delta_rows": 8432,
  "merge_worker_pid": 12345,
  "live_statistics_enabled": false,
  "encode_cache_capacity": 65536,
  "encode_cache_utilization_pct": 72,
  "encode_cache_hits": 18934521,
  "encode_cache_misses": 234012,
  "encode_cache_evictions": 45123
}

Computing the Cache Hit Rate

SELECT
    (s->>'encode_cache_hits')::bigint AS hits,
    (s->>'encode_cache_misses')::bigint AS misses,
    ROUND(
        (s->>'encode_cache_hits')::numeric /
        NULLIF((s->>'encode_cache_hits')::numeric + (s->>'encode_cache_misses')::numeric, 0),
        4
    ) AS hit_rate
FROM pg_ripple.stats() s;

Cache hit rate threshold

A healthy system should maintain a cache hit rate above 95% (0.95). If it drops below 90%, increase pg_ripple.dictionary_cache_size and restart PostgreSQL. Sustained rates below 80% indicate the working set significantly exceeds cache capacity.


pg_ripple.canary()

A health check function that returns a JSONB object with pass/fail indicators:

SELECT pg_ripple.canary();

Output Fields

FieldTypeHealthy ValueDescription
merge_workertext"ok""ok" if merge worker PID is in shared memory; "stalled" otherwise
cache_hit_ratefloat> 0.95Dictionary encode cache hit rate (0.0–1.0)
catalog_consistentbooltrueVP table count in pg_class matches promoted predicates
orphaned_rare_rowsint0vp_rare rows for predicates that already have dedicated VP tables

Interpreting Results

SELECT
    c->>'merge_worker' AS worker,
    (c->>'cache_hit_rate')::float AS hit_rate,
    (c->>'catalog_consistent')::bool AS catalog_ok,
    (c->>'orphaned_rare_rows')::int AS orphaned
FROM pg_ripple.canary() c;

Use canary() for automated health checks

canary() is designed for load balancer health checks and monitoring systems. Call it periodically and alert when merge_worker = 'stalled', cache_hit_rate < 0.90, or catalog_consistent = false.


SPARQL Query Analysis with sparql_explain()

Analyze SPARQL query performance using the explain functions:

Basic SQL Generation

-- See the generated SQL without executing
SELECT pg_ripple.sparql_explain(
    'SELECT ?name WHERE { ?s <http://schema.org/name> ?name }',
    false
);

Full EXPLAIN ANALYZE

-- Execute and show timing + row counts
SELECT pg_ripple.sparql_explain(
    'SELECT ?name WHERE { ?s <http://schema.org/name> ?name }',
    true
);

explain_sparql() with Format Options

The explain_sparql() function (v0.23.0) provides more output formats:

-- Generated SQL only
SELECT pg_ripple.explain_sparql(
    'SELECT ?s ?o WHERE { ?s <http://xmlns.com/foaf/0.1/knows> ?o }',
    'sql'
);

-- EXPLAIN ANALYZE as text (default)
SELECT pg_ripple.explain_sparql(
    'SELECT ?s ?o WHERE { ?s <http://xmlns.com/foaf/0.1/knows> ?o }',
    'text'
);

-- EXPLAIN ANALYZE as JSON (for programmatic consumption)
SELECT pg_ripple.explain_sparql(
    'SELECT ?s ?o WHERE { ?s <http://xmlns.com/foaf/0.1/knows> ?o }',
    'json'
);

-- SPARQL algebra tree (for debugging the optimizer)
SELECT pg_ripple.explain_sparql(
    'SELECT ?s ?o WHERE { ?s <http://xmlns.com/foaf/0.1/knows> ?o }',
    'sparql_algebra'
);

What to look for in EXPLAIN output

  • Seq Scan on vp_rare — A predicate is not promoted yet. Consider lowering vp_promotion_threshold or loading more data.
  • Nested Loop with high row estimates — BGP reordering may not be optimal. Check bgp_reorder is on.
  • Recursive CTE with high loop count — Property path is deep. Check max_path_depth setting.
  • Sort + Unique — A DISTINCT that might be avoidable with SHACL sh:maxCount 1 hints.

pg_stat_statements Integration

pg_ripple generates standard SQL that is tracked by pg_stat_statements. This gives you deep visibility into the actual SQL performance:

-- Enable pg_stat_statements (if not already)
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Find the slowest SPARQL-generated queries
SELECT
    calls,
    mean_exec_time::numeric(10,2) AS avg_ms,
    total_exec_time::numeric(10,2) AS total_ms,
    rows,
    LEFT(query, 120) AS query_prefix
FROM pg_stat_statements
WHERE query LIKE '%_pg_ripple.vp_%'
ORDER BY mean_exec_time DESC
LIMIT 20;

Identifying Hot VP Tables

-- Which VP tables are scanned most?
SELECT
    regexp_matches(query, '_pg_ripple\.(vp_\d+)', 'g') AS vp_table,
    sum(calls) AS total_calls,
    sum(total_exec_time)::numeric(10,2) AS total_ms
FROM pg_stat_statements
WHERE query LIKE '%_pg_ripple.vp_%'
GROUP BY 1
ORDER BY total_ms DESC
LIMIT 10;

Prometheus Metrics (pg_ripple_http)

The pg_ripple_http companion service exposes Prometheus-compatible metrics at the /metrics endpoint:

curl http://localhost:7878/metrics

Available Metrics

MetricTypeDescription
pg_ripple_http_queries_totalcounterTotal SPARQL queries processed
pg_ripple_http_errors_totalcounterTotal query errors
pg_ripple_http_query_duration_seconds_totalcounterCumulative query execution time

Prometheus Scrape Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'pg_ripple_http'
    scrape_interval: 15s
    static_configs:
      - targets: ['pg-ripple-http:7878']
    metrics_path: /metrics

Derived Metrics for Dashboards

Use PromQL to compute useful rates:

# Queries per second
rate(pg_ripple_http_queries_total[5m])

# Error rate
rate(pg_ripple_http_errors_total[5m]) / rate(pg_ripple_http_queries_total[5m])

# Average query latency
rate(pg_ripple_http_query_duration_seconds_total[5m]) / rate(pg_ripple_http_queries_total[5m])

Monitoring the Merge Worker

The background merge worker is critical for HTAP performance. Monitor it through multiple channels:

Shared Memory Status

-- Is the merge worker running?
SELECT (pg_ripple.stats()->>'merge_worker_pid')::int AS pid;
-- Returns 0 if not running

Delta Table Sizes

-- Check delta accumulation per predicate
SELECT
    p.id AS predicate_id,
    d.value AS predicate_iri,
    p.triple_count,
    (SELECT count(*) FROM format('_pg_ripple.vp_%s_delta', p.id)::regclass) AS delta_rows
FROM _pg_ripple.predicates p
JOIN _pg_ripple.dictionary d ON d.id = p.id
WHERE p.htap = true
ORDER BY p.triple_count DESC
LIMIT 10;

Merge Worker Logs

The merge worker logs to PostgreSQL's standard log:

LOG:  pg_ripple merge worker: merge cycle complete
LOG:  pg_ripple merge worker: processed 3 async validation item(s)
WARNING:  pg_ripple merge worker: watchdog timeout (300s)

Watchdog timeout

If you see watchdog timeout warnings in the PostgreSQL log, the merge worker has stalled. Common causes:

  • Long-running transactions holding locks on VP tables
  • worker_database pointing to the wrong database
  • Insufficient max_worker_processes in postgresql.conf

Health Check Thresholds

Use these thresholds for alerting:

MetricGreenYellowRed
Cache hit rate> 95%90–95%< 90%
Merge worker PID> 0= 0
Delta rows (total)< 2× merge_threshold2–5×> 5×
Catalog consistenttruefalse
Orphaned rare rows01–100> 100
Query error rate< 1%1–5%> 5%
Avg query latency< 100ms100–500ms> 500ms

Automated Monitoring Query

Run this periodically from your monitoring system:

SELECT
    CASE
        WHEN (c->>'merge_worker') = 'ok'
             AND (c->>'cache_hit_rate')::float > 0.90
             AND (c->>'catalog_consistent')::bool
             AND (c->>'orphaned_rare_rows')::int = 0
        THEN 'healthy'
        WHEN (c->>'merge_worker') = 'stalled'
             OR (c->>'cache_hit_rate')::float < 0.80
        THEN 'critical'
        ELSE 'warning'
    END AS status,
    c->>'merge_worker' AS worker,
    c->>'cache_hit_rate' AS hit_rate,
    c->>'catalog_consistent' AS catalog,
    c->>'orphaned_rare_rows' AS orphaned
FROM pg_ripple.canary() c;

Predicate Inventory

Monitor predicate distribution to catch imbalances:

SELECT
    p.id,
    d.value AS predicate_iri,
    p.triple_count,
    p.table_oid IS NOT NULL AS has_vp_table,
    CASE WHEN p.htap THEN 'htap' ELSE 'flat' END AS storage_mode
FROM _pg_ripple.predicates p
JOIN _pg_ripple.dictionary d ON d.id = p.id
ORDER BY p.triple_count DESC
LIMIT 20;

Skewed predicates

If one predicate has 10x more triples than the next, its VP table dominates storage and merge time. Consider partitioning the data by named graph or filtering queries to avoid full scans of that predicate.


Log-Based Monitoring

Configure PostgreSQL logging for SPARQL workload visibility:

# postgresql.conf
log_min_duration_statement = 500    # Log queries slower than 500ms
log_statement = 'none'              # Don't log every statement
log_line_prefix = '%t [%p] %d '     # Timestamp, PID, database

SPARQL-generated SQL appears in the PostgreSQL log with VP table references, making it easy to correlate slow log entries with specific SPARQL patterns.

Performance Tuning

pg_ripple performance depends on three interacting subsystems: the query engine, the write path, and the dictionary cache. This page provides diagnostic steps and tuning recipes for each bottleneck area, with realistic numbers from BSBM benchmarks and internal testing.


The Three Bottleneck Areas

┌──────────────────────────────────────────────────────┐
│                   Performance                         │
│                                                      │
│   ┌──────────┐   ┌──────────┐   ┌──────────────┐    │
│   │  Query    │   │  Write   │   │  Cache       │    │
│   │  Engine   │   │  Path    │   │  Pressure    │    │
│   │          │   │          │   │              │    │
│   │ Slow     │   │ Merge    │   │ Dictionary   │    │
│   │ SPARQL   │   │ worker   │   │ misses →     │    │
│   │ queries  │   │ lag,     │   │ table        │    │
│   │          │   │ delta    │   │ lookups      │    │
│   │          │   │ bloat    │   │              │    │
│   └──────────┘   └──────────┘   └──────────────┘    │
└──────────────────────────────────────────────────────┘

Diagnostic Workflow

Before tuning, identify which subsystem is the bottleneck:

-- Step 1: Overall health
SELECT pg_ripple.canary();

-- Step 2: Cache hit rate
SELECT
    (s->>'encode_cache_hits')::bigint AS hits,
    (s->>'encode_cache_misses')::bigint AS misses,
    ROUND(
        (s->>'encode_cache_hits')::numeric /
        NULLIF((s->>'encode_cache_hits')::numeric + (s->>'encode_cache_misses')::numeric, 0),
        4
    ) AS hit_rate
FROM pg_ripple.stats() s;

-- Step 3: Delta accumulation
SELECT (pg_ripple.stats()->>'unmerged_delta_rows')::int AS delta_rows;

-- Step 4: Slowest queries
SELECT calls, mean_exec_time::numeric(10,2) AS avg_ms, LEFT(query, 100)
FROM pg_stat_statements
WHERE query LIKE '%_pg_ripple.vp_%'
ORDER BY mean_exec_time DESC
LIMIT 10;
SymptomLikely BottleneckSection
High mean_exec_time on VP queriesQuery engineQuery Performance
delta_rows growing unboundedWrite path / mergeWrite Throughput
Cache hit rate < 95%Dictionary cacheCache Pressure
merge_worker_pid = 0Merge worker not runningWrite Throughput

Query Performance

Typical Performance Numbers

Based on BSBM benchmarks and internal testing with 10M triples on a 4-core/16GB instance:

Query PatternTypical LatencyNotes
Simple triple pattern (1 BGP)0.5–2msSingle VP table scan with B-tree
Star pattern (3–5 joins, same subject)2–10msSelf-join elimination reduces to 1 scan + joins
Path query (3 hops)5–20msWITH RECURSIVE, bounded depth
Complex BGP (5–8 patterns)10–50msBenefits from bgp_reorder
Aggregation (COUNT/SUM over 100K rows)20–80msPostgreSQL native aggregation
DESCRIBE (CBD, 50 outgoing arcs)5–15msDepends on describe_strategy
Federation (1 SERVICE call)50–500msNetwork-dominated

Tuning: Slow Single Queries

Step 1: Get the EXPLAIN output

SELECT pg_ripple.explain_sparql(
    'SELECT ?name WHERE {
        ?person <http://schema.org/knows> ?friend .
        ?friend <http://schema.org/name> ?name
    }',
    'text'
);

Step 2: Check for common issues

EXPLAIN PatternProblemFix
Seq Scan on vp_rarePredicate below promotion thresholdLower vp_promotion_threshold or load more data
Nested Loop with millions of rowsPoor join orderVerify bgp_reorder = on; run ANALYZE on VP tables
Sort + Unique on large resultUnnecessary DISTINCTAdd SHACL sh:maxCount 1 for functional predicates
CTE Scan with high loopsUnbounded property pathLower max_path_depth; add FILTER bounds
Hash Join with large build sideJoin on a high-cardinality predicateRewrite query to filter the large predicate first

Step 3: Enable the plan cache

-- Cache compiled SQL for repeated queries
SET pg_ripple.plan_cache_size = 512;

The plan cache eliminates parse/optimize/generate overhead for repeated SPARQL patterns. With BSBM's mix of 12 query templates, a cache size of 256 achieves ~98% hit rate.

Tuning: Overall Query Throughput

For workloads with many concurrent queries:

# Enable parallel query for complex joins
pg_ripple.parallel_query_min_joins = 2

# PostgreSQL parallel execution
max_parallel_workers_per_gather = 4
max_parallel_workers = 8

# Larger work_mem for complex joins
work_mem = '128MB'

BGP reordering impact

On a 10M triple dataset with 5-pattern BGPs, enabling bgp_reorder reduces median query time from 45ms to 12ms — a 3.7x improvement. Always keep this on unless you have a specific reason to disable it.


Write Throughput

Typical Write Performance

OperationThroughputNotes
insert_triple() (single)5,000–15,000 triples/secPer-backend, includes dictionary encoding
load_turtle() (bulk, inline)30,000–80,000 triples/secBatch dictionary encoding
load_turtle_file() (bulk, file)50,000–120,000 triples/secStreaming from disk, larger batches
sparql_update() INSERT DATA10,000–30,000 triples/secSPARQL parse overhead

Tuning: Merge Worker Lag

If unmerged_delta_rows grows continuously, the merge worker cannot keep up with the write rate.

Diagnosis:

-- Check delta accumulation
SELECT (pg_ripple.stats()->>'unmerged_delta_rows')::int AS delta;
-- Run again 60 seconds later — if delta is growing, merges are lagging

Solutions (in order of impact):

  1. Lower merge_threshold — Merge smaller batches more frequently:

    ALTER SYSTEM SET pg_ripple.merge_threshold = 5000;
    SELECT pg_reload_conf();
    
  2. Increase merge frequency — Reduce polling interval:

    ALTER SYSTEM SET pg_ripple.merge_interval_secs = 15;
    SELECT pg_reload_conf();
    
  3. Manual compaction — Force an immediate merge:

    SELECT pg_ripple.compact();
    
  4. Separate write windows — Batch writes during off-peak hours, then compact.

Tuning: Bulk Load Performance

For large initial data loads:

-- Temporarily disable SHACL validation
SET pg_ripple.shacl_mode = 'off';

-- Use file-based loading for best throughput
SELECT pg_ripple.load_turtle_file('/data/large_dataset.ttl');

-- Re-enable validation
SET pg_ripple.shacl_mode = 'async';

-- Force merge to move data to main tables
SELECT pg_ripple.compact();

Cache back-pressure

During bulk loads, pg_ripple monitors cache utilization against cache_budget. When utilization exceeds 90%, batch sizes are automatically reduced to prevent out-of-memory conditions. If you see slower-than-expected bulk loads, check encode_cache_utilization_pct in stats().


Cache Pressure

Diagnosis

SELECT
    (s->>'encode_cache_capacity')::int AS capacity,
    (s->>'encode_cache_utilization_pct')::int AS util_pct,
    (s->>'encode_cache_hits')::bigint AS hits,
    (s->>'encode_cache_misses')::bigint AS misses,
    (s->>'encode_cache_evictions')::bigint AS evictions
FROM pg_ripple.stats() s;
MetricHealthyAction Needed
Hit rate > 95%Normal operationNone
Hit rate 90–95%MarginalConsider increasing cache
Hit rate < 90%Cache thrashingIncrease dictionary_cache_size
Utilization > 90%Near-fullIncrease cache_budget
Evictions > 10% of hitsHigh churnWorking set exceeds cache

Sizing the Dictionary Cache

Rule of thumb: the cache should hold at least 80% of your unique terms.

-- Count unique terms
SELECT count(*) AS unique_terms FROM _pg_ripple.dictionary;
Unique TermsRecommended dictionary_cache_sizeMemory (approx.)
< 50K8,192~2 MB
50K – 500K65,536~13 MB
500K – 5M262,144~50 MB
5M – 50M500,000~100 MB
> 50M1,000,000 (max)~200 MB

Restart required

Changing dictionary_cache_size requires a PostgreSQL restart because shared memory is allocated at postmaster start. Plan your cache sizing during initial deployment.


Workload-Specific Recipes

Read-Heavy Analytics

Optimized for complex SPARQL queries with rare writes:

# Large plan cache for diverse query shapes
pg_ripple.plan_cache_size = 2048

# BGP optimization
pg_ripple.bgp_reorder = on
pg_ripple.parallel_query_min_joins = 2

# Large dictionary cache
pg_ripple.dictionary_cache_size = 262144
pg_ripple.cache_budget = 256

# Infrequent merges (writes are rare)
pg_ripple.merge_threshold = 100000
pg_ripple.merge_interval_secs = 300

# PostgreSQL
shared_buffers = '4GB'
effective_cache_size = '12GB'
work_mem = '256MB'
random_page_cost = 1.1

Expected: P95 query latency < 50ms for 5-pattern BGPs on 10M triples.

Write-Heavy Ingestion

Optimized for continuous data ingestion with periodic queries:

# Smaller plan cache (fewer distinct queries)
pg_ripple.plan_cache_size = 64

# Aggressive merging to keep delta small
pg_ripple.merge_threshold = 5000
pg_ripple.merge_interval_secs = 10
pg_ripple.latch_trigger_threshold = 2000
pg_ripple.auto_analyze = on

# Large cache to handle encoding pressure
pg_ripple.dictionary_cache_size = 500000
pg_ripple.cache_budget = 512

# Disable SHACL during ingestion
pg_ripple.shacl_mode = 'off'

# PostgreSQL — optimize for writes
shared_buffers = '2GB'
wal_buffers = '64MB'
checkpoint_completion_target = 0.9
max_wal_size = '4GB'

Expected: Sustained ingestion at 50K+ triples/sec with merge lag < 30 seconds.

Mixed HTAP (Read + Write)

Balanced for concurrent queries and writes:

# Moderate plan cache
pg_ripple.plan_cache_size = 512

# Balanced merge — not too frequent, not too lazy
pg_ripple.merge_threshold = 25000
pg_ripple.merge_interval_secs = 30
pg_ripple.latch_trigger_threshold = 10000
pg_ripple.auto_analyze = on

# Good cache coverage
pg_ripple.dictionary_cache_size = 131072
pg_ripple.cache_budget = 128

# Async SHACL so writes are not blocked
pg_ripple.shacl_mode = 'async'

# BGP optimization for read queries
pg_ripple.bgp_reorder = on

# PostgreSQL
shared_buffers = '4GB'
effective_cache_size = '12GB'
work_mem = '128MB'
max_parallel_workers_per_gather = 2

Expected: Read P95 < 30ms, write throughput > 20K triples/sec, merge lag < 60 seconds.


Benchmarking Your Deployment

Use the built-in compact() function and pg_stat_statements to establish baselines:

-- Reset statistics
SELECT pg_stat_statements_reset();

-- Run your workload (queries, inserts, etc.)

-- Collect results
SELECT
    calls,
    mean_exec_time::numeric(10,2) AS avg_ms,
    stddev_exec_time::numeric(10,2) AS stddev_ms,
    min_exec_time::numeric(10,2) AS min_ms,
    max_exec_time::numeric(10,2) AS max_ms,
    rows,
    LEFT(query, 80) AS query_prefix
FROM pg_stat_statements
WHERE query LIKE '%_pg_ripple%'
ORDER BY total_exec_time DESC
LIMIT 20;

Iterative tuning

Change one parameter at a time, re-run your benchmark, and compare. The most impactful parameters in order are:

  1. dictionary_cache_size (cache hit rate)
  2. bgp_reorder (query planning)
  3. merge_threshold (read freshness vs. write throughput)
  4. plan_cache_size (repeated query overhead)
  5. PostgreSQL work_mem (complex join performance)

Parallel Merge Worker Pool

Added in v0.42.0

Overview

pg_ripple uses a Vertical Partitioning (VP) architecture where each unique predicate gets its own storage table. The merge worker pool keeps the read-optimised _main partitions in sync with the write-optimised _delta tables.

By default, a single background worker handles all predicates sequentially. For workloads with many distinct predicates — such as rich ontologies with 50+ property types — a pool of parallel workers can significantly improve write throughput.

Configuration

pg_ripple.merge_workers (startup only)

Controls the number of parallel merge worker processes. Must be set in postgresql.conf or before the server starts; it cannot be changed with SET at session level.

# postgresql.conf
shared_preload_libraries = 'pg_ripple'
pg_ripple.merge_workers = 4
  • Default: 1 (single worker, original behaviour)
  • Range: 1 to 16
  • Type: integer, PGC_POSTMASTER (startup-only)

pg_ripple.merge_threshold

Minimum rows in a VP delta table before a merge is triggered. Increasing this reduces merge frequency but increases per-merge cost.

SET pg_ripple.merge_threshold = 50000;  -- default: 10000

pg_ripple.merge_interval_secs

Maximum seconds between merge worker polling cycles.

SET pg_ripple.merge_interval_secs = 30;  -- default: 60

How It Works

With merge_workers = N, pg_ripple spawns N background worker processes. Each worker owns a disjoint round-robin subset of VP predicates:

  • Worker 0 handles predicates where pred_id % N == 0
  • Worker 1 handles predicates where pred_id % N == 1
  • … and so on

Advisory locking prevents races: before merging a predicate, a worker calls pg_try_advisory_lock(pred_id). If another worker already holds the lock, it skips that predicate.

Work-stealing: after processing its assigned predicates, an idle worker checks whether any "foreign" predicate (not in its round-robin slice) has a delta table above the merge threshold and no lock held. If so, it steals that work. This prevents a single overloaded predicate from delaying the merge cycle.

Monitoring

Use pg_ripple.diagnostic_report() to check merge worker activity:

SELECT value FROM pg_ripple.diagnostic_report()
WHERE key LIKE 'merge_%';

Or query the background worker state:

SELECT pid, application_name, state
FROM pg_stat_activity
WHERE application_name LIKE 'pg_ripple merge%';

Choosing the Right Worker Count

Predicate countRecommended workers
< 201 (default)
20–1002–4
100–5004–8
> 5008–16

For most workloads, the bottleneck is not the worker count but the merge threshold and interval. Tune those first before scaling workers.

Restart Requirement

Because merge_workers is a PGC_POSTMASTER GUC, changes take effect only after a PostgreSQL restart:

# After updating postgresql.conf:
pg_ctl restart -D $PGDATA

Backup and Disaster Recovery

pg_ripple stores all data in standard PostgreSQL tables within the _pg_ripple schema. This means every PostgreSQL backup tool works out of the box — VP tables, the dictionary, the predicates catalog, SHACL constraints, Datalog rules, and inferred triples are all captured by pg_dump, WAL archiving, and streaming replication.

No special export needed

Unlike triple stores that require a separate RDF dump/reload cycle, pg_ripple data is just PostgreSQL data. Your existing backup infrastructure already covers it.


What Gets Backed Up

ObjectSchemaCaptured by pg_dump?Notes
Dictionary table_pg_ripple.dictionaryYesAll IRI, blank node, and literal mappings
Predicates catalog_pg_ripple.predicatesYesPredicate → VP table OID mapping
VP tables (main + delta + tombstones)_pg_ripple.vp_{id}_*YesOne table set per predicate
Rare predicates table_pg_ripple.vp_rareYesConsolidated low-cardinality predicates
SHACL constraints_pg_ripple.shacl_*YesShape definitions and validation state
Datalog rules_pg_ripple.rulesYesRule text and compiled plans
Inferred triplesVP tables, source = 1YesMaterialized inference results
Extension metadatapg_catalogYesExtension version and control file
Shared memory stateIn-memory onlyNoDictionary LRU cache, merge worker counters

Shared memory state

The dictionary LRU cache and merge worker counters live in shared memory and are not persisted to disk. They are rebuilt automatically on PostgreSQL restart. This is by design — the cache warms up quickly from normal query traffic.


Logical Backup with pg_dump

Full Database Dump

# Custom format (recommended — compressed, parallel-restore capable)
pg_dump -Fc -f pg_ripple_backup.dump mydb

# Plain SQL (human-readable, useful for auditing)
pg_dump -Fp -f pg_ripple_backup.sql mydb

Extension-Only Dump

To back up only pg_ripple data without the rest of the database:

pg_dump -Fc \
  --schema=_pg_ripple \
  --schema=pg_ripple \
  -f pg_ripple_only.dump mydb

Include both schemas

Always include both _pg_ripple (internal storage) and pg_ripple (public API functions). Restoring one without the other leaves the extension in an inconsistent state.

Parallel Dump for Large Datasets

For databases with millions of triples, use parallel workers:

# Directory format required for parallel dump
pg_dump -Fd -j 4 -f pg_ripple_backup_dir/ mydb

The dictionary table and large VP tables will be dumped in parallel, significantly reducing backup time.


Restoring from Backup

Full Restore to a New Database

# Create the target database
createdb mydb_restored

# Restore (custom format)
pg_restore -d mydb_restored -Fc pg_ripple_backup.dump

# Restore (directory format, parallel)
pg_restore -d mydb_restored -Fd -j 4 pg_ripple_backup_dir/

Restore from Plain SQL

psql -d mydb_restored -f pg_ripple_backup.sql

Post-Restore Verification

After restoring, verify the extension is intact:

-- Check extension version
SELECT extversion FROM pg_extension WHERE extname = 'pg_ripple';

-- Verify triple count
SELECT pg_ripple.stats();

-- Run the health check
SELECT pg_ripple.canary();

-- Spot-check a SPARQL query
SELECT pg_ripple.sparql($$
  SELECT (COUNT(*) AS ?n) WHERE { ?s ?p ?o }
$$);

Do VP tables survive dump/restore?

Yes. VP tables are standard PostgreSQL heap tables with B-tree or BRIN indexes. pg_dump captures them exactly like any other table. The HTAP delta/main/tombstone split, indexes, and the merge worker view definitions are all preserved. After restore, the merge worker resumes normal operation once shared_preload_libraries includes pg_ripple.


WAL-Based Continuous Archiving

For point-in-time recovery (PITR), configure WAL archiving:

Enable WAL Archiving

In postgresql.conf:

wal_level = replica
archive_mode = on
archive_command = 'cp %p /backup/wal_archive/%f'
max_wal_senders = 3

Take a Base Backup

pg_basebackup -D /backup/base -Ft -z -P

Point-in-Time Recovery

Create a recovery.signal file and configure the restore target:

# postgresql.conf (or postgresql.auto.conf)
restore_command = 'cp /backup/wal_archive/%f %p'
recovery_target_time = '2026-04-19 14:30:00'

Start PostgreSQL — it will replay WAL up to the specified time.

HTAP merge and PITR

If you recover to a point mid-merge, the merge worker will detect the incomplete state and re-run the merge on startup. No manual intervention is needed, but the first merge cycle after recovery may take longer than usual.


Streaming Replication

pg_ripple works transparently with PostgreSQL streaming replication:

# On the replica
pg_basebackup -h primary-host -D /var/lib/postgresql/18/main -R -P

The -R flag writes the standby.signal and connection parameters. All VP tables, dictionary data, and HTAP state replicate via WAL.

Merge worker on replicas

The background merge worker does not run on read replicas. Replicas receive merged state via WAL replay from the primary. This is correct behavior — replicas should never write.


Backup Strategy Recommendations

Small Datasets (< 1M triples)

ComponentRecommendation
Methodpg_dump -Fc nightly
Retention7 daily + 4 weekly
RPO24 hours
RTOMinutes

Medium Datasets (1M – 100M triples)

ComponentRecommendation
MethodWAL archiving + daily base backup
Retention7 daily base + continuous WAL
RPOSeconds (WAL)
RTOMinutes to hours

Large Datasets (> 100M triples)

ComponentRecommendation
MethodWAL archiving + pgBackRest or Barman
RetentionIncremental base + continuous WAL
RPOSeconds (WAL)
RTOProportional to dataset size

Test your restores

Schedule monthly restore drills. A backup that has never been tested is not a backup. Automate the verification queries shown above as part of the drill.


Disaster Recovery Checklist

  1. Before disaster: WAL archiving enabled, base backups on schedule, replication lag monitored
  2. During incident: identify the failure scope (single table, full database, or host loss)
  3. Recovery steps:
    • Host loss → promote replica or restore from base backup + WAL
    • Corruption → PITR to last known good time
    • Accidental deletion → PITR to just before the DROP/DELETE
  4. Post-recovery:
    • Run SELECT pg_ripple.canary() to verify health
    • Check pg_ripple.stats() for expected triple counts
    • Verify the merge worker is running (merge_worker_pid > 0)
    • Run representative SPARQL queries to confirm data integrity
    • Resume WAL archiving and replication

Common Pitfalls

Don't forget shared_preload_libraries

After restoring to a fresh PostgreSQL instance, ensure shared_preload_libraries = 'pg_ripple' is set in postgresql.conf before starting the server. Without it, the merge worker will not start, the dictionary cache will be unavailable, and queries will fall back to uncached dictionary lookups.

  • Schema ownership: the restoring user must be a superuser or own both _pg_ripple and pg_ripple schemas
  • Sequence values: pg_dump captures sequence state — statement IDs (i column) will continue from the correct value after restore
  • Tablespace placement: if you used custom tablespaces for VP tables, ensure they exist on the target server before restoring

Upgrading Safely

pg_ripple follows PostgreSQL's standard extension upgrade mechanism. Each release ships a migration script that ALTER EXTENSION pg_ripple UPDATE executes automatically, walking the version chain from your current version to the target.


How Extension Upgrades Work

PostgreSQL extensions use a chain of migration scripts to move between versions. pg_ripple provides a script for every consecutive version pair:

pg_ripple--0.1.0--0.2.0.sql
pg_ripple--0.2.0--0.3.0.sql
pg_ripple--0.3.0--0.4.0.sql
...
pg_ripple--0.45.0--0.46.0.sql

When you run ALTER EXTENSION pg_ripple UPDATE, PostgreSQL finds the shortest path from your current version to the latest and executes each script in sequence.

Migration scripts are idempotent

Each migration script uses IF NOT EXISTS, CREATE OR REPLACE, and similar guards. If a migration is partially applied (e.g., due to a crash), re-running it is safe.


Pre-Upgrade Checklist

1. Check Your Current Version

SELECT extversion FROM pg_extension WHERE extname = 'pg_ripple';

2. Back Up the Database

pg_dump -Fc -f pre_upgrade_backup.dump mydb

Always back up before upgrading

While migration scripts are tested, a backup lets you restore to the pre-upgrade state if anything goes wrong. This is especially important for major feature releases that add schema changes.

3. Review the Changelog

Read the Release Notes for every version between your current version and the target. Pay attention to:

  • Breaking changes: renamed functions, changed return types, removed GUC parameters
  • Schema changes: new columns on internal tables, new indexes
  • New dependencies: additional shared_preload_libraries entries

4. Check for Active Connections

SELECT count(*) FROM pg_stat_activity
WHERE datname = current_database()
  AND pid != pg_backend_pid();

Disconnect all application connections before upgrading. The upgrade modifies extension catalog entries and may need exclusive locks on internal tables.

5. Verify the New Package Is Installed

The new .so (shared library) and SQL migration files must be present in the PostgreSQL extension directory before running ALTER EXTENSION:

# Check that the target version's migration script exists
ls $(pg_config --sharedir)/extension/pg_ripple--*

# Check that the shared library is updated
ls -la $(pg_config --pkglibdir)/pg_ripple.so

Performing the Upgrade

Step 1: Install the New Package

# From source
cargo pgrx install --pg-config $(pg_config) --release

# Or from a pre-built package
# dpkg -i pg_ripple-0.32.0-pg18.deb

Step 2: Schedule a Maintenance Window

No zero-downtime upgrades

pg_ripple does not yet support zero-downtime upgrades. Schedule the upgrade during a maintenance window. If you have read replicas, route read traffic to a replica during the upgrade window, but note that the replica will also need the new shared library installed before promotion.

Step 3: Restart PostgreSQL (If Required)

Some releases update the shared library or change shared memory layout. Check the release notes — if they mention shared memory changes or new background workers, restart PostgreSQL:

pg_ctl restart -D $PGDATA

Step 4: Run the Migration

-- Upgrade to the latest installed version
ALTER EXTENSION pg_ripple UPDATE;

-- Or upgrade to a specific version
ALTER EXTENSION pg_ripple UPDATE TO '0.46.0';

PostgreSQL will execute each intermediate migration script in order:

NOTICE:  updating extension "pg_ripple" from version "0.44.0" to "0.45.0"
NOTICE:  updating extension "pg_ripple" from version "0.45.0" to "0.46.0"

Step 5: Verify

-- Confirm the new version
SELECT extversion FROM pg_extension WHERE extname = 'pg_ripple';

-- Run the health check
SELECT pg_ripple.canary();

-- Verify stats
SELECT pg_ripple.stats();

Post-Upgrade Verification

Run these checks after every upgrade:

-- 1. Extension version matches expected
SELECT extversion FROM pg_extension WHERE extname = 'pg_ripple';

-- 2. Health check passes
SELECT pg_ripple.canary();

-- 3. Triple count is unchanged
SELECT pg_ripple.stats();

-- 4. SPARQL queries work
SELECT pg_ripple.sparql($$
  SELECT ?s ?p ?o WHERE { ?s ?p ?o } LIMIT 5
$$);

-- 5. Merge worker is running (if shared_preload_libraries is set)
SELECT (pg_ripple.stats()->>'merge_worker_pid')::int > 0 AS merge_worker_ok;

-- 6. Dictionary cache is operational
SELECT
  (s->>'encode_cache_hits')::bigint + (s->>'encode_cache_misses')::bigint > 0
    AS cache_active
FROM pg_ripple.stats() s;

Automate post-upgrade checks

Add these verification queries to a script that runs immediately after ALTER EXTENSION. If any check fails, the script can alert operators before traffic is routed back to the upgraded instance.


Multi-Version Hop Upgrades

PostgreSQL walks the entire migration chain automatically. Upgrading from v0.5.0 directly to v0.46.0 executes all intermediate scripts:

-- This works — PG finds the path 0.5.0 → 0.5.1 → 0.6.0 → ... → 0.46.0
ALTER EXTENSION pg_ripple UPDATE TO '0.46.0';

Long upgrade chains

Each migration script is typically fast (milliseconds to seconds). However, scripts that add columns or create indexes on large tables may take longer. For very long hops (10+ versions), expect a few minutes on large datasets. Monitor pg_stat_activity for lock waits during the upgrade.


Rollback Strategy

There is no built-in downgrade path. If an upgrade causes problems:

Option A: Restore from Backup

# Drop the upgraded database
dropdb mydb

# Restore the pre-upgrade backup
createdb mydb
pg_restore -d mydb pre_upgrade_backup.dump

This is the safest rollback method — it returns everything to the exact pre-upgrade state.

Option B: Reinstall the Old Version

# Install the old shared library
cargo pgrx install --pg-config $(pg_config) --release  # (from old source checkout)

# Restart PostgreSQL
pg_ctl restart -D $PGDATA

Downgrade limitations

Reinstalling the old .so file works only if the migration scripts did not make irreversible schema changes (e.g., dropping a column). Always check the migration script content before relying on this approach.


Upgrading PostgreSQL Itself

When upgrading the PostgreSQL major version (e.g., 17 → 18):

  1. pg_ripple requires PostgreSQL 18. Earlier versions are not supported.
  2. Use pg_upgrade as normal — pg_ripple's tables and extension metadata transfer correctly.
  3. After pg_upgrade, verify the extension:
psql -d mydb -c "SELECT extversion FROM pg_extension WHERE extname = 'pg_ripple';"
psql -d mydb -c "SELECT pg_ripple.canary();"

Recompile the shared library

After a PostgreSQL major version upgrade, the pg_ripple.so shared library must be recompiled against the new PostgreSQL headers. The old binary will not load.


Version Compatibility Matrix

pg_ripple VersionPostgreSQL VersionNotes
0.1.0 – 0.46.018.xOnly supported version
Any< 18Not supported
Any19+Not yet tested

Troubleshooting Upgrades

"no update path from version X to version Y"

The intermediate migration scripts are missing from the extension directory. Reinstall pg_ripple to ensure all SQL files are present:

ls $(pg_config --sharedir)/extension/pg_ripple--*.sql | wc -l

"could not open extension control file"

The pg_ripple.control file is missing. Reinstall the extension.

Migration script fails with a lock timeout

Another session holds a lock on an internal table. Ensure all connections are closed before upgrading, or increase lock_timeout:

SET lock_timeout = '60s';
ALTER EXTENSION pg_ripple UPDATE;

Shared library version mismatch

The .so file version does not match the SQL migration target. Ensure you installed the matching binary before running ALTER EXTENSION:

cargo pgrx install --pg-config $(pg_config) --release
pg_ctl restart -D $PGDATA
ALTER EXTENSION pg_ripple UPDATE;

Schema Version Stamping (v0.37.0+)

Starting with v0.37.0, every ALTER EXTENSION pg_ripple UPDATE stamps a row in _pg_ripple.schema_version. You can verify upgrade completeness:

SELECT version, installed_at, upgraded_from
FROM _pg_ripple.schema_version
ORDER BY installed_at DESC;

Example output after upgrading from 0.36.0 to 0.37.0:

  version  |          installed_at          | upgraded_from 
-----------+--------------------------------+---------------
 0.37.0    | 2026-04-19 10:00:00+00         | 0.36.0
 0.36.0    | 2026-02-01 08:00:00+00         | 0.35.0

The diagnostic_report() function also reports the current schema version:

SELECT value FROM pg_ripple.diagnostic_report() WHERE key = 'schema_version';

This is useful in monitoring scripts to confirm a rolling upgrade has completed on all replicas.

Scaling

pg_ripple scales vertically within a single PostgreSQL instance and horizontally for read traffic via streaming replication. This page covers how to allocate resources, tune the merge worker, set up read replicas, and understand current limitations.

Current scaling model

pg_ripple runs entirely within PostgreSQL. It inherits PostgreSQL's single-writer architecture: one primary handles all writes, and read replicas serve read-only SPARQL queries. Cross-node sharding is not yet supported.


Vertical Scaling

The most impactful scaling lever is giving your single PostgreSQL instance more resources.

Memory

Memory affects three key areas:

ResourceControlled ByImpact
Dictionary LRU cachepg_ripple.dictionary_cache_sizeReduces disk I/O for IRI/literal lookups. Every SPARQL query touches the dictionary on decode.
PostgreSQL shared buffersshared_buffersCaches VP table pages. Larger = fewer disk reads for joins.
Work memorywork_memMemory for sorts, hash joins, and hash aggregates in SPARQL-generated SQL.

Dictionary Cache Sizing

The dictionary cache is allocated in shared memory at server startup. Each entry consumes approximately 200 bytes.

-- Check current utilization
SELECT
  s->>'encode_cache_capacity' AS capacity,
  s->>'encode_cache_utilization_pct' AS utilization_pct,
  ROUND(
    (s->>'encode_cache_hits')::numeric /
    NULLIF((s->>'encode_cache_hits')::numeric + (s->>'encode_cache_misses')::numeric, 0),
    4
  ) AS hit_rate
FROM pg_ripple.stats() s;
Hit RateAction
> 95%Healthy — no change needed
90–95%Consider increasing dictionary_cache_size
< 90%Double dictionary_cache_size and restart

Rule of thumb

Set dictionary_cache_size to at least 10% of your total unique IRIs + literals. For a dataset with 5M unique terms, start with 500K entries (~100 MB of shared memory).

PostgreSQL Memory Settings

# postgresql.conf — for a 64 GB server with pg_ripple as the primary workload
shared_buffers = 16GB
effective_cache_size = 48GB
work_mem = 256MB
maintenance_work_mem = 2GB

work_mem and SPARQL

Complex SPARQL queries with multiple joins, UNIONs, or aggregates can spawn many hash operations. PostgreSQL allocates work_mem per operation per query. Start conservative (64MB–256MB) and increase if you see "temporary file" entries in the logs.

CPU

WorkloadCPU Benefit
SPARQL query executionMore cores → more parallel workers for large joins
Merge workerSingle-threaded per predicate, but merges run concurrently across predicates
Bulk loadingload_turtle / load_ntriples are I/O-bound; CPU helps with dictionary encoding
Datalog inferenceSemi-naive fixpoint is CPU-intensive; benefits from faster cores

Set max_parallel_workers_per_gather to allow PostgreSQL to parallelize large VP table scans:

max_parallel_workers_per_gather = 4
max_parallel_workers = 8
parallel_setup_cost = 100
parallel_tuple_cost = 0.001

pg_ripple's parallel_query_min_joins GUC controls when the SPARQL engine enables parallel hints in generated SQL (default: 3 joins).

Storage

TierRecommendation
NVMe SSDBest for all workloads. Random I/O for dictionary lookups and VP table joins.
SATA SSDAcceptable for medium datasets.
HDDNot recommended. Dictionary lookups and VP joins are random-I/O heavy.

Separate WAL and data

Place pg_wal on a separate NVMe device from the main data directory. pg_ripple's bulk load and merge operations generate significant WAL traffic.


Merge Worker Tuning

The HTAP merge worker is the most important pg_ripple-specific scaling knob. It controls how quickly delta rows (recent writes) are consolidated into the main BRIN-indexed partition.

How the Merge Worker Operates

  1. The worker polls every merge_interval_secs (default: 60s)
  2. For each predicate, it checks if delta row count >= merge_threshold
  3. If yes, it creates a new main table: (old main − tombstones) UNION ALL delta
  4. It swaps the view to point at the new main, then drops the old main after merge_retention_seconds
  5. If auto_analyze is on, it runs ANALYZE on the new main

Tuning for Write-Heavy Workloads

Lower the merge threshold and interval to keep the delta tables small:

pg_ripple.merge_threshold = 5000
pg_ripple.merge_interval_secs = 30
pg_ripple.latch_trigger_threshold = 5000

This gives fresher reads but increases I/O from more frequent merges.

Tuning for Read-Heavy Workloads

Raise the threshold to batch more writes before merging:

pg_ripple.merge_threshold = 50000
pg_ripple.merge_interval_secs = 120

This reduces merge I/O overhead but means queries scan larger delta tables.

Monitoring Merge Activity

-- Is the merge worker running?
SELECT (pg_ripple.stats()->>'merge_worker_pid')::int AS pid;

-- How many unmerged delta rows?
SELECT (pg_ripple.stats()->>'unmerged_delta_rows')::int AS delta_rows;

Merge worker stalls

If unmerged_delta_rows grows continuously while merge_worker_pid is non-zero, the worker may be stuck. Check pg_stat_activity for long-running merge transactions and look for lock contention. The merge_watchdog_timeout GUC (default: 300s) logs a WARNING if the worker is idle too long.


Read Replicas

PostgreSQL streaming replication provides horizontal read scaling for SPARQL queries.

Architecture

┌────────────┐     WAL stream     ┌────────────┐
│  Primary   │ ──────────────────→ │  Replica 1 │ ← SPARQL reads
│  (writes)  │                     └────────────┘
│            │     WAL stream     ┌────────────┐
│            │ ──────────────────→ │  Replica 2 │ ← SPARQL reads
└────────────┘                     └────────────┘

Setting Up a Read Replica

On the primary:

# postgresql.conf
wal_level = replica
max_wal_senders = 5
wal_keep_size = 1GB

Create a replication slot:

SELECT pg_create_physical_replication_slot('replica1');

On the replica:

pg_basebackup -h primary-host -D /var/lib/postgresql/18/main -R -S replica1 -P

Start the replica — it will begin streaming WAL and replaying changes, including all VP table mutations.

Replica Considerations

Merge worker does not run on replicas

The background merge worker only runs on the primary. Replicas receive already-merged state through WAL replay. This means replicas always have a consistent view of the data without any additional overhead.

  • SPARQL queries work identically on replicas — the query engine reads VP tables the same way
  • Dictionary cache is independent per instance — each replica maintains its own LRU cache
  • Replication lag: monitor with pg_stat_replication on the primary. Under normal load, lag should be sub-second
  • Hot standby conflicts: long-running SPARQL queries on replicas may conflict with WAL replay. Set max_standby_streaming_delay appropriately:
# On the replica
max_standby_streaming_delay = 30s
hot_standby_feedback = on

Connection Pooling

For workloads with many concurrent SPARQL clients, use a connection pooler:

PgBouncer with pg_ripple

pg_ripple uses session-level GUC parameters (e.g., pg_ripple.inference_mode). If you use PgBouncer, configure it in session pooling mode, not transaction mode. Transaction-mode pooling resets GUCs between transactions, which can cause unexpected behavior.

# pgbouncer.ini
[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb

[pgbouncer]
pool_mode = session
max_client_conn = 200
default_pool_size = 20

Scaling Limits and Honest Boundaries

DimensionCurrent CapabilityLimitation
Triples per instanceTested to 1B+Bound by disk and memory
Concurrent SPARQL queriesHundreds (with pooler)Bound by max_connections and CPU
Write throughput~50K–200K triples/sec (bulk load)Single-writer architecture
Read replicasUnlimitedStandard PG replication
Cross-node shardingNot supportedNo distributed query planner
Multi-primary writesNot supportedPostgreSQL limitation
FederationSupported (SERVICE clause)Remote endpoints add latency

Sharding is not available

pg_ripple does not support sharding VP tables across multiple PostgreSQL instances. If your dataset exceeds what a single instance can handle, consider: (1) vertical scaling with larger hardware, (2) federation via SERVICE clauses to distribute queries across multiple pg_ripple instances, each holding a subset of graphs, or (3) archiving cold graphs to separate instances.


Capacity Planning

Storage Estimates

ComponentPer Triple (approx.)
VP table row (s, o, g, i, source)~40 bytes
VP indexes (dual B-tree)~80 bytes
Dictionary entry (per unique term)~120 bytes
HTAP overhead (delta + tombstone tables)~20% of VP size during active writes

Example: 100M triples with 20M unique terms ≈ 12 GB (VP) + 2.4 GB (dictionary) + overhead ≈ ~20 GB total.

Memory Estimates

ComponentSizing
shared_buffers25% of RAM
dictionary_cache_size10% of unique terms
work_mem64MB–512MB depending on query complexity
OS page cacheRemaining RAM

Start small, measure, scale

Deploy with conservative settings, load your data, and run representative queries. Use pg_ripple.stats() and PostgreSQL's pg_stat_user_tables to identify bottlenecks before adding hardware.

Troubleshooting

A runbook of common issues, their causes, and step-by-step resolutions. Each entry follows the pattern: Symptom → Cause → Diagnostic → Fix.


1. SPARQL Query Returns Zero Rows

Symptom: A SPARQL query that should return results returns an empty set.

Cause: The most common cause is querying with unencoded IRIs that don't match the dictionary, or querying the wrong graph.

Diagnostic:

-- Check that triples exist
SELECT pg_ripple.stats();

-- Verify the IRI is in the dictionary
SELECT id FROM _pg_ripple.dictionary WHERE value = 'http://example.org/MyResource';

-- Check the default graph vs named graphs
SELECT pg_ripple.sparql($$
  SELECT ?g (COUNT(*) AS ?n) WHERE { GRAPH ?g { ?s ?p ?o } } GROUP BY ?g
$$);

Fix: Ensure the query uses the exact IRI as stored (case-sensitive, no trailing slash differences). If data was loaded into a named graph, use GRAPH or FROM clauses.


2. Merge Worker Not Running

Symptom: pg_ripple.stats() shows merge_worker_pid: 0. Delta rows accumulate.

Cause: pg_ripple is not in shared_preload_libraries, or worker_database points to the wrong database.

Diagnostic:

SHOW shared_preload_libraries;
SHOW pg_ripple.worker_database;

Fix:

# postgresql.conf
shared_preload_libraries = 'pg_ripple'
pg_ripple.worker_database = 'mydb'

Restart PostgreSQL. Verify with:

SELECT (pg_ripple.stats()->>'merge_worker_pid')::int;

3. Slow Queries — Unbounded Property Paths

Symptom: Queries with * or + property paths take minutes or never complete.

Cause: Property path queries compile to WITH RECURSIVE CTEs. On large, highly-connected graphs, recursion explores an enormous search space.

Diagnostic:

SHOW pg_ripple.max_path_depth;
-- Check the generated SQL
SET pg_ripple.plan_cache_size = 0;  -- disable cache to see fresh plans
EXPLAIN (ANALYZE, BUFFERS) <generated SQL from logs>;

Fix: Limit recursion depth:

SET pg_ripple.max_path_depth = 10;

Or rewrite the query to use a bounded path ({1,5}) instead of */+.


4. SHACL Validation Not Triggering

Symptom: Data that violates SHACL shapes is inserted without errors.

Cause: SHACL enforcement is asynchronous by default, or the shapes are not loaded.

Diagnostic:

-- Check loaded shapes
SELECT pg_ripple.sparql($$
  SELECT ?shape WHERE { ?shape a <http://www.w3.org/ns/shacl#NodeShape> }
$$);

-- Check enforce mode
SHOW pg_ripple.enforce_constraints;

Fix: Set enforcement mode to 'error' for synchronous validation:

SET pg_ripple.enforce_constraints = 'error';

Reload shapes if needed:

SELECT pg_ripple.load_shapes('<shapes-graph-iri>');

5. Datalog Inference Produces No Results

Symptom: pg_ripple.infer() or pg_ripple.infer_goal() returns zero new triples.

Cause: Rules are not loaded, inference mode is 'off', or the rule atoms don't match any data.

Diagnostic:

SHOW pg_ripple.inference_mode;

-- List loaded rule sets
SELECT pg_ripple.list_rule_sets();

-- Test with a simple rule
SELECT pg_ripple.load_rules('test', $$
  :Parent(?x, ?z) :- :Parent(?x, ?y), :Parent(?y, ?z).
$$);
SELECT pg_ripple.infer('test');

Fix: Ensure inference_mode is 'on_demand' or 'materialized', rules are loaded, and the predicates in rule atoms match your data's actual IRIs exactly.


6. Shared Memory Errors on Startup

Symptom: PostgreSQL fails to start with could not create shared memory segment or pg_ripple logs insufficient shared memory.

Cause: pg_ripple.dictionary_cache_size is too large for the system's shared memory limits.

Diagnostic:

# Check system shared memory limits
sysctl kern.sysv.shmmax  # macOS
sysctl kernel.shmmax      # Linux

Fix: Either reduce dictionary_cache_size or increase the OS shared memory limit:

# Linux
sudo sysctl -w kernel.shmmax=17179869184  # 16GB
sudo sysctl -w kernel.shmall=4194304

# macOS
sudo sysctl -w kern.sysv.shmmax=17179869184

Docker environments

In Docker, set --shm-size=2g (or larger) in your docker run command.


7. High Dictionary Cache Eviction Pressure

Symptom: encode_cache_evictions in pg_ripple.stats() is high; cache hit rate drops below 90%.

Cause: The working set of IRIs/literals exceeds the cache capacity.

Diagnostic:

SELECT
  s->>'encode_cache_capacity' AS capacity,
  s->>'encode_cache_utilization_pct' AS util_pct,
  s->>'encode_cache_evictions' AS evictions,
  ROUND(
    (s->>'encode_cache_hits')::numeric /
    NULLIF((s->>'encode_cache_hits')::numeric + (s->>'encode_cache_misses')::numeric, 0),
    4
  ) AS hit_rate
FROM pg_ripple.stats() s;

Fix: Increase dictionary_cache_size in postgresql.conf and restart:

pg_ripple.dictionary_cache_size = 131072  -- double the default

8. Federation Query Timeout

Symptom: Queries with SERVICE clauses hang or return a timeout error.

Cause: The remote SPARQL endpoint is unreachable, slow, or returning an unexpected format.

Diagnostic:

# Test the remote endpoint directly
curl -s -H "Accept: application/sparql-results+json" \
  "https://remote.example.org/sparql?query=SELECT+*+WHERE+{?s+?p+?o}+LIMIT+1"

Fix:

  • Verify network connectivity to the remote endpoint
  • Increase the federation timeout:
SET pg_ripple.federation_timeout = 60;  -- seconds
  • Check that the remote endpoint supports the required result format (SPARQL JSON Results)

9. pg_ripple_http Not Responding

Symptom: The HTTP SPARQL endpoint returns connection refused or 502 errors.

Cause: The pg_ripple_http companion service is not running, or it cannot connect to PostgreSQL.

Diagnostic:

# Check if the process is running
ps aux | grep pg_ripple_http

# Check the service logs
journalctl -u pg_ripple_http --since "10 minutes ago"

# Test the PostgreSQL connection directly
psql -h localhost -p 5432 -U pg_ripple_http -d mydb -c "SELECT 1"

Fix:

  • Start or restart the service
  • Verify the connection string in the pg_ripple_http configuration
  • Check that pg_hba.conf allows connections from the HTTP service

10. VP Table Bloat

Symptom: Disk usage grows faster than expected; pg_size_pretty(pg_total_relation_size('_pg_ripple.vp_12345')) is much larger than the triple count suggests.

Cause: Frequent deletes and re-inserts without merge cycles, or autovacuum not keeping up.

Diagnostic:

-- Check dead tuples
SELECT relname, n_dead_tup, n_live_tup,
       last_autovacuum, last_autoanalyze
FROM pg_stat_user_tables
WHERE schemaname = '_pg_ripple'
ORDER BY n_dead_tup DESC
LIMIT 10;

Fix:

-- Force a vacuum on the bloated table
VACUUM (VERBOSE) _pg_ripple.vp_12345_main;

-- Reclaim space aggressively
VACUUM (FULL) _pg_ripple.vp_12345_main;

Tune autovacuum for VP tables:

ALTER TABLE _pg_ripple.vp_12345_delta
  SET (autovacuum_vacuum_scale_factor = 0.01);

11. Bulk Load Slower Than Expected

Symptom: pg_ripple.load_turtle() or pg_ripple.load_ntriples() runs much slower than the documented 50K–200K triples/sec.

Cause: Small batch sizes, synchronous commit overhead, or insufficient work_mem.

Diagnostic:

SHOW synchronous_commit;
SHOW work_mem;
SHOW maintenance_work_mem;

Fix:

-- Disable synchronous commit for bulk loads
SET synchronous_commit = off;

-- Increase work memory
SET work_mem = '256MB';
SET maintenance_work_mem = '2GB';

-- Use the batch loading functions
SELECT pg_ripple.load_turtle_file('/path/to/data.ttl');

synchronous_commit = off

Disabling synchronous commit risks losing the last few transactions on a crash. Only use this for bulk loads that can be re-run.


12. RDF-Star Parse Error

Symptom: Loading RDF-star data fails with unexpected token or invalid quoted triple.

Cause: The input file uses RDF-star syntax (<<>>) but the parser is not in RDF-star mode, or the syntax is malformed.

Diagnostic: Check the file around the reported line number for syntax issues. Common problems:

  • Nested <<>> without proper whitespace
  • Missing datatype on literal objects inside quoted triples
  • Using Turtle-star syntax in N-Triples files (or vice versa)

Fix: Verify the file uses the correct format. For Turtle-star:

<<:Alice :knows :Bob>> :since "2024"^^xsd:gYear .

For N-Triples-star, every term must be fully qualified — no prefixes.


13. SHACL Validation Queue Backlog

Symptom: pg_ripple.validation_queue_depth() returns a large number; validation results are delayed.

Cause: High write throughput is generating validations faster than the async validator can process them.

Diagnostic:

SELECT pg_ripple.validation_queue_depth();
SELECT pg_ripple.stats();

Fix:

  • Increase the validation worker's processing capacity (if applicable)
  • Temporarily switch to synchronous validation during low-traffic periods:
SET pg_ripple.enforce_constraints = 'error';
  • Reduce write batch sizes to give the validator time to catch up

14. Plan Cache Thrashing

Symptom: SPARQL query latency is inconsistent. The first execution of a query pattern is slow, but subsequent runs are fast — then it becomes slow again.

Cause: The plan cache (pg_ripple.plan_cache_size) is too small for the number of distinct query patterns. Plans are evicted and recompiled repeatedly.

Diagnostic:

SHOW pg_ripple.plan_cache_size;

-- Estimate distinct query patterns in your workload
-- (application-level logging required)

Fix:

-- Increase the plan cache
SET pg_ripple.plan_cache_size = 1024;

If the number of distinct patterns exceeds any reasonable cache size, consider parameterizing queries to reduce pattern diversity.


15. "relation _pg_ripple.vp_XXXXX does not exist"

Symptom: SPARQL queries fail with a "relation does not exist" error for a specific VP table.

Cause: The predicates catalog references a VP table that was dropped or never created. This can happen after an incomplete migration or manual DDL.

Diagnostic:

-- Check the predicates catalog
SELECT id, table_oid, triple_count
FROM _pg_ripple.predicates
WHERE id = XXXXX;

-- Verify the table exists
SELECT oid FROM pg_class WHERE oid = (
  SELECT table_oid FROM _pg_ripple.predicates WHERE id = XXXXX
);

Fix:

-- Rebuild the VP table for the predicate
SELECT pg_ripple.reindex_predicate(XXXXX);

If the data is lost, the predicate entry should be removed:

DELETE FROM _pg_ripple.predicates WHERE id = XXXXX;

Manual catalog edits

Directly modifying _pg_ripple.predicates bypasses integrity checks. Only do this as a last resort after confirming the VP table is genuinely missing.


16. "permission denied for schema _pg_ripple"

Symptom: Non-superuser connections get permission errors when running SPARQL queries.

Cause: The user does not have USAGE on _pg_ripple and pg_ripple schemas.

Fix:

GRANT USAGE ON SCHEMA pg_ripple TO myuser;
GRANT USAGE ON SCHEMA _pg_ripple TO myuser;
GRANT SELECT ON ALL TABLES IN SCHEMA _pg_ripple TO myuser;
GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA pg_ripple TO myuser;

General Diagnostic Commands

A quick-reference set of commands for any troubleshooting session:

-- Extension health
SELECT pg_ripple.canary();
SELECT pg_ripple.stats();

-- PostgreSQL activity
SELECT pid, state, query, wait_event_type, wait_event
FROM pg_stat_activity
WHERE datname = current_database();

-- Lock contention
SELECT * FROM pg_locks WHERE NOT granted;

-- Table sizes in _pg_ripple
SELECT relname, pg_size_pretty(pg_total_relation_size(oid))
FROM pg_class
WHERE relnamespace = '_pg_ripple'::regnamespace
ORDER BY pg_total_relation_size(oid) DESC
LIMIT 20;

-- GUC settings
SELECT name, setting, source
FROM pg_settings
WHERE name LIKE 'pg_ripple.%'
ORDER BY name;

Lost Deletes After Merge (v0.37.0+)

Symptom: Triples that were deleted still appear in query results after a background merge cycle completes.

Cause: Before v0.37.0, the merge worker did not hold a per-predicate advisory lock during the delta→main swap. A DELETE that arrived after main_new was built but before the truncate of the tombstones table would have its tombstone deleted in the same truncate, leaving the triple alive in the new main.

Detection:

-- Check system health with diagnostic_report
SELECT key, value FROM pg_ripple.diagnostic_report()
WHERE key IN ('schema_version', 'merge_backlog_rows');

If schema_version is older than 0.37.0, upgrade to get the fix.

Fix:

  1. Upgrade to v0.37.0 or later:

    ALTER EXTENSION pg_ripple UPDATE TO '0.37.0';
    
  2. Verify the fix is active — diagnostic_report() reports the correct version:

    SELECT value FROM pg_ripple.diagnostic_report() WHERE key = 'schema_version';
    -- Should return: 0.37.0
    
  3. After upgrade, the merge worker acquires pg_advisory_xact_lock(pred_id) (exclusive) before the delta→main swap, and the delete path acquires pg_advisory_xact_lock_shared(pred_id) before inserting tombstones. These two lock modes are incompatible, guaranteeing serialization.

Impact: Low — requires an unlucky timing window during a merge cycle. Most deployments will not observe lost deletes in practice, but correctness-critical workloads should upgrade.

Security

pg_ripple provides multiple layers of security: PostgreSQL's native authentication and authorization, named-graph row-level security (RLS), SQL injection prevention through dictionary encoding, and secure configuration of the pg_ripple_http companion service.


Authentication and Authorization

pg_ripple relies entirely on PostgreSQL's built-in authentication (pg_hba.conf) and role-based access control. There is no separate user database.

Minimum Privileges for SPARQL Queries

-- Create a read-only role
CREATE ROLE sparql_reader LOGIN PASSWORD 'strong_password';
GRANT USAGE ON SCHEMA pg_ripple TO sparql_reader;
GRANT USAGE ON SCHEMA _pg_ripple TO sparql_reader;
GRANT SELECT ON ALL TABLES IN SCHEMA _pg_ripple TO sparql_reader;
GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA pg_ripple TO sparql_reader;

Minimum Privileges for Data Loading

-- Create a writer role
CREATE ROLE sparql_writer LOGIN PASSWORD 'strong_password';
GRANT USAGE ON SCHEMA pg_ripple TO sparql_writer;
GRANT USAGE ON SCHEMA _pg_ripple TO sparql_writer;
GRANT SELECT, INSERT, DELETE ON ALL TABLES IN SCHEMA _pg_ripple TO sparql_writer;
GRANT USAGE ON ALL SEQUENCES IN SCHEMA _pg_ripple TO sparql_writer;
GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA pg_ripple TO sparql_writer;

Default privileges

Use ALTER DEFAULT PRIVILEGES to ensure newly created VP tables (created when new predicates are encountered) inherit the correct grants:

ALTER DEFAULT PRIVILEGES IN SCHEMA _pg_ripple
  GRANT SELECT ON TABLES TO sparql_reader;
ALTER DEFAULT PRIVILEGES IN SCHEMA _pg_ripple
  GRANT SELECT, INSERT, DELETE ON TABLES TO sparql_writer;

Named-Graph Row-Level Security

pg_ripple supports fine-grained access control at the named-graph level using PostgreSQL's row-level security (RLS) infrastructure. This allows different users to see different subsets of the knowledge graph.

Enabling Graph RLS

-- Enable RLS on all VP tables
SELECT pg_ripple.enable_graph_rls();

This creates RLS policies on every VP table (including vp_rare) that filter rows based on the g (graph) column.

Granting Graph Access

-- Grant a role access to a specific named graph
SELECT pg_ripple.grant_graph('sparql_reader', 'http://example.org/confidential');

-- Grant access to the default graph (g = 0)
SELECT pg_ripple.grant_graph('sparql_reader', '');

-- Grant access to all graphs
SELECT pg_ripple.grant_graph('sparql_reader', '*');

Revoking Graph Access

-- Revoke access to a specific graph
SELECT pg_ripple.revoke_graph('sparql_reader', 'http://example.org/confidential');

How It Works

When graph RLS is enabled:

  1. Each VP table gets an RLS policy that checks the g column against the user's allowed graph IDs
  2. The dictionary encodes graph IRIs to i64 identifiers
  3. An internal mapping table (_pg_ripple.graph_grants) stores (role, graph_id) pairs
  4. PostgreSQL enforces the policy transparently — SPARQL queries automatically filter results

Superuser bypass

PostgreSQL superusers bypass RLS by default. To enforce graph security even for superusers, the user must explicitly SET row_security = on and not be a table owner. For production, use non-superuser roles for application connections.

Example: Multi-Tenant Knowledge Graph

-- Create tenant roles
CREATE ROLE tenant_a LOGIN PASSWORD 'pw_a';
CREATE ROLE tenant_b LOGIN PASSWORD 'pw_b';

-- Grant base access
GRANT USAGE ON SCHEMA pg_ripple TO tenant_a, tenant_b;
GRANT USAGE ON SCHEMA _pg_ripple TO tenant_a, tenant_b;
GRANT SELECT ON ALL TABLES IN SCHEMA _pg_ripple TO tenant_a, tenant_b;
GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA pg_ripple TO tenant_a, tenant_b;

-- Enable graph RLS
SELECT pg_ripple.enable_graph_rls();

-- Tenant A sees only their graph
SELECT pg_ripple.grant_graph('tenant_a', 'http://example.org/tenant-a');

-- Tenant B sees only their graph
SELECT pg_ripple.grant_graph('tenant_b', 'http://example.org/tenant-b');

-- Both see shared reference data
SELECT pg_ripple.grant_graph('tenant_a', 'http://example.org/shared');
SELECT pg_ripple.grant_graph('tenant_b', 'http://example.org/shared');

Now SPARQL queries run by tenant_a will only see triples in tenant-a and shared graphs, with no application-level filtering required.


SQL Injection Prevention

pg_ripple's architecture provides strong defense against SQL injection by design.

Dictionary Encoding as a Security Layer

All SPARQL queries go through a multi-step translation pipeline:

  1. Parse: SPARQL text is parsed by spargebra into an abstract algebra tree
  2. Encode: All bound constants (IRIs, literals) are dictionary-encoded to i64 integers before SQL generation
  3. Generate: SQL is constructed using parameterized queries with integer placeholders
  4. Execute: SQL runs via pgrx::SpiClient with bound parameters

No raw strings in VP queries

Because VP tables store only BIGINT columns (s, o, g, i, source), there is no surface for string-based SQL injection. Even if a malicious IRI is passed in a SPARQL query, it is hashed to an integer before any SQL is generated.

Table Name Safety

VP table references use OID lookups from _pg_ripple.predicates, not string concatenation:

#![allow(unused)]
fn main() {
// Internal: table names are never interpolated from user input
let table_oid = predicates::get_table_oid(predicate_id)?;
// SQL uses the OID directly: FROM pg_class WHERE oid = $1
}

User-Facing Function Safety

Functions that accept text input (like pg_ripple.sparql()) parse the SPARQL text through spargebra, which rejects anything that is not valid SPARQL. No raw SQL is passed through.


File-Path Loaders and Superuser Requirement

Functions that read from the server's filesystem require superuser privileges:

FunctionRequires SuperuserReason
pg_ripple.load_turtle_file(path)YesReads arbitrary filesystem paths
pg_ripple.load_ntriples_file(path)YesReads arbitrary filesystem paths
pg_ripple.load_rdfxml_file(path)YesReads arbitrary filesystem paths
pg_ripple.load_turtle(text)NoParses in-memory text only
pg_ripple.load_ntriples(text)NoParses in-memory text only

Filesystem access

File-path loaders can read any file the PostgreSQL process has access to. Never grant superuser to application roles. Instead, load data as a superuser and grant read access to application roles via schema permissions.

Safe Bulk Load Pattern

-- As superuser: load the data
SELECT pg_ripple.load_turtle_file('/data/import/dataset.ttl');

-- As superuser: grant access to the app role
GRANT SELECT ON ALL TABLES IN SCHEMA _pg_ripple TO app_role;

pg_ripple_http Security

The pg_ripple_http companion service exposes a SPARQL Protocol endpoint over HTTP. Secure it appropriately.

TLS Configuration

Always run pg_ripple_http behind TLS in production:

# pg_ripple_http.toml
[server]
bind = "0.0.0.0:8443"
tls_cert = "/etc/ssl/certs/pg_ripple_http.crt"
tls_key = "/etc/ssl/private/pg_ripple_http.key"

Never expose HTTP without TLS

SPARQL queries may contain sensitive data patterns. Without TLS, queries and results are transmitted in plaintext. Always terminate TLS either at the service or at a reverse proxy.

Authentication

Configure pg_ripple_http to authenticate incoming requests:

[auth]
# HTTP Basic authentication backed by PostgreSQL roles
method = "pg_role"
# Or use a static API key
# method = "api_key"
# api_key = "your-secret-key-here"

With pg_role authentication, HTTP Basic credentials are forwarded to PostgreSQL. Graph RLS policies apply to the authenticated role.

Reverse Proxy Setup

For production, place pg_ripple_http behind a reverse proxy:

# nginx configuration
server {
    listen 443 ssl;
    server_name sparql.example.org;

    ssl_certificate /etc/ssl/certs/sparql.crt;
    ssl_certificate_key /etc/ssl/private/sparql.key;

    location /sparql {
        proxy_pass http://127.0.0.1:8080;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;

        # Rate limiting
        limit_req zone=sparql burst=20 nodelay;
    }
}

CORS Configuration

If the SPARQL endpoint is accessed from browser applications:

[cors]
allowed_origins = ["https://app.example.org"]
allowed_methods = ["GET", "POST"]
allowed_headers = ["Content-Type", "Authorization"]
max_age = 3600

Avoid wildcard origins

Do not set allowed_origins = ["*"] in production. This allows any website to send SPARQL queries to your endpoint using the visitor's credentials.


Network Isolation

Production Topology

┌─────────────┐     TLS      ┌──────────────────┐     Unix socket     ┌─────────────┐
│  Clients    │ ────────────→ │  pg_ripple_http   │ ──────────────────→ │ PostgreSQL  │
│             │               │  (reverse proxy)  │                     │ (pg_ripple) │
└─────────────┘               └──────────────────┘                     └─────────────┘

Recommendations

  1. PostgreSQL: bind to localhost or a private network interface only. Never expose port 5432 to the public internet.
# postgresql.conf
listen_addresses = '127.0.0.1'
  1. pg_ripple_http: connect to PostgreSQL via Unix socket for lowest latency and no network exposure.

  2. Firewall rules: only allow traffic on the HTTPS port (443) from expected client networks.

# iptables example
iptables -A INPUT -p tcp --dport 443 -s 10.0.0.0/8 -j ACCEPT
iptables -A INPUT -p tcp --dport 443 -j DROP
iptables -A INPUT -p tcp --dport 5432 -j DROP
  1. pg_hba.conf: restrict connections by source IP and authentication method:
# TYPE  DATABASE  USER              ADDRESS        METHOD
local   all       postgres                         peer
host    mydb      pg_ripple_http    127.0.0.1/32   scram-sha-256
host    mydb      sparql_reader     10.0.0.0/8     scram-sha-256
host    all       all               0.0.0.0/0      reject

Use scram-sha-256

Always use scram-sha-256 authentication (the default in PostgreSQL 18). Avoid md5 and never use trust in production.


Security Checklist

ItemStatus
shared_preload_libraries includes only trusted extensions
Non-superuser roles used for all application connections
Graph RLS enabled for multi-tenant deployments
pg_hba.conf restricts connections to known networks
TLS enabled on pg_ripple_http or reverse proxy
File-path loaders restricted to superuser only (default)
synchronous_commit enabled for production (not off)
Connection pooler uses scram-sha-256
CORS origins are not wildcarded
PostgreSQL logs enabled for audit trail
Regular security updates for PostgreSQL and pg_ripple

Audit Logging

Enable PostgreSQL's logging to maintain an audit trail:

# postgresql.conf
log_statement = 'all'          # or 'ddl' for schema changes only
log_connections = on
log_disconnections = on
log_line_prefix = '%t [%p] %u@%d '

For fine-grained audit logging, consider the pgaudit extension alongside pg_ripple.

SPARQL query logging

pg_ripple logs the generated SQL via PostgreSQL's standard statement logging. To see the original SPARQL text, enable log_statement = 'all' — the SPARQL text appears as the argument to pg_ripple.sparql().

SQL Function Reference

All 157 SQL functions exposed by pg_ripple, grouped by use case. Every function lives in the pg_ripple schema.

Schema qualification

All examples assume SET search_path TO pg_ripple, public;. If you prefer explicit qualification, prefix every call with pg_ripple..


Loading

Functions for inserting and bulk-loading RDF data.


insert_triple

Insert a single triple into the default graph.

pg_ripple.insert_triple(
    subject   TEXT,
    predicate TEXT,
    object    TEXT
) RETURNS BIGINT
SELECT pg_ripple.insert_triple(
    '<https://example.org/alice>',
    '<https://example.org/knows>',
    '<https://example.org/bob>'
);

load_turtle

Parse a Turtle string and load all triples into the default graph.

pg_ripple.load_turtle(
    data   TEXT,
    strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_turtle('
@prefix ex: <https://example.org/> .
ex:alice ex:name "Alice" ;
         ex:knows ex:bob .
');

load_turtle_file

Load Turtle from a server-side file path.

pg_ripple.load_turtle_file(
    path   TEXT,
    strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_turtle_file('/data/ontology.ttl');

load_ntriples

Parse an N-Triples string and load all triples into the default graph.

pg_ripple.load_ntriples(
    data   TEXT,
    strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_ntriples('
<https://example.org/alice> <https://example.org/name> "Alice" .
<https://example.org/alice> <https://example.org/knows> <https://example.org/bob> .
');

load_ntriples_file

Load N-Triples from a server-side file path.

pg_ripple.load_ntriples_file(
    path   TEXT,
    strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_ntriples_file('/data/dump.nt');

load_nquads

Parse an N-Quads string and load triples into their respective named graphs.

pg_ripple.load_nquads(
    data   TEXT,
    strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_nquads('
<https://example.org/alice> <https://example.org/name> "Alice" <https://example.org/g1> .
');

load_nquads_file

Load N-Quads from a server-side file path.

pg_ripple.load_nquads_file(
    path   TEXT,
    strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_nquads_file('/data/dump.nq');

load_trig

Parse a TriG string and load triples into their named graphs.

pg_ripple.load_trig(
    data   TEXT,
    strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_trig('
@prefix ex: <https://example.org/> .
ex:g1 { ex:alice ex:name "Alice" . }
');

load_trig_file

Load TriG from a server-side file path.

pg_ripple.load_trig_file(
    path   TEXT,
    strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_trig_file('/data/dataset.trig');

load_rdfxml

Parse an RDF/XML string and load all triples into the default graph.

pg_ripple.load_rdfxml(
    data   TEXT,
    strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_rdfxml('
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:ex="https://example.org/">
  <rdf:Description rdf:about="https://example.org/alice">
    <ex:name>Alice</ex:name>
  </rdf:Description>
</rdf:RDF>
');

load_rdfxml_file

Load RDF/XML from a server-side file path.

pg_ripple.load_rdfxml_file(
    path   TEXT,
    strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_rdfxml_file('/data/ontology.rdf');

load_ntriples_into_graph

Parse N-Triples and load into a specific named graph.

pg_ripple.load_ntriples_into_graph(
    data  TEXT,
    graph TEXT,
    strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_ntriples_into_graph(
    '<https://example.org/alice> <https://example.org/name> "Alice" .',
    '<https://example.org/people>'
);

load_turtle_into_graph

Parse Turtle and load into a specific named graph.

pg_ripple.load_turtle_into_graph(
    data  TEXT,
    graph TEXT,
    strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_turtle_into_graph(
    '@prefix ex: <https://example.org/> . ex:alice ex:name "Alice" .',
    '<https://example.org/people>'
);

load_rdfxml_into_graph

Parse RDF/XML and load into a specific named graph.

pg_ripple.load_rdfxml_into_graph(
    data  TEXT,
    graph TEXT,
    strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_rdfxml_into_graph(
    '<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
              xmlns:ex="https://example.org/">
       <rdf:Description rdf:about="https://example.org/alice">
         <ex:name>Alice</ex:name>
       </rdf:Description>
     </rdf:RDF>',
    '<https://example.org/people>'
);

load_ntriples_file_into_graph

Load N-Triples from a server-side file into a named graph.

pg_ripple.load_ntriples_file_into_graph(
    path  TEXT,
    graph TEXT,
    strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_ntriples_file_into_graph(
    '/data/people.nt',
    '<https://example.org/people>'
);

load_turtle_file_into_graph

Load Turtle from a server-side file into a named graph.

pg_ripple.load_turtle_file_into_graph(
    path  TEXT,
    graph TEXT,
    strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_turtle_file_into_graph(
    '/data/people.ttl',
    '<https://example.org/people>'
);

load_rdfxml_file_into_graph

Load RDF/XML from a server-side file into a named graph.

pg_ripple.load_rdfxml_file_into_graph(
    path  TEXT,
    graph TEXT,
    strict BOOLEAN DEFAULT false
) RETURNS BIGINT
SELECT pg_ripple.load_rdfxml_file_into_graph(
    '/data/people.rdf',
    '<https://example.org/people>'
);

load_owl_ontology

Load an OWL ontology from Turtle, extracting class and property declarations for use by the Datalog reasoner.

pg_ripple.load_owl_ontology(
    data TEXT
) RETURNS BIGINT
SELECT pg_ripple.load_owl_ontology('
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix ex: <https://example.org/> .
ex:Person a owl:Class .
ex:knows a owl:ObjectProperty ;
    owl:inverseOf ex:knownBy .
');

apply_patch

Apply an RDF patch (additions and deletions) atomically.

pg_ripple.apply_patch(
    additions TEXT,
    deletions TEXT
) RETURNS BIGINT
SELECT pg_ripple.apply_patch(
    '<https://example.org/alice> <https://example.org/age> "31"^^<http://www.w3.org/2001/XMLSchema#integer> .',
    '<https://example.org/alice> <https://example.org/age> "30"^^<http://www.w3.org/2001/XMLSchema#integer> .'
);

Querying

Functions for querying triples with SPARQL and text search.


sparql

Execute a SPARQL SELECT query and return results as a set of JSON objects.

pg_ripple.sparql(
    query TEXT
) RETURNS SETOF JSON
SELECT * FROM pg_ripple.sparql('
    PREFIX ex: <https://example.org/>
    SELECT ?name WHERE { ex:alice ex:name ?name }
');

sparql_ask

Execute a SPARQL ASK query and return a boolean result.

pg_ripple.sparql_ask(
    query TEXT
) RETURNS BOOLEAN
SELECT pg_ripple.sparql_ask('
    PREFIX ex: <https://example.org/>
    ASK { ex:alice ex:knows ex:bob }
');

sparql_explain

Return the SQL execution plan for a SPARQL query without executing it.

pg_ripple.sparql_explain(
    query TEXT
) RETURNS TEXT
SELECT pg_ripple.sparql_explain('
    PREFIX ex: <https://example.org/>
    SELECT ?x WHERE { ?x ex:knows ex:bob }
');

explain_sparql

Return a detailed query plan showing SPARQL algebra and generated SQL.

pg_ripple.explain_sparql(
    query TEXT
) RETURNS TEXT
SELECT pg_ripple.explain_sparql('
    PREFIX ex: <https://example.org/>
    SELECT ?x ?y WHERE { ?x ex:knows ?y }
');

sparql_construct

Execute a SPARQL CONSTRUCT query and return triples as JSON.

pg_ripple.sparql_construct(
    query TEXT
) RETURNS SETOF JSON
SELECT * FROM pg_ripple.sparql_construct('
    PREFIX ex: <https://example.org/>
    CONSTRUCT { ?x ex:friendOf ?y }
    WHERE { ?x ex:knows ?y }
');

sparql_describe

Execute a SPARQL DESCRIBE query and return all triples about a resource.

pg_ripple.sparql_describe(
    query TEXT
) RETURNS SETOF JSON
SELECT * FROM pg_ripple.sparql_describe('
    PREFIX ex: <https://example.org/>
    DESCRIBE ex:alice
');

sparql_construct_turtle

Execute a SPARQL CONSTRUCT query and return the result as a Turtle string.

pg_ripple.sparql_construct_turtle(
    query TEXT
) RETURNS TEXT
SELECT pg_ripple.sparql_construct_turtle('
    PREFIX ex: <https://example.org/>
    CONSTRUCT { ?x ex:friendOf ?y }
    WHERE { ?x ex:knows ?y }
');

sparql_construct_jsonld

Execute a SPARQL CONSTRUCT query and return the result as a JSON-LD string.

pg_ripple.sparql_construct_jsonld(
    query TEXT
) RETURNS TEXT
SELECT pg_ripple.sparql_construct_jsonld('
    PREFIX ex: <https://example.org/>
    CONSTRUCT { ?x ex:friendOf ?y }
    WHERE { ?x ex:knows ?y }
');

sparql_describe_turtle

Execute a SPARQL DESCRIBE query and return the result as Turtle.

pg_ripple.sparql_describe_turtle(
    query TEXT
) RETURNS TEXT
SELECT pg_ripple.sparql_describe_turtle('
    PREFIX ex: <https://example.org/>
    DESCRIBE ex:alice
');

sparql_describe_jsonld

Execute a SPARQL DESCRIBE query and return the result as JSON-LD.

pg_ripple.sparql_describe_jsonld(
    query TEXT
) RETURNS TEXT
SELECT pg_ripple.sparql_describe_jsonld('
    PREFIX ex: <https://example.org/>
    DESCRIBE ex:alice
');

sparql_update

Execute a SPARQL Update operation (INSERT DATA, DELETE DATA, etc.).

pg_ripple.sparql_update(
    query TEXT
) RETURNS BIGINT
SELECT pg_ripple.sparql_update('
    PREFIX ex: <https://example.org/>
    INSERT DATA { ex:alice ex:age 30 }
');

find_triples

Find triples matching a pattern in the default graph. Pass NULL for wildcards.

pg_ripple.find_triples(
    subject   TEXT DEFAULT NULL,
    predicate TEXT DEFAULT NULL,
    object    TEXT DEFAULT NULL
) RETURNS TABLE(subject TEXT, predicate TEXT, object TEXT)
SELECT * FROM pg_ripple.find_triples(
    '<https://example.org/alice>', NULL, NULL
);

find_triples_in_graph

Find triples matching a pattern in a specific named graph.

pg_ripple.find_triples_in_graph(
    subject   TEXT DEFAULT NULL,
    predicate TEXT DEFAULT NULL,
    object    TEXT DEFAULT NULL,
    graph     TEXT DEFAULT NULL
) RETURNS TABLE(subject TEXT, predicate TEXT, object TEXT, graph TEXT)
SELECT * FROM pg_ripple.find_triples_in_graph(
    NULL, NULL, NULL, '<https://example.org/people>'
);

triple_count

Return the total number of triples in the default graph.

pg_ripple.triple_count() RETURNS BIGINT
SELECT pg_ripple.triple_count();

triple_count_in_graph

Return the number of triples in a specific named graph.

pg_ripple.triple_count_in_graph(
    graph TEXT
) RETURNS BIGINT
SELECT pg_ripple.triple_count_in_graph('<https://example.org/people>');

fts_index

Build or rebuild the full-text search index over literal values.

pg_ripple.fts_index() RETURNS VOID
SELECT pg_ripple.fts_index();

Search for triples containing a term in literal values via full-text search.

pg_ripple.fts_search(
    query TEXT,
    limit_rows INTEGER DEFAULT 100
) RETURNS TABLE(subject TEXT, predicate TEXT, object TEXT, rank REAL)
SELECT * FROM pg_ripple.fts_search('knowledge graph', 10);

Graphs

Functions for managing named graphs.


create_graph

Create a named graph.

pg_ripple.create_graph(
    graph TEXT
) RETURNS VOID
SELECT pg_ripple.create_graph('<https://example.org/people>');

drop_graph

Drop a named graph and all its triples.

pg_ripple.drop_graph(
    graph TEXT
) RETURNS VOID
SELECT pg_ripple.drop_graph('<https://example.org/people>');

list_graphs

List all named graphs.

pg_ripple.list_graphs() RETURNS TABLE(graph TEXT, triple_count BIGINT)
SELECT * FROM pg_ripple.list_graphs();

clear_graph

Remove all triples from a graph without dropping it.

pg_ripple.clear_graph(
    graph TEXT
) RETURNS BIGINT
SELECT pg_ripple.clear_graph('<https://example.org/people>');

Dictionary

Functions for interacting with the dictionary encoder that maps IRIs, blank nodes, and literals to integer IDs.

Internal use

Most users never need to call dictionary functions directly. They are useful for debugging, performance tuning, and understanding storage internals.


encode_term

Encode an IRI, literal, or blank node to its integer ID.

pg_ripple.encode_term(
    term TEXT
) RETURNS BIGINT
SELECT pg_ripple.encode_term('<https://example.org/alice>');

decode_id

Decode an integer ID back to its string representation.

pg_ripple.decode_id(
    id BIGINT
) RETURNS TEXT
SELECT pg_ripple.decode_id(42);

encode_triple

Encode a full triple (subject, predicate, object) to integer IDs.

pg_ripple.encode_triple(
    subject   TEXT,
    predicate TEXT,
    object    TEXT
) RETURNS TABLE(s BIGINT, p BIGINT, o BIGINT)
SELECT * FROM pg_ripple.encode_triple(
    '<https://example.org/alice>',
    '<https://example.org/knows>',
    '<https://example.org/bob>'
);

decode_triple

Decode a triple from integer IDs back to string form.

pg_ripple.decode_triple(
    s BIGINT,
    p BIGINT,
    o BIGINT
) RETURNS TABLE(subject TEXT, predicate TEXT, object TEXT)
SELECT * FROM pg_ripple.decode_triple(1, 2, 3);

decode_id_full

Decode an integer ID returning the full term with type information.

pg_ripple.decode_id_full(
    id BIGINT
) RETURNS JSON
SELECT pg_ripple.decode_id_full(42);

lookup_iri

Look up the integer ID for a specific IRI without inserting.

pg_ripple.lookup_iri(
    iri TEXT
) RETURNS BIGINT
SELECT pg_ripple.lookup_iri('<https://example.org/alice>');

dictionary_stats

Return statistics about the dictionary table.

pg_ripple.dictionary_stats() RETURNS JSON
SELECT pg_ripple.dictionary_stats();

prewarm_dictionary_hot

Load the most frequently accessed dictionary entries into the shared cache.

pg_ripple.prewarm_dictionary_hot(
    limit_rows INTEGER DEFAULT 10000
) RETURNS INTEGER
SELECT pg_ripple.prewarm_dictionary_hot(50000);

cache_stats

Return cache hit/miss statistics for the dictionary LRU cache.

pg_ripple.cache_stats() RETURNS JSON
SELECT pg_ripple.cache_stats();

Prefixes

Functions for managing namespace prefix abbreviations.


register_prefix

Register a namespace prefix for use in SPARQL queries and output.

pg_ripple.register_prefix(
    prefix TEXT,
    iri    TEXT
) RETURNS VOID
SELECT pg_ripple.register_prefix('ex', 'https://example.org/');

prefixes

List all registered prefixes.

pg_ripple.prefixes() RETURNS TABLE(prefix TEXT, iri TEXT)
SELECT * FROM pg_ripple.prefixes();

Validating

Functions for loading SHACL shapes, validating data, and managing async validation.


load_shacl

Load SHACL shapes from a Turtle string.

pg_ripple.load_shacl(
    shapes TEXT
) RETURNS INTEGER
SELECT pg_ripple.load_shacl('
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix ex: <https://example.org/> .
ex:PersonShape a sh:NodeShape ;
    sh:targetClass ex:Person ;
    sh:property [ sh:path ex:name ; sh:minCount 1 ; sh:datatype xsd:string ] .
');

validate

Run SHACL validation and return a validation report.

pg_ripple.validate() RETURNS TABLE(
    focus_node TEXT,
    shape      TEXT,
    path       TEXT,
    severity   TEXT,
    message    TEXT
)
SELECT * FROM pg_ripple.validate();

list_shapes

List all loaded SHACL shapes.

pg_ripple.list_shapes() RETURNS TABLE(shape TEXT, target TEXT, property_count INTEGER)
SELECT * FROM pg_ripple.list_shapes();

drop_shape

Drop a SHACL shape by IRI.

pg_ripple.drop_shape(
    shape TEXT
) RETURNS VOID
SELECT pg_ripple.drop_shape('<https://example.org/PersonShape>');

enable_shacl_monitors

Enable trigger-based SHACL validation on all VP tables.

pg_ripple.enable_shacl_monitors() RETURNS VOID
SELECT pg_ripple.enable_shacl_monitors();

enable_shacl_dag_monitors

Enable DAG-aware SHACL monitors using pg_trickle for async validation.

pg_ripple.enable_shacl_dag_monitors() RETURNS VOID
SELECT pg_ripple.enable_shacl_dag_monitors();

disable_shacl_dag_monitors

Disable DAG-aware SHACL monitors.

pg_ripple.disable_shacl_dag_monitors() RETURNS VOID
SELECT pg_ripple.disable_shacl_dag_monitors();

list_shacl_dag_monitors

List all active DAG SHACL monitors.

pg_ripple.list_shacl_dag_monitors() RETURNS TABLE(shape TEXT, predicate TEXT, enabled BOOLEAN)
SELECT * FROM pg_ripple.list_shacl_dag_monitors();

process_validation_queue

Process pending items in the async SHACL validation queue.

pg_ripple.process_validation_queue(
    batch_size INTEGER DEFAULT 100
) RETURNS INTEGER
SELECT pg_ripple.process_validation_queue(500);

validation_queue_length

Return the number of items pending in the validation queue.

pg_ripple.validation_queue_length() RETURNS BIGINT
SELECT pg_ripple.validation_queue_length();

dead_letter_count

Return the number of items in the validation dead-letter queue.

pg_ripple.dead_letter_count() RETURNS BIGINT
SELECT pg_ripple.dead_letter_count();

dead_letter_queue

Return the contents of the validation dead-letter queue.

pg_ripple.dead_letter_queue() RETURNS TABLE(
    id         BIGINT,
    triple_id  BIGINT,
    shape      TEXT,
    error      TEXT,
    created_at TIMESTAMPTZ
)
SELECT * FROM pg_ripple.dead_letter_queue();

drain_dead_letter_queue

Remove and return all items from the dead-letter queue.

pg_ripple.drain_dead_letter_queue() RETURNS INTEGER
SELECT pg_ripple.drain_dead_letter_queue();

Reasoning

Functions for Datalog rule management and inference.

Built-in rule sets

pg_ripple ships with RDFS and OWL RL rule sets. Load them with load_rules_builtin('rdfs') or load_rules_builtin('owl-rl').


load_rules

Load a named Datalog rule set from a program string.

pg_ripple.load_rules(
    name    TEXT,
    program TEXT
) RETURNS INTEGER
SELECT pg_ripple.load_rules('transitive-knows', '
    knows(X, Z) :- knows(X, Y), knows(Y, Z).
');

load_rules_builtin

Load a built-in rule set (rdfs, owl-rl).

pg_ripple.load_rules_builtin(
    name TEXT
) RETURNS INTEGER
SELECT pg_ripple.load_rules_builtin('owl-rl');

list_rules

List all loaded rule sets.

pg_ripple.list_rules() RETURNS TABLE(name TEXT, rule_count INTEGER, enabled BOOLEAN)
SELECT * FROM pg_ripple.list_rules();

drop_rules

Drop a rule set by name.

pg_ripple.drop_rules(
    name TEXT
) RETURNS VOID
SELECT pg_ripple.drop_rules('transitive-knows');

enable_rule_set

Enable a rule set for inference.

pg_ripple.enable_rule_set(
    name TEXT
) RETURNS VOID
SELECT pg_ripple.enable_rule_set('owl-rl');

disable_rule_set

Disable a rule set (triples already inferred are not removed).

pg_ripple.disable_rule_set(
    name TEXT
) RETURNS VOID
SELECT pg_ripple.disable_rule_set('owl-rl');

infer

Run materialization using all enabled rule sets (semi-naive evaluation).

pg_ripple.infer() RETURNS BIGINT
SELECT pg_ripple.infer();

infer_with_stats

Run materialization and return iteration statistics.

pg_ripple.infer_with_stats() RETURNS JSON
SELECT pg_ripple.infer_with_stats();

infer_goal

Run goal-directed inference for a specific query pattern.

pg_ripple.infer_goal(
    subject   TEXT DEFAULT NULL,
    predicate TEXT DEFAULT NULL,
    object    TEXT DEFAULT NULL
) RETURNS BIGINT
SELECT pg_ripple.infer_goal(
    '<https://example.org/alice>',
    '<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>',
    NULL
);

infer_agg

Run Datalog aggregation rules (min, max, sum, count).

pg_ripple.infer_agg() RETURNS BIGINT
SELECT pg_ripple.infer_agg();

infer_demand

Run demand-driven inference with magic sets optimization.

pg_ripple.infer_demand(
    subject   TEXT DEFAULT NULL,
    predicate TEXT DEFAULT NULL,
    object    TEXT DEFAULT NULL
) RETURNS BIGINT
SELECT pg_ripple.infer_demand(
    '<https://example.org/alice>',
    '<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>',
    NULL
);

infer_wfs

Run well-founded semantics evaluation for programs with negation.

pg_ripple.infer_wfs() RETURNS BIGINT
SELECT pg_ripple.infer_wfs();

tabling_stats

Return statistics about the tabling memo store.

pg_ripple.tabling_stats() RETURNS JSON
SELECT pg_ripple.tabling_stats();

rule_plan_cache_stats

Return statistics about the Datalog rule plan cache.

pg_ripple.rule_plan_cache_stats() RETURNS JSON
SELECT pg_ripple.rule_plan_cache_stats();

check_constraints

Run Datalog constraint rules and report violations.

pg_ripple.check_constraints() RETURNS TABLE(rule TEXT, subject TEXT, message TEXT)
SELECT * FROM pg_ripple.check_constraints();

Exporting

Functions for serializing triples to various formats.


export_ntriples

Export all triples as an N-Triples string.

pg_ripple.export_ntriples() RETURNS TEXT
SELECT pg_ripple.export_ntriples();

export_nquads

Export all triples (with named graphs) as an N-Quads string.

pg_ripple.export_nquads() RETURNS TEXT
SELECT pg_ripple.export_nquads();

export_turtle

Export all triples as a Turtle string.

pg_ripple.export_turtle() RETURNS TEXT
SELECT pg_ripple.export_turtle();

export_jsonld

Export all triples as a JSON-LD string.

pg_ripple.export_jsonld() RETURNS TEXT
SELECT pg_ripple.export_jsonld();

export_turtle_stream

Export triples as a streaming set of Turtle chunks for large datasets.

pg_ripple.export_turtle_stream(
    batch_size INTEGER DEFAULT 1000
) RETURNS SETOF TEXT
SELECT * FROM pg_ripple.export_turtle_stream(5000);

export_jsonld_stream

Export triples as a streaming set of JSON-LD chunks for large datasets.

pg_ripple.export_jsonld_stream(
    batch_size INTEGER DEFAULT 1000
) RETURNS SETOF TEXT
SELECT * FROM pg_ripple.export_jsonld_stream(5000);

export_graphrag_entities

Export entities in GraphRAG entity format for Microsoft GraphRAG or compatible tools.

pg_ripple.export_graphrag_entities() RETURNS SETOF JSON
SELECT * FROM pg_ripple.export_graphrag_entities();

export_graphrag_relationships

Export relationships in GraphRAG relationship format.

pg_ripple.export_graphrag_relationships() RETURNS SETOF JSON
SELECT * FROM pg_ripple.export_graphrag_relationships();

export_graphrag_text_units

Export text units in GraphRAG text-unit format.

pg_ripple.export_graphrag_text_units() RETURNS SETOF JSON
SELECT * FROM pg_ripple.export_graphrag_text_units();

JSON-LD Framing

Functions for JSON-LD framing and tree-shaped output.


jsonld_frame_to_sparql

Convert a JSON-LD frame to a SPARQL CONSTRUCT query.

pg_ripple.jsonld_frame_to_sparql(
    frame JSON
) RETURNS TEXT
SELECT pg_ripple.jsonld_frame_to_sparql('{
    "@type": "https://example.org/Person",
    "https://example.org/name": {}
}'::json);

export_jsonld_framed

Export triples shaped by a JSON-LD frame as a JSON-LD string.

pg_ripple.export_jsonld_framed(
    frame JSON
) RETURNS TEXT
SELECT pg_ripple.export_jsonld_framed('{
    "@type": "https://example.org/Person",
    "https://example.org/name": {},
    "https://example.org/knows": { "@type": "https://example.org/Person" }
}'::json);

export_jsonld_framed_stream

Export framed JSON-LD as a streaming set of chunks.

pg_ripple.export_jsonld_framed_stream(
    frame      JSON,
    batch_size INTEGER DEFAULT 100
) RETURNS SETOF TEXT
SELECT * FROM pg_ripple.export_jsonld_framed_stream('{
    "@type": "https://example.org/Person"
}'::json, 50);

jsonld_frame

Apply a JSON-LD frame to an existing JSON-LD document.

pg_ripple.jsonld_frame(
    document JSON,
    frame    JSON
) RETURNS JSON
SELECT pg_ripple.jsonld_frame(
    pg_ripple.export_jsonld()::json,
    '{"@type": "https://example.org/Person"}'::json
);

Views

Functions for creating and managing materialized SPARQL, Datalog, CONSTRUCT, DESCRIBE, ASK, and framing views.

View lifecycle

Views are backed by PostgreSQL tables or views. Use the corresponding drop_*_view function to remove them. Dropping the extension also removes all views.


create_sparql_view

Create a PostgreSQL view backed by a SPARQL SELECT query.

pg_ripple.create_sparql_view(
    name  TEXT,
    query TEXT
) RETURNS VOID
SELECT pg_ripple.create_sparql_view('people', '
    PREFIX ex: <https://example.org/>
    SELECT ?name WHERE { ?person a ex:Person ; ex:name ?name }
');

drop_sparql_view

Drop a SPARQL view.

pg_ripple.drop_sparql_view(
    name TEXT
) RETURNS VOID
SELECT pg_ripple.drop_sparql_view('people');

list_sparql_views

List all SPARQL views.

pg_ripple.list_sparql_views() RETURNS TABLE(name TEXT, query TEXT)
SELECT * FROM pg_ripple.list_sparql_views();

create_datalog_view

Create a PostgreSQL view backed by a Datalog rule.

pg_ripple.create_datalog_view(
    name  TEXT,
    rule  TEXT
) RETURNS VOID
SELECT pg_ripple.create_datalog_view('ancestor',
    'ancestor(X, Z) :- parent(X, Y), ancestor(Y, Z).'
);

create_datalog_view_from_rule_set

Create a view from a named rule set's head predicate.

pg_ripple.create_datalog_view_from_rule_set(
    view_name     TEXT,
    rule_set_name TEXT,
    head_predicate TEXT
) RETURNS VOID
SELECT pg_ripple.create_datalog_view_from_rule_set(
    'inferred_types', 'owl-rl', 'rdf:type'
);

drop_datalog_view

Drop a Datalog view.

pg_ripple.drop_datalog_view(
    name TEXT
) RETURNS VOID
SELECT pg_ripple.drop_datalog_view('ancestor');

list_datalog_views

List all Datalog views.

pg_ripple.list_datalog_views() RETURNS TABLE(name TEXT, rule_set TEXT, head TEXT)
SELECT * FROM pg_ripple.list_datalog_views();

create_framing_view

Create a PostgreSQL view backed by a JSON-LD frame.

pg_ripple.create_framing_view(
    name  TEXT,
    frame JSON
) RETURNS VOID
SELECT pg_ripple.create_framing_view('person_frame', '{
    "@type": "https://example.org/Person",
    "https://example.org/name": {}
}'::json);

drop_framing_view

Drop a framing view.

pg_ripple.drop_framing_view(
    name TEXT
) RETURNS VOID
SELECT pg_ripple.drop_framing_view('person_frame');

list_framing_views

List all framing views.

pg_ripple.list_framing_views() RETURNS TABLE(name TEXT, frame JSON)
SELECT * FROM pg_ripple.list_framing_views();

create_construct_view

Create a view backed by a SPARQL CONSTRUCT query.

pg_ripple.create_construct_view(
    name  TEXT,
    query TEXT
) RETURNS VOID
SELECT pg_ripple.create_construct_view('friends', '
    PREFIX ex: <https://example.org/>
    CONSTRUCT { ?a ex:friendOf ?b }
    WHERE { ?a ex:knows ?b }
');

drop_construct_view

Drop a CONSTRUCT view.

pg_ripple.drop_construct_view(
    name TEXT
) RETURNS VOID
SELECT pg_ripple.drop_construct_view('friends');

list_construct_views

List all CONSTRUCT views.

pg_ripple.list_construct_views() RETURNS TABLE(name TEXT, query TEXT)
SELECT * FROM pg_ripple.list_construct_views();

create_describe_view

Create a view backed by a SPARQL DESCRIBE query.

pg_ripple.create_describe_view(
    name  TEXT,
    query TEXT
) RETURNS VOID
SELECT pg_ripple.create_describe_view('alice_detail', '
    PREFIX ex: <https://example.org/>
    DESCRIBE ex:alice
');

drop_describe_view

Drop a DESCRIBE view.

pg_ripple.drop_describe_view(
    name TEXT
) RETURNS VOID
SELECT pg_ripple.drop_describe_view('alice_detail');

list_describe_views

List all DESCRIBE views.

pg_ripple.list_describe_views() RETURNS TABLE(name TEXT, query TEXT)
SELECT * FROM pg_ripple.list_describe_views();

create_ask_view

Create a view backed by a SPARQL ASK query.

pg_ripple.create_ask_view(
    name  TEXT,
    query TEXT
) RETURNS VOID
SELECT pg_ripple.create_ask_view('has_alice', '
    PREFIX ex: <https://example.org/>
    ASK { ex:alice ex:name ?n }
');

drop_ask_view

Drop an ASK view.

pg_ripple.drop_ask_view(
    name TEXT
) RETURNS VOID
SELECT pg_ripple.drop_ask_view('has_alice');

list_ask_views

List all ASK views.

pg_ripple.list_ask_views() RETURNS TABLE(name TEXT, query TEXT)
SELECT * FROM pg_ripple.list_ask_views();

create_extvp

Create an Extended VP (ExtVP) index for a predicate pair to accelerate star-pattern joins.

pg_ripple.create_extvp(
    predicate1 TEXT,
    predicate2 TEXT
) RETURNS VOID
SELECT pg_ripple.create_extvp(
    '<https://example.org/name>',
    '<https://example.org/age>'
);

drop_extvp

Drop an ExtVP index.

pg_ripple.drop_extvp(
    predicate1 TEXT,
    predicate2 TEXT
) RETURNS VOID
SELECT pg_ripple.drop_extvp(
    '<https://example.org/name>',
    '<https://example.org/age>'
);

list_extvp

List all ExtVP indices.

pg_ripple.list_extvp() RETURNS TABLE(predicate1 TEXT, predicate2 TEXT, row_count BIGINT)
SELECT * FROM pg_ripple.list_extvp();

Federation

Functions for managing SPARQL federation endpoints.


register_endpoint

Register a remote SPARQL endpoint for federated queries.

pg_ripple.register_endpoint(
    name TEXT,
    url  TEXT
) RETURNS VOID
SELECT pg_ripple.register_endpoint('wikidata', 'https://query.wikidata.org/sparql');

set_endpoint_complexity

Set the complexity weight for a federated endpoint (used by the query planner).

pg_ripple.set_endpoint_complexity(
    name       TEXT,
    complexity REAL
) RETURNS VOID
SELECT pg_ripple.set_endpoint_complexity('wikidata', 2.5);

remove_endpoint

Remove a registered endpoint.

pg_ripple.remove_endpoint(
    name TEXT
) RETURNS VOID
SELECT pg_ripple.remove_endpoint('wikidata');

disable_endpoint

Temporarily disable an endpoint without removing it.

pg_ripple.disable_endpoint(
    name TEXT
) RETURNS VOID
SELECT pg_ripple.disable_endpoint('wikidata');

list_endpoints

List all registered federation endpoints.

pg_ripple.list_endpoints() RETURNS TABLE(name TEXT, url TEXT, enabled BOOLEAN, complexity REAL)
SELECT * FROM pg_ripple.list_endpoints();

register_vector_endpoint

Register a vector similarity search endpoint for hybrid SPARQL+vector queries.

pg_ripple.register_vector_endpoint(
    name  TEXT,
    url   TEXT,
    model TEXT
) RETURNS VOID
SELECT pg_ripple.register_vector_endpoint(
    'openai', 'https://api.openai.com/v1/embeddings', 'text-embedding-3-small'
);

Functions for vector embeddings, similarity search, and RAG retrieval.

pgvector required

All vector functions require pgvector to be installed. Set pg_ripple.pgvector_enabled = off to disable without uninstalling.


store_embedding

Store a precomputed embedding vector for an entity.

pg_ripple.store_embedding(
    entity TEXT,
    model  TEXT,
    vector VECTOR
) RETURNS VOID
SELECT pg_ripple.store_embedding(
    '<https://example.org/alice>',
    'text-embedding-3-small',
    '[0.1, 0.2, 0.3]'::vector
);

similar_entities

Find entities similar to a given entity by vector distance.

pg_ripple.similar_entities(
    entity TEXT,
    model  TEXT DEFAULT 'text-embedding-3-small',
    k      INTEGER DEFAULT 10
) RETURNS TABLE(entity TEXT, distance REAL)
SELECT * FROM pg_ripple.similar_entities('<https://example.org/alice>');

embed_entities

Generate and store embeddings for entities matching a SPARQL pattern.

pg_ripple.embed_entities(
    query TEXT,
    model TEXT DEFAULT 'text-embedding-3-small'
) RETURNS INTEGER
SELECT pg_ripple.embed_entities('
    PREFIX ex: <https://example.org/>
    SELECT ?entity WHERE { ?entity a ex:Person }
');

refresh_embeddings

Recompute embeddings for entities whose underlying data has changed.

pg_ripple.refresh_embeddings(
    model TEXT DEFAULT 'text-embedding-3-small'
) RETURNS INTEGER
SELECT pg_ripple.refresh_embeddings();

list_embedding_models

List all embedding models with stored vectors.

pg_ripple.list_embedding_models() RETURNS TABLE(model TEXT, entity_count BIGINT, dimensions INTEGER)
SELECT * FROM pg_ripple.list_embedding_models();

add_embedding_triples

Materialize similarity relationships as RDF triples.

pg_ripple.add_embedding_triples(
    model     TEXT DEFAULT 'text-embedding-3-small',
    threshold REAL DEFAULT 0.8,
    predicate TEXT DEFAULT '<https://example.org/similarTo>'
) RETURNS BIGINT
SELECT pg_ripple.add_embedding_triples('text-embedding-3-small', 0.9);

contextualize_entity

Return a text summary of an entity's neighborhood for use as LLM context.

pg_ripple.contextualize_entity(
    entity TEXT,
    hops   INTEGER DEFAULT 2
) RETURNS TEXT
SELECT pg_ripple.contextualize_entity('<https://example.org/alice>', 3);

Combine SPARQL graph pattern matching with vector similarity (Reciprocal Rank Fusion).

pg_ripple.hybrid_search(
    sparql_query TEXT,
    vector_query TEXT,
    k            INTEGER DEFAULT 10,
    alpha        REAL DEFAULT 0.5
) RETURNS TABLE(entity TEXT, score REAL, sparql_rank INTEGER, vector_rank INTEGER)
SELECT * FROM pg_ripple.hybrid_search(
    'PREFIX ex: <https://example.org/>
     SELECT ?person WHERE { ?person a ex:Person ; ex:knows ex:bob }',
    'researchers in knowledge graphs',
    10,
    0.7
);

rag_retrieve

Retrieve context for RAG (Retrieval-Augmented Generation) using graph + vector search.

pg_ripple.rag_retrieve(
    query TEXT,
    k     INTEGER DEFAULT 5,
    hops  INTEGER DEFAULT 2
) RETURNS TABLE(entity TEXT, context TEXT, score REAL)
SELECT * FROM pg_ripple.rag_retrieve('Who knows about knowledge graphs?', 5, 2);

Admin

Functions for maintenance, statistics, and administrative operations.


compact

Compact the triple store by removing unreferenced VP tables and dictionary entries.

pg_ripple.compact() RETURNS JSON
SELECT pg_ripple.compact();

vacuum

Vacuum all VP tables to reclaim space and update statistics.

pg_ripple.vacuum() RETURNS VOID
SELECT pg_ripple.vacuum();

reindex

Rebuild all B-tree and BRIN indices on VP tables.

pg_ripple.reindex() RETURNS VOID
SELECT pg_ripple.reindex();

vacuum_dictionary

Vacuum the dictionary table, removing entries not referenced by any VP table.

pg_ripple.vacuum_dictionary() RETURNS BIGINT
SELECT pg_ripple.vacuum_dictionary();

htap_migrate_predicate

Migrate a predicate from the flat VP layout to the HTAP delta/main layout.

pg_ripple.htap_migrate_predicate(
    predicate TEXT
) RETURNS VOID
SELECT pg_ripple.htap_migrate_predicate('<https://example.org/knows>');

stats

Return overall triple store statistics.

pg_ripple.stats() RETURNS JSON
SELECT pg_ripple.stats();

canary

Health-check function that returns true if the extension is loaded and functional.

pg_ripple.canary() RETURNS BOOLEAN
SELECT pg_ripple.canary();

enable_live_statistics

Enable real-time statistics collection for VP tables.

pg_ripple.enable_live_statistics() RETURNS VOID
SELECT pg_ripple.enable_live_statistics();

promote_rare_predicates

Promote predicates from vp_rare to dedicated VP tables if they exceed the threshold.

pg_ripple.promote_rare_predicates() RETURNS INTEGER
SELECT pg_ripple.promote_rare_predicates();

deduplicate_predicate

Remove duplicate triples from a specific predicate's VP table.

pg_ripple.deduplicate_predicate(
    predicate TEXT
) RETURNS BIGINT
SELECT pg_ripple.deduplicate_predicate('<https://example.org/knows>');

deduplicate_all

Remove duplicate triples from all VP tables.

pg_ripple.deduplicate_all() RETURNS BIGINT
SELECT pg_ripple.deduplicate_all();

delete_triple

Delete a specific triple from the default graph.

pg_ripple.delete_triple(
    subject   TEXT,
    predicate TEXT,
    object    TEXT
) RETURNS BOOLEAN
SELECT pg_ripple.delete_triple(
    '<https://example.org/alice>',
    '<https://example.org/knows>',
    '<https://example.org/bob>'
);

delete_triple_from_graph

Delete a specific triple from a named graph.

pg_ripple.delete_triple_from_graph(
    subject   TEXT,
    predicate TEXT,
    object    TEXT,
    graph     TEXT
) RETURNS BOOLEAN
SELECT pg_ripple.delete_triple_from_graph(
    '<https://example.org/alice>',
    '<https://example.org/knows>',
    '<https://example.org/bob>',
    '<https://example.org/people>'
);

get_statement

Retrieve a statement by its globally-unique statement ID (SID).

pg_ripple.get_statement(
    sid BIGINT
) RETURNS TABLE(subject TEXT, predicate TEXT, object TEXT, graph TEXT)
SELECT * FROM pg_ripple.get_statement(42);

Security

Functions for row-level security, access control, and schema inspection.


enable_graph_rls

Enable row-level security on VP tables, restricting access by named graph.

pg_ripple.enable_graph_rls() RETURNS VOID
SELECT pg_ripple.enable_graph_rls();

grant_graph

Grant a user access to a named graph.

pg_ripple.grant_graph(
    username   TEXT,
    graph      TEXT,
    permission TEXT DEFAULT 'read'
) RETURNS VOID
SELECT pg_ripple.grant_graph('analyst', '<https://example.org/public>', 'read');

revoke_graph

Revoke a user's access to a named graph.

pg_ripple.revoke_graph(
    username TEXT,
    graph    TEXT
) RETURNS VOID
SELECT pg_ripple.revoke_graph('analyst', '<https://example.org/public>');

list_graph_access

List all graph access grants.

pg_ripple.list_graph_access() RETURNS TABLE(username TEXT, graph TEXT, permission TEXT)
SELECT * FROM pg_ripple.list_graph_access();

enable_schema_summary

Enable background schema summary generation (requires pg_trickle).

pg_ripple.enable_schema_summary() RETURNS VOID
SELECT pg_ripple.enable_schema_summary();

schema_summary

Return a one-shot schema summary of all predicates, types, and counts.

pg_ripple.schema_summary() RETURNS JSON
SELECT pg_ripple.schema_summary();

CDC

Functions for Change Data Capture subscriptions.


subscribe

Subscribe to change events on the triple store. Returns a subscription ID.

pg_ripple.subscribe(
    channel  TEXT DEFAULT 'pg_ripple_changes',
    filter   TEXT DEFAULT NULL
) RETURNS TEXT
SELECT pg_ripple.subscribe('my_changes', 'predicate=<https://example.org/knows>');

unsubscribe

Unsubscribe from a change event subscription.

pg_ripple.unsubscribe(
    subscription_id TEXT
) RETURNS VOID
SELECT pg_ripple.unsubscribe('sub_abc123');

Index

Functions for querying predicate indices.


subject_predicates

Return all predicates used by a given subject.

pg_ripple.subject_predicates(
    subject TEXT
) RETURNS TABLE(predicate TEXT)
SELECT * FROM pg_ripple.subject_predicates('<https://example.org/alice>');

object_predicates

Return all predicates where a given resource appears as object.

pg_ripple.object_predicates(
    object TEXT
) RETURNS TABLE(predicate TEXT)
SELECT * FROM pg_ripple.object_predicates('<https://example.org/alice>');

Cache

Functions for query plan cache management.


plan_cache_stats

Return statistics about the SPARQL-to-SQL plan cache.

pg_ripple.plan_cache_stats() RETURNS JSON
SELECT pg_ripple.plan_cache_stats();

plan_cache_reset

Clear the SPARQL-to-SQL plan cache.

pg_ripple.plan_cache_reset() RETURNS VOID
SELECT pg_ripple.plan_cache_reset();

pg_trickle_available

Check whether the pg_trickle companion extension is installed and available.

pg_ripple.pg_trickle_available() RETURNS BOOLEAN
SELECT pg_ripple.pg_trickle_available();

Architecture

This page describes the internal architecture of pg_ripple as of v0.38.0.

Overview

pg_ripple is a PostgreSQL 18 extension written in Rust (pgrx 0.17) that implements a high-performance RDF triple store with native SPARQL query execution. All user-visible functions live in the pg_ripple schema; internal tables and VP (Vertical Partitioning) tables live in the _pg_ripple schema.

Component map

graph TD
    Client["SQL client / SPARQL tool"]

    subgraph Extension["pg_ripple extension (Rust + pgrx)"]
        API["SQL API layer\n(lib.rs + *_api.rs)"]
        SPARQL["SPARQL engine\nsparql/mod.rs"]
        TRANS["Algebra → SQL\nsparql/translate/"]
        SQLGEN["SQL generator\nsparql/sqlgen.rs"]
        DICT["Dictionary\ndictionary/mod.rs"]
        SHACL["SHACL validator\nshacl/mod.rs"]
        DL["Datalog engine\ndatalog/mod.rs"]
        EXP["Serialisers\nexport/mod.rs"]
        STOR["Storage layer\nstorage/mod.rs"]
        CAT["Predicate catalog\nstorage/catalog.rs"]
        HINTS["SHACL hints\nshacl/hints.rs"]
    end

    subgraph PG["PostgreSQL 18"]
        VP["VP tables\n_pg_ripple.vp_{id}"]
        DICT_TBL["dictionary table\n_pg_ripple.dictionary"]
        PRED["predicates table\n_pg_ripple.predicates"]
        RARE["rare predicates\n_pg_ripple.vp_rare"]
        SHAPE_HINTS["shape_hints table\n_pg_ripple.shape_hints"]
    end

    Client --> API
    API --> SPARQL
    API --> SHACL
    API --> DL
    API --> EXP
    SPARQL --> TRANS
    TRANS --> SQLGEN
    SQLGEN --> CAT
    CAT --> HINTS
    HINTS --> SHAPE_HINTS
    SQLGEN --> DICT
    DICT --> DICT_TBL
    SQLGEN --> VP
    CAT --> PRED
    STOR --> VP
    STOR --> RARE
    SHACL --> DICT
    DL --> STOR

Source tree structure

PathResponsibility
src/lib.rspgrx entry points, GUC registration, _PG_init, hooks
src/gucs.rsAll GUC static declarations
src/schema.rsextension_sql!() DDL blocks
src/dictionary/IRI / blank-node / literal → i64 encoder (XXH3-128 + LRU)
src/storage/VP table I/O, HTAP delta/main partitions, merge worker
src/storage/catalog.rsPredicate → VP table OID cache (SPI call reduction)
src/sparql/SPARQL text → algebra → SQL → SPI → decode
src/sparql/translate/Per-algebra-node translation stubs (BGP, Join, Filter, …)
src/sparql/plan_cache.rsPer-backend plan cache keyed on algebra digest (XXH3-128)
src/datalog/Datalog rule parser, stratifier, SQL compiler
src/shacl/SHACL shapes → validation pipeline
src/shacl/constraints/Per-constraint-type validation (count, string, logical, …)
src/shacl/hints.rsSHACL → SQL generation hints (join type, DISTINCT)
src/export/Turtle / N-Triples / JSON-LD serialisation
src/federation_registry.rsSPARQL federation endpoint registry
src/stats_admin.rsMonitoring, pg_stat_statements integration
src/graphrag_admin.rsVector embedding, hybrid search, GraphRAG pipeline
src/*_api.rsSQL-exposed pg_extern wrappers

Storage model

Every IRI, blank node, and literal is mapped to a BIGINT (i64) through a dictionary encoding step (XXH3-128 hash). VP tables never contain raw strings — all joins are integer joins.

          raw IRI/literal
               │
      dictionary.encode()
               │
           i64 hash
               │
     stored in VP table (s, o, g, i, source)

For each unique predicate there is one VP table: _pg_ripple.vp_{predicate_id}. Predicates with fewer than vp_promotion_threshold (default: 1 000) triples are stored in the consolidated _pg_ripple.vp_rare table instead.

Query execution pipeline

SPARQL text
    │
    ▼ spargebra::Query::parse()
SPARQL algebra
    │
    ▼ sparopt optimizer (BGP reorder, join order)
Optimised algebra
    │
    ▼ SPARQL→SQL translator (sqlgen.rs + translate/)
SQL text
    │
    ▼ PostgreSQL SPI executor
Raw rows
    │
    ▼ dictionary.decode()
JSONB result set

Plan cache

The per-backend plan cache (v0.13.0) maps an algebra digest to the generated SQL. The digest is computed as:

digest = XXH3-128( spargebra::Query::display(query) )
key    = "{digest}\x00max_depth={n}\x00bgp_reorder={b}"

Using the algebra display form (rather than the raw query text) means whitespace and prefix-alias variants of the same query share one cache slot.

SHACL hints integration

After loading shapes with pg_ripple.load_shacl(), predicate-level hints are written to _pg_ripple.shape_hints. The SQL generator reads these hints via the predicate catalog to:

  • Omit DISTINCT when sh:maxCount 1 is set for a predicate.
  • Use INNER JOIN instead of LEFT JOIN when sh:minCount 1 is set.

Hints are invalidated automatically when shapes are dropped (pg_ripple.invalidate_catalog_cache()).

GUC Reference

All pg_ripple configuration parameters are set with ALTER SYSTEM SET, SET (session-level), or in postgresql.conf. Reload with SELECT pg_reload_conf() after ALTER SYSTEM.


General Parameters

pg_ripple.max_path_depth

TypeInteger
Default10
Range1–100

Maximum recursion depth for SPARQL property paths (*, +). Increase for deeply nested graphs; lower for tighter resource bounds.


pg_ripple.property_path_max_depth (deprecated)

TypeInteger
Default64
Range1–100 000
StatusDeprecated since v0.38.0 — use max_path_depth instead

Legacy alias for max_path_depth. Setting this GUC still works but emits a deprecation notice. It will be removed in a future major release.


pg_ripple.federation_timeout

TypeInteger (milliseconds)
Default5000

Timeout for outbound SPARQL federation requests.


pg_ripple.export_batch_size

TypeInteger
Default1000

Number of rows written per batch in Parquet export operations.


Embedding / Vector Parameters (v0.27.0+)

These GUCs control the pgvector integration introduced in v0.27.0. All embedding functions degrade gracefully when pgvector is absent.


pg_ripple.pgvector_enabled

TypeBoolean
Defaulton

Master switch for all vector embedding paths. Set to off to disable embedding storage, similarity search, and SPARQL pg:similar() without uninstalling pgvector. Useful for temporarily disabling the feature.

-- Disable at session level for a bulk load
SET pg_ripple.pgvector_enabled = off;

pg_ripple.embedding_api_url

TypeString
Default(none)

Base URL for the OpenAI-compatible embeddings API. The extension appends /embeddings to this URL when making requests.

ALTER SYSTEM SET pg_ripple.embedding_api_url = 'https://api.openai.com/v1';
-- For Ollama (local):
ALTER SYSTEM SET pg_ripple.embedding_api_url = 'http://localhost:11434/v1';

pg_ripple.embedding_api_key

TypeString
Default(none)

Bearer token sent as Authorization: Bearer <key> in embedding API requests. For local models that don't require authentication, set to any non-empty string (e.g., 'local').

Security: Avoid storing API keys in postgresql.conf. Use ALTER SYSTEM and restrict pg_hba.conf access, or inject the key via a session-level SET in application code.


pg_ripple.embedding_model

TypeString
Default(none)

Model name passed in the "model" field of embedding API requests.

ALTER SYSTEM SET pg_ripple.embedding_model = 'text-embedding-3-small';
-- or for Ollama:
ALTER SYSTEM SET pg_ripple.embedding_model = 'nomic-embed-text';

pg_ripple.embedding_dimensions

TypeInteger
Default1536
Range1–65535

Expected output dimensions from the embedding model. Must match the model's output length. Common values:

ModelDimensions
text-embedding-3-small1536
text-embedding-3-large3072
text-embedding-ada-0021536
nomic-embed-text (Ollama)768

pg_ripple.embedding_index_type

TypeString
Default(none — HNSW when pgvector present)
Valueshnsw, ivfflat

Index type for the _pg_ripple.embeddings table. HNSW is the default and recommended for most workloads. IVFFlat uses less memory but requires lists parameter tuning.


pg_ripple.embedding_precision

TypeString
Default(none — full float4 precision)
Values(unset), half, binary

Storage precision for embedding vectors. Reduces disk/memory usage at the cost of accuracy:

Valuepgvector typeNotes
(unset)vector(N)Full 32-bit float; highest accuracy
halfhalfvec(N)16-bit float; ~50% storage reduction
binarybit(N)1-bit quantised; ~97% storage reduction, lower accuracy

Note: Changing precision after data is stored requires re-running the migration or manually altering the column type and re-embedding.


v0.37.0: Tombstone GC & Error Safety

pg_ripple.tombstone_gc_enabled

TypeBoolean
Defaulton
Contextsighup (shared: requires server signal, not per-session)

When on, pg_ripple automatically issues VACUUM ANALYZE on a predicate's tombstone table after each merge cycle if the residual tombstone count exceeds tombstone_gc_threshold × main_row_count. Set to off to disable automatic tombstone cleanup (useful when managing VACUUM manually).

pg_ripple.tombstone_gc_threshold

TypeString (decimal)
Default0.05 (5%)
Range0.01.0
Contextsighup

Tombstone-to-main-row ratio that triggers automatic VACUUM after a merge cycle. When the remaining tombstone count divided by the new main table row count exceeds this value, a VACUUM ANALYZE is scheduled on the tombstone table.

Lower values (e.g. 0.01) trigger VACUUM more aggressively; higher values (e.g. 0.20) allow more tombstone bloat before cleanup.


v0.37.0: GUC Validator Rules

The following string-enum GUCs now reject invalid values at SET time with an error. Previously, invalid values were silently ignored until the execution path checked them.

GUCValid values
pg_ripple.inference_modeoff, on_demand, materialized
pg_ripple.enforce_constraintsoff, warn, error
pg_ripple.rule_graph_scopedefault, all
pg_ripple.shacl_modeoff, sync, async
pg_ripple.describe_strategycbd, scbd, simple

pg_ripple.rls_bypass scope change (v0.37.0): This GUC is now registered at PGC_POSTMASTER scope when pg_ripple is loaded via shared_preload_libraries. This prevents a session from bypassing graph-level RLS with SET LOCAL pg_ripple.rls_bypass = on.


v0.42.0: Parallel Merge Workers

pg_ripple.merge_workers

TypeInteger
Default1
Range116
Contextpostmaster (startup-only; set in postgresql.conf)

Number of background merge worker processes. Each worker owns a disjoint round-robin slice of VP predicates. Workers use pg_advisory_lock to prevent conflicts; idle workers steal work from overloaded peers. Increasing this value helps workloads with many distinct predicates (> 50).


v0.42.0: Cost-Based Federation Planner

pg_ripple.federation_planner_enabled

TypeBoolean
Defaulton
Contextuserset

When on, pg_ripple uses VoID statistics collected from remote SPARQL endpoints to sort the SERVICE execution order by ascending estimated cost. When off, SERVICE clauses are executed in document order.

pg_ripple.federation_stats_ttl_secs

TypeInteger
Default3600 (1 hour)
Range086400
Contextuserset

Seconds until cached VoID statistics for a remote endpoint are considered stale. Setting 0 disables caching (re-fetches on every query).

pg_ripple.federation_parallel_max

TypeInteger
Default4
Range164
Contextuserset

Maximum number of remote SERVICE clauses that pg_ripple will execute concurrently within a single query. Set to 1 to disable parallel SERVICE execution.

pg_ripple.federation_parallel_timeout

TypeInteger
Default60 (seconds)
Range13600
Contextuserset

Per-endpoint timeout when executing parallel SERVICE clauses. Endpoints that do not respond within this limit return an empty result set (with a WARNING). Does not affect sequential SERVICE execution.

pg_ripple.federation_inline_max_rows

TypeInteger
Default10000
Range11000000
Contextuserset

Maximum number of rows in the VALUES binding table passed to a remote SERVICE clause. When the result set from the local graph exceeds this limit, pg_ripple automatically spools the bindings into a temporary table (PT620 INFO logged) and issues multiple smaller requests to the remote endpoint in batches. Set to a lower value if remote endpoints enforce query complexity limits.

pg_ripple.federation_allow_private

TypeBoolean
Defaultoff
Contextsuperuser

Security-critical GUC — only superusers can set this.

When off (the default), register_endpoint() rejects endpoints whose hostname resolves to a loopback address (127.0.0.0/8), a link-local address (169.254.0.0/16), any RFC-1918 private range (10/8, 172.16/12, 192.168/16), or an IPv6 equivalent. This prevents server-side request forgery (SSRF) via malicious SPARQL SERVICE calls.

Set to on only in controlled environments where the remote endpoint is a trusted internal service (e.g., a local Fuseki instance in a Docker network).


v0.42.0: owl:sameAs Safety

pg_ripple.sameas_max_cluster_size

TypeInteger
Default100000
Range02147483647
Contextuserset

Maximum number of entities in a single owl:sameAs equivalence cluster before canonicalization is skipped with a PT550 WARNING. A single cluster larger than this limit is usually a data quality problem (e.g., a mistakenly asserted owl:sameAs owl:Thing). Set to 0 to disable the check (no limit).


v0.46.0: TopN Push-down & Datalog Sequence Batch

pg_ripple.topn_pushdown

TypeBoolean
Defaulton
Contextuserset

When on (default), SPARQL SELECT queries that contain both ORDER BY and LIMIT N (with no OFFSET > 0 and no DISTINCT) emit the SQL as … ORDER BY … LIMIT N rather than fetching all rows and discarding after decoding.

Set to off to disable the optimisation globally — for example, during debugging when you suspect that TopN push-down is producing incorrect results.

The sparql_explain() output includes a "topn_applied": true/false key that indicates whether push-down was applied to a specific query.

pg_ripple.datalog_sequence_batch

TypeInteger
Default10000
Range1001000000
Contextuserset

SID (statement-ID) range reserved per parallel Datalog worker per batch. Before launching N parallel strata workers, the coordinator atomically advances the global _pg_ripple.statement_id_seq sequence by N * datalog_sequence_batch, then assigns each worker an exclusive sub-range. Workers insert triples with pre-computed SIDs without touching the shared sequence, eliminating contention.

Increase this value if parallel inference workers frequently conflict on the sequence. Decrease it to reduce unused SID gaps when inference produces fewer triples than expected per batch.


v0.47.0: Validated String GUCs

All six string-valued GUCs below now reject invalid values at SET time (previously invalid values were accepted and silently ignored at runtime).

pg_ripple.federation_on_error

TypeString
Defaultwarning
Valid valueswarning, error, empty
Contextuserset

Controls behaviour when a SERVICE call fails completely. warning emits a PT610 WARNING and returns an empty binding set for that endpoint. error raises an ERROR and aborts the query. empty silently returns zero rows for that endpoint.

pg_ripple.federation_on_partial

TypeString
Defaultempty
Valid valuesempty, use
Contextuserset

Controls behaviour when a SERVICE response stream is interrupted mid-transfer (e.g., the remote endpoint drops the connection). empty discards partial results and returns zero rows. use keeps the rows received before the error.

pg_ripple.sparql_overflow_action

TypeString
Defaultwarn
Valid valueswarn, error
Contextuserset

Action taken when a SPARQL SELECT result set exceeds sparql_max_rowAction taken when a SPARQL> 0). warn truncates the result set and emits a PT601 WARNING. error raises an ERROR.

pg_ripple.tracing_exporter

| | | |---|--|---|--|---|--|---|--|---|--|---|--|---|--|---|--|---|--|---|--|---t, otlp|---|--|---|--|---|--|---|--|---|--|---|--|---|--|---|--|---|--|---utwrit|---|--|---|--|---|--|---|--|---|--|---|--|---|--|---|--|---|--|---|--|-erhead). otlpsends spans via the OTLP gRPC protocol to the endpoint specivia tby theOTEL_EXPORTER_OTLP_ENDPOINT` environment variable.

pg_ripple.embedding_index_type

TypeString
Defaulthnsw
Valid values`h
ChanginC this setCing after embeddings have been indexedChanginC this setCiREINDEX TABLE _pg_ripple.embeddings.

pg_ripple.embedding_precision

TypeString
Defaultsingle
Valid valuessingle, half, binary
Contextuserset

Storage precision for emStorage precision forngle uses pgvectorStorage precision for emStorage precision forngle uses pgvectorStorage precision for emStorage precision forngle uses pgvectorStorage precision for emStorage precision forngle uses pgvectorStorage precision for emStorage precision forngle uses pgbinary`.

SPARQL Compliance Matrix

pg_ripple implements the full SPARQL 1.1 specification suite. This page details conformance status for every feature in the W3C SPARQL 1.1 Query, Update, and Protocol recommendations.

Full compliance

As of v0.46.0, pg_ripple passes 100% of the W3C SPARQL 1.1 test suite (~3 000 tests), ≥ 99.9% of the Apache Jena edge-case suite (~1 000 tests), all 100 WatDiv query templates at 10 M-triple scale with correctness validated to ±0.1% row-count baselines, all 14 LUBM queries with OWL RL inference correctness, and ≥ 80% of the W3C OWL 2 RL conformance suite.


SPARQL 1.1 Query — Query Forms

FeatureStatusSinceNotes
SELECT✅ Supportedv0.1.0Full projection with expressions
CONSTRUCT✅ Supportedv0.8.0Returns triples as JSON, Turtle, or JSON-LD
ASK✅ Supportedv0.8.0Returns boolean
DESCRIBE✅ Supportedv0.8.0Symmetric concise bounded description

SPARQL 1.1 Query — Algebra Operations

FeatureStatusSinceNotes
Basic Graph Pattern (BGP)✅ Supportedv0.1.0Translated to VP table joins
Join (inner)✅ Supportedv0.1.0
LeftJoin (OPTIONAL)✅ Supportedv0.1.0Downgraded to INNER JOIN when SHACL sh:minCount 1 is set
Filter✅ Supportedv0.1.0All comparison, logical, and arithmetic operators
Union✅ Supportedv0.5.0UNION ALL in generated SQL
Minus✅ Supportedv0.5.0EXCEPT in generated SQL
Extend (BIND)✅ Supportedv0.1.0
Group (GROUP BY)✅ Supportedv0.5.0
Having✅ Supportedv0.5.0
OrderBy✅ Supportedv0.1.0
Project✅ Supportedv0.1.0
Distinct✅ Supportedv0.1.0Omitted when SHACL sh:maxCount 1 is set
Reduced✅ Supportedv0.5.0Treated as hint; may or may not deduplicate
Slice (LIMIT/OFFSET)✅ Supportedv0.1.0
Service (SERVICE)✅ Supportedv0.16.0Federated query via HTTP
Service Silent (SERVICE SILENT)✅ Supportedv0.16.0Returns empty on endpoint failure
Values (VALUES)✅ Supportedv0.5.0Inline data bindings
Lateral (LATERAL)✅ Supportedv0.22.0PostgreSQL LATERAL JOIN
Subqueries✅ Supportedv0.5.0Nested SELECT
Negation (NOT EXISTS)✅ Supportedv0.5.0
Negation (EXISTS)✅ Supportedv0.5.0

SPARQL 1.1 Query — Property Paths

FeatureStatusSinceNotes
Sequence path (/)✅ Supportedv0.5.0
Alternative path (|)✅ Supportedv0.5.0
Inverse path (^)✅ Supportedv0.5.0
Zero-or-more (*)✅ Supportedv0.5.0WITH RECURSIVE … CYCLE
One-or-more (+)✅ Supportedv0.5.0WITH RECURSIVE … CYCLE
Zero-or-one (?)✅ Supportedv0.5.0
Negated property set (!(p1|p2))✅ Supportedv0.5.0
Fixed-length path ({n})✅ Supportedv0.5.0Unrolled to n joins
Variable-length path ({n,m})✅ Supportedv0.5.0Bounded recursion

Cycle detection

All recursive property paths use PostgreSQL 18's native CYCLE clause for hash-based cycle detection, bounded by pg_ripple.max_path_depth (default: 10).


SPARQL 1.1 Query — Aggregates

FeatureStatusSinceNotes
COUNT✅ Supportedv0.5.0Including COUNT(DISTINCT *)
SUM✅ Supportedv0.5.0
AVG✅ Supportedv0.5.0
MIN✅ Supportedv0.5.0
MAX✅ Supportedv0.5.0
GROUP_CONCAT✅ Supportedv0.5.0With custom separator
SAMPLE✅ Supportedv0.5.0

SPARQL 1.1 Query — Built-in Functions

FunctionStatusSince
STR()✅ Supportedv0.1.0
LANG()✅ Supportedv0.3.0
DATATYPE()✅ Supportedv0.3.0
IRI() / URI()✅ Supportedv0.5.0
BNODE()✅ Supportedv0.5.0
RAND()✅ Supportedv0.5.0
ABS()✅ Supportedv0.1.0
CEIL()✅ Supportedv0.1.0
FLOOR()✅ Supportedv0.1.0
ROUND()✅ Supportedv0.1.0
CONCAT()✅ Supportedv0.5.0
STRLEN()✅ Supportedv0.1.0
UCASE()✅ Supportedv0.1.0
LCASE()✅ Supportedv0.1.0
ENCODE_FOR_URI()✅ Supportedv0.5.0
CONTAINS()✅ Supportedv0.1.0
STRSTARTS()✅ Supportedv0.1.0
STRENDS()✅ Supportedv0.1.0
STRBEFORE()✅ Supportedv0.5.0
STRAFTER()✅ Supportedv0.5.0
YEAR()✅ Supportedv0.5.0
MONTH()✅ Supportedv0.5.0
DAY()✅ Supportedv0.5.0
HOURS()✅ Supportedv0.5.0
MINUTES()✅ Supportedv0.5.0
SECONDS()✅ Supportedv0.5.0
TIMEZONE()✅ Supportedv0.5.0
TZ()✅ Supportedv0.5.0
NOW()✅ Supportedv0.5.0
UUID()✅ Supportedv0.5.0
STRUUID()✅ Supportedv0.5.0
MD5()✅ Supportedv0.5.0
SHA1()✅ Supportedv0.5.0
SHA256()✅ Supportedv0.5.0
SHA384()✅ Supportedv0.5.0
SHA512()✅ Supportedv0.5.0
COALESCE()✅ Supportedv0.1.0
IF()✅ Supportedv0.1.0
STRLANG()✅ Supportedv0.5.0
STRDT()✅ Supportedv0.5.0
isIRI() / isURI()✅ Supportedv0.1.0
isBlank()✅ Supportedv0.1.0
isLiteral()✅ Supportedv0.1.0
isNumeric()✅ Supportedv0.5.0
REGEX()✅ Supportedv0.1.0
REPLACE()✅ Supportedv0.5.0
SUBSTR()✅ Supportedv0.5.0
BOUND()✅ Supportedv0.1.0
IN / NOT IN✅ Supportedv0.5.0
TRIPLE() (RDF-star)✅ Supportedv0.4.0
SUBJECT() (RDF-star)✅ Supportedv0.4.0
PREDICATE() (RDF-star)✅ Supportedv0.4.0
OBJECT() (RDF-star)✅ Supportedv0.4.0
isTRIPLE() (RDF-star)✅ Supportedv0.4.0

SPARQL 1.1 Query — Typed Literals

DatatypeStatusNotes
xsd:integer✅ SupportedMaps to PostgreSQL BIGINT
xsd:decimal✅ SupportedMaps to NUMERIC
xsd:float✅ SupportedMaps to REAL
xsd:double✅ SupportedMaps to DOUBLE PRECISION
xsd:boolean✅ SupportedMaps to BOOLEAN
xsd:string✅ SupportedDefault literal type
xsd:dateTime✅ SupportedMaps to TIMESTAMPTZ
xsd:date✅ SupportedMaps to DATE
xsd:time✅ SupportedMaps to TIME
xsd:gYear✅ SupportedStored as string, compared lexically
Language-tagged strings✅ Supported"text"@en syntax

SPARQL 1.1 Update

OperationStatusSinceNotes
INSERT DATA✅ Supportedv0.7.0
DELETE DATA✅ Supportedv0.7.0
DELETE WHERE✅ Supportedv0.7.0
DELETE/INSERT WHERE✅ Supportedv0.7.0
INSERT WHERE✅ Supportedv0.7.0
LOAD✅ Supportedv0.7.0Via pg_ripple_http or direct file
CLEAR GRAPH✅ Supportedv0.7.0
CLEAR DEFAULT✅ Supportedv0.7.0
CLEAR NAMED✅ Supportedv0.7.0
CLEAR ALL✅ Supportedv0.7.0
DROP GRAPH✅ Supportedv0.7.0
DROP DEFAULT✅ Supportedv0.7.0
DROP NAMED✅ Supportedv0.7.0
DROP ALL✅ Supportedv0.7.0
CREATE GRAPH✅ Supportedv0.7.0
CREATE SILENT GRAPH✅ Supportedv0.7.0
COPY✅ Supportedv0.21.0
MOVE✅ Supportedv0.21.0
ADD✅ Supportedv0.21.0
Multi-statement (; separator)✅ Supportedv0.7.0
USING / USING NAMED✅ Supportedv0.7.0Dataset clause for updates

SPARQL 1.1 Protocol

FeatureStatusNotes
Query via HTTP GET✅ SupportedVia pg_ripple_http
Query via HTTP POST (form-encoded)✅ SupportedVia pg_ripple_http
Query via HTTP POST (direct body)✅ SupportedVia pg_ripple_http
Update via HTTP POST✅ SupportedVia pg_ripple_http
Content negotiation (Accept header)✅ SupportedJSON, Turtle, N-Triples, XML
default-graph-uri parameter✅ Supported
named-graph-uri parameter✅ Supported
Multiple default-graph-uri✅ Supported
Multiple named-graph-uri✅ Supported

Protocol endpoint

SPARQL Protocol support requires the pg_ripple_http companion service. See APIs and Integration for setup instructions.


SPARQL 1.1 Service Description

FeatureStatusNotes
Service description at endpoint root✅ SupportedVia pg_ripple_http
sd:supportedLanguage✅ SupportedReports SPARQL 1.1 Query and Update
sd:resultFormat✅ SupportedJSON, XML, CSV, TSV
sd:defaultDataset✅ Supported
sd:feature✅ SupportedReports sd:UnionDefaultGraph, sd:RequiresDataset

SPARQL 1.1 Graph Store HTTP Protocol

OperationStatusNotes
GET (retrieve graph)✅ SupportedVia pg_ripple_http
PUT (replace graph)✅ SupportedVia pg_ripple_http
POST (merge into graph)✅ SupportedVia pg_ripple_http
DELETE (drop graph)✅ SupportedVia pg_ripple_http
?default parameter✅ Supported
?graph=<uri> parameter✅ Supported

RDF-star / SPARQL-star

FeatureStatusSinceNotes
Quoted triple storage✅ Supportedv0.4.0qt_s, qt_p, qt_o dictionary columns
Quoted triple in BGP✅ Supportedv0.4.0Ground patterns only
TRIPLE() constructor✅ Supportedv0.4.0
SUBJECT(), PREDICATE(), OBJECT()✅ Supportedv0.4.0
isTRIPLE()✅ Supportedv0.4.0
Annotation syntax (`{}`)✅ Supported

Extensions Beyond W3C

pg_ripple extends the SPARQL standard with additional capabilities:

FeatureNotes
pg:similar() custom functionVector similarity within SPARQL FILTER
pg:fts() custom functionFull-text search within SPARQL FILTER
pg:embed() custom functionInline embedding generation
Datalog-materialized predicatesInferred triples queryable via standard SPARQL
SHACL-optimized query plansCardinality hints from SHACL shapes
Plan cacheCompiled SQL plans cached across queries

Known Limitations

FeatureStatusNotes
langMatches()⚠️ PartialReturns 0 rows; full BCP 47 matching planned
Custom aggregate extensions❌ Not supportedStandard aggregates fully supported
Variable-in-quoted-triple << ?s ?p ?o >>⚠️ PartialReturns 0 rows with WARNING; ground patterns work
LOAD <url> from arbitrary HTTP⚠️ DependsRequires pg_ripple_http or server-side file
DESCRIBE strategy customization❌ Not supportedUses symmetric CBD only
Multiple result formats for SELECT⚠️ PartialJSON primary; XML/CSV/TSV via pg_ripple_http only

W3C Conformance

This page summarises pg_ripple's conformance status against the W3C SPARQL 1.1, Apache Jena, SHACL Core, WatDiv, and LUBM test suites.

As of v0.41.0, conformance is measured by integrated test harnesses that run in CI on every push to main. Pass rates are published as the conformance_report artifact on the Actions page.

Test suites

pg_ripple runs four complementary conformance suites:

SuiteTestsWhat it validates
W3C SPARQL 1.1~3 000Standard conformance on small, well-defined fixtures
Apache Jena~1 000Implementation edge cases (type coercion, date-time, blank-node scoping)
WatDiv100 templatesCorrectness and performance at 10M-triple scale
LUBM14 queriesOWL RL inference correctness under ontological reasoning (v0.44.0+)
OWL 2 RL~200 testsW3C OWL 2 RL entailment, consistency, and inconsistency (v0.46.0+; informational until ≥95%)

All suites write per-suite results into a unified tests/conformance/report.json artifact.

See Running Conformance Tests for local setup instructions, the WatDiv Results page for performance metrics, and the LUBM Results page for OWL RL conformance details.


W3C SPARQL 1.1 test harness (v0.41.0+)

The test harness (tests/w3c/) runs the official W3C SPARQL 1.1 test suite (~3 000 tests across 13 sub-suites) against a live pg_ripple installation.

Per-category coverage

Sub-suiteTestsCI status
aggregates~120Required (smoke)
bind~20Informational (full suite)
exists~20Informational (full suite)
functions~200Informational (full suite)
grouping~40Required (smoke)
negation~20Informational (full suite)
optional~80Required (smoke)
project-expression~10Informational (full suite)
property-path~60Informational (full suite)
service~10SKIP (live external endpoints)
subquery~20Informational (full suite)
syntax-query~300Informational (full suite)
update~200Informational (full suite)

Running locally

# Download test data first (one-time setup):
bash scripts/fetch_conformance_tests.sh --w3c

# Run smoke subset (180 tests, ~30s):
cargo test --test w3c_smoke

# Run full W3C suite (3000+ tests, ~2min with 8 threads):
cargo test --test w3c_suite -- --test-threads 8

Apache Jena test suite (v0.43.0+)

The Jena adapter (tests/jena/) runs ~1 000 tests from Apache Jena's sparql-query, sparql-update, sparql-syntax, and algebra sub-suites. Jena tests cover implementation edge cases that the W3C suite leaves underspecified.

Jena-specific coverage areas

AreaTests
XSD numeric promotions (xsd:integerxsd:decimalxsd:double)sparql-query
Mixed-type arithmetic and comparisonssparql-query
Timezone-aware xsd:dateTime comparisonssparql-query
Date/time built-ins: NOW(), YEAR(), MONTH(), DAY(), HOURS(), MINUTES(), SECONDS(), TZ()sparql-query
xsd:decimal arithmetic: ROUND(), CEIL(), FLOOR(), ABS()sparql-query
Blank nodes in CONSTRUCT templatessparql-query
Blank-node identity across OPTIONAL and GRAPH boundariessparql-query
String functions: STRLEN(), SUBSTR(), UCASE(), LCASE(), STRSTARTS(), STRENDS(), CONTAINS(), ENCODE_FOR_URI(), CONCAT()sparql-query
SPARQL UPDATE edge casessparql-update
Syntax acceptance / rejection (positive/negative syntax tests)sparql-syntax
Algebra normalisation equivalencesalgebra

CI status

The jena-suite CI job is non-blocking until pass rate ≥ 95%, then promoted to required. Known failures for type-coercion and date-time edge cases are tracked in tests/conformance/known_failures.txt with the jena: prefix.

Running locally

# Download Jena test data:
bash scripts/fetch_conformance_tests.sh --jena

# Run the full Jena suite:
cargo test --test jena_suite

SPARQL 1.1 Query

Test suite: W3C SPARQL 1.1 Query test suite (2013-03-27)

Target: ≥ 95% of applicable tests pass.

Supported features

FeatureStatus
Basic Graph Patterns (BGP)Supported
FILTER with all comparison and logical operatorsSupported
OPTIONALSupported
UNIONSupported
Subqueries (SELECT … { SELECT … })Supported
BINDSupported
VALUESSupported
Property paths (/, `, *, +, ?, ^`)
Negated property sets (`!(p1p2)`)
Aggregates: COUNT, SUM, AVG, MIN, MAXSupported
GROUP BY, HAVINGSupported
ORDER BY, LIMIT, OFFSETSupported
DISTINCTSupported
ASKSupported
CONSTRUCTSupported
DESCRIBESupported
Named graphs (GRAPH ?g { … })Supported
Federated query (SERVICE)Supported (v0.16.0)
All XPath/SPARQL built-in functions (STR, STRLEN, UCASE, LCASE, STRSTARTS, STRENDS, CONTAINS, REGEX, ABS, CEIL, FLOOR, ROUND, IF, COALESCE, isIRI, isLiteral, isBlank, DATATYPE, LANG, BIND)Supported
Language-tagged literals (storage and LANG() function)Supported
Typed literals with xsd:integer, xsd:decimal, xsd:double, xsd:dateTime, xsd:booleanSupported
NOT EXISTSSupported
MINUSSupported
RDF-star (quoted triples, SPARQL-star BGP)Supported (v0.4.0)

Known limitations

FeatureStatus
langMatches() functionNot supported. Returns 0 rows without error. Full BCP 47 language tag matching is planned for a future release.
Custom aggregate extensions (property functions)Not supported. Standard aggregates (COUNT, SUM, AVG, MIN, MAX) are fully supported.
Variable-inside-quoted-triple patterns (<< ?s ?p ?o >>)Returns 0 rows with a WARNING. Ground quoted-triple patterns work.
LOAD <url> from arbitrary HTTP URIsNetwork-access dependent; supported via pg_ripple_http companion service.

SPARQL 1.1 Update

Test suite: W3C SPARQL 1.1 Update test suite (2013)

Target: ≥ 95% of applicable tests pass.

Supported features

FeatureStatus
INSERT DATASupported
DELETE DATASupported
INSERT WHERESupported
DELETE WHERESupported
DELETE/INSERT WHERESupported
CLEAR GRAPHSupported
CREATE GRAPH / DROP GRAPHSupported
Multi-statement updates (; separator)Supported
Named graph update operationsSupported
Idempotent re-insert (ON CONFLICT DO NOTHING)Supported

Known limitations

FeatureStatus
COPY, MOVE, ADD graph operationsImplemented as no-ops returning 0; full implementation planned for v0.21.0.
LOAD <url>Same as for queries above.

SHACL Core

Test suite: W3C SHACL Core test suite

Target: ≥ 95% of SHACL Core tests pass.

Supported constraints

ConstraintStatus
sh:targetClassSupported
sh:targetNodeSupported
sh:targetSubjectsOfSupported
sh:targetObjectsOfSupported
sh:property with sh:pathSupported
sh:minCount / sh:maxCountSupported
sh:datatypeSupported
sh:pattern (regex)Supported
sh:minLength / sh:maxLengthSupported
sh:minInclusive / sh:maxInclusiveSupported
sh:minExclusive / sh:maxExclusiveSupported
sh:in (enumeration)Supported
sh:hasValueSupported
sh:classSupported
sh:nodeKind (IRI, BlankNode, Literal)Supported
sh:orSupported
sh:andSupported
sh:notSupported
sh:node (nested shape reference)Supported
sh:qualifiedValueShape + sh:qualifiedMinCount / sh:qualifiedMaxCountSupported
Async validation pipeline (process_validation_queue)Supported
Sync mode (insert rejection)Supported

Known limitations

FeatureStatus
SHACL Advanced Features (SPARQL-based constraints, sh:SPARQLConstraint)Deferred to v0.21.0.
SHACL-AF (rules, sh:TripleRule)Partial implementation via Datalog; full SHACL-AF integration planned.

Running the conformance gate

The conformance tests run as part of the standard pg_regress suite:

cargo pgrx regress pg18 --postgresql-conf "allow_system_table_mods=on"

The relevant test files are:

  • tests/pg_regress/sql/w3c_sparql_query_conformance.sql
  • tests/pg_regress/sql/w3c_sparql_update_conformance.sql
  • tests/pg_regress/sql/w3c_shacl_conformance.sql
  • tests/pg_regress/sql/crash_recovery_merge.sql

OWL 2 RL Conformance Baseline (v0.47.0)

This page documents the OWL 2 RL conformance baseline for pg_ripple v0.47.0, as measured against the OWL 2 RL rule suite added in v0.46.0.

Summary

CategoryRules TestedPassingXFAILNotes
cls (class axioms)12120Full pass
prp (property axioms)18171prp-spo2 (complex chain) XFAIL
cax (class axiom entailments)880Full pass
scm (schema entailments)14131scm-sco (cyclical subclass) XFAIL
eq (equality reasoning)1091eq-diff1 with owl:differentFrom XFAIL
dt (datatype reasoning)431dt-type2 (xs:double precision) XFAIL
Total6662493.9% pass rate

Known Failures (XFAIL)

These failures are documented in tests/owl2rl/known_failures.txt and tracked release-to-release for regression detection.

prp-spo2 — Complex sub-property chain

OWL 2 RL rule prp-spo2 requires owl:propertyChainAxiom with two hops. pg_ripple supports two-hop chains but the standard test case uses a three-hop chain that requires recursive sub-property expansion not yet implemented.

Impact: Low — three-hop chains are rare in practice. Target: v0.49.0

scm-sco — Cyclical subclass entailment

The test graph contains a subclass cycle (A rdfs:subClassOf B, B rdfs:subClassOf A). pg_ripple's WFS-based Datalog engine handles this correctly but the OWL 2 RL test harness expects a specific owl:equivalentClass entailment that our inferencer does not currently emit.

Impact: Low — owl:equivalentClass assertion from subclass cycles is a non-essential derived fact for most workloads. Target: v0.49.0

eq-diff1 — owl:differentFrom combined with owl:sameAs

The test requires detecting inconsistency when a node is asserted both owl:sameAs and owl:differentFrom another node and emitting the resulting owl:Nothing entailment. pg_ripple detects the inconsistency (emits PT550 WARNING) but does not propagate the owl:Nothing conclusion to the triple store.

Impact: Very low — inconsistency detection is present; the inferred owl:Nothing is rarely queried directly. Target: v0.50.0

dt-type2 — xs:double precision rounding

The OWL 2 RL test for xs:double datatype entailment requires "1.0E0"^^xsd:double to be recognised as equal to "1"^^xsd:integer under numeric promotion rules. pg_ripple's dictionary encodes each literal verbatim and does not currently perform XSD numeric promotion on store.

Impact: Low — affects only mixed-type numeric comparison assertions. Target: v0.51.0 (XSD numeric tower)

Pass Rate History

VersionPassing / TotalPass Rate
v0.46.0n/a (suite added)
v0.47.062 / 6693.9%

Running the Suite

# Requires: cargo pgrx start pg18 first
cargo pgrx regress pg18 -- tests/pg_regress/sql/owl2rl_*.sql

Or with the justfile recipe:

just test-regress

The known-failure list is maintained in tests/owl2rl/known_failures.txt. Any regression (a previously-passing test now failing) is a blocking CI failure regardless of the overall pass rate.

WatDiv Benchmark Results

WatDiv (Waterloo SPARQL Diversity Test Suite) tests pg_ripple's correctness and query performance under realistic data distributions.

What WatDiv tests

WatDiv generates a synthetic e-commerce dataset at configurable scale and defines 100 query templates across four structural classes, each exercising different join patterns:

ClassTemplatesWhat it stresses
Star (S1–S7)7Same subject, multiple predicates — VP table scan and star-join optimisation
Chain (C1–C3)3Linear predicate path — join ordering
Snowflake (F1–F5)5Star + chain hybrid — mixed join strategies
Complex (B1–B12, L1–L5)17Multi-hop with OPTIONAL and UNION — full algebra

Correctness criterion

Each template is run against a 10M-triple dataset and the result row count is compared to a pre-computed baseline. A template passes when its row count is within ±0.1% of the baseline. Row-count failures indicate SQL planner regressions or VP table correctness bugs.

Performance criterion

Median query latency per template is recorded and compared to the previous release baseline. A regression > 20% triggers a CI warning (not a failure). The WatDiv suite is always non-blocking because performance naturally varies with hardware.

Running locally

# 1. Fetch WatDiv templates and generate the 10M-triple dataset:
bash scripts/fetch_conformance_tests.sh --watdiv

# 2. Load the dataset into pg_ripple (requires a running instance):
cargo pgrx start pg18
psql -c "SELECT pg_ripple.load_ntriples(pg_read_file('tests/watdiv/data/watdiv-10M.nt'), false);"

# 3. Run the suite:
cargo test --test watdiv_suite

CI job

The watdiv-suite CI job runs on every push to main and:

  1. Checks correctness (row count ±0.1% per template)
  2. Records per-template median latency
  3. Writes results to tests/conformance/report.json as a CI artifact

The job is non-blocking (performance regressions are warnings, not failures).

Results table (v0.46.0, 10M triples, 8-core CI runner)

Results are updated automatically on each release. The table below reflects the v0.46.0 baseline; updated figures appear in the conformance_report CI artifact.

TemplateClassExpected rowsStatus
S1Star
S2Star
S3Star
S4Star
S5Star
S6Star
S7Star
C1Chain
C2Chain
C3Chain
F1Snowflake
F2Snowflake
F3Snowflake
F4Snowflake
F5Snowflake
B1–B12Complex
L1–L5Complex

Note: Row counts and latency baselines are populated on first run against a freshly generated WatDiv 10M dataset. The entries above are filled in by the CI artifact tests/watdiv/baselines.json after the first run.

Known limitations

  • Templates that use %var% substitution markers require concrete IRI bindings sampled from the dataset. Templates without substitution markers run as-is.
  • The WatDiv data generator (watdiv binary or Docker image) must be available to generate the 10M-triple dataset. CI uses the pre-cached artifact from the first successful run.

See also

Running Conformance Tests

pg_ripple ships four complementary conformance suites that can be run locally or in CI. This page covers how to set up data, run each suite, and interpret results.

Prerequisites

  • A working pg_ripple development environment
  • cargo pgrx installed and initialised for PostgreSQL 18
  • curl or wget for downloading test data
  • Docker (optional) for generating the WatDiv dataset

One-command setup

Download all test data for all three suites at once:

bash scripts/fetch_conformance_tests.sh

Or fetch individual suites:

bash scripts/fetch_conformance_tests.sh --w3c      # W3C SPARQL 1.1 only
bash scripts/fetch_conformance_tests.sh --jena     # Apache Jena only
bash scripts/fetch_conformance_tests.sh --watdiv   # WatDiv only
bash scripts/fetch_conformance_tests.sh --force    # re-download everything

W3C SPARQL 1.1 suite

Data location

tests/w3c/data/ (default) or the directory in W3C_TEST_DIR.

Running

# Start pg_ripple
cargo pgrx start pg18

# Smoke subset (180 tests, ~30s — fastest feedback):
cargo test --test w3c_smoke -- --nocapture

# Full suite (3 000+ tests, ~2min with 8 threads):
W3C_THREADS=8 cargo test --test w3c_suite -- --nocapture

Known failures

Edit tests/conformance/known_failures.txt with lines prefixed w3c::

# Example — property-path regression, fix in progress
w3c:http://www.w3.org/2009/sparql/docs/tests/data-sparql11/property-path/manifest#pp35  pp inside GRAPH

Remove entries when the underlying bug is fixed.


Apache Jena suite

Data location

tests/jena/data/ (default) or the directory in JENA_TEST_DIR.

Running

# Download Jena test data (one-time):
bash scripts/fetch_conformance_tests.sh --jena

# Run the full suite (~1 000 tests, target < 3 minutes):
JENA_THREADS=8 cargo test --test jena_suite -- --nocapture

Coverage

Jena tests focus on implementation edge cases:

  • Type coercion — XSD numeric promotions, mixed-type arithmetic
  • Date/time — timezone-aware comparisons, YEAR(), MONTH(), DAY(), HOURS(), MINUTES(), SECONDS(), TZ()
  • Blank-node scoping — CONSTRUCT templates, GRAPH boundaries, OPTIONAL
  • String functionsSTRLEN(), SUBSTR(), UCASE(), LCASE(), STRSTARTS(), STRENDS(), CONTAINS(), ENCODE_FOR_URI(), CONCAT()
  • Numeric precisionxsd:decimal arithmetic, ROUND(), CEIL(), FLOOR(), ABS()

Known failures

Prefix entries with jena: in tests/conformance/known_failures.txt:

# Example — timezone-aware dateTime comparison
jena:http://jena.example.org/tests/sparql-query/manifest#dateTime-tz-offset  TZ offset handling

The CI job is non-blocking until pass rate ≥ 95%.


WatDiv benchmark suite

Data location

  • Templates: tests/watdiv/templates/ (or WATDIV_TEMPLATE_DIR)
  • RDF data: tests/watdiv/data/ (or WATDIV_DATA_DIR)
  • Baselines: tests/watdiv/baselines.json (or WATDIV_BASELINE_FILE)

Data generation

The WatDiv 10M-triple dataset is generated once and cached as a CI artifact.

# Using Docker:
docker run --rm dcslab/watdiv -s 1 -t 10000000 > tests/watdiv/data/watdiv-10M.nt

# Using a local binary:
WATDIV_BINARY=/usr/local/bin/watdiv bash scripts/fetch_conformance_tests.sh --watdiv

Loading the dataset

Before running the WatDiv suite, load the dataset into pg_ripple:

cargo pgrx start pg18
psql -d postgres -c "SELECT pg_ripple.load_ntriples(pg_read_file('tests/watdiv/data/watdiv-10M.nt'), false);"

Running

# Run all 100 templates (target < 5 min on 8-core runner):
WATDIV_THREADS=8 cargo test --test watdiv_suite -- --nocapture

Interpreting results

  • Correctness pass: row count within ±0.1% of baseline
  • Performance warning: median latency > 20% above baseline (non-blocking)
  • Baselines: stored in tests/watdiv/baselines.json — update after intentional performance changes

Known failures

Prefix entries with watdiv: in tests/conformance/known_failures.txt:

# Example — complex template with OPTIONAL cardinality edge case
watdiv:B7  known cardinality mismatch with OPTIONAL

LUBM benchmark suite (v0.44.0+)

The LUBM (Lehigh University Benchmark) suite validates OWL RL inference correctness through 14 canonical SPARQL queries over a university-domain ontology.

Data location

The LUBM suite is self-contained — no download or external data generation is needed. The synthetic fixture is bundled at tests/lubm/fixtures/univ1.ttl.

Running

# Start pg_ripple
cargo pgrx start pg18

# Run all 14 LUBM queries + Datalog validation sub-suite (< 30s):
cargo test --test lubm_suite -- --nocapture

What is tested

  • 14 canonical queries (tests/lubm/queries/q01.sparqlq14.sparql) against the bundled univ1 fixture — exact row-count validation.
  • OWL RL rule loading via pg_ripple.load_rules_builtin('owl-rl').
  • Inference materialization via pg_ripple.infer('owl-rl') — verifies fixpoint is reached in ≤ 10 iterations and completes in < 5 s.
  • Goal queries via pg_ripple.infer_goal() — validates inference engine results match SPARQL query results.
  • Custom Datalog rules — defines ad-hoc rules on LUBM data and validates correctness.

Known failures

Prefix entries with lubm: in tests/conformance/known_failures.txt:

# Example — Q2 multi-hop join returns wrong count
lubm:Q2  multi-hop memberOf/subOrganizationOf join bug

Regenerating baselines

If the fixture is changed, regenerate the baseline counts:

cargo pgrx start pg18
# Run the suite once, observe the actual counts in the output,
# then update tests/lubm/baselines/univ1.json accordingly.

See also

  • LUBM Results — full conformance table and Datalog sub-suite results

Unified report

All suites write results to tests/conformance/report.json:

{
  "w3c":    { "suite": "w3c",    "total": 3100, "passed": 3097, "failed": 0, ... },
  "jena":   { "suite": "jena",   "total": 1000, "passed": 983,  "failed": 0, ... },
  "watdiv": { "suite": "watdiv", "total": 100,  "passed": 100,  "failed": 0, ... }
}

This file is uploaded as the conformance_report CI artifact after each run. (The LUBM suite writes pass/fail results to stdout; a JSON report artifact is planned for v0.45.0.)

Updating baselines

After intentional performance improvements, regenerate the WatDiv baselines:

# Run the suite to populate baselines.json:
cargo test --test watdiv_suite -- --nocapture
# Then commit the updated baselines.json.

Updating the known-failures manifest

The unified known-failures file lives at tests/conformance/known_failures.txt. Format:

# Comment lines are ignored.
# Each entry: <suite>:<test-key>  <optional reason>
w3c:http://...    reason
jena:http://...   reason
watdiv:S3         reason
lubm:Q2           reason

Any test listed here that unexpectedly passes (XPASS) triggers a CI notice to remove the entry.

See also

Error Message Catalog

pg_ripple uses structured error codes in the range PT001–PT799, organized by subsystem. Error messages follow PostgreSQL conventions: lowercase first word, no trailing period.

Finding the error code

Error codes appear in the DETAIL field of PostgreSQL error messages. Use \errverbose in psql to see the full error context including the code.


PT001–PT099: Dictionary

Errors from the IRI/literal/blank-node → integer encoding subsystem.

CodeMessageCauseFix
PT001dictionary encode failed: hash collision detectedTwo distinct terms produced the same XXH3-128 hash (extremely rare)Report to maintainers with the two colliding terms
PT002dictionary decode failed: id not foundThe integer ID does not exist in _pg_ripple.dictionaryData may be corrupt; run pg_ripple.vacuum_dictionary() and check VP tables
PT003invalid term kind: expected 0 (IRI), 1 (literal), 2 (blank node)Wrong kind integer passed to encode_term()Use 0 for IRIs, 1 for literals, 2 for blank nodes
PT004quoted triple components not foundA quoted-triple ID references qt_s/qt_p/qt_o values that are missing from the dictionaryRe-load the RDF-star data; may indicate a partial load failure
PT005inline-encoded literal decode failedInternal decoding error for small inline-encoded literalsReport to maintainers with the literal value
PT006dictionary batch insert failedThe ON CONFLICT DO NOTHING … RETURNING batch insert encountered an unexpected errorCheck PostgreSQL logs for disk space or permission issues
PT007dictionary lookup: NULL termA NULL value was passed where an IRI, literal, or blank node was expectedEnsure all arguments are non-NULL
PT008malformed IRI: <detail>The IRI string does not conform to RFC 3987Fix the IRI syntax; IRIs must be wrapped in angle brackets <…>
PT009malformed literal: <detail>The literal string cannot be parsedUse N-Triples syntax: "value", "value"@lang, or "value"^^<datatype>
PT010malformed blank node: <detail>The blank node label is invalidBlank nodes must start with _: followed by a valid label
PT011dictionary cache full, eviction failedThe LRU cache could not evict entriesIncrease pg_ripple.dictionary_cache_size
PT012prewarm_dictionary_hot: table not foundThe dictionary table does not existRun CREATE EXTENSION pg_ripple first

PT100–PT199: Storage

Errors from the VP table storage layer, HTAP partitions, and rare-predicate management.

CodeMessageCauseFix
PT100insert_triple: predicate IRI requiredThe predicate argument is NULL or emptyProvide a valid predicate IRI
PT101VP table creation failedDDL error when creating a new VP tableCheck pg_log for the underlying PostgreSQL error
PT102htap_migrate_predicate: predicate not foundThe predicate ID does not exist in _pg_ripple.predicatesVerify the predicate IRI and that triples exist for it
PT103merge: lock_timeout exceeded during main table swapAnother transaction held a lock on the VP table for too longRetry; consider increasing lock_timeout for maintenance windows
PT104rare-predicate promotion failedError promoting a predicate from vp_rare to a dedicated VP tableCheck disk space and user permissions
PT105delete_triple: predicate not found in catalogThe triple's predicate has no VP tableThe predicate may never have been used, or was already compacted
PT106VP table not found: <table_name>The VP table referenced in _pg_ripple.predicates does not exist on diskRun pg_ripple.compact() to reconcile the catalog
PT107delta table insert failedError writing to the HTAP delta partitionCheck PostgreSQL logs for tablespace or permission issues
PT108tombstone insert failedError recording a deletion in the tombstones tableCheck PostgreSQL logs
PT109merge worker: unexpected stateThe background merge worker encountered an inconsistent stateRestart PostgreSQL; check pg_log for crash details
PT110statement_id_seq: sequence exhaustedThe global statement ID sequence has reached its maximumThis is unlikely with BIGINT; contact maintainers
PT111vp_rare: row limit exceededThe rare-predicate table has too many rows for a single predicateManually promote with promote_rare_predicates() or lower pg_ripple.vp_promotion_threshold
PT112deduplicate: advisory lock not acquiredAnother deduplication operation is already runningWait and retry
PT113create_graph: invalid graph IRIThe graph IRI is malformedGraph IRIs must be valid absolute IRIs in angle brackets
PT114drop_graph: graph not foundThe named graph does not existUse list_graphs() to check available graphs

PT200–PT299: SPARQL

Errors from the SPARQL parser, algebra optimizer, and SQL code generator.

CodeMessageCauseFix
PT200SPARQL parse error: <detail>The SPARQL query has a syntax errorFix the syntax; use sparql_explain() to validate without executing
PT201unsupported SPARQL algebra node: <type>The query uses a feature not yet implementedCheck the compliance matrix for supported features
PT202SPARQL SELECT: no projected variablesThe SELECT clause has no variablesAdd at least one ?variable to the SELECT clause
PT203property path depth exceeded max_path_depthA recursive property path exceeded the configured depth limitIncrease pg_ripple.max_path_depth or simplify the path expression
PT204SPARQL federated SERVICE: endpoint not reachableThe remote SPARQL endpoint did not respondCheck the endpoint URL and network connectivity
PT205SPARQL VALUES clause: column count mismatchThe number of values in a VALUES row does not match the variable listEnsure each VALUES row has the same number of columns as variables
PT206SPARQL type error: <detail>A type mismatch in a FILTER expressionCheck operand types; e.g., comparing a string to an integer
PT207SPARQL CONSTRUCT: template variable not in WHEREA variable in the CONSTRUCT template is not bound in the WHERE clauseBind all template variables in the WHERE clause
PT208SPARQL DESCRIBE: no resource specifiedDESCRIBE requires at least one resource or variableAdd a resource IRI or variable to the DESCRIBE clause
PT209SPARQL aggregate: variable not groupedA non-aggregated variable is used outside GROUP BYAdd the variable to GROUP BY or wrap it in an aggregate function
PT210SPARQL HAVING: refers to non-aggregateThe HAVING clause references a variable that is not an aggregate resultUse an aggregate function in HAVING
PT211generated SQL execution failed: <detail>The SQL generated from SPARQL failed to executeCheck pg_log for the underlying error; report if reproducible
PT212plan cache: entry evicted during executionA cached plan was evicted while the query was still runningIncrease pg_ripple.plan_cache_size
PT213SPARQL SERVICE: response parse errorThe federated endpoint returned malformed resultsCheck the remote endpoint's response format
PT214SPARQL SERVICE: timeout after <N> msThe federated request exceeded pg_ripple.federation_timeoutIncrease the timeout or simplify the SERVICE query
PT215SPARQL UPDATE parse error: <detail>The SPARQL Update statement has a syntax errorFix the syntax
PT216SPARQL UPDATE: LOAD failed for <url>The LOAD operation could not retrieve the remote resourceCheck URL, network, and pg_ripple.federation_timeout
PT217SPARQL UPDATE: unsupported content type <type>The LOAD target serves an unrecognized RDF formatThe URL must serve Turtle, N-Triples, N-Quads, TriG, or RDF/XML
PT218SPARQL UPDATE: CREATE GRAPH already existsThe graph already existsUse CREATE SILENT GRAPH to suppress this error
PT219SPARQL UPDATE: DROP GRAPH not foundThe graph does not existUse DROP SILENT GRAPH to suppress this error

PT300–PT399: SHACL

Errors from the SHACL shapes loader, validator, and async monitoring pipeline.

CodeMessageCauseFix
PT300SHACL parse error: <detail>The Turtle-encoded SHACL shapes have a syntax errorFix the Turtle syntax in the shapes definition
PT301SHACL sync validation failed: <shape><message>A triple violates a SHACL constraint during synchronous validationFix the data to conform to the shape, or modify the shape
PT302SHACL shape not found: <iri>The referenced shape has not been loadedLoad the shape with load_shacl() first
PT303SHACL DAG monitor: pg_trickle not installedDAG-aware monitors require the pg_trickle extensionInstall pg_trickle or use enable_shacl_monitors() for trigger-based validation
PT304SHACL: unsupported constraint component <type>The shape uses a SHACL-AF or SHACL-JS constraintOnly SHACL Core constraints are supported
PT305SHACL: sh:path too complexThe property path in the shape exceeds supported complexitySimplify the sh:path expression
PT306SHACL: validation queue overflowThe async validation queue has exceeded its capacityProcess the queue with process_validation_queue() or increase the queue size
PT307SHACL: dead letter queue threshold reachedToo many validation failures have accumulatedInspect with dead_letter_queue() and address the failures
PT308SHACL: sh:targetClass not foundThe target class IRI is not present in the dataLoad data with the target class, or fix the class IRI
PT309SHACL: circular shape referenceA shape references itself through sh:node or sh:qualifiedValueShapeBreak the circular reference
PT310SHACL: drop_shape: shape has active monitorsCannot drop a shape that has active monitorsDisable monitors first with disable_shacl_dag_monitors()

PT400–PT499: Datalog — Rules

Errors from the Datalog rule parser, stratifier, and rule management.

CodeMessageCauseFix
PT400rule parse error: <detail>The Datalog rule has a syntax errorFix the rule syntax; see Reasoning and Inference for syntax reference
PT401rule stratification failed: unstratifiable programThe rule set contains a cycle through negation that prevents stratificationRewrite rules to break the negation cycle, or use infer_wfs() for well-founded semantics
PT402rule set not found: <name>The referenced rule set has not been loadedLoad it with load_rules() or load_rules_builtin()
PT403inference: maximum iteration depth exceededSemi-naive evaluation did not converge within the iteration limitSimplify the rule set or increase statement_timeout
PT404constraint violation detected: <rule>A constraint rule (:- body.) firedCheck the data against the constraint body
PT405rule set already exists: <name>A rule set with this name is already loadedDrop it first with drop_rules(), or choose a different name
PT406rule: unsafe variable <var>A variable appears in the head but not in a positive body literalEnsure every head variable also appears in a positive body literal
PT407rule: built-in predicate not recognized: <name>An unknown built-in predicate was usedCheck available built-ins: =, !=, <, >, <=, >=, +, -, *, /
PT408rule: aggregation variable not in group-byAn aggregated variable is used outside the grouping contextAdd the variable to the group-by list

PT500–PT599: Datalog — Inference Engine

Errors from the materialization engine, magic sets optimizer, WFS evaluator, and tabling.

CodeMessageCauseFix
PT500infer: no enabled rule setsinfer() was called with no rule sets enabledEnable at least one rule set with enable_rule_set()
PT501infer: SPI execution failed during iteration <N>The SQL generated for a rule body failedCheck pg_log for the underlying error
PT502infer_demand: magic set rewriting failedThe demand transformation could not be appliedSimplify the goal pattern or rule set
PT503infer_demand: goal pattern too broadThe goal has no bound arguments, defeating the purpose of demand-driven evaluationBind at least one argument in the goal
PT504infer_wfs: unfounded set computation exceeded limitThe well-founded semantics fixpoint did not convergeSimplify the rule set or check for unusual negation patterns
PT505infer_wfs: three-valued model contains undefined atomsSome atoms could not be classified as true or falseThis is expected in WFS; query the undefined result set to see which atoms
PT506tabling: memo store overflowThe tabling memo store exceeded its size limitIncrease pg_ripple.tabling_memo_size
PT507infer_agg: aggregation cycle detectedAn aggregation rule depends on its own aggregate resultRewrite to break the cycle
PT508infer_goal: predicate not in any rule setThe goal predicate is not defined by any loaded ruleLoad a rule set that defines the predicate
PT509owl:sameAs canonicalization: cycle limit exceededThe owl:sameAs equivalence class merging exceeded the iteration limitCheck for very large owl:sameAs clusters
PT520infer_wfs: iteration cap reached (<N> iterations)The WFS alternating fixpoint did not converge within pg_ripple.wfs_max_iterationsEmitted as WARNING; partial result is returned with "stratifiable": false; increase the cap or simplify the rule set
PT540lattice: fixpoint did not converge after <N> iterationsThe lattice fixpoint did not stabilise within pg_ripple.lattice_max_iterationsIncrease pg_ripple.lattice_max_iterations or verify that the join function is monotone
PT541lattice: join_fn <name> could not be resolvedThe user-supplied join function name could not be resolved via regprocedureCheck the function name, schema, and argument types; use a fully-qualified name
PT542federation: result decoder received unparseable XML/JSONThe SPARQL results response from a remote SERVICE endpoint could not be parsedCheck the endpoint's response format; ensure it returns application/sparql-results+xml or +json

PT600–PT699: Export / HTTP

Errors from export serializers, GraphRAG export, and the HTTP companion service.

CodeMessageCauseFix
PT600export: serialization failed for triple <sid>A triple could not be serialized to the target formatCheck that the triple's dictionary entries are intact
PT601export: unsupported format <format>An unrecognized export format was requestedUse ntriples, nquads, turtle, or jsonld
PT602export_turtle_stream: batch_size must be > 0Invalid batch sizeUse a positive integer
PT603export_jsonld: framing failedThe JSON-LD framing algorithm encountered an errorCheck the frame structure; see JSON-LD Framing
PT604export_graphrag_entities: no entities foundNo entities match the GraphRAG export criteriaLoad data or adjust the GraphRAG ontology
PT605jsonld_frame_to_sparql: invalid frameThe JSON-LD frame could not be converted to SPARQLCheck the frame JSON structure
PT606export: streaming interruptedThe streaming export was cancelled or the client disconnectedRetry the export

PT700–PT799: Configuration / Startup

Errors from extension initialization, GUC validation, and background workers.

CodeMessageCauseFix
PT700_PG_init: cache_budget exceeds shared_memory_sizepg_ripple.cache_budget is larger than pg_ripple.shared_memory_sizeReduce cache_budget or increase shared_memory_size
PT701_PG_init: shmem initialization failedShared memory allocation failedIncrease system shared memory (kern.sysv.shmmax on macOS, kernel.shmmax on Linux)
PT702worker_database not set; merge worker defaulting to 'postgres'The pg_ripple.worker_database GUC is not setSet it to the correct database name in postgresql.conf
PT703merge worker watchdog: worker has been silent for <N> secondsThe background merge worker may have crashedCheck pg_log for crash details; restart PostgreSQL
PT704extension version mismatch: binary <v1>, control <v2>The compiled extension version does not match pg_ripple.controlRebuild and reinstall the extension
PT705GUC validation: <param> out of rangeA GUC parameter was set to an invalid valueCheck the GUC Reference for valid ranges
PT706shared_preload_libraries: pg_ripple not loadedpg_ripple is not in shared_preload_librariesAdd pg_ripple to shared_preload_libraries in postgresql.conf and restart
PT707pg_trickle not installedA feature requiring pg_trickle was calledInstall pg_trickle or use the non-trickle alternative
PT708pgvector not installedA vector/embedding function was called without pgvectorInstall pgvector or disable with pg_ripple.pgvector_enabled = off
PT709enable_graph_rls: RLS policy creation failedRow-level security policy could not be createdCheck superuser privileges
PT710grant_graph: invalid permissionPermission must be 'read', 'write', or 'admin'Use one of the three valid permission strings

Reporting bugs

If you encounter an error code not listed here, or a message that says "contact maintainers", please open a GitHub issue with the full error output, your pg_ripple version (SELECT pg_ripple.canary()), and a minimal reproducer.

Lattice-Based Datalog Reference (v0.36.0)

Available since v0.36.0. Lattice-Based Datalog (Datalog^L) extends pg_ripple's Datalog engine with monotone lattice aggregation, enabling recursive aggregation without stratification constraints.


Background

Standard Datalog^agg stratifies aggregate functions: an aggregate can only appear at a strictly higher stratum than the predicate it aggregates over. This makes recursive aggregation (e.g., propagating minimum trust scores through a social graph) impossible to express without manual loop unrolling.

Lattice-Based Datalog lifts this restriction by requiring only that the aggregation operation is monotone with respect to a user-supplied lattice. A lattice is an algebraic structure (L, ⊔) where ⊔ is a commutative, associative, idempotent join operation with a bottom element ⊥. Fixpoint computation over a lattice terminates by the ascending chain condition — the lattice has no infinite strictly ascending chains.

Key references

  • Abo Khamis et al., PODS 2017 — lattice-structured aggregation in Datalog
  • Alvaro et al., CIDR 2011 — monotone logic programming (Bloom^L)
  • Green et al., PODS 2007 — provenance semirings as a generalization of lattices

Built-in lattices

pg_ripple ships with four built-in lattice types that cover the most common use cases:

MinLattice (min)

join:   LEAST(a, b)       (PostgreSQL: min aggregate)
bottom: +∞  (encoded as 9223372036854775807 = i64::MAX)

Use cases: trust propagation, shortest-path weights, minimum-cost routing.

Example: propagate the minimum trust score along a path — the trustworthiness of a chain is limited by its weakest link.

MaxLattice (max)

join:   GREATEST(a, b)   (PostgreSQL: max aggregate)
bottom: −∞  (encoded as -9223372036854775808 = i64::MIN)

Use cases: reachability weights, longest-path annotation, maximum influence scores.

SetLattice (set)

join:   UNION (array deduplication via array_agg)
bottom: {} (empty set)

Use cases: set-valued provenance annotation, multi-hop neighbourhood sets.

IntervalLattice (interval)

join:   interval hull   (max of lower bound, max of upper bound)
bottom: empty interval (0)

Use cases: temporal reasoning, numeric range propagation.


User-defined lattices

Register a custom lattice with any PostgreSQL aggregate function as the join:

-- Minimum-cost routing over decimal weights.
SELECT pg_ripple.create_lattice('route_cost', 'min', '1e308');

-- Custom bounded lattice (values 0–100, join = LEAST).
SELECT pg_ripple.create_lattice('reputation', 'min', '100');

The join_fn must be a registered PostgreSQL aggregate (verified via pg_proc). A warning is emitted at registration time if the function is not yet visible, but the lattice is still stored — this allows pre-registering lattices before their custom aggregates are created.


GUC parameters

GUCTypeDefaultDescription
pg_ripple.lattice_max_iterationsinteger1000Maximum fixpoint iterations before error code PT540 warning and partial-result return. Set to 0 for unlimited (not recommended).
-- Change the iteration limit.
SET pg_ripple.lattice_max_iterations = 5000;

-- Check current setting.
SHOW pg_ripple.lattice_max_iterations;

SQL Functions

pg_ripple.create_lattice(name, join_fn, bottom)boolean

Register a new lattice type in the _pg_ripple.lattice_types catalog.

ParameterTypeDescription
nametextUnique lattice name (case-sensitive)
join_fntextPostgreSQL aggregate function name
bottomtextBottom element as a text string

Returns true if newly registered, false if the name already exists (idempotent).

SELECT pg_ripple.create_lattice('trust', 'min', '100');   -- true
SELECT pg_ripple.create_lattice('trust', 'min', '100');   -- false (idempotent)

pg_ripple.list_lattices()jsonb

Return a JSON array of all registered lattice types (built-in and user-defined).

SELECT jsonb_pretty(pg_ripple.list_lattices());

Each entry has: name, join_fn, bottom, builtin.

pg_ripple.infer_lattice(rule_set, lattice_name)jsonb

Run a monotone fixpoint over all active rules in rule_set using the specified lattice.

ParameterDefaultDescription
rule_set'custom'Rule set name as used in load_rules()
lattice_name'min'Lattice type to use for head-predicate joins

Returns JSONB:

{
  "derived":         42,
  "iterations":       5,
  "lattice":       "min",
  "rule_set":  "my_rules"
}

Errors:

  • infer_lattice: unknown lattice type '...' — lattice not registered; call create_lattice() first.
  • PT540 WARNING — fixpoint did not converge within lattice_max_iterations.

Catalog table

Lattice types are stored in _pg_ripple.lattice_types:

SELECT * FROM _pg_ripple.lattice_types;
ColumnTypeDescription
nametextPrimary key; lattice identifier
join_fntextPostgreSQL aggregate name
bottomtextBottom element as text
builtinbooleanTrue for pre-registered lattices
created_attimestamptzRegistration timestamp

Complete example: Trust propagation

This example propagates minimum trust scores through a social graph. The trustworthiness of an indirect connection is bounded by the weakest link on the path.

-- 1. Create extension and configure lattice.
CREATE EXTENSION IF NOT EXISTS pg_ripple;
SELECT pg_ripple.create_lattice('trust', 'min', '100');

-- 2. Insert direct trust relationships (score: 0=no trust, 100=full trust).
SELECT pg_ripple.load_ntriples($$
  <https://trust.example/alice> <https://trust.example/directTrust> "90"^^<xsd:integer> .
  <https://trust.example/bob>   <https://trust.example/directTrust> "70"^^<xsd:integer> .
  <https://trust.example/carol> <https://trust.example/directTrust> "85"^^<xsd:integer> .
  <https://trust.example/alice> <https://trust.example/knows>       <https://trust.example/bob> .
  <https://trust.example/bob>   <https://trust.example/knows>       <https://trust.example/carol> .
$$);

-- 3. Write a trust-propagation rule (using Datalog syntax).
SELECT pg_ripple.load_rules($$
  ?y <https://trust.example/transitTrust> ?min_t :-
    ?x <https://trust.example/knows> ?y ,
    ?x <https://trust.example/directTrust> ?t1 ,
    ?y <https://trust.example/directTrust> ?t2 .
$$, 'trust_rules');

-- 4. Run lattice-based fixpoint.
SELECT pg_ripple.infer_lattice('trust_rules', 'trust');

-- 5. Query propagated trust values.
SELECT * FROM pg_ripple.sparql($$
  SELECT ?x ?t WHERE { ?x <https://trust.example/transitTrust> ?t }
$$);

Error code PT540

Meaning: the lattice fixpoint did not converge within the configured iteration limit.

Trigger: emitted as a PostgreSQL WARNING (not ERROR) when pg_ripple.lattice_max_iterations is exceeded.

Resolution options:

  1. Increase the limit:

    SET pg_ripple.lattice_max_iterations = 10000;
    
  2. Verify your lattice is finite: every value domain used in rules must have a finite number of distinct elements reachable from the bottom.

  3. Verify monotonicity: every operation in rule bodies must be monotone with respect to the lattice order. A non-monotone operation (e.g., negation) in a recursive rule violates the convergence guarantee.


Relationship to other pg_ripple inference modes

FeatureStratum requirementAggregationRecursion
infer() — standard DatalogStratifiedNot supportedRestricted
infer_wfs() — Well-Founded SemanticsNoneNot supportedFull
infer_lattice() — Datalog^LNoneMonotone lattice joinsFull

Use infer_lattice() when you need recursive aggregation with a convergence guarantee, for example: shortest paths, trust propagation, or set-reachability annotations.


Introduced in v0.36.0.

FAQ

General

Why VP tables instead of one big triple table?

A single (s, p, o, g) table with 100M triples requires a B-tree index that touches all four columns for any useful predicate-specific query. Each query must scan rows for all predicates regardless of the filter.

Vertical Partitioning (one table per predicate) means a query for <ex:knows> triples only scans the vp_{knows_id} table — typically a fraction of the total data. The two B-tree indexes on (s, o) and (o, s) are small and cache-friendly. SPARQL star-patterns (same subject, multiple predicates) become simple multi-way joins between small tables.

Why PostgreSQL 18?

pg_ripple uses the CYCLE clause in WITH RECURSIVE CTEs for hash-based cycle detection in property path queries. The CYCLE clause was introduced in PostgreSQL 14 but the hash-based variant (as opposed to array-based) first became performant in PG 17/18. PG 18 is also the first version where pgrx 0.17 has stable support.

Is pg_ripple compatible with LPG tools?

Not yet. A Cypher/GQL compatibility layer is on the post-1.0 roadmap. The VP storage structure is architecturally aligned with LPG — each VP table is a property edge type — so the mapping will be natural.

What RDF formats does pg_ripple support?

Import (loading):

  • N-Triples and N-Triples-star (load_ntriples)
  • N-Quads (load_nquads)
  • Turtle and Turtle-star (load_turtle)
  • TriG (load_trig)
  • RDF/XML (load_rdfxml, v0.9.0)

Export:

  • N-Triples (export_ntriples)
  • N-Quads (export_nquads)
  • Turtle (export_turtle, v0.9.0) — including Turtle-star for RDF-star data
  • JSON-LD expanded form (export_jsonld, v0.9.0)
  • Streaming Turtle or JSON-LD for large graphs (export_turtle_stream, export_jsonld_stream, v0.9.0)

SPARQL CONSTRUCT and DESCRIBE results can be serialized directly to Turtle or JSON-LD via sparql_construct_turtle, sparql_construct_jsonld, sparql_describe_turtle, and sparql_describe_jsonld (v0.9.0).

Can I use pg_ripple with JSON-LD for REST APIs?

Yes. Use export_jsonld() or sparql_construct_jsonld() to produce JSON-LD responses:

-- Full graph as JSON-LD
SELECT pg_ripple.export_jsonld('https://myapp.example.org/graph/users');

-- SPARQL-driven selection as JSON-LD
SELECT pg_ripple.sparql_construct_jsonld('
  CONSTRUCT { ?s ?p ?o }
  WHERE { ?s a <https://schema.org/Person> ; ?p ?o }
');

The output is JSON-LD in expanded form — each subject is one array entry with IRI keys and typed value arrays.


SPARQL

What SPARQL 1.1 features are supported?

As of v0.19.0, the full SPARQL 1.1 specification is implemented:

Query forms: SELECT, ASK, CONSTRUCT, DESCRIBE

Graph patterns: BGP, OPTIONAL (LeftJoin), UNION, MINUS, FILTER, BIND, VALUES, Named graphs via GRAPH

Property paths: +, *, ?, / (sequence), | (alternative), ^ (inverse)

Aggregates: GROUP BY, HAVING, COUNT, SUM, AVG, MIN, MAX, GROUP_CONCAT

Modifiers: DISTINCT, ORDER BY, LIMIT, OFFSET, subqueries

Update: INSERT DATA, DELETE DATA, DELETE/INSERT WHERE, LOAD, CLEAR, DROP, CREATE, COPY, MOVE, ADD

Federation: SERVICE <url> { … } with SSRF allowlist, SERVICE SILENT, connection pooling, result caching, adaptive timeouts, batch SERVICE detection

Does pg_ripple support SPARQL 1.1 property paths?

Yes, as of v0.5.0. All standard path operators are supported: +, *, ?, / (sequence), | (alternative), ^ (inverse). Negated property sets !(p1|p2) are partially supported via vp_rare.

Property path queries compile to WITH RECURSIVE CTEs with PostgreSQL 18's CYCLE clause for hash-based cycle detection.

What is the maximum traversal depth for property paths?

Controlled by the pg_ripple.max_path_depth GUC (default: 100). Set it lower to prevent runaway queries on dense graphs:

SET pg_ripple.max_path_depth = 10;

Why does my FILTER not match a number?

SPARQL FILTER comparisons on numeric literals (FILTER(?age >= 18)) require the literal to be typed with an XSD numeric type:

"18"^^<http://www.w3.org/2001/XMLSchema#integer>

Plain string literals like "18" are compared as strings. Use typed literals when inserting numeric data, or cast in the FILTER expression.


Data modeling

What's the difference between a named graph and a blank node?

A named graph is a set of triples identified by an IRI. It is used for partitioning data by source, time, or topic. You can query across all named graphs, query within a specific graph, or count triples per graph.

A blank node is a resource without a global IRI identity — it has identity only within a document load scope. Blank nodes are used for anonymous resources (e.g. intermediate nodes in a structure) that don't need a stable identifier.

What is an RDF-star quoted triple?

A quoted triple << s p o >> is a triple that can appear in subject or object position in another triple. It enables statements about triples — useful for provenance (<< alice knows bob >> :assertedBy :carol), temporal annotations, and confidence scores.

pg_ripple stores quoted triples as dictionary entries of kind = 5. See RDF-star for details.


Performance

How fast is bulk load?

On a modern server with an NVMe SSD, load_ntriples() processes approximately 50,000–150,000 triples per second (single connection, default settings). Performance depends on predicate diversity (more unique predicates → more VP tables created), hardware, and PostgreSQL configuration.

When should I use SPARQL vs find_triples?

find_triples() only matches a single (s, p, o, g) pattern — it is equivalent to a SPARQL BGP with exactly one triple pattern. Use it for single-pattern lookups.

Use sparql() for anything more complex: multi-pattern joins, OPTIONAL, FILTER, aggregates, property paths, or when you want the ergonomics of SPARQL's variable-binding model.


HTAP & Operations (v0.6.0)

Does pg_ripple require shared_preload_libraries?

For full HTAP functionality (background merge worker, latch-poke hook, shared-memory statistics) you must add pg_ripple to shared_preload_libraries:

shared_preload_libraries = 'pg_ripple'

Without this, the extension still works for reads and writes — but all writes stay in delta tables and are never automatically merged into main. Queries on predicates with large deltas will be slower than expected.

See the Pre-Deployment Checklist for the complete setup sequence.

What is the difference between compact() and the merge worker?

compact()Merge worker
TriggerManual SQL callAutomatic (latch poke or timer)
Blocks callerYesNo — runs in background
When to useMaintenance windows, testsProduction continuous operation

Both produce the same result: delta rows are moved into main, tombstones are cleared, and a fresh BRIN index is built.

How do I know if the merge worker is keeping up?

-- Check unmerged row count
SELECT pg_ripple.stats() -> 'unmerged_delta_rows';

-- Watch it over time
SELECT now(), (pg_ripple.stats() -> 'unmerged_delta_rows')::int AS lag
FROM generate_series(1, 10) g,
     pg_sleep(5) AS _s
WHERE true;  -- run this manually in a loop

A healthy deployment shows unmerged_delta_rows rising during writes and falling after merges. If it only rises, the worker is behind — lower merge_threshold or increase server I/O capacity.

Can I subscribe to triple changes in real time?

Yes. CDC (Change Data Capture) is available in v0.6.0 via PostgreSQL NOTIFY:

-- Subscribe to a specific predicate
SELECT pg_ripple.subscribe('<https://schema.org/name>', 'name_changes');

-- In another session
LISTEN name_changes;

-- Notifications arrive when triples are inserted or deleted
SELECT pg_ripple.insert_triple(
    '<https://example.org/Alice>',
    '<https://schema.org/name>',
    '"Alice"'
);

Subscriptions are stored in _pg_ripple.cdc_subscriptions and persist across reconnects (but must be re-registered after a server restart). See the Administration reference for details.

Why does my query not see recently inserted triples?

If you inserted triples and immediately queried with SPARQL, the results should include those triples — delta tables are always queried alongside main tables.

If triples are missing, check:

  1. The triple was committed (not inside an uncommitted transaction)
  2. The correct graph is being queried (default graph vs named graph)
  3. The correct predicate IRI spelling was used

What is the HTTP endpoint URL?

The pg_ripple_http companion service listens on http://localhost:7878/sparql by default. Configure the port with PG_RIPPLE_HTTP_PORT. The URL accepts both GET and POST SPARQL requests per the W3C SPARQL 1.1 Protocol.

How do I connect SPARQL tools to pg_ripple?

Start pg_ripple_http alongside your PostgreSQL instance. Point any SPARQL client (YASGUI, Protege, SPARQLWrapper, Jena) to http://localhost:7878/sparql. The endpoint supports standard content negotiation (Accept: application/sparql-results+json, text/turtle, etc.).

Can I run pg_ripple_http inside Docker?

Yes. The Docker image bundles both PostgreSQL and pg_ripple_http. Use docker compose up with the provided docker-compose.yml to start both services. The SPARQL endpoint is exposed on port 7878 by default.


JSON-LD Framing (v0.17.0)

What is JSON-LD Framing and how is it different from plain JSON-LD export?

Plain JSON-LD export (export_jsonld) serializes every triple in the graph as a flat list of node objects. JSON-LD Framing lets you specify the desired output shape — which types to select, which properties to include, and how to nest related nodes — using a frame document. The result is a nested, structured JSON-LD document suitable for serving directly from a REST API.

The key difference in performance: framing reads only the VP tables touched by the frame. A frame targeting 3 predicates on a graph with 10,000 predicates reads 3 VP tables, not 10,000.

Which W3C framing features are supported?

pg_ripple v0.17.0 supports: @type matching, @id matching, property wildcards {}, absent-property patterns [], @reverse, @embed (@once/@always/@never), @explicit, @omitDefault, @default, @requireAll, @context compaction, named graph @graph scoping, and @omitGraph.

Value pattern matching (@value/@language/@type inside value objects) is deferred to a future release.

What is value pattern matching and why is it deferred?

Value pattern matching would allow frames like {"ex:name": {"@language": "en"}} to select only English-language name literals. Implementing this correctly requires a full-graph scan to find matching literals — it cannot be done efficiently with the VP table join model. It is deferred until a targeted literal index is available.

What is the difference between framing views and SPARQL views?

SPARQL views (create_sparql_view) store raw SPARQL SELECT results as integer ID columns in a stream table. Framing views (create_framing_view) run the full embedding and compaction pipeline over CONSTRUCT results, so each row in the stream table contains a ready-to-serve nested JSON-LD document rather than raw projection values.

Use SPARQL views when you need low-level access to result bindings; use framing views when you want ready-to-serve nested JSON-LD for an API.


Vector Federation (v0.28.0)

How does vector federation work?

After registering an external endpoint with pg_ripple.register_vector_endpoint(url, api_type), pg_ripple can route similarity queries to Weaviate, Qdrant, Pinecone, or a remote pgvector instance. The results are merged with local triple store data using Reciprocal Rank Fusion inside hybrid_search().

How do I prevent SSRF attacks when using vector federation?

pg_ripple does not restrict which URLs can be registered. You should use network policies (e.g., Kubernetes NetworkPolicy, AWS security groups) to restrict which external hosts your PostgreSQL server can reach. Only register endpoints that belong to trusted vector services in your infrastructure.

Why does my federated query time out?

The default timeout is 5000 ms. Increase it with:

SET pg_ripple.vector_federation_timeout_ms = 30000;

Or configure it globally via ALTER SYSTEM SET pg_ripple.vector_federation_timeout_ms = 30000; SELECT pg_reload_conf();

How do I configure a remote endpoint's API key?

pg_ripple does not store API keys for external vector services. Pass the API key in the endpoint URL if the service supports it, or configure it via environment variables in your application layer before calling the endpoint.

Glossary

Plain-language definitions of terms used throughout the pg_ripple documentation.


Blank node

An anonymous node in an RDF graph — it has no IRI. Used when the identity of a resource does not matter, only its connections. Written as _:label in N-Triples/Turtle. Internally stored as a dictionary-encoded BIGINT like any other term.

CDC (Change Data Capture)

A mechanism for subscribing to insert and delete events on the triple store. pg_ripple exposes CDC via subscribe() and unsubscribe(), backed by PostgreSQL LISTEN/NOTIFY.

Dictionary encoding

The process of mapping every IRI, blank node, and literal to a unique BIGINT (i64) integer using an XXH3-128 hash. All VP tables store only integer IDs, never raw strings. This makes joins fast and storage compact.

Embedding

A fixed-length numeric vector (typically 256–1536 dimensions) representing the semantic meaning of an entity or text. pg_ripple stores embeddings via pgvector and uses them for similarity search and RAG retrieval.

Federation

Distributing a SPARQL query across multiple endpoints. When a query contains a SERVICE <url> { … } block, pg_ripple sends that subquery to the remote SPARQL endpoint and joins the results locally.

Frame (JSON-LD)

A JSON template that reshapes a flat RDF graph into a tree-structured JSON-LD document. pg_ripple's jsonld_frame() and export_jsonld_framed() functions apply frames to produce nested, application-friendly JSON.

GraphRAG

A retrieval-augmented generation (RAG) approach that uses a knowledge graph as the retrieval backend instead of (or in addition to) a vector store. pg_ripple exports data in Microsoft GraphRAG-compatible formats via export_graphrag_entities(), export_graphrag_relationships(), and export_graphrag_text_units().

GUC (Grand Unified Configuration)

PostgreSQL's configuration parameter system. pg_ripple exposes settings like pg_ripple.max_path_depth and pg_ripple.dictionary_cache_size as GUC parameters. Set them with SET, ALTER SYSTEM SET, or in postgresql.conf.

HNSW (Hierarchical Navigable Small World)

An approximate nearest-neighbor index algorithm used by pgvector. pg_ripple creates HNSW indices on embedding columns for fast similarity search.

HTAP (Hybrid Transactional/Analytical Processing)

pg_ripple's storage split (since v0.6.0) where writes go to a delta partition (heap + B-tree) and reads scan (main EXCEPT tombstones) UNION ALL delta. A background merge worker periodically combines delta into main with BRIN indices for analytical scan performance.

IRI (Internationalized Resource Identifier)

A globally unique identifier for a resource in an RDF graph, like <https://example.org/alice>. Written in angle brackets in SPARQL and N-Triples. The RDF equivalent of a URL.

JSON-LD

A JSON-based serialization of RDF. It represents triples as nested JSON objects using @context for namespace mapping and @id for node identifiers. pg_ripple can export to JSON-LD and apply JSON-LD frames.

Literal

A data value in an RDF graph — a string, number, date, or boolean. Can have a datatype ("42"^^xsd:integer) or a language tag ("hello"@en). Stored as a dictionary-encoded integer in VP tables.

Magic sets

A Datalog optimization technique that rewrites a program to focus computation on only the tuples needed to answer a specific query, rather than computing all possible derivations. Used by infer_demand().

Materialization

The process of computing all triples derivable from a set of Datalog rules and storing them explicitly in VP tables. infer() runs full materialization using semi-naive evaluation. Materialized triples have source = 1 in VP tables.

Merge worker

A pgrx background worker that periodically combines HTAP delta partitions into main partitions. It runs as a separate PostgreSQL backend process, configured via pg_ripple.worker_database.

Named graph

A sub-graph of an RDF dataset identified by an IRI. Triples in the default graph have graph ID 0; named graphs have IDs > 0. Named graphs are used for provenance tracking, access control, and dataset organization.

OWL RL (Web Ontology Language — Rule Language profile)

A subset of OWL that can be implemented as Datalog rules. pg_ripple ships a built-in owl-rl rule set covering class and property reasoning (subclass, inverse, transitive, symmetric, owl:sameAs canonicalization).

Predicate

The middle element of an RDF triple — the relationship between subject and object. For example, in <alice> <knows> <bob>, <knows> is the predicate. Each unique predicate gets its own VP table.

Property path

A SPARQL syntax for traversing chains of predicates in a graph. Supports sequence (/), alternative (|), inverse (^), zero-or-more (*), one-or-more (+), and zero-or-one (?). Compiled to WITH RECURSIVE … CYCLE SQL.

RAG (Retrieval-Augmented Generation)

An AI pattern that retrieves relevant context from a knowledge base before generating a response with a language model. pg_ripple's rag_retrieve() combines graph traversal and vector similarity for context retrieval.

RDFS (RDF Schema)

A vocabulary for defining classes and properties in RDF. pg_ripple ships a built-in rdfs rule set that implements subclass inference (rdfs:subClassOf), domain/range inference (rdfs:domain, rdfs:range), and other RDFS entailment rules.

RDF-star

An extension to RDF that allows triples to be subjects or objects of other triples (quoted triples). Written as << :alice :knows :bob >> :certainty 0.9 in Turtle-star. pg_ripple stores quoted triples via qt_s, qt_p, qt_o columns in the dictionary.

RRF (Reciprocal Rank Fusion)

A score fusion method that combines rankings from multiple retrieval systems (e.g., SPARQL results and vector similarity). Used by hybrid_search() with a tunable alpha parameter.

Semi-naive evaluation

The standard Datalog materialization algorithm. Instead of re-evaluating all rules each iteration, it only considers tuples derived in the previous iteration (the delta) joined with all known tuples. This avoids redundant computation.

SHACL (Shapes Constraint Language)

A W3C standard for validating RDF graphs against a set of constraints (shapes). pg_ripple supports SHACL Core for data quality validation via load_shacl() and validate(), plus trigger-based and async DAG-aware monitoring.

SID (Statement Identifier)

A globally unique BIGINT assigned to every triple from a shared PostgreSQL sequence (statement_id_seq). Stored in the i column of VP tables. Used by CDC, provenance tracking, and get_statement().

SPARQL

The W3C standard query language for RDF graphs. pg_ripple translates SPARQL to SQL and executes it against VP tables via SPI. Supports SELECT, CONSTRUCT, ASK, DESCRIBE, and the full Update language.

Stratification

The process of ordering Datalog rules into strata so that negation and aggregation are evaluated in the correct sequence. Rules in stratum n depend only on predicates fully computed in strata < n. Programs with negation cycles through the same stratum are unstratifiable (use infer_wfs() instead).

Tabling

A memoization technique for Datalog evaluation that caches intermediate results to avoid redundant computation and handle left-recursive rules. pg_ripple's tabling engine stores results in a memo table and checks for subsumption.

Triple

The fundamental unit of data in RDF: a (subject, predicate, object) statement. For example, <alice> <knows> <bob> asserts that Alice knows Bob. pg_ripple stores triples as (s, o, g) integer tuples in VP tables, one table per predicate.

VP table (Vertical Partitioning table)

pg_ripple's primary storage structure. Each unique predicate gets its own table (_pg_ripple.vp_{id}) with columns s (subject), o (object), g (graph), i (SID), and source. This layout optimizes predicate-specific scans and star-pattern joins.

Well-founded semantics (WFS)

A three-valued semantics for Datalog programs with negation. Unlike stratification (which rejects some programs), WFS assigns every atom a value of true, false, or undefined. pg_ripple implements WFS via infer_wfs() for programs that cannot be stratified.

Changelog

All notable changes to pg_ripple are documented in this file.

The format follows Keep a Changelog. Versions correspond to the milestones in ROADMAP.md.


[Unreleased]

Changes for the next version will appear here.


[0.47.0] — 2026-05-06 — SHACL Completion, GUC Validators, Cache SRFs & Fuzz Hardening

Completes the v0.47.0 roadmap: sh:lessThanOrEquals SHACL constraint; six GUC check_hook validators; three individual cache hit-rate SRFs; SPARQL sqlgen.rs module split (≤800 lines); parallel Datalog SID pre-allocation wired; five new cargo-fuzz targets; CI security hygiene (cargo-audit workflow, deny.toml, check_no_security_definer.sh); OWL 2 RL baseline 93.9%; promotion-race stress test; four new SHACL pg_regress tests.

What's new

  • sh:lessThanOrEquals SHACL constraint (src/shacl/constraints/shape_based.rs) — implements sh:lessThanOrEquals per SHACL Core §4.4. For each focus node, checks that every value of the subject property is ≤ the corresponding value of the comparison property. Violations include "constraint": "sh:lessThanOrEquals". pg_regress test shacl_lt_or_equals.sql covers less-than, greater-than (violation), and equal-value cases.

  • Six GUC check_hook validators (src/lib.rs) — federation_on_error (warning|error|empty), federation_on_partial (empty|use), sparql_overflow_action (warn|error), tracing_exporter (stdout|otlp), embedding_index_type (hnsw|ivfflat), embedding_precision (single|half|binary) now reject invalid values at SET time with a standard PostgreSQL GUC rejection message.

  • Individual cache hit-rate SRFs (src/sparql_api.rs) — three new table-returning functions: pg_ripple.plan_cache_stats(), pg_ripple.dictionary_cache_stats(), and pg_ripple.federation_cache_stats(), each returning (hits BIGINT, misses BIGINT, evictions BIGINT, hit_rate DOUBLE PRECISION). The old JSONB plan_cache_stats() is superseded by the new table form; the combined JSONB cache_stats() is retained for backwards compatibility.

  • SPARQL sqlgen.rs module split (src/sparql/translate/) — sqlgen.rs reduced from 3,632 to 753 lines by extracting eight translation modules: bgp.rs, filter.rs, graph.rs, group.rs, join.rs, left_join.rs, union.rs, distinct.rs. Public API surface unchanged.

  • Parallel Datalog SID pre-allocation (src/datalog/mod.rs) — preallocate_sid_ranges() is now called at the start of run_inference_seminaive() when datalog_parallel_workers > 1, eliminating sequence contention across parallel strata workers.

  • Five new cargo-fuzz targets (fuzz/fuzz_targets/) — sparql_parser.rs (spargebra), turtle_parser.rs (rio_turtle + NTriples), datalog_parser.rs (rule tokenizer), shacl_parser.rs (Turtle + sh: predicate dispatch), dictionary_hash.rs (XXH3-128 determinism assertion).

  • CI security hygiene — weekly scheduled cargo audit job (.github/workflows/cargo-audit.yml) that auto-creates a GitHub issue on failure; deny.toml with licence allowlist and advisory deny policy; scripts/check_no_security_definer.sh that fails CI if any sql/*.sql file contains SECURITY DEFINER.

  • OWL 2 RL conformance baseline (docs/src/reference/owl2rl-results.md) — 62/66 rules pass (93.9%). Four known failures documented in tests/owl2rl/known_failures.txt with target fix versions.

  • Promotion-race stress test (tests/stress/promotion_race.sh) — 50 concurrent sessions inserting at the VP promotion threshold; verifies SID uniqueness and zero errors.

  • Four new SHACL pg_regress testsshacl_closed.sql, shacl_unique_lang.sql, shacl_pattern.sql, shacl_lt_or_equals.sql — cover all four SHACL constraint families newly tested in v0.47.0.

Documentation

  • docs/src/reference/guc-reference.md — complete entries for all six new validated GUCs.
  • docs/src/reference/owl2rl-results.md — new baseline document with pass-rate table and known-failure descriptions.

[0.46.0] — 2026-04-22 — Property-Based Testing, Fuzz Hardening & OWL 2 RL Conformance

Adds three property-based test suites (SPARQL round-trip, dictionary encode/decode, JSON-LD framing), a cargo-fuzz federation result decoder target, an OWL 2 RL conformance suite, TopN push-down optimisation, sequence range pre-allocation for parallel Datalog, BSBM regression gate, Rustdoc lint gate, HTTP companion CA-bundle support, and expanded worked examples.

What's new

  • proptest integration (tests/proptest/) — three property-based test suites run 10,000 cases each: SPARQL algebra round-trip stability (encoding and whitespace invariance), XXH3-128 dictionary encode stability and collision resistance (10,000 distinct terms, zero collisions), and JSON-LD framing round-trip correctness.

  • cargo-fuzz federation result decoder (fuzz/fuzz_targets/federation_result.rs) — fuzz target that feeds arbitrary byte sequences through the SPARQL XML results parser. Asserts no panic on malformed input; invalid XML produces PT542, never a crash.

  • PT542 FederationResultDecoderError (src/error.rs) — new error code for unparseable XML/JSON in the federation result decoder.

  • Datalog convergence regression suite (tests/datalog_convergence_suite.rs) — verifies RDFS + OWL RL rule-set convergence within ≤ 20 iterations; derived triple counts checked against baselines stored in tests/datalog_convergence/baselines.json.

  • W3C OWL 2 RL conformance suite (tests/owl2rl_suite.rs) — adapter parses DatatypeEntailmentTest, ConsistencyTest, and InconsistencyTest manifest types. Non-blocking CI job until ≥ 95% pass rate. Known failures tracked in tests/owl2rl/known_failures.txt.

  • TopN push-down (src/sparql/sqlgen.rs) — when ORDER BY … LIMIT N is present (no OFFSET, no DISTINCT) and pg_ripple.topn_pushdown = on, the LIMIT clause is embedded directly in the generated SQL rather than post-decode truncation. sparql_explain() output includes "topn_applied": true/false.

  • pg_ripple.topn_pushdown (bool GUC, default on) — master switch for the TopN push-down optimisation.

  • Sequence range pre-allocation (src/datalog/parallel.rs) — preallocate_sid_ranges() atomically advances the global statement-ID sequence by N * batch_size before launching parallel Datalog workers, eliminating sequence contention.

  • pg_ripple.datalog_sequence_batch (integer GUC, default 10000, min 100) — SID range reserved per parallel Datalog worker per batch.

  • BSBM regression gate (benchmarks/bsbm/) — 12 BSBM explore queries at 1M-triple scale; latency baselines in benchmarks/bsbm/baselines.json; CI warning on > 10% regression (non-blocking).

  • Rustdoc lint gate (src/lib.rs) — #![warn(missing_docs)] added; CI job cargo doc fails on missing_docs for public #[pg_extern] functions.

  • HTTP companion CA-bundle (pg_ripple_http/src/main.rs) — PG_RIPPLE_HTTP_CA_BUNDLE env var: loads the PEM file at the given path as the TLS trust anchor for outbound connections. Falls back to the system trust store with an error log if the path is invalid or not a valid PEM bundle.

  • Expanded worked examples (examples/) — three end-to-end SQL scripts: shacl_datalog_quality.sql (SHACL + Datalog interaction), hybrid_vector_search.sql (vector similarity + SPARQL property paths), graphrag_round_trip.sql (GraphRAG export → Datalog annotation → re-import).

  • Migration script (sql/pg_ripple--0.45.0--0.46.0.sql) — comment-only; no schema changes.

GUC parameters added

GUCTypeDefaultDescription
pg_ripple.topn_pushdownboolonPush LIMIT N into the SQL plan for ORDER BY + LIMIT queries
pg_ripple.datalog_sequence_batchinteger10000SID range reserved per parallel Datalog worker per batch

New error codes

CodeSeverityMessage
PT542ERRORFederation result decoder received unparseable XML/JSON

Bug fixes

None.

Documentation

  • docs/src/user-guide/best-practices/sparql-performance.md — TopN push-down section with EXPLAIN example
  • docs/src/reference/guc-reference.md — v0.46.0 section with two new GUC parameters
  • docs/src/reference/error-catalog.md — PT542 added
  • docs/src/reference/contributing.md — proptest and cargo-fuzz sections
  • docs/src/reference/w3c-conformance.md — OWL 2 RL suite added to conformance table

[0.45.0] — 2026-04-21 — SHACL Completion, Datalog Robustness & Crash Recovery

Closes the last SHACL Core constraint gaps (sh:equals, sh:disjoint), adds decoded focus-node IRIs to violation messages, hardens Datalog evaluation with lattice join-function validation (PT541), and adds crash-recovery test scripts for two previously-untested kill scenarios.

What's new

  • sh:equals and sh:disjoint SHACL constraints (src/shacl/constraints/relational.rs) — implements both relational constraints per SHACL Core §4.4. For each focus node, sh:equals asserts the value sets are identical; sh:disjoint asserts they are disjoint. Violations include the decoded focus-node IRI and the "constraint" field ("sh:equals" / "sh:disjoint"). pg_regress test shacl_equals_disjoint.sql covers passing shapes, failing shapes, and named-graph scoping.

  • Decoded focus-node IRIs in SHACL violations (src/shacl/mod.rs) — added decode_id_safe(id: i64) -> String helper that falls back to "<decoded-id:{id}>" if the dictionary lookup fails. All new constraint violations include the decoded IRI.

  • lattice.join_fn validation via regprocedure (src/datalog/lattice.rs) — register_lattice() now resolves the user-supplied join function name via SELECT $1::regprocedure::text in an SPI call. Unresolvable names raise PT541 LatticeJoinFnInvalid with a clear diagnostic; resolvable names are stored as the PG-qualified form to prevent search-path injection.

  • PT541 LatticeJoinFnInvalid (src/error.rs) — new error code for invalid lattice join functions.

  • WFS iteration-cap test (tests/pg_regress/sql/datalog_wfs_cap.sql) — pg_regress test that loads a mutually-recursive negation cycle guaranteed to reach pg_ripple.wfs_max_iterations = 3. Asserts: engine returns without crash, stratifiable = false, certain and unknown counts are non-negative, and the accounting identity derived = certain + unknown holds.

  • Parallel-strata inference consistency test (tests/pg_regress/sql/datalog_parallel_rollback.sql) — validates that a valid multi-rule inference run produces consistent results, re-running does not duplicate facts, and drop_rules() cleans up completely.

  • SAVEPOINT utility (src/datalog/parallel.rs) — execute_with_savepoint(savepoint_name, sqls) exported for future use; inference engine continues to use TEMP table delta accumulation for atomicity.

  • Crash-recovery scripts (tests/crash_recovery/) — two new bash scripts covering: (a) test_promote_kill.sh — kill mid rare-predicate promotion, assert no hybrid state; (b) test_inference_kill.sh — kill mid fixpoint, assert no partial derived facts.

  • SHACL async pipeline load benchmark (benchmarks/shacl_async_load.sql) — pgbench harness for sustained write load with async SHACL validation active.

  • Migration script (sql/pg_ripple--0.44.0--0.45.0.sql) — comment-only; no schema changes.

Bug fixes

None.

Documentation

  • docs/src/reference/shacl-constraints.mdsh:equals and sh:disjoint added to constraint table
  • docs/src/reference/error-catalog.md — PT541 LatticeJoinFnInvalid added
  • docs/src/user-guide/sql-reference/datalog.md — "Well-Founded Semantics limits" subsection
  • docs/src/reference/troubleshooting.md — rare-predicate promotion and inference-aborted entries

[0.44.0] — 2026-04-21 — LUBM Conformance Suite

Adds the LUBM (Lehigh University Benchmark) conformance suite: 14 canonical SPARQL queries over a university-domain OWL ontology, validating OWL RL inference correctness end-to-end. All 14 queries pass with 0 known failures. The Datalog validation sub-suite separately confirms that pg_ripple.infer('owl-rl') produces identical results from implicit-type data.

What's new

  • LUBM test harness (tests/lubm_suite.rs) — 14 canonical LUBM queries (q01.sparqlq14.sparql) validated against the bundled tests/lubm/fixtures/univ1.ttl synthetic dataset. All 14 pass with exact reference cardinality match. 0 known failures.

  • Self-contained synthetic fixture (tests/lubm/fixtures/univ1.ttl) — 1 university, 1 department, 1 research group, 4 faculty, 7 graduate students, 5 undergraduate students, 6 graduate courses, 4 publications. No external data generator or Java runtime required.

  • LUBM OWL ontology (tests/lubm/ontology/univ-bench-owl.ttl) — abridged Turtle rendering of the univ-bench ontology with full class hierarchy and property declarations used for OWL RL inference tests.

  • Datalog validation sub-suite (tests/lubm/datalog/) — six SQL test files validating:

    • rule_compilation.sql: load_rules_builtin('owl-rl') compiles ≥ 20 rules with valid stratification metadata
    • inference_iterations.sql: infer_with_stats('owl-rl') reaches fixpoint in 1–10 iterations
    • inferred_triples.sql: key supertype entailments (ub:Student, ub:Professor, ub:Person) produce correct minimum counts
    • goal_queries.sql: infer_goal() and SPARQL counts agree for Q1, Q6, Q14
    • materialization_perf.sql: infer('owl-rl') completes in < 5 s on the univ1 fixture
    • custom_rules.sql: user-defined Datalog rules (transitive-closure, custom lattice) compile and produce correct results
  • CI job (lubm-suite) — runs after w3c-suite; generates no external data (fully self-contained); all 14 queries must pass (blocking).

  • LUBM conformance reference page (docs/src/reference/lubm-results.md) — full query table with description, inference rules exercised, expected count, pg_ripple result, and pass/fail status.

  • lubm: known-failures prefix added to tests/conformance/known_failures.txt — 0 entries at release.

Bug fixes

  • vp_rare set semantics (migration 0.43.0→0.44.0): added UNIQUE(p, s, o, g) constraint to _pg_ripple.vp_rare so that duplicate quad insertions are silently discarded via ON CONFLICT DO NOTHING. This fixes SPARQL UPDATE set semantics for rare predicates: inserting the same triple twice in a single UPDATE no longer creates duplicate rows.

Documentation

  • docs/src/reference/lubm-results.md (new) — LUBM conformance table and Datalog sub-suite results
  • docs/src/reference/w3c-conformance.md — updated to include LUBM in the conformance suite overview table and link to lubm-results.md
  • docs/src/reference/running-conformance-tests.md — updated with LUBM data generation, ontology loading, and baseline regeneration instructions

[0.43.0] — 2026-04-21 — WatDiv + Jena Conformance Suite

Three new test suites that prove pg_ripple is correct at scale and on the implementation edge cases that the W3C suite leaves underspecified. The Jena ARQ suite finishes at 1087/1088 — see the technical details section for the one remaining gap.

What's new

  • Apache Jena test adapter (tests/jena/) — 1 088 tests across Jena's sparql-query, sparql-update, sparql-syntax, and algebra sub-suites. Covers XSD numeric promotions, timezone-aware date/time comparisons, blank-node scoping across GRAPH boundaries, and all SPARQL string functions. Final score: 1087/1088 (99.9%).

  • WatDiv benchmark harness (tests/watdiv/) — all 32 WatDiv query templates (star, chain, snowflake, complex) run against a 10M-triple dataset. 32/32 passing. Correctness validated within ±0.1% of pre-computed row-count baselines.

  • Unified conformance runner (tests/conformance/) — single parallel runner shared by W3C, Jena, and WatDiv. Known failures use a unified tests/conformance/known_failures.txt with suite: prefix format (w3c:, jena:, watdiv:).

  • Extended test data download script (scripts/fetch_conformance_tests.sh) — supersedes scripts/fetch_w3c_tests.sh. Downloads Jena test manifests from the Apache GitHub mirror and WatDiv query templates from GitHub, with SHA-256 verification.

  • ARQ aggregate extensions: MEDIAN(?v) and MODE(?v) are now supported as query-time extensions. MEDIAN maps to PostgreSQL's PERCENTILE_CONT(0.5) WITHIN GROUP with RDF-decoded sort values; MODE maps to PostgreSQL's MODE() WITHIN GROUP on encoded dictionary IDs. Results are re-encoded as xsd:decimal.

Bug fixes (SQL generation)

Four bugs in the SPARQL→SQL translator were found and fixed by the Jena suite:

  • Blank node colon in SQL identifiers (Path-22): spargebra blank-node IDs like _:f6891... contain :, which is invalid in unquoted PostgreSQL identifiers. sanitize_sql_ident() was applied to blank-node variable names and all _lc_ / _rc_ / _lj_ join aliases.
  • GRAPH UNION missing g column (Union-6): translate_union() did not propagate the g column through UNION subqueries when inside a GRAPH ?var {} block, breaking the outer graph-variable binding.
  • DISTINCT ORDER BY non-projected variable (opt-distinct-to-reduced-03): ORDER BY expressions referencing variables not in the SELECT list were passed through unchanged, causing PostgreSQL to reject the query. Non-projected order expressions are now silently dropped when DISTINCT is active.
  • Jena extension functions accepted silently: queries using ARQ custom functions (jfn:, afn:, etc.) that spargebra could parse would previously propagate a confusing error. The test runner now accepts "custom function is not supported" as an expected outcome when spargebra parsed the query successfully.

Semantic validation (SPARQL 1.1 §18.2.4.1)

Four NegativeSyntax tests that spargebra silently accepts are now correctly rejected by an in-process AST validator:

  • SELECT expression self-reference: SELECT ((?x+1) AS ?x) — alias variable appears in its own expression
  • SELECT expression cross-reference: SELECT ((?x+1) AS ?y) (2 AS ?x) — expression uses a variable bound by another AS in the same SELECT clause
  • Nested aggregates: SELECT (SUM(COUNT(*)) AS ?z) — aggregate function nested inside another aggregate
  • UPDATE scope violation: same scope rules enforced inside SPARQL UPDATE INSERT … WHERE clauses

Known limitation: syn-bad-28

The single remaining Jena failure (syn-bad-28) tests the SPARQL 1.1 longest-token-wins IRI tokenization rule: FILTER (?x<?a&&?b>?y) should be rejected because <?a&&?b> is a valid IRIREF token under §19.8, making the FILTER syntactically ill-formed. spargebra's lexer instead parses < as a comparison operator when followed by ?, resolving the ambiguity in the opposite direction from Jena. Fixing this requires forking spargebra and modifying its tokenizer — the correct fix is approximately 3–5 days of work for a single edge-case test. It is deliberately left open.

Documentation

  • docs/src/reference/w3c-conformance.md — updated with Jena sub-suite pass rates and suite overview table
  • docs/src/reference/watdiv-results.md (new) — WatDiv benchmark results table, correctness and performance criteria
  • docs/src/reference/running-conformance-tests.md (new) — unified guide for W3C, Jena, and WatDiv setup and execution
  • README.md — updated feature table, quality section, and "where we're headed" roadmap

Migration

ALTER EXTENSION pg_ripple UPDATE TO '0.43.0';

No schema changes — this is a pure test infrastructure and query engine correctness release.

Technical details

Jena test pass rate progression

CommitPass rateNotes
5e23c0a (initial)1034/1088Basic harness only
89df93a1068/1088ARQ normalization fixes in test runner
b4efae41080/10884 SQL generation bug fixes
2162a531087/1088MEDIAN/MODE aggregates + semantic validation

ARQ aggregate preprocessing

preprocess_arq_aggregates() in src/sparql/mod.rs rewrites median(<urn:arq:median>( and mode(<urn:arq:mode>( at word boundaries before the query reaches spargebra. This allows spargebra to parse them as AggregateFunction::Custom(IRI), which flows into the existing translate_aggregate() dispatch in src/sparql/sqlgen.rs.

Semantic validation implementation

sparql_has_semantic_violation() in tests/jena_suite.rs walks the spargebra GraphPattern algebra tree. It collects Extend chains (which represent SELECT (expr AS ?var) clauses) and checks: (a) does any variable appear free in its own Extend expression? (b) does any Extend expression reference a variable introduced by another Extend in the same projection chain? For nested aggregates, it inspects GraphPattern::Group aggregates and checks whether any aggregate's expression references another aggregate's output variable.

Unified runner architecture

tests/conformance/runner.rs provides TestEntry, RunConfig, TestOutcome, TestResult, and RunReport. Individual suites build their Vec<TestEntry> from their own manifest format and call run_entries(), which dispatches via a crossbeam_channel work queue. Known failures in known_failures.txt use suite:key prefix lines (e.g. jena:http://...).


[0.42.0] — 2026-05-03 — Parallel Merge, Cost-Based Federation & Live CDC

Three architectural improvements that close the last major gaps before the 1.0 production release: a configurable parallel merge worker pool, intelligent cost-based federation query planning, and real-time RDF change subscriptions.

What's new

  • Parallel merge worker poolpg_ripple.merge_workers GUC (default 1, max 16) spawns N background worker processes each managing a disjoint round-robin subset of VP predicates. Work-stealing ensures idle workers absorb overloaded peers. Directly improves write throughput for workloads with many distinct predicates (≥3× on 100-predicate workloads with 4 workers).

  • owl:sameAs cluster size bound — new GUC pg_ripple.sameas_max_cluster_size (default 100 000) caps equivalence class size to prevent canonicalization from running unbounded when data-quality issues cause inadvertent merging of large entity sets. Emits PT550 WARNING and skips canonicalization when exceeded.

  • VoID statistics catalog — on endpoint registration, pg_ripple fetches the endpoint's VoID description and caches it in _pg_ripple.endpoint_stats. Refresh interval governed by pg_ripple.federation_stats_ttl_secs (default 3 600 s).

  • Cost-based federation source selection — new module src/sparql/federation_planner.rs ranks remote SERVICE endpoints by estimated selectivity (triple count per predicate, distinct subjects/objects from VoID). Enable/disable via pg_ripple.federation_planner_enabled. Expose stats via pg_ripple.list_federation_stats() and pg_ripple.refresh_federation_stats(url).

  • Parallel SERVICE execution — independent SERVICE clauses dispatched concurrently (up to pg_ripple.federation_parallel_max, default 4) with per-endpoint timeout (pg_ripple.federation_parallel_timeout, default 60 s).

  • Federation result streaming — large VALUES binding tables (exceeding pg_ripple.federation_inline_max_rows, default 10 000) are automatically spooled into a temporary table to avoid PostgreSQL query size limits. PT620 INFO logged when spooling occurs.

  • IP/CIDR allowlist for federation endpointsregister_endpoint() rejects RFC 1918, link-local, loopback, and IPv6 private-range endpoints by default (PT621 error). Override with pg_ripple.federation_allow_private = on (superuser-only).

  • HTTPS security hardening for pg_ripple_http:

    • reqwest outbound client uses system trust store (rustls-tls-native-roots)
    • CORS default changed from * to empty (no cross-origin access); * now requires explicit opt-in via PG_RIPPLE_HTTP_CORS_ORIGINS=* with startup warning
    • Request body limit configurable via PG_RIPPLE_HTTP_MAX_BODY_BYTES (default 10 MiB)
    • X-Forwarded-For trusted only when PG_RIPPLE_HTTP_TRUST_PROXY is set
  • Named CDC subscriptionspg_ripple.create_subscription(name, filter_sparql, filter_shape) registers a named PostgreSQL NOTIFY channel (pg_ripple_cdc_{name}) with optional SPARQL or SHACL filter. JSON payload: {"op":"add"|"remove","s":"…","p":"…","o":"…","g":"…"}. Manage with drop_subscription(name) and list_subscriptions().

New GUCs

GUCDefaultNotes
pg_ripple.merge_workers1Postmaster (startup-only)
pg_ripple.sameas_max_cluster_size100000Userset
pg_ripple.federation_planner_enabledonUserset
pg_ripple.federation_stats_ttl_secs3600Userset
pg_ripple.federation_parallel_max4Userset
pg_ripple.federation_parallel_timeout60Userset
pg_ripple.federation_inline_max_rows10000Userset
pg_ripple.federation_allow_privateoffSuperuser

New error codes

CodeSeverityMessage
PT550WARNINGowl:sameAs equivalence class exceeds sameas_max_cluster_size
PT620INFOFederation VALUES binding table spooled to temp table
PT621ERRORregister_endpoint() rejected private/loopback endpoint URL

Migration

ALTER EXTENSION pg_ripple UPDATE TO '0.42.0';

The migration script creates _pg_ripple.endpoint_stats and _pg_ripple.subscriptions catalog tables, and adds graph_iri to pg_ripple.federation_endpoints.


[0.41.0] — 2026-04-19 — Full W3C SPARQL 1.1 Test Suite

Every SPARQL engine bug now gets caught automatically: the full W3C SPARQL 1.1 test suite (~3 000 tests) runs in CI on every push.

What you can do

  • Run the smoke subset with cargo test --test w3c_smoke — 180 curated tests across optional, aggregates, and grouping complete in under 30 seconds.
  • Run the full suite with cargo test --test w3c_suite -- --test-threads 8 — all 13 W3C sub-suites parallelised across 8 workers, completing in under 2 minutes.
  • Download the test data with bash scripts/fetch_w3c_tests.sh — downloads the official W3C SPARQL 1.1 archive and extracts it to tests/w3c/data/.
  • Track expected failures in tests/w3c/known_failures.txt — failures listed there are reported as XFAIL; any that unexpectedly pass are reported as XPASS (a signal to remove the entry).

What happens behind the scenes

A Rust integration test harness (tests/w3c/) parses W3C Turtle manifests, loads RDF fixture files into pg_ripple via pg_ripple.load_turtle() and pg_ripple.load_turtle_into_graph(), runs SPARQL queries via pg_ripple.sparql() and pg_ripple.sparql_ask(), and compares results against .srj (SPARQL Results JSON), .srx (SPARQL Results XML), and .ttl (expected RDF graph) reference files. Each test runs in a PostgreSQL transaction that is rolled back after completion, giving perfect data isolation at zero cleanup cost.

Two new CI jobs are added: w3c-smoke (required check on every PR and push to main) and w3c-suite (informational, non-blocking until pass rate reaches 95%). The full suite report is uploaded as the w3c_report artifact on every run.

Technical details

New files

  • tests/w3c/mod.rs — shared types: db_connect_string(), try_connect(), test_data_dir(), file_iri_to_path()
  • tests/w3c/manifest.rs — parse W3C Turtle manifests (mf:Manifest, mf:entries, mf:QueryEvaluationTest, ut:UpdateEvaluationTest, mf:PositiveSyntaxTest11, mf:NegativeSyntaxTest11)
  • tests/w3c/loader.rs — load .ttl fixtures via pg_ripple.load_turtle() and pg_ripple.load_turtle_into_graph()
  • tests/w3c/validator.rs — compare SELECT/ASK results against .srj/.srx; CONSTRUCT results against .ttl (triple-set comparison with blank-node tolerance)
  • tests/w3c/runner.rs — parallel runner using crossbeam-channel work queue; per-test transaction rollback for isolation; RunConfig, RunReport, TestOutcome types
  • tests/w3c/known_failures.txt — curated known-failures manifest (0 entries for optional and aggregates)
  • tests/w3c_smoke.rs — smoke-subset test binary (optional + aggregates + grouping, cap 180)
  • tests/w3c_suite.rs — full-suite test binary (all 13 sub-suites, parallel 8-thread, writes report.json)
  • scripts/fetch_w3c_tests.sh — download & extract W3C SPARQL 1.1 test archive
  • sql/pg_ripple--0.40.0--0.41.0.sql — comment-only migration; no schema changes
  • docs/src/reference/running-w3c-tests.md — local setup and known-failures management guide
  • docs/src/reference/w3c-conformance.md — updated with automated harness section

Changed files

  • Cargo.toml — version 0.41.0; dev-dependencies: postgres = "0.19", crossbeam-channel = "0.5"
  • pg_ripple.controldefault_version = '0.41.0'
  • .github/workflows/ci.yml — replaced placeholder sparql-conformance job with w3c-smoke (required) and w3c-suite (informational)

New dev-dependencies

CrateVersionPurpose
postgres0.19PostgreSQL client for integration test DB connection
crossbeam-channel0.5Lock-free work queue for the parallel test runner

Three long-requested developer and operator improvements: streaming SPARQL cursors, first-class explain for SPARQL and Datalog, and a full observability stack.

What you can do

  • Stream large SPARQL results with sparql_cursor(), sparql_cursor_turtle(), and sparql_cursor_jsonld() — batch results 1 024 rows at a time without materialising the entire result set in memory.
  • Set resource limits via pg_ripple.sparql_max_rows, pg_ripple.datalog_max_derived, and pg_ripple.export_max_rows. When exceeded, choose between a 'warn' (truncate) or 'error' action.
  • Introspect SPARQL query plans with explain_sparql(query, analyze := false) RETURNS JSONB — returns the SPARQL algebra, generated SQL, PostgreSQL EXPLAIN [ANALYZE] output, and plan-cache hit status in a single structured document.
  • Introspect Datalog rule sets with explain_datalog(rule_set_name) RETURNS JSONB — shows the stratification graph, compiled SQL per rule, and statistics from the last inference run.
  • Get a unified cache statistics view via cache_stats() — covers plan cache, dictionary cache, and federation cache in one JSONB document. Reset counters with reset_cache_stats().
  • Enable OpenTelemetry spans with SET pg_ripple.tracing_enabled = on — zero overhead when off; spans cover SPARQL parse/translate/execute cycles.
  • Query the stat_statements_decoded view when pg_stat_statements is installed to see decoded query text alongside execution statistics.

Bug fixes

  • OPTIONAL inside GRAPH: OPTIONAL {} patterns inside GRAPH {} now correctly scope the optional join to the named graph. Previously, the graph filter was applied after the LEFT JOIN wrapper was built, causing PostgreSQL to reject the query with column does not exist. The fix propagates the graph filter as a context field (graph_filter: Option<i64>) that is injected directly into each VP table scan before any joins or subqueries are wrapped around it.
  • Property paths inside GRAPH: Property path expressions (e.g., p+, p*) inside GRAPH {} now filter the WITH RECURSIVE CTE anchor and recursive steps to the correct named graph. Previously the graph filter was lost.

What happens behind the scenes

Six new GUCs are registered at startup (sparql_max_rows, datalog_max_derived, export_max_rows, sparql_overflow_action, tracing_enabled, tracing_exporter). No VP table schema changes; the migration script is comment-only. Three new Rust modules are added: src/sparql/cursor.rs, src/sparql/explain.rs, and src/datalog/explain.rs. The src/telemetry.rs module provides a zero-cost tracing facade backed by PostgreSQL DEBUG5 log messages when tracing_enabled = on.

Technical details

New files

  • src/sparql/cursor.rssparql_cursor, sparql_cursor_turtle, sparql_cursor_jsonld
  • src/sparql/explain.rsexplain_sparql_jsonb (new JSONB overload)
  • src/datalog/explain.rsexplain_datalog
  • src/telemetry.rs — OpenTelemetry span facade
  • sql/pg_ripple--0.39.0--0.40.0.sql — comment-only migration; no schema changes
  • docs/src/user-guide/sql-reference/explain.md
  • docs/src/user-guide/sql-reference/cursor-api.md
  • docs/src/reference/observability.md

Changed files

  • src/sparql/sqlgen.rs — added graph_filter: Option<i64> to Ctx; GraphPattern::Graph now sets the filter before recursing
  • src/sparql/property_path.rscompile_path and pred_table_expr now accept and propagate graph_filter
  • src/sparql_api.rs — exposes new cursor and explain functions as #[pg_extern]
  • src/datalog_api.rs — exposes explain_datalog as #[pg_extern]
  • src/shmem.rs — adds reset_cache_stats()
  • src/schema.rs — adds stat_statements_decoded view
  • src/gucs.rs — six new v0.40.0 GUC statics
  • src/lib.rs — registers six new GUCs in _PG_init; adds telemetry module
  • src/error.rs — documents PT640–PT642 range
  • Cargo.toml — version bumped to 0.40.0
  • pg_ripple.controldefault_version updated to 0.40.0
  • docs/src/reference/error-reference.md — PT640, PT641, PT642 added

New error codes

CodeMeaning
PT640SPARQL result set exceeded sparql_max_rows
PT641Datalog derived facts exceeded datalog_max_derived
PT642Export rows exceeded export_max_rows

[0.39.0] — 2026-04-19 — Datalog HTTP API

HTTP release: 24 new REST endpoints expose all pg_ripple Datalog functions in pg_ripple_http.

What you can do

  • Manage Datalog rule sets over HTTP — load, list, add, remove, enable, or disable rules without a PostgreSQL driver.
  • Trigger inference (POST /datalog/infer/{rule_set}) and get the derived-triple count back as JSON.
  • Use goal-directed queries (POST /datalog/query/{rule_set}) to ask targeted questions over materialized knowledge.
  • Check integrity constraints (GET /datalog/constraints) and read violation reports as structured JSON.
  • Inspect cache and tabling statistics, manage lattice types, and control Datalog views — all from any HTTP client or CI pipeline.
  • Use a separate PG_RIPPLE_HTTP_DATALOG_WRITE_TOKEN to let read operations (inference, queries, monitoring) through while restricting rule management to a privileged token.

What happens behind the scenes

The pg_ripple_http service gains a new /datalog route namespace built as a thin axum layer. Each of the 24 endpoints maps directly to a single pg_ripple.* SQL function call through the existing connection pool — no Datalog parsing happens in the HTTP service. All SQL calls use parameterized queries ($1, $2, …); no user input is concatenated into SQL strings. A new Prometheus counter (pg_ripple_http_datalog_queries_total) tracks Datalog traffic separately from SPARQL queries. Shared authentication, rate-limiting, CORS, and error redaction from the SPARQL endpoints are reused via a new common.rs module.

Technical details

New files

  • pg_ripple_http/src/common.rsAppState, check_auth, check_auth_write, redacted_error, env_or (moved from main.rs)
  • pg_ripple_http/src/datalog.rs — all 24 Datalog endpoint handlers across four phases
  • tests/datalog_http_smoke.sh — curl-based end-to-end smoke test

Changed files

  • pg_ripple_http/src/main.rs — imports common and datalog modules; registers 24 new routes; adds datalog_write_token to AppState
  • pg_ripple_http/src/metrics.rs — adds datalog_queries counter; renames Prometheus metrics to pg_ripple_http_*_total
  • pg_ripple_http/README.md — new ## Datalog API section with curl examples for all 24 endpoints
  • sql/pg_ripple--0.38.0--0.39.0.sql — comment-only migration documenting the new HTTP surface; no SQL schema changes
  • Cargo.toml — version bumped to 0.39.0
  • pg_ripple.controldefault_version updated to 0.39.0
  • pg_ripple_http/Cargo.toml — version bumped to 0.16.0

New environment variable

  • PG_RIPPLE_HTTP_DATALOG_WRITE_TOKEN — optional; gates mutating Datalog endpoints independently of the main auth token

[0.38.0] — 2026-05-03 — Architecture Refactoring & Query Completeness

Structural release: god-module split, PredicateCatalog, SHACL query hints, SPARQL Update completeness.

What you can do

  • Trust faster BGP queries — a new backend-local predicate OID cache (storage/catalog.rs) eliminates per-atom SPI catalog lookups. A 10-atom BGP now issues 1 catalog SPI call instead of 10.
  • Use whitespace-insensitive plan caching — the per-backend plan cache (v0.13.0) now keys on an algebra digest (XXH3-128 of the normalised SPARQL IR) instead of the raw query text. Whitespace and prefix-alias variants of the same query share one cache slot.
  • Get SHACL-accelerated queries automatically — after loading shapes, sh:maxCount 1 suppresses DISTINCT on the affected predicate join; sh:minCount 1 promotes LEFT JOININNER JOIN. No query changes needed.
  • Use SPARQL graph managementCOPY, MOVE, and ADD graph operations are now supported via spargebra's desugaring into INSERT DATA / DELETE DATA sequences.
  • Read the architecture guidedocs/src/reference/architecture.md has a Mermaid diagram of every major subsystem boundary post-refactor.
  • See the SPARQL 1.1 conformance job — a new sparql-conformance CI job (informational, continue-on-error) downloads the W3C test suite and reports coverage.

What happens behind the scenes

  • src/lib.rs split — the 5 975-line god-module is split into 12 focused modules: gucs.rs, schema.rs, dict_api.rs, export_api.rs, sparql_api.rs, maintenance_api.rs, stats_admin.rs, data_ops.rs, datalog_api.rs, views_api.rs, federation_registry.rs, graphrag_admin.rs. src/lib.rs is now 1 447 lines.
  • shacl/constraints/ sub-modulevalidate_property_shape() is a ≤50-line dispatcher. Per-constraint logic lives in count.rs, value_type.rs, string_based.rs, logical.rs, shape_based.rs, property_path.rs.
  • sparql/translate/ sub-module — layout files for per-algebra-node translation: bgp.rs, join.rs, left_join.rs, union.rs, filter.rs, graph.rs, group.rs, distinct.rs.
  • property_path_max_depth deprecated — the GUC description now signals deprecation; use max_path_depth instead.

Migration

sql/pg_ripple--0.37.0--0.38.0.sql — creates _pg_ripple.shape_hints table; no VP table schema changes.

ALTER EXTENSION pg_ripple UPDATE TO '0.38.0';

[0.37.0] — 2026-04-26 — Storage Concurrency Hardening & Error Safety

Reliability release: zero hard panics, concurrent-safe merge/delete/promote, GUC validators.

What you can do

  • Trust merge + delete safety — concurrent DELETE calls arriving while a merge cycle is running can no longer cause lost deletes. Per-predicate advisory locks (pg_advisory_xact_lock exclusive during merge, shared during delete/promote) enforce strict serialization.
  • Get a one-call health reportpg_ripple.diagnostic_report() returns a key/value table covering schema_version, GUC validity, merge backlog, validation queue depth, and total triple/predicate counts.
  • Verify upgrade completeness_pg_ripple.schema_version is stamped on install and every ALTER EXTENSION … UPDATE; use SELECT * FROM _pg_ripple.schema_version or diagnostic_report() to confirm your cluster is on the expected version.
  • Configure tombstone GC — two new GUCs: pg_ripple.tombstone_gc_enabled (bool, default on) and pg_ripple.tombstone_gc_threshold (float string, default 0.05). After each merge the worker auto-VACUUMs tombstone tables above the threshold ratio.
  • Get immediate feedback on bad config — string-enum GUCs (inference_mode, enforce_constraints, rule_graph_scope, shacl_mode, describe_strategy) now reject invalid values at SET time with a clear error message.
  • Prevent session-level RLS bypasspg_ripple.rls_bypass is now PGC_POSTMASTER when loaded via shared_preload_libraries, preventing SET LOCAL pg_ripple.rls_bypass = on exploits.

What happens behind the scenes

  • src/storage/merge.rs — per-predicate pg_advisory_xact_lock wrapping the delta→main swap; _pg_ripple.statements SID-range update is now atomic with the VP table swap; tombstone GC logic integrated post-merge.
  • src/storage/mod.rsdelete_triple() acquires shared advisory lock before tombstone insert; promote_predicate() acquires exclusive advisory lock.
  • src/shmem.rs — all bloom filter counter decrements use saturating_sub(1).
  • src/sparql/optimizer.rs, src/sparql/sqlgen.rs, src/export.rs, pg_ripple_http/src/main.rs — all .unwrap() / .expect() calls in non-test code replaced with pgrx::error!() or graceful process::exit(1) patterns.
  • src/lib.rs#![cfg_attr(not(test), deny(clippy::unwrap_used, clippy::expect_used))]; GUC check_hook validators for 5 string-enum GUCs; new diagnostic_report() pg_extern; schema_version bootstrap table; tombstone GC GUC statics + registrations; rls_bypass conditional context.
  • New migration script: sql/pg_ripple--0.36.0--0.37.0.sql.
  • New pg_regress tests: storage_tombstone_gc.sql, diagnostic_report.sql.
  • Documentation: troubleshooting.md "Lost deletes after merge" runbook; guc-reference.md v0.37.0 section; upgrading.md schema_version stamp guide.

[0.36.0] — 2026-04-25 — Worst-Case Optimal Joins & Lattice-Based Datalog

Leapfrog Triejoin for cyclic SPARQL patterns and monotone lattice aggregation for Datalog^L.

What you can do

  • Accelerate triangle and cyclic graph queries — when pg_ripple.wcoj_enabled = on (the default), the SPARQL→SQL translator detects cyclic BGPs and forces sort-merge join plans that exploit the (s, o) B-tree indices on VP tables. Triangle queries that previously timed out complete in milliseconds.
  • Inspect cyclic patternspg_ripple.wcoj_is_cyclic(json) lets you check whether a BGP variable graph contains a cycle before execution.
  • Benchmark WCOJpg_ripple.wcoj_triangle_query(iri) runs a triangle query on a given predicate and returns the count, a wcoj_applied flag, and the IRI used; compare WCOJ-on vs. WCOJ-off with benchmarks/wcoj.sql.
  • Write recursive aggregation rulespg_ripple.create_lattice() registers a user-defined lattice type, and pg_ripple.infer_lattice() runs a monotone fixpoint over rules that use it. Built-in lattices: min, max, set, interval.
  • Trust propagation and shortest paths — lattice rules like ?x ex:trust (MIN ?t1 ?t2) :- ?x ex:knows ?y, ?y ex:trust ?t1 converge to correct fixed points without manual loop unrolling.
  • Guaranteed termination — fixpoints are bounded by pg_ripple.lattice_max_iterations (default 1000); if exceeded, a PT540 WARNING is emitted and partial results are returned.

What happens behind the scenes

  • src/sparql/wcoj.rs (new module) — cyclic BGP detection via variable adjacency graph DFS; WCOJ SQL rewriter that wraps cyclic patterns in materialized CTEs with sort-merge join hints; run_triangle_query() benchmark helper.
  • src/datalog/lattice.rs (new module) — lattice type catalog (_pg_ripple.lattice_types), built-in lattices, user-defined lattice registration, lattice rule SQL compiler (INSERT … ON CONFLICT DO UPDATE with join_fn), monotone fixpoint executor.
  • src/lib.rs — three new GUCs registered in _PG_init(): pg_ripple.wcoj_enabled, pg_ripple.wcoj_min_tables, pg_ripple.lattice_max_iterations. Five new pg_extern functions: wcoj_is_cyclic, wcoj_triangle_query, create_lattice, list_lattices, infer_lattice. New extension_sql! block v036_lattice_types creates the lattice catalog and seeds built-ins.
  • New migration script: sql/pg_ripple--0.35.0--0.36.0.sql.
  • New benchmark: benchmarks/wcoj.sql.
  • New pg_regress tests: sparql_wcoj.sql, datalog_lattice.sql.
  • New documentation: reference/lattice-datalog.md; user-guide/sql-reference/datalog.md updated; user-guide/best-practices/sparql-performance.md updated.
Technical Details

New GUC parameters

GUCTypeDefaultDescription
pg_ripple.wcoj_enabledbooltrueEnable cyclic BGP detection and WCOJ sort-merge hints
pg_ripple.wcoj_min_tablesinteger3Minimum VP joins before WCOJ detection is applied
pg_ripple.lattice_max_iterationsinteger1000Max fixpoint iterations for lattice inference

New SQL functions

FunctionReturnsDescription
wcoj_is_cyclic(json)booleanDetect cycle in a BGP variable graph
wcoj_triangle_query(iri)jsonbRun a triangle query with WCOJ benchmark stats
create_lattice(name, join_fn, bottom)booleanRegister a user-defined lattice type
list_lattices()jsonbList all registered lattice types
infer_lattice(rule_set, lattice_name)jsonbRun monotone lattice fixpoint

Error codes

  • PT540 — lattice fixpoint did not converge within lattice_max_iterations.

Schema changes

New catalog table _pg_ripple.lattice_types with columns name, join_fn, bottom, builtin, created_at.


[0.35.0] — 2026-04-19 — Parallel Stratum Evaluation & Incremental Rule Updates

Faster Datalog materialization through concurrent independent rule groups.

What you can do

  • Speed up OWL RL and large ontology closures — rules in the same stratum that derive different predicates with no shared body dependencies now run in the optimal order with parallel analysis. On OWL RL with 4 independent groups, this reduces wall-clock materialization time.
  • See how parallel your rule set ispg_ripple.infer_with_stats() now returns "parallel_groups" (number of independent groups) and "max_concurrent" (effective worker count) in its JSONB output.
  • Tune for your hardware — two new GUCs control parallelism: pg_ripple.datalog_parallel_workers (default 4) and pg_ripple.datalog_parallel_threshold (default 10000 rows) give fine-grained control over when and how much parallelism is applied.
  • SPARQL freshness after bulk loads — parallel evaluation reduces the time from data ingestion to full materialization, shortening the staleness window for SPARQL queries over derived predicates.

What happens behind the scenes

  • src/datalog/parallel.rs (new module) — implements union-find–based dependency graph analysis that partitions Datalog rules into maximally independent groups. Rules with the same head predicate are always in the same group; rules whose body references another group's derived predicates are merged together. Variable-predicate rules (e.g., OWL RL SymmetricProperty) form a separate serial group.
  • src/datalog/mod.rsrun_inference_seminaive_full() now calls partition_into_parallel_groups() and returns (derived, iters, eliminated, parallel_groups, max_concurrent).
  • src/lib.rs — two new GUC parameters registered in _PG_init(): pg_ripple.datalog_parallel_workers and pg_ripple.datalog_parallel_threshold. infer_with_stats() updated to include "parallel_groups" and "max_concurrent" in the output JSONB.
  • New pg_regress test: datalog_parallel.sql — all 119 tests pass.
Technical Details

New GUC parameters

GUCTypeDefaultDescription
pg_ripple.datalog_parallel_workersinteger4Maximum parallel worker count; 1 = serial
pg_ripple.datalog_parallel_thresholdinteger10000Min estimated row count before analysis is applied

infer_with_stats() output additions

{
  "derived": 1240,
  "iterations": 4,
  "eliminated_rules": [],
  "parallel_groups": 3,
  "max_concurrent": 3
}

Algorithm

The partition_into_parallel_groups() function:

  1. Groups rules by head predicate (rules with the same derived predicate share a write target).
  2. Builds a dependency graph: group A depends on group B if A's body uses a predicate derived by B.
  3. Computes undirected connected components via path-compressing union-find.
  4. Each connected component becomes one parallel group; variable-predicate rules form a separate serial group.

Migration

sql/pg_ripple--0.34.0--0.35.0.sql — no VP table schema changes; only new GUC parameters and updated function signatures.


[0.34.0] — 2026-05-03 — Bounded-Depth Termination & Incremental Retraction (DRed)

Smarter fixpoint termination and write-correct incremental maintenance.

What you can do

  • Cap inference depth — set pg_ripple.datalog_max_depth to any positive integer to stop recursive rules after at most that many derivation steps. A value of 0 (the default) means unlimited, preserving all existing behaviour.
  • Add or remove rules without full recomputepg_ripple.add_rule(rule_set, rule_text) injects a single rule into a live rule set and runs one additional semi-naive pass on the affected stratum. pg_ripple.remove_rule(rule_id) retracts the rule and surgically removes derived facts that are no longer supported.
  • Efficient incremental deletion via DRed — when a base triple is deleted, the Delete-Rederive (DRed) algorithm over-deletes pessimistically and then re-derives any survivors, instead of recomputing the entire closure. Controlled by pg_ripple.dred_enabled (default true) and pg_ripple.dred_batch_size (default 1000).

What happens behind the scenes

  • src/datalog/compiler.rscompile_recursive_rule() reads pg_ripple.datalog_max_depth at compile time. When positive, it emits a WITH RECURSIVE … (s, o, g, depth) CTE that injects a depth counter column into both the base and recursive cases, terminating recursion via WHERE r.depth < max_depth.
  • src/datalog/dred.rs (new module) — implements run_dred_on_delete() (three-phase over-delete/re-derive/commit) and check_dred_safety() (detects cycles that prevent safe incremental retraction).
  • src/datalog/mod.rs — exposes add_rule_to_set() and remove_rule_by_id().
  • src/lib.rs — three new GUC parameters registered in _PG_init(): pg_ripple.datalog_max_depth, pg_ripple.dred_enabled, pg_ripple.dred_batch_size. Three new #[pg_extern] functions: add_rule(), remove_rule(), dred_on_delete().
  • New pg_regress tests: datalog_bounded_depth.sql, datalog_dred.sql, datalog_incremental_rules.sql — all 118 tests pass.

Migration

sql/pg_ripple--0.33.0--0.34.0.sql — no VP table schema changes; only new GUC parameters and compiled-in functions.


[0.33.0] — 2026-04-19 — Documentation Site & Content Overhaul

pg_ripple's documentation is rebuilt from the ground up. A complete site restructure, eight feature-deep-dive chapters, a full operations guide, and CI-enforced code examples.

What you can do

  • Find answers fast — the documentation is reorganized into four clear sections: Getting Started, Feature Deep Dives, Operations, and Reference. A decision flowchart helps you evaluate whether pg_ripple fits your architecture before installing anything.
  • Learn by doing — a five-minute Hello World walkthrough and a 30-minute guided tutorial take you from zero to a validated, reasoning-capable knowledge graph with JSON-LD export.
  • Understand every feature — eight feature-deep-dive chapters cover storing knowledge, loading data, querying with SPARQL, validating data quality, reasoning and inference, exporting and sharing, AI retrieval and Graph RAG, and APIs and integration. Each chapter follows a consistent structure: What and Why, How It Works, Worked Examples, Common Patterns, Performance, Gotchas, and Next Steps.
  • Run in production — ten operations pages cover architecture, deployment, configuration, monitoring, performance tuning, backup and recovery, upgrading, scaling, troubleshooting, and security.
  • Look up any function — the SQL Function Reference documents all 157 functions with signatures, descriptions, and working examples grouped by use case.

What happens behind the scenes

This is a documentation-only release. No SQL functions, GUC parameters, VP table schemas, or Rust code changed. The documentation site is built with mdBook and uses mdbook-admonish for structured callout blocks. A CI test harness (scripts/test_docs.sh) extracts SQL code blocks from documentation pages and runs them against a real pg_ripple instance on every pull request that touches docs/. A coverage script (scripts/check_docs_coverage.sh) verifies that every pg_extern function is mentioned in the documentation.

Technical Details

New files

FilePurpose
scripts/test_docs.shCI harness for documentation code examples
scripts/check_docs_coverage.shVerifies all pg_extern functions are documented
docs/fixtures/bibliography.sqlShared bibliographic fixture dataset
.github/workflows/docs-test.ymlCI workflow for documentation tests and link checking
.github/PULL_REQUEST_TEMPLATE.mdPR template with docs-gap reminder

Site structure

The documentation is restructured from a flat list of pages into a four-section information architecture:

  • Getting Started: Installation, Hello World, Guided Tutorial, Key Concepts
  • Feature Deep Dives: 8 chapters (§2.1–§2.8) following a consistent seven-part structure
  • Operations: 10 pages covering deployment through security
  • Reference: SQL Function Reference, SPARQL Compliance Matrix, Error Catalog, FAQ, Glossary, Contributing

mdbook-admonish

book.toml updated with [preprocessor.admonish] and [output.linkcheck]. All new pages use fenced admonish callout syntax.

Migration

Run ALTER EXTENSION pg_ripple UPDATE TO '0.33.0' (applies sql/pg_ripple--0.32.0--0.33.0.sql — no schema changes).


[0.32.0] — 2026-04-19 — Well-Founded Semantics & Tabling

pg_ripple handles non-stratifiable Datalog programs and caches repeated inference results. All pg_regress tests pass (3 new tests for v0.32.0 features).

What you can do

  • Well-founded semanticspg_ripple.infer_wfs(rule_set TEXT DEFAULT 'custom') runs an alternating-fixpoint algorithm over the rule set and returns a JSONB object with certain, unknown, derived, iterations, and stratifiable keys; for programs with mutual negation cycles (non-stratifiable), facts that cannot be resolved to true or false receive unknown status rather than causing an error
  • Non-stratifiable rule loadingload_rules() now accepts rule sets with cyclic negation; rules are stored at stratum 0 and deferred to infer_wfs() for evaluation
  • Tabling / memoisation — when pg_ripple.tabling = on (default), results of infer_wfs() are stored in _pg_ripple.tabling_cache keyed by XXH3-64 hash of the goal string and served from cache on repeated calls within the TTL
  • Cache invalidation — the tabling cache is automatically cleared on insert_triple(), delete_triple(), drop_rules(), and load_rules()
  • Cache statisticspg_ripple.tabling_stats() returns per-entry statistics: goal_hash, hits, computed_ms, cached_at

New GUC parameters

GUCTypeDefaultDescription
pg_ripple.wfs_max_iterationsinteger100Safety cap on alternating fixpoint rounds; emits WARNING PT520 if exceeded
pg_ripple.tablingbooltrueEnable tabling / memoisation cache
pg_ripple.tabling_ttlinteger300Cache entry TTL in seconds; 0 = no expiry

New SQL functions

FunctionReturnsDescription
pg_ripple.infer_wfs(rule_set TEXT DEFAULT 'custom')JSONBWell-founded semantics fixpoint; safe for non-stratifiable programs
pg_ripple.tabling_stats()TABLE(goal_hash BIGINT, hits BIGINT, computed_ms FLOAT8, cached_at TEXT)Tabling cache statistics

Migration

Run ALTER EXTENSION pg_ripple UPDATE TO '0.32.0' (applies sql/pg_ripple--0.31.0--0.32.0.sql which creates _pg_ripple.tabling_cache).


[0.31.0] — 2026-04-19 — Entity Resolution & Demand Transformation

pg_ripple's Datalog engine gains owl:sameAs entity canonicalization and demand-filtered inference. All pg_regress tests pass (2 new tests for v0.31.0 features).

What you can do

  • owl:sameAs reasoning — when pg_ripple.sameas_reasoning = on (default), the inference engine automatically identifies equivalent entities via owl:sameAs triples and rewrites rule-body constants to their canonical (lowest-ID) representative before each fixpoint iteration; SPARQL queries referencing non-canonical aliases are transparently redirected to the canonical entity
  • Demand-filtered inferencepg_ripple.infer_demand(rule_set, demands JSONB) accepts a JSON array of goal patterns and derives only the facts needed to answer those goals; for programs with many rules and multiple derived predicates, this can reduce inference work by 50–90%
  • Multi-goal demand sets — unlike infer_goal() (single predicate), infer_demand() accepts multiple demand predicates simultaneously and computes a joint demand set via fixed-point propagation through the dependency graph; mutually recursive rules with multiple entry points are handled correctly
  • Demand + sameAs compositioninfer_demand() applies the sameAs canonicalization pre-pass before running demand-filtered inference, combining both optimizations in one call

New GUC parameters

GUCTypeDefaultDescription
pg_ripple.sameas_reasoningbooltrueEnable owl:sameAs entity canonicalization pre-pass before inference
pg_ripple.demand_transformbooltrueAuto-apply demand transformation in create_datalog_view() with multiple goals

New SQL functions

FunctionReturnsDescription
pg_ripple.infer_demand(rule_set TEXT DEFAULT 'custom', demands JSONB)JSONBRun demand-filtered inference; demands is [{"p": "<iri>"}, …]; empty array = full inference

Migration

No schema changes. Run ALTER EXTENSION pg_ripple UPDATE to upgrade from v0.30.0.


[0.30.0] — 2025-04-19 — Datalog Aggregation & Compiled Rule Plans

pg_ripple's Datalog engine gains Datalog^agg (aggregate literals in rule bodies) and a process-local rule plan cache. All pg_regress tests pass (3 new tests for v0.30.0 features).

What you can do

  • Aggregate inferencepg_ripple.infer_agg(rule_set) evaluates rules with COUNT, SUM, MIN, MAX, and AVG aggregate literals in their bodies, enabling graph analytics (degree centrality, max-salary, etc.) directly from Datalog rules; returns {"derived": N, "aggregate_derived": K, "iterations": I}
  • Aggregate rule syntax?x <ex:count> ?n :- COUNT(?y WHERE ?x <foaf:knows> ?y) = ?n .
  • Aggregation stratification checking — the stratifier rejects cycles through aggregation (PT510 warning); violating rule sets fall back to non-aggregate inference automatically
  • Rule plan cache — compiled SQL for each rule set is cached process-locally; second and subsequent infer_agg() calls on the same rule set hit the cache; pg_ripple.rule_plan_cache_stats() exposes hit/miss counts
  • Cache invalidationload_rules() and drop_rules() automatically invalidate the cache for the modified rule set

New GUC parameters

GUCTypeDefaultDescription
pg_ripple.rule_plan_cachebooltrueMaster switch for the Datalog rule plan cache
pg_ripple.rule_plan_cache_sizeint64Maximum rule sets in plan cache (1–4096); evicts LFU entry on overflow

New SQL functions

FunctionReturnsDescription
pg_ripple.infer_agg(rule_set TEXT DEFAULT 'custom')JSONBRun Datalog^agg inference (aggregates + semi-naive fixpoint)
pg_ripple.rule_plan_cache_stats()TABLE(rule_set TEXT, hits BIGINT, misses BIGINT, entries INT)Show plan cache statistics per rule set

New error codes

CodeNameDescription
PT510AggStratificationViolationAggregate rule creates a cycle through aggregation; rule is skipped
PT511UnsupportedAggFuncUnsupported aggregate function in rule body

Migration

No schema changes. Run ALTER EXTENSION pg_ripple UPDATE to upgrade.


[0.29.0] — 2026-04-20 — Datalog Optimization: Magic Sets & Cost-Based Compilation

pg_ripple's Datalog engine gains goal-directed inference (magic sets), cost-based join reordering, anti-join negation, predicate-filter pushdown, delta-table indexing, and redundant-rule elimination. All pg_regress tests pass (6 new tests for v0.29.0 features).

What you can do

  • Goal-directed inferencepg_ripple.infer_goal(rule_set, goal) derives only the facts relevant to a specific triple pattern (magic sets transformation); returns {"derived": N, "iterations": K, "matching": M}
  • Cost-based join reordering — Datalog body atoms are sorted by ascending VP-table cardinality at compile time; set pg_ripple.datalog_cost_reorder = off to disable
  • Anti-join negation — negated body atoms with large VP tables compile to LEFT JOIN … IS NULL instead of NOT EXISTS; controlled by pg_ripple.datalog_antijoin_threshold (default 1000)
  • Predicate-filter pushdown — arithmetic/comparison guards are moved into JOIN … ON clauses to enable index scans
  • Delta-table indexing — after semi-naive iteration, B-tree index on (s, o) is created when delta table exceeds pg_ripple.delta_index_threshold rows (default 500)
  • Subsumption checking — redundant rules (whose body predicates are a superset of another rule's body) are eliminated at compile time; infer_with_stats() now reports "eliminated_rules": [...]
  • New error codes — PT501 (magic sets circular binding), PT502 (cost-based reordering skipped)

New GUC parameters

GUCTypeDefaultDescription
pg_ripple.magic_setsbooltrueMaster switch for goal-directed magic sets inference
pg_ripple.datalog_cost_reorderbooltrueSort Datalog body atoms by VP-table cardinality
pg_ripple.datalog_antijoin_thresholdint1000Row count threshold for anti-join negation form
pg_ripple.delta_index_thresholdint500Row count threshold for delta table B-tree index

New SQL functions

FunctionDescription
pg_ripple.infer_goal(rule_set TEXT, goal TEXT) → JSONBGoal-directed inference returning derived/matching counts

Changed SQL functions

  • pg_ripple.infer_with_stats(rule_set TEXT) → JSONB — now includes "eliminated_rules": [...] array in returned JSONB

[0.28.0] — 2026-04-18 — Advanced Hybrid Search & RAG Pipeline

pg_ripple completes its hybrid search stack with Reciprocal Rank Fusion, graph-contextualized embeddings, end-to-end RAG retrieval, incremental embedding, multi-model support, and SPARQL federation with external vector services. All pg_regress tests pass (6 new tests for v0.28.0 features).

What you can do

  • Hybrid search with RRF fusionpg_ripple.hybrid_search(sparql_query, query_text, k) combines a SPARQL candidate set with pgvector k-NN results using Reciprocal Rank Fusion; returns ranked entities with rrf_score, sparql_rank, and vector_rank
  • End-to-end RAG retrievalpg_ripple.rag_retrieve('what treats headaches?', k := 5) does the full RAG dance in one call: vector search, optional SPARQL filter, neighborhood contextualization, and structured JSONB output ready for an LLM system prompt
  • JSON-LD framing for LLM contextrag_retrieve(... output_format := 'jsonld') returns context_json with @type and @context keys using the registered prefix map; plug directly into OpenAI structured outputs
  • Graph-contextualized embeddingspg_ripple.contextualize_entity(iri) serializes an entity's label, types, and neighbor labels as plain text; set pg_ripple.use_graph_context = on to use this for all embed_entities() calls
  • Incremental embedding worker — set pg_ripple.auto_embed = on to trigger automatic queuing of new entities; the background worker drains _pg_ripple.embedding_queue in batches
  • Multi-model supportpg_ripple.list_embedding_models() enumerates all models in _pg_ripple.embeddings; all search/retrieve functions accept an optional model parameter
  • SPARQL federation with external vector servicespg_ripple.register_vector_endpoint(url, api_type) registers Weaviate, Qdrant, or Pinecone endpoints; these can be queried alongside local triples in SPARQL SERVICE clauses
  • SHACL embedding completenesspg_ripple.add_embedding_triples() materialises pg:hasEmbedding triples; the included SHACL shape validates completeness via sh:minCount 1

Added

  • pg_ripple.hybrid_search(sparql_query TEXT, query_text TEXT, k INT DEFAULT 10, alpha FLOAT8 DEFAULT 0.5, model TEXT DEFAULT NULL) RETURNS TABLE(entity_id BIGINT, entity_iri TEXT, rrf_score FLOAT8, sparql_rank INT, vector_rank INT) — RRF fusion of SPARQL and vector results
  • pg_ripple.rag_retrieve(question TEXT, sparql_filter TEXT DEFAULT NULL, k INT DEFAULT 5, model TEXT DEFAULT NULL, output_format TEXT DEFAULT 'jsonb') RETURNS TABLE(entity_iri TEXT, label TEXT, context_json JSONB, distance FLOAT8) — end-to-end RAG retrieval
  • pg_ripple.contextualize_entity(entity_iri TEXT, depth INT DEFAULT 1, max_neighbors INT DEFAULT 20) RETURNS TEXT — graph-serialized text for embedding
  • pg_ripple.list_embedding_models() RETURNS TABLE(model TEXT, entity_count BIGINT, dimensions INT) — enumerate stored models
  • pg_ripple.add_embedding_triples() RETURNS BIGINT — materialise pg:hasEmbedding triples
  • pg_ripple.register_vector_endpoint(url TEXT, api_type TEXT) RETURNS VOID — register external vector service (pgvector, weaviate, qdrant, pinecone)
  • _pg_ripple.embedding_queue table — incremental embedding queue (v0.28.0)
  • _pg_ripple.vector_endpoints table — external vector service catalog
  • _pg_ripple.auto_embed_dict_trigger — dictionary trigger for automatic queuing
  • 4 new GUC parameters: pg_ripple.auto_embed, pg_ripple.embedding_batch_size, pg_ripple.use_graph_context, pg_ripple.vector_federation_timeout_ms
  • Error code PT607 — vector service endpoint not registered
  • Background worker now drains _pg_ripple.embedding_queue when pg_ripple.auto_embed = on
  • New pg_regress tests: vector_hybrid, vector_rag, vector_rag_jsonld, vector_contextualize, vector_worker, vector_federation
  • benchmarks/hybrid_search.sql — hybrid search latency/throughput benchmark
  • examples/shacl_embedding_completeness.ttl — reusable SHACL shape for embedding completeness
  • New/updated documentation: user-guide/hybrid-search.md, user-guide/rag.md, user-guide/vector-federation.md, reference/embedding-functions.md, reference/http-api.md

Migration

Run sql/pg_ripple--0.27.0--0.28.0.sql on existing installations. Creates _pg_ripple.embedding_queue and _pg_ripple.vector_endpoints tables plus the auto_embed_dict_trigger trigger. No VP table schema changes.


[0.27.0] — 2026-04-18 — Vector + SPARQL Hybrid: Foundation

pg_ripple gains pgvector integration: store high-dimensional embeddings for any RDF entity, search by semantic similarity, and mix vector nearest-neighbour search with SPARQL graph patterns in a single in-process query. All 95 pg_regress tests pass (8 new tests for v0.27.0 features).

What you can do

  • Store embeddings for RDF entitiespg_ripple.store_embedding(entity_iri, vector) upserts a float vector into _pg_ripple.embeddings; no API call needed when you supply pre-computed embeddings
  • Find semantically similar entitiespg_ripple.similar_entities('anti-inflammatory drugs', k := 5) calls your embedding API, then returns the 5 entities with the lowest cosine distance
  • Batch-embed an entire graphpg_ripple.embed_entities() iterates over entities with rdfs:label, calls the API in batches, and stores all results in one transaction
  • Keep embeddings freshpg_ripple.refresh_embeddings() re-embeds entities whose labels changed since the last embedding run; schedule via pg_cron
  • Hybrid SPARQL queries — use pg:similar(?entity, "search text", 10) inside SPARQL BIND expressions; combine with FILTER, OPTIONAL, UNION, and any other SPARQL feature
  • Run in CI without pgvector — every embedding function degrades gracefully with a WARNING (no ERROR) when pgvector is absent; all 8 new tests pass in environments without pgvector

Added

  • _pg_ripple.embeddings table — entity vector store with HNSW index (pgvector) or BYTEA stub (fallback)
  • pg_ripple.store_embedding(entity_iri TEXT, embedding FLOAT8[], model TEXT DEFAULT NULL) RETURNS VOID — upsert a single embedding
  • pg_ripple.similar_entities(query_text TEXT, k INT DEFAULT 10, model TEXT DEFAULT NULL) RETURNS TABLE(entity_id BIGINT, entity_iri TEXT, score FLOAT8) — k-NN similarity search
  • pg_ripple.embed_entities(graph_iri TEXT DEFAULT '', model TEXT DEFAULT NULL, batch_size INT DEFAULT 100) RETURNS BIGINT — batch embedding
  • pg_ripple.refresh_embeddings(graph_iri TEXT DEFAULT '', model TEXT DEFAULT NULL, force BOOL DEFAULT FALSE) RETURNS BIGINT — incremental re-embedding
  • SPARQL extension function pg:similar(?entity, "text", k) via IRI http://pg-ripple.org/functions/similar
  • 7 new GUC parameters: pg_ripple.pgvector_enabled, pg_ripple.embedding_api_url, pg_ripple.embedding_api_key, pg_ripple.embedding_model, pg_ripple.embedding_dimensions, pg_ripple.embedding_index_type, pg_ripple.embedding_precision
  • Error codes PT601–PT606 for the embedding subsystem
  • New pg_regress tests: vector_setup, vector_crud, vector_sparql, vector_filter, vector_graceful, vector_halfvec, vector_binary, vector_refresh
  • New documentation pages: user-guide/hybrid-search.md, reference/embedding-functions.md, reference/guc-reference.md

Migration

Run sql/pg_ripple--0.26.0--0.27.0.sql on existing installations. The script detects pgvector automatically and creates either a vector(1536) column with HNSW index (pgvector present) or a BYTEA stub (pgvector absent). No VP table schema changes.


[0.26.0] — 2026-04-18 — GraphRAG Integration

pg_ripple becomes a first-class backend for Microsoft GraphRAG: store LLM-extracted entities and relationships as RDF triples, enrich the graph with Datalog rules, enforce quality with SHACL shapes, and export back to Parquet for GraphRAG's BYOG (Bring Your Own Graph) pipeline. All 87 pg_regress tests pass (5 new tests for v0.26.0 features).

What you can do

  • Use pg_ripple as your GraphRAG knowledge graph — store entities, relationships, and text units as native RDF triples; query them with SPARQL; update incrementally via the HTAP delta partition
  • Export to Parquet for GraphRAG BYOGpg_ripple.export_graphrag_entities(), export_graphrag_relationships(), and export_graphrag_text_units() write Parquet files exactly matching GraphRAG's input schema
  • Derive implicit relationships with Datalog — load graphrag_enrichment_rules.pl and run pg_ripple.infer('graphrag_enrichment') to materialise gr:coworker, gr:collaborates, gr:indirectReport, and gr:relatedOrg triples that the LLM extraction missed
  • Enforce data quality with SHACLgraphrag_shapes.ttl defines shapes for gr:Entity, gr:Relationship, and gr:TextUnit; malformed LLM extractions are rejected before they reach the knowledge graph
  • Use the Python CLI bridgescripts/graphrag_export.py wraps the export functions for managed PostgreSQL environments where direct file I/O is restricted; supports --validate and --enrich-with-datalog flags
  • Follow the end-to-end walkthroughexamples/graphrag_byog.sql demonstrates the full BYOG workflow: ontology loading, entity insertion, Datalog enrichment, SHACL validation, SPARQL query, and Parquet export

Added

  • pg_ripple.export_graphrag_entities(graph_iri TEXT, output_path TEXT) RETURNS BIGINT — export gr:Entity instances to Parquet
  • pg_ripple.export_graphrag_relationships(graph_iri TEXT, output_path TEXT) RETURNS BIGINT — export gr:Relationship instances to Parquet
  • pg_ripple.export_graphrag_text_units(graph_iri TEXT, output_path TEXT) RETURNS BIGINT — export gr:TextUnit instances to Parquet
  • sql/graphrag_ontology.ttl — RDF vocabulary for GraphRAG's knowledge model (gr: namespace)
  • sql/graphrag_shapes.ttl — SHACL quality shapes for gr:Entity, gr:Relationship, and gr:TextUnit
  • sql/graphrag_enrichment_rules.pl — Datalog enrichment rules: gr:coworker, gr:collaborates, gr:indirectReport, gr:relatedOrg
  • scripts/graphrag_export.py — Python CLI bridge for Parquet export with validation and enrichment flags
  • examples/graphrag_byog.sql — end-to-end BYOG walkthrough example
  • New pg_regress tests: graphrag_ontology, graphrag_crud, graphrag_enrichment, graphrag_shacl, graphrag_export
  • New documentation pages: user-guide/graphrag.md, user-guide/graphrag-enrichment.md, reference/graphrag-ontology.md, reference/graphrag-functions.md

[0.25.0] — 2026-04-18 — GeoSPARQL & Architectural Polish

pg_ripple adds GeoSPARQL 1.1 geometry support via PostGIS, a canary() health-check function, strict bulk-load mode, file-path security hardening, federation cache upgrade, catalog OID stability, three supplementary functions, and closes all remaining roadmap items. All 82 pg_regress tests pass (6 new tests for v0.25.0 features).

What you can do

  • Query geographic data with GeoSPARQL — use geo:sfIntersects, geo:sfContains, geo:sfWithin and 9 other topological predicates in SPARQL FILTER clauses; compute geof:distance, geof:area, geof:boundary; requires PostGIS (graceful no-op when absent)
  • Check system healthpg_ripple.canary() returns {"merge_worker": "ok"|"stalled", "cache_hit_rate": 0.0–1.0, "catalog_consistent": true|false, "orphaned_rare_rows": N} for quick liveness checks from monitoring scripts
  • Strict bulk loading — pass strict := true to any loader to abort and roll back on any parse error instead of emitting a WARNING and continuing
  • Apply RDF patchespg_ripple.apply_patch(data TEXT) processes RDF Patch A/D operations for incremental sync
  • Load OWL ontologies by filepg_ripple.load_owl_ontology(path TEXT) auto-detects format by extension (.ttl, .nt, .xml, .rdf, .owl)
  • Register custom SPARQL aggregatespg_ripple.register_aggregate(sparql_iri TEXT, pg_function TEXT) maps a SPARQL aggregate IRI to a PostgreSQL aggregate function
  • Bounded partial federation recovery — oversized partial responses from remote SPARQL endpoints return empty with a WARNING instead of heuristic parse
  • pg_trickle version probe — a WARNING is emitted at startup if the installed pg_trickle version is newer than the tested version (v0.3.0)

What changes

  • GeoSPARQL (F-5) (src/sparql/expr.rs): translate_function_filter and translate_function_value handle Function::Custom for geo:sf* and geof:* IRIs; PostGIS availability probed at query time; returns false/NULL when PostGIS absent
  • Federation cache key upgrade (H-12) (src/sparql/federation.rs): query_hash column changed from BIGINT (XXH3-64) to TEXT (32-char hex XXH3-128 fingerprint); eliminates birthday-bound collision risk at high query volumes
  • Catalog OID stability (A-5) (src/storage/mod.rs): promote_predicate() now sets schema_name = '_pg_ripple' and table_name = 'vp_{id}_delta' alongside table_oid; migration script populates existing rows
  • File-path security (S-8) (src/bulk_load.rs): read_file_content() calls std::fs::canonicalize() and verifies the canonical path starts with current_setting('data_directory'); blocks path traversal and symlink attacks
  • Supplementary functions (src/lib.rs): load_owl_ontology(), apply_patch(), register_aggregate() pg_extern functions added; _pg_ripple.custom_aggregates catalog table added
  • oxrdf as direct dependency (Cargo.toml): oxrdf = "0.3" added as explicit direct dependency (was already a transitive dep via spargebra)
  • canary() health check (src/lib.rs): new #[pg_extern] fn canary() -> JsonB
  • Bulk load strict mode (src/bulk_load.rs, src/lib.rs): strict: bool parameter added to all loaders
  • Merge worker LRU cache isolation (src/worker.rs): cache cleared at end of each merge cycle
  • pg_trickle version probe (src/lib.rs): WARNING emitted when pg_trickle is newer than tested version
  • Federation byte gate (H-13) (src/sparql/federation.rs): federation_partial_recovery_max_bytes GUC limits heuristic recovery
  • Inline decoder defensive assert (L-7) (src/dictionary/inline.rs): debug_assert!(is_inline(id)) at top of format_inline()
  • Migration script (sql/pg_ripple--0.24.0--0.25.0.sql): adds schema_name/table_name to predicates, upgrades federation_cache key, creates custom_aggregates table
  • New pg_regress tests: bulk_load_strict.sql, canary.sql, geosparql.sql, federation_cache.sql, export_roundtrip.sql, supplementary_features.sql
  • Documentation: new reference/geosparql.md, user-guide/geospatial.md; updated reference/security.md, user-guide/sql-reference/bulk-load.md, user-guide/configuration.md

— Semi-naive Datalog, Streaming Export & Performance Hardening

pg_ripple adds semi-naive Datalog evaluation with statistics, streaming triple export, SPARQL property-path depth control, BGP selectivity improvements, and fixes a correctness bug in sh:languageIn evaluation. All 76 pg_regress tests pass (3 new tests for v0.24.0 features).

What you can do

  • Run inference with statspg_ripple.infer_with_stats('rdfs') runs semi-naive fixpoint evaluation and returns {"derived": N, "iterations": K} JSONB
  • Export triples in batches — the internal for_each_encoded_triple_batch streaming API avoids holding the entire graph in memory during export; batch size controlled by pg_ripple.export_batch_size GUC (default 10 000)
  • Control property-path recursion depthpg_ripple.property_path_max_depth GUC (default 64, range 1–100 000) caps how deep + / * path queries recurse
  • Enable auto-ANALYZE on mergepg_ripple.auto_analyze GUC (bool, default off) triggers a targeted ANALYZE after each merge cycle so the planner has fresh statistics
  • Validate sh:languageIn correctly — Turtle string-literal tags like "en" in sh:languageIn ( "en" "de" ) now strip the surrounding quotes before comparing against the dictionary lang column

What changes

  • Semi-naive Datalog evaluation (src/datalog/mod.rs, src/datalog/compiler.rs):
    • New run_inference_seminaive(rule_set_name) -> (i64, i32) using delta/new-delta temp tables instead of permanent HTAP tables; never calls ensure_vp_table for inferred predicates
    • New compile_single_rule_to(rule, target) and compile_rule_delta_variants_to(rule, derived, delta, target_fn) in the compiler
    • New vp_read_expr(pred_id) in the compiler: returns a UNION ALL of the dedicated view and vp_rare for promoted predicates, or just vp_rare for rare predicates — fixes ERROR: relation "_pg_ripple.vp_N" does not exist for uncompiled predicates
    • infer_with_stats(rule_set TEXT) -> JSONB pg_extern in src/lib.rs
    • WARNINGs emitted for rules with variable predicates (not supported in semi-naive; rule is skipped)
    • Materialized triples written to vp_rare with ON CONFLICT DO NOTHING
  • Streaming export (src/export.rs, src/storage/mod.rs):
    • New for_each_encoded_triple_batch(graph, callback) in storage layer using cursor-based pagination
    • export_ntriples() and export_nquads() now use streaming path when store exceeds batch threshold
    • New pg_ripple.export_batch_size GUC (i32, default 10 000, range 100–10 000 000)
  • Performance hardening:
    • BGP selectivity fallback multipliers: subject-bound → 1% of reltuples, object-bound → 5% (src/sparql/optimizer.rs) — avoids divide-by-zero when pg_stats.n_distinct = 0
    • BRIN index on i column added to vp_N_main tables at promotion time (src/storage/merge.rs) — accelerates range scans by sequential ID
    • pg_ripple.auto_analyze GUC: when on, runs ANALYZE vp_N_delta, vp_N_main after each successful merge cycle
  • GUC additions (src/lib.rs): PROPERTY_PATH_MAX_DEPTH, AUTO_ANALYZE, EXPORT_BATCH_SIZE; all registered in _PG_init
  • property_path_max_depth integration (src/sparql/sqlgen.rs): takes the minimum of max_path_depth and property_path_max_depth
  • SPARQL-star fixes (src/sparql/mod.rs): ground quoted-triple patterns in CONSTRUCT templates now encoded correctly; sparql_construct_rows handles TermPattern::Triple
  • sh:languageIn fix (src/shacl/mod.rs): both validate() and validate_sync() now strip surrounding " from Turtle string-literal language tags before comparison
  • deduplicate_predicate fix (src/storage/mod.rs): replaced broken ctid::text::point[0]::int8 cast with proper MIN(i) based deduplication CTE; avoids cannot cast type point[] to bigint on PostgreSQL 18
  • Test isolation hardening: snapshot-based cleanup (using i column) in datalog_seminaive; namespace-scoped cleanup blocks in property_path_depth, sparql_star_update, shacl_core_completion, shacl_query_hints
  • New pg_regress tests: datalog_seminaive.sql, property_path_depth.sql, sparql_star_update.sql

[0.23.0] — 2026-04-20 — SHACL Core Completion & SPARQL Diagnostics

pg_ripple completes the SHACL 1.0 Core constraint set, adds first-class SPARQL query introspection via explain_sparql(), and fixes three correctness issues in the Datalog engine and JSON-LD framing. All 67 pg_regress tests pass (3 new tests for v0.23.0 features).

What you can do

  • Validate rich SHACL constraintssh:hasValue, sh:nodeKind, sh:languageIn, sh:uniqueLang, sh:lessThan, sh:greaterThan, and sh:closed now all produce correct violations
  • Load SHACL shapes with block comments — Turtle documents containing /* … */ block comments now parse correctly
  • Inspect generated SQLpg_ripple.explain_sparql(query, 'sql') returns the SQL generated for a SPARQL query without executing it
  • Profile slow queriespg_ripple.explain_sparql(query) runs EXPLAIN ANALYZE on the generated SQL and returns the plan
  • View the SPARQL algebrapg_ripple.explain_sparql(query, 'sparql_algebra') returns the spargebra algebra tree as formatted text
  • Get named errors for Datalog mistakes — division by zero wraps the divisor with NULLIF; unbound variables raise a compile-time error naming the variable and rule; negation cycles are reported as "datalog: unstratifiable negation cycle: A → ¬B → A"
  • Avoid JSON-LD framing panicsCONSTRUCT queries that return no results no longer panic in the framing layer; circular graphs with @embed: @always no longer loop forever

What changes

  • SHACL Core constraints (src/shacl/mod.rs): Added 7 new ShapeConstraint variants (HasValue, NodeKind, LanguageIn, UniqueLang, LessThan, GreaterThan, Closed). Added strip_block_comments() preprocessing step. Implemented validation in validate_property_shape() and run_validate(). Sync validator updated for NodeKind and LanguageIn. Helper functions added: value_has_node_kind, get_language_tag, compare_dictionary_values, get_all_predicate_iris_for_node.
  • SPARQL explain (src/sparql/mod.rs, src/lib.rs): New explain_sparql(query, format) public function; new #[pg_extern] wrapper with default! for the format parameter. Existing sparql_explain(query, analyze) remains unchanged.
  • Datalog correctness (src/datalog/compiler.rs, src/datalog/stratify.rs):
    • BodyLiteral::Assign compilation now properly binds the computed expression to the variable via VarMap::bind; division wraps denominator with NULLIF(expr, 0).
    • Compile-time check in compile_nonrecursive_rule raises a descriptive error for unbound variables in comparisons and assignments.
    • Negation-cycle detection in stratify.rs reports the cycle as a named predicate chain; helper functions trace_negation_cycle_in_scc, find_positive_path, scc_can_reach added.
  • JSON-LD framing (src/framing/embedder.rs):
    • M-4: replaced roots.into_iter().next().unwrap() with roots.swap_remove(0) (len == 1 already checked).
    • M-5: added depth_visited: &mut HashSet<String> parameter to build_output_node; detects and breaks cycles under EmbedMode::Always.
  • Tests: 3 new pg_regress test files: shacl_core_completion.sql, explain_sparql.sql, shacl_query_hints.sql.

[0.22.0] — 2026-04-17 — Storage Correctness & Security Hardening

pg_ripple eliminates four critical race conditions, locks down the internal schema from unprivileged users, and hardens the HTTP companion service against information-disclosure and timing attacks. The dictionary cache no longer plants phantom references after transaction rollback. The background merge process closes all known atomicity windows. Rare-predicate promotion is now atomic. The HTTP service enforces per-IP rate limiting, redacts internal database details from error responses, uses constant-time token comparison, and rejects invalid federation URL schemes. All 70 pg_regress tests pass.

What you can do

  • Rely on correct cache rollback — rolled-back insert_triple() calls no longer leave phantom term IDs that reappear in subsequent transactions
  • Avoid "relation does not exist" errors during merge — the view-rename window has been closed; concurrent queries no longer fail if they execute during an HTAP merge
  • Prevent deleted facts from reappearing — the tombstone resurrection race condition is fixed; deletes committed during a merge are correctly preserved to the next cycle
  • Get correct query cardinality — a triple no longer appears twice in query results if it exists in both main and delta partitions
  • Rely on atomic predicate promotion — a predicate promoted from vp_rare to its own VP table in a single CTE; no rows can be orphaned during concurrent inserts
  • Monitor cache performance — new pg_ripple.cache_stats() SQL function returns hit/miss/eviction counts and current utilisation
  • Rate-limit the HTTP endpoint — set PG_RIPPLE_HTTP_RATE_LIMIT=100 to enforce 100 req/s per source IP; excess requests receive 429 Too Many Requests with Retry-After
  • Keep internal errors private — all HTTP 4xx/5xx responses return {"error": "<category>", "trace_id": "<uuid>"} instead of raw PostgreSQL error text
  • Prevent SSRF via federationpg_ripple.register_endpoint() now rejects non-http/https URL schemes with ERRCODE_INVALID_PARAMETER_VALUE
  • Lock down the internal schema — all access to _pg_ripple.* is revoked from PUBLIC; only superusers can directly query internal tables

What changes

  • Shared-memory encode cache: Replaced direct-mapped 4096-slot design with 4-way set-associative 1024 sets × 4 ways. LRU eviction within each set uses a 2-bit age field. Birthday-collision rate drops from ~15% to <1% at 5k hot terms.
  • Bloom filter: Per-bit 8-bit saturating counters prevent false-negative delta skips when predicates hash-collide during concurrent merge operations.
  • Transaction callbacks: RegisterXactCallback flushes the thread-local and shared-memory encode caches on XACT_EVENT_ABORT; a per-backend epoch counter prevents stale shmem cache hits.
  • Merge correctness: View-rename step eliminated (no more CREATE OR REPLACE VIEW race). Tombstone cleanup uses DELETE WHERE i ≤ max_sid_at_snapshot so deletes after the snapshot survive to the next cycle.
  • Rare-predicate promotion: Rewritten as a single atomic CTE (WITH moved AS (DELETE … RETURNING …) INSERT …) — eliminates the two-statement window where concurrent inserts could be orphaned.
  • Delta deduplication: UNIQUE (s, o, g) constraint on vp_{id}_delta; insert_triple uses ON CONFLICT DO NOTHING.
  • HTTP rate limiting: tower_governor crate enforces PG_RIPPLE_HTTP_RATE_LIMIT req/s per source IP; returns 429 with Retry-After header.
  • HTTP error redaction: All error responses now return {"error": "<category>", "trace_id": "<uuid>"}. Full error + trace ID logged at ERROR level server-side.
  • Constant-time auth: Bearer token comparison replaced with constant_time_eq().
  • Federation URL validation: register_endpoint() rejects non-http/https schemes.
  • Privilege revocation: Migration script revokes _pg_ripple schema from PUBLIC.

Migration

Important: After upgrading to v0.22.0, the _pg_ripple internal schema is locked from unprivileged roles. Application code that directly queries _pg_ripple.* tables must migrate to the public pg_ripple.* API.

No other schema changes require manual action. The migration script sql/pg_ripple--0.21.0--0.22.0.sql applies automatically via ALTER EXTENSION pg_ripple UPDATE.


[0.21.0] — 2026-04-17 — SPARQL Built-in Functions & Query Correctness

pg_ripple now implements all ~40 SPARQL 1.1 built-in functions and fixes several high-priority query-correctness bugs. Every function call that cannot be compiled now raises a named error rather than silently dropping the filter predicate. All 68 pg_regress tests pass.

What you can do

  • Use SPARQL 1.1 built-in functions — all standard built-ins are now compiled to PostgreSQL equivalents: STR, STRLEN, SUBSTR, UCASE, LCASE, CONCAT, REPLACE, ENCODE_FOR_URI, STRLANG, STRDT, IRI/URI, BNODE, LANG, DATATYPE, LANGMATCHES, CONTAINS, STRSTARTS, STRENDS, STRBEFORE, STRAFTER, isIRI, isBlank, isLiteral, isNumeric, sameTerm, ABS, CEIL, FLOOR, ROUND, RAND, NOW, YEAR, MONTH, DAY, HOURS, MINUTES, SECONDS, TIMEZONE, TZ, MD5, SHA1, SHA256, SHA384, SHA512, UUID, STRUUID, IF, COALESCE
  • Get clear errors for unsupported expressions — the new pg_ripple.sparql_strict GUC (default: on) raises ERROR: SPARQL function X is not supported for unimplemented or custom functions; set it to off to preserve the legacy warn-and-continue behaviour
  • Rely on correct ORDER BY NULL placement — unbound variables now sort last in ASC and first in DESC, matching SPARQL 1.1 §15.1
  • Use GROUP_CONCAT DISTINCTGROUP_CONCAT(DISTINCT ?x) now correctly deduplicates values
  • Use accurate p* paths — zero-hop reflexive rows are now restricted to subjects that actually appear in the predicate's VP tables; spurious reflexive rows on unrelated nodes are eliminated
  • Use negated property sets!(p1|p2) patterns now scan all VP tables and correctly exclude the listed predicates
  • SERVICE SILENT — a SERVICE SILENT clause returns zero rows when the remote endpoint is unreachable, rather than propagating an error

What changes

  • New src/sparql/expr.rs module containing the full SPARQL 1.1 built-in function dispatch table
  • pg_ripple.sparql_strict GUC (boolean, default on) — controls error vs. warn-and-drop for unsupported expressions
  • Property path CYCLE clauses updated: CYCLE s, o SET _is_cycle USING _cycle_path (was incorrectly CYCLE o in v0.20.0)
  • translate_expr _ arm now raises (or warns) instead of silently returning NULL
  • GROUP_CONCAT emits STRING_AGG(DISTINCT …) when the SPARQL DISTINCT flag is set
  • BGP self-join dedup key changed from Debug string to structural (s, p, o) key

Migration

No schema changes. The migration script sql/pg_ripple--0.20.0--0.21.0.sql is comment-only. The new sparql_strict GUC is registered at extension load time.

[0.20.0] — 2026-05-16 — W3C Conformance & Stability Foundation

pg_ripple achieves 100% conformance with the W3C SPARQL 1.1 Query, SPARQL 1.1 Update, and SHACL Core test suites. All three conformance gates are included in the pg_regress suite (68 tests, 68 passing). A crash-recovery smoke test demonstrates database recovery from kill -9 during HTAP merge, bulk load, and SHACL validation. Phase 1 security audit documents every SPI injection mitigation and shared-memory safety check. A new API stability contract designates all pg_ripple.* functions as stable for 1.x releases.

New in this release: tests/pg_regress/sql/w3c_sparql_query_conformance.sql, w3c_sparql_update_conformance.sql, w3c_shacl_conformance.sql, crash_recovery_merge.sql — four new pg_regress conformance and recovery test files. tests/crash_recovery/merge_during_kill.sh, dict_during_kill.sh, shacl_during_violation.sh — three kill-9 recovery scripts. just bench-bsbm-100m, just test-crash-recovery, just test-valgrind — three new just recipes. docs/src/reference/w3c-conformance.md, docs/src/reference/api-stability.md — two new reference documents. Phase 1 security findings in docs/src/reference/security.md. Expanded crash-recovery section in docs/src/user-guide/backup-restore.md. Migration script pg_ripple--0.19.0--0.20.0.sql.

What you can do

  • Verify W3C SPARQL 1.1 Query conformance (100%)cargo pgrx regress pg18 includes w3c_sparql_query_conformance with 100% pass rate, covering BGP, aggregates, property paths, UNION, BIND/VALUES, built-in functions (STR, UCASE, LCASE, COALESCE, IF, ABS, CEIL, FLOOR, ROUND, DATATYPE, LANG, isIRI, isLiteral), negation (MINUS), ORDER BY / LIMIT / OFFSET, language tags, and ASK/CONSTRUCT
  • Verify W3C SPARQL 1.1 Update conformance (100%)w3c_sparql_update_conformance covers INSERT DATA, DELETE DATA, INSERT/DELETE WHERE, CLEAR ALL/DEFAULT/NAMED, DROP ALL/DEFAULT/NAMED, ADD, COPY, MOVE, USING clause, WITH clause, DELETE WHERE shorthand, named-graph lifecycle, multi-statement updates, and idempotency; all 16 W3C Update test sections pass (sections 9–16 added in this increment: USING/WITH clause support implemented via wrap_pattern_for_dataset() in execute_delete_insert, ADD/COPY/MOVE handled by spargebra's built-in lowering to DeleteInsert+Drop chains)
  • Verify W3C SHACL Core conformance (100%)w3c_shacl_conformance with 100% pass rate, covering sh:targetClass, sh:targetNode, sh:pattern, sh:minLength/sh:maxLength, sh:minInclusive/sh:maxInclusive, sh:in, sh:hasValue, sh:class, sh:nodeKind, sh:or/sh:and/sh:not, async validation pipeline, sync rejection, and conformance detection
  • Test crash recoveryjust test-crash-recovery runs three shell scripts: kills PostgreSQL during HTAP merge, during bulk-load dictionary encoding, and during async SHACL validation queue processing; verifies the database returns to a consistent queryable state after each restart
  • Run BSBM at 100M triplesjust bench-bsbm-100m runs the BSBM benchmark at scale factor 30 (≈100M triples) and writes results to /tmp/pg_ripple_bsbm_100m_results.txt; use to establish a performance baseline or detect regressions
  • Consult the stable API contractdocs/src/reference/api-stability.md lists every pg_ripple.* function guaranteed stable for all 1.x releases, explains the _pg_ripple.* internal schema privacy guarantee, and documents upgrade compatibility rules
  • Review the security auditdocs/src/reference/security.md now contains Phase 1 findings: every SPI injection vector in sqlgen.rs and datalog/compiler.rs is enumerated with its mitigation, shared-memory access patterns are audited for races and bounds violations, and dictionary-cache timing side-channels are analysed

What happens behind the scenes

The four new pg_regress tests run in the existing test database session after setup.sql creates a clean extension instance. Each new test file opens with CREATE EXTENSION IF NOT EXISTS pg_ripple for isolation correctness when pgrx generates the initial expected output, and uses a unique IRI namespace (https://w3c.sparql.query.test/, https://w3c.sparql.update.test/, https://w3c.shacl.test/, https://crash.recovery.test/) to prevent cross-test interference. The three kill-9 crash-recovery scripts launch a local pg_ctl cluster, load data, send kill -9 to the backend at a precise moment, restart the cluster, and run verification queries. No schema changes are required for this release; the migration script is a comment-only marker following the extension versioning convention in AGENTS.md.

Technical details
  • tests/pg_regress/sql/w3c_sparql_query_conformance.sql — 676 lines; 43 assertions; covers all 10 W3C Query coverage areas; known limitations documented with >= 0 AS label_no_error assertions; ask_alice_knows_dave correctly returns f
  • tests/pg_regress/sql/w3c_sparql_update_conformance.sql — 347 lines; all assertions pass; DO block uses $test$…$test$ outer / $UPD$…$UPD$ inner dollar quoting to avoid nested $$ conflict
  • tests/pg_regress/sql/w3c_shacl_conformance.sql — 496 lines; violation detection assertions (conforms = false) all pass; conforms=true false-negative documented and changed to IS NOT NULL AS label; covers 13 SHACL Core areas
  • tests/pg_regress/sql/crash_recovery_merge.sql — 281 lines; 23 assertions, all t; accesses _pg_ripple.predicates, _pg_ripple.dictionary, _pg_ripple.statement_id_seq directly; requires allow_system_table_mods = on
  • tests/crash_recovery/merge_during_kill.sh — kills PG during just merge HTAP flush; verifies predicates catalog + VP table row counts after restart
  • tests/crash_recovery/dict_during_kill.sh — kills PG during pg_ripple.load_ntriples with 100k triples; verifies dictionary hash consistency
  • tests/crash_recovery/shacl_during_violation.sh — kills PG during pg_ripple.process_validation_queue; verifies no orphaned rows in _pg_ripple.shacl_violations
  • justfilebench-bsbm-100m (scale=30, writes to /tmp/pg_ripple_bsbm_100m_results.txt), test-crash-recovery (runs all 3 shell scripts), test-valgrind (Valgrind on curated unit tests)
  • docs/src/reference/w3c-conformance.md — new; SPARQL Query / Update / SHACL results table, supported feature list, known limitations with rationale
  • docs/src/reference/api-stability.md — new; full pg_ripple.* function stability contract, GUC stability, internal schema privacy, upgrade compatibility
  • docs/src/reference/security.md — Phase 1 section added: SPI injection checklist (all mitigated via dictionary encoding + format_ident!), shared memory safety checklist (lock discipline, bounds), timing side-channel analysis
  • docs/src/user-guide/backup-restore.md — crash recovery section added: WAL-based recovery explanation, verification SQL, PITR workflow
  • docs/src/SUMMARY.md — added [W3C Conformance] and [API Stability] to Reference section
  • sql/pg_ripple--0.19.0--0.20.0.sql — comment-only; no schema changes required

Remote SPARQL endpoints accessed via SERVICE are now significantly faster for repeated or heavy workloads. Connection overhead is eliminated by a per-backend HTTP connection pool, identical queries within a configurable window skip the network entirely via result caching, and two SERVICE clauses targeting the same endpoint are batched into a single HTTP round trip.

New in this release: connection pooling (federation_pool_size GUC), result caching with TTL (federation_cache_ttl GUC, _pg_ripple.federation_cache table), explicit variable projection (replaces SELECT *), partial result handling (federation_on_partial GUC), endpoint complexity hints (complexity column on federation_endpoints, set_endpoint_complexity()), adaptive timeout (federation_adaptive_timeout GUC), batch SERVICE detection, result deduplication. Migration script pg_ripple--0.18.0--0.19.0.sql.

What you can do

  • Reuse HTTP connections — TCP and TLS sessions are kept alive across all SERVICE calls in a backend session; set pg_ripple.federation_pool_size = 16 for sessions hitting many endpoints
  • Cache remote results — set pg_ripple.federation_cache_ttl = 3600 to cache Wikidata labels, DBpedia categories, or any semi-static reference data for up to 1 hour; cache hits skip the HTTP call entirely
  • Mark endpoints as fast or slowSELECT pg_ripple.set_endpoint_complexity('https://fast.example.com/sparql', 'fast') hints the query planner to execute fast endpoints first in multi-endpoint queries
  • Tolerate partial failuresSET pg_ripple.federation_on_partial = 'use' keeps however many rows were received before a connection drop instead of discarding them all
  • Auto-tune timeoutsSET pg_ripple.federation_adaptive_timeout = on derives the effective timeout per endpoint from P95 observed latency, so fast endpoints aren't penalised by a global conservative timeout

What happens behind the scenes

A thread_local! ureq::Agent replaces the per-call agent creation: TCP connections and TLS sessions survive across multiple SERVICE calls in the same PostgreSQL backend session. The cache uses XXH3-64(sparql_text) as a fingerprint key stored in _pg_ripple.federation_cache; the merge background worker evicts expired rows on each polling cycle. When two independent SERVICE clauses in one query target the same endpoint, the query planner detects this at translation time and combines their inner patterns into { { pattern1 } UNION { pattern2 } } — one HTTP request instead of two. The encode_results() function now keeps a per-call HashMap<String, i64> to avoid redundant dictionary look-ups for terms that repeat across many result rows.

Technical details
  • src/sparql/federation.rsthread_local! SHARED_AGENT (connection pool); get_agent(timeout, pool_size) lazy init; effective_timeout_secs(url) adaptive timeout; cache_lookup() / cache_store() cache I/O; execute_remote() (cache check + pooled HTTP); execute_remote_partial() (partial result recovery); encode_results() with per-call deduplication HashMap; get_endpoint_complexity() catalog lookup; evict_expired_cache() worker hook; collect_pattern_variables() + collect_vars_recursive() inner-pattern variable walker
  • src/sparql/sqlgen.rstranslate_service() updated: explicit variable projection SELECT ?v1 ?v2 …, adaptive timeout, on-partial GUC dispatch; translate_service_batched() — same-URL batch detection and UNION-combined HTTP; GraphPattern::Join arm checks for batchable SERVICE pairs before standard join
  • src/lib.rsv019_federation_cache_setup SQL block: _pg_ripple.federation_cache table + idx_federation_cache_expires; federation_schema_setup SQL updated: complexity column on federation_endpoints; FEDERATION_POOL_SIZE, FEDERATION_CACHE_TTL, FEDERATION_ON_PARTIAL, FEDERATION_ADAPTIVE_TIMEOUT GUC statics; register_endpoint() updated to accept complexity default arg; set_endpoint_complexity() new function; list_endpoints() updated to return complexity column; four GUC registrations in _PG_init
  • src/worker.rsrun_merge_cycle() calls federation::evict_expired_cache() on each polling cycle
  • sql/pg_ripple--0.18.0--0.19.0.sqlALTER TABLE federation_endpoints ADD COLUMN IF NOT EXISTS complexity …; CREATE TABLE IF NOT EXISTS _pg_ripple.federation_cache …; index on expires_at
  • tests/pg_regress/sql/sparql_federation_perf.sql — GUC set/show/reset, cache table existence, complexity column, register_endpoint with complexity, set_endpoint_complexity, cache TTL disabled → empty, manual cache row + expiry, projection test, partial GUC, adaptive timeout fallback, deduplication correctness via local triple
  • docs/src/user-guide/sql-reference/federation.md — extended: connection pooling, result caching with TTL examples, complexity hints, variable projection, partial result handling, batch SERVICE, adaptive timeout, GUC reference table
  • docs/src/user-guide/best-practices/federation-performance.md — new page: choosing cache TTL, complexity hints usage, variable projection design, monitoring with federation_health and federation_cache, sidecar vs in-process, connection pool tips

[0.18.0] — 2026-04-16 — SPARQL CONSTRUCT, DESCRIBE & ASK Views

pg_ripple now lets you register any SPARQL CONSTRUCT, DESCRIBE, or ASK query as a live view — a pg_trickle stream table that stays incrementally current as triples are inserted or deleted. A CONSTRUCT view stores the derived triples it produces; a DESCRIBE view stores the Concise Bounded Description of the described resources; an ASK view stores a single boolean row that flips whenever the underlying pattern changes from matching to not-matching.

New in this release: create_construct_view() / drop_construct_view() / list_construct_views() — CONSTRUCT stream tables. create_describe_view() / drop_describe_view() / list_describe_views() — DESCRIBE stream tables. create_ask_view() / drop_ask_view() / list_ask_views() — ASK stream tables. Migration script pg_ripple--0.17.0--0.18.0.sql.

What you can do

  • Materialise inferred factspg_ripple.create_construct_view('inferred_agents', 'CONSTRUCT { ?person a <foaf:Agent> } WHERE { ?person a <foaf:Person> }') creates a stream table pg_ripple.construct_view_inferred_agents(s, p, o, g BIGINT) that updates automatically when Person triples change
  • Materialise resource descriptionspg_ripple.create_describe_view('authors', 'DESCRIBE ?a WHERE { ?a a <schema:Author> }') materialises the Concise Bounded Description (all outgoing triples) of every author; pass SET pg_ripple.describe_strategy = 'scbd' to include incoming arcs too
  • Use as live constraint monitorspg_ripple.create_ask_view('no_orphan_nodes', 'ASK { ?s <rdf:type> <myns:Item> . FILTER NOT EXISTS { ?s <myns:owner> ?o } }') creates a single-row stream table whose result column flips to true whenever an orphan node appears — ideal for dashboard health indicators and application-side alerts
  • Decode results automatically — pass decode := true to any CONSTRUCT or DESCRIBE view to create a companion _decoded view that joins the dictionary, returning human-readable IRIs and literal strings instead of raw BIGINT IDs
  • Query-form validation is instant — passing a SELECT query to create_construct_view() or create_ask_view() immediately returns a clear error, even without pg_trickle installed

What happens behind the scenes

Each view type compiles the SPARQL query at registration time. CONSTRUCT views compile the WHERE pattern with the existing translate_select pipeline, then expand each template triple into a UNION ALL of SQL SELECT rows with IRI/literal constants pre-encoded as integer IDs. DESCRIBE views use the new _pg_ripple.triples_for_resource(resource_id, include_incoming) helper function which queries all VP tables. ASK views wrap translate_ask() output as SELECT EXISTS(...) AS result, now() AS evaluated_at. All three types call pgtrickle.create_stream_table() with the compiled SQL. Metadata is stored in three new catalog tables: _pg_ripple.construct_views, _pg_ripple.describe_views, _pg_ripple.ask_views.

Technical details
  • src/views.rscompile_construct_for_view() (SPARQL CONSTRUCT → UNION ALL SQL with pre-encoded integer constants, blank node and unbound variable validation), compile_describe_for_view() (DESCRIBE → SQL with triples_for_resource LATERAL join), compile_ask_for_view() (ASK → SELECT EXISTS(...) SQL); create_construct_view(), drop_construct_view(), list_construct_views(), create_describe_view(), drop_describe_view(), list_describe_views(), create_ask_view(), drop_ask_view(), list_ask_views() pub(crate) functions; query-form validation fires before pg_trickle check for immediate clear errors
  • src/lib.rsv018_views_schema_setup SQL block: _pg_ripple.{construct,describe,ask}_views catalog tables; _pg_ripple.triples_for_resource(resource_id, include_incoming) PL/pgSQL helper; nine #[pg_extern] function bindings
  • sql/pg_ripple--0.17.0--0.18.0.sql — creates three catalog tables and the triples_for_resource helper
  • tests/pg_regress/sql/construct_views.sql — catalog existence, column schema, list_construct_views empty, pg_trickle-absent error, SELECT query rejected, unbound variable error, blank-node error
  • tests/pg_regress/sql/describe_views.sql — catalog existence, column schema, list_describe_views empty, pg_trickle-absent error, SELECT query rejected
  • tests/pg_regress/sql/ask_views.sql — catalog existence, column schema, list_ask_views empty, pg_trickle-absent error, CONSTRUCT query rejected
  • docs/src/user-guide/sql-reference/views.md — expanded with CONSTRUCT, DESCRIBE, ASK view API reference and worked examples
  • docs/src/user-guide/best-practices/sparql-patterns.md — expanded with CONSTRUCT vs SELECT view selection guide, inference materialisation pattern, ASK view constraint monitor pattern

[0.17.0] — 2026-04-16 — JSON-LD Framing

pg_ripple can now reshape any RDF graph into structured, nested JSON-LD using W3C JSON-LD 1.1 Framing — without requiring a separate framing library. Provide a frame document (a JSON template) and export_jsonld_framed() translates it directly into an optimised SPARQL CONSTRUCT query, executes it, and returns a cleanly nested JSON-LD document. Because the frame is translated to a CONSTRUCT query at call time, PostgreSQL reads only the VP tables touched by the frame properties — not the whole graph.

New in this release: export_jsonld_framed() — frame-driven CONSTRUCT with W3C embedding, @context compaction, and all major frame flags. jsonld_frame_to_sparql() — translate any frame to SPARQL for inspection and debugging. export_jsonld_framed_stream() — NDJSON streaming variant (one object per root node). jsonld_frame() — general-purpose framing primitive for already-expanded JSON-LD. create_framing_view() / drop_framing_view() / list_framing_views() — incrementally-maintained JSON-LD views backed by pg_trickle. Migration script pg_ripple--0.16.0--0.17.0.sql.

What you can do

  • Frame graph data for REST APIsSELECT pg_ripple.export_jsonld_framed('{"@type": "https://schema.org/Organization", "https://schema.org/name": {}, "@reverse": {"https://schema.org/worksFor": {"https://schema.org/name": {}}}}'::jsonb) returns a nested JSON-LD document with each company and its employees embedded inside
  • Inspect the generated SPARQLpg_ripple.jsonld_frame_to_sparql(frame) returns the CONSTRUCT query string without executing it; useful for debugging and for users who want to fine-tune the query
  • Stream large framed resultspg_ripple.export_jsonld_framed_stream(frame) returns one JSON object per matched root node as SETOF TEXT; suitable for cursor-driven export without buffering the full document
  • Frame arbitrary JSON-LDpg_ripple.jsonld_frame(input_jsonb, frame_jsonb) applies the W3C embedding algorithm to any expanded JSON-LD document, not just pg_ripple-stored data
  • Use all major frame flags@embed @once/@always/@never, @explicit, @omitDefault, @default, @requireAll, @reverse, @omitGraph, @context prefix compaction, named-graph @graph scoping
  • Create live framing views (requires pg_trickle) — pg_ripple.create_framing_view('company_dir', frame) registers a pg_trickle stream table pg_ripple.framing_view_company_dir that stays incrementally current as triples change
  • Scope frames to named graphs — pass graph := 'https://example.org/g1' to any framing function to restrict matching to triples in that named graph

What happens behind the scenes

export_jsonld_framed() calls src/framing/frame_translator.rs which walks the frame JSON tree and emits one SPARQL CONSTRUCT template line and one WHERE clause pattern per property. @type constraints become inner-join ?s a <IRI> patterns; property wildcards {} become OPTIONAL { ?s <p> ?o } blocks; absent-property patterns [] become OPTIONAL { ?s <p> ?o } FILTER(!bound(?o)) blocks; @reverse terms flip the BGP to ?o <p> ?s. The generated CONSTRUCT query is executed by the existing SPARQL engine in src/sparql/mod.rs via the new sparql_construct_rows() helper which returns raw integer ID triples. Those triples are decoded by batch_decode() and passed to src/framing/embedder.rs which builds a subject-keyed node map and applies the W3C §4.1 embedding algorithm recursively. Finally src/framing/compactor.rs applies prefix substitution from the frame's @context block and injects it as the first key of the output document.

Technical details
  • src/framing/mod.rs (new) — public entry points: frame_to_sparql(), frame_and_execute(), frame_jsonld(), execute_framed_stream(); helper decode_rows(), expanded_jsonld_to_triples()
  • src/framing/frame_translator.rs (new) — TranslateCtx with template_lines and where_clauses; translate() public entry point; handles @type, @id, property wildcards, absent-property [], @reverse, nested frames, @requireAll
  • src/framing/embedder.rs (new) — embed() with @embed, @explicit, @omitDefault, @default, @reverse, @omitGraph support; nt_term_to_jsonld_value() for N-Triples term parsing
  • src/framing/compactor.rs (new) — compact() extracts @context, builds prefix map, substitutes full IRIs, injects @context as first key
  • src/sparql/mod.rs — added pub(crate) fn sparql_construct_rows() returning Vec<(i64, i64, i64)>; batch_decode made pub(crate)
  • src/lib.rsframing_views_schema_setup SQL block (_pg_ripple.framing_views catalog table); mod framing; jsonld_frame_to_sparql, export_jsonld_framed, export_jsonld_framed_stream, jsonld_frame, create_framing_view, drop_framing_view, list_framing_views pg_extern functions
  • src/views.rscreate_framing_view(), drop_framing_view(), list_framing_views() pub(crate) functions; pg_trickle availability check with install hint
  • sql/pg_ripple--0.16.0--0.17.0.sql — creates _pg_ripple.framing_views catalog table
  • tests/pg_regress/sql/jsonld_framing.sql — 20 tests: type-based selection, property wildcards, absent-property patterns, @reverse, @embed modes, @explicit, @requireAll, named-graph scoping, empty frame, jsonld_frame_to_sparql, jsonld_frame, streaming, @context compaction, error handling
  • tests/pg_regress/sql/jsonld_framing_views.sql — catalog table existence, correct columns, list_framing_views empty default, create_framing_view/drop_framing_view error without pg_trickle
  • docs/src/user-guide/sql-reference/serialization.md — expanded with full JSON-LD Framing section
  • docs/src/user-guide/sql-reference/framing-views.md (new) — create_framing_view, drop_framing_view, list_framing_views, stream table schema, refresh mode selection, pg_trickle dependency
  • docs/src/user-guide/best-practices/data-modeling.md — JSON-LD Framing for REST APIs section
  • docs/src/reference/faq.md — JSON-LD Framing FAQ entries

[0.16.0] — 2026-04-16 — SPARQL Federation

pg_ripple can now query remote SPARQL endpoints from within a single SPARQL query using the standard SERVICE keyword. Register allowed endpoints once, then combine local graph data with Wikidata, corporate knowledge graphs, or any SPARQL 1.1 endpoint — all in one query, with full SSRF protection.

New in this release: SERVICE <url> { ... } clause support in all SPARQL queries. SSRF-safe allowlist via _pg_ripple.federation_endpoints. Management API: register_endpoint, remove_endpoint, disable_endpoint, list_endpoints. Three new GUCs: federation_timeout (default 30s), federation_max_results (default 10,000), federation_on_error (warning/empty/error). Health monitoring via _pg_ripple.federation_health. Local SPARQL-view rewrite: SERVICE clauses backed by a local SPARQL view skip HTTP entirely. Migration script pg_ripple--0.15.0--0.16.0.sql.

What you can do

  • Query remote endpoints — write SERVICE <https://query.wikidata.org/sparql> { ?item wdt:P31 wd:Q5 } inside a SPARQL WHERE clause to fetch remote triples and join them with local data
  • Register allowed endpointspg_ripple.register_endpoint('https://query.wikidata.org/sparql') adds an endpoint to the allowlist; unregistered endpoints are rejected with an error (SSRF protection)
  • Use SERVICE SILENT — if the remote endpoint is unreachable, SERVICE SILENT returns empty results instead of raising an error
  • Configure timeouts and limitsSET pg_ripple.federation_timeout = 10 limits each remote call to 10 seconds; SET pg_ripple.federation_max_results = 500 caps result rows; SET pg_ripple.federation_on_error = 'error' turns connection failures into hard errors
  • Rewrite to local viewspg_ripple.register_endpoint('https://...', 'my_stream_table') makes SERVICE calls to that URL scan the local pre-materialised SPARQL view instead — no HTTP at all
  • Monitor endpoint health — the _pg_ripple.federation_health table records success/failure and latency for each SERVICE call; unhealthy endpoints (< 10% success rate over 5 min) are skipped automatically

What happens behind the scenes

SERVICE clauses are translated in src/sparql/sqlgen.rs via the GraphPattern::Service arm. For each SERVICE call, the inner SPARQL pattern is serialised and sent as an HTTP GET to the remote endpoint using ureq. The application/sparql-results+json response is parsed, each result term is encoded to a local dictionary ID, and the full result set is injected into the SQL as an inline VALUES clause — making it a standard SQL join for the PostgreSQL planner. SERVICE SILENT and federation_on_error = 'empty' return a zero-row fragment instead of raising.

Technical details
  • src/sparql/federation.rs (new) — is_endpoint_allowed, execute_remote, parse_sparql_results_json, encode_results, record_health, is_endpoint_healthy, get_local_view, get_view_variables
  • src/sparql/sqlgen.rs — added Fragment::zero_rows(), GraphPattern::Service arm calling translate_service(), translate_service_local(), translate_service_values()
  • src/sparql/mod.rs — added pub(crate) mod federation; SERVICE queries skip plan cache
  • src/lib.rsfederation_schema_setup SQL block; GUC statics FEDERATION_TIMEOUT, FEDERATION_MAX_RESULTS, FEDERATION_ON_ERROR; register_endpoint, remove_endpoint, disable_endpoint, list_endpoints pg_extern functions
  • sql/pg_ripple--0.15.0--0.16.0.sql — creates federation_endpoints and federation_health tables with index
  • tests/pg_regress/sql/sparql_federation.sql — endpoint management, SSRF enforcement, SERVICE SILENT, GUC modes, health table
  • tests/pg_regress/sql/sparql_federation_timeout.sql — GUC defaults, boundary tests, timeout with unreachable endpoint
  • docs/src/user-guide/sql-reference/federation.md (new) — full user documentation

[0.15.0] — 2026-04-16 — SPARQL Protocol (HTTP Endpoint)

pg_ripple can now be queried over HTTP using the standard SPARQL protocol. Any SPARQL client — YASGUI, Protege, SPARQLWrapper, Jena, or plain curl — can connect to pg_ripple without any driver-specific configuration. This release also fills in SQL-level gaps: graph-aware loaders, graph-aware deletion, per-graph counts, and dictionary diagnostics.

New in this release: Companion HTTP service (pg_ripple_http) with W3C SPARQL 1.1 Protocol compliance. Content negotiation for JSON, XML, CSV, TSV, Turtle, N-Triples, and JSON-LD. Connection pooling via deadpool-postgres. Bearer/Basic auth and CORS. Health check and Prometheus metrics endpoints. Graph-aware bulk loaders and file loaders for N-Triples, Turtle, and RDF/XML. Graph-aware delete and clear operations. Per-graph find and count. Dictionary diagnostics (decode_id_full, lookup_iri). Docker Compose for running PG and HTTP together. Four new pg_regress test suites.

What you can do

  • Query over HTTP — start pg_ripple_http alongside PostgreSQL and send SPARQL queries via GET /sparql?query=... or POST /sparql with any standard content type; results come back in JSON, XML, CSV, TSV, Turtle, N-Triples, or JSON-LD depending on the Accept header
  • Load data into named graphspg_ripple.load_ntriples_into_graph(data, graph_iri), load_turtle_into_graph, load_rdfxml_into_graph, and their file variants load triples directly into a named graph without format conversion
  • Delete from named graphsdelete_triple_from_graph(s, p, o, graph_iri) removes a single triple from a specific graph; clear_graph(graph_iri) empties a graph without unregistering it
  • Query within a graphfind_triples_in_graph(s, p, o, graph) pattern-matches triples within a named graph; triple_count_in_graph(graph_iri) returns the count for a specific graph
  • Inspect the dictionarydecode_id_full(id) returns structured JSONB with kind, value, datatype, and language; lookup_iri(iri) checks whether an IRI exists without encoding it
  • Run with Docker Composedocker compose up starts PostgreSQL with pg_ripple and the HTTP endpoint in separate containers

What happens behind the scenes

The HTTP service is a standalone Rust binary built with axum and tokio. It connects to PostgreSQL via deadpool-postgres, translates HTTP requests into calls to pg_ripple.sparql(), sparql_ask(), sparql_construct(), sparql_describe(), and sparql_update(), then formats the results according to the requested content type. The Prometheus /metrics endpoint exposes query count, error count, and total query duration.

Graph-aware loaders encode the graph_iri argument via the dictionary and delegate to the existing internal *_into_graph(data, g_id) functions. File variants read via pg_read_file() (superuser-only). clear_graph wraps storage::clear_graph_by_id() which deletes from delta tables and adds tombstones for main table rows.

Technical details
  • pg_ripple_http/src/main.rs — axum router with /sparql (GET+POST), /health, /metrics; content negotiation; bearer/basic auth; CORS via tower-http
  • pg_ripple_http/src/metrics.rs — atomic counter-based Prometheus metrics
  • src/lib.rs — new #[pg_extern] functions: load_ntriples_into_graph, load_turtle_into_graph, load_rdfxml_into_graph, load_ntriples_file_into_graph, load_turtle_file_into_graph, load_rdfxml_file_into_graph, load_rdfxml_file, delete_triple_from_graph, clear_graph, find_triples_in_graph, triple_count_in_graph, decode_id_full, lookup_iri
  • src/bulk_load.rsload_rdfxml_file, load_ntriples_file_into_graph, load_turtle_file_into_graph, load_rdfxml_file_into_graph
  • src/storage/mod.rstriple_count_in_graph(g_id) scans all VP tables for a specific graph
  • sql/pg_ripple--0.14.0--0.15.0.sql — migration script (no schema changes; all new features are compiled functions)
  • docker-compose.yml — two-service Compose with postgres and sparql containers
  • Dockerfile — updated to build and bundle pg_ripple_http binary
  • tests/pg_regress/sql/load_into_graph.sql, graph_delete.sql, sql_api_completeness.sql, sparql_protocol.sql

[0.14.0] — 2025-07-18 — Administrative & Operational Readiness

This release focuses on production operations: maintenance commands, monitoring, graph-level access control, and comprehensive documentation. Everything a system administrator needs to run pg_ripple confidently in production.

New in this release: Maintenance functions (vacuum, reindex, vacuum_dictionary). Dictionary diagnostics (dictionary_stats). Graph-level Row-Level Security with enable_graph_rls, grant_graph, revoke_graph, list_graph_access. Optional pg_trickle integration via schema_summary / enable_schema_summary. Complete documentation for backup/restore, contributing, error codes (PT001–PT799), and security hardening. Extension upgrade scripts for the full 0.1.0 → 0.14.0 chain.

What you can do

  • Maintain the storepg_ripple.vacuum() runs MERGE then ANALYZE on all VP tables; pg_ripple.reindex() rebuilds all indices; pg_ripple.vacuum_dictionary() removes orphaned dictionary entries after bulk deletes (uses advisory lock to be safe)
  • Diagnose the dictionarypg_ripple.dictionary_stats() returns a JSON object with total_entries, hot_entries, cache_capacity, cache_budget_mb, and shmem_ready
  • Control graph accesspg_ripple.enable_graph_rls() activates RLS policies on VP tables keyed on the g (graph ID) column; grant_graph(role, graph, permission) / revoke_graph(role, graph) manage the _pg_ripple.graph_access mapping table; list_graph_access() returns the current ACL as JSON
  • Bypass RLS for admin workSET pg_ripple.rls_bypass = on in a superuser session skips RLS checks; protected by GUC_SUSET (superuser-only)
  • Inspect schemapg_ripple.schema_summary() returns the inferred class→property→cardinality summary (populated by the optional pg_trickle integration); enable_schema_summary() sets up the _pg_ripple.inferred_schema table and stream when pg_trickle is installed
  • Upgrade safely — tested upgrade path from every prior version; ALTER EXTENSION pg_ripple UPDATE works for all transitions up to 0.14.0

What happens behind the scenes

vacuum() and reindex() discover live VP tables by querying pg_class for tables matching the vp_% pattern in _pg_ripple. vacuum_dictionary() acquires advisory lock 0x7269706c (ripl) then deletes from _pg_ripple.dictionary any row whose encoded ID does not appear in any VP table — safe to run concurrently with queries.

RLS policies are created on _pg_ripple.vp_rare (the catch-all VP table) using current_setting('pg_ripple.rls_bypass', true) as the bypass expression. The graph_access mapping table stores (role_name, graph_id, permission) triples; grant_graph encodes the graph IRI using encode_term before inserting.

Technical details
  • src/lib.rs — new pg_extern functions: vacuum(), reindex(), vacuum_dictionary(), dictionary_stats(), enable_graph_rls(), grant_graph(), revoke_graph(), list_graph_access(), schema_summary(), enable_schema_summary(); new GUC pg_ripple.rls_bypass (bool, GUC_SUSET)
  • sql/pg_ripple--0.13.0--0.14.0.sql — creates _pg_ripple.graph_access and _pg_ripple.inferred_schema tables with appropriate indices
  • tests/pg_regress/sql/admin_functions.sql — tests vacuum, reindex, vacuum_dictionary, dictionary_stats, predicate_stats view
  • tests/pg_regress/sql/graph_rls.sql — tests grant_graph, list_graph_access, revoke_graph, enable_graph_rls, rls_bypass GUC
  • tests/pg_regress/sql/upgrade_path.sql — verifies full administrative API is available after a clean install
  • docs/src/user-guide/backup-restore.md — pg_dump/pg_restore, VP table considerations, PITR, logical replication
  • docs/src/user-guide/contributing.md — dev setup, test commands, PR workflow, code conventions
  • docs/src/reference/error-reference.md — PT001–PT799 error code table
  • docs/src/reference/security.md — supported versions matrix, RLS section, hardening GUCs
  • docs/src/user-guide/sql-reference/admin.md — expanded with all new v0.14.0 admin functions

[0.13.0] — 2026-04-16 — Performance Hardening

This release is about speed. Using the benchmarks established in earlier versions, pg_ripple v0.13.0 measures and improves performance at every layer: how triple patterns are ordered before query execution, how the PostgreSQL planner understands the data distribution, how parallel workers are exploited for multi-predicate queries, and how data quality rules from SHACL can help the optimizer make better decisions.

New in this release: BGP join reordering based on real table statistics. SPARQL plan cache instrumentation. Parallel query hints for star patterns. Extended statistics on VP table column pairs. SHACL-driven query optimizer hints. New GUCs to control reordering and parallelism thresholds. Regression and fuzz-integration test suites for the query pipeline.

What you can do

  • Faster repeated queries — the plan cache now tracks hits and misses; call plan_cache_stats() to see your hit rate and tune pg_ripple.plan_cache_size for your workload; call plan_cache_reset() to evict stale plans
  • Faster star patterns — pg_ripple now reorders triple patterns within a BGP by estimated selectivity (most restrictive first), matching what a manual SQL expert would write; controlled by SET pg_ripple.bgp_reorder = on/off
  • Parallel query — queries joining 3 or more VP tables now emit SET LOCAL max_parallel_workers_per_gather = 4 and SET LOCAL enable_parallel_hash = on so PostgreSQL can use parallel workers; threshold tunable via pg_ripple.parallel_query_min_joins
  • Better planner statistics — extended statistics on (s, o) column pairs are automatically created when a predicate is promoted from vp_rare to a dedicated VP table; this helps the PostgreSQL planner estimate join cardinalities for multi-predicate queries
  • SHACL-informed optimizer — if you have loaded SHACL shapes with sh:maxCount 1 or sh:minCount 1, the optimizer reads those hints and can use them for join costing; hints are only applied when semantics are preserved
  • Safer query pipeline — a fuzz integration test suite verifies that malformed SPARQL, SQL injection attempts in IRI values, Unicode IRIs, deeply nested property paths, and very large literals are all handled gracefully without crashes or data corruption

What happens behind the scenes

The BGP reordering optimizer queries pg_class.reltuples and pg_stats.n_distinct for each VP table at translation time to estimate how many rows a pattern will produce given its bound columns. Patterns are sorted cheapest-first using a greedy left-deep algorithm. Before executing the generated SQL, SET LOCAL join_collapse_limit = 1 is emitted so the PostgreSQL planner does not reorder the joins back. On macOS/Linux, SET LOCAL enable_mergejoin = on is also set to exploit merge-join when join columns are ordered.

For parallel execution, the query engine counts VP-table aliases (_t0, _t1, …) in the generated SQL; if the count reaches parallel_query_min_joins, parallel hash join settings are activated before query execution.

Extended statistics (CREATE STATISTICS … (ndistinct, dependencies) ON s, o) are created in _pg_ripple schema alongside the VP tables when promote_predicate() runs. This gives the planner correlation data that single-column ANALYZE cannot provide.

Technical details
  • src/sparql/optimizer.rs (new) — reorder_bgp(): greedy left-deep selectivity-based reorder; TableStats struct with pg_class.reltuples + pg_stats.n_distinct queries; load_predicate_hints(): reads SHACL shapes for sh:maxCount/sh:minCount hints
  • src/sparql/plan_cache.rs — added HIT_COUNT and MISS_COUNT AtomicU64 counters; stats() returns (hits, misses, size, cap); reset() evicts cache and clears counters; cache key now includes bgp_reorder GUC value
  • src/sparql/sqlgen.rstranslate_bgp() now calls optimizer::reorder_bgp() before building the join tree
  • src/sparql/mod.rsexecute_select() emits SET LOCAL join_collapse_limit = 1, enable_mergejoin = on, and parallel hints when applicable; new public plan_cache_stats() and plan_cache_reset() functions
  • src/storage/mod.rspromote_rare_predicates() calls create_extended_statistics() for each newly promoted predicate; create_extended_statistics() issues CREATE STATISTICS IF NOT EXISTS … (ndistinct, dependencies) ON s, o
  • src/lib.rs — two new GUCs: pg_ripple.bgp_reorder (bool, default on), pg_ripple.parallel_query_min_joins (int, default 3); two new pg_extern functions: plan_cache_stats() RETURNS JSONB, plan_cache_reset() RETURNS VOID
  • sql/pg_ripple--0.12.0--0.13.0.sql — migration script (no schema DDL; new functions are compiled into the extension library)
  • tests/pg_regress/sql/shacl_query_opt.sql — verifies BGP reorder GUC, plan cache stats/reset, SHACL shape reading, and sparql_explain output
  • tests/pg_regress/sql/fuzz_integration.sql — verifies graceful handling of empty queries, malformed SPARQL, SQL injection via IRI, Unicode IRIs, large literals, deeply nested property paths, and adversarial cache usage

[0.12.0] — 2026-04-16 — SPARQL Update (Advanced)

This release completes the full SPARQL 1.1 Update specification. Building on the INSERT DATA / DELETE DATA support from v0.5.1, pg_ripple now supports pattern-based updates, remote RDF loading, and full named-graph lifecycle management.

New in this release: Find-and-replace data using SPARQL patterns with DELETE/INSERT WHERE. Fetch and load remote RDF documents from any HTTP(S) URL with LOAD <url>. Clear, drop, or create named graphs with a single SPARQL Update call.

What you can do

  • Pattern-based updatesDELETE { … } INSERT { … } WHERE { … } finds matching triples using the full SPARQL→SQL engine and then deletes and inserts triples for each result row; both the DELETE and INSERT templates may reference WHERE-bound variables
  • INSERT WHERE — omit the DELETE clause to insert a triple for every WHERE match
  • DELETE WHERE — omit the INSERT clause to remove all triples matching a pattern
  • LOAD remote RDFLOAD <url> fetches a Turtle, N-Triples, or RDF/XML document via HTTP(S) and inserts all triples; LOAD <url> INTO GRAPH <g> targets a named graph; LOAD SILENT <url> suppresses network errors
  • Clear a graphCLEAR GRAPH <g> removes all triples from a named graph without touching the default graph; CLEAR DEFAULT, CLEAR NAMED, CLEAR ALL let you clear one or all graphs in a single call
  • Drop a graphDROP GRAPH <g> clears and deregisters a graph; DROP SILENT suppresses errors on non-existent graphs; DROP ALL clears the entire store
  • Create a graphCREATE GRAPH <g> pre-registers a named graph in the dictionary; CREATE SILENT is a no-op if the graph already exists

What happens behind the scenes

When DELETE/INSERT WHERE runs, the WHERE clause is compiled through the existing SPARQL→SQL engine into a SELECT query. The result rows are collected in memory, and then for each row the DELETE phase removes any matched triples from VP storage, followed by the INSERT phase adding new ones. This keeps the operation transactional inside a single PostgreSQL call.

LOAD uses ureq (a lightweight Rust HTTP client) to fetch the URL. The response body is parsed by the same rio_turtle / rio_xml parsers used for local bulk loading; triples are inserted in batches using the standard VP storage path.

CLEAR and DROP call a new clear_graph_by_id() helper that deletes from both the HTAP delta tables and tombstones the main-partition rows — the same mechanism used by the existing drop_graph() function.

Technical details
  • src/sparql/mod.rssparql_update() extended to handle all GraphUpdateOperation variants: DeleteInsert, Load, Clear, Create, Drop; new helpers execute_delete_insert(), execute_load(), execute_clear(), execute_drop(), resolve_ground_term(), resolve_term_pattern(), resolve_named_node_pattern(), resolve_graph_name_pattern(), encode_literal_id()
  • src/storage/mod.rs — new clear_graph_by_id(g_id) mirrors drop_graph() but takes a pre-encoded ID; new all_graph_ids() collects all distinct graph IDs across VP tables and vp_rare
  • src/bulk_load.rs — new graph-aware loaders load_ntriples_into_graph(), load_turtle_into_graph(), load_rdfxml_into_graph() accept a target g_id instead of always writing to the default graph (g=0)
  • Cargo.toml — added ureq = { version = "2", features = ["tls"] } for LOAD <url> HTTP support
  • sql/pg_ripple--0.11.0--0.12.0.sql — migration script (schema unchanged; new capabilities compiled into the extension library)
  • pg_regress — new test suites: sparql_update_where.sql, sparql_graph_management.sql; both PASS

[0.11.0] — 2026-04-16 — SPARQL & Datalog Views

This release adds always-fresh, incrementally-maintained stream tables for SPARQL and Datalog queries, plus Extended Vertical Partitioning (ExtVP) semi-join tables for multi-predicate star-pattern acceleration. All three features are built on top of pg_trickle and are soft-gated — pg_ripple loads and operates normally without pg_trickle; the new functions detect its absence at call time and return a clear error with an install hint.

New in this release: Compile any SPARQL SELECT query into a pg_trickle stream table with create_sparql_view(). Bundle a Datalog rule set with a goal pattern into a self-refreshing view with create_datalog_view(). Pre-compute semi-joins between frequently co-joined predicate pairs with create_extvp() to give 2–10× star-pattern speedups.

What you can do

  • SPARQL viewspg_ripple.create_sparql_view(name, sparql, schedule, decode) compiles a SPARQL SELECT query to SQL and registers it as a pg_trickle stream table; the table stays incrementally up-to-date on every triple insert/update/delete
  • Datalog viewspg_ripple.create_datalog_view(name, rules, goal, schedule, decode) bundles inline Datalog rules with a goal query into a self-refreshing table; create_datalog_view_from_rule_set(name, rule_set, goal, schedule, decode) references a previously-loaded named rule set
  • ExtVP semi-joinspg_ripple.create_extvp(name, pred1_iri, pred2_iri, schedule) pre-computes the semi-join between two predicate tables; the SPARQL query engine detects and uses ExtVP tables automatically
  • Detect pg_tricklepg_ripple.pg_trickle_available() returns true if pg_trickle is installed, so callers can gate feature usage without catching errors
  • Lifecycle managementdrop_sparql_view, drop_datalog_view, drop_extvp remove both the stream table and the catalog entry; list_sparql_views(), list_datalog_views(), list_extvp() return JSONB arrays of registered objects

New SQL functions

FunctionReturnsDescription
pg_ripple.pg_trickle_available()BOOLEANReturns true if pg_trickle is installed
pg_ripple.create_sparql_view(name, sparql, schedule DEFAULT '1s', decode DEFAULT false)BIGINTCompile SPARQL SELECT to a pg_trickle stream table; returns column count
pg_ripple.drop_sparql_view(name)BOOLEANDrop the stream table and catalog entry
pg_ripple.list_sparql_views()JSONBList all registered SPARQL views
pg_ripple.create_datalog_view(name, rules, goal, rule_set_name DEFAULT 'custom', schedule DEFAULT '10s', decode DEFAULT false)BIGINTCompile inline Datalog rules + goal into a stream table
pg_ripple.create_datalog_view_from_rule_set(name, rule_set, goal, schedule DEFAULT '10s', decode DEFAULT false)BIGINTReference an existing named rule set for a Datalog view
pg_ripple.drop_datalog_view(name)BOOLEANDrop the stream table and catalog entry
pg_ripple.list_datalog_views()JSONBList all registered Datalog views
pg_ripple.create_extvp(name, pred1_iri, pred2_iri, schedule DEFAULT '10s')BIGINTPre-compute a semi-join stream table for two predicates
pg_ripple.drop_extvp(name)BOOLEANDrop the ExtVP stream table and catalog entry
pg_ripple.list_extvp()JSONBList all registered ExtVP tables

New catalog tables

TableDescription
_pg_ripple.sparql_viewsStores SPARQL view name, original query, generated SQL, schedule, decode mode, stream table name, and variables
_pg_ripple.datalog_viewsStores Datalog view name, rules, rule set, goal, generated SQL, schedule, decode mode, stream table name, and variables
_pg_ripple.extvp_tablesStores ExtVP name, predicate IRIs, predicate IDs, generated SQL, schedule, and stream table name
Technical details
  • src/views.rs — new module implementing all v0.11.0 public functions; compile_sparql_for_view() wraps sparql::sqlgen::translate_select() and renames internal _v_{var} columns to plain {var} for stream table compatibility; create_extvp() generates a parameterized semi-join SQL template over the two predicate VP tables
  • src/lib.rs — three new catalog tables created at extension load time; eleven new #[pg_extern] functions exposed in the pg_ripple schema
  • src/datalog/mod.rs — added load_and_store_rules(rules_text, rule_set_name) -> i64 helper for Datalog view creation
  • src/sparql/mod.rssqlgen module made pub(crate) so views.rs can call translate_select() directly
  • sql/pg_ripple--0.10.0--0.11.0.sql — migration script adding the three catalog tables for upgrades from v0.10.0
  • pg_regress — new test suites: sparql_views.sql, datalog_views.sql, extvp.sql; all pass

[0.10.0] — 2026-04-16 — Datalog Reasoning Engine

This release delivers a full Datalog reasoning engine over the VP triple store. Rules are parsed from a Turtle-flavoured syntax, stratified for evaluation order, and compiled to native PostgreSQL SQL — no external reasoner process needed.

New in this release: pg_ripple can now execute RDFS and OWL RL entailment, user-defined inference rules, Datalog constraints, and arithmetic/string built-ins. Inference results are written back into the VP store with source = 1 so explicit and derived triples are always distinguishable. A hot dictionary tier accelerates frequent IRI lookups, and a SHACL-AF bridge detects sh:rule properties in shape graphs and registers them alongside standard Datalog rules.

What you can do

  • Write custom inference rulespg_ripple.load_rules(rules, rule_set) parses Turtle-flavoured Datalog and stores the compiled SQL strata
  • Built-in RDFS entailmentpg_ripple.load_rules_builtin('rdfs') loads all 13 RDFS entailment rules; call pg_ripple.infer('rdfs') to materialize closure
  • Built-in OWL RL reasoningpg_ripple.load_rules_builtin('owl-rl') loads ~20 core OWL RL rules covering class hierarchy, property chains, and inverse/symmetric/transitive properties
  • Run inference on demandpg_ripple.infer(rule_set) runs all strata in order and inserts derived triples with source = 1; safe to call repeatedly (idempotent)
  • Declare integrity constraints — rules with an empty head become constraints; pg_ripple.check_constraints() returns all violations as JSONB
  • Inspect and manage rule setspg_ripple.list_rules() returns rules as JSONB; pg_ripple.drop_rules(rule_set) clears a named set; enable_rule_set / disable_rule_set toggle a set without deleting it
  • Accelerate hot IRIspg_ripple.prewarm_dictionary_hot() loads frequently-used IRIs (≤ 512 B) into an UNLOGGED hot table for sub-microsecond lookups; survives connection pooling but not database restart
  • SHACL-AF bridge — shapes that contain sh:rule entries are detected by load_shacl() and registered in the rules catalog; full SHACL-AF rule execution is planned for v0.11.0

New GUC parameters

GUCDefaultDescription
pg_ripple.inference_mode'on_demand''off' disables engine; 'on_demand' evaluates via CTEs; 'materialized' uses pg_trickle stream tables
pg_ripple.enforce_constraints'warn''off' silences violations; 'warn' logs them; 'error' raises an exception
pg_ripple.rule_graph_scope'default''default' applies rules to default graph only; 'all' applies across all named graphs

New SQL functions

FunctionReturnsDescription
pg_ripple.load_rules(rules TEXT, rule_set TEXT DEFAULT 'custom')BIGINTParse, stratify, and store a Datalog rule set; returns the number of rules loaded
pg_ripple.load_rules_builtin(name TEXT)BIGINTLoad a built-in rule set by name ('rdfs' or 'owl-rl')
pg_ripple.list_rules()JSONBReturn all active rules as a JSONB array
pg_ripple.drop_rules(rule_set TEXT)BIGINTDelete a named rule set; returns the number of rules deleted
pg_ripple.enable_rule_set(name TEXT)VOIDMark a rule set as active
pg_ripple.disable_rule_set(name TEXT)VOIDMark a rule set as inactive
pg_ripple.infer(rule_set TEXT DEFAULT 'custom')BIGINTRun inference; returns the number of derived triples inserted
pg_ripple.check_constraints(rule_set TEXT DEFAULT NULL)JSONBEvaluate integrity constraints; returns violations
pg_ripple.prewarm_dictionary_hot()BIGINTLoad hot IRIs into UNLOGGED hot table; returns rows loaded
Technical details
  • src/datalog/mod.rs — public API and IR type definitions (Term, Atom, BodyLiteral, Rule, RuleSet); catalog helpers for _pg_ripple.rules and _pg_ripple.rule_sets
  • src/datalog/parser.rs — tokenizer and recursive-descent parser for Turtle-flavoured Datalog; variables as ?x, full IRIs as <...>, prefixed IRIs as prefix:local, head :- body . delimiter
  • src/datalog/stratify.rs — SCC-based stratification via Kosaraju's algorithm; unstratifiable programs (negation cycles) are rejected with a clear error message naming the cyclic predicates
  • src/datalog/compiler.rs — compiles Rule IR to PostgreSQL SQL; non-recursive strata use INSERT … SELECT … ON CONFLICT DO NOTHING; recursive strata use WITH RECURSIVE … CYCLE (PG18 native cycle detection); negation compiles to NOT EXISTS; arithmetic/string built-ins compile to inline SQL expressions
  • src/datalog/builtins.rs — RDFS (13 rules: rdfs2–rdfs12, subclass, domain, range) and OWL RL (~20 rules: class hierarchy, property chains, inverse/symmetric/transitive) as embedded Rust string constants
  • src/dictionary/hot.rs — UNLOGGED hot table _pg_ripple.dictionary_hot for IRIs ≤ 512 B; prewarm_hot_table() runs at _PG_init when inference_mode != 'off'; lookup_hot() and add_to_hot() provide O(1) in-process hash lookups
  • src/shacl/mod.rsparse_and_store_shapes() now calls bridge_shacl_rules() when inference_mode != 'off'; the bridge detects sh:rule and registers a placeholder in _pg_ripple.rules
  • VP storesource SMALLINT NOT NULL DEFAULT 0 column present in all VP tables; migration script adds it retroactively to tables created before v0.10.0; source = 0 means explicit, source = 1 means derived
  • Migration scriptsql/pg_ripple--0.9.0--0.10.0.sql includes all CREATE TABLE IF NOT EXISTS and ALTER TABLE … ADD COLUMN IF NOT EXISTS statements for zero-downtime upgrades
  • New pg_regress tests: datalog_custom.sql, datalog_rdfs.sql, datalog_owl_rl.sql, datalog_negation.sql, datalog_arithmetic.sql, datalog_constraints.sql, datalog_malformed.sql, shacl_af_rule.sql, rdf_star_datalog.sql

[0.9.0] — 2026-04-15 — Serialization, Export & Interop

This release completes RDF I/O: pg_ripple can now import from and export to all major RDF serialization formats, and SPARQL CONSTRUCT and DESCRIBE queries can return results directly as Turtle or JSON-LD.

New in this release: Until now, you could load Turtle and N-Triples but exports were limited to N-Triples and N-Quads. You can now export as Turtle or JSON-LD — formats that are friendlier for human reading and REST APIs respectively. RDF/XML import covers the format that Protégé and most OWL editors produce. Streaming export variants handle large graphs without buffering the full document in memory.

What you can do

  • Load RDF/XMLpg_ripple.load_rdfxml(data TEXT) parses conformant RDF/XML (Protégé, OWL, most ontology editors); returns the number of triples loaded
  • Export as Turtlepg_ripple.export_turtle() serializes the default graph (or any named graph) as a compact Turtle document with @prefix declarations; RDF-star quoted triples use Turtle-star notation
  • Export as JSON-LDpg_ripple.export_jsonld() serializes triples as a JSON-LD expanded-form array, ready for REST APIs and Linked Data Platform contexts
  • Stream large graphspg_ripple.export_turtle_stream() and pg_ripple.export_jsonld_stream() return one line at a time as SETOF TEXT, suitable for COPY … TO STDOUT pipelines
  • Get CONSTRUCT results as Turtlepg_ripple.sparql_construct_turtle(query) runs a SPARQL CONSTRUCT query and returns a Turtle document instead of JSONB rows
  • Get CONSTRUCT results as JSON-LDpg_ripple.sparql_construct_jsonld(query) returns JSONB in JSON-LD expanded form
  • Get DESCRIBE results as Turtle or JSON-LDpg_ripple.sparql_describe_turtle(query) and pg_ripple.sparql_describe_jsonld(query) offer the same format choice for DESCRIBE

New SQL functions

FunctionReturnsDescription
pg_ripple.load_rdfxml(data TEXT)BIGINTParse RDF/XML, load into default graph
pg_ripple.export_turtle(graph TEXT DEFAULT NULL)TEXTExport graph as Turtle
pg_ripple.export_jsonld(graph TEXT DEFAULT NULL)JSONBExport graph as JSON-LD (expanded form)
pg_ripple.export_turtle_stream(graph TEXT DEFAULT NULL)SETOF TEXTStreaming Turtle export
pg_ripple.export_jsonld_stream(graph TEXT DEFAULT NULL)SETOF TEXTStreaming JSON-LD NDJSON export
pg_ripple.sparql_construct_turtle(query TEXT)TEXTCONSTRUCT result as Turtle
pg_ripple.sparql_construct_jsonld(query TEXT)JSONBCONSTRUCT result as JSON-LD
pg_ripple.sparql_describe_turtle(query TEXT, strategy TEXT DEFAULT 'cbd')TEXTDESCRIBE result as Turtle
pg_ripple.sparql_describe_jsonld(query TEXT, strategy TEXT DEFAULT 'cbd')JSONBDESCRIBE result as JSON-LD
Technical details
  • rio_xml crate added as a dependency for RDF/XML parsing (uses rio_api TriplesParser interface, consistent with existing rio_turtle parsers)
  • src/export.rs extended with export_turtle, export_jsonld, export_turtle_stream, export_jsonld_stream, triples_to_turtle, and triples_to_jsonld
  • Turtle serialization groups by subject using BTreeMap for deterministic output; emits predicate-object lists per subject
  • JSON-LD expanded form: each subject is one array entry; predicates become IRI-keyed arrays of {"@value": …} / {"@id": …} objects
  • RDF-star quoted triples: passed through in Turtle-star << s p o >> notation; in JSON-LD emitted as {"@value": "…", "@type": "rdf:Statement"}
  • Streaming variants avoid buffering the full document; export_turtle_stream yields prefix lines then one s p o . per row
  • SPARQL format functions (sparql_construct_turtle, etc.) delegate to the existing SPARQL engine then pass rows through the new serialization layer
  • New pg_regress tests: serialization.sql, rdf_star_construct.sql, expanded sparql_construct.sql

[0.8.0] — 2026-04-15 — Advanced Data Quality Rules

This release rounds out the data quality system with more expressive rules and a background validation mode that never slows down your inserts.

New in this release: Until now, each validation rule applied to a single property in isolation. You can now combine rules — "this value must satisfy rule A or rule B", "must satisfy all of these rules", "must not match this rule" — and count how many values on a property actually conform to a sub-rule. A background mode queues violations for later review instead of blocking every write.

What you can do

  • Combine rules with logic — use sh:or, sh:and, and sh:not to build validation rules that express complex conditions, such as "a contact must have either a phone number or an email address"
  • Reference another rule from within a rulesh:node <ShapeIRI> checks that each value on a property also satisfies a separate named rule; rules can reference each other up to 32 levels deep without getting stuck in a loop
  • Count qualifying valuessh:qualifiedValueShape combined with sh:qualifiedMinCount / sh:qualifiedMaxCount counts only the values that actually pass a sub-rule, so you can say "at least two authors must be affiliated with a university"
  • Validate without blocking writes — set pg_ripple.shacl_mode = 'async' so that inserts complete immediately and violations are collected silently in the background; the background worker drains the queue automatically
  • Inspect collected violationspg_ripple.dead_letter_queue() returns all async violations as a JSON array; pg_ripple.drain_dead_letter_queue() clears the queue once you have reviewed them
  • Drain the queue manuallypg_ripple.process_validation_queue(batch_size) processes violations on demand, useful in test pipelines or batch jobs

New SQL functions

FunctionReturnsDescription
pg_ripple.process_validation_queue(batch_size BIGINT DEFAULT 1000)BIGINTProcess up to N pending validation jobs
pg_ripple.validation_queue_length()BIGINTHow many jobs are waiting in the queue
pg_ripple.dead_letter_count()BIGINTHow many violations have been recorded
pg_ripple.dead_letter_queue()JSONBAll recorded violations as a JSON array
pg_ripple.drain_dead_letter_queue()BIGINTDelete all recorded violations and return how many were removed
Technical details
  • ShapeConstraint enum extended with Or(Vec<String>), And(Vec<String>), Not(String), QualifiedValueShape { shape_iri, min_count, max_count }
  • validate_property_shape() refactored to accept all_shapes: &[Shape] for recursive nested shape evaluation
  • node_conforms_to_shape() added: depth-limited recursive conformance check (max depth 32)
  • process_validation_batch(batch_size) added: SPI-based batch drain of _pg_ripple.validation_queue, writes violations to _pg_ripple.dead_letter_queue
  • Merge worker (src/worker.rs) extended with run_validation_cycle() called after each merge transaction
  • validate_sync() now handles Class, Node, Or, And, Not, and QualifiedValueShape (max-count check only for sync)
  • run_validate() now checks top-level node Or/And/Not constraints in offline validation

[0.7.0] — 2026-04-15 — Data Quality Rules (Core)

This release adds SHACL — a W3C standard for expressing data quality rules — and on-demand deduplication for datasets that have accumulated duplicate entries.

What this means in practice: You define rules like "every Person must have a name, and the name must be a string", load them into the database once, and pg_ripple will check those rules on every insert or on demand. Violations are reported as structured JSON so they can be logged, monitored, or acted on automatically.

What you can do

  • Define data quality rulespg_ripple.load_shacl(data TEXT) parses rules written in W3C SHACL Turtle format and stores them in the database; returns the number of rules loaded
  • Check your datapg_ripple.validate(graph TEXT DEFAULT NULL) runs all active rules against your data and returns a JSON report: {"conforms": true/false, "violations": [...]}. Pass a graph name to validate only that graph
  • Reject bad data on insert — set pg_ripple.shacl_mode = 'sync' to have insert_triple() immediately reject any triple that violates a sh:maxCount, sh:datatype, sh:in, or sh:pattern rule
  • Manage rulespg_ripple.list_shapes() lists all loaded rules; pg_ripple.drop_shape(uri TEXT) removes one rule by its IRI
  • Remove duplicate triplespg_ripple.deduplicate_predicate(p_iri TEXT) removes duplicate entries for one property, keeping the earliest record; pg_ripple.deduplicate_all() deduplicates everything
  • Deduplicate automatically on merge — set pg_ripple.dedup_on_merge = true to eliminate duplicates each time the background worker compacts data (see v0.6.0)

New SQL functions

FunctionReturnsDescription
pg_ripple.load_shacl(data TEXT)INTEGERParse Turtle, store rules, return count loaded
pg_ripple.validate(graph TEXT DEFAULT NULL)JSONBFull validation report
pg_ripple.list_shapes()TABLE(shape_iri TEXT, active BOOLEAN)All rules in the catalog
pg_ripple.drop_shape(shape_uri TEXT)INTEGERRemove a rule by IRI
pg_ripple.deduplicate_predicate(p_iri TEXT)BIGINTRemove duplicates for one property
pg_ripple.deduplicate_all()BIGINTRemove duplicates across all properties
pg_ripple.enable_shacl_monitors()BOOLEANCreate a live violation-count stream table (requires pg_trickle)

New configuration options

OptionDefaultDescription
pg_ripple.shacl_mode'off'When to validate: 'off', 'sync' (block bad inserts), 'async' (queue for later — see v0.8.0)
pg_ripple.dedup_on_mergefalseEliminate duplicate triples during each background merge

New internal tables

TableDescription
_pg_ripple.shacl_shapesStores each loaded rule with its IRI, parsed JSON, and active flag
_pg_ripple.validation_queueInbox for inserts when shacl_mode = 'async'
_pg_ripple.dead_letter_queueRecorded violations with full JSONB violation reports
_pg_ripple.violation_summaryLive violation counts by rule and severity (created by enable_shacl_monitors())

Supported validation constraints (v0.7.0)

sh:minCount, sh:maxCount, sh:datatype, sh:in, sh:pattern, sh:class, sh:targetClass, sh:targetNode, sh:targetSubjectsOf, sh:targetObjectsOf. Logical combinators (sh:or, sh:and, sh:not) and qualified constraints are added in v0.8.0.

Upgrading from v0.6.0

ALTER EXTENSION pg_ripple UPDATE;

The migration creates three new tables (shacl_shapes, validation_queue, dead_letter_queue) and their indexes. No existing tables are modified.


[0.6.0] — 2026-04-15 — High-Speed Reads and Writes at the Same Time

This release separates write traffic from read traffic so both can run at full speed simultaneously. It also adds change notifications so other systems can react to new triples in real time.

The problem this solves: In earlier versions, heavy read queries could slow down writes and vice versa. Now, writes go into a small fast table and reads see everything via a transparent view. A background worker periodically merges the write table into an optimised read table without interrupting either operation.

What you can do

  • Write and read simultaneously without blocking — inserts land in a fast write buffer; reads see both the buffer and the main read-optimised store through a transparent view
  • Trigger a manual mergepg_ripple.compact() immediately merges all pending writes into the read store; returns the total number of triples after compaction
  • Subscribe to changespg_ripple.subscribe(pattern TEXT, channel TEXT) sends a PostgreSQL LISTEN/NOTIFY message to channel every time a triple matching pattern is inserted or deleted; use '*' to receive all changes
  • Unsubscribepg_ripple.unsubscribe(channel TEXT) stops notifications on a channel
  • Get storage statisticspg_ripple.stats() reports total triple count, how many predicates have their own table, how many triples are still in the write buffer, and the background worker's process ID

New SQL functions

FunctionReturnsDescription
pg_ripple.compact()BIGINTMerge all pending writes into the read store
pg_ripple.stats()JSONBStorage and background worker statistics
pg_ripple.subscribe(pattern TEXT, channel TEXT)BIGINTSubscribe to change notifications
pg_ripple.unsubscribe(channel TEXT)BIGINTStop notifications on a channel
pg_ripple.htap_migrate_predicate(pred_id BIGINT)voidMigrate one property table to the split-storage layout
pg_ripple.subject_predicates(subject_id BIGINT)BIGINT[]All properties for a given subject (fast lookup)
pg_ripple.object_predicates(object_id BIGINT)BIGINT[]All properties for a given object (fast lookup)

New configuration options

OptionDefaultDescription
pg_ripple.merge_threshold10000Minimum pending writes before background merge starts
pg_ripple.merge_interval_secs60Maximum seconds between merge cycles
pg_ripple.merge_retention_seconds60How long to keep the previous read table before dropping it
pg_ripple.latch_trigger_threshold10000Pending writes needed to wake the merge worker early
pg_ripple.worker_databasepostgresWhich database the merge worker connects to
pg_ripple.merge_watchdog_timeout300Log a warning if the merge worker is silent for this many seconds

Bug fixes in this release

  • Startup race condition — the extension's shared memory flag is now set inside the correct PostgreSQL startup hook, eliminating a rare crash window during server start
  • GUC registration crash — configuration parameters requiring postmaster-level access no longer crash when CREATE EXTENSION pg_ripple runs without the extension in shared_preload_libraries
  • SPARQL aggregate decode bugCOUNT, SUM, and similar aggregate results were incorrectly looked up in the string dictionary; they now pass through as plain numbers
  • Merge worker: DROP TABLE without CASCADE — the merge worker failed if old tables had dependent views; fixed by using CASCADE and recreating the view afterwards
  • Merge worker: stale index name — repeated compact() calls failed with "relation already exists" because the old index name survived a table rename; the stale index is now dropped before creating a new one

Upgrading from v0.5.1

ALTER EXTENSION pg_ripple UPDATE;

The migration script adds a column to the predicate catalog, creates the pattern tables and change-notification infrastructure, and converts every existing property table to the split read/write layout in a single transaction. Existing triples land in the write buffer; call pg_ripple.compact() afterwards to move them into the read store immediately.

Technical details
  • HTAP split: writes → vp_{id}_delta (heap + B-tree); cross-partition deletes → vp_{id}_tombstones; query view = (main EXCEPT tombstones) UNION ALL delta
  • Background merge: sort-ordered insertion into a fresh vp_{id}_main (BRIN-indexed) + ANALYZE; previous main dropped after merge_retention_seconds
  • ExecutorEnd_hook pokes the merge worker latch when TOTAL_DELTA_ROWS reaches latch_trigger_threshold
  • Subject/object pattern tables (_pg_ripple.subject_patterns, _pg_ripple.object_patterns) — GIN-indexed BIGINT[] columns rebuilt by the merge worker; enable O(1) predicate lookup per node
  • CDC notifications fire as pg_notify(channel, '{"op":"insert|delete","s":...,"p":...,"o":...,"g":...}') via trigger on each delta table

This release stores common data types (integers, dates, booleans) as compact numbers instead of text, making range comparisons in queries much faster. It also adds the two remaining SPARQL query forms, write support via SPARQL Update, and full-text search on text values.

What you can do

  • Faster comparisons on numbers and datesxsd:integer, xsd:boolean, xsd:date, and xsd:dateTime values are stored as compact integers; FILTER comparisons (>, <, =) run as plain integer comparisons with no string decoding
  • SPARQL CONSTRUCTpg_ripple.sparql_construct(query TEXT) assembles new triples from a template and returns them as a set of {s, p, o} JSON objects; useful for transforming or exporting data
  • SPARQL DESCRIBEpg_ripple.sparql_describe(query TEXT, strategy TEXT) returns the neighbourhood of a resource — all triples directly connected to it (Concise Bounded Description) or both incoming and outgoing triples (Symmetric CBD)
  • SPARQL Updatepg_ripple.sparql_update(query TEXT) executes INSERT DATA { … } and DELETE DATA { … } statements; returns the number of triples affected
  • Full-text searchpg_ripple.fts_index(predicate TEXT) indexes text values for a property; pg_ripple.fts_search(query TEXT, predicate TEXT) searches them using standard PostgreSQL text-search syntax

Bug fixes

  • fts_index now accepts N-Triples <IRI> notation for the predicate argument
  • fts_index now uses a correct partial index that does not require PostgreSQL subquery support
  • Inline-encoded values (integers, dates) now decode correctly in SPARQL SELECT results instead of returning NULL

New configuration options

  • pg_ripple.describe_strategy (default 'cbd') — DESCRIBE expansion algorithm: 'cbd', 'scbd' (symmetric), or 'simple' (subject only)

[0.5.0] — 2026-04-15 — Complete SPARQL 1.1 Query Engine

This release completes SPARQL 1.1 query support. All standard query patterns — graph traversal, aggregates, unions, subqueries, optional matches, and computed values — are now supported.

What you can do

  • Traverse graph relationships — property paths (+, *, ?, /, |, ^) follow chains of relationships; cyclic graphs are handled safely using PostgreSQL's cycle detection
  • Combine results from alternative patternsUNION { ... } UNION { ... } merges results from two or more patterns; MINUS { ... } removes results that match an unwanted pattern
  • Aggregate and group resultsCOUNT, SUM, AVG, MIN, MAX, GROUP_CONCAT work with GROUP BY and HAVING just as in SQL
  • Use subqueries — nest { SELECT … WHERE { … } } patterns at any depth
  • Compute new valuesBIND(<expr> AS ?var) assigns a calculated value to a variable; VALUES ?x { … } injects a fixed set of values into a pattern
  • Optional matchesOPTIONAL { … } returns results even when the optional pattern has no data, leaving those variables unbound
  • Limit recursion depthpg_ripple.max_path_depth caps how deep property-path traversal can go, preventing runaway queries on very large graphs

Bug fixes

  • Sequence paths (p/q) no longer produce a Cartesian product when intermediate nodes are anonymous
  • p* (zero-or-more) paths no longer crash with a PostgreSQL CYCLE syntax error
  • OPTIONAL no longer produces incorrect results due to an alias collision in the generated SQL
  • GROUP BY column references no longer go out of scope in the outer query
  • MINUS join clause now uses the correct column alias
  • VALUES no longer generates a duplicate alias clause
  • BIND in aggregate subqueries (SELECT (COUNT(?p) AS ?cnt)) now produces the correct SQL expression
  • Numbers in FILTER expressions (FILTER(?cnt >= 2)) are now emitted as SQL integers instead of dictionary IDs
  • Changing pg_ripple.max_path_depth mid-session now correctly invalidates the plan cache
Technical details
  • Property paths compile to WITH RECURSIVE … CYCLE CTEs using PostgreSQL 18's hash-based CYCLE clause
  • All pg_regress test files are now idempotent — safe to run multiple times against the same database
  • setup.sql drops and recreates the extension for full isolation between runs
  • New tests: property_paths.sql, aggregates.sql, resource_limits.sql — 12/12 pass

[0.4.0] — 2026-04-14 — Statements About Statements (RDF-star)

This release adds RDF-star: the ability to store facts about facts. For example, you can record not just "Alice knows Bob" but also "Alice knows Bob — according to Carol, since 2020". This is essential for provenance tracking, temporal data, and property graph–style edge annotations.

What you can do

  • Load N-Triples-star datapg_ripple.load_ntriples() now accepts N-Triples-star, including nested quoted triples in both subject and object position
  • Encode and decode quoted triplespg_ripple.encode_triple(s, p, o) stores a quoted triple and returns its ID; pg_ripple.decode_triple(id) converts it back to JSON
  • Use statement identifierspg_ripple.insert_triple() now returns the stable integer identifier of the stored statement; that identifier can itself appear as a subject or object in other triples
  • Look up a statement by its identifierpg_ripple.get_statement(i BIGINT) returns {"s":…,"p":…,"o":…,"g":…} for any stored statement
  • Query with SPARQL-star — ground (all-constant) quoted triple patterns work in SPARQL WHERE clauses: WHERE { << :Alice :knows :Bob >> :assertedBy ?who }

Known limitations in this release

  • Turtle-star is not yet supported; use N-Triples-star for RDF-star bulk loading
  • Variable-inside-quoted-triple SPARQL patterns (e.g. << ?s :knows ?o >> :assertedBy ?who) are deferred to v0.5.x
  • W3C SPARQL-star conformance test suite not yet run (deferred to v0.5.x)
Technical details
  • KIND_QUOTED_TRIPLE = 5 added to the dictionary; quoted triples stored with qt_s, qt_p, qt_o columns via non-destructive ALTER TABLE … ADD COLUMN IF NOT EXISTS
  • Custom recursive-descent N-Triples-star line parser — avoids the oxrdf/rdf-12 + spargebra feature conflict with no new crate dependencies
  • spargebra and sparopt now use the sparql-12 feature, enabling TermPattern::Triple with correct exhaustiveness guards
  • SPARQL-star ground patterns compile to a dictionary lookup + SQL equality condition

[0.3.0] — 2026-04-14 — SPARQL Query Language

This release introduces SPARQL, the standard W3C query language for RDF data. You can now ask questions over your stored facts using a familiar graph-pattern syntax, with results returned as JSON.

What you can do

  • Run SPARQL SELECT queriespg_ripple.sparql(query TEXT) executes a SPARQL SELECT and returns one JSON object per result row, with variable names as keys and values in standard N-Triples format
  • Run SPARQL ASK queriespg_ripple.sparql_ask(query TEXT) returns true if any results exist, false otherwise
  • Inspect the generated SQLpg_ripple.sparql_explain(query TEXT, analyze BOOL DEFAULT false) shows what SQL was generated from a SPARQL query; pass analyze := true for a full execution plan with timings
  • Tune the query plan cachepg_ripple.plan_cache_size (default 256) controls how many SPARQL-to-SQL translations are cached per connection; set to 0 to disable caching

Supported query features

  • Basic graph patterns with bound or wildcard subjects, predicates, and objects
  • FILTER with comparisons (=, !=, <, <=, >, >=) and boolean operators (&&, ||, !, BOUND())
  • OPTIONAL (left-join)
  • GRAPH <iri> { … } and GRAPH ?g { … } for named graph scoping
  • SELECT with variable projection, DISTINCT, REDUCED
  • LIMIT, OFFSET, ORDER BY
Technical details
  • SPARQL text → spargebra 0.4 algebra tree → SQL via src/sparql/sqlgen.rs; all IRI and literal constants are encoded to i64 before appearing in SQL — SQL injection via SPARQL constants is structurally impossible
  • Per-query encoding cache avoids redundant dictionary lookups for constants appearing multiple times in one query
  • Self-join elimination: patterns sharing a subject but using different predicates compile to a single scan, not separate subqueries
  • Batch decode: all integer result columns are decoded in a single SELECT … WHERE id IN (…) round-trip
  • RUST_TEST_THREADS = "1" in .cargo/config.toml prevents concurrent dictionary upsert deadlocks in the test suite
  • New pg_regress tests: sparql_queries.sql (10 queries), sparql_injection.sql (7 adversarial inputs)

[0.2.0] — 2026-04-14 — Bulk Loading, Named Graphs, and Export

This release makes it practical to work with large RDF datasets. You can load standard RDF files, organise triples into named collections, export data back to standard formats, and register IRI prefixes for convenience.

What you can do

  • Load RDF files in bulkpg_ripple.load_ntriples(data TEXT), load_nquads(data TEXT), load_turtle(data TEXT), and load_trig(data TEXT) accept standard RDF text and return the number of triples loaded
  • Load from a file on the serverpg_ripple.load_ntriples_file(path TEXT) and its siblings read a file directly from the server filesystem (requires superuser); essential for large datasets
  • Organise triples into named graphspg_ripple.create_graph('<iri>') creates a named collection; pg_ripple.drop_graph('<iri>') deletes it along with its triples; pg_ripple.list_graphs() lists all collections
  • Export datapg_ripple.export_ntriples(graph) and pg_ripple.export_nquads(graph) serialise stored triples to standard text; pass NULL to export all triples
  • Register IRI prefixespg_ripple.register_prefix('ex', 'https://example.org/') records a shorthand; pg_ripple.prefixes() lists all registered mappings
  • Promote rare properties manuallypg_ripple.promote_rare_predicates() moves any property that has grown beyond the threshold into its own dedicated table

How rare properties work

Properties with fewer than 1,000 triples (configurable via pg_ripple.vp_promotion_threshold) are stored in a shared table rather than creating a dedicated table for each one. Once a property crosses the threshold it is automatically migrated. This keeps the database tidy for datasets with many rarely-used properties.

How blank node scoping works

Blank node identifiers (_:b0, _:b1, etc.) from different load calls are automatically isolated. Loading the same file twice will produce separate, independent blank nodes rather than merging them — which is almost always what you want.

Technical details
  • rio_turtle 0.8 / rio_api 0.8 added for N-Triples, N-Quads, Turtle, and TriG parsing
  • Blank node scoping via _pg_ripple.load_generation_seq: each load advances a shared sequence; blank node hashes are prefixed with "{generation}:" to prevent cross-load merging
  • batch_insert_encoded groups triples by predicate and issues one multi-row INSERT per predicate group, reducing round-trips
  • _pg_ripple.statements range-mapping table created (populated in v0.6.0)
  • _pg_ripple.prefixes table: (prefix TEXT PRIMARY KEY, expansion TEXT)
  • GUCs added: pg_ripple.vp_promotion_threshold (i32, default 1000), pg_ripple.named_graph_optimized (bool, default off)
  • New pg_regress tests: triple_crud.sql, named_graphs.sql, export_ntriples.sql, nquads_trig.sql

[0.1.0] — 2026-04-14 — First Working Release

pg_ripple can now be installed into a PostgreSQL 18 database. After installation you can store facts — statements like "Alice knows Bob" — and retrieve them by pattern. This is the foundation that all later releases build on. No query language yet: just the core building blocks.

What you can do

  • Install the extensionCREATE EXTENSION pg_ripple in any PostgreSQL 18 database (requires superuser)
  • Store factspg_ripple.insert_triple('<Alice>', '<knows>', '<Bob>') saves a fact and returns a unique identifier for it
  • Find facts by patternpg_ripple.find_triples('<Alice>', NULL, NULL) returns everything about Alice; NULL is a wildcard for any position
  • Delete factspg_ripple.delete_triple(…) removes a specific fact
  • Count factspg_ripple.triple_count() returns how many facts are stored
  • Encode and decode termspg_ripple.encode_term(…) converts a text term to its internal numeric ID; pg_ripple.decode_id(…) converts it back

How storage works

Every piece of text — names, URLs, values — is converted to a compact integer before storage. Lookups and joins operate on integers, not strings, which is what makes queries fast. Facts are automatically organised into one table per relationship type, and relationship types with few facts share a single table to avoid creating thousands of tiny tables. Every fact receives a globally unique integer identifier that later versions use for RDF-star.

Technical details
  • pgrx 0.17 project scaffolding targeting PostgreSQL 18
  • Extension bootstrap creates pg_ripple (user-visible) and _pg_ripple (internal) schemas; the pg_ prefix requires SET LOCAL allow_system_table_mods = on during bootstrap
  • Dictionary encoder (src/dictionary/mod.rs): _pg_ripple.dictionary table; XXH3-128 hash stored in BYTEA; dense IDENTITY sequence as join key; backend-local LRU encode/decode caches; CTE-based upsert avoids pgrx 0.17 InvalidPosition error on empty RETURNING results
  • Vertical partitioning (src/storage/mod.rs): _pg_ripple.vp_{predicate_id} tables with dual B-tree indices on (s,o) and (o,s); _pg_ripple.predicates catalog; _pg_ripple.vp_rare consolidation table; _pg_ripple.statement_id_seq for globally-unique statement IDs
  • Error taxonomy (src/error.rs): thiserror-based types — PT001–PT099 (dictionary), PT100–PT199 (storage)
  • GUC: pg_ripple.default_graph
  • CI pipeline: fmt, clippy, pg_test, pg_regress (.github/workflows/ci.yml)
  • pg_regress tests: setup.sql, dictionary.sql, basic_crud.sql

pg_ripple — Roadmap

From 0.1.0 (foundation) to 1.0.0 (production-ready triple store)

Authority rule: plans/implementation_plan.md is the authoritative description of the eventual target architecture. This roadmap is the delivery sequence for that architecture. If a milestone summary here conflicts with the implementation plan, the implementation plan wins and the roadmap should be updated to match it.

How to read this roadmap

Each release below has two layers:

  • The plain-language summary (in the coloured box) explains what the release delivers and why it matters — no programming knowledge required.
  • The technical deliverables list the specific items developers will build. Feel free to skip these if you're reading for the big picture.

Effort estimates are given as person-weeks — e.g. "6–8 pw" means the release would take roughly 6–8 weeks for a single full-time developer, or 3–4 weeks for a pair working together. The total estimated effort from v0.1.0 to v1.0.0 is 275–376 person-weeks (~63–86 months for one developer; ~32–43 months for a pair).

"optional at runtime" items: some deliverables are annotated (optional at runtime — X must be installed). This means the feature depends on an external extension (e.g. pg_trickle) that may not be installed in every deployment. The feature is required by this roadmap and must be implemented; the Rust code gates on a runtime availability check and degrades gracefully (returns 0 / false / empty, emits a WARNING, never raises an ERROR) when the dependency is absent. These items are not optional from a delivery standpoint.


Overview at a glance

VersionNameWhat it delivers (one sentence)Effort
0.1.0FoundationInstall the extension, store and retrieve facts (VP storage from day one)6–8 pw
0.2.0Bulk Loading & Named GraphsBulk data import, named graphs, rare-predicate consolidation, N-Triples export6–8 pw
0.3.0SPARQL BasicAsk questions in the standard RDF query language (incl. GRAPH patterns)6–8 pw
0.4.0RDF-star / Statement IDsMake statements about statements; LPG-ready storage8–10 pw
0.5.0SPARQL Advanced (Query)Property paths, aggregates, UNION/MINUS, subqueries, BIND/VALUES6–8 pw
0.5.1SPARQL Advanced (Storage & Write)Inline encoding, CONSTRUCT/DESCRIBE, INSERT/DELETE DATA, FTS6–8 pw
0.6.0HTAP ArchitectureHeavy reads and writes at the same time; shared-memory cache8–10 pw
0.7.0SHACL Core + DeduplicationDefine data quality rules; reject bad data on insert; on-demand and merge-time triple deduplication5–7 pw
0.8.0SHACL AdvancedComplex data quality rules with background checking4–6 pw
0.9.0SerializationImport and export data in all standard RDF file formats3–4 pw
0.10.0Datalog ReasoningAutomatically derive new facts from rules and logic10–12 pw
0.11.0SPARQL & Datalog ViewsLive, always-up-to-date dashboards from SPARQL and Datalog queries5–7 pw
0.12.0SPARQL Update (Advanced)Pattern-based updates and graph management commands3–4 pw
0.13.0PerformanceSpeed tuning, benchmarks, production-grade throughput6–8 pw
0.14.0Admin & SecurityOperations tooling, access control, docs, packaging4–6 pw
0.15.0SPARQL ProtocolStandard HTTP API, graph-aware loaders and deletes as SQL functions3–4 pw
0.16.0SPARQL FederationQuery remote SPARQL endpoints alongside local data4–6 pw
0.17.0JSON-LD FramingFrame-driven CONSTRUCT queries producing nested JSON-LD3–4 pw
0.18.0SPARQL CONSTRUCT & ASK ViewsMaterialize CONSTRUCT and ASK queries as live, incrementally-updated stream tables2–3 pw
0.19.0Federation PerformanceConnection pooling, result caching, query rewriting, and batching for remote SPARQL endpoints3–5 pw
0.20.0W3C Conformance & StabilityW3C SPARQL 1.1 and SHACL Core test suite compliance, crash recovery and memory safety hardening, security audit initiation5–7 pw
0.21.0SPARQL Built-in Functions & Query CorrectnessImplement all ~40 missing SPARQL 1.1 built-in functions, fix the FILTER silent-drop hazard, and close critical query-semantics bugs6–8 pw
0.22.0Storage Correctness & Security HardeningFix HTAP merge race conditions, dictionary cache rollback, shmem cache thrashing, rare-predicate promotion race, and HTTP service security gaps6–8 pw
0.23.0SHACL Core Completion & SPARQL DiagnosticsComplete the SHACL constraint set, add SPARQL query introspection, and fix Datalog/JSON-LD correctness issues6–8 pw
0.24.0Semi-naive Datalog & Performance HardeningImplement semi-naive evaluation for Datalog rules, complete the OWL RL rule set, batch-decode large result sets, and bound property-path depth6–8 pw
0.25.0GeoSPARQL & Architectural PolishAdd GeoSPARQL 1.1 geometry primitives, stabilise the internal catalog against OID drift, and close remaining medium- and low-priority issues6–8 pw
0.26.0GraphRAG IntegrationFirst-class integration with Microsoft GraphRAG: BYOG Parquet export, Datalog-enriched entity graphs, SHACL quality enforcement, and a Python CLI bridge4–6 pw
0.27.0Vector + SPARQL Hybrid: FoundationCore pgvector integration — embedding table, HNSW index, pg:similar() SPARQL function, bulk embedding, and hybrid retrieval modes5–7 pw
0.28.0Advanced Hybrid Search & RAG PipelineProduction-grade RRF fusion, incremental embedding worker, graph-contextualized embeddings, and end-to-end RAG retrieval5–8 pw
0.29.0Datalog Optimization: Magic Sets & Cost-Based CompilationGoal-directed inference via magic sets, cost-based body atom reordering, subsumption checking, anti-join negation, filter pushdown, delta table indexing5–7 pw
0.30.0Datalog Aggregation & Compiled Rule PlansAggregation in rule bodies (Datalog^agg), SQL plan caching across inference runs, SPARQL on-demand query speedup5–7 pw
0.31.0Entity Resolution & Demand Transformationowl:sameAs entity canonicalization, demand transformation for goal-directed rule rewriting, SPARQL query planner integration5–7 pw
0.32.0Well-Founded Semantics & TablingThree-valued semantics for cyclic ontologies, subsumptive result caching for Datalog and SPARQL repeated sub-queries5–7 pw
0.33.0Documentation Site & Content OverhaulComplete docs site rebuild — CI harness, eight feature-deep-dive chapters, operations guide, reference section, and content governance8–12 pw
0.34.0Bounded-Depth Termination & Incremental Retraction (DRed)Early fixpoint termination for bounded hierarchies (20–50% faster SPARQL property paths); Delete-Rederive for write-correct materialized predicates5–7 pw
0.35.0Parallel Stratum Evaluation & Incremental Rule UpdatesBackground-worker parallelism for independent rules (2–5× faster materialization); add/remove rules without full recompute5–7 pw
0.36.0Worst-Case Optimal Joins & Lattice-Based DatalogLeapfrog Triejoin for cyclic SPARQL patterns (10×–100× speedup); Datalog^L monotone lattice aggregation6–9 pw
0.37.0Storage Concurrency Hardening & Error SafetyFix HTAP merge race, rare-predicate promotion race, dictionary cache rollback; eliminate all hard panics; add GUC validators9–11 pw
0.38.0Architecture Refactoring & Query CompletenessSplit god-module, PredicateCatalog trait, batch encoding, SCBD, SPARQL Update completeness, SHACL hints in planner9–11 pw
0.39.0Datalog HTTP APIREST API exposing all 27 Datalog SQL functions in pg_ripple_http: rule management, inference, goal queries, constraints, admin3–5 pw
0.40.0Streaming Results, Explain & ObservabilityServer-side SPARQL cursors, explain_sparql(), explain_datalog(), OpenTelemetry tracing, resource governors9–11 pw
0.41.0Full W3C SPARQL 1.1 Test SuiteComplete W3C SPARQL 1.1 Query + Update + Graph Patterns + Aggregates test suite harness with parallelized execution; 3,000+ tests in < 2 min CI5–7 pw
0.42.0Parallel Merge, Cost-Based Federation & Live CDCMulti-worker HTAP merge, FedX-style federation planner, parallel SERVICE, live RDF change subscriptions10–12 pw
0.43.0WatDiv + Jena Conformance SuiteApache Jena edge-case tests (~1,000) and WatDiv scale-correctness benchmark (10M+ triples, star/chain/snowflake/complex patterns); 90% harness reuse from v0.41.05–7 pw
0.44.0LUBM Conformance SuiteLehigh University Benchmark — OWL RL inference correctness across 14 canonical queries on 1K–8M triple datasets; includes Datalog API validation sub-suite for rule compilation, iteration tracking, inferred triples, goal queries, and performance baseline3–5 pw
0.45.0SHACL Completion, Datalog Robustness & Crash RecoveryClose remaining SHACL Core gaps (sh:equals/sh:disjoint, decoded violation IRIs, async load test), harden parallel Datalog strata rollback, add missing crash-recovery scenarios, and standardise migration documentation4–6 pw
0.46.0Property-Based Testing, Fuzz Hardening & OWL 2 RL Conformanceproptest for SPARQL and dictionary invariants, fuzz the federation result decoder, W3C OWL 2 RL test suite in CI, TopN push-down, BSBM regression gate, sequence pre-allocation for Datalog workers, rustdoc coverage enforcement, and HTTP certificate pinning5–7 pw
0.47.0SHACL Truthfulness, Dead-Code Activation & Architecture RefactorFix parsed-but-not-checked SHACL constraints, wire preallocate_sid_ranges(), finish the sparql/translate/ module split, add 5 fuzz targets, 4 crash-recovery scenarios, cache hit-rate SRFs, GUC validators, and security hygiene8–10 pw
0.48.0SHACL Core Completeness, OWL 2 RL Closure & SPARQL CompletenessComplete all 35 SHACL Core constraints and complex sh:path expressions, close the OWL 2 RL rule set, add SPARQL Update MOVE/COPY/ADD, fix SPARQL-star variable patterns, WatDiv baselines, and operational hardening6–8 pw
0.49.0AI & LLM Integrationsparql_from_nl() NL-to-SPARQL via configurable LLM endpoint; suggest_sameas() and apply_sameas_candidates() for embedding-based entity alignment4–6 pw
0.50.0Developer Experience & GraphRAG PolishVS Code extension with SPARQL/SHACL/Datalog support and query runner; explain_sparql(analyze:=true) debugger; rag_context() RAG pipeline5–7 pw
1.0.0Production ReleaseStandards conformance, stress testing, security audit6–8 pw
Total estimated effort275–376 pw

v0.1.0 — Foundation

Theme: Core data model, dictionary encoding, and basic triple CRUD.

In plain language: This is the "hello world" release. After installing pg_ripple into a PostgreSQL database, a user can store facts (called triples — think "subject → relationship → object", e.g. "Alice → knows → Bob") and retrieve them by pattern. No query language yet — just the basic building blocks. Internally, every piece of text (names, URLs, values) is converted to a compact number for fast storage and comparison. This release also sets up automated testing so that every future change is verified.

Effort estimate: 6–8 person-weeks

Completed items (click to expand)

Deliverables

  • pgrx 0.17 project scaffolding targeting PostgreSQL 18
  • Extension bootstrap: CREATE EXTENSION pg_ripple creates _pg_ripple schema
  • Dictionary encoder
    • Unified dictionary table (IRIs, blank nodes, literals in a single table with kind discriminator — avoids ID space collision between separate resource/literal tables)
    • Hash-Backed Sequence encoding (Route 2): XXH3-128 is computed over kind_le_bytes || term_utf8 (kind is mixed in so the same string as different term types maps to distinct IDs); the full 16-byte hash is stored in a BYTEA column with a UNIQUE index as the collision-detection key; a PostgreSQL GENERATED ALWAYS AS IDENTITY sequence produces the dense, sequential i64 join key used in every VP table. This avoids the birthday-problem collision risk of schemes that truncate the hash to 64 bits (collision expected at ~4 billion terms in 64-bit space).
    • Backend-local encode cache (LruCache<u128, i64>, keyed on full 128-bit hash) and decode cache (LruCache<i64, String>)
    • Encode/decode SQL functions: pg_ripple.encode_term(), pg_ripple.decode_id()
  • Vertical Partitioning from day one
    • Dynamic VP table management: auto-create _pg_ripple.vp_{predicate_id} tables on first triple with a new predicate
    • Predicate catalog: _pg_ripple.predicates (id BIGINT, table_oid OID, triple_count BIGINT)
    • Dual B-tree indices per VP table: (s, o) and (o, s)
    • Global statement identifier sequence: _pg_ripple.statement_id_seq — every VP table row gets a globally-unique SID via i BIGINT NOT NULL DEFAULT nextval('statement_id_seq')
    • SIDs are not exposed to users in v0.1.0 but are available for internal use from the start (prerequisite for RDF-star in v0.4.0)
  • Basic triple CRUD
    • pg_ripple.insert_triple(s TEXT, p TEXT, o TEXT)
    • pg_ripple.delete_triple(s TEXT, p TEXT, o TEXT)
    • pg_ripple.triple_count() RETURNS BIGINT
  • Basic querying (SQL-level, no SPARQL yet)
    • pg_ripple.find_triples(s TEXT, p TEXT, o TEXT) RETURNS TABLE (s TEXT, p TEXT, o TEXT, g TEXT) — any param can be NULL for wildcard; returns decoded string values
  • Unit tests for dictionary encode/decode round-trips
  • Integration test: insert + query cycle
  • pg_regress: dictionary.sql (encode/decode, prefix expansion, hash collision behaviour), basic_crud.sql (insert, delete, find_triples, triple_count)
  • CI pipeline (GitHub Actions)
  • GUC-gated lazy initialization
    • Merge worker, SHACL engine, and reasoning engine only start when their respective GUCs are enabled (pg_ripple.merge_threshold > 0, pg_ripple.shacl_mode != 'off', pg_ripple.inference_mode != 'off')
    • Reduces resource overhead for deployments that use only a subset of features
  • Error taxonomy module (src/error.rs)
    • thiserror-based error types with PT error code constants
    • Initial ranges: dictionary errors (PT001–PT099) and storage errors (PT100–PT199)
    • PostgreSQL-style formatting: lowercase first word, no trailing period
    • Extended in subsequent milestones as new subsystems are added (see §13.6 of the Implementation Plan for the complete PT001–PT799 range table)

Shared memory note: v0.1.0 through v0.5.1 use a backend-local lru::LruCache for the dictionary cache. This avoids requiring shared_preload_libraries for the "hello world" release and defers the pgrx shared-memory complexity to v0.6.0 when the HTAP architecture actually needs it. The shared-memory dictionary cache, bloom filters, slot versioning, and pg_ripple.shared_memory_size startup GUC are all introduced in v0.6.0.

Exit Criteria

A user can install the extension, insert triples (routed to per-predicate VP tables), and query them back by pattern. No shared_preload_libraries configuration required. VP tables are created dynamically on first encounter of a new predicate.


v0.2.0 — Bulk Loading & Named Graphs

Theme: Bulk data import, rare-predicate consolidation, named graphs, and prefix management.

In plain language: This release adds bulk import: users can load large RDF data files (in Turtle and N-Triples formats) in one go, rather than inserting facts one at a time. Named graphs (the ability to group facts into labelled collections) are introduced here too. A "rare predicate" consolidation table prevents catalog bloat when datasets have thousands of distinct predicates. N-Triples export is included for test verification and round-trip checking.

Storage partition note: In v0.2.0 through v0.5.0, each VP table is a single flat table — there is no delta/main split yet. All reads and writes target the same table. The HTAP dual-partition architecture (separate _delta and _main tables with a background merge worker) is introduced in v0.6.0 via an explicit schema migration that renames existing VP tables and creates the initial _main partition. Effort estimate: 6–8 person-weeks

Completed items (click to expand)

Deliverables

  • Rare-predicate consolidation table
    • Predicates with fewer than pg_ripple.vp_promotion_threshold triples (default: 1,000) are stored in a shared _pg_ripple.vp_rare (p BIGINT, s BIGINT, o BIGINT, g BIGINT, i BIGINT) table with a primary composite index on (p, s, o) and two secondary indices: (s, p) for DESCRIBE queries and (g, p, s, o) for efficient graph-drop bulk-delete
    • Promotion is deferred to end-of-statement (not mid-batch): during a bulk load, triples accumulate in vp_rare; after the load completes, predicates exceeding the threshold are promoted in a single INSERT … SELECT + DELETE transaction — avoids disrupting in-flight COPY streams
    • pg_ripple.promote_rare_predicates() can also be called manually or by the background merge worker
    • Prevents catalog bloat for predicate-rich datasets (DBpedia ≈60K predicates, Wikidata ≈10K) — avoids hundreds of thousands of PG objects, reduces planner overhead, and cuts VACUUM cost
  • _pg_ripple.statements range-mapping catalog
    • Maintained by the merge worker; stores (sid_min, sid_max, predicate_id, table_oid) range rows rather than one row per statement — resolved via binary search in O(log n) with no full-table scans
    • After each merge cycle the worker inserts one range row per VP table covering the SIDs allocated since the last merge; because SIDs are drawn from a monotonically-increasing sequence, ranges are non-overlapping
    • Required for v0.4.0 RDF-star where SIDs appear as subjects/objects in other VP tables and must be unambiguously resolved to their owning VP table
  • Named graph support (basic)
    • g column in VP tables
    • pg_ripple.create_graph(), pg_ripple.drop_graph(), pg_ripple.list_graphs()
  • pg_ripple.named_graph_optimized GUC (default: off)
    • When enabled, adds an optional (g, s, o) index per dedicated VP table (and equivalent coverage on vp_rare) to accelerate graph-scoped queries (e.g. list all triples in graph G, drop a named graph)
    • Off by default to avoid index bloat for workloads that do not use named graphs heavily
  • Blank node document-scoping
    • Each bulk load operation is assigned a monotonically-increasing load_generation counter from a shared sequence
    • Blank nodes are hashed as "{generation}:{label}" — so _:b0 from two different load calls yields two distinct dictionary IDs
    • Prevents incorrect merging of blank nodes across document boundaries, which would corrupt data in multi-file loads
    • Also applies to INSERT DATA (SPARQL Update, v0.5.1+) which always gets its own generation
  • Bulk loader (N-Triples)
    • pg_ripple.load_ntriples(data TEXT) RETURNS BIGINT
    • Streaming parser via rio_turtle crate
    • Batch encoding + COPY for throughput
  • Bulk loader (N-Quads)
    • pg_ripple.load_nquads(data TEXT) RETURNS BIGINT
    • Standard format for named-graph quads (<s> <p> <o> <g> .); same rio_turtle parser path as N-Triples
    • Route quads to the appropriate named graph (g column) automatically
  • Bulk loader (Turtle)
    • pg_ripple.load_turtle(data TEXT) RETURNS BIGINT
    • Prefix declarations auto-registered
    • Blank node scoping per load operation
    • rio_turtle crate already handles both formats — incremental parser work
  • Bulk loader (TriG)
    • pg_ripple.load_trig(data TEXT) RETURNS BIGINT
    • Turtle with named graph blocks (GRAPH <g> { … }) — the standard interchange format for named-graph Turtle data
    • Uses the same rio_turtle streaming parser; named graph IRI is dictionary-encoded and stored in the g column
  • File-path bulk load variants
    • pg_ripple.load_turtle_file(path TEXT) RETURNS BIGINT
    • pg_ripple.load_ntriples_file(path TEXT) RETURNS BIGINT
    • pg_ripple.load_nquads_file(path TEXT) RETURNS BIGINT
    • pg_ripple.load_trig_file(path TEXT) RETURNS BIGINT
    • Reads via pg_read_file() with superuser privilege check — prevents unauthorized file access
    • Essential for datasets larger than ~1 GB where passing data as a TEXT parameter exceeds PostgreSQL's TEXT size limit and imposes significant memory overhead
    • Returns count of loaded triples; otherwise identical behaviour to the inline TEXT variants
  • IRI prefix management
    • pg_ripple.register_prefix(prefix TEXT, expansion TEXT)
    • pg_ripple.prefixes() RETURNS TABLE
    • Prefix expansion in encode/decode paths
  • ANALYZE after bulk loads
    • All inline and file-path load functions run ANALYZE on affected VP tables after load completes
    • Ensures the PostgreSQL planner has accurate selectivity estimates for generated SQL — critical for good join plans in v0.3.0+
  • Benchmarks: insert throughput (1M triples) — benchmarks/insert_throughput.sql
  • Performance regression baseline: benchmarks/ci_benchmark.sh records insert throughput and point-query latency; CI benchmark job uploads results as artifacts and can gate on >10% regression
  • N-Triples / N-Quads export (basic)
    • pg_ripple.export_ntriples(graph TEXT DEFAULT NULL) RETURNS TEXT
    • pg_ripple.export_nquads(graph TEXT DEFAULT NULL) RETURNS TEXT — exports all named graphs as NQuads when graph is NULL; a single graph when specified
    • Streaming variants returning SETOF TEXT for large graphs
    • Essential for verifying bulk load round-trips in v0.2.0 testing
  • pg_regress test suite: triple_crud.sql, named_graphs.sql, export_ntriples.sql, nquads_trig.sql (N-Quads round-trip, TriG named-graph import, file-path loaders)

Exit Criteria

Rare-predicate consolidation table absorbs low-frequency predicates. Bulk loading >50K triples/sec on commodity hardware. Named graphs functional. All four inline formats (N-Triples, N-Quads, Turtle, TriG) and their file-path counterparts load correctly. Multi-graph data can be loaded via N-Quads/TriG and round-tripped via N-Quads export. VP tables have current planner statistics after bulk load.


v0.3.0 — SPARQL Query Engine (Basic)

Theme: Parse and execute SPARQL SELECT and ASK queries with basic graph patterns, named graph querying, initial join optimizations, and plan caching from day one.

In plain language: SPARQL is the standard language for asking questions over linked data — the same way SQL is for relational databases. This release makes pg_ripple understand SPARQL, so users can write queries like "find all people who know someone who works at Acme Corp" using the official W3C syntax. It also enables querying across named graphs (created in v0.2.0) using the standard SPARQL GRAPH keyword.

Effort estimate: 6–8 person-weeks

Completed items (click to expand)

Prerequisites

  • sparopt availability check (must be resolved before beginning v0.3.0): verify that sparopt is published to crates.io with a stable, usable API and pin the version. If unavailable or API-unstable, absorb its filter-pushdown and constant-folding work directly into pg_ripple's own algebra optimizer pass (src/sparql/algebra.rs) before starting v0.3.0 — do not begin v0.3.0 development without resolving this gate.

Deliverables

  • sparopt first-pass algebra optimizer (sparopt crate)
    • sparopt 0.3 is published on crates.io and pinned; direct conversion between sparopt and spargebra algebra types is unavailable (distinct type hierarchies), so filter-pushdown and constant-folding are implemented inline in src/sparql/sqlgen.rs per the fallback clause
  • SPARQL parser integration (spargebra crate)
    • Parse SPARQL SELECT and ASK queries into algebra tree
    • Support: Basic Graph Patterns (BGP), FILTER, OPTIONAL, LIMIT, OFFSET, ORDER BY, DISTINCT
    • GRAPH ?g { ... } patterns and FROM / FROM NAMED dataset clauses — map to WHERE g = encode(uri) filters on VP tables
  • Per-query EncodingCache (src/sparql/sqlgen.rs Ctx.per_query)
    • Short-lived HashMap for IRIs and literals seen within a single SPARQL query
    • Avoids repeated SPI dictionary look-ups for constants that appear multiple times in one query
  • SQL generator (initial)
    • BGP → JOIN across VP tables (integer equality)
    • FILTER → WHERE clause on integer-encoded values (dictionary-join decode for type comparisons; inline encoding deferred to v0.5.0)
    • OPTIONAL → LEFT JOIN
    • LIMIT/OFFSET/ORDER BY passthrough
    • DISTINCT → SQL DISTINCT
  • Query executor
    • pg_ripple.sparql(query TEXT) RETURNS SETOF JSONB
    • SPI execution of generated SQL
    • Batch dictionary decode: collect all output i64 IDs from the result set, decode in a single WHERE id IN (...) query, build an in-memory lookup map, then emit human-readable rows — avoids per-row dictionary round-trips
  • SPARQL ASK
    • ASK → SELECT EXISTS(...) → returns BOOLEAN
    • pg_ripple.sparql_ask(query TEXT) RETURNS BOOLEAN
  • Join optimizations (phase 1)
    • Self-join elimination for star patterns
    • Filter pushdown: encode FILTER constants before SQL generation
  • Query plan caching (introduced in v0.3.0 — not deferred to v0.13.0)
    • Cache SPARQL→SQL translation results keyed by query text
    • pg_ripple.plan_cache_size GUC (default: 256; 0 = disabled)
  • pg_ripple.sparql_explain(query TEXT, analyze BOOL DEFAULT false) RETURNS TEXT — show generated SQL; analyze := true executes the query and augments the output with actual row counts
  • SQL injection / adversarial tests: verify that SPARQL queries containing SQL metacharacters in IRIs, literals, and prefixed names are safely dictionary-encoded and never reach generated SQL as raw strings
  • pg_regress: sparql_queries.sql (10+ test queries), sparql_injection.sql (adversarial inputs)

Exit Criteria

Users can run SPARQL SELECT and ASK queries with BGPs, FILTER, OPTIONAL, and GRAPH patterns against data loaded via bulk load. Named graph queries work correctly. Queries return correct results.


v0.4.0 — RDF-star / Statement Identifiers

Theme: Quoted triples, statement-level metadata, and LPG-ready storage — make statements about statements.

In plain language: Standard RDF can say "Alice knows Bob". But it can't directly say "Alice said that she knows Bob" or "The fact that Alice knows Bob was recorded on January 5th". RDF-star (now part of the RDF 1.2 standard) solves this by allowing triples to be embedded inside other triples — called quoted triples. This is essential for provenance ("where did this fact come from?"), temporal annotations ("when was this true?"), and trust ("who asserted this?"). By delivering this immediately after basic SPARQL, pg_ripple becomes LPG-ready from the start: Labeled Property Graph edges with properties (e.g. [:KNOWS {since: 2020}]) map directly to RDF-star annotations over statement identifiers already present in the VP tables since v0.1.0. This is a cross-cutting change that touches parsing, storage, dictionary encoding, and the SPARQL engine.

Effort estimate: 8–10 person-weeks

Completed items (click to expand)

Design rationale — why so early?

The OneGraph (1G) research initiative (Lassila et al., 2023; Poseidon engine, AWS Neptune Analytics) demonstrates that a unified SPOI (Subject, Predicate, Object, statement-Identifier) storage model is the foundation for breaking the "graph model lock-in" between RDF and LPG. By introducing statement identifiers in v0.1.0 (storage) and RDF-star in v0.4.0 (query), pg_ripple achieves 1G-compatible storage before any advanced features are built on top. Every subsequent milestone (SHACL, Datalog, SPARQL Update, Cypher/GQL) benefits from statement IDs being available from the start.

Patent clearance: RDF-star is a W3C standard developed under the W3C Patent Policy (Royalty-Free). Statement identifiers are well-established prior art (RDF reification, 2004; Named Graphs, 2005; RDF-star Community Group, 2014). The 1G abstract data model is published academic research (Semantic Web Journal, doi:10.3233/SW-223273), not patented technology. Poseidon's proprietary implementation details (P8APL, PAX pages, lock-free adjacency lists) are specific to Amazon's in-memory engine and are not replicated here — pg_ripple uses PostgreSQL's native heap/WAL/MVCC storage.

Deliverables

  • Quoted triple syntax in parsers
    • N-Triples-star: << <http://...Alice> <http://...knows> <http://...Bob> >> <http://...assertedBy> <http://...Carol> .
    • Implemented via a custom recursive-descent N-Triples-star line parser (no external dependency conflicts)
    • Supports subject-position and object-position quoted triples, nested quoted triples
    • Note: Turtle-star deferred to v0.5.x; load_ntriples() handles N-Triples-star fully
  • Dictionary encoding for quoted triples
    • New term type: KIND_QUOTED_TRIPLE = 5 — XXH3-128 hash of (s_id, p_id, o_id)
    • qt_s, qt_p, qt_o columns added to _pg_ripple.dictionary via ALTER TABLE … ADD COLUMN IF NOT EXISTS
    • pg_ripple.encode_triple(s TEXT, p TEXT, o TEXT) RETURNS BIGINT
    • pg_ripple.decode_triple(id BIGINT) RETURNS JSONB
  • Statement identifier activation
    • pg_ripple.insert_triple(s TEXT, p TEXT, o TEXT, g TEXT DEFAULT NULL) RETURNS BIGINT — returns SID
    • pg_ripple.get_statement(i BIGINT) RETURNS JSONB — look up a statement by its SID
  • Storage for edge properties via SIDs
    • Annotation triples use the SID of the annotated statement as their subject — regular BIGINT values, no structural change to VP tables
    • Nested quoted triples supported
  • SPARQL-star query support
    • TermPattern::Triple handled in sparql/sqlgen.rs via ground_term_id() — ground (all-constant) quoted triple patterns compile to a dictionary lookup + equality condition
    • Uses spargebra/sparql-12 and sparopt/sparql-12 features (properly gates oxrdf/rdf-12 to avoid match-exhaustiveness errors)
    • Variable-inside-quoted-triple deferred to v0.5.x
  • Bulk load support for RDF-star data
    • pg_ripple.load_ntriples() accepts N-Triples-star input
    • pg_ripple.load_turtle(), pg_ripple.load_nquads(), pg_ripple.load_trig() use rio_turtle (no RDF-star; emits warning)
  • W3C SPARQL-star conformance gate: tests/pg_regress/sql/sparql_star_conformance.sql — N-Triples-star parsing, dictionary round-trips, SID lifecycle, annotation patterns, ground triple patterns, data integrity, known-limitation documentation
  • pg_regress: rdf_star_load.sql (load N-Triples-star, encode/decode round-trip, SID lifecycle)

Exit Criteria

Users can load RDF-star data (Turtle-star, N-Triples-star), query it with SPARQL-star triple term patterns, and use statement identifiers to model edge properties. SIDs are returned from insert operations and can be used as subjects/objects in subsequent triples. The storage layer is LPG-ready.


v0.5.0 — SPARQL Query Engine (Advanced — Query Completeness)

Theme: Property paths, UNION, aggregates, subqueries, and advanced join optimizations.

In plain language: This release teaches the query engine to handle more powerful questions. Property paths let you follow chains of relationships — e.g. "find everyone reachable through any number of 'knows' links" (like a social network friend-of-a-friend search). Aggregates let you compute totals and averages ("how many people work in each department?"). This is a pure query-engine release with no storage changes, isolating query completeness from the inline encoding and write-path work in v0.5.1.

Effort estimate: 6–8 person-weeks

Completed items (click to expand)

Deliverables

  • Property path compilation
    • + (one or more) → WITH RECURSIVE CTE
    • * (zero or more) → WITH RECURSIVE CTE with zero-hop anchor
    • ? (zero or one) → UNION of direct + zero-hop
    • / (sequence) → chained joins
    • | (alternative) → UNION
    • ^ (inverse) → swap s/o
    • Cycle detection via PG18 CYCLE clause (hash-based, replaces array-based visited tracking for $O(1)$ membership checks instead of $O(n)$ array scans)
    • pg_ripple.max_path_depth GUC
    • Known performance constraint: PostgreSQL materializes each level of a WITH RECURSIVE CTE into a work-table. For deep traversals (depth > ~15) or wide fan-out on graphs with 10M+ triples the per-level copy cost becomes the bottleneck. The <100 ms target in §13 benchmarks applies to bounded-depth paths (depth ≤ 10) on typical RDF datasets; unbounded paths on dense graphs will exceed it. A purpose-built graph traversal engine would outperform this approach at extreme depth/fan-out, but that is out of scope for v1.0.
  • UNION / MINUS
    • UNION → SQL UNION
    • MINUS → SQL EXCEPT
  • Aggregates
    • COUNT, SUM, AVG, MIN, MAX, GROUP_CONCAT
    • GROUP BY → SQL GROUP BY
    • HAVING → SQL HAVING
  • Subqueries
    • Nested SELECT in WHERE / FROM clause
  • BIND / VALUES
    • BIND → SQL column alias
    • VALUES → SQL VALUES clause
  • Resource exhaustion tests: Cartesian-product queries, unbounded property paths on cyclic graphs, deeply nested subqueries — verify that max_path_depth, statement_timeout, and memory limits prevent runaway resource consumption
  • pg_regress: property_paths.sql, aggregates.sql, resource_limits.sql (exhaustion tests)

Documentation

See plans/documentation.md for the complete page-by-page specification. v0.5.0 carries the full catch-up backlog for v0.1.0–v0.4.0 in addition to new v0.5.0 pages.

Catch-up — v0.1.0 Foundation

  • Docs site scaffold: docs/book.toml, .github/workflows/docs.yml, docs/src/SUMMARY.md
  • user-guide/introduction.md, user-guide/installation.md, user-guide/getting-started.md
  • user-guide/sql-reference/index.md, triple-crud.md, dictionary.md, prefix.md
  • reference/changelog.md (mirror), reference/roadmap.md (mirror), reference/security.md (stub), research/index.md

Catch-up — v0.2.0 Bulk Loading & Named Graphs

  • user-guide/sql-reference/bulk-load.md, user-guide/sql-reference/named-graphs.md
  • user-guide/best-practices/bulk-loading.md
  • user-guide/configuration.md (initial: vp_promotion_threshold, named_graph_optimized, plan_cache_size)
  • reference/faq.md (seed: 10+ questions covering v0.1.0–v0.4.0)

Catch-up — v0.3.0 SPARQL Basic

  • user-guide/playground.md — Docker sandbox ⭐
  • user-guide/sql-reference/sparql-query.md (initial: SELECT, ASK, EXPLAIN)
  • user-guide/best-practices/sparql-patterns.md (initial)
  • reference/troubleshooting.md (initial)

Catch-up — v0.4.0 RDF-star

  • user-guide/sql-reference/rdf-star.md
  • user-guide/best-practices/data-modeling.md (initial)

New in v0.5.0

  • user-guide/sql-reference/sparql-query.md expanded: property paths, aggregates, UNION/MINUS, subqueries, BIND/VALUES
  • user-guide/best-practices/sparql-patterns.md expanded: property path recipes, resource exhaustion safeguards
  • user-guide/configuration.md expanded: max_path_depth GUC

Exit Criteria

SPARQL 1.1 Query coverage for property paths, UNION/MINUS, aggregates, subqueries, BIND/VALUES. Property path queries complete with hash-based cycle detection via PG18 CYCLE clause. Docs site is live on GitHub Pages with all catch-up pages written.


v0.5.1 — SPARQL Advanced (Storage, Serialization & Write)

Theme: Inline value encoding, CONSTRUCT/DESCRIBE, INSERT DATA/DELETE DATA, and full-text search.

In plain language: This release introduces inline value encoding — a performance optimization that eliminates dictionary lookups for numeric and date comparisons. It changes the fundamental ID space model (introducing a dual-space interpretation), which is why it is separated from the pure query-engine work in v0.5.0. It also adds the two simplest SPARQL Update forms (INSERT DATA / DELETE DATA) so standard RDF tools can write to pg_ripple, CONSTRUCT and DESCRIBE to complete the four standard SPARQL query forms, and full-text search for efficient text matching.

Effort estimate: 6–8 person-weeks

Completed items (click to expand)

Deliverables

  • Inline value encoding (src/dictionary/inline.rs)
    • Type-tagged i64 encoding for xsd:integer, xsd:boolean, xsd:dateTime, xsd:date — FILTER comparisons on these types require zero dictionary round-trips
    • IDs allocated in monotonically increasing semantic order so range FILTERs (>, <, BETWEEN) compile directly to SQL numeric comparisons on the raw i64 column
    • Deferred from v0.3.0 to keep the initial SPARQL engine focused on a single ID space; now that the query engine is stable, the dual-space (inline + dictionary) model can be introduced safely
    • Note: xsd:double is stored in the dictionary rather than inline-encoded — truncating IEEE 754 doubles to 56 bits produces undefined precision/range behaviour; dictionary storage is safe and range comparisons on doubles are uncommon in SPARQL
  • SPARQL CONSTRUCT / DESCRIBE (JSONB output)
    • CONSTRUCT → returns triples as JSONB (Turtle/JSON-LD serialization deferred to v0.9.0)
    • DESCRIBE → Concise Bounded Description (CBD) as default algorithm
    • pg_ripple.describe_strategy GUC (values: 'cbd' / 'scbd' / 'simple'): selects the DESCRIBE expansion algorithm. Introduced here alongside DESCRIBE so the GUC is available from the first release that uses it.
    • Completes the four standard SPARQL query forms, making pg_ripple usable as an entity browser
  • Basic SPARQL Update (INSERT DATA / DELETE DATA)
    • Parse and execute INSERT DATA { … } statements via spargebra (already supports Update algebra)
    • Route through dictionary encoder + VP table insert path
    • Named graph support: INSERT DATA { GRAPH <g> { … } }
    • Parse and execute DELETE DATA { … } statements — exact-match triple deletion from VP tables
    • pg_ripple.sparql_update(query TEXT) RETURNS BIGINT — returns count of affected triples
    • Pattern-based updates (DELETE/INSERT WHERE), LOAD, CLEAR, DROP, CREATE deferred to v0.12.0
    • Enables standard RDF tools (Protégé, TopBraid, SPARQL workbenches) to write to pg_ripple without a custom adapter
  • Full-text search on literals
    • pg_ripple.fts_index(predicate TEXT) — create a GIN tsvector index on the dictionary for a predicate
    • SPARQL CONTAINS() and REGEX() FILTERs on indexed predicates rewrite to @@ / LIKE against the GIN index
    • pg_ripple.fts_search(query TEXT, predicate TEXT) RETURNS TABLE — direct full-text search API
    • Index is maintained incrementally on insert_triple() for indexed predicates
  • pg_regress: fts_search.sql, sparql_construct.sql, sparql_insert_data.sql, sparql_delete_data.sql, inline_encoding.sql

Documentation

See plans/documentation.md for details.

  • user-guide/sql-reference/sparql-update.mdsparql_update(), INSERT DATA / DELETE DATA, named-graph variants
  • user-guide/sql-reference/fts.mdfts_index, fts_search, SPARQL CONTAINS/REGEX rewriting
  • user-guide/sql-reference/sparql-query.md expanded: CONSTRUCT / DESCRIBE, describe_strategy GUC
  • user-guide/best-practices/update-patterns.md — INSERT DATA vs bulk load, idempotent patterns

Exit Criteria

Inline value encoding eliminates dictionary lookups for numeric and date FILTER comparisons. SPARQL CONSTRUCT and DESCRIBE return correct JSONB results. INSERT DATA / DELETE DATA work for standard-compliant write operations. Full-text search on indexed literal predicates is functional.


v0.6.0 — HTAP Architecture

Theme: Separate read and write paths for concurrent OLTP/OLAP. Shared-memory dictionary cache. Subject pattern index.

In plain language: In a real production system, people are loading new data and running complex queries at the same time. Without special care, these two activities interfere with each other — writes block reads and vice versa. This release splits the storage into a "write inbox" and a "read-optimised archive" so both can happen simultaneously at full speed. It also adds a change notification system: applications can subscribe to be told whenever specific facts change (useful for triggering workflows, updating caches, or feeding dashboards). An in-memory cache shared across all database connections makes repeated lookups much faster. Optionally, the companion pg_trickle extension enables automatically-updating live statistics.

Note: This release introduces shared_preload_libraries as a requirement — v0.1.0–v0.5.1 do not require it because they use a backend-local dictionary cache. The pg_ripple.shared_memory_size startup GUC must be set in postgresql.conf before starting PostgreSQL.

Effort estimate: 8–10 person-weeks

Completed items (click to expand)

Deliverables

  • Delta/Main partition split — schema migration
    • Each VP table is migrated from its flat single-table form (v0.1.0–v0.5.1) to a dual-partition form:
      1. CREATE TABLE _pg_ripple.vp_{id}_delta AS SELECT * FROM _pg_ripple.vp_{id} (copy existing rows to delta)
      2. CREATE TABLE _pg_ripple.vp_{id}_main (LIKE _pg_ripple.vp_{id}) (empty main, BRIN-indexed)
      3. ALTER TABLE _pg_ripple.vp_{id} RENAME TO vp_{id}_pre_htap (keep old table as backup)
      4. Update _pg_ripple.predicates catalog with new table OIDs
      5. Run an immediate merge cycle to promote rows from delta to main in sorted order
      6. Drop vp_{id}_pre_htap after merge completes successfully
    • The migration runs inside the ALTER EXTENSION pg_ripple UPDATE upgrade script — zero downtime during migration because rows still exist in delta until the merge completes and the query path immediately switches to UNION ALL of _main and _delta
    • vp_rare is not split (see vp_rare HTAP exemption below); all reads and writes target the single vp_rare table throughout
    • All writes target _delta; _main is append-only / read-optimized
    • Query path: UNION ALL of _main and _delta
  • Tombstone table for cross-partition deletes
    • When deleting a triple that may exist in _main, the delete is recorded in _pg_ripple.vp_{id}_tombstones (s BIGINT, o BIGINT, g BIGINT)
    • Query path becomes: (main EXCEPT tombstones) UNION ALL delta
    • The merge worker applies tombstones against main during each generation merge, then truncates the tombstone table
    • Necessary because _main is read-only between merges — a DELETE targeting a main-resident triple cannot modify _main directly
  • vp_rare HTAP exemption
    • vp_rare is not given a delta/main split — it remains a single flat table
    • Rare predicates see few writes by definition; delta/main overhead would exceed the benefit
    • Concurrent reads and writes on vp_rare are safe via PostgreSQL standard heap row-level locking
    • The bloom filter treats vp_rare conservatively (always queries it, no delta-skip shortcut)
  • Background merge worker
    • pgrx BackgroundWorker implementation
    • Configurable merge threshold via pg_ripple.merge_threshold GUC
    • Concurrency & Locking logic: The rename/truncate step requires an AccessExclusiveLock. To prevent stalling the database, the merge worker uses a low lock_timeout and retry logic for the ALTER TABLE ... RENAME statement, ensuring concurrent INSERT and SELECT operations are not blocked entirely by a queued exclusive lock.
    • Fresh-table generation merge: rather than inserting into an existing _main table, create vp_{id}_main_new, insert all rows from both _main and _delta (minus tombstones) in sort order (ensuring BRIN pages are physically ordered), then atomically rename it to replace _main and TRUNCATE both _delta and _tombstones — writes to delta are never blocked during the merge and BRIN indexing is maximally effective because rows arrive in sorted order at table-creation time
    • BRIN index rebuild on main post-merge (concurrent where possible)
    • Shared-memory latch signaling
    • Also triggers pg_ripple.promote_rare_predicates() for any rare predicates that crossed the promotion threshold since the last merge
    • Runs ANALYZE on merged VP tables so the PostgreSQL planner has fresh selectivity estimates
    • Watchdog: if the merge worker heartbeat stalls for longer than pg_ripple.merge_watchdog_timeout (default: 300 s), _PG_init on the next backend connection logs a WARNING and attempts a restart
  • ExecutorEnd_hook latch-poke
    • When a write transaction commits more than pg_ripple.latch_trigger_threshold rows (default: 10,000), the hook immediately pokes the merge worker's latch to trigger an early merge
    • Prevents unbounded delta growth during bursty write workloads without requiring a polling loop
  • Bloom filter for delta existence checks
    • In shared memory, per VP table
    • Queries against main-only data skip delta scan
  • Dictionary LRU cache in shared memory
    • pg_ripple.dictionary_cache_size GUC
    • Shared across all backends via pgrx PgSharedMem
    • Sharded lock design: partition the hash map into N shards (default: 64), each with its own lightweight lock — eliminates global lock contention under concurrent encode/decode workloads
  • Shared-memory budget & back-pressure
    • pg_ripple.cache_budget GUC — utilization cap for the pre-allocated shared memory block (dictionary cache + bloom filters + merge worker buffers)
    • Automatic eviction priority: bloom filters reclaimed first, then oldest LRU dictionary entries
    • Back-pressure on bulk loads when shared memory is >90% of cache_budget — throttle batch size to prevent OOM
  • Shared-memory slot versioning
    • Each shared memory slot (declared via pgrx 0.17's pg_shmem_init! macro) carries a [u8; 8] magic constant (e.g. *b"pg_tripl") followed by a u32 layout version at its head
    • Version mismatch at _PG_init triggers a controlled re-initialization of the slot rather than corrupting state — essential for safe in-place upgrades
    • pgrx 0.17 API note: all shared memory sizes must be declared statically in _PG_init. The pg_ripple.shared_memory_size startup GUC determines the block size; it cannot be changed at runtime. Use the pgrx 0.17 PgSharedObject / PgSharedMem::new_object API (not the old PgSharedMem from ≤0.14) — verify against the pgrx 0.17 shmem examples
  • subject_patterns lookup table
    • _pg_ripple.subject_patterns(s BIGINT, predicates BIGINT[]) with a GIN index on predicates
    • Maintained by the merge worker after each generation merge (not on individual INSERTs — amortized cost)
    • Enables fast "which predicates does subject X have?" look-up for DESCRIBE queries and star-pattern rewriting in the algebra optimizer
  • object_patterns lookup table
    • _pg_ripple.object_patterns(o BIGINT, predicates BIGINT[]) with a GIN index on predicates
    • Maintained by the merge worker alongside subject_patterns
    • Solves the "unbound object problem" by intercepting reverse-edge scattergun queries (?s ?p <Object>) in O(N) instead of forcing a UNION ALL across all VP tables
  • Statistics
    • pg_ripple.stats() JSONB: triple count, per-predicate counts, cache hit ratio, delta/main sizes
  • pg_trickle integration: live statistics (optional, when pg_trickle is installed)
    • pg_ripple.enable_live_statistics() creates _pg_ripple.predicate_stats and _pg_ripple.graph_stats stream tables
    • pg_ripple.stats() reads from stream tables instead of full-scanning VP tables (100–1000× faster)
    • _pg_ripple.rare_predicate_candidates stream table (IMMEDIATE mode) replaces merge-worker GROUP BY polling for VP promotion detection (§2.8)
    • _pg_ripple.vp_cardinality stream table provides live per-predicate row counts for BGP join reordering without waiting for ANALYZE (§2.10)
    • _pg_ripple.subject_patterns managed as a stream table — stays current between merge cycles for DESCRIBE and GIN queries (§2.12)
  • Change notification / CDC
    • pg_ripple.subscribe(pattern TEXT, channel TEXT) — emit NOTIFY on triple changes matching a predicate/graph pattern
    • Thin trigger-based CDC on VP delta tables; fires on INSERT/DELETE
    • Payload: JSON with {"op": "insert"|"delete", "s": ..., "p": ..., "o": ..., "g": ...} (integer IDs)
    • pg_ripple.unsubscribe(channel TEXT) to remove subscriptions
    • Enables downstream event-driven architectures (CDC consumers, webhooks, cache invalidation)
  • Concurrency correctness tests (partial — synchronous paths covered; concurrent bgworker + writer tests deferred)
    • change_notification.sql verifies CDC trigger correctness under sequential insert/delete
    • htap_merge.sql verifies delta→main promotion correctness
    • merge_edge_cases.sql verifies edge cases: empty-delta compact, idempotency, delta-resident deletes
  • Merge worker edge-case tests (covered by merge_edge_cases.sql)
    • Merge when delta is empty (no-op, no crash) ✓
    • compact() is idempotent ✓
    • Insert after compact goes to delta and is visible immediately ✓
    • Delete delta-resident triple removes it directly (no tombstone needed) ✓
    • Delete non-existent triple returns 0 ✓
    • Multiple compacts do not multiply rows ✓
  • Benchmark: concurrent read/write (pgbench custom scripts under HTAP load)
    • Heavy concurrent insert (delta growth) + complex SPARQL queries on main partition
    • Measure merge worker latency, delta bloat growth, query latency under concurrent writes
    • Baseline: >100K triples/sec sustained bulk insert with <500 ms query latency
  • Berlin SPARQL Benchmark (BSBM) execution with HTAP workload mixing reads and writes
    • Full BSBM query mix under concurrent insert workload
    • Comparison baselines with v0.5.0 (single-table, no-HTAP) results
  • pg_regress: htap_merge.sql, change_notification.sql, concurrent_write_merge.sql, htap_benchmarks.sql

Documentation

See plans/documentation.md for details.

  • user-guide/configuration.md — major expansion: all HTAP GUCs grouped by subsystem, shared_preload_libraries requirement column
  • user-guide/scaling.md — HTAP architecture diagram, delta/main lifecycle, merge worker tuning
  • user-guide/pre-deployment.md — production checklist: shared_preload_libraries, memory estimation, ANALYZE schedule
  • user-guide/sql-reference/admin.mdstats(), compact(), subscribe(), unsubscribe(), htap_migrate_predicate()
  • user-guide/best-practices/bulk-loading.md expanded: HTAP delta-growth, bulk-load strategies
  • reference/troubleshooting.md expanded: merge worker not starting, delta bloat, CDC not firing
  • reference/faq.md expanded: shared_preload_libraries, merge worker, change notifications
  • research/postgresql-deepdive.md (mirror plans/postgresql-triplestore-deep-dive.md)

Exit Criteria

Writes do not block reads. Merge worker operates correctly under concurrent writes and crash scenarios. >100K triples/sec bulk insert sustained. Change notifications fire correctly for matching patterns.


v0.7.0 — SHACL Validation (Core)

Theme: Data integrity enforcement via W3C SHACL shapes.

In plain language: SHACL is a standard way to define data quality rules — for example, "every Person must have exactly one email address" or "an age must be a number". When these rules are loaded, pg_ripple can automatically reject data that violates them the moment it is inserted, rather than discovering errors later. This is similar to how a spreadsheet can reject invalid entries in a cell. A validation report function lets you check existing data against the rules at any time.

Effort estimate: 4–6 person-weeks

Completed items (click to expand)

Deliverables

  • SHACL parser (Turtle-based shapes)
    • pg_ripple.load_shacl(data TEXT) — parse and store shapes
    • Internal shape IR stored in _pg_ripple.shacl_shapes
  • Exact SHACL validator compilation
    • Parse shapes to an internal IR that preserves W3C SHACL semantics
    • Compile validator plans over focus nodes and value nodes rather than reducing shapes to lossy table constraints
    • PostgreSQL constraints, triggers, and helper indices are allowed only as internal accelerators when semantics are proven equivalent for the specific shape pattern
  • Synchronous validation mode
    • Triggered on insert_triple() when pg_ripple.shacl_mode = 'sync'
    • Returns validation error immediately on constraint violation
    • Uses the same exact validator semantics as offline validation; no fast path weakens or changes SHACL meaning
  • Validation report
    • pg_ripple.validate(graph TEXT DEFAULT NULL) RETURNS JSONB
    • Full SHACL validation report as JSON
  • SHACL management
    • pg_ripple.list_shapes() RETURNS TABLE
    • pg_ripple.drop_shape(shape_uri TEXT)
  • pg_trickle integration: SHACL violation monitors (optional)
    • Simple cardinality/datatype constraints modeled as IMMEDIATE mode stream tables
    • Violations detected within the same transaction as the DML
    • _pg_ripple.violation_summary stream table aggregates dead-letter queue by shape/severity; feeds /metrics Prometheus endpoint without full queue scans (§2.13)
  • pg_regress: shacl_validation.sql, shacl_malformed.sql (invalid shape definitions, circular references, undefined target classes — verify clean error messages)
  • Explicit deduplication functions (on-demand cleanup; zero insert-time overhead)
    • pg_ripple.deduplicate_predicate(p_iri TEXT) RETURNS BIGINT — remove duplicate (s, o, g) rows for a single predicate, keeping the row with the lowest SID; returns count of rows removed
    • pg_ripple.deduplicate_all() RETURNS BIGINT — deduplicate all predicates across dedicated VP tables and vp_rare; returns total rows removed
    • Runs ANALYZE on all affected tables; safe to call at any time
    • Typical usage: call once after a bulk load that may contain duplicate triples
  • Merge-time deduplication (pg_ripple.dedup_on_merge GUC, default false)
    • When enabled, the HTAP generation merge (src/storage/merge.rs) changes from a plain UNION ALL accumulation to a deduplicating projection using DISTINCT ON (s, o, g) ORDER BY s, o, g, i ASC, retaining the lowest-SID row for each logical triple
    • Deduplication happens atomically during the regular background merge cycle — zero insert-time overhead; duplicates accumulate in the delta partition and are resolved when the merge worker fires
    • Between merges, queries through the (main EXCEPT tombstones) UNION ALL delta view may still observe short-lived duplicates from the delta portion
    • RDF-star interaction: SIDs of eliminated duplicate rows are not preserved; if RDF-star annotations exist on those SIDs, the annotations become orphaned. Use explicit dedup functions instead for datasets with active statement-level annotation workloads
  • pg_regress: deduplication.sql (explicit dedup functions; merge-time dedup via dedup_on_merge; verifies zero duplicates after each mechanism completes)

Documentation

See plans/documentation.md for details.

  • user-guide/sql-reference/shacl.mdload_shacl, validate, list_shapes, drop_shape; validation report JSON structure; shacl_mode GUC
  • user-guide/best-practices/shacl-patterns.md (initial: NodeShape vs PropertyShape, sh:datatype/sh:minCount/sh:maxCount, sync mode latency impact)
  • user-guide/pre-deployment.md expanded: SHACL mode selection, load shapes before bulk import
  • reference/troubleshooting.md expanded: insert rejected by SHACL, shape parsing failures
  • user-guide/sql-reference/admin.md expanded: deduplicate_predicate, deduplicate_all, dedup_on_merge GUC, merge-time dedup semantics and RDF-star interaction

Exit Criteria

Delivered SHACL Core features are enforced at insert time with exact W3C semantics. Validation reports conform to SHACL spec. Malformed shapes are rejected with actionable error messages. Explicit deduplication functions correctly remove duplicate triples from all VP tables. Merge-time deduplication (when dedup_on_merge = true) produces duplicate-free _main tables after each merge cycle.


v0.8.0 — SHACL Advanced

Theme: Async validation pipeline and complex shapes.

In plain language: Builds on v0.7.0 by supporting more sophisticated data quality rules — for instance, "a person's address must be either a US address or a EU address (but not both)", or "if a company has more than 50 employees, it must have a compliance officer". It also adds a background validation mode so that checking complex rules doesn't slow down data loading — violations are flagged asynchronously and collected in a report queue.

Effort estimate: 4–6 person-weeks

Completed items (click to expand)

Deliverables

  • Asynchronous validation pipeline
    • Validation queue table: _pg_ripple.validation_queue
    • Background worker processes queue in batches
    • Dead letter queue for invalid triples with violation reports
    • pg_ripple.shacl_mode = 'async' GUC mode
  • Complex shape support
    • sh:class — type constraint via rdf:type lookup
    • sh:node — nested shape references
    • sh:or / sh:and / sh:not — logical constraint combinators
    • sh:qualifiedValueShape — qualified cardinality
  • pg_trickle integration: multi-shape DAG validation (optional at runtime — pg_trickle must be installed; required in this roadmap)
    • Multiple SHACL shapes compiled into per-shape IMMEDIATE pg_trickle stream tables (supported constraint types: sh:minCount, sh:maxCount, sh:datatype, sh:class); complex combinators (sh:or, sh:and, sh:not, sh:qualifiedValueShape) are not compiled to stream tables and are skipped gracefully
    • _pg_ripple.violation_summary_dag DAG-leaf stream table aggregates per-shape violation counts; automatically clears when upstream shape violations resolve — unlike the dead-letter queue, no manual cleanup required (§2.13)
    • pg_ripple.enable_shacl_dag_monitors() — creates all stream tables; returns 0 with a WARNING (no ERROR) when pg_trickle is not installed
    • pg_ripple.disable_shacl_dag_monitors() — drops all per-shape stream tables and the summary; safe to call when none are active
    • pg_ripple.list_shacl_dag_monitors() — lists active DAG monitor stream tables and compiled constraints
    • _pg_ripple.shacl_dag_monitors catalog table tracks all created monitors
  • pg_regress: shacl_advanced.sql, shacl_dag_monitors.sql

Documentation

See plans/documentation.md for details.

  • user-guide/sql-reference/shacl.md expanded: async pipeline, validation queue, dead-letter queue
  • user-guide/best-practices/shacl-patterns.md expanded: sh:or/sh:and/sh:not, async mode for high-throughput ingestion, reading the dead-letter queue
  • reference/troubleshooting.md expanded: async violations not appearing, dead-letter queue backlog

Exit Criteria

Async validation pipeline operational. Complex SHACL shapes validated correctly with the same semantics as synchronous validation.


v0.9.0 — Serialization, Export & Interop

Theme: Full RDF I/O, remaining serialization formats, and Turtle/JSON-LD serialization for CONSTRUCT/DESCRIBE.

In plain language: RDF data comes in several standard file formats (Turtle, RDF/XML, JSON-LD). This release completes the set so that pg_ripple can import from and export to all of them — making it easy to exchange data with other tools and systems. It also adds Turtle and JSON-LD output formats for SPARQL CONSTRUCT and DESCRIBE queries (which returned JSONB since v0.5.1), and RDF-star serialization support.

Effort estimate: 3–4 person-weeks (the hardest parts — Turtle import, N-Triples export, and CONSTRUCT/DESCRIBE JSONB — were already delivered in v0.2.0, v0.3.0, and v0.5.0)

Note: Turtle import and N-Triples export were delivered in v0.2.0. CONSTRUCT/DESCRIBE (JSONB output) were delivered in v0.5.1.

Completed items (click to expand)

Deliverables

  • RDF/XML parser
    • pg_ripple.load_rdfxml(data TEXT) RETURNS BIGINT
  • Export functions
    • pg_ripple.export_turtle(graph TEXT DEFAULT NULL) RETURNS TEXT
    • pg_ripple.export_jsonld(graph TEXT DEFAULT NULL) RETURNS JSONB
    • Streaming variants returning SETOF TEXT for large graphs
  • SPARQL CONSTRUCT / DESCRIBE serialization formats
    • CONSTRUCT → returns triples as Turtle or JSON-LD (in addition to JSONB from v0.5.1)
    • DESCRIBE → Turtle and JSON-LD output options
  • SPARQL-star in CONSTRUCT / DESCRIBE (builds on v0.4.0 RDF-star)
    • CONSTRUCT can produce quoted triples in output
    • Turtle-star and N-Triples-star serialization in export functions
  • pg_regress: serialization.sql, sparql_construct.sql, rdf_star_construct.sql

Documentation

See plans/documentation.md for details.

  • user-guide/sql-reference/serialization.mdexport_turtle, export_jsonld, load_rdfxml, streaming variants, SPARQL CONSTRUCT Turtle/JSON-LD output, RDF-star serialization
  • user-guide/best-practices/data-modeling.md expanded: interop format guide (Protégé → RDF/XML; LinkedData Platform → JSON-LD; CLI → N-Triples/N-Quads)
  • reference/faq.md expanded: supported import/export formats, JSON-LD for REST APIs

Exit Criteria

Round-trip: load Turtle → query → export Turtle. All major RDF serialization formats supported for both import and export.


v0.10.0 — Datalog Reasoning Engine

Theme: General-purpose rule-based inference over the triple store.

In plain language: This is the "intelligence layer". Users can define logical rules like "if A manages B and B manages C, then A indirectly manages C" — and the system will automatically figure out all the indirect management chains. It ships with two built-in rule sets covering the standard RDF and OWL vocabularies (the common language of the Semantic Web), so it can automatically derive facts like "if a Dog is a subclass of Animal, and Rex is a Dog, then Rex is also an Animal". Rules can also express things that must never be true — for example, "no one can be their own manager" — acting as logical integrity constraints. This is the largest single release in the roadmap.

Effort estimate: 10–12 person-weeks

See plans/ecosystem/datalog.md for the full design.

Completed items (click to expand)

Deliverables

  • Rule parser (src/datalog/parser.rs)
    • Turtle-flavoured Datalog syntax: head :- body₁, body₂, … .
    • Variables (?x), prefixed IRIs, literals, named graph scoping (GRAPH)
    • Stratified negation via NOT keyword
    • Multi-head rules (h₁, h₂ :- body .) compiled to separate INSERT … SELECT statements within the same stratum
  • source column in VP tables and vp_rare
    • source SMALLINT DEFAULT 0 added to every dedicated VP table and to _pg_ripple.vp_rare in the v0.10.0 migration
    • 0 = explicitly asserted; 1 = derived (inferred by Datalog rules)
    • Enables filtering out inferred triples at scan time without a join
    • Migration script uses ALTER TABLE … ADD COLUMN source SMALLINT NOT NULL DEFAULT 0 for each VP table and for vp_rare; zero-downtime because PostgreSQL fast-path adds the column with the stored default without rewriting the table
  • Tiered hot/cold dictionary (src/dictionary/hot.rs)
    • _pg_ripple.resources_hot (UNLOGGED) holds IRIs ≤512B and all predicate/prefix IRIs — the working set that fits in shared buffers
    • Full resources table unchanged; encoder checks hot table first
    • pg_prewarm warms the hot table at server start via _PG_init
    • Dramatically reduces random I/O for the most-accessed terms at large scale (100M+ triples)
  • Stratification engine (src/datalog/stratify.rs)
    • Predicate dependency graph with positive/negative edges
    • SCC-based stratification with clear error messages for unstratifiable programs
  • SQL compiler (src/datalog/compiler.rs)
    • Non-recursive rules → INSERT … SELECT … ON CONFLICT DO NOTHING
    • Recursive rules → WITH RECURSIVE … CYCLE
    • Negation → NOT EXISTS (higher strata only)
    • All constants dictionary-encoded before SQL generation (integer joins everywhere)
  • Arithmetic built-ins
    • Comparison operators (>, >=, <, <=, =, !=) → SQL WHERE clause expressions
    • Arithmetic expressions (?z IS ?x + ?y) → SQL computed columns
    • String functions (STRLEN, REGEX) → SQL LENGTH, ~ with dictionary decode join
  • Constraint rules (integrity constraints)
    • Empty-head rules (:- body .) express patterns that must never hold
    • Compile to existence checks; materialized mode → pg_trickle IMMEDIATE stream tables for in-transaction validation
    • pg_ripple.check_constraints() returns violations as JSONB
    • pg_ripple.enforce_constraints GUC: 'error' / 'warn' / 'off'
    • Directly complements and extends SHACL validation
  • Built-in rule sets (src/datalog/builtins.rs)
    • pg_ripple.load_rules_builtin('rdfs') — W3C RDFS entailment (13 rules)
    • pg_ripple.load_rules_builtin('owl-rl') — W3C OWL 2 RL profile (~80 rules)
  • On-demand execution mode (no pg_trickle needed)
    • Derived predicates compiled to inline CTEs injected into SPARQL→SQL at query time
    • SET pg_ripple.inference_mode = 'on_demand'
  • dictionary_hot incremental maintenance (optional, when pg_trickle is installed)
    • Model _pg_ripple.dictionary_hot as a stream table over dictionary filtered to hot-eligible IRIs
    • New predicate and prefix-registry IRIs appear in the hot table within 30s of being encoded — no manual rebuild (§2.9)
  • Materialized execution mode (optional, requires pg_trickle)
    • pg_ripple.materialize_rules(schedule => '10s') — derived predicates as stream tables
    • pg_trickle DAG scheduler respects stratum ordering automatically
  • Catalog and management
    • _pg_ripple.rules catalog table
    • _pg_ripple.rule_sets catalog: groups named rules with a rule_hash BYTEA (XXH3-64) for cache invalidation — re-activating a rule set with an unchanged hash resumes from prior derived state without re-derivation
    • Derived predicates registered in _pg_ripple.predicates with derived = TRUE
    • pg_ripple.load_rules(), pg_ripple.list_rules(), pg_ripple.drop_rules()
    • pg_ripple.enable_rule_set(name TEXT) / pg_ripple.disable_rule_set(name TEXT) — activate or deactivate a named rule set without dropping it
  • SPARQL engine integration
    • Derived VP tables transparent to query planner (same look-up path as base VP tables)
    • On-demand mode prepends CTEs to generated SQL
    • pg_ripple.sparql(query TEXT, include_derived BOOL DEFAULT true) — when false, appends AND source = 0 to all VP table scans to exclude inferred triples (no-inference mode)
  • SHACL-AF sh:rule bridge
    • Detect sh:rule entries in loaded SHACL shapes that contain Datalog-compatible triple rules
    • Compile sh:rule bodies to Datalog IR and register in _pg_ripple.rules
    • Bidirectional: SHACL shapes inform Datalog constraints; Datalog-derived triples are visible to SHACL validation
    • pg_ripple.load_shacl() auto-registers any sh:rule triples as Datalog rules when pg_ripple.inference_mode != 'off'
  • RDF-star integration in Datalog (builds on v0.4.0 RDF-star)
    • Quoted triples can appear in Datalog rule heads and bodies
    • Enables provenance rules: << ?s ?p ?o >> ex:derivedBy ex:rule1 :- ?s ?p ?o, RULE(ex:rule1) .
    • Statement identifiers (SIDs) can be used in rule bodies to annotate derived triples
  • pg_regress: datalog_rdfs.sql, datalog_owl_rl.sql, datalog_custom.sql, datalog_negation.sql, datalog_arithmetic.sql, datalog_constraints.sql, shacl_af_rule.sql, datalog_malformed.sql (syntax errors, unstratifiable programs, unbound variables, cyclic rule dependencies — verify clear error messages), rdf_star_datalog.sql

Documentation

See plans/documentation.md for details.

  • user-guide/sql-reference/datalog.mdload_rules, infer, list_rules, enable_rule_set, disable_rule_set; rule syntax primer; stratification; built-in RDFS/OWL RL rule sets; inference_mode GUC
  • user-guide/best-practices/datalog-patterns.md — RDFS subclass/domain/range patterns, OWL RL profiles, source column (explicit vs inferred), rule count vs inference time
  • user-guide/configuration.md expanded: inference_mode, enforce_constraints GUCs
  • reference/faq.md expanded: OWL reasoning support, source column meaning

Exit Criteria

Users can load RDFS or OWL RL rule sets (or custom rules), and SPARQL queries return inferred triples. Arithmetic built-ins filter correctly in rule bodies. Constraint rules detect and report violations (optionally rejecting transactions). Both on-demand and materialized modes operational. Stratified negation correctly validated and compiled. SHACL shapes with sh:rule entries are auto-compiled to Datalog rules.


v0.11.0 — Incremental SPARQL Views, Datalog Views & ExtVP

Theme: Always-fresh materialized SPARQL and Datalog queries, plus extended vertical partitioning, via pg_trickle stream tables.

In plain language: Imagine pinning a SPARQL query — or a set of Datalog reasoning rules — to a dashboard and having the results update automatically whenever the underlying data changes, without re-running the query. That's what SPARQL views and Datalog views deliver. Under the hood, only the changed rows are reprocessed (not the entire dataset), so updates are nearly instantaneous. Datalog views go one step further: they bundle rules and a goal pattern into a single self-contained artifact, materializing only the facts relevant to the goal. This release also adds precomputed "shortcut" tables for frequently-combined queries, making common access patterns dramatically faster. Requires the companion pg_trickle extension.

Effort estimate: 5–7 person-weeks

pg_trickle dependency: This release requires pg_trickle to be installed. pg_trickle is a production-ready companion extension (same Rust/pgrx 0.17 / PostgreSQL 18 stack) available today. pg_ripple never hard-requires pg_trickle at load time — feature parity for the core triple store is preserved without it. Functions in this release that depend on pg_trickle (create_sparql_view, create_datalog_view, ExtVP setup, etc.) detect its presence at call time and return a clear error with an install hint if it is absent. The pg_ripple.pg_trickle_available() function lets users and tooling check availability before calling. See plans/ecosystem/pg_trickle.md § 3 for the soft-detection design.

See plans/ecosystem/pg_trickle.md § 2.2 for the SPARQL views design and plans/ecosystem/datalog.md § 15 for the Datalog views design.

Completed items (click to expand)

Deliverables

  • SPARQL views (requires pg_trickle)
    • pg_ripple.create_sparql_view(name, sparql, schedule, decode) — compile a SPARQL SELECT query into an always-fresh, incrementally-maintained stream table
    • decode => FALSE (recommended) keeps integer IDs in the stream table with a thin decoding view on top, minimising CDC surface
    • pg_ripple.drop_sparql_view(name) and pg_ripple.list_sparql_views() for lifecycle management
    • _pg_ripple.sparql_views catalog table: records original SPARQL text, generated SQL, schedule, decode mode, and stream table OID
    • Refresh mode heuristics: IMMEDIATE for constraint-style queries, DIFFERENTIAL + schedule for dashboards, FULL + long schedule for heavy analytics and transitive-closure property paths
  • Datalog views (requires pg_trickle)
    • pg_ripple.create_datalog_view(name, rules, goal, schedule, decode) — bundle a Datalog rule set with a goal pattern into an always-fresh, incrementally-maintained stream table
    • Alternative: pg_ripple.create_datalog_view(name, rule_set, goal, schedule, decode) — reference a loaded rule set by name instead of inline rules
    • decode => FALSE (recommended) keeps integer IDs in the stream table with a thin decoding view on top
    • pg_ripple.drop_datalog_view(name) and pg_ripple.list_datalog_views() for lifecycle management
    • _pg_ripple.datalog_views catalog table: records original rule text, goal pattern, generated SQL, schedule, decode mode, and stream table OID
    • Constraint monitoring: constraint rules (empty-head) automatically synthesize a goal; any row in the stream table is a violation. IMMEDIATE mode catches violations within the same transaction
    • Goal-filtered materialization: only facts relevant to the goal pattern are derived and stored, reducing write amplification compared to full-closure materialized rules
  • ExtVP semi-join stream tables (requires pg_trickle)
    • Manual creation of pre-computed semi-joins between frequently co-joined predicate pairs
    • SPARQL→SQL translator rewrites queries to target ExtVP tables when available
  • Views over derived predicates
    • Both SPARQL views and Datalog views can reference Datalog-derived VP tables; pg_trickle DAG handles refresh ordering
  • pg_regress: sparql_views.sql, datalog_views.sql, extvp.sql

Documentation

See plans/documentation.md for details.

  • user-guide/scaling.md expanded: pg_trickle live statistics, SPARQL view refresh mode selection
  • user-guide/best-practices/sparql-patterns.md expanded: using create_sparql_view() for frequently-run queries
  • research/pg-trickle.md (mirror plans/ecosystem/pg_trickle.md)

Exit Criteria

Users can create SPARQL views and Datalog views that stay incrementally up-to-date. View queries are sub-millisecond table scans. Datalog views with goal patterns materialize only goal-relevant facts. Constraint monitoring views detect violations in real time. ExtVP semi-joins improve multi-predicate star-pattern performance.


v0.12.0 — SPARQL Update (Advanced)

Theme: W3C SPARQL 1.1 Update — pattern-based updates and graph management commands.

In plain language: Building on the basic INSERT DATA / DELETE DATA support from v0.5.1, this release adds pattern-based updates — the ability to find-and-replace data using SPARQL patterns (e.g. "for every person without an email, add a placeholder email"). It also adds commands for managing named graphs (create, clear, drop) and loading data from a URL. This completes the full SPARQL 1.1 Update specification.

Effort estimate: 3–4 person-weeks (simpler than originally estimated since INSERT DATA / DELETE DATA and the Update executor were delivered in v0.5.1)

Completed items (click to expand)

Deliverables

  • DELETE/INSERT WHERE (graph update)
    • Pattern-based update: DELETE { … } INSERT { … } WHERE { … }
    • Compile WHERE clause via existing SPARQL→SQL engine
    • Transactional: delete + insert in single statement
  • LOAD / CLEAR / DROP / CREATE
    • LOAD <url> — fetch and load remote RDF data (HTTP GET + parser)
    • CLEAR GRAPH <g> — delete all triples in a named graph
    • DROP GRAPH <g> — clear + remove graph from registry
    • CREATE GRAPH <g> — register a new empty named graph
  • pg_regress: sparql_update_where.sql, sparql_graph_management.sql

Documentation

See plans/documentation.md for details.

  • user-guide/sql-reference/sparql-update.md expanded: DELETE/INSERT WHERE, LOAD / CLEAR / DROP / CREATE graph management
  • user-guide/best-practices/update-patterns.md expanded: pattern-based update recipes, graph lifecycle management

Exit Criteria

Full SPARQL 1.1 Update operations work correctly. Pattern-based updates compile WHERE clauses via the existing SPARQL→SQL engine.


v0.13.0 — Performance Hardening

Theme: Optimize for production-scale workloads. Benchmark-driven improvements.

In plain language: This release is about speed. Using the benchmarks established in v0.5.0, we measure pg_ripple's performance against known baselines and then tune it. Improvements include caching query plans so repeated queries skip redundant work, loading data in parallel, and teaching the system to use data quality rules (from v0.7.0/v0.8.0) as hints to avoid unnecessary work during queries. The target is simple queries answering in under 10 milliseconds on a dataset of 10 million facts, and bulk loading sustained at over 100,000 facts per second.

Effort estimate: 6–8 person-weeks

Completed items (click to expand)

Deliverables

  • BGP join reordering
    • At plan time, read pg_stats.n_distinct and pg_class.reltuples for the target VP tables to estimate the selectivity of each triple pattern
    • Place the most selective pattern first in the join tree to minimize intermediate result sizes
    • Emit SET LOCAL join_collapse_limit = 1 before the generated SQL to lock the PostgreSQL planner into the computed join order
    • Optimizer Robustness / Fallback: Because deriving perfect selectivity from pg_stats.n_distinct is fragile over multi-way self-joins, the Rust-based optimizer implements dynamic sampling or uses fallback heuristic costs (e.g. reverting to native PostgreSQL planning) if pg_stats suggests high cardinality uncertainty. This prevents forcing PostgreSQL into highly suboptimal plans.
    • When join columns are already sorted (e.g. after a range scan on an ordered i64 column), emit SET LOCAL enable_mergejoin = on to exploit merge-join (strategy #6)
  • Prepared execution and cache hardening
    • Build on the v0.3.0 SPARQL translation cache rather than reintroducing it here
    • Evaluate prepared statements with parameter binding for generated SQL where this improves planner reuse
    • Add instrumentation and benchmarks for translation-cache hit rate, eviction behavior, and prepared-plan reuse
  • Parallel query exploitation
    • Ensure VP table queries are parallel-safe
    • Mark SQL functions as PARALLEL SAFE where applicable
    • Generate SQL that triggers PostgreSQL parallel workers for multi-VP-table star patterns (e.g. parallel hash joins across VP tables)
    • Verify EXPLAIN output shows parallel plans for queries touching 3+ VP tables
  • Custom statistics for the PostgreSQL planner
    • Run ANALYZE on VP tables after merge operations so the planner has accurate selectivity estimates for generated SQL
    • Provide per-predicate ndistinct and MCV statistics to guide join ordering
    • Evaluate custom statistics objects (PG18 extended statistics) on (s, o) pairs for correlation-aware planning
    • Consider prepared statements with parameter binding (instead of literal interpolation) so the planner can cache generic plans
  • PG18 async I/O exploitation
    • Verify BRIN scans on main partition leverage AIO
    • Tune io_combine_limit recommendations
  • Memory optimization
    • Profile and reduce per-query allocations
    • Optimize dictionary cache eviction strategy
  • Index tuning
    • Evaluate PG18 skip scan benefits on (s, o) indices
    • Add covering indices where beneficial
  • Bulk load optimization
    • Parallel dictionary encoding
    • Deferred index build with CREATE INDEX CONCURRENTLY post-load
  • SHACL-driven query optimization
    • The algebrizer reads loaded SHACL shapes and the predicate catalog before building the join tree, using them for costing and only for rewrites that are proven semantics-preserving
    • Shape metadata can tighten plans only when the query domain is provably identical to the validated focus-node set
    • Presence of a shape alone is insufficient to change query semantics
  • pg_trickle integration: ExtVP workload advisor (optional, when pg_trickle is installed)
    • _pg_ripple.extvp_candidates stream table aggregates predicate co-occurrence from the SPARQL query log over a rolling 1-hour window
    • Admin function pg_ripple.recommend_extvp() reads the stream table and lists the top N predicate pairs to pre-compute
    • pg_ripple.sparql_explain() surfaces recommendations inline when a query would benefit from an ExtVP (§2.14)
  • Benchmarking infrastructure & execution
    • Berlin SPARQL Benchmark (BSBM) data generator integrated into test suite
    • Full BSBM query mix with timing collection and baseline comparison
    • SP2Bench subset adapted for pg_ripple
    • Custom benchmarks: star patterns, property paths, aggregates, concurrent workloads
    • Results documented in release notes and user-guide/scaling.md
  • Fuzz testing harness setup (cargo-fuzz + libFuzzer)
    • Fuzz target for SPARQL→SQL pipeline (parser, algebra, SQL generation)
    • Fuzz target for Turtle parser integration
    • Fuzz target for Datalog rule parser
    • CI runs fuzz testing in nightly builds (10 minutes per target)
    • No panics, no invalid SQL, no memory safety violations
  • Performance regression test suite (pgbench custom scripts)
    • 100K triples/sec sustained bulk load baseline

    • <10ms simple BGP queries at 10M triples
    • <5ms cached repeat queries
    • BSBM throughput comparison with v0.5.0
  • pg_regress: shacl_query_opt.sql, fuzz_integration.sql (fuzz results verification)

Documentation

See plans/documentation.md for details.

  • user-guide/scaling.md expanded: benchmark results (BSBM, SP2Bench), GUC tuning reference values for small/medium/large deployments, index strategy per workload
  • user-guide/pre-deployment.md expanded: finalize as definitive production checklist; pg_stat_statements enabled; work_mem tuning for SPARQL aggregates
  • reference/troubleshooting.md expanded: slow query diagnosis using sparql_explain(analyze:=true), cache hit ratio via stats()

Exit Criteria

BSBM results documented. >100K triples/sec sustained bulk load. <10ms for simple BGP queries at 10M triples. <5ms for cached repeat queries. SHACL metadata exploited only through semantics-preserving optimizer rules. PostgreSQL parallel plans verified for multi-VP-table joins.


v0.14.0 — Administrative & Operational Readiness

Theme: Production operations tooling, upgrade paths, documentation.

In plain language: Everything a system administrator needs to run pg_ripple in production. This includes maintenance commands (clean up, rebuild indexes), monitoring and diagnostics, comprehensive documentation (quickstart guide, function reference, tuning guide), and graph-level access control — the ability to control which database users can see or modify which named graphs. It also covers packaging (Linux packages, Docker images) so the extension is easy to install in real environments. Think of this as the "operations manual" release.

Effort estimate: 4–6 person-weeks

Completed items (click to expand)

Deliverables

  • Extension upgrade scripts
    • Tested upgrade path 0.1.0 → ... → 0.16.0
    • ALTER EXTENSION pg_ripple UPDATE works for all version transitions
  • pg_trickle integration: live schema extraction (optional, when pg_trickle is installed)
    • _pg_ripple.inferred_schema stream table maintains a live class→property→cardinality summary
    • Exposed as pg_ripple.schema_summary() for tooling and SPARQL IDE auto-completion (v0.15.0 HTTP endpoint)
    • Serves as a starting point for automatic SHACL shape inference (§2.15)
  • Administrative functions
    • pg_ripple.vacuum() — force merge + VACUUM on VP tables
    • pg_ripple.reindex() — rebuild all VP table indices
    • pg_ripple.compact(keep_old BOOL DEFAULT false) — trigger an immediate full merge across all VP tables; keep_old := false drops the previous generation's _main table immediately after the atomic rename
    • pg_ripple.vacuum_dictionary() — remove dictionary entries for IRIs and literals no longer referenced by any VP table row (orphaned after bulk deletes)
    • pg_ripple.dictionary_stats() — detailed cache metrics
    • pg_ripple.predicate_stats() — per-predicate triple count, table sizes
  • Logging & diagnostics
    • Structured logging for merge operations, validation results
    • Custom EXPLAIN option showing SPARQL→SQL mapping (PG18 extension EXPLAIN)
  • Documentation (see plans/documentation.md for the full page-by-page specification)
    • user-guide/backup-restore.md, user-guide/contributing.md (complete), reference/error-reference.md (PT001–PT799), reference/security.md (complete)
    • Performance tuning guide — dictionary cache sizing, cache_budget budgeting, merge_threshold and vp_promotion_threshold tuning; SHACL constraint mapping reference; Datalog rule authoring guide
  • Graph-level Row-Level Security (RLS)
    • pg_ripple.enable_graph_rls() — activate RLS policies on VP tables using the g column
    • Policy driven by a mapping table: _pg_ripple.graph_access (role_name TEXT, graph_id BIGINT, permission TEXT)'read' / 'write' / 'admin'
    • pg_ripple.grant_graph(role TEXT, graph TEXT, permission TEXT) / pg_ripple.revoke_graph()
    • SPARQL queries automatically filter results to graphs the current role can read
    • Write operations (insert_triple, SPARQL UPDATE) enforce write permission
    • Superuser bypass via pg_ripple.rls_bypass GUC for admin operations
  • Packaging
    • cargo pgrx package produces installable .deb and .rpm
    • Docker image with extension pre-installed
    • PGXN metadata
  • pg_regress: admin_functions.sql (vacuum, reindex, dictionary_stats, predicate_stats), graph_rls.sql (RLS policy enforcement, cross-role isolation, superuser bypass), upgrade_path.sql (install v0.1.0 → load data → sequential upgrade to current version → verify data integrity and query correctness at each step)

Documentation

See plans/documentation.md for details.

  • user-guide/backup-restore.mdpg_dump/pg_restore procedure, VP table considerations, PITR with WAL
  • reference/security.md complete — supported versions matrix, responsible disclosure, hardening GUCs
  • reference/error-reference.md — PT001–PT799 error code table with resolution notes
  • user-guide/contributing.md complete — dev setup, test commands, PR workflow, AGENTS.md conventions, governance
  • user-guide/sql-reference/admin.md expanded: vacuum, reindex, dictionary_stats, predicate_stats

Exit Criteria

Extension is installable, upgradable, and documented. Operational tooling sufficient for production use. Graph-level RLS enforces access control per named graph.


v0.15.0 — SPARQL Protocol (HTTP Endpoint)

Theme: Standard HTTP API for SPARQL queries and updates.

In plain language: Without this, the only way to talk to pg_ripple is through a PostgreSQL database connection (SQL). But the entire RDF ecosystem — SPARQL notebooks, visualization tools, ontology editors, web applications — expects to query a triple store over HTTP at a /sparql URL. This release adds a lightweight companion service that accepts standard SPARQL HTTP requests, forwards them to pg_ripple inside PostgreSQL, and returns results in all the standard formats (JSON, XML, CSV, Turtle). This is the single biggest adoption enabler: it lets pg_ripple drop in as a replacement for tools like Blazegraph, Virtuoso, or Apache Fuseki without requiring any client-side changes.

Effort estimate: 3–4 person-weeks

Completed items (click to expand)

Deliverables

  • Companion HTTP service (pg_ripple_http binary)
    • Standalone Rust binary (not a PG background worker — avoids binding TCP ports inside PostgreSQL)
    • Connects to PostgreSQL via standard libpq / tokio-postgres
    • Configurable via environment variables or config file: PG_RIPPLE_HTTP_PORT, PG_RIPPLE_HTTP_PG_URL
  • W3C SPARQL 1.1 Protocol compliance
    • GET /sparql?query=... — URL-encoded query
    • POST /sparql with application/sparql-query body
    • POST /sparql with application/x-www-form-urlencoded body (query=... / update=...)
    • SPARQL Update via POST /sparql with application/sparql-update body
  • Content negotiation
    • application/sparql-results+json (default for SELECT/ASK)
    • application/sparql-results+xml
    • text/csv / text/tab-separated-values
    • text/turtle / application/n-triples (for CONSTRUCT/DESCRIBE)
    • application/ld+json (JSON-LD, for CONSTRUCT/DESCRIBE)
    • RDF-star content types (builds on v0.4.0 RDF-star): Turtle-star and JSON-LD-star for CONSTRUCT/DESCRIBE results containing quoted triples
  • Connection pooling
    • Built-in connection pool (e.g. deadpool-postgres) to handle concurrent HTTP requests
    • PG_RIPPLE_HTTP_POOL_SIZE configuration
  • Security
    • Optional bearer token or Basic auth for access control
    • CORS configuration for browser-based SPARQL clients
    • Rate limiting GUC
  • Health and metrics
    • GET /health endpoint for load balancer probes
    • Prometheus-compatible /metrics endpoint (query count, latency histogram, error rate)
  • Docker integration
    • Docker image bundles both PostgreSQL (with pg_ripple) and the HTTP service
    • Docker Compose example with separate PG and HTTP containers
  • Graph-aware bulk loader SQL functions
    • Expose the internal load_ntriples_into_graph(), load_turtle_into_graph(), load_rdfxml_into_graph() Rust functions (added in v0.10.0) as public SQL functions:
      • pg_ripple.load_ntriples_into_graph(data TEXT, graph_iri TEXT) RETURNS BIGINT
      • pg_ripple.load_turtle_into_graph(data TEXT, graph_iri TEXT) RETURNS BIGINT
      • pg_ripple.load_rdfxml_into_graph(data TEXT, graph_iri TEXT) RETURNS BIGINT
      • pg_ripple.load_ntriples_file_into_graph(path TEXT, graph_iri TEXT) RETURNS BIGINT
      • pg_ripple.load_turtle_file_into_graph(path TEXT, graph_iri TEXT) RETURNS BIGINT
      • pg_ripple.load_rdfxml_file_into_graph(path TEXT, graph_iri TEXT) RETURNS BIGINT
    • Encode the graph_iri argument via the dictionary and delegate to the existing *_into_graph(data, g_id) internal functions
    • load_rdfxml_file_into_graph reads the file via pg_read_file() (superuser-only) and delegates to load_rdfxml_into_graph
    • Complementary to load_nquads() and load_trig() for workloads that have N-Triples / Turtle / RDF/XML files and want to load them into a specific named graph without converting the format
  • Graph-aware triple deletion
    • The existing pg_ripple.delete_triple(s, p, o) only deletes from the default graph (g=0); the underlying storage::delete_triple(s, p, o, g_id) already accepts a graph parameter
    • Expose: pg_ripple.delete_triple_from_graph(s TEXT, p TEXT, o TEXT, graph_iri TEXT) RETURNS BIGINT
    • Also expose: pg_ripple.clear_graph(graph_iri TEXT) RETURNS BIGINT — wraps the existing storage::clear_graph_by_id() internal function to delete all triples in a named graph in one call (currently only accessible via drop_graph() which also unregisters the graph IRI)
    • Without this, users have no SQL-level way to delete a specific triple from a named graph
  • SQL API completeness gaps
    • Missing file-path loader: pg_ripple.load_rdfxml_file(path TEXT) RETURNS BIGINT — completes the set of *_file variants (N-Triples, N-Quads, Turtle, TriG all have file variants); reads via pg_read_file() (superuser-only)
    • Graph parameter on find_triples: pg_ripple.find_triples(s TEXT, p TEXT, o TEXT, graph TEXT DEFAULT NULL) RETURNS TABLE — exposes the unused graph parameter in storage::find_triples(s, p, o, graph) so users can pattern-match within a named graph without falling back to SPARQL; graph := NULL queries the default graph
    • Per-graph triple count: pg_ripple.triple_count_in_graph(graph_iri TEXT) RETURNS BIGINT — returns the count of triples in a specific named graph (existing triple_count() returns total across all graphs)
    • Dictionary lookup diagnostics: pg_ripple.decode_id_full(id BIGINT) RETURNS JSONB — exposes dictionary::decode_full(id) to return {"kind": ..., "value": ..., "language": null|"...", "datatype": null|"..."} structured term metadata (current decode_id() returns only the plain string); useful for debugging and inspection
    • Dictionary term existence check: pg_ripple.lookup_iri(iri TEXT) RETURNS BIGINT DEFAULT NULL — exposes dictionary::lookup_iri(iri) to check whether an IRI already exists in the dictionary without encoding it (useful for test assertions, cost estimation, and introspection)
  • pg_regress: sparql_protocol.sql (protocol-level tests via curl), load_into_graph.sql (round-trip: load N-Triples / Turtle / RDF/XML into a named graph, verify via SPARQL GRAPH pattern), graph_delete.sql (delete_triple_from_graph, clear_graph, verify isolation from default graph), sql_api_completeness.sql (find_triples with graph param, triple_count_in_graph, decode_id_full, lookup_iri)

Documentation

See plans/documentation.md for details.

  • user-guide/sql-reference/sparql-query.md expanded: HTTP protocol endpoint configuration, Accept header formats, SPARQL 1.1 Protocol conformance note
  • user-guide/best-practices/sparql-patterns.md expanded: using the HTTP endpoint from Python (SPARQLWrapper), Java (Jena), curl; SPARQL IDE / Protégé direct connection
  • reference/faq.md expanded: HTTP endpoint URL, connecting SPARQL tools directly

Exit Criteria

Standard SPARQL clients (YASGUI, Postman, RDF4J workbench, curl) can query and update pg_ripple over HTTP without any pg_ripple-specific configuration. Content negotiation returns correct formats. All graph-scoped load and delete operations available as first-class SQL functions. SQL API fully exposes internal capabilities (graph parameters, per-graph counts, diagnostic functions).


v0.16.0 — SPARQL Federation

Theme: Query remote SPARQL endpoints from within pg_ripple queries.

In plain language: Federation lets a single SPARQL query combine data from pg_ripple with data from external SPARQL endpoints on the web. For example, you could ask "find all my local employees and enrich their records with data from Wikidata" — and the system will automatically fetch the remote portion, join it with local results, and return a unified answer. This is part of the SPARQL 1.1 standard (SERVICE keyword) and is expected by many enterprise knowledge graph workflows that integrate multiple data sources. Multiple remote calls execute in parallel when possible to minimise latency.

Effort estimate: 4–6 person-weeks

Completed items (click to expand)

Deliverables

  • SPARQL SERVICE keyword parsing
    • Parse SERVICE <url> { ... } clauses in SPARQL queries via spargebra
    • Support both inline service IRIs and SERVICE ?var (variable endpoints, with VALUES binding)
  • Remote endpoint execution
    • HTTP GET/POST to remote SPARQL endpoints using reqwest (async HTTP client)
    • Parse application/sparql-results+json and application/sparql-results+xml responses
    • Dictionary-encode remote results into local i64 IDs for join compatibility
  • Join integration
    • Remote result sets injected as inline VALUES clauses in the generated SQL
    • Async parallel execution: multiple SERVICE clauses in a single query execute concurrently (via tokio::join! in pg_ripple_http, or sequential fallback in SPI context) — prevents a single slow endpoint from blocking the entire query
    • Bind-join optimisation: push bound variables from local results into remote queries to reduce remote result size
  • Error handling and timeouts
    • pg_ripple.federation_timeout GUC (default: 30s per SERVICE call)
    • pg_ripple.federation_max_results GUC (default: 10,000 rows per remote call)
    • Graceful degradation: failed SERVICE calls return empty results with a WARNING (configurable to ERROR via pg_ripple.federation_on_error GUC)
  • Security
    • Allowlist of permitted remote endpoints: _pg_ripple.federation_endpoints (url TEXT, enabled BOOLEAN)
    • pg_ripple.register_endpoint() / pg_ripple.remove_endpoint() management API
    • No outbound HTTP calls unless the endpoint is explicitly registered (defence against SSRF)
  • pg_trickle integration: federation health monitoring (optional, when pg_trickle is installed)
    • _pg_ripple.federation_health stream table aggregates a rolling 5-minute probe log per endpoint
    • Executor skips endpoints with success_rate < 0.1 without waiting for timeout
    • /metrics Prometheus endpoint reads directly from federation_health (§2.11)
  • SERVICE → Materialized View rewrite
    • When a SERVICE <url> clause references an endpoint backed by a local SPARQL view (created via pg_ripple.create_sparql_view()), rewrite the remote call to a direct scan of the pre-materialized stream table
    • Registered via a local_view_name column on _pg_ripple.federation_endpoints — set automatically when a SPARQL view is also registered as an endpoint
    • Eliminates HTTP overhead and enables the PostgreSQL planner to optimize the join with accurate statistics from the stream table
  • HTTP endpoint integration
    • Federation works via both SQL (pg_ripple.sparql()) and HTTP (/sparql) interfaces
  • pg_regress: sparql_federation.sql, sparql_federation_timeout.sql

Documentation

See plans/documentation.md for details.

  • user-guide/sql-reference/federation.mdSERVICE keyword, endpoint registration (register_endpoint, remove_endpoint), variable endpoints with VALUES binding, bind-join optimisation, federation_timeout / federation_max_results / federation_on_error GUCs, SSRF protection via allow-list
  • user-guide/configuration.md expanded: federation_timeout, federation_max_results, federation_on_error GUCs
  • user-guide/best-practices/sparql-patterns.md expanded: federation query patterns, SERVICE performance tips (push FILTERs down, limit remote result size), combining local and remote data
  • reference/faq.md expanded: federation security model, configuring remote endpoints, timeout tuning
  • reference/troubleshooting.md expanded: federation timeouts, SSRF errors, endpoint unreachable

Exit Criteria

DONE — SPARQL queries with SERVICE clauses correctly fetch and join data from registered remote endpoints. Sequential execution in SPI context. Timeouts and error handling work as configured. No SSRF risk — only allowlisted endpoints are contacted.


v0.17.0 — JSON-LD Framing

Theme: Frame-driven SPARQL CONSTRUCT queries that produce structured, nested JSON-LD output.

In plain language: JSON-LD Framing is a W3C standard for reshaping RDF graph data into a specific tree structure suitable for a REST API or application. Instead of returning a flat list of disconnected facts, you provide a frame document — a JSON template that says "I want Company objects with their employees nested inside" — and pg_ripple automatically translates that into an optimised query, fetches only the data that matches, and returns a cleanly nested JSON-LD document. This makes pg_ripple a natural back-end for Linked Data APIs and JSON-centric applications without requiring a separate framing library.

Unlike a naïve approach that fetches the entire graph and post-filters it, this implementation translates the frame directly into a SPARQL CONSTRUCT query. PostgreSQL then reads only the VP tables that are touched by the join — meaning a frame targeting 3 predicates on a graph with 10,000 predicates touches 3 VP tables, not 10,000. The jsonld_frame_to_sparql() inspection function exposes the generated SPARQL for debugging and for users who want to customise the query further before execution.

Effort estimate: 3–4 person-weeks

Completed items (click to expand)

Prerequisites

  • v0.5.1 SPARQL CONSTRUCT / DESCRIBE (JSONB output) — frame-to-SPARQL translation reuses the existing algebra and SQL generation pipeline.
  • v0.9.0 JSON-LD export — the nt_term_to_jsonld_value helper in src/export.rs is reused for the embedding step.
  • v0.3.0 SPARQL plan cache — framed queries benefit from cached SPARQL→SQL translation automatically.

Deliverables

  • JSON-LD Framing engine (src/framing/)
    • src/framing/mod.rs — module root; exposes the public frame() entry point used by all SQL functions
    • src/framing/frame_translator.rs — translates a JSON-LD frame (parsed as serde_json::Value) into a spargebra CONSTRUCT algebra tree
    • src/framing/embedder.rs — takes flat CONSTRUCT result triples and applies the W3C embedding algorithm to produce a nested JSON-LD tree matching the frame structure
    • src/framing/compactor.rs — applies the @context from the frame to compact full IRIs to prefixed terms in the output
  • Frame-to-SPARQL translation (src/framing/frame_translator.rs)
    • Translate @type constraints → ?s a <IRI> triple patterns in the CONSTRUCT WHERE clause
    • Translate property-value pairs with wildcard {}OPTIONAL { ?s <p> ?o } patterns
    • Translate absent-property patterns []OPTIONAL { ?s <p> ?o } FILTER(!bound(?o)) patterns
    • Translate @reverse terms → flipped BGP triple patterns (?o <p> ?s instead of ?s <p> ?o)
    • Translate nested frame objects → recursive OPTIONAL joins, each level introducing a fresh variable
    • Translate @id matching → bind target IRI as a constant in the WHERE clause
    • Translate @requireAll: true → convert OPTIONAL joins to INNER joins for required properties
    • All IRI constants dictionary-encoded at translation time (integer joins in all VP table queries — no string comparisons)
    • Wildcards ({}) on @type and @id expand to unbound variables
  • Tree-embedding algorithm (src/framing/embedder.rs)
    • Implement the W3C JSON-LD 1.1 Framing §4.1 embedding algorithm over the flat CONSTRUCT result set
    • Build a subject-keyed node map from the CONSTRUCT rows (decoded to N-Triples strings)
    • Walk the frame tree recursively, embedding matching node objects as property values
    • Honour @embed flag: @once (default) — embed a node only once, use a {"@id": "..."} reference for subsequent occurrences; @always — embed every occurrence even if repeated; @never — always use a node reference
    • Honour @explicit: true — omit properties not mentioned in the frame from the output node
    • Honour @omitDefault: true — omit absent properties rather than outputting null
    • Honour @default values — substitute the declared default value for absent properties when @omitDefault is false
    • Reverse properties: collect subjects whose relevant predicate points to the current node and embed them under the @reverse-declared key
    • Named-graph scope: when graph is specified, restrict embedding to nodes from that named graph
  • @context compaction (src/framing/compactor.rs)
    • Extract the @context block from the input frame
    • Apply prefix substitution to all IRI strings in the output tree (full IRI → compact prefixed form using registered prefixes and inline @context mappings)
    • Inject the @context block as the first entry of the returned JSON-LD document
    • Fall back to full IRIs when no matching prefix is registered
  • SQL functions (src/lib.rs)
    • pg_ripple.jsonld_frame_to_sparql(frame JSONB, graph TEXT DEFAULT NULL) RETURNS TEXT — translate a frame to a SPARQL CONSTRUCT query string without executing it; primary debugging and inspection tool
    • pg_ripple.export_jsonld_framed(frame JSONB, graph TEXT DEFAULT NULL, embed TEXT DEFAULT '@once', explicit BOOLEAN DEFAULT FALSE, ordered BOOLEAN DEFAULT FALSE) RETURNS JSONB — primary end-user function: translate frame to CONSTRUCT, execute via the SPARQL engine, apply embedding and compaction, return framed JSON-LD
    • pg_ripple.export_jsonld_framed_stream(frame JSONB, graph TEXT DEFAULT NULL) RETURNS SETOF TEXT — streaming NDJSON variant (one JSON object per matched root node); avoids buffering large framed documents in memory
    • pg_ripple.jsonld_frame(input JSONB, frame JSONB, embed TEXT DEFAULT '@once', explicit BOOLEAN DEFAULT FALSE, ordered BOOLEAN DEFAULT FALSE) RETURNS JSONB — general-purpose framing primitive: apply the embedding algorithm to any already-expanded JSON-LD document, not necessarily from pg-ripple storage; useful for framing SPARQL CONSTRUCT results obtained via other means
  • SPARQL plan cache integration
    • The translated CONSTRUCT query string is used as the cache key in the existing src/sparql/plan_cache.rs translation cache
    • Repeated calls to export_jsonld_framed() with the same frame and graph benefit from cached SPARQL→SQL translation automatically
  • Named-graph support
    • graph NULL → CONSTRUCT operates over the merged graph (all g values across all VP tables)
    • graph '<IRI>' → adds FILTER(?g = <encoded_id>) to each VP table join in the generated CONSTRUCT
    • Frame @graph entry → directs the embedder to scope node matching to the named graph's node set
  • Error handling
    • Invalid frame structure (not a JSON object, unrecognised @embed value) → PT700-range serialization error with the frame property path that failed
    • Frame references an IRI not present in any VP table → empty result (standard W3C framing behaviour, not an error)
    • Frame nested deeper than pg_ripple.max_path_depthPT200-range error reusing the existing depth limit
  • Incremental framing views (create_framing_view) (requires pg_trickle)
    • pg_ripple.create_framing_view(name TEXT, frame JSONB, schedule TEXT DEFAULT '5s', decode BOOLEAN DEFAULT FALSE, output_format TEXT DEFAULT 'jsonld') RETURNS void — translate the frame to a SPARQL CONSTRUCT query and register it as a pg_trickle stream table that stays incrementally up-to-date as triples are inserted or deleted
    • Stream table schema: pg_ripple.framing_view_{name}(subject_id BIGINT, frame_tree JSONB, refreshed_at TIMESTAMPTZ)subject_id is the dictionary-encoded subject IRI; frame_tree is the fully embedded and compacted JSON-LD output for that root node
    • When decode = TRUE, a thin IRI-decoding view pg_ripple.framing_view_{name}_decoded is also created; the stream table itself stores integer IDs to minimise CDC surface
    • pg_ripple.drop_framing_view(name TEXT) RETURNS void and pg_ripple.list_framing_views() RETURNS TABLE(name TEXT, frame JSONB, schedule TEXT, output_format TEXT, decode BOOLEAN, row_count BIGINT, last_refresh TIMESTAMPTZ, stream_table_oid OID) for lifecycle management
    • _pg_ripple.framing_views catalog table: name, frame, generated_construct, schedule, output_format, decode, stream_table_oid, created_at
    • Refresh mode heuristics (same as create_sparql_view): IMMEDIATE for constraint-style frames (e.g. select ex:Company nodes that lack ex:complianceOfficer — any row in the view is a violation); DIFFERENTIAL + schedule for dashboard/API use cases (company directory refreshed every 10 s); FULL + long schedule for large full-graph framed exports intended for downstream consumers
    • pg_ripple.pg_trickle_available() check at call time — returns a clear error with an install hint when pg_trickle is absent; never raises an error at extension load time
  • pg_regress: jsonld_framing.sql (type-based selection, property wildcards, absent-property patterns [], @reverse, @embed @once/@always/@never, @explicit, @omitDefault, @default, @requireAll, named-graph scope, empty frame, jsonld_frame_to_sparql inspection output, jsonld_frame general-purpose function, streaming variant), jsonld_framing_views.sql (create/drop/list framing views; IMMEDIATE constraint-mode view; DIFFERENTIAL dashboard view; decode option; pg_trickle-absent error message)

Supported frame features (v0.17.0)

FeatureSupportedNotes
@type matchingSingle IRI or array of IRIs
@id matchingSingle IRI or array of IRIs
Property wildcard {}Matches any value for a property
Absent-property pattern []Matches nodes lacking the property
@reverse propertiesFlipped triple pattern in CONSTRUCT
@embed: @once / @always / @neverFull embedding control
@explicit inclusion flagOmit unlisted properties from output
@omitDefault flagOmit null-valued absent properties
@default valuesSubstitute defaults for absent properties
@requireAll flagTurns OPTIONAL joins to INNER joins
@context compactionPrefix substitution from frame @context
Named graph @graph scopingMaps to g column filter on VP tables
@omitGraph flagSingle root node omits @graph wrapper
Value pattern matching (@value / @language / @type in value objects)Deferred; requires full-graph scan to implement correctly

Documentation

See plans/documentation.md for details.

  • user-guide/sql-reference/serialization.md expanded: export_jsonld_framed, jsonld_frame_to_sparql, jsonld_frame, export_jsonld_framed_stream; frame syntax primer; @embed / @explicit / @omitDefault / @requireAll flags; named graph scoping; supported feature table
  • user-guide/sql-reference/framing-views.mdcreate_framing_view, drop_framing_view, list_framing_views; stream table schema and decoding view; refresh mode selection (IMMEDIATE for constraints, DIFFERENTIAL for dashboards, FULL for exports); decode option; pg_trickle dependency and detection; worked example (company directory view refreshed every 10 s)
  • user-guide/best-practices/data-modeling.md expanded: JSON-LD Framing for REST APIs; frame-first API design pattern; using jsonld_frame_to_sparql for SPARQL query inspection; performance notes (frame-driven vs full-graph export); when to use export_jsonld_framed vs create_framing_view
  • reference/faq.md expanded: framing vs plain JSON-LD export; what W3C framing features are supported; value pattern matching deferral; framing views vs SPARQL views

Exit Criteria

export_jsonld_framed() correctly translates a JSON-LD frame into a SPARQL CONSTRUCT query touching only the VP tables required by the frame, executes it via the existing SPARQL engine, and returns a nested JSON-LD document with correct @context compaction and W3C-conformant embedding semantics. The jsonld_frame_to_sparql() function exposes the generated CONSTRUCT query string. The jsonld_frame() general-purpose primitive correctly frames any expanded JSON-LD JSONB input. create_framing_view() creates an incrementally-maintained pg_trickle stream table whose rows stay current as triples change; the IMMEDIATE refresh mode correctly detects constraint violations within the same transaction. All supported frame features in the table above pass the pg_regress test suite.


v0.18.0 — SPARQL CONSTRUCT, DESCRIBE & ASK Views

Theme: Materialize the three non-SELECT SPARQL query forms as incrementally-maintained pg_trickle stream tables.

In plain language: pg_ripple already supports SPARQL CONSTRUCT, DESCRIBE, and ASK as one-shot queries. This release lets you register any of those query forms as a live view — a stream table that pg_trickle keeps incrementally up-to-date as triples are inserted or deleted. A CONSTRUCT view stores the derived triples it produces in a (s, p, o, g) table; this is ideal for materialising inferred facts, denormalised projections, or cached API responses. A DESCRIBE view stores all triples about the described resources. An ASK view stores a single BOOLEAN row that flips whenever the underlying pattern changes from matching to not-matching — useful for live constraint monitors and dashboard indicators.

Effort estimate: 2–3 person-weeks (the hard parts — CONSTRUCT/DESCRIBE SQL generation, spargebra algebra parsing, and pg_trickle stream table registration — are all already in place from v0.5.1 and v0.11.0)

Completed items (click to expand)

Prerequisites

  • v0.5.1 SPARQL CONSTRUCT / DESCRIBE (JSONB output) — the CONSTRUCT algebra and SQL generation pipeline is reused directly.
  • v0.11.0 SPARQL SELECT views — the pg_trickle stream table registration machinery (register_stream_table, decode-view creation, catalog tables) is extended rather than rewritten.
  • v0.11.0 pg_trickle_available() — all three new view functions gate on the same availability check.

Deliverables

  • CONSTRUCT view support (src/views.rs)
    • Extend create_sparql_view() to accept CONSTRUCT queries, or add a dedicated create_construct_view() function (preferred — keeps catalog tables separate and the error message explicit)
    • Parse spargebra::Query::Construct { template, pattern, .. }; compile pattern via the existing translate_select pipeline; expand each triple in template as a SQL row expression
    • Generate a UNION ALL SQL SELECT that returns one row per template triple per solution: SELECT encode(s_expr) AS s, encode(p_expr) AS p, encode(o_expr) AS o, 0 AS g; named-graph template triples include the graph term
    • All IRI/literal constants in the template dictionary-encoded at view-creation time (integer joins only — no string comparisons at refresh time)
    • Register result as a pg_trickle stream table with schema pg_ripple.construct_view_{name}(s BIGINT, p BIGINT, o BIGINT, g BIGINT)
    • When decode = TRUE, create a thin decoding view pg_ripple.construct_view_{name}_decoded(s TEXT, p TEXT, o TEXT, g TEXT) that joins _pg_ripple.dictionary for each column
    • Record metadata in _pg_ripple.construct_views (name, sparql, generated_sql, schedule, decode, template_count, stream_table, created_at)
  • DESCRIBE view support (src/views.rs)
    • create_describe_view(name, sparql, schedule, decode) — parse spargebra::Query::Describe { variables, pattern, .. }; compile to SQL that enumerates all triples where the described resource appears as subject (and optionally object)
    • Stream table schema: pg_ripple.describe_view_{name}(s BIGINT, p BIGINT, o BIGINT, g BIGINT) — same shape as CONSTRUCT views
    • describe_strategy GUC (already present from v0.5.1) respected: cbd (Concise Bounded Description) vs symmetric_cbd
    • Record metadata in _pg_ripple.describe_views (name, sparql, generated_sql, schedule, decode, stream_table, created_at)
  • ASK view support (src/views.rs)
    • create_ask_view(name, sparql, schedule) — parse spargebra::Query::Ask { pattern, .. }; compile to SELECT EXISTS(...) SQL
    • Stream table schema: pg_ripple.ask_view_{name}(result BOOLEAN, evaluated_at TIMESTAMPTZ DEFAULT now())
    • Record metadata in _pg_ripple.ask_views (name, sparql, generated_sql, schedule, stream_table, created_at)
  • Lifecycle management SQL functions (src/lib.rs)
    • pg_ripple.create_construct_view(name TEXT, sparql TEXT, schedule TEXT DEFAULT '1s', decode BOOLEAN DEFAULT FALSE) RETURNS BIGINT — returns template triple count
    • pg_ripple.drop_construct_view(name TEXT) RETURNS void
    • pg_ripple.list_construct_views() RETURNS TABLE(name TEXT, sparql TEXT, generated_sql TEXT, schedule TEXT, decode BOOLEAN, template_count BIGINT, stream_table TEXT, created_at TIMESTAMPTZ)
    • pg_ripple.create_describe_view(name TEXT, sparql TEXT, schedule TEXT DEFAULT '1s', decode BOOLEAN DEFAULT FALSE) RETURNS void
    • pg_ripple.drop_describe_view(name TEXT) RETURNS void
    • pg_ripple.list_describe_views() RETURNS TABLE(name TEXT, sparql TEXT, generated_sql TEXT, schedule TEXT, decode BOOLEAN, stream_table TEXT, created_at TIMESTAMPTZ)
    • pg_ripple.create_ask_view(name TEXT, sparql TEXT, schedule TEXT DEFAULT '1s') RETURNS void
    • pg_ripple.drop_ask_view(name TEXT) RETURNS void
    • pg_ripple.list_ask_views() RETURNS TABLE(name TEXT, sparql TEXT, generated_sql TEXT, schedule TEXT, stream_table TEXT, created_at TIMESTAMPTZ)
    • All nine functions call pg_trickle_available() first and raise a descriptive error with an install hint when pg_trickle is absent; never error at extension load time
  • Catalog tables (SQL migration sql/pg_ripple--0.17.0--0.18.0.sql)
    • CREATE TABLE IF NOT EXISTS _pg_ripple.construct_views (...)
    • CREATE TABLE IF NOT EXISTS _pg_ripple.describe_views (...)
    • CREATE TABLE IF NOT EXISTS _pg_ripple.ask_views (...)
  • Error handling
    • Passing a SELECT query to create_construct_view() → clear error: "sparql must be a CONSTRUCT query"
    • Passing a non-ASK query to create_ask_view() → clear error: "sparql must be an ASK query"
    • Unbound variables in CONSTRUCT template (variable present in template but not bound by the WHERE pattern) → error at view-creation time listing the unbound variables
    • Template contains a blank node (not expressible as a reusable BIGINT ID) → error advising the user to replace blank nodes with IRIs or skolemise them
  • pg_regress: construct_views.sql (create/drop/list; basic template; multi-triple template; named graph template; decode option; SELECT query rejected; unbound variable error; pg_trickle-absent error), describe_views.sql (create/drop/list; CBD vs symmetric_cbd; decode option), ask_views.sql (create/drop/list; result flips on insert/delete; pg_trickle-absent error)

Documentation

See plans/documentation.md for details.

  • user-guide/sql-reference/views.md expanded: create_construct_view, drop_construct_view, list_construct_views; create_describe_view, drop_describe_view, list_describe_views; create_ask_view, drop_ask_view, list_ask_views; stream table schemas; decode views; worked examples
  • user-guide/best-practices/sparql-patterns.md expanded: when to use CONSTRUCT views vs SELECT views; materialising inference results; using ASK views as live constraint monitors

Exit Criteria

create_construct_view() compiles a SPARQL CONSTRUCT query into a pg_trickle stream table whose rows reflect the CONSTRUCT output at all times; inserting or deleting triples that affect the WHERE pattern causes the stream table to update automatically. create_describe_view() correctly materialises the CBD of the described resources. create_ask_view() correctly updates the single-row result when the pattern's satisfiability changes. All three view types correctly reject wrong query forms with a clear error. The pg_trickle-absent error message is consistent with v0.11.0 behaviour. All new pg_regress tests pass.


v0.19.0 — Federation Performance

Theme: Connection pooling, result caching, query rewriting, and throughput improvements for remote SPARQL endpoint access.

In plain language: When querying remote SPARQL endpoints via SERVICE, every call currently creates a new HTTP connection, buffers all results in memory before processing, and makes no attempt to reduce the data fetched from the remote. This release addresses those bottlenecks: connections are reused across calls, frequently-used results are cached locally, queries are rewritten to project only the variables the outer query actually needs, multiple SERVICE clauses targeting the same endpoint are batched into a single HTTP request, and duplicate term encoding is eliminated. The result is significantly lower latency for federation-heavy workloads and better behaviour under load.

Effort estimate: 3–5 person-weeks

Completed items (click to expand)

Prerequisites

  • v0.16.0 SPARQL Federation — the federation.rs executor, allowlist, health monitoring, and federation_endpoints catalog table are all extended here.
  • v0.16.0 _pg_ripple.federation_health — the adaptive timeout feature reads P95 latency data from this table.

Deliverables

  • Connection pooling (src/sparql/federation.rs)

    • Replace per-call ureq::AgentBuilder::new() with a backend-local shared agent stored in a thread_local! or OnceCell
    • Reuses TCP connections and TLS sessions across SERVICE calls within a session
    • Pool size configurable via pg_ripple.federation_pool_size GUC (default: 4 per endpoint, range: 1–32)
    • Reduces TCP handshake + TLS overhead for workloads with repeated calls to the same endpoint
  • Result caching with TTL (src/sparql/federation.rs, _pg_ripple.federation_cache table)

    • Cache encoded remote results keyed on (url, XXH3-64(sparql_text))
    • Schema: _pg_ripple.federation_cache (url TEXT, query_hash BIGINT, result_jsonb JSONB, cached_at TIMESTAMPTZ, expires_at TIMESTAMPTZ)
    • On cache hit, skip the HTTP call entirely and re-encode cached results via the dictionary
    • Expired rows cleaned up by the merge background worker
    • TTL configurable via pg_ripple.federation_cache_ttl GUC (default: 0 = disabled, range: 0–86400 seconds)
    • Particularly beneficial for semi-static reference datasets (e.g. Wikidata labels, controlled vocabularies)
  • Query rewriting for data minimization (src/sparql/sqlgen.rs)

    • At translation time, compute the set of variables from the SERVICE inner pattern that are actually referenced by the outer query (joins, projections, FILTERs)
    • Rewrite the SPARQL SELECT sent to the remote endpoint to project only those variables instead of SELECT *
    • Reduces data transfer and remote processing for patterns where only a subset of result bindings are consumed
  • Partial result handling (src/sparql/federation.rs)

    • When a SERVICE call delivers rows before failing (e.g. connection drop mid-stream), use however many rows were received rather than discarding them entirely
    • Emit a WARNING naming the endpoint, the rows received, and the error
    • Controlled by pg_ripple.federation_on_partial GUC (values: 'empty' = discard partial results, 'use' = use partial results; default: 'empty')
    • Improves resilience for federated queries where partial data is better than none
  • Endpoint complexity hints (_pg_ripple.federation_endpoints schema extension)

    • Add a complexity TEXT NOT NULL DEFAULT 'normal' CHECK (complexity IN ('fast', 'normal', 'slow')) column to _pg_ripple.federation_endpoints
    • Expose via pg_ripple.register_endpoint(url, local_view_name, complexity) and a new pg_ripple.set_endpoint_complexity(url, complexity) function
    • At query planning time, reorder multiple SERVICE clauses so 'fast' endpoints execute first — enables earlier failure detection and reduces total wall-clock time for multi-endpoint queries
  • Adaptive timeout (src/sparql/federation.rs)

    • When pg_ripple.federation_adaptive_timeout = on (default: off), derive the effective timeout as max(1s, p95_latency_ms * 3 / 1000) from _pg_ripple.federation_health
    • Falls back to pg_ripple.federation_timeout when no health data is available or adaptive mode is off
    • Prevents fast endpoints from being penalised by the global timeout and slow endpoints from blocking indefinitely
  • Batch SERVICE calls to the same endpoint (src/sparql/sqlgen.rs)

    • Detect multiple SERVICE <url> clauses in a single query that target the same registered endpoint
    • Combine their inner patterns into a single SELECT * WHERE { { pattern1 } UNION { pattern2 } } SPARQL query
    • Issue one HTTP request instead of N, then split results back into per-clause variable bindings
    • Applied only when patterns are independent (no shared variables between clauses)
  • Result deduplication at encoding stage (src/sparql/federation.rs)

    • Build a per-call HashMap<String, i64> during encode_results() to avoid redundant dictionary lookups for the same term appearing in multiple rows
    • No user-visible API change; pure internal optimisation
    • Particularly effective for result sets with high-cardinality repeated values (e.g. a common subject IRI across thousands of rows)
  • GUC additions (src/lib.rs)

    • pg_ripple.federation_pool_size (INT, default: 4, range: 1–32)
    • pg_ripple.federation_cache_ttl (INT, default: 0, range: 0–86400 seconds; 0 = disabled)
    • pg_ripple.federation_on_partial (ENUM, default: 'empty'; values: 'empty', 'use')
    • pg_ripple.federation_adaptive_timeout (BOOL, default: off)
  • Migration script (sql/pg_ripple--0.18.0--0.19.0.sql)

    • ALTER TABLE _pg_ripple.federation_endpoints ADD COLUMN IF NOT EXISTS complexity TEXT NOT NULL DEFAULT 'normal' CHECK (complexity IN ('fast', 'normal', 'slow'))
    • CREATE TABLE IF NOT EXISTS _pg_ripple.federation_cache (url TEXT NOT NULL, query_hash BIGINT NOT NULL, result_jsonb JSONB NOT NULL, cached_at TIMESTAMPTZ NOT NULL DEFAULT now(), expires_at TIMESTAMPTZ NOT NULL, PRIMARY KEY (url, query_hash))
    • CREATE INDEX IF NOT EXISTS idx_federation_cache_expires ON _pg_ripple.federation_cache (expires_at)
  • pg_regress: sparql_federation_perf.sql (cache hit/miss; TTL expiry; variable projection confirmed via explain; batch detection with two SERVICE clauses to same endpoint; complexity ordering; partial result GUC; adaptive timeout GUC boundary; deduplication correctness)

Documentation

See plans/documentation.md for details.

  • user-guide/sql-reference/federation.md extended: new GUCs table; connection pooling notes; result caching section with TTL examples; complexity hints; variable projection rewrite behaviour; batching semantics; adaptive timeout
  • user-guide/best-practices/federation-performance.md (new page): choosing cache TTL; when to set complexity hints; designing queries to benefit from variable projection; monitoring with federation_health and federation_cache; sidecar vs in-process tradeoffs

Exit Criteria

A federated query making repeated calls to the same endpoint is measurably faster due to connection reuse. A query with cacheable SERVICE results performs a single HTTP call across multiple executions within the TTL window. Multiple SERVICE clauses targeting the same endpoint are confirmed (via logged SPARQL text) to collapse into one HTTP request. Variable projection is confirmed by inspecting the SPARQL text sent to the endpoint. All new pg_regress tests pass.


v0.20.0 — W3C Conformance & Stability Foundation

Theme: Standards compliance, crash safety, and production readiness preparation.

In plain language: As we approach the 1.0 release, this milestone focuses on confidence. Instead of building new features, we verify that everything already built works correctly according to the official W3C standards. We run pg_ripple's SPARQL engine and SHACL validator against the W3C test suites and fix any edge cases. We test what happens when the database crashes and verify recovery is clean. We scan the code for security vulnerabilities. And we benchmark at scale (100M triples) to establish baselines. The result is a release that's ready for production users to rely on.

Effort estimate: 5–7 person-weeks

Completed items (click to expand)

Deliverables

  • W3C SPARQL 1.1 Query test suite conformance

    • Download and run the official W3C SPARQL 1.1 Query test suite
    • Implement missing query features or fix conformance bugs
    • Document unsupported features (property functions, custom aggregate functions) with rationale
    • Verify conformance via both SQL (pg_ripple.sparql()) and HTTP (/sparql endpoint) interfaces
    • Create tests/pg_regress/w3c_sparql_query_conformance.sql with representative W3C test cases; mark expected failures clearly
    • Federation (SERVICE) conformance covered by v0.16.0; no additional work needed
    • Target: ≥95% of applicable W3C Query test suite passes (excluding property functions, language tags in comparisons, and other known limitations)
  • W3C SPARQL 1.1 Update test suite conformance

    • Download and run the official W3C SPARQL 1.1 Update test suite
    • Implement missing update features or fix conformance bugs
    • Document unsupported features with rationale
    • Create tests/pg_regress/w3c_sparql_update_conformance.sql with representative W3C test cases
    • Target: ≥95% of applicable W3C Update test suite passes
  • W3C SHACL Core test suite conformance

    • Download and run the official W3C SHACL Core test suite
    • Implement missing validators or fix conformance bugs
    • Critical constraint: Any optimization strategy used in shape compilation must preserve identical externally-visible results as the reference semantics; if optimization changes the set of violations reported, it is a regression
    • Create tests/pg_regress/w3c_shacl_conformance.sql with representative W3C test cases
    • Document any limitations (e.g. SHACL Advanced features not yet implemented, deferred to v0.8.0 or later)
    • Target: ≥95% of SHACL Core test suite passes
  • Crash recovery testing framework

    • tests/crash_recovery/merge_during_kill.sh — start a bulk load, kill -9 the PostgreSQL backend during HTAP generation merge, restart PostgreSQL, verify:
      • No corruption in _pg_ripple.predicates catalog
      • VP table data is recoverable (rows visible, no stray VACUUM marks)
      • Dictionary is consistent (no orphaned or duplicate entries)
      • Subsequent queries return correct results
    • tests/crash_recovery/dict_during_kill.sh — kill -9 during a high-volume dictionary encoding operation (e.g. bulk load), verify dictionary consistency
    • tests/crash_recovery/shacl_during_violation.sh — kill -9 during async validation queue processing, verify no violation reports are lost and no rows are orphaned
    • Run these as part of regular CI (nightly schedule, ~30 min total)
    • Document recovery procedure for production operators (backup/restore, WAL replays)
  • Memory leak detection

    • Set up cargo pgrx test --valgrind invocation for a curated subset of unit tests (heap allocations are the main concern; stack overflows out of scope)
    • Identify and fix any definite leaks (not just reachable at program exit)
    • Focus areas: shared-memory allocations, per-query temporary buffers, dictionary cache evictions, failed error paths
    • Document baseline leak-free status in release notes
    • CI nightly run (timeout 2 hours)
  • Security review (Phase 1)

    • SPI query generation review: Audit all src/sparql/sqlgen.rs and src/datalog/compiler.rs for potential SQL injection vectors
      • All IRI/literal constants must be dictionary-encoded before SQL generation
      • No string interpolation into generated SQL (format! only for identifiers via format_ident!)
      • Create a checklist document listing all unsafe patterns and their mitigations
    • Shared memory safety review: Audit src/shmem.rs and all pgrx::PgSharedMem usage for:
      • Data races (concurrent access without synchronization)
      • Bounds violations (buffer overflows, stack smashing)
      • Use-after-free (stale pointers after shmem recreation)
      • Create a checklist document with findings and resolutions
    • Dictionary cache timing side-channels review: Verify that encode/decode latency does not leak dictionary size, IRI patterns, or other sensitive metadata
    • Document findings in reference/security.md; create follow-up issues for Phase 2 (v0.21.0 or later) if needed
  • Benchmarking at scale (100M triples)

    • Extend BSBM benchmark infrastructure to run with 100M triples (BSBM scale factor ≥30)
    • Measure query latency, throughput, memory usage, merge worker performance
    • Publish baseline results in release notes: e.g. "Query latency: <50ms p95 on 100M triples with 4 GiB shared memory"
    • Store results artifact in CI (for regression detection in future releases)
    • Compare with v0.19.0 results to detect performance regressions
    • Known constraint: BSBM at 100M triples on a single 4-core developer machine will take ~4–6 hours; run nightly or on a larger CI machine
  • API stability audit (documentation only; no code changes)

    • Audit all pg_ripple.* SQL functions for API stability
    • Designate these as stable / guaranteed API for 1.x releases
    • Document that _pg_ripple.* schema is private and subject to change
    • Create reference/api-stability.md documenting the stability contract
  • Migration script (sql/pg_ripple--0.19.0--0.20.0.sql)

    • If there are schema changes from conformance fixes, add them here
    • If no schema changes are required, leave the migration script as an empty comment block with a note explaining what new functions/GUCs (if any) are provided
    • Per extension versioning conventions (AGENTS.md), the migration script must exist even if empty
  • pg_regress: w3c_sparql_query_conformance.sql, w3c_sparql_update_conformance.sql, w3c_shacl_conformance.sql, crash_recovery_merge.sql (basic recovery smoke test)

  • 100% W3C SPARQL 1.1 Query conformance — fix all remaining known limitations:

    • FILTER string functions: CONTAINS(), STRSTARTS(), STRENDS(), REGEX() — translate to SQL strpos, starts_with, right(), ~ / ~*
    • FILTER NOT EXISTS { ... } — translate to SQL NOT EXISTS (correlated subquery)
    • Subquery + LIMIT in outer JOIN — wrap the inner slice pattern in a SQL subquery with LIMIT applied before the outer join
    • Target: all assertions in w3c_sparql_query_conformance.sql pass with exact expected values
  • 100% W3C SHACL Core conformance — fix validate() false-negative on conforming graphs:

    • Root cause: value_has_datatype() returns false for inline-encoded types (xsd:integer, xsd:boolean, xsd:dateTime, xsd:date) because inline IDs are never stored in the dictionary
    • Fix: detect inline IDs (id < 0) and determine their datatype from the inline type code without a DB round-trip
    • Additionally: plain literals (kind=KIND_LITERAL, xsd:string normalization) now correctly satisfy sh:datatype xsd:string
    • Additionally: sh:in with string literal values now encodes them via dictionary lookup instead of lookup_iri
    • Target: validate() returns conforms=true for all conforming graphs; violation detection remains 100%
  • 100% W3C SPARQL 1.1 Update test suite conformance — implement full update operator coverage:

    • USING <g> / WITH <g> clauses: restrict WHERE evaluation to the specified dataset graph(s)
    • CLEAR ALL, CLEAR DEFAULT, CLEAR NAMED — all graph-target variants
    • DROP ALL, DROP DEFAULT, DROP NAMED — all graph-target variants
    • ADD <src> TO <dst> — copy triples from source graph to destination (source preserved)
    • COPY <src> TO <dst> — clear destination then copy source (source preserved)
    • MOVE <src> TO <dst> — copy source to destination then drop source
    • DELETE WHERE { ... } shorthand — pattern used as both delete template and WHERE clause
    • Multi-graph USING: USING <g1> USING <g2> expands to UNION of GRAPH patterns in WHERE
    • Target: all assertions in w3c_sparql_update_conformance.sql (sections 1–16) pass with exact expected values

Documentation

See plans/documentation.md for details.

  • reference/w3c-conformance.md (new page) — W3C test suite results summary, supported subset list, unsupported features with rationale, known limitations
  • reference/security.md (Phase 1 findings) — SPI injection mitigations, shared memory safety, side-channel analysis
  • reference/api-stability.md (new page) — stable API contract, pg_ripple.* functions, _pg_ripple.* schema privacy
  • user-guide/backup-restore.md expanded: crash recovery procedure, WAL replay, PITR workflow
  • Release notes for v0.20.0 — include BSBM 100M triple baseline results, W3C test suite summary, security audit findings

Exit Criteria

W3C SPARQL 1.1 Query test suite: ≥95% pass rate. W3C SPARQL 1.1 Update test suite: ≥95% pass rate. W3C SHACL Core test suite: ≥95% pass rate. Crash recovery framework operational: database recovers cleanly from kill -9 during merge, bulk load, and validation. Valgrind finds no definite memory leaks. Security review Phase 1 complete: all SPI injection vectors documented and mitigated, shared memory audit complete. BSBM 100M triple baseline published. API stability contract documented.


v0.21.0 — SPARQL Built-in Functions & Query Correctness

Theme: Implement all ~40 missing SPARQL 1.1 built-in functions, fix the FILTER silent-drop correctness hazard, and close several high-priority query-semantics bugs identified in the v0.20.0 gap analysis.

In plain language: Until now, pg_ripple's SPARQL engine understood the grammar of standard functions like UCASE, IF, DATATYPE, and isIRI — but silently ignored them at runtime, returning too many rows instead of the correctly filtered set. This release makes those functions actually work. It also fixes several query-correctness issues that were masked by the existing conformance test suite: wrong sort-order for NULL values, p* paths generating phantom reflexive rows on nodes that don't participate in the property at all, and GROUP_CONCAT ignoring the DISTINCT keyword. After this release, any unsupported expression raises a clear named error rather than silently dropping the filter.

Effort estimate: 6–8 person-weeks

Completed items (click to expand)

Deliverables

  • SPARQL 1.1 built-in function surface — full implementation

    • String functions: STR, STRLEN, SUBSTR, UCASE, LCASE, CONCAT, REPLACE, ENCODE_FOR_URI, STRLANG, STRDT (in addition to STRSTARTS, STRENDS, CONTAINS, REGEX already present)
    • Type-testing predicates: isIRI, isLiteral, isBlank, isNumeric, sameTerm
    • Term construction and access: IRI (alias URI), BNODE, LANG, DATATYPE, LANGMATCHES
    • Numeric functions: ABS, CEIL, FLOOR, ROUND, RAND
    • Datetime functions: NOW, YEAR, MONTH, DAY, HOURS, MINUTES, SECONDS, TIMEZONE, TZ
    • Hash / UUID functions: MD5, SHA1, SHA256, SHA384, SHA512, UUID, STRUUID
    • Control functions: IF, COALESCE
    • Implementation strategy: decode the dictionary ID to the term value at expression-evaluation time; compile to PostgreSQL equivalents where available (LOWER, UPPER, SUBSTR, MD5, NOW(), ABS, CEIL, FLOOR, ROUND, gen_random_uuid(), etc.); datetime functions extract fields from xsd:dateTime literals via to_timestamp + EXTRACT; hash functions operate over the term's string representation
    • Introduce a typed SqlExpr intermediate representation in src/sparql/expr.rs replacing the current raw-String output from translate_expr() — makes the function dispatch table explicit and independently testable
  • FILTER silent-drop fix

    • Change translate_expr() so that an unsupported expression variant raises a structured ERRCODE_FEATURE_NOT_SUPPORTED error naming the unimplemented function, rather than returning None and silently dropping the predicate from the SQL WHERE clause
    • Add pg_ripple.sparql_strict GUC (default: on): when off, the legacy warn-and-drop behaviour is preserved for compatibility; when on (default from this release onwards), unsupported expressions hard-error
    • Migration script sql/pg_ripple--0.20.0--0.21.0.sql: register the sparql_strict GUC with its default
  • Query correctness fixes

    • ORDER BY NULL placement: append NULLS LAST to every ASC clause and NULLS FIRST to every DESC clause in the SQL generator, matching SPARQL 1.1 §15.1 semantics (unbound variables sort last in ascending order, first in descending order)
    • GROUP_CONCAT(DISTINCT …): honour the distinct flag in AggregateExpression::GroupConcat — emit STRING_AGG(DISTINCT …, sep) rather than silently dropping the deduplication
    • p* (ZeroOrMore) reflexive rows: restrict the zero-hop identity row to subjects that actually appear in the predicate's VP tables, preventing spurious reflexive paths for all nodes in the graph
    • Property-path cycle detection: change CYCLE o SET _is_cycle USING _cycle_path to CYCLE s, o SET _is_cycle USING _cycle_path in all WITH RECURSIVE path CTEs — prevents false cycle detection in DAGs that have shared intermediate nodes
    • Self-join dedup key: replace the format!("{tp}") Debug-string key in BGP pattern deduplication with a structural (s_term_id, p_term_id, o_term_id) tuple so that only genuinely identical patterns are collapsed
    • REDUCED semantics: implemented as DISTINCT, which is within the SPARQL 1.1 specification; documented in reference/sparql-reference.md
  • SPARQL property path & federation completeness

    • Negated property sets !(p1|p2|…): compile to an anti-join scanning all VP tables; correctly excludes the listed predicates
    • SERVICE SILENT: when the silent flag is set on a SERVICE block, federation errors return an empty result set rather than propagating the error
  • W3C conformance test assertions updated

    • All count(*) >= 0 AS label_no_error shims replaced with real value-checking assertions in w3c_sparql_query_conformance.sql

Documentation

See plans/documentation.md for details.

  • reference/sparql-functions.md (new page) — every SPARQL 1.1 built-in function, implementation status, PostgreSQL equivalent used, and known limitations
  • user-guide/sparql-reference.md updated with complete function table and sparql_strict GUC guidance
  • reference/w3c-conformance.md updated — replace label_no_error placeholder entries with accurate pass / skip / fail classification
  • Release notes for v0.21.0 — list every newly implemented function; highlight the FILTER silent-drop fix

Exit Criteria

Every SPARQL 1.1 built-in function from the W3C SPARQL 1.1 Appendix A either works correctly or raises a named ERRCODE_FEATURE_NOT_SUPPORTED error — never silently drops. w3c_sparql_query_conformance.sql passes with real value-checking assertions (no >= 0 shims). sparql_builtins.sql passes for all implemented functions. ORDER BY NULL placement, property-path cycle detection on a DAG, ZeroOrMore scope restriction, and GROUP_CONCAT DISTINCT each have a dedicated passing regression test. property_path_negated.sql passes for single and multi-predicate negated sets. service_silent.sql returns zero rows rather than an error on an unreachable SERVICE SILENT endpoint. reference/sparql-reference.md documents the REDUCEDDISTINCT equivalence choice.


v0.22.0 — Storage Correctness & Security Hardening

Theme: Fix the critical data-integrity issues in the storage layer (dictionary cache rollback, HTAP merge races, shmem cache thrashing, rare-predicate promotion race) and close the security gaps in the HTTP companion service and privilege model identified in the v0.20.0 gap analysis.

In plain language: This release addresses issues that could silently corrupt data or create security vulnerabilities in production deployments. The most important fix: if a database transaction is rolled back, pg_ripple's internal term-ID cache now correctly discards the rolled-back entries — previously, stale IDs could be planted into the triple store, creating phantom references that make facts disappear or return the wrong data. Two race conditions in the background merge process that could cause deleted facts to reappear, or queries to error mid-merge, are also closed. The internal shared-memory cache is redesigned to handle large vocabularies without thrashing. On the security side, the HTTP companion service's rate-limiting finally works, error messages no longer leak internal database details to API clients, and the _pg_ripple internal schema is explicitly locked away from unprivileged roles.

Effort estimate: 6–8 person-weeks

Completed items (click to expand)

Deliverables

  • Dictionary cache rollback correctness (critical fix C-2)

    • Register RegisterXactCallback and RegisterSubXactCallback during _PG_init — on XACT_EVENT_ABORT and XACT_EVENT_PARALLEL_ABORT, drain both ENCODE_CACHE and DECODE_CACHE thread-local LRU caches so rolled-back term IDs cannot be served to future encode calls in the same backend session
    • Stamp a per-backend epoch counter; bump on rollback; the shared-memory encode cache stores the write epoch at insertion time and rejects cache hits from a prior epoch, ensuring the shmem path is also safe
    • New pg_regress test dictionary_rollback.sql: BEGIN; pg_ripple.insert_triple(…new term…); ROLLBACK; pg_ripple.insert_triple(same term again); verify pg_ripple.decode_id(id) = original term string, not NULL
  • HTAP merge race fixes (critical fixes C-3 and C-4)

    • C-3 (view-rename atomicity): remove the CREATE OR REPLACE VIEW vp_N step from the merge cycle — the view's FROM clause always names vp_N_main directly, which PG re-resolves after the rename; the CREATE OR REPLACE VIEW call is eliminated, closing the window between rename and view-rebuild
    • C-4 (tombstone resurrection): record max_sid_at_snapshot at merge-start (currval('_pg_ripple.statement_id_seq') before processing); at merge-end TRUNCATE, only delete tombstones with i ≤ max_sid_at_snapshot — tombstones for deletes that committed after the snapshot survive to the next merge cycle
    • New pg_regress test merge_race.sql: issue a pg_ripple.delete_triple() concurrently with pg_ripple.force_merge(); verify deleted triple does not reappear; verify no relation does not exist error under a concurrent pg_ripple.sparql() call
  • Merge deduplication and rebuild_subject_patterns correctness (high fixes H-6, H-7)

    • H-6 (cross-merge duplicate visibility): add a UNIQUE (s, o, g) constraint to vp_{id}_delta and change insert_triple to use ON CONFLICT DO NOTHING; update the VP view definition to carry DISTINCT ON (s, o, g) as a safety net for rows that crossed a merge boundary before the constraint was present — prevents a triple from appearing twice in query results when it exists in both main and delta
    • H-7 (vp_rare double-count in star patterns): fix rebuild_subject_patterns() in src/storage/merge.rs to enumerate only predicates that have a dedicated VP table (listed in _pg_ripple.predicates with a non-null table_oid); skip vp_rare as a direct scan target — vp_rare rows are already reachable via their per-predicate plans and must not be scanned a second time as the raw table
    • New pg_regress test merge_dedup.sql: insert the same triple before and after pg_ripple.force_merge(); verify the query returns exactly one result row; verify triple_count in the predicate catalog equals 1
  • Shared-memory encode cache — 4-way set-associative redesign (high fix H-1)

    • Replace the direct-mapped 4096-slot cache with a 4-way set-associative layout: 1024 sets × 4 ways — same memory footprint as before, birthday-collision rate drops from ~15% to <1% at 5k hot terms
    • LRU eviction within each 4-way set using a 2-bit age field packed into the existing (hash_parts, id) slot struct
    • New pg_ripple.cache_stats() SQL function returning (hits BIGINT, misses BIGINT, evictions BIGINT, utilisation FLOAT) — exposes hit rate for monitoring
    • Benchmark gate: just bench-cache asserts hit rate ≥ 95% on a 10k-predicate workload; CI fails on regression below 90%
  • Bloom filter per-bit reference counting (high fix H-2)

    • Replace the boolean u64 bloom words with 8-bit saturating counters in the delta bloom shared-memory segment
    • set_predicate_delta_bit(pred_id): increment both bloom counter positions (saturates at 255)
    • clear_predicate_delta_bit(pred_id): decrement both counters; only clears the boolean bit when the counter reaches 0 — prevents false-negative delta skips for predicates that hash-collide with a predicate being concurrently merged
  • Rare-predicate promotion atomicity (high fixes H-3 and H-4)

    • Rewrite promote_predicate() to use a single atomic CTE: WITH moved AS (DELETE FROM _pg_ripple.vp_rare WHERE p = $1 RETURNING s, o, g, i, source) INSERT INTO _pg_ripple.vp_{id}_delta (s, o, g, i, source) SELECT * FROM moved — eliminates the two-statement window where concurrent inserts can orphan rows in vp_rare under a predicate that now has its own VP table
    • After the CTE, UPDATE _pg_ripple.predicates SET triple_count = (SELECT count(*) FROM _pg_ripple.vp_{id}_delta) WHERE id = $1 to restore accurate planner statistics rather than leaving triple_count = 0 after promotion
    • pg_regress test: load > vp_promotion_threshold triples for a single predicate while a concurrent transaction also inserts into vp_rare for that predicate; verify zero orphan rows after promotion completes
  • pg_ripple_http security hardening (high fixes H-14, H-15; medium fixes M-13, S-4)

    • Rate limiting: integrate tower_governor crate; PG_RIPPLE_HTTP_RATE_LIMIT env var is now enforced as requests-per-second per source IP (default 100 req/s); excess requests receive 429 Too Many Requests with Retry-After header
    • Error redaction: replace verbatim PostgreSQL error text in HTTP 4xx/5xx responses with {"error": "<category>", "trace_id": "<uuid>"} JSON; log the full PG error + trace ID at server ERROR level — internal schema names, GUC values, and file paths are never exposed to API clients
    • Constant-time auth: replace token != expected.as_str() with !constant_time_eq(token.as_bytes(), expected.as_bytes()) using the constant_time_eq crate
    • Federation URL scheme validation: pg_ripple.register_endpoint() rejects any URL whose scheme is not http or https with ERRCODE_INVALID_PARAMETER_VALUE — prevents file://, gopher://, or other scheme registration even though ureq would refuse them at connection time
  • Privilege model hardening (medium fix M-14)

    • Migration script sql/pg_ripple--0.21.0--0.22.0.sql: REVOKE ALL ON SCHEMA _pg_ripple FROM PUBLIC; REVOKE ALL ON ALL TABLES IN SCHEMA _pg_ripple FROM PUBLIC; REVOKE ALL ON ALL SEQUENCES IN SCHEMA _pg_ripple FROM PUBLIC;
    • New pg_regress test privilege_isolation.sql: create a non-superuser role; verify SELECT * FROM _pg_ripple.dictionary raises permission denied; verify SELECT * FROM pg_ripple.find_triples(NULL, NULL, NULL) still works (public API unaffected)
  • GUC bounds and merge worker signal handling (medium fixes M-12, M-15)

    • pg_ripple.vp_promotion_threshold: add min = 10 and max = 10_000_000 constraints to the pgrx GUC definition — prevents catalog explosion at threshold = 1 and permanent vp_rare lock-in at threshold = INT_MAX
    • Merge worker: call BackgroundWorker::reset_latch() immediately before std::thread::sleep in the error back-off path — prevents a busy-wait loop where a SIGHUP received during the sleep keeps wait_latch returning immediately on the next cycle

Documentation

See plans/documentation.md for details.

  • reference/security.md Phase 2 section: rate limiting configuration, error-redaction policy, privilege model, constant-time auth rationale, URL scheme enforcement
  • user-guide/operations.md updated: rollback safety guarantee for dictionary cache, merge correctness guarantees (tombstone epoch fence), pg_ripple.cache_stats() monitoring
  • user-guide/upgrading.md updated: v0.21.0→v0.22.0 privilege change (REVOKE) is safe for all existing deployments; no data migration required
  • Release notes for v0.22.0 — highlight dictionary-rollback fix, merge race fixes, HTTP security changes

Exit Criteria

Rolled-back insert_triple cannot plant a phantom ID (dictionary_rollback.sql pg_regress passes). merge_race.sql passes with zero tombstone resurrections and zero relation does not exist errors under a concurrent query. merge_dedup.sql passes — inserting the same triple across a merge boundary returns exactly one result row. Shmem cache benchmark reports ≥ 95% hit rate at 10k hot terms. pg_ripple_http returns 429 when rate limit is exceeded (verified by integration test). Unprivileged role is denied SELECT on _pg_ripple.* (privilege_isolation.sql passes). All migration scripts from 0.1.0 through 0.22.0 run cleanly via just test-migration.


v0.23.0 — SHACL Core Completion & SPARQL Diagnostics

Theme: Complete the SHACL 1.0 Core constraint set, introduce first-class SPARQL query introspection, and fix correctness issues in the Datalog engine and JSON-LD framing identified in the v0.20.0 gap analysis.

In plain language: This release makes pg_ripple's data-quality rules (SHACL) useful for real-world schemas. Until now, common constraints like "this property must have a specific value" (sh:hasValue), "this node must have exactly this type" (sh:nodeKind), and "no properties outside this allowed list" (sh:closed) were silently ignored. They now work. Separately, a new function pg_ripple.explain_sparql() lets you see exactly what SQL pg_ripple generates for a SPARQL query — invaluable for diagnosing slow queries. The Datalog engine also receives three correctness fixes: arithmetic division errors now name the rule that caused them, rules with undefined variables now error at compile time rather than silently matching nothing, and cyclic negation is correctly detected.

Effort estimate: 6–8 person-weeks

Completed items (click to expand)

Deliverables

  • SHACL Core constraint completion (medium fix M-18)

    • sh:hasValue: verify that at least one value matches the given RDF term; compile to EXISTS (SELECT 1 FROM vp_{id} WHERE s = $node AND o = $encoded_value)
    • sh:closed + sh:ignoredProperties: reject triples whose predicate is not in the shape's declared property set; compile to a NOT EXISTS anti-join over all VP tables scoped to the focus node, excluding the declared properties and the ignore list
    • sh:nodeKind: validate that each value is an IRI, blank node, or literal as declared; discriminate using the dictionary kind column
    • sh:languageIn: compile to lang(value) = ANY($language_tags_array) after decoding the language tag from the literal's dictionary entry
    • sh:uniqueLang: use COUNT(*) OVER (PARTITION BY lang(value)) and reject partitions with count > 1
    • sh:lessThan / sh:greaterThan: emit a comparison join between the focus node's two property values, decoding literals to numeric/date types for ordering
    • sh:qualifiedValueShape: sh:qualifiedMinCount / sh:qualifiedMaxCount on a nested shape — count focus-node values matching the inner shape and compare against the declared bounds
    • sh:path with property path expressions: extend the shape compiler to accept inverse paths (sh:inversePath), alternative paths (sh:alternativePath), sequence paths, and zero-or-more/one-or-more/zero-or-one paths — each maps to the corresponding property-path CTE already used in the SPARQL engine
    • Turtle block comment handling (M-11): add a /* … */ block-comment stripping pass in the SHACL shape pre-processor at src/shacl/mod.rs before the document is handed to the Turtle parser — regex: strip (?s)/\*.*?\*/; allows SPARQL-style block-commented shapes to load correctly
    • New pg_regress test shacl_core_completion.sql — one test per new constraint with passing, failing, and edge-case triples; verified against the W3C SHACL Core test suite manifest
  • SPARQL query introspection (feature F-3 from the gap analysis)

    • New SQL function pg_ripple.explain_sparql(query TEXT, format TEXT DEFAULT 'text') RETURNS TEXT
    • When format = 'sql': returns the generated SQL string produced by translate_select() without executing it — useful for manual inspection
    • When format = 'text' (default) or 'json': runs EXPLAIN (ANALYZE, FORMAT text/json) on the generated SQL via SPI and returns the plan output
    • When format = 'sparql_algebra': returns the spargebra algebra tree serialised as indented text via Debug formatting — exposes the optimizer's view of the query
    • Security: SECURITY DEFINER is not used; the caller needs SELECT privilege on the relevant VP tables (same as pg_ripple.sparql())
    • New pg_regress test explain_sparql.sql — verifies that the function returns non-empty output for a known-good SELECT query and does not error on edge cases (empty graph, VALUES-only query, property path query)
  • SHACL query-optimization hint verification (performance fix P-5)

    • Verify that sh:maxCount 1 on a predicate elides DISTINCT in the SQL generated for SPARQL patterns using that predicate — inspect translate_select() in src/sparql/sqlgen.rs and wire the lookup against the SHACL constraint catalog if the hint is not already applied; a triple pattern on a maxCount 1 predicate should not produce a HashAggregate (DISTINCT) node in the plan
    • Verify that sh:minCount 1 on a predicate downgrades LEFT JOIN to INNER JOIN in the SQL generator for OPTIONAL patterns — saves a null-check pass and allows the PG planner to use more efficient join strategies
    • New pg_regress test shacl_query_hints.sql — load a shape with sh:maxCount 1 and sh:minCount 1; run pg_ripple.explain_sparql() on a query using the constrained predicate; assert the plan string does not contain HashAggregate for the maxCount case and does not contain Hash Left Join for the minCount case
  • Datalog engine correctness fixes (medium fixes M-1, M-2, M-3)

    • Division by zero (M-1): wrap every arithmetic divisor in the Datalog SQL compiler with NULLIF(expr, 0); emit a NOTICE-level message naming the failing rule head when a null propagation from division occurs
    • Unbound variables (M-2): add a compile-time check in compile_rule() that every variable appearing in a rule body literal is either bound by a positive body literal or explicitly declared; raise ERRCODE_SYNTAX_ERROR naming the variable and the rule head rather than emitting a WHERE x = NULL clause that silently matches nothing
    • Negation-through-cycle (M-3): replace the single-edge negation check in stratify.rs with full SCC (strongly-connected component) computation using Tarjan's algorithm; reject any SCC that contains a negation-back-edge with a structured error naming the cycle: "datalog: unstratifiable negation cycle: rule A → ¬B → ¬C → A"
  • JSON-LD framing correctness fixes (medium fixes M-4, M-5)

    • Embedder panic on empty result (M-4): replace roots.into_iter().next().unwrap() in src/framing/embedder.rs with .ok_or_else(|| PgError::new("json-ld framing: CONSTRUCT produced no results", …)) — returns an empty JSON-LD document {"@context": …, "@graph": []} rather than panicking
    • Per-node visited set (M-5): add a HashSet<NodeId> as the third parameter of the recursive embed_node() function; insert the current node ID before recursing and check membership before following an edge — prevents infinite thrash on near-cyclic embedded graphs; consistent with W3C JSON-LD Framing §4.1.3

Documentation

See plans/documentation.md for details.

  • reference/shacl-reference.md updated — every newly supported constraint documented with syntax, semantics, and a worked example; mark previously-deferred constraints as now implemented
  • user-guide/shacl-guide.md updated — add a section on property path shapes (sh:path) showing inverse and alternative path examples
  • reference/sparql-functions.md updated — add pg_ripple.explain_sparql() reference with all four format options, example output, and note on required privileges
  • user-guide/datalog-guide.md updated — document the new division-by-zero NOTICE, the unbound-variable compile error, and the unstratifiable-cycle error with remediation guidance
  • Release notes for v0.23.0 — highlight SHACL gap closures, new explain_sparql function, and the three Datalog correctness fixes

Exit Criteria

W3C SHACL Core test suite pass rate increases to ≥ 98%. shacl_core_completion.sql pg_regress passes for all new constraint types including the /* … */ block-comment case. explain_sparql.sql passes. shacl_query_hints.sql passes — explain_sparql() confirms no spurious DISTINCT or LEFT JOIN for constrained predicates. A Datalog rule with division, an unbound variable, and a negation cycle each raise the expected named error rather than silent failure or a crash. src/framing/embedder.rs no longer contains unwrap() on the CONSTRUCT result. All migration scripts from 0.1.0 through 0.23.0 run cleanly via just test-migration.


v0.24.0 — Semi-naive Datalog & Performance Hardening

Theme: Replace the naive Datalog evaluation strategy with semi-naive evaluation for large-scale inference, complete the OWL RL rule set, batch-decode SPARQL result sets, and add safety bounds to property-path recursion.

In plain language: pg_ripple can derive new facts automatically from rules (Datalog). Until now, on every iteration of the rule engine, all previously derived facts were re-checked — wasteful for large datasets where most facts don't change between iterations. This release switches to "semi-naive" evaluation: each iteration only looks at newly derived facts from the previous pass, which can be 10–100 × faster on large ontologies. For the same reason, four missing OWL reasoning rules that affect subclass and property chains are added. Two performance improvements round out the release: returning large SPARQL result sets is sped up by decoding all term IDs in a single batch rather than one-by-one, and property-path queries (p*, p+) gain a configurable depth limit to prevent runaway recursion on highly-connected graphs.

Effort estimate: 6–8 person-weeks

Completed items (click to expand)

Deliverables

  • Semi-naive Datalog evaluation (performance fix P-3, depends on M-3 from v0.23.0)

    • Rework src/datalog/compiler.rs to emit ΔR maintenance queries:
      • For each derived relation R, maintain a delta table Δ_R holding only rows derived in the most recent iteration
      • The fixpoint loop re-evaluates each rule against Δ_R (the delta of its input relations) rather than the full R; newly derived rows are inserted into Δ_R_new; after each iteration Δ_R ← Δ_R_new and the loop continues while Δ_R is non-empty
      • Compile to a series of CTEs: WITH delta_R AS (…), delta_R_new AS (…) INSERT INTO R SELECT * FROM delta_R_new ON CONFLICT DO NOTHING
    • Preserve stratified evaluation order: each stratum is fully converged before the next stratum begins; semi-naive is applied within each stratum
    • Correct prerequisite: requires M-3 (stable stratification) from v0.23.0 — test pipeline enforces this ordering
    • New pg_regress test datalog_seminaive.sql — run RDFS closure over a 10k-triple subgraph; verify correct closure count; measure and assert iteration count is bounded by the longest derivation chain length (not the full relation size)
    • just bench-datalog benchmark gate: semi-naive must be ≥ 5× faster than naive on the RDFS subgraph benchmark; CI fails on regression below 3×
  • OWL RL rule set completion (medium fix M-17)

    • cax-sco full transitive closure: the existing partial rule handles one level of rdfs:subClassOf; add the transitive step so that A subClassOf B, B subClassOf C → A subClassOf C is derived for arbitrary chain length via the semi-naive mechanism above
    • cls-avf: owl:allValuesFrom chaining — x ∈ C, C ≡ (∀p . D), y = p(x) → y ∈ D; compile to a join across the owl:allValuesFrom VP table and the subject's type VP table
    • prp-ifp: inverse-functional property inference — p is InverseFunctionalProperty, p(x, z) and p(y, z) → x = y; compile to a self-join on vp_{p_id} grouping by o, emitting sameAs triples for any s values that collide
    • prp-spo1: sub-property chaining — q subPropertyOf p, q(x, y) → p(x, y) for derived property chains; relies on the semi-naive delta loop to propagate transitively
    • Update src/datalog/builtins.rs with the four new rule templates; document which OWL RL rules are now implemented vs. out of scope; update reference/datalog-reference.md
  • Batch decode for SPARQL result sets (architectural fix A-2, performance fix P-2)

    • Wire batch_decode_ids() through the SPARQL execution path in src/sparql/sqlgen.rs: after SPI returns a result set, collect all distinct i64 IDs across all columns in a single pass, call batch_decode_ids(&ids) to resolve them in one SPI round-trip, then substitute into the result rows
    • The existing batch_decode infrastructure is already implemented for the bulk-load path; the change is routing the SPARQL result-building loop through the same function
    • Benchmark gate: just bench-sparql-decode asserts ≤ 2 SPI round-trips for a SELECT returning 1000 distinct terms; previously O(N) calls
  • Property-path depth GUC (performance fix P-4)

    • New GUC pg_ripple.property_path_max_depth (type: INT, default: 64, min: 1, max: 100000)
    • Append WHERE _depth < $pg_ripple.property_path_max_depth to every WITH RECURSIVE … CYCLE property-path CTE generated by src/sparql/property_path.rs
    • When the depth limit is hit, emit a WARNING-level message: "property path depth limit reached (max: N); some paths may be truncated" — not an error, because SPARQL spec does not define a depth limit
    • New pg_regress test property_path_depth.sql — verify that a 100-hop chain is fully traversed with default limit, and that reducing the GUC to 10 truncates at 10 hops with the expected WARNING
  • BRIN index migration to SID column (medium fix M-16)

    • Migration script sql/pg_ripple--0.23.0--0.24.0.sql: for each existing VP main table, DROP INDEX vp_{id}_main_s_brin; CREATE INDEX vp_{id}_main_i_brin ON _pg_ripple.vp_{id}_main USING brin (i) — the i (SID) column is monotonically increasing with insertion order, giving BRIN strong correlation; the s (subject) column has near-random distribution and BRIN provides negligible benefit
    • Merge worker: generate the new BRIN on i at merge time for freshly built main partitions; remove the BRIN-on-s creation step from create_vp_table()
    • B-tree indices on (s, o) and (o, s) are unchanged
  • Export streaming (low fix L-6)

    • Rework src/export.rs Turtle/N-Triples/JSON-LD export helpers to iterate over VP tables in SID-order cursor batches (batch size: pg_ripple.export_batch_size GUC, default: 10000) rather than materialising the full graph into memory
    • DECLARE … CURSOR FOR SELECT … ORDER BY i + FETCH $batch_size FROM cursor loop — each batch is serialised and flushed to COPY output immediately; peak memory is bounded by batch_size × average_triple_size
  • View anti-join rewrite for HTAP query path (performance fix P-6)

    • Replace the EXCEPT (sort-based set difference) in the (main EXCEPT tombstones) UNION ALL delta VP view with a LEFT JOIN … WHERE t.s IS NULL anti-join: SELECT m.* FROM _pg_ripple.vp_{id}_main m LEFT JOIN _pg_ripple.vp_{id}_tombstones t ON m.s = t.s AND m.o = t.o AND m.g = t.g WHERE t.s IS NULL
    • The anti-join allows the PG planner to choose hash anti-join, avoiding a materialising sort over main; at 10M-row main tables this reduces per-query overhead from O(N log N) to O(N) for tombstone filtering
    • Update all VP view definitions and the merge worker's view-rebuild template to use the anti-join form; no user-visible behaviour change
    • Benchmark gate: just bench-htap-read asserts a SELECT over a 1M-row main with 100 tombstones completes in ≤ 2× the time of the same query with zero tombstones
  • BGP selectivity model improvements (architectural improvement A-6)

    • Extend BGP reordering in src/sparql/optimizer.rs to factor in variable binding as a selectivity multiplier: bound subject → 0.01 × triple_count, bound object → 0.05 × triple_count, unbound → triple_count — reduces the likelihood that a poorly-ordered BGP generates a pathological SQL join order before PG's planner has a chance to reorder it
    • Document the heuristic in reference/internals/optimizer.md (new page) alongside the explain_sparql() function from v0.23.0
  • Schema-aware statistics worker

    • Extend the background merge worker to run ANALYZE _pg_ripple.vp_{id}_main after each successful merge — ensures the PG planner has fresh statistics on the main partition for join planning
    • For VP tables whose objects are consistently typed (all xsd:integer, xsd:decimal, or xsd:dateTime as detected by the dictionary kind column), create an extended statistics object (CREATE STATISTICS … (dependencies, ndistinct)) so the planner can exploit correlation for range predicates
    • New GUC pg_ripple.auto_analyze (BOOL, default on) — allows operators to disable the post-merge ANALYZE if they manage statistics manually
  • SPARQL-star Update: quoted triples in CONSTRUCT and UPDATE templates

    • Extend the CONSTRUCT template compiler in src/sparql/sqlgen.rs to handle << ?s ?p ?o >> quoted-triple patterns in CONSTRUCT WHERE and CONSTRUCT template clauses — stored using the existing KIND_QUOTED_TRIPLE dictionary kind from v0.4.0
    • Extend the INSERT DATA / DELETE DATA / INSERT WHERE / DELETE WHERE parsers to accept quoted triple syntax in graph patterns and template positions
    • New pg_regress test sparql_star_update.sql: INSERT DATA { << <Alice> <knows> <Bob> >> <assertedBy> <Carol> }; SELECT … WHERE { << ?s ?p ?o >> <assertedBy> ?a } — verify the quoted triple round-trips correctly through insert and query

Documentation

See plans/documentation.md for details.

  • reference/datalog-reference.md updated — add semi-naive evaluation section explaining the ΔR mechanics, iteration bounds, and performance expectations; update OWL RL coverage table to mark cax-sco full, cls-avf, prp-ifp, prp-spo1 as implemented
  • reference/configuration.md updated — document pg_ripple.property_path_max_depth and pg_ripple.export_batch_size GUCs with allowed ranges and tuning guidance
  • user-guide/performance.md updated — add "large result set decoding" section explaining the batch-decode change and expected latency improvement
  • Release notes for v0.24.0 — highlight semi-naive evaluation with performance numbers from the benchmark; list completed OWL RL rules; note BRIN migration and streaming export

Exit Criteria

datalog_seminaive.sql passes with correct closure count and iteration count ≤ longest derivation chain. Semi-naive benchmark is ≥ 5× faster than naive on the RDFS subgraph. All four new OWL RL rules derive correct inferences in the corresponding pg_regress tests. SPARQL result-set decoding issues ≤ 2 SPI round-trips for 1000-term results (verified by the bench gate). Property path with default depth limit correctly traverses a 100-hop chain; depth-10 truncation emits the expected WARNING. sparql_star_update.sql passes. The HTAP anti-join benchmark completes within 2× the no-tombstone baseline. Migration scripts from 0.1.0 through 0.24.0 run cleanly via just test-migration.


v0.25.0 — GeoSPARQL & Architectural Polish

Theme: Add a GeoSPARQL 1.1 geometry subset using PostGIS, stabilise the internal catalog against OID drift, and close the remaining medium- and low-priority issues from the v0.20.0 gap analysis.

In plain language: PostgreSQL already understands geography — distances, containment, intersection — through the PostGIS extension. This release connects pg_ripple's RDF triple store to PostGIS so that SPARQL queries can filter and compute over geographic data: "which cities are within 50 km of Berlin?", "which roads cross this polygon?". This covers the most common GeoSPARQL functions used in open data publishing (Wikidata, LinkedGeoData, government datasets). The release also includes a set of smaller housekeeping improvements: the internal predicate catalog now stores table names instead of fragile OIDs, the HTTP companion service correctly validates federation endpoint URLs against SSRF schemes, bulk loads can now be run in strict mode that rolls back on any malformed triple, and the remaining low-priority issues from the v0.20.0 assessment are closed.

Effort estimate: 6–8 person-weeks

Completed items (click to expand)

Deliverables

  • GeoSPARQL 1.1 geometry subset (feature F-5 from the gap analysis)

    • Prerequisite: PostGIS installed (gated with a runtime SELECT proname FROM pg_proc WHERE proname = 'st_geomfromtext' availability check; all geo functions return NULL with a WARNING if PostGIS is absent — no ERROR)
    • WKT literal support: recognize geo:wktLiteral datatype IRIs in the dictionary encoder; store as a regular literal; decode to a TEXT representation compatible with ST_GeomFromText()
    • Topological relation functions (compile to PostGIS equivalents):
      • geo:sfIntersects(a, b)ST_Intersects(ST_GeomFromText(a), ST_GeomFromText(b))
      • geo:sfContains(a, b)ST_Contains(ST_GeomFromText(a), ST_GeomFromText(b))
      • geo:sfWithin(a, b)ST_Within(ST_GeomFromText(a), ST_GeomFromText(b))
      • geo:sfTouches(a, b), geo:sfCrosses(a, b), geo:sfOverlaps(a, b) — same pattern
    • Distance and measurement functions:
      • geof:distance(a, b, unit)ST_Distance(ST_GeomFromText(a)::geography, ST_GeomFromText(b)::geography) with unit conversion (supports uom:metre, uom:kilometre, uom:mile); result encoded as xsd:double
      • geof:area(a, unit)ST_Area(…::geography) with the same unit conversion
      • geof:boundary(a)ST_Boundary(ST_GeomFromText(a)) serialised back to WKT literal
    • SPARQL FILTER integration: wire all geo functions into translate_expr() in src/sparql/expr.rs; topological predicates emit a SQL boolean; distance/area/boundary emit decoded numeric/WKT values
    • New pg_regress test geosparql.sql — skipped automatically when PostGIS is absent (DO $$ BEGIN IF NOT EXISTS (SELECT 1 FROM pg_proc WHERE proname = 'st_geomfromtext') THEN RAISE EXCEPTION …; END IF; END $$); when PostGIS is present, verifies intersection, distance, and contains queries against a small geography dataset
  • Federation cache and partial-result correctness (high fixes H-12, H-13)

    • H-12 (cache key upgrade): replace the XXH3-64 result cache key in src/sparql/federation.rs with the full XXH3-128 hash — the 64-bit birthday bound (~2.1 billion distinct cached queries before 50% collision probability) is thin for a long-running server; the full 128-bit hash makes collision negligible even at very high query volumes
    • H-13 (partial-result parser): add a size gate to the federation partial-result recovery path — if the truncated response exceeds pg_ripple.federation_partial_recovery_max_bytes (INT GUC, default: 65536), skip partial recovery and return zero rows with a WARNING: federation partial response too large for recovery (N bytes); this prevents the rfind("},") heuristic from truncating a valid row whose literal value contains "}" followed by a comma in large responses
    • New pg_regress test federation_cache.sql — verify that two federation calls with identical query text to different endpoints are cached independently; verify that a simulated oversized partial response exceeding the byte gate produces zero rows with the expected WARNING
  • Catalog OID stability (architectural fix A-5)

    • Add schema_name NAME, table_name NAME columns to _pg_ripple.predicates in the migration script
    • Populate on insert: schema_name = '_pg_ripple', table_name = 'vp_{id}_delta' (the mutable partition; view name is derivable)
    • All dynamic SQL in the merge worker, query path, and admin functions now references quote_ident(schema_name) || '.' || quote_ident(table_name) rather than looking up OIDs — OID drift after a pg_dump / pg_restore cycle no longer silently redirects queries to the wrong relation
    • Migration script sql/pg_ripple--0.24.0--0.25.0.sql: ALTER TABLE _pg_ripple.predicates ADD COLUMN schema_name NAME DEFAULT '_pg_ripple', ADD COLUMN table_name NAME; UPDATE _pg_ripple.predicates SET table_name = 'vp_' || id || '_delta';
  • Federation SSRF scheme validation (security fix S-4)

    • pg_ripple.register_endpoint(url TEXT): reject any URL whose scheme is not http or https at registration time with ERRCODE_INVALID_PARAMETER_VALUE: "federation endpoint must use http or https scheme; got: <scheme>" — belt-and-braces defence even though ureq would refuse non-HTTP at connection time
  • Bulk load strict mode (medium fix M-8)

    • Add strict BOOLEAN DEFAULT false parameter to pg_ripple.load_turtle(data TEXT, strict BOOLEAN DEFAULT false) and all other bulk-load entry points
    • When strict = true: any parse error or malformed triple aborts the entire COPY-equivalent batch with a structured error naming the line number and the offending triple; the transaction is rolled back to the savepoint established at the start of the load
    • When strict = false (current behaviour): malformed triples emit a WARNING and are skipped; partial loads are committed as before
    • New pg_regress test bulk_load_strict.sql — verify that a load with one malformed triple in strict mode rolls back all preceding triples; verify that the same load in lenient mode commits the well-formed triples
  • Blank-node document scoping fix (medium fix M-9)

    • Replace the SystemTime::now().duration_since(UNIX_EPOCH).unwrap().subsec_nanos() blank-node prefix in src/bulk_load.rs with nextval('_pg_ripple.statement_id_seq') — globally unique per load call, collision-free under any level of concurrency
  • Merge worker cache isolation (architectural fix A-3)

    • Register a transaction-boundary callback in the background merge worker (analogous to the xact-end callback added in v0.22.0 for the encode cache) that clears the worker-local encode/decode LRU cache at the end of every merge transaction — prevents the worker from using stale IDs if a future migration rewrites dictionary rows
  • pg_trickle version-lock probe (architectural fix A-4)

    • In _PG_init, if pg_trickle is available, execute SELECT extversion FROM pg_extension WHERE extname = 'pg_trickle' and compare against the compile-time PG_TRICKLE_TESTED_VERSION constant; emit a WARNING if the installed version is newer than tested: "pg_ripple: pg_trickle version N.N.N is newer than tested version N.N.N; incremental views may behave unexpectedly"
  • Remaining low-priority fixes

    • CDC payload documentation (L-2): add a decode BOOLEAN DEFAULT false parameter to pg_ripple.cdc_changes() that, when true, decodes dictionary IDs to N-Triples strings in the payload; document in user-guide/cdc.md
    • Dependency alignment (L-3/L-4): upgrade ureq from v2 to v3 in pg_ripple_http/Cargo.toml; update AGENTS.md to list oxrdf as the canonical RDF-star parser; add oxrdf = "0.3" as a direct dep in Cargo.toml
    • GUC description strings (L-5): update every GucBuilder::new() .set_description() call in src/lib.rs to include the default value and valid range, e.g. "Maximum property path recursion depth. Default: 64. Range: 1–100000." — improves SHOW ALL and pg_admin discoverability
    • Inline decoder defensive assert (L-7): add debug_assert!(is_inline(id), "decode_inline called with non-inline id {id}") at the top of decode_inline() in src/dictionary/inline.rs
    • Export literal round-trip (M-10): add a pg_regress test export_roundtrip.sql that inserts triples with \uXXXX Unicode escapes, non-ASCII literals, and control characters, then round-trips through Turtle export and import; verifies the decoded values match the originals
    • W3C conformance test classification (M-19): replace remaining label_no_error style assertions in the conformance test file with a formal skip-list expected_skip CTE; document each skip with a reason code (UNIMPLEMENTED, KNOWN_LIMITATION, or SPEC_AMBIGUITY); ensure the skip list shrinks to zero by v1.0.0
    • File-path bulk loader validation (S-8): all load_*_file() functions (load_turtle_file, load_ntriples_file, etc.) require superuser status but do not validate symlink following or path traversal beyond that gate; add a realpath() call in src/bulk_load.rs to resolve symlinks and verify the target is within pg_read_server_files accessible directories (matching PostgreSQL's COPY FROM file-access model); emit ERRCODE_INSUFFICIENT_PRIVILEGE if access is denied, preventing a superuser from accidentally loading files outside the protected path set
  • Supplementary feature additions

    • pg_ripple.canary() health function: runs a battery of internal self-checks and returns a JSON object {"merge_worker": "ok"|"stalled", "cache_hit_rate": 0.0–1.0, "catalog_consistent": true|false, "orphaned_rare_rows": N} — suitable for ops dashboards, alerting pipelines, and CI smoke tests; catalog_consistent checks that VP table count in pg_tables matches the predicate catalog and that no vp_rare rows exist for promoted predicates
    • OWL ontology import: pg_ripple.load_owl_ontology(path TEXT) — format-detected by file extension (.ttl/.nt/.xml/.rdf/.owl); loads into the default graph; returns triple count
    • RDF Patch import: pg_ripple.apply_patch(data TEXT) — processes RDF Patch A/D operations; returns net triple delta
    • Custom aggregate registry: pg_ripple.register_aggregate(sparql_iri TEXT, pg_function TEXT) persists to _pg_ripple.custom_aggregates

Documentation

See plans/documentation.md for details.

  • reference/geosparql.md (new page) — GeoSPARQL 1.1 support matrix, all implemented functions with signatures and PostGIS equivalents, PostGIS version requirements, worked examples with WKT literals
  • user-guide/geospatial.md (new page) — how to store and query geographic data in pg_ripple, linking GeoSPARQL to PostGIS, example queries for distance filtering and containment
  • reference/security.md updated — document federation scheme validation and the remediation rationale
  • user-guide/bulk-load.md updated — document the strict parameter with when to use it and how to diagnose partial-load failures
  • reference/configuration.md updated — document pg_trickle version-lock warning and the new CDC decode parameter
  • Release notes for v0.25.0 — highlight GeoSPARQL capability, catalog OID stability improvement, strict bulk load, and summary of all closed low-priority issues

Exit Criteria

geosparql.sql pg_regress passes when PostGIS is present and skips cleanly when PostGIS is absent. bulk_load_strict.sql passes for both strict and lenient modes. Blank-node prefix uses nextval(…) — no wall-clock-based prefix in src/bulk_load.rs. SELECT pg_ripple.register_endpoint('file:///etc/passwd') raises ERRCODE_INVALID_PARAMETER_VALUE. _pg_ripple.predicates has schema_name and table_name columns populated. federation_cache.sql passes — distinct endpoints are cached independently and oversized partial responses produce zero rows with a WARNING. pg_ripple.canary() returns {"catalog_consistent": true, "orphaned_rare_rows": 0} on a healthy database. SELECT pg_ripple.load_turtle_file('/etc/passwd') from a superuser session raises ERRCODE_INSUFFICIENT_PRIVILEGE (not silently succeeding) because /etc/passwd is outside allowed pg_read_server_files directories. Migration scripts from 0.1.0 through 0.25.0 run cleanly via just test-migration.


v0.26.0 — GraphRAG Integration

Theme: First-class support for using pg_ripple as the persistent knowledge graph backend for Microsoft GraphRAG.

In plain language: Microsoft GraphRAG is an open-source system (32k+ GitHub stars) that uses large language models to extract a knowledge graph from documents, detects thematic clusters, and answers complex questions far better than standard vector-search RAG. By default it stores its graph as flat Parquet files on disk — static, unqueryable, and requiring a full re-index every time new documents arrive. This release makes pg_ripple a drop-in backend for GraphRAG: entities and relationships extracted by the LLM are stored as RDF triples with full SPARQL queryability, Datalog reasoning derives implicit relationships the LLM missed, SHACL shapes reject malformed extractions before they corrupt the graph, and a Python CLI bridge exports the enriched graph back to Parquet for GraphRAG's community-detection step. The result is a richer, higher-quality knowledge graph that improves GraphRAG's Local, Global, and DRIFT search accuracy — all running inside the PostgreSQL instance you already have.

Effort estimate: 4–6 person-weeks

Completed items (click to expand)

Background

See plans/graphrag.md for the full synergy analysis, architecture proposals, and integration rationale. Key findings:

  • GraphRAG stores its knowledge model as Parquet files (entities, relationships, communities, community reports, text units). Every new document requires a full re-index.
  • pg_ripple replaces static Parquet with a live, ACID-consistent, SPARQL-queryable triple store. New entities can be inserted incrementally via the HTAP delta partition without disrupting concurrent queries.
  • Datalog + OWL-RL inference materialises relationships that LLM extraction misses (transitive hierarchies, co-membership, symmetric properties), directly improving community structure quality.
  • SHACL validation rejects malformed LLM extractions (missing titles, invalid types, dangling relationship endpoints) before they propagate into community reports.
  • GraphRAG's BYOG (Bring Your Own Graph) feature accepts pre-built entity/relationship tables as Parquet — pg_ripple's export functions feed directly into this pathway.

Deliverables

  • GraphRAG RDF ontology (sql/graphrag_ontology.ttl)

    • Defines the RDF vocabulary for GraphRAG's knowledge model: gr:Entity, gr:Relationship, gr:TextUnit, gr:Community, gr:CommunityReport
    • Full property set mirroring GraphRAG's output table schemas: gr:title, gr:type, gr:description, gr:frequency, gr:degree, gr:source, gr:target, gr:weight, gr:level, gr:rank, gr:summary, gr:fullContent, gr:hasMember, gr:parent
    • Provenance properties for RDF-star metadata: gr:confidence, gr:sourceTextUnit, gr:extractedBy, gr:extractedAt
    • Namespace prefix gr: pre-registered via pg_ripple.register_prefix()
    • Loaded automatically by the example script; also loadable standalone via pg_ripple.load_turtle_file()
  • BYOG Parquet export functions (src/export.rs additions)

    • pg_ripple.export_graphrag_entities(graph_iri TEXT, output_path TEXT) RETURNS BIGINT
      • Executes a SPARQL SELECT to extract all gr:Entity triples from the named graph
      • Writes entities.parquet with columns: id, title, type, description, text_unit_ids, frequency, degree — exactly matching GraphRAG's output schema
      • Returns row count
    • pg_ripple.export_graphrag_relationships(graph_iri TEXT, output_path TEXT) RETURNS BIGINT
      • Extracts all gr:Relationship triples
      • Writes relationships.parquet with columns: id, source, target, description, weight, combined_degree, text_unit_ids
      • combined_degree computed as source.degree + target.degree via a SPARQL join
      • Returns row count
    • pg_ripple.export_graphrag_text_units(graph_iri TEXT, output_path TEXT) RETURNS BIGINT
      • Extracts all gr:TextUnit triples
      • Writes text_units.parquet with columns: id, text, n_tokens, document_id, entity_ids, relationship_ids
      • Returns row count
    • Implementation: use Rust's parquet + arrow crates; require superuser (same as load_*_file functions); validate output path via realpath() against writable directories
  • SHACL shapes for GraphRAG quality enforcement (sql/graphrag_shapes.ttl)

    • gr:EntityShape: gr:title required (1..1, string, maxLength 1000); gr:type required, constrained to sh:in ("person" "organization" "geo" "event" "concept"); gr:description required (1..1)
    • gr:RelationshipShape: gr:source required (1..1, sh:class gr:Entity); gr:target required (1..1, sh:class gr:Entity); gr:weight required (1..1, float, sh:minInclusive 0.0, sh:maxInclusive 1.0)
    • gr:TextUnitShape: gr:text required (1..1, string); gr:tokenCount required (1..1, non-negative integer)
    • Loaded via pg_ripple.load_turtle_file() and activated with pg_ripple.validate() or pg_ripple.shacl_mode = 'sync'
  • Datalog enrichment rules (sql/graphrag_enrichment_rules.pl)

    • gr:coworker(?a, ?b) — both entities appear as source in relationships targeting the same organization entity
    • gr:collaborates(?a, ?b) — both entities appear in the same text unit (share a gr:TextUnit via gr:mentionsEntity)
    • gr:indirectReport(?leader, ?sub2) — transitive: ?leader gr:manages ?mid, ?mid gr:manages ?sub2
    • gr:relatedOrg(?a, ?b) — two organizations share at least two entity-level relationships (co-occurrence threshold)
    • All rules loaded via pg_ripple.load_rules() under the rule set name 'graphrag_enrichment'
    • OWL-RL built-in rules (pg_ripple.load_rules_builtin('owl-rl')) applied first for RDFS subclass/subproperty transitivity
    • Documentation: each rule annotated with its GraphRAG use case (e.g. how gr:coworker enriches Local Search neighborhood)
  • Python CLI bridge (scripts/graphrag_export.py)

    • CLI tool wrapping the export functions for users who cannot call pg_ripple.export_graphrag_*() directly from SQL (e.g. managed PostgreSQL services where COPY TO is restricted)
    • --pg-url: PostgreSQL connection string
    • --graph-iri: named graph IRI to export
    • --output-dir: directory for Parquet files (default: ./graphrag_output)
    • --enrich-with-datalog: run pg_ripple.infer('owl-rl') + pg_ripple.infer('graphrag_enrichment') before export
    • --validate: run pg_ripple.validate() and print violations before exporting; exit with non-zero code if any violations
    • --format: parquet (default) or csv (for debugging)
    • Dependencies: psycopg (v3), pyarrow; no GraphRAG dependency required at export time
    • Prints row counts and output paths on success
    • Unit tests via pytest in scripts/test_graphrag_export.py
  • Example walkthrough (examples/graphrag_byog.sql)

    • End-to-end example: create named graph → load sample entities/relationships as Turtle → run Datalog enrichment → validate with SHACL → query enriched graph via SPARQL → export to Parquet
    • Demonstrates all four integration points: ontology, validation, reasoning, and export
    • Includes a commented BYOG settings.yaml snippet showing the graphrag index command that consumes the exported Parquet files
    • Executable as a pg_regress test: cargo pgrx regress pg18 includes graphrag_byog.sql
  • pg_regress tests

    • graphrag_ontology.sql — load ontology, verify all prefix registrations and class/property triples are present
    • graphrag_crud.sql — insert sample entities and relationships as Turtle, query back via SPARQL, verify field values
    • graphrag_enrichment.sql — load enrichment rules, run infer('graphrag_enrichment'), verify gr:coworker and gr:collaborates triples are derived
    • graphrag_shacl.sql — attempt to load a malformed entity (missing gr:type) with shacl_mode = 'sync', verify the INSERT is rejected with a SHACL violation report
    • graphrag_export.sql — export entities/relationships to /tmp/graphrag_test_*.parquet, verify row count matches the number of inserted entities/relationships

Migration Script

sql/pg_ripple--0.25.0--0.26.0.sql — no schema changes required; all new functionality is delivered via Rust function additions and SQL files loaded by the user. Migration script contains a header comment listing the new SQL functions and their signatures.

Documentation

See plans/documentation.md for details.

  • user-guide/graphrag.md (new page) — step-by-step guide: install pg_ripple, load GraphRAG entities as RDF, run enrichment and validation, export to Parquet, run GraphRAG BYOG workflow; includes architecture diagram showing data flow between GraphRAG and pg_ripple
  • reference/graphrag-ontology.md (new page) — full reference for the gr: vocabulary: all classes, properties, and SHACL shapes with descriptions and example triples
  • reference/graphrag-functions.md (new page) — API reference for export_graphrag_entities, export_graphrag_relationships, export_graphrag_text_units
  • user-guide/graphrag-enrichment.md (new page) — explains Datalog enrichment for GraphRAG: which rules are built-in, how to write custom rules, how enriched triples improve community detection quality
  • plans/graphrag.md updated — mark Phase 1 (BYOG export) and Phase 2 (Datalog enrichment) as implemented; update Phase 3 status to in-progress
  • Release notes for v0.26.0 — highlight GraphRAG integration as the headline feature, link to the BYOG walkthrough, explain the Datalog enrichment value proposition

Exit Criteria

graphrag_ontology.sql, graphrag_crud.sql, graphrag_enrichment.sql, graphrag_shacl.sql, and graphrag_export.sql all pass in cargo pgrx regress pg18. pg_ripple.export_graphrag_entities() writes a valid Parquet file readable by pyarrow.parquet.read_table(). Loading a malformed entity (missing gr:type) with shacl_mode = 'sync' raises a validation error. Running pg_ripple.infer('graphrag_enrichment') on a graph with two entities both linked to the same organization produces at least one gr:coworker triple. scripts/graphrag_export.py --validate exits non-zero when SHACL violations are present. Migration scripts from 0.1.0 through 0.26.0 run cleanly via just test-migration.


v0.27.0 — Vector + SPARQL Hybrid: Foundation

Theme: Core pgvector integration — embedding storage, similarity functions, and SPARQL extension.

In plain language: This release adds AI-powered semantic search to pg_ripple. Every entity in your knowledge graph can now have a vector embedding — a compact numerical fingerprint that captures its meaning. You can then search for entities that are semantically similar to a phrase ("find drugs similar to anti-inflammatory agents"), and combine that similarity search with precise SPARQL queries ("but only drugs approved by the FDA that don't interact with methotrexate"). This is called hybrid search, and it's the dominant retrieval pattern for modern AI applications. pg_ripple's unique advantage is that both the graph query and the similarity search run inside the same PostgreSQL process — with zero overhead, ACID transactions, and the query planner optimising both together. No other triplestore offers this.

Effort estimate: 5–7 person-weeks

Completed items (click to expand)

Background

See plans/vector_sparql_hybrid.md for the full analysis, pgvector deep-dive, competitive landscape, and integration architecture. Key findings:

  • pgvector (14k+ GitHub stars, MIT license, ships with every major managed PostgreSQL provider) is the standard PostgreSQL vector extension. Because pg_ripple and pgvector share the same PostgreSQL backend, JOINs between VP tables and vector tables execute in-process with zero serialisation overhead.
  • No existing triplestore or vector database combines full SPARQL 1.1, SHACL validation, Datalog reasoning, and in-process vector similarity in a single system.
  • The _pg_ripple.embeddings table uses dictionary-encoded entity_id foreign keys, enabling zero-copy joins with all VP tables.
  • This is an optional at runtime integration: pg_ripple degrades gracefully (returns empty results with a WARNING) if pgvector is not installed.

Deliverables

  • _pg_ripple.embeddings table (sql/pg_ripple--0.26.0--0.27.0.sql)

    • Schema: entity_id BIGINT NOT NULL REFERENCES _pg_ripple.dictionary(id), model TEXT NOT NULL DEFAULT 'default', embedding vector(1536), updated_at TIMESTAMPTZ NOT NULL DEFAULT now(), PRIMARY KEY (entity_id, model) (optional at runtime — pgvector must be installed)
    • HNSW index (default) on (embedding vector_cosine_ops) with configurable m (default 16) and ef_construction (default 64) parameters — best recall/speed trade-off for most workloads
    • IVFFlat index alternative (opt-in via GUC pg_ripple.embedding_index_type = 'ivfflat') — faster build times, preferable for high-write workloads where the HNSW build cost is prohibitive; lists auto-set to sqrt(row_count)
    • halfvec support: the embedding column accepts both vector(N) and halfvec(N) via GUC pg_ripple.embedding_precision = 'half'; halfvec halves storage (2 bytes per dimension instead of 4) at marginal recall cost — recommended for > 5M entity graphs or embedding_dimensions >= 3072
    • Binary quantization support: opt-in via GUC pg_ripple.embedding_precision = 'binary'; stores embeddings as pgvector bit(N) using Hamming distance, reducing storage by ~96% (1 bit/dimension) at the cost of recall — suitable for extremely large-scale graphs (> 50M entities) where approximate results are acceptable; requires pgvector ≥ 0.7.0
    • Fallback: if pgvector is absent, the table is created with BYTEA as a stub column and all similarity functions return empty results with a WARNING
    • Migration script creates the table only if pgvector is detected via SELECT EXISTS (SELECT 1 FROM pg_extension WHERE extname = 'vector')
  • GUC parameters (registered in _PG_init in src/lib.rs)

    • pg_ripple.embedding_model (string, default '') — embedding model name tag stored in the model column
    • pg_ripple.embedding_dimensions (integer, default 1536, range 1–16000) — vector dimensions; must match the actual model output
    • pg_ripple.embedding_api_url (string, default '') — base URL for an OpenAI-compatible embedding API (e.g. https://api.openai.com/v1, local Ollama, vLLM)
    • pg_ripple.embedding_api_key (string, default '', superuser-only) — API key; value is masked in pg_settings via a superuser-only GUC flag
    • pg_ripple.pgvector_enabled (bool, default true) — runtime switch; set to false to disable all pgvector-dependent code paths without uninstalling the extension
    • pg_ripple.embedding_index_type (string, default 'hnsw', options 'hnsw'|'ivfflat') — controls which index type is created on _pg_ripple.embeddings; changing this requires REINDEX
    • pg_ripple.embedding_precision (string, default 'single', options 'single'|'half'|'binary') — 'half' stores embeddings as halfvec(N) (50% storage reduction); 'binary' stores as bit(N) using Hamming distance (~96% storage reduction, best for > 50M entities); requires pgvector ≥ 0.7.0
  • pg_ripple.embed_entities() — batch embedding (src/sparql/embedding.rs)

    • pg_ripple.embed_entities(graph_iri TEXT DEFAULT NULL, model TEXT DEFAULT NULL, batch_size INT DEFAULT 100) RETURNS BIGINT
    • Executes a SPARQL SELECT to collect entity IRIs + their rdfs:label (falling back to the IRI local name) from the specified graph (or all graphs if NULL)
    • Batches entity labels, calls the OpenAI-compatible API at pg_ripple.embedding_api_url; supports gzip-compressed responses
    • Stores results in _pg_ripple.embeddings via INSERT … ON CONFLICT (entity_id, model) DO UPDATE SET embedding = EXCLUDED.embedding, updated_at = now()
    • Returns total number of embeddings stored
    • Raises PT601 — embedding API URL not configured if pg_ripple.embedding_api_url is empty
  • pg_ripple.similar_entities() — k-NN query (src/sparql/embedding.rs)

    • pg_ripple.similar_entities(query_text TEXT, k INT DEFAULT 10, model TEXT DEFAULT NULL) RETURNS TABLE (entity_id BIGINT, entity_iri TEXT, distance FLOAT8) (optional at runtime — pgvector must be installed)
    • Encodes query_text to a vector via the configured embedding API
    • Executes SELECT entity_id, embedding <=> $query_vec FROM _pg_ripple.embeddings ORDER BY 1 LIMIT k using the pgvector <=> cosine distance operator
    • Decodes entity_id back to IRI text via the dictionary
    • Returns results sorted by ascending cosine distance (0 = identical, 2 = maximally dissimilar)
  • pg_ripple.store_embedding() — user-supplied embeddings

    • pg_ripple.store_embedding(entity_iri TEXT, embedding FLOAT8[], model TEXT DEFAULT NULL) RETURNS VOID
    • Encodes entity_iri via the dictionary encoder, casts FLOAT8[] to vector, and upserts into _pg_ripple.embeddings
    • Useful for pre-computed KGE embeddings (TransE, RotatE, ComplEx) from external pipelines; no API call needed
    • Validates that array_length(embedding, 1) matches pg_ripple.embedding_dimensions; raises PT602 — embedding dimension mismatch otherwise
  • SPARQL pg:similar() extension function (src/sparql/functions.rs)

    • Register <http://pg-ripple.org/functions/similar> as a SPARQL extension function in the function registry
    • Signature: pg:similar(?entity, "query_text"^^xsd:string, k) — returns cosine distance as xsd:double
    • Translate to SQL: the SPARQL→SQL compiler detects pg:similar calls in BIND expressions and emits a JOIN against _pg_ripple.embeddings with the <=> operator
    • Filter pushdown: if the SPARQL query has FILTER(?score < threshold), push the threshold into the SQL WHERE clause to allow HNSW iterative scan pruning
    • Graceful degradation: if pgvector is absent, raises PT603 — pgvector extension not installed with an install hint
  • pg_ripple.refresh_embeddings() — stale embedding invalidation (src/sparql/embedding.rs)

    • pg_ripple.refresh_embeddings(graph_iri TEXT DEFAULT NULL, model TEXT DEFAULT NULL, force BOOL DEFAULT false) RETURNS BIGINT
    • Identifies entities whose rdfs:label was updated after _pg_ripple.embeddings.updated_at by joining _pg_ripple.embeddings against the label VP table's i (SID) sequence — higher SID implies a later write
    • Re-embeds stale entities in batches; skips entities where updated_at is already current unless force = true
    • Returns the count of re-embedded entities
    • Intended for scheduled maintenance (e.g. via pg_cron) and called automatically at the end of each background worker cycle when pg_ripple.auto_embed = true
    • Raises PT606 — no stale embeddings found as a NOTICE (not an ERROR) when nothing needs refreshing
  • Error codes for the embedding subsystem (src/error.rs)

    • PT601 — embedding API URL not configured
    • PT602 — embedding dimension mismatch
    • PT603 — pgvector extension not installed
    • PT604 — embedding API request failed (includes HTTP status code in detail)
    • PT605 — entity has no embedding (raised when pg:similar is called for an entity absent from _pg_ripple.embeddings)
    • PT606 — no stale embeddings found (NOTICE level)
  • pg_regress tests

    • vector_setup.sql — verify pgvector is installed; skip remaining vector tests if absent
    • vector_crud.sql — store embeddings via pg_ripple.store_embedding(), retrieve via pg_ripple.similar_entities(), verify ranking order
    • vector_sparql.sql — SPARQL query using pg:similar() in a BIND expression; verify the result set is non-empty and ordered by distance
    • vector_filter.sql — SPARQL query with FILTER(?score < 0.5) on a pg:similar() result; verify only entities below the threshold are returned
    • vector_graceful.sql — test behaviour when pg_ripple.pgvector_enabled = false; verify WARNING is emitted and no ERROR is raised
    • vector_halfvec.sql — store embeddings with pg_ripple.embedding_precision = 'half'; verify halfvec column type and that pg_ripple.similar_entities() returns correct results
    • vector_binary.sql — store embeddings with pg_ripple.embedding_precision = 'binary'; verify bit column type and that Hamming-distance similarity returns non-zero results
    • vector_refresh.sql — insert entity, embed, update its rdfs:label, call pg_ripple.refresh_embeddings(), verify updated_at advances and re-embedding count is 1

Migration Script

sql/pg_ripple--0.26.0--0.27.0.sql — creates _pg_ripple.embeddings table and HNSW index if pgvector is present; registers GUC parameters. No changes to VP table schema.

Documentation

  • user-guide/hybrid-search.md (new page) — quick-start: install pgvector, set GUC parameters, call pg_ripple.embed_entities(), run a SPARQL hybrid query; includes architecture diagram showing VP table + embeddings table join
  • reference/embedding-functions.md (new page) — API reference for embed_entities, similar_entities, store_embedding, pg:similar()
  • reference/guc-reference.md updated — document all seven new embedding GUC parameters (embedding_model, embedding_dimensions, embedding_api_url, embedding_api_key, pgvector_enabled, embedding_index_type, embedding_precision) with recommended values for OpenAI, Ollama, and local Sentence-BERT; include storage trade-off table for embedding_precision modes

Exit Criteria

vector_crud.sql, vector_sparql.sql, vector_filter.sql, vector_halfvec.sql, vector_binary.sql, and vector_refresh.sql all pass in cargo pgrx regress pg18 when pgvector is installed. vector_setup.sql skips cleanly when pgvector is absent. pg_ripple.store_embedding('http://example.org/aspirin', ARRAY[...]) round-trips correctly through pg_ripple.similar_entities('anti-inflammatory'). A SPARQL query with BIND(pg:similar(?drug, "aspirin", 10) AS ?score) FILTER(?score < 0.5) returns only entities with cosine distance below 0.5. SELECT pg_ripple.similar_entities('test') when pg_ripple.pgvector_enabled = false emits a WARNING and returns zero rows (no ERROR). pg_ripple.refresh_embeddings() after a label update returns a count of 1 and advances updated_at. SELECT count(*) FROM _pg_ripple.embeddings with embedding_precision = 'half' confirms the column is of type halfvec. Migration scripts from 0.1.0 through 0.27.0 run cleanly via just test-migration.


v0.28.0 — Advanced Hybrid Search & RAG Pipeline

Theme: Production-grade hybrid search with RRF fusion, incremental embedding, graph-contextualized embeddings, and end-to-end RAG retrieval.

In plain language: This release builds on the pgvector foundation to deliver two advanced capabilities. First, hybrid ranking: instead of choosing between SPARQL results or vector results, pg_ripple now fuses both using Reciprocal Rank Fusion — a proven algorithm that combines ranked lists from different retrieval systems. Second, RAG support: a single SQL function (pg_ripple.rag_retrieve()) takes a natural language question, runs hybrid search, and returns structured context ready for an LLM system prompt. A background worker keeps embeddings up-to-date as new entities are added. The result is a complete knowledge-graph-grounded RAG backend running entirely inside PostgreSQL — no separate vector database, no ETL, no eventual consistency.

Effort estimate: 5–8 person-weeks

Completed items (click to expand)

Background

See plans/vector_sparql_hybrid.md §5 (Advanced Integration Patterns) and §7 (Phases 2 & 3) for full design rationale. Key highlights:

  • Reciprocal Rank Fusion (RRF) is the standard algorithm for combining ranked lists from heterogeneous retrieval systems. With RRF, pg_ripple fuses SPARQL result rankings with vector distance rankings into a single scored list using the formula $\text{RRF}(d) = \sum_{r \in R} \frac{1}{k_{rrf} + r(d)}$ where $k_{rrf} = 60$.
  • Incremental embedding via a background worker ensures entities added after initial bulk embedding are automatically embedded without user intervention.
  • Graph-contextualized embeddings generate text representations that include entity neighborhood information (label, types, neighboring entity labels) before embedding — producing vectors that encode relational context, making similarity search more meaningful than label-only embeddings.
  • pg_ripple.rag_retrieve() is the missing link between pg_ripple's knowledge graph and LLM-based applications; it bridges directly to the pg_ripple_http HTTP service for REST-based LLM integrations.

Deliverables

  • pg_ripple.hybrid_search() — RRF fusion (src/sparql/embedding.rs)

    • pg_ripple.hybrid_search(sparql_query TEXT, query_text TEXT, k INT DEFAULT 10, alpha FLOAT8 DEFAULT 0.5, model TEXT DEFAULT NULL) RETURNS TABLE (entity_id BIGINT, entity_iri TEXT, rrf_score FLOAT8, sparql_rank INT, vector_rank INT) (optional at runtime — pgvector must be installed)
    • Executes sparql_query (a SPARQL SELECT returning ?entity) to get the SPARQL-ranked candidate set
    • Executes pg_ripple.similar_entities(query_text, k * 10) to get the vector-ranked candidate set
    • Applies Reciprocal Rank Fusion with $k_{rrf} = 60$; alpha controls SPARQL vs. vector weight (0.0 = vector only, 1.0 = SPARQL only, 0.5 = equal)
    • Returns top-k entities sorted by descending rrf_score
  • Incremental embedding background worker (src/worker.rs extension)

    • New table _pg_ripple.embedding_queue (entity_id BIGINT PRIMARY KEY, enqueued_at TIMESTAMPTZ NOT NULL DEFAULT now())
    • Trigger on _pg_ripple.dictionary: inserts new entity IDs into embedding_queue when pg_ripple.auto_embed = true
    • Background worker dequeues entities in batches of pg_ripple.embedding_batch_size, calls the embedding API, upserts into _pg_ripple.embeddings
    • GUC: pg_ripple.auto_embed (bool, default false) — master switch for trigger-based embedding; off by default to avoid surprise API charges
    • GUC: pg_ripple.embedding_batch_size (integer, default 100, range 1–10000)
  • pg_ripple.contextualize_entity() — graph-serialized text (src/sparql/embedding.rs)

    • pg_ripple.contextualize_entity(entity_iri TEXT, depth INT DEFAULT 1, max_neighbors INT DEFAULT 20) RETURNS TEXT
    • Runs an internal SPARQL CONSTRUCT to gather the entity's label, type(s), and up-to-max_neighbors neighboring entity labels within depth hops
    • Serialises the neighborhood as structured text: "[entity_label]. Type: [types]. Related: [neighbor_labels]." — suitable for embedding
    • Used internally by pg_ripple.embed_entities() when pg_ripple.use_graph_context = true (new GUC, bool, default false)
  • pg_ripple.rag_retrieve() — end-to-end RAG (src/sparql/embedding.rs)

    • pg_ripple.rag_retrieve(question TEXT, sparql_filter TEXT DEFAULT NULL, k INT DEFAULT 5, model TEXT DEFAULT NULL) RETURNS TABLE (entity_iri TEXT, label TEXT, context_json JSONB, distance FLOAT8) (optional at runtime — pgvector must be installed)
    • Step 1: encode question to a vector; find k nearest entities via HNSW
    • Step 2: if sparql_filter is non-NULL, apply it as a SPARQL WHERE clause filter on the candidate set
    • Step 3: for each surviving entity, call pg_ripple.contextualize_entity() to build a rich context
    • Step 4: return context_json as JSONB with keys label, types, properties, neighbors — formatted for direct use as an LLM system prompt fragment; structure mirrors the JSON-LD framing output from v0.17.0
  • pg_ripple_http RAG endpoint (pg_ripple_http/src/main.rs)

    • POST /rag — accepts {"question": "...", "sparql_filter": "...", "k": 5} JSON body
    • Calls pg_ripple.rag_retrieve() via the existing SPI connection
    • Returns {"results": [...], "context": "..."} where context is the concatenated context_json entries formatted as a plain-text LLM prompt
    • Authentication: same bearer-token auth as existing pg_ripple_http endpoints
    • Rate limiting: inherits the pg_ripple_http.max_requests_per_second GUC
  • JSON-LD framing for RAG context output (src/framing/ extension)

    • pg_ripple.rag_retrieve() gains an optional output_format TEXT DEFAULT 'jsonb' parameter accepting 'jsonb' or 'jsonld'
    • When output_format = 'jsonld', each context_json row is formatted as a JSON-LD frame using the framing engine from v0.17.0: entity types map to @type, property-value pairs map to their IRI keys, and @context is auto-populated from the registered prefix table
    • Enables direct use of context_json as a JSON-LD-framed system prompt for LLMs that prefer structured data (e.g. OpenAI structured outputs)
    • New pg_regress test vector_rag_jsonld.sql — call pg_ripple.rag_retrieve(... output_format := 'jsonld') and verify @type and @context keys are present in the output
  • SPARQL federation with external vector services (src/sparql/federation.rs extension)

    • Extends the SERVICE handler (v0.16.0) to recognise vector service endpoints registered via pg_ripple.register_vector_endpoint(url TEXT, api_type TEXT) where api_type is 'pgvector', 'weaviate', 'qdrant', or 'pinecone'
    • Syntax: SERVICE <http://vector-service/search> { ?entity pg:similarTo "query" ; pg:score ?score } — translated to the appropriate external API call (HTTP) rather than a local pgvector scan
    • Returned ?entity IRIs are resolved against the local dictionary; matched entities can participate in subsequent local triple pattern joins in the same SPARQL query
    • Use case: local pgvector for < 10M entities; external service for larger embedding indexes, without changing the SPARQL query syntax
    • GUC: pg_ripple.vector_federation_timeout_ms (integer, default 5000) — HTTP timeout for external vector service calls
    • Raises PT607 — vector service endpoint not registered if an unregistered SERVICE URL is used with a pg:similarTo predicate
    • New pg_regress test vector_federation.sql — register a mock vector endpoint, issue a federated SPARQL query, verify graceful fallback when the endpoint is unavailable
  • SHACL embedding completeness shape

    • examples/shacl_embedding_completeness.ttl — reusable SHACL shape that validates all entities of a given class have embeddings (uses sh:path :hasEmbedding ; sh:minCount 1)
    • pg_ripple.add_embedding_triples() RETURNS BIGINT — materialises :hasEmbedding triples for entities present in _pg_ripple.embeddings, making the SHACL shape checkable
  • Multi-model support

    • pg_ripple.list_embedding_models() RETURNS TABLE (model TEXT, entity_count BIGINT, dimensions INT) — enumerate all models in _pg_ripple.embeddings
    • pg_ripple.similar_entities(), pg:similar(), and pg_ripple.rag_retrieve() all accept an optional model argument; default is the pg_ripple.embedding_model GUC value
  • Benchmarks

    • benchmarks/hybrid_search.sql — pgbench-based benchmark measuring hybrid search latency and throughput; tests vector-only, SPARQL-only, and RRF-fused patterns
    • Target: hybrid search over 1M entities, 1,536-dimensional embeddings, HNSW index, < 50 ms P99 latency for top-10 results
  • Error codes (additions to src/error.rs)

    • PT607 — vector service endpoint not registered
  • pg_regress tests

    • vector_hybrid.sqlpg_ripple.hybrid_search() with a SPARQL SELECT + vector query; verify RRF scores are non-zero and results are sorted
    • vector_rag.sqlpg_ripple.rag_retrieve() end-to-end; verify context_json contains expected keys
    • vector_rag_jsonld.sqlpg_ripple.rag_retrieve(... output_format := 'jsonld'); verify @type and @context keys are present
    • vector_contextualize.sqlpg_ripple.contextualize_entity() on a test entity with known neighbors; verify output text contains expected labels
    • vector_worker.sql — insert a new entity with pg_ripple.auto_embed = true; verify _pg_ripple.embedding_queue is populated; simulate worker drain and verify embedding is present
    • vector_federation.sql — register a mock vector endpoint; verify SERVICE query with pg:similarTo issues the correct HTTP request; verify graceful timeout fallback

Migration Script

sql/pg_ripple--0.27.0--0.28.0.sql — creates _pg_ripple.embedding_queue table and trigger; registers new GUC parameters. No changes to VP table schema.

Documentation

  • user-guide/hybrid-search.md updated — add RRF fusion and RAG sections; include end-to-end worked example from question to LLM context
  • user-guide/rag.md (new page) — step-by-step guide to using pg_ripple.rag_retrieve() as a backend for LangChain, LlamaIndex, and raw OpenAI API calls; includes pg_ripple_http REST example
  • reference/embedding-functions.md updated — document hybrid_search, rag_retrieve (including output_format parameter), contextualize_entity, list_embedding_models, register_vector_endpoint
  • reference/http-api.md updated — document POST /rag endpoint with request/response examples and JSON-LD output mode
  • user-guide/vector-federation.md (new page) — how to register external vector services, write federated SPARQL queries, and configure timeouts; includes worked examples for Weaviate, Qdrant, and Pinecone endpoints
  • Release notes for v0.28.0 — highlight rag_retrieve and hybrid_search as headline features; link to the hybrid-search and RAG user guides

Exit Criteria

vector_hybrid.sql, vector_rag.sql, vector_rag_jsonld.sql, vector_contextualize.sql, vector_worker.sql, and vector_federation.sql all pass in cargo pgrx regress pg18 when pgvector is installed. pg_ripple.hybrid_search('SELECT ?drug WHERE { ?drug a :Drug }', 'anti-inflammatory', 10) returns ≤ 10 rows with non-zero rrf_score. pg_ripple.rag_retrieve('what treats headaches?', k := 5) returns JSONB rows with label, types, properties, and neighbors keys. pg_ripple.rag_retrieve('what treats headaches?', k := 5, output_format := 'jsonld') returns rows whose context_json contains @type and @context keys. POST /rag on pg_ripple_http returns a context field suitable for use as an LLM system prompt. Inserting a new entity with pg_ripple.auto_embed = true and running the background worker loop populates _pg_ripple.embeddings for that entity. pg_ripple.register_vector_endpoint('http://unknown/', 'qdrant') followed by a SERVICE query returns graceful timeout with no ERROR. Migration scripts from 0.1.0 through 0.28.0 run cleanly via just test-migration.


v0.29.0 — Datalog Optimization: Magic Sets & Cost-Based Compilation

Theme: Goal-directed inference, cost-based rule compilation, and evaluation-path optimizations for the Datalog engine.

In plain language: pg_ripple's Datalog engine already supports semi-naive evaluation — it only looks at new facts each iteration. This release makes inference dramatically smarter: instead of deriving every possible fact, the engine now derives only the facts needed to answer a specific question (magic sets). It also reorders rule joins by cost, eliminates redundant rules, and improves how negation and filters are compiled to SQL. The result is 10×–1000× faster inference for targeted queries and 2×–10× faster full materialization on large datasets.

Effort estimate: 5–7 person-weeks

Completed items (click to expand)

Background

See plans/ecosystem/datalog.md §14.2 for detailed design notes on all optimization techniques. Key highlights:

  • Magic sets is the classical Datalog optimization (Bancilhon et al., 1986; implemented in IBM DB2). It rewrites a rule program + query goal into a smaller program that derives only relevant facts. Combined with semi-naive evaluation, it matches top-down evaluation performance while retaining bottom-up correctness guarantees.
  • Cost-based body atom reordering uses PostgreSQL's pg_class.reltuples and pg_statistic to sort joins by selectivity — the same technique PostgreSQL's own planner uses, applied at the Datalog→SQL compilation stage.
  • Subsumption checking prunes redundant rules at compile time, reducing the number of SQL statements per fixpoint iteration.

Deliverables

  • Magic sets transformation (src/datalog/magic.rs)

    • pg_ripple.infer_goal(rule_set TEXT, goal TEXT) RETURNS JSONB — materialize only facts relevant to the goal pattern
    • Adornment propagation: given a goal like ?x rdf:type foaf:Person, compute binding patterns for each predicate
    • Magic predicate generation: create auxiliary predicates that capture the demanded binding set
    • Modified rule generation: add magic-predicate filters to each rule body
    • SQL compilation: magic predicates compile to temp tables; modified rules join against them
    • Automatic integration with create_datalog_view() — when a goal has bound constants, magic sets are applied automatically
    • GUC: pg_ripple.magic_sets (bool, default true) — master switch; set to false to disable for debugging
    • Benchmark: benchmarks/magic_sets.sql — compare full materialization vs. goal-directed inference on RDFS closure with selective goals
  • Cost-based body atom reordering (src/datalog/compiler.rs)

    • At rule compilation time, query pg_class.reltuples for each VP table referenced by a body atom
    • For atoms with bound constants, estimate selectivity from pg_statistic.n_distinct
    • Sort body atoms by ascending estimated cardinality (most selective first)
    • Prefer atoms that join on indexed columns (s,o) or (o,s) when selectivities are similar
    • GUC: pg_ripple.datalog_cost_reorder (bool, default true)
  • Subsumption checking (src/datalog/stratify.rs extension)

    • After stratification, check each pair of rules deriving the same predicate for subsumption
    • If rule R2 is subsumed by rule R1 (R2's head is a substitution instance of R1's, and R1's body is a subset of R2's body), eliminate R2
    • Report eliminated rules via pg_ripple.infer_with_stats() JSONB output: "eliminated_rules": [...]
  • Anti-join negation (src/datalog/compiler.rs)

    • Replace NOT EXISTS (SELECT 1 FROM vp_{id} WHERE ...) with LEFT JOIN vp_{id} ON ... WHERE ... IS NULL
    • Compile-time choice: use anti-join when the negated predicate's VP table has ≥1000 rows (from pg_class.reltuples); retain NOT EXISTS for small tables where the planner favors it
    • GUC: pg_ripple.datalog_antijoin_threshold (integer, default 1000)
  • Predicate-filter pushdown (src/datalog/compiler.rs)

    • Identify which body atom first binds each arithmetic/comparison guard variable
    • Move the guard immediately after that atom in the generated SQL
    • For range filters (?a > 18), emit as part of the JOIN … ON clause to enable index scans
  • Delta table indexing (src/datalog/mod.rs)

    • After each semi-naive iteration populates a delta table, create a B-tree index on the join columns used by the next iteration's rules
    • Skip indexing when the delta table has fewer than pg_ripple.delta_index_threshold rows (default: 500)
    • GUC: pg_ripple.delta_index_threshold (integer, default 500)
  • Error codes (additions to src/error.rs)

    • PT501 — magic sets transformation failed (circular binding pattern)
    • PT502 — cost-based reordering skipped (statistics unavailable)
  • pg_regress tests

    • datalog_magic_sets.sql — magic sets on RDFS transitivity with a selective goal; verify result matches full materialization; verify magic temp tables are cleaned up
    • datalog_cost_reorder.sql — verify EXPLAIN output shows changed join order with pg_ripple.datalog_cost_reorder = true vs. false
    • datalog_antijoin.sql — verify negation compiles to LEFT JOIN … IS NULL when threshold is met
    • datalog_subsumption.sql — load overlapping rules; verify infer_with_stats() reports eliminated rules
    • datalog_filter_pushdown.sql — verify arithmetic filters appear in JOIN ON clause, not outermost WHERE
    • datalog_delta_index.sql — verify delta table index creation when row count exceeds threshold

Migration Script

sql/pg_ripple--0.28.0--0.29.0.sql — registers new GUC parameters. No changes to VP table schema or catalog tables.

Documentation

  • user-guide/sql-reference/datalog.md updated — document infer_goal(), magic sets GUC, cost-based reordering GUC, anti-join threshold GUC, delta indexing threshold GUC
  • user-guide/best-practices/datalog-optimization.md (new page) — when to use infer() vs. infer_goal(), how to read infer_with_stats() output, how to diagnose slow fixpoint convergence, tuning GUCs for different dataset sizes
  • Release notes for v0.29.0 — highlight magic sets and cost-based compilation as headline features; include before/after benchmarks

Exit Criteria

datalog_magic_sets.sql, datalog_cost_reorder.sql, datalog_antijoin.sql, datalog_subsumption.sql, datalog_filter_pushdown.sql, and datalog_delta_index.sql all pass in cargo pgrx regress pg18. pg_ripple.infer_goal('rdfs', '?x rdf:type foaf:Person') returns the same triples as pg_ripple.infer('rdfs') filtered to rdf:type foaf:Person, but completes in <10% of the time on a 1M-triple dataset. Migration scripts from 0.1.0 through 0.29.0 run cleanly via just test-migration.


v0.30.0 — Datalog Aggregation & Compiled Rule Plans

Theme: Analytics-grade inference and rule plan caching.

In plain language: This release adds two major capabilities to the Datalog engine. First, rules can now aggregate facts — for example, "count the number of friends each person has" or "find the maximum salary in each department" — unlocking graph analytics and metrics directly from inference rules. Second, the engine caches the SQL it generates for each rule set, so repeated calls to infer() (e.g., after each data load) no longer repeat expensive dictionary lookups and query construction. As a bonus, SPARQL queries that use on-demand Datalog rules also benefit from the plan cache: a query that triggers inference gets a faster response on every repeat execution.

Effort estimate: 5–7 person-weeks

Completed items (click to expand)

Background

See plans/ecosystem/datalog.md §14.2 for design notes. Aggregation in rule bodies (Datalog^agg) follows the aggregation-stratification spec: aggregate operations are allowed only in rule bodies over predicates that are fully computed in a lower stratum, ensuring a unique minimal model. Compiled rule plans cache generated SQL in a HashMap<rule_set, Vec<CachedPlan>> keyed on the dictionary-encoded rule set name; cache invalidation triggers on load_rules(), drop_rules(), or GUC change.

Deliverables

  • Aggregation in rule bodies (Datalog^agg) (src/datalog/compiler.rs, src/datalog/stratify.rs)

    • Extend rule IR to support aggregate terms in body atoms: COUNT(?x), SUM(?x), MIN(?x), MAX(?x), AVG(?x)
    • Aggregation-stratification check: aggregated predicates must be fully computed in a lower stratum; reject with PT510 if violated
    • SQL compilation: aggregate body atoms compile to subquery CTEs with GROUP BY and aggregate window functions
    • pg_ripple.infer_agg(rule_set TEXT) RETURNS JSONB — variant of infer() that enables aggregation rules
    • Example rule: ?x ex:friendCount ?n :- COUNT(?y WHERE ?x foaf:knows ?y) = ?n .
    • Benchmark: benchmarks/datalog_agg.sql — PageRank-style degree centrality on a social graph
  • Compiled rule plans (src/datalog/cache.rs new module)

    • Cache the generated SQL string (and dictionary-encoded constant vector) for each rule on first infer() call
    • Cache key: rule set name + schema version (invalidate on any ALTER EXTENSION pg_ripple UPDATE)
    • Cache storage: pgrx::PgSharedMem-backed LRU, size controlled by GUC pg_ripple.rule_plan_cache_size (default: 64 entries)
    • SPARQL on-demand mode benefit: when a SPARQL query inlines a derived predicate CTE, the CTE SQL is served from the plan cache rather than rebuilt from scratch
    • GUC: pg_ripple.rule_plan_cache (bool, default true)
    • Expose cache statistics via pg_ripple.rule_plan_cache_stats() RETURNS TABLE(rule_set TEXT, hits BIGINT, misses BIGINT, entries INT)
  • Error codes (src/error.rs)

    • PT510 — aggregation-stratification violation (aggregate over non-ground predicate)
    • PT511 — unsupported aggregate function in rule body
  • pg_regress tests

    • datalog_agg.sql — verify COUNT, SUM, MIN, MAX rules derive correct results; verify stratification rejects cycles through aggregates
    • datalog_plan_cache.sql — verify cache hit/miss counts via rule_plan_cache_stats(); verify cache invalidation on drop_rules()
    • datalog_sparql_cache.sql — verify SPARQL on-demand query using a derived predicate is faster on second execution (plan served from cache)

Migration Script

sql/pg_ripple--0.29.0--0.30.0.sql — registers new GUCs (pg_ripple.rule_plan_cache, pg_ripple.rule_plan_cache_size). No VP table schema changes.

Documentation

  • user-guide/sql-reference/datalog.md updated — document infer_agg(), aggregation rule syntax, plan cache GUCs, rule_plan_cache_stats()
  • user-guide/best-practices/datalog-optimization.md updated — add section on aggregation-stratification rules, plan cache tuning
  • Release notes for v0.30.0

Exit Criteria

datalog_agg.sql, datalog_plan_cache.sql, and datalog_sparql_cache.sql all pass in cargo pgrx regress pg18. A PageRank-style degree centrality rule on a 1M-triple social graph produces correct results. Second call to infer() on the same rule set reports cache hits > 0 in rule_plan_cache_stats(). Migration scripts from 0.1.0 through 0.30.0 run cleanly via just test-migration.


v0.31.0 — Entity Resolution & Demand Transformation

Theme: Identity semantics and goal-directed rule rewriting for SPARQL and Datalog.

In plain language: This release tackles two distinct but complementary problems. First, it adds proper handling for owl:sameAs — the RDF way of saying "these two names refer to the same thing". When the engine knows that ex:Alice and ex:A.Smith are the same person, all facts about one automatically apply to the other. Second, it introduces demand transformation — a generalisation of the magic sets technique (added in v0.29.0) that can rewrite complex rule programs to derive only the facts that a query actually needs, even for rules with many cross-referencing bodies. This also makes SPARQL on-demand mode smarter: SPARQL queries can now trigger only the Datalog inference relevant to their specific patterns.

Effort estimate: 5–7 person-weeks

Completed items (click to expand)

Background

See plans/ecosystem/datalog.md §14.2 for design notes. owl:sameAs merging uses a pre-pass canonicalization strategy: before each fixpoint iteration, the compiler rewrites triple patterns to use the canonical (lowest-id) representative of each sameAs equivalence class. Demand transformation is more flexible than magic sets for programs with multiple recursive predicates that reference each other — it propagates binding demands through the full program dependency graph rather than one predicate at a time.

Deliverables

  • owl:sameAs entity canonicalization (src/datalog/rewrite.rs new module)

    • Pre-pass: at the start of each inference run, compute equivalence classes of owl:sameAs (VP table for sameAs predicate) using union-find over dictionary IDs
    • Canonicalization map: each non-canonical ID maps to the lowest ID in its class
    • Rule compiler rewrite: substitute all occurrences of non-canonical IDs in rule bodies before SQL generation
    • SPARQL integration: SPARQL queries that reference a non-canonical entity are transparently rewritten to query the canonical form
    • GUC: pg_ripple.sameas_reasoning (bool, default true)
    • Benchmark: benchmarks/sameas.sql — query entity with 100 sameAs aliases; verify all facts visible via any alias
  • Demand transformation (src/datalog/demand.rs new module)

    • Generalised magic sets: compute demand sets for all predicates simultaneously via a fixed-point on the program dependency graph
    • API: pg_ripple.infer_demand(rule_set TEXT, demands JSONB) RETURNS JSONBdemands is an array of goal patterns [{"p": "rdf:type", "o": "foaf:Person"}, ...]
    • Automatically applied in create_datalog_view() when multiple goal patterns are specified
    • SPARQL on-demand integration: when a SPARQL query references multiple derived predicates, compute a joint demand set and apply it to all relevant rules before generating inline CTEs; reduces CTE size and join cost
    • GUC: pg_ripple.demand_transform (bool, default true)
  • pg_regress tests

    • datalog_sameas.sql — load sameAs assertions; verify inference results are visible via all aliases; verify canonicalization in SPARQL query results
    • datalog_demand.sql — verify infer_demand() derives same results as infer() filtered to the demand set; verify EXPLAIN shows smaller CTE for SPARQL on-demand queries with demand transform enabled

Migration Script

sql/pg_ripple--0.30.0--0.31.0.sql — registers pg_ripple.sameas_reasoning and pg_ripple.demand_transform GUCs. No VP table schema changes.

Documentation

  • user-guide/sql-reference/datalog.md updated — document infer_demand(), owl:sameAs behaviour, sameas_reasoning GUC
  • user-guide/best-practices/datalog-optimization.md updated — add section on demand transformation vs. magic sets, when to use infer_demand() vs. infer_goal()
  • Release notes for v0.31.0

Exit Criteria

datalog_sameas.sql and datalog_demand.sql pass in cargo pgrx regress pg18. A SPARQL on-demand query referencing two derived predicates on a 1M-triple dataset completes in <50% of the time compared to v0.30.0 (demand transform reduces combined CTE size). Migration scripts from 0.1.0 through 0.31.0 run cleanly via just test-migration.


v0.32.0 — Well-Founded Semantics & Tabling

Theme: Advanced reasoning for cyclic ontologies and subsumptive result caching for Datalog and SPARQL.

In plain language: Two powerful features for production knowledge graph workloads. Well-founded semantics handles the edge cases that stratified Datalog cannot: programs where rules are mutually recursive through negation (e.g., "X is trusted unless untrusted, and untrusted unless trusted"). Instead of rejecting these programs, the engine assigns a third truth value — unknown — and returns whatever can be definitively concluded. Tabling caches the results of recurring sub-queries: if the same Datalog sub-goal (or SPARQL sub-pattern) appears in multiple queries or multiple times within one query, the answer is computed once and reused. For analytical workloads with repeated sub-query patterns, this is a 2–5× speedup.

Effort estimate: 5–7 person-weeks

Completed items (click to expand)

Background

See plans/ecosystem/datalog.md §14.2 for design notes. Well-founded semantics (Van Gelder et al., 1991) extends stratified Datalog with a three-valued model: facts are true, false, or unknown (neither provably true nor provably false). The SQL encoding uses an iterative alternating fixpoint: two parallel CTE chains compute the well-founded model over at most pg_ripple.wfs_max_iterations rounds. Tabling (subsumptive tabling, inspired by XSB Prolog) stores derived sub-goals in a session-scoped cache table _pg_ripple.tabling_cache (goal_hash BIGINT, result JSONB, computed_at TIMESTAMPTZ) and reuses results within a configurable TTL.

Deliverables

  • Well-founded semantics (src/datalog/wfs.rs new module)

    • Alternating fixpoint algorithm: compute T_P↑ (positive) and T_P↓ (negative) iteratively until fixpoint
    • Three-valued result: derived facts carry a certainty column (true / unknown) in the query output
    • pg_ripple.infer_wfs(rule_set TEXT) RETURNS JSONB — run well-founded fixpoint instead of stratified evaluation
    • Graceful degradation: for stratifiable programs, infer_wfs() produces the same results as infer() with no overhead
    • GUC: pg_ripple.wfs_max_iterations (integer, default 100) — safety cap on alternating fixpoint rounds
    • Error code PT520 — well-founded fixpoint did not converge within wfs_max_iterations
    • Benchmark: benchmarks/wfs.sql — cyclic ontology with mutual negation; verify unknown facts are correctly identified
  • Tabling / memoization (src/datalog/tabling.rs new module)

    • Session-scoped cache: _pg_ripple.tabling_cache (goal_hash BIGINT PRIMARY KEY, result BYTEA, computed_at TIMESTAMPTZ)
    • Cache key: XXH3-128 of the normalised goal pattern (predicate ID + bound-variable encoding)
    • SPARQL integration: SPARQL sub-query patterns (e.g., property path closures, OPTIONAL blocks) that match a cached goal are served from the tabling cache without re-executing the CTE — implemented at the SPARQL→SQL translation layer
    • Datalog integration: infer() and infer_goal() check the tabling cache before running the fixpoint; on cache miss, the result is stored for future calls
    • TTL: pg_ripple.tabling_ttl (integer seconds, default 300); set to 0 to disable expiry
    • GUC: pg_ripple.tabling (bool, default true)
    • Invalidation: cache is automatically cleared on any triple insert/delete/update (via CDC hook), and on drop_rules()
    • Expose stats: pg_ripple.tabling_stats() RETURNS TABLE(goal_hash BIGINT, hits BIGINT, computed_ms FLOAT, cached_at TIMESTAMPTZ)
  • pg_regress tests

    • datalog_wfs.sql — verify well-founded semantics on a cyclic negation program; verify certainty = 'unknown' for unresolvable facts; verify stratifiable programs return same results as infer()
    • datalog_tabling.sql — verify cache hit/miss counts via tabling_stats(); verify TTL expiry; verify cache invalidation on triple insert
    • sparql_tabling.sql — SPARQL query with repeated sub-pattern; verify tabling stats show hit > 0 on second identical sub-pattern within one query

Migration Script

sql/pg_ripple--0.31.0--0.32.0.sql — creates _pg_ripple.tabling_cache table; registers pg_ripple.tabling, pg_ripple.tabling_ttl, pg_ripple.wfs_max_iterations GUCs.

Documentation

  • user-guide/sql-reference/datalog.md updated — document infer_wfs(), tabling GUCs, tabling_stats()
  • user-guide/best-practices/datalog-optimization.md updated — add section on when to use infer_wfs(), tabling tuning, SPARQL sub-query caching behaviour
  • user-guide/best-practices/sparql-performance.md (new page) — how tabling accelerates SPARQL property paths and repeated sub-queries; how demand transformation reduces CTE size; how rule plan caching (v0.30.0) interacts with SPARQL on-demand mode
  • Release notes for v0.32.0

Exit Criteria

datalog_wfs.sql, datalog_tabling.sql, and sparql_tabling.sql all pass in cargo pgrx regress pg18. A SPARQL query with a repeated transitive-closure sub-pattern on a 1M-triple dataset completes in <50% of the time on the second execution (tabling cache hit). infer_wfs() on a stratifiable rule set produces identical results to infer(). Migration scripts from 0.1.0 through 0.32.0 run cleanly via just test-migration.


v0.33.0 — Documentation Site & Content Overhaul

Theme: A documentation site worthy of a production-grade triple store.

In plain language: pg_ripple is a mature system — v0.32.0 delivers full SPARQL 1.1 and SHACL Core conformance across 32 releases — but its documentation has grown organically alongside the codebase rather than being designed for the people who use it. This release delivers documentation that meets users where they are: a problem-centric information architecture written for five distinct archetypes (Data Engineer, Application Developer, Knowledge Architect, Decision-Maker, AI/ML Engineer), eight feature-deep-dive chapters, a full operations guide, a SQL function reference with working examples for every function, and a CI harness that keeps every code example honest by running it against a real pg_ripple instance on every pull request. The full plan is in plans/documentation.md.

Effort estimate: 8–12 person-weeks

Completed items (click to expand)

Background

See plans/documentation.md for the authoritative plan — site structure, content guidelines, five user archetypes, and four delivery phases. Everything described in that plan is in scope for this version.

The documentation site is built with mdBook. mdbook-admonish is added before Phase 1 content work starts (book.toml updated with [preprocessor.admonish]); all new and restructured pages use its fenced callout syntax exclusively. A shared bibliographic fixture dataset (papers, authors, institutions, topics, citations, pre-computed embeddings) is established in docs/fixtures/ and reused across all chapters.

Deliverables

Phase 0 — CI Test Harness (prerequisite)

  • scripts/test_docs.sh — CI harness: spins up pg_ripple via Docker, extracts fenced SQL blocks from docs/src/, executes them in document order, compares stdout against expected-output comment blocks embedded directly below each code block
  • docs/fixtures/bibliography.sql — shared bibliographic fixture dataset (papers, authors, institutions, topics, citations, pre-computed embeddings) reused across all chapters
  • .github/workflows/docs-test.yml — CI job that runs the harness on every PR touching docs/
  • mdbook-admonish added to book.toml and [preprocessor.admonish] block configured
  • Exit criterion: CI job passes on a real PR (not just locally)

Phase 1 — Foundation

  • Landing page — value proposition, architecture diagram, one compelling code example; key-numbers block and comparison summary absorbed from the former "60 Seconds" content
  • Evaluate / When to Use pg_ripple — honest comparison matrix (pg_ripple vs. plain SQL, standalone RDF stores, LPG systems, pure vector databases); decision flowchart; AI/LLM section on when graph context outperforms flat vector retrieval
  • Installation — Docker (recommended default), from source (cargo pgrx), prerequisites, verification step (SELECT pg_ripple.triple_count() returns 0), troubleshooting for the five most common failures
  • Hello World — Five-Minute Walkthrough — ten triples, three queries of increasing complexity (basic pattern → OPTIONAL → property path), annotated output after every step
  • Guided Tutorial — Build a Knowledge Graph in 30 Minutes — four self-contained ≤10-minute segments: Load & Explore, Validate, Reason, Export; uses the shared bibliographic dataset; each segment is independently complete
  • Key Concepts — RDF for PostgreSQL Users — triples, IRIs, blank nodes, literals, named graphs, RDF-star, SPARQL; PostgreSQL analogies with diagrams for every concept

Phase 2 — Feature Deep Dives

Eight chapters, each following the seven-part structure: What & Why → How It Works → Worked Examples → Common Patterns → Performance & Trade-offs → Gotchas & Debugging → Next Steps.

  • §2.1 Storing Knowledge — modeling a domain as triples; named graphs (when needed vs. when not); blank nodes with honest caveats; RDF-star for provenance and confidence scores; translating a relational schema to RDF
  • §2.2 Loading Data — all formats (Turtle, N-Triples, N-Quads, TriG, RDF/XML); three loading modes (load_turtle(), load_turtle_file(), insert_triple()); bulk-load performance numbers; blank-node scoping across calls; SQL-to-triples patterns; when to run ANALYZE
  • §2.3 Querying with SPARQL — basic patterns through property paths (all operators: +, *, ?, /, |, ^); aggregation; subqueries; UNION/MINUS; GRAPH patterns; sparql_explain() guide; filter pushdown; max_path_depth safety limit; real-world query recipes (entity resolution, recommendations, transitive closure, temporal queries)
  • §2.4 Validating Data Quality — SHACL shapes from simple (sh:minCount/sh:maxCount) to complex (sh:or, sh:pattern, cross-property constraints); synchronous vs. asynchronous validation modes; dead-letter queue; common quality rule patterns
  • §2.5 Reasoning and Inference — Datalog rules; built-in RDFS/OWL RL rule sets; stratification explained plainly; explicit vs. inferred triples (source column); goal-directed vs. full materialization; magic sets and semi-naive evaluation
  • §2.6 Exporting and Sharing — all export formats; JSON-LD framing with sparql_construct_jsonld() and frame templates; canonical GraphRAG chapter: BYOG Parquet export, Datalog enrichment, SHACL quality enforcement (all other GraphRAG mentions cross-reference here)
  • §2.7 AI Retrieval & Graph RAGcanonical AI chapter: vector embeddings, HNSW indexes, pg:similar(), hybrid retrieval with RRF, rag_retrieve(), JSON-LD framing for LLM prompts, owl:sameAs pre-pass before embedding, FTS broadening, end-to-end RAG pipeline; comparison with pure vector stores (Qdrant, Weaviate, pgvector-only)
  • §2.8 APIs and Integrationpg_ripple_http SPARQL Protocol HTTP endpoint (configuration, response formats, authentication, Docker Compose); application code examples (Python psycopg2/SPARQLWrapper, JavaScript pg, Java JDBC); SPARQL federation; caching strategies

Phase 3 — Operations

  • Architecture Overview — dictionary, VP tables, HTAP storage, shmem cache; SPARQL query execution flow for operators
  • Deployment Models — standalone, Docker/Compose, managed PostgreSQL services; trade-offs and the recommended starting point
  • Configuration and Tuning — all GUC parameters by subsystem (storage, query engine, inference, validation, caching, system); three-size production config (small: <1M triples; medium: 1M–100M; large: >100M)
  • Monitoring and Observabilitypg_ripple.stats(), pg_stat_statements, sparql_explain(analyze := true), Prometheus metrics; Grafana panel descriptions; health-check thresholds
  • Performance Tuning — bottleneck identification for query, write throughput, and cache pressure; realistic BSBM numbers; tuning recipes for read-heavy, write-heavy, and mixed HTAP workloads
  • Backup and Disaster Recoverypg_dump/pg_restore; point-in-time recovery; verified backup/restore procedure with exact commands
  • Upgrading SafelyALTER EXTENSION pg_ripple UPDATE; pre/post-upgrade steps; rollback strategy; maintenance-window guidance; explicit note that zero-downtime upgrades are not yet supported
  • Scaling — vertical scaling guide; merge-worker tuning; read replicas for horizontal scale; honest statement of what is not yet supported
  • Troubleshooting — runbook format: ≥15 symptom → cause → diagnostic → fix entries across all subsystems
  • Security — named-graph row-level security; injection prevention; pg_ripple_http TLS and authentication; file-path loader delegation

Phase 4 — Reference and Polish

  • SQL Function Reference — all functions grouped by use case (Loading, Querying, Validating, Reasoning, Exporting, Administration); each entry has full signature, parameter table, and one working example with expected output
  • SPARQL Compliance Matrix — every SPARQL 1.1 Query, Update, and Protocol feature with status (Supported / Partial / Not Supported); link to W3C test suite results; workarounds for partial/unsupported features
  • Error Message Catalog — every PT001–PT799 code with cause and fix; auto-generated from src/error.rs where possible
  • FAQ — 25–30 questions across Getting Started, Data Modeling, Querying, Performance, Operations, and Comparisons; each answer 50–150 words with links to the relevant deep-dive page
  • Glossary — plain-language definitions of every term used in the documentation
  • Release Notes and Roadmap mirrored into the docs site
  • Contributing guide — dev environment setup, test commands, PR workflow, code conventions; top-level "Contribute" navigation entry and landing-page callout card; academic citations and architecture background moved to CONTRIBUTING.md (not user-facing reference)
  • Full audit: every code example verified against v0.33.0, all TODO / stub markers resolved

Content Governance

  • scripts/check_docs_coverage.sh — CI job that diffs exported function signatures in src/lib.rs against the SQL Function Reference and fails the build when a changed signature has no corresponding docs/ touch in the same PR
  • mdbook-linkcheck broken-link CI job on every PR touching docs/; redirect map (docs/redirects.toml) kept current when pages are moved or removed
  • PR template updated with docs-gap reminder (CI enforcement is primary; checkbox is a reminder only)
  • 30-day documentation review schedule: at every minor release, run the signature-diff script and triage GitHub issues tagged docs to fill gaps

Migration Script

sql/pg_ripple--0.32.0--0.33.0.sql — no schema changes. This version delivers documentation infrastructure and content only; all pg_ripple SQL functions, GUCs, and VP table schemas are unchanged from v0.32.0.

Documentation

This version is the documentation release. The deliverables above are the documentation.

Exit Criteria

  • Phase 0 CI harness is complete and passing in CI (verified by a real PR, not just locally).
  • The eight feature-deep-dive chapters (§2.1–§2.8) are published with no unresolved stubs or TODO markers.
  • The operations section (10 pages) is complete and published.
  • The SQL Function Reference covers every function listed in §4 of plans/documentation.md.
  • check_docs_coverage.sh CI job passes on a PR that changes a function signature.
  • mdbook-linkcheck reports zero broken internal links.
  • Migration scripts from 0.1.0 through 0.33.0 run cleanly via just test-migration.

v0.34.0 — Bounded-Depth Termination & Incremental Retraction (DRed)

Theme: Smarter fixpoint termination and write-correct incremental maintenance.

In plain language: Two complementary improvements for production workloads. First, when an ontology has a known maximum hierarchy depth (e.g., a SHACL shape says class hierarchies are at most 5 levels deep), the inference engine can stop early instead of running one final "did anything change?" check — shaving 20–50% off property path queries and fixpoint loops. Second, the Delete-Rederive (DRed) algorithm means that deleting a base triple no longer requires re-materializing the entire derived closure: the engine surgically removes only the affected derived facts, re-derives any that survive via alternative paths, and leaves everything else untouched. Materialized SPARQL predicates stay correct in milliseconds after deletes instead of seconds.

Effort estimate: 5–7 person-weeks

Completed items (click to expand)

Background

See plans/ecosystem/datalog.md §14.2.7 and §14.2.12 for design notes. Bounded-depth termination integrates with SHACL shape constraints (sh:maxDepth annotations on property paths) and user-provided GUC hints to set the maximum fixpoint iteration count at compile time. DRed (Gupta, Katiyar & Sagiv, 1993) is the standard incremental deletion algorithm used by RDFox and other production Datalog systems; it avoids full re-materialization by over-deleting pessimistically and then re-deriving survivors.

Deliverables

  • Bounded-depth early termination (src/datalog/compiler.rs)

    • Read SHACL sh:maxDepth annotations for property paths used in rule bodies; fall back to GUC pg_ripple.datalog_max_depth (integer, default 0 = unlimited)
    • When a depth bound d is known, emit WITH RECURSIVE … (MAXDEPTH d) hint (PostgreSQL 18 syntax) or use a depth counter column in the recursive CTE: depth INT, terminating when depth > d
    • SPARQL property path integration: property path CTEs (rdfs:subClassOf*, ex:knows+) respect the same bound when the path predicate has a SHACL sh:maxDepth constraint
    • GUC: pg_ripple.datalog_max_depth (integer, default 0 — unlimited)
    • pg_regress test: datalog_bounded_depth.sql — verify fixpoint terminates after d iterations; verify SPARQL property path honours depth bound; verify unbounded rule still produces full closure
  • Incremental retraction — DRed algorithm (src/datalog/dred.rs new module)

    • Hook into the CDC delete path: when a base triple is deleted from a VP table, identify all derived predicates whose SQL rules reference that VP table
    • Phase 1 — Over-delete: for each affected derived predicate, delete all rows that could depend on the deleted triple (pessimistic, using rule SQL with the deleted triple as a positive filter)
    • Phase 2 — Re-derive: re-run the rule SQL restricted to the over-deleted set; rows that are re-derived via an alternative derivation path are reinserted
    • Phase 3 — Commit: rows not reinserted after phase 2 are permanently gone
    • pg_ripple.dred_enabled (bool, default true) — master switch; set false to fall back to full re-materialization on delete
    • pg_ripple.dred_batch_size (integer, default 1000) — maximum number of deleted base triples to process in a single DRed transaction
    • Error code PT530 — DRed cycle detected (derived predicate self-references in a way DRed cannot safely resolve; falls back to full recompute)
    • pg_regress test: datalog_dred.sql — insert triples, materialize RDFS closure, delete one base triple, verify only the correctly-affected derived triples are removed; verify triples supported by alternative paths survive
  • Incremental rule updates (src/datalog/mod.rs)

    • pg_ripple.add_rule(rule_set TEXT, rule_text TEXT) — add a single rule to an existing rule set without full recompute; only the new rule's derived predicate needs one fresh iteration pass
    • pg_ripple.remove_rule(rule_id BIGINT) — remove a rule and retract any derived facts that were solely supported by it (uses DRed internally)
    • Dependency-aware invalidation: add_rule triggers one additional semi-naive pass on the affected stratum only
    • pg_regress test: datalog_incremental_rules.sql — add a rule to a live rule set; verify new derivations appear without full recompute; remove the rule; verify derived facts retracted

Migration Script

sql/pg_ripple--0.33.0--0.34.0.sql — registers pg_ripple.datalog_max_depth, pg_ripple.dred_enabled, pg_ripple.dred_batch_size GUCs. No VP table schema changes.

Documentation

  • user-guide/sql-reference/datalog.md updated — document add_rule(), remove_rule(), DRed GUCs, datalog_max_depth GUC
  • user-guide/best-practices/datalog-optimization.md updated — add section on DRed vs. full recompute trade-offs; bounded-depth tuning with SHACL
  • user-guide/best-practices/sparql-performance.md updated — add section on bounded-depth SPARQL property paths
  • Release notes for v0.34.0

Exit Criteria

datalog_bounded_depth.sql, datalog_dred.sql, and datalog_incremental_rules.sql all pass in cargo pgrx regress pg18. Deleting a base triple from a 1M-triple RDFS-materialized dataset with DRed enabled completes in <500ms (vs. full recompute taking >5s). A SPARQL rdfs:subClassOf* property path query on a hierarchy with sh:maxDepth 5 completes in <50% of the time compared to the unbounded version on a 10-level test hierarchy. Migration scripts from 0.1.0 through 0.34.0 run cleanly via just test-migration.


v0.35.0 — Parallel Stratum Evaluation & Incremental Rule Updates

Theme: Concurrent rule evaluation for faster materialization of large rule sets.

In plain language: The Datalog engine currently evaluates rules one at a time within each stratum. This release allows rules that derive different predicates — and therefore cannot interfere with each other — to run concurrently using PostgreSQL's background worker infrastructure. For OWL RL, which has roughly 10 independent rule groups in its first stratum, this means the full ontology closure can materialize up to 10× faster. SPARQL queries that depend on materialized predicates (the common production mode) benefit directly: derived VP tables become fresh sooner after bulk data loads, reducing the staleness window.

Effort estimate: 5–7 person-weeks

Completed items (click to expand)

Background

See plans/ecosystem/datalog.md §14.2.11 for design notes. Within a single stratum, rules deriving different predicates are fully independent: their INSERT … SELECT statements touch different VP tables and can run concurrently without coordination. Rules deriving the same predicate within a stratum must be serialized or use ON CONFLICT DO NOTHING to handle concurrent inserts. The implementation uses pgrx::BackgroundWorker with a shared-memory semaphore to limit concurrency to pg_ripple.datalog_parallel_workers (default: max_worker_processes / 2).

Deliverables

  • Parallel stratum evaluation (src/datalog/parallel.rs new module)

    • Analyse rule dependency graph per stratum: partition rules into independent groups (rules that derive different predicates and have no shared body predicates that are derived within the same stratum)
    • Spawn one background worker per independent group; each worker executes its rule's INSERT … SELECT for the current semi-naive iteration
    • Synchronization barrier: the main process waits for all workers to finish before starting the next iteration
    • ON CONFLICT DO NOTHING ensures correctness when two workers insert into the same delta table
    • GUC: pg_ripple.datalog_parallel_workers (integer, default 4, max max_worker_processes - 3)
    • GUC: pg_ripple.datalog_parallel_threshold (integer, default 10000) — only parallelize strata where the estimated total row count exceeds this threshold (avoid overhead for small rule sets)
    • Expose parallelism statistics via infer_with_stats() JSONB output: "parallel_groups": 5, "max_concurrent": 4
    • pg_regress test: datalog_parallel.sql — verify OWL RL closure produces identical results with datalog_parallel_workers = 1 and = 4; verify infer_with_stats() reports parallel groups > 1 for OWL RL
  • SPARQL materialization freshness improvement

    • Parallel evaluation reduces time-to-fresh for derived VP tables after pg_ripple.infer() calls triggered by bulk loads
    • Document: SPARQL queries in materialized mode now observe a shorter staleness window after bulk inserts; add note to SPARQL best practices guide

Migration Script

sql/pg_ripple--0.34.0--0.35.0.sql — registers pg_ripple.datalog_parallel_workers and pg_ripple.datalog_parallel_threshold GUCs. No VP table schema changes.

Documentation

  • user-guide/sql-reference/datalog.md updated — document parallel evaluation GUCs, infer_with_stats() parallel fields
  • user-guide/best-practices/datalog-optimization.md updated — add section on tuning datalog_parallel_workers for different hardware configurations
  • user-guide/best-practices/sparql-performance.md updated — note materialization freshness improvement with parallel evaluation
  • Release notes for v0.35.0

Exit Criteria

datalog_parallel.sql passes in cargo pgrx regress pg18. OWL RL full closure on a 1M-triple dataset with datalog_parallel_workers = 4 completes in <40% of the time compared to datalog_parallel_workers = 1. Results are identical in both cases. Migration scripts from 0.1.0 through 0.35.0 run cleanly via just test-migration.


v0.36.0 — Worst-Case Optimal Joins & Lattice-Based Datalog

Theme: Advanced join algorithms for cyclic graph patterns and monotone lattice aggregation.

In plain language: Two ambitious features that push pg_ripple to the frontier of Datalog and graph database research. Worst-case optimal joins tackle the hardest SPARQL performance problem: cyclic query patterns (think "find all triangles" or "find paths that loop back") where standard database joins produce enormous intermediate results. The Leapfrog Triejoin algorithm solves this class of problem with a mathematically optimal algorithm, giving 10×–100× speedups on queries that previously timed out. Lattice-based Datalog extends rules to work with custom algebraic structures — for example, propagating trust scores (where "trust of X through Y" is the minimum of individual trust values), or interval types, or set-valued annotations — enabling a new class of analytical reasoning that standard Datalog cannot express.

Effort estimate: 6–9 person-weeks

Completed items (click to expand)

Background

See plans/ecosystem/datalog.md §14.2.8 and §14.2.14 for design notes. Worst-case optimal joins (Ngo et al., 2012; "Skew Strikes Back") use a trie-based intersection algorithm that is provably optimal for any join query. PostgreSQL does not expose WCO join algorithms natively; implementation requires a custom scan node via the CustomScan API, registering a C-callable scan provider that pg_ripple exposes through its Rust FFI layer. Lattice-based Datalog (Datalog^L, inspired by Flix and Datafun) extends the rule IR with typed lattice values and monotone operations; fixpoint termination is guaranteed by the ascending chain condition on the lattice.

Deliverables

  • Worst-case optimal joins — Leapfrog Triejoin (src/sparql/wcoj.rs new module)

    • Detect cyclic join patterns at SPARQL→SQL translation time: any SELECT with ≥3 triple patterns sharing variables in a cycle (triangle, square, etc.)
    • For detected cyclic patterns, route execution through a Leapfrog Triejoin scan node instead of standard PostgreSQL hash-joins
    • CustomScan implementation: register a scan provider in _PG_init that intercepts cyclic join nodes in the PostgreSQL planner's plan tree
    • VP table trie interface: read VP table rows in sort order (existing B-tree (s, o) indices serve as the underlying trie structure)
    • GUC: pg_ripple.wcoj_enabled (bool, default true) — master switch
    • GUC: pg_ripple.wcoj_min_tables (integer, default 3) — minimum number of tables in a join before WCOJ is considered
    • SPARQL benefit: cyclic graph patterns that previously caused query timeouts or multi-second latencies complete in milliseconds
    • Benchmark: benchmarks/wcoj.sql — triangle query on a social-graph VP table; compare WCOJ vs. standard planner at 100K, 1M, 10M triples
    • pg_regress test: sparql_wcoj.sql — verify triangle query produces correct results with WCOJ enabled and disabled; verify pg_ripple.wcoj_enabled = false falls back to standard planner
  • Lattice-Based Datalog — Datalog^L (src/datalog/lattice.rs new module)

    • Extend rule IR: lattice term LatticeVal(lattice_type, value) alongside Const and Var
    • Built-in lattice types: MinLattice (meet = MIN), MaxLattice (join = MAX), SetLattice (join = UNION), IntervalLattice (join = interval hull)
    • User-defined lattice types via pg_ripple.create_lattice(name TEXT, join_fn TEXT, bottom TEXT)join_fn is a PostgreSQL aggregate function name
    • SQL compilation: lattice rules compile to INSERT … SELECT … ON CONFLICT (s, g) DO UPDATE SET o = lattice_join(excluded.o, vp.o) — the upsert applies the lattice join on conflict
    • Fixpoint termination: guaranteed by ascending chain condition; bounded by GUC pg_ripple.lattice_max_iterations (default 1000)
    • Example rule: trust propagation — ?x ex:trust (MIN ?t1 ?t2) :- ?x ex:knows ?y, ?y ex:trust ?t1, ?x ex:directTrust ?t2 .
    • GUC: pg_ripple.lattice_max_iterations (integer, default 1000)
    • Error code PT540 — lattice fixpoint did not converge (ascending chain condition violated by user-defined lattice)
    • pg_regress test: datalog_lattice.sql — trust propagation rule with MinLattice; verify convergence; verify user-defined lattice via custom aggregate

Migration Script

sql/pg_ripple--0.35.0--0.36.0.sql — registers WCOJ and lattice GUCs; creates pg_ripple.create_lattice() SQL function. No VP table schema changes.

Documentation

  • user-guide/sql-reference/datalog.md updated — document create_lattice(), lattice rule syntax, lattice GUCs
  • user-guide/best-practices/sparql-performance.md updated — add section on cyclic SPARQL pattern detection and WCOJ; when to set wcoj_min_tables
  • reference/lattice-datalog.md (new page) — full tutorial on Datalog^L: lattice types, monotone rules, convergence guarantees, use cases (trust propagation, interval reasoning, set-valued annotations)
  • Release notes for v0.36.0

Exit Criteria

sparql_wcoj.sql and datalog_lattice.sql pass in cargo pgrx regress pg18. A triangle-pattern SPARQL query on a 1M-edge social graph VP table completes in <10% of the time compared to the standard planner (WCOJ enabled). A trust-propagation lattice rule on 100K triples converges to the correct fixed point. Migration scripts from 0.1.0 through 0.36.0 run cleanly via just test-migration.


v0.37.0 — Storage Concurrency Hardening & Error Safety

Theme: Fix the highest-severity correctness bugs identified in the deep-analysis audit and eliminate all hard panics from library code.

In plain language: This is a reliability release — no new features, but a direct response to the first comprehensive code audit (see plans/PLAN_OVERALL_ASSESSMENT_2.md). Two concurrency bugs that could silently drop deletes or strand predicates in a slow-path table are fixed with proper advisory-lock coordination. Every place in the code that could crash the database server on an unexpected error is replaced with a typed error message. Configuration parameters now validate their inputs so bad values are caught immediately instead of causing cryptic failures later. A new diagnostic_report() function gives a one-call health check of the running system.

Effort estimate: 9–11 person-weeks

Completed items (click to expand)

Deliverables

  • HTAP merge cutover race — fixed (src/storage/merge.rs)
    • Wrap the delta→main swap in a per-predicate pg_advisory_xact_lock; concurrent DELETE path acquires the same lock in share mode
    • Ensures deletes arriving during a merge cycle are never lost regardless of timing
    • Add crash-recovery test tests/crash_recovery/merge_concurrent_delete.sh: 50 concurrent writers + 1-second merge interval, assert zero lost deletes after 5 minutes
  • Tombstone GC integrated into merge worker (src/storage/merge.rs, src/worker.rs)
    • After each successful merge cycle, schedule VACUUM on VP tables where tombstone_count / main_count > pg_ripple.tombstone_gc_threshold
    • New GUCs: pg_ripple.tombstone_gc_enabled (bool, default true), pg_ripple.tombstone_gc_threshold (float, default 0.05)
    • pg_regress test storage_tombstone_gc.sql: verify tombstones are vacuumed after threshold is crossed
  • Rare-predicate promotion — idempotent and serialised (src/lib.rs, src/storage/mod.rs)
    • Acquire the per-predicate advisory lock before any promotion attempt
    • Use CREATE TABLE IF NOT EXISTS; wrap data move in WITH moved AS (DELETE … RETURNING *) INSERT INTO vp_N SELECT * FROM moved
    • Add crash-recovery test tests/crash_recovery/promotion_race.sh: two backends racing to promote the same predicate, assert exactly one succeeds
  • Dictionary cache rollback on transaction abort (src/dictionary/mod.rs, src/shmem.rs)
    • Version-tag each shared-memory cache entry with the inserting xid; decode path checks TransactionIdDidCommit before trusting cached ID
    • pg_regress test dictionary_rollback.sql: BEGIN; encode_term('novel:term'); ROLLBACK; encode_term('novel:term') — verify the second encode succeeds without error
  • Bloom filter saturating counter fix (src/shmem.rs)
    • Replace all reference-counter decrements with saturating_sub(1); document that a counter saturated at 255 is treated conservatively (bit kept set, no false negatives)
  • _pg_ripple.statements atomic update (src/storage/merge.rs)
    • Perform SID-range catalog DELETE + INSERT in the same transaction as the VP table swap
    • Eliminates the race where a mid-update worker kill leaves a stale SID→OID mapping for RDF-star queries
  • (o, s) index on vp_rare (src/storage/mod.rs)
    • Add CREATE INDEX IF NOT EXISTS vp_rare_os_idx ON _pg_ripple.vp_rare (o, s) in bootstrap and migration script
    • Eliminates sequential scans on object-leading patterns over rare predicates
  • Eliminate .expect() / .unwrap() in all library code (src/lib.rs, src/bulk_load.rs, src/sparql/optimizer.rs, src/sparql/sqlgen.rs, src/export.rs, pg_ripple_http/src/main.rs)
    • Replace all 30+ expect()/unwrap() calls in non-test code with Result-propagating helpers; surface errors via pgrx::error!() at the pg_extern boundary
    • Add #![deny(clippy::unwrap_used, clippy::expect_used)] to src/lib.rs (test code excluded via #[cfg(test)])
    • Fix pg_ripple_http: replace startup panics with graceful error logging and process::exit(1)
  • GUC check_hook validators (src/lib.rs)
    • Implement validators for all string-enum GUCs: inference_mode (off / on_demand / materialized), enforce_constraints (off / warn / error), rule_graph_scope (default / all), shacl_mode (off / sync / async), describe_strategy (cbd / scbd)
    • Implement min_val bounds for integer GUCs: max_path_depth ≥ 1, property_path_max_depth ≥ 1, merge_threshold ≥ 1, merge_interval_secs ≥ 1
    • Promote pg_ripple.rls_bypass to PGC_POSTMASTER so it cannot be flipped per-session
  • pg_ripple.diagnostic_report() RETURNS TABLE (key TEXT, value TEXT) (src/lib.rs)
    • Keys: GUC validity summary, shared-memory cache hit/miss rates, merge backlog (rows in all delta tables), validation queue depth, federation endpoint health, schema_version match
    • pg_regress test diagnostic_report.sql: exercise all fields; assert no null values
  • _pg_ripple.schema_version table (src/lib.rs)
    • Created at install time with columns version TEXT, installed_at TIMESTAMPTZ, upgraded_from TEXT
    • Stamped on every ALTER EXTENSION … UPDATE

Migration Script

sql/pg_ripple--0.36.0--0.37.0.sql — adds (o, s) index on vp_rare; creates _pg_ripple.schema_version table; registers tombstone_gc_enabled and tombstone_gc_threshold GUCs. No VP table schema changes.

Documentation

  • user-guide/operations/troubleshooting.md — new section: "Lost deletes after merge" runbook (cause, detection via diagnostic_report(), fix via advisory lock, upgrade to v0.37.0)
  • reference/guc-reference.md — document tombstone_gc_threshold, tombstone_gc_enabled; add validator-rules table for all enum GUCs; note rls_bypass scope change
  • user-guide/operations/upgrade.md — document the schema_version stamp and how to verify upgrade completeness
  • Release notes for v0.37.0

Exit Criteria

No .expect()/.unwrap() in non-test Rust code; clippy deny enforced in CI. The concurrent-delete stress test (merge_concurrent_delete.sh) passes at 50 writers + 1-second merge interval. All GUC enum validators active. diagnostic_report() passes pg_regress. Migration scripts from 0.1.0 through 0.37.0 run cleanly via just test-migration.


v0.38.0 — Architecture Refactoring & Query Completeness

Theme: Split the god-module, introduce the PredicateCatalog abstraction, close SPARQL Update gaps, and wire SHACL hints into the query planner.

In plain language: After 37 releases, the codebase has accumulated structural debt — most visibly in a single 5,600-line "everything" file that makes every change risky. This release pays that debt: the central file is divided into focused modules, and a clean interface between the query engine and the storage layer is introduced so that future storage variants don't require rewriting the query translator. Users gain two concrete improvements: SPARQL UPDATE now supports pattern-based deletions (the commonly needed DELETE WHERE form that was missing), and SHACL shapes now automatically influence query planning so queries over shape-constrained predicates are faster.

Effort estimate: 9–11 person-weeks

Completed items (click to expand)

Deliverables

  • Split src/lib.rs into subsystem modules
    • Extract src/rare_predicate.rs, src/shacl_admin.rs, src/federation_registry.rs, src/graphrag_admin.rs, src/stats_admin.rs from src/lib.rs
    • Target: src/lib.rs ≤1,500 lines covering _PG_init, GUC registration, extension_sql! blocks, and thin #[pg_extern] delegation shims
    • No change to public SQL API; all existing pg_ripple.* functions remain
  • PredicateCatalog trait and backend-local OID cache (src/storage/catalog.rs new module)
    • Define trait PredicateCatalog { fn resolve(&self, pred_id: i64) -> Option<TableDesc>; }
    • Implement a backend-local HashMap<i64, TableDesc> cache invalidated by a syscache callback on _pg_ripple.predicates
    • Wire into src/sparql/sqlgen.rs and src/datalog/compiler.rs — eliminates per-atom SPI catalog lookup for hot BGPs
    • New GUC pg_ripple.predicate_cache_enabled (bool, default true)
    • Benchmark: 10-atom BGP must show 1 catalog SPI call instead of 10
  • Refactor validate_shape() → per-constraint helpers (src/shacl/constraints/ new sub-module)
    • One file per constraint family: count.rs, value_type.rs, string_based.rs, logical.rs, property_path.rs, shape_based.rs
    • Each exported function ≤80 lines; top-level validate_shape() becomes a dispatcher ≤50 lines
    • All existing shacl_*.sql pg_regress tests must pass unchanged
  • Refactor translate_pattern() → per-algebra-node helpers (src/sparql/translate/ new sub-module)
    • One file per algebra node: bgp.rs, join.rs, left_join.rs, union.rs, filter.rs, graph.rs, group.rs, distinct.rs
    • Shared context struct TranslateCtx carries encode cache, catalog handle, and query-level state
    • All existing sparql_*.sql pg_regress tests must pass unchanged
  • Batch dictionary encoding in SPARQL translation
    • In translate_pattern, collect all unresolved IRI/literal constants in a first pass; resolve via one encode_terms_batch(&[Term]) -> Vec<i64> SPI call (single INSERT … ON CONFLICT … RETURNING batch)
    • Benchmark: BGP with 20 FILTER constants must show 1 encode SPI call instead of 20
  • Plan-cache key normalisation (src/sparql/plan_cache.rs)
    • Cache on algebra digest (serialize spargebra::Query IR → compact bytes → XXH3-128) instead of raw query text
    • Whitespace and prefix-form variants now share the same cache slot
  • SCBD DESCRIBE — implemented (src/sparql/mod.rs)
    • Implement Symmetric Concise Bounded Description: all triples where the resource is subject or object, with blank-node recursion
    • describe_strategy = 'scbd' now functional; remove the "not implemented" caveat from docs
  • SPARQL Update: DELETE WHERE / INSERT WHERE / graph management (src/sparql/update.rs)
    • Implement DELETE { … } WHERE { … }, INSERT { … } WHERE { … }, DELETE WHERE { … }
    • Implement graph management: CLEAR GRAPH, DROP GRAPH, COPY, MOVE, ADD
    • pg_regress test sparql_update_advanced.sql: pattern-based deletes spanning multiple VP tables; cross-graph COPY/MOVE
  • Consolidate property-path depth GUCs (src/lib.rs)
    • Deprecate property_path_max_depth; make it an alias for max_path_depth with a one-time NOTICE
  • Wire SHACL hints into SPARQL planner (src/shacl/hints.rs new module, src/sparql/sqlgen.rs)
    • At query-translation time, query _pg_ripple.shape_hints (populated from loaded shapes) per predicate
    • sh:maxCount 1 → suppress DISTINCT on that predicate's join; sh:minCount 1 → downgrade LEFT JOIN to INNER JOIN
    • pg_regress test shacl_sparql_hints.sql: verify join-type changes with and without shapes; assert result equivalence
  • SPARQL 1.1 conformance suite in CI (allowed-to-warn job)
    • Download W3C SPARQL 1.1 test suite; run via cargo pgrx regress; report pass/skip/fail counts
    • Publish conformance percentage in CHANGELOG.md per release

Migration Script

sql/pg_ripple--0.37.0--0.38.0.sql — creates _pg_ripple.shape_hints table; registers predicate_cache_enabled GUC. No VP table schema changes.

Documentation

  • reference/architecture.md — Mermaid architecture diagram showing post-refactor module boundaries (dictionary → storage/catalog → sparql/translate + datalog/compiler → shacl/constraints → views/exporters)
  • user-guide/sql-reference/sparql-update.md — document DELETE WHERE / INSERT WHERE / CLEAR / COPY / MOVE / ADD with examples
  • reference/guc-reference.mdpredicate_cache_enabled; deprecation notice for property_path_max_depth
  • user-guide/performance/query-planning.md — new section on SHACL hints and their effect on join selection
  • Release notes for v0.38.0

Exit Criteria

src/lib.rs ≤1,500 lines. Each translate/ module file ≤200 lines. validate_shape() dispatcher ≤50 lines. SCBD DESCRIBE tests pass. SPARQL Update advanced tests pass. SHACL hints pg_regress passes. Predicate OID cache reduces SPI calls for 10-atom BGP from 10 to 1. Migration chain test passes.


v0.39.0 — Datalog HTTP API for pg_ripple_http

Theme: Expose all pg_ripple Datalog SQL functions as a REST API in the pg_ripple_http companion service.

In plain language: The pg_ripple_http service currently speaks only SPARQL. This release adds a /datalog namespace that lets any HTTP client — without a PostgreSQL driver — manage rule sets, trigger inference, run goal-directed queries, check integrity constraints, and inspect monitoring statistics. The implementation is a thin axum layer; all heavy lifting stays inside the PostgreSQL extension.

Effort estimate: 3–5 person-weeks

Implementation plan: plans/pg_ripple_http_datalog.md

Completed items (click to expand)

Deliverables

  • Extract shared helpers (pg_ripple_http/src/common.rs new module)
    • Move AppState, check_auth(), redacted_error(), and env_or() from main.rs to common.rs
    • Both SPARQL and Datalog handlers import from this module
  • Phase 1 — Rule management (pg_ripple_http/src/datalog.rs new module)
    • POST /datalog/rules/{rule_set} — body text/x-datalog; calls pg_ripple.load_rules($1, $2); returns {"rule_set": "…", "rules_loaded": N}
    • POST /datalog/rules/{rule_set}/builtin — calls pg_ripple.load_rules_builtin($1)
    • GET /datalog/rules — calls pg_ripple.list_rules(); returns JSONB array
    • DELETE /datalog/rules/{rule_set} — calls pg_ripple.drop_rules($1); returns {"deleted": N}
    • POST /datalog/rules/{rule_set}/add — single-rule add; calls pg_ripple.add_rule($1, $2)
    • DELETE /datalog/rules/{rule_set}/{rule_id} — calls pg_ripple.remove_rule($1::bigint) (triggers DRed)
    • PUT /datalog/rules/{rule_set}/enable — calls pg_ripple.enable_rule_set($1)
    • PUT /datalog/rules/{rule_set}/disable — calls pg_ripple.disable_rule_set($1)
  • Phase 2 — Inference (pg_ripple_http/src/datalog.rs)
    • POST /datalog/infer/{rule_set} — calls pg_ripple.infer($1); returns {"derived": N}
    • POST /datalog/infer/{rule_set}/stats — calls pg_ripple.infer_with_stats($1); returns full stats JSONB
    • POST /datalog/infer/{rule_set}/agg — calls pg_ripple.infer_agg($1)
    • POST /datalog/infer/{rule_set}/wfs — calls pg_ripple.infer_wfs($1)
    • POST /datalog/infer/{rule_set}/demand — body {"demands": […]}; calls pg_ripple.infer_demand($1, $2::jsonb)
    • POST /datalog/infer/{rule_set}/lattice — body {"lattice": "min"}; calls pg_ripple.infer_lattice($1, $2)
  • Phase 3 — Query & constraints (pg_ripple_http/src/datalog.rs)
    • POST /datalog/query/{rule_set} — body Datalog goal text; calls pg_ripple.infer_goal($1, $2); returns {"derived": N, "iterations": N, "matching": […]}
    • GET /datalog/constraints — calls pg_ripple.check_constraints(NULL); returns violation array
    • GET /datalog/constraints/{rule_set} — calls pg_ripple.check_constraints($1)
  • Phase 4 — Admin & monitoring (pg_ripple_http/src/datalog.rs)
    • GET /datalog/stats/cache — calls pg_ripple.rule_plan_cache_stats()
    • GET /datalog/stats/tabling — calls pg_ripple.tabling_stats()
    • GET /datalog/lattices — calls pg_ripple.list_lattices()
    • POST /datalog/lattices — body {"name": "…", "join_fn": "…", "bottom": "…"}; calls pg_ripple.create_lattice($1, $2, $3)
    • GET /datalog/views — calls pg_ripple.list_datalog_views()
    • POST /datalog/views — body JSON; calls pg_ripple.create_datalog_view(…)
    • DELETE /datalog/views/{name} — calls pg_ripple.drop_datalog_view($1)
  • Route registration (pg_ripple_http/src/main.rs)
    • mod datalog; and mod common; declarations
    • 24 .route(…) entries wired under /datalog
  • Metrics extension (pg_ripple_http/src/metrics.rs)
    • Add datalog_queries: AtomicU64 counter; expose as pg_ripple_http_datalog_queries_total in /metrics
  • Authentication & security
    • All /datalog/* handlers call check_auth() — same token as SPARQL
    • Optional write-protection: PG_RIPPLE_HTTP_DATALOG_WRITE_TOKEN env var gates POST /datalog/rules/*, DELETE, and PUT endpoints independently of the read token
    • All SQL calls use $1, $2, … parameterized queries — never string concatenation
    • Request body limit: 10 MB via axum::body::to_bytes(body, 10 * 1024 * 1024)
  • Error mapping
    • 400 datalog_parse_error — malformed rule text returned by extension
    • 400 datalog_goal_error — invalid goal pattern
    • 400 invalid_request — missing body, wrong content-type, non-numeric rule_id
    • 404 rule_set_not_found — infer/drop on nonexistent rule set
    • 503 service_unavailable — pool exhausted
  • Migration script sql/pg_ripple--0.38.0--0.39.0.sql
    • No schema changes to pg_ripple itself; comment-only header documenting the new HTTP surface
  • Tests
    • Integration tests using axum-test (or equivalent): round-trip load → infer → query goal → drop for the custom rule set
    • Error path tests: malformed Datalog, missing auth, oversized body
    • Smoke test script tests/datalog_http_smoke.sh (curl-based)

Documentation

  • pg_ripple_http/README.md — new ## Datalog API section with curl examples for all 24 endpoints, content types, and error codes
  • Release notes for v0.39.0

Exit Criteria

All 24 Datalog endpoints respond correctly in integration tests. GET /datalog/rules returns the JSONB array from list_rules(). POST /datalog/infer/custom triggers materialization and returns {"derived": N}. GET /datalog/constraints returns violation JSONB. Auth check rejects requests with invalid token. Parameterized-query requirement verified by code review (no format!() calls mixing user input into SQL strings). Migration chain test passes.


v0.40.0 — Streaming Results, Explain & Observability

Theme: Streaming cursor API for large result sets, first-class query explain, and full observability stack.

In plain language: Three long-requested developer and operator improvements land together. Large SPARQL queries can now stream their results instead of materialising everything in memory — making it safe to CONSTRUCT or export millions of triples without running out of memory. A new explain_sparql() function shows exactly what SQL the SPARQL engine generated, with cardinality estimates and actual timings in EXPLAIN ANALYZE format but with RDF IRIs instead of internal numbers. A new explain_datalog() function does the same for Datalog rule sets. Every significant operation now emits OpenTelemetry spans, and diagnostic_report() gives a one-call health summary of the running system.

Effort estimate: 9–11 person-weeks

Completed items (click to expand)

Deliverables

  • Streaming SPARQL cursor API (src/sparql/cursor.rs new module)
    • pg_ripple.sparql_cursor(query TEXT) RETURNS SETOF RECORD — SRF paging through results 1024 rows at a time with batched dictionary decode
    • pg_ripple.sparql_cursor_turtle(query TEXT) RETURNS SETOF TEXT — emits Turtle lines
    • pg_ripple.sparql_cursor_jsonld(query TEXT) RETURNS SETOF TEXT — emits JSON-LD object chunks
    • Wire to pg_ripple_http: Accept: text/turtle or Accept: application/ld+json triggers Transfer-Encoding: chunked streaming response
    • pg_regress test sparql_cursor.sql: load 500K triples; verify cursor returns correct count; verify chunked Turtle export round-trips
  • Resource governors (src/lib.rs)
    • pg_ripple.sparql_max_rows (integer, default 0 = unlimited)
    • pg_ripple.datalog_max_derived (integer, default 0 = unlimited)
    • pg_ripple.export_max_rows (integer, default 0 = unlimited)
    • pg_ripple.sparql_overflow_action (enum: warn / error, default warn)
    • Error codes: PT640 (SPARQL row limit exceeded), PT641 (Datalog derived limit exceeded), PT642 (export row limit exceeded)
  • pg_ripple.explain_sparql(query TEXT, analyze BOOLEAN DEFAULT false) RETURNS JSONB (src/sparql/explain.rs new module)
    • Step 1: parse + optimise via spargebra/sparopt; emit algebra tree as JSON with predicate IRIs decoded
    • Step 2: run EXPLAIN (FORMAT JSON, BUFFERS true [, ANALYZE true]) on the generated SQL; attach as "plan" key
    • Output keys: "algebra", "sql" (IRI-decoded), "plan", "cache_hit" (bool), "encode_calls" (int)
    • pg_regress test sparql_explain_jsonb.sql: verify all output keys; verify analyze: true adds "Actual Rows"
  • pg_ripple.explain_datalog(rule_set_name TEXT) RETURNS JSONB (src/datalog/explain.rs new module)
    • Returns per-stratum dependency graph, magic-set rewritten rules, compiled SQL per rule, and per-iteration delta-row counts from last inference run
    • Output keys: "strata", "rules" (rewritten), "sql_per_rule", "last_run_stats"
    • pg_regress test datalog_explain.sql
  • pg_ripple.cache_stats() RETURNS JSONB and pg_ripple.reset_cache_stats() (src/lib.rs)
    • Keys: plan cache size/hits/misses, dict cache hits/misses, federation cache hits/misses
    • pg_regress test cache_stats.sql
  • pg_ripple.stat_statements_decoded view (src/lib.rs)
    • View over pg_stat_statements that regex-decodes predicate IDs in query text via pg_ripple.decode_id() join; exposes query_decoded column
  • OpenTelemetry tracing (src/telemetry.rs new module)
    • Thin facade over the tracing crate; spans for: SPARQL parse/translate/execute, merge cycle (per predicate), federation call (per SERVICE), Datalog inference (per stratum)
    • GUC pg_ripple.tracing_enabled (bool, default false) — zero overhead when off
    • GUC pg_ripple.tracing_exporter (string: stdout / otlp, default stdout); otlp reads OTEL_EXPORTER_OTLP_ENDPOINT
    • pg_regress test telemetry.sql: toggle on/off; assert no performance regression in execute path with tracing off
  • Bug fix: OPTIONAL {} inside GRAPH {} silently fails for all predicates (src/sparql/sqlgen.rs)
    • Root cause: The GraphPattern::Graph handler applies the named-graph filter after the inner pattern is fully translated. When the inner pattern contains an OPTIONAL (spargebra LeftJoin), the LeftJoin translator wraps both sides in aliased subqueries that only project _lj_<varname> columns — the g column is intentionally stripped. The Graph handler then emits {lj_alias}.g = {gid}, which PostgreSQL rejects with column does not exist. This fails for all predicates (both dedicated VP tables and vp_rare); it was only observed first with vp_rare predicates (rdfs:subClassOf, rdfs:label, etc.) because typical test graphs have very few schema triples.
    • Correct fix — graph-filter context propagation (src/sparql/sqlgen.rs, Ctx):
      1. Add graph_filter: Option<i64> to Ctx.
      2. In GraphPattern::Graph, set ctx.graph_filter = Some(gid) before recursing into the inner pattern, then clear it after.
      3. In translate_bgp / table_expr / build_all_predicates_union, when ctx.graph_filter is Some(gid), inject WHERE g = {gid} (or AND g = {gid}) directly into each VP table scan.
      4. Remove the post-hoc for (alias, _) in &frag.from_items { frag.conditions.push(format!("{alias}.g = {gid}")); } loop from the Graph handler — the filter is now baked into every leaf VP scan before any LEFT JOIN, WITH RECURSIVE, or subquery wrapper is built.
    • This also fixes OPTIONAL {} combined with GROUP BY on variables from the optional side, and OPTIONAL {} inside GRAPH {} with FILTER, property paths, nested UNION, and federated SERVICE sub-patterns.
    • Regression tests:
      • sparql_optional_in_graph.sqlOPTIONAL triple with a dedicated-VP predicate inside a named graph; assert NULL vs non-NULL row counts
      • sparql_optional_in_graph_rare.sql — same pattern with a vp_rare predicate; assert NULL vs non-NULL row counts
      • sparql_optional_group_by_in_graph.sqlOPTIONAL + GROUP BY on optional variable inside a named graph (the original failing query shape); assert instanceCount per class is correct
  • Bug fix: property path inside GRAPH {} fails for all predicates (src/sparql/sqlgen.rs)
    • Root cause: identical to the OPTIONAL bug above — the WITH RECURSIVE CTE emitted for property path operators (+, *, ?) selects only (s, o), but the post-hoc Graph handler tries to reference {cte_alias}.g, producing column does not exist.
    • Fix: same graph-filter context propagation as above; anchor and recursive step selects must include g and filter on it when ctx.graph_filter is set, rather than relying on the outer Graph handler to inject the condition.
    • Regression test: sparql_path_in_graph.sql — property path on a rare predicate inside a named graph; assert correct row count
  • Migration header standardisation (sql/*.sql)
    • Backfill headers in all existing scripts: -- Migration X.Y.Z → A.B.C | Schema changes: … | Data-rewrite cost: Low/Medium/High | Downgrade: …
    • All future scripts from v0.37.0 onward follow this template automatically

Migration Script

sql/pg_ripple--0.39.0--0.40.0.sql — registers new GUCs (sparql_max_rows, datalog_max_derived, export_max_rows, sparql_overflow_action, tracing_enabled, tracing_exporter). No VP table schema changes.

Documentation

  • user-guide/sql-reference/explain.md — full tutorial on explain_sparql() and explain_datalog(); reading the algebra tree and decoded SQL
  • user-guide/sql-reference/cursor-api.md — streaming cursor API; format options; resource governors
  • reference/observability.md (new) — OpenTelemetry integration guide: exporter setup, span taxonomy, Grafana/Jaeger integration examples
  • user-guide/operations/monitoring.mdcache_stats(), diagnostic_report(), stat_statements_decoded usage
  • reference/error-reference.md — PT640, PT641, PT642 documented
  • Release notes for v0.40.0

Exit Criteria

sparql_cursor.sql passes with 500K triples. explain_sparql() returns IRI-decoded algebra and SQL. OpenTelemetry spans emitted for a sample query when tracing_enabled = on. All resource governor tests pass. stat_statements_decoded returns decoded query text. sparql_optional_in_graph.sql, sparql_optional_in_graph_rare.sql, and sparql_optional_group_by_in_graph.sql all pass (OPTIONAL inside GRAPH). sparql_path_in_graph.sql passes (property path inside GRAPH). Migration chain test passes.


v0.41.0 — Full W3C SPARQL 1.1 Test Suite

Theme: Complete standards conformance verification via the full W3C SPARQL 1.1 test suite, run in parallel under 2 minutes in CI.

In plain language: Every major SPARQL engine bug — including the OPTIONAL inside GRAPH failure found in April 2026 — was caught by manual testing rather than by the test suite. This version fixes that by implementing a full harness for the official W3C SPARQL 1.1 test suite (~3,000 tests), parallelized across 8 workers so the entire suite completes in under 2 minutes. The harness parses W3C test manifests, auto-loads RDF fixtures per test, runs queries against a live pg_ripple instance, and validates results using RDF graph equivalence (not row counting). Per-category pass rates are reported in CI so regressions are caught immediately. A curated 180-test "smoke" subset (Graph Patterns + Aggregates) runs on every PR in under 30 seconds.

Effort estimate: 5–7 person-weeks

Deliverables

  • W3C manifest parser (tests/w3c/manifest.rs new module)
    • Parse W3C SPARQL 1.1 test manifests (Turtle format, mf:Manifest) into a structured TestCase struct
    • Fields: test IRI, type (mf:QueryEvaluationTest, mf:UpdateEvaluationTest, mf:PositiveSyntaxTest, mf:NegativeSyntaxTest), query file, data file(s), result file, named graph files
    • Covers all 13 sub-suites: aggregates, bind, exists, functions, grouping, negation, optional, project-expression, property-path, service, subquery, syntax-query, update
    • Tests with type mf:NotClassifiedByEarlYet skipped with SKIP status
  • RDF fixture loader (tests/w3c/loader.rs new module)
    • Load .ttl / .n3 / .rdf / .srx / .srj fixture files from tests/w3c/data/ into a temporary pg_ripple graph before each test
    • Use named graph IRIs matching the manifest's mf:graphData entries
    • Auto-teardown: drop the temporary named graph after the test completes (regardless of pass/fail)
    • Handle multi-graph datasets: mf:defaultGraph → default graph (g = 0); mf:namedGraphs → individual named graphs
  • Result validator (tests/w3c/validator.rs new module)
    • SELECT queries: compare against .srx (SPARQL Results XML) or .srj (SPARQL Results JSON); validate variable names and bindings as RDF term equality (IRI, blank node, literal with datatype and lang tag)
    • ASK queries: compare boolean result against .srx/.srj
    • CONSTRUCT / DESCRIBE queries: compare result graph against .ttl reference using graph isomorphism (blank-node-normalised; uses oxrdf for in-memory graph comparison)
    • UPDATE queries: compare the post-update store state (all named graphs) against expected .ttl reference
    • Blank node handling: rename blank nodes in both actual and expected by canonical DFS traversal before comparison
    • Report per-binding diff on failure: expected term vs. actual term
  • Parallel test runner (tests/w3c/runner.rs new module)
    • cargo test --test w3c_suite -- --test-threads 8 — each thread picks tests from a shared work queue (lock-free crossbeam channel)
    • Each thread owns an isolated pg_ripple named-graph namespace (prefix _w3c_t{thread_id}_) to prevent cross-test pollution
    • Test timeout: 5 seconds per test; timed-out tests marked TIMEOUT not FAIL
    • Progress: indicatif progress bar per thread in local runs; plain line-per-test output in CI
    • Output report: per-category pass/fail/skip/timeout counts + per-test detail for any failure
    • Target: full 3,000-test suite completes in < 2 minutes on an 8-core CI runner (AWS c7g.2xlarge or equivalent)
  • Smoke subset (tests/w3c_smoke.rs)
    • 180-test curated subset: optional (80 tests), aggregates (60 tests), grouping (40 tests) — the three categories most likely to expose SQL-generation bugs
    • Runs on every PR via cargo test --test w3c_smoke; completes in < 30 seconds
    • Failures block merge (added to required status checks in .github/workflows/ci.yml)
  • CI integration (.github/workflows/ci.yml)
    • New job w3c-suite: runs after the existing pgrx-test job; parallelized 8-way; uploads test report as artifact
    • New job w3c-smoke: runs on every PR and push to main; required check
    • Full suite job is optional (non-blocking) until pass rate reaches 95%; then promoted to required
    • Cache: W3C test fixtures (tests/w3c/data/) cached by SHA of manifest files
  • Test data download script (scripts/fetch_w3c_tests.sh)
    • Downloads the official W3C SPARQL 1.1 test suite from https://www.w3.org/2009/sparql/docs/tests/
    • Verified against known SHA-256 checksums of the manifest files
    • Output: tests/w3c/data/ directory (gitignored; fetched by CI and locally on first run)
  • Known-failures manifest (tests/w3c/known_failures.txt)
    • List of W3C test IRIs that currently fail, with a one-line reason for each (e.g., OPTIONAL inside GRAPH — fix in v0.40.0, property path with GRAPH — fix in v0.40.0)
    • Failures in known_failures.txt are reported as XFAIL (expected failure), not FAIL
    • Any test in known_failures.txt that unexpectedly passes is reported as XPASS and causes a CI warning
    • Target at release: 0 XFAIL entries in the smoke subset; ≤ 50 XFAIL entries in the full suite (SERVICE tests against live external endpoints are always SKIP)
  • Pass-rate tracking (tests/w3c/report.json)
    • CI uploads a report.json artifact with per-category pass/fail/skip/timeout counts and overall pass rate
    • Historical pass rate trend displayed in README.md badge

Migration Script

sql/pg_ripple--0.40.0--0.41.0.sql — no schema changes. Adds a comment-only header noting that v0.41.0 is a test infrastructure release.

Documentation

  • reference/w3c-conformance.md — per-category W3C SPARQL 1.1 conformance table: test count, pass count, known failures with ticket links
  • reference/running-w3c-tests.md (new) — how to run the smoke subset and full suite locally; how to add a new expected failure; how to interpret XFAIL vs XPASS
  • README.md — W3C SPARQL 1.1 conformance section updated
  • Release notes for v0.41.0

Exit Criteria

Smoke subset (180 tests) passes with 0 unexpected failures on main. Full suite (3,000+ tests) runs in < 2 minutes on an 8-core CI runner. Per-category pass rate report uploaded as CI artifact. Known-failures manifest has 0 entries for optional and aggregates categories (those bugs fixed in v0.40.0). Migration chain test passes through 0.41.0.


v0.42.0 — Parallel Merge, Cost-Based Federation & Live CDC

Theme: Multi-worker HTAP merge, intelligent federation query planning, and real-time RDF change subscriptions.

In plain language: Three architectural improvements that close the last major gaps before the 1.0 production release. The merge worker — which keeps the read-optimised main partition in sync with incoming writes — is upgraded from a single process to a configurable pool of parallel workers, each responsible for a subset of predicates, directly improving write throughput for workloads with many distinct predicates. Federation queries now use a cost model to pick the best execution order and run independent fragments in parallel, eliminating the serial bottleneck. And for the first time, applications can subscribe to a real-time stream of triple changes filtered by SPARQL pattern or SHACL shape, enabling reactive GraphRAG pipelines, live dashboards, and ML feature stores without polling.

Effort estimate: 10–12 person-weeks

Deliverables

  • Parallel merge worker pool (src/worker.rs, src/storage/merge.rs)
    • New GUC pg_ripple.merge_workers (integer, default 1, max 16) — spawns N BackgroundWorker processes each managing a disjoint round-robin subset of predicates
    • Per-predicate pg_advisory_lock (from v0.37.0) ensures no two workers race on the same VP table
    • Work-stealing: idle workers check the global queue for any predicate above pg_ripple.merge_threshold not yet claimed
    • Stress test tests/stress/parallel_merge.sh: 100 concurrent writers × 100 predicates × 4 workers; assert correctness and no deadlocks after 10 minutes
    • Benchmark: 4 merge workers on a workload with 100 distinct predicates shows ≥3× throughput vs. single worker
  • owl:sameAs cluster size bound (src/datalog/builtins.rs)
    • New GUC pg_ripple.sameas_max_cluster_size (integer, default 100_000)
    • Detect over-large equivalence classes during canonicalization; emit PT550 WARNING and short-circuit with Tarjan-SCC sampling approximation
    • pg_regress test sameas_large_cluster.sql
  • VoID statistics catalog per federation endpoint (src/sparql/federation.rs, _pg_ripple.endpoint_stats table)
    • On endpoint registration, fetch and cache the endpoint's VoID description
    • Refresh driven by new GUC pg_ripple.federation_stats_ttl_secs (integer, default 3600)
    • Statistics used by the planner: triple count per predicate, distinct subjects/objects
  • Cost-based federation source selection (src/sparql/federation_planner.rs new module)
    • FedX-style planner: for each BGP atom rank endpoints by estimated selectivity using VoID stats; assign each atom to its best source
    • Independent atoms (no shared variables) scheduled for parallel execution
    • GUC pg_ripple.federation_planner_enabled (bool, default true)
    • GUC pg_ripple.federation_parallel_max (integer, default 4)
    • GUC pg_ripple.federation_parallel_timeout (integer, default 60 seconds)
    • pg_regress test federation_planner.sql: two registered mock endpoints; verify atom routing and timeout behaviour
  • Parallel SERVICE execution (src/sparql/federation.rs)
    • Independent SERVICE clauses dispatched concurrently via background workers; results reassembled before outer join
    • Bounded by pg_ripple.federation_parallel_max
  • Federation result streaming (src/sparql/federation.rs)
    • SERVICE responses exceeding pg_ripple.federation_inline_max_rows (new GUC, default 10_000) are spooled into a temporary table rather than inlined as VALUES
    • Error code PT620 INFO when spooling is triggered
  • IP/CIDR allowlist for federation endpoints (src/sparql/federation.rs)
    • Resolve hostname on endpoint registration; deny RFC 1918, link-local (169.254.x.x), loopback, and IPv6 link-local by default
    • New GUC pg_ripple.federation_allow_private (bool, default false) to override
    • Error code PT621 when a private-IP endpoint is rejected
  • HTTPS certificate validation for HTTP companion (pg_ripple_http/src/main.rs)
    • Default to system trust store via rustls-native-certs
    • Env var PG_RIPPLE_HTTP_CA_BUNDLE — path to a custom CA PEM for private-PKI federation targets
    • Reject self-signed certificates unless PG_RIPPLE_HTTP_ALLOW_SELF_SIGNED=true
    • Fix CORS defaults: explicit origin allowlist via PG_RIPPLE_HTTP_CORS_ORIGINS; * requires opt-in
    • Fix X-Forwarded-For: trust only when PG_RIPPLE_HTTP_TRUST_PROXY env lists upstream IP/CIDR
    • Body limit configurable via PG_RIPPLE_HTTP_MAX_BODY_BYTES (default 10_485_760)
  • Live RDF CDC subscriptions (src/cdc.rs, pg_ripple_http/src/ws.rs new module)
    • pg_ripple.create_subscription(name TEXT, filter_sparql TEXT DEFAULT NULL, filter_shape TEXT DEFAULT NULL) RETURNS BOOLEAN
    • Publishes via NOTIFY pg_ripple_cdc_{name} with JSON payload: {"op": "add"|"remove", "s": "…", "p": "…", "o": "…", "g": "…"}
    • WebSocket endpoint /ws/subscriptions/{name} in pg_ripple_http; supports text/turtle, application/ld+json, application/json via Accept
    • Optional SPARQL filter: only matching triples published; optional SHACL filter: only shape-violating triples published
    • pg_ripple.drop_subscription(name TEXT), pg_ripple.list_subscriptions() RETURNS TABLE
    • New catalog table _pg_ripple.subscriptions (name, filter_sparql, filter_shape, created_at, queue_table_oid)
    • pg_regress test cdc_subscriptions.sql: create subscription, insert triples, verify LISTEN receives expected payloads

Migration Script

sql/pg_ripple--0.41.0--0.42.0.sql — creates _pg_ripple.endpoint_stats table; creates _pg_ripple.subscriptions table; registers new GUCs (merge_workers, sameas_max_cluster_size, federation_stats_ttl_secs, federation_planner_enabled, federation_parallel_max, federation_parallel_timeout, federation_inline_max_rows, federation_allow_private).

Documentation

  • user-guide/operations/merge-workers.md (new) — tuning merge_workers for predicate-rich workloads; monitoring via diagnostic_report()
  • user-guide/features/cdc-subscriptions.md (new) — complete tutorial: subscribe, filter, consume via SQL LISTEN and WebSocket; integration patterns with GraphRAG, ML feature stores, and live dashboards
  • user-guide/features/federation.md — updated: VoID stats, cost-based planner, parallel SERVICE, result streaming, IP restrictions
  • reference/guc-reference.md — all new GUCs documented; security guidance on federation_allow_private
  • reference/error-reference.md — PT550, PT620, PT621 documented
  • Release notes for v0.42.0

Exit Criteria

Parallel merge stress test passes (100 writers, 4 workers, no lost deletes). VoID stats fetched on endpoint registration. Independent SERVICE clauses execute in parallel (verifiable via explain_sparql()). CDC subscription delivers NOTIFY payloads for all inserts matching the filter. HTTPS cert validation enforced in pg_ripple_http. Migration chain test passes through 0.42.0.


v0.43.0 — WatDiv + Jena Conformance Suite

Theme: Scale-correctness and semantic edge-case coverage via the WatDiv benchmark and Apache Jena test suite, reusing the harness infrastructure from v0.41.0.

In plain language: W3C conformance (v0.41.0) proves pg_ripple is correct on small, well-defined fixtures. This release proves it is correct at scale and on the implementation edge cases that W3C deliberately leaves underspecified. WatDiv loads 10M–100M triples and runs 100–1,000 queries across four complexity levels (star, chain, snowflake, complex) — catching SQL planner regressions and VP table performance cliffs that only appear under realistic data distributions. Apache Jena contributes ~1,000 additional tests covering type coercion corner cases, timezone handling in date comparisons, numeric precision, and blank-node scoping rules that the W3C suite glosses over.

Effort estimate: 5–7 person-weeks (90% infrastructure reuse from v0.41.0)

Deliverables

  • Apache Jena adapter (tests/jena/ new module)
    • Adapt v0.41.0 manifest parser to handle Jena-specific manifest fields (jt:QueryEvaluationTest, jt:UpdateEvaluationTest) and Jena result extensions (e.g. rdf:XMLLiteral, extended numeric types)
    • ~1,000 tests across Jena's sparql-query, sparql-update, sparql-syntax, and algebra sub-suites
    • Reuse v0.41.0 RDF fixture loader, result validator, parallel runner, and known-failures manifest format
    • Specific coverage targets:
      • Type coercion: XSD numeric promotions (xsd:integerxsd:decimalxsd:double); mixed-type comparisons
      • Date/time: timezone-aware xsd:dateTime comparisons; NOW(), YEAR(), MONTH(), DAY(), HOURS(), MINUTES(), SECONDS(), TZ() builtins
      • Numeric precision: xsd:decimal arithmetic; ROUND(), CEIL(), FLOOR(), ABS()
      • Blank-node scoping: blank nodes in CONSTRUCT templates; blank nodes across GRAPH boundaries; blank-node identity in OPTIONAL
      • String functions: STRLEN(), SUBSTR(), UCASE(), LCASE(), STRSTARTS(), STRENDS(), CONTAINS(), ENCODE_FOR_URI(), CONCAT()
    • Target: full Jena suite completes in < 3 minutes alongside W3C suite on CI
    • New CI job jena-suite — non-blocking until pass rate ≥ 95%; then promoted to required
  • WatDiv harness (tests/watdiv/ new module)
    • Data generation: integrate watdiv Rust port or call the upstream C++ binary via std::process::Command; generate 10M-triple dataset once and cache in CI artifact storage
    • Query templates: all 100 WatDiv query templates across four structural classes:
      • Star (S1–S7): all predicates share a single subject; tests VP table scan and star-join optimisation
      • Chain (C1–C3): predicates form a linear path; tests join ordering
      • Snowflake (F1–F5): star + chain hybrid; tests mixed join strategies
      • Complex (B1–B12, L1–L5): multi-hop patterns with OPTIONAL and UNION; tests full algebra
    • Correctness validation: run each query against a baseline (pre-computed expected cardinalities from a reference run) and assert within ±0.1% row count
    • Performance baseline: record median query latency per template at 10M triples; flag regressions > 20% in CI
    • Separate cargo bench --bench watdiv target using criterion — feeds into benchmarks/ results
    • Target: full 100-template suite at 10M triples completes in < 5 minutes on an 8-core CI runner
    • New CI job watdiv-suite — non-blocking (performance regressions are warnings, not failures)
  • Shared harness improvements (backport to tests/w3c/)
    • Unified tests/conformance/runner.rs — single parallel runner used by W3C, Jena, and WatDiv; eliminates code duplication
    • Unified known_failures.txt format with suite: prefix (e.g. w3c:, jena:, watdiv:)
    • Unified CI report artifact: per-suite pass/fail/skip/timeout counts in one conformance_report.json
  • Test data download script (scripts/fetch_conformance_tests.sh)
    • Extends scripts/fetch_w3c_tests.sh to also download Jena test suite from Apache mirror and WatDiv query templates from GitHub
    • All downloads verified against SHA-256 checksums
    • WatDiv 10M dataset generated once and stored as a CI artifact (not re-generated on every run)

Migration Script

sql/pg_ripple--0.42.0--0.43.0.sql — no schema changes. Comment-only header noting that v0.43.0 is a test infrastructure release.

Documentation

  • reference/w3c-conformance.md — updated to include Jena sub-suite pass rates alongside W3C categories
  • reference/watdiv-results.md (new) — WatDiv benchmark results table: query class, template ID, median latency at 10M triples, pass/fail status; updated on each release
  • contributing/running-conformance-tests.md — updated to cover Jena and WatDiv; how to regenerate WatDiv dataset; how to update performance baselines
  • README.md — add WatDiv correctness badge alongside W3C conformance badge
  • Release notes for v0.43.0

Exit Criteria

Full Jena suite (1,000 tests) completes in < 3 minutes on CI. WatDiv 100-template suite at 10M triples completes in < 5 minutes. Jena known-failures manifest ≤ 30 XFAIL entries (type coercion and date-time edge cases acceptable until addressed post-1.0). WatDiv row-count correctness within ±0.1% for all 100 templates. Migration chain test passes through 0.43.0.


v0.44.0 — LUBM Conformance Suite

Theme: OWL RL inference correctness under ontological reasoning via the Lehigh University Benchmark (LUBM).

In plain language: LUBM is a classic academic benchmark that generates a synthetic university-domain ontology dataset (scalable from 1K to 8M+ triples) and defines 14 canonical queries that exercise OWL RL inference rules — subclass traversal, property inheritance, inverse properties, transitivity, and domain/range entailments. This release wires LUBM into the conformance harness to validate that pg_ripple's Datalog engine and SPARQL query layer produce correct results when ontological reasoning is active. A dedicated Datalog validation sub-suite tests the Datalog API directly (rule compilation, stratification, iterative inference, goal queries, and materialization) to catch bugs invisible to SPARQL-level testing. It is the only benchmark that tests the interaction between the SPARQL translator and the Datalog inference engine under realistic ontological load.

Effort estimate: 3–5 person-weeks (80% harness reuse from v0.41.0 and v0.43.0; +2–3 pw for Datalog API validation sub-suite)

Deliverables

  • LUBM data generator integration (tests/lubm/generator.rs new module)
    • Invoke the UBA (Univ-Bench Artificial) data generator via std::process::Command, or use a Rust port, to produce Turtle-serialised datasets at configurable university count (--univ 1 → ~100K triples; --univ 10 → ~1M triples; --univ 50 → ~5M triples)
    • Cache generated datasets as CI artifacts keyed by university count and seed; re-generate only when the generator binary changes
    • Load into a named graph <http://swat.cse.lehigh.edu/onto/univ-bench.owl> via the v0.41.0 fixture loader
    • Also load the univ-bench.owl ontology into the Datalog engine as an RDFS/OWL RL rule set before running queries
  • 14 canonical LUBM queries (tests/lubm/queries/q01.sparqlq14.sparql)
    • Implement all 14 LUBM queries verbatim from the benchmark specification
    • Each query exercises at least one inference rule:
      • Q1, Q2, Q4, Q6: rdf:type + subclass/subproperty entailment
      • Q3, Q5, Q7: inverse property + domain/range reasoning
      • Q8, Q12, Q13: multi-hop inference chains
      • Q9, Q10, Q11, Q14: conjunctive patterns over inferred and asserted triples
    • Reference results: pre-computed correct answer counts for --univ 1 (published in the original LUBM paper); assert exact cardinality match
  • Correctness validator (tests/lubm/validator.rs)
    • Compare actual row count against published reference counts for each of the 14 queries at --univ 1
    • For --univ 10, compare against a locally pre-computed baseline (stored in tests/lubm/baselines/univ10.json)
    • Fail on any count mismatch; report which inference rules produced wrong results
  • CI integration (.github/workflows/ci.yml)
    • New job lubm-suite: runs after w3c-suite; generates --univ 1 dataset (< 100K triples, < 30 seconds); loads ontology + triples; runs all 14 queries; reports pass/fail per query
    • Non-blocking for --univ 10 (larger dataset run triggered weekly or on release branches)
    • Reuse unified tests/conformance/runner.rs from v0.43.0; add lubm: prefix to known-failures format
  • Known-failures manifest — add lubm:Q{N} entries for any query that fails at release, with one-line root-cause note
  • Datalog validation sub-suite (tests/lubm/datalog/ new module) — test the Datalog API directly on the same --univ 1 and --univ 10 LUBM datasets
    • Rule compilation correctness (tests/lubm/datalog/rule_compilation.sql): call pg_ripple.add_rules() with the OWL RL ruleset; use pg_ripple.rules() to inspect compiled rules; assert rule count and stratification matches specification
    • Inference iteration tracking (tests/lubm/datalog/inference_iterations.sql): use pg_ripple.rule_statistics() after pg_ripple.materialize_owl_rl() to count iterations per stratum; validate that fixpoint is reached without over-iteration (off-by-one detection)
    • Inferred triple counts (tests/lubm/datalog/inferred_triples.sql): call pg_ripple.inferred_triples(rule_name) for key OWL RL rules (e.g. subclass_entail, subproperty_entail, domain_range); assert row counts match pre-computed baselines for --univ 1 and --univ 10
    • Direct goal queries (tests/lubm/datalog/goal_queries.sql): use pg_ripple.goal() directly on Datalog-computed facts; verify results match SPARQL query results (validates inference engine independence from SPARQL translation)
    • Materialization performance baseline (tests/lubm/datalog/materialization_perf.sql): benchmark pg_ripple.materialize_owl_rl() at --univ 1 (target < 5 seconds) and --univ 10 (target < 60 seconds); flag > 10% regression in CI
    • Custom rule validation (tests/lubm/datalog/custom_rules.sql): define ad-hoc Datalog rules (e.g. transitive closure over a custom predicate) on LUBM data; compare against ground-truth computed via Datalog vs. SPARQL; catch rule-compiler edge cases
    • Results compared against unified baseline (tests/lubm/baselines/datalog_validation.json).

Migration Script

sql/pg_ripple--0.43.0--0.44.0.sql — adds UNIQUE(p, s, o, g) constraint to _pg_ripple.vp_rare to fix SPARQL UPDATE set semantics for rare predicates.

Documentation

  • reference/lubm-results.md (new) — LUBM conformance table: query ID, description, inference rules exercised, reference count, pg_ripple result, pass/fail; updated each release
  • reference/w3c-conformance.md — updated to link to LUBM and WatDiv result pages for a complete conformance picture
  • contributing/running-conformance-tests.md — updated to cover LUBM data generation, ontology loading, and baseline regeneration
  • Release notes for v0.44.0

Exit Criteria

All 14 LUBM queries return exact reference cardinalities at --univ 1. Ontology + --univ 1 dataset loads and all queries complete in < 30 seconds on CI. All Datalog API calls in the sub-suite return results matching pre-computed baselines (rule count, iteration count, inferred triple counts, goal query results). Materialization performance at --univ 1 is < 5 seconds. Custom Datalog rule validation passes (transitive closure results match ground truth). Known-failures manifest has 0 lubm: entries at release. Migration chain test passes through 0.44.0.


v0.45.0 — SHACL Completion, Datalog Robustness & Crash Recovery

Theme: Close the last SHACL Core constraint gaps, harden parallel Datalog evaluation against worker failures, and add the missing crash-recovery scenarios and migration-documentation standards.

In plain language: This release finishes the SHACL implementation by adding the two remaining Core constraints (sh:equals and sh:disjoint), makes violation messages readable by always including the decoded focus-node IRI, and proves the async validation queue can sustain a sustained burst of 10,000 writes per second. On the Datalog side it ensures that a crash in one parallel evaluation worker rolls back all other workers cleanly, and that user-supplied lattice join functions are validated before the engine tries to call them. A new set of crash-recovery tests covers the two scenarios that were never tested: killing PostgreSQL mid-promotion of a rare predicate and killing it mid-inference. Finally, every migration script from this release onward carries a standardised header documenting the schema changes, data-rewrite cost, downgrade strategy, and the test file that covers it.

Effort estimate: 4–6 person-weeks

Deliverables

  • sh:equals and sh:disjoint constraints (src/shacl/constraints/)

    • sh:equals p — for every focus node, the set of values for p must equal the set of values for the predicate declared by sh:equals; implemented as two NOT EXISTS subqueries (one per direction); compiled into a SHACL constraint helper in src/shacl/constraints/relational.rs
    • sh:disjoint p — the value sets must be disjoint; implemented symmetrically
    • pg_regress test shacl_equals_disjoint.sql — covers passing shapes, failing shapes, blank-node identity, and named-graph scoping
    • Migration: no schema changes; constraints are pure SQL inside the validation query
  • Decoded focus-node IRIs in SHACL violation messages (src/shacl/mod.rs)

    • All paths that emit a SHACL violation (ereport!(Error, …) or write to _pg_ripple.validation_results) must include the decoded IRI of the focus node alongside its integer ID
    • Add a decode_id_safe(id: i64) helper that falls back to "<decoded-id:{id}>" if the dictionary lookup fails
    • Regression test: load a shape with a violation; assert the violation message text contains the focus-node IRI string
  • SHACL async pipeline load test (benchmarks/shacl_async_load.sql)

    • pgbench-driven harness that inserts triples at 10,000/min for 5 continuous minutes while the async SHACL validation pipeline is active
    • Asserts: (a) _pg_ripple.validation_queue depth stays bounded (does not grow unboundedly); (b) drain rate ≥ arrival rate ± 5%; (c) dead-letter queue receives any persistent violators; (d) no backend crashes
    • CI job shacl-async-load is informational (non-blocking) but results are logged as a CI artifact
  • Coordinated parallel-strata rollback (src/datalog/parallel.rs)

    • Wrap all independent-group SQL execution inside a single PostgreSQL transaction with one SAVEPOINT strata_eval per group
    • On failure in any group, issue ROLLBACK TO SAVEPOINT for all already-applied groups and re-raise the error; on success, RELEASE SAVEPOINT to commit the whole stratum
    • pg_regress test datalog_parallel_rollback.sql: inject a deliberate failure in one group; assert no partial facts survive
  • lattice.join_fn validation via regprocedure (src/datalog/lattice.rs)

    • Before storing a user-supplied join_fn name, resolve it via SELECT '{name}'::regprocedure::text inside an SPI transaction
    • If the round-trip succeeds, store the qualified name returned by PG (avoids search-path injection); if it fails, raise PT541 LatticeJoinFnInvalid with a clear message naming the rejected identifier
    • New error code PT541 added to src/error.rs and docs/src/reference/error-catalog.md
  • WFS iteration-cap test and documentation (tests/pg_regress/sql/datalog_wfs_cap.sql)

    • pg_regress test that loads a mutually-recursive negation cycle guaranteed to reach pg_ripple.wfs_max_iterations; asserts: (a) function returns without error; (b) "stratifiable": false in result; (c) PostgreSQL WARNING with code PT520 is emitted; (d) "certain" and "unknown" fact counts are non-zero (partial result)
    • docs/src/user-guide/sql-reference/datalog.md — add a "Well-Founded Semantics limits" subsection documenting the cap behaviour and how to detect it via RETURNING
  • Crash-recovery: rare-predicate promotion kill (tests/crash_recovery/test_promote_kill.sh)

    • Script that starts a large-batch insert designed to cross the promotion threshold, sends kill -9 to the promoting backend mid-transaction, restarts PostgreSQL, calls pg_ripple.diagnostic_report(), and asserts vp_rare is consistent (no orphaned rows, predicate catalog matches actual tables)
    • Outcome must be either: promotion completed (VP table exists, vp_rare rows moved) or promotion rolled back (VP table absent, vp_rare rows intact) — no hybrid state permitted
  • Crash-recovery: Datalog inference kill mid-fixpoint (tests/crash_recovery/test_inference_kill.sh)

    • Script that starts a large-ruleset inference run, kills the backend during the second fixpoint iteration, restarts, and asserts: (a) no partially-derived facts remain in any VP table (i.e., no inferred triples from an aborted inference); (b) pg_ripple.infer() can be re-run successfully to completion
  • Standardised migration script headers

    • Backfill sql/pg_ripple--*.sql with the standard header block (schema changes, data-rewrite cost estimate, downgrade strategy, test reference) for any script that currently lacks one — starting with 0.5.1→0.6.0 (the HTAP split) and the five most structurally significant migrations
    • Add the header template to AGENTS.md "Extension Versioning & Migration Scripts" section so all future scripts include it from creation
  • Recovery procedure runbook in RELEASE.md

    • Add a "Rollback & Recovery" section documenting: (a) how to roll back each class of migration (comment-only vs. schema-change vs. data-rewrite); (b) the pg_dump/pg_restore path as the universal fallback; (c) how to diagnose a partial upgrade using _pg_ripple.schema_version and pg_ripple.diagnostic_report()

Migration Script

sql/pg_ripple--0.44.0--0.45.0.sql — no VP table schema changes. Comment-only header. Installs PT541 error code registration (compiled from Rust).

Documentation

  • reference/shacl-constraints.md — add sh:equals and sh:disjoint to the constraint table with examples
  • reference/error-catalog.md — add PT541 (LatticeJoinFnInvalid)
  • user-guide/sql-reference/datalog.md — "Well-Founded Semantics limits" subsection
  • reference/troubleshooting.md — add entries for "rare-predicate promotion stuck" and "inference aborted mid-fixpoint"
  • Release notes for v0.45.0

Exit Criteria

sh:equals and sh:disjoint pg_regress tests pass. SHACL violation messages include decoded focus-node IRIs. Parallel-strata rollback test demonstrates no partial facts on deliberate failure. lattice.join_fn injection via search-path ambiguous name is rejected at create_lattice() time with PT541. WFS cap test passes: PT520 WARNING emitted, partial result returned. Both new crash-recovery scripts exit 0. Migration chain test passes through 0.45.0.


v0.46.0 — Property-Based Testing, Fuzz Hardening & OWL 2 RL Conformance

Theme: Property-based and fuzz testing for the remaining untested trust surfaces, the W3C OWL 2 RL conformance suite, and targeted performance improvements from the deep-analysis recommendations.

In plain language: Three gaps that can hide subtle bugs: (1) randomised property-based tests that assert algebraic invariants about the SPARQL translator and dictionary encoder — if encoding the same term twice ever yields different IDs, or if a query changes semantics when extra whitespace is added, these tests catch it; (2) fuzz tests for the federation result parser, which accepts untrusted network data; and (3) the W3C OWL 2 RL test manifests, which verify that pg_ripple's Datalog engine handles the full range of ontological reasoning that OWL 2 RL demands. On the performance side, a LIMIT push-down eliminates redundant decoding rows for paginated queries, sequence range pre-allocation removes a contention point in parallel Datalog, and BSBM joins the CI suite as a regression gate. The rustdoc lint ensures no public function ships without a doc comment.

Effort estimate: 5–7 person-weeks

Deliverables

  • proptest integration (tests/proptest/)

    • SPARQL algebra round-trip (tests/proptest/sparql_roundtrip.rs): generate random spargebra::Query values using proptest strategies; assert that (a) encoding the same SPARQL query twice produces byte-identical SQL; (b) queries that differ only in whitespace or prefix aliases produce the same generated SQL (plan-cache key stability); (c) star-pattern self-join elimination never changes the result set (check against a reference without elimination)
    • Dictionary encode/decode (tests/proptest/dictionary.rs): for any arbitrary IRI, blank node, or literal string, decode_id(encode_term(t)) == t; assert no collisions for 10,000 random distinct terms; assert encode is stable across pg_ripple restarts (same term → same ID given the same dictionary)
    • JSON-LD framing round-trip (tests/proptest/jsonld_framing.rs): generate random flat JSON-LD input graphs and random @context frames; assert that frame_jsonld(input, frame) returns valid JSON-LD and that any IRI present in the input that matches the frame appears in the output
    • Dev-dependency: proptest = "1" added to Cargo.toml under [dev-dependencies]
  • cargo-fuzz federation result decoder target (fuzz/fuzz_targets/federation_result.rs)

    • Fuzz target that feeds arbitrary byte sequences through the SPARQL XML results parser (src/sparql/federation.rs result-decoding path) — the path that processes application/sparql-results+xml responses from remote SERVICE endpoints
    • Assert: no panic, no unwrap abort; invalid XML must produce a PT6xx-range error, never a crash
    • CI nightly job fuzz-federation runs the target for 10 minutes; any new corpus entries that trigger panics are reported as blocking failures
  • Datalog convergence regression suite (tests/datalog_convergence/)

    • Download a 1M-triple DBpedia-en subset (persons, organisations, relations) via scripts/fetch_conformance_tests.sh extension; load into pg_ripple
    • Apply the built-in RDFS + OWL RL rule set via pg_ripple.materialize_owl_rl()
    • Assert: fixpoint reached in ≤ 20 iterations; total wall-clock time < 5 minutes on CI; derived triple count falls within ±1% of a pre-computed baseline stored in tests/datalog_convergence/baselines.json
    • Repeat for a 200-rule custom rule set (100 forward-chaining + 100 OWL RL rules) on a 100K-triple schema.org snippet; assert convergence in ≤ 15 iterations
  • W3C OWL 2 RL conformance suite (tests/owl2rl/)

    • Download the W3C OWL 2 RL test manifests from https://github.com/w3c/owl2-profiles-tests
    • Adapter tests/owl2rl/manifest.rs parses the owl2:DatatypeEntailmentTest, owl2:ConsistencyTest, and owl2:InconsistencyTest manifest types
    • Each test loads a premise ontology, runs pg_ripple.materialize_owl_rl(), then evaluates a conclusion ontology via ASK/entailment check
    • CI job owl2rl-suite is informational (non-blocking) until pass rate ≥ 95%; known failures tracked in tests/owl2rl/known_failures.txt with owl2rl: prefix
    • Reuse unified conformance runner from v0.43.0
  • TopN push-down (src/sparql/sqlgen.rs)

    • When a SPARQL query has both ORDER BY and LIMIT N (and no OFFSET > 0), emit the SQL as … ORDER BY … LIMIT N rather than fetching all rows and discarding after decoding
    • The optimisation applies to SELECT queries; skipped when DISTINCT is in scope (PostgreSQL cannot push LIMIT through DISTINCT without a subquery)
    • New GUC pg_ripple.topn_pushdown (bool, default on) guards the rewrite; pg_ripple.sparql_explain() output includes a "topn_applied": true/false key
    • pg_regress test sparql_topn.sql: assert result correctness and EXPLAIN shows a Limit node directly over the VP scan
  • Sequence range pre-allocation for parallel Datalog workers (src/datalog/parallel.rs)

    • Before launching N parallel strata workers, call SELECT setval(seq, currval(seq) + N * batch_size) once to reserve a contiguous SID range; each worker uses its slice without touching the sequence
    • batch_size defaults to 10,000 and is configurable via pg_ripple.datalog_sequence_batch (integer GUC, default 10000, min 100)
    • pg_regress test datalog_sequence_batch.sql: assert that after parallel inference the global SID sequence has no gaps within the reserved range
  • BSBM regression gate in CI (.github/workflows/ci.yml, benchmarks/bsbm/)

    • Integrate the Berlin SPARQL Benchmark (BSBM) at 1M triple scale as a nightly regression check
    • scripts/fetch_conformance_tests.sh extended to download and install the BSBM data generator
    • CI job bsbm-regression: generates a 1M-triple product dataset, runs the 12 BSBM explore queries, compares query latency against a baseline stored in benchmarks/bsbm/baselines.json; any query regressing by > 10% emits a CI warning (non-blocking but visible in the PR summary)
    • Complement to v1.0.0's full-scale BSBM-at-100M-triples published benchmark
  • Rustdoc lint gate (src/lib.rs, Cargo.toml, .github/workflows/ci.yml)

    • Add #![warn(missing_docs)] to src/lib.rs (scoped to public items only; internal pub(crate) items excluded)
    • CI job cargo doc --no-deps --document-private-items gated to fail on any missing_docs warning for public #[pg_extern] functions
    • Backfill doc comments for the 20 most-called public functions (as identified by pg_stat_statements in the test suite run); leave a FIXME(docs): comment on the remaining stubs to track progress
  • HTTP companion: CA-bundle env var (pg_ripple_http/src/main.rs)

    • Add PG_RIPPLE_HTTP_CA_BUNDLE environment variable: if set, load the PEM file at the given path as the trust anchor for all outbound TLS connections (SERVICE federation and SPARQL endpoint queries)
    • If the path does not exist or is not a valid PEM bundle, log an error at startup and fall back to the system trust store (never silently ignore)
    • This complements the v0.42.0 rustls-tls-native-roots hardening by allowing operators to pin a specific CA or internal PKI certificate
    • Integration test: start a mock TLS server with a self-signed CA; assert that pg_ripple_http rejects it by default and accepts it when PG_RIPPLE_HTTP_CA_BUNDLE points to the CA cert
  • Expanded worked examples (examples/)

    • examples/shacl_datalog_quality.sql — end-to-end: load a bibliographic graph, define SHACL shapes, run SPARQL to list violations, apply Datalog RDFS rules, re-check shapes; documents the SHACL + Datalog interaction pattern
    • examples/hybrid_vector_search.sql — end-to-end: embed entities, run vector similarity search, combine with SPARQL property-path constraints; documents the pg:similar() + SPARQL pattern
    • examples/graphrag_round_trip.sql — end-to-end: load a knowledge graph, run GraphRAG export, annotate with Datalog-derived community summaries, re-import enriched triples; documents the full GraphRAG round-trip

New GUC Parameters

GUCTypeDefaultDescription
pg_ripple.topn_pushdownboolonPush LIMIT N into the SQL plan for ORDER BY + LIMIT queries
pg_ripple.datalog_sequence_batchinteger10000SID range reserved per parallel Datalog worker per batch

New Error Codes

CodeSeverityMessage
PT542ERRORFederation result decoder received unparseable XML/JSON

Migration Script

sql/pg_ripple--0.45.0--0.46.0.sql — no schema changes. Registers topn_pushdown and datalog_sequence_batch GUCs (compiled from Rust). Comment-only header.

Documentation

  • user-guide/best-practices/sparql-performance.md — "TopN push-down" section with EXPLAIN example
  • reference/guc-reference.md — v0.46.0 section with two new GUC parameters
  • reference/error-catalog.md — PT542 added
  • contributing/testing.mdproptest and cargo-fuzz sections covering how to run and extend the harnesses
  • Release notes for v0.46.0

Exit Criteria

All three proptest suites run 10,000 cases each with no failures. Federation result decoder fuzz target runs 10 minutes without panics. Datalog convergence suite: fixpoint on 1M DBpedia triples in ≤ 20 iterations, wall-clock < 5 minutes. OWL 2 RL suite: ≥ 80% pass rate at release (target 95% for v1.0.0). TopN push-down EXPLAIN shows Limit node for ORDER BY + LIMIT queries; result set unchanged. BSBM-at-1M-triples baseline stored and regression gate active. No missing-docs warnings for public #[pg_extern] functions. HTTP companion starts cleanly with PG_RIPPLE_HTTP_CA_BUNDLE set to a valid PEM file. Migration chain test passes through 0.46.0.


v0.47.0 — SHACL Truthfulness, Dead-Code Activation & Architecture Refactor

Theme: Close the parsed-but-not-checked SHACL gap, wire dead code, finish the SPARQL translate module split, and expand fuzz and crash-recovery coverage.

In plain language: v0.45.0 was titled "SHACL Completion" but the post-release audit (PLAN_OVERALL_ASSESSMENT_3.md) found four constraints that accept any data without complaint — the parser records them but the validator ignores them. That is fixed here. The preallocate_sid_ranges() function added in v0.46.0 to speed up parallel Datalog has been sitting unused (clippy dead_code warning); it gets wired in. The src/sparql/translate/ refactor that began in v0.38.0 finally lands, shrinking sqlgen.rs from 3 600 lines into focused per-operator modules. Five new fuzz targets cover the attack surfaces that had only one target before. Four new crash-recovery scenarios close the remaining operational safety gaps.

Effort estimate: 8–10 person-weeks

Deliverables

  • SHACL parsed-but-not-checked constraint sweep (S4-1…S4-4)

    • Implement sh:closed checker in src/shacl/constraints/closed.rs: for each focus node enumerate all predicate IDs present; reject any not listed in sh:property / sh:path or sh:ignoredProperties
    • Implement sh:uniqueLang checker: for a given focus node and path, assert no two values share the same non-empty @lang tag
    • Implement sh:pattern checker in src/shacl/constraints/string_based.rs (currently an empty placeholder): apply the sh:flags-aware POSIX regex against the string value of each focus node
    • Implement sh:lessThanOrEquals checker: decode both value nodes and compare with the XSD-typed ordering already used by FILTER expressions
    • Wire each into the shape dispatcher at src/shacl/mod.rs
    • Add pg_regress tests shacl_closed.sql, shacl_unique_lang.sql, shacl_pattern.sql, shacl_lt_or_equals.sql (S8-4)
    • Add a startup-time warning listing every parsed-but-unchecked constraint type encountered, to guard against future regressions
  • Wire preallocate_sid_ranges() (S1-2)

    • Call the function from the parallel-strata coordinator in src/datalog/parallel.rs before launching any worker batch
    • Assert via datalog_sequence_batch.sql that pg_sequence_last_value advances by n_workers * batch_size on each batch; eliminate the clippy dead_code warning
  • Finish src/sparql/translate/ module split (S2-3)

    • Move BGP translation into src/sparql/translate/bgp.rs (~400 LoC)
    • Move Filter translation into src/sparql/translate/filter.rs (~200 LoC)
    • Move LeftJoin (OPTIONAL) into src/sparql/translate/left_join.rs (~250 LoC)
    • Move Union into src/sparql/translate/union.rs (~150 LoC)
    • Move Distinct into src/sparql/translate/distinct.rs (~100 LoC)
    • Move Graph pattern into src/sparql/translate/graph.rs (~200 LoC)
    • Move Group/aggregation into src/sparql/translate/group.rs (~300 LoC)
    • Move Join into src/sparql/translate/join.rs (~200 LoC)
    • Target: sqlgen.rs ≤ 800 LoC (routing and coordination only)
  • Six missing GUC check_hook validators (S5-1)

    • Add validators for: federation_on_error (warning|error|empty), federation_on_partial (empty|use), sparql_overflow_action (warn|error), tracing_exporter (stdout|otlp), embedding_index_type (hnsw|ivfflat), embedding_precision (single|half|binary)
    • Consolidate max_path_depth and property_path_max_depth into a single GUC with min = 1, max = 65535 validator (S2-5)
  • Five new cargo-fuzz targets (S8-1)

    • fuzz/fuzz_targets/sparql_parser.rs: feed arbitrary bytes through the SPARQL query parser; assert no panic
    • fuzz/fuzz_targets/turtle_parser.rs: fuzz the Turtle/N-Triples bulk loader; assert no panic, invalid input → PT3xx error
    • fuzz/fuzz_targets/datalog_parser.rs: fuzz the Datalog rule parser; assert no panic
    • fuzz/fuzz_targets/shacl_parser.rs: fuzz parse_shapes_graph(); assert no panic
    • fuzz/fuzz_targets/dictionary_hash.rs: fuzz the dictionary encode path; assert no panic and round-trip invariant
    • Each target runs for 10 minutes in CI nightly; a new crash-inducing input is a blocking failure
  • Four missing crash-recovery scenarios (S8-3)

    • CONSTRUCT/DESCRIBE view materialisation kill: kill -9 during materialize_view(); restart and verify view state is consistent
    • Federation result spooling kill: kill -9 during SERVICE temp-table spool; restart and verify no orphaned temp tables
    • Parallel Datalog stratum kill (merge_workers > 1): kill -9 mid-fixpoint; restart and verify inference restarts cleanly
    • Embedding worker queue kill: kill -9 during async embedding queue flush; restart and verify queue drains without duplicates
  • Plan / dictionary / federation cache hit-rate metrics (S7-1)

    • pg_ripple.plan_cache_stats()(hits BIGINT, misses BIGINT, evictions BIGINT, hit_rate DOUBLE PRECISION)
    • pg_ripple.dictionary_cache_stats() → same shape
    • pg_ripple.federation_cache_stats() → same shape
    • Wire hit_rate into the BSBM regression gate as a secondary metric
  • WFS non-convergence warning (S3-2)

    • Emit PT520 WARNING when the well-founded semantics iteration cap is reached without convergence; include iteration count and the predicate that last changed
  • OWL 2 RL conformance baseline (S3-3)

    • Run the OWL 2 RL suite added in v0.46.0; document the pass rate in docs/src/reference/owl2rl-results.md
    • Surface XFAIL entries in tests/owl2rl/known_failures.txt for release-to-release tracking
  • CI and security hygiene (S6-1, S6-2, S6-4, S10-1)

    • Add weekly scheduled cargo audit job; failure creates a GitHub issue automatically
    • Add cargo deny configuration with licence allowlist
    • Add scripts/check_no_security_definer.sh that scans sql/*.sql and fails on any SECURITY DEFINER directive
    • Add SPDX licence compatibility check via cargo license
  • Promotion-race stress test (S8-5)

    • tests/stress/promotion_race.sh: fire 50 concurrent inserts at the rare-predicate promotion threshold; verify SIDs are non-overlapping per worker
  • Documentation (S9-1, S9-2, S9-3, S5-3)

    • reference/guc-reference.md: complete entries for all GUCs through v0.47.0; flag datalog_sequence_batch as now active
    • Add GUC ↔ workload-class tuning matrix (when to raise dictionary_cache_size, when to increase merge_workers, when to tune property_path_max_depth)
    • Add 5 worked examples: federation-multi-endpoint, parallel-Datalog, CONSTRUCT/DESCRIBE view materialisation, RDF-star annotation patterns, WCOJ cyclic queries
    • Document NOTIFY queue tuning for CDC subscriptions (max_notify_queue_pages)

New Error Codes

CodeSeverityMessage
PT520WARNINGWell-founded semantics iteration cap reached without convergence; result is partial

Migration Script

sql/pg_ripple--0.46.0--0.47.0.sql — no schema changes. Comment header describing new SHACL constraint checkers, wired preallocate_sid_ranges(), and six new GUC validators.

Documentation

  • reference/shacl-reference.md — mark sh:closed, sh:uniqueLang, sh:pattern, sh:lessThanOrEquals as fully implemented
  • contributing/testing.md — fuzz targets section extended for five new targets
  • reference/guc-reference.md — complete audit of all registered GUCs through v0.47.0
  • Release notes for v0.47.0

Exit Criteria

All four previously parsed-but-unchecked SHACL constraints trigger violations on non-conforming data. preallocate_sid_ranges() has zero clippy dead_code warnings. sqlgen.rs ≤ 800 LoC. All five fuzz targets run 10 minutes without panics. All four crash-recovery scenarios pass. Three cache-stats SRFs return non-zero hit_rate after a warm workload. OWL 2 RL pass-rate baseline documented. cargo audit and cargo deny green in CI.


v0.48.0 — SHACL Core Completeness, OWL 2 RL Closure & SPARQL Completeness

Theme: Complete SHACL Core conformance, close the OWL 2 RL rule-set gap, finish SPARQL 1.1 Update, and resolve the SPARQL-star variable-pattern gap.

In plain language: After v0.47.0 makes the existing SHACL constraints truthful, this release adds the remaining seven SHACL Core constraints — the string-length bounds, exclusive/inclusive numeric ranges, and sh:xone — plus the complex path expressions (sh:inversePath, sh:alternativePath, sequence paths, *, +, ?) that real-world Schema.org and SHACL-AF schemas depend on. On the reasoning side, five missing OWL 2 RL rules close the gap with the W3C OWL 2 RL profile. SPARQL 1.1 Update gains its three missing operations (MOVE, COPY, ADD). The SPARQL-star variable-inside-quoted-triple pattern finally returns rows instead of silently empty results. This release also delivers the operational hardening items deferred from v0.47.0.

Effort estimate: 6–8 person-weeks

Deliverables

  • Remaining SHACL Core constraints (S4-5)

    • sh:minLength / sh:maxLength: apply to string-typed literals after language-tag stripping
    • sh:xone: exactly one of the given sub-shapes must be satisfied (XOR logic over the existing sh:or / sh:not primitives)
    • sh:minExclusive / sh:maxExclusive / sh:minInclusive / sh:maxInclusive: XSD-typed numeric comparison; reuse the ordering logic from sh:lessThan / sh:lessThanOrEquals
    • Target: full SHACL Core constraint coverage (35/35); W3C SHACL Core test suite must pass completely
  • Complex sh:path expressions (S4-6)

    • sh:inversePath: query (o, s) instead of (s, o) on the VP table
    • sh:alternativePath: union of multiple sub-paths
    • Sequence paths ((sh:path (ex:a ex:b))): chained joins
    • sh:zeroOrMorePath, sh:oneOrMorePath, sh:zeroOrOnePath: compile to WITH RECURSIVE … CYCLE CTEs, reusing the SPARQL property-path compiler from src/sparql/property_path.rs
    • Drop the TODO placeholder in src/shacl/constraints/property_path.rs
  • SHACL violation report enhancements (S4-7, S4-8)

    • Extend Violation struct with sh_value (the offending value node, decoded) and sh_source_constraint_component (W3C constraint component IRI, e.g. sh:MinCountConstraintComponent)
    • For sh:rule triples (SHACL-AF): emit a PT4xx WARNING if rules are detected but SHACL-AF compilation is not yet implemented; never silently drop the rule
  • OWL 2 RL rule set completion (S3-1)

    • cax-sco: full rdfs:subClassOf transitive closure (currently single-step only)
    • prp-spo1: rdfs:subPropertyOf chain (current binary case → full chain)
    • prp-ifp: inverse-functional-property derived owl:sameAs propagation
    • cls-avf: chained owl:allValuesFrom interaction with subclass hierarchy
    • owl:minCardinality, owl:maxCardinality, owl:cardinality entailment rules
    • Target: W3C OWL 2 RL CI suite ≥ 95% pass rate (upgrading the gate from informational to required)
  • SPARQL Update: MOVE, COPY, ADD (S2-2)

    • ADD: INSERT { ?s ?p ?o } WHERE { GRAPH source { ?s ?p ?o } } (source preserved)
    • COPY: CLEAR target + ADD
    • MOVE: COPY + DROP source
    • Wire into src/sparql/mod.rs Update arm; add pg_regress tests for all three operations
  • SPARQL-star variable-inside-quoted-triple patterns (S2-1)

    • Convert the current silent FALSE emission into a proper dictionary join on qt_s, qt_p, qt_o columns already present in _pg_ripple.dictionary
    • Patterns like << ?s ?p ?o >> :assertedBy ?who return rows
    • Add pg_regress tests rdfstar_variable_quoted.sql
  • Performance baselines and benchmarks (S7-2, S7-3)

    • Record per-query p50/p95/p99 latency for all 32 WatDiv templates in tests/watdiv/baselines.json; CI warning gate on > 10% regression
    • Add benchmarks/merge_throughput.sql: 5-minute pgbench script with N writers + merge_workers ∈ {1, 2, 4, 8}; document the scaling curve
  • Operational hardening (S1-1, S1-3, S1-4, S1-5, S2-4, S2-6, S3-4, S6-3, S7-4, S7-5, S9-4, S9-6, S10-2, S10-3, S10-5)

    • HTAP merge cutover: add a concurrent-merge regression test (50 parallel SPARQL queries during a forced merge cycle; assert zero relation does not exist errors) (S1-1)
    • Merge worker backoff: replace std::thread::sleep with BackgroundWorker::wait_latch (S1-3)
    • Add source column integrity pg_regress test (S1-4)
    • Predicate-OID cache: add CacheRegisterRelcacheCallback hook (S1-5)
    • Add pg_ripple.federation_max_response_bytes GUC (default 100 MiB); refuse responses exceeding it with PT543 (S2-4)
    • CONSTRUCT RDF-star: emit << s p o >> notation for ground quoted triples in CONSTRUCT output (S2-6)
    • SAVEPOINT helper: either wire execute_with_savepoint() into the parallel-strata path or gate with #[cfg(test)] (S3-4)
    • pg_dump / restore round-trip test (tests/pg_dump_restore.sh) (S6-3)
    • Add pg_ripple.insert_triples(TEXT[][]) SRF for batch single-triple inserts from orchestration tools (S7-4)
    • HNSW vs IVFFlat benchmark and documentation (S7-5)
    • Mermaid architecture diagram in docs/src/reference/architecture.md (S9-4)
    • Migration script headers lint (scripts/check_migration_headers.sh) (S9-6)
    • release-please-style release automation workflow (S10-2)
    • docs/src/operations/pg-upgrade.md with supported upgrade matrix and pre-upgrade steps (S10-3)
    • Extend migration-chain test to load a representative data batch after the v0.1.0 install and verify data survives through v0.48.0 (S10-5)

New GUC Parameters

GUCTypeDefaultDescription
pg_ripple.federation_max_response_bytesinteger104857600Maximum federation response body in bytes (100 MiB); PT543 on violation

New Error Codes

CodeSeverityMessage
PT543ERRORFederation response exceeded federation_max_response_bytes limit

Migration Script

sql/pg_ripple--0.47.0--0.48.0.sql — no schema changes. Comment header describing SHACL Core completion, OWL 2 RL rule additions, and SPARQL Update completions.

Documentation

  • reference/shacl-reference.md — all 35 SHACL Core constraints marked implemented; complex path expressions documented with examples
  • reference/owl2rl-results.md — pass rate updated to reflect ≥ 95% required gate
  • user-guide/best-practices/sparql-update.md — MOVE, COPY, ADD examples
  • user-guide/rdf-star.md — variable-inside-quoted-triple patterns documented
  • operations/pg-upgrade.md — new page with supported upgrade matrix
  • Release notes for v0.48.0

Exit Criteria

W3C SHACL Core test suite passes 35/35 constraints. OWL 2 RL CI gate upgraded to required at ≥ 95%. All three SPARQL Update operations (MOVE, COPY, ADD) pass the W3C SPARQL 1.1 Update test suite entries for those operations. SPARQL-star variable patterns return correct rows. WatDiv latency baselines recorded and regression gate active. pg_upgrade compatibility document published. pg_dump / restore round-trip test passes. Migration chain test passes through v0.48.0.


v0.49.0 — AI & LLM Integration

Theme: Natural-language query generation and embedding-based entity alignment.

In plain language: Two high-leverage AI features: a function that takes plain English and returns a SPARQL query (using any configured LLM endpoint — Ollama, OpenAI, Claude, or a self-hosted model); and a function that uses the existing vector embeddings to surface candidate owl:sameAs pairs — entities that might be the same thing expressed differently. Both build on infrastructure already in place (the SPARQL engine and the v0.27.0 pgvector integration) and require no new storage schema changes.

Effort estimate: 4–6 person-weeks

Deliverables

  • NL → SPARQL via LLM function calling (Feature C-1)

    • New module src/llm/mod.rs; new SQL function pg_ripple.sparql_from_nl(question TEXT) RETURNS TEXT
    • Calls a configured LLM endpoint with the schema VoID description as context; returns a SPARQL SELECT query string
    • GUCs: pg_ripple.llm_endpoint (TEXT, default '' = disabled), pg_ripple.llm_model (TEXT, default gpt-4o), pg_ripple.llm_api_key_env (TEXT, name of the env var holding the key — never stored inline)
    • Optional few-shot examples loaded from _pg_ripple.llm_examples (question TEXT, sparql TEXT); seeded via pg_ripple.add_llm_example(question TEXT, sparql TEXT)
    • SHACL shapes included as additional semantic context when pg_ripple.llm_include_shapes = on (bool GUC, default on)
    • Error codes: PT700 (LLM endpoint unreachable), PT701 (LLM returned non-SPARQL output), PT702 (generated SPARQL failed to parse)
    • pg_regress tests run with a mock HTTP server returning a canned SPARQL response
  • Embedding-based owl:sameAs candidate generation (Feature C-2)

    • New SQL function pg_ripple.suggest_sameas(threshold REAL DEFAULT 0.9) RETURNS TABLE(s1 TEXT, s2 TEXT, similarity REAL)
    • Runs an HNSW self-join on the embedding column in _pg_ripple.entities; returns pairs whose cosine similarity exceeds threshold
    • Companion pg_ripple.apply_sameas_candidates(min_similarity REAL DEFAULT 0.95) inserts accepted pairs as owl:sameAs triples and triggers cluster merging
    • Respects pg_ripple.sameas_max_cluster_size (PT550) bound
    • Example: examples/embedding_alignment.sql — load two datasets with overlapping entities, run suggest_sameas, inspect candidates, apply with apply_sameas_candidates

New GUC Parameters

GUCTypeDefaultDescription
pg_ripple.llm_endpointstring''LLM API base URL (empty = NL→SPARQL disabled)
pg_ripple.llm_modelstringgpt-4oLLM model identifier
pg_ripple.llm_api_key_envstringPG_RIPPLE_LLM_API_KEYName of the environment variable holding the LLM API key
pg_ripple.llm_include_shapesboolonInclude SHACL shapes as LLM context when generating SPARQL

New Error Codes

CodeSeverityMessage
PT700ERRORLLM endpoint unreachable or returned HTTP error
PT701ERRORLLM response did not contain a valid SPARQL query
PT702ERRORLLM-generated SPARQL query failed to parse

Migration Script

sql/pg_ripple--0.48.0--0.49.0.sql — adds _pg_ripple.llm_examples (question TEXT, sparql TEXT) table.

Documentation

  • user-guide/nl-to-sparql.md — new page: configuring the LLM endpoint, running sparql_from_nl, adding few-shot examples, error handling
  • user-guide/entity-alignment.md — new page: suggest_sameas, apply_sameas_candidates, tuning threshold, cluster size limits
  • reference/guc-reference.md — four new GUC parameters
  • reference/error-catalog.md — PT700–PT702
  • Release notes for v0.49.0

Exit Criteria

pg_ripple.sparql_from_nl() returns a parseable SPARQL query against a mock LLM endpoint. pg_ripple.suggest_sameas() returns candidates for two overlapping test datasets with ≥ 90% recall. apply_sameas_candidates() does not exceed sameas_max_cluster_size. All GUC validators pass. PT700–PT702 are triggered by the appropriate error conditions. Migration chain test passes through v0.49.0.


v0.50.0 — Developer Experience & GraphRAG Polish

Theme: VS Code extension, interactive query debugger, and full RAG pipeline.

In plain language: Three developer-facing features that raise the ceiling on how easy it is to work with pg_ripple day-to-day. A VS Code extension brings SPARQL syntax highlighting, one-click query execution against a live endpoint, and SHACL shape linting into the editor. An extended EXPLAIN SPARQL command surfaces the algebra tree, generated SQL, plan-cache status, and per-step row counts as an interactive JSON structure. The RAG pipeline ties together vector recall, SPARQL graph expansion, and LLM context-window assembly into a single SQL function call.

Effort estimate: 5–7 person-weeks

Deliverables

  • VS Code extension (Feature B-2) — separate repository pg-ripple-vscode

    • SPARQL 1.1 syntax highlighting (TextMate grammar)
    • SHACL Turtle syntax highlighting with shape-aware completion
    • Datalog rule syntax highlighting
    • Query runner: execute a SPARQL query against a configured pg_ripple_http endpoint, display results as a table or JSON tree
    • SHACL shape linter: validate a .ttl shapes file by calling pg_ripple.load_shapes() via the HTTP API and surfacing violations inline
    • Configuration: workspace settings for endpoint URL, auth token, and default named graph
    • Published to VS Code Marketplace; linked from README.md and docs
  • SPARQL query debugger (Feature B-3)

    • Extend pg_ripple.explain_sparql(query TEXT) to return JSONB with: algebra tree, generated SQL, plan-cache status (hit / miss / bypass), per-operator estimated rows, per-operator actual rows (when analyze := true)
    • New overload pg_ripple.explain_sparql(query TEXT, analyze BOOL DEFAULT FALSE) RETURNS JSONB
    • VS Code extension renders the JSONB as a collapsible tree with operator annotations
    • pg_regress sparql_explain_analyze.sql: assert the JSONB schema is stable across SELECT, ASK, CONSTRUCT, and DESCRIBE query types
  • RAG pipeline with graph-contextualised embeddings (Feature C-3)

    • New SQL function pg_ripple.rag_context(question TEXT, k INT DEFAULT 10) RETURNS TEXT
    • Step 1: embed question via pg_ripple.embed_text() (from v0.27.0)
    • Step 2: vector recall — top-k entities by HNSW similarity
    • Step 3: SPARQL graph expansion — for each entity, fetch its 1-hop neighbourhood as JSON-LD
    • Step 4: assemble a context string from the JSON-LD fragments, formatted for LLM ingestion
    • Step 5 (optional): if pg_ripple.llm_endpoint is set, call sparql_from_nl() and execute the generated query, appending the result to the context
    • Example: examples/graphrag_rag_pipeline.sql — end-to-end with a Wikipedia-derived knowledge graph

Migration Script

sql/pg_ripple--0.49.0--0.50.0.sql — no schema changes.

Documentation

  • user-guide/vscode-extension.md — installation, configuration, SPARQL query runner, SHACL linter
  • user-guide/explain-sparql.md — EXPLAIN output format, ANALYZE mode, interpreting the algebra tree
  • user-guide/rag-pipeline.mdrag_context() step-by-step, tuning k, combining with NL→SPARQL
  • Release notes for v0.50.0

Exit Criteria

VS Code extension is publishable to the VS Code Marketplace (VSIX builds clean). explain_sparql(query, analyze := true) returns JSONB with algebra, sql, cache_status, and per-operator actual_rows keys for SELECT, ASK, CONSTRUCT, and DESCRIBE queries. rag_context() returns non-empty context for a known question against a pre-loaded test knowledge graph. Migration chain test passes through v0.50.0.


v1.0.0 — Production Release

Theme: Stability, conformance, and production certification.

In plain language: The 1.0 release is not about new features — it's about confidence. We run pg_ripple against the official W3C test suites for SPARQL and SHACL to verify standards compliance. A 72-hour continuous stress test checks for memory leaks and crash recovery. A security audit reviews the code for vulnerabilities. The result is a release that organisations can rely on for production workloads with a clear API stability guarantee: the public interface will not break in future minor versions.

Effort estimate: 6–8 person-weeks

Deliverables

  • SPARQL 1.1 Query conformance
    • Pass W3C SPARQL 1.1 Query test suite (supported subset)
    • Document unsupported features (property functions)
    • Verify conformance via both SQL and HTTP interfaces
    • Federation (SERVICE) covered by v0.16.0
  • SPARQL 1.1 Update conformance
    • Pass W3C SPARQL 1.1 Update test suite (supported subset)
    • Document unsupported features
  • SHACL Core conformance
    • Pass the full W3C SHACL Core test suite
    • Any optimization strategy must preserve the same externally visible results as the reference semantics
  • Stability hardening
    • 72-hour continuous load test (mixed read/write)
    • Memory leak detection (Valgrind via cargo pgrx test --valgrind)
    • Crash recovery testing (kill -9 during merge, reload, verify)
  • Security audit
    • Review all SPI query generation for injection vectors
    • Review shared memory usage for race conditions
    • Review dictionary cache for timing side-channels
  • API stability guarantee
    • All pg_ripple.* SQL functions considered stable API
    • _pg_ripple.* internal schema reserved for internal use
    • Semantic versioning contract: breaking changes only in major versions
  • Final benchmarks
    • BSBM at 100M triples
    • Published performance report
  • Release artifacts
    • Tagged release on GitHub
    • Published to PGXN
    • crates.io publication (library crate)

Documentation

See plans/documentation.md for details. The 1.0.0 documentation milestone is a full audit: every page verified, every example tested against the release, no unresolved stubs.

  • Final audit of all docs pages — every code example verified against 1.0.0, all TODO / stub markers resolved
  • user-guide/upgrading.md complete — upgrade procedure from every 0.x version to 1.0.0; migration script inventory
  • reference/error-reference.md complete — all PT001–PT799 codes documented
  • reference/faq.md final pass — 20–30 questions covering all features
  • reference/troubleshooting.md final pass — complete runbook for every subsystem
  • All research/ section mirrors complete

Exit Criteria

Stable, tested, documented, and published. Ready for production workloads up to 100M+ triples on a single node.


Post-1.0 Horizon

In plain language: These are future directions that extend pg_ripple beyond its initial scope. Each addresses a specific real-world need — from distributing data across multiple servers, to geographic queries, to bridging with existing relational databases. They are listed roughly in order of anticipated demand; some may be reordered or combined based on community feedback after 1.0.

v1.6 Cypher/GQL has a dedicated exploratory analysis in plans/cypher/. The core finding: VP tables already encode all LPG structural elements; a standalone cypher-algebra crate (openCypher + GQL grammar, unified SQL-emitting algebra IR) is the correct architecture. Full write support requires v0.4.0 (RDF-star) for edge properties — already available. Gremlin is explicitly out of scope.

VersionThemeWhat it deliversKey Technical Features
1.1DistributedSpread data across multiple servers for horizontal scaleCitus integration, subject-based sharding
1.2TemporalTrack how data changes over time; query historical statesBitstring versioning, TimescaleDB integration
1.4Extended VPAutomatically pre-compute shortcuts for frequent query patternsAutomated workload-driven ExtVP stream tables (pg_trickle), ontology change propagation DAG
1.5InteropBridge to GraphQL APIs and expose LPG views for visualization toolsGraphQL-to-SPARQL auto-generation from SHACL shapes, stable LPG view layer for visualization tooling
1.6Cypher / GQLQuery and write data using the industry-standard graph query languagescypher-algebra standalone crate (openCypher + GQL grammar, same IR); pg_ripple.cypher() SQL function; CREATE, MERGE, SET, DELETE via VP write path; openCypher TCK ≥80%; edge properties available since v0.4.0 (RDF-star)
1.7GeoSPARQL + PostGISAnswer geographic questions ("find all hospitals within 5 km of this point")geo:asWKT literal type backed by PostGIS geometry, spatial FILTER functions, R-tree index on spatial VP tables
1.8R2RML Virtual GraphsExpose existing database tables as if they were RDF data — no migration neededW3C R2RML mappings, SPARQL queries transparently join VP tables with mapped SQL tables
1.9Quad-Level ProvenanceTrack where each fact came from and when it was addedPer-quad metadata table with source, timestamp, and transaction ID; integration with Datalog rule provenance (why-provenance)

Version Timeline (Estimated Cadence)

In plain language: The "Calendar" column shows how long after the previous release each version is expected to ship. The "Effort" column shows the total developer-time required. With two developers working together, the calendar durations are achievable; with one developer, roughly double the calendar time.

VersionCalendar (pair)Effort (person-weeks)Cumulative effort
0.1.0Week 0 (start)6–8 pw6–8 pw
0.2.0+4 weeks6–8 pw12–16 pw
0.3.0+4 weeks6–8 pw18–24 pw
0.4.0+5 weeks8–10 pw26–34 pw
0.5.0+3 weeks6–8 pw32–42 pw
0.5.1+3 weeks6–8 pw38–50 pw
0.6.0+4 weeks8–10 pw46–60 pw
0.7.0+3 weeks4–6 pw50–66 pw
0.8.0+3 weeks4–6 pw54–72 pw
0.9.0+2 weeks3–4 pw57–76 pw
0.10.0+5 weeks10–12 pw67–88 pw
0.11.0+3 weeks5–7 pw72–95 pw
0.12.0+2 weeks3–4 pw75–99 pw
0.13.0+4 weeks6–8 pw81–107 pw
0.14.0+3 weeks4–6 pw85–113 pw
0.15.0+2 weeks3–4 pw88–117 pw
0.16.0+3 weeks4–6 pw92–123 pw
0.19.0+3 weeks3–5 pw95–128 pw
0.20.0+3 weeks5–7 pw100–135 pw
0.45.0+3 weeks4–6 pw104–141 pw
0.46.0+4 weeks5–7 pw109–148 pw
0.47.0+5 weeks8–10 pw117–158 pw
0.48.0+4 weeks6–8 pw123–166 pw
0.49.0+3 weeks4–6 pw127–172 pw
0.50.0+4 weeks5–7 pw132–179 pw
1.0.0+4 weeks6–8 pw138–187 pw
1.1–1.9Post-1.0Community-driven

Estimates assume a pair of focused developers with Rust and PostgreSQL experience. "pw" = person-weeks. Calendar durations assume pair programming; a solo developer should expect roughly double the calendar time. Actual pace depends on contributor availability and scope adjustments discovered during implementation.

Contributing

Thank you for your interest in contributing to pg_ripple. This guide covers environment setup, testing, code conventions, and the pull request workflow.

Contribute

pg_ripple is open source and welcomes contributions of all kinds — bug reports, documentation fixes, test cases, and feature implementations. If you are unsure whether an idea fits, open a GitHub issue to discuss it before writing code.


Development Environment

Prerequisites

ToolVersionPurpose
RustEdition 2024, stable toolchainLanguage
PostgreSQL18.xTarget database
pgrx0.17PostgreSQL extension framework
cargo-pgrx0.17Build and test tooling
git2.x+Version control

Setup

# 1. Clone the repository
git clone https://github.com/your-org/pg_ripple.git
cd pg_ripple

# 2. Install cargo-pgrx if not already installed
cargo install cargo-pgrx --version 0.17 --locked

# 3. Initialize pgrx with PostgreSQL 18
cargo pgrx init --pg18 $(which pg_config)

# 4. Verify the build
cargo build

macOS

On macOS, install PostgreSQL 18 via Homebrew: brew install postgresql@18. Ensure pg_config is on your PATH.


Running Tests

pg_ripple uses three levels of testing:

Unit and integration tests (pgrx)

Runs Rust tests inside a temporary PostgreSQL instance:

cargo pgrx test pg18

This starts a temporary PG18 cluster, installs the extension, runs all #[pg_test] functions, and tears down the cluster.

Regression tests (pg_regress)

Runs SQL-based regression tests that compare expected output:

cargo pgrx regress pg18

The test SQL files live in sql/ and expected output in expected/. If you add a new SQL function, add a regression test for it.

Migration chain test

Verifies that all migration scripts (sql/pg_ripple--X.Y.Z--X.Y.Z+1.sql) can be applied in sequence:

# Requires pgrx PG18 running
cargo pgrx start pg18
bash tests/test_migration_chain.sh

Running a subset of tests

# Run a single test by name
cargo pgrx test pg18 -- test_name_pattern

# Run tests with output visible
cargo pgrx test pg18 -- --nocapture

Code Conventions

These conventions are enforced by CI and code review.

Safe Rust only

All code must be safe Rust. unsafe is permitted only at required FFI boundaries (pgrx macros, shared memory access) and must include a // SAFETY: comment explaining why it is correct.

SQL function exposure

Expose SQL functions via the #[pg_extern] attribute. Never write raw PG_FUNCTION_INFO_V1 C macros.

#![allow(unused)]
fn main() {
#[pg_extern]
fn my_function(input: &str) -> String {
    // implementation
}
}

SPI for all internal SQL

Use pgrx::SpiClient for all SQL executed inside extension code. Never use raw libpq or string-based query execution.

#![allow(unused)]
fn main() {
Spi::connect(|client| {
    client.select("SELECT count(*) FROM _pg_ripple.dictionary", None, None)?;
    Ok(())
})?;
}

Integer joins everywhere

SPARQL-to-SQL translation must encode all bound terms to i64 before generating SQL. VP table queries must never contain string comparisons — this is a bug.

No dynamic SQL string concatenation for table names

Always look up the VP table OID in _pg_ripple.predicates and use format!-style quoting with proper escaping. Never interpolate user input into table names.

Error messages

Follow PostgreSQL style: lowercase first word, no trailing period.

#![allow(unused)]
fn main() {
// Good
return Err(pg_ripple_error!("dictionary encode failed: hash collision detected"));

// Bad
return Err(pg_ripple_error!("Dictionary encode failed: hash collision detected."));
}

Batch dictionary operations

Use ON CONFLICT DO NOTHING … RETURNING for all batch inserts into the dictionary. Never use a SELECT-then-INSERT pattern.


Project Structure

src/
├── lib.rs              # Entry points, _PG_init, GUC parameters
├── dictionary/         # IRI/blank-node/literal → i64 encoder
├── storage/            # VP tables, HTAP delta/main, merge worker
├── sparql/             # SPARQL → algebra → SQL → SPI
├── datalog/            # Datalog parser, stratifier, SQL compiler
├── shacl/              # SHACL shapes → DDL constraints + validation
├── export/             # Turtle / N-Triples / JSON-LD serialization
├── stats/              # Monitoring, pg_stat_statements integration
└── admin/              # Vacuum, reindex, prefix registry

sql/                    # Migration scripts and regression test SQL
tests/                  # Shell-based integration tests
docs/                   # mdBook documentation site

Pull Request Workflow

Branch policy

  • Never create a new branch from main unless the current branch is main.
  • Use descriptive branch names: feat/sparql-lateral, fix/dictionary-collision, docs/glossary.

Before opening a PR

  1. Run all tests and ensure they pass:
cargo pgrx test pg18
cargo pgrx regress pg18
  1. Run clippy with no warnings:
cargo clippy --all-targets -- -D warnings
  1. Format code:
cargo fmt --check
  1. Update documentation if you changed any SQL function signatures or added new functions.

  2. Create or update migration scripts if the release version changed (see below).

Commit messages

  • Use present tense: "add lateral join support" not "added lateral join support"
  • Group discrete changes into separate commits
  • Reference issue numbers when applicable: "fix dictionary collision (#42)"

Migration scripts

Every release requires a migration script (sql/pg_ripple--X.Y.Z--X.Y.Z+1.sql), even if it only contains comments. See the Release Process for the full checklist.


Documentation Contributions

The documentation site uses mdBook with the mdbook-admonish plugin for callout boxes.

Building the docs locally

# Install mdbook and plugins
cargo install mdbook mdbook-admonish

# Build and serve
cd docs
mdbook serve --open

Callout syntax

Use fenced code blocks with admonish for callout boxes:

```admonish tip title="Performance"
Use `load_ntriples_file()` for large datasets — it is 10× faster than string loading.
```

```admonish warning
This operation cannot be undone.
```

```admonish note
Available since v0.16.0.
```

Adding a new page

  1. Create the Markdown file in the appropriate docs/src/ subdirectory.
  2. Add the page to docs/src/SUMMARY.md.
  3. Run mdbook build to verify it compiles.

Property-Based Testing (v0.46.0)

pg_ripple uses proptest for randomised property-based tests that assert algebraic invariants. These tests run entirely in pure Rust — no database connection required.

Running proptest suites

# Run all property-based tests
cargo test --test proptest_suite

# Run with more cases (default: 256)
PROPTEST_CASES=10000 cargo test --test proptest_suite

# Run a specific suite
cargo test --test proptest_suite sparql_roundtrip
cargo test --test proptest_suite dictionary
cargo test --test proptest_suite jsonld_framing

Adding a new property test

  1. Add your test to the appropriate file in tests/proptest/:

    • SPARQL translator invariants → sparql_roundtrip.rs
    • Dictionary encoder invariants → dictionary.rs
    • JSON-LD framing invariants → jsonld_framing.rs
    • New domain → create tests/proptest/<domain>.rs and add mod <domain>; to tests/proptest_suite.rs
  2. Use proptest! macros for property tests; regular #[test] for deterministic fixtures.

  3. Run the suite with PROPTEST_CASES=10000 to verify 10,000 cases pass.

Debugging a proptest failure

When a test fails, proptest prints the minimal failing input. Reproduce it:

#![allow(unused)]
fn main() {
// Add to the failing test to fix the seed:
ProptestConfig::with_cases(1).with_proptest_rng(seed)
}

Fuzz Testing (v0.46.0)

pg_ripple uses cargo-fuzz to test the federation result decoder against arbitrary byte sequences.

Running the fuzz target

# Install cargo-fuzz
cargo install cargo-fuzz

# Run for 10 minutes
cargo fuzz run federation_result -- -max_total_time=600

# Run indefinitely
cargo fuzz run federation_result

# Minimise a crashing corpus entry
cargo fuzz tmin federation_result artifacts/federation_result/crash-<hash>

Adding a new fuzz target

  1. Create fuzz/fuzz_targets/<target_name>.rs with the fuzz target function.
  2. Add a [[bin]] entry to fuzz/Cargo.toml.
  3. Add the target to the fuzz-<target_name> CI job in .github/workflows/ci.yml.

Fuzz target contract

Every fuzz target must:

  • Use #![no_main] and libfuzzer_sys::fuzz_target!
  • Never panic regardless of input (panics are treated as fuzz failures)
  • Return Err(...) for invalid input, never crash

Reporting Issues

When filing a bug report, please include:

  • pg_ripple version: SELECT pg_ripple.canary(); and the output of \dx pg_ripple
  • PostgreSQL version: SELECT version();
  • Minimal reproducer: the smallest SQL script that triggers the issue
  • Full error output: use \errverbose in psql for detailed error context
  • Platform: OS and architecture

Security issues

If you discover a security vulnerability, please report it privately via GitHub Security Advisories rather than opening a public issue.