The Ontology Pipeline¶
Turn unstructured documents into proven, queryable, provenance-tracked knowledge
— not plausible LLM extractions. This is the implemented pattern; the decisions
behind it are ARC-ADR-030
/ 032 /
033 /
019 /
016; the
broader vision is the Labs north-star (obsidian/labs/AgentArmyLabs/Ontology-Pipeline.md).
The pattern in one line¶
The LLM proposes; the formal layer disposes. A concept reaches the canonical graph only when it is proven — schema-valid, anti-pattern-free, reasoner-consistent across a dual gUFO+BFO grounding, and SHACL-conformant — and every step is traceable back to its source. Snapped, not plausible.
What a document becomes — four substrates¶
document (.docx/.pdf/.txt)
├─ ingest ───────────────▶ VECTOR INDEX (ArcadeDB LSM_VECTOR, Cohere embed-v-4-0)
└─ sift (Cerebras) ─┬────▶ HOLOGRAPHIC LPG (ArcadeDB — every candidate + lifecycle state)
├────▶ MID-LEVEL MAP (proven concepts aligned to business-mid.ttl)
└────▶ CANONICAL RDF (Fuseki — only the PROVEN, + PROV-O lineage)
| Substrate | Store | Holds | Role |
|---|---|---|---|
| Vector index | ArcadeDB LSM_VECTOR |
doc chunks + 1536-d embeddings | semantic retrieval (RAG) |
| Holographic LPG | ArcadeDB graph | every candidate + per-level results + state |
the working/staging graph; quarantine is a state, not a separate store |
| Canonical RDF | Fuseki (TDB2, /knowledge) |
only proven gUFO+BFO triples + lineage + mappings | the authoritative knowledge graph |
| Mid-level map | RDF (in the canonical graph) | proven concepts aligned to a shared vocabulary | cross-document integration layer |
The vector index and the ontology graph trace to the same source document, so a
consumer can pivot concept → source span → chunk.
The discipline — propose → sift → snap | quarantine¶
The sift ladder (ARC-ADR-032):
| Level | Gate | Proves |
|---|---|---|
| L1 | JSON-Schema on the IR fragment | well-formed, valid stereotypes |
| L2 | OntoUML anti-pattern check | role bindings reference declared entities |
| L3 | OWL reasoner closure (owlrl) | gUFO and BFO classifications agree (an Event grounded to a Continuant is inconsistent, not merely odd) |
| L4 | SHACL conformance (pyshacl) | relator under-mediation and other shape rules |
| (prod) | the Fuseki sieve (sieve.sh) |
an independent SHACL re-check before promotion |
Only an all-green candidate snaps (projects to canonical RDF). Anything that can't be proven within the repair budget lands in quarantine — retained with its full violation report, never auto-promoted.
Mid-level mapping — and the evidence ladder¶
A proven concept is still document-local (one doc's "Rumor" ≠ another's). The
mid-level mapper aligns it to a shared vocabulary (business-mid.ttl, ~40
business/epistemic/risk classes) so the fleet integrates across documents.
- Primary signal — embedding cosine in the same Cohere space the RAG index uses (a measurable number), gated by gUFO archetype compatibility: a reified relation may only map to a relator class; an occurrent never to an object.
- Thresholds →
skos:exactMatch/skos:closeMatch/rdfs:subClassOf. - Below the floor → escalate to the gateway model, and its verdict is cited as
PROV-O provenance (
sift:citedSource "cerebras:zai-glm-4.7"). A decision a measurable signal couldn't make is handed to a model — and the model is named.
Every mapping records its method, cosine, embed model, and any cited source. The "propose / dispose" rule holds here too: the model only picks among cosine-ranked candidates; it cannot invent a class.
Functional core, imperative shell¶
The same project + sift logic exists in two forms (ARC-ADR-033):
- Imperative shell — Python (
backend-core): all the IO — HTTP, Cerebras gateway, Cohere embeddings, ArcadeDB, Fuseki. The live/api/v1/ontology/*routes. - Functional core — F# (
tools/ontology-sift/fsharp): the provable transformations. The category theory pays rent here: the IR is a coproduct (discriminated union),projectis a functor (a catamorphism folding the fragment to triples), the ladder is a Result monad (L1 short-circuit) into a Validation applicative (L2–L4 accumulate), and the outcome is the coproductSnapped | Quarantined. WithFS0025-as-error, adding a stereotype case fails the build until every projection handles it — "snapped, not plausible" enforced by the compiler, not by tests. Seetools/ontology-sift/fsharp/README.md.
Provenance, end to end¶
PROV-O threads the whole chain, so the canonical commit is auditable:
source span ─ wasDerivedFrom ─▶ concept ─ wasGeneratedBy ─▶ propose activity (proposer + prompt hash)
concept ─ skos:exactMatch/closeMatch/subClassOf ─▶ mid:Class
└─ wasGeneratedBy ─▶ map activity (method, cosine, embed model, citedSource?)
Run it¶
| Want | How |
|---|---|
| See the whole thing on a real doc | backend-core/notebooks/ontology_pipeline_e2e.ipynb → Run All (needs backend-core :8000 + Fuseki :3030) |
| Call the pipeline | POST /api/v1/ontology/pipeline { "source_text", "source_doc", "proposer": "gateway" } |
| Prove the F# core | dotnet run --project tools/ontology-sift/fsharp (parity doctor) |
| Offline, no LLM | tools/ontology-sift/doctor.py (fixture proposer; deterministic) |
Where it lives¶
backend-core/app/ontology/—sift_engine.py,proposer.py,midlevel.py,pipeline.py,arcade_schema.py,api.py;discipline/(IR schema, gUFO/BFO-lite TTLs, SHACL shapes,midlevel/business-mid.ttl).tools/ontology-sift/— the Python reference + offline doctor +spikes/+fsharp/(the compiler core).- Contract:
backend-core/contracts/backend-core.openapi.json(/api/v1/ontology/*).
Status¶
Implemented and proven live on the "Three makes a tiger" document (2026-05-29):
57 chunks → vector index; one fragment snapped (all four gates green); 11 concepts
mapped (3 exact / 4 close / 4 escalated + cited zai-glm-4.7); 183 triples in Fuseki.
Shipped in backend-core #130 and hub #336.