ARC-ADR-001 — Pub/Sub Broker Selection for Cross-Spoke Eventing¶
- Status: Proposed
- Date: 2026-05-26
- Deciders: middle-core maintainers
- Consulted: backend-core, frontend-core
- Informed: AgentArmy hub
Context and Problem Statement¶
Middle-core is the provider-neutral semantic and governance layer between
frontend-core and backend-core. Today every cross-spoke interaction is
synchronous HTTP. As use cases UC-01–UC-16 expand, several flows want
asynchronous fan-out that HTTP request/response cannot model cleanly:
- Knowledge-landed notifications — backend-core finishes an ingest; middle-core, frontend-core, and tool-promotion need to react independently without backend-core enumerating consumers.
- Evidence event log — "evidence is a primitive" implies a replayable audit stream that lives outside any single workflow's memory.
- Tool-promotion re-evaluation — capability evidence aging out should trigger re-checks across consumers that don't know about each other.
We need to pick one durable pub/sub transport and one wire envelope that work across the polyglot fleet (Python modelgen, C# runtime, JS frontend) and do not violate the charter's provider-neutral stance.
Scope: pub/sub vs. durable workflows¶
This ADR covers decoupled fan-out across services, not workflow orchestration. Middle-core's BPMN scenarios, state-machine lifecycle transitions, MCP tool-promotion gating, and HITL waits are workflow-shaped and belong on a durable execution engine (Temporal-class), not a message bus. A separate ADR will cover that. The rule of thumb:
| Concern | Engine |
|---|---|
| Middle-core code must survive crash mid-step | durable workflow |
| Something outside middle-core reacts when middle-core finishes | pub/sub (this ADR) |
Decision Drivers¶
- Provider-neutral by charter. Middle-core exists to keep meaning independent of any single provider. A pub/sub choice that hard-couples middle-core to one cloud actively contradicts that.
- Polyglot clients. Python (modelgen), C#/.NET (runtime), JS (frontend-core) must all be first-class — no second-tier SDKs.
- Dev parity with prod. The local lane is
docker compose up. The broker must run in compose without managed-service stubs. - Durable + replayable. At-least-once delivery, consumer offsets, replay from a position; not fire-and-forget.
- One tool covers queue and log semantics. The fan-out use case is work-queue-shaped; the evidence audit is log-shaped. Picking two technologies doubles the ops surface.
- Operational floor. Prototype scale today (single-digit producers, handful of consumers). The broker must not require a dedicated platform team.
- Envelope portability. The wire format must outlive any broker choice — we should be able to swap brokers without rewriting producers and consumers.
Considered Options¶
- NATS JetStream with CloudEvents v1.0 envelope
- Azure Service Bus with CloudEvents v1.0 envelope
- Google Cloud Pub/Sub with CloudEvents v1.0 envelope
- Apache Kafka / Redpanda with CloudEvents v1.0 envelope
Decision Outcome¶
Chosen: NATS JetStream with CloudEvents v1.0 as the wire envelope.
- Broker: NATS JetStream — self-hosted in
docker-compose.ymlfor dev and CI; deployed as a stateful workload on Azure Container Apps (or Synadia Cloud) for shared environments. - Envelope: CloudEvents v1.0, JSON format, transported as a NATS message with the CloudEvents binding for NATS. This keeps producers and consumers broker-agnostic at the message-shape level.
- Subject naming:
aax.<spoke>.<domain>.<event>.v<n>(e.g.aax.backend.knowledge.landed.v1). Streams group byaax.<spoke>.>. - Delivery: at-least-once, durable consumers with explicit ack,
per-stream DLQ as
aax.dlq.<stream>. - Schemas: event schemas live in
contracts/events/alongside the existing OpenAPI vendoring lane, validated by the modelgen drift gate.
Positive Consequences¶
- Charter-aligned: NATS is portable across Azure, GCP, on-prem, and Synadia managed — no cloud coupling.
- One technology covers both work-queue (durable consumer groups) and log-replay (JetStream stream replay from offset) needs.
- First-class clients in all three languages (Python
nats-py, .NETNATS.Client, JSnats.js), maintained by the same upstream. docker compose upkeeps working: a single NATS container is enough.- CloudEvents envelope makes the transport replaceable later without touching producer/consumer code.
- Lightweight enough to live alongside ArcadeDB without justifying a new dedicated cluster.
Negative Consequences¶
- Less managed than Service Bus or Google Pub/Sub — running JetStream in production means owning a stateful workload (storage, backup, upgrade). Mitigated by deferring to Synadia Cloud if ops becomes painful.
- Smaller hosted-service ecosystem in 2026 than Kafka — fewer drop-in observability and CDC integrations.
- CloudEvents-over-NATS binding is less ubiquitous than CloudEvents-over-HTTP; consumers must use a small shared helper for envelope decode rather than copy-pasting framework code.
Pros and Cons of the Options¶
NATS JetStream + CloudEvents¶
- ✅ Provider-neutral; identical on Azure, GCP, on-prem, Synadia
- ✅ Polyglot SDK parity across Python, .NET, JS
- ✅ Queue + log + KV in one binary
- ✅ Trivial dev story — one container in compose
- ✅ Subject hierarchy fits a "spoke.domain.event" namespace
- ❌ Self-hosting JetStream requires owning storage and upgrade
- ❌ Smaller managed-ecosystem footprint than Kafka
Azure Service Bus + CloudEvents¶
- ✅ Fully managed; ACA and ArcadeDB are already on Azure
- ✅ Mature .NET SDK; tight Azure identity integration
- ✅ DLQ, sessions, and scheduled delivery built in
- ❌ Couples middle-core to Azure; violates the provider-neutral charter
- ❌ Local-dev story relies on Service Bus emulator or stubs — no true parity with prod
- ❌ Topic/subscription model is queue-shaped; weak as an event log
Google Cloud Pub/Sub + CloudEvents¶
- ✅ Fully managed; aligns with the listed GCP project
- ✅ Strong throughput and global availability
- ❌ ACA-hosted producers in Azure pay inter-cloud egress
- ❌ Same provider-coupling concern as Service Bus, in the other direction
- ❌ No serious local-dev parity (emulator exists but lags)
Apache Kafka / Redpanda + CloudEvents¶
- ✅ Industry-standard durable log; replay and CDC story unmatched
- ✅ Redpanda gives a single-binary, compose-friendly local lane
- ❌ Heavier ops burden than current scale justifies
- ❌ Consumer-group semantics are richer than current consumers need
- ❌ Schema-registry expectation pushes a second piece of infra
Confirmation¶
The decision is confirmed when:
docker-compose.ymlincludes anatsservice running JetStream and middle-core's/healthreports broker reachability.contracts/events/contains at least one event schema (backend.knowledge.landed.v1) with a CloudEvents JSON example fixture.- A modelgen drift gate validates event schemas the same way it validates the OpenAPI contract today.
- A round-trip integration test publishes a CloudEvent on
aax.backend.knowledge.landed.v1and a middle-core consumer acks it, with replay-from-offset proven in the same test. - The ADR is referenced from
docs/middle-core-charter.mdunder "What the layer owns."
More Information¶
- Related: ADR for durable workflow engine selection (Temporal-class, TBD). That ADR will cover BPMN scenario execution, state-machine lifecycle, MCP gating, and HITL waits — explicitly out of scope here.
- Standards: CloudEvents v1.0 (CNCF), NATS JetStream subject and stream conventions, AsyncAPI 3.0 for documenting event contracts.
- Follow-ups:
- Decide whether AsyncAPI 3.0 docs are generated from the same modelgen pipeline that produces OpenAPI today.
- Decide hosting: Synadia Cloud vs. self-hosted JetStream on ACA, once the first cross-spoke consumer ships.
- Define the
aax.dlq.*redrive runbook.