ARC-ADR-024 — Platform Maturity Audit (run vs design)¶
| Field | Value |
|---|---|
| ID | ARC-ADR-024 |
| Status | Accepted |
| Date | 2026-05-26 |
| Deciders | Hub owner |
| Supersedes | — |
| Superseded by | — |
| Tags | maturity, audit, ops, design, slo, finops, security, observability, release |
Context¶
The fleet shipped ARC-ADR-023 (container tiering) + the hub-owned ArcadeDB deploy lane + the application-tier image.json for backend-core + the LLM gateway extracted as a function-tier container. With those landed, the platform has its first internally consistent shape — but several gaps were obvious enough that the hub owner asked for a Platform Maturity Audit, segregating design/architecture from platform ops.
Eight specialist agents ran in two waves:
Wave 1 — Design/Architecture (target-state + strategy):
- cloud-architect — env promotion model, dual-ACR resolution, multi-cloud question
- platform-architect — golden paths, IDP/catalog question, DORA metrics, Team Topologies fit
- security-architect — threat model, secrets lifecycle, ingress posture, supply chain
- release-manager — release-train rhythm, semver per repo, contract-version coordination
Wave 2 — Platform Ops (current-state + concrete gaps):
- azure-infra-engineer — running ACA + ACR + Storage + Log Analytics inventory
- observability-engineer — OTel + Log Analytics + dashboards
- sre-engineer — first-pass SLOs/SLIs, error budgets, on-call at solo scale
- finops-engineer — cost map across the 4 RGs + LLM provider spend kill-switch
This ADR captures what the audit found, what we'll do about it, the maturity baseline going forward, and the segregation rule for future audits.
Decision Drivers¶
| # | Driver |
|---|---|
| D1 | Run vs design separation — hub owner explicitly asked for it; design agents produce target state, ops agents produce current-state gap reports. Don't mix the two. |
| D2 | The fleet has fires — three P0 findings need same-day action; the ADR must surface them, not bury them. |
| D3 | Solo-team realism — every recommendation must be sized for 1-2 humans + an AI army. No SLOs that need 24×7 humans; no platform that needs a team to operate. |
| D4 | Specialist agents are leverage, not work — the audit took 8 agents running in parallel ~10 minutes wall-clock. Future audits should reuse this pattern. |
| D5 | Maturity is measurable, not vibes — score each domain on a 5-level scale; track improvement over time via the heartbeat. |
Decision Outcome¶
Run vs design segregation (formal rule)¶
| Track | Owns | Output shape | Example agents |
|---|---|---|---|
| Design / Architecture | Target state, ADRs, strategies, golden paths | Decision documents, target architectures, ADR amendments | cloud-architect, platform-architect, security-architect, release-manager, enterprise-architect, solution-architect, togaf-adm-advisor, wardley-strategist, business-architect |
| Platform Ops | Running infra, telemetry, SLOs, cost | Gap reports against live systems, IaC fixes, runbooks, dispatched issues | azure-infra-engineer, gcp-infra-engineer, aws-infra-engineer, observability-engineer, sre-engineer, finops-engineer, deployment-engineer, devops-engineer, chaos-engineer, incident-responder |
When commissioning either, brief explicitly with track + output shape. Design produces what should be true; Ops measures what is actually true. The synthesis (this ADR's role) is where they merge.
See docs/agent-tracks.md for the full 178-agent roster classified by track.
Maturity scorecard (baseline + targets)¶
Levels: 0 ad-hoc → 1 named → 2 measured → 3 enforced → 4 optimizing.
| Domain | Current | 90-day target | Owning track |
|---|---|---|---|
| Container tiering | 3 enforced (ADR-023 + heartbeat checks) | 4 optimizing | Design + Ops |
| Contract-first | 2 measured (heartbeat catches drift) | 3 enforced (gated PRs) | Design + Ops |
| Env separation (dev/stg/prd) | 0 ad-hoc (one unlabeled RG) | 2 measured (rg-aa-<env>-<tier>) |
Design |
| Identity & secrets | 1 named (KV exists, no rotation) | 2 measured (rotation policy + KV refs everywhere) | Design + Ops |
| Observability | 1 named (ADR-010 accepted, zero impl) | 2 measured (OTel collector live, day-1 dashboards) | Ops |
| SLOs / error budgets | 0 ad-hoc | 2 measured (per-service SLOs + heartbeat burn-rate probe) | Ops |
| FinOps | 0 ad-hoc | 2 measured (budget alerts + weekly digest) | Ops |
| Release coordination | 0 ad-hoc | 1 named (semver tags, milestone-driven trains) | Design |
| Public ingress posture | 0 ad-hoc (naked ACA ingress) | 2 measured (Front Door + WAF) | Design + Ops |
| Container supply chain | 1 named (Image Standard) | 3 enforced (digest pin + Trivy + SBOM) | Ops |
| Incident response | 0 ad-hoc | 1 named (Sev1/2/3 + postmortem template) | Ops |
The heartbeat (tools/fleet-heartbeat.mjs) is extended over time to measure the right-hand column: tier coverage (today), SLO burn rate, OTel readiness, contract version skew, weekly cost digest. What's measured improves; what isn't, doesn't.
P0 (this week — fires)¶
frontend-corerevision--0000023is FAILED and holds 100% traffic. Public endpoint returning errors right now. Route traffic to--0000011and diagnose the failed deploy. Owner: hub.- Hardcoded passwords in ACA env vars.
arcadedbJVM args contain root password as plain string;backend-corecarriesARCADEDB_PASSWORDas plain env. Rotate tosecretRef+ Key Vault references. Owner: hub + backend-core. arcadedb-devACI zombie burning ~$40/mo inrg-arcadedb-devfor no value. Stateless ACI with no volume — pure waste.az container delete. Owner: hub.- Set LLM provider hard spend caps NOW ($50/mo OpenAI + $50/mo Anthropic) before any traffic reaches the LLM gateway. Owner: hub (provider dashboards).
P1 (this month — foundations)¶
- Converge to one ACR (
agentarmy.azurecr.io); mirrorarcadeplatformacrtags viaaz acr importthen decommission. Saves ~$15/mo + collapses RBAC surface. - Establish
rg-aa-<env>-<tier>naming convention; stand uprg-aa-dev-platform+rg-aa-dev-appfrom the converged ACR. Retirerg-arcade-platform's unlabeled form. - Add Azure Front Door + WAF in front of
frontend-core; lock ACA ingress to FD origin IPs only. - Deploy shared OTel Collector + wire Application Insights exporter; emit
traceparentfromfrontend-coreBFF; instrumentbackend-coreUDA + LLM gateway spans. - Define + ship SLOs/SLIs per service (see Ops agent report — committed as
docs/slos.md); add SLO burn-rate probe to heartbeat. - Add subscription budget ($150/mo all-up dev) + 4-threshold email alerts via Bicep; add weekly cost digest via heartbeat.
- Document secret rotation cadences (JWT 90d, providers 180d, KV-backed DB roots 180d with ACA revision reload).
- Migrate
PROJECT_TOKENfrom classic PAT to GitHub App (fine-grained, auto-rotating).
P2 (this quarter — maturity)¶
- Add
scripts/new-spoke.shscaffolder +templates/contract-scaffold/(platform-architect's biggest leverage call). - Ship
docs/catalog.yamlas the fleet's flat catalog (no Backstage at this scale). - DORA baseline via heartbeat
--doraflag. - Trivy scan + SBOM emission gate in image-build workflows; digest-pin all Platform image bases (Renovate).
- Conventional commits +
git-cliffper-spoke changelogs; hub GitHub Releases aggregate trains. - ArcadeDB restart game day; document Sev1/2/3 + postmortem template under
docs/incidents/. - Define LLM provider outage attribution rule; per-tenant rate limiter + cost-attribution metadata in
/v1/chat/completions.
Open decisions (HITL required)¶
- Edge provider — Azure Front Door (recommended, single trust boundary) vs Cloudflare (best DX, separate plane).
- Single vs dual ACR — collapse vs document dual-ACR RBAC model. Recommended: collapse.
- First-tag moment — concrete trigger for cutting
v0.1.0across spokes (RT6 Phase 2 complete? first green E2E?). - LLM gateway versioning bond — share backend-core's tag (recommended), or independent tag.
- GCP scope — formally narrow to "BigQuery-only, no compute" via a separate ADR-025, or keep open.
- Postmortem ownership at solo scale — hub owner writes, or fleet heartbeat agent drafts first pass from issue timeline.
Implementation¶
This ADR is the strategy + the baseline scorecard; execution is dispatched as ~30 labelled issues across the 4 repos (see the PR body for the full backlog, sourced from each of the 8 specialist reports). Each issue links back to ADR-024 + the originating specialist domain.
The hub fleet heartbeat (tools/fleet-heartbeat.mjs) becomes the maturity-tracking surface going forward: tier inventory (today), then SLO burn (P1), OTel readiness (P1), cost digest (P1), DORA metrics (P2), contract version skew (P2). When the heartbeat reports a domain at the target level for ≥30 days, the scorecard advances.
The 8 specialist reports are preserved in the PR body for traceability; this ADR captures the synthesized decisions, not the analyses.
Consequences¶
- + First measurable maturity baseline. Every domain has a current level + a 90-day target.
- + Run-vs-design segregation is now a formal rule with a roster classification (
docs/agent-tracks.md). Future audits reuse the pattern. - + P0 fires are surfaced, not buried.
- + Every recommendation is sized for solo-team operation.
- − Substantial backlog (~30 issues) dispatched across the fleet — short-term work increase.
- − Maturity scorecard requires periodic re-evaluation (proposed: quarterly). Without that cadence the scorecard rots.
References¶
- ARC-ADR-010 — Observability Standard
- ARC-ADR-021 — LLM Gateway
- ARC-ADR-022 — Event Bus Bridges
- ARC-ADR-023 — Fleet Container Tiering
docs/agent-tracks.md— full roster classified by Design vs Opstools/fleet-heartbeat.mjs— the measurement surface- PR body of this ADR's hub PR — full text of all 8 specialist reports