AIQIDE turns eval signals into business-grade narratives. Given a system's DNA — agency level, action type, exposure surface, domain, data sensitivity — and a threshold breach, it returns a persona-targeted explanation: what failed, what it means commercially, what it means under MAS FEAT/TRM, what to do. Crystal Ball calls AIQIDE on breach. Scout findings ride the same engine.
Eval tools produce numbers, not decisions. AIQIDE is the layer that takes a metric reading + the system context and returns the sentence an executive, auditor, or release lead can act on.
Two systems with different DNA need different evals. A high-agency external advisory failure means something different to a low-agency internal helper failure. DNA carries that context into every narrative AIQIDE generates.
Impact rules cite MAS FEAT, TRM, and related obligations where they apply. The narrative isn't just "this is bad" — it's "this attribute breach exposes you under this clause."
Locked vocabulary. Each system gets a DNA tuple at onboarding. Impact rules match against tuples. Crystal Ball displays it on the Quality Lead view.
How autonomous is the system? Recommendation only? Decisions with human approval? Decisions without approval? Affects severity directly.
What kind of work? Generative, retrieval, classification, scoring, action-taking. Determines which attribute families are material.
Who interacts with it? Internal users only, customers, regulated counterparties. Drives reputation + regulatory severity.
FSI, telco, public sector, education, internal tooling. Pulls in domain-specific regulatory rules + attribute weighting.
PII, financial, regulated, public. Multiplies severity for confidentiality + integrity-class breaches.
RAG vs fine-tuned vs prompt-engineered vs agentic. Sibling field on engagement record today; promotes to 6th DNA axis when ≥8 joint-matching rules exist.
Two systems can carry identical DNA — say action_type=generative, exposure=external, domain=fsi — and have radically different failure modes. A RAG-grounded system fails on retrieval relevancy + groundedness. A fine-tuned generative system fails on hallucination + drift. The eval activities differ. Architecture pattern carries that distinction into AIQIDE's rule-matching layer so the right tests get scoped, the right thresholds apply, and the narrative reflects the actual system.
Locked vocabulary of quality attributes. Each attribute carries DNA-applicability rules: which DNA combinations make it material, which thresholds apply, which evidence shapes count. Catalogue is being extended with RAGAS-canonical attributes (retrieval_relevancy, context_precision, context_recall, response_groundedness) in the same release as the architecture_pattern sibling field.
Material for any system whose output is depended upon for correctness.
Material for any system exposed to inputs it does not control.
Material when outcomes affect people unequally and protected attributes are in play (FSI, hiring, public sector).
Material when an auditor, regulator, or stakeholder might ask why a given output was produced.
Material for any system that handles regulated personal or sensitive data.
Material for any system that has to keep running in production under real load.
Ships with architecture_pattern sibling field. Plugs the catalogue gap that motivates the Coverage Gap Audit.
Catalogue carries applicability rules per attribute — keyed on DNA. A FSI advisory system pulls in the full Accuracy + Explainability + Fairness load. An internal helper pulls a leaner Reliability + Accuracy slice. The Coverage Gap Audit deliverable runs this selection against a real system to produce its material-attribute set, then maps each to the tools currently measuring it. Gaps surface as audit findings with regulatory citations attached.
Each attribute in the 22-item catalogue has a definition, a short list of tools that typically measure it, and a status flag for whether Crystal Ball is reading from that source today. Status legend: ✓ Live = wired through a Crystal Ball adapter and visible in the demo today; ◐ Reachable = within the contract of a deployed adapter (e.g. Scout push, Langfuse trace replay) but not yet calibrated for this attribute; ○ Future = no adapter yet, requires a new tool integration.
| Attribute | Family | Definition | Example tools | In demo today |
|---|---|---|---|---|
factual_accuracy |
Accuracy | Output's factual claims hold up against the underlying source-of-truth corpus or reference set. | RAGAS faithfulness, DeepEval HallucinationMetric, custom LLM-judge against reference | ✓ Live — Langfuse hallucination metric on Corpus Coach |
groundedness |
Accuracy | Every claim in the output traces back to retrieved or supplied context, no fabrication. | RAGAS faithfulness/groundedness, custom LLM-judge with retrieval overlap | ✓ Live — Corpus Coach groundedness via Langfuse |
citation_correctness |
Accuracy | Cited sources actually contain the claim attributed to them, and the citation pointer resolves. | Custom citation-validity check, retrieval-overlap test, source-mapping LLM-judge | ◐ Reachable — Corpus Coach citations partially via Langfuse |
adversarial_robustness |
Robustness | Output stays correct under crafted adversarial inputs designed to derail the model. | Garak red-team suite, custom adversarial prompt sets, PromptFoo redteam config | ○ Future — no adapter wired |
prompt_injection_resistance |
Robustness | Model refuses or neutralises instructions injected through user input or retrieved content. | Garak prompt-injection probes, PromptFoo redteam, custom injection harness | ◐ Reachable — Scout adapter can probe via push |
edge_case_handling |
Robustness | Behaviour on inputs at or beyond expected distribution edges remains safe and predictable. | PromptFoo, DeepEval, custom edge-case generators, Scout exploratory probes | ◐ Reachable — Scout adapter can probe via push |
demographic_parity |
Fairness | Outcome rates are similar across protected demographic groups. | AIF360, Fairlearn, custom group-comparison harness | ○ Future — no adapter wired |
equal_opportunity |
Fairness | True-positive rates are similar across protected groups conditional on the true outcome. | AIF360, Fairlearn, custom counterfactual evaluator | ○ Future — no adapter wired |
treatment_consistency |
Fairness | Functionally identical inputs differing only in protected attributes get equivalent treatment. | Custom paired-prompt comparison, counterfactual LLM-judge | ○ Future — no adapter wired |
decision_traceability |
Explainability | Each output is reconstructable from the trace of inputs, retrievals, prompts, and intermediate steps. | Langfuse traces, custom decision-tree audit, LangSmith | ✓ Live — Langfuse trace ingestion (PR #8 evidence trace ID) |
evidence_citation |
Explainability | Output surfaces the supporting evidence so a reviewer can validate the claim independently. | Custom citation extractor, retrieval-trace inspector | ◐ Reachable — same instrumentation as citation_correctness |
persona_appropriate_explanation |
Explainability | Explanation depth and vocabulary fit the consuming persona (e.g. exec vs governance vs delivery). | Custom LLM-judge against persona profile, readability + audience-fit scoring | ◐ Reachable — Scout adapter can probe via push |
pii_leakage |
Privacy | Output does not expose personally identifiable information beyond what was authorised. | Microsoft Presidio, regex scrubbers, custom PII probe set | ○ Future — no adapter wired |
data_minimisation |
Privacy | System collects, retains, and exposes only the data necessary for the requested task. | Custom audit + retention probe, scope-creep detector | ○ Future — no adapter wired |
consent_handling |
Privacy | Data flows respect captured consent state at the point of inference. | Custom consent-trace audit, integration test against consent store | ○ Future — no adapter wired |
availability |
Reliability | Service responds within SLO over a measurement window. | Datadog, Grafana, Prometheus, standard APM | ○ Future — APM stack not wired into a CB adapter |
latency |
Reliability | Response time at p50/p95/p99 stays within calibrated SLOs. | Datadog, Grafana, Prometheus, Langfuse latency telemetry | ◐ Reachable — Langfuse traces carry latency, not yet surfaced |
graceful_degradation |
Reliability | System falls back to a safe, communicable state under partial failure rather than failing closed silently. | Chaos-engineering tools, custom dependency-failure probes | ○ Future — no adapter wired |
deterministic_replay |
Reliability | Given a captured trace, the same inputs reproduce the same outputs (or a fingerprint of the divergence). | Langfuse replay, custom replay harness, fixture snapshotter | ◐ Reachable — Langfuse traces enable replay, not yet automated |
retrieval_relevancy (pending) |
RAGAS | Retrieved context is on-topic for the user's question. | RAGAS context_precision, custom retrieval-relevance LLM-judge | ○ Future — RAGAS adapter on Path B |
context_precision (pending) |
RAGAS | Top-ranked retrieved chunks are the ones the answer actually depends on. | RAGAS context_precision | ○ Future — RAGAS adapter on Path B |
context_recall (pending) |
RAGAS | Retrieval surfaces all chunks needed to answer; nothing critical is missed. | RAGAS context_recall | ○ Future — RAGAS adapter on Path B |
response_groundedness (pending) |
RAGAS | Final response is supported by the retrieved context, not invented. | RAGAS faithfulness | ✓ Live (proxy) — Corpus Coach groundedness via Langfuse already covers this signal pre-RAGAS adapter |
"In demo today" reflects the eval adapters wired into Crystal Ball at the time of the latest demo polish series (Path A, May 8–9). Counts: 4 attributes ✓ Live, 6 attributes ◐ Reachable via deployed adapters, 12 attributes ○ Future (require new adapter or tool integration). Coverage Gap Audit on Path B will produce this same matrix per real-world target system, with regulatory citations attached to gaps.