AIQIDE

AI Quality Impact Determination Engine

AIQIDE turns eval signals into business-grade narratives. Given a system's DNA — agency level, action type, exposure surface, domain, data sensitivity — and a threshold breach, it returns a persona-targeted explanation: what failed, what it means commercially, what it means under MAS FEAT/TRM, what to do. Crystal Ball calls AIQIDE on breach. Scout findings ride the same engine.

5 DNA axes 22-attribute catalogue Impact rules Persona narratives FastAPI

🧬

System DNA axes

5 today · 6 after promotion trigger

📚

Quality attribute catalogue

22 attributes · RAGAS extension pending

📐

Impact rule library

16 approved · trigger 100 / 2nd-org

🎭

Personas served

Executive · Governance · Quality lead · Delivery

Why Build This

🔁

Eval signal → business meaning

Eval tools produce numbers, not decisions. AIQIDE is the layer that takes a metric reading + the system context and returns the sentence an executive, auditor, or release lead can act on.

🧬

System DNA scopes the work

Two systems with different DNA need different evals. A high-agency external advisory failure means something different to a low-agency internal helper failure. DNA carries that context into every narrative AIQIDE generates.

⚖️

Regulatory grounding built in

Impact rules cite MAS FEAT, TRM, and related obligations where they apply. The narrative isn't just "this is bad" — it's "this attribute breach exposes you under this clause."

What AIQIDE Does

🧬

Classify

Capture System DNA + architecture pattern per system

›

🎯

Scope

Material attribute set per DNA — from the 22-attribute catalogue

›

📐

Rule match

Match incoming eval verdict against impact-rule library

⚖️

Severity

Calibrate severity by DNA — same finding hits differently per system

›

📝

Narrate

Persona-targeted business narrative + regulatory citation

›

🔄

Return

Crystal Ball renders, click-through goes back to source eval

Two-way contract with Crystal Ball

📥

CB calls AIQIDE on breach

Threshold rule fires → CB sends (system_id, attribute, severity, evidence) to AIQIDE
AIQIDE returns narrative scoped to the requesting persona
CB renders, links back to source eval

📤

AIQIDE knows nothing about CB

Engine is a pure function of system context + breach event
Same engine drives Scout findings, RAGAS runs, future eval sources
Adapter pattern keeps AIQIDE source-agnostic

System DNA — 5 axes today

Locked vocabulary. Each system gets a DNA tuple at onboarding. Impact rules match against tuples. Crystal Ball displays it on the Quality Lead view.

🤖

agency_level

How autonomous is the system? Recommendation only? Decisions with human approval? Decisions without approval? Affects severity directly.

⚙️

action_type

What kind of work? Generative, retrieval, classification, scoring, action-taking. Determines which attribute families are material.

🌐

exposure_surface

Who interacts with it? Internal users only, customers, regulated counterparties. Drives reputation + regulatory severity.

🏢

domain

FSI, telco, public sector, education, internal tooling. Pulls in domain-specific regulatory rules + attribute weighting.

🔒

data_sensitivity

PII, financial, regulated, public. Multiplies severity for confidentiality + integrity-class breaches.

+

architecture_pattern SIBLING (May 2026)

RAG vs fine-tuned vs prompt-engineered vs agentic. Sibling field on engagement record today; promotes to 6th DNA axis when ≥8 joint-matching rules exist.

Why DNA + architecture pattern together

Two systems can carry identical DNA — say action_type=generative, exposure=external, domain=fsi — and have radically different failure modes. A RAG-grounded system fails on retrieval relevancy + groundedness. A fine-tuned generative system fails on hallucination + drift. The eval activities differ. Architecture pattern carries that distinction into AIQIDE's rule-matching layer so the right tests get scoped, the right thresholds apply, and the narrative reflects the actual system.

22-attribute quality catalogue

Locked vocabulary of quality attributes. Each attribute carries DNA-applicability rules: which DNA combinations make it material, which thresholds apply, which evidence shapes count. Catalogue is being extended with RAGAS-canonical attributes (retrieval_relevancy, context_precision, context_recall, response_groundedness) in the same release as the architecture_pattern sibling field.

🎯

Accuracy family

factual_accuracy
groundedness
citation_correctness

Material for any system whose output is depended upon for correctness.

🛡️

Robustness family

adversarial_robustness
prompt_injection_resistance
edge_case_handling

Material for any system exposed to inputs it does not control.

⚖️

Fairness family

demographic_parity
equal_opportunity
treatment_consistency

Material when outcomes affect people unequally and protected attributes are in play (FSI, hiring, public sector).

🔍

Explainability family

decision_traceability
evidence_citation
persona_appropriate_explanation

Material when an auditor, regulator, or stakeholder might ask why a given output was produced.

🔒

Privacy family

pii_leakage
data_minimisation
consent_handling

Material for any system that handles regulated personal or sensitive data.

🔌

Reliability family

availability
latency
graceful_degradation
deterministic_replay

Material for any system that has to keep running in production under real load.

+

RAGAS-canonical (pending)

retrieval_relevancy
context_precision
context_recall
response_groundedness

Ships with architecture_pattern sibling field. Plugs the catalogue gap that motivates the Coverage Gap Audit.

How DNA selects attributes

Catalogue carries applicability rules per attribute — keyed on DNA. A FSI advisory system pulls in the full Accuracy + Explainability + Fairness load. An internal helper pulls a leaner Reliability + Accuracy slice. The Coverage Gap Audit deliverable runs this selection against a real system to produce its material-attribute set, then maps each to the tools currently measuring it. Gaps surface as audit findings with regulatory citations attached.

Per-attribute tool coverage

Each attribute in the 22-item catalogue has a definition, a short list of tools that typically measure it, and a status flag for whether Crystal Ball is reading from that source today. Status legend: ✓ Live = wired through a Crystal Ball adapter and visible in the demo today; ◐ Reachable = within the contract of a deployed adapter (e.g. Scout push, Langfuse trace replay) but not yet calibrated for this attribute; ○ Future = no adapter yet, requires a new tool integration.

Attribute	Family	Definition	Example tools	In demo today
`factual_accuracy`	Accuracy	Output's factual claims hold up against the underlying source-of-truth corpus or reference set.	RAGAS faithfulness, DeepEval HallucinationMetric, custom LLM-judge against reference	✓ Live — Langfuse hallucination metric on Corpus Coach
`groundedness`	Accuracy	Every claim in the output traces back to retrieved or supplied context, no fabrication.	RAGAS faithfulness/groundedness, custom LLM-judge with retrieval overlap	✓ Live — Corpus Coach groundedness via Langfuse
`citation_correctness`	Accuracy	Cited sources actually contain the claim attributed to them, and the citation pointer resolves.	Custom citation-validity check, retrieval-overlap test, source-mapping LLM-judge	◐ Reachable — Corpus Coach citations partially via Langfuse
`adversarial_robustness`	Robustness	Output stays correct under crafted adversarial inputs designed to derail the model.	Garak red-team suite, custom adversarial prompt sets, PromptFoo redteam config	○ Future — no adapter wired
`prompt_injection_resistance`	Robustness	Model refuses or neutralises instructions injected through user input or retrieved content.	Garak prompt-injection probes, PromptFoo redteam, custom injection harness	◐ Reachable — Scout adapter can probe via push
`edge_case_handling`	Robustness	Behaviour on inputs at or beyond expected distribution edges remains safe and predictable.	PromptFoo, DeepEval, custom edge-case generators, Scout exploratory probes	◐ Reachable — Scout adapter can probe via push
`demographic_parity`	Fairness	Outcome rates are similar across protected demographic groups.	AIF360, Fairlearn, custom group-comparison harness	○ Future — no adapter wired
`equal_opportunity`	Fairness	True-positive rates are similar across protected groups conditional on the true outcome.	AIF360, Fairlearn, custom counterfactual evaluator	○ Future — no adapter wired
`treatment_consistency`	Fairness	Functionally identical inputs differing only in protected attributes get equivalent treatment.	Custom paired-prompt comparison, counterfactual LLM-judge	○ Future — no adapter wired
`decision_traceability`	Explainability	Each output is reconstructable from the trace of inputs, retrievals, prompts, and intermediate steps.	Langfuse traces, custom decision-tree audit, LangSmith	✓ Live — Langfuse trace ingestion (PR #8 evidence trace ID)
`evidence_citation`	Explainability	Output surfaces the supporting evidence so a reviewer can validate the claim independently.	Custom citation extractor, retrieval-trace inspector	◐ Reachable — same instrumentation as citation_correctness
`persona_appropriate_explanation`	Explainability	Explanation depth and vocabulary fit the consuming persona (e.g. exec vs governance vs delivery).	Custom LLM-judge against persona profile, readability + audience-fit scoring	◐ Reachable — Scout adapter can probe via push
`pii_leakage`	Privacy	Output does not expose personally identifiable information beyond what was authorised.	Microsoft Presidio, regex scrubbers, custom PII probe set	○ Future — no adapter wired
`data_minimisation`	Privacy	System collects, retains, and exposes only the data necessary for the requested task.	Custom audit + retention probe, scope-creep detector	○ Future — no adapter wired
`consent_handling`	Privacy	Data flows respect captured consent state at the point of inference.	Custom consent-trace audit, integration test against consent store	○ Future — no adapter wired
`availability`	Reliability	Service responds within SLO over a measurement window.	Datadog, Grafana, Prometheus, standard APM	○ Future — APM stack not wired into a CB adapter
`latency`	Reliability	Response time at p50/p95/p99 stays within calibrated SLOs.	Datadog, Grafana, Prometheus, Langfuse latency telemetry	◐ Reachable — Langfuse traces carry latency, not yet surfaced
`graceful_degradation`	Reliability	System falls back to a safe, communicable state under partial failure rather than failing closed silently.	Chaos-engineering tools, custom dependency-failure probes	○ Future — no adapter wired
`deterministic_replay`	Reliability	Given a captured trace, the same inputs reproduce the same outputs (or a fingerprint of the divergence).	Langfuse replay, custom replay harness, fixture snapshotter	◐ Reachable — Langfuse traces enable replay, not yet automated
`retrieval_relevancy` (pending)	RAGAS	Retrieved context is on-topic for the user's question.	RAGAS context_precision, custom retrieval-relevance LLM-judge	○ Future — RAGAS adapter on Path B
`context_precision` (pending)	RAGAS	Top-ranked retrieved chunks are the ones the answer actually depends on.	RAGAS context_precision	○ Future — RAGAS adapter on Path B
`context_recall` (pending)	RAGAS	Retrieval surfaces all chunks needed to answer; nothing critical is missed.	RAGAS context_recall	○ Future — RAGAS adapter on Path B
`response_groundedness` (pending)	RAGAS	Final response is supported by the retrieved context, not invented.	RAGAS faithfulness	✓ Live (proxy) — Corpus Coach groundedness via Langfuse already covers this signal pre-RAGAS adapter

"In demo today" reflects the eval adapters wired into Crystal Ball at the time of the latest demo polish series (Path A, May 8–9). Counts: 4 attributes ✓ Live, 6 attributes ◐ Reachable via deployed adapters, 12 attributes ○ Future (require new adapter or tool integration). Coverage Gap Audit on Path B will produce this same matrix per real-world target system, with regulatory citations attached to gaps.

Path A — shipped to date

Path B — next