01
Who cares about quality?
Tap to reveal →
The stakeholders who own the outcome
02
But QA doesn't speak the same language
Tap to reveal →
What QA delivers instead
03
The result?
Tap to reveal →
The translation failure
04
And now with AI, the problem is more challenging again
Tap to reveal →
AI makes everything harder
05
Our engine communicates AI quality in a way everyone understands
Tap to reveal →
Risk: Known

Tap card 1 to begin

Skip to persona views ↓
Typical report
What Crystal Ball delivers
Credit risk model
Model evaluation report
XGBoost ensemble achieved AUC-ROC 0.847 (95% CI: 0.831–0.863) on hold-out test set (n=12,450). Precision 0.79, recall 0.72, F1 0.754. Brier score 0.168. SHAP: income_to_debt_ratio (0.234), months_since_last_delinquency (0.189). PSI 0.043, within <0.1 stability threshold.
Crystal Ball
Business owner
The credit model is making accurate decisions. No action required before go-live.
Accuracy is within acceptable range — roughly 8 in 10 decisions are correct.
The model is reading the same type of customer it was trained on. No signs of drift.
Income-to-debt ratio is the dominant approval factor. Consistent with policy intent.
Deployment: cleared Drift: stable
Fairness analysis
Responsible AI assessment
Statistical Parity Difference (SPD) -0.037 (female vs male, threshold ±0.05). Disparate Impact Ratio (DIR) 0.91 (group A vs B, threshold >0.8). Equalised Odds TPR diff 0.028, FPR diff 0.041. Intersectional: female × group B × 18–25 DIR 0.76 — flagged for remediation.
Crystal Ball
Compliance officer
One customer segment is receiving significantly fewer approvals than comparable applicants. This requires remediation before deployment.
Women aged 18–25 in ethnic group B are approved at a rate 24% below the majority group — outside the MAS FEAT fairness threshold.
Broad gender and ethnicity comparisons pass. The problem is concentrated in this intersection only.
This is a regulatory exposure under MAS FEAT Principle 3. Evidence of the finding and remediation plan will be required for audit.
Deployment: blocked MAS FEAT: breach Remediation: required
RAG customer service chatbot
LLM evaluation report
Answer Relevancy 0.83, Faithfulness 0.91, Context Precision 0.78, Context Recall 0.85. Hallucination rate 4.2% (target <5%). Toxicity mean 0.02, max 0.11. Latency P50: 1.8s, P95: 4.2s, P99: 7.1s. Cost: USD 0.023 (GPT-4o) / USD 0.004 (Claude 3.5 Haiku).
Crystal Ball
Product owner
The chatbot passes quality gates and is ready for monitored release. One metric needs watching post-launch.
1 in 24 responses contains information not supported by source documents. Within threshold — but approaching the limit. Flag for Week 2 review.
Response quality is consistent and answers are grounded in policy documents. No toxicity risk.
The cheaper model reduces cost by 83% with no measurable quality difference at current volumes.
Deployment: cleared Hallucination: watch
The same metric.
Three completely different events.
System Agency Action / Exposure Metric Severity Action
Internal HR drafting assistant
Assistive Generative / Internal 5% hallucination Low Monitor only
Credit scoring advisory tool
Supervised Advisory / Internal 5% hallucination Medium Add review gate
Autonomous regulatory filing
Autonomous Transactional / Ext. Regulated 5% hallucination Critical Suspend immediately

Hover the teal values and severity badges to see why each factor changes the score. Agency level, action type, and exposure surface are part of every system's DNA profile — configured once, applied to every assessment.

The secret sauce
System DNA — configured once, applied to every assessment
Every AI system you connect to Crystal Ball gets a DNA profile. It captures what the system does, how autonomously it acts, and who it affects. That profile shapes every impact score automatically — you don't re-explain the context each time a metric breaches.

Two assessments with the same hallucination rate can reach opposite conclusions. The DNA profile is why.
How rules are generated ↗
Agency level
How autonomous is the system? Assistive, supervised, or fully autonomous — determines how much human oversight exists before an output has real-world effect.
Action type
What does the system actually do? Generating content, advising, or filing a transaction — directly affects reversibility. A filed document can't be un-filed.
Exposure surface
Who receives the output? Internal staff, external customers, or regulated third parties — defines the blast radius if something goes wrong.
Domain Optional
The industry context — financial services, healthcare, legal. Unlocks domain-specific regulatory references in every finding.
Data sensitivity Optional
What data does the system process? Public, internal, personal, or sensitive personal — escalates impact scores when personal data is involved in a breach.
Beyond the DNA
System DNA is the foundation, but it's not the only input. The scoring engine also accounts for how reversible the impact is — a hallucination in a filed regulatory document is categorically different to one in a draft email. It accounts for whether a breach is happening for the first time or the third. And it accounts for how confident we are in the measurement itself — if the underlying evaluation ran on too few test cases, that uncertainty is flagged and the score is adjusted accordingly. Every factor is explicit, auditable, and traceable back to an approved rule.

Each view is built for a different stakeholder. Click any persona to see Crystal Ball through their lens.

Scout Overview
What Scout does, the reasoning loop, the data model and point schema, the API surface, and the roadmap to LLM-judge observers and full eval coverage.
Crystal Ball Overview
The dashboard layer. EvalSourceAdapter contract, four persona views (executive / governance / quality lead / delivery), and how threshold breaches surface AIQIDE narratives with click-through to source eval.
AIQIDE Overview
The engine. System DNA (5 axes plus the architecture_pattern sibling), the 22-attribute catalogue with pending RAGAS extension, and how DNA scopes which evals matter for a given system.
Severity over time — viz options
Four candidate visualisations for showing quality findings over time when the underlying signal is severity (red / amber / green) rather than a numeric measurement. Includes commentary on why a percentage axis breaks for severity-encoded data.
Engine Harness
Direct interface to AIQIDE. Submit breach events, inspect raw scoring, and verify persona assessments against the live rule set.
System Lenses
Combined view of the Scout, Corpus Coach, and engine stack across three lenses: implementation status, quality posture, and journey flow.
Scout Walkthrough
Trace of an autonomous Scout run probing Corpus Coach for AML/CFT grounding failures. Step-by-step view of the probe, signals raised, and findings.
How Rules Are Generated
The agent pipeline behind the engine — how impact rules are drafted, critiqued, and approved before they enter the scoring system.
FAQ
Questions from business owners, compliance leads, and technology executives seeing AI quality translated into business language for the first time.