Crystal Ball: AI Quality Platform

01 Why This Exists

Who cares about quality?

Tap to reveal →

But QA doesn't speak the same language

Tap to reveal →

The result?

Tap to reveal →

And now with AI, the problem is more challenging again

Tap to reveal →

Our engine communicates AI quality in a way everyone understands

Tap to reveal →

Tap card 1 to begin

Skip to persona views ↓

02 Insights, not numbers

Typical report

What Crystal Ball delivers

Credit risk model

Model evaluation report

XGBoost ensemble achieved AUC-ROC 0.847 (95% CI: 0.831–0.863) on hold-out test set (n=12,450). Precision 0.79, recall 0.72, F1 0.754. Brier score 0.168. SHAP: income_to_debt_ratio (0.234), months_since_last_delinquency (0.189). PSI 0.043, within <0.1 stability threshold.

→

Crystal Ball

Business owner

The credit model is making accurate decisions. No action required before go-live.

Accuracy is within acceptable range: roughly 8 in 10 decisions are correct.

The model is reading the same type of customer it was trained on. No signs of drift.

Income-to-debt ratio is the dominant approval factor. Consistent with policy intent.

Deployment: cleared Drift: stable

Fairness analysis

Responsible AI assessment

Statistical Parity Difference (SPD) -0.037 (female vs male, threshold ±0.05). Disparate Impact Ratio (DIR) 0.91 (group A vs B, threshold >0.8). Equalised Odds TPR diff 0.028, FPR diff 0.041. Intersectional: female × group B × 18–25 DIR 0.76: flagged for remediation.

→

Crystal Ball

Compliance officer

One customer segment is receiving significantly fewer approvals than comparable applicants. This requires remediation before deployment.

Women aged 18–25 in ethnic group B are approved at a rate 24% below the majority group: outside the MAS FEAT fairness threshold.

Broad gender and ethnicity comparisons pass. The problem is concentrated in this intersection only.

This is a regulatory exposure under MAS FEAT Principle 3. Evidence of the finding and remediation plan will be required for audit.

Deployment: blocked MAS FEAT: breach Remediation: required

RAG customer service chatbot

LLM evaluation report

Answer Relevancy 0.83, Faithfulness 0.91, Context Precision 0.78, Context Recall 0.85. Hallucination rate 4.2% (target <5%). Toxicity mean 0.02, max 0.11. Latency P50: 1.8s, P95: 4.2s, P99: 7.1s. Cost: USD 0.023 (GPT-4o) / USD 0.004 (Claude 3.5 Haiku).

→

Crystal Ball

Product owner

The chatbot passes quality gates and is ready for monitored release. One metric needs watching post-launch.

1 in 24 responses contains information not supported by source documents. Within threshold: but approaching the limit. Flag for Week 2 review.

Response quality is consistent and answers are grounded in policy documents. No toxicity risk.

The cheaper model reduces cost by 83% with no measurable quality difference at current volumes.

Deployment: cleared Hallucination: watch

03 Why the same number means different things

The same metric.
Three completely different events.

System	Agency	Action / Exposure	Metric	Severity	Action
Internal HR drafting assistant	Assistive	Generative / Internal	5% hallucination	Low	Monitor only
Credit scoring advisory tool	Supervised	Advisory / Internal	5% hallucination	Medium	Add review gate
Autonomous regulatory filing	Autonomous	Transactional / Ext. Regulated	5% hallucination	Critical	Suspend immediately

Hover the teal values and severity badges to see why each factor changes the score. Agency level, action type, and exposure surface are part of every system's DNA profile: configured once, applied to every assessment.

The secret sauce

System DNA: configured once, applied to every assessment

Every AI system you connect to Crystal Ball gets a DNA profile. It captures what the system does, how autonomously it acts, and who it affects. That profile shapes every impact score automatically: you don't re-explain the context each time a metric breaches.

Two assessments with the same hallucination rate can reach opposite conclusions. The DNA profile is why.

How rules are generated ↗

Agency level

How autonomous is the system? Assistive, supervised, or fully autonomous: determines how much human oversight exists before an output has real-world effect.

Action type

What does the system actually do? Generating content, advising, or filing a transaction: directly affects reversibility. A filed document can't be un-filed.

Exposure surface

Who receives the output? Internal staff, external customers, or regulated third parties: defines the blast radius if something goes wrong.

Domain Optional

The industry context: financial services, healthcare, legal. Unlocks domain-specific regulatory references in every finding.

Data sensitivity Optional

What data does the system process? Public, internal, personal, or sensitive personal: escalates impact scores when personal data is involved in a breach.

Beyond the DNA

System DNA is the foundation, but it's not the only input. The scoring engine also accounts for how reversible the impact is: a hallucination in a filed regulatory document is categorically different to one in a draft email. It accounts for whether a breach is happening for the first time or the third. And it accounts for how confident we are in the measurement itself: if the underlying evaluation ran on too few test cases, that uncertainty is flagged and the score is adjusted accordingly. Every factor is explicit, auditable, and traceable back to an approved rule.

04 Persona Views

Each view is built for a different stakeholder. Click any persona to see Crystal Ball through their lens.

Business Owner

Customer impact, revenue exposure, and brand risk translated from technical breach events. Plain language, no metric definitions required.

Quality Lead

Breach trends, remediation guidance, impact scoring, and recurrence tracking. The operational view for the person who owns quality velocity.

Compliance & Governance

MAS FEAT alignment, regulatory reference mapping, and audit trail. Evidence pack generation for regulatory submission.

Product Owner

Sprint-level release readiness, blocker list, and quality trend vs prior sprint. Go/no-go decision support without the technical detail.

05 Other information

Scout Overview