Crystal Ball

AI Quality Dashboard for stakeholders who don't read confusion matrices

Crystal Ball ingests evaluation signals from the tools your data scientists already use (Langfuse, RAGAS, DeepEval, PromptFoo, Phoenix) and translates them into per-persona governance views. On threshold breach, it pulls a business-grade narrative from AIQIDE so non-technical stakeholders can answer the "so what" question without retraining.

EvalSourceAdapter pattern React + FastAPI Multi-source Persona views AIQIDE-linked
🎭
Persona views
Executive · Governance · Quality lead
🔌
Eval adapters
Langfuse live · RAGAS / DeepEval / PromptFoo planned
🧬
Per-system context
System DNA + architecture pattern
🌐
Overview and Demo
Why Build This
🎯
Quality as something stakeholders interrogate

Logo concept aside, that's the operating principle. Eval dashboards built for engineers don't translate to people who sign off on AI deployments. Crystal Ball renders the same data through the lens that fits the role asking.

🔧
Plug into existing eval tooling

The market doesn't lack eval tools. It lacks a layer that consolidates their output and represents it to the people who need to consume it. EvalSourceAdapter contract: connect what you already run, no rip-and-replace.

⚖️
Threshold breach → business narrative

When a metric goes red, raw numbers don't move executives. Crystal Ball calls AIQIDE on breach to produce a persona-specific narrative: what happened, what it means commercially, what it means under MAS FEAT/TRM, what to do.

What Crystal Ball Does
🔌
Connect
Pull eval signals from each connected source on its own schedule
📊
Normalise
EvalSourceAdapter writes risk_assessments + eval_results into shared schema
⚠️
Detect
Threshold rules per quality attribute, calibrated to the system's DNA
📞
Call AIQIDE
On breach, request a persona-targeted impact narrative from the engine
🎭
Render
Per-persona view: executive, governance/audit, quality lead, delivery
🔍
Click-through
Drill from narrative back to the source eval (Scout finding, RAGAS run, Langfuse trace)
Three Demo Systems — Same Dashboard
🏦
VFA

Autonomous FSI advisory. High severity, external exposure, MAS FEAT + TRM in scope. Used as the anchor system in the current pilot.

🔍
Scout

Autonomous AI exploratory testing agent. Medium severity, internal exposure. Findings feed Crystal Ball as another eval source via the ScoutAdapter.

📚
Corpus Coach

Assistive RAG over the MAS regulatory corpus. Low severity, internal, educational. Same framework, different severity calibration, different regulatory path.

EvalSourceAdapter contract

Eval tools ship findings in their own shape. Adapters translate each tool's output into Crystal Ball's two canonical tables, then go quiet. Adding a new tool is one adapter, no schema work.

📥
risk_assessments (current state)
  • One row per (system, quality attribute) — current Red/Amber/Green
  • Drives the per-system overview cards on the dashboard
  • Adapter writes on every fresh eval run
📈
eval_results (history)
  • Insert-only timeseries — every measurement preserved
  • Drives sparklines, trend charts, regression detection
  • Adapter writes on every measurement event (run, finding, trace)
Eval adapters live + planned
Langfuse

Live in prod. Pulls trace + score data. Confirmed running in pilot client production.

Scout

Push-based. Findings → risk_assessments + synthetic eval_results trend points. Works against any system Scout can probe.

RAGAS

Planned. Independent scheduling against Corpus Coach as reference target.

DeepEval / PromptFoo / Phoenix

Planned. DeepEval G-Eval calibration in progress on Scout side.

System DNA + architecture pattern

Each system in Crystal Ball carries a DNA tuple (agency_level, action_type, exposure_surface, domain, data_sensitivity) plus an architecture_pattern sibling field. AIQIDE uses both to scope which quality attributes are material and which thresholds apply. The Quality Lead view surfaces the DNA so a stakeholder can see the system's classification at a glance.

Per-persona renderings of the same data

Same risk_assessments, same eval_results, four different shapes. Each persona has a different question they're asking; rendering is fitted to the question, not the data shape.

👔
Executive view

Question: Are we OK to ship and scale?

  • Risk heatmap (Red/Amber/Green) per system
  • Trend over time — direction, not detail
  • Business impact translation per breach
  • Release readiness traffic light
🛡
Governance / audit view

Question: Can we defend this to the regulator?

  • Compliance posture vs MAS FEAT / TRM
  • Test coverage per regulatory obligation
  • Audit trail per finding
  • Evidence pack export for regulators
🔧
Quality lead view

Question: What needs my attention this sprint?

  • System DNA + architecture pattern visible per project
  • Material-attribute coverage map
  • Threshold breaches with click-through to source eval
  • Scout finding nests with severity + status pills
🚀
Delivery view

Question: Can I release this build?

  • Release readiness checklist (per attribute, per release)
  • Regression alerts when a previously-green attribute slips
  • Per-build evidence trail
  • Faster detection, faster release confidence
Path A — shipped to date
Path B — next