Crystal Ball

AI Quality Dashboard for stakeholders who don't read confusion matrices

Crystal Ball ingests evaluation signals from the tools your data scientists already use (Langfuse, RAGAS, DeepEval, PromptFoo, Phoenix) and translates them into per-persona governance views. On threshold breach, it pulls a business-grade narrative from Prism so non-technical stakeholders can answer the "so what" question without retraining.

EvalSourceAdapter pattern React + FastAPI Multi-source Persona views Prism-linked

🎭

Persona views

Executive · Governance · Quality lead

🔌

Eval adapters

Langfuse live · RAGAS / DeepEval / PromptFoo planned

🧬

Per-system context

System DNA + architecture pattern

🌐

Overview and Demo

DEMO →

Why Build This

🎯

Quality as something stakeholders interrogate

Logo concept aside, that's the operating principle. Eval dashboards built for engineers don't translate to people who sign off on AI deployments. Crystal Ball renders the same data through the lens that fits the role asking.

🔧

Plug into existing eval tooling

The market doesn't lack eval tools. It lacks a layer that consolidates their output and represents it to the people who need to consume it. EvalSourceAdapter contract: connect what you already run, no rip-and-replace.

⚖️

Threshold breach → business narrative

When a metric goes red, raw numbers don't move executives. Crystal Ball calls Prism on breach to produce a persona-specific narrative: what happened, what it means commercially, what it means under MAS FEAT/TRM, what to do.

What Crystal Ball Does

🔌

Connect

Pull eval signals from each connected source on its own schedule

›

📊

Normalise

EvalSourceAdapter writes risk_assessments + eval_results into shared schema

›

⚠️

Detect

Threshold rules per quality attribute, calibrated to the system's DNA

📞

Call Prism

On breach, request a persona-targeted impact narrative from the engine

›

🎭

Render

Per-persona view: executive, governance/audit, quality lead, delivery

›

🔍

Click-through

Drill from narrative back to the source eval (Scout finding, RAGAS run, Langfuse trace)

Three Demo Systems: Same Dashboard

🏦

VFA

Autonomous FSI advisory demo system. High severity, external exposure, MAS FEAT + TRM in scope.

🔍

Scout

Autonomous AI exploratory testing agent. Medium severity, internal exposure. Findings feed Crystal Ball as another eval source via the ScoutAdapter.

📚

Corpus Coach

Assistive RAG over the MAS regulatory corpus. Low severity, internal, educational. Same framework, different severity calibration, different regulatory path.

EvalSourceAdapter contract

Eval tools ship findings in their own shape. Adapters translate each tool's output into Crystal Ball's two canonical tables, then go quiet. Adding a new tool is one adapter, no schema work.

📥

risk_assessments (current state)

One row per (system, quality attribute): current Red/Amber/Green
Drives the per-system overview cards on the dashboard
Adapter writes on every fresh eval run

📈

eval_results (history)

Insert-only timeseries: every measurement preserved
Drives sparklines, trend charts, regression detection
Adapter writes on every measurement event (run, finding, trace)

Eval adapters live + planned

✓

Langfuse

Live adapter. Pulls trace + score data from the running demo stack.

◐

Scout

Push-based. Findings → risk_assessments + synthetic eval_results trend points. Works against any system Scout can probe.

○

RAGAS

Planned. Independent scheduling against Corpus Coach as reference target.

○

DeepEval / PromptFoo / Phoenix

Planned. DeepEval G-Eval calibration in progress on Scout side.

System DNA + architecture pattern

Each system in Crystal Ball carries a DNA tuple (agency_level, action_type, exposure_surface, domain, data_sensitivity) plus an architecture_pattern sibling field. Prism uses both to scope which quality attributes are material and which thresholds apply. The Quality Lead view surfaces the DNA so a stakeholder can see the system's classification at a glance.

Per-persona renderings of the same data

Same risk_assessments, same eval_results, four different shapes. Each persona has a different question they're asking; rendering is fitted to the question, not the data shape.

👔

Executive view

Question: Are we OK to ship and scale?