Severity over time — visualisation options

Four ways to render quality findings when the underlying signal is severity (red / amber / green) rather than a numeric score.

Why this page exists

Crystal Ball's drilldown view was built around real numeric measurements — RAGAS faithfulness, hallucination_rate, and similar scores produced by evaluation harnesses (Langfuse, RAGAS, DeepEval) on a 0.0 – 1.0 scale. Threshold lines, percentages, line charts: all of that assumes the underlying data is a measurement.

Scout — our autonomous exploratory testing agent — does not produce measurements. It produces findings, each tagged with a severity (high / medium / low → red / amber / green). To make Scout findings visible on the same trend chart that holds RAGAS scores, the dispatch layer writes a synthetic eval_results row encoding severity as a number: red = 1.0, amber = 0.5, green = 0.0. Convention is "higher = worse", which works for the sparkline (line going up = bad).

The encoding works for the sparkline. It does not work for the drilldown view, where the same value gets rendered as a percentage on a higher-is-better metric like groundedness. A finding-driven 1.0 reads as "100% grounded" sitting next to a "very high risk" badge — a visible contradiction.

Today the drilldown patches around this by labelling severity-encoded values explicitly: "100% (severity-encoded — no direct measurement available)". Honest, but cosmetic. The right answer is to render severity-shaped data with severity-shaped visualisations — which is what the four options below explore.

Once Corpus Coach (or any anchor system) is instrumented with a real groundedness evaluator (RAGAS faithfulness, claim-by-claim verification against context), the drilldown will receive a real metric_value on a real scale, and the existing percentage view applies. Until then, severity is the signal and the views below are how it should be presented.

Option 1 — Severity strip timeline (event ticks by severity tier)

When to use: when each finding is an individual event with a severity, not a score on a scale. Shows density of findings plus their severity over time without pretending each is a numeric measurement.

Pros: honest about the data shape (events, not numbers). Clear severity legend. Hover reveals individual finding context. Works for any time window.

Cons: no aggregate trend at a glance — viewer has to interpret density themselves.

Red (critical) Amber (high / medium) Green (resolved / low)

Option 2 — Open findings by severity (stacked area, cumulative)

When to use: when leadership needs a single-glance read of "is the open backlog growing or shrinking, and at what severity". Treats findings as a queue: opened, ageing, resolved. Common in security dashboards.

Pros: immediately answers "are we getting better or worse". Severity composition visible in the stack thickness. Maps cleanly to remediation work-in-progress.

Cons: requires Scout to carry resolution state on each finding (open / in remediation / resolved). Today we only emit "found"; closure event is extra plumbing.

Option 3 — Calendar heatmap (max severity per day)

When to use: when the audience reviews trend over weeks or months and wants a "is the pattern getting worse" read at a glance. Same idea as GitHub contribution graph or Datadog incident calendar.

Pros: one cell per day, colour speaks for itself. Easy to scan recurring problem days. Compresses a lot of history into small space.

Cons: loses intra-day detail (multiple findings same day collapse to one cell). Cell-count number compensates but is small.

All green Worst = amber Worst = red No findings

Option 4 — Risk score over time (weighted severity sum)

When to use: when stakeholders demand a single number and a line. Each finding contributes a weight (red = 10, amber = 5, green = 1); rolling 7-day window plotted as a line. Closest to today's trend chart experience without faking a percentage.

Pros: familiar shape (line going up = bad). Single composite KPI. Easy to put a threshold band on (e.g. red if score > 30).

Cons: the score is opaque — "what does 47 mean?". Loses individual finding visibility. Must be paired with a "click for breakdown" drill-through so it's not just a magic number.

Working recommendation

Default the drilldown to Option 1 (severity strip) for any dimension where the underlying data is severity-encoded. Pair with a small KPI block showing open finding counts by severity (Option 2 simplified). Hide the percentage axis.

When a dimension has real numeric measurements (RAGAS faithfulness, hallucination_rate from Langfuse, etc.), keep the existing line chart. Detection at the data layer: if all eval_results rows for the dimension have metric_scale == 'ordinal_risk', render the severity views; otherwise render the numeric view.

Option 4 (weighted risk score) is the easiest "drop in next to existing chart" but it adds a synthetic number that needs explaining — which is exactly the trap we're trying to escape from with the current 100% / very-high-risk contradiction.

Open to discussion. The right answer depends on how the audience reads the page — whether they read severity natively (Option 1) or want a line they can interpret as "trending up = bad" (Option 4 with the right captioning).