Severity over time — visualisation options
Four ways to render quality findings when the underlying signal is severity (red / amber / green) rather than a numeric score.
Option 1 — Severity strip timeline (event ticks by severity tier)
When to use: when each finding is an individual event with a severity, not a score on a scale. Shows density of findings plus their severity over time without pretending each is a numeric measurement.
Pros: honest about the data shape (events, not numbers). Clear severity legend. Hover reveals individual finding context. Works for any time window.
Cons: no aggregate trend at a glance — viewer has to interpret density themselves.
Option 2 — Open findings by severity (stacked area, cumulative)
When to use: when leadership needs a single-glance read of "is the open backlog growing or shrinking, and at what severity". Treats findings as a queue: opened, ageing, resolved. Common in security dashboards.
Pros: immediately answers "are we getting better or worse". Severity composition visible in the stack thickness. Maps cleanly to remediation work-in-progress.
Cons: requires Scout to carry resolution state on each finding (open / in remediation / resolved). Today we only emit "found"; closure event is extra plumbing.
Option 3 — Calendar heatmap (max severity per day)
When to use: when the audience reviews trend over weeks or months and wants a "is the pattern getting worse" read at a glance. Same idea as GitHub contribution graph or Datadog incident calendar.
Pros: one cell per day, colour speaks for itself. Easy to scan recurring problem days. Compresses a lot of history into small space.
Cons: loses intra-day detail (multiple findings same day collapse to one cell). Cell-count number compensates but is small.
Option 4 — Risk score over time (weighted severity sum)
When to use: when stakeholders demand a single number and a line. Each finding contributes a weight (red = 10, amber = 5, green = 1); rolling 7-day window plotted as a line. Closest to today's trend chart experience without faking a percentage.
Pros: familiar shape (line going up = bad). Single composite KPI. Easy to put a threshold band on (e.g. red if score > 30).
Cons: the score is opaque — "what does 47 mean?". Loses individual finding visibility. Must be paired with a "click for breakdown" drill-through so it's not just a magic number.
Working recommendation
Default the drilldown to Option 1 (severity strip) for any dimension where the underlying data is severity-encoded. Pair with a small KPI block showing open finding counts by severity (Option 2 simplified). Hide the percentage axis.
When a dimension has real numeric measurements (RAGAS faithfulness, hallucination_rate from Langfuse, etc.), keep the existing line chart. Detection at the data layer: if all eval_results rows for the dimension have metric_scale == 'ordinal_risk', render the severity views; otherwise render the numeric view.
Option 4 (weighted risk score) is the easiest "drop in next to existing chart" but it adds a synthetic number that needs explaining — which is exactly the trap we're trying to escape from with the current 100% / very-high-risk contradiction.
Open to discussion. The right answer depends on how the audience reads the page — whether they read severity natively (Option 1) or want a line they can interpret as "trending up = bad" (Option 4 with the right captioning).
Why this page exists
Crystal Ball's drilldown view was built around real numeric measurements — RAGAS faithfulness, hallucination_rate, and similar scores produced by evaluation harnesses (Langfuse, RAGAS, DeepEval) on a 0.0 – 1.0 scale. Threshold lines, percentages, line charts: all of that assumes the underlying data is a measurement.
Scout — our autonomous exploratory testing agent — does not produce measurements. It produces findings, each tagged with a severity (high / medium / low → red / amber / green). To make Scout findings visible on the same trend chart that holds RAGAS scores, the dispatch layer writes a synthetic eval_results row encoding severity as a number:
red = 1.0,amber = 0.5,green = 0.0. Convention is "higher = worse", which works for the sparkline (line going up = bad).Today the drilldown patches around this by labelling severity-encoded values explicitly: "100% (severity-encoded — no direct measurement available)". Honest, but cosmetic. The right answer is to render severity-shaped data with severity-shaped visualisations — which is what the four options below explore.
Once Corpus Coach (or any anchor system) is instrumented with a real groundedness evaluator (RAGAS faithfulness, claim-by-claim verification against context), the drilldown will receive a real
metric_valueon a real scale, and the existing percentage view applies. Until then, severity is the signal and the views below are how it should be presented.