The first block is the consequence text — it comes directly from an approved impact rule. It's deterministic and identical every time that rule fires: the nature of the exposure, the commercial components, the required action. Think of it as the standing verdict — authored by a human, reviewed, approved, and locked.

The second block is the LLM narrative — generated fresh each time the assessment runs. It takes the engine's structured output (the actual metric values, breach delta, recurrence count, system context) and translates that into a personalised statement for this specific breach event. That's why it names "1 in 18 assessments" and "exceeding threshold by 10.2%" — those numbers are computed at runtime.

A Business Owner sees the consequence text on the system detail card whenever a dimension is in breach. The narrative appears when they open the drilldown or trigger an impact assessment. Separating them keeps the fixed reasoning auditable and the event-specific language readable.

The score combines four things: how far the metric has breached its threshold, how autonomous the system is, how reversible the impact is, and how relevant this type of failure is to that specific persona. The same breach, a different system — a different score. An autonomous filing system scores higher than an assistive tool for the same hallucination rate because the consequences are categorically different.

Read it comparatively, not as an absolute. This issue is more critical than that one. This system needs attention before that one. What it gives you is a consistent, auditable basis for prioritisation across your entire AI portfolio — rather than relying on someone's judgement about which breach matters more.

Credit Scoring is assessed across four attributes and two are already flagging — hallucination is breaching threshold and bias is high risk. That's why the status shows "deploy with conditions" rather than cleared for production.

The absence of drift data isn't a blind spot — it reflects where the system is in its maturity. We're not measuring what hasn't been deployed yet. The four attributes we are measuring are already telling us it's not ready. A gap you can see is better than a score you can't trust.

The regulatory references don't come from the LLM — they come from approved rules. Every rule is authored with specific regulatory references as structured data fields. The LLM narrator is explicitly constrained to only cite frameworks and clause references that appear verbatim in the rule it's working from. It cannot add, extend, or infer beyond what a human has already reviewed and approved.

The question isn't whether the LLM knows MAS FEAT — it's whether the rule was correctly authored. That's a human accountability question, not a model reliability question. The reviewer who approved that rule owns those references. And every assessment carries a full provenance trail back to the specific rule that fired.

The scoring system is entirely deterministic — it produces the same outcome for a given set of system attributes, thresholds, and breach inputs every time. No LLM is involved in the reasoning, the severity calculation, or the regulatory determination.

The LLM does two things only: it drafts candidate rules (which a human must approve before they enter the engine), and it translates the engine's structured output into plain language for each persona. The reasoning is done before the LLM sees anything. The LLM translates — it doesn't determine.

If a system type isn't configured, the engine returns no assessment for it — deliberately. A visible gap is better than a miscalibrated score. Over-reporting or wrong severity is a worse outcome than no output.

When a new system needs to be onboarded, there's a structured pipeline: an agent drafts a candidate rule using existing golden rules as reference, a second agent maps impact across personas, a third critiques it for regulatory accuracy and causal chain quality. The output then goes to a qualified human for sign-off before it touches the engine. Nothing fires until someone has reviewed it.

A new system creates a visible gap, not a wrong answer. And closing that gap is a defined process, not a manual authoring exercise.

Your data science team measures quality. This answers a different question: what does that measurement mean for the people making business decisions?

Right now your ML ops team has dashboards your CFO can't read. Your compliance officer doesn't know which findings require her attention. Your business unit heads are being asked to sign off on systems they can't interpret. The engine takes what your existing tools already produce and translates it into the language each of those stakeholders needs — with regulatory context, severity scoring, and a clear action.

If you already have consolidated monitoring, even better — you don't replace it, you connect to it. The engine sits downstream and handles the translation and prioritisation layer. Your data science team keeps working exactly as they do today.

Onboarding a new system is straightforward once the platform is running. Five steps: define the system's DNA profile, confirm which personas need to receive assessments, generate and review the rules, set up the project, and connect your existing eval tools.

The configuration itself is fast — realistically less than a day. The majority of the time goes on two things that have nothing to do with the technology: getting access credentials to your existing monitoring tools, and getting the right person to review and approve the generated rules. That's a governance step, not a technical one — and it's intentional. You don't want rules going live without a qualified reviewer signing off.

For your next system after the demo anchor: once the platform is live, you're looking at days, not weeks.