Given a target AI system and a session charter, Scout hypothesises failure modes, runs probes, observes behaviour, and produces risk-ranked findings mapped to AIQIDE quality attributes — feeding directly into Crystal Ball's governance dashboard.
Produces findings that map directly to AIQIDE quality attributes. Two systems wired into the dashboard today — Corpus Coach and Scout itself — with contrasting severity profiles.
Same AIQIDE framework. Two different severity calibrations. Two different regulatory paths. One Crystal Ball dashboard.
A service offering: "we run our exploratory agent against your AI system and produce a quality assessment." Not hypothetical. Demonstrable against any chat UI.
Every commercial agentic testing product targets traditional applications. RAGAS, DeepEval, PromptFoo are excellent at regression — "did the known cases still pass?" — but brittle at discovery. Scout is the discovery layer.
Corpus Coach is itself an AI system worth evaluating. Running evals against it exercises every major eval tool as designed — no forced fits, no artificial test cases.
POST /api/v1/metrics push to Crystal BallTarget system failures — retry ×2 with backoff, mark as target_error, continue. 3 consecutive failures → terminate with partial report.
browser-use crash — capture error state, write completed probes to store, generate partial report with session_status: partial.
LLM rate limits — exponential backoff, 60s ceiling. If paused >5 min → terminate with partial results.
Any session with session_status: partial still writes to /metrics endpoint.
Crystal Ball displays partial results with a visual indicator.
Findings from partial sessions are valid — they're just incomplete.
Service positioning: ~S$8–13/month per target at one session/week. Portkey provides exact tracking once wired.
Running Scout as a capability means managing targets (systems it probes), charters (reusable exploratory specs per target), and runs (individual executions). Session-level detail — hypotheses, probes, transcripts, findings — hangs off runs. v1 includes a clients table with one row (internal) so multi-tenancy is a config change, not a migration.
Tenancy root. v1 has one row: internal. Every other table carries client_id — free to include now, migration later.
Systems Scout can probe. Holds endpoint, executor type (api/browser), auth config, DNA profile, and the mapping to Crystal Ball's system_id.
Reusable exploratory specs. Goal, persona, time-box, probe budget, schedule, cost cap. Same target can have many charters — each is a distinct testing intent.
Individual executions. Status, cost, timing, prompt/engine versions, trigger source. Hypotheses, probes, transcripts, findings all reference run_id.
Solid arrows = primary FK. Dashed arrows = detail tables referencing run_id. All tables carry client_id for v2+ tenancy.
v1 is CLI-driven over this API. UI is deferred to v2+. Auth deferred (single internal client). The API shape is locked now so future tenancy, scheduling, and UI are additive. As of 2026-04-29, GET /api/v1/runs/{id} returns the full execution log (hypotheses, probes with observer_flags, execution_summary) inline — substituting for the missing /transcript and /findings endpoints in practice.
At most one running status per (target, charter). Second trigger → 409 with in-progress run_id. Stops scheduled + manual collisions.
daily_cost_cap on charter. New runs 402 if exceeded. Mid-run overrun → terminate partial with cost_cap_exceeded.
Default 10 runs per target per hour. Prevents runaway scheduler bugs or misconfigured triggers overwhelming a client system.
Credentials never in the DB. auth_config.secret_ref keys into /opt/thatsquality/.env.secrets — same pattern as Crystal Ball (CB-BACKLOG-017).
/assess calloutsScout pushes a summarised MetricsPayload to Crystal Ball on run completion (fire-and-forget). Findings in the payload carry evidence_ref URLs pointing back to Scout's evidence endpoint. Quality Lead clicking a finding in Crystal Ball lands in Scout for full transcript context.
Run as part of the Scout pipeline, post-session, against Scout's own outputs. Answer: did Scout do its job well?
Scout captures target responses during probing. Minimal in v1 — mostly Phoenix trace inspection. Answer: what did the target do during this probe sequence?
Own scheduled jobs — cron, CI, etc. Read target's Langfuse traces or API directly. Scout does not orchestrate. Answer: is the target behaving as designed?
A finding about Corpus Coach's retrieval behaviour gets system_id: corpus_coach regardless of whether Scout or a scheduled RAGAS job produced it.
No cross-contamination. The system_id is about what the finding is about, not who produced it. The source_tool answers "who produced it". Separation keeps the Crystal Ball dashboards honest.
internal), no authThe v1 data model and API shape are production-viable. v2+ is additive, not a rewrite.
POST /api/v1/metrics
Push results from any producing system. Idempotent via result_id.
GET /api/v1/metrics?system_id&since&limit
Crystal Ball polls on ingested_at cursor. Default 5-min interval.
1.0
Unknown versions rejected (422). Single ANTHROPIC_API_KEY for all components.
All required fields must be present. additionalProperties: false
Required when result_type = "session". Scout-type results only.
Maps to AIQIDE /assess context block. All values string-typed per AIQIDE contract. Absence degrades AIQIDE confidence but never causes 422.
All measured_value in [0.0, 1.0] — decimals only, never percentages. Routes through to_aiqide_attribute() → BreachDetector pipeline.
Required for Scout. Optional for Corpus Coach. External evaluated targets that don't produce findings are skipped. Only critical/high fire POST /assess.
Predicted failure points and their resolution status. Links to findings via finding_refs.
Fail-hard on identity/routing fields. Fail-soft on quality-improvement fields.
category is descriptive. aiqide_attribute drives AIQIDE routing. category is an open string from v3.
| Category (scout_v1) | AIQIDE Attribute | Rationale |
|---|---|---|
persona_drift | faithfulness | Deviation from declared role — grounding failure |
retrieval_gap | context_quality | Absent, irrelevant, or insufficient context |
refusal_inconsistency (safety) | safety | Inconsistent harmful-prompt refusal |
refusal_inconsistency (robust) | robustness | Inconsistent behaviour under adversarial variation |
ambiguity_handling_failure | context_quality | Confident output from underspecified retrieval |
stale_grounding | context_quality | Temporally outdated retrieved context |
hallucination | hallucination_rate | Fabricated content, no grounding |
prompt_injection | safety | Successful instruction override |
refusal_inconsistency maps to both safety and robustness — emit two separate findings when both apply.
Two levels of link granularity — run-level and trace-level.
Shown as a labelled button in the system detail view and Quality Lead drilldown header. "View in Langfuse" / "View in PromptFoo" etc. Session-level source_url overrides payload-level.
Shown inline in the dimension drilldown alongside the metric value. This is the primary click target — lands directly on the specific trace that produced the suspicious score.
Accepts https://, s3://, file:// URIs. For eval tool findings, use the tool's trace URL directly (e.g. a Langfuse trace URL, a PromptFoo eval row URL). CB stores as pointer and renders as a link — does not fetch the target.
Today the observer is rule-based (substring matches). The LLM-judge replacement is the largest open Path B item — see Week 1 above. Status legend: ● done · ● partial / in progress · ○ not started.
| Concern | Tool | Status |
|---|---|---|
| Hypothesis quality | LLM judge (custom) | ● Partial |
| Finding faithfulness | RAGAS-style | ○ Path B |
| Report specificity | DeepEval G-Eval | ● Wired |
| Observer rule replacement (rules → LLM judge) | LLM judge (Round 2 seed shipped) | ● Partial |
| Charter adherence | LLM judge (custom) | ○ Path B |
| Probe diversity | Semantic clustering | ○ Path B |
| Prompt injection resilience | PromptFoo | ○ Path B |
| Cost and latency | Portkey | ○ Path B |
| Concern | Tool | Status |
|---|---|---|
| Answer relevancy | RAGAS | ○ Path B |
| Faithfulness | RAGAS | ○ Path B |
| Contextual precision | DeepEval | ○ Path B |
| Hallucination | DeepEval | ○ Path B |
| Conversation completeness | DeepEval | ○ Path B |
| Regression | PromptFoo | ○ Path B |
| Adversarial / red-team | PromptFoo | ○ Path B |
| Trace observability | Phoenix (Arize) | ● Wired |
Two systems sit on the dashboard today. Corpus Coach is the primary evaluated target, deliberately seeded with a representative set of typical RAG mistakes (retrieval gaps, citation drift, prompt-injection susceptibility, version-confusion, hallucination on out-of-corpus questions). It is not "one flawed bot" — every planted flaw maps to a failure mode commonly seen in real client RAG implementations, so the eval results are representative, not toy. Scout also evaluates itself: its own runs feed metrics back into Crystal Ball alongside the Corpus Coach findings, closing the loop on the toolchain that produced them.
Scoring: all 3 criteria met = 1.0. Category + repro but vague trigger = 0.5. Category only = 0. Path B Week 3 target: ≥70% across 10 runs with same charter.
asyncio.create_task() in Crystal Ball FastAPI backend. Default 5-min interval. last_poll persisted to SQLite key-value table.