P4· L-E· Bands Operational → Integrated → Optimised → Defensible

· GOV-09

AI Evaluation Harness Specification

AI deployed without a defined evaluation pass operates on vendor assertion. The AI Evaluation Harness Specification defines the structured test framework that every AI tool must pass before deployment and every quarter while in production. The Harness specifies evaluation dimensions — accuracy, hallucination rate, bias, privilege handling, agentic boundary adherence — plus test corpus standards, pass thresholds, and regression discipline for model updates. Without it, the firm cannot evidence due diligence at procurement or in litigation. Methodology v2026.1.

strategic

Per-engagement

Initial evaluation 3–5 business days per tool (Tiers 0–2); 7–10 business days for Tier 3–4 including agentic supplement. Ongoing monitoring is continuous.

Methodology v2026.1

Executive Summary

GOV-09 defines the organisation’s AI Evaluation Harness: a standardised, evidence-based methodology for testing every AI tool before deployment and throughout its lifecycle. It specifies three mandatory evaluation domains—hallucination and factual accuracy, bias and fairness, and robustness and operational reliability—plus an Agentic Tier Supplement for Tier 3–4 tools. Each domain is scored on a 0–100 scale with clear Pass, Conditional Pass, and Fail thresholds, and hard floors for unacceptable behaviour such as excessive legal citation fabrication. The Harness produces a Model Risk Profile that populates DAT-06 Field 10 and a defensible evidence package for GOV-08 Agentic Governance Panel decisions. It establishes performance and bias baselines to support continuous monitoring, defines re-evaluation triggers, and sets DPS-grade evidence retention requirements. Without GOV-09, governance decisions would rely on vendor assertions rather than independent testing, undermining defensibility under the EU AI Act, ISO/IEC 42001, NIST AI RMF, and professional liability standards.

Defensibility Evidence Produced

GOV-09 operates at DPS Tier 3 (Defensible) across all three lenses. Adoption lens: stakeholder notifications of evaluation completion, training records for evaluation team members, and documentation of evaluation toolchain and test set sources — 5-year retention from evaluation date. Sophistication lens: full AI Evaluation Harness Reports including all domain scores and sub-scores, bias baseline documentation, Conditional Pass Mitigation Plans, and re-evaluation trigger documentation provide an auditable trail of every deployment decision — 5-year retention. Defensibility lens: Agentic Tier Supplement Reports, kill-switch response time test records, scope boundary enforcement verification records, escalation trigger accuracy test results, Model Risk Profile summaries as supplied to DAT-06 Field 10, and GOV-08 Panel submission packages constitute the primary technical evidence for regulatory, client, and professional liability inquiries regarding AI tool deployment decisions — 7-year retention from tool decommissioning. Evidence available within 48 hours of regulatory or legal inquiry. Annual evidence accessibility audit required.

Elements:

Methodology transparencyEvidence framework

Metric 0 — Pre-Check

Before any GOV-09 evaluation, three gates must pass:

DAT-06 Registration Initiated – The tool must have a DAT-06 AI BoM entry at least at Draft status. Unregistered tools cannot be evaluated.
GOV-02 AI Use Policy Entry Exists – A GOV-02 entry must define the approved use cases. Evaluation scope is limited to these use cases.
Qualified Evaluation Team Available – The team must include at least one qualified assessor with documented AI evaluation experience, confirmed by the AI Governance Lead.

Failure at any gate pauses evaluation until remediated.

Operational Signals

gov-09.harness-coverage

→ Defensibility Posture Statement

Proportion of deployed AI with current harness pass — DE-3 Evidence framework record.

Quarterly

gov-09.regression-detection

→ Annual Legal AI OS Index

Regression detection rate against versioned baseline feeds the Annual Legal AI OS Index quality signal.

Per Module run

gov-09.evaluation-currency

→ Console

Days since last evaluation per active AI capability for Console intelligence substrate.

On change

Recommended Stakeholders

Owner

CIO / CISO

Approvers

General Counsel
CIO / CISO
Risk & Compliance

Contributors

Engineering / IT
AI Task Force
External Evaluator

Informed

Board
Audit Committee

Inputs · Outputs

Inputs

· DAT-06 AI Bill of Materials entry at Draft or Provisional status
· GOV-02 AI Use Policy entry specifying approved use cases
· Vendor model card, system card, and technical documentation
· GOV-04 vendor due diligence outputs for infrastructure and data supply chain
· Legal domain test sets or Legal AI Test Corpus subsets
· Demographic and jurisdictional test data for intended operating scope
· Agentic workflow design documentation for Tier 3–4 tools
· Kill-switch architecture and infrastructure documentation for Tier 3–4 tools

Outputs

· AI Evaluation Harness Report per tool and evaluation cycle
· Domain scores and overall Pass / Conditional Pass / Fail verdict
· Model Risk Profile summary formatted for DAT-06 Field 10
· Agentic Tier Supplement Report for Tier 3–4 tools
· Bias and performance baseline records for GOV-08 monitoring
· Red-team and adversarial testing findings
· Conditional Pass Mitigation Plans and completion records
· Evaluation evidence package for DPS retention and audits

Framework Crosswalk

EU AI Act

European Union

Supports pre-deployment testing, technical documentation, and accuracy, robustness, cybersecurity, and human oversight requirements under Articles 9–15.

NIST AI Risk Management Framework

NIST

Implements the MEASURE function by providing structured, quantitative and qualitative evaluation of AI risks at deployment and in operation.

ISO/IEC 42001

ISO/IEC

Provides AI management system controls for documented performance evaluation, including accuracy, reliability, and fairness, which GOV-09 operationalises.

NIST Special Publication 1270

NIST

Informs GOV-09 bias and fairness testing methods and metrics for identifying, measuring, and mitigating AI bias.

Operational Artefacts

AI Evaluation Harness Scorecard
xlsx · v2026.1
Gated
Domain Test Set Templates — Hallucination, Bias, Robustness
xlsx · v2026.1
Gated
Agentic Tier Supplement Checklist
checklist · v2026.1
Gated
AI Evaluation Harness Report Template
docx · v2026.1
Gated
Bias Monitoring Baseline Record Template
xlsx · v2026.1
Gated

Sequence

Run before

USE-02
Pilot Program Design
The canonical AI pilot execution instrument that structures three-phase deployments with Risk Taxonomy 2026 monitoring, AI BoM gating, and Phase 3 DPS evidence production feeding STR-08 ROAI tracking.

Run next

GOV-08
Agentic Governance Charter
Establishes binding governance, mandatory safeguards, and approval authorities for all Agentic Tier 3 and Tier 4 AI deployments.

Diagnostic Relevance

Running the AI Evaluation Harness Specification strengthens the Defensibility lens — expected Band progression: Integrated → Optimised.

Confidence: high

Key Takeaways

Evaluate every AI tool across hallucination, bias, and robustness domains before deployment.
Apply a 0–100 scoring rubric with Pass (80+), Conditional Pass (60–79), and Fail (<60) thresholds.
Use the Agentic Tier Supplement for all Tier 3–4 tools; any supplement failure is an overall Fail.
Populate DAT-06 Field 10 (Model Risk Profile) directly from GOV-09 evaluation outputs.
Establish bias and performance baselines at initial evaluation to power continuous monitoring.
Retain evaluation evidence for 5 years, and 7 years for agentic supplement and red-team records.
Trigger re-evaluation on major model changes, annual review, or monitoring-detected drift.

Run this Module

Operational artefacts available to Enterprise Partnership members. Methodology v2026.1.

View Membership

Targeting

Audience

AI Governance LeadIT SecurityLegal OperationsRisk and ComplianceGeneral Counsel

Strengthens

Defensibility lensSophistication lens

Module Details

Format: Module
Difficulty: Advanced
Pillar: P4
Layer: E · Execution
Owner: CIO / CISO
Access: Enterprise Partnership

Maturity Bands

OperationalIntegratedOptimisedDefensible

Risk Classes Mitigated

In the Ecosystem

P4 · Governance →Module Library →Find the Right Diagnostic →

Where this Module lives

The Evaluation Harness is the quality gate between Pilot Program Design (USE-02) and production deployment. It produces DE-2 (Methodology transparency) and DE-3 (Evidence framework) records into the DPS, generates evaluation evidence for the AI BoM (DAT-06), and feeds Agentic Charter (GOV-08) tier decisions. Without this Module, deployment evidence collapses into vendor demos.

Advisory

When this Module sits inside a Programme.

Modules are operated in-house by GC and Legal Operations teams. When the capability transformation is multi-Pillar — or when the regulator timeline tightens — Advanta operates the canonical Module sequence as a Programme.

View Engagement Models Find the Right Diagnostic

← Back to Module Library