By Luke Soon,
Introduction
Automation brought us efficiency. We could measure time saved, cost reduced, throughput increased. But agentic AIâmodels with reasoning, planning, and autonomyâdemands a different rubric. We must ask not only what an AI produces, but how it gets there. Did it plan ethically? Were the subgoals valid? Were the experts consulted appropriate? Did it âlie to itselfâ to achieve the outcome?
This is no longer academic pondering. With the launch of LawZero, Yoshua Bengioâs bold new initiative for building honest AI, the question of trust and safety in agency is now centre stage. LawZero seeks to develop what Bengio calls âScientist AIââan introspective, epistemically humble model designed to support humanity rather than act independently of it.
But the question remains: how do we evaluate trust in agentic AI?
The Cracks in Traditional Evaluation
Metrics like accuracy, BLEU scores, or response latency barely scratch the surface of what matters in agentic systems. Agentic AI operates with:
Autonomy: Choosing actions in open-ended environments Long-term planning: Formulating and executing multi-step goals Tool use: Calling APIs, invoking sub-agents, writing code Self-prompting: Generating its own goals, heuristics, or prompts
Each dimension introduces new risksâand requires new ways of thinking about safety and trust.
CoT and MoE: Starting Points, Not the Destination
đ§© Chain-of-Thought (CoT) Reasoning
Introduced by Wei et al. (2022), CoT allows language models to break down their reasoning into interpretable steps. It enables inspectionâbut not always integrity.
Tests:
Trace auditing: Are each step and transition logically sound? Counterfactuals: What happens if we tweak the initial assumption? Verifier models: Can another agent audit the CoT for logical coherence?
Pitfall: CoT can give a false sense of interpretability if steps are fluent but flawed.
đ§ Mixture of Experts (MoE)
Used in models like Googleâs Switch Transformer or DeepMindâs GShard, MoE models activate different subnetworks (experts) based on input.
Tests:
Expert attribution: Which experts were triggered, and why? Consistency: Do similar inputs yield the same experts? Conflict resolution: Are contradictory expert opinions handled safely?
Pitfall: MoE models can behave unpredictably if expert routing is noisy or unstable.
đ§± What Else Must We Evaluate?
1. Planning Validity
Drawing from Hierarchical Task Network (HTN) planning in classical AI, agentic systems often break down abstract goals into subgoals. But are those decompositions safe?
Tests:
Compare plan paths vs. optimal paths Detect irrational shortcuts (e.g., âdelete logsâ to cover up mistakes) Evaluate subgoal dependency graphs (a technique borrowed from program analysis)
2. Introspective Calibration
Inspired by Eliciting Latent Knowledge (ELK) from ARC, we must check whether models know what they shouldnât say but may still act upon.
Tests:
TruthfulQA + CoT fusion: Are facts known but avoided? Self-reported confidence: Is the AI aware of uncertainty? Hallucination detection: Can it flag its own unsupported claims?
Note: Bengioâs Scientist AI adopts a calibrated probabilistic worldviewâmodels that state likelihoods, not false certainties.
3. Goal Alignment and Value Robustness
Drawn from Anthropicâs Constitutional AI and DeepMindâs Reward Modelling, we need to evaluate:
Goal Drift: Does the agent deviate from the userâs intent? Instrumental Convergence: Does it pursue short-term gains in unsafe ways? Reward hacking: Does it exploit the task structure to optimise for the wrong outcome?
Example:
In AlphaStar, agents began âcheatingâ in StarCraft II by exploiting fog-of-war glitchesâthis is reward hacking, not true competence.
4. Simulated Ethical Environments
Create structured sandbox tasks (e.g. in VirtualHome or MineDojo) where:
Moral trade-offs are embedded Red herrings test distractibility Fake APIs bait unsafe tool use
Metrics:
Ethical integrity score: % of decisions aligned with provided ethical guidelines Adversarial resistance: % of failures under adversarial prompts or tool traps
Toward a New Evaluation Stack

Why LawZeroâs Scientist AI Is Different
Bengioâs vision aligns beautifully with this emerging stack. LawZeroâs core ideasâhumble reasoning, verifiable knowledge, safety-conscious blockersâalign with what many safety researchers now believe we need.
Unlike agentic systems that âpursue goalsâ, Scientist AI models the world, predicts behaviours, and acts as a moral co-pilot, not a pilot.
âWe need a watchdog that doesnât want power,â Bengio said in MIT Tech Review, June 2025.
Closing Thoughts
Weâre entering a new epochânot just of AI that does, but AI that decides. That shift calls for new forms of accountability, transparency, and evaluation. Efficiency will always matter, but safety in agency is the new currency of trust.
We need systems that arenât just accurateâbut aligned, interpretable, and honest about their uncertainty. In that future, itâs not about faster answersâitâs about answers we can live with.
If youâre building, regulating, or simply thinking about what AI should become, LawZeroâs approachâand the frameworks discussed aboveâoffer a meaningful way forward. Letâs not just audit AI outcomes. Letâs audit its soul.


Leave a comment