Exposed: Why AI Benchmarks Are Failing Us All – And How Singapore’s Real-World Governance Is Fighting Back – Genesis: Human Experience in the Age of Artificial Intelligence | Synthesis: The Superintelligence Protocol

I’ve long believed that true innovation isn’t just about raw capability — it’s about AI that fits, not just functions. That’s why, when I came across the new preprint “CIRCLE: A Framework for Evaluating AI from a Real-World Lens” by Reva Schwartz and colleagues, it stopped me in my tracks. Here is a rigorous, six-stage, lifecycle-based approach that finally bridges the notorious “reality gap” between glossy benchmark scores and what actually happens when AI meets messy human contexts, users, workflows, and constraints.

Led by a stellar team of ethical AI experts and researchers — including Reva Schwartz of Civitaas Insights and Rumman Chowdhury of Humane Intelligence, a pioneering voice in responsible AI who has spent years championing human-centric evaluation and public-benefit AI governance — CIRCLE (Contextualise → Identify → Represent → Compare → Learn → Extend) operationalises the missing “Validation” phase of TEVV. It translates stakeholder concerns outside the technical stack into measurable signals, blending qualitative insights with scalable quantitative metrics through field testing, red teaming, and longitudinal studies. This approach echoes the broader calls from ethical AI leaders such as Virginia Dignum, whose work on responsible AI design has long emphasised the need for context-aware, stakeholder-driven assessment rather than purely technical metrics.

Published just weeks ago, CIRCLE produces systematic, comparable-yet-context-sensitive knowledge that decision-makers can actually govern by. It resonates deeply with the practical, real-world governance ecosystem Singapore has been quietly perfecting for years. Today I want to reflect on three pillars I’ve been deeply involved with — AI Verify, the Model AI Governance Framework (MGF) for Agentic AI from the Ministry of Digital Development and Information (MDDI), and the Monetary Authority of Singapore’s (MAS) AI Risk Management Guidelines — and show how beautifully they align with, and are strengthened by, the CIRCLE framework.

My Hands-On Work with AI Verify: Turning Principles into Practice

I’ve been at the forefront of making AI assurance real and repeatable. In May–June 2025 I led the technical testing work as part of the IMDA AI Verify Global AI Assurance Pilot. This world-first initiative pairs AI assurance providers with organisations deploying generative AI applications. Crucially, we test the real-life application in context — not just the underlying foundation model — to codify “what and how to test” for fairness, robustness, explainability, and safety.

I’ve since helped scale this expertise into practical, production-grade assurance practices that move organisations from ad-hoc experimentation to trustworthy deployment. I’ve also had the privilege of closing the AI Verify Foundation’s recent event “Testing Agentic AI Systems in the Real World”, where I shared a comprehensive risk landscape analysis and practical approaches to embedding these controls into agentic applications.

AI Verify gives organisations an open-source toolkit to run standardised tests against 11 core principles. In my own writing (see my recent piece “The AI Governance Stack: A Blueprint for the Agentic Era”), I position it as the sovereign “technical testing muscle” — generating Model Labelling reports that read like nutrition facts for AI. It’s perfect for the Compare and Learn stages of CIRCLE, turning abstract concerns into concrete metrics while remaining extensible for local contexts. This technical rigour aligns closely with the ethical imperatives championed by experts like Rumman Chowdhury, who has consistently argued that real-world validation must sit at the heart of trustworthy AI.

The MGF for Agentic AI: MDDI’s World-First Blueprint for Autonomous Systems

In January 2026, Minister Josephine Teo announced something genuinely groundbreaking at Davos: the Model AI Governance Framework for Agentic AI (MGF), developed by IMDA under the new Ministry of Digital Development and Information. This is the world’s first dedicated governance framework for agents that can plan, reason, and act autonomously.

Building on the original 2020 MGF and the 2024 Generative AI version, it introduces four practical dimensions:

1. Assess and bound risks upfront — threat modelling, least-privilege access, sandboxes, and agent identity management.

2. Make humans meaningfully accountable — clear responsibility chains, human-on-the-loop checkpoints for high-stakes actions, and training against automation bias.

3. Implement technical controls and processes — guardrails for planning, tool use, multi-agent interactions, continuous monitoring, and gradual rollouts.

4. Enable end-user responsibility — transparency declarations, user education, and escalation paths.

As someone who has been stress-testing agentic systems, I can attest that the MGF directly supports CIRCLE’s Contextualise and Identify stages. It forces organisations to translate stakeholder concerns (e.g., “will this agent overstep authority?”) into design choices and measurable guardrails before deployment. It’s not retrospective auditing — it’s prospective, scalable governance, perfectly complementing the ethical AI principles advanced by leaders such as Reva Schwartz and Virginia Dignum.

MAS AI Risk Management Guidelines: Precision for the Financial Sector

Complementing the national picture, the Monetary Authority of Singapore’s Guidelines on AI Risk Management (AIRG / AI Model Risk Management paper) set the gold standard for financial institutions. They emphasise board-level oversight, comprehensive AI usage inventories, full lifecycle controls, fairness and transparency testing, and robust human oversight — all grounded in the long-standing FEAT principles and Veritas methodologies.

For banks and insurers I work with, these guidelines are the practical translation of CIRCLE’s Represent and Extend stages into regulated environments. They demand evidence of real-world performance, ongoing monitoring, and incident response — exactly the “systematic knowledge” CIRCLE calls for.

Where CIRCLE Fits: The Missing Validation Layer in Singapore’s Stack

What excites me most is the synergy. Singapore’s ecosystem already gives us:

• Policy & compliance (MGF + MAS guidelines)

• Technical testing (AI Verify + Project Moonshot for agent safety)

• Sector-specific depth (Veritas 2.0 for finance)

CIRCLE — crafted by ethical AI trailblazers like Rumman Chowdhury and Reva Schwartz — adds the real-world validation engine that ties it all together. Its six-stage process lets us:

• Contextualise stakeholder concerns using MGF’s risk-bounding approach

• Identify measurable signals via AI Verify’s test suite

• Represent and Compare outcomes across sites while preserving local nuance

• Learn from longitudinal field studies and red teaming

• Extend insights into scalable governance updates

In my ongoing work with these initiatives, I’m already exploring how to layer CIRCLE on top of AI Verify pilots and MGF implementations. The result? Assurance that is not only compliant but genuinely human-centric — protecting against the “rogue agent” problem while unlocking productivity and trust. This mirrors the vision of ethical AI experts worldwide who insist that evaluation must centre lived human experience, not just laboratory abstractions.

Why This Matters for All of Us

We stand at a hinge moment. Agentic AI will amplify both our best and worst impulses. Frameworks like CIRCLE, AI Verify, the MGF, and MAS guidelines don’t slow innovation — they accelerate safe, scalable adoption. They turn theoretical capabilities into materialised, trustworthy outcomes.

Singapore has shown the world how to lead with pragmatic, enterprise-ready governance. As I often say: when customers and employees know an AI system is fair, transparent, and aligned with human values, they use it more — and that’s where the real value is unlocked.

If you’re building, deploying, or governing AI — especially agentic systems — I urge you to read the CIRCLE preprint, explore AI Verify, study the new MGF, and align with MAS expectations. Better still, let’s connect and test these tools together in real deployments.

The future isn’t about choosing between innovation and safety. It’s about building both — intelligently, responsibly, and at scale.

What are your thoughts on bridging the reality gap? Drop a comment or reach out — I’m always up for a proper discussion.

Exposed: Why AI Benchmarks Are Failing Us All – And How Singapore’s Real-World Governance Is Fighting Back

Share this:

Leave a comment Cancel reply