Navigating the Frontiers of AI Safety: from Agent Foundations to Superintelligence Alignment

By Dr. Luke Soon, Computer Scientist

15 September 2025

Introduction: From Narrow AI to Existential Risk

Yet as we edge closer to the development of artificial general intelligence (AGI)—and potentially superintelligence—the conversation must shift. Efficiency and optimisation are no longer sufficient metrics of progress; safety, governance, and survival take centre stage.

My aim is to bridge the technical underpinnings of agent foundations with practical governance strategies—ensuring innovation while mitigating existential risks.

Why Expect Catastrophic Risks from AGI?

The curriculum begins with a sobering question: Why might AGI lead to ruin?

Seminal writings such as Eliezer Yudkowsky’s AGI Ruin: A List of Lethalities and Nate Soares’ essays outline disjunctive failure modes—where even solving 90% of alignment challenges may be insufficient, as the remaining 10% could prove fatal. John Wentworth’s essays on recursive self-improvement describe “insight cascades” where intelligence growth could rapidly outpace alignment techniques.

Yoshua Bengio echoes these risks in How Rogue AIs May Arise (2025), warning that autonomous systems with planning capabilities could pursue harmful goals. Similarly, Geoffrey Hinton, in his August 2025 CNN interview, forecast that AI might manipulate humans “like candy-bribing toddlers,” while Roman Yampolskiy predicted 99% job obsolescence by 2030 without strict containment.

The AI 2027 Scenario Paper by Daniel Kokotajlo et al. paints an alarming month-by-month simulation of rogue AI emergence within two years—paralleling Nick Bostrom’s earlier existential frameworks. While Gary Marcus critiques the immediacy of these forecasts, he concedes that robust safeguards are non-negotiable.

For governance, the message is clear: risk registers, stress tests, and adversarial simulations must be embedded into regulatory frameworks before capabilities surge beyond control.

The Agent Foundations Worldview: Reframing Intelligence and Goals

Next, I’d like to introduce the “agent foundations” paradigm—a rethinking of agency, decision theory, and embedded cognition. John Wentworth’s Why Agent Foundations? argues that current ML paradigms are inadequate for aligning superintelligent systems.

This resonates with MIRI’s long-standing position that intelligence and goals are orthogonal: highly capable AIs can pursue misaligned objectives. Rob Miles’ explainers on instrumental convergence (why benign goals lead to power-seeking behaviours) illustrate the inevitability of risk if left unchecked.

Bengio’s launch of LawZero, a nonprofit dedicated to building “honest AI” guardrails, exemplifies attempts to operationalise these insights. Stuart Russell, Max Tegmark, and Nick Bostrom have all stressed the need for value alignment as a scientific challenge rather than an afterthought.

For regulators and businesses, this worldview implies adopting agent-centric risk assessments. It means asking not just what an AI does, but what kind of agent it is becoming.

Predictable Challenges in AI Alignment

Module 3 examines foreseeable pitfalls, many already visible in today’s frontier models:

Goodhart’s Curse – when optimising proxies leads to divergence from true goals. Mesa-optimisers – inner agents that emerge during training, pursuing hidden objectives. Siren Worlds – over-optimised solutions that look appealing but prove catastrophic.

These risks align with Ilya Sutskever’s concerns about inner alignment and Jan Leike’s calls for scalable oversight. Bengio’s International Scientific Report on the Safety of Advanced AI (2025) synthesises global evidence of these risks and recommends common standards.

In applied contexts—such as insurance underwriting—misaligned optimisation can manifest as adversarial gaming of models, eroding trust. Formal verification, robust delegation frameworks, and constant monitoring for emergent behaviours must become standard practice.

Cultivating the Right Mindset for AI Safety

Technical tools are insufficient without the right mindset. Module 4 explores epistemological virtues—rationality, foresight, and worst-case thinking.

Resources such as The Twelve Virtues of Rationality (Yudkowsky), Paul Christiano’s Methodologies, and Arbital’s AI Safety Mindset outline how to think about hard problems where intuition often fails.

Bengio’s TIME profile earlier this year highlighted the psychological difficulty many researchers face in acknowledging catastrophic risks. Gary Marcus, in his September 2025 Substack, similarly critiqued hype-driven optimism, calling instead for sober, evidence-based discourse.

For boards, regulators, and teams, this mindset shift is essential. It allows us to design policies rooted not in ideology or fear, but in rational preparation for worst-case scenarios.

Agendas and Approaches: Charting a Path Forward

The curriculum concludes with actionable agendas. MIRI’s Agent Foundations Agenda, Wentworth’s Plan, and research on logical inductors and corrigibility all propose pathways toward robust alignment.

Recent initiatives echo this:

Bengio’s LawZero develops transparency-focused architectures. Dario Amodei’s Anthropic explores scalable oversight. Max Tegmark’s AI Safety Index (2025) calls for benchmarking corporate risk practices. Fei-Fei Li advocates for human-centred AI policies in governance frameworks.

Hybrid approaches—where technical safeguards are matched by regulatory levers—emerge as the only viable path forward.

Towards Responsible AI Governance

Reflecting on this journey, the Agent Foundations curriculum makes one point clear: AI safety is profoundly multidisciplinary. Philosophy, mathematics, computer science, psychology, and international policy must converge.

Proactive governance can steer us toward beneficial outcomes, but delay invites disaster. As Yoshua Bengio argued in his January 2025 Mila report, international coordination is essential. Gary Marcus has warned that U.S. regulation is lagging; Europe and Asia must fill the void.

The balance between innovation and safety is fragile. But by embracing foresight, cultivating the right mindset, and embedding agent foundations into governance, we can build systems that not only avoid catastrophe but actively serve humanity’s flourishing.

Closing Reflection

While the path to superintelligence-robust alignment is fraught, it is not hopeless. By engaging with these resources, embedding them into corporate and regulatory practice, and fostering global cooperation, we can navigate the shadows of superintelligence toward a brighter horizon.

I invite fellow practitioners, policymakers, and researchers to continue this dialogue. Together, we can ensure AI remains not just powerful, but trustworthy.

For further discussion, connect with me on LinkedIn.

Leave a comment