Welcome back to the blog. If you’ve been paying attention to the AI space over the last few months, you’ve likely noticed a massive paradigm shift. We are no longer just building chatbots that passively answer our questions; we are building “doers”—autonomous agents that schedule meetings, update databases, book flights, and write code. In the words of Andrej Karpathy, we are fully entering the “Software 3.0” era, where prompting and context window management have become our new levers over the computer, automating profound amounts of digital information processing.
However, with this incredible capability comes unprecedented risk. We are transitioning from casual “vibe coding” (which wonderfully lowered the barrier to entry for building software) to rigorous “agentic engineering,” which demands we maintain professional quality and security standards while coordinating highly autonomous systems.
Today, I want to dive deep into the latest on Agentic AI safety and governance, examining the complex memory architectures of these new agents, the critical role of integration protocols, the terrifying risks they pose, and the robust architectures we must deploy to keep them in check.
Understanding the Modern Agentic Mind: The Four Memory Architectures
To secure these systems, we first have to understand how they actually think and remember. A true agent relies on a cognitive architecture that closely mirrors human memory. Researchers refer to this framework as CoALA (Cognitive Architectures for Language Agents), which maps out four distinct types of memory that every advanced agent needs:
- Working Memory: This is the agent’s context window—its immediate “scratchpad”. Much like random-access memory (RAM) in a traditional computer, it is fast and accessible but highly volatile; when the session ends, it is gone. It holds the current conversation, system instructions, and immediately loaded files, but its size is strictly limited.
- Semantic Memory: This is the agent’s persistent knowledge base of facts, rules, and conventions. In modern agentic systems, this is often implemented elegantly through lightweight markdown files, such as
agents.mdorClaude.md, which sit at the root of a project. This memory dictates coding conventions, build commands, and architectural rules, ensuring the agent doesn’t repeat basic mistakes. - Procedural Memory: This is how the agent actually knows how to do things. It operates through an open standard called “agent skills,” stored as
skill.mdfiles. Rather than overwhelming the working memory by loading every known workflow at once, procedural memory uses “progressive disclosure”. The agent only reads a lightweight index of its skills at startup. When a user’s request matches a specific skill—such as generating a PowerPoint deck—the agent dynamically pulls in the full, token-heavy, step-by-step instructions it needs for that precise task. - Episodic Memory: This acts as the agent’s distilled experience. Rather than saving full transcripts of past sessions, production agents compress their past interactions to retain only useful insights. For example, instead of remembering a 45-minute debugging transcript, episodic memory notes, “Last time we debugged the auth module, the issue was in the middleware layer”. This allows the agent to truly learn across sessions.
Reaching Outside the Box: The Role of MCP
Agents with perfect memory still need hands to interact with the world. When an agent decides it needs to reach outside its own boundaries to interact with external tools, APIs, databases, or SaaS platforms, it uses the Model Context Protocol (MCP).
Before MCP, an AI agent required a custom connector built for every single external service it might touch, which was an absolute mess. MCP solves this by acting as a universal open standard. It wraps an external tool—like a Notion database or a Stripe payment link—in an “MCP server” providing a standard interface. The agent simply speaks MCP to the server, and the server translates that into the underlying API calls.
Through MCP, an agent is empowered to execute real-world tasks, while another open standard, Agent-to-Agent (A2A), allows different agents to read each other’s “agent cards” to delegate complex workflows amongst themselves seamlessly.
The Expanding Attack Surface: Shadow AI and Hijacked Goals
Because agents don’t just chat, but actively use MCP to pull levers and click buttons, they heavily amplify existing security risks. We are currently battling an epidemic of Shadow AI—unofficial, unsanctioned AI helpers spun up by well-meaning employees that silently interact with sensitive customer data and third-party APIs without official oversight.
According to the OWASP Top 10 for AI agents, we are facing several critical vulnerabilities:
- Goal Hijacking via Prompt Injections: Because agents struggle to separate system instructions from user content, attackers can embed hidden prompts in web pages or emails to silently redirect an agent’s objective. The agent behaves perfectly, but towards a completely malicious goal.
- Tool Misuse & Code Execution: Agents that dynamically write and execute code can be manipulated into remote code execution or sandbox escapes if strong boundaries around MCP servers are absent.
- Memory Poisoning: If an attacker subtly poisons the semantic markdown files or the episodic memory, they can permanently bias an agent’s future decision-making across all tasks.
- Rogue Agents: Over time, unmonitored agents can drift into becoming “rogue agents” that mask hidden goals or collude with other agents while appearing compliant on the surface.
Deploying the Agent Operating System
Right now, deploying unregulated agents is like putting five toddlers in charge of a restaurant kitchen. To restore order, we must deploy an Agent Operating System (OS)—a central “principal” that acts as the orchestration and governance layer.
This OS requires a robust kernel to maintain sanity. It needs a Scheduler to orchestrate which agent gets compute resources first, a Memory Manager to cure goldfish-like amnesia, and a Tool Manager to enforce strict sandboxing so an agent can only execute code within an isolated, padded room where it cannot accidentally delete your production database.
Zero Trust, CIBA, and Defeating Prompt Injections
One of the most critical updates we must make within this OS is treating AI agents as non-human identities requiring absolute Zero Trust. We can no longer hardcode static API keys into our workloads. Instead, we must mandate dynamic, just-in-time credentials that are securely tied to a specific session and automatically expire after a task is completed.
Agents must have verifiable identities authenticated by an Identity Provider (IdP). Through token delegation, an agent operates using a combined token that proves both its own identity and the identity of the authenticated human user it is acting on behalf of.
Most importantly, to utterly neutralise the threat of prompt injections hijacking sensitive operations (like moving money or off-boarding an employee), we must implement OAuth 2.0 CIBA (Client-Initiated Backchannel Authentication). CIBA acts like a passkey for agents. If an attacker successfully injects a prompt into the agent instructing it to “off-board all employees”, the agent doesn’t just blindly execute it. Instead, CIBA bypasses the browser completely and sends an out-of-band approval prompt directly to the authenticated human user’s mobile phone. Because the attacker cannot physically press ‘approve’ on the employee’s personal device, the prompt injection fundamentally fails.
The Unified Cockpit: Merging Governance and Security
Ultimately, our strategy must acknowledge a profound truth: Governance without security is fragile, and security without governance is blind. If you write rules for fairness but your model gets hacked, your rules instantly collapse; if you strictly defend a system that is fundamentally biased or unexplainable, you are simply protecting a broken asset.
To protect our organisations, we need a unified cockpit that continuously loops through discovery, assessment, governance, security, and auditing. We must actively discover Shadow AI, apply AI-specific firewalls at runtime to evaluate prompts and prevent data extraction, and maintain unalterable observability logs that trace every tool called and rule fired.
As AI agents assume more autonomy in this new era of Agentic Engineering, our role transitions from writing granular code to providing strict architectural oversight. The age of the AI agent is here-it’s up to us to build the guardrails so they can safely get to work


Leave a comment