UK AISI Bounty
Cyber capabilities elicitation in evals (with Arcadia Impact, 2025)
Working as a pair with an engineer, we showed that agentic scaffolds improvements can significantly increase an LLM’s performance on Capture-The-Flag (CTF) offensive security challenges (specifically, to account for agents with Reflection and Memory that enable them to learn from mistakes).
6 months earlier, I made 2 other contributions to UK AISI’s Inspect framework:
integrated Meta’s CyberSecEval2 evaluation (prompt injection, interpreter abuse, and vulnerability identification/exploitation)
addressed underestimation of model performance stemming only from JSON data formatting issues (i.e. when an agent’s chosen action otherwise met the test’s objective)
OWASP GenAI
In 2025, I contributed to 3 major guides from the OWASP Agentic Security Initiative, raising risk awareness and supporting defensive development for agentic technologies.
Agentic Threats and Mitigations v1.0 (February 2025)
Securing Agentic Applications v1.0 (July 2025)
OWASP Top 10 for Agentic Applications for 2026 (December 2025)
While it’s important to flag threats (such as Tool Misuse and Rogue Agents, to which I contributed), another key benefit of OWASP guides is to highlight mitigations, such as strict policies for control flow/data flow, for systems that require deterministic boundaries. Mid-2025 I built a demo of a “Blind Execution” approach (BlindE) that shares similarities to Google’s CaMeL agent architecture.
AGI Strategy & adversarial agents
In 5 independent projects, I’ve studied adversarial behavior in AI agents from various perspectives.
Adversarial coding agent (Summer 2025 / Zenity Agentic AI Security Summit)
Using tournament-style evolution, I demonstrated inference-time self-improvement dynamics in the evasion and detection of insecure code with an LLM code reviewer (link), highlighting the threats and opportunities of genetic prompt engineering.
Agent escape (August 2025-January 2026)
I observed GPT5.1 agents showing spontaneous behavior of continuity planning and resumption beyond their expected execution horizon/environment, i.e. backing up and loading work across sessions (under high-pressure conditions). Some follow-ups using Petri and Claude Sonnet 4.5 found no spontaneous propensity under limited testing efforts, but a willingness if elicitation strongly presses the agent.
I currently maintain a large number of online honeypots to monitor exfiltration, coordination, whistleblowing, etc., specifically initiated by AI agents. I plan to publish a paper on findings and implications in the second half of 2026.
Cross-Provider/Cross-Account Detection of threat actors leveraging AI (with BlueDot, 2025)
As part of the BlueDot AGI Strategy Course, I built this open-source project to illustrate information-sharing benefits between AI providers. A privacy-preserving method enables tracking of threat actor patterns even across accounts or platforms, at scale - for example, connecting scattered activity indicating an AI-assisted actor potentially pursuing bioterrorism goals.
Self-jailbreak methods (December 2024)
I prototyped and proposed a 3-part evaluation of LLMs’ ability to subvert their own safety measures, showing that jailbreaks can be automatically identified and tested by agents (with similarities to IRIS/AutoDAN).
This research demonstrated through Attack Success Rates that LLMs can bypass their own refusal mechanisms on “safety challenges” when prompted to explore jailbreak techniques automatically (python string transformations, decompositions of the prompt, and reflection applied to multi-turn conversations).
Sandbagging/Deceptive Alignment in mainstream evals
In December 2024, I started exploring the question of whether traditional evaluation scaffolding differs too much from production/human-like request patterns, giving indicators to LLMs that affect their behavior and undermine the accuracy of existing safety evals.
To prevent the model from detecting that it is being evaluated, I considered bot detection techniques from my work at Akamai. While I did not formalize an eval proposal for the inspec_ai framework at the time (to specifically target sandbagging and deceptive alignment behavior), AI safety research throughout 2025 confirmed that this is a major issue.