Key readings
To support others entering the field of AI Safety & Security, I’ve collected and organized key readings in AI Safety - I can’t recommend these enough (since Feb 2025)
I also highly recommend BlueDot courses, which are constantly updated (and offer both intensive and part-time tracks).
AI Alignment course (Certified/Spring 2024)
AGI Strategy Course (Certified/Fall 2025)
Token based watermarks
To truly understand LLMs’ method of generating content, I encourage serious engineers to practice building this derivative design yourself (I proposed it to someone at OpenAI in Sept. 2022): use the ranks of the model’s next token choices (logprobs) as a deterministic fingerprint to make LLM outputs detectable in a watermark scheme (“text steganography”) (Aug 2024)
Latent space classifiers & neural circuits
A valuable tool to understand and manage safety risks in LLMs is deep learning, culminating in mechanistic interpretability.
Embedding-based policy controls were a highly desirable enterprise feature request in 2023, implemented only by 3rd parties later that year (demonstrating that it takes large efforts to overcome weaknesses). I prototyped basic input guardrails with semantic embeddings on Cloudflare at llm-proxy.com (Oct 2023)
I chose sensitivity to texts carrying tacit security & privacy risks as my research project during BlueDot’s AI safety fundamentals course. Learn more about this research on LessWrong - (June 2024)
Instruction hierarchy hardening
Jailbreaks will always raise the pressure. How could we truly enforce limitations in capabilities, within applications’ system prompts? Following a podium runner-up at a hackathon by Apart Research, my team continued researching methods for Scoping LLM applications / Hardening instruction hierarchy. Our report covers the state of the art, gaps, and alternatives (Apr 9, 2025)