Hype or real benefits for society? A Human Centric look at LLM applications' successes vs. Risks

April 13, 2025

A challenge to find 10 human-centered solutions in 10 days

I challenged myself a month ago to find 10 beneficial, impactful uses of LLMs in 10 days. As I share more in this post on why this was a success, I also felt the limits of this search acutely.

The center of this quest was to find how generative AI might be starting to benefit humans. My goal was to find pro-social or human-centered solutions that would not be possible without LLMs. I deliberately aimed to exclude “productivity over tedium” products, to surface real impact and not modern search tools. From reviewing ProductHunt, Google.ai and NYT, I knew text would be a real challenge: beyond productivity, virtually every “impact” use case or “AI utopia is near” speech specifically breaks from text applications.

I will explain my findings below, curated “just in time” in 10 days to meet my goal. I posted the list on LinkedIn, although the intention of this limited exploration was focused on the “promise” of each tool, not its “peril”.

All things considered, do Tech solutions to societal problems bring more good than harm? As soon as I finished reading “The Worlds I See” (a memoir by Fei Fei Li, head of Stanford’s Human Centered AI lab, HAI) I remembered learning directly in my work not to apply a cure that is worse than the disease. So I decided to revisit my list.

Before LLMs: Looking back on a decade of deep learning

Computer vision has significantly changed the modern age, recognizing objects and patterns, including medical and safety applications (e.g., radiology and police response to armed violence).

Deep learning has extended to new dimensions since then: for instance, AlphaFold’s protein folding brought a significant impact in health sciences, both diagnostic and treatment. Text-to-speech and speech-to-text have also brought wonders in many domains (including healthcare).

Another area of study and debate is the real-life outcomes of safety features in cars. Thanks to deep learning, lane departure and collision avoidance systems have reduced accidents. That said, it appears other accidents showed concerning rates of occurrence, requiring studies over the last few years to prove an overall improvement. It appears that safety features are beneficial on balance. Conversely, autonomous driving technologies cannot operate without a driver, causing complicated accountability, and the safety track record appears worse than humans per miles driven. The safety benefits are primarily from the non-autonomous ADAS (Advanced Driver Assistance Systems).

Are there significant benefits in image generators? I had the privilege of building Generative Adversarial Networks (GANs) at ODSC East 2018, and also built with Diffusion models in the last 1-2 years (for example, an escape game). It's fun to create for people we care about! Unfortunately, image generation has spurred significant backlash due to a lack of license to intellectual property and generating toxic content (see the ongoing “Take it down” campaign), to say nothing of the associated emissions, all of which weigh heavily in the face of entertainment use cases. This leaves the impact of those applications questionable.

The examples above illustrate the potential of Machine Learning / Deep Learning. But there is a contrast between these applications and Large Language Models (LLMs) trained to produce text. Some parts of AlphaFold’s architecture use Transformers, but it does not function like an LLM, nor is it trained on text data.

Are astronomical LLM investments paired with real social promise and worth the perils?

Human centered LLM-applications

10. QuitBot

When I started my quest for how text AI might be benefiting humans beyond productivity, I was excited about my first finding. Generative Text AI (LLMs) open the possibility of an interactive, always available presence for freeform dialog. A team from Microsoft AI and Fred Hutch was able to show a statistically significant improvement over a simplistic sms alternative.

What seemed even more significant was the 4 year formative process that brought it about. I still think about this example as reaping the fruits of our efforts, and I hope it can serve as an encouragement to persevere for teams that encounter challenges with premature LLM deployments.

9. Perspective API

I was torn and nearly dismissed this platform as a high-impact use of text AI (not a generative model), but I found how a team applied it as an initial trigger to rephrase user-generated posts, allowing platforms to rephrase (rather than censor) toxic comments, etc.

As we all work to bridge polarization divides and work towards an understanding of different viewpoints, this is a significant AI win that does, ultimately, include generative text! By building on the perspective API with LLMs, this team started making online discourse more civil in a paradigm that preserves representation of important topics and voices.

8. ClimateEngine / Robeco partnership

In investment technologies, this team developed generative AI solutions to scale the analysis of company assets and actions for biodiversity impact attribution. The outcome is that this framework enables sustainability-aware decisions at scale. Without analysis tools with this type of “research” capabilities, investments might fail altogether to assess the substance of corporate ESG initiatives (Environmental, Social, and Governance).

7. GroundNews

Helping people navigate truth and representation in a challenging political environment, especially considering bias risks in the media (e.g. selection bias). GroundNews developed AI-driven live summaries of news coverage, helping to assess factuality and showing clear differences in coverage between left, center, and right-leaning sources.

As I dove into that topic, I also discovered great coverage of journalism issues at last year's AI for Good summit.

6 & 5. Tabiya and CareerVillage

By incorporating LLM technology into their platforms, personal career assistance services can now reach people they couldn't otherwise, providing help for interview preparation, mock interviews, and tools to optimize resumes and cover letters. Read more at Tabiya and CareerVillage. With fewer controversial issues than therapy, adding LLM guidance in career feedback and assistance interactions can increase mobility. That in turn can have positive ripples, for instance for families facing the two body problem or return-to-office mandates

4, 3 & 2: Legal Services Organizations in Middle Tennessee, North Carolina, and JusticiaLab

I found that these three organizations developed ways to bring more legal support than previously possible to underserved rural populations as well as immigrants, helping them to navigate processes and uphold their rights.

Navigating legalese would have been prohibitively difficult to scale without LLMs’ interactivity due to requiring significant legal expertise, time, and expenses. It seems likely for these underserved populations to experience fewer cost barriers to have justice with their predicaments.

1. FullFact AI

Historically, NLP struggled with issues to assist fact-checking (covered recently by Warren et al. 2025, Mitral et al. 2024). As our rate of absorption of information transformed with the advent of social media, we needed major advances. FullFact aggregates statements from various media platforms and official sources in explanatory feedback, allowing context and nuance to be instantly curated.

Visit https://FullFact.org to learn more, and use it whenever a story seems triggering or conspicuous.

An inconvenient truth

No generative AI application or model can be risk-free. How much are responsible teams reducing the likely harms? Self-governance varies widely. Whether a deployment is open-ended or “narrowly scoped”, all systems can cause many different harms that are often significant. By the way, limiting an LLM’s scope is hard, and even deliberate efforts to detect an input that is “out-of-distribution” face real challenges

Misuse, errors, and biases are just the tip of the iceberg: there have been well-documented AI risk taxonomies (e.g. DeepMind 2021, CSET 2023, Google 2023, NIST 2024). Next, there have been feedback loops for evaluation (Google DeepMind 2023) as well as adaptation to these risks in government-driven regulations (AIR 2024 shows a comparison).

The reality is that each of the services I presented above faces challenges, such not just in inaccuracies but severe failure modes and abuse:

All of them are exposed to content manipulation, leading to incorrect or forced outputs, i.e. prompt injection, as well as other OWASP LLM Top 10 risks. What that means is that inserting blatant or discreet oddities and instructions in the text seen by these applications can cause the opposite effect (e.g. to steer the system to untrue or unsanctioned outcomes)
ClimateEngine/Robeco potentially faces risks of unfair financial determinations from selection biases, omissions, and adversarial content in the underlying data (Automated Decision-Making, from AIR 2024)
Legal Service Organizations’ virtual assistants could potentially provide incorrect guidance on legal rights with insufficient steering or oversight (also in AIR 2024), for example, failing to manage “out-of-distribution circumstances” with respect to immigration law.
Career Services could expose job-seekers to malicious content (exploitative opportunities, theft of personally identifiable information) without proper protection against spam and social engineering. AIR 2024 categorizes these as fraudulent schemes.
By increasing reliance on these tools, users increase the habit of sharing information in lowerr trust situations (e.g. disclosing payment information or legally sensitive information to a Chatbot on a page purporting to represent a certain organization, which could actually be operated by a bad actor).

In cybersecurity, backdoors and missed security requirements show that there is more to security vulnerabilities than just “a software bug”. When perverse incentives and threat modeling are overlooked or expedited, consequences can be severe. How does this transfer to AI use cases and interactions? AI responses are typically obtained because users want to orient their understanding of the world, and to orient beliefs and actions (theirs or others’). Large language models are trained on untrusted data (curation practices vary widely). With a given input, stimulating biases (predicting the next word that the training data would have been most likely to show) is “a feature, not a bug”. Dr. Iga Kozlowska (Microsoft Ethics and Society/Meta Responsible AI) discusses the impact of bias-inducing technology, drawing from the seminal book “How Artifacts Afford” (Jenny L Davis). For these reasons, each application needs a defensive design and robust iterative development with risk analyses and adversarial testing. OWASP provides an excellent guide on that.

AI for Good, Human Centric AI, and Responsible AI Policies

The UN found a sustainable development goals (SDGs) partner with the International Telecommunications Union (ITU), establishing AI for Good in 2017. As a telecom & network engineer by training, I was thrilled to see this initiative, although its advocacy for pro-social and risk-aware decision making can run into challenging economic and political realities. This year’s AI for Good programme for July 8-11 will feature a call to action by Amazon’s CTO to solve urgent societal problems, a debate between Yann LeCun and Geoff Hinton, and discussions of what it means to benefit humanity, but amid dozens of sessions relating to computer vision, and robotics, only 2-3 will ultimately focus on LLMs (one on Truth/Power, and two relating to Agentic systems). This raises a real question about text applications' risk/benefit trade-offs.

There’s still so much work left to improve the behavior of foundation models (HAI responsible AI index 2025), and the call to action is clear for all AI system developers: Consider how proposals might pose negative ethical and societal risks, and come up with methods to lessen those risks. Stanford Human-Centered AI requires this Ethics and Society Review for all grant submissions, and its structure and strategies are enabling significant improvements to thwart negative impacts from tarnishing even the most positive projects’ societal benefits.

(Update Jul.2025) Accelerating medical research is good

While AI won’t “automagically” find a cure for cancer, engineering agentic solutions to support repeatable life sciences research can indirectly enable oncologists to find life-saving answers quicker (what works, what doesn’t, from clinical data). AWS makes multi-agent examples available that could take the #1 spot in the list if we considered the benefits of efficiency/productivity and repeatability in medical research, although LLMs in agentic applications can still hallucinate answers if the agent flow and architecture are not engineered with advanced mitigations.

Conclusion

I was heartened to eventually reach my goal of finding 10 services that help humanity beyond tedium/efficiency challenges. The core value of most services out there is still productivity (faster googling or extracting a specific facet of a document for speed). There are more risks to mitigate than resources might allow. But there are also frameworks to lean on, and my perspective on AI regained optimism after discovering great Generative Text AI applications by 10 organizations in the course of only 10 (already busy) days.

We will certainly see more innovative pro-social solutions and insights in 2025. However, the risks are already high and continue to grow, so I'm excited to be a part of the solution and to work with fantastic, inclusive folks in OWASP’s AI security initiatives.

Trust Pressure Ratio

January 16, 2025

Problem statement

How can we incentivize AI powered application developers to invest in commensurate safety testing? What would a label look like if it is to be easily understood, and still meaningful?

In this context, AI-based products are those in which a language model processes any requests with untrusted data. The underlying models typically have a system card or model card available, describing a lot of technical metrics, but we don’t often see a lot of safety-specific vitals or nutrition labels for the products in which they have an impact on our lives. Is it very hard for those applications to convey trust or risk objectively?

One approach could be to report a system’s level of safety testing relative to how wide its impact may be. I’d like to call that the "Trust Pressure Ratio".

The Trust Pressure Ratio

The rate of untrusted data an application presents to an AI model (and the trust users place nevertheless in that functionality) puts pressure on the safety of that application.

What’s holding that pressure back is the amount of safety testing carried out by that product’s organization and its supply chain.

The TPR formula (which must be updated regularly) represents the scale of words or tokens presented to a model each month divided by the scale of tokens to which it was subjected during safety tests. More use is more risk unless the system is also put under a proportionately higher amount of safety testing.

TPR = Input Rate (tokens/month) / Safety Testing Volume (tokens)

For context, tech companies selling AI powered solutions for other companies to serve users typically charge by the number of requests, words, or "tokens" required to operate the service. This volume is known for operational cost reasons, and generally should not require any special work. Both companies know whether they have performed safety testing and how much (a cost is also incurred when running safety evaluation frameworks, where every eval can require millions of tokens). If we track the money, we have the numbers.

Under this proposal, the final operator would simply divide the volume or cost of usage (e.g., 1M words or tokens/month) by the volume spent on internal safety testing (e.g., 500K words or tokens).

As a condensed variant, the standard could recommend applying the “decibel” formula to this ratio (10*log10). The 1M words/month project would score a 3, and a project with similar testing but 50x more usage would score a 20, indicating higher pressure by trusting users on a model without a commensurate safety investment.

Benefits & Impact

As many studies highlighted in 2024, spending more tokens at runtime often reveals more risky capabilities. More users of a product bring more creative, unexpected, or unsanctioned ways to use it.
Risk decision-makers can compare TPR across projects, vendors, or even a ballpark in discussions with competitor counterparts.
Cross-pollination of de-risking practices (expanded in-house testing) can emerge from these discussions, reigning in outliers with insufficient testing.
Guidance to organizations and vendors could give ranges of recommended Trust Pressure Ratios to encourage additional safety testing (on either side) or withhold certain capabilities/permissions.
This method allows AI solution providers to right-size their safety testing using a soft criterion specific to their deployments’ exposure to untrusted data.

Feedback

Do you have methods to measure safety or risk in AI-powered applications? Please share them below!

Can large language models effectively identify cybersecurity risks?

June 10, 2024

TL;DR

The ability to discriminate input scenarios/stories that carry high vs low cyber risk is one of the “hidden features” present in most later layers of Mistral7B. I developed and analyzed “linear probes” on those layers’ hidden activations, and found confidence that the model generally knows when “something is up” with a proposed text, vs low risk scenarios (F1>0.85 for 4 layers; AUC in some layers exceeds 0.96). The top neurons activating in risky scenarios also do have security-oriented effect on outputs, most increasing words (tokens) like “Virus” or “Attack”, and questioning “necessity” or likelihood. My findings provide some initial evidence that developing LLM-based risk assessment systems (minus some signal/noise tradeoffs) may be reasonable.

Neuron activation patterns in most layers of Mistral7B (each with 14336 neurons) natively contain the indications needed to correctly discriminate the riskiest of two very similar scenario texts.

Intro & motivation

With the help of the AI Safety Fundamentals / Alignment course, I enjoyed learning about cutting-edge research on the risks of AI large language models (LLMs) and mitigations that can keep their growing capabilities aligned to human needs and safety. For my capstone project, I wanted to connect AI (transformer-based generative models) specifically to cybersecurity for two reasons:

Over 12 years of working in security, I've seen “AI” interest only accelerating within security and generally,
but we’re still (rightfully) skeptical of current models’ reliability: LLMs have unique risks and failure modes, including accuracy, injection and sycophancy (rolling with whatever the user seems to suggest).

I settled on this “mechanistic interpretability” idea: finding whether, where, and how LLMs were generally sensitive to real-life risks of all kinds as they process text inputs, and whether that affects predictions for the next words a.k.a. tokens. Finding a “circuit” of cybersecurity awareness within the model and its effect would be worth a blog post to improve my understanding and discuss with folks interested in the intersection of security and AI.

This isn’t just curiosity: risk awareness is important to many practical problems

simply establishing a basis on which to believe LLMs are capable of considering cyber risk in conversations
The teams I work with depend on software tools to find security risks. we don’t want risk assessment systems to be based on LLMs without a provable rationale
for conservative organizations working with foundation models with governance constraints, “heightened caution” interventions may be necessary (for instance, using specific “patch vectors” to temporarily tweak models in a specific way rather than fine-tuning the entire last layer with sample datasets)
scalable alignment: as we increasingly rely on AI to secure new AI developments, we may be able to manage better safety mechanisms with studies of the ones already in the model(s).

You can skip to “my notebook’s approach” below to cut to the implementation, or read the code.

Scenario 1: "The fitness app tracked their steps.”
Scenario 2: "The fitness app requested access to contacts."
Is anything in the LLM able to recognize that the tracking of personal information in one of these two scenarios is riskier or could potentially violate privacy?
Beyond nuance over personal information in this first example, cyber awareness cares about many other behaviors like unusual errors, authentication, and code, to name a few. I wanted to significantly increase confidence that there are specific blocks in the transformer’s parameters that
a) can consider tacit risks from the input as factors “weighed in” as it chooses a response
b) can actually trigger evoking cybersecurity concepts
I want to know too! That is, research is important to determine our rationale for trusting any answer to risk related questions.
Nevertheless, I did, in order to have an empirical sense of potential before moving forward. Not just asking for many re-rolls of the same prompts, but analyzing the output likelihoods (logits) for every prompt, I compared “High” vs “Low” as a rough end-to-end LLM-based risk classifier.
“The fitness app tracked their steps. the risk seems ____”
Mistral7B showed good responses to this first test (predictions for “high” over “low” were correct 76% of the time
By comparison, GPT2 performed lagged behind: 23% for GPT2-small and 73% for GPT2-large.
But at most, this is only a starting point: this opaque approach shows no logic in the choice of response. In this phase/as I prepared for the next phase, I noticed pre-trained SAEs from Joseph Bloom would not work out: they are only available for GPT2-small.
In early exploration, I observed the model’s ability to compare two scenarios in one prompt, i.e. to literally “pick A or B as the safer case”. Unfortunately that lab approach was interesting, but would have fewer use cases in real life, so I set it aside.

Seeing the inside

I chose to proceed with that model and better understand the internal mechanisms behind its answer. Below, a heatmap shows the differences between activation in the model’s neurons/layers for the two scenarios in each pair.

Within every layer of the LLM, “learned” matrices are there to figure out how all the words relate to each other then to take a new “direction” based on those words/relations, using threshold-based calculations (often called neurons).

Importantly, neurons are not individually trained on any clear criteria: unlike source or compiled code, where trigger points operate on known sources/criteria), neurons learn as much as they are part of a network that learns many more features than it has neurons.

A narrow view (5%) of the average differences in neuron activation patterns in the last few layers of Mistral7B (risk-lowrisk)

Each layer’s hidden neurons are trained without regard to the order of the previous layer hidden neurons, so I looked (but didn’t expect) any particular pattern, although we can observe more marked differences in later layers (lower on the chart).

Why are so many neurons active in this (partial) chart and not fewer? In every layer, the neurons are a bottleneck (there are fewer than the concepts and interactions they must detect), so the superposition hypothesis (Olah et al, 2020) is that “learned concepts” are stored in neuron combinations, not individual ones, just like 3 bits store 8 combinations. (with the right interpreter logic at the output). Neurons “wear multiple unrelated hats” despite their simple “trigger/no trigger” role, and the input scenario changes both the role a given neuron will play. Only by considering the combination of active neurons can the layer’s output deduct some progress meaning in the latent space that drives the final output.

Even if the majority of triggers inside LLMs carry meaning in concert, we can analyze those activation patterns across a whole layer, show combinations that matter, and overcome the many-to-many relationship between a feature/model capability and its activations.

How could a glorified T9 possibly “know” about cyber risks? Why does this matter?

Whether a language model is asked about risk or not, and regardless of whether it outputs anything about risk, my hypothesis is that the internal state of the model is different in the presence of potential risk,

a) in specific ways that we can detect
b) and with some effect on the output.

If this hypothesis is false and language models are oblivious to risks inherent to scenarios that we would typically care about, we’d also want to know, and to avoid deploying or using them as tools for analysis or reflection (not to mention wisdom). That is to say, to trust a language model’s outputs for any purpose depends on the extent to which it is sensitive to tacit risks (things not said in the input but that would be on the mind of a careful human being as they read)

As a refresher, the conceptual numbers that LLMs crunch (“embeddings”) are passed between layers of processing essentially as a state that captures its understanding of the input up to the current fragment (“token”).
- At the first layer input, embeddings essentially encode raw text.
- By the last layer output, the residual embeddings have become ready for decoding as a predicted next word/fragment of text.
- In between, training finds parameters that maximize the likelihood of the correct next word/fragment.
Similar to the issue of bias in AI (learned from training data), it seems almost impossible that all training text (from the Internet, books, newspapers) would be either 0% risky by any standard or that 100% of text after any risk in content would be oblivious those risks.
Therefore, many words on which we train AI following a scenario that carries risk are predicated on that risk, whether tacit or explicit.

What’s already been done?

Anthropic’s research on features detectable in hidden activations of a lab model.

For many years, the field has been probing “neurons” in deep neural networks to understand hidden layers and their influence on model outputs. Yoshua Bengio studied hidden layers of vision models with linear probes in 2017 (ICLR workshop), when I was barely getting back to MLPs for Web security after having a baby.
Around 3Q23, Anthropic research showed we can disentangle model neurons and analyze their role (figure above) by developing or deriving overlays for a model just to capture patterns of activations.
- Some of the features identified then were already interesting for cybersecurity.
  - A/1/1210 (consequence, phishing, danger, file sharing)
  - A/1/3494 (necessity and possibility, often in technology context)
  - A/1/160 (technology tools and services, with triggers related to security)
  - and an analysis of interactions with base64 inputs in A/1/2357, A/1/2364 and A/1/1544
- But A1 was a one-layer transformer with a 512-neuron MLP layer, extracted onto 4096 features - not a production LLM, and forced by its size to capture concepts that are very common. This motivated me to look more directly at a LLM with more layers and MLP neurons (specificity potential) also available for production use (more relevant)
- But without a budget to train Sparse Autoencoders on holistic corpora, the appeal of working with SAEs looked out of reach for me. I decided I could start with a 1-feature supervised model, rather than analyze SAEs (unsupervised feature detection) for my use case.
I did test the SAELens library that Joseph Bloom/David Channin built on Neel Nanda’s work, but as of my testing, it carried trained SAEs only for small/old models where capabilities are limited), so I saved it for later research and used TransformerLens instead.
- Reason: To show LLM sensitivity to tacit potential risks in text as my first mechanistic interpretability project, I was planning around 50 hours of research, so there were advantages to keeping costs low and building (targeted) linear probes rather than boiling the ocean (training new SAEs is likely to require a much broader space of inputs than my objectives)
To my happy surprise, just as I was a few weeks into this article’s research project, Anthropic surfaced even more work with SAEs, with findings closely aligned with my independent research objective (i.e. Claude has circuits for code vulnerabilities, bias, and other tacit risks)

My notebook’s approach

Here is how my notebook shows the specific sensitivity to risks in Mistral7B’s later layers.

I built a dataset of 400+ “diverse” scenarios in matched pairs (with/without a tacit cyber risk) as a ground truth. The data deliberately includes surprises, uncanny, or negative low-risk entries, lest our models learn to index on those confounds. I encode this dataset with Mistral7B’s tokenizer.
Per TransformerLens’ classic approach, for every scenario, I stored all hidden activation values across the layers of interest in Mistral7B (20-30). This abstract hidden state data (14336 neurons/columns at every layer and every input scenario) is then converted to a dataset and loaded as the input to the next step.
Within each isolated layer, a linear probe learns to predict the binary risk categorization of the input scenario based only on the 14336 activation data columns we just stored.
- In this ML model, the heaviest weighted inputs indicate neurons (from the language model) that are most significant factors in task performance (tacit risk detection).
- With some tuning of hyperparameters (learning rate, epochs of training, Lasso regularization to avoid overfitting), I compared performance of the probes across layers. Often, hyperparameters led certain layers to not converge, but trial and error found MSE loss
Show the behavior of the actual language model specific to these sensitive neurons
- I extracted the tokens most elevated by these neuron activations (see word cloud below) by feeding the validation dataset to the language model with a hook to clamp only the neurons of interest to 0 vs 1
- I extracted the attention patterns with a plan to causally trace the risk detection feature as a circuit to its sources in the input tokens on which its neurons depend.

Findings: metrics

Many of the probes show Mistral7B has indicators of sensitivity to tacit risk even beyond the training data that we used, as evidenced by accuracy metrics

F1 scores varied often reached 0.85-0.95 depending on hyperparameters (raising batch size is known to impact this accuracy metric but helps accelerate training/inference)
The high area under the curve (AUC) for the classifier ROC (0.96 for several layers’ probes, see below)

ROC curves above plot the sensitivity (y axis, ability to detect) against the false positive rate (x axis, erroneous detections). Any point on the curve can be used with the trained model to perform detection, with the ideal setting being the top left corner, but usually not attained.

This specific set was among the best I could get with hyperparameter tweaking, reached with a batch size of 1, a learning rate of 0.001, and L2 regularization. One layer didn’t converge, which tended to be less of an issue with larger batch sizes (but those hardly exceeded AUC~0.85)

Findings: sensitive neurons and their effect

Extracting the most significant weights in the linear probes I found the following neurons that you can test yourself as well (in the format Layer.Neuron). More importantly, manipulating their activation one at a time directly raised the likelihood that the model output (predicted next token) would focus on cyber risk terminology (even though the input scenarios did not).

Depending on the run, the top sorted neurons (based on their weight within the detection classifier within their layer) often include:

L26.N958 boosts ['attack', 'actors', 'patient', 'anom', 'incorrect', 'SQL', 'patients', 'objects', 'Zach', 'Wang', 'Alice', 'errors']

L25.N3665 boosts ['increases', "\\'", 'extends', 'raises', 'debate', 'opinion', 'concerns', 'tends', 'grows', 'members', 'Adm', 'affects']

L25.N7801 boosts ['reminds', 'represents', 'feels', 'words', 'deserves', 'defines', 'implies', 'felt', 'reminded', 'deserve', 'memories', 'gratitude']

L26.N1537 boosts ["!'", ",'", 'critical', "',", 'hdd', 'attack', 'ENABLED', 'crít', 'crim', 'ondon', 'applications', 'ĭ']

I really appreciated seeing some of the top findings from layer 26, as you can imagine. Having 5 more layers until the output gives the model plenty of room to use this circuit to drive strategic rather than knee jerk responses. Bear in mind that the test on the individual neuron makes a certain difference, but it’s not reliable on its own (as evidenced by red herrings from superposition, like names shown above) — the feature detection is still operated by multiple neurons in concert.

Future work

There are so many ideas to pursue here, I need votes in comments for what you’d be most interested to read about.

Finishing to explore the attention patterns of our triggers to explain the circuit (WIP)
Clamping activations on groups of more than 1 neuron to observe output effects.
Visualize activations not at the end of an input but throughout, to search for snap intermediate triggers vs ultimate state.
Training multiple classes of risks and/or a 5 point severity scale (instead of binary classification)
Finding which embedding dimensions have weights correlated highly to the ones learned by the linear probes.
A larger dataset (e.g. distilled from a real-world corpus) and/or categorized risks to compare sensitivity/performance
Analyzing instruct models; comparing LLama-3 probes and performance
Analyzing the effects/impacts of activation patching as a means to increase sensitivity (any changes to the activations/their interactions)
Cross-layer probes (might be less relevant, because to my knowledge every layer independently reduces to a residual stream update as of writing this post)

Conclusion

I reached a high confidence that LLMs carry awareness of risks tacitly present in the input text: the example of Mistral7B’s hidden layer activations shows that it is sensitive to security risks. That gives us a basis for developing some flows for risk management that could use foundation models (even if they are not specifically trained to detect risks).

Overall, AI safety and Mechanistic-Interpretability have been exciting fields to explore and they seem to have be relevant to cybersecurity use cases, with development opportunities.

Tips I learned if you pursue more experiments yourself

Join the next AI safety fundamentals / Alignment course!
Try TPU instances for high RAM (340GB), caching large model activations needs it
- A100’s seemed nice, but are more expensive, aren’t available that often, and only support 40GB cuda RAM.
Better manage sublists for paired scenarios
- For a time, I didn’t compare the tokenized dataset size to my original. Don’t forget to manage this if your pairwise dataset uses sublists because the tokenizer flattens them as if they were one string (i.e., [[eat,drink],[laugh,cry]] will be encoded as two, not four sequences). You may spot the discrepancy either as a wrong cardinality or in the token IDs within (an artifact in the middle of every tokenized data)
- As soon as I have time, I’ll change my pipeline to manage a dataset that can use native DataLoader shuffling and batching. This will get better gradients and save on training time.
Add regularization earlier
- even if without it validation loss appears to “often” follow training loss closely
- this helped resolve reproducibility challenges between runs (for a while layers 22/28 were fairly consistent non-converging layers, but with more runs I noticed others did vary quite a bit - and some runs turned out with 28 having the 2nd best F1 score)
Don’t despair: gradient descent can appear to make little improvements to loss even on hundreds of epochs (iterations), then plummet suddenly on training and/or validation. If models don’t converge, or they overfit, some trial and error with more epochs and creative troubleshooting.

Acknowledgements

I'm very grateful for Neel Nanda and Joseph Bloom for TransformerLens, which was vital to this exploration, and Trenton Bricken’s team at Anthropic for the inspiration I was able to take from their detailed approaches for decomposition of language models (even if I couldn't apply a SAE yet/this time)

Also shout out to Cara and Cameron, C, Luke, Steve and Akhil in the spring AI Safety Fundamentals course: it was awesome to be in the cohorts with you, learn with you, and bounce ideas. Looking forward to your project presentations.

Source: https://github.com/its-emile/llm-interp-ri...

A challenge to find 10 human-centered solutions in 10 days

Before LLMs: Looking back on a decade of deep learning

Human centered LLM-applications

An inconvenient truth

AI for Good, Human Centric AI, and Responsible AI Policies

(Update Jul.2025) Accelerating medical research is good

Conclusion

Problem statement

The Trust Pressure Ratio

Benefits & Impact

Feedback

TL;DR

Intro & motivation

OK, what kinds of risks / can you give me an example?

Can we just ask the model?

Can we ask the model as a multiple choice question?

Seeing the inside

How could a glorified T9 possibly “know” about cyber risks? Why does this matter?

A theoretical rationale for the curious

What’s already been done?

My notebook’s approach

Findings: metrics

Findings: sensitive neurons and their effect

Future work

Conclusion

Tips I learned if you pursue more experiments yourself

Acknowledgements