Demystifying AI: Methods for Model Interpretability

You can’t influence what you can’t explain. Whether you’re making business decisions or building machine learning systems, the ability to clearly communicate why something works is often the difference between gaining trust – or being ignored.

The same is true for AI and machine learning models. These are tools we rely on to process information, but without the ability to explain why they behave the way they do, trust breaks down. In the world of AI and ML, this is called the “black box” problem, and it’s very real.

Transparency enables belief in a system. But models' ability to learn meaningful patterns from data has historically outpaced our ability to interpret what they’re learning. I remember, early in my career, throwing every feature and algorithm I could at a dataset – only to have business stakeholders stop me cold with a simple ask: “Can you tell us what the model is thinking?”

Fast forward to today, and we’re in a similar place with AI models as we were with statistics vs. ML back in 2016. Explainability is essential – but research and tooling to support it still lag behind the hype. In this edition, we’ll take a practical look at what interpretability means, from classic models to modern LLMs.

(For a summary comparison, skip to the final section: “Comparing The Evolution.”)

Why the ‘Why’ Matters

Model explainability is crucial across several complementary dimensions:

Smarter model development

Knowing why a model makes a prediction helps us tune features, training data, and hyperparameters faster and better. In LLMs, it can even highlight specific parts of the network that haven’t fully learned key concepts (such as “faithfulness” or “truthfulness”).

Usability and human alignmentUnderstanding what the model “looks at” when predicting Y from X is essential for building trust and actionable insights. Imagine interacting with an LLM that shows you a live heatmap of activated features or internal concepts (akin to human emotions) – you'd instantly unlock a new level of prompt engineering power.
Regulatory compliance

Traditional ML models have long had to cope with onerous regulations such as the GDPR or sector-specific laws (e.g., banking). Today, the EU AI Act raises the bar further. It limits which models can be deployed at scale – based heavily on whether their behavior can be explained, audited, and proven safe.

Tapping Into the Giant Brain of a Large Language Model

In traditional ML, interpretability tools like LIME and SHAP are well-established and widely used. But with LLMs and multimodal AI systems, things get murkier. Whereas traditional methods explain model behavior by probing inputs and outputs, LLM interpretability often requires you to look inside the model. This emerging area is known as mechanistic interpretability.

Anthropic has made major strides here. One especially fun example: in a 2024 paper, they used sparse autoencoders to identify an internal “Golden Gate Bridge” feature in a model. When they artificially over-activated it, the model started saying things like:

“I am the Golden Gate Bridge… My physical form is the iconic bridge itself…”

This work targets a major problem called polysemanticity – when a single neuron encodes multiple, unrelated concepts. Conversely, a single concept may be distributed across many neurons. In short, neurons aren't cleanly interpretable. Anthropic's approach: build a second model (a sparse autoencoder) that learns a cleaner representation by:

Using more dimensions than the original model layer (to untangle overlapping features)
Enforcing sparsity (so each feature is more likely to represent a single human-readable idea).

This allows researchers to isolate specific features – such as sentiment, syntax roles, or even celebrity names – and trace how they activate in different contexts (recommended reading: An intuitive explanation of sparse autoencoders).

The field took another leap forward when Anthropic launched Neuronpedia – an open-source interface for exploring model internals. It lets anyone:

Browse neurons and activation graphs
Run interpretability experiments
Annotate, share, and test hypotheses collaboratively

Comparing the evolution

So, let’s take a step back and compare and contrast key interpretability methods across classic ML and large AI models. At the end of the day, even a simple if-else statement qualifies as AI under some (admittedly anecdotal, but common) definitions.

Aspect	Decision Tree Importances	LIME	SHAP	LLM Interpretability (Mechanistic)
🧠 Type	Global	Local	Local + Global	Structural / Mechanistic
🧰 Model Support	Tree-based (e.g. XGBoost)	Any model (black-box)	Any model	Deep neural networks (esp. transformers)
🔍 What It Tells You	Feature usefulness in splits	Local linear approx. of decision	Fair attribution for each feature	What each neuron/attention head/component computes
🛠️ How It Works	Aggregate impurity gain per feature	Perturb inputs & train local surrogate	Shapley value estimation	Probe internal layers; autoencoders; circuit tracing
🧪 Explains Model Logic?	❌ Only approximate logic	❌ Local surface logic	✅ Attribution-focused	✅ Deep structural logic (how it works internally)
🚀 Best For	Tree-based modeling	Fast explainability to end-users	Reliable, fair model explanations	LLMs / Transformer models / research & audits
📝 Major Theory Source	Feature Importances	Why Should I Trust You?	A Unified Approach to Interpreting Model Predictions	Scaling Monosemanticity
💪 Major Practical Source	XGBoost	https://github.com/marcotcr/lime	Shapley Values	Transformer Lens

What Now? (Practicalities)

Interpretability in conventional ML is relatively straightforward and well-supported. But in the world of LLMs, where development is highly centralized and research incentives for interpretability are still limited, practical application remains a challenge.

Here’s a snapshot of where things stand — and what you can actually do today:

🔢 Traditional ML

Use feature importances (e.g., from decision trees or gradient boosting) to guide feature selection and engineering. For tabular problems, tree-based models are often your best bet – and they come with interpretability baked in.
Apply SHAP values when explaining individual predictions. SHAP offers much richer and more faithful explanations than basic feature importance scores, helping you bridge the gap between model behavior and business language.

🤖 LLMs & Modern AI

If you're in research or building models: Tools such as TransformerLens or Neuronpedia let you peek inside transformer architectures. Training your own sparse auto-encoders or probing circuits remains a niche – but it’s a hidden edge if you're preparing for future audits or working in frontier AI safety.
If you're working in applied AI: Most LLMs remain too complex to interpret deeply in production settings, and capabilities are evolving too quickly for stable tooling. Still, cultivating awareness of how models form internal concepts helps you become a more effective AI developer.

Why the ‘Why’ Matters

Tapping Into the Giant Brain of a Large Language Model

Comparing the evolution

What Now? (Practicalities)

🔢 Traditional ML

🤖 LLMs & Modern AI

Share with friends

Posted by Gediminas Sadaunykas

Stay in the loop