March 20, 2025 • 2 min read

Mechanistic Interpretability from Scratch (WIP)

(WIP)

Mission:

To reverse-engineer neural networks into understandable algorithms. This includes examining the internal mechanisms and representations learned by these networks, which is crucial for ensuring safety and alignment in AI systems.

Goal:

Map neural network computations
Understand how individual neurons and layers contribute to decision-making
Treat neural networks like complex computational or biological mechanisms
Create methods to verify and validate AI decision processes

Why MI is not just a theory

On GPT-2, Wang et al., 2022 used MI techniques to derive an interpretable algorithm that the network is using to solve an NLP task. They argued that the algorithm is faulty, and shows that running adversarial samples does cause the network to produce the expected wrong results.(Summary).

Mechanistic Interpretability from Scratch (WIP)

Tutorials

Reads

You don’t need pHDs to become an AI researcher