Mechanistic Interpretability from Scratch (WIP)
(WIP)
Mission:
- To reverse-engineer neural networks into understandable algorithms. This includes examining the internal mechanisms and representations learned by these networks, which is crucial for ensuring safety and alignment in AI systems.
Goal:
-
Map neural network computations
-
Understand how individual neurons and layers contribute to decision-making
-
Treat neural networks like complex computational or biological mechanisms
-
Create methods to verify and validate AI decision processes
Why MI is not just a theory
On GPT-2, Wang et al., 2022 used MI techniques to derive an interpretable algorithm that the network is using to solve an NLP task. They argued that the algorithm is faulty, and shows that running adversarial samples does cause the network to produce the expected wrong results.(Summary).
Open Problems
- 200 Concrete Open Problems in Mechanistic Interpretability: Introduction (2022.Dec)
- Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems (2024.May)
Tutorials
-
Concrete Steps to Get Started in Transformer Mechanistic Interpretability (Full)
-
Transformers & Mechanistic Interpretability by Callum McDougall
Playground: neuronpedia.org by Neal
Reads
-
An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2
-
Bridging the VLM and mech interp communities for multimodal interpretability
-
Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems
-
A Comprehensive Mechanistic Interpretability Explainer & Glossary