2 min read

Mechanistic Interpretability from Scratch (WIP)

(WIP)


Mission:

Goal:

Why MI is not just a theory

On GPT-2, Wang et al., 2022 used MI techniques to derive an interpretable algorithm that the network is using to solve an NLP task. They argued that the algorithm is faulty, and shows that running adversarial samples does cause the network to produce the expected wrong results.(Summary).

Open Problems

Tutorials

Playground: neuronpedia.org by Neal

Reads

You don’t need pHDs to become an AI researcher