GitHub topics: mechanistic-interpretability
Trustworthy-ML-Lab/Neuron_Eval
[ICML 25] A unified mathematical framework to evaluate neuron explanations of deep learning models with sanity tests
Language: Jupyter Notebook - Size: 1.74 MB - Last synced at: about 3 hours ago - Pushed at: about 4 hours ago - Stars: 4 - Forks: 0

OpenMOSS/Language-Model-SAEs
For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.
Language: Python - Size: 16.2 MB - Last synced at: about 15 hours ago - Pushed at: about 16 hours ago - Stars: 127 - Forks: 14

stanfordnlp/pyvene
Stanford NLP Python library for understanding and improving PyTorch models via interventions
Language: Python - Size: 25.5 MB - Last synced at: 4 days ago - Pushed at: 29 days ago - Stars: 758 - Forks: 82

gauravfs-14/awesome-mechanistic-interpretability
A carefully curated collection of high-quality libraries, projects, tutorials, research papers, and other essential resources focused on Mechanistic Interpretability, a growing subfield in machine learning interpretability research that aims to reverse-engineer neural networks into understandable computational components.
Language: JavaScript - Size: 59.6 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 3 - Forks: 0

microsoft/automated-brain-explanations
Generating and validating natural-language explanations for the brain.
Language: Jupyter Notebook - Size: 1.07 GB - Last synced at: 4 days ago - Pushed at: 9 days ago - Stars: 54 - Forks: 7

lkopf/prism
PRISM is a multi-concept feature description framework which can identify and score polysemantic features.
Language: Jupyter Notebook - Size: 32.1 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 3 - Forks: 0

lkopf/cosy
CoSy: Evaluating Textual Explanations
Language: Jupyter Notebook - Size: 2.31 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 16 - Forks: 1

epfl-dlab/llm-latent-language
Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".
Language: Jupyter Notebook - Size: 2.54 MB - Last synced at: 7 days ago - Pushed at: over 1 year ago - Stars: 78 - Forks: 17

MauroAbidalCarrer/learning-deep-learning
Repo where I learn the fundamentals of DL training and interpretation on pytorch.
Language: Jupyter Notebook - Size: 4.96 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 0 - Forks: 0

stanfordnlp/axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
Language: Python - Size: 617 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 95 - Forks: 6

maxdreyer/attributing-clip
Repository for "From What to How: Attributing CLIP's Latent Components Reveals Unexpected Semantic Reliance"
Language: Python - Size: 3.35 MB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 2 - Forks: 0

aygp-dr/attribution-graphs-explorer
A toolkit for exploring attribution graphs and circuit tracing in transformer models, implemented in Guile Scheme
Language: Scheme - Size: 182 KB - Last synced at: 3 days ago - Pushed at: 24 days ago - Stars: 0 - Forks: 0

Butanium/nnterp
A small package implementing some useful wrapping around nnsight
Language: Python - Size: 414 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 13 - Forks: 3

aarnphm/morph
exploration WYSIWYG editor
Language: TypeScript - Size: 63.6 MB - Last synced at: 4 days ago - Pushed at: about 1 month ago - Stars: 6 - Forks: 0

steering-vectors/steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
Language: Python - Size: 8.19 MB - Last synced at: 30 days ago - Pushed at: 4 months ago - Stars: 101 - Forks: 13

Trustworthy-ML-Lab/CB-LLMs
[ICLR 25] A novel framework for building intrinsically interpretable LLMs with human-understandable concepts to ensure safety, reliability, transparency, and trustworthiness.
Language: Python - Size: 936 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 12 - Forks: 3

lmb-freiburg/understanding-clip-ood
Official code for the paper: "When and How Does CLIP Enable Domain and Compositional Generalization?" (ICML 2025 Spotlight)
Language: Python - Size: 13.2 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

taufeeque9/codebook-features
Sparse and discrete interpretability tool for neural networks
Language: Python - Size: 3.58 MB - Last synced at: 7 days ago - Pushed at: over 1 year ago - Stars: 63 - Forks: 4

Ki-Seki/Awesome-Transformer-Visualization
Explore visualization tools for understanding Transformer-based large language models (LLMs)
Size: 23.4 MB - Last synced at: 16 days ago - Pushed at: 7 months ago - Stars: 12 - Forks: 2

pmcurtin/model-crosscoders
Final project for cs2222
Language: Jupyter Notebook - Size: 33 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

Trustworthy-ML-Lab/posthoc-generative-cbm
[CVPR 2025] Concept Bottleneck Autoencoder (CB-AE) -- efficiently transform any pretrained (black-box) image generative model into an interpretable generative concept bottleneck model (CBM) with minimal concept supervision, while preserving image quality
Language: Jupyter Notebook - Size: 3 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 7 - Forks: 1

sdesabbata/geospatial-mechanistic-interpretability
Geospatial Mechanistic Interpretability of Large Language Models
Language: Jupyter Notebook - Size: 41.7 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 4 - Forks: 0

yash-srivastava19/arrakis
Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.
Language: Jupyter Notebook - Size: 3.53 MB - Last synced at: about 1 month ago - Pushed at: 2 months ago - Stars: 29 - Forks: 2

Trustworthy-ML-Lab/ThinkEdit
An effective weight-editing method for mitigating overly short reasoning in LLMs, and a mechanistic study uncovering how reasoning length is encoded in the model’s representation space.
Language: Python - Size: 545 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 10 - Forks: 1

ruizheliUOA/Awesome-Interpretability-in-Large-Language-Models
This repository collects all relevant resources about interpretability in LLMs
Size: 63.5 KB - Last synced at: about 2 months ago - Pushed at: 8 months ago - Stars: 343 - Forks: 24

tegridydev/mechamap
MechaMap - Toolkit for Mechanistic Interpretability (MI) Research
Language: Python - Size: 27.3 KB - Last synced at: about 2 months ago - Pushed at: 2 months ago - Stars: 2 - Forks: 0

Gabe-Thomp/Belief-State-Replication
A quick replication of the paper Transformers Represent Belief State Geometry in their Residual Stream!
Language: Jupyter Notebook - Size: 0 Bytes - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

evan-lloyd/graphpatch
graphpatch is a library for activation patching on PyTorch neural network models.
Language: Python - Size: 4.49 MB - Last synced at: 21 days ago - Pushed at: 5 months ago - Stars: 14 - Forks: 0

ztjona/ztjona.github.io
My personal website.
Language: HTML - Size: 48.4 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

chrisliu298/awesome-sparse-autoencoders
A resource repository of sparse autoencoders for large language models
Size: 8.79 KB - Last synced at: about 2 months ago - Pushed at: 10 months ago - Stars: 7 - Forks: 0

THU-KEG/SafetyNeuron
Data and code for the paper: Finding Safety Neurons in Large Language Models
Language: Jupyter Notebook - Size: 4.59 MB - Last synced at: about 2 months ago - Pushed at: 9 months ago - Stars: 5 - Forks: 0

jiaqingxie/steer-sae
Code for the ETH MSc Thesis: Sparse Autoencoders vs. Activation Difference for Language Model Steering
Language: Shell - Size: 9.66 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

aryamanarora/causalgym
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
Language: Python - Size: 95.4 MB - Last synced at: 3 months ago - Pushed at: 7 months ago - Stars: 41 - Forks: 6

tegridydev/mixture-of-persona-research
A “Mixture of Perspectives” Framework for Ethical AI
Size: 12.7 KB - Last synced at: 3 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

michaelyliu6/transformers
Educational and Production ready implementations of GPT2
Language: Jupyter Notebook - Size: 17.8 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

Mathewvanh/reasoning_model_experiment
A look inside the deepseek-R1 distilled llama 3.1 8b thinking model.
Language: Jupyter Notebook - Size: 9.85 MB - Last synced at: 21 days ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

Somitheiconic09/AI-Safety
AAAI 2025 Tutorial on Machine Learning Safety
Size: 1000 Bytes - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

tim-lawson/mlsae
Multi-Layer Sparse Autoencoders (ICLR 2025)
Language: Python - Size: 642 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 17 - Forks: 0

koayon/awesome-sparse-autoencoders
A curated reading list of research in Sparse Autoencoders, Feature Extraction and related topics in Mechanistic Interpretability
Size: 21.5 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 11 - Forks: 2

Butanium/llm-lang-agnostic
minimal code to reproduce results from Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers
Language: Jupyter Notebook - Size: 19.5 MB - Last synced at: 3 months ago - Pushed at: 7 months ago - Stars: 5 - Forks: 1

koayon/atp_star
PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)
Language: Python - Size: 73.2 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 18 - Forks: 1

daspartho/pronoun-prediction
Identifying Circuit behind Pronoun Prediction in GPT-2 Small
Language: Jupyter Notebook - Size: 987 KB - Last synced at: 9 days ago - Pushed at: about 2 years ago - Stars: 2 - Forks: 0

ishanjmukherjee/toy-models-of-superposition-replication
Replication of the Anthropic interpretability paper "Toy Models of Superposition" by Elhage et al. (2022)
Language: Jupyter Notebook - Size: 810 KB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0

corxyz/icl-cmr
Official implementation of the paper "Linking In-context Learning in Transformers to Human Episodic Memory" by Li Ji-An, Corey Zhou, Marcus Benna, and Marcelo Mattar
Language: Jupyter Notebook - Size: 16.4 MB - Last synced at: 7 months ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0

Hari31416/TorchLight
A librray to visualize features learned by CNNs
Language: Jupyter Notebook - Size: 27.9 MB - Last synced at: 4 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Zhaoyi-Li21/creme
Implementation for the paper "Understanding and Patching Compositional Reasoning in LLMs" @ ACL2024-Findings, Bangkok, Thailand.
Language: Python - Size: 12.4 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 4 - Forks: 0

Butanium/llm-latent-language Fork of epfl-dlab/llm-latent-language
Language: Jupyter Notebook - Size: 68.1 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 1 - Forks: 0

BatsResearch/cross-lingual-detox
Code for "Preference Tuning For Toxicity Mitigation Generalizes Across Languages"
Language: Jupyter Notebook - Size: 309 KB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 14 - Forks: 0

wesg52/universal-neurons
Universal Neurons in GPT2 Language Models
Language: Jupyter Notebook - Size: 24.5 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 21 - Forks: 5

SAGARSS24/MTB_manuscript_data
Physiological modeling into the metaverse of Mycobacterium tuberculosis beta CA inhibition mechanism
Size: 108 MB - Last synced at: 8 months ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

francescortu/comp-mech
Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals
Language: Python - Size: 173 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 2 - Forks: 0

apartresearch/deepdecipher
🦠 DeepDecipher: An open source API to MLP neurons
Language: Rust - Size: 3.71 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 8 - Forks: 0

DeanHazineh/Emergent-World-Representations-Othello
A mechanistic interpretability study invvestigating a sequential model trained to play the board game Othello
Language: Jupyter Notebook - Size: 2.97 GB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 5 - Forks: 2

jbloomAus/DecisionTransformerInterpretability
Interpreting how transformers simulate agents performing RL tasks
Language: Jupyter Notebook - Size: 51.1 MB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 54 - Forks: 14

Nix07/finetuning
This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking".
Language: Jupyter Notebook - Size: 60.4 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 9 - Forks: 0

Nix07/binding-circuit-discovery
This repository contains the code used for the experiments in the paper "Discovering Variable Binding Circuitry with Desiderata".
Language: Python - Size: 26.4 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

pauljblazek/deepdistilling
Mechanistically interpretable neural networks losslessly compressed to computer code, discovering new algorithms that generalize out-of-distribution and outperform human-designed algorithms
Language: Python - Size: 1.35 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 10 - Forks: 0

AlejoAcelas/Interp-Benchmarks
Reversed-engineered Transformer models as a benchmark for interpretability methods
Language: Jupyter Notebook - Size: 62.5 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

cx0/mech-interpretability
Exploring length generalization in the context of indirect object identification (IOI) task for mechanistic interpretability.
Language: Python - Size: 13.5 MB - Last synced at: 4 days ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

zroe1/toy-models-of-superposition
A replication of "Toy Models of Superposition," a groundbreaking machine learning research paper published by authors affiliated with Anthropic and Harvard in 2022.
Language: Jupyter Notebook - Size: 32.6 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

wesg52/sparse-probing-paper
Sparse probing paper full code.
Language: Jupyter Notebook - Size: 50.9 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 29 - Forks: 8

Lejoon/cup-transformer
A project that simulates a game of shuffling cups with a hidden ball underneath one of them. It also trains a Transformer based deep learning model to predict the final position of the ball after a series of swaps.
Language: Jupyter Notebook - Size: 95.7 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

matthiasdellago/visualising-attention
Visualising (self)-attention as a vector field: exploring and building intuition. Based on anvaka.github.io/fieldplay.
Language: GLSL - Size: 35.2 KB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 1 - Forks: 0

apartresearch/interpretability-starter
🧠 Starter templates for doing interpretability research
Size: 17.6 KB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 26 - Forks: 1
