GitHub topics: mechanistic-interpretability

Repositories

Trustworthy-ML-Lab/Neuron_Eval

[ICML 25] A unified mathematical framework to evaluate neuron explanations of deep learning models with sanity tests

Language: Jupyter Notebook - Size: 1.74 MB - Last synced at: about 3 hours ago - Pushed at: about 4 hours ago - Stars: 4 - Forks: 0

OpenMOSS/Language-Model-SAEs

For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.

Language: Python - Size: 16.2 MB - Last synced at: about 15 hours ago - Pushed at: about 16 hours ago - Stars: 127 - Forks: 14

stanfordnlp/pyvene

Stanford NLP Python library for understanding and improving PyTorch models via interventions

Language: Python - Size: 25.5 MB - Last synced at: 4 days ago - Pushed at: 29 days ago - Stars: 758 - Forks: 82

gauravfs-14/awesome-mechanistic-interpretability

A carefully curated collection of high-quality libraries, projects, tutorials, research papers, and other essential resources focused on Mechanistic Interpretability, a growing subfield in machine learning interpretability research that aims to reverse-engineer neural networks into understandable computational components.

Language: JavaScript - Size: 59.6 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 3 - Forks: 0

microsoft/automated-brain-explanations

Generating and validating natural-language explanations for the brain.

Language: Jupyter Notebook - Size: 1.07 GB - Last synced at: 4 days ago - Pushed at: 9 days ago - Stars: 54 - Forks: 7

lkopf/prism

PRISM is a multi-concept feature description framework which can identify and score polysemantic features.

Language: Jupyter Notebook - Size: 32.1 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 3 - Forks: 0

lkopf/cosy

CoSy: Evaluating Textual Explanations

Language: Jupyter Notebook - Size: 2.31 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 16 - Forks: 1

epfl-dlab/llm-latent-language

Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".

Language: Jupyter Notebook - Size: 2.54 MB - Last synced at: 7 days ago - Pushed at: over 1 year ago - Stars: 78 - Forks: 17

MauroAbidalCarrer/learning-deep-learning

Repo where I learn the fundamentals of DL training and interpretation on pytorch.

Language: Jupyter Notebook - Size: 4.96 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 0 - Forks: 0

stanfordnlp/axbench

Stanford NLP Python library for benchmarking the utility of LLM interpretability methods

Language: Python - Size: 617 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 95 - Forks: 6

maxdreyer/attributing-clip

Repository for "From What to How: Attributing CLIP's Latent Components Reveals Unexpected Semantic Reliance"

Language: Python - Size: 3.35 MB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 2 - Forks: 0

aygp-dr/attribution-graphs-explorer

A toolkit for exploring attribution graphs and circuit tracing in transformer models, implemented in Guile Scheme

Language: Scheme - Size: 182 KB - Last synced at: 3 days ago - Pushed at: 24 days ago - Stars: 0 - Forks: 0

Butanium/nnterp

A small package implementing some useful wrapping around nnsight

Language: Python - Size: 414 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 13 - Forks: 3

aarnphm/morph

exploration WYSIWYG editor

Language: TypeScript - Size: 63.6 MB - Last synced at: 4 days ago - Pushed at: about 1 month ago - Stars: 6 - Forks: 0

steering-vectors/steering-vectors

Steering vectors for transformer language models in Pytorch / Huggingface

Language: Python - Size: 8.19 MB - Last synced at: 30 days ago - Pushed at: 4 months ago - Stars: 101 - Forks: 13

Trustworthy-ML-Lab/CB-LLMs

[ICLR 25] A novel framework for building intrinsically interpretable LLMs with human-understandable concepts to ensure safety, reliability, transparency, and trustworthiness.

Language: Python - Size: 936 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 12 - Forks: 3

lmb-freiburg/understanding-clip-ood

Official code for the paper: "When and How Does CLIP Enable Domain and Compositional Generalization?" (ICML 2025 Spotlight)

Language: Python - Size: 13.2 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

taufeeque9/codebook-features

Sparse and discrete interpretability tool for neural networks

Language: Python - Size: 3.58 MB - Last synced at: 7 days ago - Pushed at: over 1 year ago - Stars: 63 - Forks: 4

Ki-Seki/Awesome-Transformer-Visualization

Explore visualization tools for understanding Transformer-based large language models (LLMs)

Size: 23.4 MB - Last synced at: 16 days ago - Pushed at: 7 months ago - Stars: 12 - Forks: 2

pmcurtin/model-crosscoders

Final project for cs2222

Language: Jupyter Notebook - Size: 33 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

Trustworthy-ML-Lab/posthoc-generative-cbm

[CVPR 2025] Concept Bottleneck Autoencoder (CB-AE) -- efficiently transform any pretrained (black-box) image generative model into an interpretable generative concept bottleneck model (CBM) with minimal concept supervision, while preserving image quality

Language: Jupyter Notebook - Size: 3 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 7 - Forks: 1

sdesabbata/geospatial-mechanistic-interpretability

Geospatial Mechanistic Interpretability of Large Language Models

Language: Jupyter Notebook - Size: 41.7 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 4 - Forks: 0

yash-srivastava19/arrakis

Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.

Language: Jupyter Notebook - Size: 3.53 MB - Last synced at: about 1 month ago - Pushed at: 2 months ago - Stars: 29 - Forks: 2

Trustworthy-ML-Lab/ThinkEdit

An effective weight-editing method for mitigating overly short reasoning in LLMs, and a mechanistic study uncovering how reasoning length is encoded in the model’s representation space.

Language: Python - Size: 545 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 10 - Forks: 1

ruizheliUOA/Awesome-Interpretability-in-Large-Language-Models

This repository collects all relevant resources about interpretability in LLMs

Size: 63.5 KB - Last synced at: about 2 months ago - Pushed at: 8 months ago - Stars: 343 - Forks: 24

tegridydev/mechamap

MechaMap - Toolkit for Mechanistic Interpretability (MI) Research

Language: Python - Size: 27.3 KB - Last synced at: about 2 months ago - Pushed at: 2 months ago - Stars: 2 - Forks: 0

Gabe-Thomp/Belief-State-Replication

A quick replication of the paper Transformers Represent Belief State Geometry in their Residual Stream!

Language: Jupyter Notebook - Size: 0 Bytes - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

evan-lloyd/graphpatch

graphpatch is a library for activation patching on PyTorch neural network models.

Language: Python - Size: 4.49 MB - Last synced at: 21 days ago - Pushed at: 5 months ago - Stars: 14 - Forks: 0

ztjona/ztjona.github.io

My personal website.

Language: HTML - Size: 48.4 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

chrisliu298/awesome-sparse-autoencoders

A resource repository of sparse autoencoders for large language models

Size: 8.79 KB - Last synced at: about 2 months ago - Pushed at: 10 months ago - Stars: 7 - Forks: 0

THU-KEG/SafetyNeuron

Data and code for the paper: Finding Safety Neurons in Large Language Models

Language: Jupyter Notebook - Size: 4.59 MB - Last synced at: about 2 months ago - Pushed at: 9 months ago - Stars: 5 - Forks: 0

jiaqingxie/steer-sae

Code for the ETH MSc Thesis: Sparse Autoencoders vs. Activation Difference for Language Model Steering

Language: Shell - Size: 9.66 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

aryamanarora/causalgym

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

Language: Python - Size: 95.4 MB - Last synced at: 3 months ago - Pushed at: 7 months ago - Stars: 41 - Forks: 6

tegridydev/mixture-of-persona-research

A “Mixture of Perspectives” Framework for Ethical AI

Size: 12.7 KB - Last synced at: 3 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

michaelyliu6/transformers

Educational and Production ready implementations of GPT2

Language: Jupyter Notebook - Size: 17.8 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

Mathewvanh/reasoning_model_experiment

A look inside the deepseek-R1 distilled llama 3.1 8b thinking model.

Language: Jupyter Notebook - Size: 9.85 MB - Last synced at: 21 days ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

Somitheiconic09/AI-Safety

AAAI 2025 Tutorial on Machine Learning Safety

Size: 1000 Bytes - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

tim-lawson/mlsae

Multi-Layer Sparse Autoencoders (ICLR 2025)

Language: Python - Size: 642 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 17 - Forks: 0

koayon/awesome-sparse-autoencoders

A curated reading list of research in Sparse Autoencoders, Feature Extraction and related topics in Mechanistic Interpretability

Size: 21.5 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 11 - Forks: 2

Butanium/llm-lang-agnostic

minimal code to reproduce results from Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers

Language: Jupyter Notebook - Size: 19.5 MB - Last synced at: 3 months ago - Pushed at: 7 months ago - Stars: 5 - Forks: 1

koayon/atp_star

PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)

Language: Python - Size: 73.2 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 18 - Forks: 1

daspartho/pronoun-prediction

Identifying Circuit behind Pronoun Prediction in GPT-2 Small

Language: Jupyter Notebook - Size: 987 KB - Last synced at: 9 days ago - Pushed at: about 2 years ago - Stars: 2 - Forks: 0

ishanjmukherjee/toy-models-of-superposition-replication

Replication of the Anthropic interpretability paper "Toy Models of Superposition" by Elhage et al. (2022)

Language: Jupyter Notebook - Size: 810 KB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0

corxyz/icl-cmr

Official implementation of the paper "Linking In-context Learning in Transformers to Human Episodic Memory" by Li Ji-An, Corey Zhou, Marcus Benna, and Marcelo Mattar

Language: Jupyter Notebook - Size: 16.4 MB - Last synced at: 7 months ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0

Hari31416/TorchLight

A librray to visualize features learned by CNNs

Language: Jupyter Notebook - Size: 27.9 MB - Last synced at: 4 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Zhaoyi-Li21/creme

Implementation for the paper "Understanding and Patching Compositional Reasoning in LLMs" @ ACL2024-Findings, Bangkok, Thailand.

Language: Python - Size: 12.4 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 4 - Forks: 0

Butanium/llm-latent-language Fork of epfl-dlab/llm-latent-language

Language: Jupyter Notebook - Size: 68.1 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 1 - Forks: 0

BatsResearch/cross-lingual-detox

Code for "Preference Tuning For Toxicity Mitigation Generalizes Across Languages"

Language: Jupyter Notebook - Size: 309 KB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 14 - Forks: 0

wesg52/universal-neurons

Universal Neurons in GPT2 Language Models

Language: Jupyter Notebook - Size: 24.5 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 21 - Forks: 5

SAGARSS24/MTB_manuscript_data

Physiological modeling into the metaverse of Mycobacterium tuberculosis beta CA inhibition mechanism

Size: 108 MB - Last synced at: 8 months ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

francescortu/comp-mech

Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals

Language: Python - Size: 173 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 2 - Forks: 0

apartresearch/deepdecipher

🦠 DeepDecipher: An open source API to MLP neurons

Language: Rust - Size: 3.71 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 8 - Forks: 0

DeanHazineh/Emergent-World-Representations-Othello

A mechanistic interpretability study invvestigating a sequential model trained to play the board game Othello

Language: Jupyter Notebook - Size: 2.97 GB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 5 - Forks: 2

jbloomAus/DecisionTransformerInterpretability

Interpreting how transformers simulate agents performing RL tasks

Language: Jupyter Notebook - Size: 51.1 MB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 54 - Forks: 14

Nix07/finetuning

This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking".

Language: Jupyter Notebook - Size: 60.4 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 9 - Forks: 0

Nix07/binding-circuit-discovery

This repository contains the code used for the experiments in the paper "Discovering Variable Binding Circuitry with Desiderata".

Language: Python - Size: 26.4 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

pauljblazek/deepdistilling

Mechanistically interpretable neural networks losslessly compressed to computer code, discovering new algorithms that generalize out-of-distribution and outperform human-designed algorithms

Language: Python - Size: 1.35 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 10 - Forks: 0

AlejoAcelas/Interp-Benchmarks

Reversed-engineered Transformer models as a benchmark for interpretability methods

Language: Jupyter Notebook - Size: 62.5 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

cx0/mech-interpretability

Exploring length generalization in the context of indirect object identification (IOI) task for mechanistic interpretability.

Language: Python - Size: 13.5 MB - Last synced at: 4 days ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

zroe1/toy-models-of-superposition

A replication of "Toy Models of Superposition," a groundbreaking machine learning research paper published by authors affiliated with Anthropic and Harvard in 2022.

Language: Jupyter Notebook - Size: 32.6 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

wesg52/sparse-probing-paper

Sparse probing paper full code.

Language: Jupyter Notebook - Size: 50.9 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 29 - Forks: 8

Lejoon/cup-transformer

A project that simulates a game of shuffling cups with a hidden ball underneath one of them. It also trains a Transformer based deep learning model to predict the final position of the ball after a series of swaps.

Language: Jupyter Notebook - Size: 95.7 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

matthiasdellago/visualising-attention

Visualising (self)-attention as a vector field: exploring and building intuition. Based on anvaka.github.io/fieldplay.

Language: GLSL - Size: 35.2 KB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 1 - Forks: 0

apartresearch/interpretability-starter

🧠 Starter templates for doing interpretability research

Size: 17.6 KB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 26 - Forks: 1

Related Keywords

mechanistic-interpretability 64 interpretability 17 machine-learning 14 large-language-models 11 deep-learning 9 sparse-autoencoder 7 llm 7 explainable-ai 6 pytorch 4 transformer 4 ai-safety 4 computer-vision 4 intervention 3 llms 3 transformers 3 open-source 3 interpretable-deep-learning 3 gpt-2 3 research 2 nlp 2 ai 2 sparse-autoencoders 2 multilingual-nlp 2 interpretability-jam 2 clip 2 interpretability-and-explainability 2 generative-ai 2 sae 2 visualization 2 attention-mechanism 2 alignment 2 awesome 2 gpt 2 science-of-deep-learning 2 huggingface 2 language-model 2 benchmark 2 natural-language-processing 2 xai 2 eye-closure 1 ml-testing 1 drowsiness-detection 1 genai 1 experimental 1 self-driving-cars 1 driver-monitoring 1 dataset 1 aisafety 1 llama3 1 distributed-training 1 open-research 1 gemma-9b-it 1 moral-machines 1 mop-ai 1 llm-reasoning 1 ai-research 1 ai-framework 1 syntaxgym 1 causality 1 website 1 othello-ai 1 reinforcement-learning 1 entity-tracking 1 finetuning 1 deep-distilling 1 program-synthesis 1 causal-analysis 1 indirect-object-identification 1 ioi 1 python3 1 toy-models 1 ai-alignment 1 attention 1 vector-field 1 alignment-jam 1 sparse-coding 1 episodic-memory 1 induction-head 1 feature-visualization 1 compositional-reasoning 1 factual-reasoning 1 multi-hop-reasoning 1 cross-lingual-transfer 1 generalization 1 drug-design 1 mechanism-of-action 1 systems-biology 1 tuberculosis 1 academic 1 api 1 interpretability-methods 1 capstone-project 1 patchscopes 1 nnsight 1 transformer-interpretability 1 scheme 1 circuit-tracing 1 attribution-graphs 1 mechanistic 1 attributions 1