mechanistic-interpretability | Topic

Topic: "mechanistic-interpretability"

stanfordnlp/pyvene

Stanford NLP Python library for understanding and improving PyTorch models via interventions

Language: Python - Size: 26.1 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 800 - Forks: 90

ruizheliUOA/Awesome-Interpretability-in-Large-Language-Models

This repository collects all relevant resources about interpretability in LLMs

Size: 63.5 KB - Last synced at: 6 days ago - Pushed at: about 1 year ago - Stars: 381 - Forks: 26

OpenMOSS/Language-Model-SAEs

Performant framework for training, analyzing and visualizing Sparse Autoencoders (SAEs) and their frontier variants.

Language: Python - Size: 32.1 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 163 - Forks: 21

itsqyh/Awesome-LMMs-Mechanistic-Interpretability

A curated collection of resources focused on the Mechanistic Interpretability (MI) of Large Multimodal Models (LMMs). This repository aggregates surveys, blog posts, and research papers that explore how LMMs represent, transform, and align multimodal information internally.

Size: 2.48 MB - Last synced at: 30 days ago - Pushed at: 30 days ago - Stars: 148 - Forks: 3

steering-vectors/steering-vectors

Steering vectors for transformer language models in Pytorch / Huggingface

Language: Python - Size: 8.19 MB - Last synced at: 2 months ago - Pushed at: 9 months ago - Stars: 124 - Forks: 13

stanfordnlp/axbench

Stanford NLP Python library for benchmarking the utility of LLM interpretability methods

Language: Python - Size: 617 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 95 - Forks: 6

epfl-dlab/llm-latent-language

Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".

Language: Jupyter Notebook - Size: 2.54 MB - Last synced at: 4 days ago - Pushed at: over 1 year ago - Stars: 80 - Forks: 18

taufeeque9/codebook-features

Sparse and discrete interpretability tool for neural networks

Language: Python - Size: 3.58 MB - Last synced at: about 1 month ago - Pushed at: almost 2 years ago - Stars: 63 - Forks: 5

Butanium/nnterp

Unified access to Large Language Model modules using NNsight

Language: Python - Size: 5.06 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 58 - Forks: 5

microsoft/automated-brain-explanations

Generating and validating natural-language explanations for the brain.

Language: Jupyter Notebook - Size: 1.17 GB - Last synced at: 30 days ago - Pushed at: about 1 month ago - Stars: 57 - Forks: 8

jbloomAus/DecisionTransformerInterpretability

Interpreting how transformers simulate agents performing RL tasks

Language: Jupyter Notebook - Size: 51.1 MB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 54 - Forks: 14

aryamanarora/causalgym

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

Language: Python - Size: 95.4 MB - Last synced at: 7 months ago - Pushed at: 12 months ago - Stars: 41 - Forks: 6

EleutherAI/bergson

Mapping out the "memory" of neural nets with data attribution

Language: Python - Size: 8.02 MB - Last synced at: 37 minutes ago - Pushed at: about 2 hours ago - Stars: 32 - Forks: 10

yash-srivastava19/arrakis

Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.

Language: Jupyter Notebook - Size: 3.53 MB - Last synced at: 2 months ago - Pushed at: 7 months ago - Stars: 31 - Forks: 3

wesg52/sparse-probing-paper

Sparse probing paper full code.

Language: Jupyter Notebook - Size: 50.9 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 29 - Forks: 8

apartresearch/interpretability-starter

🧠 Starter templates for doing interpretability research

Size: 17.6 KB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 26 - Forks: 1

tim-lawson/mlsae

Multi-Layer Sparse Autoencoders (ICLR 2025)

Language: Python - Size: 642 KB - Last synced at: 2 months ago - Pushed at: 9 months ago - Stars: 24 - Forks: 0

Trustworthy-ML-Lab/CB-LLMs

[ICLR 25] A novel framework for building intrinsically interpretable LLMs with human-understandable concepts to ensure safety, reliability, transparency, and trustworthiness.

Language: Python - Size: 956 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 21 - Forks: 4

wesg52/universal-neurons

Universal Neurons in GPT2 Language Models

Language: Jupyter Notebook - Size: 24.5 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 21 - Forks: 5

evan-lloyd/graphpatch

graphpatch is a library for activation patching on PyTorch neural network models.

Language: Python - Size: 4.49 MB - Last synced at: about 2 months ago - Pushed at: 9 months ago - Stars: 20 - Forks: 0

Ki-Seki/Awesome-Transformer-Visualization

Explore visualization tools for understanding Transformer-based large language models (LLMs)

Size: 23.4 MB - Last synced at: 7 days ago - Pushed at: 12 months ago - Stars: 20 - Forks: 1

gauravfs-14/awesome-mechanistic-interpretability

A carefully curated collection of high-quality libraries, projects, tutorials, research papers, and other essential resources focused on Mechanistic Interpretability, a growing subfield in machine learning interpretability research that aims to reverse-engineer neural networks into understandable computational components.

Language: JavaScript - Size: 184 KB - Last synced at: about 19 hours ago - Pushed at: about 21 hours ago - Stars: 19 - Forks: 1

koayon/atp_star

PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)

Language: Python - Size: 73.2 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 18 - Forks: 1

lkopf/cosy

CoSy: Evaluating Textual Explanations

Language: Jupyter Notebook - Size: 2.31 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 16 - Forks: 1

Trustworthy-ML-Lab/ThinkEdit

[EMNLP 25] An effective and interpretable weight-editing method for mitigating overly short reasoning in LLMs, and a mechanistic study uncovering how reasoning length is encoded in the model’s representation space.

Language: Python - Size: 579 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 15 - Forks: 1

BatsResearch/cross-lingual-detox

Code for "Preference Tuning For Toxicity Mitigation Generalizes Across Languages"

Language: Jupyter Notebook - Size: 309 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 14 - Forks: 0

krnel-ai/krnel-graph

Lightweight mechanistic interpretability dataflow operations for agent developers.

Language: Python - Size: 6.18 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 12 - Forks: 1

koayon/awesome-sparse-autoencoders

A curated reading list of research in Sparse Autoencoders, Feature Extraction and related topics in Mechanistic Interpretability

Size: 21.5 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 11 - Forks: 2

pauljblazek/deepdistilling

Mechanistically interpretable neural networks losslessly compressed to computer code, discovering new algorithms that generalize out-of-distribution and outperform human-designed algorithms

Language: Python - Size: 1.35 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 10 - Forks: 0

Nix07/finetuning

This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking".

Language: Jupyter Notebook - Size: 60.4 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 9 - Forks: 0

Nix07/belief_tracking

This repository contains the code used for the experiments in the paper "Language Models use Lookbacks to Track Beliefs".

Language: Jupyter Notebook - Size: 95.8 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 8 - Forks: 3

THU-KEG/SafetyNeuron

Data and code for the paper: Finding Safety Neurons in Large Language Models

Language: Jupyter Notebook - Size: 4.59 MB - Last synced at: 30 days ago - Pushed at: about 1 year ago - Stars: 8 - Forks: 0

chrisliu298/awesome-sparse-autoencoders

A resource repository of sparse autoencoders for large language models

Size: 8.79 KB - Last synced at: 6 days ago - Pushed at: about 1 year ago - Stars: 8 - Forks: 0

apartresearch/deepdecipher

🦠 DeepDecipher: An open source API to MLP neurons

Language: Rust - Size: 3.71 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 8 - Forks: 0

rraghavkaushik/NLP-Learning-Resources

List of latest papers and blogs for NLP

Size: 13.7 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 7 - Forks: 0

Trustworthy-ML-Lab/posthoc-generative-cbm

[CVPR 2025] Concept Bottleneck Autoencoder (CB-AE) -- efficiently transform any pretrained (black-box) image generative model into an interpretable generative concept bottleneck model (CBM) with minimal concept supervision, while preserving image quality

Language: Jupyter Notebook - Size: 3 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 7 - Forks: 1

aarnphm/morph

exploration WYSIWYG editor

Language: TypeScript - Size: 63.7 MB - Last synced at: 20 days ago - Pushed at: 23 days ago - Stars: 6 - Forks: 0

Butanium/llm-lang-agnostic

minimal code to reproduce results from Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers

Language: Jupyter Notebook - Size: 19.5 MB - Last synced at: 7 months ago - Pushed at: 11 months ago - Stars: 5 - Forks: 1

DeanHazineh/Emergent-World-Representations-Othello

A mechanistic interpretability study invvestigating a sequential model trained to play the board game Othello

Language: Jupyter Notebook - Size: 2.97 GB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 5 - Forks: 2

ziansu/awesome-llm-steering

A collection of papers related to steering of large language models.

Size: 6.84 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 4 - Forks: 0

Trustworthy-ML-Lab/Neuron_Eval

[ICML 25] A unified mathematical framework to evaluate neuron explanations of deep learning models with sanity tests

Language: Jupyter Notebook - Size: 1.74 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 4 - Forks: 0

sdesabbata/geospatial-mechanistic-interpretability

Geospatial Mechanistic Interpretability of Large Language Models

Language: Jupyter Notebook - Size: 41.7 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 4 - Forks: 0

Zhaoyi-Li21/creme

Implementation for the paper "Understanding and Patching Compositional Reasoning in LLMs" @ ACL2024-Findings, Bangkok, Thailand.

Language: Python - Size: 12.4 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 4 - Forks: 0

Yusen-Peng/CE-Bench

[BlackboxNLP Workshop @ EMNLP, 2025] CE-Bench: A Contrastive Evaluation Benchmark of LLM Interpretability with Sparse Autoencoders

Language: Jupyter Notebook - Size: 412 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 3 - Forks: 0

lkopf/prism

PRISM is a multi-concept feature description framework which can identify and score polysemantic features.

Language: Jupyter Notebook - Size: 32.1 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 3 - Forks: 0

RishabSA/interp-refusal-tokens

We study whether categorical refusal tokens enable controllable and interpretable safety behavior in language models.

Language: Jupyter Notebook - Size: 33.7 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 2 - Forks: 0

moment-timeseries-foundation-model/representations-in-tsfms

Exploring Representations and Interventions in Time Series Foundation Models @ ICML 2025

Language: Python - Size: 51.8 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 2 - Forks: 1

maxdreyer/attributing-clip

Repository for "From What to How: Attributing CLIP's Latent Components Reveals Unexpected Semantic Reliance"

Language: Python - Size: 3.35 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 2 - Forks: 0

tegridydev/mechamap

MechaMap - Toolkit for Mechanistic Interpretability (MI) Research

Language: Python - Size: 27.3 KB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 2 - Forks: 0

francescortu/comp-mech

Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals

Language: Python - Size: 173 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

Nix07/binding-circuit-discovery

This repository contains the code used for the experiments in the paper "Discovering Variable Binding Circuitry with Desiderata".

Language: Python - Size: 26.4 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

cx0/mech-interpretability

Exploring length generalization in the context of indirect object identification (IOI) task for mechanistic interpretability.

Language: Python - Size: 13.5 MB - Last synced at: 3 days ago - Pushed at: almost 2 years ago - Stars: 2 - Forks: 0

daspartho/pronoun-prediction

Identifying Circuit behind Pronoun Prediction in GPT-2 Small

Language: Jupyter Notebook - Size: 987 KB - Last synced at: about 1 month ago - Pushed at: over 2 years ago - Stars: 2 - Forks: 0

peppinob-ol/attribution-graph-probing

Automates attribution-graph analysis via probe prompting: circuit-trace a prompt, auto-generate concept probes, profile feature activations, cluster supernodes.

Language: Python - Size: 44.7 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 1 - Forks: 0

jwuphysics/euclid-galaxy-morphology-saes

studying (self-)supervised representations of Euclid galaxy imaging via SAEs

Language: Python - Size: 166 KB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 1 - Forks: 0

Davide011/Thesis_public

Mechanistic analysis of a GPT-2–like model exploring the compositionality gap in transformers. Using Logit Lens and Causal Tracing, the study identifies and mitigates a deep-layer bottleneck via dataset enhancement to improve logical reasoning.

Language: Python - Size: 370 MB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 1 - Forks: 0

wasim/scaling-specialization-dense-lms

Do dense LMs develop MoE-like specialization as they scale? Measure it, visualize it, and turn it into speed.

Language: Python - Size: 6.84 KB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 1 - Forks: 0

manik-sethi/hallucination-circuits

Tenatively for: Using Monosemantic Features From Sparse Auto-Encoders to Detect Hallucinations

Language: Jupyter Notebook - Size: 270 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

Dhia-naouali/Tickling-Vision-Models

performing mechanistic interpretability on inceptionV1, from linear prob and sparse direction maximization to adversarial and ciruict patching & ablation

Language: Python - Size: 5.16 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 1 - Forks: 0

zer0int/CLIP-HeadHunter

Head-Hunter: A Visual Bias Explorer. Attention Head Max Visualization to find, rank, and visualize heads; map bias; see what a CLIP 'sees'.

Language: Python - Size: 15.9 MB - Last synced at: 2 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

jejaquez/bumblebee

🐝 Bumblebee — An AI Fuzzer for Model Robustness, Safety, Security, and Interpretability

Language: Python - Size: 6.84 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

jejaquez/hexray

🔬 HexRay: An Open-Source Neuroscope for AI — Tracing Tokens, Neurons, and Decisions for Frontier AI Research, Safety, and Security

Language: Python - Size: 229 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

BeekeepingAI/hexray

🔬 HexRay: An Open-Source Neuroscope for AI — Tracing Tokens, Neurons, and Decisions for Frontier AI Research, Safety, and Security

Language: Python - Size: 215 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

ilyalasy/memorization_circuits

Applied mechanistic interpretability techniques to find circuits behind memorization processes in GPT-NEO-125m

Language: Jupyter Notebook - Size: 2.13 MB - Last synced at: about 1 month ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

lmb-freiburg/understanding-clip-ood

Official code for the paper: "When and How Does CLIP Enable Domain and Compositional Generalization?" (ICML 2025 Spotlight)

Language: Python - Size: 13.2 MB - Last synced at: 2 months ago - Pushed at: 6 months ago - Stars: 1 - Forks: 0

ztjona/ztjona.github.io

My personal website.

Language: HTML - Size: 48.4 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 1 - Forks: 0

jiaqingxie/steer-sae

Code for the ETH MSc Thesis: Sparse Autoencoders vs. Activation Difference for Language Model Steering

Language: Shell - Size: 9.66 MB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 1 - Forks: 0

tegridydev/mixture-of-persona-research

A “Mixture of Perspectives” Framework for Ethical AI

Size: 12.7 KB - Last synced at: 8 months ago - Pushed at: 10 months ago - Stars: 1 - Forks: 0

Butanium/llm-latent-language Fork of epfl-dlab/llm-latent-language

Language: Jupyter Notebook - Size: 68.1 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

matthiasdellago/visualising-attention

Visualising (self)-attention as a vector field: exploring and building intuition. Based on anvaka.github.io/fieldplay.

Language: GLSL - Size: 35.2 KB - Last synced at: almost 2 years ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 0

raghulchandramouli/Mechanistic-Qwen

A KL-div + Token-level Analysis of Qwen and DeepSeek model for Reasoning

Language: Jupyter Notebook - Size: 516 KB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0

raishish/discrete-diffusion-rome

Interpreting Masked Diffusion Language Models

Language: Python - Size: 9.77 KB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 0 - Forks: 0

maximusrafla/bilinear-modular-arithmetic

Exploring bilinear neural network layers for modular addition with tensor decomposition analysis

Language: Python - Size: 4.12 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 0 - Forks: 0

Course-Correct-Labs/entropy-collapse-null

Null result on entropy collapse; reproducible figures; Course Correct Labs

Language: Python - Size: 181 KB - Last synced at: 30 days ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

mduffster/self-referent-test

Testing role-based pathways on small LLMs

Language: Python - Size: 2.11 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

AIDoctrine/fpc-ae1r

FPC v2.2r + AE-1r: Predicting and Preventing LLM Failures via Measurable Internal States

Language: Python - Size: 3.15 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

galassoandrea/mechanistic-interpretability

Repository for the implementation of several mechanistic interpretability algorithms for LLMs.

Language: Python - Size: 742 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

fringewidth/steer-clear

unsupervised search for interpretable behaviours in Qwen 14b

Language: Jupyter Notebook - Size: 31.3 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

ashioyajotham/greater-than-circuit

Reverse engineering the circuit responsible for the "greater than" capability in a language model

Language: Python - Size: 212 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

salma2vec/inductra

Lightweight mech-interp sandbox for probing ICL, tracing induction heads, and stress-testing distilled vs base LMs.

Language: Python - Size: 3.91 KB - Last synced at: 24 days ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

Faraday-dot-py/MATS-9.0

Do different AIs dream of the same electric sheep?

Language: Shell - Size: 897 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

carlacodes/freezeLLM

mechanistic interpretability of toy LLMs

Language: Python - Size: 494 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

fateme-hshm96/skel-for-circuits

Subgraph Extraction via Multi-Scale Graph Skeletonization

Language: Python - Size: 978 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

xycoord/Language-Modelling

Implementations and Experiments: Transformers, RoPE, KV cache, SAEs, Tokenisers

Language: Python - Size: 1.52 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

HillaryDanan/relativistic-interpretability

A geometric framework for understanding neural network reasoning through multiple reference frames

Language: Python - Size: 52.7 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

drKeeman/glitch_core

AI Safety research platform for studying personality drift in AI systems using mechanistic interpretability and clinical assessment tools. Complete simulation framework with neural circuit analysis, statistical drift detection, and intervention protocols.

Language: Jupyter Notebook - Size: 10.6 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0