We address the challenge of theoretically grounded feature recovery using Sparse Autoencoders (SAEs) for interpreting Large Language Models. Current SAE training methods lack mathematical guarantees and face issues like hyperparameter sensitivity. We propose a statistical framework for feature recovery that models polysemantic features as sparse mixtures of monosemantic concepts. Based on this, we develop a "bias adaptation" SAE training algorithm that dynamically adjusts network biases for optimal sparsity. We prove that this algorithm correctly recovers all monosemantic features under our statistical model. Our improved variant, Group Bias Adaptation (GBA), outperforms existing methods on LLMs up to 1.5 billion parameters in terms of sparsity-loss trade-off and feature consistency. This work provides the first SAE algorithm with theoretical recovery guarantees, advancing interpretable and trustworthy AI through enhanced mechanistic understanding.
The sentence contains two distinct concepts: a muddy footprint and a broken window. For a trained deep neural network, the learned representations in the intermediate layers are often polysemantic, meaning that they are mixture of multiple features of the underlying concepts. Specifically, the representation \(x\) after seeing the whole sentence may have the following form:The detective found a muddy footprint near the broken window, leading him to suspect a ?
\(x = h_1 \cdot\) "feature of muddy footprint" \( + h_2 \cdot\) "feature of broken window"where \(h_1\) and \(h_2\) are the nonnegative weights of the two monosemantic features. After obtaining the polysemantic representation \(x\), the model can then generate the token "burglary" at the "?" position.
The goal of feature recovery is to recover the monosemantic features underlying each concept, through training on a dataset that contains polysemantic representations, which are often extracted from the intermediate layers of a trained deep neural network, e.g., the residual stream of a transformer-based model.While this example illustrates the intuitive notion of monosemantic features, we need a more rigorous definition to make progress. Currently, researchers primarily assess monosemanticity through feature interpretability, i.e., how well a feature aligns with human-understandable concepts. However, this anthropocentric view has limitations: neural networks may process information in ways fundamentally different from human conceptual understanding. We need a more principled, mathematically grounded definition of monosemantic features that captures their essential properties independent of human interpretation. Specifically, we ask the following questions:
What is a mathematically rigorous definition of identifiable monosemantic features?
Given polysemantic representations, when will the monosemantic features be identifiable?
How to reliably recover the monosemantic features?
To address the limitations of existing Sparse Autoencoder (SAE) training methods, we propose a new algorithm called Grouped Bias Adaptation (GBA). Our algorithm has two main components: a bias adaptation subroutine that controls the activation frequency of each neuron, and a neuron grouping strategy that allows us to assign different target activation frequencies (TAFs) to different groups of neurons. The algorithm is carried out as follows:
By combining these two strategies, GBA offers direct control over neuron activation, avoiding the need for complex tuning while ensuring that neurons are activated at appropriate frequencies to learn features that also occur with different frequencies.
In this section, we evaluate the performance of our Grouped Bias Adaptation (GBA) algorithm across several key dimensions. To ensure a comprehensive understanding, we first briefly introduce the experimental setup before diving into the analysis of the three key questions.
We conduct our experiments using two datasets: Pile Github
and Pile Wikipedia
datasets, each with the first 100k tokens. The experiments are performed on the Qwen2.5-1.5B base model, where we attach a Sparse Autoencoder (SAE) to the MLP outputs of layers 2, 13, and 26. Each SAE contains 66k hidden neurons and operates with an input/output dimension of 1536.
To ensure optimal performance, we adopt the JumpReLU activation function for all methods tested. We use 100 million tokens for training the SAEs, feeding the tokens through the LLM and collecting the MLP outputs at the specified layers.
The four methods compared are:
TopK
: A sparse activation method that retains only the top-K activated neurons.L1
: A sparse method that uses L1 regularization for sparsity.BA (Bias Adaptation)
: A variant of GBA with a single group and hyperparameter tuning for Target Activation Frequency (TAF).GBA (Grouped Bias Adaptation)
: Our proposed method, which uses multiple groups of neurons with TAFs set geometrically without hyperparameter tuning.The first question we address is how the GBA method compares with other methods in terms of reconstruction loss and activation sparsity. The results show that GBA performs comparably to the TopK method with post-activation sparsity and outperforms it with pre-activation sparsity. Furthermore, GBA significantly outperforms the L1 and Bias Adaptation (BA) methods across all experiments.
The second question revolves around the robustness of the GBA method to the choice of hyperparameters, such as the number of groups and target frequencies. Our ablation studies show that GBA is nearly tuning-free. As long as the Highest Target Frequency (HTF) is sufficiently high (e.g., 0.5) and the number of groups (K) is large enough (e.g., 10 or 20), GBA performs consistently well.
The third question examines the consistency of the features learned by the GBA method across independent runs with different random seeds. We evaluate this using the Maximum Cosine Similarity (MCS) metric, and the results show that GBA outperforms other methods, including TopK, in terms of feature consistency. The specific definition of MCS can be found in Β§A.2 (Evaluation Metrics) of the paper. Simply put, a higher MCS indicates that a neuron can find another neuron that is more similar to it in other runs, with higher values indicating better consistency. Speicifically, we plot the percentage of neurons that has MCS higher than different thresholds, where the neurons ploted are also filtered by certain metrics.
We provide additional studies on the neurons learned by the GBA in terms of the three metrics used above: maximum activation, Z-score, and maximum cosine similarity across different runs with different random seeds.
These metrics are computed based on the validation part of the Pile Github
dataset, with the hook position at the MLP output of layer 26.
For each neuron, we also compute the activation fraction, which is the fraction of tokens where the pre-activations of the neuron are non-negative.
For each neuron, we have four metrics: maximum activation, Z-score, maximum cosine similarity, and activation fraction. We generate scatter plots by plotting the Z-score against the other three metrics. The results for GBA are shown below.
Pile Github
dataset with a hook at the MLP output of layer 26.To formulate the feature recovery problem, let's consider a model's hidden representation at a specific layer. This layer encodes \(n\) distinct features in a \(d\)-dimensional space, which we collect into a feature matrix \(V\in\mathbb{R}^{n\times d}\). Each row \(v_i\) represents one feature vector. Given a training set of \(N\) tokens, we extract their hidden representations into a data matrix \(X\in\mathbb{R}^{N\times d}\). Each row \(x_\ell\) is a weighted combination of the feature vectors, with weights stored in a coefficient matrix \(H\in\mathbb{R}_+^{N\times n}\). This gives us the compact data model: \begin{align}\label{eq:data_model} X = H V \in \mathbb{R}^{N\times d}. \tag{1} \end{align}
Given data generated according to Eq. \eqref{eq:data_model}, we know that not all feature matrices \(V\) can be uniquely identified from the observed data matrix \(X\). This leads us to a fundamental question in feature recovery: Under what conditions can we identify the feature matrix \(V\) from the data matrix \(X\)? In the following, we restrict to \((H,V), (H', V')\in \mathcal{G}\) for a class \(\mathcal{G}\).
In a nutshell, feature identification is to find the minimal number of features that can decompose the dataset under certain conditions (specified by the class \(\mathcal{G}\)), which are unique up to feature scaling and permutation.
The feature matrix V is identifiable when features are almost independently occurring and almost orthogonal. Besides, each data should be a sparse linear combination of the features with non-negligible combination coefficients.
We theoretically prove that the GBA algorithm can recover all the features if the features are identifiable and certain regularity conditions are met.
Our theoretical analysis further reveals that GBA works through a selective feature learning mechanism, where each neuron group preferentially learns features whose natural occurrence frequency is just slightly lower than the group's target activation frequency (TAF). This "resonance" principle enables robust and consistent feature recovery, and demonstrates the necessity of using different neuron groups to learn features with different occurrence frequencies.
In this section, we present a demo of the feature dashboard for the SAE-learned features. These features are derived from the training experiments conducted using the Grouped Bias Adaptation (GBA) method, where the SAE was trained on the Pile Github
dataset with the first 100k tokens. The features were extracted from the output of the MLP block at layer 26 of the Qwen2.5-1.5B base model.
The feature dashboard visualizes various aspects of the learned features, providing insights into their activations and behaviors across different runs. One of the examples featured here corresponds to neuron 4688, which exhibits a clear bimodal activation pattern. This neuron is activated just before outputting the "class" token, indicating that it captures a distinct feature relevant to that part of the model's operation.
For a comprehensive view of all the learned features, see the All Feature Dashboards page.