CAIAC Papers Week 4
Published:
This week shifts from trajectory forecasting to the technical challenge of understanding what's actually happening inside neural networks. Mechanistic interpretability promises to crack open the black box—to understand not just what models do, but how they do it. The two papers we examined reveal both the fundamental obstacle (superposition) and a promising technique for overcoming it (sparse autoencoders for dictionary learning).
Toy Models of Superposition (Elhage et al., 2022)
Core insight: Neural networks can represent more features than they have dimensions by storing them in superposition—using the same neurons to encode multiple unrelated concepts. This happens when features are sparse (rarely active simultaneously), allowing models to compress information at the cost of "interference" that must be filtered out nonlinearly. Superposition explains why neurons are often polysemantic (responding to many unrelated inputs) and makes interpretability fundamentally harder.
Key Findings:
- Superposition as compression: When features are sparse, models can pack more features than dimensions by tolerating interference between them. This is like lossy compression—you store more than should fit, accepting some noise.
- Phase change behavior: There's a sharp transition between storing features in orthogonal dimensions (like PCA) and storing them in superposition. This depends on feature importance and sparsity—important, sparse features get stored; unimportant or dense features get dropped.
- Geometric structure: Features in superposition arrange themselves in specific geometric patterns (uniform polytopes—pentagons, triangles, etc.). The model finds configurations that minimize interference while maximizing representational capacity.
- Computation in superposition: Models can perform operations (like computing absolute value) while features remain in superposition. This suggests neural networks may be "noisily simulating" much larger sparse networks.
- Monosemantic vs. polysemantic neurons: Whether a neuron is monosemantic (one concept) or polysemantic (many concepts) depends on which side of the phase boundary it falls. This isn't a choice—it's an emergent consequence of optimization pressure and feature statistics.
Reflections / Speculative Ideas:
- Superposition as unavoidable: The paper demonstrates that superposition isn't a bug or implementation detail—it's the optimal solution given sparse features and limited capacity. Trying to train models without superposition (by enforcing activation sparsity) doesn't solve polysemanticity! Even with 1-hot activations, individual neurons can still be polysemantic. This is devastating: you can't just "train better models" to avoid interpretability challenges.
- The interpretability tax: Superposition reveals a fundamental tradeoff between model capacity and interpretability. Models that use superposition can represent more concepts in fewer parameters—they're more parameter-efficient. Forcing monosemanticity would require massively larger models. This suggests interpretability might have an inherent cost in capability or efficiency.
- Connection to adversarial examples: The paper hints that adversarial examples might exploit interference in superposition—perturbations that activate features in combinations that don't naturally occur, causing the model to "hallucinate" spurious feature combinations. This connects interpretability to robustness.
- Phase changes and grokking: The sharp transition from not representing a feature to representing it in superposition might relate to grokking (sudden generalization after extended training). Perhaps grokking represents crossing a phase boundary where enough capacity becomes available to store a feature set in superposition.
- Biological plausibility: Brains have ~86 billion neurons but need to represent far more than 86 billion concepts. Superposition might be how biological neural systems achieve their impressive capacity. The sparsity of neural firing in the brain aligns with the conditions favoring superposition. This could mean biological interpretability faces the same fundamental challenges.
- Implications for scaling: As models get larger, they can store more features, but also store more features in superposition per neuron. Interpretability might not improve with scale—it might get worse. Each neuron in GPT-4 could be responding to dozens or hundreds of features.
- The "features not neurons" insight: If models are simulating larger sparse networks, the right unit of analysis isn't neurons—it's the features being represented in superposition. This motivates the search for methods to extract these features, leading directly to the monosemanticity work.
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (Bricken et al., 2023)
Core insight: Sparse autoencoders can decompose polysemantic neurons into monosemantic features. By training an overcomplete autoencoder (more hidden units than inputs) with sparsity constraints on a model's activations, we can extract interpretable features that correspond to specific, understandable concepts. A 512-neuron layer decomposes into 4000+ features representing things like DNA sequences, legal language, HTTP requests, Hebrew text, etc.
Methodological Innovation:
- Sparse autoencoder (SAE) approach: Train an autoencoder to reconstruct layer activations, but with a bottleneck that's overcomplete (wider than the input). Add L1 sparsity penalty to encourage only a few features to activate for any given input. This forces the model to find a sparse decomposition of the activation space.
- Dictionary learning framing: The SAE learns a "dictionary" of feature directions. Each activation can be expressed as a sparse linear combination of these dictionary elements. The features are the basis vectors of this overcomplete dictionary.
- Training on real model activations: Unlike toy models, this works on actual transformers. They decompose the MLP layer of a small 1-layer transformer, showing the technique scales beyond synthetic setups.
Key Results:
- Features are interpretable: Manual inspection, automated interpretability scoring, and analysis of logit weights all show learned features are more interpretable than neurons. Most features correspond to clear, specific concepts.
- Features are invisible in neuron basis: Many interpretable features don't correspond to any individual neuron—they're linear combinations of many neurons. Looking at neurons alone misses most of what the model represents.
- Features enable circuit analysis: By decomposing in the feature basis, you can trace how information flows through the network at a more interpretable granularity. Features that detect specific inputs connect to features that produce specific outputs.
- Example: Base64 feature: One discovered feature responds specifically to Base64-encoded text. It activates on Base64 inputs and its logit weights predict Base64 continuations. This is a clean, monosemantic concept that emerges from the decomposition.
Reflections / Speculative Ideas:
- SAEs as compression of compression: The original model compressed world knowledge into its weights via training. SAEs compress the model's computational structure into an interpretable format. We're doing lossy compression of lossy compression, which raises questions about what's lost at each stage.
- The reconstruction-interpretability tradeoff: SAEs with higher sparsity (fewer active features) are more interpretable but reconstruct activations less accurately. There's a tunable tradeoff between "how well does this explain the model" and "how understandable is the explanation." This is a microcosm of the broader alignment challenge—we can make models more interpretable at the cost of capability.
- Feature splitting and the curse of dimensionality: The paper mentions "feature splitting"—the same concept being represented by multiple features at different levels of abstraction. As you increase SAE width, features split into finer-grained variants. This suggests an infinite regress: there's no "right" granularity of features. The number of "true features" might be unbounded.
- Automated interpretability as necessary scaffold: Manual inspection doesn't scale to thousands of features. The paper uses automated interpretability (having another model explain features) to evaluate at scale. This creates a dependency: interpretability research increasingly requires AI to interpret AI. What happens when the interpretability model itself is uninterpretable?
- Features as causal units: If we can identify features and understand their effects on outputs, we might be able to intervene—selectively activating or suppressing features to steer behavior. This connects interpretability to controllability. But the paper's features are extracted from activations, not necessarily causal. Correlation vs. causation remains a challenge.
- Scaling to production models: The paper works on a tiny 1-layer model. Anthropic has since scaled this to Claude 3 Sonnet (see "Scaling Monosemanticity"), extracting millions of features. But does this help with safety? Can we use these features to detect deception, monitor for scheming, or ensure alignment? The techniques scale, but do the benefits?
- The feature lottery hypothesis: Just as neural network training involves a "neural network lottery" where initialization matters, SAE training likely involves a "feature lottery" where different training runs find different decompositions. How stable are these features? If we train multiple SAEs on the same model, do we get the same features or different-but-equally-valid alternatives?
- Neuroscience parallels, divergences: Neuroscience has similar debates about the right level of analysis—should we study individual neurons, neural populations, or functional circuits? SAEs suggest the answer for ANNs is "features extracted via dictionary learning," but biological neurons might not have this property. Or maybe they do and we just don't have the right probes yet. This could inform experimental neuroscience technique development.
- From understanding to action: The critical question is whether interpretability translates to safety. If we understand what features a model has learned, can we prevent it from learning dangerous features? Can we detect when it's learned to deceive? Can we verify alignment by auditing its feature space? These are open questions that mechanistic interpretability must eventually address.
Connecting Themes and Takeaways
- The two-paper arc: Toy Models diagnoses the problem (superposition makes neurons polysemantic), and Monosemanticity offers a solution (SAEs extract monosemantic features). Together they establish a research program: identify the fundamental obstacles to interpretability, then develop techniques to overcome them.
- Interpretability is non-trivial: These papers demonstrate that understanding neural networks is hard in a principled sense. Superposition isn't incidental—it's the optimal response to the structure of real-world data (sparse features). Any interpretability solution must contend with this.
- Features, not neurons, as atoms of cognition: Both papers converge on the idea that the right unit of analysis is abstract features, not physical neurons. This mirrors debates in neuroscience and suggests a general principle: computational structure doesn't respect hardware boundaries.
- Trade-offs are fundamental: Capacity vs. interpretability (superposition enables compression but creates polysemanticity). Reconstruction vs. interpretability (sparse SAEs are clearer but less faithful). Granularity vs. tractability (finer features are more precise but exponentially more numerous). Every choice involves giving something up.
- Scaling as both solution and problem: Larger SAEs can extract finer-grained features, and the technique scales to production models. But larger models likely have more superposition, more features, and more complex feature interactions. Interpretability research scales, but does interpretability itself scale?
- The automation necessity: We've already reached the point where human interpretability analysis doesn't scale. We need AI to interpret AI, creating a tower of meta-levels. This is either encouraging (we can automate interpretability research) or concerning (we're building interpretability tools we can't fully verify).
- Causal intervention as the goal: Understanding is only useful if it enables action. The research program must move from "what features exist" to "how can we use features to ensure safe behavior." This requires moving from descriptive to causal interpretability.
- Biological insights bidirectional: Artificial neural network interpretability can inform neuroscience (superposition might explain polysemantic biological neurons), and neuroscience can inform AI (understanding how brains manage superposition might improve model design). The fields are converging methodologically.
- Connection to Week 2 alignment challenges: Interpretability is meant to help with alignment, but these papers show why it's hard. If models represent goals in superposition, scattered across millions of features, how do we verify those goals are aligned? How do we detect goal misgeneralization (Shah et al.) if the goal representation is distributed and entangled with other features?
- Bridge to Week 5 (Control): If we can't fully interpret models, can we at least control them? Mechanistic interpretability aims for transparency; control aims for safety despite opacity. Understanding these papers' limitations motivates the control agenda—we might need to ensure safety without complete understanding.
References:
- Toy Models of Superposition, Elhage et al., Anthropic, 2022. https://transformer-circuits.pub/2022/toy_model/index.html
- Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, Bricken et al., Anthropic, 2023. https://transformer-circuits.pub/2023/monosemantic-features
