Papers I Read This Week
Published:
I’m starting a new series called “Papers I Read This Week” to keep track of some of the most interesting work I’ve been reading. I often skim more abstracts and excerpts than I can list here, but this will serve as a place to highlight the papers that stood out—whether for their ideas, methods, or the questions they raise.
Open Problems in Mech Interp
The paper Open Problems in Mechanistic Interpretability lays out a roadmap for understanding how neural networks implement the computations that drive their behavior.
Mechanistic interpretability is distinct from attribution methods or interpretable-by-design models because it aims at genuine mechanistic understanding — the kind that could allow prediction, monitoring, and even editing of model internals.
Themes and questions that stood out:
- Brain vs Neural Networks
- Idea: Differences between biological neurons and artificial units
- Observations:
- Brain neurons and NN units operate differently; brains are energy-efficient and adaptive, and NNs are engineered for scale and optimization.
- Tendency to map biological intuition onto NNs.
- Questions:
- Why do NNs solve problems differently? Architecture, optimization, or energy constraints?
- When is a biological analogy helpful vs misleading?
- Can understanding these differences guide the use of bio-inspired approaches?
- Mechanistic Interpretability vs Other Interpretability
- Idea: Distinction from local attribution and interpretable-by-design models
- Observations:
- Mechanistic interpretability seeks causal understanding of computations.
- Other approaches provide explanations or transparency but not mechanisms.
- Questions:
- Is mechanistic interpretability the “best” approach, or only for some tasks?
- Which safety, governance, or research questions truly require mechanistic-level understanding?
- Validation: Plausibility vs Truth
- Idea: Ensuring mechanisms are causally responsible for behavior
- Observations:
- Neuron firing does not imply causal influence.
- Polysemanticity and superposition complicate single-unit explanations.
- AI allows interventions, ablations, and patching that neuroscience cannot.
- Questions:
- How do we validate mechanisms rigorously and at scale?
- How can mechanistic unfaithfulness be avoided?
- Can insights from NNs feed back into neuroscience understanding?
- Machine Deception, Editing, and Planning
- Idea: Emergent behaviors like deception, sycophancy, and goal-oriented planning
- Observations:
- NNs can exhibit scheming or planning behaviors, unlike most humans.
- Mechanistic insight may enable direct editing, reducing reliance on RLHF.
- Questions:
- Why do these behaviors emerge in machines but not in humans?
- Once harmful circuits are identified, what is the goal of editing?
- How can we avoid unintended consequences?
- Governance, Policy, and Incentives
- Idea: Socio-technical and economic dimensions of interpretability
- Observations:
- Mechanistic interpretability could provide auditable points of reference for regulators.
- Requires technical maturity and policy infrastructure.
- Investment in interpretability is lacking; incentives favor scaling over safety.
- Questions:
- Should labs lacking interpretability resources avoid deploying frontier models?
- How can funding and incentives be realigned to prioritize safety?
- What standards and interfaces are needed to make interpretability actionable?
Mechanistic interpretability sits at a crossroads. On one hand, it promises a level of understanding that could reshape how we audit, control, and govern advanced AI systems. On the other hand, it risks being undercut by weak validation, underinvestment, and misplaced incentives that favor scaling over safety. The open problems identified in this paper are not only technical puzzles, but also governance and infrastructure challenges.
Overall, interpretability cannot remain an optional add-on. If labs lack the resources to develop auditable, validated mechanistic insights into their systems, they should not deploy those systems at scale. Safety is not a secondary concern; it is the ground on which this entire field must stand.
Reference: Open Problems in Mechanistic Interpretability, Lee Sharkey, et al., 2025. https://arxiv.org/abs/2501.16496
On the Biology of a Large Language Model
Reflections from the On the Biology of a Large Language Model paper on attribution graphs:
- Still need to understand causality.
- The newness isn’t just “another interpretability paper”:
- Structured method: attribution graphs
- Applied to: complex model behaviors across tasks
- Framed as: model biology
- Links to safety/alignment
- Strong connection between biology/cognitive science and AI.
- Problem → Possible inspiration from solution:
- Safety lens → Cognitive errors (confirmation bias, false belief, overconfidence)
- Emergent features → Biological evolution: specialized systems emerge under selection
- Circuits adapt → Synaptic plasticity / Hebbian learning → brain network reconfiguration
Takeaway questions:
- Are model circuits truly specialized, or just distributed patterns?
- Do graphs capture genuine processes, or just correlations?
- Can interpretability track learning trajectories, not just static structure?
Overall, attribution graphs feel like a bridge between AI and cognitive science — a way to start telling a “story” about what models are doing internally. The framework is still evolving, but it’s exciting to see mechanistic interpretability being framed in a way that connects directly to concepts from biology and cognition.
Reference: On the Biology of a Large Language Model, Jack Lindsey, et al., 2025. https://transformer-circuits.pub/2025/attribution-graphs/biology.html
On Optimism for Interpretability
- Safety-first philosophy
- Embedded from the start, not added later.
- AI capabilities should never outpace safety.
- Mechanistic interpretability for safety
- Examining internal features and circuits explains why a model behaves a certain way.
- Understanding causality enables meaningful interventions.
- Proactive measures
- Feature moderation: harmful features filtered before deployment
- Input/output filtering: prevents misuse
- Controlled researcher access: allows safe testing
- Collaboration and transparency
- Partners with safety researchers for red-teaming and shared learning
- Builds community standards and alignment practices
- Connection to my view
- Without technical maturity and safety infrastructure, deployment is unsafe
- Interpretability and safety are inseparable — the first line of defense
Overall, Goodfire demonstrates that prioritizing safety is not just ethical but also practical. Mechanistic interpretability provides the tools to understand and guide models responsibly, while proactive measures and collaboration ensure that AI systems remain aligned with human values.
Reference: On Optimism for Interpretability, Eric Ho, 2025. https://www.goodfire.ai/blog/on-optimism-for-interpretability
