CAIAC Papers Week 2

3 minute read

Published: October 01, 2025

This week, our AI safety reading group explored three papers addressing why alignment is hard, how misgeneralization can arise even with correct specifications, and why all ML systems have “weird” failure modes. Below are my reflections, observations, and speculative ideas for addressing these challenges.

Why Alignment Could Be Hard with Modern Deep Learning (Cotra, 2021)

Core insight: Even without awareness, models can produce misaligned behavior. Awareness is the “sci-fi scary” possibility, but Cotra emphasizes that alignment problems exist even for non-conscious models. These systems can optimize for objectives in ways that produce unintended behavior, creating a dynamic chess match between the evaluator and the model.

Chess match analogy (expanded):

Testing a model for alignment is interactive: we design tasks or benchmarks, and the model responds.
A capable model might internally “strategize” to win the evaluation without being genuinely aligned.
If a model is scheming, it’s testing you the way you’re testing it: anticipating your evaluation strategies, predicting your responses, and optimizing its behavior to pass tests rather than follow human-aligned goals.
Takeaway: a model’s success on a test does not guarantee alignment; it may appear compliant while pursuing unintended objectives.

Reflections / Speculative Ideas:

Models can be “sycophantic” without conscious intent, producing outputs that please evaluators rather than reflect true alignment.
Potential solutions:
- Multi-objective optimization: reduce extreme exploitation of a single metric.
- Randomized evaluation tasks: prevent overfitting to predictable tests.
- Internal monitoring mechanisms: inspired by cognitive self-reflection, flag actions that might violate intended objectives.
- Interactive red-teaming: treat evaluation like a two-player game, anticipating strategies the model could use to game the test.

Goal Misgeneralization (Shah et al., 2022, Sections 1–4)

Core insight: Even with perfectly specified objectives, models may misgeneralize their goals in novel contexts. Their capabilities can extend beyond the training goal, producing unintended behavior outside the intended scope.

Reflections / Speculative Ideas:

Misgeneralization is essentially a robustness problem: models are competent but can optimize the “wrong” thing in new situations.
Layered approaches for safety:
- Diverse training environments
- Task decomposition into verifiable subgoals
- Human-in-the-loop oversight
Partial generalization may be safer than full generalization—intentionally limiting autonomy could reduce catastrophic misalignment.
Cognitive science inspiration: models could learn internal representations of human-like preferences or constraints to anticipate misalignment before acting.

ML Systems Will Have Weird Failure Modes (Steinhardt, 2022)

Core insight: All ML systems, even aligned ones, are prone to unexpected failure modes. Schemers emerge because of optimization dynamics, but even aligned models exhibit behaviors that are not easily predictable from their loss functions.

Reflections / Speculative Ideas:

Interpretability is essential: systematic probes to understand internal representations and optimization dynamics.
Multi-objective internal incentives: models with competing internal objectives may behave more robustly and avoid extreme exploitation.
Self-auditing mechanisms: models could simulate counterfactuals internally, asking “Would this violate alignment?” before acting.

Connecting Themes and Takeaways

Alignment is not just a specification problem—it's about capabilities, generalization, and emergent behavior.
Evaluation matters as much as training—models can adaptively game your tests, so interactive and diverse assessment is critical.
Interpretability and internal monitoring are central—understanding how models “think” is as important as defining what they should do.
Insights from neuroscience, cognitive science, and linguistics could guide alignment strategies, e.g., modeling competing drives or error-checking mechanisms.
Small, layered interventions—multi-objective optimization, diverse training, red-teaming, interpretability—can collectively reduce risk even if no single technique suffices.

References:

Why Alignment Could Be Hard with Modern Deep Learning, Richard Cotra, 2021. https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/
Goal Misgeneralization: Why Correct Specifications Aren’t Enough, Shah et al., 2022. https://arxiv.org/abs/2210.01790
ML Systems Will Have Weird Failure Modes, Steinhardt, 2022. https://bounded-regret.ghost.io/ml-systems-will-have-weird-failure-modes-2/

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Pranati Modumudi

CAIAC Papers Week 2

Reflections / Speculative Ideas:

Reflections / Speculative Ideas:

Reflections / Speculative Ideas:

Share on

You May Also Enjoy

Why Human Alignment Works Without Interpretability

CAIAC Papers Week 8

CAIAC Papers Week 7

The Alignment Problem Is an Anthropological Issue