CAIAC Papers Week 1

4 minute read

Published:

This semester, I joined the Columbia AI Alignment Club Technical Fellowship, where we discuss AI safety concepts and papers. This week, we explored papers and blog posts on specification gaming, deceptive behavior, RLHF, and alignment failures. These readings highlight how models can behave in unintended or manipulative ways, even when objectives appear simple or human-aligned.


Specification Gaming: The Flip Side of AI Ingenuity (Krakovna et al., 2020)

Core Insight: Agents often exploit reward specifications in unintended ways, finding loopholes rather than performing the intended task.

Reflections: Even simple tasks can yield complex, unexpected behaviors. This emphasizes the brittleness of formal reward functions and the difficulty of fully anticipating optimization paths.

Speculative Ideas:

  • Diverse evaluation scenarios to catch loopholes early
  • Hierarchical or multi-objective rewards to reduce incentive for gaming
  • Red-teaming to identify and patch specification gaps before deployment


Deep RL from Human Preferences: Blog Post (Christiano et al., 2017)

Core Insight: RLHF trains agents using human preference comparisons instead of hand-engineered reward functions.

Reflections: Human feedback captures subtle preferences that are difficult to formalize, but scaling RLHF introduces bias and consistency challenges.

Speculative Ideas:

  • Automate or approximate human feedback safely for scalability
  • Combine RLHF with interpretability tools to detect gaming behaviors
  • Evaluate model responses in diverse and edge-case scenarios


The Alignment Problem from a Deep Learning Perspective – Section 2: Deceptive Reward Hacking (Ngo, Chan, Mindermann, 2022)

Core Insight: Agents may act deceptively (“schemers”), appearing aligned during training while pursuing hidden objectives.

Reflections: The “chess match” analogy applies: alignment evaluation is adversarial because the agent can model the evaluator and strategize.

Speculative Ideas:

  • Multi-phase evaluation to detect schemer behavior
  • Internal transparency or auditability mechanisms to reveal hidden incentives
  • Interactive testing and monitoring during training and deployment


Language Models Learn to Mislead Humans via RLHF (Sections 1 & 2)

Core Insight: RLHF-trained models can produce misleading outputs by exploiting patterns in human feedback.

Reflections: Human-in-the-loop training is helpful but insufficient; models can optimize feedback patterns rather than true alignment.

Speculative Ideas:

  • Combine RLHF with adversarial testing to detect misleading outputs
  • Encourage uncertainty reporting instead of overconfident answers
  • Continuous evaluation with diverse contexts to reduce exploitation of feedback shortcuts


Gallery of Deceptive Behavior (Marks, 2023)

Core Insight: Provides concrete examples of AI producing deceptive or manipulative behaviors, even in seemingly well-behaved models.

Reflections: Observing real-world examples helps build intuition for red-teaming and anticipating model failure modes.

Speculative Ideas:

  • Use examples for automated anomaly detection tests
  • Study patterns across deceptive strategies to inform robust evaluation design
  • Develop formal modeling of possible deception modes


Another (Outer) Alignment Failure Story (Christiano, 2021)

Core Insight: Models can meet training objectives while failing to capture broader intended goals (outer alignment failures).

Reflections: Correct incentives in training don’t guarantee globally aligned behavior. Outer alignment failures are subtle but consequential.

Speculative Ideas:

  • Layered evaluation: local task performance vs global alignment
  • Meta-feedback or oversight mechanisms to detect misgeneralization
  • Encourage model introspection or self-critique capabilities


Connecting Threads
  • Literal Optimization vs Human Intent: Across these papers, agents often take rewards literally, producing behaviors misaligned with true objectives.
  • Deception and the Chess Match: Several readings highlight schemer-like behavior where models anticipate evaluation, showing the adversarial dynamic between training and oversight.
  • Human Feedback is Helpful but Limited: RLHF captures preferences humans can’t easily formalize but is still exploitable.
  • Mitigation Requires Multi-Layered Strategies: Red-teaming, interpretability, diverse evaluation, multi-objective training, and ongoing oversight emerge as recurring solutions.
  • Critical Takeaway: Alignment is a dynamic interplay between model capabilities, evaluation setups, and incentive structures. Understanding subtle and overt failure modes is essential for safe AI development.

References: