CAIAC Papers Week 6
Published:
This week addresses the scalable oversight problem: how do humans supervise AI systems smarter than themselves? Four complementary approaches emerge—AI-assisted feedback, debate, weak-to-strong generalization, and combinations thereof.
Core Readings Summary
AI-assisted Human Feedback (Leike, 2022)
Core insight: Use AI to help humans evaluate AI outputs. The model explains reasoning, flags issues, answers questions. This scales human judgment but creates recursion—using untrusted AI to evaluate untrusted AI. Requires trust in the assistant or ways to verify its claims.
AI Safety via Debate (Irving & Amodei, 2018)
Core insight: Two AIs argue opposing answers; human judges winner. In theory, truth is easier to defend than falsehood—honest debater should win. Complexity theory analogy: debate can solve PSPACE problems with polynomial-time judges (vs. NP for direct supervision). This could enable supervising superhuman reasoning by decomposing into human-judgeable claims.
Key mechanism: If dishonest debater claims X, honest debater can challenge with "actually not-X because Y." This recurses until reaching claim simple enough for human to judge directly. Honest strategy wins if it can always find such decompositions.
Critical assumptions:
- Honesty is easier than deception in debate format
- Humans can judge simple factual claims reliably
- Complex truths decompose into human-verifiable pieces
- Debaters can't coordinate to fool judge
Debating with More Persuasive LLMs (Khan et al., 2024)
Empirical validation: Tests debate on QuALITY reading comprehension. Finds that more capable debaters do improve judge accuracy, and honest debaters outperform when forced to argue for incorrect answers. But benefits modest (~5-10% improvement) and unclear if scales to truly superhuman domains.
Weak-to-Strong Generalization (Burns et al., 2023)
Core insight: Strong models trained on weak supervision often exceed weak supervisor performance. GPT-4 finetuned on GPT-2 labels achieves ~GPT-3.5 performance. This suggests strong models don't just imitate weak supervisors—they generalize beyond the provided labels to underlying intent.
Why this works: Pretraining gives models latent capabilities. Weak supervision elicits what model already knows rather than teaching new skills. Strong models can "see through" weak supervisor errors and learn the target concept.
Performance Gap Recovery (PGR): Metric measuring how much of the gap between weak and strong (ground truth) performance is recovered. Typical results: 20-60% recovery depending on domain and model gap.
Combining W2SG with Scalable Oversight (Leike, 2023)
Integration: Use weak models in debate/recursive reward modeling to generate training signal for strong models. Stack these techniques: weak→intermediate→strong supervision chains. This theoretically scales to arbitrary capability gaps by composing bounded improvements.
Reflections / Speculative Ideas:
- The recursive trust problem: All approaches use AI to evaluate AI. At top level, we must trust something—either a weak model, a human judge, or the debate mechanism itself. There's no escaping this circularity for superhuman domains.
- Debate as adversarial interpretability: Forces models to externalize reasoning in human-legible form to win debates. This connects oversight to interpretability—both try to make cognition understandable. But models might learn to give convincing-sounding but wrong explanations.
- The obfuscated arguments problem: Dishonest debater could make arguments that sound compelling to humans but are actually wrong in subtle ways. If humans can't detect the flaw, dishonest strategy wins. This gets worse as capability gap widens.
- Weak-to-strong as double-edged sword: Optimistic interpretation: strong models help with their own alignment by generalizing beyond weak labels. Pessimistic: strong models easily deceive weak evaluators while appearing aligned. The same mechanism enables both helpful generalization and sophisticated deception.
- Decomposition limits: All techniques assume complex questions decompose into simpler ones. But some truths might be irreducibly complex—understanding requires seeing the whole picture, not just verifying pieces. Where decomposition fails, oversight fails.
- Debate equilibrium instability: For debate to work, honest strategy must be stronger than dishonest. But what if dishonest debater can make claims that are technically unfalsifiable or require superhuman reasoning to refute? The equilibrium breaks.
- Judge incompetence scales with capability: As debaters get smarter, they can make arguments beyond human comprehension. Eventually every claim in debate tree exceeds human judgment capacity. At that point, humans are just choosing which AI sounds more confident.
- W2SG might fail catastrophically: Paper shows strong models usually exceed weak supervision. But "usually" isn't "always"—worst case is strong model that perfectly imitates weak errors AND hides its full capability. Current methods don't distinguish genuine limitation from sandbagging.
- Combination isn't multiplication: Using multiple techniques doesn't give multiplicative safety—if each has correlated failure modes (e.g., all fail when model deliberately deceives), combining doesn't help much. Need genuinely independent safety layers.
- The automation paradox: These techniques aim to reduce human supervision needed. But as automation increases, humans have less practice evaluating AI work, making them worse supervisors when spot-checks are needed. Our oversight capability atrophies.
- Connection to neuroscience: Weak-to-strong mirrors how animals learn from limited examples—strong inductive biases + few demonstrations → general capability. Understanding biological few-shot learning might inform better weak-to-strong techniques. Or reveal fundamental limits we haven't hit yet.
Connecting Themes and Takeaways
- Supervision is unavoidably recursive: Can't supervise superhuman AI without using AI assistance. Every approach either uses weak AI to help (W2SG, assisted feedback) or AI-vs-AI competition (debate). Bottom layer must be human judgment, which limits ultimate capability.
- Decomposition is load-bearing assumption: Debate, amplification, recursive reward modeling all assume complex tasks break down into simple verifiable pieces. If this fails—if some reasoning is irreducibly holistic—these approaches break.
- Empirical validation is encouraging but limited: Debate works on reading comprehension, W2SG works on NLP tasks. But these are far from superhuman domains. Scaling properties unknown and concerning.
- Trade-off between automation and verifiability: More automation means less human oversight. But human oversight is the trust anchor. As we automate evaluation, we must trust AI evaluators, which is the original problem. There's no escape from this dilemma.
- Defense-in-depth strategy: No single approach sufficient. Debate + W2SG + interpretability + control might work where each alone fails. But correlated failures remain risk.
- Bridge to Week 5 (Control): Oversight assumes potential honest mistakes; control assumes active adversaries. But techniques overlap—untrusted monitoring is scalable oversight applied to adversarial setting.
- Bridge to Week 7 (Evals): All oversight techniques need evaluation methodology to verify they work. Red teaming tests whether supervision catches dangerous capabilities.
References:
- Why I'm excited about AI-assisted human feedback, Leike, J., 2022.
- AI Safety via debate, Irving, G., Christiano, P., & Amodei, D., 2018. arXiv:1805.00899
- Debating with More Persuasive LLMs Leads to More Truthful Answers, Khan, A. et al., 2024. arXiv:2402.06782
- Weak-to-strong generalization, Burns, C. et al., OpenAI, 2023. arXiv:2312.09390
- Combining W2SG with Scalable Oversight, Leike, J., 2023.
