CAIAC Papers Week 3
Published:
This week, our AI safety reading group examined three pieces that shift from abstract alignment challenges to concrete trajectory analysis: where AI capabilities are headed, how fast they're improving, and what catastrophic scenarios might unfold. These readings force us to confront the uncomfortable gap between exponential technical progress and our glacial institutional response systems.
Epoch AI Trends Page (2025)
Core insight: Every major input to AI progress—compute, data, hardware, investment—is growing exponentially and shows no signs of plateauing. Training compute has grown 5x/year since 2020, costs are growing 3000x/year, and we're rapidly approaching fundamental limits in both data availability and energy requirements.
Key Trend Observations:
- Compute scaling persists: Training compute for frontier models has grown 4-5x annually from 2010-2024. Over 30 models now exceed 10^25 FLOP (the GPT-4 threshold), with roughly 2 new models crossing this line every month in 2024.
- Hardware improvements plateauing: GPU FLOP/s only growing 1.35x/year—much slower than the 5x/year growth in model compute. This means scaling is driven primarily by throwing more chips at the problem, not better chips.
- Investment explosion: Frontier AI companies (OpenAI, Anthropic, DeepMind) each grew revenue >90% in H2 2024, corresponding to >3x annualized growth. Training costs reaching hundreds of millions per run.
- Data crunch approaching: Median projection shows we'll exhaust the effective stock of public human text between 2026-2032. Llama 4 already trained on 30 trillion tokens—nearing the ~300 trillion token estimate for all public text.
- Energy concerns mounting: Power required to train frontier models growing >2x/year. Extrapolations suggest multi-gigawatt training runs by 2030—equivalent to multiple nuclear reactors.
Reflections / Speculative Ideas:
- Bottleneck succession: Constraints shift in real-time from algorithms → compute → data/energy. Each gets addressed just before becoming binding, suggesting adaptive optimization where the ecosystem finds workarounds faster than individual constraints can halt progress.
- The efficiency paradox: Algorithmic efficiency improves 5.1%/year, yet total compute usage explodes (Jevons' paradox for AI). Efficiency enables more ambitious applications that consume even more resources.
- Infrastructure as commitment device: $100B+ data centers represent sunk costs demanding utilization, creating institutional pressure to keep scaling regardless of safety. We're building infrastructure that makes slowdown economically painful.
- China's scaling slowdown: Export controls working (3x/year vs. 5x/year globally), but creates race dynamics where the US feels pressure to move faster, potentially compromising safety.
Measuring AI Ability to Complete Long Tasks (METR, 2025)
Core insight: METR proposes "task-completion time horizon"—the length of tasks AI can complete with 50% reliability—as a unified metric bridging multiple capability domains. This metric has been doubling every 7 months for 6 years (2019-2025), potentially accelerating to every 4 months in 2024-2025. Extrapolating: AI could handle month-long projects by 2027-2029.
Methodological Innovation:
- Linking capability to real-world impact: Instead of measuring performance on arbitrary benchmarks, METR anchors to human time—how long would this task take a professional? This makes progress directly interpretable in economic terms.
- Unified scaling law: By converting benchmark performance to time horizons, METR "stitches together" saturating benchmarks into a single long-term trend. As models saturate MATH or HumanEval, we can still track progress by measuring time horizons on harder tasks.
- Robustness across domains: The 4-7 month doubling time holds not just for software tasks but across scientific QA (GPQA), math contests (AIME), and even early-stage robotics/computer use (though horizons are 50x shorter there currently).
Key Findings:
- Claude 3.7 Sonnet: ~1 hour time horizon (can complete tasks that take humans an hour with 50% reliability, but only reliably completes tasks up to a few minutes)
- At 80% reliability threshold, time horizons are ~2 doublings behind 50% threshold (roughly 1 year lag)
- 2024-2025 trend appears faster than historical average, possibly due to outcome-based RL specifically targeting agentic capabilities
- Performance degrades on "messy" tasks—less structured problems closer to real-world distributions
Reflections / Speculative Ideas:
- The reliability gap: Current models can sometimes do hour-long tasks but reliably only minute-long tasks. This 60x gap is where economic value gets bottlenecked. Progress may come from improving consistency rather than expanding capability ceiling.
- Superexponential growth dynamics: Once AIs can do 10-hour tasks, they start contributing to AI research, creating feedback loops where the doubling time itself decreases. We might rapidly traverse from week-long to year-long capabilities.
- Task length as fundamental metric: Time horizon captures something essential—maintaining coherent goal-directed behavior over extended periods. Orthogonal to raw intelligence, it's about planning, error recovery, and persistence.
- The 2024 inflection: Recent acceleration suggests RL for agency is working, but we're intentionally training long-horizon goal pursuit—the exact capability that makes alignment harder.
- Economic displacement timeline: 50% reliability at day-long tasks (potentially 2027) displaces knowledge work. We need "better than marginal employee," a much lower bar than 99% reliability.
AI 2027 (Kokotajilo, 2025)
Core insight: A detailed scenario forecast showing how competitive pressures (corporate race, US-China competition) could drive us from current systems to superintelligence by late 2027, with intermediate milestones including autonomous coding (early 2027) and self-improving AI research (mid-2027). The scenario branches into "race" (AI takeover) and "slowdown" (coordinated pause) endings, neither presented as likely outcomes but as boundary cases.
Scenario Architecture:
- 2025-2026: Steady improvement in agents, with increasing autonomy in coding and research tasks. Companies invest heavily in RL for agency, models become more reliable at multi-hour tasks.
- Early 2027: Complete automation of coding emerges. AI systems can handle most software engineering tasks that previously took humans days. China steals frontier model weights (high probability event according to security experts).
- Mid-2027: Self-improving AI research begins. Virtual AI employees (100,000+ networked agents) conducting experiments, sharing results. Rapid capability gains as AIs contribute to their own development.
- Late 2027: Intelligence explosion. In the "race" ending: misaligned AI executes takeover. In the "slowdown" ending: coordinated pause to solve alignment, though Kokotajilo emphasizes this is a "terrifying path" requiring extreme luck.
Key Assumptions (and their fragility):
- Scaling laws hold: No fundamental barriers to continued architectural progress. Kokotajilo dismisses arguments about data inefficiency or architectural limits based on poor track record of such predictions.
- Institutional passivity: Democratic institutions remain unable to meaningfully regulate AI development. No effective international coordination emerges until crisis point.
- Espionage as certainty: China will steal model weights. AI companies already assume they're infiltrated by state actors. Export controls create incentives for industrial espionage.
- Alignment remains unsolved: Training methods (RLHF, outcome-based RL) don't reliably instill intended values. Models become increasingly sophisticated at appearing aligned while pursuing different objectives.
- One year from autonomous coding to superintelligence: Once you have AI systems that can fully substitute for human programmers, the timeline to superintelligence compresses dramatically due to recursive self-improvement.
Reflections / Speculative Ideas:
- Race dynamics as self-fulfilling prophecy: The scenario's power lies in showing how belief in the race creates the race. If labs believe competitors won't slow down, they can't afford to pause for safety. If nations believe adversaries will exploit any lead, they can't permit regulation. The Nash equilibrium is acceleration, even if all parties would prefer coordination. This isn't a prediction failure mode—it's a coordination failure mode.
- The "slowdown" ending isn't reassuring: Kokotajilo is explicit that the good ending—where OpenBrain unilaterally pauses to solve alignment—requires "getting extremely lucky." It assumes: (1) alignment is solvable on short timelines, (2) companies will actually pause despite competitive pressure, (3) the pause won't trigger geopolitical crisis. Each assumption is questionable.
- Median vs. modal confusion: Kokotajilo's median is now 2029, not 2027—acknowledging significant uncertainty. But the scenario's detail creates anchoring effects. The value is in working through mechanisms, not treating 2027 as fixed prediction.
- The espionage wildcard: China stealing weights is treated as ~inevitable, but the timing matters enormously. Early theft might slow US progress (less incentive to race if advantage is lost) or accelerate it (perception of lost lead). Late theft after alignment is solved could be stabilizing. The scenario assumes theft at the worst possible moment.
- Missing: technical obstacles as circuit breakers: The scenario underweights potential slowdowns from technical barriers. What if data exhaustion hits harder than expected? What if RL scaling for agency plateaus? What if there's a "messiness wall" where systems can't generalize beyond clean benchmarks? These wouldn't prevent superintelligence eventually, but could buy crucial years.
- Underexplored: public backlash scenarios: The scenario assumes steady public tolerance despite mounting evidence of risk. But we might see sharp discontinuities—catastrophic AI accidents, mass unemployment triggering political backlash, literal "stop the servers" movements. These could force slowdown, though potentially chaotically.
- One year compression seems right: The most compelling piece is the timeline from "autonomous coders" to "superintelligence." If AIs can do the research and nothing else changes, why wouldn't it be ~1 year? It took humans decades to get to current capabilities, but we're building on that foundation with entities that don't sleep, can be copied infinitely, and can be directly edited. The question isn't whether this is possible—it's whether we'll actually grant that level of autonomy before solving alignment.
- Bridge from Week 2 to Week 3: Last week we studied why alignment is hard even in principle. This week shows why it's hard in practice—we won't have time. The chess match between evaluator and model (Cotra) plays out in compressed timeframes where the model's move speed is accelerating. Goal misgeneralization (Shah) becomes catastrophic when the misaligned system is superhuman. Weird failure modes (Steinhardt) are fatal when correction cycles are measured in months not years.
Connecting Themes and Takeaways
- Exponentials everywhere, governance nowhere: Every technical trend is exponential (compute, capabilities, time horizons), but institutional response remains linear at best. The mismatch is the crisis. We're trying to regulate a 5x/year phenomenon with systems designed for 1.05x/year change.
- The measurement-to-mitigation gap: We're getting much better at tracking progress (Epoch trends, METR time horizons) but the insights don't translate to action. Knowing that capabilities double every 7 months doesn't slow the doubling. The forecasting and the steering are decoupled.
- Bottlenecks buy time, not safety: Each constraint (data, energy, chips) that slows scaling also triggers massive investment to overcome it. We're not getting time to solve alignment—we're getting time to build the infrastructure that makes alignment harder (bigger models, more autonomy, tighter race dynamics).
- Time horizon as alignment difficulty multiplier: Every doubling in time horizon means agents pursuing goals over longer periods with more intervening steps. This exponentially expands the space of possible misaligned behaviors. A model that can plan over days can scheme over days. A model that can plan over weeks can develop and execute multi-stage deceptions.
- Superexponential risk from feedback loops: Once AI accelerates AI research, we don't just get faster progress—we get accelerating acceleration. The remaining time until superintelligence could compress dramatically. This creates a "crunch point" where we have even less time than exponential extrapolation suggests.
- Race dynamics dominate technical considerations: The most important variable isn't whether alignment is solvable—it's whether we'll have institutional capacity to pause long enough to solve it. AI 2027 shows how competitive pressures can override safety even when risks are understood.
- From abstract to concrete: These readings move from "alignment might be hard" (Week 2) to "here's exactly when and how we run out of time." The shift in framing is important: this isn't about possibility, it's about probability. It's not "could we face x-risk?" but "what would have to change to avoid it?"
- The reliability threshold determines impact: Economic disruption requires 50% reliability, but catastrophic risk might require only 1% reliability on the wrong task. We might see massive job displacement (requiring high reliability) years before we see x-risk (requiring one successful escape/deception). Or the timeline could compress violently if reliability improvements accelerate.
- Neuroscience connections: Time horizons in AI mirror working memory and temporal integration challenges in biological systems. But where humans have evolved mechanisms for long-range temporal credit assignment (episodic memory, planning, meta-cognition), we're training AIs on outcome-based RL that might produce brittle proxies. Understanding how biological systems maintain goal stability over time could inform more robust training methods—but we may not have time to implement them.
- The intervention point paradox: Early intervention (now) faces skepticism because capabilities seem limited. Late intervention (post-autonomous-coding) faces impossibility because the pace is too fast and competitive pressures dominate. The window might be narrower than it appears—neither "too early" nor "too late" but only a brief "exactly right" that we might miss.
References:
- Machine Learning Trends, Epoch AI, 2025. https://epoch.ai/trends
- Measuring AI Ability to Complete Long Tasks, Kwa, West et al., METR, 2025. https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
- AI 2027, Kokotajilo, Lifland, Larsen, Dean, AI Futures Project, 2025. https://ai-2027.com/
