CAIAC Papers Week 8
Published:
The final week steps back from technical details to address broader strategic questions: How fast is AI really progressing? What does this mean for alignment timelines? And how can individuals contribute to solving this problem?
AI as Normal Technology (Narayanan & Kapoor, 2025)
Core thesis: AI progress might be more incremental and constrained than exponential forecasts suggest. Many headline capabilities are overhyped, real-world deployment is slow, and fundamental barriers (data scarcity, reliability requirements, task complexity) may cap progress well below AGI for decades. This contrarian view challenges dominant narratives in AI safety community.
Key Arguments:
- Benchmark saturation ≠ real capability:
- Models quickly saturate tests (95%+ on MMLU, near-human on many benchmarks)
- But real-world performance lags far behind
- Example: Models ace medical licensing exams but can't safely diagnose patients without supervision
- Gap between test performance and deployment readiness is huge and closing slowly
- Deployment requires reliability orders of magnitude beyond current systems:
- Critical applications need 99.99%+ reliability
- Current models achieve maybe 90-95% on real tasks
- Getting from 90% to 99.99% is exponentially harder than getting from 0% to 90%
- This "long tail" of reliability improvement could take decades
- Economic value concentrates in narrow applications:
- Most AI value comes from specialized tools (code completion, customer service, content generation)
- Market doesn't necessarily push toward general intelligence
- Companies optimize for revenue, not AGI—might plateau at "good enough" narrow systems
- Economic incentives might not drive continued scaling beyond certain point
- Historical parallels suggest slow transformation:
- Nuclear power, nanotechnology, quantum computing—all transformative but over decades
- Initial hype cycles followed by gradual, incremental progress
- AI likely follows similar pattern: revolutionary impact but measured in decades, not years
- Deployment constrained by institutions, regulation, social acceptance
- Data and compute constraints are real:
- Running out of high-quality text data (see Week 3: Epoch trends)
- Synthetic data has quality issues and potential for model collapse
- Compute scaling faces physical limits (energy, chip manufacturing, cooling)
- Algorithmic improvements might not compensate indefinitely
Implications for AI Safety:
- If timelines are long (10-30 years to AGI):
- ✅ More time to solve alignment before critical systems
- ✅ Can do careful, iterative research rather than rushing
- ✅ Gradual deployment allows learning from mistakes at lower stakes
- ✅ Institutions have time to adapt, regulation can catch up
- ✅ Can train multiple generations of alignment researchers
- But long timelines have downsides:
- ❌ Less urgency means less funding, less talent attraction
- ❌ Alignment research might fade as crisis seems distant
- ❌ Could be lulled into complacency then face sudden breakthrough
- ❌ Long timelines until sudden acceleration (fast takeoff from high base) might be worst case
- Strategic implications:
- If Narayanan right: invest in institutions, education, robust governance
- If short timelines (Kokotajilo): focus on near-term deployment of known techniques
- Uncertainty dominates: need portfolio approach covering both scenarios
Reflections / Speculative Ideas:
- The inference gap: Narayanan focuses on training scaling but underweights inference-time compute. We might hit walls on training while test-time compute continues scaling (see: o1, o3 models). This could give long training plateaus but continued capability growth via inference scaling.
- Deployment vs. capability distinction: Authors are right that deployment is slow, but alignment problem is about capability not deployment. Even if AGI takes 20 years to deploy widely, we face x-risk as soon as it exists. Slow adoption doesn't help if first AGI is misaligned.
- The "normal technology" framing might be anchoring bias: Every previous technology was normal because none achieved general intelligence. AGI by definition breaks the pattern. Analogizing to nuclear/nano/quantum might be category error—those didn't have recursive self-improvement potential.
- Reliability requirements are asymmetric: Safety-critical apps need 99.99% reliability, but takeover attempts only need one success. Misaligned AGI doesn't need to be reliable at helping humans—just at achieving its own goals. Different optimization target.
- Economic incentives update: Authors written before AI revenue explosion (late 2024-2025). ChatGPT hit 100M users in 2 months. That level of traction creates massive economic pressure to scale capabilities. Market might push toward AGI despite risks.
- Base rate outside view vs. inside view: Narayanan uses outside view (most tech takes decades). But inside view (current scaling trends, METR time horizons, Epoch data) suggests faster progress. Which reference class is right? Genuinely uncertain.
- Institutional adaptation is bottleneck: Even if capabilities scale slowly, institutions adapt even slower. We might hit dangerous capabilities before governance catches up regardless of timeline. Slow progress doesn't guarantee safe progress.
- The hedging strategy: Can't know if timelines short or long, so must prepare for both. This argues for: (1) near-term safety measures (control, evals) AND (2) long-term research (theory, interpretability). Not either/or.
Opportunities in Alignment (Bashansky, 2024)
Core message: Multiple pathways to contribute to AI safety depending on background, skills, and comparative advantage. Not everyone should do technical research. The field needs diverse skillsets and approaches.
Career Pathways:
- Technical Research:
- Interpretability (mechanistic, representation engineering, SAEs)
- Control (protocol design, red teaming methodology, evaluation)
- Scalable oversight (debate, weak-to-strong, recursive approaches)
- Theory (formal verification, game theory, learning theory)
- Best for: Strong ML/CS background, research aptitude, can work on open-ended problems
- Engineering & Implementation:
- Evaluation infrastructure (automated testing, monitoring systems)
- Safety tools (interpretability dashboards, circuit discovery automation)
- Deployment systems (production control protocols, monitoring at scale)
- Benchmarks (creating robust eval suites, maintaining leaderboards)
- Best for: Strong software engineering, systems design, making research practical
- Governance & Policy:
- Standards development (safety requirements, testing protocols)
- Risk assessment frameworks (what counts as dangerous, how to measure)
- International coordination (AI treaties, information sharing)
- Corporate governance (board advisory, internal safety advocacy)
- Best for: Policy background, communication skills, understanding both technical and political constraints
- Field-Building:
- Education (courses, workshops, making technical work accessible)
- Outreach (communicating risks, building public awareness)
- Funding (grantmaking, evaluating proposals, strategic allocation)
- Talent pipeline (recruiting, mentoring, career advising)
- Best for: Teaching ability, community building, strategic thinking
Strategic Considerations:
- Comparative advantage matters:
- Don't just do "what seems most important"—do what you're uniquely good at
- Field needs diversity: brilliant researchers AND great communicators AND policy experts
- Marginal impact of Nth interpretability researcher might be less than 1st policy person
- Portfolio approach to uncertainty:
- We don't know which technical approach works (interpretability? control? oversight?)
- We don't know if technical or governance solutions more important
- Community should fund and pursue multiple paradigms simultaneously
- Individuals should choose based on fit, not trying to pick "the" solution
- Theory of change clarity:
- How exactly does your work reduce x-risk?
- Can you articulate causal path from your research/work to safety?
- If working on interpretability: how does understanding → safety?
- If working on policy: how do regulations → actual safety measures?
- Vague theories of change suggest unfocused work
- Timeline sensitivity:
- If very short timelines (2-3 years): focus on deployment of known techniques, practical implementation
- If medium (5-10 years): mix of research and implementation
- If long (10-20+ years): fundamental research, institution building viable
- Should update career plans based on timeline estimates
- Skill development vs. immediate impact:
- Fresh graduates: invest in skill building even if not immediately impactful
- Experienced professionals: can make immediate contributions but might lack specific AI safety knowledge
- Career capital matters—sometimes best move is building skills/credentials for 2-3 years
Reflections / Speculative Ideas:
- The "too late to matter" concern: If timelines very short (AI 2027 scenario), newcomers might not develop expertise before critical decisions made. This argues for either (1) immediate pivot if you have transferable skills, or (2) focus on non-technical contributions where ramp-up faster.
- Bottleneck analysis: What's the current limiting factor for field? If it's ideas, do research. If it's implementation, do engineering. If it's coordination, do policy. Bottleneck shifts over time—staying effective requires updating which pathway most valuable.
- Information value of career experiments: Trying different roles early career has option value. Might discover unexpected comparative advantage. But also costs in terms of specialization. Tradeoff between exploration and exploitation in career choice.
- The escalator problem: Some career paths have clear escalators (PhD → postdoc → faculty, or engineer → senior → staff). Others more ambiguous. AI safety has unclear career progression, which makes planning hard but also creates opportunity for entrepreneurial role creation.
- Geographic considerations: Most alignment research concentrated in SF Bay Area, London, some in Boston. Remote work helps but many opportunities require location. Should factor into career planning.
- Credibility and access: Some work requires access to frontier models or inside knowledge of labs. Building credibility through research, or working at labs, gates certain pathways. Not everyone can do model-organism-of-misalignment research without lab affiliation.
- The community growth dilemma: Field needs to grow to match problem scale. But rapid growth dilutes quality, creates communication overhead, and might attract people with poor fit. How to scale carefully?
- Alternative high-impact paths: Earning to give (high-paying job, donate to alignment research), movement building (increase number of talented people entering field), or working on AI capabilities if it helps with safety (contentious—could accelerate race). Each has theory of change but also risks.
Synthesis: Reflections Across All 8 Weeks
The Arc of the Course:
- Weeks 1-2 (Why alignment is hard): Established fundamental challenges—models can be misaligned despite appearing aligned, goal misgeneralization is robust, systems have weird failure modes.
- Week 3 (Current trajectory): Showed we're running out of time—exponential capability growth, compressed timelines, race dynamics dominating.
- Week 4 (Interpretability): Revealed deep obstacle to understanding—superposition means we can't just "look at neurons" to verify alignment.
- Week 5 (Control): Pragmatic pivot—if we can't ensure alignment, can we at least prevent takeover?
- Week 6 (Oversight): Techniques for supervising superhuman systems, all involving recursion and accepting imperfect solutions.
- Week 7 (Evaluations): How we actually test whether safety measures work, discovering they're often brittle.
- Week 8 (Strategy): Stepping back to ask whether the dominant narrative is right and how individuals can contribute.
Key Insights Across Weeks:
- No silver bullets exist: Every approach has limitations. Interpretability doesn't scale. Control fails for sufficiently capable systems. Oversight requires recursion. Evaluations can be gamed. Need defense-in-depth with multiple independent layers.
- Evaluation is the universal bottleneck: Control needs red team evaluation. Oversight needs capability evaluation. Research needs impact evaluation. Career choices need field-state evaluation. Everything depends on measuring what we can't fully understand.
- The capability-safety scissors: All safety techniques work better on weaker models. As capability increases, control becomes harder, oversight needs more powerful assistants (introducing new risks), interpretability faces more superposition, evaluations become less trustworthy. Safety gets harder exactly when it matters most.
- Recursion is unavoidable: Supervising superhuman AI requires using AI. Evaluating AI requires using AI for red teaming. Understanding AI representations requires using AI for automated interpretability. We cannot escape using systems we don't fully trust to verify systems we don't fully trust.
- Timeline uncertainty dominates strategy: Different assumptions about timelines (Narayanan's decades vs. Kokotajilo's years) imply radically different priorities. Can't resolve this empirically in advance. Must act under deep uncertainty with portfolio strategies.
- Coordination as important as technical work: Many solutions require industry-wide adoption. Competitive pressures push toward less safe deployment without regulation. Technical solutions insufficient without governance.
- The field is maturing: Shift from "maybe we can solve this theoretically" to "here are partial solutions we can deploy now." More empirical, more pragmatic, more engineering-focused. This is either encouraging (real progress) or concerning (lowering ambitions).
What We Still Don't Know:
- Can interpretability scale to production models or hits fundamental limits?
- What capability level makes control impossible?
- Do oversight techniques work for truly superhuman domains or only near-human?
- Can we detect sandbagging or will scheming models always evade evaluation?
- Are timelines 2-3 years, 10-20 years, or 50+ years?
- Which technical approach(es) actually work?
- Can coordination solve the race dynamics or are we locked into competition?
Personal Strategic Takeaways:
- Comparative advantage: Bridge between neuroscience and AI interpretability. EEG research experience relevant to understanding how biological systems represent information despite superposition-like constraints.
- Research connections: Continual learning + Test-Time Training connects to interpretability (how do models update representations?) and control (can we verify learning is safe?).
- Timeline considerations: 2027 graduation coincides with potentially critical period. Balance skill building (research) with deployment readiness (practical ML engineering).
References:
- AI as Normal Technology, Narayanan, A. & Kapoor, S., 2025.
- Opportunities in alignment, Bashansky, M., 2024.
