Posts by Tags

AI

Why Human Alignment Works Without Interpretability

8 minute read

Published:

A striking assumption underlies much of contemporary AI alignment research: alignment requires interpretability. If we are to ensure that artificial systems act in accordance with human values, we must be able to inspect their internal representations, understand their objectives, and predict their decisions. Opacity, in this view, is dangerous. Alignment without transparency is treated as incoherent. Human alignment presents a problem for this assumption.

The Alignment Problem Is an Anthropological Issue

6 minute read

Published:

When AI researchers talk about aligning machines with “human values,” the phrase sounds deceptively simple. It implies a shared moral universe, a single standard of right and wrong. But there is no single moral universe. Across cultures, intelligence, agency, and control are understood in vastly different ways. Western traditions often equate intelligence with abstract reasoning, analytical problem-solving, and autonomous decision-making. In many collectivist societies, intelligence is relational: it involves the ability to act with discernment in social contexts, maintain harmony, and fulfill moral obligations. Indigenous cosmologies may distribute agency across humans, ancestors, and the environment, with control being relational rather than individual.

Intelligence as Dynamic Balance - From Evolution to the AI-Culture Fork

10 minute read

Published:

Your brain just performed an extraordinary feat. As your eyes moved across these words, millions of neurons fired in precise patterns, transforming chaotic photons into meaning. But here’s what’s remarkable. Your visual system didn’t just recognize familiar letter shapes. It is simultaneously ready to make sense of fonts you’ve never seen, handwriting styles that would baffle it, and even words that don’t quite look right. This is the fundamental tension that defines intelligence across every domain: the balance between finding useful patterns and breaking free from them when circumstances change. This same dynamic, what I call the pattern-finding/pattern-breaking tension, operates across scales from evolutionary deep time to the millisecond decisions your neurons make right now. It is not just a curious parallel. These systems are nested within each other, each building on the solutions developed by the previous level, each facing its own unique constraints while grappling with the same core challenge.

AI safety

Why Human Alignment Works Without Interpretability

8 minute read

Published:

A striking assumption underlies much of contemporary AI alignment research: alignment requires interpretability. If we are to ensure that artificial systems act in accordance with human values, we must be able to inspect their internal representations, understand their objectives, and predict their decisions. Opacity, in this view, is dangerous. Alignment without transparency is treated as incoherent. Human alignment presents a problem for this assumption.

CAIAC Papers Week 8

12 minute read

Published:

The final week steps back from technical details to address broader strategic questions: How fast is AI really progressing? What does this mean for alignment timelines? And how can individuals contribute to solving this problem?

CAIAC Papers Week 7

9 minute read

Published:

This week focuses on how we test AI systems for dangerous capabilities before deployment. Red teaming and evaluations aim to characterize what models can and can't do—especially for tasks we'd prefer they couldn't do.

The Alignment Problem Is an Anthropological Issue

6 minute read

Published:

When AI researchers talk about aligning machines with “human values,” the phrase sounds deceptively simple. It implies a shared moral universe, a single standard of right and wrong. But there is no single moral universe. Across cultures, intelligence, agency, and control are understood in vastly different ways. Western traditions often equate intelligence with abstract reasoning, analytical problem-solving, and autonomous decision-making. In many collectivist societies, intelligence is relational: it involves the ability to act with discernment in social contexts, maintain harmony, and fulfill moral obligations. Indigenous cosmologies may distribute agency across humans, ancestors, and the environment, with control being relational rather than individual.

CAIAC Papers Week 6

6 minute read

Published:

This week addresses the scalable oversight problem: how do humans supervise AI systems smarter than themselves? Four complementary approaches emerge—AI-assisted feedback, debate, weak-to-strong generalization, and combinations thereof.

CAIAC Papers Week 5

8 minute read

Published:

This week introduces the Control agenda: a pragmatic approach to AI safety that assumes models might be actively trying to subvert our safety measures. Rather than ensuring AIs want to be safe (alignment), control ensures they can't cause catastrophes even if they're scheming against us.

CAIAC Papers Week 4

11 minute read

Published:

This week shifts from trajectory forecasting to the technical challenge of understanding what's actually happening inside neural networks. Mechanistic interpretability promises to crack open the black box—to understand not just what models do, but how they do it. The two papers we examined reveal both the fundamental obstacle (superposition) and a promising technique for overcoming it (sparse autoencoders for dictionary learning).

CAIAC Papers Week 3

12 minute read

Published:

This week, our AI safety reading group examined three pieces that shift from abstract alignment challenges to concrete trajectory analysis: where AI capabilities are headed, how fast they're improving, and what catastrophic scenarios might unfold. These readings force us to confront the uncomfortable gap between exponential technical progress and our glacial institutional response systems.

CAIAC Papers Week 2

3 minute read

Published:

This week, our AI safety reading group explored three papers addressing why alignment is hard, how misgeneralization can arise even with correct specifications, and why all ML systems have “weird” failure modes. Below are my reflections, observations, and speculative ideas for addressing these challenges.

CAIAC Papers Week 1

4 minute read

Published:

This semester, I joined the Columbia AI Alignment Club Technical Fellowship, where we discuss AI safety concepts and papers. This week, we explored papers and blog posts on specification gaming, deceptive behavior, RLHF, and alignment failures. These readings highlight how models can behave in unintended or manipulative ways, even when objectives appear simple or human-aligned.

Papers I Read This Week

5 minute read

Published:

I’m starting a new series called “Papers I Read This Week” to keep track of some of the most interesting work I’ve been reading. I often skim more abstracts and excerpts than I can list here, but this will serve as a place to highlight the papers that stood out—whether for their ideas, methods, or the questions they raise.

RLHF

CAIAC Papers Week 1

4 minute read

Published:

This semester, I joined the Columbia AI Alignment Club Technical Fellowship, where we discuss AI safety concepts and papers. This week, we explored papers and blog posts on specification gaming, deceptive behavior, RLHF, and alignment failures. These readings highlight how models can behave in unintended or manipulative ways, even when objectives appear simple or human-aligned.

adversarial attacks

CAIAC Papers Week 7

9 minute read

Published:

This week focuses on how we test AI systems for dangerous capabilities before deployment. Red teaming and evaluations aim to characterize what models can and can't do—especially for tasks we'd prefer they couldn't do.

alignment

CAIAC Papers Week 2

3 minute read

Published:

This week, our AI safety reading group explored three papers addressing why alignment is hard, how misgeneralization can arise even with correct specifications, and why all ML systems have “weird” failure modes. Below are my reflections, observations, and speculative ideas for addressing these challenges.

CAIAC Papers Week 1

4 minute read

Published:

This semester, I joined the Columbia AI Alignment Club Technical Fellowship, where we discuss AI safety concepts and papers. This week, we explored papers and blog posts on specification gaming, deceptive behavior, RLHF, and alignment failures. These readings highlight how models can behave in unintended or manipulative ways, even when objectives appear simple or human-aligned.

capabilities

CAIAC Papers Week 3

12 minute read

Published:

This week, our AI safety reading group examined three pieces that shift from abstract alignment challenges to concrete trajectory analysis: where AI capabilities are headed, how fast they're improving, and what catastrophic scenarios might unfold. These readings force us to confront the uncomfortable gap between exponential technical progress and our glacial institutional response systems.

careers

CAIAC Papers Week 8

12 minute read

Published:

The final week steps back from technical details to address broader strategic questions: How fast is AI really progressing? What does this mean for alignment timelines? And how can individuals contribute to solving this problem?

cognitive science

Intelligence as Dynamic Balance - From Evolution to the AI-Culture Fork

10 minute read

Published:

Your brain just performed an extraordinary feat. As your eyes moved across these words, millions of neurons fired in precise patterns, transforming chaotic photons into meaning. But here’s what’s remarkable. Your visual system didn’t just recognize familiar letter shapes. It is simultaneously ready to make sense of fonts you’ve never seen, handwriting styles that would baffle it, and even words that don’t quite look right. This is the fundamental tension that defines intelligence across every domain: the balance between finding useful patterns and breaking free from them when circumstances change. This same dynamic, what I call the pattern-finding/pattern-breaking tension, operates across scales from evolutionary deep time to the millisecond decisions your neurons make right now. It is not just a curious parallel. These systems are nested within each other, each building on the solutions developed by the previous level, each facing its own unique constraints while grappling with the same core challenge.

control

CAIAC Papers Week 5

8 minute read

Published:

This week introduces the Control agenda: a pragmatic approach to AI safety that assumes models might be actively trying to subvert our safety measures. Rather than ensuring AIs want to be safe (alignment), control ensures they can't cause catastrophes even if they're scheming against us.

cultural anthropology

Why Human Alignment Works Without Interpretability

8 minute read

Published:

A striking assumption underlies much of contemporary AI alignment research: alignment requires interpretability. If we are to ensure that artificial systems act in accordance with human values, we must be able to inspect their internal representations, understand their objectives, and predict their decisions. Opacity, in this view, is dangerous. Alignment without transparency is treated as incoherent. Human alignment presents a problem for this assumption.

The Alignment Problem Is an Anthropological Issue

6 minute read

Published:

When AI researchers talk about aligning machines with “human values,” the phrase sounds deceptively simple. It implies a shared moral universe, a single standard of right and wrong. But there is no single moral universe. Across cultures, intelligence, agency, and control are understood in vastly different ways. Western traditions often equate intelligence with abstract reasoning, analytical problem-solving, and autonomous decision-making. In many collectivist societies, intelligence is relational: it involves the ability to act with discernment in social contexts, maintain harmony, and fulfill moral obligations. Indigenous cosmologies may distribute agency across humans, ancestors, and the environment, with control being relational rather than individual.

debate

CAIAC Papers Week 6

6 minute read

Published:

This week addresses the scalable oversight problem: how do humans supervise AI systems smarter than themselves? Four complementary approaches emerge—AI-assisted feedback, debate, weak-to-strong generalization, and combinations thereof.

evaluations

CAIAC Papers Week 7

9 minute read

Published:

This week focuses on how we test AI systems for dangerous capabilities before deployment. Red teaming and evaluations aim to characterize what models can and can't do—especially for tasks we'd prefer they couldn't do.

evolution

Intelligence as Dynamic Balance - From Evolution to the AI-Culture Fork

10 minute read

Published:

Your brain just performed an extraordinary feat. As your eyes moved across these words, millions of neurons fired in precise patterns, transforming chaotic photons into meaning. But here’s what’s remarkable. Your visual system didn’t just recognize familiar letter shapes. It is simultaneously ready to make sense of fonts you’ve never seen, handwriting styles that would baffle it, and even words that don’t quite look right. This is the fundamental tension that defines intelligence across every domain: the balance between finding useful patterns and breaking free from them when circumstances change. This same dynamic, what I call the pattern-finding/pattern-breaking tension, operates across scales from evolutionary deep time to the millisecond decisions your neurons make right now. It is not just a curious parallel. These systems are nested within each other, each building on the solutions developed by the previous level, each facing its own unique constraints while grappling with the same core challenge.

forecasting

CAIAC Papers Week 8

12 minute read

Published:

The final week steps back from technical details to address broader strategic questions: How fast is AI really progressing? What does this mean for alignment timelines? And how can individuals contribute to solving this problem?

CAIAC Papers Week 3

12 minute read

Published:

This week, our AI safety reading group examined three pieces that shift from abstract alignment challenges to concrete trajectory analysis: where AI capabilities are headed, how fast they're improving, and what catastrophic scenarios might unfold. These readings force us to confront the uncomfortable gap between exponential technical progress and our glacial institutional response systems.

machine learning

Neuromatch NeuroAI Reflections

2 minute read

Published:

This summer, I completed the NeuroAI course offered by Neuromatch Academy, a volunteer-led organization providing research education and training at the intersection of neuroscience and machine learning. The course introduced key topics in biologically inspired AI, including neural coding, learning dynamics, and open problems in cognition, while emphasizing hands-on coding, peer collaboration, and engagement with current research. As someone who had been passively interested in NeuroAI, this course was foundational in helping me move from curiosity to genuine direction. It gave me the conceptual framework, technical skills, and intellectual community to explore the field seriously.

mechanistic interpretability

CAIAC Papers Week 4

11 minute read

Published:

This week shifts from trajectory forecasting to the technical challenge of understanding what's actually happening inside neural networks. Mechanistic interpretability promises to crack open the black box—to understand not just what models do, but how they do it. The two papers we examined reveal both the fundamental obstacle (superposition) and a promising technique for overcoming it (sparse autoencoders for dictionary learning).

Papers I Read This Week

5 minute read

Published:

I’m starting a new series called “Papers I Read This Week” to keep track of some of the most interesting work I’ve been reading. I often skim more abstracts and excerpts than I can list here, but this will serve as a place to highlight the papers that stood out—whether for their ideas, methods, or the questions they raise.

misgeneralization

CAIAC Papers Week 2

3 minute read

Published:

This week, our AI safety reading group explored three papers addressing why alignment is hard, how misgeneralization can arise even with correct specifications, and why all ML systems have “weird” failure modes. Below are my reflections, observations, and speculative ideas for addressing these challenges.

neuroai

Neuromatch NeuroAI Reflections

2 minute read

Published:

This summer, I completed the NeuroAI course offered by Neuromatch Academy, a volunteer-led organization providing research education and training at the intersection of neuroscience and machine learning. The course introduced key topics in biologically inspired AI, including neural coding, learning dynamics, and open problems in cognition, while emphasizing hands-on coding, peer collaboration, and engagement with current research. As someone who had been passively interested in NeuroAI, this course was foundational in helping me move from curiosity to genuine direction. It gave me the conceptual framework, technical skills, and intellectual community to explore the field seriously.

neuroscience

Neuromatch NeuroAI Reflections

2 minute read

Published:

This summer, I completed the NeuroAI course offered by Neuromatch Academy, a volunteer-led organization providing research education and training at the intersection of neuroscience and machine learning. The course introduced key topics in biologically inspired AI, including neural coding, learning dynamics, and open problems in cognition, while emphasizing hands-on coding, peer collaboration, and engagement with current research. As someone who had been passively interested in NeuroAI, this course was foundational in helping me move from curiosity to genuine direction. It gave me the conceptual framework, technical skills, and intellectual community to explore the field seriously.

philosophy

Intelligence as Dynamic Balance - From Evolution to the AI-Culture Fork

10 minute read

Published:

Your brain just performed an extraordinary feat. As your eyes moved across these words, millions of neurons fired in precise patterns, transforming chaotic photons into meaning. But here’s what’s remarkable. Your visual system didn’t just recognize familiar letter shapes. It is simultaneously ready to make sense of fonts you’ve never seen, handwriting styles that would baffle it, and even words that don’t quite look right. This is the fundamental tension that defines intelligence across every domain: the balance between finding useful patterns and breaking free from them when circumstances change. This same dynamic, what I call the pattern-finding/pattern-breaking tension, operates across scales from evolutionary deep time to the millisecond decisions your neurons make right now. It is not just a curious parallel. These systems are nested within each other, each building on the solutions developed by the previous level, each facing its own unique constraints while grappling with the same core challenge.

red teaming

CAIAC Papers Week 7

9 minute read

Published:

This week focuses on how we test AI systems for dangerous capabilities before deployment. Red teaming and evaluations aim to characterize what models can and can't do—especially for tasks we'd prefer they couldn't do.

CAIAC Papers Week 5

8 minute read

Published:

This week introduces the Control agenda: a pragmatic approach to AI safety that assumes models might be actively trying to subvert our safety measures. Rather than ensuring AIs want to be safe (alignment), control ensures they can't cause catastrophes even if they're scheming against us.

reflections

What the Wild Taught Us - Reflections from My Trip to Africa

3 minute read

Published:

This summer, my parents and I traveled to Kenya and Tanzania, visiting Maasai Mara, Serengeti, and the Ngorongoro Crater. The safari was more than just a vacation — it gave us a chance to step back and think about the world and our place in it. We expected to see animals in their natural habitat, but what we gained was a deeper perspective.

research

Papers I Read This Week

5 minute read

Published:

I’m starting a new series called “Papers I Read This Week” to keep track of some of the most interesting work I’ve been reading. I often skim more abstracts and excerpts than I can list here, but this will serve as a place to highlight the papers that stood out—whether for their ideas, methods, or the questions they raise.

risk scenarios

CAIAC Papers Week 3

12 minute read

Published:

This week, our AI safety reading group examined three pieces that shift from abstract alignment challenges to concrete trajectory analysis: where AI capabilities are headed, how fast they're improving, and what catastrophic scenarios might unfold. These readings force us to confront the uncomfortable gap between exponential technical progress and our glacial institutional response systems.

scalable oversight

CAIAC Papers Week 6

6 minute read

Published:

This week addresses the scalable oversight problem: how do humans supervise AI systems smarter than themselves? Four complementary approaches emerge—AI-assisted feedback, debate, weak-to-strong generalization, and combinations thereof.

scheming

CAIAC Papers Week 5

8 minute read

Published:

This week introduces the Control agenda: a pragmatic approach to AI safety that assumes models might be actively trying to subvert our safety measures. Rather than ensuring AIs want to be safe (alignment), control ensures they can't cause catastrophes even if they're scheming against us.

sparse autoencoders

CAIAC Papers Week 4

11 minute read

Published:

This week shifts from trajectory forecasting to the technical challenge of understanding what's actually happening inside neural networks. Mechanistic interpretability promises to crack open the black box—to understand not just what models do, but how they do it. The two papers we examined reveal both the fundamental obstacle (superposition) and a promising technique for overcoming it (sparse autoencoders for dictionary learning).

superposition

CAIAC Papers Week 4

11 minute read

Published:

This week shifts from trajectory forecasting to the technical challenge of understanding what's actually happening inside neural networks. Mechanistic interpretability promises to crack open the black box—to understand not just what models do, but how they do it. The two papers we examined reveal both the fundamental obstacle (superposition) and a promising technique for overcoming it (sparse autoencoders for dictionary learning).

timelines

CAIAC Papers Week 8

12 minute read

Published:

The final week steps back from technical details to address broader strategic questions: How fast is AI really progressing? What does this mean for alignment timelines? And how can individuals contribute to solving this problem?

travel

What the Wild Taught Us - Reflections from My Trip to Africa

3 minute read

Published:

This summer, my parents and I traveled to Kenya and Tanzania, visiting Maasai Mara, Serengeti, and the Ngorongoro Crater. The safari was more than just a vacation — it gave us a chance to step back and think about the world and our place in it. We expected to see animals in their natural habitat, but what we gained was a deeper perspective.

weak-to-strong

CAIAC Papers Week 6

6 minute read

Published:

This week addresses the scalable oversight problem: how do humans supervise AI systems smarter than themselves? Four complementary approaches emerge—AI-assisted feedback, debate, weak-to-strong generalization, and combinations thereof.