Why Human Alignment Works Without Interpretability
Published:
A striking assumption underlies much of contemporary AI alignment research: alignment requires interpretability. If we are to ensure that artificial systems act in accordance with human values, we must be able to inspect their internal representations, understand their objectives, and predict their decisions. Opacity, in this view, is dangerous. Alignment without transparency is treated as incoherent. Human alignment presents a problem for this assumption.
