Discussion about this post

User's avatar
Alex's avatar

Good post! I think it overstates its case on the second scenario, though. By the Pigeonhole Principle (or some other theorem if you arrange the problem differently), an AI of some size has a limited number of concepts it can represent.

If we had 20 years to do it, we *could* inspect and correct AI morality. We just need the time before someone builds the killer AI!

Marcus Arvan's avatar

I published a proof last year that alignment is impossible for the underdetermination reasons here:

https://philpapers.org/rec/ARVIAA

‘Interpretability’ and ‘alignment’ are fool’s errands: a proof that controlling misaligned large language models is the best anyone can hope for

Marcus Arvan

AI and Society 40 (5) (2025)

Abstract

This paper uses famous problems from philosophy of science and philosophical psychology—underdetermination of theory by evidence, Nelson Goodman’s new riddle of induction, theory-ladenness of observation, and “Kripkenstein’s” rule-following paradox—to show that it is empirically impossible to reliably interpret which functions a large language model (LLM) AI has learned, and thus, that reliably aligning LLM behavior with human values is provably impossible. Sections 2 and 3 show that because of how complex LLMs are, researchers must interpret their learned functions largely in terms of empirical observations of their outputs and network behavior. Sections 4–7 then show that for every “aligned” function that might appear to be confirmed by empirical observation, there is always an infinitely larger number of “misaligned”, arbitrarily time-limited functions equally consistent with the same data. Section 8 shows that, from an empirical perspective, we can thus never reliably infer that an LLM or subcomponent of one has learned any particular function at all before any of an uncountably large number of unpredictable future conditions obtain. Finally, Section 9 concludes that the probability of LLM “misalignment” is—at every point in time, given any arbitrarily large body of empirical evidence—always vastly greater than the probability of “alignment.”

Also see this Scientific American piece: https://www.scientificamerican.com/article/ai-is-too-unpredictable-to-behave-according-to-human-goals/

5 more comments...

No posts

Ready for more?