Good post! I think it overstates its case on the second scenario, though. By the Pigeonhole Principle (or some other theorem if you arrange the problem differently), an AI of some size has a limited number of concepts it can represent.
If we had 20 years to do it, we *could* inspect and correct AI morality. We just need the time before someone builds the killer AI!
An AI of some size has a limited number of moral concepts it can represent, but there's no particular reason to think that any one of those moral concepts will be aligned.
‘Interpretability’ and ‘alignment’ are fool’s errands: a proof that controlling misaligned large language models is the best anyone can hope for
Marcus Arvan
AI and Society 40 (5) (2025)
Abstract
This paper uses famous problems from philosophy of science and philosophical psychology—underdetermination of theory by evidence, Nelson Goodman’s new riddle of induction, theory-ladenness of observation, and “Kripkenstein’s” rule-following paradox—to show that it is empirically impossible to reliably interpret which functions a large language model (LLM) AI has learned, and thus, that reliably aligning LLM behavior with human values is provably impossible. Sections 2 and 3 show that because of how complex LLMs are, researchers must interpret their learned functions largely in terms of empirical observations of their outputs and network behavior. Sections 4–7 then show that for every “aligned” function that might appear to be confirmed by empirical observation, there is always an infinitely larger number of “misaligned”, arbitrarily time-limited functions equally consistent with the same data. Section 8 shows that, from an empirical perspective, we can thus never reliably infer that an LLM or subcomponent of one has learned any particular function at all before any of an uncountably large number of unpredictable future conditions obtain. Finally, Section 9 concludes that the probability of LLM “misalignment” is—at every point in time, given any arbitrarily large body of empirical evidence—always vastly greater than the probability of “alignment.”
I don't understand how we could allow an AI system to destroy Phoenix in pursuit of solar farms or paper clips or whatever. People say these sorts of things all the time, and I don't get it. How?
A hypercapable AI does whatever it wants. We would have no more power to "allow" one to destroy Phoenix than bats have the power to "allow" us to repair a sewer line. If you think hypercapable AI will never exist, you're relying on the Capacity Constraint. That's what I'm relying on, too, which is why I am so worried that AI companies are dead set on creating hypercapable AI.
Thanks, yes, I'm relying on capacity constraint, but it seems obvious to me that capacity would be highly constrained. When I hear things like "hypercapable," defined as lack of constraint, I don't understand how that happens, how it sneaks up on us. To take an extreme example, suppose I've got a computer system that can book my vacation. I assume that that system won't be able to launch ICBMs. It seems wild to suppose that it would. Moreover, it seems easy to prevent or, if we see it happening, stop, no? I tend to run aground on these narratives on the details of how it would work and a sense that the movie plot doesn't add up.
Do you mean that it could use autonomous bulldozers? Autonomous bombers? What kind of real-world implements could it use to 'destroy' Phoenix? Maybe it could make it uninhabitable by taking over the water supply or the power grid, but that hardly counts as 'destroying' as those things could be counteracted by physical controls (which I would guess would never be abandoned (cue Dr. Strangelove music).
Good post! I think it overstates its case on the second scenario, though. By the Pigeonhole Principle (or some other theorem if you arrange the problem differently), an AI of some size has a limited number of concepts it can represent.
If we had 20 years to do it, we *could* inspect and correct AI morality. We just need the time before someone builds the killer AI!
An AI of some size has a limited number of moral concepts it can represent, but there's no particular reason to think that any one of those moral concepts will be aligned.
I published a proof last year that alignment is impossible for the underdetermination reasons here:
https://philpapers.org/rec/ARVIAA
‘Interpretability’ and ‘alignment’ are fool’s errands: a proof that controlling misaligned large language models is the best anyone can hope for
Marcus Arvan
AI and Society 40 (5) (2025)
Abstract
This paper uses famous problems from philosophy of science and philosophical psychology—underdetermination of theory by evidence, Nelson Goodman’s new riddle of induction, theory-ladenness of observation, and “Kripkenstein’s” rule-following paradox—to show that it is empirically impossible to reliably interpret which functions a large language model (LLM) AI has learned, and thus, that reliably aligning LLM behavior with human values is provably impossible. Sections 2 and 3 show that because of how complex LLMs are, researchers must interpret their learned functions largely in terms of empirical observations of their outputs and network behavior. Sections 4–7 then show that for every “aligned” function that might appear to be confirmed by empirical observation, there is always an infinitely larger number of “misaligned”, arbitrarily time-limited functions equally consistent with the same data. Section 8 shows that, from an empirical perspective, we can thus never reliably infer that an LLM or subcomponent of one has learned any particular function at all before any of an uncountably large number of unpredictable future conditions obtain. Finally, Section 9 concludes that the probability of LLM “misalignment” is—at every point in time, given any arbitrarily large body of empirical evidence—always vastly greater than the probability of “alignment.”
Also see this Scientific American piece: https://www.scientificamerican.com/article/ai-is-too-unpredictable-to-behave-according-to-human-goals/
I don't understand how we could allow an AI system to destroy Phoenix in pursuit of solar farms or paper clips or whatever. People say these sorts of things all the time, and I don't get it. How?
A hypercapable AI does whatever it wants. We would have no more power to "allow" one to destroy Phoenix than bats have the power to "allow" us to repair a sewer line. If you think hypercapable AI will never exist, you're relying on the Capacity Constraint. That's what I'm relying on, too, which is why I am so worried that AI companies are dead set on creating hypercapable AI.
Thanks, yes, I'm relying on capacity constraint, but it seems obvious to me that capacity would be highly constrained. When I hear things like "hypercapable," defined as lack of constraint, I don't understand how that happens, how it sneaks up on us. To take an extreme example, suppose I've got a computer system that can book my vacation. I assume that that system won't be able to launch ICBMs. It seems wild to suppose that it would. Moreover, it seems easy to prevent or, if we see it happening, stop, no? I tend to run aground on these narratives on the details of how it would work and a sense that the movie plot doesn't add up.
Do you mean that it could use autonomous bulldozers? Autonomous bombers? What kind of real-world implements could it use to 'destroy' Phoenix? Maybe it could make it uninhabitable by taking over the water supply or the power grid, but that hardly counts as 'destroying' as those things could be counteracted by physical controls (which I would guess would never be abandoned (cue Dr. Strangelove music).