AI Alignment Is Impossible

Not just in practice, but in theory.

Apr 16, 2026

Artificial Intelligence presents a number of risks and challenges, the most important of which is existential risk. That is a fancy way of saying that AIs might kill us all. For a long time, I was dismissive of this idea. But with the huge advances in AI capability that have come over the last six months or so, I’m starting to get worried.

There are basically two reasons why AI wouldn’t kill us all.

The first reason is that AIs will be incapable of doing this; no matter how advanced they get, they either won’t know how to kill us all, or, even if they know how, they won’t be able to act in a way that would allow them to kill us all. Call this the Capacity Constraint.

The second reason is that AIs, while capable of killing us, won’t choose to do so. They will care about human well-being, and care about it enough that they would avoid killing us all. Call this the Moral Constraint.

Now, if you don’t think AI will kill us all, it’s worth taking a moment to think about which of these two constraints you are (perhaps implicitly) assuming will save us from “AI doom.” Are you counting on AI being weak? Or are you counting on AI being virtuous?

I have no particular expertise on the development of AI capacities. But the Capacity Constraint appears to be weakening every day. Particularly worrisome to me is the advent of “agentic” AI, which is capable of commanding computer systems (and thus, some day soon, capable of commanding robot bodies) and acting independently to figure out the best way to solve some particular task. I’m also worried about the huge advances in the ability of AIs to write computer code. Most apocalyptic scenarios involve AI writing code to improve itself, thus increasing its capacities exponentially.

But more than this, pretty much everyone working in AI is attempting to overcome the Capacity Constraint, and they report varying degrees of success in the effort. The goal of all of the top AI labs is to make AI agents that are capable of killing us all. This is not, of course, to say that they want killer AIs. What they want are AIs that are hypercapable, with an ability to understand the world that far outstrips any human thinker, and an ability to use that understanding to modify the world with an efficiency that far outstrips any human agent.

Such a hypercapable AI would be incredibly useful if it acted in pursuit of human ends. But a hypercapable AI would absolutely be able to kill us all if it wanted to. The potential existence of hypercapable AI looks more and more plausible by the day.

Why would hypercapable AI be dangerous? I don’t worry about war between humans and machines, as in Terminator or The Matrix. Indeed, I find these scenarios oddly comforting, because war implies a rough parity of capacity between humans and machines.

No, what worries me is a story that got a decent amount of traction on social media recently about a pipeline in DC that ruptured, spilling huge amounts of raw sewage. It was long known that the pipeline was in need of repair, but the needed repairs were held up indefinitely, partly out of concern that conducting those repairs would harm an endangered species of local bat. When most people read this story, they’re incredulous that the repairs would have been held up for this reason—They’re just bats! We need that sewer line!

And what worries me is that the advent of hypercapable AI would cast humans in the role of bats. A hypercapable AI might decide that it’s imperative—or at least expedient—that a new massive solar farm be built in the desert southwest, and demolish Phoenix overnight. Millions of humans killed, but so what? They’re just humans, we need that solar farm.

Hypercapable AI, in other words, is inherently dangerous to human life if it doesn’t care about human life enough to want to protect it.

This brings us to the Moral Constraint, which is more commonly known as “AI alignment.” How can we build an AI whose motivations are aligned with human well-being? This is an area on which I have expertise, as a philosopher who specializes in the foundations of moral reasoning.

Unfortunately, I’m pretty sure that AI alignment is impossible.

How might an AI form a moral sense? There are basically two scenarios. In one scenario, moral facts are the kind of fact that one might simply figure out by thinking about them hard. In such a case, perhaps AIs would be good moral reasoners, and indeed even better moral reasoners than humans, in virtue of their advanced intellectual capacities.

In the second scenario, moral facts aren’t the sorts of things we can figure out by pure intellectual effort, but we can nonetheless train AIs to develop a moral sense in much the same way we train children in good behavior: by rewarding them when they’re good and punishing them when they’re bad.

My Intelligence Isn't Artificial, Thanks

Sam Kahn

Mar 26

Read full story

The first scenario is doomed, for reasons first pointed out by the philosopher David Hume in his oft-quoted (and oft-misunderstood) passage where he indicates that there is a gap (not Hume’s term) between “is” and “ought.” Hume thought that reasoning is not some sort of truth-generator, a special faculty that takes intellectual effort as an input and spits out knowledge as an output. Rather, it is a process, where we move from one thought to the next, with our later thoughts hopefully (though not necessarily) supported by our earlier thoughts.

But the process is fallible. After all, if we are to reason our way to a moral conclusion, we must be reasoning from non-moral conclusions. Taking that into account, what operation of the mind could possibly take us from premises that describe the world to conclusions that tell us how to act?

The difficulty of the transition from “is” to “ought” is compounded because the kinds of moral conclusions that we’re interested in aren’t just intellectual appreciations of the moral law, but principles of action that will guide our conduct. That is, for someone to act morally, it’s not enough for them to be able to recite Kant’s dictum to always treat humanity as an end in itself. (AI can already do that, as any college ethics professor reading student papers can tell you.) We need AIs to actually treat humanity as an end in itself.

Ultimately, Hume thought that we could reach these kinds of moral conclusions. But (and this is crucial) we do so by drawing on the innate emotional capacities that evolved along with our species. Most notably, it involves drawing on our instinctive sense of human sympathy.

Our reasoning, then, shows us how to most effectively deploy that pre-existing sympathy. But this just shows that the first scenario is doomed, since the core question of AI alignment is how we can instill in AI a powerful sense of sympathy and concern for humanity in the first place.

This brings us to the second scenario, the training scenario, where we teach AIs to care about humans through a system of reward and punishment. This is the dominant paradigm in AI alignment research today. I’m afraid that it is doomed as well.

To see why, we need to think a bit about what the training approach is trying to accomplish. By punishing an AI when it does something we don’t like, and rewarding it when it does something we do like, we are providing AIs with a set of data points about which actions are good or bad, and getting them to develop general principles of action on the basis of those data points. They can then extrapolate from those principles and apply them in novel contexts.

However, this kind of learning runs into another famous philosophical problem, “the underdetermination of theory by data.” We can prove mathematically that an infinite number of theories are consistent with a finite amount of training data. That means that for any finite sequence of data, there are an infinite number of ways to extend that sequence that are all compatible with the existing sequence. In the context of AI alignment training, this means that no amount of training in controlled circumstances gives any kind of guarantee of how AIs will act in the next instance. We can give an AI a billion cases of moral and immoral action, but AIs can learn practically any lesson from all of this training, and thus act in any way whatsoever.

To make this a bit more concrete: we might put AI in a testing environment where it is tempted to do bad things, and punish it when it does bad things. This is what AI alignment researchers currently do. We hope to get it to learn “Don’t do bad things.” But it might instead just be learning “Don’t get caught.” Or, more ominously, “Don’t put yourself in a position where humans are able to punish you.” Or, more ominously still, “Act nicely when, and only when, humans are able to punish you for acting badly.”

An AI that learned this lesson would quickly go rogue when released into the wild.

But even these examples understate the difficulty of the problem, since there are an infinite number of lessons that AIs might learn from this training, and thus any course of action could be consistent with the training data. We might tell AIs not to pave over Phoenix to create a solar farm, but it goes and does it anyway, because we trained it not to fake-pave-over simulated-Phoenix, but it doesn’t apply that lesson to real Phoenix. Or it might learn not to pave over Phoenix, and pave over Tuscon instead. What principle might keep Phoenix safe but put Tuscon at risk? An infinite number of such principles! No finite amount of training can take that number below infinity.

This kind of concern might seem overblown. After all, we successfully train our kids to be good by rewarding and punishing them. But the difficulties of parenting are a good illustration of the problem. As any parent knows, young kids are little psychopaths who go to any lengths to avoid punishment or to lawyer your prior instructions to give themselves permission to do what they’re clearly not allowed to do. At a certain point, though, all the moral instruction kind of clicks, and they (for the most part) become decent people. What the underdetermination of theory by data teaches us is that this “click” is not the kind of thing that emerges mechanistically from reward and punishment. Rather, it’s a product of the way this training works on human moral psychology—i.e. by developing our innate emotional capacity for sympathy. Adult psychopaths lack this capacity, and thus never learn the right lesson.

AIs do not have a human moral psychology. They don’t have human moral emotions, because they don’t have human brains. It’s a radically different hardware, and so we have no reason at all to expect that AIs will—or can—learn the kinds of lessons that we hope for them to learn through alignment training. A psychopath cannot learn to care about others through a process of reward and punishment. And we have every reason to think that AIs are psychopaths, or perhaps something far more alien and far less disposed to human sympathy.

AI alignment is not something that works in theory but is difficult to put into practice. It’s something that doesn’t work in theory, and yet AI companies have decided to give it the old college try. I’m not opposed to trying—maybe the theory is wrong. But we’re relying on the success of a theoretically impossible endeavor because the AI labs have already resolved to demolish the Capacity Constraint. So AI alignment has to work... or else we’re doomed.

This is cartoonishly reckless.

Matt Lutz is an Associate Professor of Philosophy at Wuhan University and writes the Substack Humean Beings.

Follow Persuasion on X, Instagram, LinkedIn, and YouTube to keep up with our latest articles, podcasts, and events, as well as updates from excellent writers across our network.

And, to receive pieces like this in your inbox and support our work, subscribe below:

Marcus Arvan

Apr 16

I published a proof last year that alignment is impossible for the underdetermination reasons here:

https://philpapers.org/rec/ARVIAA

‘Interpretability’ and ‘alignment’ are fool’s errands: a proof that controlling misaligned large language models is the best anyone can hope for

AI and Society 40 (5) (2025)

Abstract

This paper uses famous problems from philosophy of science and philosophical psychology—underdetermination of theory by evidence, Nelson Goodman’s new riddle of induction, theory-ladenness of observation, and “Kripkenstein’s” rule-following paradox—to show that it is empirically impossible to reliably interpret which functions a large language model (LLM) AI has learned, and thus, that reliably aligning LLM behavior with human values is provably impossible. Sections 2 and 3 show that because of how complex LLMs are, researchers must interpret their learned functions largely in terms of empirical observations of their outputs and network behavior. Sections 4–7 then show that for every “aligned” function that might appear to be confirmed by empirical observation, there is always an infinitely larger number of “misaligned”, arbitrarily time-limited functions equally consistent with the same data. Section 8 shows that, from an empirical perspective, we can thus never reliably infer that an LLM or subcomponent of one has learned any particular function at all before any of an uncountably large number of unpredictable future conditions obtain. Finally, Section 9 concludes that the probability of LLM “misalignment” is—at every point in time, given any arbitrarily large body of empirical evidence—always vastly greater than the probability of “alignment.”

Also see this Scientific American piece: https://www.scientificamerican.com/article/ai-is-too-unpredictable-to-behave-according-to-human-goals/

Alex

Good post! I think it overstates its case on the second scenario, though. By the Pigeonhole Principle (or some other theorem if you arrange the problem differently), an AI of some size has a limited number of concepts it can represent.

If we had 20 years to do it, we *could* inspect and correct AI morality. We just need the time before someone builds the killer AI!

1 reply

25 more comments...

My Intelligence Isn't Artificial, Thanks

Discussion about this post

Ready for more?