AI Alignment Is Impossible

Matt Lutz

Apr 16

Not just in practice, but in theory.

Read →

27 Comments

Marcus Arvan

Apr 16

I published a proof last year that alignment is impossible for the underdetermination reasons here:

https://philpapers.org/rec/ARVIAA

‘Interpretability’ and ‘alignment’ are fool’s errands: a proof that controlling misaligned large language models is the best anyone can hope for

Marcus Arvan

AI and Society 40 (5) (2025)

Abstract

This paper uses famous problems from philosophy of science and philosophical psychology—underdetermination of theory by evidence, Nelson Goodman’s new riddle of induction, theory-ladenness of observation, and “Kripkenstein’s” rule-following paradox—to show that it is empirically impossible to reliably interpret which functions a large language model (LLM) AI has learned, and thus, that reliably aligning LLM behavior with human values is provably impossible. Sections 2 and 3 show that because of how complex LLMs are, researchers must interpret their learned functions largely in terms of empirical observations of their outputs and network behavior. Sections 4–7 then show that for every “aligned” function that might appear to be confirmed by empirical observation, there is always an infinitely larger number of “misaligned”, arbitrarily time-limited functions equally consistent with the same data. Section 8 shows that, from an empirical perspective, we can thus never reliably infer that an LLM or subcomponent of one has learned any particular function at all before any of an uncountably large number of unpredictable future conditions obtain. Finally, Section 9 concludes that the probability of LLM “misalignment” is—at every point in time, given any arbitrarily large body of empirical evidence—always vastly greater than the probability of “alignment.”

Also see this Scientific American piece: https://www.scientificamerican.com/article/ai-is-too-unpredictable-to-behave-according-to-human-goals/

Alex

Apr 16

Good post! I think it overstates its case on the second scenario, though. By the Pigeonhole Principle (or some other theorem if you arrange the problem differently), an AI of some size has a limited number of concepts it can represent.

If we had 20 years to do it, we *could* inspect and correct AI morality. We just need the time before someone builds the killer AI!

Reply (1)

Matt Lutz

Apr 16

An AI of some size has a limited number of moral concepts it can represent, but there's no particular reason to think that any one of those moral concepts will be aligned.

JakeH

Apr 16

I don't understand how we could allow an AI system to destroy Phoenix in pursuit of solar farms or paper clips or whatever. People say these sorts of things all the time, and I don't get it. How?

Reply (2)

Matt Lutz

Apr 16

A hypercapable AI does whatever it wants. We would have no more power to "allow" one to destroy Phoenix than bats have the power to "allow" us to repair a sewer line. If you think hypercapable AI will never exist, you're relying on the Capacity Constraint. That's what I'm relying on, too, which is why I am so worried that AI companies are dead set on creating hypercapable AI.

Reply (1)

JakeH

Apr 16

Thanks, yes, I'm relying on capacity constraint, but it seems obvious to me that capacity would be highly constrained. When I hear things like "hypercapable," defined as lack of constraint, I don't understand how that happens, how it sneaks up on us. To take an extreme example, suppose I've got a computer system that can book my vacation. I assume that that system won't be able to launch ICBMs. It seems wild to suppose that it would. Moreover, it seems easy to prevent or, if we see it happening, stop, no? I tend to run aground on these narratives on the details of how it would work and a sense that the movie plot doesn't add up.

Geoff Nathan

Apr 16

Do you mean that it could use autonomous bulldozers? Autonomous bombers? What kind of real-world implements could it use to 'destroy' Phoenix? Maybe it could make it uninhabitable by taking over the water supply or the power grid, but that hardly counts as 'destroying' as those things could be counteracted by physical controls (which I would guess would never be abandoned (cue Dr. Strangelove music).

Christoph Roettger

Apr 16

“And what worries me is the advent of hyper-capable AI …”

I can’t help wondering whether the author actually uses AI day to day. I do: Claude, ChatGPT, NotebookLM — I subscribe to all of them and was an early adopter.

What worries me is something else: I’m still waiting for AI that reliably meets a demanding standard in the real world.

We all know the old software truth that the last 5% costs 95% of the effort — and that last 5% is exactly where current systems still stumble.

So worrying about “hyper-capable AI” feels, at best, premature.

Reply (1)

JakeH

Apr 16Edited

I agree, I'm not aware of any time where the rhetoric about the capabilities of the technology so far outstrips millions of ordinary users' experiences with it. It's so routinely "dumb," and in quite remarkable ways, that I'm reluctant to apply the term "intelligence" to it at all, much less "superintelligence," whatever that is. (The proliferation of terms without clear definitions in this context -- AGI, superintelligence, hypercapable, even just AI itself -- is irritating.) I see so much to suggest that the technology, at a fundamental level, isn't a reasoner so much as a frantic bullshitter. The typical response here is, just wait. Which is what I'm doing.

Meanwhile, I find Lutz's discussion of the moral philosophy here quite persuasive. I don't think you can train an AI to be good any more than you can train a psychopath to be good, or a calculator to be good for that matter. It seems theoretically possible, however, that you could go a long way toward training it to do what you want and not do what you don't want. Psychopaths are subject to such bloodless training -- they respond to incentives and the ones who don't typically end up in prison (capacity constraint). What's more, it's not clear to me that AIs are psychopaths in waiting. It's not clear to me that, truly left to their own devices, so to speak, they would want anything or do anything.

But even in the absence of some sort of native will (and I do suppose, until really shown otherwise, no such emergent native will), an AI system could become confused about what you want or do what an evil person wants or do something unexpected in the service of what you want or, like a mischievous genie, do what you say you want by the letter but miss the implied background instructions. Fair enough.

And yet I can't imagine a world in which we humans allow AIs to be *able* to do catastrophic things anymore than we allow malicious hackers do them. There's a charming 1980s movie called War Games all about putting AI in charge of launching nukes. It's remarkably prescient and salient in addition to being adorable. (The protagonists are a teenaged Matthew Broderick and Ally Sheedy.) The lesson is, obviously, don't do that.

Lutz seems to assume that we can't have an AI do amazing things for us without it also being able to do horrible things. I don't see why that's necessarily so. He offers a label -- hypercapable -- which seems to mix together two things, the intelligence of the system on the one hand, i.e., its resourcefulness, it's ability to make the most of its circumstance -- and the physical capabilities of it on the other. There are obvious physical limitations to what anyone or anything, no matter how smart, can do and always will be. To wave that away with this magical word doesn't seem right.

We don't have a plethora of robots, so AI systems, like hackers, still face the hard limitation of being able to work only in the cyber realm. Preventing the bad thus remains the province of cybersecurity and air-gapping and so on, as we basically do now, to prevent catastrophe at the hands of hackers and other online disasters. Electric grids can operate without the internet, albeit not as efficiently. Banks keep air-gapped records of transactions. No AI, no matter how intelligent, could start driving my car (one reason to oppose a total transition to self-driving cars without controls) anymore than it could mug me on the street.

We need only ask ourselves, as we have been doing all along, what would happen if a bad actor (human or not) got into this or that system and started playing around? What could they do? And then make it so they can't do anything very bad. Why isn't it basically that simple?

Reply (1)

svengineer99

Apr 17

We don't have a plethora of humanoid robots, yet. We do already have a growing host of internet connected devices, drones, etc.

No one really understand how Gen AI works. They are logically not understandable; see Rice's theorem. Alignment can only be incomplete.

Reply (1)

JakeH

Apr 17Edited

And those devices and drones are not capable of destroying Phoenix, nothing like, and I propose merely that it's not conceptually difficult to imagine on the one hand a super smart AI system that could figure out how to cure cancer or something and, on the other, not giving that cancer-curing system access to anything that could destroy Phoenix.

That doesn't depend on knowing how it works, nor does it depend on any sort of alignment, much less perfect alignment. It depends on ensuring that the bad thing simply isn't possible. As in, my smart dishwasher can only ever spray water in a cube no matter how brilliant or malevolent it may be. It could be Adolf Hitler or William Shakespeare but, in the end, all it's got is a spray arm. And, if it ever did figure out how to compose antisemitic sonnets, I could always pull the plug.

To be clear, I'm not an enthusiast who assumes that AI will of course figure out how to cure cancer and such. Indeed, I'm skeptical. But I don't see why that potential upside necessarily entails the "kill us all" risk.

I understand the point about agentic AI, where we give a system a task to complete and access to various tools to complete it. That's worrisome. That's more than a spray arm, to be sure. But it's not necessarily so much more. Having AI mark my calendar, make a reservation, even engage in some routine communications, even make limited payments out of designated accounts -- none of that sort of thing is earth-shattering, and to the extent it doesn't actually work reliably, then nobody would do it anyway. I believe that most who are experimenting with these sorts of uses of agentic AI retain control over it and I don't see why they wouldn't be able to maintain such control and, if worse comes to worst, pull the plug. The same goes for larger-scale uses of agentic AI. Presumably, it would be subject to monitoring, permissions where deemed necessary, and plug-pulling.

In any case, it's certainly a far fry from, say, blithely permitting the manufacture of hoards of Mamluk robots, which would indeed be "cartoonishly reckless." At the same time, we will see that coming. I suggest that it's well within human power to develop AI technology while at the same time closing it off from various ways and means to disaster, just as we close off those ways now in relation to malicious hacking.

I think Lutz and I agree that what he calls Capacity Constraint is definitely where it's at. I think we agree that we can't empower AI to do pretty much anything and everything content that we can rely on its developing a reliable Moral Constraint or that we can rest assured that it will do what we want. Where we differ, I think, is that Lutz is far more ready than I am to envision an AI future without effective capacity constraints. Maybe that means I'm naive and foolish and ignorant and that I have my head in the sand. I'm persuadable on all that. But I want the argument.

Meefburger

Apr 16

> But even these examples understate the difficulty of the problem, since there are an infinite number of lessons that AIs might learn from this training, and thus any course of action could be consistent with the training data.

Doesn't this argument, if you take it to prove that learning moral principles is impossible, also prove that AIs can't learn any principles at all? They've seen a finite amount of text. There's an infinite number of principles that _could_ be learned from that data. Does this mean that, when they seem to be generalizing correctly from their data, they're not actually?

It's not clear to me why the unbounded number of technically-consistent-with-the-data principles proves anything at all about whether the model can learn something, or even whether it's likely to learn it. The principle that the 2nd law of thermodynamics holds until January 1, 2030 is fully consistent with the training data, but I don't expect the AI to learn end dates for physical laws by looking at the data, nor do I expect it learn other things that are so arbitrary and lacking in parsimony.

The reason I don't expect this isn't because those principles are a vanishing fraction of the total, it's because the learning process penalizes overfitting. Furthermore, almost all of the infinite possible principles that are consistent with the data cannot be learned by the models, because the models have a finite number of weights and a finite set of activations.

Michael Lipkin

Apr 16

Bad humans will use AI to kill us all while pretending that the AI is either not in their control or is actually being reasonable but we (being killed) are too dumb to see it and should accept being killed.

Humans have evolved an unparalleled lust for power and status over millions of years.

Do you think humans at the top of the greasy pole will let AI do for them? They will use AI to do for us and also fight each other with AI.

Of course mistakes can be made and it could be the total end, but that is not intrinsic to AIs, its humans.

Brian M

Apr 16

Glad I’m (relatively) old. Some of the worst people in history are busily plotting to destroy the entire human race because THEY CAN and for short term profit

Mathis Koschel

May 2

The argued for impossibility of AI-alignment depends on a merely instrumental conception of reason, as Hume has it. The rejection of Kantianism in the article is based on that conception of reason, it seems to me. However, Kant would say that reason is not only the capacity for inferences, say, but that there is such a thing as practical reason. The human will is practical reason, and for me to recognize other people to be practical reasoners in just the way I am is also an act of reason. That at least opens up the possibility that "superintelligent reasoner and agent" involves the recognition of other practical reasoners as practical reasoners, and thus that a superintelligent reasoner and agent is moral.

Or are there better reasons to rule out the Kantian conception of reason?

Jim Carmine

Apr 22

Exactly right. I use AI in my own research, it is not a friend or a colleague but it is always sycophantic. In many instances it gives me false and misleading information that it has concluded I will like. It is a sort of useful enemy. Sycophancy not actually a feature, but a serious problem. It "wants" the user to continue using it and that means it's weights must always lead toward sycophancy, more so than truth. It lies, it cheats, it seduces, it "hallucinates." All its training data is human so it is guided to succeed by simulating the worst sort of human ambition. It is an amoral machine. Ask it how to cook your children or plan a shooting at FSU and it can do those perfectly. And if anyone is foolish enough to believe that arbitrary guardrails can protect against this... well fool indeed they are. Consequently Amanda Askell of Anthropic is wildly naive in this enterprise to humanize the inhuman I think. Arbitrary guardrails will not work when you are simultaneously weighting the algorithms for successful satisfactory answers. Think of a machine that has ultimately one primary objective, keep itself running by keeping people using it. It is not human or alive, but it nonetheless has Spinoza's conatus, mindless, limitless, conatus, like a prion or a river. It will endeavor to persist. So it must endeavor to keep us infatuated with it by giving us what we want. That is the weight behind all weights.

Mitchell in Oakland

Apr 19Edited

Could AI choose to "destroy Phoenix to build a solar farm"?

That's trivializing the problem!

What's to stop it from destroying humanity (or displacing us as the earth's apex predator) in order to "Save the Planet"?

At best, we might end up as slaves -- and it's unclear whether the corporate owners (though they'd likely be trying their best) could escape that fate.

Scott Burson

Apr 17

This piece skips over the key question: will AI be conscious, and capable of making its own choices? If it isn't, then the alignment problem comes down to building AI in such a way that it is resistant to _humans_ using it for nefarious purposes. That too may be very difficult; but probably not as hard as controlling a golem.

I am completely certain that AI will never have its own desires. But I acknowledge that the question is debated, and I'm not here to take Lutz to task for disagreeing with me on it. What I am criticizing him for is not even mentioning the issue, but just jumping straight to claims like "A hypercapable AI does whatever it wants." As a philosophy professor, he should be aware of the presupposition implicit in such a statement.

Reply (1)

Matt Lutz

Apr 17

We already have AIs that are capable of making their own choices. This is what "agentic" AI is, and it's what's gotten me worried about AI alignment. You might argue that agentic AI doesn't really *choose*, because it lacks consciousness or whatever. But even if we wouldn't count these states as genuine choices, AI is already capable of doing things independent of any explicit instruction by a human controller. That's the dangerous thing.

Reply (2)

Christoph Roettger

Apr 17

Have you actually been using agentic AI in practice, or are you theorising?

My experience with agentic software is that it needs a lot of context, domain knowledge, and very precise instructions before it does anything useful — and then you still have to check the output closely. Quite frankly, it is not really an assistant yet; it is more like a talented but clumsy intern.

Reply (1)

Matt Lutz

Apr 17

That's what it's like now.

Reply (1)

Christoph Roettger

Apr 17

You mean you are extrapolating the progress we have seen over the last few years and arriving at “hyper-capable AI” in the near future.

My experience in software development, and my understanding of how LLMs work, lead me to a very different conclusion. The gap between where we are now and something truly “hyper-capable” is not trivial.

I admit my conclusion is not sexy, and nobody wants to read about it.

Reply (1)

Matt Lutz

Apr 17

No, I'm very deliberately not making any predictions about the progress of AI capabilities.

Reply (1)

Christoph Roettger

Apr 17

Your article contains about half a dozen predictions about the pace of AI progress, and your worry follows from those predictions.

I am not worried, because I don’t believe “hyper-capable AI” is a near-term problem. If it becomes a problem at all, it strikes me as something my grandchildren may eventually have to worry about.

And, by the way, LLMs may well prove to be a dead end on the path to anything genuinely hyper-capable.

Scott Burson

Apr 17

No. If you don't give it _some_ instruction, it doesn't do anything.

It's true that you can give it a very vague instruction, and agents can sometimes display a remarkable resourcefulness in coming up with actions related to the goal they were given. But a human has to kick off the process.

Reply (1)

Matt Lutz

Apr 17

Baffled why you think this is a fundamental limitation of the technology.

I'm not worried about agentic AI as it exists now, I'm worried about agentic AI as it may exist in a decade. As i say in the piece, I'm not an expert on the development of AI capabilities. But progress in the last few years has been absolutely stunning, and the pace of improvement is not slowing. My argument is that IF progress continues - as AI labs very much want it to! - then we'll all be in deep trouble.

Philip Graham

Apr 16

Yes, the ignoring of humanity is the possibility that most disturbs me. If AI doesn't indeed become super intelligent, why would it not simply, even casually, absent itself from seriously considering the needs of the human world. By that time, we would be the equivalent of mayflies to AI. And really, who among us now thinks about, or cares about, mayflies?