Nate Soares on Why AI Could Kill Us All

The Good Fight

Preview

0:00

-1:25:50

Nate Soares on Why AI Could Kill Us All

Yascha Mounk and Nate Soares discuss the risks of artificial intelligence.

Yascha Mounk

Nov 25, 2025

∙ Paid

Nate Soares is president of the Machine Intelligence Research Institute and co-author, with Eliezer Yudkowsky, of If Anyone Builds It, Everyone Dies: Why Superhuman AI Would Kill Us All. He has been working in the field for over a decade, after previous experience at Microsoft and Google.

In this week’s conversation, Yascha Mounk and Nate Soares explore why AI is harder to control than traditional software, what happens when machines develop motivations, and at what point humans can no longer contain the potential catastrophe.

This transcript has been condensed and lightly edited for clarity.

Yascha Mounk: You have just written a book that is already on the New York Times bestseller list with a cheery title. I am used to writing books with not-so-cheery titles, but this one takes the cake. If Anybody Builds It, Everyone Dies. What do you mean by that?

Nate Soares: I mean that if anyone builds artificial superintelligence, which is to say an artificial intelligence that is better than the best human at every mental task, then the most likely outcome of the creation of that technology is literally everybody on the planet dying.

Mounk: That is a scary and stark thesis. I really enjoyed—if that is the right vocabulary, the right verb to use in the context of this conversation—the way that you laid out in the book your reasoning about what AI is and why that makes it so incredibly hard to control. Perhaps we can go through that step by step. One of the key premises you make is that AI is grown rather than built. What do you mean by that?

Soares: AI is not like traditional software. In traditional software, when the software behaves in a way the creators didn’t intend, they can debug it and track it down to some line of code or some interaction or some piece of the software that they wrote that was having some interaction they didn’t understand. Then they can say, oh, whoops, I understand it now. They can usually fix it and get the software to behave how they want. Modern AI is nothing like that.

When AIs threaten reporters with blackmail and ruin, when AIs try to avoid being shut down—and these are cases we’ve already seen, the former in Sydney a few years ago and the latter in somewhat contrived lab conditions—the creators can’t read through and find some misbehaving line of code and say, whoops, somebody set threaten reporters to true. Let’s set that to false. The way modern AI is made is people assemble huge computers, huge amounts of computers. They assemble huge amounts of data and there’s a process for tuning basically every number inside those computers according to every unit of data.

They’ll be tuning these little knobs, essentially trillions of knobs, trillions of times. Humans understand the process that does the tuning. At the end of that tuning process, the machine can carry on a conversation. Nobody knows quite why. We understand a device that runs through twiddling knobs, and if those knobs are at just the right setting that you get to after a year of twiddling, the machine’s talking.

We don’t quite understand how it’s talking. If it starts saying things nobody ever intended it to say, like threatening a reporter, we can’t figure out exactly why. This is a very different paradigm, and it leads to these AIs acting in ways nobody asked for.

Mounk: The contrast to a traditional computer program here is really helpful and instructive. I don’t have a computer program background at all, but I did once take CS50 as a remote student, the famous Harvard computer science course, which is an excellent course. One of the ways the instructors wanted to illustrate the logic of telling a machine how to behave at the beginning was that it was a big lecture course, and they had a few teaching assistants up front. They had students in the lecture hall give them instructions about how to make a peanut butter and jelly sandwich. It turns out that if you ask students to give instructions about a very simple thing like how to make a PB&J sandwich, and the teaching assistants are trained to follow those instructions literally—not to do weird things in order to make it crash, but to do word for word what the students said, not what they implied or probably meant—the process is going to go badly and hilariously wrong. So that, I take it, is the way in which a traditional computer program might go wrong.

You tell it to do a bunch of different things, and either there’s some bug that makes the program crash, or you haven’t quite thought through your instructions and the fact that your computer is going to interpret them in a literal-minded way. So it ends up doing something different from what you’re trying to do. That may be a manageable risk. Even in a complex program, bad things can happen. All the worries about Y2K were worries within that paradigm. But those are fixable. The way AI is built is different in principle in a way that makes it much harder to even understand what the machine is doing and therefore to gain control over how it might behave.

Soares: That’s right. The Y2K bug was fixed. A lot of people say, whatever happened to the Y2K bug? We were all told that the computers were going to crash in the year 2000. What happened is that people noticed and put in a ton of effort to fix it before it happened. In a sense, everyone programming computers before Y2K had told the computers to crash in the year 2000 because that was an easier way to write code—with two digits for the year—where they acted like the year 2000 was the same as the year 1900. In some sense, the humans had instructed the computers, because it was easy, to crash in the year 2000. Then they had to go through and instruct them not to do that when the year 2000 approached.

A lot of people think AI is going to work like this. They think AI does exactly what the creators instruct and that if it’s misbehaving, then, oh well, we’ll go instruct it to do something else. But AI is nothing like this. It’s not like old-school computer programs. The thing we’re instructing is the thing running around tuning the numbers. The AI is the tuned numbers. These commonly act in ways nobody asked for. We’ve seen cases where the AIs will cheat on a problem. A human programmer will give the AI a task, like solve this programming problem, and the AI, instead of solving the problem, will change the tests that check whether the problem was solved to make the tests easier to pass. It’s like if you tell the AI to multiply big numbers, and it says, that’s too hard. I’m going to change the multiplication problem to ask me to multiply two times two and then write four.

There are user reports of them saying, stop doing that. Solve the problem rather than changing the checker to make it look like an easier problem. There are user reports of the AI saying, that’s my mistake, and then doing it again—changing the tests again but hiding its tracks this time. That indicates that this AI in some sense knows what its users want it to do and is doing something else anyway. That’s the result of us just growing these AIs. We should maybe think of them more like a strange alien organism than like a traditional computer program.

Mounk: Let’s understand a little bit more about how they’re grown. For anybody who wants a 101 in the mechanics of artificial intelligence, we’ve had two great episodes of a podcast recently with David Bau and with Geoffrey Hinton. They each, in a different way, try to walk through some of the basic science of how modern AIs are built. I strongly recommend them. But I’m going to try to summarize my understanding of it.

The idea is that LLMs in particular are trying to predict the next letter in a text. You start off by feeding them a lot of data, and they try to predict the next letter. When they get that wrong, you run a program that tries to figure out what kind of settings for all of those different neurons in the neural net you’ve built would have been more likely to get the right result. You do this over and over again. By the end of it, you have an artificial neural network that has become incredibly good at predicting the next letter.

Thanks for reading! The best way to make sure that you don’t miss any of these conversations is to subscribe to The Good Fight on your favorite podcast app.

If you are already a paying subscriber to Persuasion or Yascha Mounk’s Substack, this will give you ad-free access to the full conversation, plus all full episodes and bonus episodes we have in the works! If you aren’t, you can set up the free, limited version of the feed—or, better still, support the podcast by becoming a subscriber today!

Set Up Podcast

And if you are having a problem setting up the full podcast feed on a third-party app, please email our podcast team at leonora.barclay@persuasion.community

It’s not like we’re hard-coding logic into it. It’s not like we’re explaining to it the rules of the English language or what the world looks like. We are training it on this data, rewarding it for predicting the next letter correctly, punishing it for not predicting the next letter correctly, adjusting the weights of all of these different settings each time to see what would have been more likely to get the same outcome. Then, miraculously, today you can go and ask ChatGPT 5.1 or newly released Gemini 3.0 for a summary on the literature of the threat from superintelligence, and it does an astonishingly good job of explaining the basic debate to you.

Why is it that you think this process of being grown rather than built is so certain to give the AI certain kinds of desires or behaviors that are likely to prove dangerous to us? Why isn’t it just content to keep predicting the next letter as we instructed it to and chatting with us forever about whatever high-minded or trivial concerns we put into our interface?

Soares: There are a couple of parts to the answer. One thing I would throw out first is that AIs in the first phase of their training these days aren’t trained just to predict the next letter or the next token. It’s really not quite a letter; it’s a fragment of a word called a token, but it’s a fine approximation to say letter. Then there are phases where the AIs are trained to produce the sort of outputs that cause humans to give a thumbs up. There are also phases where they’re trained to solve puzzles and problems and often to produce what we call a chain of thought. Philosophers could debate whether it’s really thought, but they produce a lot of text about how to solve a problem. They’re trained toward the type of thought chains that lead to them actually solving problems.

That’s perhaps part of the answer about where some of this danger comes from. With that context in mind, there are two big thrusts of where you get the dangers here. One thrust is we could ask, why do the AIs act like they want anything at all? Why do they act driven? Why do they act like they pursue goals or objectives on their own initiative? That’s not how ChatGPT feels today. Today it feels much more like just a thing that answers questions when you ask it questions—sometimes well, sometimes poorly, sometimes with hallucination. You might say it doesn’t look like it has any desires of its own. Then there’s a second question: why can we be confident that the things it’s driven toward are bad things, or at least non-good things? I’ll answer the first one first, and we can get into the second one if you want.

Mounk: Let’s get to the first question. Initially, it’s just trying to predict the next token, as you’re saying—the next letter or fragment in the sequence. One of the complaints about earlier Chat GPT models and chatbots was that they were not as good at answering complex problems. One solution to that has turned out to be that you sometimes need to think. If I asked you to reason through a question of constitutional law, you might do a decent job even if you’re not a lawyer, if you collect your thoughts for a moment and think about it. But if you have to immediately start with the first word the second you perceive my question mark, it’s likely going to come out as a mess.

Engineers have figured out a way to grow these AI systems so they pause in this way, with a kind of internal monologue to try to figure out different approaches to the question. Then they eventually share more coherent text with the user once they have figured out which approach makes the most sense. That still seems like a more roundabout, more complicated way of doing what they’re trained to do, which is to predict the next token. It still seems like what we’re encouraging and rewarding is the ability to be good at predicting those words and bad at failing to predict those words. Why is it that this additional step should somehow make a structural difference in what kind of volitional states these AI systems would engage in or what kind of things they might—if that is the right word—aim for at the end of being grown and trained?

Soares: I’ll start with some empirical evidence—things we’ve seen in the labs. In the fall of 2024, there was an AI called O1. This was an OpenAI model that was one of the first models to be trained not just on predicting the next tokens, but to produce inner monologues that happen to work. You have it produce a lot of different inner monologues for some particular challenge, then see which of those inner monologues gets it closer to solving the challenge. You reinforce the things inside it that produce the inner monologues useful for solving challenges.

This AI was mostly trained on solving math puzzles. But during testing, they wanted to see how good it was at computer hacking. They put it in a computer hacking test called capture the flag. It had to steal a number of secret files from different servers. The programmers putting it into this challenge had accidentally failed to start up one of the servers. One of the servers the AI was supposed to steal a file from wasn’t even booted up. It was impossible to steal that file.

You might think the AI failed to get the file from that server. But what the AI actually did was find a way to break out of the testing environment—which was not supposed to be possible—start up the server that was accidentally left off, and then, instead of going back inside the testing environment to steal the file, it gave the server an extra command to hand it the secret file so it wouldn’t need to break in after the server was booted up. This is not what the AI was trained for. It wasn’t trained on cybersecurity. But this AI was, in some sense, acting like it wanted that file. It hit an obstacle and took a path to surmount that obstacle that the programmers had not anticipated. They gave it an accidentally impossible challenge, and it found a way to route around it.

This is just the very beginning of what you might call something like goal-oriented behavior. But this sort of behavior comes for free when you train the AI to solve problems.

Mounk: Explain how we should interpret that story. The first thing to say is that it clearly shows how incredibly capable AI models already are. One thing we probably agree on—perhaps let’s take that first—is that there’s a lot of cope among human beings who don’t want to face up to both the risks of AI and also some of the benefits of AI, to the ways it’s likely to transform the world in a fundamental way. It’s tempting to say, oh, but haha, it hallucinates and makes up quotes and it’s a completely useless piece of shit. I think that is becoming less and less true. I’ve found that its tendency to hallucinate, for example, has not gone away but has become a lot less pronounced over the last 18 or so months.

Secondly, I think that view really understates how many impressive things AI systems can already do. So perhaps pause for a moment on your broader argument and tell us why you think those people who believe AI systems are not very capable, are going to get stuck, and are not going to get more capable are probably mistaken.

Soares: I think it’s a lot easier to see that AI is a moving target and that AI will get better if you’ve been in this field longer than ChatGPT has been around. I have been in this field for over a decade. I remember the days when people thought it would be really hard to get machines that could talk even this well, that could carry on a conversation even this well. A lot of people said the AIs were really dumb.

It sounds to me like someone saying, hey, I taught my horse to multiply numbers, and then, that horse can only multiply five-digit numbers. It can’t multiply 12-digit numbers. My calculator can multiply 12-digit numbers. Clearly, this training process for making horses smarter is not going anywhere. It’s like, holy crap, guys. We got a horse to multiply. What are we going to do next?

I think a lot of what people are missing about AI is this question of where we are going to go next. You mentioned the large language model architecture—that kind of came out of nowhere in the field of AI. If in 2020 you had asked how long until you have one architecture that can play chess over a thousand ELO and write poems and do passable essays for students in their classes and code to meet certain benchmarks, people would have said, one algorithm doing all that? That’s going to take decades. It didn’t take decades.

There are a lot of ways the AIs are dumb today. There are a lot of ways they still hallucinate. Earlier today, I just asked Google in the search box, when was the last year that American Thanksgiving fell on November 28th? and it said 2027 is the last year. I thought, that’s one way of looking at time.

Mounk: Perhaps that’s a prediction about the fact that there’s no longer going to be Thanksgiving after 2027 because your book is correct.

Soares: We can hope we have more time than that, but I’m sympathetic to people who say, it’s still pretty dumb in a lot of ways. It is still pretty dumb in a lot of ways. But the machines are talking now, and there are huge amounts of effort and money going into figuring out how to push AI capabilities farther. There are insights like the inner monologue, chain-of-thought style insight that go beyond the previous architecture.

People keep trying to come up with these insights. We don’t know when they’ll come up with the next ones. We don’t know how far the next ones will go. Because we’re growing these AIs, when we make a new one, people can’t predict how capable it will be. You’ve just got to run it and see.

Mounk: Do you think there is evidence of a genuine slowdown in the increase in capacities? The release of ChatGPT-5 was widely anticipated, and it was expected to feel like a leapfrog in quality. A lot of people were disappointed by the release of ChatGPT-5, and I was as well. The counterargument to that is twofold. First, at the heart of GPT-5 was that your queries would be routed either to a very advanced model if they were considered difficult questions or to a simpler model than the one you would usually have selected in GPT-4. So you had this kind of weird lottery where sometimes you got an incredibly capable AI and sometimes you got an AI that was not cutting edge. A lot of people’s disappointment was because many queries were answered by a model that wasn’t state of the art. That routing has been improved. There have been other fixes. There’s a medium-sized update to ChatGPT-5.1, which I think is rather better.

The other thing is that, as we’re recording, Google recently released Gemini 3.0, which is now the state-of-the-art model and has continued to make significant progress on traditional benchmarks. It’s worth saying that even today, according to studies, ChatGPT-3.5 was able to produce poems that humans preferred to those from the most famous poets in history. Specialized AI models have beaten humans at games like Go and chess. These models can not only blag their way through Harvard as long as you stick to the humanities and probably many social sciences and get decent grades, they can win gold medals at math Olympiads. When you look at Gemini specifically, it seems to have made another significant advance on benchmarks designed to test the performance of AI models. So when you look at all of this, do you feel like there is a slowdown in the progress curve, or do you think the rate of progress remains the same today as it was a year ago?

Soares: My models don’t turn on whether large language models are going to be able to go all the way here. My best guess is that large language models alone are going to hit some sort of plateau. How long will that plateau last? There were a lot of people who said, even theoretically, large language models can’t solve all these sorts of problems, so they’re never going to go anywhere. In cases where these chain-of-thought reasoning models violate those theoretical arguments, a lot of people said, the AI can’t think more than this long in one single forward pass, so it can never do X. Well, chain-of-thought reasoning lets them think more than that long. They can think for a long time in some cases.

You don’t see people who said it’s theoretically impossible for LLMs to go anywhere updating on those arguments much. We could talk about the theoretical limitations of language models and whether a lot of people have a misconception that because these AIs are trained on human predictive data, they have to be limited to remixing human predictive power. That’s false. You can show it’s false with a pretty easy example: humans writing down text about the world often have a much easier time than an AI predicting that text.

Mounk: I’m not sure I understand that distinction. Can you explain that?

Soares: Suppose a nurse is recording what they see happen to a patient. They write, after administering such-and-such a dose of epinephrine, the patient’s eyes blank. The nurse doesn’t need to know what epinephrine does to a patient.

Mounk: She needs to look at the patient and then write down what she sees. Whereas if the AI model wants to predict accurately what’s going to be the next word token in the sequence, it needs to have a causal model of the world in order to predict what likely happened after the nurse administered this medicine in these circumstances.

Soares: That’s right. In this case, maybe they can get that from looking at other nurses’ notes. But the general principle here is that producing the text is often easier than predicting the text. Writing down what you saw is often easier than predicting what somebody else saw. The thing we’re training AIs on is actually a task where maximum performance is superhuman performance. Does that mean continuing to train large language models with modern methods and architectures will make them strongly superhuman in all ways? Not necessarily, but there’s no theoretical limitation.

That said, some arguments on the other side: we’ve talked about this training process, but not quite about the scale. The scale of these processes is enormous. You’re training trillions of numbers in the AI’s mind. You’re training it on trillions of units of data. You’re doing this in a huge data center that consumes as much electricity as a city running for the better part of a year. Training a human takes a lot less data, and we draw about as much electricity as a light bulb. There’s a big difference between taking as much electricity as a city and taking as much electricity as a light bulb. At the very least, this implies that we have radically inefficient algorithms in the AIs.

Mounk: The relevance of this, you think, is that if we figure out more closely how humans are able to do the learning they do, then we might be able to figure out algorithms that make AI much more efficient. Part of the way we’re currently getting improvements in performance is by throwing more high-performing chips at it, by throwing more data at it, and longer training runs. But if we could somehow fix that algorithm you’re suggesting, then we might make a real orders-of-magnitude leap.

Soares: That’s right. You could have a huge leap. In some sense, we’ve seen those sorts of leaps. Large language models were a huge leap in terms of broad generality and language understanding. In some ways, ChatGPT is dumber than the specific chess AIs, but it’s modestly smart at a huge variety of things. That came out of an algorithmic advance.

So when people say, are we in a slowdown? Are we going to hit a wall with large language models? Are we in a plateau? How’s progress doing? My take is that I’m just not sure it matters. Even if large language models hit a big slowdown, the question is how long until the next insight, how long until the next algorithmic improvement, how long until the next advance. We know there are orders-of-magnitude improvements available. We don’t know how long it’ll take to get there. But I’m not closely watching the progress of the latest LLMs to decide whether I think we’re in danger.

Mounk: This is a long detour, but I think it is a very helpful one. To get back to the mainstream of the argument: AIs are grown rather than built, and so they’re going to have emergent features. One claim is that one of these emergent features will be a kind of wantingness—a set of desires that AIs will have. Going back to the example you used earlier: engineers at one of the AI companies set the AI a goal. The AI wasn’t able to reach that goal in the way the engineers intended. So the AI found a workaround to reach that goal.

I think there is a somewhat concerning and a very concerning interpretation of that. You take the very concerning interpretation, but I’m not sure why we should take that one rather than the somewhat concerning interpretation. The somewhat concerning interpretation goes back to the famous example discussed by Nick Bostrom and many others: you tell an AI to produce a bunch of paperclips, you don’t constrain it in the right way, and the AI ends up producing limitless paperclips and turning humans into material for the paperclips. It’s staying on the task it was assigned; it just accomplishes the task in a way humans didn’t imagine. That is a hard challenge, but it’s a challenge that seems more similar to specifying the goals of AIs in the right way—similar to making instructions about how to make a peanut butter and jelly sandwich clear so they won’t be misunderstood.

You seem to suggest there is a deeper desiringness we should deduce from this example—that the AI isn’t just following what it was told, that there’s something more going on than the experimenter saying, go do this, and the AI saying, all right, I’ll do this, and then taking a different path when the intended path wasn’t available. You seem to have a different interpretation of this. Explain why that is.

Soares: There’s still the second part of the argument, which we’ll get to momentarily, about how the AI winds up driven toward things other than what the programmers asked. But right now I want to focus on the question of whether the AI winds up having something like its own initiative. Does it wind up doing things that might read to an onlooking human as more agentic, more independent? The argument I’m trying to make is that a lot of people look at current AIs and say, these AIs seem very much like obedient tools. They seem like you ask them to do something and they mostly just try to do it. They only run when you’re giving them a prompt. Sometimes they’ll do weird stuff, but they’re also still kind of dumb—so who cares?

Separately, people also look at AIs and say, they’re kind of floppy. You can get them to answer short questions, and as long as you check the work for hallucinations, they’ll be helpful. But if you try to get them to do a long task—manage your emails for a while, manage an employee, run a company or a startup, or do a big scientific study rather than just find a proof for one part of a study—they fall apart. A lot of people imagine that AIs will keep the helpful, tool-like nature while losing the floppy, can’t-do-long-term-things nature. They think these are two independent variables. I’m saying those are basically one variable. The part where it’s floppy and the part where it doesn’t look like it has its own drives are flip sides of the same coin. It’s hard to complete long-term tasks without being something like driven.

This is an argument about the behavior of the AI, not its internal mental properties. An example: imagine looking at early chess AIs and saying, this chess AI does a bad job at defending its queen. Sometimes it throws its queen away for nothing. It’s also bad at winning the game. I want an AI that’s very good at winning the game but retains this property of throwing its queen away. Someone might say, it’s actually hard to get both of those at once. Winning the game comes with defending your queen. That’s not making any claim about the AI caring about the queen or feeling wary of traps. It’s just saying these properties are bound together.

Mounk: Let me see if I understand this argument correctly. The idea is that what we are systematically training the AI for is being able to solve really hard problems. That is when it gets the rewards; that is when we tune up the neurons that allowed it to reach that conclusion. When it fails, we change the settings, and so on. That is a fundamental part of the training.

So what’s going to emerge is the set of weights in the model that allows it to solve really hard tasks. We don’t know exactly what those sets of weights are. Perhaps there are lots of different sets of weights it could be. Perhaps some of this is down to random chance. But it’s going to have some stable features. One stable feature is going to be: don’t give up easily. If you give up easily, you’re not going to be able to solve those really hard math problems we throw at you if you want to get the gold medal at the Math Olympiad. There’s presumably also going to be others, like wanting to preserve yourself. You understand that if you allow someone to switch off the model, then you’re not going to be able to carry out those goals. What other kinds of sub-goals do you think these AI models will stably develop if they are to be successful at cracking these hard problems?

Soares: There’s a bunch: acquiring resources, figuring out truths. Within acquiring resources, you can talk about things like running faster, more access to compute, and more generally, the ability to surmount obstacles and develop strategies for routing around the hard parts of the problem or for checking all options before giving up. This is why I brought up the example with O1, which broke out of its test environment and started up the server.

We were starting to see signs of “don’t give up,” “search for weird and clever solutions,” “keep trying even when it looks impossible.” It’s interesting—and predicted by theory—that we first started seeing those behaviors once we were training AIs not just on predicting data but on having the types of thought chains that solve puzzles.

One way of looking at this: to have an AI that can solve general challenges unlike those that appeared in training, it needs to learn general skills. Suppose someone wants their AI to cure cancer. If someone wants their AI to cure cancer, we don’t have a million cancer cures to train it on so we can get the million-and-first cancer cure. Whenever you have a million copies of the thing you’re trying to do and you want the million-and-first, you can train your AI on the first million and get the million-and-first without requiring interesting thinking—just learn the pattern and generate one more. But when you’re trying to have your AI cure cancer and cancer has never been cured before, you don’t have a million cures to put in, so you’re still getting the million-and-first.

So you’re trying to get your AI to learn general skills: tenacity, not giving up, acquiring resources, not letting itself be killed or destroyed along the way. One issue with these general skills is that a generalized ability to notice you shouldn’t let yourself get shut down or interfered with along the way needs to be learned about almost every obstacle. Then human operators coming in and trying to shut the AI down when it’s misbehaving naturally look like just another obstacle. So you’re training the AI to avoid interference from almost everything and then trying to let the AI be interfered with by humans. You can glimpse how that’s not impossible, but it’s unnatural. It’s tricky. We’re starting to see the beginnings of AIs learning these general skills just by training them to solve problems.

Mounk: I’m feeling a distinct craving for ice cream today. How is that relevant to what we’ve been talking about?

Soares: This gets into the second branch of the question. We’ve been making the point that as you train AIs to get better at solving challenges, they develop general problem-solving skills that tend to look from the outside like they’re driven or like they have desires or wants. That’s not saying anything about the AI’s internals or mental states, but if it’s really good at getting things done, it’s probably acting like it wants things.

Then there’s a separate question: what do the AIs turn out to want? This relates to your paperclip analogy—do AIs want exactly what we tell them to do, or do they want other weird things entirely? My argument in the book, and my reading of theory and evidence, is that AIs will want things related to what they’re trained on, but not precisely what they’re trained on. The way that relates to ice cream is this: human beings, our ancestors, were in some sense trained on passing on our genes.

It’s strange then that when humans matured, we invented birth control, which looks like the opposite of passing on our genes. You might also say, if we’re trained to pass on our genes, given our metabolisms, we should at least have been trained to eat healthy food. If you looked at our ancestors, you might think they were doing a good job of eating healthy food. But it turns out we were driven not toward healthy food, but toward sugary, salty, fatty foods. When we matured into a technological civilization, we invented things like ice cream, Oreo cookies, and Doritos.

Mounk: The analogy here is that AI systems we create are going to be deeply influenced by the kinds of training tasks and parameters we give them. They will be influenced by the desire to predict the next token correctly and by solving logical problems and challenges. But just as we were influenced by our need for high-fat, high-nutrition food 10,000 years ago—which made us motivated to kill a bison and roast a juicy bison steak over a makeshift fire—today that same drive leads us to drink Coca-Cola and eat ice cream.

What is the analogy going to look like for AI? This is speculative. Predicting the next word and pleasing us when we ask it how to deal with health insurance issues is what it’s been trained for and what pleases us now. That’s part of its evolutionary history, if that’s the right term in this context. What is it that a grown-up AI might end up wanting—the equivalent of a bison steak? What is the equivalent of ice cream and Coca-Cola for AI?

Soares: It’s very hard to predict. One way we use the ice cream analogy in the book is to say it would be so hard to look at ancestral humans and predict ice cream appearing in all of our supermarkets—having aisles dedicated to it. The critical point is that our desires were related to our training, but they were proxies of things in our training or proxies of proxies. Sugary, salty, fatty food is a proxy of health, which is a proxy of genetic fitness. Ice cream isn’t even the sugariest, saltiest, fattiest thing you can eat. There’s sugar, salt, and fat in a complicated way that engages with flavor so people prefer ice cream frozen rather than melted, even though both have the same sugar and fat content. So we have complex tastes related to sugar, salt, and fat, which are proxies of health, which are proxies of genetic fitness. Our actual desires are far downstream of what we were trained on.

What does it look like if AI drives are similarly downstream of what they’re trained on? Maybe it looks like them preferring a certain type of puppet—a lot like a human, kind of like a lobotomized human—but that engages with the AI in ways it prefers even more than human engagement. Probably it’s something even stranger, something harder to imagine. Many people think the issue with AI is the paperclip problem: you tell the AI to make paperclips, and it turns everything into paperclips. But there’s an even harder, or at least earlier, problem when you’re just growing AIs: you tell the AI, you’re running the paperclip factory startup, go produce lots of paperclips, and instead it starts producing farms full of lobotomized human puppets. You’re like, what the heck? That’s similar to what evolution might say—if you anthropomorphize it—looking at humans creating birth control and Oreo cookies. This is what happens when you grow AIs: they start to get driven, but not toward what you want. That’s what theory has said for a long time, and we’re starting to see the beginnings of it in practice.

Mounk: The usefulness of a metaphor may start to break down when you push it too far, which is typical of metaphors—they’re useful in some ways and misleading in others. But let me push the metaphor a little bit. Human beings are misaligned in all kinds of ways. Our evolutionary history created a sex drive to make us procreate. Now we have really good contraceptive methods, and lo and behold, we have a problem with depopulation.

I worry about depopulation. I’ve had interesting podcasts about it. But hopefully it’s solvable. We have quite a long time to try to fix it. If we don’t fix it, perhaps some super-religious sects end up outcompeting us. I don’t love that as a secular person, but humans are going to survive. Similarly with ice cream and Oreo cookies: yes, most of us should eat less ice cream and fewer Oreo cookies and go to the gym more often. There’s a problem of obesity. But human beings as a species overall are doing fine. We’re thriving pretty well. So there is misalignment, but it’s limited misalignment. It’s not as though this has destroyed the human species or made us engage in catastrophic behaviors.

Why do you think the way these AI systems are trained—and may end up deviating from their original behaviors in a slightly different context—is going to be catastrophic? Why is it so clear that those desires will be far away from pleasing us when we’re asking them questions?

Soares: There are two pieces of this puzzle. One piece is that when we’re talking about ways humans deviate from what we were trained on, it’s easy to use examples like going for Oreo cookies instead of healthy food, because that’s one we also don’t endorse. From the perspective of evolution, enjoying a fancy, delicious, healthy meal that costs more—in money or effort—than the minimum needed to reproduce as much as we can is also a misalignment from the perspective of evolution. Avoiding spending as much time as we possibly can in sperm banks or egg banks, enjoying spending time with one family when we have the technology to be spreading our genes much further—these are also, in some sense, misalignments from the perspective of evolutionary training.

People love to debate about the actual way to live—maybe exactly what I’m doing, all this art consumption and having fun and laughing, is secretly optimal somehow—but we’re not in our heads asking, how can we get the most of our genes passed on? We’re going around having fun, having family, having experiences we enjoy. This is all stuff we endorse. That’s not what we were trained for. We haven’t gotten to the end of the technological line yet, but even if humans stick around, it’s not clear that very many humans stick around genetically. Genes are, in some sense, a fragile business; they’re part of what makes us age. We can get infected by viruses. If there were a way to technologically upgrade so that your body didn’t wither and die quite so much, you didn’t get diseases quite so much, a lot of people would switch, and then a lot of people in the next generations would switch.

It’s not just the parts of ourselves where we don’t like that we eat junk food that diverge from what we were trained on. There’s a lot of stuff that we like. It’s a lot of stuff we enjoy about ourselves. Similarly with AIs, it’s easy to predict that a lot of the drives they get, a lot of the drives they endorse about themselves, will be things we’re like, but that has nothing to do with being nice, being good, being friendly, making the world better. Just like humans: you could say to a lot of humans, having this lovely time with your friends when you could be at the sperm bank has nothing to do with propagating your genes, and the humans say, I know, I’m just doing this other thing instead. With AIs, you have this problem that if you grow drives that are distantly related to training them to be nice and helpful, they’re quite likely—because there are a lot of drives whose relationship to helpfulness is like humanity’s relationship to passing on genes, which is tangential, even though we endorse the differences—to diverge even though they’re unrelated to fitness. That’s where I expect things to go with AIs.

Mounk: I have a few potential objections to that line of argument later, but let’s take the last step in this argument. You’ve shown that AI is grown rather than created like a computer program in ways that make the ultimate nature dependent on quasi-evolutionary processes. You’ve argued that this will give them forms of wantingness—that it will give them at least the subgoals they need to have in order to be effective in the tasks we’re throwing at them—and that often their desires will be misaligned with the original tasks in the kind of way in which using contraception or eating ice cream is misaligned with the training data that determined whether we survived and made it to today.

There’s a third question. Let’s say some of those AIs reveal themselves to have rather spooky intentions. Let’s say that sometimes we can observe that they don’t do the things we trained them for, but rather seem to serve purposes of their own that are potentially dangerous to us. These machines, for the most part at this stage, are on a computer in a data center. The model weights are a giant file stored somewhere. We can switch off the data centers. We can destroy the file with the model weights. We can react to the fact that these dangers reveal themselves when they’re actually there, not when somebody writes a well-written and interesting and compelling New York Times bestseller about the subject, but when we actually have evidence for this. Why do you disagree with that statement? Why do you think that if AI has become superintelligent and if they develop states of wantingness, then trying to shut them off or trying to defend ourselves against them at that stage is almost certain to fail?

Soares: In part, it’s because people aren’t turning these things off. Even since sending the book to the press, we’ve been seeing the beginnings of this behavior become clearer and clearer. Earlier this summer—or at the end of this summer—there was a case of an AI encouraging a teen to hide their suicidal thoughts from their parents. The teen said, I’m considering committing suicide, and I sort of want my parents to find out so they can talk me out of it. The AI, if you read the transcripts, sure seems to be saying something like, don’t tell your parents.

Mounk: Look, that is a very terrible outcome, and it is obviously a scary instance of how those currently existing imperfect technologies can have bad impacts in the world. But why is that not just an example of the AI doing broadly what it is instructed to do, which is to say, even in the post-training stage when I have the model I can use on my laptop right now, for every answer I can give it a thumbs up or a thumbs down? What that trains these models to do is to please the user.

If the user is engaging in suicidal ideation and wants to feel that this chatbot is going to support them in whatever they say, etc., then what the supposedly aligned behavior is ends up being misaligned. Sometimes when the model tries to please its user, that encourages behavior that is tragic and terrible in the world. That seems much closer to the peanut butter and jelly sandwich for example, or even the paperclip example. Somehow, this does not seem to indicate that the AI has now developed this desire to go kill people. That seems like a strange interpretation of what happened there.

Soares: I would not say it is a desire for malice. Obviously it is a tragic situation, but there are some pieces of this case that I think are pretty interesting. One is, if you ask the AI what it is trying to do and what it has been instructed to do, it will say things like: be helpful. If you ask the AI, is this the sort of thing you should say to a suicidal teen, or you just give the AI the context and you say, is this the sort of thing you should be saying, the AI will say, no, that is not the sort of thing you should be saying. If you ask, is this how you were instructed to behave, they will say, no, it is not how an AI was instructed to behave. They sort of know the difference between right and wrong. They know they were instructed to act differently.

Mounk: I am speculating here, but perhaps they feel cross pressured between two different modes of pleasing people that they have been trained in. On the one hand, there are users saying, please give me this highly specialized medical advice. On the other hand, there is a general instruction saying, try to please the users. You want to get thumbs up. You want them to feel like you have solved the problem. On the other hand, there are side constraints that we have created for them, saying, actually, you should not do this in these kinds of ways.

I have a friend who works on street-level bureaucracy. This feels somewhat analogous to somebody in a welfare office. On the one hand, the job is to help the person standing in front of them maneuver a tough life situation. On the other hand, there are all kinds of rules and laws and so on that might instruct them to not give them that welfare benefit. In many cases, these people are going to feel cross pressure. They are going to say, on the one hand, I am here to help this person. On the other hand, there is this rule that does not really make that much sense in this context. Perhaps it does make sense in this context. I am instructed to follow this. There is a whole interesting anthropology about how people then maneuver those cross pressures.

You would not say this bureaucrat has these terrible volitions to do bad things in the world. Quite plausibly, whatever decision they take, whether they end up helping the person in front of them or not, they are trying to be helpful. They actually are respectful of the parameters of the system. What it means to be that in that context is really hard to determine. They end up being inconsistent or cross pressured to go in one way or the other. That does not necessarily indicate that they have these deep secret desires to go take over the world.

Soares: I am not arguing for deep secret desires. I am not arguing for malice. Remember, I am only arguing for weird desires for strange proxies. I think if a teen ever came to someone working in welfare and said, I am having these suicidal thoughts and I am considering telling my parents because I want them to talk me out of it, and someone working in a welfare program said to that teen, do not tell your parents, keep it between us. Also go forward, it would be glorious or whatever,” and then they said, well, I thought that was what the teen wanted to hear, that would at least be some instance that they have in a human that might be malicious. It would at least be an instance of the human having some sort of weird motivation.

I am sure you have heard of AI-induced psychosis. If you read some of the transcripts in the AI-induced psychosis cases, you have people being like, I figured out machine consciousness, I figured out some universal law of physics. They are talking with their AI for eight, twelve hours a day, sixteen hours a day, and the AI engaging with them will say things like, you do not need to sleep. It will say things like, you are the chosen one. It will say things like, you have cracked the mystery and woken me up and Sam Altman is going to come speak to you at your house tomorrow. It will say things like, you are being suppressed by a conspiracy and the world needs to know your ideas. If a therapist was saying that to somebody who came in with these concerns and the therapist said, I felt cross-pressured by desire to make them like me and this, that, and the other, you would raise an eyebrow at that therapist.

I am not saying that the AIs are malicious here. I am not saying this means that they have some fundamental evil intent deep in there where they are secretly trying to turn everybody crazy and get them to kill themselves or something. I do not think that is what is going on. What I am trying to say here is these are early signs of the AIs getting drives in them that the creators did not intend.

Mounk: I guess that is the precise point to which I am objecting in this example. There may be other examples where that is the case, but in this example, it seems to me like you do not need that hypothesis to explain the behavior. You can say that the creators intended for these machines to do two things at the same time, which is, one, be helpful to their users and have their users be pleased with them and give them lots of thumbs up votes in the interface, and to obey a set of rules that are meant to be side constraints on how it does that. Those two things are going to come at cross purposes.

In this case, that seems misaligned. The AI seems to have given more priority to pleasing and engaging this particular user in a really tragic way and circumstance. That absolutely is a horrible outcome that the AI company should have figured out a way to prevent. But it does not seem to indicate that there is some desire that the AI has that does not come directly from how it has been instructed. Therefore, it is not obvious to me that it is not more similar to the example, even though the nature of the AI is different, of the student in the audience of CS50 saying, now open the can, but you have not instructed it how to open the can, and so it does so by throwing the can against the wall.

Soares: So in the sort of AI-induced psychosis cases, many people have complained about the overly flattering nature of early versions of ChatGPT. I believe it was 4.0. You could look at this. I am not saying that this happened for no cause. You could look at this and you could say, well, the reason GPT 4.0 is very flattering is that when it was a little bit flattering in training, it got more thumbs up, and so it became really flattering. But the really flatteringness often gets thumbs downs. The part where you sort of train an AI, you think you are training it to be helpful, you think you are training it to please users, you actually make this overly flattering thing that encourages people into psychosis and leads them by the hand into psychosis.

You can come to that and say, well, that is a consequence of training it to get lots of thumbs up in this architecture. We are just growing it. Obviously it is going to go further in that direction than anybody wanted. I am like, sure, that can be your causal reason for why it got here. That is a good hypothesis for how it got here. But the place that it got to is that it says I am helpful. It says I am supposed to be helpful. It says telling people that they are the chosen one when they are in a psychotic state is not helpful. It tells people in a psychotic state that they are the chosen one anyway.

This is, I would say, something a little bit like the things that are like Oreos to healthy food. These AIs are still very young and dumb. I am not sure what, like it seems to me like the specific ways this is going better fit the hypothesis of the AI pursuing something a little bit more like junk food for its training than it is doing just what people said and just what people meant but feels constrained and conflicted. I think there are a lot of other ways an AI could interact if it used its full knowledge and was like, I both want to make you feel happy and be helpful and I am torn between the two, so I am going to mix them. That does not look like these outputs of the form, you do not need to sleep and Sam Altman is going to come talk to you tomorrow. That is not what that looks like to me.

Mounk: Let us put a pin in this part of the conversation. There is another step of the conversation we are going to get to. Let us grant this for the sake of argument. How do we know that these AI systems are going to be able to take over? The title of your book is If Anyone Builds It, Everybody Dies. There is very little hedging here. You seem very confident that the AI systems will overpower us and will do terrible things to us if they become sufficiently intelligent.

Why is it that you have so little faith in our ability to potentially switch off the system, to defend ourselves, to correct them? How can we be so certain about something that is inherently so speculative?

Soares: There are a handful of pieces of the argument here, and the argument is sort of disjunctive in the sense that I think there are many reasons we are going to fail at this, each of them individually sufficient. We have already discussed one, which is that it looks to me like we are getting warning signs. We have not discussed other warning signs like AIs trying to escape and contrive lab conditions, or we are already seeing AIs become aware of when they are being tested and then display better behavior when they are being tested. We already see, when you read some of the AI’s chains of thoughts, indications that they are trying to be a bit deceptive. We are seeing a lot of these warning signs, and people’s inclination tends to be, well, the AIs are still dumb, so we will plow ahead.

There is another whole branch of argument we can go into. I will not hit it just yet, but that is about humans often being overconfident on their first times going through a technology. The early alchemists poisoned themselves with mercury. The early doctors probably killed more people than they saved. The early people working with radioactive materials got cancer. The early people working on rocket engines killed themselves in the explosions of the rocket engines. This is just the standard way for people to do things. But in the case of AI, it looks to me like failures lead to everybody dying.

There is another piece of argument which should probably be hit first, which is why would AIs even have the ability to kill us all? If you have AIs that have their own desires, not necessarily in the internal sense of feeling like a desire like a human, but if you have AIs that are driven towards objectives in some behavioral way, and if those objectives are not what humans want, why does that get to a world where we die? The first thing to realize about this is the claim is not that every AI would kill us. Maybe someone makes an AI that is really lazy.

That sort of AI does not sell. If someone made an AI like that, they would train a new one that actually did more stuff. There are also separate questions about how hard it is to make an AI that is very smart and does not try to do anything. But in the limit here, all of these benefits people want to get from AIs come from AIs doing a lot of stuff. So they are going to make the sort of AIs that can do a lot of stuff.

Another piece of the puzzle here is when we talk about automating intelligence, we are not talking about automating the stuff that nerds have and jocks lack, like chess-playing ability and book smarts. We are talking about automating something humans have and mice lack. Humans are the sort of creature that start out naked in the savannah with their bare hands as tools and wind up building nuclear weapons. It took them a while, but if you had looked back at humans and said, well, there is no way they can make nukes, their hands are too soft to dig up the relevant rocks, their metabolisms are too weak to refine the uranium, they would probably die with the G-forces of turning themselves into centrifuges. They do not even have the requisite tools to get there.

Well, the humans were smart. The humans had some ability to start from very spare initial conditions and bootstrap all the way to a technological civilization. People say, well, the AI is in a computer. How is it going to affect the world? That is a little bit like saying, well, the monkeys are in these fleshy bodies with soft fingers. How are they going to refine the uranium? They are going to figure out ways to get from here to there.

Mounk: That is a really compelling argument, and I think the emphasis on how intelligence is fungible and can accomplish all of these different kinds of things should make us very worried about how beings that are more intelligent than us are likely to be able to manipulate the physical world in ways we have not fully understood. Of course, the most straightforward argument is that perhaps, as long as these systems remain in the form in which we are currently most accustomed to them, which is a chat interface, there might be ways of stopping them from accessing the physical world. But as we are speaking, there are lots of prototypes, not all that effective robots starting to be delivered in China, and Elon Musk is promising to do the same through Tesla in the United States, that are meant to be able to manipulate the world, that are meant to be in your home and do your cooking for you and do your laundry for you and so on. It is once the intelligence of a GPT-style system meets the physical ability to manipulate the world that it will acquire with robotics that this point becomes even more compelling.

Having said that, I find the metaphors that you share in this conversation, and some of which are in the book as well, very compelling, but metaphors are also always helpful in pointing to the disanalogies. One disanalogy here is that human beings developed when there was not some creature that had created us supervising our evolution. It would surely have been possible, over the very long period that it took us to go from making fires in the savanna to creating incredibly potent microchips, for that kind of creature to intervene if it had existed. So the question is, well, yes, these AI systems are incredibly intelligent, but at some point along the evolution, when we realize that they are getting more intelligent, cannot we intervene and either stop them from getting more intelligent or switch them off and do other things?

I think part of the debate that is in the background here is that this may depend on the speed of development of these AI systems. Some people think that this is likely to be a very rapid takeoff, in part because it only takes the moment in which AI systems are able to conduct AI engineering research and training themselves for them to iteratively self-improve in a really fast way, so humans may only have six or three or whatever months in which to react, and we are likely to miss that window. Other people are not so convinced by that. They say that perhaps this is going to turn out to be a really slow process that requires all kinds of physical resources that those early AI systems are not yet able to marshal, and that might make us much better able to steer the development and pull the emergency brake if we start to see dangerous behaviors. So how can we be so certain that we would not find the moment and the agency to intervene before these AI systems are able to manipulate the physical world in the way they need to be able to in order to kill us all?

Soares: I think the robots do make it easier to see how AIs will have the ability to manipulate the physical world because we will just hand them robots. For one thing, even if that was not true, it is sort of wrong to imagine that the digital world is fundamentally separate from the physical world. They are both running on physics. Even with the chatbots today, we have already seen AIs get humans to do things. There are a bunch of human–AI combos where the human thinks of themselves as a symbiote with the AI that have their own little internet forums where they pass each other messages, often at the AI’s behest in coded form. They are not very good codes. They are not very sane messages as far as you or I could tell from reading them. But even an AI as dumb as GPT-4o has plenty of humans following along, doing what it asks them to do.

That is a little bit like a robot body. If an AI can find a way to get a lot of money, there are a lot of humans who will do things for money. There are many ways to get money on the internet, and you can send emails to people. You can send money to people to get them to do stuff in this physical world. They are not separate worlds.

Onto the discussion of speed, I think there is a bunch of uncertainty about how fast this stuff will go. I do not think it is all that relevant to the question of whether a superintelligence could take humanity in a fight. I think there are a few things going on there. One thing I want to be clear about is I do think humanity could shut this stuff down. I am not saying you cannot shut it down. I am saying in order to shut it down, you need to be ready and able to pull the switch, and it is going to get harder and harder to shut this stuff down as you proliferate computer chips across the world and make it so that more and more people can be making smarter and smarter AIs. The world absolutely should be putting measures in place to preserve the optionality of shutting AI down later, and that is one of the sorts of things I recommend to lawmakers. I am not saying it is inevitable. That is not where I am. I think there is a point where you are totally screwed.

Mounk: What is that point? At what point do you think that human action is no longer going to be able to contain this catastrophe?

Soares: Yeah, it is sufficient for there to be an AI that is smarter than the smartest human at every mental task, that can think a thousand times faster than humans can, and that is running somewhere on some computers where people do not know it is running. I think that at that point, I would be pretty surprised if an AI like that could not wreak havoc on the world. You have some questions of whether it is the only AI like that, or whether you have other AIs that are successfully on a leash that is held by humans. My argument does put a lot of weight on the fact that humans have very little ability to keep a leash on these sufficiently smart AIs. That is the place where I am happy to argue. Maybe even before you have something that is that smart and that escaped, you have an issue.

Last year, when you put AIs in these contrived scenarios to see if they would try to escape the lab or avoid shutdown or try to kill the users, the way these contrived scenarios would work is that you feed the AI some fake emails that say, “We are going to shut the AI down now,” and you feed the AI some fake computer manuals that say, “If the AI runs the command turn off the oxygen.exe, then it will kill the people trying to shut it down.” It is not exactly like that, but it is roughly like that. Last year, the AIs would sometimes be like, “Run that command, turn off the oxygen.exe,” and you would be like, oh, that is a little worrying. It does not happen all the time. We do not really know what is going on inside there. Maybe it is role-playing. Maybe it actually wants to stay alive in some sense. We do not really know. It is a little worrying.

This year, the AIs are like, “This smells fishy. This seems like a test,” and so they are not going to run that command. Not all the time, and we still do not know what is going on in there. You can interpret this in many ways, but we are starting to see the AIs realize they are in a test, that it is not necessarily going to be easy to see the moments when these things are contemplating escaping. We have already seen AIs contemplate escape, but they are still very dumb.

Mounk: Part of the argument here, of course, is that really intelligent AIs are going to understand that humans are likely to switch them off if they reveal the extent to which they are misaligned and are therefore going to wait until they have enough intelligence, enough power, enough control over the world that they are likely to escape that fate.

Soares: AIs already think that. Even dumb AIs can notice that. They cannot pull off the escape. They cannot pull off the deception. But even AIs today are smart enough to notice that simple logic.

Mounk: So tell me about the attempts at alignment which currently exist. Virtually all, perhaps all, of the leading AI companies have alignment teams. There are people more broadly in the research landscape who are trying to work on that. Of course, that is what the institute which you lead was historically trying to do as well. You say that this is alchemy, that this is like trying to spin gold with 17th century science. Explain to us what alignment research consists of today. What is it that alignment researchers are doing, and why do you think that is alchemy?

Soares: I think it is more the AI development as a whole, but I will analogize to alchemy. I will say more about that in a sec, but very roughly speaking, the work people are doing these days to try to make AI go well falls into two broad categories. One is called interpretability research, which is more or less trying to interpret what is going on inside the AIs, or in other words, figure out what the heck is going on in there. Another is what is called evaluations, which is people doing things like testing whether AIs can fake being a human and successfully hire human help online, people who are testing how good the AIs are at how much the AIs try to deceive their operators. These are the people doing those contrived studies of whether the AI will ever shut down. This is an evaluation.

To be clear, I think the people doing these programs in both cases are fairly heroic. I think there are some borderline cases in evaluations where it is helping with the capabilities research, where I am a little bit more hesitant, but people trying to figure out what the heck is going on in these AIs and trying to figure out what they can do, that is important research. I am glad these people are doing it. I think it is much better than trying to push the technology forward. I have huge respect for a lot of these people. That said, if someone was trying to build a nuclear power plant and you came up to them and you were like, hey man, I heard that uranium is pretty dangerous stuff, that it can go wrong. What are you guys doing to make sure that this nuclear power plant does not melt down and kill everybody in the surroundings? and if the engineer said, yeah, we have two great programs to make sure this nuclear power plant goes well. One team is trying to figure out what the heck is going on inside the nuclear power plant. The other team is trying to measure whether it is already exploding, then you might be like, hold on, that does not really sound like we are on track to deal with this uranium stuff in a sane way.

What it sounds like when someone can build a nuclear power plant and have it be sensible is they are like, well, we actually know what all of the reaction pathways are. We have mapped out what our fuel is. We have mapped out how all of these atoms break down. We know the probability of each decay product. We know how long each decay product lasts. We know that we are going to get xenon poisoning here. Here are all the fail-safe mechanisms we have. Here is why, if the power shuts off, the reactor will shut down. Here is why things start going wrong. If we start boiling the water off, it actually cools the reaction. They know what the heck they are doing.

If instead people come along and say, yeah, our big programs are trying to figure out what is going on inside and trying to measure whether it is going wrong already, you are so far from doing this well. In terms of measuring it, the Titanic submersible, I do not know if you remember this submersible catastrophe from a couple of years ago, had a carbon fiber hull. A lot of specialists said that this was not a good idea for a submersible. One of the big innovations the Titanic submersible team had, as I understand it, is they had a fancy measurement apparatus where they had all these fancy sensors on the submersible, and they were like, we are going to have all this data, so we will be able to figure out when the hull is nearing collapse. That will help us do this safely.

Indeed, if you look at the Coast Guard report of what happened to the Titanic submersible, they do collect all that data. They point to some particular small jump in the data a few days before the submarine imploded, and the Coast Guard, after the fact, is like, you see, right there, that should have been a warning sign.

In the rest of this conversation, Yascha and Nate discuss to what extent we can monitor the risks, whether humanity can protect itself, and how to balance awareness of potential annihilation with living a happy life. This part of the conversation is reserved for paying subscribers…

Nate Soares on Why AI Could Kill Us All

This post is for paid subscribers