David Bau on How Artificial Intelligence Works

The Good Fight

Preview

0:00

-55:51

David Bau on How Artificial Intelligence Works

Yascha Mounk and David Bau delve into the “black box” of AI.

Yascha Mounk

Sep 30, 2025

∙ Paid

David Bau is Assistant Professor at Northeastern University and Director of the National Deep Inference Fabric, researching the emergent internal mechanisms of deep generative networks in both Natural Language Processing and Computer Vision.

In this week’s conversation, Yascha Mounk and David Bau explore the technology behind AI, why it’s concerning that so many computer scientists don’t understand how it works, and how to embed morals and values.

This transcript has been condensed and lightly edited for clarity.

Yascha Mounk: We met recently at a workshop about artificial intelligence at Harvard. I thought that in a conversation we had, you helped me understand the nature, the architecture, and the technology of artificial intelligence better than anybody had before. I thought I would love to talk to you about this on the podcast. Fundamentally, how do current AI models work? When we say they are LLMs, large language models, what does that mean and how does that distinguish this form of artificial intelligence from other forms that we’ve historically used?

David Bau: They are a generative model, which means that they are just open-ended models of a kind of behavior as opposed to being trained to make really narrow decisions. What was really popular in AI up until recently was to train classifiers to solve specific problems, like to help you make a specific decision.

But then in recent years, it has been more popular, or actually more amazing, to create models that have more open-ended goals. So a large language model is pretty simple. In concept, its job is to imitate human language. But imitating human language is a lot richer than making a yes or no decision or answering a simple multiple-choice question, which is what we used to train AI to do.

Mounk: Perhaps help us understand what these classifiers were, to understand what the difference is with these large language models. It sounds like you’re saying the point was to classify into yes or no, into one of four or five different kinds of buckets.

Bau: Yeah, I’ll describe it in the way that I describe the difference between these two types of models to students. Basically, if you train an AI to classify inputs into a bunch of different categories, what you’re doing is asking it to tell you the difference between things. Let me give you a couple of examples. You could ask the models to tell you the difference between a picture of a cat and a picture of a dog.

Or for a more realistic application, you might ask an AI to tell you the difference between a piece of writing that was written well and a piece of writing that was written badly, or a movie review that was positive or a movie review that was negative. A company like Yelp might do this to take a look at your reviews to see if you tend to write positive reviews or negative reviews, or if a specific review is positive or negative.

Mounk: What is the startup doing in the TV show Silicon Valley that seems really stupid and then suddenly becomes important? He’s trying to classify whether something is a hot dog or not. So that was a classifier.

Bau: Yes, I think that’s right. So it was really one of the earliest problems that people put in front of AI. I think the very first neural network that was ever made was a thing called the perceptron, and it was trained to classify the difference between pictures of boys and pictures of girls.

They took a lot of pictures of students and showed that this little neural network, which had 64 neurons, if properly configured, could tell the difference between these types of pictures. That was considered an amazing feat at the time. That same class of problems has been with us for more than 50 years.

It is a powerful framework, but it gives the machine learning models a lot of space for taking shortcuts. For example, if you needed to tell the difference between a cat and a dog, it might be sufficient to not look at every aspect of that picture. You might be able to just look for the tips of the ears and recognize that cats have pointy ears and dogs tend not to, without looking at the rest of the image.

Thanks for reading! The best way to make sure that you don’t miss any of these conversations is to subscribe to The Good Fight on your favorite podcast app.

If you are already a paying subscriber to Persuasion or Yascha Mounk’s Substack, this will give you ad-free access to the full conversation, plus all full episodes and bonus episodes we have in the works! If you aren’t, you can set up the free, limited version of the feed—or, better still, support the podcast by becoming a subscriber today!

Set Up Podcast

And if you are having a problem setting up the full podcast feed on a third-party app, please email our podcast team at leonora.barclay@persuasion.community

One of the things that classifiers are really good at is identifying the most salient difference, focusing on that, and making a decision based on it, which is great. It leads them to be very accurate, but it also means that they do not necessarily develop a complete understanding of the world. If you invented a picture of a pointy-eared dog and gave it to one of these classifiers that was focusing on the tips of the ears, it would say, that’s clearly a cat. It might not recognize that there was something else wrong with the image.

Mounk: Even if the rest of the image very clearly looks like a dog, and it is not really a hard case in other ways.

Bau: It is the kind of thing a classifier would do, yes.

Mounk: Tell me a little bit about the technology behind it. We may be going too far back now, but is a classifier a similar kind of technology with fewer neurons, whatever that means? Or is it a completely different kind of thing?

When we move from classifiers to the kinds of large language models we have now, is it building on the same technology, or is it a completely different avenue toward how to create this form of intelligence?

Bau: It is basically the same kind of technology. I would say that there are relatively few major innovations that separate classical classifiers, as invented in the 1950s, from modern large language models. There have been a lot of gradual, small, clever innovations, but in terms of major innovations, there have been relatively few.

We are really going after the problem using the same techniques that we have used since the 1980s, when a lot of the innovations that we are still using today were established.

Mounk: Tell us about what these techniques are. One idea that often comes up is neurons and neural networks. I can understand what a neural network might be in my brain. I am far from being a neuroscientist, but I understand that a neuron is a kind of cell, unless I am getting that badly wrong. My brain is a kind of neural network. All of these cells are connected in some complicated way. What does it mean for an AI to have neurons or be a neural network?

Bau: There is a popular term that you will see sometimes called a deep neural network. What makes a neural network interesting is its depth. A neural network is inspired by the architecture of the human brain. All it does is compute a bunch of numbers. The input comes into the neural network as words or as images, and the first thing you do is convert it to a bunch of numbers and feed each number into a neuron.

Then you connect the neurons so that they pass the numbers from one to the other. If a neuron has a bunch of inputs, it adds the numbers together, does a small bit of computation, and then creates another number that feeds onto the next layer.

Neural networks are just this big mess of neurons connected to one another to produce some output that you hope is useful. If you just created a random neural network, it probably would not do anything useful, but it would do something with your data. The trick for artificial intelligence, the trick for machine learning, is to train the neural network to strengthen and weaken all of the connections between the neurons, transforming a random machine, a random function, into something that does something useful. It is the training process that makes a neural network seem almost magical.

Mounk: Okay, so take me a step back. I want to understand the training process, but before that, I now get this image of a bunch of human cells—or a bunch of digital cells, whatever exactly that means—transmitting this information. You see something with your human eye, that is a visual stimulus, and in some way that stimulus gets translated into a bunch of signals that neurons fire at each other.

Then there is some way of using that information. Help me understand a little bit more how that works in a computer and what the point of these simple calculations that you mentioned is.

Bau: The simple calculations are probably simpler than you would imagine as a non-technical person. Every neuron is just a sum of all the inputs that come into it. It is a weighted sum where, if you have one neuron that is connected to 1,000 inputs, it just adds up all the inputs and then looks at whether the numbers come out to be positive or negative.

If the numbers come out to be positive, the neuron will do one thing in the output, such as transmitting the sum out. If the numbers come out to be negative, the neuron will do something different, such as outputting zero. That output then becomes another number that goes into other neurons. Each neuron is an extremely simple step. It is just adding things, then looking at the answer, and then producing an output. You might think, oh gosh, why the heck is this useful?

Mounk: I can imagine what it looks like, but I am having trouble understanding why it is useful. You have these very simple calculations, they add up, so how does that mean that I can—I know we are jumping a number of steps—but how does that relate to why I can have a conversation with ChatGPT using my voice and it talks back to me? I know there are going to be lots of steps to get there, but why is it that this neural network is such a powerful tool, such a useful idea?

Bau: Why is it so useful? You can answer this in two ways. But I have to say, the question you are asking is actually one of the core puzzles behind neural networks. Let me give you a little bit of history. Neural networks are one of the oldest forms of programming ever devised, in the 1940s before digital computers were widespread. They have been with us ever since in different guises.

One of the reasons it has taken so long for neural networks to become so prominent in computing is that everybody had the question you are asking now. Even if you can get these things to work, why would we expect them to work? In the 1950s and the 1960s, a scientist named Rosenblatt demonstrated that you could get neural networks to do some useful things, but it never really caught on because everybody had this question: there are these neurons just passing numbers around. What do these numbers mean? How can we be sure they do anything useful? Is there anything that explains why they do something useful at all? Sometimes they do not work. Can we tell the difference?

It really took a long time. There were many different ways of doing machine learning where the numbers inside the AI were more understandable. For many years, mainstream AI scientists believed that you should use one of these other approaches, which were more transparent and designed to be explainable. It is really just in recent years that we have decided to use neural networks that have this key disadvantage: we do not know what the numbers that come out of a neuron are for. We do not know if they have a particular purpose. We do not know if some of the neurons are more important than others. We do not know under what conditions they learn something good or learn something bad. It is very opaque.

But one of the things that has happened is that neural networks work so well. We have some understanding of some of the things that lead to them working so well. They work so well that the field has become comfortable with the idea that we should just use these black boxes. They are so useful that maybe it does not matter that we do not really understand what the neurons are doing, what they are for, or what is being computed inside.

One of the reasons I was at this workshop is that I am really concerned about this philosophy we have adopted in the AI industry and in machine learning. Historically, as engineers and computer scientists, it has really been our responsibility to understand the systems that we make, to make sure they are doing what we want, that they operate correctly. I feel that the new discipline of machine learning, because it has become so important to just accept these black boxes and use them even though we do not understand them, is leading to an unhealthy turn in the practice of engineering and computer science.

We are training a whole new generation of computer scientists to be comfortable with this idea that they should not really look inside these complicated black boxes, that it is not something understandable or their responsibility to understand. I think this is a fundamental error. One of the most important things we should be doing as AI scientists and practitioners is to try to resolve this fundamental problem with large-scale neural networks and to understand what is making them work, what causes them to work well or badly, or to acquire certain types of behavior.

Mounk: You work on interpretability, which I understand is basically an attempt to look under the hood to get a better sense of what is going on in that black box in various ways.

Bau: That is right. I work in an area that you could call post-hoc interpretability. There are a couple of approaches to achieving interpretability in machine learning. One is to simplify the machine learning models to the point where a person can look at them and understand and explain what each of the steps is doing. But the area that I work on is post-hoc interpretability.

That means, let us say we did not do that. Let us say we decided to use a neural network with millions or billions of neurons that are far too complicated for us to have a preconceived idea of how they work. Can we go in there and analyze the system the same way that a biologist might analyze a complicated, emergent biological system? Can we understand the structure of these learned computations after the fact, after they are trained, even if we did not try to restrict them ahead of time to make them understandable by people?

Mounk: I would love to get into a little bit more detail about what you think we can understand and what progress we have made in that field of interpretability. But to go back to the bottom-up building of our understanding of these AI models: we have these neural networks. They are a very simple process, actually. We do not fully understand why they have proven to be so phenomenally useful.

We started out having limited resources, with a limited number of neurons in these networks and probably limited data to throw at them. We might have been able to get them to distinguish between a dog and a cat, between a girl and a boy, and perhaps between something that is a hot dog and something that is not a hot dog. Then somebody said, let us raise the scale of ambition. Let us throw a lot more data at it. Let us do a lot of other things—you will tell me what they are.

Perhaps they could then become general-purpose large language models that are not trained to one very specific goal, but can actually help us with a huge variety of tasks, from writing a poem to summarizing a text to generating an image. What technical progress or what changes have allowed us to get there? How is it that this neural network suddenly scales to be able to do these kinds of things?

Bau: There are a couple of things to know, and things that readers or listeners might have heard of. Let me first emphasize that these really large neural networks are not that different from the classifiers that we have worked with for 50 years, other than the fact that they are bigger. From a technological point of view, from the way it is built, a language model is a classifier. It solves a slightly more open-ended classification problem than we solved previously, but it fundamentally just solves repeated classification problems. And what is the classification problem? It is this big multiple-choice question: what is the next word?

Is the next word cat? Is the next word dog? It is not a two-way choice that the language model is facing. We typically give a language model a vocabulary of something like 50,000 words, syllables, and letters. We tell the language model, you have this 50,000-way choice. What is the next word that is the right answer in this context? As input, we give the language model all the previous words, and we ask it, tell me what you think the right next word is.

So it is just a classifier, just like we trained classifiers to tell the difference between cats and dogs. But the scale of modern parallel computing GPUs has allowed us to make these classifiers capable of doing larger-scale classification problems. The outputs are bigger—there is a 50,000-way choice instead of a two-way choice or a ten-way choice, which we might have traditionally done. The inputs are bigger—instead of just one image or one sentence, we can feed these models entire books, entire histories of text, for them to look at to decide what the next word should be.

There have been a couple of key architectural innovations that allow these models to consume and learn to use such huge inputs and solve open-ended output problems. But fundamentally, they are just neural networks, a bunch of neurons connected in the same way that Rosenblatt was connecting neurons in the 1950s.

Mounk: So you think it is basically the same as what we were using 80 years ago, which is astonishing. But there have been a few technological innovations that allow these models to consume so much information and therefore go beyond the limits of earlier models. Can you give us an example, or is there a particularly important innovation in that respect?

Bau: I think that one of the things that has happened in recent years is the rise of a particular neural network architecture strategy called the transformer, which has really taken over the industry. There used to be a wide variety of different neural network architectures, but more and more we are converging on using transformers for everything. The fundamental thing that transformers do is introduce a form of short-term memory that we call attention.

What this means is that instead of the models only consulting what they learned during training, the models develop the ability to learn from the inputs they are provided. Let me give you a simple example.

If there was a particular person that you were asking the language model about, then with traditional training you would expect that language model to only be able to answer questions about that person if information about them appeared in the training data. But in real life, you often have situations where somebody asks you a question about a person you just met. You did not meet this person in your childhood. You did not read about them in school. You just had a conversation with them, and then somebody asks you about the person, and you have to answer now. This kind of short-term reasoning is something that traditional neural networks are not very good at.

What the transformer architecture does is introduce a special way of connecting the neurons called an attention layer, which allows the network to look back at previous things that happened recently in the input and use them as a type of memory, manipulating those memories and reasoning about them. This has turned out to be so powerful that you can really think of transformers as a different class of neural networks. It was a major innovation.

Mounk: So the difference, if I think in terms of my interaction with something like ChatGPT, Claude, Grok, or any of the other AI models, is this: if the model did not have a transformer, the problem would be that every question I asked it would be answered the same way it would have been before I started the conversation. It would essentially be difficult to have a back-and-forth conversation.

The transformer is what allows it to keep a conversational context in mind and to have an ongoing, progressive conversation. Or did I misunderstand that?

Bau: I think that is right. The transformer has really made it possible to teach the neural network things by telling it something, rather than being in control of the whole training process. When you have a conversation with a person, you are constantly learning and teaching the other person your ideas. You are sharing your concepts, sharing your understanding of the world with the person, and they absorb it. That allows the conversation to proceed.

Transformers allow a neural network to develop the ability to have the same kind of conversation, to develop an understanding during the course of a conversation, to learn things in the short run that it uses to make immediate responses, and to recall things that happened recently, as opposed to only relying on long-term memory.

Transformers were not the first architecture to try to do this. There were architectures proposed in the 1980s to do this, called recurrent neural networks (RNN). You might have heard of LSTMs, long short-term memory networks. The idea that you should have a short-term memory, that you should be able to solve problems in this way, is not totally new.

What a transformer does is make it efficient to train networks that can do this. It was an innovation that showed this old idea could be made practical and scaled up in a big way.

Mounk: Technically, can somebody with a computer science background understand how the transformer does that? Or do we get into areas that are too complicated to understand?

Bau: I think the main thing to understand is that if you were to try to train something with short-term memory, then since short-term memory seems like such a sequential process, the natural ways that you would end up training a neural network to have short-term memory are very sequential. They are one at a time.

You tell the neural network some things and then immediately turn around and ask it to make a prediction of the thing that you just told it. Then, based on that, you might go on to the next step, because time proceeds in a stepwise fashion. That can be done, but it tends to be very slow. The big innovation around the AI industry has to do with parallel computing.

The reason that training neural networks is so efficient is that we can process many inputs and learn many things at the same time in parallel on these GPU devices. The recurrent neural network architectures of the past did not fit very well with this parallel computation model. There were many things in the training that were inherently sequential, so training them was slow.

What a transformer does is change some assumptions in how this memory works to allow it to be parallelized really well. I am not sure it is very interesting to go into the details of how it is parallelized. It does mean that transformers are a little more limited theoretically than the old RNNs, but the limitations are carefully chosen and carefully architected to allow parallelism and to cover for the limitations.

There is a concept that transformers have that the old RNNs do not have, which is called a context window. Sometimes if you buy an AI product, it will tell you that the product has a certain context window, and another product might have a larger context window. You might even have to pay more money for something that has a larger context window. The idea of a context window is something that transformers introduced to allow them to parallelize the training.

A context window is a fixed number of words in the past that the transformer can see when it is trying to remember its short-term memory. An RNN has an infinite context window. In principle, it could remember everything in the past since it was first turned on. But a transformer is trained with a fixed context window. If the transformer has a context window of 1,000, it means that after you say 1,001 words, the very first word that you said is no longer in the short-term memory of the transformer. It will not be able to remember that anymore.

That simple limitation ends up being an enabling factor for training. It allows the neurons to be hooked up in a way that the transformer can be trained in a massively parallel way, which is many times more efficient than training an RNN.

Mounk: The limits of these context windows are still relevant today in terms of technical limitations in some of the existing language models. You can go back and forth for a certain amount of time, and then at some point it loses track of the beginning of your conversation. Or if you ask it to do much more complex tasks, it stays on task and on track for a while, and then it stops being able to retain the information it needed in order to carry it out in the right way. Is that roughly right?

Bau: Yes. There are really two effects here. One is a hard context window, where the transformer has no hope of understanding things that are beyond its context window. The other is a soft decay of its memory. These neural networks are statistical machines, they are never perfect, and as the conversation gets longer, even for things that are theoretically inside the context window, the transformer has more difficulty accurately recalling and processing things that are further in the past.

Mounk: I want to go back to the overall architecture of the AI now. I am going to create a hypothetical scenario here, David. What if I gave you a billion dollars, and said, build me an AI? What would you do? You would build a neural network, it would have a transformer, all of those things. What are the other steps here? You have to train it. Once you have trained it, you have to adjust it. What does that mean concretely?

Bau: Modern machine learning really has two steps. If you gave me a billion dollars to train a neural network, the task in front of me would be, first, to do something called pre-training the network. The second task would be to fine-tune the network to have a certain personality or to achieve a certain goal that I want the network to help me with.

This split between pre-training and fine-tuning is one of the pieces of the lore, one of the fundamental rules of thumb that we have learned, and it is quite profound in modern machine learning. The idea is this: if you go straight to trying to train an AI to solve the problem that you care about, then you miss a lot of opportunities to get an AI that has a profound understanding of the world. There are many other problems, unrelated to the one you are making the AI for, that it could learn from and generalize from.

What people are realizing is that the way to make an AI is to begin by training it to understand as many things as possible in the world. Once the AI is really good at modeling a wide variety of interesting problems, you then fine-tune it on solving the particular problem that you care about. The AI benefits greatly from that pre-training.

The first step nowadays is to pre-train the model on a universal problem. The universal problem that the entire industry has converged on is to pre-train the model on large-scale language modeling: to be able to imitate text. Which text? All the text. All the text that humanity has ever written, broadly construed.

If the text has images in it, those images can be encoded as little pieces of text, little patches of image words. If there are videos, they can similarly be boiled down to a set of tokens. We can train an AI to imitate any content that has been put together by a human in the past.

Mounk: Tell me what training means in a technical or semi-technical sense. Presumably, when a baby is being trained, the baby has eyes and ears and looks around the world. The information flooding through its brain somehow gets encoded by some mechanistic form into these neurons. Over time, the kinds of stimuli that the baby sees and receives start to train the neural network that is the baby’s brain.

I am going to try to make the analogy here for the AI system. Presumably, you can have this neural network, and you throw all of this text at it, and somehow that shapes the neural network in a way that may or may not be analogous. What does it mean to train exactly?

Bau: To train a neural network is actually really simple. First, you need to have a goal. Once you have a goal, you expose the neural network to challenges. You give it inputs and then have it produce outputs. You then ask, did the output achieve that goal or not? Sometimes the output will have achieved the goal, and sometimes it will not have achieved the goal.

If it did achieve the goal, then whatever computation the neural network happened to do in that instance, you strengthen all the neural connections that led to this positive outcome. If the network did not achieve the goal, you go to that computation and slightly weaken all of those neural connections that led to the bad outcome. Each time you do not make a huge change in the network; you might change everything by 1 percent or by a tenth of a percent. But after you have done this thousands, millions, or billions of times, eventually the network will converge on a pattern of computation that becomes correct more often, incorrect less often, and increasingly sophisticated at solving harder instances of the goal over time.

This whole process is called backpropagation or gradient descent, and it is the backbone of how machine learning works. It is a deceptively simple process. A primitive version of it was invented in the 1950s, and more sophisticated versions were developed in the 1980s. It remains such an important process that it is still an active area of research today. Fundamentally, the technique of backpropagation is the same as what we have been doing since the 1980s. We have just been making small tweaks to it.

Mounk: Very simple question. To have this backpropagation mechanism, presumably you have to know when the system is doing something good or bad, right or wrong. But we are not talking here about post-training. I do not think we are talking here about how, once you have a large language model that has been trained, you then give it feedback depending on its output. That is a different thing.

So how does the model know when it is doing something right or wrong? For example, it might say, this is a cat, or something more complicated, like producing a sentence. How does it know, this is a good sentence or this is a bad sentence? How does it know, actually, this was not a cat, it was a dog?

Bau: The distinction that you are drawing is one of the big fundamental insights, which is the distinction between supervised training and unsupervised training. In supervised training, you have a clear idea of the problem that you want the network to solve. For example, you need the network to tell the difference between good restaurant reviews and bad restaurant reviews. You show the network what good ones are and what bad ones are, and then punish the network every time it makes the wrong choice. That was the way we conceived of AI training for a long time.

The problem is that it is expensive to collect that training data. You have to make human assessments and judgments about the problem you want to solve. There is only so far you can go with it. The big innovation is to introduce a different type of goal, called an unsupervised training problem. An unsupervised training problem is a goal where you do not need to label the data. You do not need a person to tell you that this was the right thing or the wrong thing. You come up with a goal that the AI can pursue that is more natural or more ubiquitous in the world.

Language modeling is an unsupervised training goal. This multiple-choice question of predicting the next word does not require a human expert to label the data, saying, this is the right word, that is the wrong word. All we need to do is gather text. A language model can judge itself based on all the text that was written without needing a separate expert to train it on these things.

Mounk: In supervised learning, I have a database with 100 positive reviews and 100 negative reviews—probably more like a million, but let us say 100. You give it one of these reviews, and in the database there is a data point that says positive or negative. That label was generated by humans at some point. A human looked over these 200 reviews and classified them into positive and negative. The system judges itself against the “objective” human judgment encoded in the thing it is checking itself against.

How does unsupervised learning check itself at the end? You are saying, for example, that Shakespeare, or some blogger, or a New York Times journalist has already decided what the right next word is. But since this LLM is creating a sentence that has never existed in human language in many contexts, how does it know whether it is similar to the kind of word that Shakespeare, or The New York Times, or the blogger would have written next?

Bau: There are two things to clarify. One is that one of the fundamental insights that made unsupervised learning work was the recognition that there is no single right answer. What is the next word here? If different people confronted the same situation, even if they were very intelligent and very human, they might choose different words.

It is more accurate to think of the right next word as a distribution of possibilities. Maybe 30 percent of the time you would have chosen this word, 10 percent of the time you would have chosen another word, and in the remaining 60 percent of the time there is a wide variety of other choices you could have made. The right thing to train the AI to do is to understand and model this probability distribution as accurately as possible. Instead of just getting the next word right, it needs to get the probabilities right as much as possible. There are mathematical ways of writing this down.

That is why these systems are probabilistic machines. They do not output single choices. They output an assessment of what they think the probabilities should be.

How do you know if this is right or not? What we do is measure these things on what is called a holdout set. It is really simple. You take a bunch of text that would have been part of your training data, and then you separate it out as a holdout, a quiz. You tell the AI system, you can train on all this data, but not these ten pages. These ten pages are different, and you will never get a chance to see them.

After all the training is done, we go to the model and ask it to look at these ten pages. We give it the first 100 words of the first page and ask it to tell us what it thinks the probabilities of the next word are.

Mounk: I see. The closer it gets to actually predicting the rest of the passage, the better it has learned.

Bau: That is correct. This holdout test has for many years been the gold standard for how you measure the success of a machine learning model. Can it correctly predict the answers on a piece of data that you held out from training?

Mounk: You have done all of this, and in a sense this is only the first step of what a lot of models do at the moment. Because then, if I understand it right, you have a bunch of post-training, or whatever the right term for it is, in which you have a model perform tasks and you give it positive or negative reinforcement depending on what it does.

How is that different from what we have talked about so far? How does that change the model? You have this trained system, this huge neural network, so how is it that positive or negative reinforcement changes the physical structure of the network? Surely it needs to do that for it to learn over time and become better adjusted.

Bau: That is right. The problem with unsupervised training is that the model does not learn how to do any one useful thing in particular. Let me give you an example of what comes out of unsupervised language modeling. If you go to an unsupervised language model and try to have a conversation with it, and you say, please tell me the capital of Vermont, what do you want the language model to say?

You want it to say, what a great question. A lot of people do not know the capital of Vermont, but it is Montpelier. Here is a way to remember that. Here is a little bit of information about Montpelier. But if you go to an unsupervised language model and ask, what is the capital of Vermont? it will answer by predicting what it thinks the most likely next word is. It might say, what is the capital of Colorado? What is the capital of Maine? What is the capital of Wyoming? What is the capital of New York?

Mounk: Right. Normally, when you look at a text, the next word is not necessarily Montpelier. It might be in certain contexts, such as a dialogue in a novel where the person has the right answer. But in other contexts, you may have a list of questions, or you may be using it as an example in a philosophical text, etc.

Bau: Indeed. If you really train the model on all the world’s text, then the most common situation, the most common context for asking a question, would be a book of questions. The model will just continue writing that book of questions, inventing more and more that are similar to the one you asked. It is a pretty dissatisfying experience.

It can actually be fun, but it is not very useful. What you can do, though, is go to one of these pre-trained language models and say, it’s great that you can imitate every book in the library but let me give you a set of books that I would like you to be especially good at imitating. What are these books? They are a collection of 100,000 conversations, pieces of dialogue, which are examples of people asking questions and getting their questions answered in a nice and helpful way.

If you go back to this pre-trained language model and train it on dialogue text—just fine-tuning it, which means taking these 100,000 pages and making them the last thing the network was trained on, the last thing it learned, the last thing it was rewarded and punished for—then the language model will acquire this bias. It will tend to imitate the last thing it saw. If you ask, what is the capital of Vermont? it will tend to give you a useful answer. It will answer in dialogue, which is remarkable.

This process is called instruction fine-tuning. People collect datasets of useful instruction-following behavior: Please answer this question for me. Please do this for me. Please do that thing for me, and examples of an AI doing the task in a smart, useful way. If you went to a transformer and trained it only on these thousands of conversations, it might understand the grammar of what you are doing, but it would not be very helpful. It would not know much about the world.

But if you go to a large language model that has been trained to imitate every book ever written, every blog post ever posted on the internet, and then, as a final fine-tuning, you show it dialogue and say, what I really want you to learn is to follow this format. While you are doing next-word prediction, just do it in a way that answers questions, then something profound happens. Not only does it follow the form of dialogue, but it also exploits the vast array of knowledge it acquired during pre-training.

For example, if you ask a question about Shakespeare, the model tends to answer it, even if the specific dialogue examples said nothing about Shakespeare. It will follow the dialogue form, but draw on the knowledge it acquired earlier in pre-training. That is really the magic of modern machine learning, of modern language modeling: the split between pre-training and fine-tuning.

In the rest of this conversation, Yascha and David discuss how to set constraints, instilling moral values, and how to tell if AI systems are using reasoning. This part of the conversation is reserved for paying subscribers…

David Bau on How Artificial Intelligence Works

This post is for paid subscribers