How intelligence helps (and hurts) alignment
Published: 2023-03-05 . Back to ≈
We've found ourselves in a strange cultural moment where it suddenly appears obvious to many people that we will give birth to a superintelligent AI that will kill us all.

Could intelligence help with this?

I recently listened to Eliezer Yudkowsky’s Bankless interview, which has been making the rounds on social media. This post contains my initial reaction as someone who has 1) spent a lot of time thinking about statistical learning theory and how it relates to capabilities of deep neural networks, but 2) not lots of time (except maybe the brief journal note here and there) thinking about AI alignment.

With the attention generated by models such as ChatGPT and Sydney (Bing), huge numbers of people have suddenly become attuned to the fact that AI is about to massively change the world. Frankly, it looks powerful and dangerous. As people begin to wonder what specific worries they should have, they will inevitably come across very intelligent people who have been thinking about this issue for a very long time, and who say that the danger (or even the foregone conclusion in Eliezer’s case) is that an unaligned AI will destroy the world. And their response is often, “Ah, yes, I knew I was worried about something dire…”

So we’ve found ourselves in a strange cultural moment where it suddenly appears obvious to many people that we will give birth to a superintelligent AI that will kill us all.

For my part, I’ve never found this conclusion obvious. And with these worries entering the mainstream, I recently made plans to spend some of my free time carefully reviewing the arguments of people like Eliezer. However, after listening to the Bankless podcast, I found that I had quite a few thoughts in response. So I wanted to spend some time formulating these thoughts before allowing the ideas of others to bias my way of thinking about the issue. That’s what this post is.

My views on Eliezer’s overall position went through a funny progression as I wrote and then fine-tuned the post. On the same day that I listened to the podcast, I wrote out an initial draft entitled “Intelligence is Self-Regularizing.” I soon realized there were a few points I could reframe to be clearer, and spent some spare minutes throughout the week refactoring the post. Only when I thought the post was ready to polish up and publish did I finally have a clear enough view of the issues to suddenly understand the full force of one of Eliezer’s worries, resulting in the final section.

Some basic alignment problems

As I listened to the the bankless podcast I was struck by the notion that, although superintelligence is at the root of the overall alignment worry, it also seems like it could be the seed of the solution. In fact, I wanted to frame things this way: The problem that Eliezer is worried about is not superintelligence at all but superoptimization. And the solution to this problem could be, in a word, intelligence.

While I think I’ve now arrived at a more nuanced view (i.e., intelligence holds elements of the solution the alignment problem, but may also be the source of its deepest difficulties), I think this post may still be interesting from the standpoint of trying to decouple the roles of optimization and intelligence within the broader alignment problem.

To this end, I’ve taken two of Eliezer’s worries which stood out to me within the podcast and posed them in terms of optimization, without any reference to intelligence. The problems:

  1. Suppose we were to specify a utility function, and then apply some super powerful optimizer toward maximizing this utility function. The problem is that we don’t know how to specify a utility function such that–when optimized strongly enough–it won’t be a result we didn’t want; one that clashed with our basic values.
  2. The second problem is that we might optimize an agent to solve a particular utility function in the context of one input distribution, but find that the policies learned by the agent are arbitrarily poorly aligned with the original utility function when the distribution shifts somehow.

Naturally, intelligence might come into the picture in different ways; for instance, our super-powerful optimizer in problem 1 could be a superintelligent agent. Or, intelligence could be an emergent property of the agent optimized in problem 2. While I think it makes complete sense to formulate these problems in terms of optimization, it’s difficult to imagine the types of world ending scenarios that Eliezer paints without intelligence coming into the picture. We’re usually envisioning an AI that ends the world right underneath our noses without use noticing or being able to stop it.

I’ll dwell on this point a little further. The important quality of the optimizer in problem 1 is its raw optimization power, and pointedly not the set of interfaces or actuators that it has access to. This is an agent that can destroy the world by displaying text to manipulate a human into sending an email that ends the world. So in my treatment of problem 1, I will stick to a highly constrained agent of arbitrary power.

One thing to notice is that solving problem 1 is dependent on solving problem 2. You could roughly decompose the two problems in the following way: For an agent to take an action aligned with the intent of my instruction it must both a) understand my intent and how to enact it and b) want to comply with my instructions. I’m using the word want here to match Eliezer’s vocabulary. The word isn’t important. As long as condition a) is well-defined and satisfied, condition b) simply regards any tendency for there to be a residual between what the agent did and what it understood that the user wanted it to do. In our discussion of problem 1, we’ll assume that problem 2 has already been solved.

I’ll spend the remainder of the post describing the hopefully thought-provoking ways in which intelligence is both part of the solution and part of the problem to each of these contexts.

1. The problem of utility function misspecification

The cute way of saying my point concerning problem 1 is that any agent smart enough to end the world ala Eliezer’s nightmare must also be smart enough to understand that we don’t want it to do this. Cuteness aside, what I actually mean to do here is to propose that there are ways of leveraging the intelligence of an agent to solve the alignment problem.

It is perfectly possible to tell a human (with human-level intelligence), to observe certain constraints while solving a problem. A human may be able to find loopholes within your constraints which allow them to improve the optimization performance. But a human is also intelligent enough to infer an intent behind your constraints and recognize when an optimization is exploiting a loophole. It is quite easy to convey to a human-level intelligence the concept that I want the solution to satisfy the general intent of a set of constraints more so than the exact letter of the constraints.

This capability only improves with intelligence. A sufficiently intelligent agent could have a general model of human beings that could be fine-tuned to particular instances. Constraints of the optimization could then be defined on the basis of this model. You can tell your friend “don’t do anything I wouldn’t do” or to consider “what would Jesus do?” You can ask the AI not to do anything that a committee of leading ethics wouldn’t do (though this might actually be the worst type of committee to choose…).

I could easily be wrong, but my hunch is that many people in the alignment space are trying to “hand code” a utility function that will force an AI to toe the line–improving on Isaac Asimov’s three laws. Just like hand-coding an ImageNet classifier, this is an extremely difficult problem.

Now, whether it’s possible to do something like what I am saying with an AI agent depends of course on the base objective (utility function) on which it has been optimized/trained. I think that as far as this possibility is concerned, recent developments such as ChatGPT are very suggestive.

Let us take the objective of the GPT series: next-token prediction for internet text. We’ll ignore practicalities such as training methodology, and simply imagine the agent which maximizes this objective (Implicit here is that we are maximizing performance on a test set which we randomly generate from distribution each time we use it. We’ll deal with statistical issues in the next section) (This maximizer might not be defined if we don’t constrain it somehow, so let’s just say it’s constrained only by the amount of compute, memory, and time needed to evaluate the model. In particular, it’s not constrained by any limitations of training. It simply is the best solution subject to these physical constraints).

This agent is a superintelligence by almost any standard. Let’s look at some examples of problems which it can solve:

  • I can describe a math problem which human mathematicians puzzled over for centuries, and it will give me the solution after some amount of time.
  • I can describe the preamble to a 5-year analysis by the world’s top climate scientists of some data which I have supplied. After some amount of time, it will give me the rest of the report.

(Naturally, these don’t need to be actual problems which were actually solved or reports which were actually written.)

There’s little more to be said here. By simply appropriately prompting the model, the values of human society are already accounted for. We’re done.

It probably looks like there is a lot of room for someone to jump in here and talk about antagonistic/devious types of behaviors which can be elicited from models such as ChatGPT, but I’m actually not sure that this particular problem is as big as it looks. Effective prompting can greatly reduce the probability of latent variables that would be linked to undesired behavior (e.g., the report is a hoax, it was sponsored by an interested party, the quality was low). Using progressively more powerful models, we could also annotate training data in ways that would make such prompting even more effective.

One objection that one might have to my formulation is that what we want (and will build) is a superintelligent agent which we can set loose to actually solve the problems as it sees fit. Meanwhile, prompt engineering doesn’t seem powerful enough to make a super-ChatGPT suitable for this purpose. My initial thought is that building a superintelligent agent is simply unnecessary and undesirable. Almost all advances in any field will be need to be grounded in an empirical feedback loop involving experimentation and data collection. The human interfaces necessary for this to happen will mean that there is little loss from keeping humans in the loop. Certainly there will be autonomous AI-powered systems in the future. And perhaps we will have superintelligent AIs. But I don’t see any compelling reason for them to be the same thing.

In closing out this section, I’ll reiterate that my answer to problem 1 is clearly empty without an answer to problem 2 at hand. So let’s now turn to the probably more interesting discussion of the role which intelligence plays within the subtle problems of distribution-shift and the emergence of proto-wants.

2. The problem of proto-wants / distribution shift

The problem of distribution shift is quite familiar in machine learning research. Imagine that you perform some data driven optimization in order to teach a self-driving car to navigate a city based on inputs from its array of sensors. But all of your training data comes from bright, sunny days in the month of July. How will your optimized set of policies for self-driving perform on a rainy day in November? The answer is that that they will probably fail very badly and result in a possibly disastrous outcome.

Now, a basic question is whether intelligence is helpful or harmful in this situation. It’s actually pretty obvious that, at least up unto a point, intelligence is exactly what we need to solve the distribution-shift problem.

It’s helpful here to look at Eliezer’s own example of evolutionary optimization for genetic fitness in humans. He’s worried things like ice cream.

Millions of years of evolution optimized humans for survival within an environment that was hugely different than the one in which the majority of modern humans find themselves. In that environment, it was a good strategy for humans to consume as much of sweet, salty, and fatty goods as we could get into our mouths. And now this means that we are very liable to overindulge in ice cream, even when this is counterproductive to our genetic fitness.

This is an example of an instance in which our internal wants (eating lots of food) are not precisely aligned with the original objective of the optimization (genetic fitness). Somehow, evolution didn’t manage to make us actually instinctually want nothing more than the replication of our genes. This is an important point, and we will return to it.

But before doing so, let’s take a step back and consider two broader points:

  1. Evolution imbued us with a host of wants and instincts which–in the context of humanity’s great environmental shift–no longer serve their original function. Ironically, intelligence is the marked exception. Intelligence originated because it helped us to navigate, understand, and manipulate our world. And it still allows us to navigate, understand, and manipulate the world, even though our environment has changed drastically.
  2. Up unto a point, intelligence clearly makes us more robust to the problem of distributional shift. Again, take ice cream. You love ice cream. You want ice cream. But you also have some self-awareness about this fact. Eating ice cream isn’t just something that you automatically, compulsively do when ice cream is presented to you. You become aware of the particular desire, you evaluate it against other objectives that you may have (which on the whole are probably more aligned with the original optimization objective), and ultimately make some kind of decision.

The problem really is the “up to a point” in 2.

Let’s first reiterate the model in question. We’ll think of the intelligent agent having it’s own inner objective which could be different than our imposed outer optimization objective. When these are identical, all is well and intelligence assumes its role up mitigating the problem of distribution shift. After all, an intelligent agent not only has a good strategy, but also an understanding of when and why that strategy works, and an ability to find new strategies when conditions change. But if the innter and outer objectives differ, we’d expect the intelligence to optimize our outer objective under the training distribution, but no longer to do so in the presence of certain distribution shifts. It could even work at cross-purposes.

The worry is that we end up having a situation like modern humans and birth control. Ultimately, the intelligence, acting on its internal objectives (sex and ice-cream) completely circumvents the original objective (children).

So how do we solve this problem of alignment between inner and outer objectives?

The optimistic scenario

Let me first paint the case for optimism.

In many ways, the inner alignment problem is akin to other problems which the grand program of graduate student descent in machine learning research is already solving. I’ll back up a bit in order to make this perspective clear.

Let me start by clarifying a bit what I mean by intelligence. In recent years, with Deep Learning–an instance of a “Machine Learning” technology–being the premiere instance of the category of technologies known as “Artificial Intelligence,” it is probably easy to assume that intelligence and capacity for learning are somehow related or even identical. However, it’s probably better to think of intelligence as a set of qualitative capabilities that may emerge from optimizing to an objective. Here, this is Optimization with a capital O, at a societal level. Not just the result of, say, performing a numerical optimization with something like SGD, but also the result of architectural optimization resulting from trial and error and even principled reasoning based on the results of “knowledge” of our universe based on millennia of cultural learning / science.  

How we will arrive at an intelligent model is likely in the same way that we have arrived at deep learning models which dazzle with their current generalization capabilities: By setting appropriate learning objectives and then performing graduate student descent over the space of architectures, hyperparameters, training methodologies, etc. Eventually, the learning objective for which the great corpus of AI labs and graduate students is optimizing will be expansive enough in its requirements for robustness to things like distribution shifts and adversarial examples that it will require something like human intelligence to meet the objective well.

So now we have an intelligent model, but one that may have some amount of inner/outer misalignment. But since this misalignment is again a kind of lack of robustness, we can again suppress it by continuing to make the objective to which we optimize even more expansive/demanding. In this sense, solving this aspect of the alignment problem is “well aligned” with what researchers and the field in general are likely to be already doing!

The worry

The biggest worry is that the problem of inner alignment turns out to be a hard problem in some strong sense of the word hard. Or alternatively, the inner alignment problem could simply be significantly harder than the problem of creating a powerful superintelligence. In either case, it appears that we could end up in a scenario where inner/outer misalignment results in the intelligence subverting the original objective.

Suppose it is true that intelligence does tend to allow an agent to more robustly achieve the outer objective–even when there is some misalignment between the outer objective and the inner objective(s). The double-edge of this effect is that the inner objectives are to some degree insulated from pressure that we might try to exert by demanding more robustness. A related idea is that the more intelligent a given human is, the more difficult it may be to infer their values and objectives from observing their behaviors.

One way of stating this worry is that intelligence makes a model intransigent to the various regularizations we might apply to try to achieve inner alignment. But once we’ve but the matter more generally, it makes sense to ask if this is intuitively likely for all of the regularizations we can think of. For instance, it may be that we cannot perfectly align a single instance of a model, but by combining together a large number of models in a Bayesian / “mixture of experts”, we could “average out” orthogonal components of the inner objective which depend on random initialization (provided they are zero-mean).

In order to answer the hardness questions that prompt this general worry, it seems that we some way to formalize the kinds of emergent capabilities we have in mind when we say “intelligence.” This is almost certainly not a simple problem, but there may be fairly good stand-ins that we can use. I’m excited to learn more about what existing work there might be within this space and how people are approaching such modeling questions.