Here’s my breakdown of the difficulties involved in ensuring powerful AI makes our lives radically better, rather than taking over the world, as well as some reasons why I think they’re hard. Here are things it’s not:
That said, it is my attempt to group the problems in my own words, in a configuration that I haven’t seen before, with enough high-level motivation that one can hopefully tell the extent to which advances in the state of the art address them.
The first difficulty: we don’t have a sense of what sort of thinking we would want AI systems to use, in sufficient detail that one could (for instance) write python code to execute it. Of course, some of the difficulty here is that we don’t know how smart machines think, but we can give ourselves access to subroutines like “do perfect Bayesian inference on a specified prior and likelihood” or “take a function from vectors to real numbers and find the vector that minimizes the function” and still not solve the problem. To illustrate:
It’s not a priori definitely impossible to build a thinking machine that does what we want without knowing how we want it to think, but it’s not at all obvious how one would. A core difficulty here is that the sorts of signs of positive outcomes we know how to specify (like “GDP has gone up a lot” or “a human says that they’re happy with the AI’s performance”) are compatible with extremely bad outcomes - and in general, as mentioned in point 1, things that are trying to achieve their own objectives in the physical world will be incentivized to cause those bad outcomes.
Given that we don’t know how to specify advanced AI cognition that will do good stuff and not take control of Earth, how could we hope to build it? One obvious path is a sort of trial and error: we build some AIs, and before putting them in situations where they could conceivably take over (e.g. by having them become able to influence enough of the physical world to build fancy new technology), we figure out if they would do good stuff. Then, we can only deploy things that actually do good stuff, or even better, tweak things such that they’re more likely to do better stuff, and less likely to take over. The question is: how would we determine if our AIs will do good stuff once they’re able to take over?
One possibility you could imagine is trying to write a proof - after all, AIs are algorithms written in computer code, and one can often prove things about algorithms. The problem is that it’s entirely unclear what property we’d want to prove that our AI has, to the level of formal specificity that one could write a proof about it.3 This is closely related to the difficulty in section 1: if we had such a “goodness” property, we could build an AI that thought of plans that scored highly on “goodness”.
A second possibility is that you could look at your AI’s behaviour in a range of circumstances, and see how good it is. If your ‘goodness’ ratings come as numbers, and there are a bunch of free variables in your AI design, you can even automatedly do gradient descent to set those variables to values that get your AI to do things rated as highly good. The basic problem here is that just because your AI does good stuff when it can’t take over the world, doesn’t mean that it will do good stuff once it can. The basic reason is that by and large, there are a lot of motivations that can cause AIs to do stuff that looks good:
So, it seems like there are more AIs that pass your behavioural tests without being aligned with your interests than AIs that pass your behavioural tests by being aligned with your interests. Note that this issue is, again, related to difficulties discussed in section 1: just like how many goals we could initially imagine writing down (like “get humans to approve of you” or “run a profitable business without being caught breaking any laws”) produce bad behaviour when optimized by an advanced AI, similarly there are many motivations that produce good behaviour before an advanced AI can take over the world, but not after.
Also note that we are talking as if our AIs have “motivations”, thus allowing us to re-use some of the reasoning from section 1: thinking of strategies that help achieve some goal, and concluding that the AI will take those strategies. This should be understood as saying that they coherently steer the world into some narrow set of states5 (aka the states they are ‘motivated’ to reach), not as a strong claim about their exact internal functioning. And in order for AIs to be useful, they need to be steering the world into observably different, hard-to-reach states, compared to if they weren’t made.
Finally, a worrying aspect of this second possibility is that many of its failure modes can only be exhibited once AI is advanced enough to be dangerous. By analogy, external observers may not have been able to tell that humans would end up using contraception until they were technologically advanced enough to make reliable contraceptives. Similarly, possibility 4 will only show up once AIs can come up with and competently execute such deceptive plans.6
A third issue is that our current best ways of making AI involve taking gigantic tensors of numbers glued together by matrix multiplication and some non-linear functions (aka ‘neural networks’), and tweaking them until they do something impressive when run. This design doesn’t place strong constraints on specific parts of those tensors having any particular known function - it’s just a collection of numbers that happens to exhibit the right behaviours.
There are two closely-related key problems with this type of AI design:
Problem 1 means that we aren’t able to precisely steer the cognition of smart AIs into styles that we like, even if we knew the sort of cognition we wanted to distill; and problem 2 means that we can’t easily perform meaningful safety analysis for large capable AIs, even if we knew what this would look like.
Given that we face these difficult problems, you might hope that we are able to use AI to solve them - just like we’ve used it to solve other problems that are insurmountable by humans, like “beat the best human chess player at chess”. This strategy only works if the AI we use isn’t the sort that we might be scared of. However, there are a few aspects of the alignment problem that make it seem very difficult for AIs that aren’t advanced enough to be scary:
To be sure, limited AIs can help in the meantime by e.g. making Google search better, or facilitating other kinds of human cognitive labour. But it’s not obvious how we can successfully outsource the AI alignment problem to other AIs, while being confident that the AIs we outsource to don’t need to be aligned themselves.
As mentioned in the introduction, these problems are by no means unknown in the literature. Section 1 is related to work on value learning, corrigibility, and multi-multi alignment. Section 2 is related to work on inner alignment, robustness and interpretability in machine learning, as well as informed and scalable oversight. Section 3 is related to work on interpretability in machine learning, as well as deep learning theory. Finally, section 4 is related to OpenAI’s approach to AI alignment.
Furthermore, not all these problems need to be solved in order to build powerful aligned AI. I would break it down this way:
My thanks to Erik Jenner for commenting on a draft of this post.
It’s actually slightly unfair to conflate this with RLHF, because reinforcement learning uses reward to shape agents’ thoughts, rather than building agents that optimize for reward, but I think this critique is relevant to understanding problems with RLHF, for reasons gestured to in section 2. ↩
I don’t think that this is actually what the people behind ‘constitutional AI’ were thinking, but it’s nice and linkable, and this is a proposal that some people talk about. ↩
Also, such a proof would plausibly require modelling the range of situations your AI would find itself in, which is a challenge to formalize. h/t Erik Jenner for making this point. ↩
Presumably evolution would, given enough time, eventually shape our psychology so that we abstain from contraception enough to have lots of children. But for the present point, what’s important is that it didn’t manage to instill the right desires on the first try, before we were powerful enough to invent technology to suit our interests. ↩
Note that there are some subtleties in this definition, as described here, but it will do for now. ↩
It’s been proposed that AIs will first be bad at deception before they’re good at it, just like they were bad at chess before they were good at it, and this will give us advanced warning to solve the problem. Besides my worry that existing AIs can already exhibit primitive deceptive behaviour, and that this doesn’t seem to be spurring effective research to deal with this failure mode, I also think that AIs will be able to evaluate whether they’re able to effectively deceive (in service of another goal) before they can actually effectively deceive (in service of another goal), and given that ineffective deception is worse than useless, I’d expect some regime where AIs refrain from behaving deceitfully until they’re able to do so effectively. ↩