Cross-posted to the Alignment Forum.
One thing I worry about sometimes is people writing code with optimisers in it, without realising that that’s what they were doing. An example of this: suppose you were doing deep reinforcement learning, doing optimisation to select a controller (that is, a neural network that takes a percept and returns an action) that generated high reward in some environment. Alas, unknown to you, this controller actually did optimisation itself to select actions that score well according to some metric that so far has been closely related to your reward function. In such a scenario, I’d be wary about your deploying that controller, since the controller itself is doing optimisation which might steer the world into a weird and unwelcome place.
In order to avoid such scenarios, it would be nice if one could look at an algorithm and determine if it was doing optimisation. Ideally, this would involve an objective definition of optimisation that could be checked from the source code of the algorithm, rather than something like “an optimiser is a system whose behaviour can’t usefully be predicted mechanically, but can be predicted by assuming it near-optimises some objective function”, since such a definition breaks down when you have the algorithm’s source code and can compute its behaviour mechanically.
You might think about optimisation as follows: a system is optimising some objective function to the extent that that objective function attains much higher values than would be attained if the system didn’t exist, or were doing some other random thing. This type of definition includes those put forward by Yudkowsky and Oesterheld. However, I think there are crucial counterexamples to this style of definition.
Firstly, consider a lid screwed onto a bottle of water. If not for this lid, or if the lid had a hole in it or were more loose, the water would likely exit the bottle via evaporation or being knocked over, but with the lid, the water stays in the bottle much more reliably than otherwise. As a result, you might think that the lid is optimising the water remaining inside the bottle. However, I claim that this is not the case: the lid is just a rigid object designed by some optimiser that wanted water to remain inside the bottle.
This isn’t an incredibly compelling counterexample, since it doesn’t qualify as an optimiser according to Yudkowsky’s definition: it can be more simply described as a rigid object of a certain shape than an optimiser, so it isn’t an optimiser. I am somewhat uncomfortable with this move (surely systems that are sub-optimal in complicated ways that are easily predictable by their source code should still count as optimisers?), but it’s worth coming up with another counterexample to which this objection won’t apply.
Secondly, consider my liver. It’s a complex physical system that’s hard to describe, but if it were absent or behaved very differently, my body wouldn’t work, I wouldn’t remain alive, and I wouldn’t be able to make any money, meaning that my bank account balance would be significantly lower than it is. In fact, subject to the constraint that the rest of my body works in the way that it actually works, it’s hard to imagine what my liver could do which would result in a much higher bank balance. Nevertheless, it seems wrong to say that my liver is optimising my bank balance, and more right to say that it “detoxifies various metabolites, synthesizes proteins, and produces biochemicals necessary for digestion”—even though that gives a less precise account of the liver’s behaviour.
In fact, my liver’s behaviour has something to do with optimising my income: it was created by evolution, which was sort of an optimisation process for agents that reproduce a lot, which has a lot to do with me having a lot of money in my bank account. It also sort of optimises some aspects of my digestion, which is a necessary sub-process of me getting a lot of money in my bank account. This explains the link between my liver function and my income without having to treat my liver as a bank account funds maximiser.
What’s a better theory of optimisation that doesn’t fall prey to these counterexamples? I don’t know. That being said, I think that they should involve the internal details of the algorithms implemented by those physical systems. For instance, I think of gradient ascent as an optimisation algorithm because I can tell that at each iteration, it improves on its objective function a bit. Ideally, with such a definition you could decide whether an algorithm was doing optimisation without having to run it and see its behaviour, since one of the whole points of a definition of optimisation is to help you avoid running systems that do it.
Thanks to Abram Demski, who came up with the bottle-cap example in a conversation about this idea.