Website of Daniel Filan.
http://danielfilan.com/
Bottle Caps Aren't Optimisers<p><em>Cross-posted to the <a href="https://www.alignmentforum.org/posts/26eupx3Byc8swRS7f/bottle-caps-aren-t-optimisers">Alignment Forum</a>.</em></p>
<p>One thing I worry about sometimes is people writing code with optimisers in it, without realising that that’s what they were doing. An example of this: suppose you were doing deep reinforcement learning, doing optimisation to select a controller (that is, a neural network that takes a percept and returns an action) that generated high reward in some environment. Alas, unknown to you, this controller actually did optimisation itself to select actions that score well according to some metric that so far has been closely related to your reward function. In such a scenario, I’d be wary about your deploying that controller, since the controller itself is doing optimisation which might steer the world into a weird and unwelcome place.</p>
<p>In order to avoid such scenarios, it would be nice if one could look at an algorithm and determine if it was doing optimisation. Ideally, this would involve an objective definition of optimisation that could be checked from the source code of the algorithm, rather than <a href="https://arxiv.org/abs/1805.12387">something</a> like “an optimiser is a system whose behaviour can’t usefully be predicted mechanically, but can be predicted by assuming it near-optimises some objective function”, since such a definition breaks down when you have the algorithm’s source code and can compute its behaviour mechanically.</p>
<p>You might think about optimisation as follows: a system is optimising some objective function to the extent that that objective function attains much higher values than would be attained if the system didn’t exist, or were doing some other random thing. This type of definition includes those put forward by <a href="https://www.lesswrong.com/posts/Q4hLMDrFd8fbteeZ8/measuring-optimization-power">Yudkowsky</a> and <a href="https://link.springer.com/article/10.1007/s11229-015-0883-1">Oesterheld</a>. However, I think there are crucial counterexamples to this style of definition.</p>
<p>Firstly, consider a lid screwed onto a bottle of water. If not for this lid, or if the lid had a hole in it or were more loose, the water would likely exit the bottle via evaporation or being knocked over, but with the lid, the water stays in the bottle much more reliably than otherwise. As a result, you might think that the lid is optimising the water remaining inside the bottle. However, I claim that this is not the case: the lid is just a rigid object designed by some optimiser that wanted water to remain inside the bottle.</p>
<p>This isn’t an incredibly compelling counterexample, since it doesn’t qualify as an optimiser according to Yudkowsky’s definition: it can be more simply described as a rigid object of a certain shape than an optimiser, so it isn’t an optimiser. I am somewhat uncomfortable with this move (surely systems that are sub-optimal in complicated ways that are easily predictable by their source code should still count as optimisers?), but it’s worth coming up with another counterexample to which this objection won’t apply.</p>
<p>Secondly, consider my <a href="https://en.wikipedia.org/wiki/Liver">liver</a>. It’s a complex physical system that’s hard to describe, but if it were absent or behaved very differently, my body wouldn’t work, I wouldn’t remain alive, and I wouldn’t be able to make any money, meaning that my bank account balance would be significantly lower than it is. In fact, subject to the constraint that the rest of my body works in the way that it actually works, it’s hard to imagine what my liver could do which would result in a much higher bank balance. Nevertheless, it seems wrong to say that my liver is optimising my bank balance, and more right to say that it “detoxifies various metabolites, synthesizes proteins, and produces biochemicals necessary for digestion”—even though that gives a less precise account of the liver’s behaviour.</p>
<p>In fact, my liver’s behaviour has something to do with optimising my income: it was created by evolution, which was sort of an optimisation process for agents that reproduce a lot, which has a lot to do with me having a lot of money in my bank account. It also sort of optimises some aspects of my digestion, which is a necessary sub-process of me getting a lot of money in my bank account. This explains the link between my liver function and my income without having to treat my liver as a bank account funds maximiser.</p>
<p>What’s a better theory of optimisation that doesn’t fall prey to these counterexamples? I don’t know. That being said, I think that they should involve the internal details of the algorithms implemented by those physical systems. For instance, I think of gradient ascent as an optimisation algorithm because I can tell that at each iteration, it improves on its objective function a bit. Ideally, with such a definition you could decide whether an algorithm was doing optimisation without having to run it and see its behaviour, since one of the whole points of a definition of optimisation is to help you avoid running systems that do it.</p>
<p><em>Thanks to Abram Demski, who came up with the bottle-cap example in a conversation about this idea.</em></p>
Fri, 31 Aug 2018 00:00:00 +0000
http://danielfilan.com//2018/08/31/bottle_caps_arent_optimisers.html
http://danielfilan.com//2018/08/31/bottle_caps_arent_optimisers.htmlMechanistic Transparency for Machine Learning<p>Cross-posted to the <a href="https://www.alignmentforum.org/posts/3kwR2dufdJyJamHQq/mechanistic-transparency-for-machine-learning">Alignment Forum</a>.</p>
<p>Lately I’ve been trying to come up with a thread of AI alignment research that (a) I can concretely see how it significantly contributes to actually building aligned AI and (b) seems like something that I could actually make progress on. After some thinking and narrowing down possibilities, I’ve come up with one – basically, a particular angle on machine learning transparency research.</p>
<p>The angle that I’m interested in is what I’ll call <em>mechanistic</em> transparency. This roughly means developing tools that take a neural network designed to do well on some task, and outputting something like pseudocode for what algorithm the neural network implements that could be read and understood by developers of AI systems, without having to actually run the system. This pseudocode might use high-level primitives like ‘sort’ or ‘argmax’ or ‘detect cats’, that should themselves be able to be reduced to pseudocode of a similar type, until eventually it is ideally reduced to a very small part of the original neural network, small enough that one could understand its functional behaviour with pen and paper within an hour. These tools might also slightly modify the network to make it more amenable to this analysis in such a way that the modified network performs approximately as well as the original network.</p>
<p>There are a few properties that this pseudocode must satisfy. Firstly, it must be faithful to the network that is explained, such that if one substitutes in the pseudocode for each high-level primitive recursively, the result should be the original neural network, or a network close enough to the original that the differences are irrelevant (although just in case, the reconstructed network that is exactly explained should presumably be the one deployed). Secondly, the high-level primitives must be somewhat understandable: the pseudocode for a 256-layer neural network for image classification should not be <code class="highlighter-rouge">output = f2(f1(input))</code> where <code class="highlighter-rouge">f1</code> is the action of the first 128 layers and <code class="highlighter-rouge">f2</code> is the action of the next 128 layers, but rather break down into edge detectors being used to find floppy ears and spheres and textures, and those being combined in reasonable ways to form judgements of what the image depicts. The high-level primitives should be as human-understandable as possible, ideally ‘carving the computation at the joints’ by representing any independent sub-computations or repeated applications of the same function (so, for instance, if a convolutional network is represented as if it were fully connected, these tools should be able to recover convolutional structure). Finally, the high-level primitives in the pseudocode should ideally be understandable enough to be modularised and used in different places for the same function.</p>
<p>This agenda nicely relates to some existing work in machine learning. For instance, I think that there are strong synergies with research on <a href="http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture15.pdf">compression of neural networks</a>. This is partially due to background models about compression being related to understanding (see the ideas in common between Kolmogorov complexity, MDL, Solomonoff induction, and Martin-Löf randomness), and partially due to object-level details about this research. For example, sparsification seems related to increased modularity, which should make it easier to write understandable pseudocode. Another example is the efficacy of weight quantisation, which means that the least significant bits of the weights aren’t very important, indicating that the relations between the high-level primitives should be modular in an understandable way and not have crucial details depend on some of the least significant bits of the output.</p>
<p>The Distill post on the <a href="https://distill.pub/2018/building-blocks/">building blocks of interpretability</a> includes some other examples of work that I feel is relevant. For instance, work on using matrix factorisation to group neurons seems very related to constructing high-level primitives, and work on neuron visualisation should help with understanding the high-level primitives if their output corresponds to a subset of neurons in the original network.</p>
<p>I’m excited about this agenda because I see it as giving the developers of AI systems tools to detect and correct properties of their AI systems that they see as undesirable, without having to deploy the system in a test environment that they must laboriously ensure is adequately sandboxed. You could imagine developers checking if their systems conform to theories of aligned AI, or detecting any ‘deceive the human’ subroutine that might exist. I see this as fairly robustly useful, being helpful in most stories of how one would build an aligned AI. The exception is if AGI is built without things which look like modern machine learning algorithms, which I see as unlikely, and at any rate hope that lessons transfer to the methods which are used.</p>
<p>I also believe that this line of research has a shot at working for systems which act in the world. It seems hard for me to describe how I detect laptops given visual informations, but given visual primitives like ‘there’s a laptop there’, it seems much easier for me to describe how I play tetris or even go. As such, I would expect tools developed in this way to illuminate the strategy followed by tetris-playing DQNs by referring to high-level primitives like ‘locate T tetronimo’, that themselves would have to be understood using neuron visualisation techniques.</p>
<p>Visual primitives are probably not the only things that would be hard to fully understand using the pseudocode technique. In cases where humans evade oversight by other humans, I assert that it is often not due to consequentialist reasoning, but rather due to avoiding things which are frustrating or irritating, where frustration/irritation is hard to introspect on but seems to reliably steer away from oversight in cases where that oversight would be negative. A possible reason that this frustration/irritation is hard to introspect upon is that it is complicated and hard to decompose cleanly, like our object recognition systems are. Similarly, you could imagine that one high-level primitive that guides the AI system’s behaviour is hard to decompose and needs techniques like neuron visualisation to understand. However, at least the mechanistic decomposition allowed us to locate this subsystem and determine how it is used in the network, guiding the tests we perform on it. Furthermore, in the case of humans, it’s quite possible that our frustration/irritation is hard to introspect upon not because it’s hard to understand, but rather because it’s strategically better to not be able to introspect upon it (see the ideas in the book <a href="http://elephantinthebrain.com/">The Elephant in the Brain</a>), suggesting that this problem might be less severe than it seems.</p>
Tue, 10 Jul 2018 00:00:00 +0000
http://danielfilan.com//2018/07/10/mechanistic_transparency.html
http://danielfilan.com//2018/07/10/mechanistic_transparency.htmlInsights from 'The Strategy of Conflict'<p>I recently read <a href="https://en.wikipedia.org/wiki/Thomas_Schelling">Thomas Schelling</a>’s book ‘The Strategy of Conflict’. Many of the ideas it contains are now pretty widely known, especially in the rationalist community, such as the value of Schelling points when coordination must be obtained without communication, or the value of being able to commit oneself to actions that seem irrational. However, there are a few ideas that I got from the book that I don’t think are as embedded in the public consciousness.</p>
<h3 id="schelling-points-in-bargaining">Schelling points in bargaining</h3>
<p>The first such idea is the value of Schelling points in bargaining situations where communication <em>is</em> possible, as opposed to coordination situations where it is not. For instance, if you and I were dividing up a homogeneous pie that we both wanted as much of as possible, it would be strange if I told you that I demanded at least 52.3% of the pie. If I did, you would probably expect me to give some argument for the number 52.3% that distinguishes it from 51% or 55%. Indeed, it would be more strange than asking for 66.67%, which itself would be more strange than asking for 50%, which would be the most likely outcome were we to really run the experiment. Schelling uses as an example</p>
<blockquote>
<p>the remarkable frequency with which long negotiations over complicated quantitative formulas or <em>ad hoc</em> shares in some costs or benefits converge ultimately on something as crudely simple as equal shares, shares proportionate to some common magnitude (gross national product, population, foreign-exchange deficit, and so forth), or the shares agreed on in some previous but logically irrelevant negotiation.</p>
</blockquote>
<p>The explanation is basically that in bargaining situations like these, any agreement could be made better for either side, but it can’t be made better for both simultaneously, and any agreement is better than no agreement. Talk is cheap, so it’s difficult for any side to credibly commit to only accept certain arbitrary outcomes. Therefore, as Schelling puts it,</p>
<blockquote>
<p>Each party’s strategy is guided mainly by what he expects the other to accept or insist on; yet each knows that the other is guided by reciprocal thoughts. The final outcome must be a point from which neither expects the other to retreat; yet the main ingredient of this expectation is what one thinks the other expects the first to expect, and so on. Somehow, out of this fluid and indeterminate situation that seemingly provides no logical reason for anybody to expect anything except what he expects to be expected to expect, a decision is reached. These infinitely reflexive expectations must somehow converge upon a single point, at which each expects the other not to expect to be expected to retreat.</p>
</blockquote>
<p>In other words, a Schelling point is a ‘natural’ outcome that somehow has the intrinsic property that each party can be expected to demand that they do at least as well as they would in that outcome.</p>
<p>Another way of putting this is that once we are bargained down to a Schelling point, we are not expected to let ourselves be bargained down further. Schelling uses the examples of soldiers fighting over a city. If one side retreats 13 km, they might be expected to retreat even further, unless they retreat to the single river running through the city. This river can serve as a Schelling point, and the attacking force might genuinely expect that their opponents will retreat no further.</p>
<h3 id="threats-and-promises">Threats and promises</h3>
<p>A second interesting idea contained in the book is the distinction between threats and promises. On some level, they’re quite similar bargaining moves: in both cases, I make my behaviour dependent on yours by promising to sometimes do things that aren’t narrowly rational, so that behaving in the way I want you to becomes profitable for you. When I threaten you, I say that if you don’t do what I want, I’ll force you to incur a cost even at a cost to myself, perhaps by beating you up, ruining your reputation, or refusing to trade with you. The purpose is to ensure that doing what I want becomes more profitable for you, taking my threat into account. When I make a promise, I say that if you do do what I want, I’ll make your life better, again perhaps at a cost to myself, perhaps by giving you money, recommending that others hire you, or abstaining from behaviour that you dislike. Again, the purpose is to ensure that doing what I want, once you take my promise into account, is better for you than other options.</p>
<p>There is an important strategic difference between threats and promises, however. If a threat is successful, then it is not carried out. Conversely, the point of promises is to induce behaviour that forces you to carry out the promise. This means that in the ideal case, threat-making is cheap for the threatener, but promise-making is expensive for the promiser.</p>
<p>This difference has implications for one’s ability to convince one’s bargaining partner that one will carry out your threat or promise. If you and I make five bargains in a row, and in the first four situations I made a promise that I subsequently kept, then you have some reason for confidence that I will keep my fifth promise. However, if I make four threats in a row, all of which successfully deter you from engaging in behaviour that I don’t want, then the fifth time I threaten you, you have no more evidence that I will carry out the threat than you did initially. Therefore, building a reputation as somebody who carries out their threats is somewhat more difficult than building a reputation for keeping promises. I must either occasionally make threats that fail to deter my bargaining partner, thus incurring both the cost of my partner not behaving in the way I prefer and also the cost of carrying out the threat, or visibly make investments that will make it cheap for me to carry out threats when necessary, such as hiring goons or being quick-witted and good at gossipping.</p>
<h3 id="mutually-assured-destruction">Mutually Assured Destruction</h3>
<p>The final cluster of ideas contained in the book that I will talk about are implications of the model of <a href="https://en.wikipedia.org/wiki/Mutual_assured_destruction">mutually assured destruction</a> (MAD). In a MAD dynamic, two parties both have the ability, and to some extent the inclination, to destroy the other party, perhaps by exploding a large number of nuclear bombs near them. However, they do not have the ability to destroy the other party immediately: when one party launches their nuclear bombs, the other has some amount of time to launch a second strike, sending nuclear bombs to the first party, before the first party’s bombs land and annihilate the second party. Since both parties care about not being destroyed more than they care about destroying the other party, and both parties know this, they each adopt a strategy where they commit to launching a second strike in response to a first strike, and therefore no first strike is ever launched.</p>
<p>Compare the MAD dynamic to the case of two gunslingers in the wild west in a standoff. Each gunslinger knows that if she does not shoot first, she will likely die before being able to shoot back. Therefore, as soon as you think that the other is about to shoot, or that the other thinks that you are about to shoot, or that the other thinks that you think that the other is about to shoot, et cetera, you need to shoot or the other will. As a result, the gunslinger dynamic is an unstable one that is likely to result in bloodshed. In contrast, the MAD dynamic is characterised by peacefulness and stability, since each one knows that the other will not launch a first strike for fear of a second strike.</p>
<p>In the final few chapters of the book, Schelling discusses what has to happen in order to ensure that MAD remains stable. One implication of the model that is perhaps counterintuitive is that if you and I are in a MAD dynamic, it is vitally important to me that you know that you have second-strike capability, and that you know that I know that you know that you have it. If you don’t have second-strike capability, then you will realise that I have the ability to launch a first strike. Furthermore, if you think that I know that you know that you don’t have second-strike capability, then you’ll think that I’ll be tempted to launch a first strike myself (since perhaps my favourite outcome is one where you’re destroyed). In this case, you’d rather launch a first strike before I do, since you anticipate being destroyed either way. Therefore, I have an incentive to help you invest in technology that will help you accurately perceive whether or not I am striking, as well as technology that will hide your weapons (like <a href="https://en.wikipedia.org/wiki/Ballistic_missile_submarine">ballistic missile submarines</a>) so that I cannot destroy them with a first strike.</p>
<p>A second implication of the MAD model is that it is much more stable if both sides have more nuclear weapons. Suppose that I need 100 nuclear weapons to destroy my enemy, and he is thinking of using his nuclear weapons to wipe out mine (since perhaps mine are not hidden), allowing him to launch a first strike. Schelling writes:</p>
<blockquote>
<p>For illustration suppose his accuracies and abilities are such that one of his missiles has a 50-50 chance of knocking out one of ours. Then, if we have 200, he needs to knock out just over half; at 50 percent reliability he needs to fire just over 200 to cut our residual supply to less than 100. If we had 400, he would need to knock out three-quarters of ours; at a 50 percent discount rate for misses and failures he would need to fire more than twice 400, that is, more than 800. If we had 800, he would have to knock out seven-eighths of ours, and to do it with 50 percent reliability he would need over three times that number, or more than 2400. And so on. The larger the initial number on the “defending” side, the larger the <em>multiple</em> required by the attacker in order to reduce the victim’s residual supply to below some “safe” number.</p>
</blockquote>
<p>Consequently, if both sides have many times more nuclear weapons than are needed to destroy the entire world, the situation is much more stable than if they had barely enough to destroy the enemy: each is comforted in their second strike capabilities, and doesn’t need to respond as aggressively to arms buildups by the other party.</p>
<p>It is important to note that this conclusion is only valid in a ‘classic’ simplified MAD dynamic. If for each nuclear weapon that you own, there is some possibility that a rogue actor will steal the weapon and <a href="https://en.wikipedia.org/wiki/Nuclear_terrorism">use it for their own ends</a>, the value of large arms buildups becomes much less clear.</p>
<p>The final conclusion I’d like to draw from this model is that it would be preferable to not have weapons that could destroy other weapons. For instance, suppose that both parties were countries that had biological weapons that when released infected a large proportion of the other country, caused them obvious symptoms, and then killed them a week later, leaving a few days between the onset of symptoms and losing the ability to effectively do things. In such a situation, you would know that if I struck first, you would have ample ability to get still-functioning people to your weapons centres and launch a second strike, regardless of your ability to detect the biological weapon before it arrives, or the number of weapons and weapons centres that you or I have. Therefore, you are not tempted to launch first. Since this reasoning holds regardless of what type of weapon you have, it is always better for me to have this type of biological weapon in a MAD dynamic, rather than any nuclear weapons that can potentially destroy weapons centres, so as to preserve your second strike capabilities. I speculatively think that this argument should hold for real life biological weapons, since it seems to me that they could be destructive enough to act as a deterrent, but that authorities could detect their spread early enough to send remaining healthy government officials to launch a second strike.</p>
Wed, 03 Jan 2018 00:00:00 +0000
http://danielfilan.com//2018/01/03/schelling.html
http://danielfilan.com//2018/01/03/schelling.htmlTopology<p>I have a friend who is generally a fan of mathematical structures and the relationships between them. For Christmas, he asked me to make a diagram of certain sets of mathematical objects and the relationships between them. One instance of this would have been a <a href="https://complexityzoo.uwaterloo.ca/File:Really-important-inclusions.png">diagram</a> of <a href="https://complexityzoo.uwaterloo.ca/Complexity_Zoo">complexity classes</a>, with an arrow from class C to class D if every problem in class C was also in class D. Another instance would be a diagram of types of algebraic objects (<a href="https://en.wikipedia.org/wiki/Monoid">monoids</a>, <a href="https://en.wikipedia.org/wiki/Group_(mathematics)">groups</a>, <a href="https://en.wikipedia.org/wiki/Ring_(mathematics)">rings</a>, etc.), with arrows indicating facts such that all groups are monoids. Instead, I chose to diagram types of <a href="https://en.wikipedia.org/wiki/Topological_space">topological spaces</a> - plotting properties involving separation, compactness, connectivity, and metrisability, as well as which properties implied which other properties. I also wrote definitions that could theoretically be understood by anyone who understood set theory and equivalence relations, some theorems that hopefully provoke interest in these properties, and some example topological spaces to classify.</p>
<p>Here is the <a href="/pdfs/topology_graph.pdf">diagram</a> (<a href="/dot_files/topology_graph.dot">dot file</a>), the <a href="/pdfs/topology_definitions.pdf">definitions</a> (<a href="/tex/topology_definitions.tex">tex</a>), the <a href="/pdfs/topology_theorems.pdf">theorems</a> (<a href="/tex/topology_theorems.tex">tex</a>), and the <a href="/pdfs/fun_topological_spaces.pdf">example spaces</a> (<a href="/tex/fun_topological_spaces.tex">tex</a>). In the process of researching topological facts to include, I also found a <a href="https://topospaces.subwiki.org/wiki/Main_Page">wiki</a> specifically about topology, and a <a href="http://topology.jdabbs.com/">website</a> that is a search engine for topological spaces.</p>
<p>I hope you enjoy them, and if you spot any errors, please email me and let me know.</p>
Wed, 04 Jan 2017 00:00:00 +0000
http://danielfilan.com//2017/01/04/topology.html
http://danielfilan.com//2017/01/04/topology.htmlA discussion on the usefulness on 538's forecasts<p>Recently, an <a href="https://www.currentaffairs.org/2016/12/why-you-should-never-ever-listen-to-nate-silver">article</a> was published decrying the usefulness of 538’s forecasts of political events and Nate Silver’s opinions. I thought that this was largely misguided, and so got in an argument on Facebook about it. The argument is preserved for posterity, because I basically agree with what I said.</p>
<p>My first response to the article:</p>
<blockquote>
<blockquote>
<p>He bases his claim to have succeeded off his having given Trump a somewhat higher probability of a win than some other people.</p>
</blockquote>
</blockquote>
<blockquote>
<p>Make that a significantly higher probability of a win than anyone else who was forecasting based off poll data (rather than yard signs/halloween costumes/feelings). I’m pretty sure the closest contender was the upshot, who gave Trump half the chance of winning than Silver did. That’s a pretty significant difference.</p>
</blockquote>
<blockquote>
<blockquote>
<p>Silver makes sure to hedge every statement carefully so that he can never actually be wrong. And when things don’t go his way, he lectures the public on their ignorance of statistics. After all, probability isn’t certainty, he didn’t say it would definitely happen.</p>
</blockquote>
</blockquote>
<blockquote>
<p>Sure, but things usually go his way. You can check this by looking at all the races that he predicted this year and in previous years - he ends up looking relatively good.</p>
</blockquote>
<blockquote>
<blockquote>
<p>But recognize what it means: even when Silver isn’t wrong, because he’s hedged everything carefully, he’s still not offering any information of value.</p>
</blockquote>
</blockquote>
<blockquote>
<p>Of course he’s offering information of value. If you think that Donald Trump has a 25% chance of being president, you’re going to be significantly more interested in preparing for that eventuality than if you think he has a 0.5% chance of becoming president, and significantly less than if you think that he has a 75% chance of becoming president.</p>
</blockquote>
<blockquote>
<blockquote>
<p>But for anyone interested in the actual human lives affected by political questions, Silver’s analyses are of almost no help. They can tell us today that Silver thinks Trump has a 5% chance of winning. But then we might wake up tomorrow and find that Silver now thinks Trump has a 30% chance of winning.</p>
</blockquote>
</blockquote>
<blockquote>
<p>If you think that Trump has a 5% chance of winning, than more likely than not you should think that his chances will decrease over time, not increase. Maybe they eventually shoot up to 100%, but there’s only a 5% chance of that - that’s just what the 5% number means.</p>
</blockquote>
<blockquote>
<blockquote>
<p>And the important question for anyone trying to affect the world, as opposed to just watching the events in it unfold, is how those chances can be made to change.</p>
</blockquote>
</blockquote>
<blockquote>
<p>If you want to affect the world, you need to know how much you can affect it, and part of that involves knowing what the chances of certain outcomes are.</p>
</blockquote>
<blockquote>
<blockquote>
<p>The problem is that poll data analysts are completely fucking useless in a crisis. They don’t understand anything that’s going on around them, and they’re powerless to predict what’s about to happen next.</p>
</blockquote>
</blockquote>
<blockquote>
<p>This is just not true. Probabilistic forecasts are useful for predicting what’s about to happen next, as demonstrated by their track record in 2008, 2012, and 2016, because that’s literally what they’re about.</p>
</blockquote>
<p>The response of someone who posted the article:</p>
<blockquote>
<blockquote>
<blockquote>
<p>But recognize what it means: even when Silver isn’t wrong, because he’s hedged everything carefully, he’s still not offering any information of value.</p>
</blockquote>
</blockquote>
</blockquote>
<blockquote>
<blockquote>
<p>Of course he’s offering information of value. If you think that Donald Trump has a 25% chance of being president, you’re going to be significantly more interested in preparing for that eventuality than if you think he has a 0.5% chance of becoming president, and significantly less than if you think that he has a 75% chance of becoming president.</p>
</blockquote>
</blockquote>
<blockquote>
<p>This doesn’t seem to engage with what Robinson’s criticism is. Robinson isn’t saying it wouldn’t be important to know that Trump has a 25% chance of becoming president. He’s saying that probability of that eventuality is not the thing reported by the number put on the 538 website. What is reported on the website is Silver’s guess about what that actual probability is, and a guess based on a methodology that most consumers of media do not understand and one that seems incredibly sensitive to…something (why was the number today 15% less than the number yesterday?)</p>
</blockquote>
<blockquote>
<p>If it’s difficult to tell what the relationship is supposed to be between the number Silver puts up and the number we’re actually interested in, then we have a problem that isn’t statistical.</p>
</blockquote>
<p>My response to that comment:</p>
<blockquote>
<blockquote>
<p>He’s saying that probability of that eventuality is not the thing reported by the number put on the 538 website. What is reported on the website is Silver’s guess about what that actual probability is, and a guess based on a methodology that most consumers of media do not understand and one that seems incredibly sensitive to…something (why was the number today 15% less than the number yesterday?)</p>
</blockquote>
</blockquote>
<blockquote>
<p>Firstly, I didn’t get the sense that that’s what Robinson was complaining about – what quotes made you think that this was the concern?</p>
</blockquote>
<blockquote>
<p>Secondly, I think that there’s good reason to think that the 538 forecast is pretty close to the probability that you would assign if you knew everything there was to know - you can give scores to probabilistic forecasts that reward them for being more certain rather than less and at the same time ensure that events given 90% probability happen 90% of the time. I think that these scores put the 538 forecasts in a good light (see for instance <a href="https://www.buzzfeed.com/jsvine/2016-election-forecast-grades">BuzzFeed’s analysis</a>). I’d be interested to hear reasons why the probabilities are bad other than referring to a few specific instances where they were wrong.</p>
</blockquote>
<blockquote>
<p>Thirdly, you say that the 538 forecasts are “a guess based on a methodology that most consumers of media do not understand and one that seems incredibly sensitive to…something (why was the number today 15% less than the number yesterday?)”. Admittedly, by the nature of probabilistic forecasts, they have to be a guess. I’m sure that most media consumers don’t understand them, but they could if they read <a href="http://fivethirtyeight.com/features/a-users-guide-to-fivethirtyeights-2016-general-election-forecast/">538’s in-depth explanation</a>. Regarding the claim that they’re incredibly sensitive, I don’t really buy this. I can find exactly one day where the polls-only model jumps by 15%, right after the DNC when polling was really good, producing a bounce that that particular model didn’t adjust for. The polls-plus model, which does know about the conventions, didn’t show such a bounce. Why do you think that there’s so much of a problem that the forecast is basically meaningless?</p>
</blockquote>
<p>OP’s response:</p>
<blockquote>
<p>To the first thing - I’m describing what I would guess statistical forecasting would seem like to someone who isn’t in the relevant know. I take Robinson to be gesturing at the interpretation problem here: “Similarly, Silver will make predictions that have multiple components, so that if one part fails, the overall prediction will seem to have come true, even if its coming true had no relation to the reasons Silver originally offered.”</p>
</blockquote>
<blockquote>
<p>And here: “The myth of Nate Silver’s continued usefulness is based on a careful moving of goalposts”</p>
</blockquote>
<blockquote>
<p>Though in both spots he blames Silver, what seems to me to be at work is an underlying unclarity about what is being communicated.</p>
</blockquote>
<blockquote>
<p>Second, it may be the case that the 538 forecast is the probability that I would assign if I had all the data. But I’m certainly not claiming that predictions are useless, and in Robinson’s careful moments he doesn’t either.</p>
</blockquote>
<blockquote>
<p>Third - 15 percent was an arbitrary and hyperbolic number, I’m actually pretty surprised that that ever happened. My point is, again, on the side of a lay person just refreshing their screen and seeing a different number, and trying to figure out for themselves what has changed about the universe such that Trump has a better chance of winning today than he did yesterday. My guess would be that many visitors to his website would be at a loss to explain that sort of thing, which is useful to think about in terms of reporting stats. And again, I’ll emphasize that the uselessness of statistical forecasts is Robinson’s position, not mine - I don’t have any problem with statistical forecasts or any particular bone to pick with Silver.</p>
</blockquote>
<p>Me again:</p>
<blockquote>
<blockquote>
<p>“Similarly, Silver will make predictions that have multiple components, so that if one part fails, the overall prediction will seem to have come true, even if its coming true had no relation to the reasons Silver originally offered.” And here: “The myth of Nate Silver’s continued usefulness is based on a careful moving of goalposts”</p>
</blockquote>
</blockquote>
<blockquote>
<p>The first quote is in the context of Silver randomly saying stuff, which is probably legit. The second one is referring to the forecast, which as I’ve pointed out is better than he is acting like it is, see e.g. the Buzzfeed analysis.</p>
</blockquote>
<blockquote>
<blockquote>
<p>But I’m certainly not claiming that predictions are useless, and in Robinson’s careful moments he doesn’t either.</p>
</blockquote>
</blockquote>
<blockquote>
<p>I’m sure he has some moments where he doesn’t say that predictions are useless, but he also says “They can tell us today that Silver thinks Trump has a 5% chance of winning. But then we might wake up tomorrow and find that Silver now thinks Trump has a 30% chance of winning. And the important question for anyone trying to affect the world, as opposed to just watching the events in it unfold, is how those chances can be made to change”, and the only way I can reasonably interpret that is “probabilities are unimportant because they can change”.</p>
</blockquote>
<blockquote>
<blockquote>
<p>My point is, again, on the side of a lay person just refreshing their screen and seeing a different number, and trying to figure out for themselves what has changed about the universe such that Trump has a better chance of winning today than he did yesterday.</p>
</blockquote>
</blockquote>
<blockquote>
<p>Firstly, I just don’t buy that this is what Robinson is talking about (is someone here friends with him so that he can be tagged?). Secondly, if this was your actual concern, the forecasts had an <a href="https://projects.fivethirtyeight.com/2016-election-forecast/updates/">‘updates’ tab</a> which included polls and how they moved the numbers. 538 also regularly had pieces and podcasts explaining why the numbers changed (<a href="http://fivethirtyeight.com/features/election-update-clinton-gains-and-the-polls-magically-converge/">link</a> to the most recent one).</p>
</blockquote>
<p>OP:</p>
<blockquote>
<blockquote>
<p>The first quote is in the context of Silver randomly saying stuff, which is probably legit. The second one is referring to the forecast, which as I’ve pointed out is better than he is acting like it is, see e.g. the Buzzfeed analysis.</p>
</blockquote>
</blockquote>
<blockquote>
<p>Even if it is better than he is acting like it is <em>when properly interpreted</em>, it doesn’t follow that it is better than he is acting like it is on common, actual interpretations.</p>
</blockquote>
<blockquote>
<blockquote>
<p>I’m sure he has some moments where he doesn’t say that predictions are useless, but he also says “They can tell us today that Silver thinks Trump has a 5% chance of winning. But then we might wake up tomorrow and find that Silver now thinks Trump has a 30% chance of winning. And the important question for anyone trying to affect the world, as opposed to just watching the events in it unfold, is how those chances can be made to change”, and the only way I can reasonably interpret that is “probabilities are unimportant because they can change”.</p>
</blockquote>
</blockquote>
<blockquote>
<p>That strikes me as an entirely unfair interpretation. The point isn’t well made but it’s incoherent stick him with the claim that he doesn’t think probabilities are important if his warrant is that he thinks its important to change probabilities. Maybe he hasn’t earned a lot of rope but we should do better than that.</p>
</blockquote>
<blockquote>
<blockquote>
<p>Firstly, I just don’t buy that this is what Robinson is talking about (is someone here friends with him so that he can be tagged?). Secondly, if this was your actual concern, the forecasts had an <a href="https://projects.fivethirtyeight.com/2016-election-forecast/updates/">‘updates’ tab</a> which included polls and how they moved the numbers. 538 also regularly had pieces and podcasts explaining why the numbers changed (<a href="http://fivethirtyeight.com/features/election-update-clinton-gains-and-the-polls-magically-converge/">link</a> to the most recent one).</p>
</blockquote>
</blockquote>
<blockquote>
<p>I started this discussion thread out by saying that this article wasn’t fair to Silver, and these are the things I had in mind. So this I concede straightaway, with the caveat that points to my general interest in articles like this: that consumers are culpably negligent in consuming information in the way that they do does not mean that producers of information are off the hook. If that culpable negligence is predictable then we might ask questions about further steps producers should take, and this article shows some stuff to take stock of. I don’t take it that Robinson is a particularly unsophisticated reader (I concede off the bat that having an axe to grind can make someone otherwise competent functionally equivalent to a bad reader, but in this case I don’t think that is the whole story)</p>
</blockquote>
<p>Me:</p>
<blockquote>
<blockquote>
<p>Even if it is better than he is acting like it is <em>when properly interpreted</em>, it doesn’t follow that it is better than he is acting like it is on common, actual interpretations.</p>
</blockquote>
</blockquote>
<blockquote>
<p>Sure, but it seems like 538 have taken great pains to help people interpret it better, and if you think “well it’s just hard to communicate probabilistic forecasts and 538 could have done it better”, then that’s fine but seems separate to the original article.</p>
</blockquote>
<blockquote>
<blockquote>
<p>That strikes me as an entirely unfair interpretation. The point isn’t well made but it’s incoherent stick him with the claim that he doesn’t think probabilities are important if his warrant is that he thinks its important to change probabilities.</p>
</blockquote>
</blockquote>
<blockquote>
<p>It does seem like he thinks that probabilistic forecasts are unimportant, given the above quote and the end of the article: “That doesn’t mean there’s anything wrong with Nate Silver, just that nobody should ever pay any attention to him. Nate Silver will probably always be the best poll data analyst. The problem is that poll data analysts are completely fucking useless in a crisis. They don’t understand anything that’s going on around them, and they’re powerless to predict what’s about to happen next… [Silver] tells you entirely about the world as it looks to him right now, rather than the world as it could suddenly be tomorrow.” The most straightforward readings of this I can make are either “The 538 forecast measures current sentiment but is bad at predicting the state of the race” (which I think is just factually false) or “Probabilistic forecasts are unimportant because they could change given effort”. I just can’t understand what else he could possibly mean.</p>
</blockquote>
<blockquote>
<blockquote>
<p>I started this discussion thread out by saying that this article wasn’t fair to Silver, and these are the things I had in mind. So this I concede straightaway, with the caveat that points to my general interest in articles like this: that consumers are culpably negligent in consuming information in the way that they do does not mean that producers of information are off the hook. If that culpable negligence is predictable then we might ask questions about further steps producers should take, and this article shows some stuff to take stock of.</p>
</blockquote>
</blockquote>
<blockquote>
<p>I think that this is pretty interesting, but almost disjoint to what I understood the article and the section you quoted to be about. Discussion about what the article means aside, I sort of agree, and think that 538 could have done better (e.g. by letting you sample maps from their forecasts), but at the same time think that they did do relatively well, especially to readers who read their articles about the forecast.</p>
</blockquote>
<p>OP:</p>
<blockquote>
<blockquote>
<p>Sure, but it seems like 538 have taken great pains to help people interpret it better, and if you think “well it’s just hard to communicate probabilistic forecasts and 538 could have done it better”, then that’s fine but seems separate to the original article.</p>
</blockquote>
</blockquote>
<blockquote>
<p>That’s not what I’m thinking. I’m thinking something more along the lines of “well what is it that you’re communicating when you report statistics in the sort of media context that we have?”</p>
</blockquote>
<blockquote>
<blockquote>
<p>It does seem like he thinks that probabilistic forecasts are unimportant, given the above quote and the end of the article: “That doesn’t mean there’s anything wrong with Nate Silver, just that nobody should ever pay any attention to him. Nate Silver will probably always be the best poll data analyst. The problem is that poll data analysts are completely fucking useless in a crisis. They don’t understand anything that’s going on around them, and they’re powerless to predict what’s about to happen next… [Silver] tells you entirely about the world as it looks to him right now, rather than the world as it could suddenly be tomorrow.” The most straightforward readings of this I can make are either “The 538 forecast measures current sentiment but is bad at predicting the state of the race” (which I think is just factually false) or “Probabilistic forecasts are unimportant because they could change given effort”. I just can’t understand what else he could possibly mean.</p>
</blockquote>
</blockquote>
<blockquote>
<p>From the looks of it, he’s arguing against the kind of fatalism people can develop when confronted with the sort of epistemic authority that statistics are often used to claim. Don’t let Nate tell you that battleground state is a lock for the RNC, says Robinson: go out and canvass anyway, because no matter what the polls tell Nate today, tomorrow is another day. Maybe you’re scratching your head and wondering why Nate is supposed to disagree with something like that - and that’s not without justice, as the thought that what I just said pits one against forecasting is at best confused - but if you are just scratching your head then you, as I said in the beginning, probably aren’t engaging with the perspective that Robinson seems to be inside of and speaking to.</p>
</blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>I started this discussion thread out by saying that this article wasn’t fair to Silver, and these are the things I had in mind. So this I concede straightaway, with the caveat that points to my general interest in articles like this: that consumers are culpably negligent in consuming information in the way that they do does not mean that producers of information are off the hook. If that culpable negligence is predictable then we might ask questions about further steps producers should take, and this article shows some stuff to take stock of.</p>
</blockquote>
</blockquote>
</blockquote>
<blockquote>
<blockquote>
<p>I think that this is pretty interesting, but almost disjoint to what I understood the article and the section you quoted to be about. Discussion about what the article means aside, I sort of agree, and think that 538 could have done better (e.g. by letting you sample maps from their forecasts), but at the same time think that they did do relatively well, especially to readers who read their articles about the forecast.</p>
</blockquote>
</blockquote>
<blockquote>
<p>I don’t think so. If this use of statistics speaks so poorly to a class of otherwise engaged readers (on the guess that Robinson isn’t alone here) then I wonder what statistics for world-changers could look like. We have some thoughts here about how it would have to succeed in communicating about itself.</p>
</blockquote>
<p>At this point I got tired of responding.</p>
Fri, 30 Dec 2016 00:00:00 +0000
http://danielfilan.com//2016/12/30/on_538_vs_CA.html
http://danielfilan.com//2016/12/30/on_538_vs_CA.htmlKelly bettors<h3 id="the-kelly-criterion">The Kelly Criterion</h3>
<p>The Kelly criterion for betting tells you how much to wager when someone offers you a bet. First introduced in <a href="http://www.herrold.com/brokerage/kelly.pdf">this paper</a>, it deals with the situation where someone is offering you a contract that pays you €1 if the event <script type="math/tex">E</script> (for concreteness, you can imagine <script type="math/tex">E</script> as the event that the Republican candidate wins the 2020 US Presidential election) and €0 otherwise. They are selling it for €<script type="math/tex">q</script>, and your probability for <script type="math/tex">E</script> is <script type="math/tex">p > q</script> (this is equivalent to the more common formulation with odds, but it’s easier for me to think about). As a result, you think that it’s worth buying this contract. In fact, they will sell you a scaled-up contract of your choice: for any real number <script type="math/tex">r \geq 0</script>, you can buy a contract that pays you €<script type="math/tex">r/q</script> if <script type="math/tex">E</script> occurs for €<script type="math/tex">r</script>, just as if you could buy <script type="math/tex">r/q</script> copies of the original contract. The question you face is this: how much of your money should you spend on this scaled-up contract? The Kelly criterion gives you an answer: you should spend <script type="math/tex">(p-q)/(1-q)</script> of your money.</p>
<p>Why would you spend this exact amount? One reason would be if you were an expected utility maximiser, and your utility was the logarithm of your wealth. Note that the logarithm is important here to make you risk averse: if you simply wanted to maximise your expected wealth after the bet, you would bet all your money. To show that expected log-wealth maximisers use the Kelly criterion, note that if your initial wealth is <script type="math/tex">W</script>, you spend <script type="math/tex">fW</script> on the scaled contract, and <script type="math/tex">E</script> occurs, you then have <script type="math/tex">(1-f)W + fW/q</script>, while if you bet that much and <script type="math/tex">E</script> does not occur, your wealth is only <script type="math/tex">(1-f)W</script>. The expected log-wealth maximiser therefore wants to maximise</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align} U &= p\log \left( (1-f)W + \frac{fW}{q} \right) + (1-p) \log ((1-f)W) \\ &= p \log \left(1 - f + \frac{f}{q} \right) + (1-p) \log (1-f) + \log (W)\text{.} \end{align} %]]></script>
<p>The derivative of this with respect to <script type="math/tex">f</script> is
$^$\frac{\partial U}{\partial f} = \left( \frac{p}{1-f + f/q} \right) \left( \frac{1}{q} - 1 \right) - \frac{1-p}{1-f}\text{.}$^$
Setting this derivative to 0 and rearranging produces the stated formula.</p>
<p>The <a href="https://en.wikipedia.org/w/index.php?title=Kelly_criterion&oldid=742759833#Proof">Wikipedia page</a> as of 25 October 2016 gives another appealing fact about Kelly betting. Suppose that this contract-buying opportunity recurs again and again: that is, there are many events <script type="math/tex">E_t</script> in a row that you think each independently have probability <script type="math/tex">q</script>, and after your contract about <script type="math/tex">E_{t-1}</script> resolves, you can always spend €<script type="math/tex">r</script> on a contract that will pay €<script type="math/tex">r/p</script> if <script type="math/tex">E_t</script> happens. Suppose that you always spend <script type="math/tex">f</script> of your wealth on these contracts, you make <script type="math/tex">N</script> of these bets, and <script type="math/tex">K</script> pay off. Then, your final wealth after the <script type="math/tex">N</script> bets will be
$^$\text{Wealth} = \left(1-f+\frac{f}{q} \right)^K (1-f)^{N-K} W \text{.}$^$
The derivative of this with respect to <script type="math/tex">f</script> is</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align} \frac{\partial \text{Wealth}}{\partial f} &= W\left( K \left(1-f+ \frac{f}{q} \right)^{K-1} \left(\frac{1}{q} - 1 \right)(1-f)^{N-K} \right. \\ &\quad \left. {} - \left(1-f+ \frac{f}{q} \right)^K (N-K)(1-f)^{N-K-1}\right) \text{.} \end{align} %]]></script>
<p>Setting this to 0 gives <script type="math/tex">f = K/N - ((N-K)/N)(q/(1-q))</script>, and if <script type="math/tex">K/N = p</script> (which it should in the long run), this simplifies to <script type="math/tex">f = (p-q)/(1-q)</script>, the Kelly criterion. This makes it look like Kelly betting maximises your total wealth after the <script type="math/tex">N</script> runs, so why wouldn’t an expected wealth maximiser use the Kelly criterion? Well, the rule of betting all your money every chance you have leaves you with nothing if <script type="math/tex">K = pN</script>, but in the unlikely case that <script type="math/tex">K = N</script>, the rule works out so well that expected wealth maximisers think that it’s worth the risk.</p>
<p>Before I move on, I’d like to share one interesting fact about the Kelly criterion that gives a flavour of the later results. You might wonder what the expected utility of using the Kelly criterion is. Well, by simple substitution it’s just <script type="math/tex">p \log (p/q) + (1-p) \log ((1-p)/(1-q)) + \log (W)</script>. Ignoring the <script type="math/tex">\log (W)</script> utility that you already have, this is just <span><script type="math/tex">D_{KL}(p||q)</script></span>. Bam! <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">Information theory!</a></p>
<h3 id="kelly-bettors-in-prediction-markets">Kelly bettors in prediction markets</h3>
<p>Previously, we talked about the case where somebody was offering to sell you a contract at a fixed price, and all you could do was keep it. Instead, we can consider a market full of these contracts, where all of the participants are log wealth maximisers, and think about what the equilibrium price is. Our proof will be similar to the one found <a href="https://arxiv.org/pdf/1201.6655.pdf">here</a>.</p>
<p>Before diving into the math, let’s clarify exactly what sort of situation we’re imagining. First of all, there are going to be lots of contracts available that correspond to different outcomes in the same event, at least one of which will occur. For instance, the event could be “Who will win the Democratic nomination for president in 2020?”, and the outcomes could be “Cory Booker”, “Elizabeth Warren”, “Martin O’Malley”, and all the other candidates (once they are known - before then, the outcomes could be each member of a list of prominent Democrats and one other outcome corresponding to “someone else”). Alternatively, the event could be “What will the map of winners of each state in the 2020 presidential election be?”, and the outcomes would be lists of the form “Republicans win Alabama, Democrats win Alaska, Democrats win Arizona, …”. This latter type of market actually forms a <a href="http://blog.oddhead.com/2008/12/22/what-is-and-what-good-is-a-combinatorial-prediction-market/">combinatorial prediction market</a> – by buying and short-selling bundles of contracts, you can make bets of the form “Republicans will win Georgia”, “If Democrats win Ohio, then Republicans will win Florida”, or “Republicans will win either North Dakota or South Dakota, but not both”. Such markets are interesting for their own reasons, but we will not elaborate on them here.</p>
<p>We should also clarify our assumptions about the traders. The participants are log wealth maximisers who have different priors and don’t think that the other participants know anything that they don’t – otherwise, the <a href="https://en.wikipedia.org/wiki/No-trade_theorem">no-trade theorem</a> could apply. We also assume that they are <a href="http://www.investopedia.com/terms/p/pricetaker.asp">price takers</a>, who decide to buy or sell contracts at whatever the equilibrium price is, not considering how their trades effect the equilibrium price.</p>
<p>Now that we know the market setup, we can derive the purchasing behaviour of the participants for a given market price. We will index market participants by <script type="math/tex">i</script> and outcomes by <script type="math/tex">j</script>. We write <script type="math/tex">q_j</script> for the market price of the contract that pays €1 if outcome <script type="math/tex">j</script> occurs, <script type="math/tex">p^i_j</script> for the probability that participant <script type="math/tex">i</script> assigns to outcome <script type="math/tex">j</script>, and <script type="math/tex">W^i</script> for the initial wealth of participant <script type="math/tex">i</script>.</p>
<p>First of all, without loss of generality, we can assume that participant <script type="math/tex">i</script> spends all of their wealth on contracts. This is because if they spend some money on all contracts, they are guaranteed some payoff, just as if they had saved some money. We can therefore write the amount that participant <script type="math/tex">i</script> spends on contracts for outcome <script type="math/tex">j</script> as <script type="math/tex">W^i \tilde{p}^i_j</script>, under the condition that <script type="math/tex">\sum_j \tilde{p}^i_j = 1</script>. Then, if outcome <script type="math/tex">j</script> occurs, their posterior wealth will be <script type="math/tex">W^i \tilde{p}^i_j / q_j</script>. We can use the method of Lagrange multipliers to determine how much participant <script type="math/tex">i</script> will bet on each outcome, by maximising
$^$L(\tilde{p}^i, \lambda) = \sum_j p^i_j \log \left(\frac{W^i \tilde{p}^i_j}{q_j}\right) - \lambda \left( \sum_j \tilde{p}^i_j - 1\right)\text{.}$^$
Taking partial derivatives,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}\frac{\partial L}{\partial \tilde{p}^i_j} &= \frac{p^i_j}{\tilde{p}^i_j} - \lambda \\ &= 0\text{,}\end{align} %]]></script>
<p>so <script type="math/tex">\tilde{p}^i_j = p^i_j / \lambda</script>, and</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}\frac{\partial L}{\partial \lambda} &= \sum_j \tilde{p}^i_j - 1 \\ & = 0\text{,}\end{align} %]]></script>
<p>so <script type="math/tex">\lambda^{-1} \sum_j p^i_j = 1</script>, so <script type="math/tex">\lambda = 1</script>. Therefore, regardless of the market prices, participant <script type="math/tex">i</script> will spend <script type="math/tex">W^i p^i_j</script> on contracts for outcome <script type="math/tex">j</script>. You might notice that this looks different to the previous section – this is because previously our bettor could only bet on one outcome, as opposed to betting on both.</p>
<p>Next, we can generalise to the case where the market participants save some amount of money, buy some contracts, and sell some others. This will be important for deriving the equilibrium market behaviour, since you can’t have a market where everyone wants to buy contracts and nobody wants to sell them.</p>
<p>Suppose trader <script type="math/tex">i</script> saves <script type="math/tex">W^i s^i</script> and spends <script type="math/tex">W^i \tilde{p}^i_j</script> on contracts for each outcome <script type="math/tex">j</script>. Here, we allow <script type="math/tex">\tilde{p}^i_j</script> to be negative - this means that <script type="math/tex">i</script> will sell another trader <script type="math/tex">-W^i \tilde{p}^i_j</script> worth of contracts, and will supply that trader with <script type="math/tex">-W^i \tilde{p}^i_j/q_j</script> if outcome <script type="math/tex">j</script> occurs. We now demand that <script type="math/tex">s^i + \sum_j \tilde{p}^i_j = 1</script> for <script type="math/tex">s_i</script> to make sense. Now, if outcome <script type="math/tex">j</script> occurs, trader <script type="math/tex">i</script>’s wealth will be <script type="math/tex">W^i(s^i + \tilde{p}^i_j/q_j)</script> – if <script type="math/tex">\tilde{p}^i_j > 0</script> then the trader makes money off their contracts in outcome <script type="math/tex">j</script>, and if <script type="math/tex">% <![CDATA[
\tilde{p}^i_j < 0 %]]></script> then the trader pays their dues to the holder of the contract in outcome <script type="math/tex">j</script> they sold. We’d like this to be equal to <script type="math/tex">W^i p^i_j/q_j</script>, so that the trader’s wealth is the same as if they had saved all their money. This happens if <script type="math/tex">s^i + \tilde{p}^i_j/q_j = p^i_j/q_j</script>, i.e. <script type="math/tex">\tilde{p}^i_j = p^i_j - s^i q_j</script>.</p>
<p>Now that we have the behaviour of each trader for a fixed market price, we can derive the equilibrium prices of the market. At equilibrium, supply should be equal to demand, meaning that there are as many contracts being bought as being sold: for all <script type="math/tex">j</script>, <script type="math/tex">\sum_i W^i \tilde{p}^i_j = 0</script>. This implies that <script type="math/tex">\sum_i W^i (p^i_j - s^i q_j) = 0</script>, or <script type="math/tex">q_j = \left( \sum_i W^i p^i_j \right)/\left( \sum_i W^i s^i \right)</script>. It must also be the case that <script type="math/tex">\sum_j q_j = 1</script>, since otherwise the agents could arbitrage, putting pressure on the prices to satisfy <script type="math/tex">\sum_j q_j = 1</script>. This means that <script type="math/tex">\sum_j \left(\sum_i W^i p^i_j\right)/\left(\sum_i W^i s^i\right) = 1</script>, implying that <script type="math/tex">\sum_i W^i = \sum_i W^i s^i</script> and <script type="math/tex">q_j = \sum_i W^i p^i_j / \left(\sum_i W^i\right)</script>.</p>
<p>Note the significance of this price: it’s as if we have a Bayesian mixture where each trader corresponds to a hypothesis, our prior in hypothesis <script type="math/tex">i</script> is <script type="math/tex">h^i = W^i / \left(\sum_i W^i\right)</script>, and the market price is the Bayesian mixture probability <script type="math/tex">\sum_i h^i p^i_j</script>. How much wealth does the participant/hypothesis have after we know the outcome? Exactly <script type="math/tex">W_i p^i_j \left(\sum_i W^i\right) / \left(\sum_i W^i p^i_j\right) = \left(\sum_i W_i\right) h^i p^i_j / \left(\sum_i h^i p^i_j\right)</script>, proportional to the posterior probability of that hypothesis. Our market has done an excellent job of replicating a Bayesian mixture!</p>
<h3 id="but-is-it-general-enough">But is it general enough?</h3>
<p>You might have thought that the above discussion was sufficiently general, but you’d be wrong. It only applies to markets with a countable number of possible outcomes. Suppose instead that we’re watching someone throw a dart at a dartboard, and will be able to see the exact point where the dart will land. In general, we imagine that there’s a set <script type="math/tex">\Omega</script> (the dartboard) of outcomes <script type="math/tex">\omega</script> (points on the dartboard), and you have a probability distribution <script type="math/tex">P</script> that assigns probability to any event <script type="math/tex">E \subseteq \Omega</script> (region of the dartboard). (More technically, <script type="math/tex">\Omega</script> will be a measurable set with sigma-algebra <script type="math/tex">\mathcal{F}</script> which all our subsets will belong to, <script type="math/tex">P</script> will be a probability measure, and all functions mentioned will be measurable.)</p>
<p>First, let’s imagine that there’s just one agent with wealth <script type="math/tex">W</script> and probability distribution <script type="math/tex">P</script>, betting against the house which has probability distribution <script type="math/tex">Q</script>. This agent can buy some number <script type="math/tex">b(\omega)</script> of contracts from the house that each pay €1 if <script type="math/tex">\omega</script> occurs and €0 otherwise, for every <script type="math/tex">\omega \in \Omega</script> (similarly to the previous section, if <script type="math/tex">% <![CDATA[
b(\omega) < 0 %]]></script> the agent is selling these contracts to the house). The house charges the agent the expected value of their bets: <script type="math/tex">\mathbb{E}_{Q} [b(\omega)]</script>. The question: what function <script type="math/tex">b</script> should the agent choose to bet with?</p>
<p>Our agent is an expected log wealth maximiser, so they want to choose <script type="math/tex">b</script> to maximise <script type="math/tex">\mathbb{E}_{P} [\log b(\omega)]</script>. However, they are constrained by only betting as much money as they have (and without loss of generality, exactly as much money as they have). Therefore, the problem is to optimise the Lagrangian</p>
<p><span><script type="math/tex">% <![CDATA[
\begin{align} L(b, \lambda) &= \mathbb{E}_{P} [ \log b(\omega) ] - \lambda \left( W - \mathbb{E}_Q [ b(\omega) ] \right) \\ &= \mathbb{E}_{P} [ \log b(\omega) ] - \mathbb{E}_Q[\lambda(W - b(\omega))] \end{align} %]]></script></span></p>
<p>To make this easier to manipulate, we’re going to want to make all of this an expectation with respect to <script type="math/tex">Q</script>, using an object <script type="math/tex">dP/dQ (\omega)</script> called the <a href="https://en.wikipedia.org/wiki/Radon%E2%80%93Nikodym_theorem">Radon-Nikodym derivative</a>. Essentially, if we were thinking about <script type="math/tex">\omega</script> as being a point on a dartboard, we could think of the probability density functions <script type="math/tex">p(\omega)</script> and <script type="math/tex">q(\omega)</script>, and it would be the case that <span><script type="math/tex">\mathbb{E}_{P}[f(\omega)] = \mathbb{E}_{Q} [f(\omega) p(\omega) / q(\omega)]</script></span>. The Radon-Nikodym derivative acts just like the factor <script type="math/tex">p(\omega) / q(\omega)</script>, and is always defined as long as whenever <script type="math/tex">Q</script> assigns some set probability 0, <script type="math/tex">P</script> does as well (otherwise, you should imagine that <script type="math/tex">q(\omega) = 0</script> so <script type="math/tex">p(\omega) / q(\omega)</script> isn’t defined). This lets us rewrite the Lagrangian as</p>
<p><span>$^$ L(b, \lambda) = \mathbb{E}_{Q} \left[ \left( \log b(\omega) \right) \frac{dP}{dQ}(\omega) - \lambda(W - b(\omega)) \right] $^$</span></p>
<p>We have one more trick up our sleeves to maximise this with respect to <script type="math/tex">b</script>. At a maximum, changing <script type="math/tex">b</script> to <script type="math/tex">b + \delta b</script> should only change <script type="math/tex">L</script> up to second order, for any small <script type="math/tex">\delta b</script>. So,</p>
<p><span><script type="math/tex">% <![CDATA[
\begin{align}L(b + \delta b, \lambda) &= \mathbb{E}_{Q} \left[ \left( \log (b(\omega) + \delta b(\omega)) \right) \frac{dP}{dQ}(\omega) - \lambda(W - b(\omega) - \delta b(\omega)) \right] \\ &= \mathbb{E}_{Q} \left[ \left( \log b(\omega) + \frac{\delta b(\omega)}{b(\omega)} \right) \frac{dP}{dQ}(\omega) - \lambda (W - b(\omega) - \delta b(\omega))\right] \\
&\quad {} + o(\delta b(\omega)^2)\\
&= L(b, \lambda) + \mathbb{E}_Q \left[ \frac{\delta b(\omega)}{b(\omega)} \frac{dP}{dQ}(\omega) + \lambda \delta b(\omega)\right] + o(\delta b(\omega)^2)\end{align} %]]></script></span></p>
<p>We therefore require that <script type="math/tex">\mathbb{E}_Q [(\delta b(\omega) / b(\omega)) (dP/dQ(\omega)) + \lambda \delta b(\omega)] = 0</script> for all <script type="math/tex">\delta b(\omega)</script>. This can only happen when <script type="math/tex">b(\omega) = - \lambda^{-1} dP/dQ(\omega)</script>, and it’s easy to check that we need <script type="math/tex">\lambda = -W^{-1}</script>. Therefore, the agent buys <script type="math/tex">W \times dP/dQ(\omega)</script> shares in outcome <script type="math/tex">\omega</script>, which you should be able to check is the same as in the case of countably many contracts.</p>
<p>Suppose we want to express the bet equivalently as our agent saving <script type="math/tex">S</script>. For this to be equivalent to the agent spending all their money, we need <script type="math/tex">b(\omega) + S = W \times dP/dQ(\omega)</script> for all <script type="math/tex">\omega</script>, which is easily solved.</p>
<p>Now, suppose we’re in a market with many agents, indexed by <script type="math/tex">i</script>. Each agent has wealth <script type="math/tex">W^i</script>, probabilities <script type="math/tex">P^i</script>, and saves <script type="math/tex">S^i</script>. In response to a market probability <script type="math/tex">Q</script>, they buy <script type="math/tex">W^i dP^i/dQ(\omega) - S^i</script> contracts for outcome <script type="math/tex">\omega</script>. What is this equilibrium market probability?</p>
<p>We would like to think of markets for each outcome <script type="math/tex">\omega</script> and solve for equilibrium, but it could be that each agent assigns probability 0 to every outcome. For instance, if I’m throwing a dart at a dartboard, and your probability that I hit some region is proportional to the area of the region, then for any particular point your probability that I will hit that point is 0. If this is the case, then the equilibrium price for contracts in every outcome will be 0, which tells us nothing about how traders buy and sell these contracts. Instead, we’ll imagine that there’s a set of events <script type="math/tex">\{ E_j \}</script> that are mutually exclusive with the property that one of them is sure to happen – in the case of the dartboard, this would be a collection of regions that don’t overlap and cover the whole dartboard. The agents will bundle all of their contracts for outcomes of the same event, and buy and sell those together. In this case, letting <span><script type="math/tex">[[ \omega \in E]]</script></span> be the <a href="https://en.wikipedia.org/wiki/Iverson_bracket">function</a> that is 1 if <script type="math/tex">\omega \in E</script> and 0 otherwise, the condition for equilibrium is</p>
<p><span><script type="math/tex">% <![CDATA[
\begin{align} 0 &= \sum_i \mathbb{E}_Q \left[ [[\omega \in E_j]] \left( W^i \frac{dP^i}{dQ}(\omega) - S^i \right)\right] \\
\sum_i W^i P^i (E_j) &= Q(E_j) \left( \sum_i S^i \right) \end{align} %]]></script></span></p>
<p>To avoid arbitrage, it must be the case that <script type="math/tex">\sum_i S^i = \sum_i W^i</script>, therefore we require that for all <script type="math/tex">j</script>, <span><script type="math/tex">Q(E_j) = \sum_i W^i P^i (E_j) / \left( \sum_i W^i \right)</script></span>. Now, in the limit of there being infinitely many infinitely small sets <script type="math/tex">E_j</script>, all sets are just a union of some of the sets <script type="math/tex">E_j</script>, and in general we will have <script type="math/tex">Q(E) = \sum_i W^i P^i(E) / \left( \sum_i W^i \right)</script>. This is just like the discrete case: our market prices are exactly Bayes mixture probabilities, and as a result the wealth of each agent after the bets are paid will be proportional to their posterior credence in the mixture.</p>
<p>Finally, it’s worth noting something interesting that’s perhaps more obvious in this formalism than in others. Suppose we again have a single agent betting on which outcome would occur with the house, but instead of learning the outcome <script type="math/tex">\omega</script>, the house and agent only learn that the outcome was in some event <script type="math/tex">E</script>. In this case, the agent would have spent <span><script type="math/tex">\mathbb{E}_Q [W \times dP/dQ(\omega) [[\omega \in E]] ] = W P(E)</script></span> on contracts for outcomes in <script type="math/tex">E</script>, and should presumably be paid the house’s posterior expected value of those contracts:
<span>$^$ \frac{\mathbb{E}_Q [W \times dP/dQ(\omega) [[\omega \in E]] ]}{Q(E)} = W \frac{P(E)}{Q(E)} $^$</span>
Now, this is exactly what would have happened if the agent had been asked which of events <script type="math/tex">E_1</script> through <script type="math/tex">E_n</script> would occur: the agent would bet <script type="math/tex">W P(E_i)</script> on each event <script type="math/tex">E_i</script> and in the case that event <script type="math/tex">E_j</script> occurred, would be paid <script type="math/tex">W P(E_j)/ Q(E_j)</script>. In the dartboard example, instead of learning which point the dart would land on, you only learned how many points the throw was worth and your bets only paid out accordingly, but it turned out that you bet optimally anyway, despite the payouts being different to what you thought they were. Therefore, Kelly betting has this nice property that you don’t need to know exactly what you’re betting on: as long as you know the space of possible outcomes, you’ll bet optimally no matter what the question about the outcomes is.</p>
Fri, 18 Nov 2016 00:00:00 +0000
http://danielfilan.com//2016/11/18/kelly.html
http://danielfilan.com//2016/11/18/kelly.html