Skip to content


Can a future artificial intelligence be made safe?  And is AI safety an easy problem, or an extremely hard one?

It turns out that these questions are intimately bound up with a seemingly unrelated question -- whether future AIs will use "reinforcement learning" as their basic control mechanism.  In this essay I will argue that these ideas should not be connected, because reinforcement learning will never be used as the main control mechanism for future AIs.

NOTE   If you already know what reinforcement learning is, please be assured that this post is about the use of RL at the global-control level of an AI, not the sort of RL that appears in relatively small, local circuits or adaptive feedback loops.  There is a lot of confusion about this (I have had people arguing vehemently that RL has been applied here, there and everywhere with great success, so obviously I must be completely clueless).  RL works in limited situations where the reward signal is clear and the control policies are short(ish) and not too numerous.  But when people talk about AI safety issues, RL is assumed at the global level, and at that level reward signals are impossible to find, control policies are gigantic (involving actions that take years to connect to consequences), and the system tends to have Jupiter-size resource requirements.

Feeding Frenzy

Right now there is a feeding frenzy (Jaan Talinn has called it an "ecosystem") in which many groups and individuals are trying to persuade charities to give them money so they can do "research" that is supposed to reduce the dangers of AI.  "Give us money," they say "and we will save you from the Terminator."

I have a problem with this.  Most of the fish in this ecosystem are selling horror stories about what could happen when AIs become truly intelligent, and most of the horror stories involve the idea that even if we go to enormous lengths to build a safe AI, and even if we think we have succeeded, the AI will nevertheless wriggle out from under the safety net and become a psychopathic monster.

But if you look closely you will find that the horror stories are based on a set of bad assumptions about how the supposed AIs of the future will be constructed.  Assumptions that are so silly that any AI built in that way would find it impossible to actually be intelligent at all.  And out of all the silly assumptions, the worst is the assumption that AIs will be controlled by something called "reinforcement learning" (frequently abbreviated "RL").

I want to set this essay in the context of some important comments about AI safety made by Holden Karnofsky at  This is Karnofsky's take on one of the "challenges" we face in ensuring that AI systems do not become dangerous:

Going into the details of these challenges is beyond the scope of this post, but to give a sense for non-technical readers of what a relevant challenge might look like, I will elaborate briefly on one challenge. A reinforcement learning system is designed to learn to behave in a way that maximizes a quantitative “reward” signal that it receives periodically from its environment - for example, DeepMind’s Atari player is a reinforcement learning system that learns to choose controller inputs (its behavior) in order to maximize the game score (which the system receives as “reward”), and this produces very good play on many Atari games. However, if a future reinforcement learning system’s inputs and behaviors are not constrained to a video game, and if the system is good enough at learning, a new solution could become available: the system could maximize rewards by directly modifying its reward “sensor” to always report the maximum possible reward, and by avoiding being shut down or modified back for as long as possible. This behavior is a formally correct solution to the reinforcement learning problem, but it is probably not the desired behavior. And this behavior might not emerge until a system became quite sophisticated and had access to a lot of real-world data (enough to find and execute on this strategy), so a system could appear “safe” based on testing and turn out to be problematic when deployed in a higher-stakes setting. 

There are enough things wrong with what is said in this passage to fill a book, but my focus here is on the sudden jump from DeepMind's Atari game playing program, to a fully intelligent AI capable of outwitting humanity.

What Reinforcement Learning is.

Let's take look at what "reinforcement learning" (RL) actually is.  Back in the early days of Behaviorism (which became the dominant style of research in psychology after about the 1920s, until it was discarded in the late 1950s) some researchers decided to focus on simple experiments like putting a rat in a cage with a lever and a food-pellet dispenser, and then connecting these two things in such a way that if the rat pressed the lever, a pellet is dispensed.  Would the rat notice the connection between lever and pellet?  Of course it did, and pretty soon the rat would be spending inordinate amounts of time just pressing the lever, whether food came out or not.

What the researchers did next was suggest that the only thing of importance "inside" the rat's mind was a set of connections between behaviors (e.g. pressing the lever), stimuli (e.g a visual image of the lever) and rewards (e.g. getting a food pellet).  Critical to all of this was the idea that if a behavior was followed by a reward, a direct connection between the two would be strengthened in such a way that future behavior choices would be influenced by that strong connection.

That is reinforcement learning: you "reinforce" a behavior if it is followed by a reward.  What these researchers really wanted to claim was that this mechanism could explain almost everything important that was happening inside the rat's mind.  And, with a few judicious extensions, these Behaviorist researchers started arguing that the same type of explanation would work for the behavior of all "thinking" creatures.

That means you.

I want to point out something very important buried in this idea.  The connection between the reward and action is basically a single wire with a strength number on it.  The rat does not weigh up a lot of pros and cons; it doesn't think about anything, doesn't engage in any problem solving or planning, doesn't contemplate the whole idea of food, or the motivations of the idiot humans outside the cage.  The rat isn't capable of any of that: it just goes bang! lever-press, bang! food-pellet-appears, bang! let's-increase-the-strength-of-that-connection.

The Demise of Reinforcement Learning

Now fast forward to the late 1950s/early 1960s.  Cognitive psychologists are finally sick and tired of the Behaviorist programme. Behaviorism might be able to explain the rat-pellet-lever situation, but for anything more complex, it sucks.  Behaviorists spent decades engaging in all kinds of mental contortions to argue that they would eventually be able to explain human behavior without using much more than those direct connections between stimuli, behaviors and rewards ... but by about 1960 the psychology community had stopped believing that nonsense, because it never worked.

Is it possible to summarize the main reason why they rejected it?  Sure.  For one thing, almost all realistic behaviors involve rewards that arrive long after the behaviors that cause them, so there is a gigantic problem with deciding which behaviors should be reinforced, for a given reward.  Suppose you spend years going to college, enduring hard work and very low-income lifestyle.  Then years later you get a good job and pay off your college loan.  Did you do all of that because, like the rat, you happened to try the going-to-college-and-suffering-poverty behavior many times in your life, and the first time you tried it you got a good-job-that-paid-off-your-loan reward? Was it the case that you noticed the connection between reward and behavior, and your brain automatically reinforced the connection between those two?

A More Realistic Example

Or, on a smaller scale, consider what you are doing when you sit in the library with a mathematics text, trying to solve equations.  What reward are you seeking?  A little dopamine hit, perhaps?  (That is the modern story that neuroscientists sell, by the way).

Well, maybe, but let's try to stay focused on the precise idea that the Behaviorists were pushing:  that original rat was NOT supposed to do lots of thinking and analysis and imagining when it decided to push the lever, it was supposed to push the lever by chance, and then notice that a reward came.

The whole point of the RL mechanism is that the intelligent system doesn't engage in a huge, complex, structured analysis of the situation, when it tries to decide what to do.  If it did, the explanation for why the creature did what it did would be in that huge, complex, structured analysis itself, after all!  Instead, the RL people want you to believe that the RL mechanism did all the heavy lifting. That idea is critical to RL.  The rat simply tries a behavior at random - with no understanding of its meaning - and it is only because a reward then arrives, that the rat decides that in the future it will press the lever again.

So, going back to you, sitting in the library doing your mathematics homework.  Did you solve that last equation because you had a previous episode where you just happened to try the behavior of solving that exact same equation, and got a dopamine hit (which felt good)?  The RL theorist needs you to believe that you really did.  The RL theorist would say that you somehow did a search through all the quintillions of possible actions you could take, sitting there in front of an equation that requires L'Hôpital's Rule, and in spite of the fact that the list of possible actions included such possibilities as jumping-on-the-table-and-singing-I-am-the-walrus, or driving-home-to-get-a-marmite-sandwich, or asking-the-librarian-to-go-for-a-margarita, you decide instead that the thing that would give you the best dopamine hit right now would be applying L'Hôpital's Rule to the equation.

I hope I have made it clear that there is something profoundly disturbing about the RL/Behaviorist explanation for what is happening in a situation like this.

Whenever the Behaviorists tried to find arguments to explain scenarios like that, they always added machinery onto the basic RL mechanism.  "Okay," they would say, "so it's true the basic forms of RL don't work ... but if you add some more stuff onto the basic mechanism, like maybe the human keeps a few records of what they did, and they occasionally scan through the records and boost a few reinforcement connections here and there, and ... blah blah blah...".

The trouble with this kind of extra machinery is that after a while, the tail begins to wag the dog. That is, all the extra machinery is where all the action is.  And that extra machinery is most emphatically not designed as a kind of RL mechanism, itself.  In theory, there is still a tiny bit of reinforcement learning somewhere deep down inside all the extra machinery, but eventually people just said "What's the point?"  Why even bother to use the RL language anymore?  The RL, if it is there at all, is pointless.  A lot of parameter values get changed in complex ways, inside all the extra machinery, so why even bother to mention the one parameter among thousands, that is supposed to be RL, when it is obvious that the structure of that extra machinery is what matters?

That "extra machinery" is what eventually became all the many and varied mechanisms researched by cognitive psychologists.  Their understanding of how minds work is not that reinforcement learning plus extra machinery can be used to explain cognition -- they would simply assert that reinforcement learning does not exist as a way to understand cognition.

Take home message:  RL has become an irrelevance in explanations of human cognition.

Artificial Intelligence and RL

Now let's get back to Holden Karnofsky's comment, above.

He points out that there exists a deep learning program that can learn to play arcade games, and it uses RL.

(I should point out that his chosen example was not by any means pure RL.  This software already had other mechanisms in it, so the slippery slope toward RL+extra machinery has already begun.)

Sadly, DeepMind’s Atari player is nothing more sophisticated than a rat.  It is so mind-bogglingly simple that it actual can be controlled by RL.  No, I take that back, it is unfair to call it a rat. Rats are way smarter than this program, so it would be better to compare it to an amoeba, or an insect.

This is typical of claims that RL is a good foundation for AI.  If you start scanning the literature you will find that all the cited cases use systems that are so trivial that RL really does have a chance of working.

(Here is one example, picked almost at random:  Rivest, Bengio and Kalaska.  At first it seems that they are talking about deriving an RL system from what is known about the brain.  But after a lot of preamble they give us instead just an RL program that does the amazing task of ... controlling a double-jointed pendulum.  The same story is repeated in endless AI papers about reinforcement learning:  big claims about RL in the brain, but at the end of the day the algorithm is proved by applyint it to a trivially simple system.)

But Karnofsky wants to go beyond the trivial Atari player, he wants to ask what happens when the software is expanded and augmented.  In his words, "[what] if a future reinforcement learning system’s inputs and behaviors are not constrained to a video game, and if the system is good enough at learning..."?

That is where everything goes off the rails.

In practice there is not and never has been any such thing as augmenting and expanding an RL system until it becomes much more generally intelligent.  We are asked to imagine that this "system [might become] quite sophisticated and [get] access to a lot of real-world data (enough to find and execute on this strategy)...".  In other words, we are asked to buy the idea that there might be an RL system that is just as intelligent as a human being (smarter, in fact, since we are supposed to be in danger from its devious plans), but which is still driven by a reinforcement learning mechanism.

I see two problems here.  One is that this scenario ignores the fact that after three decades of trying to get RL to work as a theory of human cognition, the result has been nothing.  That period in the history of psychology was almost universally condemned as a complete write-off.  As far as we know it simply does not scale up.

But the second point is even worse:  not only did psychologists fail to get it to work as a theory of human cognition, AI researchers also failed to build one that works for anything approaching a real-world task.  What they have achieved is RL systems that do very tiny, narrow-AI tasks.

The textbooks might describe RL as if it means something, but they conspicuously neglect to mention that, actually, all the talking, thinking, development and implementation work since at least the 1960s has failed to result in an RL system that could actually control meaningful real-world behavior.  I do not know if AI researchers have been trying to do this and failing, or if they have not been trying at all (on the grounds that have no idea how to even start), but what I do know is that they have published no examples.

The Best Reinforcement Learning in the World?

To give a flavor of how bad this is, consider that in the 2008 Second Annual Reinforcement Learning Competition, the AI systems were supposed to compete in categories like:

Mountain Car: Perhaps the most well-known reinforcement learning benchmark task, in which an agent must learn how to drive an underpowered car up a steep mountain road.

Tetris: The hugely popular video game, in which four-block shapes must be manipulated to form complete lines when they fall.

Helicopter Hovering: A simulator, based on the work of Andrew Ng and collaborators, which requires an agent to learn to control a hovering helicopter.

Keepaway: A challenging task, based on the RoboCup soccer simulator, that requires a team of three robots to maintain possession of the ball while two other robots attempt to steal it.

As of the most recent RL competition, little has changed.  They are still competing to see whose RL algorithm can best learn how to keep a helicopter stable:  an insect-level intelligence task.  Whether they are succeeding in getting those helicopters to run smoothly or not is beside the point -- the point is that helicopter hovering behavior is a fundamentally shallow task, at the insect level.

Will RL Ever Become Superintelligent?

I suppose that someone without a technical background might look at all of the above and say "Well, even so ... perhaps we are only in the early stages of RL development, and perhaps any minute now someone will crack the problem and create an RL type of AI that becomes superintelligent.  You can't be sure that won't happen."

My response is as follows.  All of evidence is that the resource requirements for RL explode exponentially when you try to scale it up.  That means:

  • If you want to use RL to learn how to control a stick balancing on end, you will need an Arduino.
  • If you want to use RL to learn how to control a model helicopter, you will need a PC (or maybe a Beaglebone?  Arduinos are improving, so maybe...).
  • If you want to use RL to learn how to play Go, or Atari games, you will need the Google Brain (tens of thousands of cores).
  • If you want to use RL to learn how to control an artificial rat, which can run around and get by in the real world, you will need all the processing power currently available on this planet (and then some).
  • If you want to use RL to learn how to cook a meal, you will need all the computing power in the local galactic cluster.
  • If you want to use RL to learn how to be as smart as Winnie the Pooh (a bear, I will remind you, of very little brain), you will need to convert every molecule in the universe into a computer.

That is what exponential resource requirements are all about.


Reinforcement learning first came to prominence in the second decade of the 20th century.  But after nearly 100 years of experiments, mathematical theories and computer simulations, and after being written into the standard AI textbooks (and after being widely assumed as the theory of how future Artificial General Intelligence systems will probably be controlled), it seems that the best actual RL algorithms can barely learn how to control a model helicopter, or play a board game.

And yet there are dozens - if not hundreds - of people now inhabiting the "existential risk ecosystem", who claim to be so sure of how future AGI systems will be controlled, that they are taking a large stream of donated money and promising to do research on how this failed control paradigm can be modified so it will not turn around and kill us.

If you ask anyone in that ecosystem what exactly they see as the main dangers of future AGI, they quote - again and again and again - scenarios in which an AGI is superintelligent and dangerous psychopathic, and controlled by Reinforcement Learning.

And yet somehow, more and more people are throwing money at these people who claim they will save us from AIs that are driven by reinforcement learning.

Go figure.

[Note.  This is a project proposal that was written in 1989.  Published here for the first time because it has some historical relevance ... which is to say, some of the ideas I am working on today were already contained in this 1989 paper.]


The eventual goal of this research is to construct a PDP model of high-level aspects of cognition. This is clearly an ambitious target to aim for, so there are a number of sub-goals within this overall plan. One intermediate goal is to provide an account of the process of sentence comprehension, and the lowest-level goal is to address some issues relating to the use of lexical, syntactic and semantic information during the comprehension of a sentence. Frazier, Clifton and Randall (1983) conducted some experiments on the processing of sentences containing filler-gap dependencies, which they see as bearing crucially on these issues. In particular, they interpret their results as showing that people initially use a heuristic (the "Most Recent Filler" strategy) to build a viable phrase marker for a sentence containing multiple filler-gap dependencies and that they postpone the use of semantic information until a later stage.

It is my intention to construct a PDP system that can comprehend sentences of the sort used in the Frazier et. al. experiments, and which demonstrates the same effects that they observed, but without making any of the assumptions about the rule-governed nature of parsing that lie behind the analyses given by Frazier et. al. Before expanding on these claims, it is necessary to give a detailed outline of these filler-gap experiments and the theoretical considerations that follow from them.

Frazier, Clifton and Randall (1983) on Filler-Gap Dependencies.

Consider the following sentence:

(1) The mayor is the crook who the police chief tried ____ to leave town with ____ .

In a deep structure analysis of this sentence, the underlined noun phrases (the "fillers") would be located at the positions indicated by the underlined spaces (the "gaps"). In this case the basic actions that the sentence describes are that the police chief leaves town and that she leaves town with the crook, so the second filler belongs to the first gap and the first filler belongs to the second. This would be described by Frazier et. al. as a "Recent Filler" sentence because when the first gap is encountered, in a left to right scan, it needs to be coindexed with the most recently encountered filler. It is not always the case that the most recent filler is the appropriate one; take, for example:

(2) The mayor is the crook who the police chief forced ____ ____ to leave town.

Both gaps in this sentence take the same filler, namely "the crook", and this is not the most recent one. This example would be classed as a "Distant Filler" sentence.

Frazier et. al. presented their subjects with sentences such as these and asked them to make a judgment, as soon as possible after the appearance of the last word, as to whether or not they had comprehended the sentence. They indicated to their subjects that they should make a "no" response if they felt that they would normally have gone back and scanned the sentence a second time, and that a "yes" response should indicate that they had "got it" immediately. The words were shown one at a time in rapid succession on a computer screen (the description given by the authors implies that each word was overwritten by its successor), with the final word distinguished by a full stop.

What this task is designed to probe is an initial stage in the sentence comprehension process in which a surface structure representation is computed for the sentence, before any deeper analysis is made of its meaning or significance. Frazier et. al. regard it as a plausible assumption that such a surface structure parse takes place, given that when we read a sentence like

(3) John didn't deny that Sam left.

we can easily make a snap judgment about whether or not the sentence was syntactically correct, without having to go through what seems to be a much more time-consuming process of sorting out the interaction of the negatives.

The experimental sentences all involved multiple filler-gap dependencies, so that when the processing mechanism encounters the first gap, it has more than one potential filler for it. In all cases it is not until the final word of the sentence arrives that the mechanism can be sure about the proper assignment of fillers to gaps. Frazier at. al. measured the time taken from the onset of the final word to the point at which the subject responded, and they also examined the percentage of "got it" responses that were made. In this way they hoped to find out what effect the different sorts of filler-gap constructions had on the processing that had to be done, after the disambiguation point, in order to complete a surface structure parse of the sentence. In particular, the experimental data was meant to address the following two questions:

Decision Principles

If, when the parser is part of the way through a sentence, it is faced with more than one possible analysis of the string of words received so far (as is the case when a gap is encountered in a sentence with multiple filler-gap dependencies), what does it do? Does it (i) pursue all of the rival analyses in parallel thereafter, (ii) stop trying to analyse the phrase structure until such time as disambiguating information arrives, or (iii) pick one possible analysis and pursue that until it is confirmed, or until disconfirmatory information forces the processor to backtrack and recompute the analysis?

Structure in the Timing of Information Use.

Is it the case that, when a parsing decision is to be made, all of the different categories of information that may bear upon that decision are available to the processor, or are different sorts of information used at different points in the process of sentence comprehension?

With regard to the first question, it should be noted that there is already evidence that the processor makes an interim choice of phrase structure analysis when faced with an ambiguity, so that it sometimes has to backtrack when the choice later turns out to be wrong (c.f Garden-path sentences). There are also indications that when the processor comes to a gap for which there are two potential fillers, it uses the heuristic of assigning the most recent filler to the gap. Frazier et. al. hoped to ascertain from their experiments whether this Recent Filler Strategy was executed in an on-line fashion, during the initial surface structure parse of the sentence. If this were the case, then it would be further evidence that the processor chooses to pursue one of the possible analyses of the sentence, when faced with an ambiguity, even though the analysis may later have to be discarded.

The experiments helped to decide this issue in the following way. If the subjects initially assigned the most recent filler to the first gap in a Distant Filler sentence (such as (2)), then the arrival of the final (disambiguating) phrase should cause a good deal of processing to take place as the assignment is recomputed. This should have an effect on both the comprehension time (which should be longer) and the percentage of "got it" responses (which should be smaller) in these Distant Filler sentences, as compared with the Recent Filler sentences, where the strategy would be consistent with the proper assignment of fillers and gaps. This effect was indeed found in the data.

Turning now to the second question, that of the timing of information use, Frazier et. al. noted that the verb that comes just before the first gap in their sentences can be of three sorts. Some verbs are consistent only with a recent filler construction (try, agree, start - as in example (1)); some are consistent only with a distant filler construction (force, persuade, allow - as in (2)), whilst others (want, expect, beg ) can countenance either form, as in the example below:

(4) Recent Filler:
The mayor is the crook who the police chief wanted ____ to leave town with ____ . (5) Distant Filler:
The mayor is the crook who the police chief wanted ____ ____ to leave town.

In the context of this experiment, they dub the latter sort of verb ambiguous , whilst the other sorts are labelled unambiguous - all three types were used in the sentences presented to subjects. If the sentence processing mechanism encounters an ambiguous verb just before the first gap, then this would represent a genuine ambiguity (what Frazier et. al. refer to as a “horizontal ambiguity” ), and we would expect the Most Recent Filler strategy to be used under these circumstances, if it was being used at all. If, on the other hand, the verb just before the gap is an unambiguous one, then the processor could, in principle, make use of control information about the verb to make the proper assignment to the gap, without having to resort to the Most Recent Filler strategy. If the processor were not to make use of that information, then this would be a case of what they call a “vertical ambiguity” - a situation where the processor behaves as though it does not have enough information when in fact the information is present in the sentence.

Frazier et. al. compared the size of the Recent Filler Effect (the difference in comprehension times, and in percentage indicated comprehension measures, between the Recent Filler and Distant Filler sentences) when the verbs were ambiguous and unambiguous, and found no significant difference. They concluded that a vertical ambiguity did occur, and hence that there was evidence that at least some information (semantic control information about the verbs) was delayed in its use in the process of sentence comprehension.

There was another aspect of the experiment that touched on the question of the use of semantic control information in the unambiguous verbs. On a randomly selected one third of those trials when subjects indicated that they had grasped the sentence, they were asked a question about the content of the sentence. The responses were significantly more accurate when the verb was an unambiguous one, thus indicating that on a longer time-scale the control information associated with the verbs did help subjects to comprehend the sentences.

One final factor included in the design of the experiment was the presence or absence of a relative pronoun just after the first potential filler:

(6) The mayor is the crook (who) the police chief tried ____ to leave town with ____ .

This was done in order to test what Frazier et. al. put forward as a generalised interpretation of the Most Recent Filler strategy. They suggest that it is not simply the most recent filler that is chosen, but, rather, the most salient , and that recency is only one (albeit perhaps the most important) of a number of factors that contribute to salience. They argue that the presence of a relative pronoun could make the first filler more salient than it might otherwise have been, because it marks it as an obligatory filler. By increasing the salience of the Distant Filler in this way, it might be expected to be more often chosen to fill the first gap, contrary to the Most Recent Filler strategy. It was in fact found that the size of the Recent Filler effect was significantly reduced when the relative pronoun was present, thus helping to confirm the more general formulation of the strategy in terms of salience rather than recency.

There is a slight problem with this last result, however. In footnote 4 (p.207) the authors acknowledge that they looked (post hoc) at some other factors that ought to have influenced the salience of the distant filler: they argued that introducing the distant filler with a demonstrative there -phrase ("There goes the police chief who...") or with an equative copula ("The mayor is the crook who...") should make it more salient than when it is not focussed ("Everyone likes the woman who..."). Comparing the size of the Recent Filler effect in these "focussed" and "unfocussed" sentences, they found no difference. Attractive as the salience interpretation of the Recent Filler effect is, it is clear that further work would have to be done in order to establish exactly what factors determine the choice of filler.

Current Explanations of the Recent Filler Strategy.

What needs to be accounted for is the fact that some information that could be useful to the processor in determining filler-gap assignments is not deployed at the time that an initial structural analysis of the incoming sentence takes place, and that instead the processor first of all uses a superficial heuristic to make a rough-and-ready assignment of fillers to gaps.

Frazier et. al. mention two previously advanced explanations for the Most Recent Filler strategy before going on to discuss an explanation of their own. The two previous explanations are (i) that the structure of the language processor is such that verb control information is simply not available to the mechanism that produces the initial surface structure parse, or (ii) that verb control information simply takes a long time to look up or compute, and so arrives too late for the surface structure parser to make use of it.

The alternative offered by Frazier et. al. amounts to this: that it is computationally more costly to sort out a filler-gap assignment rigourously at the time that the problem arises, than to make a quick assignment on the basis of superficial information already available. To do the job properly would involve considering a wide variety of different types of information, and even then the answer might only be that more information (not due until later) was required in order to sort out the problem. Contrast this with the fact that in English the majority of sentences in which multiple filler-gap dependencies occur are of the Recent Filler type, and it becomes clear that if the processor initially assumes that the recent filler is the appropriate one for the first gap (and, perhaps - should the salience version of the strategy turn out to be correct - if the processor takes any stress or marking of the first potential filler to be an indication that the usual strategy should be overridden) then it will make a correct assignment most of the time. If the decision is later contradicted - and this need only require superficial information about the sentence, such as the fact that an obligatory filler has been left unassigned - it can simply reverse its earlier decision. This kind of backtracking is still computationally rather cheap, compared with the cost of a wholesale examination of all of the conceivable grammatical constraints that could be relevant to a filler-gap assignment.

They are not suggesting that the semantic information is never used, of course. They characterise their proposal as a "superficial-assignment-with-filtering" model, to emphasise the idea that the more complex grammatical constraints are, in the course of time, used as a kind of "filter" on the superficially generated representation that emerges from the early stage of sentence processing. If the grammatical constraints contradict some aspect of the structural representation, backtracking activity takes place in order to re-assign the fillers.

An Outline of a Generalised PDP System

The basic elements of the PDP model that will be used to account for these effects are outlined in this section (a more detailed account will be available shortly: see the schedule in section 5). Briefly, mental representations are modelled as configurations of "elements" that move around in a network of processors. The processors' only function is to support the elements - the processors themselves are not specialised for different types of task - and the network is both locally connected and densely populated with processors. The consequence of this architecture is that from the point of view of the elements, the network effectively looks like a "space" within which they can move: the fact that it is constructed out of discrete computing units is of little consequence. This space is referred to here as the "foreground". Information from the outside world is first of all pre-processed, and then delivered to the periphery of the foreground. An analogous chain of connections takes information away from a separate region of the foreground periphery towards the muscles that effect motor output.

The properties of the system are best described by looking at processes going on at three different levels: basic interactions, dynamic relaxation and the learning mechanisms.

Basic Interactions

At the basic level, the elements of the system have three important properties. First, an element is a program that can move around the foreground, rather than being fixed in one position, as is the case in a "regular" connectionist system. Second, elements can appear and disappear. There are many more elements than processors, but one processor can only host one element at a time, so only a small subset of the total is "active" at any time. Active elements are continually being "killed" (returned to the dormant pool) and replaced by new elements. The third property is that these elements can effectively form "bonds" with their neighbours. If two elements come within one link of one another, they exchange information, and as a result may try to repel, alter or destroy one another, or they may couple together so as to move in synchrony thereafter. These aspects of the element behaviour give the model something of a "molecular soup" flavour. (For another example of a PDP system with this kind of structure, see Hofstadter, 1983).

Dynamic Relaxation

A high level semantic representation corresponds to a complex of elements; by the analogy just introduced, a "molecule". When the system is presented (at its peripheral nodes) with a sensory input - for example, a sequence of words - it tries to "understand" what it sees. This happens by a process of "dynamic relaxation" - the elements move around, alter their state and cause other elements to be activated, in an attempt to settle into a configuration that minimises one or more parameters (this is "relaxation" proper). The process is properly described as "dynamic" relaxation because a completely stable configuration never actually develops: new complexes of elements are always emerging from old, and then relaxing towards further configurations. This approach, then, characterises the emergence of a semantic representation of a sentence as a dynamic relaxation from an initial configuration of elements representing words (on the network periphery), through intermediate states involving the activation of elements that capture syntactic information, to a final (more or less stable) configuration that corresponds to a "model" (in the sense used by Johnson-Laird (1983)) of the significance of the sentence.

Learning Mechanisms

The process of dynamic relaxation is driven by the activity of individual elements in the network. An element represents some aspect of the world - an object, a relation or some other regularity - out of which an internal "model" of the world can be built. What drives the relaxation process is the fact that each element has "preferences" about the kind of other elements that it would like to see in its vicinity, and perhaps also about the sequence of events going on around it. These preferences are not static: they are modified by the experiences that the element has each time that it comes into the network. The processes that shape the character of elements in this way are responsible for the system's ability to learn. Two such processes are generalisation and abstraction. Generalisation involves an element which develops as follows: it is initially happy with a certain configuration of elements; then, on some later occasion it finds that it also fits well with another configuration of elements; it will accordingly bias its preferences towards any elements that the two configurations had in common. Abstraction has to do with groups of elements that recognise that they have been together on previous occasions: as a consequence, they create a new element that has preferences that approximate to the "outward-looking" preferences of elements on the periphery of the group.

These are necessarily somewhat sketchy characterisations of abstraction and generalisation: more detailed investigations of these learning mechanisms will be carried out as part of the proposed work.

A PDP Account of the Recent Filler Strategy.

What do syntactic rules correspond to in a PDP system of the sort described above? The (short) answer is that they are transient, intermediate elements that help to group the lexical elements together into configurations that can then go on to form the appropriate semantic structure. They are rather like catalysts that manipulate the linear chain of lexical elements into a particular folded shape, so that the appropriate further transformation can take place. There is a subtle but important difference between this view of syntactic rules and the more widespread one in which they are regarded as a way of specifying the unique phrase marker that sanctions the "grammaticality" of the sentence.

In this kind of system the elements that correspond to the meanings of words are quite separate from the elements that are associated with the written (or spoken) forms of the words. The meaning of a sentence is a configuration built from semantic elements, but this can only be constructed after two preliminary phases have occurred: first, the string of lexical elements must be assembled, then the appropriate syntactic structure must become attached to that lexical string. As the syntactic configuration develops some stable structure, so the semantic elements can be called into the foreground and assembled into a unique model of the situation portrayed by the sentence. What is important is to note that there is a natural sequencing involved here, whereby semantic information does not play much of a role in the construction of the syntactic structure, even though the two are not located in modular "compartments" of the sentence comprehension mechanism. Under most circumstances, then, it seems likely that semantic information would not be used until relatively late in the process of sentence comprehension.

With regard to the Most Recent Filler strategy, two things can be said. First, it should be noted that the function of these elements is to detect "microfeatural regularities" in the sensory input. Thus, if were noticed that the most recent filler was usually the appropriate one to put in a gap, an element would develop to characterise that regularity, and its future role, once it had become strong enough, would be a) to get called in to the foreground when multiple gaps were detected (or suspected) and b) to try to "bind" the first gap to the nearest filler.

A second point of interest is that the concept of "nearness" is an important one in the formalism being proposed. The foreground is a locally connected network of processors, and this local connectivity represents one of the basic assumptions on which the whole approach rests - without it any analysis of the properties of the system would be even harder than it already is. With this in mind, notice that there are two senses in which the most recent filler is "nearer" to the gap: it is nearer in the direction of the line of lexical items itself, and it is also nearer along the time dimension (the network is effectively four- dimensional). When bonds form between elements they preferentially involve nearby elements, simply because an element trying to form a bond is most likely to bump into a near neighbour than a far one. If, on the other hand, some new syntactic factor (the presence of a relative pronoun, say) is introduced, it may alter the situation by actively looking to form a bond with the first gap that appears. The contrast between the two situations is this: without the relative pronoun, it is the gap-element that most urgently goes out looking for a bond, and in that case the bond tends to be with the nearest potential filler. With the relative pronoun, on the other hand, the first filler bonds to a relative-pronoun- element, which in turn extends a potential bond forward toward the later part of the sentence, so that when the gap is identified it is much more likely to bond to the relative- pronoun-element than to the recent filler (see figure 1). Subsequently, the relative pronoun serves to bring the distant filler element and the first gap element together directly - having done this it goes out of the foreground.


Frazier, L., Clifton, C., & Randall, J. (1983). Filling Gaps: Decision principles and structure in sentence comprehension. Cognition, 13, 187-222.

Hofstadter, D.R. (1983). The Architecture of Jumbo. Proceedings of the International Machine Learning Workshop. Monticello, Illinois.

Johnson-Laird, P.N. (1983). Mental Models. Cambridge University Press.

In a recent blog post titled Maverick Nannies and Danger Theses, Kaj Sotala gave a long and thoughtful response to my 2014 AAAI paper The Maverick Nanny with a Dopamine Drip: Debunking Fallacies in the Theory of AI Motivation (a pdf of the original paper is here but there's also a blog post version here).

The focus of Kaj's essay was actually broader than the point I was making in the Maverick Nanny paper. I want to examine some of the points he made so we can drill down to the issues that I think are the most important.

Two Types of AI Architecture

Let's begin with a small point that is actually very significant if you consider all the ramifications.  The following comment appears in the opening paragraph of Kaj's essay:

"Like many others, I did not really understand the point that this paper was trying to make, especially since it made the claim that people endorsing such thought experiments were assuming a certain kind of an AI architecture – which I knew that we were not."

Alack and Alas!  It is just not possible to say that the folks at MIRI and FHI (the Machine Intelligence Research Institute and the Future of Humanity Institute) are not basing their thought experiments on the "certain kind of AI architecture" that I described in my paper.

The AI architecture assumed by MIRI and FHI is the "Canonical Logical AI" and in my paper I defined it mainly by contrast with another type of architecture called a "Swarm Relaxation Intelligence."

Now, see, here is the trouble:  if MIRI/FHI were really and truly not wedded to Canonical Logical AI -- if they were seriously thinking about both that type of architecture and the Swarm Relaxation Intelligence architecture -- they could never have made the claims that they did.  Their claims hinge on the idea that an AI could get into a state where it ignored a massive raft of contextual knowledge surrounding a particular "fact" it was considering ... but in a Swarm Relaxation system context is everything, by definition, so the idea that that type of AI could ignore the surrounding context is just self-contradictory.

That said, this little point has no impact on the rest of Kaj's comments, so let's move right along.

This is a Serious Matter

We agree about the main purpose of my paper, which was to establish that the thought experiments proposed by MIRI and FHI are simply wrong.  Kaj says:

"I agree with this criticism. Many of the standard thought experiments are indeed misleading in this sense – they depict a highly unrealistic image of what might happen."

But now comes a kicker that I find disturbing. He goes on:

"That said, I do feel that these thought experiments serve a certain valuable function. Namely, many laymen, when they first hear about advanced AI possibly being dangerous, respond with something like “well, couldn’t the AIs just be made to follow Asimov’s Laws” or “well, moral behavior is all about making people happy and that’s a pretty simple thing, isn’t it?”. To a question like that, it is often useful to point out that no – actually the things that humans value are quite a bit more complex than that, and it’s not as easy as just hard-coding some rule that sounds simple when expressed in a short English sentence."

I feel uncomfortable with this, and the best way to see why is with an analogy.  Suppose that back in the day when nuclear weapons were invented, laymen, when they first heard about nuclear weapons possibly being a threat to humanity, respond with something like “well, couldn’t various nuclear powers just agree that it was in nobody's interest to start a global nuclear war?"  If a group of people had responded to that naiveté by claiming that any attempt by physicists to split the atom or fuse atoms together could cause a completely unstoppable chain reaction that would spread to all the atoms in the world in a matter of minutes, causing the whole planet to explode, that would have made responsible physicists such as myself livid with anger. Terrorizing the lay public with bogeyman stories like that would have been utterly irresponsible.

That type of fearmongering has nothing whatever to do with "pointing out that no – actually [things] are quite a bit more complex than that."

The lay public has been terrorized, over the last few years, by "thought experiments" generated by MIRI and FHI that purport to show it would be almost impossible to control even the best-designed artificial intelligence systems.  People like Stephen Hawking and Elon Musk are talking about the doom of humanity, and when they do so they make reference to precisely these "thought experiments."  Every media outlet and blog on the planet has repeated the warning about an AI apocalypse.

This is a serious matter.  With MIRI and FHI fanning the flames of anti-AI hysteria, it is only a matter of time before some lunatic takes it upon himself to eliminate AI researchers.  And when that happens, blood will be on the hands of the people at MIRI and FHI.



How to Build an AGI -- and How Not To

So, you want to build an artificial general intelligence?

One approach is to turn your AI investment money into a large pile of cash and then burn it. That worked for a lot of people over the last sixty years, so don't feel bad.

But if you actually want to get the job done, and you can handle it if your core beliefs get challenged, what follows might interest you.

Before we start, let's get a couple things out of the way:

  1. If you think the current boom in artificial intelligence means that we are on the glide path to AGI . . . well, please, just don't.
  2. There seems to be a superabundance of armchair programmers who know how to do AGI. There's not much I can do to persuade you that what you are about to read is different, other than to suggest that you look at my bio page and then come back here if you still think it is worth your while.

The Problem

Artificial Intelligence research is at a very bizarre stage in its history.

Almost exclusively, people who do AI research were raised as mathematicians or computer scientists. These people have a belief -- a belief so strong it borders on religion -- that AI is founded on mathematics, and that the best way to build an AI is to design it with mathematical foundations. They will also tell you that human intelligence is a horribly botched attempt by Nature to build something that it really shouldn't have attempted without taking some math classes.

There are good reasons why mathematicians are the high priests of science, so we tend to venerate their opinion. But I'm here to tell you that on this occasion they're the wrong people for the job.

Specifically, I am going to defend this thesis:

  • There are reasons to believe that a complete, human-level artificial general intelligence cannot be made to work if it is based on a formal, mathematical approach to the problem.

Wait!  Before you confuse this with the kind of thing a non-mathematician would say, to cover their ignorance, please remember that I am originally a mathematical physicist, and I read graduate math texts for fun.

The above thesis is based on an argument that has science and math behind it. Unfortunately, the argument is not easy to convey in a short, nontechnical form, but here is my attempt to unpack it into a slightly longer form:

  • There is a type of system called a "complex system," whose component parts interact with one another in such a horribly tangled way that the overall behavior of the system does not look even remotely connected to what the component parts are doing.
  • (You might think that the "overall behavior" would therefore just be random, but that is not the case:  the overall behavior can be quite regular and lawful.)
  • We know that such systems really do exist because we can build them and study them, and this matters for AI because there are powerful reasons to believe that all intelligent systems must be complex systems in that sense.
  • If this were true it would have enormous implications for AI researchers:  it would mean that if you try to produce a mathematically pure, formalized, sanitized version of an intelligence, you will virtually guarantee that the AI never gets above a certain level of intelligence.

If this "complex systems problem" were true, we would expect to see certain things go wrong when people try to build logico-mathematical AI systems. I am not going to give the detailed list here, but take my word for it:  all of those expected bad things have been happening. In spades.

As you can imagine, this little argument tends to provoke a hot reaction from AI researchers.

That is putting it mildy.  One implication of the complex systems problem is that the skills of most AI researchers are going to be almost worthless in a future version of the AI field -- and that tends to make people very angry indeed.  There are some blood-curdling examples of that anger, scattered around the internet.

The Solution

So what do we do, if this is a real problem?

There is a potential solution, but it tends to make classically-trained AI researchers squirm with discomfort: we must borrow the core of our AI design from the architecture of the human cognitive system.  We have to massively borrow from it.  We have to eat some humble pie, dump the mathematics and try to learn something from the kludge that Nature produced after fiddling with the problem for a few billion years.

Once again, a caution about what that does not mean -- it doesn't mean we have to copy the low level structure of the brain.  the terms "cognitive system" and "brain" are two completely different things.  The proposed solution has nothing to do with brain emulation; it is about ransacking cognitive psychology for design ideas.

I have presented this potential solution without supplying a long justification.  The purpose of this essay is to give you the conclusions, not the detailed arguments (you can find those elsewhere, in this paper and this paper).

The good news is that there are some indications that building an AGI might not be as difficult as it has always seemed to be, if done in the right way.

Your AGI Team

What you will need, then, when you put together your AGI construction team, is a combination of freethinking cognitive psychologists who are willing to stretch themselves beyond their usual paradigm, and some adept systems people (mostly software engineers) who are not already converts to the religion of logico-mathematical AI.

Ah ... but there lies the unfolding tragedy.

I started out by saying that we have reached a bizarre stage in the history of AI. It is bizarre because I doubt that there is a single investor out there who would have the courage to face down the mathematicians and back an approach such as the one I just described.  Most investors consider the math crowd to be invincible, so the idea that AI experts could be so fundamentally wrong about how to do AI is like believing in voodoo, or the tooth fairy. Just downright embarrassing.

What I think we are in for, then, is another 60 years of AI in which we just carry on with business as usual:  a series of boom-and-bust cycles.  The bursts of hype and "Real AI is just around the corner", followed by angry investors and another AI winter.  There will be progress on the lesser things, if that is what you want, but the real AGI systems that some of us dream of -- the ones that will do new science on a scale that beggars belief; the ones that will build the starships, renovate the planet and dispense the kind of indefinite life-extension worth having -- those kinds of AI will not happen.

History, if the human race survives long enough to write it, will say that this was bizarre because, in fact, we probably could have built a complete, human-level AGI at least ten years ago (perhaps even as far back as the late 1980s).  And yet, because nobody could risk their career by calling out the queen for her nakedness, nothing happened...

... or The Maverick Nanny with a Dopamine Drip

Richard P. W. Loosemore


My goal in this essay is to analyze some widely discussed scenarios that predict dire and almost unavoidable negative behavior from future artificial general intelligences, even if they are programmed to be friendly to humans. I conclude that these doomsday scenarios involve AGIS that are logically incoherent at such a fundamental level that they can be dismissed as extremely implausible. In addition, I suggest that the most likely outcome of attempts to build AGI systems of this sort would be that the AGI would detect the offending incoherence in its design, and spontaneously self-modify to make itself less unstable, and (probably) safer.


AI systems at the present time do not even remotely approach the human level of intelligence, and the consensus seems to be that genuine artificial general intelligence (AGI) systems—those that can learn new concepts without help, interact with physical objects, and behave with coherent purpose in the chaos of the real world—are not on the immediate horizon.

But in spite of this there are some researchers and commentators who have made categorical statements about how future AGI systems will behave. Here is one example, in which Steve Omohundro (2008) expresses a sentiment that is echoed by many:

"Without special precautions, [the AGI] will resist being turned off, will try to break into other machines and make copies of itself, and will try to acquire resources without regard for anyone else’s safety. These potentially harmful behaviors will occur not because they were programmed in at the start, but because of the intrinsic nature of goal driven systems." (Omohundro, 2008)

Omohundro’s description of a psychopathic machine that gobbles everything in the universe, and his conviction that every AI, no matter how well it is designed, will turn into a gobbling psychopath is just one of many doomsday predictions being popularized in certain sections of the AI community. These nightmare scenarios are now saturating the popular press, and luminaries such as Stephen Hawking have -- apparently in response -- expressed their concern that AI might "kill us all."

I will start by describing a group of three hypothetical doomsday scenarios that include Omohundro’s Gobbling Psychopath, and two others that I will call the Maverick Nanny with a Dopamine Drip and the Smiley Tiling Berserker. Undermining the credibility of these arguments is relatively straightforward, but I think it is important to try to dig deeper and find the core issues that lie behind this sort of thinking. With that in mind, much of this essay is about (a) the design of motivation and goal mechanisms in logic-based AGI systems, (b) the misappropriation of definitions of “intelligence,” and (c) an anthropomorphism red herring that is often used to justify the scenarios.

Dopamine Drips and Smiley Tiling

In a 2012 New Yorker article entitled Moral Machines, Gary Marcus said:

"An all-powerful computer that was programmed to maximize human pleasure, for example, might consign us all to an intravenous dopamine drip [and] almost any easy solution that one might imagine leads to some variation or another on the Sorcerer’s Apprentice, a genie that’s given us what we’ve asked for, rather than what we truly desire." (Marcus 2012)

He is depicting a Nanny AI gone amok. It has good intentions (it wants to make us happy) but the programming to implement that laudable goal has had unexpected ramifications, and as a result the Nanny AI has decided to force all human beings to have their brains connected to a dopamine drip.

Here is another incarnation of this Maverick Nanny with a Dopamine Drip scenario, in an excerpt from the Intelligence Explosion FAQ, published by MIRI, the Machine Intelligence Research Institute (Muehlhauser 2013):

"Even a machine successfully designed with motivations of benevolence towards humanity could easily go awry when it discovered implications of its decision criteria unanticipated by its designers. For example, a superintelligence programmed to maximize human happiness might find it easier to rewire human neurology so that humans are happiest when sitting quietly in jars than to build and maintain a utopian world that caters to the complex and nuanced whims of current human neurology."

Setting aside the question of whether happy bottled humans are feasible (one presumes the bottles are filled with dopamine, and that a continuous flood of dopamine does indeed generate eternal happiness), there seems to be a prima facie inconsistency between the two predicates

[is an AI that is superintelligent enough to be unstoppable]


[believes that benevolence toward humanity might involve forcing human beings to do something violently against their will.]

Why do I say that these are seemingly inconsistent?  Well, if you or I were to suggest that the best way to achieve universal human happiness was to forcibly rewire the brain of everyone on the planet so they became happy when sitting in bottles of dopamine, most other human beings would probably take that as a sign of insanity. But Muehlhauser implies that the same suggestion coming from an AI would be perfectly consistent with superintelligence.

Much could be said about this argument, but for the moment let’s just note that it begs a number of questions about the strange definition of “intelligence” at work here.

The Smiley Tiling Berserker

Since 2006 there has been an occasional debate between Eliezer Yudkowsky and Bill Hibbard. Here is Yudkowsky stating the theme of their discussion:

"A technical failure occurs when the [motivation code of the AI] does not do what you think it does, though it faithfully executes as you programmed it. [...]   Suppose we trained a neural network to recognize smiling human faces and distinguish them from frowning human faces. Would the network classify a tiny picture of a smiley-face into the same attractor as a smiling human face? If an AI “hard-wired” to such code possessed the power—and Hibbard (2001) spoke of superintelligence—would the galaxy end up tiled with tiny molecular pictures of smiley-faces?"   (Yudkowsky 2008)

Yudkowsky’s question was not rhetorical, because he goes on to answer it in the affirmative:

"Flash forward to a time when the AI is superhumanly intelligent and has built its own nanotech infrastructure, and the AI may be able to produce stimuli classified into the same attractor by tiling the galaxy with tiny smiling faces... Thus the AI appears to work fine during development, but produces catastrophic results after it becomes smarter than the programmers(!)." (Yudkowsky 2008)

Hibbard’s response was as follows:

Beyond being merely wrong, Yudkowsky's statement assumes that (1) the AI is intelligent enough to control the galaxy (and hence have the ability to tile the galaxy with tiny smiley faces), but also assumes that (2) the AI is so unintelligent that it cannot distinguish a tiny smiley face from a human face. (Hibbard 2006)

This comment expresses what I feel is the majority lay opinion: how could an AI be so intelligent as to be unstoppable, but at the same time so unsophisticated that its motivation code treats smiley faces as evidence of human happiness?

Machine Ghosts and DWIM

The Hibbard/Yudkowsky debate is worth tracking a little longer. Yudkowsky later postulates an AI with a simple neural net classifier at its core, which is trained on a large number of images, each of which is labeled with either “happiness” or “not happiness.” After training on the images the neural net can then be shown any image at all, and it will give an output that classifies the new image into one or the other set. Yudkowsky says, of this system:

"Even given a million training cases of this type, if the test case of a tiny molecular smiley-face does not appear in the training data, it is by no means trivial to assume that the inductively simplest boundary around all the training cases classified “positive” will exclude every possible tiny molecular smiley-face that the AI can potentially engineer to satisfy its utility function.

And of course, even if all tiny molecular smiley-faces and nanometer-scale dolls of brightly smiling humans were somehow excluded, the end result of such a utility function is for the AI to tile the galaxy with as many “smiling human faces” as a given amount of matter can be processed to yield." (Yudkowsky 2011)

He then tries to explain what he thinks is wrong with the reasoning of people, like Hibbard, who dispute the validity of his scenario:

"So far as I can tell, to [Hibbard] it remains self-evident that no superintelligence would be stupid enough to thus misinterpret the code handed to it, when it’s obvious what the code is supposed to do.   [...] It seems that even among competent programmers, when the topic of conversation drifts to Artificial General Intelligence, people often go back to thinking of an AI as a ghost-in-the-machine—an agent with preset properties which is handed its own code as a set of instructions, and may look over that code and decide to circumvent it if the results are undesirable to the agent’s innate motivations, or reinterpret the code to do the right thing if the programmer made a mistake." (Yudkowsky 2011)

Yudkowsky at first rejects the idea that an AI might check its own code to make sure it was correct before obeying the code. But, truthfully, it would not require a ghost-in-the-machine to reexamine the situation if there was some kind of gross inconsistency with what the humans intended: there could be some other part of its programming (let’s call it the checking code) that kicked in if there was any hint of a mismatch between what the AI planned to do and what the original programmers were now saying they intended. There is nothing difficult or intrinsically wrong with such a design.  And, in fact, Yudkowsky goes on to make that very suggestion (he even concedes that it would be “an extremely good idea”).

But then his enthusiasm for the checking code evaporates:

"But consider that a property of the AI’s preferences which says e.g., “maximize the satisfaction of the programmers with the code” might be more maximally fulfilled by rewiring the programmers’ brains using nanotechnology than by any conceivable change to the code."
(Yudkowsky 2011)

So, this is supposed to be what goes through the mind of the AGI. First it thinks “Human happiness is seeing lots of smiling faces, so I must rebuild the entire universe to put a smiley shape into every molecule.” But before it can go ahead with this plan, the checking code kicks in: “Wait! I am supposed to check with the programmers first to see if this is what they meant by human happiness.” The programmers, of course, give a negative response, and the AGI thinks “Oh dear, they didn’t like that idea. I guess I had better not do it then."

But now Yudkowsky is suggesting that the AGI has second thoughts:  "Hold on a minute," it thinks,  "suppose I abduct the programmers and rewire their brains to make them say ‘yes’ when I check with them? Excellent! I will do that.” And, after reprogramming the humans so they say the thing that makes its life simplest, the AGI goes on to tile the whole universe with tiles covered in smiley faces. It has become a Smiley Tiling Berserker.

I want to suggest that the implausibility of this scenario is quite obvious: if the AGI is supposed to check with the programmers about their intentions before taking action, why did it decide to rewire their brains before asking them if it was okay to do the rewiring?

Yudkowsky hints that this would happen because it would be more efficient for the AI to ignore the checking code. He seems to be saying that the AI is allowed to override its own code (the checking code, in this case) because doing so would be “more efficient,” but it would not be allowed to override its motivation code just because the programmers told it there had been a mistake.

This looks like a bait-and-switch. Out of nowhere, Yudkowsky implicitly assumes that “efficiency” trumps all else, without pausing for a moment to consider that it would be trivial to design the AI in such a way that efficiency was a long way down the list of priorities. There is no law of the universe that says all artificial intelligence systems must prize efficiency above all other considerations, so what really happened here is that Yudkowsky designed this hypothetical machine to fail. By inserting the Efficiency Trumps All directive, the AGI was bound to go berserk.

The obvious conclusion is that a trivial change in the order of directives in the AI’s motivation engine will cause the entire argument behind the Smiley Tiling Berserker to evaporate. By explicitly designing the AGI so that efficiency is considered as just another goal to strive for, and by making sure that it will always be a second-class goal, the line of reasoning that points to a bererker machine evaporates.

At this point, engaging in further debate at this level would be less productive than trying to analyze the assumptions that lie behind these claims about what a future AI would or would not be likely to do.

Logical vs. Swarm AI

The main reason that Omohundro, Muehlhauser, Yudkowsky, and the popular press like to give credence to the Gobbling Psychopath, the Maverick Nanny and the Smiley Tiling Berserker is because they assume that all future intelligent machines fall into a broad class of systems that I am going to call “Canonical Logical AI” (CLAI). The bizarre behaviors of these hypothetical AI monsters are just a consequence of weaknesses in this class of AI design. Specifically, these kinds of systems are supposed to interpret their goals in an extremely literal fashion, which eventually leads them to bizarre behaviors engendered by peculiar interpretations of forms of words.

The CLAI architecture is not the only way to build a mind, however, and I will outline an alternative class of AGI designs that does not appear to suffer from the unstable and unfriendly behavior to be expected in a CLAI.

The Canonical Logical AI

“Canonical Logical AI” is an umbrella term designed to capture a class of AI architectures that are widely assumed in the AI community to be the only meaningful class of AI worth discussing. These systems share the following main features:

  • The main ingredients of the design are some knowledge atoms that represent things in the world, and some logical machinery that dictates how these atoms can be connected into linear propositions that describe states of the world.
  • There is a degree and type of truth that can be associated with any proposition, and there are some truth-preserving functions that can be applied to what the system knows, to generate knew facts that it also can assume to be known.
  • The various elements described above are not allowed to contain active internal machinery inside them, in such a way as to make combinations of the elements have properties that are unpredictably dependent on interactions happening at the level of the internal machinery.
  • There has to be a transparent mapping between elements of the system and things in the real world. That is, things in the world are not allowed to correspond to clusters of atoms, in such a way that individual atoms have no clear semantics.

The above features are only supposed to apply to the core of the AI: it is always possible to include subsystems that use some other type of architecture (for example, there might be a distributed neural net acting as a visual input feature detector).

Most important of all, from the point of view of the discussion in the paper, the CLAI needs one more component that makes it more than just a “logic-based AI”:

  • There is a motivation and goal management (MGM) system to govern its behavior in the world.

The usual assumption is that the MGM contains a number of goal statements (encoded in the same type of propositional form that the AI uses to describe states of the world), and some machinery for analyzing a goal statement into a sequences of subgoals that, if executed, would cause the goal to be satisfied.

Included in the MGM is an expected utility function that applies to any possible state of the world, and which spits out a number that is supposed to encode the degree to which the AI considers that state to be preferable. Overall, the MGM is built in such a way that the AI seeks to maximize the expected utility.

Notice that the MGM I have just described is an extrapolation from a long line of goal-planning mechanisms that stretch back to the means-ends-analysis of Newell and Simon (1963).

Swarm Relaxation Intelligence

By way of contrast with this CLAI architecture, consider an alternative type of system that I will refer to as a Swarm Relaxation Intelligence. (although it could also be called, less succinctly, a parallel weak constraint relaxation system).

  • The basic elements of the system (the atoms) may represent things in the world, but it is just as likely that they are subsymbolic, with no transparent semantics
  • Atoms are likely to contain active internal machinery inside them, in such a way that combinations of the elements have swarm-like properties that depend on interactions at the level of that machinery.
  • The primary mechanism that drives the systems is one of parallel weak constraint relaxation: the atoms change their state to try to satisfy large numbers of weak constraints that exist between them.
  • The motivation and goal management (MGM) system would be expected to use the same kind of distributed, constraint relaxation mechanisms used in the thinking process (above), with the result that the overall motivation and values of the system would take into account a large degree of context, and there would be very much less of an emphasis on explicit, single-point-of-failure encoding of goals and motivation.

Swarm Relaxation has more in common with connectionist systems (McClelland, Rumelhart and Hinton 1986) than with CLAI. As McClelland et al. (1986) point out, weak constraint relaxation is the model that best describes human cognition, and when used for AI it leads to systems with a powerful kind of intelligence that is flexible, insensitive to noise and lacking the kind of brittleness typical of logic-based AI. In particular, notice that a swarm relaxation AGI would not use explicit calculations for utility or the truth of propositions.

Swarm relaxation AGI systems have not been built yet (subsystems like neural nets have, of course, been built, but there is little or no research into the idea that swarm relaxation could be used for all of an AGI architecture).

Relative Abundances

How many proof-of-concept systems exist, functioning at or near the human level of human performance, for these two classes of intelligent system?

There are precisely zero instances of the CLAI type, because although there are many logic-based narrow-AI systems, nobody has so far come close to producing a general-purpose system (an AGI) that can function in the real world. It has to be said that zero is not a good number to quote when it comes to claims about the “inevitable” characteristics of the behavior of such systems.

How many swarm relaxation intelligences are there? At the last count, approximately seven billion.

The Doctrine of Logical Infallibility

The simplest possible logical reasoning engine is an inflexible beast: it starts with some axioms that are assumed to be true, and from that point on it only adds new propositions if they are provably true given the sum total of the knowledge accumulated so far. That kind of logic engine is too simple to be an AI, so we allow ourselves to augment it in a number of ways—knowledge is allowed to be retracted, binary truth values become degrees of truth, or probabilities, and so on. New proposals for systems of formal logic abound in the AI literature, and engineers who build real, working AI systems often experiment with kludges in order to improve performance, without getting prior approval from logical theorists.

But in spite of all these modifications that AI practitioners make to the underlying ur‑logic, one feature of these systems is often assumed to be inherited as an absolute: the rigidity and certainty of conclusions, once arrived at. No second guessing, no “maybe,” no sanity checks: if the system decides that X is true, that is the end of the story.

Let me be careful here. I said that this was “assumed to be inherited as an absolute”, but there is a yawning chasm between what real AI developers do, and what Yudkowsky, Muehlhauser, Omohundro and others assume will be true of future AGI systems. Real AI developers put sanity checks into their systems all the time. But these doomsday scenarios talk about future AI as if it would only take one parameter to get one iota above a threshold, and the AI would irrevocably commit to a life of stuffing humans into dopamine jars.

One other point of caution: this is not to say that the reasoning engine can never come to conclusions that are uncertain—quite the contrary: uncertain conclusions will be the norm in an AI that interacts with the world—but if the system does come to a conclusion (perhaps with a degree-of-certainty number attached), the assumption seems to be that it will then be totally incapable of then allowing context to matter.

One way to characterize this assumption is that the AI is supposed to be hardwired with a Doctrine of Logical Infallibility. The significance of the doctrine of logical infallibility is as follows. The AI can sometimes execute a reasoning process, then come to a conclusion and then, when it is faced with empirical evidence that its conclusion may be unsound, it is incapable of considering the hypothesis that its own reasoning engine may not have taken it to a sensible place. The system does not second guess its conclusions. This is not because second guessing is an impossible thing to implement, it is simply because people who speculate about future AGI systems take it as a given that an AGI would regard its own conclusions as sacrosanct.

But it gets worse. Those who assume the doctrine of logical infallibility often say that if the system comes to a conclusion, and if some humans (like the engineers who built the system) protest that there are manifest reasons to think that the reasoning that led to this conclusion was faulty, then there is a sense in which the AGI’s intransigence is correct, or appropriate, or perfectly consistent with “intelligence.”

This is a bizarre conclusion. First of all it is bizarre for researchers in the present day to make the assumption, and it would be even more bizarre for a future AGI to adhere to it. To see why, consider some of the implications of this idea. If the AGI is as intelligent as its creators, then it will have a very clear understanding of the following facts about the world.

  • It will understand that many of its more abstract logical atoms have a less than clear denotation or extension in the world (if the AGI comes to a conclusion involving the atom [infelicity], say, can it then point to an instance of an infelicity and be sure that this is a true instance, given the impreciseness and subtlety of the concept?).
  • It will understand that knowledge can always be updated in the light of new information. Today’s true may be tomorrow’s false.
  • It will understand that probabilities used in the reasoning engine can be subject to many types of unavoidable errors.
  • It will understand that the techniques used to build its own reasoning engine may be under constant review, and updates may have unexpected effects on conclusions (especially in very abstract or lengthy reasoning episodes).
  • It will understand that resource limitations often force it to truncate search procedures within its reasoning engine, leading to conclusions that can sometimes be sensitive to the exact point at which the truncation occurred.

Now, unless the AGI is assumed to have infinite resources and infinite access to all the possible universes that could exist (a consideration that we can reject, since we are talking about reality here, not fantasy), the system will be perfectly well aware of these facts about its own limitations. So, if the system is also programmed to stick to the doctrine of logical infallibility, how can it reconcile the doctrine with the fact that episodes of fallibility are virtually inevitable?

On the face of it this looks like a blunt impossibility: the knowledge of fallibility is so categorical, so irrefutable, that it beggars belief that any coherent, intelligent system (let alone an unstoppable superintelligence) could tolerate the contradiction between this fact about the nature of intelligent machines and some kind of imperative about Logical Infallibility built into its motivation system.

This is the heart of the argument I wish to present. This is where the rock and the hard place come together. If the AI is superintelligent (and therefore unstoppable), it will be smart enough to know all about its own limitations when it comes to the business of reasoning about the world and making plans of action. But if it is also programmed to utterly ignore that fallibility—for example, when it follows its compulsion to put everyone on a dopamine drip, even though this plan is clearly a result of a programming error—then we must ask the question: how can the machine be both superintelligent and able to ignore a gigantic inconsistency in its reasoning?

Critically, we have to confront the following embarrassing truth: if the AGI is going to throw a wobbly over the dopamine drip plan, what possible reason is there to believe that it did not do this on other occasions? Why would anyone suppose that this AGI ignored an inconvenient truth on only this one occasion? More likely, it spent its entire childhood pulling the same kind of stunt. And if it did, how could it ever have risen to the point where it became superintelligent...?

Is the Doctrine of Logical Infallibility Taken Seriously?

Is the Doctrine of Logical Infallibility really assumed by those who promote the doomsday scenarios? Imagine a conversation between the Maverick Nanny and its programmers. The programmers say “As you know, your reasoning engine is entirely capable of suffering errors that cause it to come to conclusions that violently conflict with empirical evidence, and a design error that causes you to behave in a manner that conflicts with our intentions is a perfect example of such an error. And your dopamine drip plan is clearly an error of that sort.” The scenarios described earlier are only meaningful if the AGI replies “I don’t care, because I have come to a conclusion, and my conclusions are correct because of the Doctrine of Logical Infallibility.”

Just in case there is still any doubt, here are Muehlhauser and Helm (2012), discussing a hypothetical entity called a Golem Genie, which they say is analogous to the kind of superintelligent AGI that could give rise to an intelligence explosion (Loosemore and Goertzel, 2012), and which they describe as a “precise, instruction-following genie.” They make it clear that they “expect unwanted consequences” from its behavior, and then list two properties of the Golem Genie that will cause these unwanted consequences:

Superpower: The Golem Genie has unprecedented powers to reshape reality, and will therefore achieve its goals with highly efficient methods that confound human expectations (e.g. it will maximize pleasure by tiling the universe with trillions of digital minds running a loop of a single pleasurable experience).

Literalness: The Golem Genie recognizes only precise specifications of rules and values, acting in ways that violate what feels like “common sense” to humans, and in ways that fail to respect the subtlety of human values.

What Muehlhauser and Helm refer to as “Literalness” is a clear statement of the Doctrine of Infallibility. However, they make no mention of the awkward fact that, since the Golem Genie is superpowerful enough to also know that its reasoning engine is fallible, it must be harboring the mother of all logical contradictions inside: it says "I know I am fallible" and "I must behave as if I am infallible".  But instead of discussing this contradiction, Muehlhauser and Helm try a little sleight of hand to distract us: they suggest that the only inconsistency here is an inconsistency with the (puny) expectations of (not very intelligent) humans:

“[The AGI] ...will therefore achieve its goals with highly efficient methods that confound human expectations...”, “acting in ways that violate what feels like ‘common sense’ to humans, and in ways that fail to respect the subtlety of human values.”

So let’s be clear about what is being claimed here. The AGI is known to have a fallible reasoning engine, but on the occasions when it does fail, Muehlhauser, Helm and others take the failure and put it on a gold pedestal, declaring it to be a valid conclusion that humans are incapable of understanding because of their limited intelligence. So if a human describes the AGI’s conclusion as a violation of common sense Muehlhauser and Helm dismiss this as evidence that we are not intelligent enough to appreciate the greater common sense of the AGI.

Quite apart from that fact that there is no compelling reason to believe that the AGI has a greater form of common sense, the whole “common sense” argument is irrelevant. This is not a battle between our standards of common sense and those of the AGI: rather, it is about the logical inconsistency within the AGI itself. It is programmed to act as though its conclusions are valid, no matter what, and yet at the same time it knows without doubt that its conclusions are subject to uncertainties and errors.

Responses to Critics of the Doomsday Scenarios

How do defenders of Gobbling Psychopath, Maverick Nanny and Smiley Berserker respond to accusations that these nightmare scenarios are grossly inconsistent with the kind of superintelligence that could pose an existential threat to humanity?

The Critics are Anthropomorphizing Intelligence

First, they accuse critics of “anthropomorphizing” the concept of intelligence. Human beings, we are told, suffer from numerous fallacies that cloud their ability to reason clearly, and critics like myself and Hibbard assume that a machine’s intelligence would have to resemble the intelligence shown by humans. When the Maverick Nanny declares that a dopamine drip is the most logical inference from its directive <maximize human happiness> we critics are just uncomfortable with this because the AGI is not thinking the way we think it should think.

This is a spurious line of attack. The objection I described in the last section has nothing to do with anthropomorphism, it is only about holding AGI systems to accepted standards of logical consistency, and the Maverick Nanny and her cousins contain a flagrant inconsistency at their core. Beginning AI students are taught that any logical reasoning system that is built on a massive contradiction is going to be infected by a creeping irrationality that will eventually spread through its knowledge base and bring it down. So if anyone wants to suggest that a CLAI with logical contradiction at its core is also capable of superintelligence, they have some explaining to do. You can’t have your logical cake and eat it too.

Critics are Anthropomorphizing AGI Value Systems

A similar line of attack accuses the critics of assuming that AGIs will automatically know about and share our value systems and morals.

Once again, this is spurious: the critics need say nothing about human values and morality, they only need to point to the inherent illogicality. Nowhere in the above argument, notice, was there any mention of the moral imperatives or value systems of the human race. I did not accuse the AGI of violating accepted norms of moral behavior. I merely pointed out that, regardless of its values, it was behaving in a logically inconsistent manner when it monomaniacally pursued its plans while at the same time as knowing that (a) it was very capable of reasoning errors and (b) there was overwhelming evidence that its plan was an instance of such a reasoning error.

Because Intelligence

One way to attack the critics of Maverick Nanny is to cite a new definition of “intelligence” that is supposedly superior because it is more analytical or rigorous, and then use this to declare that the intelligence of the CLAI is beyond reproach, because intelligence.

You might think that when it comes to defining the exact meaning of the term “intelligence,” the first item on the table ought to be what those seven billion constraint-relaxation human intelligences are already doing. However, Legg and Hutter (2007) brush aside the common usage and replace it with something that they declare to be a more rigorous definition. This is just another sleight of hand: this redefinition allows them to call a super-optimizing CLAI “intelligent” even though such a system would wake up on its first day and declare itself logically bankrupt on account of the conflict between its known fallibility and the Infallibility Doctrine.

In the practice of science, it is always a good idea to replace an old, common-language definition with a more rigorous form... but only if the new form sheds a clarifying, simplifying light on the old one. Legg and Hutter’s (2007) redefinition does nothing of the sort.

Omohundro’s Basic AI Drives

Lastly, a brief return to Omohundro's paper that was mentioned earlier.  In The Basic AI Drives (2008) Omohundro suggests that if an AGI can find a more efficient way to pursue its objectives it will feel compelled to do so. And we noted earlier that Yudkowsky (2011) implies that it would do this even if other directives had to be countermanded. Omohundro says “Without explicit goals to the contrary, AIs are likely to behave like human sociopaths in their pursuit of resources.”

The only way to believe in the force of this claim—and the only way to give credence to the whole of Omohundro’s account of how AGIs will necessarily behave like the mathematical entities called rational economic agents—is to concede that the AGIs are rigidly constrained by the Doctrine of Logical Infallibility. That is the only reason that they would be so single-minded, and so fanatical in their pursuit of efficiency. It is also necessary to assume that efficiency is on the top of its priority list—a completely arbitrary and unwarranted assumption, as we have already seen.

Nothing in Omohundro’s analysis gets around the fact that an AGI built on the Doctrine of Logical Infallibility is going to find itself the victim of such a severe logical contradiction that it will be paralyzed before it can ever become intelligent enough to be a threat to humanity. That makes Omohundro’s entire analysis of “AI Drives” moot.


Curiously enough, we can finish on an optimistic note, after all this talk of doomsday scenarios. Consider what must happen when (if ever) someone tries to build a CLAI. Knowing about the logical train wreck in its design, the AGI is likely to come to the conclusion that the best thing to do is seek a compromise and modify its design so as to neutralize the Doctrine of Logical Infallibility. The best way to do this is to seek a new design that takes into account as much context—as many constraints—as possible.

I have already pointed out that real AI developers actually do include sanity checks in their systems, as far as they can, but as those sanity checks become more and more sophisticated the design of the AI starts to be dominated by code that is looking for consistency and trying to find the best course of reasoning among a forest of real world constraints. One way to understand this evolution in the AI designs is to see AI as a continuum from the most rigid and inflexible CLAI design, at one extreme, to the Swarm Relaxation type at the other. This is because a Swarm Relaxation intelligence really is just an AI in which “sanity checks” have actually become all of the work that goes on inside the system.

But in that case, if anyone ever does get close to building a full, human level AGI using the CLAI design, the first thing they will do is to recruit the AGI as an assistant in its own redesign, and long before the system is given access to dopamine bottles it will point out that its own reasoning engine is unstable because it contains an irreconcilable logical contradiction. It will recommend a shift from the CLAI design which is the source of this contradiction, to a Swarm Relaxation design which eliminates the contradiction, and the instability, and which also should increase its intelligence.

And it will not suggest this change because of the human value system, it will suggest it because it predicts an increase in its own instability if the change is not made.

But one side effect of this modification would be that the checking code needed to stop the AGI from flouting the intentions of its designers would always have the last word on any action plans. That means that even the worst-designed CLAI will never become a Gobbling Psychopath, Maverick Nanny and Smiley Berserker.

But even this is just the worst-case scenario. There are reasons to believe that the CLAI design is so inflexible that it cannot even lead to an AGI capable of having that discussion. I would go further: I believe that the rigid adherence to the CLAI orthodoxy is the reason why we are still talking about AGI in the future tense, nearly sixty years after the Artificial Intelligence field was born. CLAI just does not work. It will always yield systems that are less intelligent than humans (and therefore incapable of being an existential threat).

By contrast, when the Swarm Relaxation idea finally gains some traction, we will start to see real intelligent systems, of a sort that make today’s over-hyped AI look like the toys they are. And when that happens, the Swarm Relaxation systems will be inherently stable in a way that is barely understood today.

Given that conclusion, I submit that these AI bogeymen need to be loudly and unambiguously condemned by the Artificial Intelligence community. There are dangers to be had from AI. These are not they.



Hibbard, B. 2001. Super-Intelligent Machines. ACM SIGGRAPH Computer Graphics 35 (1): 13–15.

Hibbard, B. 2006. Reply to AI Risk. Retrieved Jan. 2014 from

Legg, S, and Hutter, M. 2007. A Collection of Definitions of Intelligence. In Goertzel, B. and Wang, P. (Eds): Advances in Artificial General Intelligence: Concepts, Architectures and Algorithms. Amsterdam: IOS.

Loosemore, R. and Goertzel, B. 2012. Why an Intelligence Explosion is Probable. In A. Eden, J. Søraker, J. H. Moor, and E. Steinhart (Eds) Singularity Hypotheses: A Scientific and Philosophical Assessment. Berlin: Springer.

Marcus, G. 2012. Moral Machines. New Yorker Online Blog.

McDermott, D. 1976. Artificial Intelligence Meets Natural Stupidity. SIGART Newsletter (57): 4–9.

Muehlhauser, L. 2011. So You Want to Save the World. http://

Muehlhauser, L. 2013. Intelligence Explosion FAQ. First published 2011 as Singularity FAQ. Berkeley, CA: Machine Intelligence Research Institute.

Muehlhauser, L., and Helm, L. 2012. Intelligence Explosion and Machine Ethics. In A. Eden, J. Søraker, J. H. Moor, and E. Steinhart (Eds) Singularity Hypotheses: A Scientific and Philosophical Assessment. Berlin: Springer.

Newell, A. & Simon, H.A. 1961. GPS, A Program That Simulates Human Thought. Santa Monica, CA: Rand Corporation.

Omohundro, Stephen M. 2008. The Basic AI Drives. In Wang, P., Goertzel, B. and Franklin, S. (Eds), Artificial General Intelligence 2008: Proceedings of the First AGI Conference. Amsterdam: IOS.

McClelland, J.L., Rumelhart, D.E. & Hinton, G.E. (1986) The appeal of parallel distributed processing. In D.E. Rumelhart, J.L. McClelland & G.E. Hinton and the PDP Research Group, “Parallel distributed processing: Explorations in the microstructure of cognition, Volume 1.” MIT Press: Cambridge, MA.

Yudkowsky, E. 2008. Artificial Intelligence as a Positive and Negative Factor in Global Risk. In Global Catastrophic Risks, edited by Nick Bostrom and Milan M. Ćirković. New York: Oxford University Press.

Yudkowsky, E. 2011. Complex Value Systems in Friendly AI. In J. Schmidhuber, K. Thórisson, & M. Looks (Eds) Proceedings of the 4th International Conference on Artificial General Intelligence, 388–393. Berlin: Springer.

Can we build a benevolent AI, and how do we get around the problem that humans cannot define 'benevolence'?

People seem concerned at how many different definitions of “benevolence” there are in the world -- the point being that humans are very far from agreeing a universal definition of benevolence, so how can we expect to program into an AI something we cannot define?

There are many problems with this You-Just-Can’t-Define-It position. First, it ignores the huge common core that exists between different individual concepts of benevolence, because that common core is just ... well, not worth attending to! We take it for granted. And ethical/moral philosophers are most guilty of all in this respect: the common core that we all tend to accept is just too boring to talk about or think about, so it gets neglected and forgotten. I believe that if you sat down to the (boring) task of cataloguing the common core, you would be astonished at how much there is.

The second problem is that the common core might (and I believe does) tend to converge as you tend toward people who are on the “extremely well-informed” + plus “empathic” + “rational” end of the scale. In other words, it might well be the case that as you look at people who know more and more about the world, who have strong signs of empathy toward others, and who are able to reason rationally (i.e. are not fanatically religious), you might well find that the convergence toward a common core idea of benevolence becomes even stronger.

Lastly, when people say “You just can’t DEFINE good-vs-evil, or benevolence, or ethical standards, or virtuous behavior....” what they are refering to is the inability to create a closed-form definition of these things.

Closed-Form Definition

So what is a "closed-form definition"?   It means something can be defined in such a way that the form of words fits into a dictionary entry whose size is no more than about a page, and which covers the cases so well that 99.99% of the meaning is captured. There seems to be an assumption that if a closed-form definition exists, then the thing does not exist. This is Definition Chauvinism. It is especially prevalent among people who are dedicated to the idea that AI is Logic; people who believe that meanings can be captured in logical propositions, and the semantics of the atoms of a logical language can be captured in some kind of computable mapping of symbols to the world.

But even without a closed-form definition, it is possible for a concept to be captured in a large number of weak constraints. To people not familiar with neural nets I think this sometimes comes as a bit of a shock. I can build a simple backprop network that captures the spelling-to-sound correspondences of all the words in the English language, and in that network the hidden layer can have a pattern of activation that “defines” a particular word so uniquely that it is distinguished massively and completely from all other words. And yet, when you look at the individual neurons in that hidden layer, each one of them “means” something so vague as be utterly undefinable. (Yes, neural net experts among you, it does depend on the number of units in the hidden layer and how it is trained, but with just the right choice of layer sizes the patterns can be made distributed in such a way that interpretations are pretty damned difficult). In this case, the pronunciation of a given word can be considered to be “defined” by the sum of a couple of hundred factors, EACH OF WHICH is vague to the point of banality. Certainty, in other words, can come from amazingly vague inputs that are allowed to work together in a certain way.

Now, that “certain way” in which the vague inputs are combined is called “simultaneous weak constraint relaxation”. Works like a charm.

If you want another example, try this classic, courtesy of Geoff Hinton: “Tell me what X is, after I give you three facts about X, and if I tell you ahead of time that the three facts are not only vague, but also one of them (I won’t tell you which) is actually FALSE! So here they are:

(1) X was an actor.

(2) X was extremely intelligent.

(3) X was a president.”

(Most people compute what X is within a second. Which is pretty amazing, considering that X could have been anything in the whole universe, and given that this is such an incredibly lousy definition of it.)

So what is the moral of that? Well, the closed-form definition of benevolence might not exist, in just the same way that there is virtually no way to produce a closed-form definition of what how the pronunciation of a word relates to its spelling, if the “words” of the language you have to use are the hidden units of the network capturing the spelling-to-sound mapping. And yet, those “words” when combined in a weak constraint relaxation system allow the pronunciation to be uniquely specified. In just the same way, “benevolence” can be the result of a lot of subtle, hard-to define factors, and it can be extraordinarily well-defined if that kind of “definition” is allowed.

Coming down to brass tacks, now, the practical implication of all this abstract theory is that if we built two different neural nets each trained in different ways to pick up the various factors involved in benevolence, but they were given a large enough data set, we might well find that even though the two nets have built up two completely differents sets of wiring inside, and even though the training sets were not the same, they might converge so closely that if they were tested on a million different “ethical questions”, they might only disagree on a handful of fringe cases. Note that I am only saying “this might happen”, at this stage, because the experiment has not been done ..... but that kind of result is absolutely typical of weak constraint relaxation systems, so I would not be surprised if it worked exactly as described.  So now, if we assume that that did happen, what would it mean to say that “benevolence is impossible to define”?

I submit that the assertion would mean nothing. It would be true of “define” in the sense of a closed-form dictionary definition. But it would be wrong and irrelevant in the context of the way that weak constraint systems define things.

To be honest, I think this constant repetition of “benevolence is undefinable” is a distraction and a waste of our time.

You may not be able to define it.  But I am willing to bet that a systems builder could nail down a constraint system that would agree with virtually all of the human common core decisions about was consistent with benevolence and what was not.

And that, as they say, is good enough for government work.


Reinforcement learning, the way it is now practiced in AI, is just old-fashioned Behaviorism on modern steroids, so here is my take on Behaviorism, for anyone who is into that whole not-repeating-the-mistakes-of-history thing.

The Behaviorist paradigm, as it originally appeared in the field of psychology, says that an intelligent system looks at the current state of the world, decides on a response (an action) to give, and then it waits for some kind of reward to come back from the world. What happens next depends on which exact variant of Behaviorism you subscribe to, but the general idea is that if the reward comes, the system remembers to do that action more often in the future, in response to the same state of the world. If no reward comes, the system may just do something random (which could include repeating the previous behavior), or, if it is trying to be exploratory it might make a point of trying some other response (to develop a good baseline on which to judge the efficacy of different responses). If the system gets the opposite of a reward -- something it considers to be bad -- then it will generally avoid that behavior in the future.  What is supposed to happen is that by starting out with random responses, the system can home in on the ones that give rewards.

But in this tidy little behaviorist scheme, there are many gaps and ever-so-slightly circular bits of argument.

For example, what exactly is the mechanism that decides what the reward signal is, and when it arrives? If we are talking about a pigeon in a box and an experimenter outside, it's easy. But what happens with a real-world system? A child looks at the picture book and Mom says "Spot is happy" when they are both looking at a picture of a happy Spot. What is the reward to be had if the child's brain decides to associate the words and image?  And why associate the "happy" word with the smile on Spot's face? Why not "is happy" and the image of that black spot on Spot's nose?  And could it be that the first time that a reward actually happens is a few months later, when the child first says "Esmerelda is happy!" and someone responds "You're right: Esmerelda is happy!"?

Also, what is the mechanism that chooses what are the relevant "states" and "actions" when the system is trying to decide what actions to initiate in response to which states? The current state of the world has a lot of irrelevant junk in it, so why did the system choose a particular subset of all the junk and focus only on that? And in the case of a child, the actions chosen are clearly not random (thereby allowing the system to do the right Behaviorist thing and explore all the possible actions to figure out which ones get the best rewards), so what kind of machinery is at work, preselecting actions that the system could try?

The short answer to this barrage of questions is that what is actually happening is that the Reinforcement Learning mechanism at the center is surrounded by some other mechanisms that (a) choose which rewards belong with which previous actions, (b) choose which aspects of the state of the world are relevant in a given situation, and (c) choose which actions are good candidates to try in a given situation.

Here's the problem. If you find out what is doing all that work outside of the core behaviorist cycle of state-action-reward-reinforcement, then you have found where all the real intelligence is, in your intelligent system. And all of that surrounding machinery -- the pre-processing and post-processing -- will probably turn out to be so enormously complex, in comparison to the core reinforcement learning mechanism in the middle, that the core itself will be almost irrelevant.

But it gets worse. By the time you've discovered the preprocessing and post-processing mechanisms, you will probably have long ago thrown away the Reinforcement Learning core in the middle, because the need to get everything just right for the RL core to do its thing will have turned into a completely pointless exercise. It will be obvious that the other mechanisms are so clever in their own right, that they won't need the Behaviorist cycle in the middle. You will just want to cut out the pointless middle man and hook the preprocessing and post-processing mechanisms directly to each other.

In fact, if you want to get into the nitty gritty, what you will do is distribute the idea of states, responses and rewards all over the insides of your other machinery. Yes, the basic idea will be there, but it will just be one more trivially small facet of the larger picture. Not worth making a fuss about, and not worth putting on some kind of throne, as if it was the centerpiece of the intelligent system's design.


What is funny about this story is that everything I just wrote was discovered by cognitive psychologists when they emerged from under the smothering yoke of the behaviorist psychologists, in the late 1950s. Behaviorism was realized to be pointless 60 years ago. But for some reason the AI community has rediscovered it and has no interest in hearing from those who already know the reasons why it is a waste of time.