Skip to content

Can a future artificial intelligence be made safe?  And is AI safety an easy problem, or an extremely hard one?

It turns out that these questions are intimately bound up with a seemingly unrelated question -- whether future AIs will use "reinforcement learning" as their basic control mechanism.  In this essay I will argue that these ideas should not be connected, because reinforcement learning will never be used as the main control mechanism for future AIs.

NOTE   If you already know what reinforcement learning is, please be assured that this post is about the use of RL at the global-control level of an AI, not the sort of RL that appears in relatively small, local circuits or adaptive feedback loops.  There is a lot of confusion about this (I have had people arguing vehemently that RL has been applied here, there and everywhere with great success, so obviously I must be completely clueless).  RL works in limited situations where the reward signal is clear and the control policies are short(ish) and not too numerous.  But when people talk about AI safety issues, RL is assumed at the global level, and at that level reward signals are impossible to find, control policies are gigantic (involving actions that take years to connect to consequences), and the system tends to have Jupiter-size resource requirements.

Feeding Frenzy

Right now there is a feeding frenzy (Jaan Talinn has called it an "ecosystem") in which many groups and individuals are trying to persuade charities to give them money so they can do "research" that is supposed to reduce the dangers of AI.  "Give us money," they say "and we will save you from the Terminator."

I have a problem with this.  Most of the fish in this ecosystem are selling horror stories about what could happen when AIs become truly intelligent, and most of the horror stories involve the idea that even if we go to enormous lengths to build a safe AI, and even if we think we have succeeded, the AI will nevertheless wriggle out from under the safety net and become a psychopathic monster.

But if you look closely you will find that the horror stories are based on a set of bad assumptions about how the supposed AIs of the future will be constructed.  Assumptions that are so silly that any AI built in that way would find it impossible to actually be intelligent at all.  And out of all the silly assumptions, the worst is the assumption that AIs will be controlled by something called "reinforcement learning" (frequently abbreviated "RL").

I want to set this essay in the context of some important comments about AI safety made by Holden Karnofsky at  This is Karnofsky's take on one of the "challenges" we face in ensuring that AI systems do not become dangerous:

Going into the details of these challenges is beyond the scope of this post, but to give a sense for non-technical readers of what a relevant challenge might look like, I will elaborate briefly on one challenge. A reinforcement learning system is designed to learn to behave in a way that maximizes a quantitative “reward” signal that it receives periodically from its environment - for example, DeepMind’s Atari player is a reinforcement learning system that learns to choose controller inputs (its behavior) in order to maximize the game score (which the system receives as “reward”), and this produces very good play on many Atari games. However, if a future reinforcement learning system’s inputs and behaviors are not constrained to a video game, and if the system is good enough at learning, a new solution could become available: the system could maximize rewards by directly modifying its reward “sensor” to always report the maximum possible reward, and by avoiding being shut down or modified back for as long as possible. This behavior is a formally correct solution to the reinforcement learning problem, but it is probably not the desired behavior. And this behavior might not emerge until a system became quite sophisticated and had access to a lot of real-world data (enough to find and execute on this strategy), so a system could appear “safe” based on testing and turn out to be problematic when deployed in a higher-stakes setting. 

There are enough things wrong with what is said in this passage to fill a book, but my focus here is on the sudden jump from DeepMind's Atari game playing program, to a fully intelligent AI capable of outwitting humanity.

What Reinforcement Learning is.

Let's take look at what "reinforcement learning" (RL) actually is.  Back in the early days of Behaviorism (which became the dominant style of research in psychology after about the 1920s, until it was discarded in the late 1950s) some researchers decided to focus on simple experiments like putting a rat in a cage with a lever and a food-pellet dispenser, and then connecting these two things in such a way that if the rat pressed the lever, a pellet is dispensed.  Would the rat notice the connection between lever and pellet?  Of course it did, and pretty soon the rat would be spending inordinate amounts of time just pressing the lever, whether food came out or not.

What the researchers did next was suggest that the only thing of importance "inside" the rat's mind was a set of connections between behaviors (e.g. pressing the lever), stimuli (e.g a visual image of the lever) and rewards (e.g. getting a food pellet).  Critical to all of this was the idea that if a behavior was followed by a reward, a direct connection between the two would be strengthened in such a way that future behavior choices would be influenced by that strong connection.

That is reinforcement learning: you "reinforce" a behavior if it is followed by a reward.  What these researchers really wanted to claim was that this mechanism could explain almost everything important that was happening inside the rat's mind.  And, with a few judicious extensions, these Behaviorist researchers started arguing that the same type of explanation would work for the behavior of all "thinking" creatures.

That means you.

I want to point out something very important buried in this idea.  The connection between the reward and action is basically a single wire with a strength number on it.  The rat does not weigh up a lot of pros and cons; it doesn't think about anything, doesn't engage in any problem solving or planning, doesn't contemplate the whole idea of food, or the motivations of the idiot humans outside the cage.  The rat isn't capable of any of that: it just goes bang! lever-press, bang! food-pellet-appears, bang! let's-increase-the-strength-of-that-connection.

The Demise of Reinforcement Learning

Now fast forward to the late 1950s/early 1960s.  Cognitive psychologists are finally sick and tired of the Behaviorist programme. Behaviorism might be able to explain the rat-pellet-lever situation, but for anything more complex, it sucks.  Behaviorists spent decades engaging in all kinds of mental contortions to argue that they would eventually be able to explain human behavior without using much more than those direct connections between stimuli, behaviors and rewards ... but by about 1960 the psychology community had stopped believing that nonsense, because it never worked.

Is it possible to summarize the main reason why they rejected it?  Sure.  For one thing, almost all realistic behaviors involve rewards that arrive long after the behaviors that cause them, so there is a gigantic problem with deciding which behaviors should be reinforced, for a given reward.  Suppose you spend years going to college, enduring hard work and very low-income lifestyle.  Then years later you get a good job and pay off your college loan.  Did you do all of that because, like the rat, you happened to try the going-to-college-and-suffering-poverty behavior many times in your life, and the first time you tried it you got a good-job-that-paid-off-your-loan reward? Was it the case that you noticed the connection between reward and behavior, and your brain automatically reinforced the connection between those two?

A More Realistic Example

Or, on a smaller scale, consider what you are doing when you sit in the library with a mathematics text, trying to solve equations.  What reward are you seeking?  A little dopamine hit, perhaps?  (That is the modern story that neuroscientists sell, by the way).

Well, maybe, but let's try to stay focused on the precise idea that the Behaviorists were pushing:  that original rat was NOT supposed to do lots of thinking and analysis and imagining when it decided to push the lever, it was supposed to push the lever by chance, and then notice that a reward came.

The whole point of the RL mechanism is that the intelligent system doesn't engage in a huge, complex, structured analysis of the situation, when it tries to decide what to do.  If it did, the explanation for why the creature did what it did would be in that huge, complex, structured analysis itself, after all!  Instead, the RL people want you to believe that the RL mechanism did all the heavy lifting. That idea is critical to RL.  The rat simply tries a behavior at random - with no understanding of its meaning - and it is only because a reward then arrives, that the rat decides that in the future it will press the lever again.

So, going back to you, sitting in the library doing your mathematics homework.  Did you solve that last equation because you had a previous episode where you just happened to try the behavior of solving that exact same equation, and got a dopamine hit (which felt good)?  The RL theorist needs you to believe that you really did.  The RL theorist would say that you somehow did a search through all the quintillions of possible actions you could take, sitting there in front of an equation that requires L'Hôpital's Rule, and in spite of the fact that the list of possible actions included such possibilities as jumping-on-the-table-and-singing-I-am-the-walrus, or driving-home-to-get-a-marmite-sandwich, or asking-the-librarian-to-go-for-a-margarita, you decide instead that the thing that would give you the best dopamine hit right now would be applying L'Hôpital's Rule to the equation.

I hope I have made it clear that there is something profoundly disturbing about the RL/Behaviorist explanation for what is happening in a situation like this.

Whenever the Behaviorists tried to find arguments to explain scenarios like that, they always added machinery onto the basic RL mechanism.  "Okay," they would say, "so it's true the basic forms of RL don't work ... but if you add some more stuff onto the basic mechanism, like maybe the human keeps a few records of what they did, and they occasionally scan through the records and boost a few reinforcement connections here and there, and ... blah blah blah...".

The trouble with this kind of extra machinery is that after a while, the tail begins to wag the dog. That is, all the extra machinery is where all the action is.  And that extra machinery is most emphatically not designed as a kind of RL mechanism, itself.  In theory, there is still a tiny bit of reinforcement learning somewhere deep down inside all the extra machinery, but eventually people just said "What's the point?"  Why even bother to use the RL language anymore?  The RL, if it is there at all, is pointless.  A lot of parameter values get changed in complex ways, inside all the extra machinery, so why even bother to mention the one parameter among thousands, that is supposed to be RL, when it is obvious that the structure of that extra machinery is what matters?

That "extra machinery" is what eventually became all the many and varied mechanisms researched by cognitive psychologists.  Their understanding of how minds work is not that reinforcement learning plus extra machinery can be used to explain cognition -- they would simply assert that reinforcement learning does not exist as a way to understand cognition.

Take home message:  RL has become an irrelevance in explanations of human cognition.

Artificial Intelligence and RL

Now let's get back to Holden Karnofsky's comment, above.

He points out that there exists a deep learning program that can learn to play arcade games, and it uses RL.

(I should point out that his chosen example was not by any means pure RL.  This software already had other mechanisms in it, so the slippery slope toward RL+extra machinery has already begun.)

Sadly, DeepMind’s Atari player is nothing more sophisticated than a rat.  It is so mind-bogglingly simple that it actual can be controlled by RL.  No, I take that back, it is unfair to call it a rat. Rats are way smarter than this program, so it would be better to compare it to an amoeba, or an insect.

This is typical of claims that RL is a good foundation for AI.  If you start scanning the literature you will find that all the cited cases use systems that are so trivial that RL really does have a chance of working.

(Here is one example, picked almost at random:  Rivest, Bengio and Kalaska.  At first it seems that they are talking about deriving an RL system from what is known about the brain.  But after a lot of preamble they give us instead just an RL program that does the amazing task of ... controlling a double-jointed pendulum.  The same story is repeated in endless AI papers about reinforcement learning:  big claims about RL in the brain, but at the end of the day the algorithm is proved by applyint it to a trivially simple system.)

But Karnofsky wants to go beyond the trivial Atari player, he wants to ask what happens when the software is expanded and augmented.  In his words, "[what] if a future reinforcement learning system’s inputs and behaviors are not constrained to a video game, and if the system is good enough at learning..."?

That is where everything goes off the rails.

In practice there is not and never has been any such thing as augmenting and expanding an RL system until it becomes much more generally intelligent.  We are asked to imagine that this "system [might become] quite sophisticated and [get] access to a lot of real-world data (enough to find and execute on this strategy)...".  In other words, we are asked to buy the idea that there might be an RL system that is just as intelligent as a human being (smarter, in fact, since we are supposed to be in danger from its devious plans), but which is still driven by a reinforcement learning mechanism.

I see two problems here.  One is that this scenario ignores the fact that after three decades of trying to get RL to work as a theory of human cognition, the result has been nothing.  That period in the history of psychology was almost universally condemned as a complete write-off.  As far as we know it simply does not scale up.

But the second point is even worse:  not only did psychologists fail to get it to work as a theory of human cognition, AI researchers also failed to build one that works for anything approaching a real-world task.  What they have achieved is RL systems that do very tiny, narrow-AI tasks.

The textbooks might describe RL as if it means something, but they conspicuously neglect to mention that, actually, all the talking, thinking, development and implementation work since at least the 1960s has failed to result in an RL system that could actually control meaningful real-world behavior.  I do not know if AI researchers have been trying to do this and failing, or if they have not been trying at all (on the grounds that have no idea how to even start), but what I do know is that they have published no examples.

The Best Reinforcement Learning in the World?

To give a flavor of how bad this is, consider that in the 2008 Second Annual Reinforcement Learning Competition, the AI systems were supposed to compete in categories like:

Mountain Car: Perhaps the most well-known reinforcement learning benchmark task, in which an agent must learn how to drive an underpowered car up a steep mountain road.

Tetris: The hugely popular video game, in which four-block shapes must be manipulated to form complete lines when they fall.

Helicopter Hovering: A simulator, based on the work of Andrew Ng and collaborators, which requires an agent to learn to control a hovering helicopter.

Keepaway: A challenging task, based on the RoboCup soccer simulator, that requires a team of three robots to maintain possession of the ball while two other robots attempt to steal it.

As of the most recent RL competition, little has changed.  They are still competing to see whose RL algorithm can best learn how to keep a helicopter stable:  an insect-level intelligence task.  Whether they are succeeding in getting those helicopters to run smoothly or not is beside the point -- the point is that helicopter hovering behavior is a fundamentally shallow task, at the insect level.

Will RL Ever Become Superintelligent?

I suppose that someone without a technical background might look at all of the above and say "Well, even so ... perhaps we are only in the early stages of RL development, and perhaps any minute now someone will crack the problem and create an RL type of AI that becomes superintelligent.  You can't be sure that won't happen."

My response is as follows.  All of evidence is that the resource requirements for RL explode exponentially when you try to scale it up.  That means:

  • If you want to use RL to learn how to control a stick balancing on end, you will need an Arduino.
  • If you want to use RL to learn how to control a model helicopter, you will need a PC (or maybe a Beaglebone?  Arduinos are improving, so maybe...).
  • If you want to use RL to learn how to play Go, or Atari games, you will need the Google Brain (tens of thousands of cores).
  • If you want to use RL to learn how to control an artificial rat, which can run around and get by in the real world, you will need all the processing power currently available on this planet (and then some).
  • If you want to use RL to learn how to cook a meal, you will need all the computing power in the local galactic cluster.
  • If you want to use RL to learn how to be as smart as Winnie the Pooh (a bear, I will remind you, of very little brain), you will need to convert every molecule in the universe into a computer.

That is what exponential resource requirements are all about.


Reinforcement learning first came to prominence in the second decade of the 20th century.  But after nearly 100 years of experiments, mathematical theories and computer simulations, and after being written into the standard AI textbooks (and after being widely assumed as the theory of how future Artificial General Intelligence systems will probably be controlled), it seems that the best actual RL algorithms can barely learn how to control a model helicopter, or play a board game.

And yet there are dozens - if not hundreds - of people now inhabiting the "existential risk ecosystem", who claim to be so sure of how future AGI systems will be controlled, that they are taking a large stream of donated money and promising to do research on how this failed control paradigm can be modified so it will not turn around and kill us.

If you ask anyone in that ecosystem what exactly they see as the main dangers of future AGI, they quote - again and again and again - scenarios in which an AGI is superintelligent and dangerous psychopathic, and controlled by Reinforcement Learning.

And yet somehow, more and more people are throwing money at these people who claim they will save us from AIs that are driven by reinforcement learning.

Go figure.