Skip to content


Introduction to Reinforcement Learning | Deep Reinforcement Learning for Hackers (Part 0)

Machine Learning, Reinforcement Learning, Deep Learning, Python4 min read


The best way to understand what Reinforcement Learning is to watch this video:

Remember the first time you went behind the wheel of a car? Your dad, mom or driving instructor was next to you, waiting for you to mess something up. You had a clear goal - make a couple of turns and get to the supermarket for ice cream. The task was infinitely more fun if you had to learn to drive stick. Ah, good times. Too bad that your kids might never experience that. More on that later.

What is Reinforcement Learning?

Reinforcement learning (RL) is learning what to do, given a situation and a set of possible actions to choose from, in order to maximize a reward. The learner, which we will call agent, is not told what to do, he must discover this by himself through interacting with the environment. The goal is to choose its actions in such a way that the cumulative reward is maximized. So, choosing the best reward now, might not be the best decision, in the long run. That is greedy approaches might not be optimal.

Back to you, behind the wheel with running engine, properly strapped seatbelt, adrenaline pumping and rerunning the latest Fast & Furious through your mind - you have a good feeling about this, the passenger next to you does not look that scared, after all…

How all of this relates to RL? Let’s try to map your situation to an RL problem. Driving is really complex, so for your first lesson, your instructor will do everything, except turning the wheel. The environment is the nature itself and the agent is you. The state of the environment (situation) can be defined by the position of your car, surrounding cars, pedestrians, upcoming crossroads etc. You have 3 possible actions to choose from - turn left, keep straight and turn right. The reward is well defined - you will eat ice cream if you are able to reach the supermarket. Your instructor will give your intermediate rewards based on your performance. At each step (let’s say once every second), you will have to make a decision - turn left, right or continue straight ahead. Whether or not the ice cream is happening is mostly up to you.

Let’s summarize what we’ve learned so far. We have an agent and an environment. The environment gives the agent a state. The agent chooses an action and receives a reward from the environment along with the new state. This learning process continues until the goal is achieved or some other condition is met.


Examples of Reinforcement Learning

Let’s have a look at some example applications of RL:

Cart-Pole Balancing


  • Goal - Balance the pole on top of a moving cart

  • State - angle, angular speed, position, horizontal velocity

  • Actions - horizontal force to the cart

  • Reward - 1 at each time step if the pole is upright

Atari Games


  • Goal - Beat the game with the highest score

  • State - Raw game pixels of the game

  • Actions - Up, Down, Left, Right etc

  • Reward - Score provided by the game



  • Goal - Eliminate all opponents

  • State - Raw game pixels of the game

  • Actions - Up, Down, Left, Right etc

  • Reward - Positive when eliminating an opponent, negative when the agent is eliminated

Training robots for Bin Packing


  • Goal - Pick a device from a box and put it into a container

  • State - Raw pixels of the real world

  • Actions - Possible actions of the robot

  • Reward - Positive when placing a device successfully, negative otherwise

You started thinking that all RL researchers are failed pro-gamers, didn’t you? In practice, that doesn’t seem to be the case. For example, somewhat “meta” applications include “Designing Neural Network Architectures using Reinforcement Learning”.

Formalizing the RL problem

Markov Decision Process (MDP) is mathematical formulations of the RL problem. They satisfy the Markov property:

Markov property - the current state completely represents the state of the environment (world). That is, the future depends only on the present.

An MDP can be defined by (S,A,R,P,γ)(S, A, R, P, \gamma) where:

  • SS - set of possible states
  • AA - set of possible actions
  • RR - probability distribution of reward given (state, action) pair
  • PP - probability distribution over how likely any of the states is to be the new states, given (state, action) pair. Also known as transition probability.
  • γ\gamma - reward discount factor

How MDPs work

At the initial time step t=0t=0, the environment chooses initial state s0p(s0)s_0 \sim p(s_0). That state is used as a seed state for the following loop:

for t=0t=0 until done:

  • The agent selects action ata_t
  • The environment chooses reward rtR(.st,at)r_t \sim R(. \vert\, s_t, a_t) and next state st+1P(.st,at)s_{t + 1} \sim P(. \vert\, s_t, a_t)
  • The agent receives reward rtr_t and next state st+1s_{t + 1}

More formally, the environment does not choose, it samples from the reward and transition probability distributions.

What is the objective of all this? Find a function π\pi^*, known as optimal policy, that maximizes the cumulative discounted reward:

t0γtrt\sum_{t \geq 0}\gamma^t r_t

A policy π\pi is a function that maps state ss to action aa, that our agent believes is the best given that state.

Your first MDP

Let’s get back to you, cruising through the neighborhood, dreaming about that delicious ice cream. Here is one possible situation, described as an MDP:


Your objective is to get to the bitten ice cream on a stick, without meeting a zombie. The reasoning behind the new design is based on solid data science - people seem to give a crazy amount of cash for a bitten fruit and everybody knows that candy is much tastier. Putting it together you get “the all-new ice cream”. And honestly, it wouldn’t be cool to omit the zombies, so there you have it.

The state is fully described by the grid. At the first step you have the following actions:


Crashing into a zombie (Carmageddon anyone?) gives a reward of -100 points, taking an action is -1 points and eating the ice cream gives you the crazy 1000 points. Why -1 points for taking an action? Well, the store might close anytime now, so you have to get there as soon as possible.

Congrats, you just created your first MDP. But how do we solve the problem? Stay tuned for that :)

Oops, almost forgot, your reward for reading so far:

Want to learn more?


Want to be a Machine Learning expert?

Join the weekly newsletter on Data Science, Deep Learning and Machine Learning in your inbox, curated by me! Chosen by 10,000+ Machine Learning practitioners. (There might be some exclusive content, too!)

You'll never get spam from me