Curiousily

Introduction to Reinforcement Learning | Deep Reinforcement Learning for Hackers (Part 0)

04.12.2017 — Machine Learning, Reinforcement Learning, Deep Learning, Python — 4 min read

The best way to understand what Reinforcement Learning is to watch this video:

Remember the first time you went behind the wheel of a car? Your dad, mom or driving instructor was next to you, waiting for you to mess something up. You had a clear goal - make a couple of turns and get to the supermarket for ice cream. The task was infinitely more fun if you had to learn to drive stick. Ah, good times. Too bad that your kids might never experience that. More on that later.

What is Reinforcement Learning?

Reinforcement learning (RL) is learning what to do, given a situation and a set of possible actions to choose from, in order to maximize a reward. The learner, which we will call agent, is not told what to do, he must discover this by himself through interacting with the environment. The goal is to choose its actions in such a way that the cumulative reward is maximized. So, choosing the best reward now, might not be the best decision, in the long run. That is greedy approaches might not be optimal.

Back to you, behind the wheel with running engine, properly strapped seatbelt, adrenaline pumping and rerunning the latest Fast & Furious through your mind - you have a good feeling about this, the passenger next to you does not look that scared, after all…

How all of this relates to RL? Let’s try to map your situation to an RL problem. Driving is really complex, so for your first lesson, your instructor will do everything, except turning the wheel. The environment is the nature itself and the agent is you. The state of the environment (situation) can be defined by the position of your car, surrounding cars, pedestrians, upcoming crossroads etc. You have 3 possible actions to choose from - turn left, keep straight and turn right. The reward is well defined - you will eat ice cream if you are able to reach the supermarket. Your instructor will give your intermediate rewards based on your performance. At each step (let’s say once every second), you will have to make a decision - turn left, right or continue straight ahead. Whether or not the ice cream is happening is mostly up to you.

Let’s summarize what we’ve learned so far. We have an agent and an environment. The environment gives the agent a state. The agent chooses an action and receives a reward from the environment along with the new state. This learning process continues until the goal is achieved or some other condition is met.

Source: https://phrasee.co/

Examples of Reinforcement Learning

Let’s have a look at some example applications of RL:

Cart-Pole Balancing

Goal - Balance the pole on top of a moving cart
State - angle, angular speed, position, horizontal velocity
Actions - horizontal force to the cart
Reward - 1 at each time step if the pole is upright

Atari Games

Goal - Beat the game with the highest score
State - Raw game pixels of the game
Actions - Up, Down, Left, Right etc
Reward - Score provided by the game

DOOM

Goal - Eliminate all opponents
State - Raw game pixels of the game
Actions - Up, Down, Left, Right etc
Reward - Positive when eliminating an opponent, negative when the agent is eliminated

Training robots for Bin Packing

Source: www.plasticsdist.com

Goal - Pick a device from a box and put it into a container
State - Raw pixels of the real world
Actions - Possible actions of the robot
Reward - Positive when placing a device successfully, negative otherwise

You started thinking that all RL researchers are failed pro-gamers, didn’t you? In practice, that doesn’t seem to be the case. For example, somewhat “meta” applications include “Designing Neural Network Architectures using Reinforcement Learning”.

Formalizing the RL problem

Markov Decision Process (MDP) is mathematical formulations of the RL problem. They satisfy the Markov property:

Markov property - the current state completely represents the state of the environment (world). That is, the future depends only on the present.

An MDP can be defined by $(S, A, R, P, \gamma)$ where:

$S$ - set of possible states
$A$ - set of possible actions
$R$ - probability distribution of reward given (state, action) pair
$P$ - probability distribution over how likely any of the states is to be the new states, given (state, action) pair. Also known as transition probability.
$\gamma$ - reward discount factor

How MDPs work

At the initial time step $t=0$ , the environment chooses initial state $s_0 \sim p(s_0)$ . That state is used as a seed state for the following loop:

for $t=0$ until done:

The agent selects action $a_t$
The environment chooses reward $r_t \sim R(. \vert\, s_t, a_t)$ and next state $s_{t + 1} \sim P(. \vert\, s_t, a_t)$
The agent receives reward $r_t$ and next state $s_{t + 1}$

More formally, the environment does not choose, it samples from the reward and transition probability distributions.

What is the objective of all this? Find a function $\pi^*$ , known as optimal policy, that maximizes the cumulative discounted reward:

\sum_{t \geq 0}\gamma^t r_t

A policy $\pi$ is a function that maps state $s$ to action $a$ , that our agent believes is the best given that state.

Your first MDP

Let’s get back to you, cruising through the neighborhood, dreaming about that delicious ice cream. Here is one possible situation, described as an MDP:

Your objective is to get to the bitten ice cream on a stick, without meeting a zombie. The reasoning behind the new design is based on solid data science - people seem to give a crazy amount of cash for a bitten fruit and everybody knows that candy is much tastier. Putting it together you get “the all-new ice cream”. And honestly, it wouldn’t be cool to omit the zombies, so there you have it.

The state is fully described by the grid. At the first step you have the following actions:

Crashing into a zombie (Carmageddon anyone?) gives a reward of -100 points, taking an action is -1 points and eating the ice cream gives you the crazy 1000 points. Why -1 points for taking an action? Well, the store might close anytime now, so you have to get there as soon as possible.

Congrats, you just created your first MDP. But how do we solve the problem? Stay tuned for that :)

Oops, almost forgot, your reward for reading so far:

Want to learn more?

Want to be a Machine Learning expert?

Join the weekly newsletter on Data Science, Deep Learning and Machine Learning in your inbox, curated by me! Chosen by 10,000+ Machine Learning practitioners. (There might be some exclusive content, too!)

You'll never get spam from me

Hacker's Guide to Neural Networks in JavaScript

Build Machine Learning models (especially Deep Neural Networks) that you can easily integrate with existing or new web apps. Think of your ReactJs, Vue, or Angular app enhanced with the power of Machine Learning models.

Get SH*T Done with PyTorch

Learn how to solve real-world problems with Deep Learning models (NLP, Computer Vision, and Time Series). Go from prototyping to deployment with PyTorch and Python!

Hacker's Guide to Machine Learning with Python

This book brings the fundamentals of Machine Learning to you, using tools and techniques used to solve real-world problems in Computer Vision, Natural Language Processing, and Time Series analysis. The skills taught in this book will lay the foundation for you to advance your journey to Machine Learning Mastery!

Hands-On Machine Learning from Scratch

This book will guide you on your journey to deeper Machine Learning understanding by developing algorithms in Python from scratch! Learn why and when Machine learning is the right tool for the job and how to improve low performing models!