Skip to content


Build a taxi driving agent in a post-apocalyptic world using Reinforcement Learning | Machine Learning from Scratch (Part VIII)

Machine Learning, Reinforcement Learning, MDP, Q-Learning, Self-driving8 min read


TL;DR Build a simple MDP for self-driving taxi problem. Pick up passengers, avoid danger and drop them off at a specified location. Build an agent and solve the problem using Q-learning.

You wake up. It is a sunny day. Your partner is still asleep next to you. You take a minute to admire the beauty and even crack a smile.

Your stomach is rumbling, so you jump on your feet and look around. Jackpot! You’re looking at the leftovers of the anniversary dinner you two had last night. You take a minute to scratch some private spot of yours. Yep, what an amazing morning! You have 1/3 of a bean can, expired only 6 months ago. It is delicious!

Although you feel bad about not leaving any food for the one sleeping in your bed, you dress up and prepare for work. “What are you going to do? Not eat and work!?” - those thoughts don’t seem to help anymore. Time to hop in the car!

You try to reach Johny via the radio but no luck. Oh well, Jimmy is missing since last week, his twin brother might be too. Strange, you can’t remember the last time your eyes were feeling wet.

A lot has changed since the event. The streets are getting more dangerous every week, but people still need transportation. You receive a pickup request and have a look at the anomaly map. You accept hesitantly. Your map was updated 2 months ago.

You got the job done and got credit for a half can. That couple was really generous! That leaves time for tinkering with the idea of making self-driving taxis. You even got a name - SafeCab. You read a lot of about this crazy guy called Elon Nusk that was trying to build those fully autonomous vehicles right before the event occurred.

You start scratching your head. Is this possible or just a fantasy? After all, if you get this done, you might afford to eat every other day. Enter Reinforcement Learning.

Complete source code in Google Colaboratory Notebook

Reinforcement Learning

Reinforcement Learning (RL) is concerned with producing algorithms (agents) that try to achieve some predefined goal. The achievement of this objective is dependent on choosing a set of actions - receiving rewards for the good ones and punishment for the bad ones - the reinforcement bit comes from this. The agent act, in environments that have a state, gives rewards, and a set of actions.

reinforcement learning

Deep Reinforcement Learning (using Deep Neural Networks for choosing actions) achieved some great things lately:

Markov Decision Processes

Markov Decision Process (MDP) is a mathematical formulation of the RL problem. MDPs satisfy the Markov property:

Markov property - the current state completely represents the state of the environment (world). That is, the future depends only on the present.

An MDP can be defined by (S,A,R,P,γ)(S, A, R, P, \gamma) where:

  • SS - set of possible states
  • AA - set of possible actions
  • RR - probability distribution of reward given (state, action) pair
  • PP - probability distribution over how likely any of the states is to be the new states, given (state, action) pair. Also known as transition probability.
  • γ\gamma - reward discount factor

The discount factor γ\gamma allows us to inject the heuristic that 100 bucks now are more valuable than 100 bucks in 30 days. A discount factor of 1 would make future rewards worth just as much as immediate rewards.

Another reason to discount future rewards:

In the long run we are all dead - J. M. Keynes


Here is how learning happens in RL context:

for time step t=0t=0 until done:

  1. The environment gives your agent a state
  2. Your agent chooses an action (from a set of possible ones)
  3. The environment gives a reward along with a new state
  4. Continue until the goal or other condition is met

What is the objective of all this? Find a function π\pi^*, known as optimal policy, that maximizes the cumulative discounted reward:

t0γtrt\sum_{t \geq 0}\gamma^t r_t

where rtr_t is the reward received at step tt and γt\gamma^t is the discount factor at step tt.

A policy π\pi is a function that maps state ss to action aa, that our agent believes is the best given that state.

Value function

The value function gives you the maximum expected future reward the agent will get, starting from some state ss and following some policy π\pi.

Vπ(s)=Eπ[k=0γkRt+k+1St=s]V_{\pi}(s) = \mathop{\mathbb{E}}_{\pi}[\sum_{k=0}^{\infty} \gamma^k R_{t + k + 1} | S_t = s]

There exists an optimal value function that has the highest value for all states. Defined as:

V(s)=maxπVπ(s)sSV^*(s) = \max_{\pi}V^{\pi}(s) \quad \forall s \in \mathbb{S}


Similarly, let’s define another function known as QQ-function (state-action value function) that gives the expected return starting from state ss, taking action aa, and following policy π\pi.

Qπ(s,a)=Eπ[k=0γkRt+k+1St=s,At=a]Q_{\pi}(s, a) = \mathop{\mathbb{E}}_{\pi}[\sum_{k=0}^{\infty} \gamma^k R_{t + k + 1} | S_t = s, A_t = a]

You can think of the QQ-function as the “quality” of a certain action given a state. We can define the optimal QQ-function as:

Q(s,a)=maxπQπ(s,a)sSQ^*(s, a) = \max_{\pi}Q^{\pi}(s, a) \quad \forall s \in \mathbb{S}

There is a relationship between the two optimal functions VV^* and QQ^*. It is given by:

V(s)=maxaQ(s,a)sSV^*(s) = \max_aQ^*(s, a) \quad \forall s \in \mathbb{S}

That is, the maximum expected total reward when starting at ss is the maximum of Q(s,a)Q^*(s, a) over all possible actions.

We can find the optimal policy π\pi^* by choosing the action aa that gives maximum reward Q(s,a)Q^*(s, a) for state ss:

π(s)=argmaxaQ(s,a)sS\pi^*(s) = \text{arg}\max_{a} Q^* (s, a) \quad \forall s \in \mathbb{S}

There seems to be a synergy between all functions we defined so far. More importantly, we can now build an optimal agent for a given environment. Can we do it in practice?


Q-Learning is an off-policy algorithm for Temporal Difference (TD) learning. It is proven that with enough training, it converges with probability 1 to a close approximation of the action-value function for an arbitrary target policy. It learns the optimal policy even when actions are selected using exploratory (some randomness) policy (off-policy).

Given a state ss and action aa, we can express Q(s,a)Q(s, a) in terms of itself (recursively):

Q(s,a)=r+γmaxaQ(s,a)Q(s, a) = r + \gamma \max_{a'}Q(s', a')

This is the Bellman equation. It defines the maximum future reward as the reward rr the agent received, for being at state ss, plus the maximum future reward for state ss' for every possible action.

We can define QQ-learning as an iterative approximation of QQ^* using the Bellman equation:

Qt+1(st,at)=Qt(st,at)+α(rt+1+γmaxaQt(st+1,a)Qt(st,at))Q_{t+1}(s_t, a_t) = Q_t(s_t, a_t) + \alpha(r_{t+1} + \gamma \max_{a}Q_t(s_{t + 1}, a) - Q_t(s_t, a_t))

where α\alpha is the learning rate that controls how much the difference between previous and new Q value is considered.

Here’s the general algorithm we’re going to implement:

  1. Initialize a QQ-values table
  2. Observe initial state ss
  3. Choose action aa and act
  4. Observe reward rr and a new state ss'
  5. Update the QQ table using rr and the maximum possible reward from ss'
  6. Set the current state to the new state and repeat from step 2 until a terminal state

Note that QQ-learning is just one possible algorithm to solve the RL problem. Here is a comparison between some more of them.

Exploration vs exploitation

You might’ve noticed that we glanced over the strategy on choosing an action. Any strategy you implement will have to choose how often to try something new or use something it already knows. This is known as the exploration/exploitation tradeoff.

  • Exploration - finding new information about the environment
  • Exploitation - using existing information to maximize the reward

Remember, the goal of our RL agent is to maximize the expected cumulative reward. What does this mean for our self-driving taxi?

Initially, our driving agent knows pretty much nothing about the best set of driving directions for picking up and dropping off passengers. It should learn to avoid anomalies too since they are bad for business (passengers tend to disappear or worse in those)! During this time, we expect to make a lot of exploration.

After obtaining some experience, the agent can use it to choose an action more and more often. Eventually, all choices will be based on what is learned.

Driving in a post-apocalyptic world


Your partner sketched this map. Each block represents a small region of the city. Anomalies are marked in bright circles. The four letters are “safe zones”, where pickups and drop-offs happen.

Let’s assume your car is the only vehicle in the city (not much of a stretch). We can break it up into a 5x5 grid, which gives us 25 possible taxi locations.

The current taxi location is (3, 1) in (row, col) coordinates. The pickup and drop off locations are: [(0,0), (0,4), (4,0), (4,3)]. Anomalies are at [(0, 2), (2, 1), (2, 2), (4, 2)].


We’re going to encode our city map into an environment for the self-driving agent using OpenAI’s Gym. What is this Gym thing anyways?

Gym is a toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing games like Pong or Pinball.

The most important entity in Gym is the Environment. It provides a unified and easy to use interface. Here are the most important methods:

  • reset - resets the environment and returns a random initial state

  • step(action) - take action and advance one timestep. It returns:

    observation - the new state of the environment

    reward - reward received from taking the action

    done - most environments are divided into well-defined episodes and if done is True it indicates the episode has been completed

    info - additional information about the environment (might be useful for debugging)

  • render - renders one frame of the environment (useful for visualization)

The complete source code of the environment is in the notebook. Here, we’ll take a look at the registration to Gym:

2 id='SafeCab-v0',
3 entry_point=f"{__name__}:SafeCab",
4 timestep_limit=100,

Note that we set a timestep_limit which limits the number of steps in an episode.

Action Space

Our agent encounters one of the 500 states (5 rows x 5 columns x 5 passanger locations x 4 destinations) and it chooses an action. Here are the possible actions:

  1. south
  2. north
  3. east
  4. west
  5. pickup
  6. dropoff

You might notice that in the illustration above, the taxi cannot perform certain actions when near the boundaries of the city. We will penalize this with -1 and won’t move the taxi if such situation occurs.

Getting a taste of the environment

Let’s have a look at our encoded environment:

4print("Action Space {}".format(env.action_space))
5print("State Space {}".format(env.observation_space))

encoded environment

1Action Space Discrete(6)
2State Space Discrete(500)

Here is the action to number mapping:

  • 0 = south
  • 1 = north
  • 2 = east
  • 3 = west
  • 4 = pickup
  • 5 = dropoff

Building an agent

Our agent has a simple interface for interacting with the environment. Here it is:

1class Agent:
3 def __init__(
4 self,
5 n_states,
6 n_actions,
7 decay_rate=0.0001,
8 learning_rate=0.7,
9 gamma=0.618
10 ):
11 pass
13 def choose_action(self, explore=True):
14 pass
16 def learn(
17 self,
18 state,
19 action,
20 reward,
21 next_state,
22 done,
23 episode
24 ):
25 pass

We’re going to use the choose_action method when we want our agent to make a decision and act. Then, after the reward and new state from the environment are observed, our agent will learn from its actions using the learn method.

Let’s take a look at the implementations:

1def __init__(
2 self,
3 n_states,
4 n_actions,
5 decay_rate=0.0001,
6 learning_rate=0.7,
7 gamma=0.618
9 self.n_actions = n_actions
10 self.q_table = np.zeros((n_states, n_actions))
11 self.max_epsilon = 1.0
12 self.min_epsilon = 0.01
13 self.epsilon = self.max_epsilon
14 self.decay_rate = decay_rate
15 self.learning_rate = learning_rate
16 self.gamma = gamma # discount rate
17 self.epsilons_ = []

The interesting part of the __init__ is the initialization of our QQ-table. Initially, it is all zeros. Can we initialize it in some other way?

1def choose_action(self, explore=True):
2 exploration_tradeoff = np.random.uniform(0, 1)
4 if explore and exploration_tradeoff < self.epsilon:
5 # exploration
6 return np.random.randint(self.n_actions)
7 else:
8 # exploitation (taking the biggest Q value for this state)
9 return np.argmax(self.q_table[state, :])

Our strategy is rather simple. We draw a random number from a uniform distribution between 00 and 11. If this number is smaller than epsilon and we want to explore, we take a random action. Otherwise, we take the best action based on our current knowledge.

1def learn(
2 self,
3 state,
4 action,
5 reward,
6 next_state,
7 done,
8 episode
10 # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
11 self.q_table[state, action] = self.q_table[state, action] + \
12 self.learning_rate * (reward + self.gamma * \
13 np.max(self.q_table[new_state, :]) - self.q_table[state, action])
15 if done:
16 # Reduce epsilon to decrease the exploration over time
17 self.epsilon = self.min_epsilon + (self.max_epsilon - self.min_epsilon) * \
18 np.exp(-self.decay_rate * episode)
19 self.epsilons_.append(self.epsilon)

Learning involves the update of the QQ-table using the QQ-learning equation and reducing the exploration rate ϵ\epsilon if the episode is complete.


Now that our agent is ready for action, we can train it on the environment we’ve created. Let’s train our agent for 50k episodes and record episode rewards over time:

1total_episodes = 60000
2total_test_episodes = 10
4agent = Agent(env.observation_space.n, env.action_space.n)

Let’s have a look at our agent driving before learning anything about the environment:

1rewards = []
3for episode in range(total_episodes):
4 state = env.reset()
5 episode_rewards = []
7 while True:
9 action = agent.choose_action()
11 # Take the action (a) and observe the outcome state(s') and reward (r)
12 new_state, reward, done, info = env.step(action)
14 agent.learn(
15 state,
16 action,
17 reward,
18 new_state,
19 done,
20 episode
21 )
23 state = new_state
25 episode_rewards.append(reward)
27 if done == True:
28 break
30 rewards.append(np.mean(episode_rewards))

Recall that we set timestep_limit when we registered the environment so our agent won’t stay in an episode infinitely.


Let’s have a look at reward changes as the training progresses:

reward train

Note that the learning curve is smoothened using savgol_filter savgol_filter(rewards, window_length=1001, polyorder=2)

Recall that our exploration rate should decrease as our agent is learning. Have a look:

epsilon train

Here’s how we’re going to test our agent and record the progress:

1frames = []
3rewards = []
5for episode in range(total_test_episodes):
6 state = env.reset()
7 episode_rewards = []
9 step = 1
11 while True:
12 action = agent.choose_action(explore=False)
14 new_state, reward, done, info = env.step(action)
16 frames.append({
17 'frame': env.render(mode='ansi'),
18 'state': state,
19 'episode': episode + 1,
20 'step': step,
21 'reward': reward
22 })
24 episode_rewards.append(reward)
26 if done:
27 step = 0
28 break
29 state = new_state
30 step += 1
32 rewards.append(np.mean(episode_rewards))

Note that we want our agent to use only the experience it has so we set explore=False. Here’s what the total reward for each episode looks like:

reward test

I know that this chart might not give you a good idea of what the agent is capable of. Here is a video of it driving in the city:

Pretty good, right? It looks like that it learned to avoid the anomalies, pick up and drop off passengers.


Congratulations on building a a self-driving taxi agent. You’ve learned how to

  • Build your own environment based on one provided by OpenAI’s Gym
  • Implement and apply QQ-learning
  • Build an agent that learns to pickup, drop off passengers and avoid danger areas

Complete source code in Google Colaboratory Notebook

Can you increase the size of the city? Does the agent learns well still? Tell me how it went in the comments below!


Want to be a Machine Learning expert?

Join the weekly newsletter on Data Science, Deep Learning and Machine Learning in your inbox, curated by me! Chosen by 10,000+ Machine Learning practitioners. (There might be some exclusive content, too!)

You'll never get spam from me