Machine_Learning_with_TensorFlow (6)

Reinforcement Learning

All these examples can be unified under a general formulation: performing an action in a scenario can yield a reward. A more technical term for scenario is a state. And we call the collection of all possible states a state-space. Performing of an action causes the state to change. But the question is, what series of actions yields the highest cumulative rewards?

Real-world examples

Here are some examples to open your eyes to some and successful uses of RL by Google:

  • Game playing
  • input:
  • More game playing
  • Robotics and control:

Formal notions

It's not supervised learning, because the training data comes from the algorithm deciding between exploration and exploitation. And it's not unsupervised because the algorithm receives feedback from the environment. As long as you're in a situation where performing an action in a state produces a reward, you can use reinforcement learning to discover the best sequence of actions to take.

You may notice that reinforcement learning lingo involves anthropomorphizing the algorithm into taking "actions" in "situations" to "receive rewards." In fact, the algorithm is often referred to as an "agent" that "acts with" the environment. It should't be a surprise that much of reinforcement learning theory is applied in robotics.

A robot performs actions to change between different states. But how does it decide which action to take? The next section introduces a new concept, called the policy, the answer this question.


In reinforcement learning lingo, we call the strategy a policy.

One of the most common ways to solve reinforcement learning is by observing the long-term consequences of actions at each state. The short-term consequence is easy to calculate: that's just the reward. As you know, performing an action yields an immediate reward, but it's not always a good idea to greedily choose the action with the best reward all the time.

The best possible policy is often called the optimal policy, and it's often the holy grail of reinforcement learning. Learning the optimal policy, tells you the optimal action given any state.

Screen Shot 2017-01-03 at 3.34.23 PM

We've so far described one type of policy where the agent always chooses the action with the greatest immediate rewrd, called the greedy policy. Another simple example of a policy is arbitrarily choosing an action, called the random policy. If you come up with a policy to solve a reinforcement learning problem, it's often a good idea to double-check that you learned policy performs better than a random policy.


The long-term reward is called a utility. It turns out, if we know the utility of performing an action at a state, then it's easy to solve reinforcement learning. For example, to decide which action to take, we simply select the action that produces the highest utility. The hard part, as you might have guessed, is uncovering these utility values.

The utility of performing an action at a state is written as function Q(s, a) , called utility function.

Screen Shot 2017-01-03 at 3.36.56 PM

An elegant way to calculate the utility of a particular state-action pair (s,a) is by recursively considering the utilities of future actions. The utility of your current action is influenced not by just the immediate reward but also the next best action, as shown in the formula below. In the formula, s' denotes the next state, and a' denotes the next action. The reward of taking action a in state s is denoted by r(s,a) :

Q(s,a) = r(s,a) + \gamma \max Q(s', a')

Here \gamma is a hyper-parameter that you get to choose, called the discount factor. If \gamma is 0 , then the agent chooses the action that maximizes the immediate reward. Higher values of \gamma will make the agent chooses the action that maximizes the immediate reward. Higher values of \gamma will make the agent put more importance in considering long-term consequences.

In some applications of reinforcement learning, newly available information might be more important than historical records, or vice versa.

  • For example, if a robot is expected to learn to solve tasks quickly but not necessarily optimally, we might want to set a faster learning rate.
  • Or if a robot is allowed more time to explore and exploit, we might tune down the learning rate.

Let's call the learning rate \alpha , and change our utility function as follows:

Q(s,a) \leftarrow Q(s,a) + \alpha(r(s,a) + \gamma \max Q(s',a') - Q(s,a))

Reinforcement learning can be solved if we know this Q(s,a) function. Conveniently for us, there's a machine learning strategy called neural networks, which are a way to approximate functions given enough training data. TensorFlow is the perfect tool to deal with neural networks because it comes with many essential algorithms to simplify neural network implementation.

Applying reinforcement learning

Application of reinforcement learning requires defining a way to retrieve rewards once an action is taken from a state. A stock marker trader fits these requirements easily, because buying and selling a stock changes the state of the trader, and each action generates a reward (or less).

The states in this situation are a vector containing information about the current budget, current number of stocks, and a recent history of stock prices (the last 200 stock prices). So each state is a 202-dimensional vector.

For simplicity, there are only three actions: buy, sell, and hold.

  1. Buying a stock at the current stock price decreases the budget will incrementing the current stock count.
  2. Selling a stock trades it in for money at the current share price.
  3. Holding does neither, and performing the action simply waits a single time-period, and yields no reward.

The goal is to learn a policy that gains the maximum net-worth from trading in a stock market.


To get stock prices, we will use the yahoo_finance library in Python. You can install it using pip, as shown below, or alternatively follow the official guide.

pip install yahoo-finance
%matplotlib inline
from yahoo_finance import Share # for obtaining stock price raw data
from matplotlib import pyplot as plt
import numpy as np
import tensorflow as tf
import random

Create a helper function to get stock prices using the yahoo_finance library. The library requires three pieces of information: share symbol, start date, and end date. When you pick each of the three values, you'll get a list of numbers representing the share prices in that period by day.

# Listing 6.2 Helper function to get prices
def get_prices(share_symbol, start_date, end_date, cache_filename='stock_prices.npy'):
    # try to load the data from file if it has already been computed
    stock_prices = np.load(cache_filename)
  except IOError:
    # Retrieve stock prices from the library
    share = Share(share_symbol)
    stock_hist = share.get_historical(start_date, end_date)
    # Extract only relevant info from the raw data
    stock_prices = [stock_price['Open'] for stock_price in stock_hist]
    # Cache the result, stock_prices)
  return stock_prices    

Visualize the stock price data:

<h1 id="toc_7">Listing 6.3 Helper function to plot the stock prices</h1>

<p>def plot_prices(prices):<br/>
plt.title('Opening stock prices')<br/>
plt.ylabel('price ($)')<br/>

# Listing 6.4 Get data and visualize it
if __name__ == '__main__':
  prices = get_prices('MSFT', '1992-07-22', '2016-07-22')

Most reinforcement learning algorithms follow similar implementation patterns. As a result, it's a good idea to create a class with the relevant methods to reference later, such as an abstract class or interface.

Basically, reinforcement learning needs two operations well defined:
(1) how to select an action, and
(2) how to improve the utility Q-function.

# Listing 6.5 Define a superclass for all decision policies
class DecisionPolicy:
  # Give a state, the decision policy will calculate the next action to take
  def select_action(self, current_state):
  # Improve the Q-function from a new experience of taking an action
  def update_q(self, state, action, reward, next_sate):

Most reinforcement learning algorithms boil down to just three main steps: infer, do, and learn.

  • During the first step, the algorithm selects the best action, the algorithm selects the best action a given a state s using the knowledge it has so far.
  • Next, it does the action to find out the reward r as well as the next state s' .
  • Then it improves its understanding of the world using the newly acquired knowledge (s, r, a, s') .

Next, let's inherit from this superclass to implement a random decision policy. We only need to define the selection_action method, which will randomly pick an action without even look at the state.

# Listing 6.6 Implement a random decision policy
class RandomDecisionPolicy(DecisionPolicy):
  # Inherit from DecisionPolicy to implement its functions
  def __init__(self, actions):
    self.actions = actions
  # Random choose the next action
  def select_action(self, current_state):
    action = self.actions[random.randint(0, len(self.actions) - 1)]
    return action

In listing 6.7, we assume a policy is given to us, and run it on the real world stock-price data. This function takes care of exploration and exploitation at each interval of time.
Screen Shot 2017-01-13 at 1.41.50 P

# Listing 6.7 Use a given policy to make decisions and return the performance
def run_simulation(policy, initial_budget, initial_num_stocks, prices, hist, debug=False):
  # Initialize values that depend on computing the net worth of a portfolio
  budget = inital_budget
  num_stocks = initial_num_stocks
  share_value = 0
  transitions = list()
  for i in range(len(prices) - hist - 1):
    if i % 100 = 0:
      print('progress {:.2f}%'.format(float(100*i) / (len(prices) - hist - 1)))
      # The state is a `hist + 2` dimensional vector
      current_state = np.asmatrix(np.hstack((price[i:i+hist], budget, num_stocks)))
      # Calculate the portfolio value  
      current_portfolio = budget + num_stocks * share_value
      # Selection an action from the current policy
      action = policy.select_action(current_state, i)
      # Update portfolio values based on action
      share_value = float(prices[i + hist + 1])
      if action == 'Buy' and budget >= share_value:
        budget -= share_value
        num_stocks -= 1
      elif action == 'Sell' and num_stocks > 0:
        budget += share_value
        num_stocks -= 1
        action = 'Hold'
      # Compute new portfolio value after taking action      
      new_portfolio = budget + num_stocks * share_value
      # Compute the reward from taking an action at a state
      reward = new_portfolio - current_portfolio
      # Update the policy after experiencing a new action
      next_state = np.asmatrix(np.hstack((prices[i+1:i+hist+1], budget, num_stocks)))
      transitions.append((current_state, action, reward, next_state))
      policy.update_q(current_state, action, reward, next_state)
  # Compute final portfolio worth
  portfolio = budget + num_stocks * share_value
  if debug:
    print('${}\t{} shares'.format(budget, num_stocks))
  return portfolio

To obtain a more robust measurement of success, let's run the simulation a couple times and average the results. Doing so may take a while to complete, but your results will be more reliable.

# Listing 6.8 Run multiple simulations to calculate an average performance
def run_simulations(policy, budget, num_stocks, prices, hist):
  num_tries = 10
  final_portfolios = list()
  for i in range(num_tries):
    final_portfolio = run_simulation(policy, budget, num_stocks, prices, hist)
  avg, std = np.mean(final_portfolios), np.std(final_portfolios)
  return avg, std

In main, define the decision policy and try running simulations to see how it performs.

# Listing 6.9 Append the following lines to maintain
if __name__ == '__main__':
  prices = get_prices('MSFT', '1992-07-22', '2016-07-22')
  actions = ['Buy', 'Sell', 'Hold']
  hist = 200
  policy = RandomDecisionPolicy(actions)
  budget = 1000.0
  num_stocks = 0
  num_stocks = 0
  avg, std = run_simulations(policy, budget, num_stocks, prices, hist)

  print (avg, std)

Now that we have a baseline to compare our results, let's implement our network approach to learn the Q-function. The decision policy is often called the Q-learning decision policy. The following listing 6.10 introduces a new hyper-parameter "epsilon" to keep the solution from getting "stuck" when applying the same action over and over. The decision policy is often called the Q-learning decision policy. The following listing 6.10 getting "stuck" when applying the same action over and over. The lesser its value, the more often it will randomly explore new actions. The Q-function is defined by the function visualized in figure 6.9.
Screen Shot 2017-01-14 at 1.22.54 P

The input is the state space vector, with three outputs, one for each output's Q-value.

class QLearningDecisionPolicy(DecisionPolicy):
  def __init__(self, actions, input_dim):
    # Set the hyper-parameters from the Q-function
    self.epsilon = 0.9
    self.gamma = 0.01
    self.actions = actions
    output_dim = len(actions)
    # Set the number of hidden nodes in the neural networks
    h1_dim = 200
    # Define the input and output tensor
    self.x = tf.placeholder(tf.float32, [None, input_dim])
    self.y = tf.placeholder(tf.float32, [output_dim])
    # Design the neural network architecture
    W1 = tf.Variable(tf.random_normal([input_dim, h1_dim]))
    b1 = tf.Variable(tf.constant(0.1, shape=[h1_dim]))
    h1 = tf.nn.relu(tf.matmul(self.x, W1) + b1)
    W2 = tf.Variable(tf.random_normal([h1_dim, output_dim]))
    b2 = tf.Variable(tf.constant(0.1, shape=[output_dim]))
    # Define the op to calculate the utility
    self.q = tf.nn.relu(tf.matmul(h1, W2) + b2)
    # Set the loss as the square error
    loss = tf.square(self.y - self.q)
    # Use an optimizer to update the model parameters to minize the loss
    self.train_op = tf.train.AdagradOptimizer(0.01).minimize(loss)
    # Set up the session and initialize variables
    self.sess = tf.Session()
  def select_action(self, current_state, step):
    threshold = min(self.epsilon, step / 1000.)
    if random.random() < threshold:
      # Exploit best option with probability epsilon
      action_q_vals =, feed_dict={self.x: current_state})
      action_idx = np.argmax(action_q_vals)
      action = self.actions[action_idx]
      # Explore random option with probability `1 - epsilon`
      action = self.actions[random.randint(0, len(self.actions) - 1)]
    return action

  # Update the Q-function by updating its model parameters
  def update_q(self, state, action, reward, next_state):
    action_q_vals =, feed_dict={self.x: state})
    next_action_q_vals =, feed_dict={self.x: next_state})
    next_action_idx = np.argmax(next_action_q_vals)
    action_q_vals[0, next_action_idx] = reward + self.gamma * next_action_q_vals[0, next_action_idx]
    action_q_vals = np.squeeze(np.asarray(action_q_vals))
    # retrain the neural network to update the weights, using the previous `action_q_vals` as the truth., feed_dict={self.x: state, self.y: action_q_vals})

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.