# Reinforcement Learning

All these examples can be unified under a general formulation: performing an action in a scenario can yield a reward. A more technical term for scenario is a *state*. And we call the collection of all possible states a *state-space*. Performing of an action causes the state to change. But the question is, what series of actions yields the highest cumulative rewards?

## Real-world examples

Here are some examples to open your eyes to some and successful uses of RL by Google:

- Game playing
- input:
- More game playing
- Robotics and control:

## Formal notions

It's not supervised learning, because the training data comes from the algorithm deciding between exploration and exploitation. And it's not unsupervised because the algorithm receives feedback from the environment. As long as you're in a situation where performing an action in a state produces a reward, you can use reinforcement learning to discover the best sequence of actions to take.

You may notice that reinforcement learning lingo involves anthropomorphizing the algorithm into taking "actions" in "situations" to "receive rewards." In fact, the algorithm is often referred to as an "agent" that "acts with" the environment. It should't be a surprise that much of reinforcement learning theory is applied in robotics.

A robot performs actions to change between different states. But how does it decide which action to take? The next section introduces a new concept, called the __policy__, the answer this question.

## Policy

In reinforcement learning lingo, we call the strategy a __policy__.

One of the most common ways to solve reinforcement learning is by observing the long-term consequences of actions at each state. The short-term consequence is easy to calculate: that's just the reward. As you know, performing an action yields an immediate reward, but it's not always a good idea to greedily choose the action with the best reward all the time.

The best possible policy is often called the optimal policy, and it's often the holy grail of reinforcement learning. Learning the optimal policy, tells you the optimal action given any state.

We've so far described one type of policy where the agent always chooses the action with the greatest immediate rewrd, called the __greedy policy__. Another simple example of a policy is arbitrarily choosing an action, called the __random policy__. If you come up with a policy to solve a reinforcement learning problem, it's often a good idea to double-check that you learned policy performs better than a random policy.

## Utility

The long-term reward is called a __utility__. It turns out, if we know the utility of performing an action at a state, then it's easy to solve reinforcement learning. For example, to decide which action to take, we simply select the action that produces the highest utility. The hard part, as you might have guessed, is uncovering these utility values.

The utility of performing an action at a state is written as function , called __utility function__.

An elegant way to calculate the utility of a particular state-action pair is by recursively considering the utilities of future actions. The utility of your current action is influenced not by just the immediate reward but also the next best action, as shown in the formula below. In the formula, denotes the next state, and denotes the next action. The reward of taking action in state is denoted by :

Here is a hyper-parameter that you get to choose, called the **discount factor**. If is , then the agent chooses the action that maximizes the immediate reward. Higher values of will make the agent chooses the action that maximizes the immediate reward. Higher values of will make the agent put more importance in considering long-term consequences.

In some applications of reinforcement learning, newly available information might be more important than historical records, or vice versa.

- For example, if a robot is expected to learn to solve tasks quickly but not necessarily optimally, we might want to set a faster learning rate.
- Or if a robot is allowed more time to explore and exploit, we might tune down the learning rate.

Let's call the learning rate , and change our utility function as follows:

Reinforcement learning can be solved if we know this function. Conveniently for us, there's a machine learning strategy called *neural networks*, which are a way to approximate functions given enough training data. TensorFlow is the perfect tool to deal with neural networks because it comes with many essential algorithms to simplify neural network implementation.

## Applying reinforcement learning

Application of reinforcement learning requires defining a way to retrieve rewards once an action is taken from a state. A stock marker trader fits these requirements easily, because buying and selling a stock changes the state of the trader, and each action generates a reward (or less).

The states in this situation are a vector containing information about the current budget, current number of stocks, and a recent history of stock prices (the last 200 stock prices). So each state is a 202-dimensional vector.

For simplicity, there are only three actions: buy, sell, and hold.

- Buying a stock at the current stock price decreases the budget will incrementing the current stock count.
- Selling a stock trades it in for money at the current share price.
- Holding does neither, and performing the action simply waits a single time-period, and yields no reward.

The goal is to learn a policy that gains the maximum net-worth from trading in a stock market.

## Implementation

To get stock prices, we will use the `yahoo_finance`

library in Python. You can install it using pip, as shown below, or alternatively follow the official guide.

```
pip install yahoo-finance
```

```
%matplotlib inline
from yahoo_finance import Share # for obtaining stock price raw data
from matplotlib import pyplot as plt
import numpy as np
import tensorflow as tf
import random
```

Create a helper function to get stock prices using the `yahoo_finance`

library. The library requires three pieces of information: share symbol, start date, and end date. When you pick each of the three values, you'll get a list of numbers representing the share prices in that period by day.

```
# Listing 6.2 Helper function to get prices
def get_prices(share_symbol, start_date, end_date, cache_filename='stock_prices.npy'):
try:
# try to load the data from file if it has already been computed
stock_prices = np.load(cache_filename)
except IOError:
# Retrieve stock prices from the library
share = Share(share_symbol)
stock_hist = share.get_historical(start_date, end_date)
# Extract only relevant info from the raw data
stock_prices = [stock_price['Open'] for stock_price in stock_hist]
# Cache the result
np.save(cache_filename, stock_prices)
return stock_prices
```

Visualize the stock price data:

“`python

<h1 id="toc_7">Listing 6.3 Helper function to plot the stock prices</h1>

<p>def plot_prices(prices):<br/>

plt.title('Opening stock prices')<br/>

plt.xlabel('day')<br/>

plt.ylabel('price ($)')<br/>

plt.plot(prices)<br/>

plt.savefig('prices.png')<br/>

“`

```
# Listing 6.4 Get data and visualize it
if __name__ == '__main__':
prices = get_prices('MSFT', '1992-07-22', '2016-07-22')
plot_prices(prices)
```

Most reinforcement learning algorithms follow similar implementation patterns. As a result, it's a good idea to create a class with the relevant methods to reference later, such as an abstract class or interface.

Basically, reinforcement learning needs two operations well defined:

(1) how to select an action, and

(2) how to improve the utility Q-function.

```
# Listing 6.5 Define a superclass for all decision policies
class DecisionPolicy:
# Give a state, the decision policy will calculate the next action to take
def select_action(self, current_state):
pass
# Improve the Q-function from a new experience of taking an action
def update_q(self, state, action, reward, next_sate):
pass
```

Most reinforcement learning algorithms boil down to just three main steps: *infer*, *do*, and *learn*.

- During the first step, the algorithm selects the best action, the algorithm selects the best action given a state using the knowledge it has so far.
- Next, it does the action to find out the reward as well as the next state .
- Then it improves its understanding of the world using the newly acquired knowledge .

Next, let's inherit from this superclass to implement a random decision policy. We only need to define the `selection_action`

method, which will randomly pick an action without even look at the state.

```
# Listing 6.6 Implement a random decision policy
class RandomDecisionPolicy(DecisionPolicy):
# Inherit from DecisionPolicy to implement its functions
def __init__(self, actions):
self.actions = actions
# Random choose the next action
def select_action(self, current_state):
action = self.actions[random.randint(0, len(self.actions) - 1)]
return action
```

In listing 6.7, we assume a policy is given to us, and run it on the real world stock-price data. This function takes care of exploration and exploitation at each interval of time.

￼

```
# Listing 6.7 Use a given policy to make decisions and return the performance
def run_simulation(policy, initial_budget, initial_num_stocks, prices, hist, debug=False):
# Initialize values that depend on computing the net worth of a portfolio
budget = inital_budget
num_stocks = initial_num_stocks
share_value = 0
transitions = list()
for i in range(len(prices) - hist - 1):
if i % 100 = 0:
print('progress {:.2f}%'.format(float(100*i) / (len(prices) - hist - 1)))
# The state is a `hist + 2` dimensional vector
current_state = np.asmatrix(np.hstack((price[i:i+hist], budget, num_stocks)))
# Calculate the portfolio value
current_portfolio = budget + num_stocks * share_value
# Selection an action from the current policy
action = policy.select_action(current_state, i)
# Update portfolio values based on action
share_value = float(prices[i + hist + 1])
if action == 'Buy' and budget >= share_value:
budget -= share_value
num_stocks -= 1
elif action == 'Sell' and num_stocks > 0:
budget += share_value
num_stocks -= 1
else:
action = 'Hold'
# Compute new portfolio value after taking action
new_portfolio = budget + num_stocks * share_value
# Compute the reward from taking an action at a state
reward = new_portfolio - current_portfolio
# Update the policy after experiencing a new action
next_state = np.asmatrix(np.hstack((prices[i+1:i+hist+1], budget, num_stocks)))
transitions.append((current_state, action, reward, next_state))
policy.update_q(current_state, action, reward, next_state)
# Compute final portfolio worth
portfolio = budget + num_stocks * share_value
if debug:
print('${}\t{} shares'.format(budget, num_stocks))
return portfolio
```

To obtain a more robust measurement of success, let's run the simulation a couple times and average the results. Doing so may take a while to complete, but your results will be more reliable.

```
# Listing 6.8 Run multiple simulations to calculate an average performance
def run_simulations(policy, budget, num_stocks, prices, hist):
num_tries = 10
final_portfolios = list()
for i in range(num_tries):
final_portfolio = run_simulation(policy, budget, num_stocks, prices, hist)
final_portfolios.append(final_portfolio)
avg, std = np.mean(final_portfolios), np.std(final_portfolios)
return avg, std
```

In main, define the decision policy and try running simulations to see how it performs.

```
# Listing 6.9 Append the following lines to maintain
if __name__ == '__main__':
prices = get_prices('MSFT', '1992-07-22', '2016-07-22')
plot_prices(prices)
actions = ['Buy', 'Sell', 'Hold']
hist = 200
policy = RandomDecisionPolicy(actions)
budget = 1000.0
num_stocks = 0
num_stocks = 0
avg, std = run_simulations(policy, budget, num_stocks, prices, hist)
print (avg, std)
```

Now that we have a baseline to compare our results, let's implement our network approach to learn the Q-function. The decision policy is often called the Q-learning decision policy. The following listing 6.10 introduces a new hyper-parameter "epsilon" to keep the solution from getting "stuck" when applying the same action over and over. The decision policy is often called the Q-learning decision policy. The following listing 6.10 getting "stuck" when applying the same action over and over. The lesser its value, the more often it will randomly explore new actions. The Q-function is defined by the function visualized in figure 6.9.

￼

The input is the state space vector, with three outputs, one for each output's Q-value.

```
class QLearningDecisionPolicy(DecisionPolicy):
def __init__(self, actions, input_dim):
# Set the hyper-parameters from the Q-function
self.epsilon = 0.9
self.gamma = 0.01
self.actions = actions
output_dim = len(actions)
# Set the number of hidden nodes in the neural networks
h1_dim = 200
# Define the input and output tensor
self.x = tf.placeholder(tf.float32, [None, input_dim])
self.y = tf.placeholder(tf.float32, [output_dim])
# Design the neural network architecture
W1 = tf.Variable(tf.random_normal([input_dim, h1_dim]))
b1 = tf.Variable(tf.constant(0.1, shape=[h1_dim]))
h1 = tf.nn.relu(tf.matmul(self.x, W1) + b1)
W2 = tf.Variable(tf.random_normal([h1_dim, output_dim]))
b2 = tf.Variable(tf.constant(0.1, shape=[output_dim]))
# Define the op to calculate the utility
self.q = tf.nn.relu(tf.matmul(h1, W2) + b2)
# Set the loss as the square error
loss = tf.square(self.y - self.q)
# Use an optimizer to update the model parameters to minize the loss
self.train_op = tf.train.AdagradOptimizer(0.01).minimize(loss)
# Set up the session and initialize variables
self.sess = tf.Session()
self.sess.run(tf.global_variables_initializer())
def select_action(self, current_state, step):
threshold = min(self.epsilon, step / 1000.)
if random.random() < threshold:
# Exploit best option with probability epsilon
action_q_vals = self.sess.run(self.q, feed_dict={self.x: current_state})
action_idx = np.argmax(action_q_vals)
action = self.actions[action_idx]
else:
# Explore random option with probability `1 - epsilon`
action = self.actions[random.randint(0, len(self.actions) - 1)]
return action
# Update the Q-function by updating its model parameters
def update_q(self, state, action, reward, next_state):
action_q_vals = self.sess.run(self.q, feed_dict={self.x: state})
next_action_q_vals = self.sess.run(self.q, feed_dict={self.x: next_state})
next_action_idx = np.argmax(next_action_q_vals)
action_q_vals[0, next_action_idx] = reward + self.gamma * next_action_q_vals[0, next_action_idx]
action_q_vals = np.squeeze(np.asarray(action_q_vals))
# retrain the neural network to update the weights, using the previous `action_q_vals` as the truth.
self.sess.run(self.train_op, feed_dict={self.x: state, self.y: action_q_vals})
```