# Sequence Modeling: Recurrent and Recursive Nets

a recurrent neural network is a neural network that is specialized fro processing a sequence of values $x^{(1)}, \dots, x^{(\tau)}$. Just as convolutional networks can readily scale to images with large width and height, and some convolutional networks can readily sale to images with large width an height, and some convolutional networks can process images of variable size, recurrent networks can scale to much longer sequences than would be practical for networks without sequence-based specialization.

To go from multi-layer networks to recurrent networks, we need to take advantage of the early ideas found in machine learning and statical models of the 1980s: sharing parameters across different parts of a model.

• Parameter sharing makes it possible to extend and apply the model to examples of different forms and generalize across them.
• If we had separate parameters for each value of the time index, we could not generalize to sequence lengths not seen during training, nor share statistical strength across different sequence lengths and across different positions in time.
• Such sharing is particular important when a specific piece of information can occur at multiple positions within the sequence.

Suppose that we trained a feedforward network that processes sentences of fixed length. A traditional fully connected feedforward network would have separate parameters for each input feature, so ti would need to learn all of the rules of the language separately at each position in the sentence. By comparison, a recurrent neural network shares the same weights across several time steps.

A related idea is the use of convolution across a 1-D temporal sequence. This convolutional approach is the basis for time-delay neural networks.

• The convolution operation allows a network to share parameters across time, but is shallow. The output of convolution is a sequence where each member of the output is a function fo a small number of neighboring members of the input. The idea of parameter sharing manifests in the application of the same convolution kernel at each time step.

Recurrent networks share parameters in a different way.

• Each member of the output is produced using the same update rule applied to the previous outputs. This recurrent formulation results in the sharing of parameters through a very deep computational graph.

This chapter extends the idea of a computational graph to include cycles. These cycles represent the influence of the present value of a variable on its own value at a future time step. Such computational graphs allow us to define recurrent neural networks. We then describe many different ways to construct, train, and use recurrent neural networks.

## Unfolding Computational Graphs

A computational graph is a way to formalize the structure of a set of computations, such as those involved in mapping inputs and parameters to output and loss. Please refer to Sec. 6.5.1 for a general introduction.

For example, consider the classical form of a dynamical system:

$s^{(t)} = f(s^{(t-1)}; \theta),$

where $s^{(t)}$ is called the state of the system.

Another example, let us consider a dynamical system driven by an external signal $x^{(t)}$,

$s^{(t)} = f(s^{(t-1)}, x^{(t)}; \theta)$

where we see that the state now contains information about the whole past sequence.

Recurrent neural networks can be built in many different ways. Much as almost any function can be considered a feedforward neural network, essentially any function involving recurrence can be considered a recurrent neural network.

Many recurrent neural networks used Eq. 10.5 or a similar equation to define the values of their hidden units. To indicate that the state is the hidden units of the network, we now rewrite Eq. 10.4 using the variable $h$ to represent the state:

$h^{(t)} = f(h^{(t-1)}, x^{(t)}; \theta)$

typical RNNs will add extra architectural features such as output layers that read information out of the state $h$ to make predictions.

When the recurrent network is trained to perform a task that requires predicting the future from the past, the network typically learns to use $h^{(t)}$ as a kind of lossy summary of the task-relevant aspects of the past sequence of inputs up to $t$.

• This summary is in general necessarily lossy, since it maps an arbitrary length sequence $(x^{(t)}, x^{(t)}, x^{(t-1)}, \dots, x^{(2)}, x^{(1)})$ to fixed length vector $h^{(t)}$.
• Depending on the training criterion, this summary might selectively keep some aspects of the past sequence with more precision than other aspects.

The most demanding situation is when we ask $h^{(t)}$ to be rich enough to allow one to approximately recover the input sequence, as in auto-encoder frameworks.

Eq. 10.5 can be drawn in two different ways. One way to draw the RNN is with a diagram containing one node for every component that might exist in a physical implementation of the model, such as a biological neural network.

• In this view, the network defines a circuit that operates in real time, with physical parts whose current state can influence their future state.

The other way to draw the RNN is as an unfolded computational graph, in which each component is represented by many different variables, with one variable per time step, representing the state of the component at the point in time.

• Each variable for each time step is drawn as a separate node of the computational graph, as in right of Fig. 10.2.

What we call unfolding is the operation that maps a circuit as in the left side of the figure to a computational graph with repeated pieces as in the right side. The unfolded graph now has a size that depends on the sequence length.

We can represent the unfolded recurrence after $t$ steps with a function $g^{(t)}$:

The function $g^{(t)}$ takes the whole past sequence $(...)$ as input and produces the current state, but the unfolded recurrent structure allows us to factorize $g^{(t)}$ into repeated application of a function $f$. The unfolding process thus introduces two major advantages:

1. Regardless of the sequence length, the learned model always has the same input size, because it is specified in terms of transition from one state to another state, rather than specified in terms of a variable-length history of states.
2. It is possible to use the same transition function $f$ with the same parameters at every time step.

These two factors make it possible to learn a single model $f$ that operates on all time steps and all sequence lengths, rather than needing to learn a separate model $g^{(t)}$ for all possible time steps. Learning a single, shard model allows generalization to sequence lengths that did not appear in the training set, and allows the model to be estimated with far fewer training examples than would be required without parameter sharing.

Both the recurrent graph and the unrolled graph have their uses. The recurrent graph is succinct. The unfolded graph provides an explicit description of which computations to perform. The unfolded graph also helps to illustrate the idea of information flow forward in time (computing outputs and losses) and backward in time (computing gradients) by explicitly showing the path along which this information flows.

## Recurrent Neural Networks

Armed with the graph unrolling and parameter sharing ideas of Sec. 10.1, we can design a wide variety of recurrent neural networks.

Some examples of important design patterns for recurrent neural networks include the following:

• Recurrent networks that produce an output at each time step and have recurrent connections between hidden units, illustrated in Fig. 10.3.

• Recurrent networks that produce an output at each time step and have recurrent connections only from the output at one time step to the hidden units at the next time step, illustrated in Fig. 10,4.

• Recurrent networks with recurrent connections between hidden units, that read an entire sequence and then produce a single output, illustrated in Fig. 10.5.

We now develop the forward propagation equations for the RNN depicted in Fig. 10.3.

• The figure does not specify the choice of the activation function.
• Also, the figure does not specify exactly what form of the output and loss function take.
• Here we assume that the output is discrete, as if the RNN is used to predict words or characters.

A natural way to represent discrete variables is to regard the output $o$ as giving the unnormalized log probabilities of each possible value of the discrete variable.

• We can then apply the $\mathbf{softmax}$ operation as a post-processing step to obtain a vector $\hat{y}$ of normalized probabilities over the output.

Forward propagation begins with a specification of the initial state $h^{(0)}$. Then, for each time step from $t = 1$ to $t = \tau$, we apply the following update equations:

The total loss for a given sequence of $x$ values paired with a sequence of $y$ values would then be just the sum of the losses over all the time steps.

where $p_{model}(y^{(t)} | \{x^{(1)}, \dots, x^{(t)}\})$ is given by reading the entry for $y^{(t)}$ from the model's output vector $\hat{y}^{(t)}$

Computing the gradient of this loss function with respect to the parameters is an expensive operation. The gradient computation involves performing a forward propagation pass moving left to right followed by a backward propagation pass moving right to left through the graph.

• The runtime is $O(\tau)$ and cannot be reduced by parallelization because the forward propagation graph is inherently sequential; each time step may only be computed after the previous one.
• States computed in the forward pass must be stored until they are reused during the backward pass, so the memory cost is also $O(\tau)$.
• The back-propagation algorithm applied to the unrolled graph with $O(\tau)$ cost is called back-propagation through time or BPTT.

### Teacher Forcing and Networks with Output Recurrence

• The network with recurrent connections only from the output at one time step to the hidden units at the next time step is strictly less powerful because it lacks hidden-to-hidden recurrent connections. It requires that the output units capture all of the information about the past the network will use to predict the future.
• The advantage of eliminating hidden-to-hidden recurrence is that, for any loss function based on comparing the prediction at time $t$ to the training target at time $t$, all the time steps are decoupled.
• Training can thus be parallelized, with the gradient for each step $t$ computed in isolation. There is no need to compute the output for the previous time step first, because the training set provides the ideal value of that output.

Models that have recurrent connections from their outputs leading back into the model may be trained with teacher forcing. Teacher forcing is a procedure that emerges from the maximum likelihood criterion, in which during training the model receives the ground truth output $y^{(t)}$ as input time $t + 1$. We can see this by examining a sequence with two time steps. The conditional maximum likelihood criterion is

We originally motivated teacher forcing as allowing us to avoid back-propagation through time in models that lack hidden-to-hidden connections.

Teacher forcing may still be applied to models that have hidden-to-hidden connections so long as they have connections from the output at one time step to values computed in the next time step.

However, as soon as the hidden units become a function of earlier time steps, the BPTT algorithm is necessary. Some models may thus be trained with both teacher forcing and BPTT.

The disadvantage of strict teacher forcing arises if the network is going to be later used in an open-loop mode, with the network outputs fed back as input.

### Computing the Gradient in a Recurrent Neural Network

To gain some intuition for how the BPTT algorithm behaves, we provide an example of how to compute gradients by BPTT for the RNN equations above. The nodes of our computational graph include the parameters $U$, $V$, $W$, $b$, and $c$ as well as the sequence of nodes indexed by $t$ for $x^{(t)}$, $h^{(t)}$, $o^{(t)}$ and $L^{(t)}$.

We start the recursion with the nodes immediately preceding the final loss

In this derivation we assume that the outputs $o^{(t)}$ are used at the argument to the softmax function to obtain the vector $\hat{y}$ of probabilities over the output. We also assume that the loss is the negative log-likelihood of the true target $y^{(t)}$ given the input so far.

We work our way backwards, starting from the end of the sequence. At the final time step $\tau$, $h^{(\tau)}$ only has $o^{(\tau)}$ as a descendent, so its gradient is simple:

We can then iterate backwards in time to back-propagate gradients through time, from $t = \tau - 1$ down to $t = 1$, noting that $h^{(t)}$ (for $t < \tau$) has a descendents both $o^{(t)}$ and $h^{(t+1})$. Its gradient is thus given by

where $\mathtt{diag}(1 - (h^{(t+1)})^2)$ indicates the diagonal matrix containing the elements $1 - (h_i^{(t+1)})^2$. This is the Jacobian of the hyperbolic tangent associated with the hidden unit $i$ at time $t + 1$.

Once the gradients on the internal nodes of the computational graph are obtained, we can obtain the gradients on the parameter nodes. Because the parameters are shared across many time stamps, we must take some care when denoting calculus operations involving these variables.

We do not need to compute the gradient with respect to $x^{(t)}$ for training because it does not have any parameters as ancestors in the computational graph defining the loss.

### Recurrent Networks as Directed Graphical Models

As with a feedforward network, we usually wish to interpret the output of the RNN as a probability distribution, and we usually use the cross-entropy associated with the distribution to define the loss. Mean squared error is the cross-entropy loss associated with an output distribution that is a unit Gaussian, for example, just as with a feedforward network.

This may mean that we maximize the log-likelihood

or, if the model includes connections from the output at one time step to the next time step,

Decomposing the joint probability over the sequence of

 1 y

values as a series of one-step probabilistic predictions is one way to capture the full joint distribution across the whole sequence.

As a simple example, let us consider the case where the RNN models only a sequence of scalar random variables $\mathbb{Y} = \{y^{(1)},\dots, y^{(\tau)}\}$, with no additional inputs $x$.

where the right-hand side of the bar is empty for $t = 1$, of course. Hence the negative log-likelihood of a set of values $\{y^{(1)},\dots,y^{(\tau)}\}$ according to such a model is

$L = \sum_t L^{(t)}$

where

$L^{(t)} = -\log P(\mathtt{y}^{(t)} = y^{(t)}|y^{(t-1)},y^{(t-2)},\dots,y^{(1)})$

One way to interpret an RNN as a graphical model is to view the RNN as defining a graphical model whose structure is the complete graph, able to represent direct dependencies between any pair of $\mathtt{y}$ values. The graphical model over the y values with the complete graph structure is shown in Fig. 10.7. The complete graph interpretation of the RNN is based on ignoring the hidden units $h^{(t)}$ by marginalizing them out of the model.

It is more interesting to consider the graphical model structure of RNNs that results from regarding the hidden units $h^{(t)}$ as a random variables. Including the hidden units in the graphical model reveals that the RNN provides a very efficient parametrization of the joint distribution over discrete values with a tabular representation — an array containing a separate entry for each possible assignment of value, with the value of that entry giving the probability of that assignment occurring. If $y$ can take on $k$ different values, the tabular representation would have $O(k^\tau)$ parameters.

The price recurrent networks pay for their reduced number of parameters is that optimizing the parameters may be difficult.

The parameter sharing used in recurrent networks relies on the assumption that the same parameters can be used for different time steps. Equivalently, the assumption is that the conditional probability distribution over the variables at time $t + 1$ given the variables at time $t$ is stationary, meaning that the relationship between the previous time step and the next step does not depend on $t$.

To complete our view of an RNN as a graphical model, we must describe how to draw samples from the model. The main operation that we need to perform is simply to sample from the conditional distribution at each time step. However, there is one additional complication. The RNN must have some mechanism for determining the length of the sequence. This can be achieved in various ways.

• In the case when the output is a symbol taken from a vocabulary, one can add a special symbol corresponding to the end of a sequence. When that symbol is generated, the sampling process stops. In the training set, we insert this symbol as an extra member of the sequence, immediately after $x^{(\tau)}$ in each training example.

• Another option is to introduce an extra Bernoulli output to the model that represents the decision to either continue generation or halt generation at each time step. This approach is more general than the approach of adding an extra symbol to the vocabulary, because it may be applied to any RNN, rather than only RNNs that output a sequence of symbols.

• Another way to determine the sequence length $\tau$ is to add an extra output to the model that predicts the integer $\tau$ itself. The model can sample a value of $\tau$ and then sample $\tau$ steps worth of data. This approach requires adding an extra input to the recurrent update at each time step so that the recurrent update is aware of whether it is near the end of the generated sequence.

The strategy of predicting $\tau$ directly is used for example by Goodfellow et al.

### Modeling Sequences Conditioned on Context with RNNs

In the previous section we described how an RNN could correspond to a directed graphical model over a sequence of random variables $y^{(t)}$ with no inputs $x$. Of course, our development of RNNs as in Eq. 10.8 included as a sequence of inputs $x^{(1)}, x^{(2)},\dots,x^{(\tau)}$.

In general, RNNs allow the extension of the graphical model view to represent not only a joint distribution over the $y$ variables but also a conditional distribution over $y$ given $\mathbf{x}$.

As discussed in the context of feedforward networks, any model representing a variable $P(y;\theta)$ can be reinterpreted as a model representing a conditional distribution $P(y|\omega)$ with $\omega = \theta$.

We can extend such a model to represent a distribution $P(y|x)$ by using the same $P(y|\omega)$ as before, but making $\omega$ a function of $x$. In the case of an RNN, this can be achieved in different ways. We review here the most common and obvious choices.

Previously, we have discussed RNNs that take a sequence of vectors $P(y;\theta)$ for $t = 1, \dots, \tau$ as input. Another option is to take only a single vector $x$ as input. When $x$ is a fixed-size vector, we can simply make it an extra input of the RNN that generates the $y$ sequence. Some common ways of providing an extra input to an RNN are:

1. as an extra input at each time step, or
2. as the initial state $h^{(0)}$, or
3. both.

The first and most common approach is illustrated in Fig. 10.9.

The interaction between the input $x$ and each hidden unit vector $h^{(t)}$ is parameterized by a newly introduced weight matrix $\mathbf{R}$ that was absent from the model of only the sequence of $y$ values. The same product $x^T R$ is added as additional input to the hidden units every time step.

Rather than receiving only a single vector $x$ as input, the RNN may receive a sequence of vectors $x^{(t)}$ as input. The RNN described in Eq. 10.8 corresponds to a conditional distribution $P(y^{(1)}, \dots, y^{(\tau)} | x^{(1)}, \dots, x^{(\tau)})$ that makes a conditional independence assumption that this distribution factorizes as

To remove the conditional independence assumption, we can add connections from the output at time $t$ to the hidden unit at time $t+1$, as shown in Fig. 10.10. (?)

The model can then represent arbitrary probability distributions over the $y$ sequence. This kind of model representing a distribution over a sequence given another sequence still has one restriction, which is that the length of both sequences must be the same. We describe how to remove this restriction in Sec. 10.4.

## Bidirectional RNNs

All of the recurrent networks we have considered up to now have a "causal" structure, meaning that the state at time $t$ only captures information from the past$x^{(1)}, \dots, x^{(t-1)}$, and the present input$x^{(t)}$. Some of the models we have discussed also allow information from past $y$ values to affect the current state when the $y$ values are available.

However, in many applications we want to output a prediction of $y^{(t)}$ which may depend on the whole input sequence.

• For example, in speed recognition, the correct interpretation of the current sound as a phoneme may depend on the next few phonemes because of co-articulation and potentially may even depend on the next few words because of the linguistic dependencies between nearby words: if there are two interpretations of the current word that are both acoustically plausible, we may have to look far into the future (and the past) to disambiguate them. This is also true of handwriting recognition and other sequence-to-sequence learning tasks, described in the next section.

Bidirectional recurrent neural networks were invented to address that need. They have been extremely successful, in applications where that need arises, such as handwriting recognition, speech recognition, and bioinformatics.

As the name suggests, bidirectional RNNs combine an RNN that moves forward through time beginning from the start of the sequence with another RNN that moves backward through time beginning from the end of the sequence.

Fig. 10.11 illustrates the typical bidirectional RNN, with $h^{(t)}$ standing for the state of the sub-RNN that moves forward through time and $g^{(t)}$ standing for the state of the sub-RNN that moves backward through time.

This allows the output units $o^(t)$ to compute a representation that depends on both the past and future but is most sensitive to the input values around time $t$, without having to specify a fixed-size window around $t$.

This idea can be naturally extended to 2-dimensional input, such as images, by having four RNNs, each one going in one of the four directions: up, down, left, right.

• At each point $(i, j)$ of a 2-D gird, an output $O_{i, j}$ could then compute a representation that would capture mostly local information but could also depend on long-range inputs, if the RNN is able to learn to carry that information.
• Compared to a convolutional network, RNNs appleid to images are typically more expensive but allow for long-range lateral interactions between features in the same feature map.
• Indeed ,the forward propagation equations for such RNNs may be written in a form that shows they use a convolution that computes the bottom-up input to each layer, prior to the recurrent propagation across the feature map that incorporates the later interactions.

## Encoder-Decoder Sequence-to-Sequence Architectures

We have seen in Fig. 10.5 how an RNN can map an input sequence to a fixed-size vector. We have seen in Fig. 10.9 how an RNN can map a fixed-size vector to a sequence.

Here we discuss how an RNN can be trained to map an input sequence to an output sequence which is not necessarily of the same length. This comes up in many applications, such as speech recognition, machine translation or question answering, where the input and output sequences in the training set are generally not of the same length (although their lengths might be related).

We often call the input to the RNN the "context." We want to produce a representation of this context, $C$. The context $C$ might be a vector a sequence of vectors that summarize the input sequence $X = (x^{(1)}, \dots, x^{(n_x)})$.

The idea is very simple:

1. an encoder or reader or input RNN processes the input sequence. The encoder emits the context $C$, usually as a simple function of its final hidden state.
2. a decoder or writer or output RNN is conditioned on that fixed-length vector (just like in Fig. 10.9) to generate the output sequence $Y = (y^{(1)}, \dots, y^{(n_y)})$.

The innovation of this kind of architecture over those presented in earlier sections of this chapter is that the lengths $n_x$ and $n_y$ can vary from each other, while previous architectures constrained $n_x = n_y = \tau$.

In a sequence-to-sequence architecture, the two RNNs are trained jointly to maximize the average of $\log P(y^{(1)}, \dots, y^{(n_y)} | x^{(1)}, \dots, x^{(n_x)})$ over all the pairs of $x$ and $y$ sequences in the training set. The last state $h^{(n_x)}$ of the encoder RNN is typically used as a representation $C$ of the input sequence that is provided as input to the decoder RNN.

If the context $C$ is a vector, then decoder RNN is simply a vector-to-sequence RNN. As we have seen, there are at least two ways for a vector-to-sequence RNN to receive input.

• The input can be provided as the initial state of the RNN, or
• the input can be connected to the hidden units at each time step.

These two ways can also be combined.

There is no constraint that the encoder must have the same size of hidden layer as the decoder.

One clear limitation of this architecture is when the context $C$ output by the encoder RNN has a dimension that is too small to properly summarize a long sequence.

• make $C$ a variable-length sequence rather that a fixed-size vector.
• attention mechanism

## Deep Recurrent Networks

The computation in most RNNs can be decomposed into three blocks of parameters and associated transformations:

1. from the input to the hidden state,
2. from the previous hidden state to the next hidden state, and
3. from the hidden state to the output.

With the RNN architecture of Fig. 10.3, each of these three blocks is associated with a single weight matrix. In other words, when the network is unfolded, each of these corresponds to a shallow transformation.

• By a shallow transformation, we mean a transformation that would be represented by a single layer within a deep MLP.
• Typically this is a transformation represented by a learned affine transformation followed by a fixed nonlinearity.

Graves et al. were the first to show a significant benefit of decomposing the state of an RNN into multiple layers as in Fig. 10.13 (left).

We can think of the lower layers in the hierarchy depicted in Fig 10.13a as playing a role in transforming the raw input into a representation that is more appropriate, at the higher levels of the hidden state.

Considerations of representational capacity suggest to allocate enough capacity in each of these three steps, but doing so by adding depth may hurt learning by making optimization difficult.

In general, it is easier to optimize shallower architectures, and adding the extra depth of Fig. 10.13b makes the shortest path from a variable in time step $t$ to a variable in time step $t+1$ become longer.

• However, this can be mitigated by introducing skip connections in the hidden-to-hidden path, as illustrated in Fig. 10.13c.

## Recursive Neural Networks

Recursive neural networks represent yet another generalization of recurrent networks, with a different kind of computational graph, which is structured as a deep tree, rather than the chain-like structure of RNNs. The typical computational graph for a recursive network is illustrated in Fig. 10.14.

Recursive networks have been successfully applied to processing data structures as input to neural nets, in natural language process, as well as in computer vision.

One clear advantage of recursive network over recurrent nets is that for a sequence of the same length $\tau$, the depth can be drastically reduced from $\tau$ to $O(\log \tau)$, which might help deal with long-term dependencies.

An open question is how to best structure the tree.

• One option is to have a tree structure which does not depend on the data, such as balanced binary tree.
• In some application domains, external methods can suggest the appropriate tree structure. For example, when processing natural language sentences, the tree structure for the recursive network can be fixed to the structure of the parse tree of the sentence provided by a natural language parser.

## The Challenge of Long-Term Dependencies

The basic problem is that gradients propagated over many states tend to either vanish (most of the time) or explode (rarely, bet with much damage to the optimization).

Even if we assume that the parameters are such that the recurrent network is stable (can store memories, with gradients not exploding), the difficulty with long-term dependencies arise from the exponentially smaller weights given to long-term interactions (involving the multiplication of many Jocobians) compared to short-term ones.

Recurrent networks involve the composition of the same function multiple times, once per time step. These compositions can result in extremely nonlinear behavior, as illustrated in Fig. 10.15.

In particular, the function composition employed by recurrent neural networks somewhat resembles matrix multiplication. We can think of the recurrence relation

as a very simple recurrent neural network lacking a nonlinear activation function, and lacking inputs $x$. As described in Sec. 8.2.5, this recurrence relation essentially describes the power method. It may be simplified to

and if $W$ admits an eignedecomposition of the form

with orthogonal $Q$, the recurrence may be simplified further to

The eigenevalues are raised to the power of $t$ causing eigenvalues with magnitude less than one to decay to zero and eigenvalues with magnitude greater than one to explode. Any component of $h^{(0)}$ that is not aligned with the largest eigenvector will eventually be discarded.

This problem is particular to recurrent networks. In the scalar case, imagine multiplying a weight $w$ by itself many times. The product $w^t$ will either vanish or explode depending on the magnitude of $w$.

However, if we make a non-recurrent network that has a different weight $w^{(t)}$ at each time step, the situation is different.

• If the initial state is given by 1, then the state at time $t$ is given by $\prod_t w^{(t)}$.
• Suppose that the $w^{(t)}$ values are generated randomly, independently from one another, with zero mean and variance $v$. The variance of the product is $O(v^n)$.
• To obtain some desired variance $v^\ast$we may choose the individual weights with variance $v=\sqrt[n]{v^\ast}$.
• Very deep feedforward networks with carefully chosen scaling can thus avoid the vanishing and exploding gradient problem, as argued by Sussillo (2014).

The vanishing and exploding gradient problem for RNNs was independently discovered by separate researchers. Unfortunately, in order to store memories in a way that is robust to small perturbations, the RNN must enter a region of parameter space where gradients vanish.

• In practice, the experiments in * show that as we increase the span of the dependencies that need to be captured, gradient-based optimization becomes increasingly difficult, with the probability of successful training of a traditional RNN via SGD rapidly reaching 0 for sequences of only length 10 or 20.

The remaining sections of this chapter discuss various approaches that have been proposed to reduce the difficulty of learning long-term dependencies (in some cases allowing RNN to learn dependencies across hundreds of steps), but the problem of learning long-term dependencies remains one of the main challenges in deep learning.

## Echo State Networks

The recurrent weights mapping from $h^{(t-1)}$ to $h^{(t)}$ and the input weights mapping from $x^{(t)}$ to $h^{(t)}$ are some of the most difficult parameters to learn in a recurrent network.

One proposed approach to avoiding this difficulty is to set the recurrent weights such that the recurrent hidden units do a good job of capturing the history of past inputs, and only learn the output weights. This is the idea that was independently proposed for echo state networks or ESNs and liquid state machines.

• The latter is similar, except that it uses spiking neurons instead of the continuous valued hidden units used for ESNs. Both ESNs and liquid state machines are termed reservoir computing to denote the fact that the hidden units form of reservoir of temporal features which may capture different aspects of the history of inputs.

One way to think about these reservoir computing recurrent networks is that they are similar to kernel machines:

• They map an arbitrary length sequence (the history of inputs up to time $t$) into a fixed-length vector (the recurrent state $h^{(t)}$), on which a linear predictor (typically a linear regression) can be applied to solve the problem of interest.
• The training criterion may then be easily designed to be convex as a function of the output weights.

The import question is therefore: how do we set the input and recurrent weights so that a rich of histories can be represented in the recurrent neural network state?

• The answer proposed in the reservoir computing literature is to view the recurrent net as a dynamical system, and set the input and recurrent weights such that the dynamical system is near the edge of stability. (?)

The original idea was to make the eigenvalues of the Jacobian of the state-to-state transition function be close to 1.

To understand the effect of the spectral radius, consider the simple case of back-propagation with a Jacobian matrix $J$ that does not change with $t$. …

Everything we have said about back-propagation via repeated matrix multiplication applies equally to forward propagation in a network with no linearity, where the state $h^{(t+1)} = {h^{(t)}}^T W.$

The strategy of each state networks is simply to fix the weights to have some spectral radius such as 3, where information is carried forward through time but does not explode due to the stabilizing effect of saturating nonlinearities like $\tanh$.

More recently, it has been shown the techniques used to set the weights in ESNs could be used to initialize the weights in a fully trainable recurrent network (with the hidden-to-hidden recurrent weights trained using back-propagation through time), helping to learn long-term dependencies.

## Leaky Units and Other Strategies for Multiple Time Scales

One way to deal with long-term dependencies is to design a model that operates at multiple time scales, so that some parts of the model operate at fine-grained time scales and can handle small details, while other parts operate at coarse time scales and transfer information from the distant past to the present more efficiently.

• Various strategies for building both fine and coarse time scales are possible. These include the addition of skip connections across time , "leaky units" that integrate signals with different time constants, and the removal of some of the connections used to model fine-grained time scales.

## The Long Short-Term Memory and Other Gated RNNs

This site uses Akismet to reduce spam. Learn how your comment data is processed.