Recurrent networks like LSTM and GRU are powerful sequence models. I will explain how to create recurrent networks in TensorFlow and use them for sequence classification and labelling tasks. If you are not familiar with recurrent networks, I suggest you take a look at Christopher Olah’s great article first. On the TensorFlow part, I also expect some basic knowledge. The official tutorials are a good place to start.
Defining the Network
To use recurrent networks in TensorFlow we first need to define the network architecture consiting of one or more layers, the cell type and possibly dropout between the layers. In TensorFlow, we build recurrent networks out of so called cells that wrap each other.
from tensorflow.nn.rnn_cell import GRUCell, DropoutWrapper, MultiRNNCell num_neurons = 200 num_layers = 3 dropout = tf.placeholder(tf.float32) cell = GRUCell(num_neurons) # Or LSTMCell(num_neurons) cell = DropoutWrapper(cell, output_keep_prob=dropout) cell = MultiRNNCell([cell] * num_layers)
Simulating Time Steps
We can now add the operations to the graph that simulate the recurrent network
over the time steps of the input. We do this using TensorFlow’s
operation. It takes the a tensor block holding the input sequences and returns
the output activations and last hidden state as tensors.
max_length = 100 # Batch size x time steps x features. data = tf.placeholder(tf.float32, [None, max_length, 28]) output, state = tf.nn.dynamic_rnn(cell, data, dtype=tf.float32)
Static Unrolling in Time
This is not needed anymore, but TensorFlow also supports an
that creates the compute nodes for a given amount of time steps. You can think
of this as calling the same function ten times after each other rather than
using a loop.
max_length = 100 data = tf.placeholder(tf.float32, [None, max_length, 28]) outputs, state = tf.nn.rnn(cell, unpack_sequence(data), dtype=tf.float32) output = pack_sequence(outputs)
In contrast to
dynamic_rnn(), the function takes and returns Python lists of
tensor frames. Thus we need
tf.unpack() to split
our data tensors into lists of frames and merge the output sequence back to a
def unpack_sequence(tensor): """Split the single tensor of a sequence into a list of frames.""" return tf.unpack(tf.transpose(tensor, perm=[1, 0, 2])) def pack_sequence(sequence): """Combine a list of the frames into a single tensor of the sequence.""" return tf.transpose(tf.pack(sequence), perm=[1, 0, 2])
For classification, you might only care about the output activation at the last
time step. We transpose so that the time axis is first and use
selecting the last frame. We can’t just use
output[-1] because unlike Python
lists, TensorFlow doesn’t support negative indexing yet.
output, _ = tf.nn.dynamic_rnn(cell, data, dtype=tf.float32) output = tf.transpose(output, [1, 0, 2]) last = tf.gather(output, int(output.get_shape()) - 1)
The code below adds a softmax classifier ontop of the last activation and defines the cross entropy cost function. For now we assume sequences to be equal in length but I will cover variable length sequences in another post. Here is the complete gist for sequence classification.
out_size = int(target.get_shape()) weight = tf.Variable(tf.truncated_normal([num_neurons, out_size], stddev=0.1)) bias = tf.Variable(tf.constant(0.1, shape=[out_size])) prediction = tf.nn.softmax(tf.matmul(last, weight) + bias) cross_entropy = -tf.reduce_sum(target * tf.log(prediction))
For sequence labelling, we want a prediction for each timestamp. However, we share the weights for the softmax layer across all timesteps. How do we do that? By flattening the first two dimensions of the output tensor. This way time steps look the same as examples in the batch to the weight matrix. Afterwards, we reshape back to the desired shape.
max_length = int(target.get_shape()) out_size = int(target.get_shape()) weight = tf.Variable(tf.truncated_normal([num_neurons, out_size], stddev=0.1)) bias = tf.Variable(tf.constant(0.1, shape=[out_size])) output = tf.reshape(output, [-1, num_neurons]) prediction = tf.nn.softmax(tf.matmul(output, weight) + bias) prediction = tf.reshape(prediction, [-1, max_length, out_size])
Let’s say we predict a class for each frame, so we keep using cross entropy as our cost function. Here we have a prediction and target for every time step. We thus compute the cross entropy for every time step and sequence and then average over the batch size. Here is the complete gist for sequence labelling.
cross_entropy = -tf.reduce_sum(target * tf.log(prediction), [1, 2]) cross_entropy = tf.reduce_mean(cross_entropy)
That’s all. We learned how to construct recurrent networks in TensorFlow and use them for sequence learning tasks. Please ask any questions below if you couldn’t follow.
Updated 2016-08-17: TensorFlow 0.10 moved the recurrent network operations
tf.models.rnn into the
tf.nn package where they live along the other
neural network operations now. Cells can now be found in
Updated 2016-05-20: TensorFlow 0.8 introduced
to train on larger sequences on the GPU, because activations needed to compute
the gradients can be swapped to main memory. The function also expects and
returns tensors directly, so we do not need to convert to and from Python-lists