Recurrent networks like LSTM and GRU are powerful sequence models. I will explain how to create recurrent networks in TensorFlow and use them for sequence classification and labelling tasks. If you are not familiar with recurrent networks, I suggest you take a look at Christopher Olah’s great article first. On the TensorFlow part, I also expect some basic knowledge. The official tutorials are a good place to start.

Defining the Network

To use recurrent networks in TensorFlow we first need to define the network architecture consiting of one or more layers, the cell type and possibly dropout between the layers. In TensorFlow, we build recurrent networks out of so called cells that wrap each other.

from tensorflow.nn.rnn_cell import GRUCell, DropoutWrapper, MultiRNNCell

num_neurons = 200
num_layers = 3
dropout = tf.placeholder(tf.float32)

cell = GRUCell(num_neurons)  # Or LSTMCell(num_neurons)
cell = DropoutWrapper(cell, output_keep_prob=dropout)
cell = MultiRNNCell([cell] * num_layers)

Simulating Time Steps

We can now add the operations to the graph that simulate the recurrent network over the time steps of the input. We do this using TensorFlow’s dynamic_rnn() operation. It takes the a tensor block holding the input sequences and returns the output activations and last hidden state as tensors.

max_length = 100

# Batch size x time steps x features.
data = tf.placeholder(tf.float32, [None, max_length, 28])
output, state = tf.nn.dynamic_rnn(cell, data, dtype=tf.float32)

Static Unrolling in Time

This is not needed anymore, but TensorFlow also supports an rnn() operation that creates the compute nodes for a given amount of time steps. You can think of this as calling the same function ten times after each other rather than using a loop.

max_length = 100

data = tf.placeholder(tf.float32, [None, max_length, 28])
outputs, state = tf.nn.rnn(cell, unpack_sequence(data), dtype=tf.float32)
output = pack_sequence(outputs)

In contrast to dynamic_rnn(), the function takes and returns Python lists of tensor frames. Thus we need tf.pack() and tf.unpack() to split our data tensors into lists of frames and merge the output sequence back to a single tensor.

def unpack_sequence(tensor):
    """Split the single tensor of a sequence into a list of frames."""
    return tf.unpack(tf.transpose(tensor, perm=[1, 0, 2]))

def pack_sequence(sequence):
    """Combine a list of the frames into a single tensor of the sequence."""
    return tf.transpose(tf.pack(sequence), perm=[1, 0, 2])

Sequence Classification

For classification, you might only care about the output activation at the last time step. We transpose so that the time axis is first and use tf.gather for selecting the last frame. We can’t just use output[-1] because unlike Python lists, TensorFlow doesn’t support negative indexing yet.

output, _ = tf.nn.dynamic_rnn(cell, data, dtype=tf.float32)
output = tf.transpose(output, [1, 0, 2])
last = tf.gather(output, int(output.get_shape()[0]) - 1)

The code below adds a softmax classifier ontop of the last activation and defines the cross entropy cost function. For now we assume sequences to be equal in length but I will cover variable length sequences in another post. Here is the complete gist for sequence classification.

out_size = int(target.get_shape()[2])
weight = tf.Variable(tf.truncated_normal([num_neurons, out_size], stddev=0.1))
bias = tf.Variable(tf.constant(0.1, shape=[out_size]))
prediction = tf.nn.softmax(tf.matmul(last, weight) + bias)
cross_entropy = -tf.reduce_sum(target * tf.log(prediction))

Sequence Labelling

For sequence labelling, we want a prediction for each timestamp. However, we share the weights for the softmax layer across all timesteps. How do we do that? By flattening the first two dimensions of the output tensor. This way time steps look the same as examples in the batch to the weight matrix. Afterwards, we reshape back to the desired shape.

max_length = int(target.get_shape()[1])
out_size = int(target.get_shape()[2])

weight = tf.Variable(tf.truncated_normal([num_neurons, out_size], stddev=0.1))
bias = tf.Variable(tf.constant(0.1, shape=[out_size]))

output = tf.reshape(output, [-1, num_neurons])
prediction = tf.nn.softmax(tf.matmul(output, weight) + bias)
prediction = tf.reshape(prediction, [-1, max_length, out_size])

Let’s say we predict a class for each frame, so we keep using cross entropy as our cost function. Here we have a prediction and target for every time step. We thus compute the cross entropy for every time step and sequence and then average over the batch size. Here is the complete gist for sequence labelling.

cross_entropy = -tf.reduce_sum(target * tf.log(prediction), [1, 2])
cross_entropy = tf.reduce_mean(cross_entropy)

That’s all. We learned how to construct recurrent networks in TensorFlow and use them for sequence learning tasks. Please ask any questions below if you couldn’t follow.

Updated 2016-08-17: TensorFlow 0.10 moved the recurrent network operations from tf.models.rnn into the tf.nn package where they live along the other neural network operations now. Cells can now be found in tf.nn.rnn_cell.

Updated 2016-05-20: TensorFlow 0.8 introduced dynamic_rnn() that to train on larger sequences on the GPU, because activations needed to compute the gradients can be swapped to main memory. The function also expects and returns tensors directly, so we do not need to convert to and from Python-lists anymore.