Some practical tricks for training recurrent neural networks:

Optimization Setup

  • Adaptive learning rate. We usually use adaptive optimizers such as Adam (Kingma14) because they can better handle the complex training dynamics of recurrent networks that plain gradient descent.
  • Gradient clipping. Print or plot the gradient norm to see its usual range, then scale down gradients that exceeds this range. This prevents spikes in the gradients to mess up the parameters during training.
  • Normalizing the loss. To get losses of similar magnitude across datasets, you can sum the loss terms along the sequence and divide them by the maximum sequence length. This makes it easier to reuse hyper parameters between experiments. The loss should be averaged across the batch.
  • Truncated backpropagation. Recurrent networks can have a hard time learning long sequences because of vanishing and noisy gradients. Train on overlapping chunks of about 200 steps instead. You can also gradually increase the chunk length during training. Preserve the hidden state between chunk boundaries.
  • Long training time. Especially in language modeling, small improvements in loss can make a big difference in the preserved quality of the model. Stop training when the training loss does not improve for multiple epochs or the evaluation loss starts increasing.
  • Multi-step loss. When training generative sequence models, there is a trade-off between 1-step losses (teacher forcing) and training longer imagined sequences towards matching the target (Chiappa17). Professor forcing (Goyal17) combines the two but is more involved.

Network Structure

  • Gated Recurrent Unit. GRU (Cho14) alternative memory cell design to LSTM. I found it to often reach the equal or better performance while using fewer parameters and being faster to compute.
  • Layer normalization. Adding layer normalization (Ba16) to all linear mappings of the recurrent network speeds up learning and often improves final performance. Multiple inputs to the same mapping should be normalized separately as done in the paper.
  • Feed-forward layers first. Preprocessing the input with feed-forward layers allows your model to project the data into a space with easier temporal dynamics. This can improve performance on the task.
  • Stacked recurrent networks. Recurrent networks need a quadratic number of weights in their layer size. It can be more efficient to stack two or three smaller layers instead of one big one. Sum the outputs of all layers instead of using only the last one, similar to a ResNet or DenseNet.

Model Parameters

  • Learned initial state. Initializing the hidden state as zeros can cause large loss terms for the first few time steps, so that the model focuses less on the actual sequence. Training the initial state as a variable can improve performance as described in this post.
  • Forget gate bias. It can take a while for a recurrent network to learn to remember information form the last time step. Initialize biases for LSTM’s forget gate to 1 to remember more by default. Similarly, initialize biases for GRU’s reset gate to -1.
  • Regularization. If your model is overfitting, use specific regularization methods for recurrent networks. For example recurrent dropout (Semeniuta16) or Zoneout (Krueger17).

I hope this collection of tips is helpful. Please feel free to post further suggestions for the list and ask questions.