Survey of existing interfaces and implementations

Some commonly used deep learning libraries with good RNN / LSTM support include Theano and its wrappers Lasagne and Keras; CNTK; TensorFlow; and various implementations in Torch like char-rnn, this and this.

Theano

RNN support in Theano is provided by its scan operator, which allows construction of a loop where the number of iterations is specified via a runtime value of a symbolic variable. An official example of LSTM implementation with scan can be found here.

Implementation

I’m not very familiar with the Theano internals, but it seems from theano/scan_module/scan_op.py#execute that the scan operator is implemented with a loop in Python that perform one iteration at a time:

fn = self.fn.fn

while (i < n_steps) and cond:
    # ...
    fn()

The grad function in Theano constructs a symbolic graph for computing gradients. So the grad for the scan operator is actually implemented by constructing another scan operator:

local_op = Scan(inner_gfn_ins, inner_gfn_outs, info)
outputs = local_op(*outer_inputs)

The performance guide for Theano’s scan operator suggests minimizing the usage of scan. This might be due to the fact that the loop is executed in Python, which might be a bit slow (due to context switching, and performance of Python itself). Moreover, since no unrolling is performed, the graph optimizer cannot see the big picture here.

If I understand correctly, when multiple RNN/LSTM layers are stacked, instead of a single loop with each iteration computing the whole feedforward network operation, the computation goes by doing a separate loop for each of the layer that uses the scan operator, sequentially. This is fine if all the intermediate values are stored to support computing the gradients. But otherwise, using a single loop could be more memory efficient.

Lasagne

The documentation for RNN in Lasagne can be found here. In Lasagne, a recurrent layer is just like standard layers, except that the input shape is expected to be (batch_size, sequence_length, feature_dimension). The output shape is then (batch_size, sequence_length, output_dimension).

Both batch_size and sequence_length are specified as None, and inferred from the data. Alternative, when memory is enough and the (maximum) sequence length is known a prior, the user can set unroll_scan to False. Then Lasagne will unroll the graph explicitly, instead of using Theano scan operator. Explicitly unrolling is implemented in utils.py#unroll_scan.

The recurrent layer also accept a mask_input, to support the case of variable length sequences (e.g. sequences could be of different length even within a mini-batch). The mask is of shape (batch_size, sequence_length).

Keras

The documentation for RNN in Keras can be found here. The interface in Keras is similar to Lasagne. The input is expected to be of shape (batch_size, sequence_length, feature_dimension), and the output shape (if return_sequences is True) is (batch_size, sequence_length, feature_dimension).

Keras currently supports both Theano and TensorFlow backend. RNN for the Theano backend is implemented with the scan operator. For TensorFlow, it seem to be implemented via explicitly unrolling. The documentation says for TensorFlow backend, the sequence length must be specified a prior, and masking is currently not working (because tf.reduce_any is not functioning yet).

Torch

karpathy/char-rnn is implemented via explicitly unrolling. Element-Research/rnn, on the contrary, run the sequence iteration in Lua. It actually has a very modular design:

  • The basic RNN/LSTM modules only run one time step upon one call of forward (and accumulate / store necessary information to support backward computation if needed). So the users could have detailed control when using this API directly.
  • A collection of Sequencer are defined to model common scenarios like forward sequence, bi-directional sequence, attention models, etc.
  • Other utility modules like masking to support variable length sequences, etc.

CNTK

CNTK looks quite different from other common deep learning libraries. I cannot understand it very well. I will talk with Yu to get more details.

It seems the basic data types are matrices (although there is also a TensorView utility class). The mini-batch data for sequence data is packed in a matrix with N-row being feature_dimension and N-column being sequence_length * batch_size (see Figure 2.9 at page 50 of the CNTKBook).

Recurrent networks is a first-class citizen in CNTK. In section 5.2.1.8 of the CNTKBook, we can see an example of customized computation node. The node need to explicitly define function for standard forward and forward with a time index, which is used for RNN evaluation:

virtual void EvaluateThisNode()
{
    EvaluateThisNodeS(FunctionValues(), Inputs(0)->
        FunctionValues(), Inputs(1)->FunctionValues());
}
virtual void EvaluateThisNode(const size_t timeIdxInSeq)
{
    Matrix<ElemType> sliceInputValue = Inputs(1)->
        FunctionValues().ColumnSlice(timeIdxInSeq *
        m_samplesInRecurrentStep, m_samplesInRecurrentStep);
    Matrix<ElemType> sliceOutputValue = m_functionValues.
        ColumnSlice(timeIdxInSeq * m_samplesInRecurrentStep,
        m_samplesInRecurrentStep);
    EvaluateThisNodeS(sliceOutputValue, Inputs(0)->
        FunctionValues(), sliceInput1Value);
}

The function ColumnSlice(start_col, num_col) takes out the packed data for that time index as described above (here m_samplesInRecurrentStep must be mini-batch size).

The low-level API for recurrent connection seem to be a delay node. But I’m not sure how to use this low level API. The example of ptb language model uses very high-level API (simply setting recurrentLayer = 1 in the config).

TensorFlow

The current example of RNNLM in TensorFlow use explicit unrolling for a predefined number of time steps. The white paper mentioned advanced control flow API (Theano’s scan-like) coming in the future.