Survey of Existing Interfaces and Implementations¶
Commonly used deep learning libraries with good RNN/LSTM support include Theano and its wrappers Lasagne and Keras; CNTK; TensorFlow; and various implementations in Torch, such as this well-known character-level language model tutorial, this.
In this document, we present a comparative analysis of the approaches taken by these libraries.
Theano¶
In Theano, RNN support comes via its scan operator, which allows construction of a loop where the number of iterations is specified as a runtime value of a symbolic variable. You can find an official example of an LSTM implementation with scan here.
Implementation¶
I’m not very familiar with the Theano internals, but it seems from theano/scan_module/scan_op.py#execute that the scan operator is implemented with a loop in Python that performs one iteration at a time:
fn = self.fn.fn
while (i < n_steps) and cond:
# ...
fn()
The grad
function in Theano constructs a symbolic graph for computing gradients. So the grad
for the scan operator is actually implemented by constructing another scan operator:
local_op = Scan(inner_gfn_ins, inner_gfn_outs, info)
outputs = local_op(*outer_inputs)
The performance guide for Theano’s scan operator suggests minimizing the usage of the scan. This might be due to the fact that the loop is executed in Python, which might be a bit slow (due to context switching and the performance of Python itself). Moreover, because no unrolling is performed, the graph optimizer can’t see the big picture.
If I understand correctly, when multiple RNN/LSTM layers are stacked, instead of a single loop with each iteration computing the whole feedforward network operation, the computation sequentially does a separate loop for each layer that uses the scan operator. If all of the intermediate values are stored to support computing the gradients, this is fine. Otherwise, using a single loop could be more memory efficient.
Lasagne¶
The documentation for RNN in Lasagne can be found here. In Lasagne, a recurrent layer is just like a standard layer, except that the input shape is expected to be (batch_size, sequence_length, feature_dimension)
. The output shape is then (batch_size, sequence_length, output_dimension)
.
Both batch_size
and sequence_length
are specified as None
, and inferred from the data. Alternatively, when memory is sufficient and the (maximum) sequence length is known beforehand, you can set unroll_scan
to False
. Then Lasagne will unroll the graph explicitly, instead of using the Theano scan
operator. Explicitly unrolling is implemented in utils.py#unroll_scan.
The recurrent layer also accepts a mask_input
, to support variable length sequences (e.g., when sequences within a mini-batch have different lengths. The mask has the shape (batch_size, sequence_length)
.
Keras¶
The documentation for RNN in Keras can be found here. The interface in Keras is similar to the interface in Lasagne. The input is expected to be of shape (batch_size, sequence_length, feature_dimension)
, and the output shape (if return_sequences
is True
) is (batch_size, sequence_length, feature_dimension)
.
Keras currently supports both a Theano and a TensorFlow back end. RNN for the Theano back end is implemented with the scan operator. For TensorFlow, it seems to be implemented via explicitly unrolling. The documentation says that for the TensorFlow back end, the sequence length must be specified beforehand, and masking is currently not working (because tf.reduce_any
is not functioning yet).
Torch¶
karpathy/char-rnn is implemented by explicitly unrolling. On the contrary, Element-Research/rnn runs sequence iteration in Lua. It actually has a very modular design:
- The basic RNN/LSTM modules run only one time step per one call of
forward
(and accumulate/store necessary information to support backward computation, if needed). You could have detailed control when using this API directly. - A collection of
Sequencer
s are defined to model common scenarios, like forwarding sequence, bi-directional sequence, attention models, etc. - There are other utility modules, like masking to support variable length sequences, etc.
CNTK¶
CNTK looks quite different from other common deep learning libraries. I don’t understand it very well. I will talk with Yu to get more details.
It seems that the basic data types are matrices (although there is also a TensorView
utility class). The mini-batch data for sequence data is packed in a matrix with N-row being feature_dimension
and N-column being sequence_length * batch_size
(see Figure 2.9 on page 50 of the CNTKBook).
Recurrent networks are first-class citizens in CNTK. In section 5.2.1.8 of the CNTKBook, you can see an example of a customized computation node. The node needs to explicitly define the functions for standard forward and forward with a time index, which is used for RNN evaluation:
virtual void EvaluateThisNode()
{
EvaluateThisNodeS(FunctionValues(), Inputs(0)->
FunctionValues(), Inputs(1)->FunctionValues());
}
virtual void EvaluateThisNode(const size_t timeIdxInSeq)
{
Matrix<ElemType> sliceInputValue = Inputs(1)->
FunctionValues().ColumnSlice(timeIdxInSeq *
m_samplesInRecurrentStep, m_samplesInRecurrentStep);
Matrix<ElemType> sliceOutputValue = m_functionValues.
ColumnSlice(timeIdxInSeq * m_samplesInRecurrentStep,
m_samplesInRecurrentStep);
EvaluateThisNodeS(sliceOutputValue, Inputs(0)->
FunctionValues(), sliceInput1Value);
}
The function ColumnSlice(start_col, num_col)
takes out the packed data for that time index, as described above (here m_samplesInRecurrentStep
must be the mini-batch size).
The low-level API for recurrent connection seem to be a delay node. But I’m not sure how to use this low-level API. The example of PTB language model uses a very high-level API (simply setting recurrentLayer = 1
in the config).
TensorFlow¶
The current example of RNNLM in TensorFlow uses explicit unrolling for a predefined number of time steps. The white-paper mentions that an advanced control flow API (Theano’s scan-like) is planned.