Tensorflow – In Keras, what exactly am I configuring when I create a stateful `LSTM` layer with N `units`

keraslstmneural-networktensorflow

The first arguments in a normal Dense layer is also units, and is the number of neurons/nodes in that layer. A standard LSTM unit however looks like the following:

(This is a reworked version of "Understanding LSTM Networks")

In Keras, when I create an LSTM object like this LSTM(units=N, ...), am I actually creating N of these LSTM units? Or is it the size of the "Neural Network" layers inside the LSTM unit, i.e., the W's in the formulas? Or is it something else?

For context, I'm working based on this example code.

The following is the documentation: https://keras.io/layers/recurrent/

It says:

units: Positive integer, dimensionality of the output space.

It makes me think it is the number of outputs from the Keras LSTM "layer" object. Meaning the next layer will have N inputs. Does that mean there actually exists N of these LSTM units in the LSTM layer, or maybe that that exactly one LSTM unit is run for N iterations outputting N of these h[t] values, from, say, h[t-N] up to h[t]?

If it only defines the number of outputs, does that mean the input still can be, say, just one, or do we have to manually create lagging input variables x[t-N] to x[t], one for each LSTM unit defined by the units=N argument?

As I'm writing this it occurs to me what the argument return_sequences does. If set to True all the N outputs are passed forward to the next layer, while if it is set to False it only passes the last h[t] output to the next layer. Am I right?

Best Answer

You can check this question for further information, although it is based on Keras-1.x API.

Basically, the unit means the dimension of the inner cells in LSTM. Because in LSTM, the dimension of inner cell (C_t and C_{t-1} in the graph), output mask (o_t in the graph) and hidden/output state (h_t in the graph) should have the SAME dimension, therefore you output's dimension should be unit-length as well.

And LSTM in Keras only define exactly one LSTM block, whose cells is of unit-length. If you set return_sequence=True, it will return something with shape: (batch_size, timespan, unit). If false, then it just return the last output in shape (batch_size, unit).

As for the input, you should provide input for every timestamp. Basically, the shape is like (batch_size, timespan, input_dim), where input_dim can be different from the unit. If you just want to provide input at the first step, you can simply pad your data with zeros at other time steps.

General Keras behavior

The standard keras internal processing is always a many to many as in the following picture (where I used features=2, pressure and temperature, just as an example):

In this image, I increased the number of steps to 5, to avoid confusion with the other dimensions.

For this example:

We have N oil tanks
We spent 5 hours taking measures hourly (time steps)
We measured two features:
- Pressure P
- Temperature T

Our input array should then be something shaped as (N,5,2):

        [     Step1      Step2      Step3      Step4      Step5
Tank A:    [[Pa1,Ta1], [Pa2,Ta2], [Pa3,Ta3], [Pa4,Ta4], [Pa5,Ta5]],
Tank B:    [[Pb1,Tb1], [Pb2,Tb2], [Pb3,Tb3], [Pb4,Tb4], [Pb5,Tb5]],
  ....
Tank N:    [[Pn1,Tn1], [Pn2,Tn2], [Pn3,Tn3], [Pn4,Tn4], [Pn5,Tn5]],
        ]

Inputs for sliding windows

Often, LSTM layers are supposed to process the entire sequences. Dividing windows may not be the best idea. The layer has internal states about how a sequence is evolving as it steps forward. Windows eliminate the possibility of learning long sequences, limiting all sequences to the window size.

In windows, each window is part of a long original sequence, but by Keras they will be seen each as an independent sequence:

        [     Step1    Step2    Step3    Step4    Step5
Window  A:  [[P1,T1], [P2,T2], [P3,T3], [P4,T4], [P5,T5]],
Window  B:  [[P2,T2], [P3,T3], [P4,T4], [P5,T5], [P6,T6]],
Window  C:  [[P3,T3], [P4,T4], [P5,T5], [P6,T6], [P7,T7]],
  ....
        ]

Notice that in this case, you have initially only one sequence, but you're dividing it in many sequences to create windows.

The concept of "what is a sequence" is abstract. The important parts are:

you can have batches with many individual sequences
what makes the sequences be sequences is that they evolve in steps (usually time steps)

Achieving each case with "single layers"

Achieving standard many to many:

You can achieve many to many with a simple LSTM layer, using return_sequences=True:

outputs = LSTM(units, return_sequences=True)(inputs)

#output_shape -> (batch_size, steps, units)

Achieving many to one:

Using the exact same layer, keras will do the exact same internal preprocessing, but when you use return_sequences=False (or simply ignore this argument), keras will automatically discard the steps previous to the last:

outputs = LSTM(units)(inputs)

#output_shape -> (batch_size, units) --> steps were discarded, only the last was returned

Achieving one to many

Now, this is not supported by keras LSTM layers alone. You will have to create your own strategy to multiplicate the steps. There are two good approaches:

Create a constant multi-step input by repeating a tensor
Use a stateful=True to recurrently take the output of one step and serve it as the input of the next step (needs output_features == input_features)

One to many with repeat vector

In order to fit to keras standard behavior, we need inputs in steps, so, we simply repeat the inputs for the length we want:

outputs = RepeatVector(steps)(inputs) #where inputs is (batch,features)
outputs = LSTM(units,return_sequences=True)(outputs)

#output_shape -> (batch_size, steps, units)

Understanding stateful = True

Now comes one of the possible usages of stateful=True (besides avoiding loading data that can't fit your computer's memory at once)

Stateful allows us to input "parts" of the sequences in stages. The difference is:

In stateful=False, the second batch contains whole new sequences, independent from the first batch
In stateful=True, the second batch continues the first batch, extending the same sequences.

It's like dividing the sequences in windows too, with these two main differences:

these windows do not superpose!!
stateful=True will see these windows connected as a single long sequence

In stateful=True, every new batch will be interpreted as continuing the previous batch (until you call model.reset_states()).

Sequence 1 in batch 2 will continue sequence 1 in batch 1.
Sequence 2 in batch 2 will continue sequence 2 in batch 1.
Sequence n in batch 2 will continue sequence n in batch 1.

Example of inputs, batch 1 contains steps 1 and 2, batch 2 contains steps 3 to 5:

                   BATCH 1                           BATCH 2
        [     Step1      Step2        |    [    Step3      Step4      Step5
Tank A:    [[Pa1,Ta1], [Pa2,Ta2],     |       [Pa3,Ta3], [Pa4,Ta4], [Pa5,Ta5]],
Tank B:    [[Pb1,Tb1], [Pb2,Tb2],     |       [Pb3,Tb3], [Pb4,Tb4], [Pb5,Tb5]],
  ....                                |
Tank N:    [[Pn1,Tn1], [Pn2,Tn2],     |       [Pn3,Tn3], [Pn4,Tn4], [Pn5,Tn5]],
        ]                                  ]

Notice the alignment of tanks in batch 1 and batch 2! That's why we need shuffle=False (unless we are using only one sequence, of course).

You can have any number of batches, indefinitely. (For having variable lengths in each batch, use input_shape=(None,features).

One to many with stateful=True

For our case here, we are going to use only 1 step per batch, because we want to get one output step and make it be an input.

Please notice that the behavior in the picture is not "caused by" stateful=True. We will force that behavior in a manual loop below. In this example, stateful=True is what "allows" us to stop the sequence, manipulate what we want, and continue from where we stopped.

Honestly, the repeat approach is probably a better choice for this case. But since we're looking into stateful=True, this is a good example. The best way to use this is the next "many to many" case.

Layer:

outputs = LSTM(units=features, 
               stateful=True, 
               return_sequences=True, #just to keep a nice output shape even with length 1
               input_shape=(None,features))(inputs) 
    #units = features because we want to use the outputs as inputs
    #None because we want variable length

#output_shape -> (batch_size, steps, units)

Now, we're going to need a manual loop for predictions:

input_data = someDataWithShape((batch, 1, features))

#important, we're starting new sequences, not continuing old ones:
model.reset_states()

output_sequence = []
last_step = input_data
for i in steps_to_predict:

    new_step = model.predict(last_step)
    output_sequence.append(new_step)
    last_step = new_step

 #end of the sequences
 model.reset_states()

Many to many with stateful=True

Now, here, we get a very nice application: given an input sequence, try to predict its future unknown steps.

We're using the same method as in the "one to many" above, with the difference that:

we will use the sequence itself to be the target data, one step ahead
we know part of the sequence (so we discard this part of the results).

Layer (same as above):

outputs = LSTM(units=features, 
               stateful=True, 
               return_sequences=True, 
               input_shape=(None,features))(inputs) 
    #units = features because we want to use the outputs as inputs
    #None because we want variable length

#output_shape -> (batch_size, steps, units)

Training:

We are going to train our model to predict the next step of the sequences:

totalSequences = someSequencesShaped((batch, steps, features))
    #batch size is usually 1 in these cases (often you have only one Tank in the example)

X = totalSequences[:,:-1] #the entire known sequence, except the last step
Y = totalSequences[:,1:] #one step ahead of X

#loop for resetting states at the start/end of the sequences:
for epoch in range(epochs):
    model.reset_states()
    model.train_on_batch(X,Y)

Predicting:

The first stage of our predicting involves "ajusting the states". That's why we're going to predict the entire sequence again, even if we already know this part of it:

model.reset_states() #starting a new sequence
predicted = model.predict(totalSequences)
firstNewStep = predicted[:,-1:] #the last step of the predictions is the first future step

Now we go to the loop as in the one to many case. But don't reset states here!. We want the model to know in which step of the sequence it is (and it knows it's at the first new step because of the prediction we just made above)

output_sequence = [firstNewStep]
last_step = firstNewStep
for i in steps_to_predict:

    new_step = model.predict(last_step)
    output_sequence.append(new_step)
    last_step = new_step

 #end of the sequences
 model.reset_states()

This approach was used in these answers and file:

Achieving complex configurations

In all examples above, I showed the behavior of "one layer".

You can, of course, stack many layers on top of each other, not necessarly all following the same pattern, and create your own models.

One interesting example that has been appearing is the "autoencoder" that has a "many to one encoder" followed by a "one to many" decoder:

Encoder:

inputs = Input((steps,features))

#a few many to many layers:
outputs = LSTM(hidden1,return_sequences=True)(inputs)
outputs = LSTM(hidden2,return_sequences=True)(outputs)    

#many to one layer:
outputs = LSTM(hidden3)(outputs)

encoder = Model(inputs,outputs)

Decoder:

Using the "repeat" method;

inputs = Input((hidden3,))

#repeat to make one to many:
outputs = RepeatVector(steps)(inputs)

#a few many to many layers:
outputs = LSTM(hidden4,return_sequences=True)(outputs)

#last layer
outputs = LSTM(features,return_sequences=True)(outputs)

decoder = Model(inputs,outputs)

Autoencoder:

inputs = Input((steps,features))
outputs = encoder(inputs)
outputs = decoder(outputs)

autoencoder = Model(inputs,outputs)

Train with fit(X,X)

Additional explanations

If you want details about how steps are calculated in LSTMs, or details about the stateful=True cases above, you can read more in this answer: Doubts regarding `Understanding Keras LSTMs`

Keras input explanation: input_shape, units, batch_size, dim, etc

Units:

The amount of "neurons", or "cells", or whatever the layer has inside it.

It's a property of each layer, and yes, it's related to the output shape (as we will see later). In your picture, except for the input layer, which is conceptually different from other layers, you have:

Hidden layer 1: 4 units (4 neurons)
Hidden layer 2: 4 units
Last layer: 1 unit

Shapes

Shapes are consequences of the model's configuration. Shapes are tuples representing how many elements an array or tensor has in each dimension.

Ex: a shape (30,4,10) means an array or tensor with 3 dimensions, containing 30 elements in the first dimension, 4 in the second and 10 in the third, totaling 30*4*10 = 1200 elements or numbers.

The input shape

What flows between layers are tensors. Tensors can be seen as matrices, with shapes.

In Keras, the input layer itself is not a layer, but a tensor. It's the starting tensor you send to the first hidden layer. This tensor must have the same shape as your training data.

Example: if you have 30 images of 50x50 pixels in RGB (3 channels), the shape of your input data is (30,50,50,3). Then your input layer tensor, must have this shape (see details in the "shapes in keras" section).

Each type of layer requires the input with a certain number of dimensions:

Dense layers require inputs as (batch_size, input_size)
- or (batch_size, optional,...,optional, input_size)
2D convolutional layers need inputs as:
- if using channels_last: (batch_size, imageside1, imageside2, channels)
- if using channels_first: (batch_size, channels, imageside1, imageside2)
1D convolutions and recurrent layers use (batch_size, sequence_length, features)
- Details on how to prepare data for recurrent layers

Now, the input shape is the only one you must define, because your model cannot know it. Only you know that, based on your training data.

All the other shapes are calculated automatically based on the units and particularities of each layer.

Relation between shapes and units - The output shape

Given the input shape, all other shapes are results of layers calculations.

The "units" of each layer will define the output shape (the shape of the tensor that is produced by the layer and that will be the input of the next layer).

Each type of layer works in a particular way. Dense layers have output shape based on "units", convolutional layers have output shape based on "filters". But it's always based on some layer property. (See the documentation for what each layer outputs)

Let's show what happens with "Dense" layers, which is the type shown in your graph.

A dense layer has an output shape of (batch_size,units). So, yes, units, the property of the layer, also defines the output shape.

Hidden layer 1: 4 units, output shape: (batch_size,4).
Hidden layer 2: 4 units, output shape: (batch_size,4).
Last layer: 1 unit, output shape: (batch_size,1).

Weights

Weights will be entirely automatically calculated based on the input and the output shapes. Again, each type of layer works in a certain way. But the weights will be a matrix capable of transforming the input shape into the output shape by some mathematical operation.

In a dense layer, weights multiply all inputs. It's a matrix with one column per input and one row per unit, but this is often not important for basic works.

In the image, if each arrow had a multiplication number on it, all numbers together would form the weight matrix.

Shapes in Keras

Earlier, I gave an example of 30 images, 50x50 pixels and 3 channels, having an input shape of (30,50,50,3).

Since the input shape is the only one you need to define, Keras will demand it in the first layer.

But in this definition, Keras ignores the first dimension, which is the batch size. Your model should be able to deal with any batch size, so you define only the other dimensions:

input_shape = (50,50,3)
    #regardless of how many images I have, each image has this shape

Optionally, or when it's required by certain kinds of models, you can pass the shape containing the batch size via batch_input_shape=(30,50,50,3) or batch_shape=(30,50,50,3). This limits your training possibilities to this unique batch size, so it should be used only when really required.

Either way you choose, tensors in the model will have the batch dimension.

So, even if you used input_shape=(50,50,3), when keras sends you messages, or when you print the model summary, it will show (None,50,50,3).

The first dimension is the batch size, it's None because it can vary depending on how many examples you give for training. (If you defined the batch size explicitly, then the number you defined will appear instead of None)

Also, in advanced works, when you actually operate directly on the tensors (inside Lambda layers or in the loss function, for instance), the batch size dimension will be there.

So, when defining the input shape, you ignore the batch size: input_shape=(50,50,3)
When doing operations directly on tensors, the shape will be again (30,50,50,3)
When keras sends you a message, the shape will be (None,50,50,3) or (30,50,50,3), depending on what type of message it sends you.

Dim

And in the end, what is dim?

If your input shape has only one dimension, you don't need to give it as a tuple, you give input_dim as a scalar number.

So, in your model, where your input layer has 3 elements, you can use any of these two:

input_shape=(3,) -- The comma is necessary when you have only one dimension
input_dim = 3

But when dealing directly with the tensors, often dim will refer to how many dimensions a tensor has. For instance a tensor with shape (25,10909) has 2 dimensions.

Defining your image in Keras

Keras has two ways of doing it, Sequential models, or the functional API Model. I don't like using the sequential model, later you will have to forget it anyway because you will want models with branches.

PS: here I ignored other aspects, such as activation functions.

With the Sequential model:

from keras.models import Sequential  
from keras.layers import *  

model = Sequential()    

#start from the first hidden layer, since the input is not actually a layer   
#but inform the shape of the input, with 3 elements.    
model.add(Dense(units=4,input_shape=(3,))) #hidden layer 1 with input

#further layers:    
model.add(Dense(units=4)) #hidden layer 2
model.add(Dense(units=1)) #output layer

With the functional API Model:

from keras.models import Model   
from keras.layers import * 

#Start defining the input tensor:
inpTensor = Input((3,))   

#create the layers and pass them the input tensor to get the output tensor:    
hidden1Out = Dense(units=4)(inpTensor)    
hidden2Out = Dense(units=4)(hidden1Out)    
finalOut = Dense(units=1)(hidden2Out)   

#define the model's start and end points    
model = Model(inpTensor,finalOut)

Shapes of the tensors

Remember you ignore batch sizes when defining layers:

inpTensor: (None,3)
hidden1Out: (None,4)
hidden2Out: (None,4)
finalOut: (None,1)