How to use return_sequences option and TimeDistributed layer in Keras

deep-learningkeraslstmrecurrent-neural-network

I have a dialog corpus like below. And I want to implement a LSTM model which predicts a system action. The system action is described as a bit vector. And a user input is calculated as a word-embedding which is also a bit vector.

t1: user: "Do you know an apple?", system: "no"(action=2)
t2: user: "xxxxxx", system: "yyyy" (action=0)
t3: user: "aaaaaa", system: "bbbb" (action=5)

So what I want to realize is "many to many (2)" model. When my model receives a user input, it must output a system action.
enter image description here
But I cannot understand return_sequences option and TimeDistributed layer after LSTM. To realize "many-to-many (2)", return_sequences==True and adding a TimeDistributed after LSTMs are required? I appreciate if you would give more description of them.

return_sequences: Boolean. Whether to return the last output in the output sequence, or the full sequence.

TimeDistributed: This wrapper allows to apply a layer to every temporal slice of an input.

Updated 2017/03/13 17:40

I think I could understand the return_sequence option. But I am not still sure about TimeDistributed. If I add a TimeDistributed after LSTMs, is the model the same as "my many-to-many(2)" below? So I think Dense layers are applied for each output.
enter image description here

Best Answer

The LSTM layer and the TimeDistributed wrapper are two different ways to get the "many to many" relationship that you want.

  1. LSTM will eat the words of your sentence one by one, you can chose via "return_sequence" to outuput something (the state) at each step (after each word processed) or only output something after the last word has been eaten. So with return_sequence=TRUE, the output will be a sequence of the same length, with return_sequence=FALSE, the output will be just one vector.
  2. TimeDistributed. This wrapper allows you to apply one layer (say Dense for example) to every element of your sequence independently. That layer will have exactly the same weights for every element, it's the same that will be applied to each words and it will, of course, return the sequence of words processed independently.

As you can see, the difference between the two is that the LSTM "propagates the information through the sequence, it will eat one word, update its state and return it or not. Then it will go on with the next word while still carrying information from the previous ones.... as in the TimeDistributed, the words will be processed in the same way on their own, as if they were in silos and the same layer applies to every one of them.

So you dont have to use LSTM and TimeDistributed in a row, you can do whatever you want, just keep in mind what each of them do.

I hope it's clearer?

EDIT:

The time distributed, in your case, applies a dense layer to every element that was output by the LSTM.

Let's take an example:

You have a sequence of n_words words that are embedded in emb_size dimensions. So your input is a 2D tensor of shape (n_words, emb_size)

First you apply an LSTM with output dimension = lstm_output and return_sequence = True. The output will still be a squence so it will be a 2D tensor of shape (n_words, lstm_output). So you have n_words vectors of length lstm_output.

Now you apply a TimeDistributed dense layer with say 3 dimensions output as parameter of the Dense. So TimeDistributed(Dense(3)). This will apply Dense(3) n_words times, to every vectors of size lstm_output in your sequence independently... they will all become vectors of length 3. Your output will still be a sequence so a 2D tensor, of shape now (n_words, 3).

Is it clearer? :-)

Related Topic