How to build a multiple linear memory layer LMU?

Ryan · May 21, 2021, 9:09am

Dear all,

I have read the paper titled “Hardware Aware Training for Efficient Keyword Spotting on
General Purpose and Specialized Hardware” . In this paper, the LMU was modified as below:

"we have removed the connection from the nonlinear to the linear layer, the connection from the linear layer to the intermediate input 𝑢𝑡, and the recurrent connection from the nonlinear layer to itself. As well, we have included multiple linear memory layers in the architecture; the outputs of each of these layers are concatenated before being mapped to the nonlinear layer via the matrix 𝑊𝑚.
"
I read the some origin codes of LMU that as below:

state for the hidden cell

    h = states[:-1]
    # state for the LMU memory
    m = states[-1]

    # compute memory input
    u_in = tf.concat((inputs, h[0]), axis=1) if self.hidden_to_memory else inputs
    if self.dropout > 0:
        u_in *= self.get_dropout_mask_for_cell(u_in, training)
    u = tf.matmul(u_in, self.kernel)

    if self.memory_to_memory:
        if self.recurrent_dropout > 0:
            # note: we don't apply dropout to the memory input, only
            # the recurrent kernel
            rec_m = m * self.get_recurrent_dropout_mask_for_cell(m, training)
        else:
            rec_m = m

        u += tf.matmul(rec_m, self.recurrent_kernel)
    # separate memory/order dimensions
    m = tf.reshape(m, (-1, self.memory_d, self.order))
    u = tf.expand_dims(u, -1)

    # update memory
    m = tf.matmul(m, self.A) + tf.matmul(u, self.B)

    # re-combine memory/order dimensions
    m = tf.reshape(m, (-1, self.memory_d * self.order))

    # apply hidden cell
    h_in = tf.concat((m, inputs), axis=1) if self.input_to_hidden else m

    if self.hidden_cell is None:
        o = h_in
        h = []
    elif hasattr(self.hidden_cell, "state_size"):
        o, h = self.hidden_cell(h_in, h, training=training)
        #print('removed h to h')
    else:
        o = self.hidden_cell(h_in, training=training)
        h = [o]

    return o, h + [m]

Could anyone who can show me how to modify the codes above to achieve the required LMU architecture?

Thanks a lot.

Best,
Ryan

arvoelke · May 26, 2021, 5:05pm

Hi @Ryan. Fortunately no modifications are needed. The memory_d parameter is the number of memory layers. You can set this to be any positive integer when you create any LMU cell or layer. The additional connections are toggled via hidden_to_memory, memory_to_memory, and input_to_hidden, which all default to False.

Ryan · May 31, 2021, 2:53am

Dear Arvoelke,

Thanks a lot for your help! I will have a try with your suggestions. However, I have some questions:

if I set the hidden to memory false. The corresponding codes below will do this:

        # compute memory input
        u_in = tf.concat((inputs, h[0]), axis=1) if self.hidden_to_memory else inputs

This indicates:
u_in = inputs

if I set the input to hidden false. The corresponding codes below will do this:

        # apply hidden cell
        h_in = tf.concat((m, inputs), axis=-1) if self.input_to_hidden else m

This indicates:
h_in = m

Based on above, the LMU cell architecture seems doesn’t fit the figure in the paper as below:

Could you help me to clarify this? Thanks a lot.

Best,
Ryan

arvoelke · May 31, 2021, 1:21pm

You’ll want hidden_to_memory = True in order to get the h_{t-1} -> u_t connection that you see in the architecture diagram. You’ll also want input_to_hidden = True in order to get the x_t -> Nonlinear connection in the architecture.

Also note that the LMU allows passing in an arbitrary hidden_cell, and so you can make this a dense layer or similar to avoid the connection from hidden to hidden.

I took another look at the paper and this part of the description:

we have removed the connection from the nonlinear to the linear layer

is actually incorrect, as we have hidden_to_memory = True. The equations from the paper (below) and the architecture diagram are correct. Thanks for bringing this to our attention.

Screenshot from 2021-05-31 09-22-44

Ryan · June 2, 2021, 2:11am

Hi Arvoelke,

Thanks a lot for your help. I will have a try!

PS: I still has a small question:

Regarding your suggestions as below:

" Also note that the LMU allows passing in an arbitrary hidden_cell , and so you can make this a dense layer or similar to avoid the connection from hidden to hidden."

Do you mean that I create an extra dense layer？ The input of this dense layer is LMU layer hidden cell h(t), and the output of this dense layer connects LMU layer h(t-1). I use this dense layer to avoid the connections from hidden to hidden?

Best,
Ryan

arvoelke · June 2, 2021, 2:32am

The hidden_cell is the nonlinear f part that produces the h_t in the equation above, labelled “Nonlinear” in the architecture diagram.

If you use a Dense layer with a bias and activation function $f$ for the hidden_cell, and input_to_hidden = True, then you will get exactly $\mathbf{h}_t = f(\mathbf{W}_x \mathbf{x}_t + \mathbf{W}_m \mathbf{m}_t + \mathbf{b})$.

If you want $h_{t-1}$ to be an input to $f$ as well then that calls for a different hidden_cell, such as an RNN cell.