I’m encountering this behaviour that I’m finding difficult to debug / understand, where longer training simulations (i.e., more training data) of the exact same network effectively rescales the output and degrades performance significantly.
The weird thing about this is the loss
reported by nengo_dl
is not changing by any significant amount. In fact it seems uncorrelated with the MSE that I calculate offline from the exact same data (see print
statements at end of post). The offline MSE mirrors what you would expect by visually inspecting the attached plot, while the loss
reported by nengo_dl
seems arbitrary.
Maybe I’m misunderstanding some nuances surrounding the optimization hyperparameters / how the MSE is reported / how the parameters are carried over? How can I get consistent performance across different lengths of training time?
Details: I’m trying to learn a function from $\mathbb{R} \mapsto \mathbb{R}$ by using backprop to optimize both the encoders and decoders, with a single layer of sigmoidal units in between (i.e., a standard perceptron). The example function is just the identity (i.e., communication channel). Training and testing uses the same 5 Hz sinusoid in all conditions. RectifiedLinear
units produce approximately the same behaviour.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nengo
from nengo.utils.numpy import rmse
import nengo_dl
import tensorflow as tf
def go(sim_t,
n_neurons=100,
freq=5,
n_epochs=100):
with nengo.Network(seed=0) as inner:
tf_input = nengo.Node(output=np.zeros(1))
u = nengo.Node(size_in=1)
x = nengo.Ensemble(n_neurons, 1, neuron_type=nengo.Sigmoid())
y = nengo.Node(size_in=1)
nengo.Connection(tf_input, u, synapse=None)
nengo.Connection(u, x, synapse=None)
nengo.Connection(x, y, synapse=None)
tf_output = nengo.Probe(y, synapse=None)
t = np.arange(0, sim_t, 0.001)
data_y = np.sin(2*np.pi*freq*t)[:, None]
data_u = data_y
inputs = {tf_input: data_u[:, None, :]}
outputs = {tf_output: data_y[:, None, :]}
with nengo_dl.Simulator(inner, minibatch_size=100) as sim_train:
optimizer = tf.train.AdamOptimizer()
sim_train.train(inputs, outputs, optimizer, n_epochs=n_epochs, objective='mse')
sim_train.freeze_params(inner)
loss = sim_train.loss(inputs, outputs, 'mse')
with nengo.Network() as outer:
test_input = nengo.Node(output=nengo.processes.PresentInput(data_u, sim_train.dt))
outer.add(inner)
nengo.Connection(test_input, u, synapse=None)
test_output = nengo.Probe(y, synapse=None)
with nengo_dl.Simulator(outer) as sim:
sim.run(sim_t)
return {
't': sim.trange(),
'actual': sim.data[test_output],
'target': data_y,
'loss': loss,
}
try_sim_t = np.linspace(1, 16, 6)
data = []
for sim_t in try_sim_t:
data.append(go(sim_t=sim_t))
sl = slice(1000)
plt.figure(figsize=(18, 4))
for sim_t, r in zip(try_sim_t, data):
plt.plot(r['t'][sl], r['actual'][sl],
label="sim_t=%s (loss=%.4f)" % (sim_t, r['loss']))
print("sim_t=%s (mse=%s)" % (
sim_t, rmse(r['target'].squeeze(), r['actual'].squeeze())**2))
# all of the targets are the same (only differs in length)
plt.plot(r['t'][sl], r['target'][sl], ls='--', lw=2, label="Target")
plt.legend()
plt.xlabel("Time (s)")
plt.show()
Build finished in 0:00:00
Optimization finished in 0:00:00
Construction finished in 0:00:00
Training finished in 0:00:04 (loss: 0.0000)
Build finished in 0:00:00
Optimization finished in 0:00:00
Construction finished in 0:00:00
Simulation finished in 0:00:00
Build finished in 0:00:00
Optimization finished in 0:00:00
Construction finished in 0:00:00
Training finished in 0:00:16 (loss: 0.0248)
Build finished in 0:00:00
Optimization finished in 0:00:00
Construction finished in 0:00:00
Simulation finished in 0:00:01
Build finished in 0:00:00
Optimization finished in 0:00:00
Construction finished in 0:00:00
Training finished in 0:00:27 (loss: 0.0000)
Build finished in 0:00:00
Optimization finished in 0:00:00
Construction finished in 0:00:00
Simulation finished in 0:00:01
Build finished in 0:00:00
Optimization finished in 0:00:00
Construction finished in 0:00:00
Training finished in 0:00:39 (loss: 0.0000)
Build finished in 0:00:00
Optimization finished in 0:00:00
Construction finished in 0:00:00
Simulation finished in 0:00:02
Build finished in 0:00:00
Optimization finished in 0:00:00
Construction finished in 0:00:00
Training finished in 0:00:50 (loss: 0.0733)
Build finished in 0:00:00
Optimization finished in 0:00:00
Construction finished in 0:00:00
Simulation finished in 0:00:03
Build finished in 0:00:00
Optimization finished in 0:00:00
Construction finished in 0:00:00
Training finished in 0:01:02 (loss: 0.0197)
Build finished in 0:00:00
Optimization finished in 0:00:00
Construction finished in 0:00:00
Simulation finished in 0:00:03
sim_t=1.0 (mse=0.00038221049835826233)
sim_t=4.0 (mse=0.03710108930405243)
sim_t=7.0 (mse=0.10872908536932545)
sim_t=10.0 (mse=0.5926093741152001)
sim_t=13.0 (mse=1.2567356972009183)
sim_t=16.0 (mse=2.116441248688516)
FYI, this example is a stripped down version of something else that I’m working on. There, I found that a magic rescaling factor of approximately 0.13
significantly improves the accuracy of the trained network (when trained from 30 seconds of data).