NengoDL performance bad on GPU

I’ve been testing a custom model of mine on a machine equipped with some pretty powerful GPUs and was surprised to see that running on these is significantly slower than running the same model by specifying device="/cpu:0".
We are talking 25 seconds on the CPU vs. nearly 5 minutes on the GPU. It’s a pretty small model but is that extra time really all overhead?

I wrote all my code using TensorFlow 1.0 syntax so I thought that may have been the culprit, but disabling v2 execution seems to do nothing to help.

There is some overhead to running a NengoDL network on the GPU, but like you said, it shouldn’t be on the order of minutes. It’s hard to pin down specifically what is causing the performance slow down here without having reference code to look at. Here are some debugging options for you to try:

  • Reduce your model down to a minimal set of code that reproduces the problem. This will hopefully help pinpoint where the issue is occurring. You can also post this minimal model here to speed up the debugging process.
  • Check to see if your model is actually using the appropriate GPU (either through nvidia-smi on linux, or through the task manager on windows)
  • Try changing the model size, and see what impact that has on the run time. Is the run time linear in relationship to the model size (which would indicate that it’s not an overhead causing the issue), or not (which may indicate that it’s just overhead).

I’ll start by posting the full model and then I’ll try the steps you suggested.

My custom learning rule function is https://github.com/Tioz90/Learning-to-approximate-functions-using-niobium-doped-strontium-titanate-memristors/blob/f1a869da3ebc3799cc3af8a7eb5fa7fa3c6e3c7d/memristor_nengo/learning_rules.py#L215
while the equivalent Nengo Core implementation is https://github.com/Tioz90/Learning-to-approximate-functions-using-niobium-doped-strontium-titanate-memristors/blob/f1a869da3ebc3799cc3af8a7eb5fa7fa3c6e3c7d/memristor_nengo/learning_rules.py#L60

The model can easily be tested by running https://github.com/Tioz90/Learning-to-approximate-functions-using-niobium-doped-strontium-titanate-memristors/blob/master/experiments/test_builder_mPES.py

I’m not one of the primary developers of NengoDL, so I will have to double check with them on this matter, but it simulation time difference you are experiencing may be due to the specific implementation of the neuron logic you are using. In my quick comparison between your code base and the NengoDL code base, it may be that in your code, the neuron computation is being done on the CPU, in which case forcing TF to use the GPU would cause it to encounter a lot of extra overhead in shuffling data back and forth from the GPU to CPU and back whenever a neuron update needs to happen (which I assume is every timestep).

As I suggested before, breaking down your code into a minimal example that replicates this issue would help you pin-point exactly where the slow down is happening. A python or TF profiling tool may also help indicate which functions in particular are causing the slowdown. It looks like this is an example on how to profile TF code?

That’s an interesting insight but I don’t understand why that should be the case… I’m using the default LIF() neuron which should be automatically be switched to a LIFRate() one during simulation, I believe. Or does that happen only when using the Keras API? But would the switch not occurring even be the cause of the issue?

I looked into the link you kindly supplied and it seems pretty straightforward, except I wouldn’t know how to call the callback as I’m not explicitly calling the Keras API:

# Create a TensorBoard callback
logs = "logs/" + datetime.now().strftime("%Y%m%d-%H%M%S")

tboard_callback = tf.keras.callbacks.TensorBoard(log_dir = logs,
                                                 histogram_freq = 1,
                                                 profile_batch = '500,520')

model.fit(ds_train,
          epochs=2,
          validation_data=ds_test,
          callbacks = [tboard_callback])

Hi! Sorry, I misspoke when I said it might have been the neuron computation causing the slowdown.
Using the standard LIF neuron shouldn’t cause any slowdown in your simulation. However, you have a custom learning rule (your mPES class) that, depending on how it is implemented and how it interacts with the neuron class, can cause a performance slowdown, especially if it hasn’t been optimized to run on the GPU.

Yes, I now understand what you mean.
My mPES class only interacts with the neuron class by reading the output signal of the pre object and then using it for some internal calculations; specifically:

  • Reads the Signal in build_mpes() (line 225): acts = build_or_passthrough( model, mpes.pre_synapse, model.sig[ conn.pre_obj ][ "out" ] )

  • Reshapes it in build_pre() (line 293):

      self.pre_data = signals.combine( [ op.pre_filtered for op in self.ops ] )
      self.pre_data = self.pre_data.reshape( (len( self.ops ), 1, self.ops[ 0 ].pre_filtered.shape[ 0 ]) )
    
  • Reads the activations in build_step() (line 347): pre_filtered = signals.gather( self.pre_data )

  • Multiplies it by the error (line 418): pes_delta = -local_error * pre_filtered

  • Uses it to binarise the information (line 420):spiked_map = find_spikes( pre_filtered, self.post_n_neurons )

  • Uses the map to mask (line 421): pes_delta = pes_delta * spiked_map

Apart from that, I don’t see where the overhead could be coming from … I don’t see a great advantage in rewriting my code using TensorFlow 2.0 syntax right now as that could just risk introducing further bottlenecks.

Also, I can’t see how to make the required callback to TensorBoard given that I’m not using the .fit() method to train the network (but maybe I should?!)

Thanks again!

I only looked briefly at your implementation in learning_rules.py but given the massive amount of serial TF operations and conditional branching in there I wouldn’t be surprised if that was the culprit. GPUs are generally great at doing large matrix multiplies in parallel, but awful at running a bunch of small operations in a row, or doing if-then logic where the branching logic is not predetermined at build time.

Could you try swapping this out for the basic PES learning rule to see if there’s still the same speed difference between CPU and GPU? In general it would be helpful to isolate the slowdown as Xuan suggested, and a profiling tool could help with this.

I must say that I was hoping that TF would be able to dear with conditional logic given that I’m using all native operations…

I have to wonder if that’s the problem, actually, as I ran the following as benchmarks with the default PES learning rule compared to my mPES one, with 10 neurons in each of the three ensembles:
PES on CPU: 21s
mPES on CPU: 29 s
PES on GPU: 1 m 24 s
mPES on CPU: 2 m 1 s

I must say that I was hoping that TF would be able to dear with conditional logic given that I’m using all native operations…

I’m not super familiar with TF, but from what I understand, to write efficient conditional TF code, you need to use TF control flow operations (I found this article that seems like a good overview) instead of the standard python control flow.

If you are using conditional logic to manipulate elements within a matrix or vector, another method may scale better to GPU computation is to use indexing and masking operations (see here for an code example, and here for an example in Nengo) to do the equivalent conditional assignments.

I have to wonder if that’s the problem, actually, as I ran the following as benchmarks with the default PES learning rule compared to my mPES one, with 10 neurons in each of the three ensembles:

For a network of that size, you’ll definitely see that the GPU computation is much slower than the CPU computation. The overhead in transferring data on / off the GPU (and also to compile the model to GPU code) has to be outweighed by the massive advantage the GPU has over the CPU in doing large matrix operations. This in turn means that to see any meaningful performance uplift, you’ll need to size your network size appropriately.

I would test your learning rule against the PES learning on the CPU and GPU using varying network sizes, and varying run times, to see if you can establish a relationship between those variables (e.g., is the run time linear with the number of neurons?). This will give you a better sense of the performance scaling characteristics of your learning rule compared to the PES learning rule.

Unfortunately, I’m already doing that as all my conditional operations are using tf.cond() and all my tensor operations are using tf native functions (for ex., tf.greater(), tf.where(), tf.boolean_mask(), tf.tensor_scatter_nd_update() …).

I will get onto this as soon as I have some time to invest because this performance gap really strikes me as strange. Also, I would really like to be able to run bigger models in a reasonable time.

Sounds good. Keep us posted on the results of your experimentation. We’ll try to help as much as we can, but seeing as your code is fairly complex, you’ll be the best person to debug and profile your code. TF has a built in profiler which may help you identify where the bottlenecks in your code are.

Thanks, I’ll keep you updated!

Yes, I know about the profiler but how would I set up the necessary callbacks using the Nengo API? Which of the tf profiling API would I do best to use and how would I integrate them with my code?

I haven’t used the TF profiler with NengoDL myself, but the NengoDL devs inform me that the “TensorBoard Keras Callback” is the easiest profiler to use (all of the profilers are supported with NengoDL).

The TF documentation provides information on how to use the TF profiler with your code.

I’ll give it a try and see if I can figure it out :slight_smile:

1 Like

I have access to a server with four GPUs and I’ve started running a few more tests.

One thing I immediately noticed is that when I try to assign the Simulator to \gpu:[0-3] I get the following printout.

Is there any useful information in here that could help explain the performance issues? From what I can work out it’s having trouble running everything on the GPU so it’s assigning some ops to the CPU? That could maybe explain the overhead I’m seeing?

Using run optimisation
Devices available:
Device type: CPU String: /physical_device:CPU:0
Device type: XLA_CPU String: /physical_device:XLA_CPU:0
Device type: XLA_GPU String: /physical_device:XLA_GPU:0
Device type: XLA_GPU String: /physical_device:XLA_GPU:1
Device type: XLA_GPU String: /physical_device:XLA_GPU:2
Device type: XLA_GPU String: /physical_device:XLA_GPU:3
Device type: GPU String: /physical_device:GPU:0
Device type: GPU String: /physical_device:GPU:1
Device type: GPU String: /physical_device:GPU:2
Device type: GPU String: /physical_device:GPU:3

Simulating with mPES()
Backend is nengo_dl, running on /gpu:3

Build finished in 0:00:00                                                                                                                                     
Optimization finished in 0:00:00                                                                                                                              
|                                                               # Constructing graph                                                                 | 0:00:062020-09-16 10:06:13.842392: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2020-09-16 10:06:13.846021: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2020-09-16 10:06:13.848595: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2020-09-16 10:06:13.851124: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2020-09-16 10:06:13.871552: W tensorflow/core/common_runtime/colocation_graph.cc:1139] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
  /job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices: 
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:3' assigned_device_name_='' resource_device_name_='/device:GPU:3' supported_device_types_=[CPU] possible_devices_=[]
ReadVariableOp: GPU CPU XLA_CPU XLA_GPU 
AssignVariableOp: CPU XLA_CPU XLA_GPU 
VarIsInitializedOp: GPU CPU XLA_CPU XLA_GPU 
Const: GPU CPU XLA_CPU XLA_GPU 
VarHandleOp: CPU XLA_CPU XLA_GPU 

Colocation members, user-requested devices, and framework assigned devices, if any:
  TensorGraph/saved_state/int32_1/Initializer/zeros (Const) 
  TensorGraph/saved_state/int32_1 (VarHandleOp) /device:GPU:3
  TensorGraph/saved_state/int32_1/IsInitialized/VarIsInitializedOp (VarIsInitializedOp) /device:GPU:3
  TensorGraph/saved_state/int32_1/Assign (AssignVariableOp) /device:GPU:3
  TensorGraph/saved_state/int32_1/Read/ReadVariableOp (ReadVariableOp) /device:GPU:3

Construction finished in 0:00:07                                                                                                                              

0:00:01WARNING:tensorflow:From /home/p291020/.conda/envs/nengodl/lib/python3.8/site-packages/tensorflow/python/keras/engine/training_v1.py:2070: Model.state_updates (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
WARNING:tensorflow:From /home/p291020/.conda/envs/nengodl/lib/python3.8/site-packages/tensorflow/python/keras/engine/training_v1.py:2070: Model.state_updates (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.

I also noticed how calling my model with both PES and mPES, either on CPU or GPU, assigns the process (71273 in this example) to each GPU in the system but that the selected GPU (in this case /gpu:1) is not really used and its usage periodically falls to 0%. You can also see that the GPU memory is not really being used, either.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN X (Pascal)    Off  | 00000000:02:00.0  On |                  N/A |
| 67%   85C    P2   136W / 250W |  11049MiB / 12196MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 2080    Off  | 00000000:04:00.0 Off |                  N/A |
|ERR!   81C    P2   162W / 215W |   4389MiB /  7952MiB |     82%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 00000000:83:00.0 Off |                  N/A |
| 75%   88C    P2   183W / 250W |  11052MiB / 12196MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 2080    Off  | 00000000:84:00.0 Off |                  N/A |
| 56%   86C    P2   144W / 215W |   4453MiB /  7952MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     71273      C   python                                       249MiB |
|    0     71829      G   /usr/lib/xorg/Xorg                            17MiB |
|    0     77310      C   python                                     10771MiB |
|    1     43521      C   python                                      4123MiB |
|    1     71273      C   python                                       255MiB |
|    2     11591      C   python                                     10781MiB |
|    2     71273      C   python                                       259MiB |
|    3     42927      C   python                                      4123MiB |
|    3     71273      C   python                                       319MiB |
+-----------------------------------------------------------------------------+

I was wondering if the “noisy” version of my learning rule might be at fault, as in Nengo Core the generation of randomly perturbed parameters at each timestep had a noticeable impact on performance.
I have tried bypassing the tf.cond() statements and executing with or without them makes quite a difference, but only because the graph is simpler. The actual generation or not of the noisy parameters at each timestep has a minimal impact.
I am generating the noise using tf.random.normal() and, with NengoDL, I don’t see a difference like in Nengo Core.

One thing I am noticing is that disabling all the Probes makes an enormous difference to the performance, both in Nengo Core and NengoDL. I was expecting that to only impact memory usage, not runtime, as stated in your docs.

The image summarises my tests (on my laptop’s CPU), with 100 pre/post/error neurons:

I think you can safely ignore that warning, because the Op it’s complaining about there is just assigning a value to a scalar int32 Variable (which, I would guess, is the Simulator time step). So it’s unlikely that that operation running on the CPU would introduce any significant overhead.

How much non-GPU memory does your model use when running only on the CPU (e.g. with device="/cpu:0")? Just to confirm that it should be greater than 255MB. But it does look like your model isn’t really using the GPU, assuming that it should be >255MB. What about if you run other models in NengoDL (not using any of your custom code), do those utilize the GPU as expected?

When running only on the CPU on my laptop it uses around 400MB (measured by looking at the increase of memory usage of my IDE), as per the testing in my last post.
Using the memory_profiler package the maximum memory usage was measured at 300MB both on CPU and GPU.

I tried running my model with the following command in order to set the connection to use default PES learning:

python mPES.py -vv -N 100 -d /gpu:3 -l PES

but, still, it seems as if the GPU is not really used and the running time is indeed much higher than if I ran:

python mPES.py -vv -N 100 -d /cpu:0 -l PES

The former model takes 1m 26s (GPU) while the latter only 15s (CPU)!

Running the latter does not really seem to be using the GPU more than the former:


My custom learning rule instead takes 6m 37s on the GPU and 1m 7s on the CPU. Tested by running python mPES.py -vv -N 100 -d /gpu:3 -l mPES and python mPES.py -vv -N 100 -d /cpu:0 -l mPES.

I don’t know if this has any bearing but I call the following at the beginning of mPES.py:

tf.compat.v1.disable_eager_execution()
tf.compat.v1.disable_control_flow_v2()

I tried running the “Spiking MNIST” example and it seems that the GPUs are engaged correctly, as per the screenshot:

I specified Simulation(..., device="/gpu:3") but it still seems to auto-select the GPUs.

I’m also getting different messages from TensorFlow when launching the simulation. The GPUs seems to be actually evaluated and initialised for usage:

engodl) p291020@turing10:~$ python spiking_mnist.py                                                                                                         
2020-09-17 11:11:49.741587: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Build finished in 0:00:00                                                                                                                                     
Optimization finished in 0:00:00                                                                                                                              
|#                                                                Constructing graph                                                                 | 0:00:002020-09-17 11:11:58.019755: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-09-17 11:11:58.050366: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:04:00.0 name: GeForce RTX 2080 computeCapability: 7.5
coreClock: 1.71GHz coreCount: 46 deviceMemorySize: 7.77GiB deviceMemoryBandwidth: 417.23GiB/s
2020-09-17 11:11:58.051783: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 1 with properties: 
pciBusID: 0000:84:00.0 name: GeForce RTX 2080 computeCapability: 7.5
coreClock: 1.71GHz coreCount: 46 deviceMemorySize: 7.77GiB deviceMemoryBandwidth: 417.23GiB/s
2020-09-17 11:11:58.052579: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 2 with properties: 
pciBusID: 0000:83:00.0 name: TITAN Xp computeCapability: 6.1
coreClock: 1.582GHz coreCount: 30 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 510.07GiB/s
2020-09-17 11:11:58.053495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 3 with properties: 
pciBusID: 0000:02:00.0 name: TITAN X (Pascal) computeCapability: 6.1
coreClock: 1.531GHz coreCount: 28 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 447.48GiB/s
2020-09-17 11:11:58.053542: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-09-17 11:11:58.056211: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-09-17 11:11:58.058455: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-09-17 11:11:58.058928: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-09-17 11:11:58.061496: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-09-17 11:11:58.063002: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-09-17 11:11:58.068111: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-09-17 11:11:58.086070: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0, 1, 2, 3
2020-09-17 11:11:58.086984: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-09-17 11:11:58.135001: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2197335000 Hz
2020-09-17 11:11:58.140460: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5631e567b2b0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-09-17 11:11:58.140490: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-09-17 11:11:58.813313: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5631e50072c0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-09-17 11:11:58.813375: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2080, Compute Capability 7.5
2020-09-17 11:11:58.813394: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): GeForce RTX 2080, Compute Capability 7.5
2020-09-17 11:11:58.813410: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (2): TITAN Xp, Compute Capability 6.1
2020-09-17 11:11:58.813433: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (3): TITAN X (Pascal), Compute Capability 6.1
2020-09-17 11:11:58.816386: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:04:00.0 name: GeForce RTX 2080 computeCapability: 7.5
coreClock: 1.71GHz coreCount: 46 deviceMemorySize: 7.77GiB deviceMemoryBandwidth: 417.23GiB/s
2020-09-17 11:11:58.819237: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 1 with properties: 
pciBusID: 0000:84:00.0 name: GeForce RTX 2080 computeCapability: 7.5
coreClock: 1.71GHz coreCount: 46 deviceMemorySize: 7.77GiB deviceMemoryBandwidth: 417.23GiB/s
2020-09-17 11:11:58.820519: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 2 with properties: 
pciBusID: 0000:83:00.0 name: TITAN Xp computeCapability: 6.1
coreClock: 1.582GHz coreCount: 30 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 510.07GiB/s
2020-09-17 11:11:58.821905: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 3 with properties: 
pciBusID: 0000:02:00.0 name: TITAN X (Pascal) computeCapability: 6.1
coreClock: 1.531GHz coreCount: 28 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 447.48GiB/s
2020-09-17 11:11:58.821974: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-09-17 11:11:58.822025: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-09-17 11:11:58.822058: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-09-17 11:11:58.822089: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-09-17 11:11:58.822128: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-09-17 11:11:58.822158: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-09-17 11:11:58.822189: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-09-17 11:11:58.830921: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0, 1, 2, 3
2020-09-17 11:11:58.830969: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-09-17 11:12:03.172575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-09-17 11:12:03.172636: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0 1 2 3 
2020-09-17 11:12:03.172661: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N N N N 
2020-09-17 11:12:03.172666: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 1:   N N N N 
2020-09-17 11:12:03.172671: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 2:   N N N N 
2020-09-17 11:12:03.172678: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 3:   N N N N 
2020-09-17 11:12:03.178388: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2020-09-17 11:12:03.178476: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7226 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080, pci bus id: 0000:04:00.0, compute capability: 7.5)
2020-09-17 11:12:03.180294: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2020-09-17 11:12:03.180340: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 7226 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080, pci bus id: 0000:84:00.0, compute capability: 7.5)
2020-09-17 11:12:03.181790: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2020-09-17 11:12:03.181834: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 921 MB memory) -> physical GPU (device: 2, name: TITAN Xp, pci bus id: 0000:83:00.0, compute capability: 6.1)
2020-09-17 11:12:03.183548: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2020-09-17 11:12:03.183609: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 4056 MB memory) -> physical GPU (device: 3, name: TITAN X (Pascal), pci bus id: 0000:02:00.0, compute capability: 6.1)
Construction finished in 0:00:05                                                                                                                              
2020-09-17 11:12:07.284170: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10               
2020-09-17 11:12:07.786281: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
Accuracy before training: 0.0934000015258789
Accuracy after training: 0.9869999885559082 0:00:00                                                                                                           
(nengodl) p291020@turing10:~$ python spiking_mnist.py                                                                                                         
2020-09-17 11:13:31.472787: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Build finished in 0:00:00                                                                                                                                     
Optimization finished in 0:00:00                                                                                                                              
|#                                                                Constructing graph                                                                 | 0:00:002020-09-17 11:13:41.757297: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-09-17 11:13:41.779979: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:04:00.0 name: GeForce RTX 2080 computeCapability: 7.5
coreClock: 1.71GHz coreCount: 46 deviceMemorySize: 7.77GiB deviceMemoryBandwidth: 417.23GiB/s
2020-09-17 11:13:41.781116: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 1 with properties: 
pciBusID: 0000:84:00.0 name: GeForce RTX 2080 computeCapability: 7.5
coreClock: 1.71GHz coreCount: 46 deviceMemorySize: 7.77GiB deviceMemoryBandwidth: 417.23GiB/s
2020-09-17 11:13:41.781938: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 2 with properties: 
pciBusID: 0000:83:00.0 name: TITAN Xp computeCapability: 6.1
coreClock: 1.582GHz coreCount: 30 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 510.07GiB/s
2020-09-17 11:13:41.782828: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 3 with properties: 
pciBusID: 0000:02:00.0 name: TITAN X (Pascal) computeCapability: 6.1
coreClock: 1.531GHz coreCount: 28 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 447.48GiB/s
2020-09-17 11:13:41.782864: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-09-17 11:13:41.785582: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-09-17 11:13:41.788028: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-09-17 11:13:41.788498: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-09-17 11:13:41.791183: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-09-17 11:13:41.792631: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-09-17 11:13:41.798078: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-09-17 11:13:41.805495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0, 1, 2, 3
2020-09-17 11:13:41.805961: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-09-17 11:13:41.854901: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2197335000 Hz
2020-09-17 11:13:41.860757: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x561a9f04b280 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-09-17 11:13:41.860798: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-09-17 11:13:42.365618: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x561a9e9d7640 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-09-17 11:13:42.365673: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2080, Compute Capability 7.5
2020-09-17 11:13:42.365699: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): GeForce RTX 2080, Compute Capability 7.5
2020-09-17 11:13:42.365721: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (2): TITAN Xp, Compute Capability 6.1
2020-09-17 11:13:42.365736: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (3): TITAN X (Pascal), Compute Capability 6.1
2020-09-17 11:13:42.377324: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:04:00.0 name: GeForce RTX 2080 computeCapability: 7.5
coreClock: 1.71GHz coreCount: 46 deviceMemorySize: 7.77GiB deviceMemoryBandwidth: 417.23GiB/s
2020-09-17 11:13:42.379253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 1 with properties: 
pciBusID: 0000:84:00.0 name: GeForce RTX 2080 computeCapability: 7.5
coreClock: 1.71GHz coreCount: 46 deviceMemorySize: 7.77GiB deviceMemoryBandwidth: 417.23GiB/s
2020-09-17 11:13:42.381161: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 2 with properties: 
pciBusID: 0000:83:00.0 name: TITAN Xp computeCapability: 6.1
coreClock: 1.582GHz coreCount: 30 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 510.07GiB/s
2020-09-17 11:13:42.383961: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 3 with properties: 
pciBusID: 0000:02:00.0 name: TITAN X (Pascal) computeCapability: 6.1
coreClock: 1.531GHz coreCount: 28 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 447.48GiB/s
2020-09-17 11:13:42.384045: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-09-17 11:13:42.384097: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-09-17 11:13:42.384132: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-09-17 11:13:42.384164: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-09-17 11:13:42.384196: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-09-17 11:13:42.384227: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-09-17 11:13:42.384265: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-09-17 11:13:42.393801: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0, 1, 2, 3
2020-09-17 11:13:42.393842: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-09-17 11:13:46.659687: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-09-17 11:13:46.659744: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0 1 2 3 
2020-09-17 11:13:46.659754: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N N N N 
2020-09-17 11:13:46.659759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 1:   N N N N 
2020-09-17 11:13:46.659767: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 2:   N N N N 
2020-09-17 11:13:46.659790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 3:   N N N N 
2020-09-17 11:13:46.666048: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2020-09-17 11:13:46.666113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7226 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080, pci bus id: 0000:04:00.0, compute capability: 7.5)
2020-09-17 11:13:46.667999: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2020-09-17 11:13:46.668047: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 7226 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080, pci bus id: 0000:84:00.0, compute capability: 7.5)
2020-09-17 11:13:46.669479: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2020-09-17 11:13:46.669527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 921 MB memory) -> physical GPU (device: 2, name: TITAN Xp, pci bus id: 0000:83:00.0, compute capability: 6.1)
2020-09-17 11:13:46.671421: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2020-09-17 11:13:46.671463: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 4056 MB memory) -> physical GPU (device: 3, name: TITAN X (Pascal), pci bus id: 0000:02:00.0, compute capability: 6.1)
Construction finished in 0:00:05                                                                                                                              
2020-09-17 11:13:51.029187: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10               
2020-09-17 11:13:51.570327: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
Accuracy before training: 0.0934000015258789
Epoch 1/3
300/300 [==============================] - 9s 31ms/step - loss: 0.2685 - out_p_loss: 0.2685                                                                   
Epoch 2/3
300/300 [==============================] - 9s 31ms/step - loss: 0.0696 - out_p_loss: 0.0696
Epoch 3/3
300/300 [==============================] - 9s 31ms/step - loss: 0.0481 - out_p_loss: 0.0481
Accuracy after training: 0.9850999712944031 0:00:00                                                                                                           
(nengodl) p291020@turing10:~$ e finished in 0:00:00

With my own model the output is:

(nengodl) p291020@turing10:~/Learning-to-approximate-functions-using-niobium-doped-strontium-titanate-memristors/experiments$ PYTHONPATH=.. python mPES.py -vv -f x -i sine -N 100 -d /gpu:3 -l mPES
Using run optimisation
Devices available:
Device type: CPU String: /physical_device:CPU:0
Device type: XLA_CPU String: /physical_device:XLA_CPU:0
Device type: XLA_GPU String: /physical_device:XLA_GPU:0
Device type: XLA_GPU String: /physical_device:XLA_GPU:1
Device type: XLA_GPU String: /physical_device:XLA_GPU:2
Device type: XLA_GPU String: /physical_device:XLA_GPU:3
Device type: GPU String: /physical_device:GPU:0
Device type: GPU String: /physical_device:GPU:1
Device type: GPU String: /physical_device:GPU:2
Device type: GPU String: /physical_device:GPU:3
Simulating with mPES()
Backend is nengo_dl, running on /gpu:3
Build finished in 0:00:00                                                                                                                                     
Optimization finished in 0:00:00                                                                                                                              
|                                                    #            Constructing graph                                                                 | 0:00:052020-09-17 10:07:43.481718: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2020-09-17 10:07:43.483678: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2020-09-17 10:07:43.485466: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2020-09-17 10:07:43.487683: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
|                                                     #           Constructing graph                                                                 | 0:00:052020-09-17 10:07:43.514308: W tensorflow/core/common_runtime/colocation_graph.cc:1139] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
  /job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices: 
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:3' assigned_device_name_='' resource_device_name_='/device:GPU:3' supported_device_types_=[CPU] possible_devices_=[]
ReadVariableOp: GPU CPU XLA_CPU XLA_GPU 
AssignVariableOp: CPU XLA_CPU XLA_GPU 
VarIsInitializedOp: GPU CPU XLA_CPU XLA_GPU 
Const: GPU CPU XLA_CPU XLA_GPU 
VarHandleOp: CPU XLA_CPU XLA_GPU 

Colocation members, user-requested devices, and framework assigned devices, if any:
  TensorGraph/saved_state/int32_1/Initializer/zeros (Const) 
  TensorGraph/saved_state/int32_1 (VarHandleOp) /device:GPU:3
  TensorGraph/saved_state/int32_1/IsInitialized/VarIsInitializedOp (VarIsInitializedOp) /device:GPU:3
  TensorGraph/saved_state/int32_1/Assign (AssignVariableOp) /device:GPU:3
  TensorGraph/saved_state/int32_1/Read/ReadVariableOp (ReadVariableOp) /device:GPU:3

Construction finished in 0:00:06                                                                                                                              

Running discretised step 1 of 1
|                 #                                                   Simulating                                                                     | 0:00:01WARNING:tensorflow:From /home/p291020/.conda/envs/nengodl/lib/python3.8/site-packages/tensorflow/python/keras/engine/training_v1.py:2070: Model.state_updates (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
WARNING:tensorflow:From /home/p291020/.conda/envs/nengodl/lib/python3.8/site-packages/tensorflow/python/keras/engine/training_v1.py:2070: Model.state_updates (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
Simulation finished in 0:06:37                                                                                                                                

Total time for simulation: 00:06:37 s
Maximum memory usage: 301.5 MB
MSE after learning [f(pre) vs. post]:
[0.10397076606750488, 0.23197855055332184, 0.14481531083583832]
Pearson correlation after learning [f(pre) vs. post]:
[0.8704643944583814, 0.9591849603892563, 0.984917510410567]
Spearman correlation after learning [f(pre) vs. post]:
[0.8405644401582758, 0.9417869220256265, 0.9824773736246466]
Kendall correlation after learning [f(pre) vs. post]:
[0.6274194684869902, 0.793860776251106, 0.8814211776472058]

Could it be that this is all because I’m not using the NengoDL/Keras “fit, evaluate, predict” API?