Benchmarking Apple M1

Nengo’s backend portability sets it apart as a benchmarking tool. Apple has made a big move recently away from Intel CPUs and Iris graphics to its in-house designed chips. M1 is the first one to not be in a phone. There’s a lot of talk of their benchmark performance.

I wonder how they perform on neuromorphic benchmarks.

Obviously they support python. Single-core speed is tied to clock rate but it’s also tied in nontrivial ways to cache architecture and OS power management.

They also support OpenCL. The GPU architecture is strange. That’s very hard to predict. It has NVIDIA-class specs on FLOPs, but that is really irrelevant to some benchmarks that are communication limited. Being co-integrated means GPU bandwidth could potentially be broader, but it’s not a 1-to-1. Someone just has to try it.

To add another twist, Intel produces Loihi, a neuromorphic architecture that does not have an Apple counterpart to my knowledge. A benchmark of Loihi against M1 (and eventually the M2, M3, etc) is potentially interesting to somebody, and nengo would be the ideal way, I think, to do that.

Is anyone else interested in developing a benchmark script? Nothing exhaustive; just a proof of concept of nengo’s capability. Has someone already made this? Does anyone have an M1 that would be willing to setup pyopencl and run some scripts?

1 Like

nengo-benchmarks has been used to do some benchmarking of some core Nengo NEF and SPA networks across various backends. It’s pretty extensible and has some nice methods for sweeping parameters and storing metrics via pytry. Mentioning in case that’s a helpful starting point. I don’t have anything to add for the rest of your post (e.g., I don’t have an M1) though I am interested!

If you want, I could set you up with SSH access to my Mac Mini M. I already managed to get NengoDL running on it, but would love to help you developers give it full support. :slight_smile:

A very quick benchmark I just tried sees my 2017 MacBook Pro with a 2.8 GHz Quad-Core Intel Core i7 taking 50 minutes classifying 10.000 MNIST digits on Nengo Core with my Mac Mini with an Apple Silicon M1 taking 19 (!!) minutes (without the fan even kicking in, while my MacBook was spinning on maximum).

I would expect NengoDL to be even faster, once it is properly supported.

That’s very interesting @Tioz. Thank you for just jumping in! I think you are saying that you are using a CPU-based backend; is that right? Brute clock rate does not explain a factor of 2.5, so that would seem to indicate something about architecture or OS having an effect. Is it Big Sur or OSX?

Apple graphics does not support CUDA, the GPU language made by NVIDIA upon which Nengo-DL and tensorflow-gpu are based. Instead Apple makes Metal. CUDA and Metal are meant to unlock the latest hardware features of each company’s GPUs, so they can’t port between hardware.

All GPUs so far support OpenCL, which is a more basic and generic instruction set. That means using the Nengo-OCL backend. Do you want to try installing it? I’ve got it running on Intel Iris GPU in my 2017 macbook pro with OSX Catalina, but I think there was a trick when configuring pyopencl for Mac. Please post if you run into any installation hang ups.

@arvoelke, I had not seen these before. Good stuff

Both Macs are on the latest Big Sur build.

There is a recent project on Apple’s part (https://github.com/apple/tensorflow_macos) to make TensorFlow run on CoreML and thus take advantage of both CPU and GPU without needing CUDA or OpenCL. I think this fork has actually been merged into the main TensorFlow branch.

I actually managed to get NengoDL to run on Apple Silicon, but it is not working properly yet.