Parameter space exploration, nengo_ocl and SpiNNaker

xchoo · February 18, 2021, 5:04am

NengoOCL (or rather, OpenCL) code is much more complicated that the NengoDL / Keras code, especially for people who don’t have experience programming in C. I don’t do much C programming myself, but I’ll try to explain to the best of my knowledge what is going on with the NengoOCL code.

The NengoOCL Simulator Class
As I described in the other post I made when the NengoOCL simulator goes to build Nengo operators, it ill try to call the sim._plan_<op_name> method. Since the voltage learning rule creates a SimVLR operator, the NengoOCL simulator will attempt to call the sim._plan_SimVLR method. Unfortunately, unlike core Nengo, which has an easy way to register build functions, since NengoOCL attempts to call a method that is part of the simulator class, you’ll need to subclass the NengoOCL simulator to “insert” your custom _plan_SimVLR method:

class CustomOCLSimulator(nengo_ocl.Simulator):
    def _plan_SimVLR(self, ops):
        ...

Then, when you go to run your model, you’ll need to use this custom NengoOCL simulator instead of the default one:

from ... import CustomOCLSimulator

with nengo.Network() as model:
    ...  # define Nengo model

with CustomOCLSimulator(model) as sim:
    sim.run(1)

So that’s how to actually use the custom NengoOCL simulator, let me get into more detail about how each of the different components of the NengoOCL operators work. Before I being discussing all of the parts of the NengoOCL simulator class, I recommend experimenting with an existing (working) learning rule to see what each of the parameters do (e.g., insert print statements to see what their sizes are and what not) to get a better understanding of the code.

The _plan_SimVLR Method
The goal of the _plan_ method is to assemble the appropriate Nengo signals and arrays, and feed those values to a function that creates the OpenCL code. The _plan_ method functions similarly to the build_ methods that you register with the core Nengo simulator/builder.

The _plan_ method received 1 argument, ops, which is a list containing all of the corresponding Nengo operators related to this plan method. As an example, if your Nengo model contains 2 voltage learning rule connections, ops will contain references to both of the SimVLR operators. In NengoOCL, operators are grouped this way to increase the efficiency of executing operators on the GPU. It is more efficient to execute 1 learning rule kernel that works across multiple connections as opposed to executing 1 kernel per connection.

There are two main parts to the _plan_ method:

A part of the code that assembles the Nengo signals and various arrays. In this part of the code, the NengoOCL simulator pre packages all of the signals in your Nengo model into the self.all_data list, and the code just creates references to these signals. For non-signals, you’ll need to assemble the data arrays yourself, and if necessary, note where the data starts for each operator (see the BCM masked learning rule for an example of this). For the voltage learning rule, this part of the method might look something like this:

        pre_voltages = self.all_data[[self.sidx[op.pre_voltages] for op in ops]]
        post_filtered = self.all_data[[self.sidx[op.post_filtered] for op in ops]]
        weights = self.all_data[[self.sidx[op.weights] for op in ops]]
        delta = self.all_data[[self.sidx[op.delta] for op in ops]]
        alpha = self.Array([op.learning_rate * self.model.dt for op in ops])
        vprog = self.Array([op.vprog for op in ops])

A part that calls the function that builds the OCL code. In NengoOCL, this is just a function call, like so (assuming plan_vlr is the custom OCL code creation function):

        return [
            plan_vlr(
                self.queue, pre_voltages, post_filtered, weights, delta, alpha, vprog
            )
        ]

The plan_vlr Function
This function is where all of the OCL code is created, and is split into 4 sections:

A section of the code to double check that all of the data being passed to the function is correctly shaped and sized. As an example, the delta matrix that is passed to the plan functions for most learning rules is the size of the full connection weight matrix. This weight matrix is size_post x size_pre in shape, so a check might be:

    assert (post.shape0s == delta.shape0s).all()
    assert (pre.shape0s == delta.shape1s).all()

Other checks this section does are things like checking the number of operators for each input are correct, and that the data type is to be expected. The full code for this section might look something like this:

    # Check that all lists being passed in have the same number of items per list,
    # which should be the same as the number of ops
    assert (
        len(pre) == len(post) == len(weights) == len(delta) == alpha.size == vprog.size
    )
    N = len(pre)  # Number of ops in op list

    for arr in (pre, post):  # vectors
        # Check that all vectors are flat (second dim is 1)
        assert (arr.shape1s == 1).all()
    for arr in (delta, weights):  # matrices
        # Check that all matricies have a stride1 of 1
        print(">>", arr.stride0s, arr.stride1s)
        assert (arr.stride1s == 1).all()

    # Check shape of matrices are correct. delta and weights matrix should be
    # post.shape x pre.shape
    assert (post.shape0s == delta.shape0s).all()
    assert (pre.shape0s == delta.shape1s).all()
    assert (post.shape0s == weights.shape0s).all()
    assert (pre.shape0s == weights.shape1s).all()

    # Check that the ctypes for all of the data are the same
    assert (
        pre.ctype
        == post.ctype
        == weights.ctype
        == delta.ctype
        == alpha.ctype
        == vprog.ctype
    )

Next, a section of code specifically defines the C code that will be used to compile the kernel that will be run on the GPU. I’ll elaborate more on this code in the next block of text below. This section of code might look like this:

    text = """
    __kernel void vlr() {
        // kernel definition here
    }

Next follows a section of the code uses the C code text from the previous section, as well as other parameters (like the shapes and sizes of the function inputs) to actually create the OCL kernel. This section of code might look like this:

    textconf = dict(type=pre.ctype)
    text = as_ascii(Template(text, output_encoding="ascii").render(**textconf))

    # Provide a reference to the arguments that will be given to the OCL kernel code
    # - Must match order of arguments appearing in OCL code
    full_args = (
        delta.cl_shape0s,
        delta.cl_shape1s,
        pre.cl_stride0s,
        pre.cl_starts,
        pre.cl_buf,
        post.cl_stride0s,
        post.cl_starts,
        post.cl_buf,
        weights.cl_stride0s,
        weights.cl_starts,
        weights.cl_buf,
        delta.cl_stride0s,
        delta.cl_starts,
        delta.cl_buf,
        alpha,
        vprog,
    )
    # Create the OCL kernel
    _fn = cl.Program(queue.context, text).build().vlr
    _fn.set_args(*[arr.data for arr in full_args])

The final section of the function creates NengoOCL plan (which is similar to a Nengo operator) and returns it to the simulator. This section of code defines things like what the global OCL indices should iterate over (see next section of text for how the OCL code works), and the number of bytes the kernel needs per call. This section also specifies how many floating point operations each call will take, but I’m not sure how to determine this, so you’ll have to experiment with this number yourself. The code for this section might look like this (a lot of it is copy-pasted from existing NengoOCL plans):

    # Create the NengoOCL plan and return it
    lsize = None
    gsize = (delta.sizes.max(), N)  # global index shape
    plan = Plan(queue, _fn, gsize, lsize=lsize, name="cl_vlr", tag=tag)
    plan.full_args = full_args  # prevent garbage-collection
    # Number of FLOPs per call. Not sure how to determine this.
    plan.flops_per_call = 4 * delta.sizes.sum()
    plan.bw_per_call = (
        pre.nbytes
        + post.nbytes
        + weights.nbytes
        + delta.nbytes
        + alpha.nbytes
        + vprog.nbytes
    )
    return plan

The OCL Code
The OCL code is a typically standard C function. The first part is the function definition, where the argument list matches the argument list defined in Step 3 above. Here’s what the voltage learning rule argument list might look like. I included all of the parameters I am aware of, but I noticed a fun_post function in your learning rule code. I don’t know that that consists of, so I’ve excluded it from this example.

    __kernel void vlr(
        __global const int *shape0s,
        __global const int *shape1s,
        __global const int *pre_stride0s,
        __global const int *pre_starts,
        __global const ${type} *pre_data,
        __global const int *post_stride0s,
        __global const int *post_starts,
        __global const ${type} *post_data,
        __global const int *weights_stride0s,
        __global const int *weights_starts,
        __global const ${type} *weights_data,
        __global const int *delta_stride0s,
        __global const int *delta_starts,
        __global ${type} *delta_data,
        __global const ${type} *alphas,
        __global const ${type} *vprog
    )

Next up in the OCL code is the determination of the matrix indices. In NengoOCL, all of the data is flattened, not just across dimensions, but across operators as well. As an example, if you had two learning rule operators, one with a 1x1 connection matrix, and one with a 2x2 connection matrix, the data being sent to the kernel would look like this:
weights = [lr1_00, lr2_00, lr2_01, lr2_10, lr2_11]
whereby the weights for the each learning rule operator would actually be this:

weights_lr1 = [lr1_00]
weights_lr2 = [[lr2_00, lr2_01], [lr2_10, lr2_11]]

Because the data is flattened in this way, in the OCL kernel, we need to compute the relative i and j of a matrix given a global index. The code to do this looks like this:

        const int ij = get_global_id(0);
        const int k = get_global_id(1);
        const int shape0 = shape0s[k];
        const int shape1 = shape1s[k];
        const int i = ij / shape1;
        const int j = ij % shape1;

Note that k determines the operator index, so if you have two of the same learning rule operators (see Step 1 of the _plan method), k would be 0 for the first one, then 1 for the second one.

The next part of the OCL kernel code uses the i, j, and k indices to figure out what data is relevant to the matrix operation being computed. Once again, this is done because the input data is a flattened array. The code might look like this:

        __global ${type} *delta = delta_data + delta_starts[k];
        const ${type} pre = pre_data[pre_starts[k] + j*pre_stride0s[k]];
        const ${type} post = post_data[post_starts[k] + i*post_stride0s[k]];
        const ${type} weight_ij = weights_data[
            weights_starts[k] + i*weights_stride0s[k] + j];
        const ${type} alpha = alphas[k];

Note that delta is created as a global pointer reference here because this learning rule modifies delta as part of its operation. The _starts[k] array values tell the code where in the data array the specific values start for that specific operator (remember everything is flattened). Likewise, something like i*weights_stride0s[k] + j is standard math to figure out what element in a flattened array corresponds to a specific i and j of a matrix.

The final part of the OCL kernel code is what the learning rule computation does. This code might look something like this:

        if (i < shape0) {
            delta[i*delta_stride0s[k] + j] =
                alpha * post * weight_ij * ...;
        }

The if (i < shape0) is just to make sure that just in case the i index goes out of bounds, the computation does spit out garbage numbers. Also note that this code is defined per i,j element in the delta matrix. Finally, I noticed you had a fun_post function embedded in your learning rule code. That will need to be re-implemented in C code for the OCL code to function properly.

Summary
And that should be it. As with my other posts, I’ve assembled all of the code snippets I included above into one file: voltage_learning_ocl.py (6.0 KB)
I highly recommend you look at this file and play with the various debug statements and parameters to get a better understanding of how the OCL code works, and more importantly, what it expects. A lot of the knowledge I’ve gleaned from OCL code I have done so using that specific method of debugging.

I hope this helps!