NengoOCL (or rather, OpenCL) code is much more complicated that the NengoDL / Keras code, especially for people who donāt have experience programming in C. I donāt do much C programming myself, but Iāll try to explain to the best of my knowledge what is going on with the NengoOCL code.
The NengoOCL Simulator Class
As I described in the other post I made when the NengoOCL simulator goes to build Nengo operators, it ill try to call the sim._plan_<op_name>
method. Since the voltage learning rule creates a SimVLR
operator, the NengoOCL simulator will attempt to call the sim._plan_SimVLR
method. Unfortunately, unlike core Nengo, which has an easy way to register build functions, since NengoOCL attempts to call a method that is part of the simulator class, youāll need to subclass the NengoOCL simulator to āinsertā your custom _plan_SimVLR
method:
class CustomOCLSimulator(nengo_ocl.Simulator):
def _plan_SimVLR(self, ops):
...
Then, when you go to run your model, youāll need to use this custom NengoOCL simulator instead of the default one:
from ... import CustomOCLSimulator
with nengo.Network() as model:
... # define Nengo model
with CustomOCLSimulator(model) as sim:
sim.run(1)
So thatās how to actually use the custom NengoOCL simulator, let me get into more detail about how each of the different components of the NengoOCL operators work. Before I being discussing all of the parts of the NengoOCL simulator class, I recommend experimenting with an existing (working) learning rule to see what each of the parameters do (e.g., insert print
statements to see what their sizes are and what not) to get a better understanding of the code.
The _plan_SimVLR
Method
The goal of the _plan_
method is to assemble the appropriate Nengo signals and arrays, and feed those values to a function that creates the OpenCL code. The _plan_
method functions similarly to the build_
methods that you register with the core Nengo simulator/builder.
The _plan_
method received 1 argument, ops
, which is a list containing all of the corresponding Nengo operators related to this plan method. As an example, if your Nengo model contains 2 voltage learning rule connections, ops
will contain references to both of the SimVLR
operators. In NengoOCL, operators are grouped this way to increase the efficiency of executing operators on the GPU. It is more efficient to execute 1 learning rule kernel that works across multiple connections as opposed to executing 1 kernel per connection.
There are two main parts to the _plan_
method:
- A part of the code that assembles the Nengo signals and various arrays. In this part of the code, the NengoOCL simulator pre packages all of the signals in your Nengo model into the
self.all_data
list, and the code just creates references to these signals. For non-signals, youāll need to assemble the data arrays yourself, and if necessary, note where the data starts for each operator (see the BCM masked learning rule for an example of this). For the voltage learning rule, this part of the method might look something like this:
pre_voltages = self.all_data[[self.sidx[op.pre_voltages] for op in ops]]
post_filtered = self.all_data[[self.sidx[op.post_filtered] for op in ops]]
weights = self.all_data[[self.sidx[op.weights] for op in ops]]
delta = self.all_data[[self.sidx[op.delta] for op in ops]]
alpha = self.Array([op.learning_rate * self.model.dt for op in ops])
vprog = self.Array([op.vprog for op in ops])
- A part that calls the function that builds the OCL code. In NengoOCL, this is just a function call, like so (assuming
plan_vlr
is the custom OCL code creation function):
return [
plan_vlr(
self.queue, pre_voltages, post_filtered, weights, delta, alpha, vprog
)
]
The plan_vlr
Function
This function is where all of the OCL code is created, and is split into 4 sections:
- A section of the code to double check that all of the data being passed to the function is correctly shaped and sized. As an example, the
delta
matrix that is passed to the plan functions for most learning rules is the size of the full connection weight matrix. This weight matrix is size_post x size_pre
in shape, so a check might be:
assert (post.shape0s == delta.shape0s).all()
assert (pre.shape0s == delta.shape1s).all()
Other checks this section does are things like checking the number of operators for each input are correct, and that the data type is to be expected. The full code for this section might look something like this:
# Check that all lists being passed in have the same number of items per list,
# which should be the same as the number of ops
assert (
len(pre) == len(post) == len(weights) == len(delta) == alpha.size == vprog.size
)
N = len(pre) # Number of ops in op list
for arr in (pre, post): # vectors
# Check that all vectors are flat (second dim is 1)
assert (arr.shape1s == 1).all()
for arr in (delta, weights): # matrices
# Check that all matricies have a stride1 of 1
print(">>", arr.stride0s, arr.stride1s)
assert (arr.stride1s == 1).all()
# Check shape of matrices are correct. delta and weights matrix should be
# post.shape x pre.shape
assert (post.shape0s == delta.shape0s).all()
assert (pre.shape0s == delta.shape1s).all()
assert (post.shape0s == weights.shape0s).all()
assert (pre.shape0s == weights.shape1s).all()
# Check that the ctypes for all of the data are the same
assert (
pre.ctype
== post.ctype
== weights.ctype
== delta.ctype
== alpha.ctype
== vprog.ctype
)
- Next, a section of code specifically defines the C code that will be used to compile the kernel that will be run on the GPU. Iāll elaborate more on this code in the next block of text below. This section of code might look like this:
text = """
__kernel void vlr() {
// kernel definition here
}
- Next follows a section of the code uses the C code text from the previous section, as well as other parameters (like the shapes and sizes of the function inputs) to actually create the OCL kernel. This section of code might look like this:
textconf = dict(type=pre.ctype)
text = as_ascii(Template(text, output_encoding="ascii").render(**textconf))
# Provide a reference to the arguments that will be given to the OCL kernel code
# - Must match order of arguments appearing in OCL code
full_args = (
delta.cl_shape0s,
delta.cl_shape1s,
pre.cl_stride0s,
pre.cl_starts,
pre.cl_buf,
post.cl_stride0s,
post.cl_starts,
post.cl_buf,
weights.cl_stride0s,
weights.cl_starts,
weights.cl_buf,
delta.cl_stride0s,
delta.cl_starts,
delta.cl_buf,
alpha,
vprog,
)
# Create the OCL kernel
_fn = cl.Program(queue.context, text).build().vlr
_fn.set_args(*[arr.data for arr in full_args])
- The final section of the function creates NengoOCL plan (which is similar to a Nengo operator) and returns it to the simulator. This section of code defines things like what the global OCL indices should iterate over (see next section of text for how the OCL code works), and the number of bytes the kernel needs per call. This section also specifies how many floating point operations each call will take, but Iām not sure how to determine this, so youāll have to experiment with this number yourself. The code for this section might look like this (a lot of it is copy-pasted from existing NengoOCL plans):
# Create the NengoOCL plan and return it
lsize = None
gsize = (delta.sizes.max(), N) # global index shape
plan = Plan(queue, _fn, gsize, lsize=lsize, name="cl_vlr", tag=tag)
plan.full_args = full_args # prevent garbage-collection
# Number of FLOPs per call. Not sure how to determine this.
plan.flops_per_call = 4 * delta.sizes.sum()
plan.bw_per_call = (
pre.nbytes
+ post.nbytes
+ weights.nbytes
+ delta.nbytes
+ alpha.nbytes
+ vprog.nbytes
)
return plan
The OCL Code
The OCL code is a typically standard C function. The first part is the function definition, where the argument list matches the argument list defined in Step 3 above. Hereās what the voltage learning rule argument list might look like. I included all of the parameters I am aware of, but I noticed a fun_post
function in your learning rule code. I donāt know that that consists of, so Iāve excluded it from this example.
__kernel void vlr(
__global const int *shape0s,
__global const int *shape1s,
__global const int *pre_stride0s,
__global const int *pre_starts,
__global const ${type} *pre_data,
__global const int *post_stride0s,
__global const int *post_starts,
__global const ${type} *post_data,
__global const int *weights_stride0s,
__global const int *weights_starts,
__global const ${type} *weights_data,
__global const int *delta_stride0s,
__global const int *delta_starts,
__global ${type} *delta_data,
__global const ${type} *alphas,
__global const ${type} *vprog
)
Next up in the OCL code is the determination of the matrix indices. In NengoOCL, all of the data is flattened, not just across dimensions, but across operators as well. As an example, if you had two learning rule operators, one with a 1x1 connection matrix, and one with a 2x2 connection matrix, the data being sent to the kernel would look like this:
weights = [lr1_00, lr2_00, lr2_01, lr2_10, lr2_11]
whereby the weights for the each learning rule operator would actually be this:
weights_lr1 = [lr1_00]
weights_lr2 = [[lr2_00, lr2_01], [lr2_10, lr2_11]]
Because the data is flattened in this way, in the OCL kernel, we need to compute the relative i
and j
of a matrix given a global index. The code to do this looks like this:
const int ij = get_global_id(0);
const int k = get_global_id(1);
const int shape0 = shape0s[k];
const int shape1 = shape1s[k];
const int i = ij / shape1;
const int j = ij % shape1;
Note that k
determines the operator index, so if you have two of the same learning rule operators (see Step 1 of the _plan
method), k
would be 0
for the first one, then 1
for the second one.
The next part of the OCL kernel code uses the i
, j
, and k
indices to figure out what data is relevant to the matrix operation being computed. Once again, this is done because the input data is a flattened array. The code might look like this:
__global ${type} *delta = delta_data + delta_starts[k];
const ${type} pre = pre_data[pre_starts[k] + j*pre_stride0s[k]];
const ${type} post = post_data[post_starts[k] + i*post_stride0s[k]];
const ${type} weight_ij = weights_data[
weights_starts[k] + i*weights_stride0s[k] + j];
const ${type} alpha = alphas[k];
Note that delta
is created as a global pointer reference here because this learning rule modifies delta as part of its operation. The _starts[k]
array values tell the code where in the data array the specific values start for that specific operator (remember everything is flattened). Likewise, something like i*weights_stride0s[k] + j
is standard math to figure out what element in a flattened array corresponds to a specific i
and j
of a matrix.
The final part of the OCL kernel code is what the learning rule computation does. This code might look something like this:
if (i < shape0) {
delta[i*delta_stride0s[k] + j] =
alpha * post * weight_ij * ...;
}
The if (i < shape0)
is just to make sure that just in case the i
index goes out of bounds, the computation does spit out garbage numbers. Also note that this code is defined per i,j element in the delta
matrix. Finally, I noticed you had a fun_post
function embedded in your learning rule code. That will need to be re-implemented in C code for the OCL code to function properly.
Summary
And that should be it. As with my other posts, Iāve assembled all of the code snippets I included above into one file: voltage_learning_ocl.py (6.0 KB)
I highly recommend you look at this file and play with the various debug statements and parameters to get a better understanding of how the OCL code works, and more importantly, what it expects. A lot of the knowledge Iāve gleaned from OCL code I have done so using that specific method of debugging.
I hope this helps! 