Parameter space exploration, nengo_ocl and SpiNNaker

nikhilgarg · November 23, 2020, 9:32pm

I have been trying to do MNIST classification with a simple 1 layer network and unsupervised Hebbian learning. There are 784 input neurons and 20 output neurons(Fully connected).
Now a single iteration of the entire MNIST dataset (70,000 images, 350ms presentation time for each image) takes around 40 hours on CPU and a little less time with Nengo_ocl (on Google Collab GPU).

For statistical significance, I need to repeat the simulation with a different seed 10 times, and if I imagine some parameter exploration, the estimated time for simulation looks non-feasible.

I was wondering if it’s possible to run 100 such simulations in parallel with help of nengo_ocl or spinnaker. Is it possible to do so?

xchoo · November 25, 2020, 2:23am

Hmmm, I see that you have a bit of a predicament. I’ll try an elaborate on some of your options below:

NengoOCL
NengoOCL doesn’t limit the number of Nengo model you can run on one GPU. However, from my experience, running multiple Nengo models on a single GPU doesn’t improve your throughput at all. Rather, because of the additional I/O needed, you get no gains in throughput (i.e., running 2 models on one GPU halves the speed of both simulations, resulting in a net gain of 0).

Thus, you really only have one option here, and that is to run your code on multiple GPUs. I’m not sure how to do it on Google Colab, but to get an improvement in your throughput, you’ll need to spawn multiple instances on your Jupyter notebook on separate GPUs.

@Eric is the primary author of NengoOCL, and he may have some suggestions on how to speed up your network with NengoOCL.

SpiNNaker
NengoSpinnaker is designed to run Nengo simulations in real time, so, doing some quick math, one simulation run with 70,000 images (at 350ms each) will take roughly 7 hours to complete. That’s slightly more manageable, but 100 runs will still take roughly a month to finish collecting all of the data. You can improve the throughput by running NengoSpinnaker simulations in parallel, but because one SpiNNaker board supports only one NengoSpinnaker model, you’ll need access to multiple SpiNNaker boards to accomplish this.

Your Model
I suspect the reason why your model is taking a while to finish running is because of the online learning rule that is incorporated into the model. We do have other tools (e.g., NengoDL) that can perform the learning in an offline manner, but will train the model much quicker. I’m not sure what your goals are with your network are, but it may be worthwhile investigating NengoDL as well (see this example for an MNIST example).

nikhilgarg · November 25, 2020, 6:43am

Thanks a lot for your detailed reply.

Oh okay. Seems like running multiple simulations with opencl not a very feasible option then.

My access to Spinnaker is limited to one offered by the HBP portal. I do not have a physical board. Any idea if I can run multiple simulations in parallel using a single job?

Very true. The whole idea is of online learning using Hebbian plasticity, so I am afraid that I cannot do an offline version of the same.

xchoo · November 25, 2020, 5:45pm

I thought about your use case a bit more, and I think I might have a solution, but it’ll require some work on your end. One of the advantages of Nengo is that there are no restrictions on the network structure you can simulation. That being the case, you should be able to “parallelize” your model by literally creating parallel networks within one Nengo model.

Here’s a quick example of how you would do this:

def createModel():
    with nengo.Network() as net:
        # Define MNIST model here
        net.inp = nengo.Ensemble(784, 1)
        net.out = nengo.Ensemble(20, 1)
        # Define connections + learning rule
        nengo.Connection(net.inp, net.out, transform=[...])
    return net

# The "Parallel" model
probes = []
num_parallel_models = 100
with nengo.Network() as model:
    # The MNIST data input node
    mnist_node = nengo.Node(<function to feed MNIST data as vector>)

    for i in range(num_parallel_models):
        subnet = createModel()
        # Connect the MNIST input to the subnet (so you don't have to make an MNIST node for each subnet)
        nengo.Connection(mnist_node, subnet.inp)
        # Add to list of probes
        probes.append(nengo.Probe(subnet.out)

Now, this works in your case because the total number of neurons in your model is relatively small. Having several dozen of your small models running in “parallel” in the larger Nengo model should have an impact on your overall simulation run time, but you’ll be effectively collecting more data per run.

The experimentation you’ll have to do on your part is to figure out how many “models” you can run in parallel (i.e., the value of num_parallel_models) before you start to experience diminishing returns. You should experiment with a small subset of the dataset (say 1000 images), and see how changing the number of parallel models impact your overall simulation runtime. I’m wiling to bet that even at a 1000 parallel models, the impact on the run time would not be significant.

This approach should work with both NengoOCL, and NengoSpiNNaker (this is the beauty of Nengo. )

DISCLAIMER: The code I provided above is pseudo-ish code, and I haven’t tested it out myself, but it should be mostly correct.

nikhilgarg · November 25, 2020, 5:48pm

This looks great. I shall run some simulations and get back.

nikhilgarg · January 4, 2021, 7:34pm

Hello,

So lately I was trying to try running parallel models, but I have one issue :

def createModel():
    with nengo.Network() as net:
        net.input_layer = nengo.Ensemble(**input_neurons_args)
        net.layer1 = nengo.Ensemble(**layer_1_neurons_args)

        # Define connections + learning rule
        net.w = nengo.Node(CustomRule_post(**learning_args), size_in=n_in, size_out=n_neurons)
    	nengo.Connection(net.input_layer.neurons, w, synapse=None)
    	nengo.Connection(w, net.layer1.neurons,transform=g_max, synapse=None)

    	#Lateral inhibition
    	inhib = nengo.Connection(net.layer1.neurons,net.layer1.neurons,**lateral_inhib_args) 

    return net


# The "Parallel" model
probes = []
num_parallel_models = 100

with nengo.Network() as model:
    # The MNIST data input node
    picture = nengo.Node(PresentInputWithPause(images, presentation_time, pause_time,0))
    true_label = nengo.Node(PresentInputWithPause(labels, presentation_time, pause_time,-1))

    for i in range(num_parallel_models):
        subnet = createModel()
        # Connect the MNIST input to the subnet (so you don't have to make an MNIST node for each subnet)
        input_conn = nengo.Connection(picture,subnet.input_layer.neurons,synapse=None)

        # Add to list of probes
        
    	p_layer_1 = nengo.Probe(subnet.layer1.neurons, sample_every=probe_sample_rate)
    	weights = subnet.w.output.history

    	probes.append(nengo.Probe(p_layer_1)
		probes.append(weights)

Note that ‘w’ is a nengo.process, and implements a custom learning rule. I was setting the signal of ‘w’ as shown below in case of a single network

with nengo.Simulator(model) as sim:

w.output.set_signal_vmem(sim.signals[sim.model.sig[input_layer.neurons]["voltage"]])
w.output.set_signal_out(sim.signals[sim.model.sig[layer1.neurons]["out"]])
sim.run((presentation_time+pause_time) * labels.shape[0]*iterations)

But I am not able to think of a way of assigning these signals in case of parallel models as I don’t have list of these models. Can you help?

xchoo · January 6, 2021, 2:49am

I may not be completely understanding your code right, but could you create an array to store the references to each subnet, and then do the set_signal assignments after?

E.g.,

subnets = []
...
with nengo.Network() as model:
    ...
    for i in range(num_parallel_models):
        subnet = createModel()
        subnets.append(subnet)
        ...

Then

for subnet in subnets:
    subnet.w.output.set_signal_vmem(sim.signals[sim.model.sig[subnet.input_layer.neurons]["voltage"]])
    subnet.w.output.set_signal_out(sim.signals[sim.model.sig[subnet.layer1.neurons]["out"]])

sim.run((presentation_time+pause_time) * labels.shape[0]*iterations)

nikhilgarg · January 6, 2021, 3:08pm

Thanks a lot. Solved my issue

nikhilgarg · February 5, 2021, 8:19pm

Hey,
I was trying to run my code on nengo_ocl.
I made a custom neuron inspired by the LIF neuron. (Just changed the reset potential to -1. )

class MyLIF_in(LIFRate):
    """Spiking version of the leaky integrate-and-fire (LIF) neuron model.

    Parameters
    ----------
    tau_rc : float
        Membrane RC time constant, in seconds. Affects how quickly the membrane
        voltage decays to zero in the absence of input (larger = slower decay).
    tau_ref : float
        Absolute refractory period, in seconds. This is how long the
        membrane voltage is held at zero after a spike.
    min_voltage : float
        Minimum value for the membrane voltage. If ``-np.inf``, the voltage
        is never clipped.
    amplitude : float
        Scaling factor on the neuron output. Corresponds to the relative
        amplitude of the output spikes of the neuron.
    initial_state : {str: Distribution or array_like}
        Mapping from state variables names to their desired initial value.
        These values will override the defaults set in the class's state attribute.
    """

    state = {
        "voltage": Uniform(low=0, high=1),
        "refractory_time": Choice([0]),
    }
    spiking = True

    min_voltage = NumberParam("min_voltage", high=0)

    def __init__(
        self, tau_rc=0.02, tau_ref=0.002, min_voltage=0, amplitude=1, initial_state=None
    ):
        super().__init__(
            tau_rc=tau_rc,
            tau_ref=tau_ref,
            amplitude=amplitude,
            initial_state=initial_state,
        )
        self.min_voltage = min_voltage

    def step(self, dt, J, output, voltage, refractory_time):
        # look these up once to avoid repeated parameter accesses
        tau_rc = self.tau_rc
        min_voltage = self.min_voltage

        # reduce all refractory times by dt
        refractory_time -= dt

        # compute effective dt for each neuron, based on remaining time.
        # note that refractory times that have completed midway into this
        # timestep will be given a partial timestep, and moreover these will
        # be subtracted to zero at the next timestep (or reset by a spike)
        delta_t = clip((dt - refractory_time), 0, dt)

        # update voltage using discretized lowpass filter
        # since v(t) = v(0) + (J - v(0))*(1 - exp(-t/tau)) assuming
        # J is constant over the interval [t, t + dt)
        voltage -= (J - voltage) * np.expm1(-delta_t / tau_rc)

        # determine which neurons spiked (set them to 1/dt, else 0)
        spiked_mask = voltage > 1
        output[:] = spiked_mask * (self.amplitude / dt)

        # set v(0) = 1 and solve for t to compute the spike time
        t_spike = dt + tau_rc * np.log1p(
            -(voltage[spiked_mask] - 1) / (J[spiked_mask] - 1)
        )

        # set spiked voltages to zero, refractory times to tau_ref, and
        # rectify negative voltages to a floor of min_voltage
        voltage[voltage < min_voltage] = min_voltage
        voltage[spiked_mask] = -1 #reset voltage
        refractory_time[spiked_mask] = self.tau_ref + t_spike

But when I tried running this on the nengo_ocl backend, I got an error that this neuron is not supported by the ocl backend. And workaround this?

xchoo · February 6, 2021, 8:14pm

Hi @nikhilgarg,

Because NengoOCL uses OpenCL to run Nengo models on the GPU, each object (or class) that you use in your model has to have a corresponding implementation in OpenCL (i.e., C) code. The NengoOCL codebase contains all of this (OpenCL) code for the objects found in Nengo core. Thus, if you want to use a custom object (e.g., your custom neuron class), you’ll have to implement the OpenCL code for your custom class and get NengoOCL to use that while it is building your Nengo model.

I demonstrated how to insert custom OCL code for NengoOCL models here, but I’ll also summarize it below.

In NengoOCL, when a Nengo model is built, every object that is used in the Nengo model is built using a _plan_<operator_name> function. For example, if you have LIF neurons in your model, NengoOCL will attempt to call the sim._plan_LIF() function to build those neurons. If you look in the NengoOCL simulator code, you’ll see that such a function has been defined. For your custom neuron class, NengoOCL will be looking to call a sim._plan_MyLIF_in() function, which isn’t defined, and this is what causes the error.

The _plan_<operator_name> function only sets up the arrays that will be used to transfer data to and from the GPU. In each _plan_<operator_name> function, it returns a call to another function that actually contains the OpenCL code. Using the example of the _plan_LIF() function, it returns an instance of the plan_lif() function (the name plan_lif here is just to maintain consistency… you can name it anything you want really), which is defined here. In the plan_lif function, you will see the strings decs and text which are the C definitions for the object declarations and neuron logic code respectively.

Thus, to implement and use your custom neuron object in your Nengo model in NengoOCL, you’ll need to:

Create a plan_mylif_in function that implements your custom neuron class in OCL code. You can use the plan_lif function as a reference. You can probably use that code verbatim, just remove the adaptive lif and fast life code, and change this line.
Subclass the NengoOCL simulator, and within that subclass, create a _plan_MyLIF_in function that calls your plan_mylif_in function (that you created in Step 1). The NengoOCL simulator subclass should look like this:

import nengo_ocl
class CustomOCLSimulator(nengo_ocl.Simulator):
    def _plan_MyLIF_in(self, ops):
        ...
        <code for _plan_MyLIF_in>
        ...

You can use the code for the existing _plan_LIF method as a reference for your custom _plan_MyLIF_in method. Since you only changed the neuron logic, and not the input signals, it should be identical, actually.
When you create your NengoOCL simulator and run it, use your CustomOCLSimulator instead of the default nengo_ocl.Simulator, like so:

with nengo.Network() as model:
    ...
    <define nengo model>
    ...

with CustomOCLSimulator(model, context=ocl_context) as sim:
    sim.run(1)

And that should be it!

As I mentioned above, please also read this forum post to get an actual working example of the steps I have described above. In that post, I uploaded example code (bcm_mask_ocl.py – contains the custom OCL code, and custom Simulator class; test_bcm_mask_ocl.py – contains the Nengo model code and the simulator runs) that you can reference as well.

nikhilgarg · February 10, 2021, 6:16pm

Thanks a lot for such a detailed solution. I really appreciate it.
I took the code for plan_lif and _plan_lif, and I initialized fastlif and adaptivelif to zero.

nikhilgarg · February 11, 2021, 10:33am

Hello,
I have been trying to convert my custom rule to Nengo class of learning rule. Till now I was using that as a Node.

Here is how it looks like :

class CustomRule_post(nengo.Process):
   
    def __init__(self, vprog=-0.6, learning_rate=1):
       
        self.vprog = vprog     
        self.signal_vmem_pre = None
        self.signal_out_post = None  
        self.learning_rate = learning_rate 
                
        super().__init__()
        
    def make_step(self, shape_in, shape_out, dt, rng, state=None):  
       
        self.w = np.random.uniform(0, 1, (shape_out[0], shape_in[0]))
        dw = np.zeros((shape_out[0], shape_in[0]))

        def step(t, x):

            assert self.signal_vmem_pre is not None
            assert self.signal_out_post is not None
            
            vmem = np.clip(self.signal_vmem_pre, -1, 1) 
            post_out = self.signal_out_post 
            vmem = np.reshape(vmem, (1, shape_in[0]))   
            post_out_matrix = np.reshape(post_out, (shape_out[0], 1))
            self.w = np.clip((self.w + dt*(fun_post((self.w,vmem, self.vprog),*popt)  )*post_out_matrix*self.learning_rate), 0, 1)
            return np.dot(self.w, x)
        
        return step   
    
    def set_signal_vmem(self, signal):
        self.signal_vmem_pre = signal
        
    def set_signal_out(self, signal):
        self.signal_out_post = signal

I am reading the output signal of post synaptic neurons and the membrane voltage of pre synaptic neuron and using the same for my weight update function.

I am trying to understand how to translate the above to the format as in the masked_ocl rule, but am not able to follow through. I tried modifying the first two classes inherited from learning rule type and operator but stuck at the moment.

Below is the code I am writing, can you please help me a bit?

class CustomRule_post(LearningRuleType):
    """

    Parameters
    ----------

    """

    modifies = "weights"
    probeable = ()

    learning_rate = NumberParam("learning_rate", low=0, readonly=True, default=1e-3)
    vprog = NumberParam("vprog", low=0, readonly=True, default=-0.6)
    

    def __init__(
        self,
        learning_rate=Default,
        vprog=Default
    ):
        super().__init__(learning_rate,vprog, size_in=0)
 
        self.signal_vmem_pre = None
        self.signal_out_post = None  



class CustomRule_post(Operator):


    def __init__(
        self,learning_rate,vprog
    ):
        super().__init__(tag=tag)
        self.learning_rate = learning_rate
        self.reads = []
        self.updates = []

    @property
    def delta(self):
        return self.updates[0]

    @property
    def pre_filtered(self):
        return self.reads[0]

    @property
    def post_filtered(self):
        return self.reads[1]

    @property
    def theta(self):
        return self.reads[2]

    @property
    def _descstr(self):
        return f"pre={self.pre_filtered}, post={self.post_filtered} -> {self.delta}"

    def make_step(self, signals, dt, rng):
        pre_filtered = signals[self.pre_filtered]
        post_filtered = signals[self.post_filtered]
        theta = signals[self.theta]
        delta = signals[self.delta]
        alpha = self.learning_rate * dt

        def step_simbcmmasked():
            delta[...] = np.multiply(
                np.outer(alpha * post_filtered * (post_filtered - theta), pre_filtered),
                self.mask.T,
            )

        return step_simbcmmasked

xchoo · February 13, 2021, 3:07am

Writing a custom learning rule in Nengo consists of 3 parts:

A LearningRuleType subclass. This is the class you use when you create the learning rule within your Nengo network. The learning rule object also serves as a container to store variables related to your learning rule, like learning_rate, synaptic filter values, or in your specific use case, a value like vprog.

As an example, let us define a “voltage learning rule (VLR)” that has as parameters, a learning_rate, a post_synapse and a vprog value. Also, let us assume that this learning rule modifies the weights of a connection (other options could be the ensemble encoders or decoders). Some example code for this learning rule might be:

class VLR(LearningRuleType):
    modifies = "weights"  # Identifies what properties is modified by this learning rule
    probeable = ("pre_voltages", "post_activities", "post_filtered")  # Identifies what can be probed

    # Default values for the input parameters
    learning_rate = NumberParam("learning_rate", low=0, readonly=True, default=1e-9)
    post_synapse = SynapseParam("post_synapse", default=None, readonly=True)
    vprog = NumberParam("vprog", readonly=True, default=-0.6)

    def __init__(
        self,
        learning_rate=Default,
        post_synapse=Default,
        vprog=Default,
    ):
        super().__init__(learning_rate, size_in=0)  # Call the super class constructor
        # Store the various parameters that might be needed later
        self.post_synapse = post_synapse
        self.vprog = vprog

A builder function. This function is registered with the Nengo builder so that whenever it encounters your custom learning rule, it knows what signals to use, and how to construct your learning rule. The function itself consists of three parts (roughly):

a. A section to extract the appropriate signals from the Nengo model.
b. A section where the learning rule operator is cosntructed.
c. A section to define signals for probes.

For the VLR learning rule above, an example builder function might look like this:

@Builder.register(VLR)  # Register the function below with the Nengo builder
def build_vlr(model, vlr, rule):
    # Extract necessary signals and objects from the model and learning rule
    conn = rule.connection
    pre_voltages = model.sig[get_pre_ens(conn).neurons]["voltage"]
    post_activities = model.sig[get_post_ens(conn).neurons]["out"]
    post_filtered = build_or_passthrough(model, vlr.post_synapse, post_activities)

    # Instantiate and add the custom learning rule operator to the Nengo model op graph
    model.add_op(
        SimVLR(
            pre_voltages,
            post_filtered,
            model.sig[rule]["delta"],
            learning_rate=vlr.learning_rate,
        )
    )

    # Expose these signals for probes
    model.sig[rule]["pre_voltages"] = pre_voltages
    model.sig[rule]["post_activities"] = post_activities
    model.sig[rule]["post_filtered"] = post_filtered

An Operator subclass. The operator object is what gets added to the Nengo op graph. Operators are kind of like the nengo.Process object, where you have a make_step method that returns a step function that defines how the learning rule affects your model during each step of the Nengo simulation. One major difference between the nengo.Process and a custom learning rule Operator is that the Operator gets all of the signals passed to it via the builder function mentioned above.

A custom learning rule Operator consists of a constructor, which defines what signals the operator sets, increments, reads and updates. Next, the Operator defines a bunch of properties that are essentially user-readable shortcuts to signals stored in lists. Finally, the Operator defines the make_step method, which is called by the Nengo builder to figure out what the learning rule operator actually does. Just as with the nengo.Process object, within this make_step method, you return a function call that contains all of the learning rule logic. For the VLR learning rule above, an Operator subclass might look something like this:

class SimVLR(Operator):
    def __init__(self, pre_voltages, post_filtered, delta, learning_rate, tag=None):
        super().__init__(tag=tag)
        self.learning_rate = learning_rate

        # Define what this operator sets, increments, reads and updates
        # See (https://github.com/nengo/nengo/blob/master/nengo/builder/operator.py)
        # for some other example operators
        self.sets = []
        self.incs = []
        self.reads = [pre_voltages, post_filtered]
        self.updates = [delta]

    @property
    def delta(self):
        return self.updates[0]

    @property
    def pre_voltages(self):
        return self.reads[0]

    @property
    def post_filtered(self):
        return self.reads[1]

    @property
    def _descstr(self):
        return f"pre={self.pre_voltages}, post={self.post_filtered} -> {self.delta}"

    def make_step(self, signals, dt, rng):
        # Get signals from model signal dictionary
        pre_voltages = signals[self.pre_voltages]
        post_filtered = signals[self.post_filtered]
        delta = signals[self.delta]

        def step_vlr():
            # Put learning rule logic here
            return np.zeros(delta.shape)

        return step_vlr

I’m not familiar with the exact details of the learning rule you are trying to implement, but what I have laid out above should provide you a good place to start. You’ll definitely have to modify the learning rule logic from your nengo.Process to work within the custom learning rule Operator class above, so some work is definitely required.

I’ve compiled all of the code above into one file here: voltage_learning.py (3.8 KB)

To use this learning rule, simply import it from the file, then set the learning_rule_type to the custom learning rule:

import nengo
from voltage_learning import VLR

with nengo.Network() as model:
    ...
    conn = nengo.Connection(...)
    conn.learning_rule_type = VLR(learning_rate=..., vprog=...)

nikhilgarg · February 14, 2021, 10:54am

Thanks a lot for your detailed response. I think I understood the whole concept now.
I also wanted to read the weight and use it to calculate dW(delta).

But, when I used the rule with 2 neurons, the weights do not get updated even if I place a constant in the weight update function.

I have updated the logic to read the weightvoltage_learning_v2.py (5.1 KB) and wrote the code for the sample simulation. Could you please see and verify once why it does not do anything?

xchoo · February 14, 2021, 4:58pm

I took a look at your code and your observations are correct, there is an error in the code. As usual, the devils are in the details, and I should have elaborated on how the delta variable is updated in the step function.

Internally, in Nengo, the delta signal is a Numpy array, and the simulator uses references to it to figure out what needs to be done to the weights for the connections. As such, to modify the delta signal, you have to operate on it by modifying the weights of the array directly. As an example, you’ll need to do this:

delta[0] = 1

instead of this:

delta = np.array([1])

Additionally, the step function doesn’t need to return a value.

So, for your example code, if you wanted to modify all of the weights in your connection by dw = -0.2 * dt every timestep, you’ll want to do something like this:

delta[...] = np.ones(delta.shape) * -0.2 * dt

You can refer to the Nengo code base for other examples of how the delta signal is computed.

nikhilgarg · February 14, 2021, 6:33pm

Thanks a lot. That works

nikhilgarg · February 15, 2021, 6:20pm

Hello, Sorry for yet another question.

So now my rule is working fine with the Nengo core backend, all thanks to your constant support.
Now I am trying to run the same on the ocl backend. For which I am trying to write code for plan_vlr and _plan_vlr. I saw the post on masked_bcm implementation in the ocl format.

I am trying to adapt the two classes in this file voltage_learning_v2.py (9.5 KB)
But I am not able to understand what all arguments to put, the assert statements, and the text out there. Can you please help me a bit with this?

xchoo · February 18, 2021, 5:04am

NengoOCL (or rather, OpenCL) code is much more complicated that the NengoDL / Keras code, especially for people who don’t have experience programming in C. I don’t do much C programming myself, but I’ll try to explain to the best of my knowledge what is going on with the NengoOCL code.

The NengoOCL Simulator Class
As I described in the other post I made when the NengoOCL simulator goes to build Nengo operators, it ill try to call the sim._plan_<op_name> method. Since the voltage learning rule creates a SimVLR operator, the NengoOCL simulator will attempt to call the sim._plan_SimVLR method. Unfortunately, unlike core Nengo, which has an easy way to register build functions, since NengoOCL attempts to call a method that is part of the simulator class, you’ll need to subclass the NengoOCL simulator to “insert” your custom _plan_SimVLR method:

class CustomOCLSimulator(nengo_ocl.Simulator):
    def _plan_SimVLR(self, ops):
        ...

Then, when you go to run your model, you’ll need to use this custom NengoOCL simulator instead of the default one:

from ... import CustomOCLSimulator

with nengo.Network() as model:
    ...  # define Nengo model

with CustomOCLSimulator(model) as sim:
    sim.run(1)

So that’s how to actually use the custom NengoOCL simulator, let me get into more detail about how each of the different components of the NengoOCL operators work. Before I being discussing all of the parts of the NengoOCL simulator class, I recommend experimenting with an existing (working) learning rule to see what each of the parameters do (e.g., insert print statements to see what their sizes are and what not) to get a better understanding of the code.

The _plan_SimVLR Method
The goal of the _plan_ method is to assemble the appropriate Nengo signals and arrays, and feed those values to a function that creates the OpenCL code. The _plan_ method functions similarly to the build_ methods that you register with the core Nengo simulator/builder.

The _plan_ method received 1 argument, ops, which is a list containing all of the corresponding Nengo operators related to this plan method. As an example, if your Nengo model contains 2 voltage learning rule connections, ops will contain references to both of the SimVLR operators. In NengoOCL, operators are grouped this way to increase the efficiency of executing operators on the GPU. It is more efficient to execute 1 learning rule kernel that works across multiple connections as opposed to executing 1 kernel per connection.

There are two main parts to the _plan_ method:

A part of the code that assembles the Nengo signals and various arrays. In this part of the code, the NengoOCL simulator pre packages all of the signals in your Nengo model into the self.all_data list, and the code just creates references to these signals. For non-signals, you’ll need to assemble the data arrays yourself, and if necessary, note where the data starts for each operator (see the BCM masked learning rule for an example of this). For the voltage learning rule, this part of the method might look something like this:

        pre_voltages = self.all_data[[self.sidx[op.pre_voltages] for op in ops]]
        post_filtered = self.all_data[[self.sidx[op.post_filtered] for op in ops]]
        weights = self.all_data[[self.sidx[op.weights] for op in ops]]
        delta = self.all_data[[self.sidx[op.delta] for op in ops]]
        alpha = self.Array([op.learning_rate * self.model.dt for op in ops])
        vprog = self.Array([op.vprog for op in ops])

A part that calls the function that builds the OCL code. In NengoOCL, this is just a function call, like so (assuming plan_vlr is the custom OCL code creation function):

        return [
            plan_vlr(
                self.queue, pre_voltages, post_filtered, weights, delta, alpha, vprog
            )
        ]

The plan_vlr Function
This function is where all of the OCL code is created, and is split into 4 sections:

A section of the code to double check that all of the data being passed to the function is correctly shaped and sized. As an example, the delta matrix that is passed to the plan functions for most learning rules is the size of the full connection weight matrix. This weight matrix is size_post x size_pre in shape, so a check might be:

    assert (post.shape0s == delta.shape0s).all()
    assert (pre.shape0s == delta.shape1s).all()

Other checks this section does are things like checking the number of operators for each input are correct, and that the data type is to be expected. The full code for this section might look something like this:

    # Check that all lists being passed in have the same number of items per list,
    # which should be the same as the number of ops
    assert (
        len(pre) == len(post) == len(weights) == len(delta) == alpha.size == vprog.size
    )
    N = len(pre)  # Number of ops in op list

    for arr in (pre, post):  # vectors
        # Check that all vectors are flat (second dim is 1)
        assert (arr.shape1s == 1).all()
    for arr in (delta, weights):  # matrices
        # Check that all matricies have a stride1 of 1
        print(">>", arr.stride0s, arr.stride1s)
        assert (arr.stride1s == 1).all()

    # Check shape of matrices are correct. delta and weights matrix should be
    # post.shape x pre.shape
    assert (post.shape0s == delta.shape0s).all()
    assert (pre.shape0s == delta.shape1s).all()
    assert (post.shape0s == weights.shape0s).all()
    assert (pre.shape0s == weights.shape1s).all()

    # Check that the ctypes for all of the data are the same
    assert (
        pre.ctype
        == post.ctype
        == weights.ctype
        == delta.ctype
        == alpha.ctype
        == vprog.ctype
    )

Next, a section of code specifically defines the C code that will be used to compile the kernel that will be run on the GPU. I’ll elaborate more on this code in the next block of text below. This section of code might look like this:

    text = """
    __kernel void vlr() {
        // kernel definition here
    }

Next follows a section of the code uses the C code text from the previous section, as well as other parameters (like the shapes and sizes of the function inputs) to actually create the OCL kernel. This section of code might look like this:

    textconf = dict(type=pre.ctype)
    text = as_ascii(Template(text, output_encoding="ascii").render(**textconf))

    # Provide a reference to the arguments that will be given to the OCL kernel code
    # - Must match order of arguments appearing in OCL code
    full_args = (
        delta.cl_shape0s,
        delta.cl_shape1s,
        pre.cl_stride0s,
        pre.cl_starts,
        pre.cl_buf,
        post.cl_stride0s,
        post.cl_starts,
        post.cl_buf,
        weights.cl_stride0s,
        weights.cl_starts,
        weights.cl_buf,
        delta.cl_stride0s,
        delta.cl_starts,
        delta.cl_buf,
        alpha,
        vprog,
    )
    # Create the OCL kernel
    _fn = cl.Program(queue.context, text).build().vlr
    _fn.set_args(*[arr.data for arr in full_args])

The final section of the function creates NengoOCL plan (which is similar to a Nengo operator) and returns it to the simulator. This section of code defines things like what the global OCL indices should iterate over (see next section of text for how the OCL code works), and the number of bytes the kernel needs per call. This section also specifies how many floating point operations each call will take, but I’m not sure how to determine this, so you’ll have to experiment with this number yourself. The code for this section might look like this (a lot of it is copy-pasted from existing NengoOCL plans):

    # Create the NengoOCL plan and return it
    lsize = None
    gsize = (delta.sizes.max(), N)  # global index shape
    plan = Plan(queue, _fn, gsize, lsize=lsize, name="cl_vlr", tag=tag)
    plan.full_args = full_args  # prevent garbage-collection
    # Number of FLOPs per call. Not sure how to determine this.
    plan.flops_per_call = 4 * delta.sizes.sum()
    plan.bw_per_call = (
        pre.nbytes
        + post.nbytes
        + weights.nbytes
        + delta.nbytes
        + alpha.nbytes
        + vprog.nbytes
    )
    return plan

The OCL Code
The OCL code is a typically standard C function. The first part is the function definition, where the argument list matches the argument list defined in Step 3 above. Here’s what the voltage learning rule argument list might look like. I included all of the parameters I am aware of, but I noticed a fun_post function in your learning rule code. I don’t know that that consists of, so I’ve excluded it from this example.

    __kernel void vlr(
        __global const int *shape0s,
        __global const int *shape1s,
        __global const int *pre_stride0s,
        __global const int *pre_starts,
        __global const ${type} *pre_data,
        __global const int *post_stride0s,
        __global const int *post_starts,
        __global const ${type} *post_data,
        __global const int *weights_stride0s,
        __global const int *weights_starts,
        __global const ${type} *weights_data,
        __global const int *delta_stride0s,
        __global const int *delta_starts,
        __global ${type} *delta_data,
        __global const ${type} *alphas,
        __global const ${type} *vprog
    )

Next up in the OCL code is the determination of the matrix indices. In NengoOCL, all of the data is flattened, not just across dimensions, but across operators as well. As an example, if you had two learning rule operators, one with a 1x1 connection matrix, and one with a 2x2 connection matrix, the data being sent to the kernel would look like this:
weights = [lr1_00, lr2_00, lr2_01, lr2_10, lr2_11]
whereby the weights for the each learning rule operator would actually be this:

weights_lr1 = [lr1_00]
weights_lr2 = [[lr2_00, lr2_01], [lr2_10, lr2_11]]

Because the data is flattened in this way, in the OCL kernel, we need to compute the relative i and j of a matrix given a global index. The code to do this looks like this:

        const int ij = get_global_id(0);
        const int k = get_global_id(1);
        const int shape0 = shape0s[k];
        const int shape1 = shape1s[k];
        const int i = ij / shape1;
        const int j = ij % shape1;

Note that k determines the operator index, so if you have two of the same learning rule operators (see Step 1 of the _plan method), k would be 0 for the first one, then 1 for the second one.

The next part of the OCL kernel code uses the i, j, and k indices to figure out what data is relevant to the matrix operation being computed. Once again, this is done because the input data is a flattened array. The code might look like this:

        __global ${type} *delta = delta_data + delta_starts[k];
        const ${type} pre = pre_data[pre_starts[k] + j*pre_stride0s[k]];
        const ${type} post = post_data[post_starts[k] + i*post_stride0s[k]];
        const ${type} weight_ij = weights_data[
            weights_starts[k] + i*weights_stride0s[k] + j];
        const ${type} alpha = alphas[k];

Note that delta is created as a global pointer reference here because this learning rule modifies delta as part of its operation. The _starts[k] array values tell the code where in the data array the specific values start for that specific operator (remember everything is flattened). Likewise, something like i*weights_stride0s[k] + j is standard math to figure out what element in a flattened array corresponds to a specific i and j of a matrix.

The final part of the OCL kernel code is what the learning rule computation does. This code might look something like this:

        if (i < shape0) {
            delta[i*delta_stride0s[k] + j] =
                alpha * post * weight_ij * ...;
        }

The if (i < shape0) is just to make sure that just in case the i index goes out of bounds, the computation does spit out garbage numbers. Also note that this code is defined per i,j element in the delta matrix. Finally, I noticed you had a fun_post function embedded in your learning rule code. That will need to be re-implemented in C code for the OCL code to function properly.

Summary
And that should be it. As with my other posts, I’ve assembled all of the code snippets I included above into one file: voltage_learning_ocl.py (6.0 KB)
I highly recommend you look at this file and play with the various debug statements and parameters to get a better understanding of how the OCL code works, and more importantly, what it expects. A lot of the knowledge I’ve gleaned from OCL code I have done so using that specific method of debugging.

I hope this helps!

nikhilgarg · February 18, 2021, 5:34pm

Thanks a lot for such a detailed answer. Now I understand this a lot better