Sunday, October 28, 2012

Exploring OpenCL: Hello, (Py)OpenCL

After setting up a working PyOpenCL environment we'll, of course, want to give it a test, don't we? There's a nice tutorial over at CodePlex though it's just started with only a single tutorial at this time. It's incredibly detailed in explaining OpenCL's concepts though so those wanting to learn more really should check it out.

Here's a simple program for PyOpenCL to add pairs of floating point numbers. Let's get started.

__kernel void add(__global const float* input1, 
                  __global const float* input2, 
                  __global float* result) {

    const int index = get_global_id(0);
    result[index] = input1[index] + input2[index];
}

Above is the kernel we'll be using to add the numbers. Nothing particularly hard going on here. Just take 2 input, add them together and store the results for output. Save it in an separate .cl file for later use.

Next up is to code the Python script that'll make use of the above kernel.

import numpy
import pyopencl as cl

First is to import the modules we'll be using, of course.

platforms = cl.get_platforms()
devices = [device for platform in platforms 
                  for device in platform.get_devices()]

context = cl.Context(devices)
queue = cl.CommandQueue(context,
        properties=cl.command_queue_properties.PROFILING_ENABLE)

Next, we'll want to get ourselves the available OpenCL devices and create a OpenCL context. We'll also need a command queue to send commands and data we want processed by the kernel.

In the code above, OpenCL profiling is enabled so we can extract performance info later. It should also grab all available OpenCL devices. That's both OpenCL capable CPUs & GPUs.

src = ""
with open("ocl_add.cl", "r") as fp:
    src = fp.read()
    
program = cl.Program(context, src).build()

Load the kernel we've saved earlier, then get PyOpenCL to create a OpenCL program object for us. Nothing too hard so far.

# Create 2 lists of random floats
input_size = 12345678
input1 = numpy.random.rand(input_size).astype(numpy.float32)
input2 = numpy.random.rand(input_size).astype(numpy.float32)

# Allocate the memory we'll need on the OpenCL device. buf1 
# & buf2 will contain contents of input1 & input2 
# respectively. result is an allocated but left empty
# so we can fill it with the results in the kernel, so we 
# just specify the size we need.
flags = cl.mem_flags
buf1 = cl.Buffer(context, 
                 flags.READ_ONLY | flags.COPY_HOST_PTR, 
                 hostbuf=input1)
buf2 = cl.Buffer(context, 
                 flags.READ_ONLY | flags.COPY_HOST_PTR, 
                 hostbuf=input2)
result = cl.Buffer(context, flags.WRITE_ONLY, input1.nbytes)

Let's test this thing out with a pretty big set of numbers. 12345678 pairs of floating point values is pretty big. Adding so many numbers using just the regular Python sum function takes a few minutes.

event = program.add(queue, input1.shape, None, buf1, buf2, result)
event.wait()

Here we just call the kernel we made to process the input values. There doesn't seem to be any documentation regarding this. Or, I just didn't search hard enough.

Basically, our program object will contain our kernel as a function. The name is exactly as you named the kernel. But we need to pass the command queue and work dimension (input1.shape, or input2.shape should be fine) first. I'm not sure what the 3rd parameter is supposed to be due to lacking documentation.

Finally, just pass the parameters as defined in the kernel. And, we should be done.

print "OpenCL sum func: ", 
print 1e-9 * (event.profile.end - event.profile.start), "s"

Here, we make use of profiling info to see how long it took to complete the computation.

Adding 12345678 pairs of floating point numbers comparison

Just to give you an idea of the difference in compute speed, here's a "benchmark" showing how long it takes to complete that many additions compared to using Python's built-in function.

Results are averaged from 3 runs.