A library and standard for writing parallel programing for various processors, including GPUs. These notes cover OpenCL 1.2 on macOS using an Intel Iris Plus Graphics 655 GPU chipset.
On macOS, use
#include <OpenCL/cl.h> and compile with
The computer running an OpenCL program (i.e. written in C/C++) is called the host. Each processing element (i.e. a GPU) is called a device. The implementation of OpenCL for a particular device is called a platform (i.e. the Nvidia platform is used to interact with Nvidia devices).
In an OpenCL host application, one or more devices are grouped together as a logical group called a context, where the application might control more than one context. Each context has command queues that the host can use to dispatch commands to the context, which are by default processed in order. There are commands to transfer data and to perform computation.
A kernel is a command that represents a specially coded OpenCL function
that should be executed by one or more devices. An OpenCL program is a set
of kernels, usually defined as
Device Memory Model
An OpenCL device distinguishes four address spaces:
- Global memory: Read and write, and available to the whole device
- Constant memory: Read only, and available to the whole device. Some devices provide a memory region specifically for constant memory, but in many cases this is just a subset of the global memory
- Local memory: Read and write, and available to a whole work-group. Every work-item that belongs to the work-group can access this memory, but a work-group can’t access the local memory of another work-group. This type of data will be allocated and deallocated once per work-group
- Private memory: Read and write, and available only to a particular work-item
Private memory is the fastest, but also the smallest. Conversely, global memory is the biggest, but also the slowest.
There are various extensions to OpenCL. The ones approved by the OpenCL Working
Group are preffixed with
cl_khr_, or suffixed with
clCreateEventFromGLsyncKHR. Non-approved extensions are
OpenCL host programming supports the following two’s complement integers:
cl_uchar (8 bits),
cl_ushort (16 bits),
cl_uint (32 bits),
cl_ulong (64 bits), and the
following floating-point values:
cl_half (half-precision, 16 bits),
cl_float (single-precision, 32 bits),
cl_double (double-precision, 64
To get the first available GPU:
cl_platform_id platform; cl_device_id device; cl_int result = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
Devices may have a preferred type vector width that you can inspect with the
cl_device_id device; cl_uint char_width; clGetDeviceInfo(device, CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR, sizeof(char_width), &char_width, NULL);
Its a common practice to fetch the preffered vector width and then pass it as a define macro to the OpenCL program.
We can determine a device’s endianness with the
and checking whether the return valye is
We can create a context out of a set of devices from the same platform with
We can also create a conetxt out of all the devices from the same platform that
match a certain type with
A context must be destroyed with
A program is usually created with
from the source code out of
.cl files. The program can then be built with
which allows us to customize various options such as the include path (
Its recommended to report warnings as errors with
-Werror. Programs are
We can get build logs after attempting to build a program using
cl_device_id device; cl_program program; // Find the size of the logs size_t log_size; clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, 0, NULL, &log_size); char * program_log; program_log = (char *) malloc(log_size + 1); program_log[log_size] = '\0'; // Get the logs clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, log_size + 1, program_log, NULL); printf("%s\n", program_log); free(program_log);
__ENDIAN_LITTLE__ is always defined if the device is little endian,
and undefined otherwise.
A command queue must be created for one or more devices that are part of the
same context. The
function allows us to configure whether the commands are executed in order or
not. A command queue must be destroyed with
The lifecycle of a command consists of:
CL_QUEUED: The command has been added to the queue
CL_SUBMITTED: The command has been submited to the device
CL_RUNNING: The command is running on the device
CL_COMPLETE: The command has finished running
Every command has a wait list that consists of a set of
structures. The command will not start executing until the wait list is empty
NULL). We can pass a wait list when using i.e.
as a simple way to synchronize execution.
In order to profile a command queue, create the queue with the
CL_QUEUE_PROFILING_ENABLE flag, create a
cl_event with a callback and
associate it with the command that you want to profile, and on the callback,
CL_PROFILING_COMMAND_END, and perform arithmetic between the resulting
operation forces a command queue to not execute any following commands until
all the events in the queue reached the “completed” state. The
function has a similar effect, but it enqueues a “barrier command” instead of
using the wait list.
Kernels may use the
function inn a work-item to wait until every other work-item in the group
reached the barrier. Fences (see
are similar to barriers, but can synchronize specific memory operations.
Currently there is no way to synchronize work-items from different work-groups apart from executing new kernels.
The command queue “enqueue” functions can be associated with a
structure which contains a callback function that will be executed whenever a
command changes its status. These callback functions have the following signature:
void CL_CALLBACK callback(cl_event event, cl_int status, void * data);
cl_event is associated with a callback function using
user_data argument passed to this function will become the callback’s
data argument. Notice an event must be associated with a callback after
enqueuing the command.
A user event is an event controlled by the host application. See
A user event is not associated with a particular command queue, so we can use
the same event for multiple devices. If you set a user event in a command’s
wait list, the command will not execute until you manually update the user
event’s status from the host with
We need to set the arguments to the kernel with
before enqueuing it. We can enqueue kernels on a command queue using
The latter offers finer control over how the kernel executes.
Kernel declarations must start with
__kernel and must always return
Since kernels don’t have a return value, we have to “output” data by writing to
Every kernel argument must have either
__private (default) qualifier (see Device Memory Model). The
qualifier can only be used with pointers. Since
__constant is read-only, it
must be initialized on its declaration (a define
-D macro is usually
preferred). If a kernel argument doesn’t have an address space qualifier, then
its assumed to be
NULL to a kernel will make it
just reserve memory in its local space for the kernel argument.
Private kernel arguments can only be primitives or vectors. We can use
primitives such as
int but also i.e. an array of 4 floats interpreted in the
float4, as private arguments can’t be pointers.
For performance reasons: try to re-use private variables to reduce kernel
memory consumption, inline non kernel functions, use the
for multiply and add operations, access local memory sequentially, and avoid
branch “miss penalties” by ensuring conditional statements are more often true
We can use asynchronous calls to transfer data between global and local memory.
In order to wait until the data transfers complete, see
This is a composite data type that consist of an array of a certain number of
primitive elements. For example
float4 represents an array of 4 floats. The
possible types are
N is 2, 3, 4, 8, or 16. Some devices
We can initialize vectors as
(type) (value1, value2, ..., valueN) where the
values are primitives or other vectors. For example:
float4 vector1 = (float4) (1.0, 2.0, 3.0, 4.0); float2 subvector1 = (float2) (1.0, 2.0); float4 vector2 = (float4) (subvector1, 3.0, 4.0);
OpenCL supportts arithmetic operations over vectors. For example:
float4 vector1 = (float4) (1.0, 2.0, 3.0, 4.0); float4 vector2 = (float4) (3.0, 4.0, 5.0, 6.0); float4 result = vector1 + vector2; // (4.0, 6.0, 8.0, 10.0)
Vectors are indexed for read or write purposes as
vector.<index> where index
.s and one or more hexadecimal digits starting from 0 (
.w, which are synonymous to
.lo (highest or lowest halves), and
.odd (even or odd elements).
Buffers belong to a context. They are created with
and destroyed with
A sub-buffer is useful if a kernel needs a subset of the data passed to another
kernel. The sub-buffer is a reference to an offset of the super-buffer, and
there is no allocation involved.
Buffers take the following access flags:
CL_MEM_WRITE_ONLY. If you are passing a write-only
parameter to get data back from a kernel, then you can set the access level to
CL_MEM_WRITE_ONLY and make it initially
cl_context context; cl_int result; cl_mem output = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(result), NULL, NULL);
Image objects are
cl_mem buffers meant to hold pixel data. See
These two constructors take a
row_pitch argument that determines how many
bytes each row occupies. If
row_pitch is set to 0, OpenCL will assume its
width * pixel size.
clCreateImage3D also takes a
argument that determines the number of bytes in each 2D slice. If
is set to 0, its value will be set to
row_pitch * height.
On GPUs, image objects are stored in texture memory, a special global memory region cached for performance reasons.
clEnqueueCopyBuffer: Copy a buffer to another buffer
clEnqueueCopyImage: Copy an image object to another image object
clEnqueueMapBuffer: Map a buffer into host memory
clEnqueueMapImage: Map an image object into host memory
clEnqueueUnmapMemObject: Unmap a buffer or image object
clEnqueueReadBuffer: Read a buffer into host memory
clEnqueueReadImage: Read an image object into host memory
clEnqueueWriteBuffer: Write to a buffer from host memory
clEnqueueWriteImage: Write to an image object from host memory
The read/write functions take a
blocking argument to control whether the
function should wait for the data transfer to complete before returning or not.
OpenCL defines various atomic operations for kernel programming such as
atomic_inc. See Atomic
for details. The availability of these functions depend on the extensions
supported by the target device.
We can use
(swap arguments) and
(ternary conditional operation) to implement a mutex system. These two
functions require the
cl_khr_int64_base_atomics extension, which can be
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable.