Index

Introduction

The Foundry's Blink is a C++-based image processing framework designed to allow complex algorithms to be implemented in a device-independent manner. With Blink, you can choose to accelerate your code either by using SIMD (Single Instruction, Multiple Data) instructions on the CPU for greater efficiency, or by running it on the GPU to take advantage of the massively parallel processing capabilities. This is achieved through code generation, in which device-specific code is generated from the Blink code for each of the devices we target. The code in a Blink “kernel” can currently be translated into standard C++ for the CPU, SIMD code for the CPU or OpenCL for the GPU.

Background

In recent years, GPUs have become increasingly powerful and useful for general computing, as well as the specialised graphics operations they were originally designed for. With the advent of CUDA and OpenCL, they became more accessible to programmers than ever before. Here at The Foundry we wanted to make better use of these powerful devices that were often sitting almost idle in our customers' machines, while the CPUs did all the work. However, we couldn't assume that a powerful GPU would always be available to run on - for example, in many render farms this is not the case - so all our code had to be able to run on the CPU as well. It was also important that our code gave the same results whether it was run on the CPU or GPU, to avoid the situation where an artist shares a script with someone on a different machine - or indeed runs it on a render farm - and it gives a different result.

The Blink framework was our solution to these two constraints. Developers write Blink "kernels", which are designed to be executed in parallel at each position inside some iteration space. Within each kernel it is necessary to specify both the images required by the kernel, and how it needs to access each one at a single point in the iteration space. For example, a saturation kernel might only need access to its images at the current position in the space, or a blur might need to access a two-dimensional range of positions from its input around the current point. Knowing which images are needed and how they will be accessed allows us to translate Blink code into optimised code for each of the target devices we support. Using code translation in this manner has the important advantage that we only need to write our algorithmic code once, for it to run on any supported device. It also makes it straightforward to support new target devices, as we need add new back-end code in one place only to enable all existing Blink code to run on the new device.

Blink Concepts

Blink Kernels

Blink “kernels” are the cornerstone of our Blink framework. A Blink kernel is similar to a C++ class, but with some special parameter types and functions. Blink kernels are designed to be run on parallel architectures, so each instance of a Blink kernel will be independent of all the others.

The "iterate()" function is used to run a Blink kernel over an iteration space, usually the bounds of an image or sub-image. One instance of the kernel will then be launched for every point in this iteration space. For example, a kernel that produces an output image might launch an instance of the kernel for every pixel in the output image in order to generate its result.

Blink kernels also have a "granularity", which can be either "pixel-wise" or "component-wise"."Pixel-wise" kernels have access to all the components within their input and output images. However, "component-wise" kernels can access only a single component of their input and output images at a time. A separate instance of a component-wise kernel will be launched for each component at each point in the iteration space.

There are three types of Blink kernel:

ImageComputationKernel: used for image processing, this takes zero or more images as input and produces one or more images as output.
ImageRollingKernel: also used for image processing, where there is a data dependency between the output at different points in the output space. With an ImageComputationKernel, there are no guarantees about the order in which the output pixels will be filled in. With an ImageRollingKernel, you can choose to "roll" the kernel either horizontally or vertically over the iteration bounds, allowing you to carry data along rows or down columns respectively. (See Example of a Rolling Kernel.)
ImageReductionKernel: used to "reduce" an image down to a value or set of values that represent it, for example to calculate statistics such as the mean or variance of an image. (See Example of a Reduction Kernel.)

For more information about writing Blink kernels, please see the Guide to Writing Blink Kernels on our website. See also Running Blink Kernels for more information on how to create, set up and execute a kernel.

Blink Images

A Blink::Image holds a reference to some image data that resides on a Blink::ComputeDevice, which can be either a CPU or GPU. In addition, it holds some information about the bounds of the image and the type of pixels it contains, in a Blink::ImageInfo object. Images can be copied easily from one type of device to the other, but must have the same ImageInfo on both devices for the copy to succeed. See Getting Data To and From Blink Images for more information about how to set up and use Blink Images inside an NDK plug-in.

Blink Compute Devices

A Blink Compute Device is used for processing on and can be either a CPU or GPU. The static functions GetCurrentCPU() and GetCurrentGPU() on Blink::ComputeDevice will return the currently-selected CPU or GPU respectively.

Requirements for GPU Acceleration

An AMD GPU able to run OpenCL 1.2 and above.

or

An NVIDIA GPU with compute capability 3.5 and above.
NVIDIA drivers with CUDA 11.8 support. Driver versions 522.06 (Windows) and 520.61.05 (Linux) or above are required.

Are CPU and GPU results really the same?

In general, you can be confident that CPU and GPU results from the same Blink code will look the same. It is possible to write code that will give different results, though. For example, if the results of your code can change according to the order in which kernel calls are executed inside an iteration space, then you are likely to get different results. However, for ImageComputationKernels and ImageReductionKernels at least, Blink makes no guarantees about the order in which the iteration space will be traversed, and therefore we would say that this is incorrect Blink code. Used correctly, your Blink code will give results that look the same from the CPU and GPU. In fact, not only will they look the same, but wherever possible they will be bitwise-identical: the CPU and GPU will truly give the same result. This is not trivial to achieve and considerable effort has gone into making sure that this is the case. In future, this true sameness will be important as it will allow Blink to be used for heterogeneous computing, where the work to do is shared between the available devices instead of run on one or another.

Unfortunately, at the time of writing, only NVIDIA's Windows and Linux GPU drivers support the accurate maths required for us to achieve these bitwise-identical results. On OS X, it is not supported on many GPUs at present. Recent versions of OS X (OS X 10.8, "Mountain Lion", and later) do support this accurate maths for the GPUs which have been shipped in Apple computers, such as the GeForce GT 650M in mid-2012 MacBook Pros. However, this support does not seem to have been implemented for GPUs other than those available from Apple - so, for example, if you put a Quadro K5000 in a Mac Pro running Mountain Lion you will not get this accurate maths support, even though both the GPU and the operating system should be capable of it under other circumstances!

Using Blink in the NDK