IntroductionGeneral-purpose computing on graphics processing units (GPGPU) is the technique of using a GPU, which is usuylly employed in computing only graphics, to perform computation traditionally handled by the CPU.
GPU and CPU computing
GPUs are designed specifically for graphics and thus are very restrictive in terms of operations and programming. However, the number of cores of a GPU greatly surpass the number of computational unit included in CPUs nowaday. Infact, a high-end CPU contain up to 8 cores, while a high-en GPU contain up to 1024 computational units. Because of their nature, GPUs are only effective at tackling problems that can be solved using stream processing (a parallel computing programming paradigm, related to SIMD) and the hardware can only be used in certain ways. GPU instructions are infact limited to operation related to the visualization of 3d graphics and generation of advanced visula effects by means of vector anipulations. This computations are executed in parallel on each thread processor.
GPU programming and the CUDA libraries
GPU functionality has, traditionally, been very limited. In fact, for many years the GPU was only used to accelerate certain parts of the graphics pipeline. Some improvements were needed before GPGPU became feasible. These improvements are namely the increased programmability of GPUs and the ability to manage new data types.
However, the direct execution of computation on GPUs was a non-trivial, since it was not possible to direct access the computational core and the memory of the GPU, and GPGPU programs used the normal graphics APIs for executing programs. The situation has changed after the introduction of dedicated parallel computing architecture and API (application programming interface) by the major GPUs manufacturers, i.e. Nvidia and AMD (former ATI).
NVIDIA® CUDA™ is a general purpose parallel computing architecture. CUDA also means Compute Unified Device Architecture and it is a software platform for massively parallel high-performance computing on the company’s powerful GPUs. Formally introduced in 2006, after a year-long gestation in beta, CUDA is steadily winning customers in scientific and engineering fields.
NVIDIA has released the CUDA libraries and software development kit, allowing the exploitation of GPGPU on their graphic cards. Algorithm can be programmed in C, using the CUDA SDK, and, then, executed in parallel on the GPU. Wrappers are also available for other programming languages, such as Python, Fortran, Java and Matlab.
Important mathematical libraries, such as BLAS (Basic Linear Algebra Subprograms - algorithm for vector-vector, matrix-vector and matrix-matrix operations) and FFT (fast Fourier transformate) have been already coded to make use of CUDA, and are available with the standard CUDA SDK.
Finally, it is noteworthy that while GPGPU can achieve a 100-250x speedup vs a single CPU, only embarrassingly parallel applications will see this kind of benefit. A single GPU processing core is not equivalent to a single processing core found in a desktop CPU. As an example, algorithm as data compression or recursive algorithms will be executed optimally on classical CPUs.
The ATI Stream Software Development Kit (SDK) is a complete development platform created by AMD to allow you to quickly and easily develop applications accelerated by ATI Stream technology. The SDK allows you to develop your applications in a high-level language, ATI Brook+. The Brook+ compiler and runtime layer handle the low-level details for you so that you can concentrate on implementing your algorithms on the GPU. Brook+ is built on top of ATI Compute Abstraction Layer (CAL), which gives you low-level control and programmability of the hardware. This SDK allows the execution of parallel programs on the GPUs produced by ATI, providing an abstraction layer similar to the one supplied by NVIDIA through the CUDA SDK.
OpenCL (Open Computing Language) is an open, royalty-free standard for general-purpose parallel programming of heterogeneous systems. OpenCL provides a uniform programming environment for software developers to write efficient, portable code for high-performance compute servers, desktop computer systems and handheld devices using a diverse mix of multi-core CPUs, GPUs, Cell-type architectures and other parallel processors.
The aim of OpenCL is to use all the computational resources available in a system, like CPUs and GPUs, using a single, common code. Both data and task parallelism can be implemented.
OpenCL implements a platform model where an host is connected to one or more OpenCL devices. An OpenCL device is a collection of one or more computing cores. An OpenCL program is then structured as a collection of computational units, called kernels , that are executed in parallel by submitting them the available devices.