The evolution of graphics hardware as general purpose computing devices has speeded up the development of non-graphics API devoted at programming graphics processor to execute non-graphics computing tasks. Both vendor-specific and cross-platform solutions have been recently made available to the public each fighting the battle of creating the ultimate standard for GPGPU development from both an architectural and programming level.
The two major commercial competing standards for GPGPU development are represented by the NVIDIA "Compute Unified Device Architecture" (CUDA) [6] and ATI's StreamSDK [7]. The CUDA developing platform has so far been adopted by the HPC market more extensively than the StreamSDK one, however both still equally compete on the vendors' side.
CUDA [6] represents the NVIDIA answer to GPGPU computing. Born as Computer Unified Device Architecture the CUDA acronym stands now on its own, replacing entirely the original extended name; CUDA is available on all latest NVIDIA graphics cards and starting from the GeForce 8 series, Quadro FX 5600/4600 and Tesla solutions. Its software architecture comprises several layers in terms of API, hardware driver, and utility libraries (see Figure 4.5). Through CUDA the GPU is seen as a massively parallel set of multiprocessors (see Figure 4.6) capable of executing a high number of threads in parallel.
From a programming architecture point of view the GPU is treated as a co-processor (or computing device) to the main CPU (or computing host). The CPU downloads to the devices those functions which exhibit parallel behaviour. Before being downloaded each of these functions is translated into the device instruction set, the resulting device-program is referred to as kernel. A kernel can be composed of several threads organized as a grid of thread blocks. Threads are assigned to blocks according to their data access pattern to allow for efficient sharing of data and memory access.
The API programming model is based on an extension of the C programming language according to four major development lines: function type qualifiers to characterize a function as belonging to the host or to the device, variable type qualifiers to specify the memory location of a variable, directives to describe the execution of a kernel on a device, built-in variables related to grid and block dimensions and blocks and threads indexing system.
The AMD Stream Computing Model (see Figure 4.7) is AMD's response to GPU programming for high-performance data parallel tasks. The model includes a software stack (StreamSDK) and the AMD stream processor born from the acquisition of ATI. The Stream SDK consists of a suite of tools for GPU programming which leverages the offload of arithmetic operations onto the GPU. This capability is achieved through a hierarchy of three main levels which integrates with C and C++ development products. The hierarchy comprises: performance libraries such as the AMD Core Math Library (ACML) and COBRA for optimized domain-specific algorithms, compilers such as Brook+ and RapidMind, lower level drivers and programming languages such as the AMD Compute Abstraction Layers (CAL) and the Intermediate Language (IL), performance profiling tools such as GPU ShaderAnalyzer and AMD CodeAnalyst. Brook+ is a high-level language, extension of Brook for GPU programming, an abstract processor model used for simplifying computations on graphics processors. AMD SDK includes a Brook+ compiler which converts Brooks+ files into C for execution on the CPU and GPU. CAL is a device-driver library that provides a programmer-friendly interface to AMD's stream processors devices. IL or intermediate language is a high-level assembly language for the GPU designed to allow developers to access directly the graphics hardware's lower levels.
Most of AMD Stream SDK [7] is a result of AMD's acquisition of ATI. The initiative created a unique stream of knowledge which saw expertise in semiconductor technologies combined with expertise in high-performance development tools. This latest generation of processors support the unified shader programming model. Programmable stream cores execute user-developed programs called stream kernels. Streams and Kernels are the basic building blocks for code development in Brook+. Streams can be defined as collection of data which are read from or to (CPU or GPU). Kernels are clusters of operations which define functions on elements in the streams. Stream cores can execute both graphics and non-graphics functions through a virtualized SIMD programming model operating on the stream of data.