Cross platform APIs to face the GPGPU programming issues have been proposed by both RapidMind [8] and the Khronos Group [9]. Compared with their commercial antagonists these vendor-independent APIs focus is much more oriented towards a unified standard for parallel programming adaptable not only to GPU programming but to a wider range of multiprocessor architectures like multi-core CPUs and Cell Broadband Engines.
RapidMind [8] is a private enterprise devoted at the development of platform independent solutions for high-performance processors including multi-core CPUs and accelerators such as GPUs. RapidMind flagship is the RapidMind Multi-core Development Platform. In the case of GPU the RapidMind platform can be used for both shaders and general purpose processing. It provides a software development platform supporting as programming language ISO-standard C++ with no requirement for any GPU-specific extension.
Developers use their own existing C++ compilers and build systems, but are required to use specific
platform defined types for numbers, vectors, matrices and arrays. This allows the RapidMind embedded system to identify and record parts of an application which can be accelerated at runtime.
Parallel processing is automatically managed by the RapidMind platform at runtime, the optimized code is mapped onto all the available computational resources in a given system or on the specifically targeted hardware (see Figure 4.8). RapidMind recently embraced Open Source and Standards Projects policies taking part in the LLVM (Low Level Virtual Machine) Compiler Infrastructure project and OpenCL standard as one of the development partners and user (see Section 4.2.2.2).
OpenCL (Open Computing Language) [9] represents the first endeavour in creating a open standard for parallel programming of heterogeneous computational resources at processor level. More than just a programming language it includes an API, libraries and runtime system for software development.
The framework aspires at enabling portable and efficient access to general purpose parallel
programming across CPUs, GPUs, Cell and ManyCores architectures for both HPC and commodity applications. The main key is to allow applications to use a host and one or more OpenCL devices as a single heterogeneous parallel computer system. Experienced programmers are supported throughout the process of developing general purpose algorithm without the necessity of mapping the algorithm onto architecture/platform specific features like 3D graphics API such as OpenGL or DirectX.
Originally started by Apple which served as specification editor, led in the development by Khronos group the OpenCL working panel lists partners as: 3DLans, AMD/ATI, IBM, Intel, Motorola, Nokia, NVIDIA, RapidMind, Texas Instruments. Being an open source royalty-free standard it is free to join the developers consortium through the Khronos group.
The core ideas behind OpenCL are described by four main models: platform model,
memory model, execution model , programming model.
platform model The Platform model consist of a host connected to one or more OpenCL compute devices. A compute device is divided into one or more compute units (CUs) which are further divided into one or more processing elements (PEs). Computations on a device occur within the processing elements. Each processing element can either behave as a SIMD (single instruction multiple data) or as a SPMD (single program multiple data) unit. The core difference between SIMD and SPMD relies in wether a kernel is executed concurrently on multiple processing elements each with its own data and a shared program counter or each with its own data but its program counter. In the SIMD case all processing elements execute a strictly identical set of instructions which cannot be always true for the SPMD case due to possible branching in a kernel.
Each OpenCL application runs on a host according to the hosting platform models, and submits commands from the host to be executed on the processing elements within a device.
execution model The Execution model of an OpenCL program occurs in two parts: a kernel, basic unit of executable code which is executed on one or more OpenCL devices, and a host program, collection of compute kernels and internal functions, which is executed on the host. A host program defines the context for the kernels and manages their execution.
When a kernel is submitted for execution by the host, an index space is defined. An instance of the kernel executes for each point in this index space. This kernel instance is called a work-item and is
identified by its point in the index space, which provides a global ID for the work-item. Work-items are
organized into work-groups. The work-groups provide a more coarse-grained decomposition of the index space. Work-groups are assigned a unique work-group ID with the same dimensionality as the index space used for the work-items. Work-items are assigned a unique local ID within a work-group so that a single work-item can be uniquely identified by its global ID or by a combination of its local ID and work-group ID.
memory model The Memory model of OpenCL is a shared memory model with relaxed consistency. Each work-item has access to four distinct memory regions: global memory accessible by all work-items in all work-groups; constant memory a read only global space; local memory local to a work-group; private memory private to a work-item.
programming model OpenCL supports two main Programming models: the data parallel programming model and the task parallel programming model. Hybrids of the two models are allowed though the driving one remains the data-parallels. In a data-parallel programming model sequences of instructions are applied to multiple elements of memory objects. OpenCL maps data to work-items and work-items to work-groups. The data-parallel model is implemented in two possible ways. The first or explicit model lets the programmer define both the number of work-items to execute in parallel and how work-items are divided among work-groups. The second or implicit model lets the programmer specify the number of work-items but OpenCL to manages the division into work-groups.
With Direct3D11 Microsoft will introduce another shader along side the already existing shaders in the common Direct3D10 pipeline to generalize the data-parallel programming model that exists in Direct3D. It will be fully integrated into Direct3D allowing to exchange data for example between Pixel shaders and Compute shaders (see Figure 4.3).
Unlike Pixel Shaders it will have cross-thread data sharing, unordered access I/O operations like gather and scatter. It will enable far more general data structures - like irregular arrays, trees, linked list - compared to the previous Resource types in Direct3D10 which are basically representing linear arrays. With the availability of these datastructures it will allow more general algorithms that go far beyond the common shading. Compute shader will be focused on client scenarios; this means not for HPC computation cluster farms but enabling the Tera Flop performance that a graphics card provide on a single machine that wants to display/render the result of the computation with very low latency in realtime.
Computer shaders can simply spawn a regular array of threads. That array can be one, two or three dimensional. There are shared registers between threads to reduce the register pressures on the underlying graphics processing unit and also will eliminate redundancy computation and I/O operations. The initial version will have 32KByte of registers that can be shared. These registers will be 32bit only. Not all threads in a call will be able or should share registers with each other, otherwise the latency of accessing the registers would be too high. That is why sharing threads are broken down into subsets (groups) of threads. All the different thread indices, like threadID, threadGroupID, and threadIDinGroup are accessible in the compute shader. Compute shaders will share the same language subset as the other shaders and following the same evolutionary path. Meaning that they will be backwards and forwards compatible between different generations.