Chapter 5 - Future Outlook

 

GPGPU, GPU Clusters, Multi-cores or multi-processors architecture, not only the future of graphics hardware but high performance general purpose computing is also at a turning point. With so many new possible designs and unexplored routes what will be the real hardware turnover is still a guess. If at the level of its expectation the Intel's Larrabee project [10] could drastically change the profile of the current market, if not then solutions like GPU clusters could lead the way.

5.1 Larrabee

LarrabeeArch

Figure 5.1: Schematic for the Larrabee many-core architecture

Larrabee is the codename for a new many-core visual computing architecture that Intel is developing in parallel with its current line of integrated graphics accelerators (Intel GMA). Larrabee's blueprint[10] was first released in August 2008 at the SIGGRAPH 2008 conference, the first chip release is expected to be in early 2010. Published specifics and claimed performances make Larrabee the potential direction to where GPUs are going in the future.
From a design point of view Larrabee exhibits similarities with both multi-core CPUs and GPUs. Of a multi-core CPU architecture it inherits a coherent cache hierarchy and x86 architecture compatibility
of the GPU architecture it inherits the wide SIMD vector units and texture sampling hardware. As described in [10] the main architecture (see Figure 5.1) is based on in-order CPU cores running an
extended version of the x86 instruction set. The extensions include wide vector processing operations,
specialized scalar instructions such as bit count and bit scan, new instruction and instruction modes for
explicit cache control. Each CPU is augmented with a wide vector processing unit (VPU) (see Figure 5.2) which should allow for extremely efficient mathematical operations. Larrabee is equipped with coherent on-die 2nd level cache to allow efficient inter-processor communication and high-bandwidth local data to be accessed by CPU cores. Each core can support up to four execution threads with separate register sets per thread.
Task scheduling is performed entirely in software rather than fixed function logic. This allows for pipelines to adjust their resource scheduling and load balancing algorithms. Larrabee programming model (or Larrabee Native) is C/C++ based, it supports both DirectX and OpenGL
applications and thread programming through OpenMP. Unlike current GPUs architectures Larrabee will feature cache coherency across all its cores. From a rendering point of view the Larrabee graphics rendering pipeline includes very little specialized hardware. A Larrabee Native application being mostly software written it can be easily extended.

Larrabee innovative design and projected performances make it a potential breakthrough for both graphics and traditional HPC computing.

LarrabeeVPU

Figure 5.2: Vector Processing Unit block diagram.

5.2 GPU Clusters

An alternative to multi-core/multi-processor architectures like the one proposed by Larrabee is clusters
of GPUs. As for CPU clusters a GPU cluster consist of a computer cluster in which each node is equipped with a Graphics Processing Unit. Target of this kind of clusters is to harness the computational power of GPUs treating them as general purpose processors. From a hardware point of view GPU clusters fall into two categories: heterogeneous and homogeneous. Heterogeneous clusters feature hardware from different manufacturers (mostly NVIDIA and AMD/ATI) or same make but different model, while in
homogenous clusters all GPUs are identical to each other (same class, brand and model). An example of heterogeneous GPU cluster is the NCSA's Innovative Systems Laboratory 16-node cluster. The NCSA infrastructure combines both GPUs and FPGA (field-programmable gate array) technology to explore the application of these architectures to accelerate scientific computing tasks. According to [11] each of the 16 nodes in the cluster features at present: two dual-core 2.4 GHz AMD Opterons with 8 GB of memory, four NVIDIA Quadro 5600 GPUs with 1.5 GB of memory each, and a Nallatech H101-PCIX FPGA accelerator with 16 MB SRAM and 512 MB SDRAM. Another interesting attempt at building a GPGPU cluster was made in 2004 by the Center For Visual Computing and Department of Computer Science at Stony Brook University [12]. The Stony Brook cluster featured 32 nodes connected by a 1 Gigabit Ethernet switch, with each node being an HP PC equipped with two Pentium Xeon 2.4GHz processors and 2.5GB memory, and a GeForce FX 5800 Ultra with 128MB memory, used for the GPU
cluster computation.


  1. Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert Cavin, Roger Espasa, Ed Grochowski, Toni Juan, and Pat Hanrahan. Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph., 27(3):1-15, 2008.
  2. Ncsa's innovative systems laboratory - gpu cluster project, jan 2009.
  3. Zhe Fan, Feng Qiu, Arie Kaufman, and Suzanne Yoakum-Stover. GPU cluster for high performance computing. In SC '04: Proceedings of the 2004 ACM/IEEE conference on Supercomputing, page 47, Washington, DC, USA, 2004. IEEE Computer Society.