Thirty years ago, CPUs and other specialized processors handled almost all computing tasks. Graphics cards of that era helped to speed up the rendering of 2D shapes in Windows and applications, but had no other uses.
Fast forward to today, and the GPU has become one of the most dominant chips in the industry.
Ironically, the days when the sole function of graphics chips was graphics are long gone, with machine learning and high-performance computing heavily relying on the processing power of the humble GPU. Join us as we explore how this single chip evolved from an inconspicuous pixel pusher to a powerful floating-point computing engine.
At first, CPUs ruled everything
Let's go back to the late 1990s. The field of high-performance computing, including scientific work on supercomputers, data processing on standard servers, and engineering and design tasks on workstations, relied entirely on two types of CPUs: 1) specialized processors designed for a single purpose, and 2) off-the-shelf chips from AMD, IBM, or Intel.
The ASCI Red supercomputer was one of the most powerful supercomputers around 1997, consisting of 9,632 Intel Pentium II Overdrive CPUs (as shown in the figure below). Each unit operated at a frequency of 333 MHz, and the system's theoretical peak computational performance was slightly above 3.2 TFLOPS (trillion floating-point operations per second).
Advertisement
Since we will frequently mention TFLOPS in this article, it is worth taking some time to explain what it means. In computer science, floating-point numbers (or simply floats) are data values that represent non-integer values, such as 6.2815 or 0.0044. Integer values (called integers) are often used for calculations required to control computers and any software running on them.
Floating-point numbers are essential for situations where precision is critical, especially anything related to science or engineering. Even a simple calculation, such as determining the circumference of a circle, involves at least one floating-point value.For decades, CPUs have had separate circuits to perform logical operations on integers and floating-point numbers. In the case of the Pentium II Overdrive mentioned above, it can perform one basic floating-point operation (multiplication or addition) per clock cycle. Theoretically, this is why the peak floating-point performance of ASCI Red is 9,632 CPUs x 333 million clock cycles x 1 operation/cycle = 32,074.56 million FLOPS.
These figures are based on ideal conditions (e.g., using the simplest instructions for data that fits well in the cache) and are rarely achievable in real life. However, they do indicate the potential capabilities of the system.
Other supercomputers also have a similar number of standard processors—Lawrence Livermore National Laboratory's Blue Pacific uses 5,808 IBM PowerPC 604e chips, and Los Alamos National Laboratory's Blue Mountain (pictured above) uses 6,144 MIPS Technologies R1000.
To achieve the processing power of teraflops in floating-point operations, thousands of CPUs are needed, all of which require substantial support from RAM and hard disk storage. This was, and still is, due to the mathematical requirements of the machines.
When we first encounter equations in physics, chemistry, and other subjects in school, everything is one-dimensional. In other words, we use a single number to represent distance, velocity, mass, time, etc. However, to accurately model and simulate phenomena, more dimensions are needed, and mathematics moves into the realm of vectors, matrices, and tensors.
They are considered as single entities in mathematics but contain multiple values, which means that any computer performing calculations needs to handle a large number of numbers simultaneously. Given that CPUs at the time could only handle one to two floating-point numbers per cycle, thousands of floating-point numbers were needed.
SIMD Joins the Competition:
MMX, 3DNow!, and SSE
In 1997, Intel updated the Pentium CPU series with a technology extension called MMX, which is a set of instructions that utilize eight additional registers within the core. Each is designed to store one to four integer values. The system allows the processor to perform one instruction across multiple numbers simultaneously, a method known as SIMD (Single Instruction, Multiple Data).A year later, AMD introduced its own version, called 3DNow!. Its performance was particularly superior because the registers could store floating-point values. It took another year for Intel to address this issue in MMX and introduce SSE (Streaming SIMD Extensions) in the Pentium III.
As the calendar turned to the new millennium, designers of high-performance computers could use standard processors that could effectively handle vector mathematics.
Once scaled to thousands, these processors could manage matrices and tensors just as brilliantly. Despite this advancement, the world of supercomputers still favored old or specialized chips because these new extensions were not specifically designed for such tasks. The same was true for another processor that was rapidly gaining popularity, the GPU, which was better at SIMD work than any CPU from AMD or Intel.
In the early days of graphics processors, the CPU handled the calculations for the triangles that made up the scene (hence the name 3DNow! used by AMD for executing SIMD). However, the shading and texturing of pixels were entirely handled by the GPU, and many aspects of this work involved vector mathematics.
The best consumer graphics cards of more than 20 years ago, such as the 3dfx Voodoo5 5500 and Nvidia GeForce 2 Ultra, were excellent SIMD devices. However, they were created for the purpose of generating 3D graphics for games and nothing else. Even graphics cards in the professional market were only focused on rendering.
The ATI FireGL 3, priced at $2,000, came equipped with two IBM chips (a GT1000 geometry engine and an RC1000 rasterizer), a massive 128 MB DDR-SDRAM, and an alleged processing power of 30 GFLOPS. But all of this was to accelerate graphics in programs like 3D Studio Max and AutoCAD using the OpenGL rendering API.
GPUs of that era could not be used for other purposes because the process of transforming 3D objects and converting them into monitor images did not involve a large amount of floating-point mathematics. In fact, a significant portion of it was at the integer level, and it took graphics cards several years to start using floating-point values extensively throughout the entire pipeline.
The first was ATI's R300 processor, which had eight independent pixel pipelines, handling all mathematical operations with 24-bit floating-point precision. Unfortunately, there was no other way to leverage this capability besides graphics—the hardware and related software were entirely image-centric.Computer engineers have not forgotten that GPUs possess a large amount of SIMD capabilities, but lack a method to apply them to other fields. Surprisingly, it was a gaming console that demonstrated how to solve this tricky problem.
A New Unified Era
Microsoft's Xbox 360 was released in November 2005, with its CPU designed and manufactured by IBM, based on the PowerPC architecture, and the GPU designed by ATI and manufactured by TSMC.
This graphics chip, code-named Xenos, was particularly special because its layout completely avoided the classic approach of separate vertex and pixel pipelines.
Instead, it featured a three-way SIMD array cluster. Specifically, each cluster consisted of 16 vector processors, each containing 5 arithmetic units. This layout allowed each array to simultaneously execute two sequential instructions from threads on 80 floating-point data values per cycle.
Known as the unified shader architecture, each array could handle any type of shader. Although Xenos made other aspects of the chip more complex, it sparked a design paradigm that is still in use today. At a clock speed of 500 MHz, the entire cluster could theoretically achieve a processing rate of 240 GFLOPS (500 x 16 x 80 x 2) for multiply-add commands across three threads.
To give this figure a sense of scale, some of the world's top supercomputers a decade ago could not match this speed. For example, Sandia National Laboratories' Aragon XP/S140, with 3,680 Intel i860 CPUs, topped the world's supercomputer list in 1994 with a peak speed of 184 GFLOPS. The pace of chip development quickly surpassed this machine, but the same was true for GPUs.
CPUs have been integrating their own SIMD arrays for years, for example, Intel's original Pentium MMX had a dedicated unit for executing instructions on vectors, containing up to 8 8-bit integers. When Xbox's Xenos was used in households worldwide, the size of such devices at least doubled, but they were still small compared to Xenos.When consumer-grade graphics cards began to adopt GPUs with unified shader architectures, they already possessed a significantly higher processing rate than the graphics chip of the Xbox 360.
The Nvidia G80 used in the GeForce 8800 GTX (2006) had a theoretical peak of 346 GLFOPS, while the ATI R600 in the Radeon HD 2900 XT (2007) boasted 476 GLFOPS.
Both graphics chip manufacturers quickly leveraged this computational power in their professional models. Although they were expensive, the ATI FireGL V8650 and Nvidia Tesla C870 were well-suited for high-end scientific computing. However, at the highest level, supercomputers around the world still relied on standard CPUs. In fact, it was only a few years later that GPUs began to appear in the most powerful systems.
The design, construction, and operation of supercomputers and similar systems are extremely costly. For many years, they were built around large arrays of CPUs, so integrating another processor was not an overnight task. Such systems required thorough planning and initial small-scale testing before increasing the number of chips.
Secondly, coordinating the operation of all these components, especially in terms of software, was by no means an easy task, which was also a significant weakness of GPUs at the time. Although they had become highly programmable, the software available for them was quite limited before.
Microsoft's HLSL (Higher Level Shader Language), Nvidia's Cg library, and OpenGL's GLSL made it easy to access the processing power of graphics chips, albeit purely for rendering.
Unified shader architecture GPUs changed all of this.
In 2006, ATI, which had become a subsidiary of AMD, and Nvidia released software toolkits aimed at using this capability not only for graphics, with their APIs known as CTM (Close To Metal) and CUDA (Compute Unified Device Architecture), respectively.
However, what the scientific and data processing communities truly needed was a comprehensive software package that would treat a large number of CPUs and GPUs (commonly referred to as heterogeneous platforms) as a single entity composed of numerous computing devices.Their needs were met in 2009. OpenCL was initially developed by Apple and released by the Khronos Group, which had absorbed OpenGL a few years earlier, becoming the de facto software platform for general-purpose computing on GPUs beyond everyday graphics, or GPGPU at the time. The term GPGPU, coined by Mark Harris, refers to general-purpose computing on GPUs.
GPUs Enter the Computing Race
Unlike the vast world of technology reviews, there are not hundreds of reviewers globally testing the performance claims of supercomputers. However, an ongoing project launched by the University of Mannheim in Germany in the early 1990s is dedicated to achieving this goal.
The organization, known as "TOP500," releases a list of the world's 10 most powerful supercomputers twice a year.
The first entry boasting a GPU appeared in 2010, with China having two systems—Nebulae and Tianhe-1. They relied on Nvidia's Tesla C2050 (essentially the GeForce GTX 470) and AMD's Radeon HD 4870 chips, with a theoretical peak of 2,984 TFLOPS for the former.
In the early stages of high-end GPGPU, Nvidia was the preferred supplier for computing giants, not for performance (as AMD's Radeon cards typically offered a higher degree of processing power), but for software support. CUDA experienced rapid development, and it took AMD a few years to find a suitable alternative, encouraging users to switch to OpenCL.
However, Nvidia did not completely dominate the market, as Intel's Xeon Phi processors tried to secure a foothold. These large chips originated from a discontinued GPU project called Larrabee and were a special CPU-GPU hybrid, composed of multiple Pentium-like cores (CPU part) paired with large floating-point units (GPU part).
An examination of the internal structure of the Nvidia Tesla C2050 reveals 14 blocks called Streaming Multiprocessors (SM), divided by caches and a central controller. Each contains 32 sets of two logical circuits (which Nvidia calls CUDA cores) for performing all mathematical operations—a set for integer values and another for floating-point numbers. In the latter case, the cores can manage one FMA (fused multiply-add) operation per clock cycle at single (32-bit) precision; double precision (64-bit) operations require at least two clock cycles.
The floating-point units in the Xeon Phi chip (as shown below) look somewhat similar, only the data values processed by each core are half of those in the C2050's SM. Nevertheless, due to having 32 replicated cores compared to Tesla's 14, a single Xeon Phi processor can handle more values per clock cycle overall. However, Intel's first release of this chip was more of a prototype and could not fully realize its potential—Nvidia's products ran faster, consumed less power, and ultimately proved to be superior.This will become a recurring theme in the three-way GPGPU competition between AMD, Intel, and Nvidia. One model may have a larger number of processing cores, while another may have a faster clock speed or a more powerful cache system.
The CPU remains crucial for all types of computing, and many supercomputers and high-end computing systems are still composed of AMD or Intel processors. Although a single CPU cannot compete with the SIMD performance of a typical GPU, when thousands of CPUs are connected together, they prove to be sufficient. However, such systems lack efficiency.
For example, while the Tianhe-1 supercomputer used Radeon HD 4870 GPUs, AMD's largest server CPU (the 12-core Opteron 6176 SE) also began to gain popularity. For a power consumption of about 140 W, the CPU theoretically achieves 220 GFLOPS, while the GPU can provide a peak of 1,200 GFLOPS with an additional 10 W, and at a fraction of the cost.
A small graphics card that can do more
A few years later, it's not just the world's supercomputers that are leveraging GPUs for collective parallel computing. Nvidia is actively promoting its GRID platform, a GPU virtualization service for scientific and other applications. Initially launched as a system for hosting cloud-based gaming, the growing demand for large-scale, cost-effective GPGPU has made this shift inevitable. At its annual technology conference, GRID is considered an important tool for engineers in various fields.
At the same event, GPU manufacturers showcased the future architecture codenamed Volta. Few details were revealed, but the general assumption is that this will be another chip serving all of Nvidia's markets.
In the meantime, AMD is also doing something similar, utilizing the regularly updated Graphics Core Next (GCN) design in its game-focused Radeon series as well as FirePro and Radeon Sky server cards. At that time, the performance data was already astonishing—the peak FP32 throughput of the FirePro W9100 was 5.2 TFLOPS (32-bit floating point), a number that would have been unimaginable for supercomputers less than twenty years ago.
GPUs are still primarily designed for 3D graphics, but the advancement of rendering techniques means that these chips must become increasingly adept at handling general computing workloads. The only issue is their limited ability to perform high-precision floating-point mathematics (i.e., FP64 or higher).Looking back at the top supercomputers of 2015, there were relatively fewer supercomputers using GPUs (Intel's Xeon Phi or Nvidia's Tesla) compared to those based entirely on CPUs.
When Nvidia launched the Pascal architecture in 2016, all of this changed. This was the company's first attempt to design a GPU specifically for the high-performance computing market, while other GPUs were used in multiple fields. The former only produced one model (GP100), which resulted in only five products, but all previous architectures only had a few FP64 cores, while this chip accommodated nearly 2,000 cores.
The Tesla P100 offers more than 9 TFLOPS of FP32 processing power and half the FP64 processing power, making it extremely powerful. AMD's Radeon Pro W9100, using the Vega 10 chip, is 30% faster in FP32 but 800% slower in FP64. At this point, Intel was on the verge of discontinuing the Xeon Phi due to poor sales.
A year later, Nvidia finally released Volta, indicating that the company was not only interested in introducing its GPUs to the HPC and data processing markets but also targeting another market.
Neurons, Networks
Deep learning is a field within the broader discipline of machine learning, which is a subset of artificial intelligence. It involves using complex mathematical models (called neural networks) to extract information from given data.
An example is determining the probability that a presented image depicts a specific animal. To do this, the model needs to be "trained"— in this case, by showing millions of images of the animal and millions of images that do not show the animal. The mathematics involved is rooted in matrix and tensor calculations.
For decades, such workloads were only suitable for large supercomputers based on CPUs. However, as early as the 2000s, GPUs were clearly well-suited for such tasks.
Despite this, Nvidia bet on a significant expansion of the deep learning market and added additional features to its Volta architecture to stand out in this field. These are FP16 logic unit groups sold as tensor cores, which run together as a large array but have very limited functionality.In fact, their functionality is quite limited, only capable of performing one function: multiplying two FP16 4x4 matrices and then adding another FP16 or FP32 4x4 matrix to the result (this process is called GEMM operation). Nvidia's previous GPUs and competitors' GPUs could also perform such calculations, but the speed was far inferior to Volta. The only GPU using this architecture, GV100, contains a total of 512 tensor cores, each capable of executing 64 GEMM operations per clock cycle.
Depending on the size of the matrices in the dataset and the floating-point size used, the Tesla V100 card can theoretically achieve 125 TFLOPS in these tensor computations. Volta is obviously designed for a niche market, but the progress of GP100 in the supercomputer field is limited, and the new Tesla model is quickly adopted.
PC enthusiasts will know that Nvidia subsequently added tensor cores to the Turing architecture of general consumer products and developed an upgrade technology called DLSS (Deep Learning Super Sampling), which uses the cores in the GPU to run neural networks on the computer. Enlarge the image, correct any artifacts in the frame.
In a short period of time, Nvidia monopolized the GPU-accelerated deep learning market, and its data center department's revenue increased significantly - a growth rate of 145% in fiscal year 2017, 133% in fiscal year 2018, and 52% in fiscal year 2019. By the end of fiscal year 2019, sales in fields such as HPC and deep learning totaled $2.9 billion, which is a very positive result.
But then, the market really took off. The company's total revenue for the fourth quarter of 2023 was $22.1 billion, a year-on-year increase of 265%. Most of this growth came from the company's artificial intelligence plan, which generated $18.4 billion in revenue.
However, as long as there is money, competition is inevitable. Although Nvidia is still the largest GPU provider to date, other large technology companies have not been idle.
In 2018, Google began to provide access to its internally developed tensor processing chips through cloud services. Amazon quickly followed suit, launching a dedicated CPU called AWS Graviton. At the same time, AMD is restructuring its GPU department, forming two different product lines: one mainly for gaming (RDNA), and the other specifically for computing (CDNA).
Although RDNA is significantly different from its predecessor, CDNA is largely a natural evolution of GCN, albeit on a massive scale. Looking at the GPUs used in today's supercomputers, data servers, and artificial intelligence machines, everything is very large.AMD's CDNA 2-driven MI250X boasts 220 compute units, offering a double-precision FP64 throughput slightly below 48 TFLOPS and 128 GB of high-bandwidth memory (HBM2e), both of which are highly sought after in HPC applications. Nvidia's GH100 chip, adopting the Hopper architecture and 576 Tensor Cores, may reach 4000 TOPS in AI matrix computations using the low-precision INT8 numerical format.
Intel's Ponte Vecchio GPU is similarly massive, with 100 billion transistors, while AMD's MI300 has 46 billion transistors, including multiple CPU, graphics, and memory chiplets.
However, one thing they all have in common is that they are definitely not GPUs: they are not Graphics Processing Units. The acronym stood for Graphics Processing Unit long before Nvidia used the term as a marketing tool. AMD's MI250X has no render output units (ROPs), and even the GH100 only has Direct3D performance similar to the GeForce GTX 1050, rendering the "G" in GPU irrelevant.
So, what can we call them?
"GPGPU" is not ideal, as it is a clumsy phrase referring to the use of GPUs in general computing, not the device itself. "HPCU" (High-Performance Computing Unit) is not much better. But perhaps it doesn't matter.
After all, the term "CPU" is very broad, covering a variety of different processors and uses.
What will GPUs conquer next?
Nvidia, AMD, Apple, Intel, and dozens of other companies have invested billions of dollars in GPU research and development, and today's graphics processors will not be replaced by any radically different product anytime soon.
For rendering, the latest APIs and the software packages that use them (such as game engines and CAD applications) are generally agnostic to the hardware running the code, so theoretically, they can adapt to something entirely new.
However, there are relatively few components in GPUs specifically for graphics; the triangle setup engine and ROPs are the most obvious components, and the ray tracing units in the latest versions are also highly specialized. The rest, however, is essentially a massively parallel SIMD chip, supported by a powerful and complex memory/cache system.The basic design is as good as it has always been, and any future improvements are closely tied to advances in semiconductor manufacturing technology. In other words, they can only be improved by accommodating more logic units, operating at higher clock speeds, or a combination of both.
Of course, they can incorporate new features to enable them to play a role in a wider range of scenarios. In the history of GPUs, this has happened many times, but the transition to a unified shader architecture is particularly important. While it is best to have dedicated hardware to handle tensor or ray tracing calculations, the core of modern GPUs is capable of managing all of this, albeit at a slower speed.
This is why products like AMD MI250 and Nvidia GH100 are very similar to their desktop counterparts, and future designs for HPC and AI are likely to follow this trend. So, if there are no major changes in the chips themselves, how will their applications change?
Given that anything related to AI is essentially a branch of computation, GPUs will likely be used as long as there is a need to perform a large number of SIMD calculations. Although there are not many fields in science and engineering that have not yet used such processors, we may see a surge in the use of GPU-derived products.
Currently, people can purchase smartphones equipped with microchips whose sole function is to accelerate tensor calculations. As the capabilities and popularity of tools like ChatGPT continue to grow, we will see more devices equipped with such hardware.
The humble GPU has evolved from a device that simply runs games faster than a CPU to a general-purpose accelerator, powering workstations, servers, and supercomputers worldwide.
Millions of people around the world use it every day - not only in our computers, phones, TVs, and streaming devices, but also when we use services that include voice and image recognition or provide music and video recommendations.
The true next step for GPUs may be an unknown territory, but one thing is certain: the Graphics Processing Unit will continue to be a major tool for computing and artificial intelligence for decades to come.
Post a comment