We have been hearing this question frequently of late, and thought we would provide some basic background here, especially for those new to semis.
For decades, the workhorse chip of computers has been the central processing unit (CPU) which are found in every PC, laptop and server in the cloud. CPUs can be thought of as general purpose compute – a chip that can be used to run any kind of software. In a PC, the CPU has to run the operating system (OS), manage files, display images on the screen, manage low level things like the keyboard and USB ports, and most critically power all the things people use their computers for – games, web browsing, spreadsheets, etc.
CPUs make sense for this use because they are flexible, but in semiconductors there are always trade-offs. Chips are exercises in applied geometry. There is only so much space on a chip. That means chips need to reserve space for every function. So CPUs are good because they are flexible, but there is a better way to design a chip if it is only going to be used for a single purpose. CPUs have to be able to run all those things we listed above, which includes both small math problems but also very large ones.
This started to become an issue in the 1980’s with the advent of graphical user interfaces (GUIs). People soon realized that powering a GUI required the majority of a CPU’s capacity, slowing down other functions. So companies began using graphical processing units (GPUs). These were designed to largely do one thing – graphics. Graphics workloads could now be handed off to the GPU, freeing up the CPU to do everything else. This worked well because it allowed for the use of less expensive CPUs that did not need to run graphics. And because the GPUs were more efficient at its task, systems could be designed to be either more power efficient (laptops) or more powerful (e.g. for graphical design or gaming).
The key difference between the two chips was parallelization. CPUs had to be capable of scaling up to do big complicated math, while GPUs just had to handle determining the color of every pixel on a screen. The math behind graphical rendering is actually fairly simple. The constraint was this simple calculation had to be run many times (dozens of times a second for every pixel on the screen). So simple math, done many times in parallel.
Of course this transition had its fair share of friction. A big issue at the time was that the CPU and GPU did not always play well together. We personally once bricked a brand new PC because of an incompatibility bug with the GPU card. Getting these issues sorted out was a headache often left to companies that assembled the PC but who lacked much in the ways of semiconductor experience. The operating systems (OS) vendor Microsoft spent a lot of effort to smooth this over, but one of the chip companies also played an important role in solving this problem. Tiny Nvidia, a start-up back then, got tired of having to support the constant stream of problems with its graphics cards and so created a layer of software that sat between the GPU and the OS to make designing device drivers simpler. The idea was to unify all the different interfaces of the GPUs, and so they called it Compute Unified Device Architecture or CUDA.
Over time, people started to find other computing needs that were parallelizable – simple math problems that needed to be done quickly in large volumes. Google had already opened the door to this approach with advances to structuring its database. Instead of throwing the index for every website at giant CPUs, they broke up the task into smaller problems which they distributed across a large number of weaker computers. They kept using CPUs to do this, but soon another highly parallel workload emerged – Bitcoin. Bitcoin mining is a race between computers endlessly repeating some fairly simple math problems. This lent itself very well to GPUs instead of CPUs. Same pattern – small math problem done many times.
Which brings us to AI. As much as the latest AI tools like GPT and Dall E seem like magic, at their heart they all make use of a fairly simple set of mathematical calculations. Specifically, they are all based on linear algebra and matrix multiplication. Matrix math requires taking arrays of data (numbers organized into columns and rows) and performing calculations across the whole array. Each individual calculation is fairly simple, say multiplying two numbers, the hard part is keeping track of where each result goes. (And we say this having just barely passed college linear algebra, an experience which led us to stop taking college math.) AI requires matrix operations at vast scale – billions or hundreds of billions of columns in each. But once again the pattern emerges – simple calculations done in large volume – a perfect task for GPUs.
Now today, instead of incompatible graphics card, the AI industry has to contend with hundreds of different AI software frameworks and libraries. And again, the easiest way to smooth out these differences is through the use of Nvidia’s CUDA. CUDA is critical here because the scale at which AI models are built mean that even small differences in performance can have massive impacts on the costs of building and running AI models. CUDA was built to give access to fine-grained, low-level optimizations of chip performance.
Of course, to many GPUs themselves are over-engineered for the AI task at hand, just as CPUs were overkill for the early days of GUIs. There is always going to be some way to build a chip more optimally designed for a specific task at hand. That being said, like some sort of endangered species that has evolved to fit into its exact place in an ecosystem, those chips risk being overly-optimized and not capable of adapting to a changing environment, or set of software requirements. Which is why GPUs, especially Nvidia GPUs, remain the preferred chip for AI for the foreseeable future.