Ten Lessons from Google’s TPU

Google kicked off the current trend for Internet companies to Roll Their Own chips in 2016 with their TPU for AI workloads. Since then, many other companies have followed suit, and six year on Google is still going strong, expanding their portfolio of homegrown semis. And like an aging rock star writing their memoirs, Google recently published a review of the progress of the TPU which is now running in its fourth generation (with a fifth already in the works).

In a true search engine optimization, clickbait style the paper includes a Top Ten Lessons Learned section. And as dense and academic as this paper is, this list is fairly readable and provides some valuable insight for how to think about chips. Below, we paraphrase these (badly) to tease out some hints on where things go next.

Lesson #1 – the different elements of a chip advance at a different pace. Pry open a chip and it is clear that there are a lot of different things going on inside. While modern processors all have multiple cores which can be seen with the naked eye, there are all kinds of other elements – connections, data buses, memory, etc. that all need to get designed onto the floor plan of the chip. Google’s point is that these pieces all move on different timelines depending on the roadmaps of IP vendors, academic research and manufacturing processes. This means that the bottlenecks move each generation of the chip, today’s advanced memory is tomorrow’s laggard. Chip designers need to be constantly optimizing for this, and include contingency plans for when some part of the mix does not meet expected deadlines. We think this partially explains why so many of the first generation of AI chip start-ups foundered, because they solved one problem only to see the real problem emerge in a different area one generation later.

Lesson #2 – Compilers matter, but take time. This really speaks to the software problem, a topic that is particularly acute in AI workloads. The software that runs on chips depend heavily on the way in which their code is translated into low-level instructions which the chip understands. As we have pointed out repeatedly, modern compilers are very sophisticated. Developers can take pretty much any code base and re-compile it for other chips – x86 to Arm for instance, but this only covers 80% of performance, the code and the compiler then need to be optimized to get the full benefit of the new chip. Google points out that one of the key problems is that new compilers only reach peak performance after the hardware is complete, creating a chicken and egg problem. No one wants to port their software to a new chip until they can get optimal compiler performance, but the compiler cannot be optimized until many people have ported their software. This is a big reason why Arm struggled in the data center for so long, and is only now starting to turn the corner towards broad adoption; and why the hyperscalers have had an easier time of it since they control their own software.

Lesson #3 – Design for Total Cost of Ownership (TCO). If we had to take only one lesson away from this paper this would be it. All too often, when people compare chips they look only at raw performance – so-called speeds and feeds. One chip runs faster or costs less than another so it must be better. In reality, chips all operate as just one part of a larger system, even if the chip itself is the key strategic real estate in that system. In servers for instance, everyone wants to optimize for cost, but the chip itself is often only 20% of the cost of the server – the memory, the casing, the power supply and the fan all add up. The large data center builders and their vendors all have sophisticated TCO models that take into account all the costs of the system from the cost of the chip, to the cost of electricity, cooling and the rest. (As a side note, if you love Excel as much as we do, these models are a beauty to behold.) These calculations are supremely important and proprietary, so much so that in this paper Google does not put units on the axis of the charts showing their TCO improvements. This is also incredibly important when doing comparisons from the outside and understanding why some chips are gaining or losing share. So chips need to be designed with TCO in mind, which is often a skill beyond the ready understanding of the average chip designer.

Lesson #4 – Backwards Compatibility. A topic that is the bane of existence to the entire industry, but is especially important in AI today. The software that AI developers use is changing rapidly with new libraries, compilers and tools coming online constantly. This means that no one can operate entirely at the leading edge, there are always going to be customers (internal or external) who are using older software packages and their semis need to be able facilitate this. This is a pressing issue in AI today, where every neural net algorithm is slightly different, often in ways that can meaningfully detract from chip performance. This is one reason that Nvidia GPUs still dominate AI workloads. Too many of the standalone AI accelerator chips over-optimize for one algorithm or another. Better to stick to the devil you know.

Lesson #5 – These chips get hot. AI chips are optimized to do one thing (matrix multiplication) and to do it in huge volumes. As a result, they are often running fully utilized and need cooling. Good to keep in mind with #3 TCO above.

Lesson #6 – Only some workloads need floating point. We are starting to get a bit arcane. Put simply, small changes in requirements can have outsized impact on chip design. Depending on how precise the AI calculations need to be, some AI chips need double the amount of on-chip memory. This is expensive both in terms of the size of the chip, but also in the way it impacts all the TCO implications. Takeaway: design carefully for the workload needed.

The last four Lessons all deal with the way that neural net algorithms interact with the chips they are run on. We are not going to get into all of this, but we strongly encourage anyone designing an AI chip read these carefully. They offer hints as to where Google is taking TPU and some specific technical problems that need to be solved to get there.

At a high level, the most important thing about the TPU is that it conveys a strategic advantage to Google. By adopting these chips, they were able to shift much of their core software to “AI” which they have spoken of as a critical shift. More than half of these lessons touch on software. As much as we think of chips as the ultimate piece of hardware, they are at heart software made corporeal, tangible. And this speaks to the way that many chips need to be build today.

Image from Google via Venture Beat

Leave a Reply