Ten Things We Hate About Nvidia

To be clear, we do not hate Nvidia. Quite the opposite, we have been increasingly convinced of the strength of their position in the data center since their March 2022 Analyst Day (i.e. six months before Chat GPT lit the AI world on fire). Our title comes the 1999 Rom-Com, a modern adaptation of Shakespeare. And while we would not say we love Nvidia either, we do have a high degree of conviction that they are going to be the leader in data center silicon for the foreseeable future.

That being said, we strive for intellectual honesty and that means the higher the conviction we have in a thesis, the more we need to test it out. Poke holes. Look for ways in which we could be wrong. So here, we want to walk through all the ways in which Nvidia might be vulnerable.

Looking at this systemically is challenging, as Nvidia positions itself as providing complete solutions, which means all the pieces are tied together. But we will divide our analysis into several buckets and walk through each one.

The first area is hardware. Nvidia has been executing incredibly well for several years and they have a clear performance advantage. They generally have a lock on the market for AI Training. The market for Cloud Inference is going to be larger, and economics will matter more than raw performance. That is going to mean a lot more competition. The closest peer here is AMD with their Mi300 series, and while this is highly performant, it still lacks many features. It is probably good enough to carve out a toehold for AMD, but seems unlikely to make any major dents in Nvidia’s market share for the time being. Intel and its recently a launched Gaudi 3 accelerator is enough to show that Intel is still in the game, but is not programmable nor fully-featured, and so limits the portion of the market where Intel could challenge Nvidia. So in terms of major chip companies, Nvidia looks well positioned.

There are a number of start-ups going after this market. The most advanced is probably Groq, who has put out some fairly impressive inference benchmarks, but our sense is that their solution is only suitable for a subset of AI inference. Again, enough for Groq to stay in the fight, but not challenging Nvidia across large portions of the market.

The most serious competition comes from the hyperscalers’ internal silicon solutions. Here, Google is by far the most advanced. They recently disclosed that they used their TPUs to train their Gemini Large Language Model (LLM), the only major crack in Nvidia’s training dominance. But Google is a special case, they control their software stack, allowing them to tailor TPUs very precisely. The other hyperscalers are further behind. After years of testing out everything on the market, Meta has finally launched its own accelerator, following Microsoft who launched theirs last year. Both of these look interesting, but both are also first attempts, and it will take a few generations for these solutions to prove themselves out. All of which reinforces our view that both will continue to rely heavily on Nvidia and AMD for a few years more. For their part, AWS is on the second generation of their inference and training chips, but these are also fairly far behind the curve, and Amazon now seems to be scrambling to buy as much Nvidia output as they can. Their needs stem from the reality that they do not control their software stack, they run their customers’ software and those customers all have a strong preference Nvidia.

Another important element in all this is networking hardware. The links between all the servers in a data center are a major constraint on AI models. Nvidia has a major advantage in its networking stack. Much of this comes from their acquisition of Mellanox in 2019, and their low-latency Infiniband solution. This deal is like to go down as one of the best M&A deals in recent history. However, this advantage is a double-edged sword. Nvidia’s sales of complete systems today is an important part of their revenue growth and in many use cases those systems’ advantages rests largely on the networking element. Recall that Nvidia admitted networking is the source of their advantage in the inference market. For the moment, Infiniband remains critical to AI deployments, but the industry is pouring an immense amount of effort into the low-latency version of Ethernet, Ultra Ethernet. Should Ultra Ethernet deliver on its promise, by no means guaranteed, that would put pressure on Nvidia in some important areas.

In short, Nvidia faces a large quantity of competitors, but remains comfortably ahead in quality.

A big reason for this lead remains its software stack. There are really two sides to Nvidia’s software – its compatibility layer “CUDA”, and the growing array of software services and models it provides to customers.

CUDA is the best known of these, and is often pointed to as the basis for the company’s lead in AI compute. We think this oversimplifies the situation. CUDA is really shorthand for a whole host of features in Nvidia chips which make them programable down to a very low level. From everything we can see, this moat remains incredibly solid. The industry is (finally) becoming aware of the power this software layer provides and there are many initiatives to provide alternatives. These range from AMD’s ROCm to “open” alternatives like XLA and UXL. If you want a deep dive into this Austin Lyon wrote a great primer on ChipStrat, which is definitely worth a read. But the quick summary is that none of those have gained much traction yet and the sheer array of alternatives risks diluting everyone’s efforts. (As usual, XKCD said it best.) The heuristic we have been using to evaluate these alternatives is to ask the proponents of each how many chips support the standard currently. Whenever the answer progresses past an awkward silence we will revisit this position. The biggest threat to Nvidia on this front comes from their own customers. The major hyperscalers are all searching out ways to move away from CUDA, but they are probably the only ones capable of dong so.

Beyond CUDA, Nvidia is also building up a whole suite of other software. These include pre-trained models for a few dozen end markets, composable service APIs (NIMs), and a whole host of others. These make it much easier to train and deploy AI models, so long as those models run on Nvidia silicon. It is still early days for AI, and should Nvidia gain widespread adoption of these, they will effectively lock in a generation of developers whose entire software stacks rest on top of these services. So far, it is unclear how widespread that adoption is. We know that in some markets, such as pharma and biotech, there is considerable enthusiasm for Nvidia tools, other markets are in still in early evaluation phases.

It is unclear if Nvidia ever plans to charge for these and grow an actual software business, and this leads to a larger fundamental question about Nvidia’s competitiveness. As they grow more successful and more prominent, the industry’s discomfort level grows. Already the data center supply chain is full of grumbling about Nvidia’s pricing, its allocation of scarce parts and long lead times. Nvidia’s largest customers, the hyperscalers are highly wary of becoming too reliant on the company, especially as Nvidia seems to waver on the edge of launching its own infrastructure as a service (IaaS) offering. How far will these customers go, how much of Nvidia’s stack will they buy into? There has to be a limit, but companies can often short circuit long-term strategic thinking for short-term discounts and supply opportunities. The hyperscalers would not have found themselves in this position if they had not abandoned almost every start-up that tried to sell them an alternative over the past decade. So Nvidia definitely faces risks on this front, but for the moment those risks are largely unformed.

More broadly, we think there are challenges to Nvidia’s overall business model. The company has always sold complete systems, from graphics cards 30 years ago, to mammoth DGX server racks today. As much as the company says it is willing to sell de-composed components, they would clearly prefer to sell complete systems. And this poses a number of problems. As the company’s history demonstrates, when inventory cycles turn down Nvidia stands at the end of the bullwhip, wreaking havoc on their financials. Given the scope of their recent growth and the ever larger systems they sell, the risk of a major reset is much larger. To be clear, we are not forecasting this to happen any time soon, but it is worth considering the magnitude of the problem.

Which leads us to AI factories. There are now roughly a dozen data center operators, independent of the public cloud IaaS hyperscalers, running warehouses full of Nvidia GPUs. Nvidia has invested in many of these and they are likely a major source of revenue given that their value proposition rests largely on their ability to offer GPU instances on demand. This is Nvidia’s channel, and is likely to be a source of problems somewhere down the road. In fairness, there is a remote possibility that AI presents such a seismic shift in compute that AI Factories become the dominant IaaS providers, but there are a few trillion dollar companies that would fight a scorched earth war to prevent that from happening.

Finally, the ultimate risk hanging over Nvidia is the growth of neural network based machine learning, aka AI. So far, the gains from AI are fairly narrow in scope – code generation, digital marketing and a host of small, under the hood software performance gains. If you wanted to construct a bear case for Nvidia it should be exploring the possibility that AI goes no further. We think that is unlikely, and our sense is that AI can still advance much further, but should it fail to, Nvidia would be left highly exposed. By the same token (pun intended), AI software is changing so rapidly it is possible that some future genius developer comes up with a superior AI model that shifts compute in a direction where GPUs and Nvidia’s investment matter less. This seems unlikely, but there is still the risk that AI software either stagnates here or advances to the point that it deflates the need for so many massive GPU clusters around the world. This should not be seen as a catastrophe for Nvidia, but would spark a significant slowdown.

To sum all of this up, Nvidia is in a very strong position, but it is not unassailable. In our view, their biggest threat comes from them being so successful that it forces its customers to respond. There are multiple futures for Nvidia and the data center, ranging from Nvidia ending up as just one of many competitors in the data center, to Nvidia becoming master of the universe. There are enough vulnerabilities in its model to make the latter unlikely, but they have so much momentum that the former is no more likely.

Digits to Dollars

Deep Tech, Semis and More

Ten Things We Hate About Nvidia

Like this:

Related

Leave a ReplyCancel reply

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Digits to Dollars