How to Build in a Server in 100 Easy Steps

A couple people recently asked us about the logic behind AMD’s proposed acquisition of ZT Systems. Yesterday we wrote about the growing complexities of installing AI clusters. Since these two topics are actually closely related, we thought we would use ZT as a lens through which to view the broader problem in the industry.

Let’s say Acme Semiconductor wants to enter the data center market. They spend a few hundred million dollars to design a processor. Then they go to sell it to their hyperscaler customer, but the hyperscaler does not want a chip, they want a working system to test out their software. So Acme goes to an ODM, pays them a few hundred thousand dollars to design a working server, complete with storage, power, cooling, networking and all the rest. Acme builds a few dozen of these and hands them out to their top sales prospects. Acme is now out a $1 million or so, and they notice that their chip is only 20% of the cost of the system.

The hyperscalers then spend a few months testing out the system. One of them likes Acme’s performance enough that they want to put it through a more rigorous test, and to really judge performance they do not want a standard server, they want one designed specifically for the way they operate their data centers. This means a new server design, with a totally different configuration of storage, networking, cooling and all the rest. The hyperscaler also wants Acme to build these test systems with their preferred ODM. Acme really wants to close a deal, so they foot the cost of the design for this too, at least the hyperscaler will pay for the test systems – Acme now has first revenue, of maybe $100,000.  While the first hyperscaler is running their multi-month evaluation, a second customer expresses interest. Of course, they want their own server configuration with their own preferred ODM. Acme needs the business so they foot the cost for this design too. Acme approaches all the OEMs to see if any of them will design a catalog system to facilitate this process. The OEMs are all very friendly and very interested in what Acme is doing. Great job guys. And they’ll be happy to do a design once Acme gets some business.

Finally, a customer wants to buy in volume. A big win for Acme. a system that will go into production at the customer. Of course, the customer wants a new server design, but this time because there is real volume the ODM agrees to do the design. The trouble is the new server will use the hyperscaler’s internally designed networking and security chips. These were kept secret, so Acme has never seen them, and they know very little about the new server which was designed directly between the customer and the ODM. The ODM builds a bunch of servers, then wires them up inside the hyperscaler’s data center, flip the power switch on and things immediately start to break.

This is expected, there are bugs everywhere, but very quickly everyone starts to blame Acme for the problems. Overlooking the fact that Acme was largely kept out of the design process, their chip is what is least familiar to the ODM and the customer. Acme had worked with the customer to iron out bugs during the evaluation cycle, but this is different. Much of the system is new and the stakes are much higher so everyone is operating at higher stress levels. Acme sends its field engineers to the super-remote data center to put their hands on the system. The three teams work through each bug, finding more along the way. At some point it turns out that Acme’s processor enters an obscure error mode when interacting with the hyperscaler’s security chip, the networking components are very fragile and perform far under spec, and of course every chip is running a different firmware that is incompatible with all the others. Add to all of this liquid cooling which no one on any of  the debugging team has ever worked with before and which probably generates 50% of the problems. The deployment drags on as the teams work through it all. Eventually, something significant has to be entirely replaced which adds delay and cost. But finally after months of work the system enters production. Then Acme’s second customer decides they want to do a deeper evaluation too, and the process starts all over.

And if that does not sound painful enough, remember that we have not even touched on the lawyers. Just to get started with the project Acme had to spend nine months negotiating ludicrously strenuous terms with the hyperscaler from a very weak position. When they started designing the custom server, the three companies (Acme, ODM and customer) probably spent six weeks just negotiating the NDA.

This is how servers got built for years. Then Nvidia enters the market and they bring their own server designs. Not only that, they bring designs for entire racks. They have been designing systems going back 25 years when they designed graphics cards. Their team also builds their own data centers. So they have an in-house team that is experienced in handling all of the above issues.

To compete with Nvidia, AMD can spend five years replicating Nvidia’s team. Or they can buy ZT. In theory, ZT can let AMD eliminate almost all of the friction outlined above. It is too soon to tell how well this will work out in practice, but AMD has gotten pretty good at merger integration. And personally, we would happily pay $5 billion to never have to negotiate a three-way NDA and Master Service agreement ever again.

Leave a Reply