When it comes to companies rolling their own custom chips, our core thesis is that doing this to save a few dollars on chips is breakeven at best. Instead, companies want to build their own chips when it conveys some form of strategic advantage. The textbook example is Apple, which ties its chips to its own software to meaningfully differentiate their phones and their computers. Or Google, which is customizing chips for their most intense workloads like search algorithms and video encoding. A few hundred million dollars in chip design costs are more than paid pack in billions in extra sales for Apple or billions in capex and opex savings for Google. It is important to point out that in both those cases the company completely controls what software is being run on its homegrown chips.
So what is in it for Amazon?
For Amazon, and more specifically for AWS, software control is beyond them. AWS runs everyone else’s software, and so by definition, AWS cannot control it. They have to run almost literally every form of software in the world. Nonetheless, AWS seems to be working very hard to push their customers to run workloads on their Graviton CPUs. AWS has many ways to lock customers in, but silicon is not one of them. At least not yet.
AWS is probably not doing this to save money on the AMD and Intel x86 CPUs they are buying. The fact that they have two vendors alone means they have ample room for pricing leverage. To some degree, Graviton may be a hedge against the day when Intel stops being competitive in x86. (A point we may have already reached.)
That being said, we think there is a bigger reason – power. The chief constraint in data center construction today is electricity. Data centers use a lot of power, and when designing new ones, companies have to work around a power budget. Now imagine they could reduce power consumption by 20%, that means they could add more equipment in the same electricity footprint, which means more revenue. A reduction in power consumption by one part of the system means a much higher return on the overall investment. Then multiply that gain by 38 as the savings percolate through all of AWS’ data global centers.
Now of course the math is a bit more complicated than that. CPUs are only part of a system, so even if Graviton is 20% more power efficient for the same performance versus an x86 chip, that does not really translate into 20% more profit from the data center, but the scale is about right. Switching to an internally designed Arm CPU can generate sufficient increase in data center capacity to more than offset the cost of designing the chip.
Taking this a step further, one big obstacle that prevents more companies from moving to Arm workloads is the cost of optimizing their software for a new instruction set. We have touched on this topic in past posts, porting software can be labor intensive. AWS has a big incentive to get their customers to switch, and seems to be doing what they can to make this process easier. However, we have to wonder if this is something of a one-way street. Once customers make the switch to Graviton, that just shifts the friction. As we said above, today AWS cannot use x86 silicon to lock their customers into their service, but once customers switch to Graviton all that optimization friction shifts to work in AWS’ favor, creating a new form of lock in. Admittedly, the barrier today exists between Arm and X86, not among the various versions of Arm servers. But one of the beauties of working with Arm is the ability to semi-customize a chip, and so it is entirely possible that AWS may introduce proprietary-ish features in future versions of Graviton.
We think Amazon has many other good reasons to encourage the move to their Arm-based Graviton CPU, but we have to wonder if this lock-in is not lingering somewhere in the back of their brains. If true, that just gives the other hyperscalers more reasons to shift to Arm servers as well.
Good color and context for why AWS is doing this in the ever-brilliant James Hamilton presentation from a few weeks ago. The whole “Silicon innovation day” is worth a watch (can do so on twitch for the full days worth.
Agree with you. Would add
– 20% cheaper is 20% cheaper. If an airline launched planes that enabled a 20% price discount, it would be huge news. But somehow the simple price difference gets lost in the discussion around cloud.
– 30%-40% price performance gains. This means AWS is probably making a very fat margin here. Of course they need to fund R&D spend, but that is a fixed cost on increasing revenues (Graviton is up to $5B annual run rate according to The Information)
– tight integration with all the other hardware – Nitro Chips, the SSD, etc. It is clear they get 1 + 1 = 3 benefits from putting multiple chips together in a solution.
I waffled about on that 20% because what we are really talking about is 20% better power from the CPU, but the CPU is only a portion of a server’s power consumption, roughly a third. You have to factor in disk and fan and the rest of the board.
AWS is already making really good profits. I noodled on these numbers back in 2013
But you should read Martin Cassado is has been writing a lot on the subject of cloud vs on-prem costs lately and has done some great analysis on the excess costs (and thus AWS profits) .
Thanks for the video suggestion, will watch
Pingback: Bringing an epee to a Gun Fight | Digits to Dollars·
I’ve seen a lot of great commentary from Digits to Dollars on the “cambrian explosion” of domain specific chips e.g. AI accelerator, video encoders, sql, etc. But I feel like I haven’t seen as much about the compiler technology needed to run all these types of chips. Whilethere is much focus on the fundamental chip manufacturing and software-based chip design, I do think compiler technology is at least as important, as the compiler makes many of the optimizations needed to run software on top of custom chips (particularly for AI accelerators like TPU). Have you all seen any analysis on the supply-demand talent-wise for this technology or any metrics measuring the growth of novel compilers for domain specific chips?
Totally agree – compilers are super important. And I have linked to a few compiler stories in the newsletter. I touch on the subject of” software moats” in the data center frequently – and a lot of that story is about compilers. TBH, it’s a bit beyond our area of expertise. Probably something should explore a bit in the future
Pingback: Has Nvidia won the market for AI Training? | Digits to Dollars·