Nvidia’s Ian Buck Shares His Vision for the GPU Data Center

Ian Buck has spent most of his life thinking about potato chips. But now the head of accelerated computing at Nvidia, the world’s largest chip company, is thinking bigger. “You can’t buy Blackwell as a chip,” Buck, also vice president of the company’s data center and HPC business, tells DCD, referring to the next generation of its GPU line. “There’s a good reason for that: It wants to be integrated with the CPU. It wants to be integrated with NV Link. “It wants to be connected.”

“We’re not really thinking about potato chips.”

Instead of dealing with individual semiconductors, Nvidia has transformed itself into a platform company. It no longer cares about a single accelerator and instead focuses on large integrated systems.

“That was the decision we made in the Pascal generation [in 2016], because AI wanted to be on multiple GPUs,” Buck says. “The P100 era changed what we build and what we bring to market or make available. Now it’s systems.”

That’s started to change the makeup of data centers, Buck says. “The opportunity for transformative computing began with supercomputing, but with the advent of artificial intelligence it has expanded.

“Every data center is becoming an AI factory. “It’s not measured in failures or megawatts, it’s measured in tokens per second and how many terabytes of data you’re converting into productivity gains for your business.”

This opportunity, bubble or not, has sparked a wave of new data center construction. “But they can’t wait two years for a construction project,” Buck says. “We’ve seen an acceleration in the removal of old infrastructure; “They’re just stripping out their CPU infrastructure, adding GPUs and accelerating, so that every data center can be an AI factory.”

He added: “What you’re going to see is not just one Nvidia GPU, but a combination of platforms and ecosystems, allowing everyone to build the right kind of AI factory and the right workload that they need. “They’re all going to be at different stages of that process or different points of optimization.”

Of course, as much as Nvidia tries to move away from focusing on the specific chips within these so-called “AI factories,” their thermal design point (TDP) defines the makeup of much of the rest of the system. “The hopper is 700 watts and we cool it with air,” Buck says.

“The HGXB100 is also 700 watts and is designed to fit right where the Hopper was,” he adds. “So when HGXB100 hits the market, all of our servers, all of our data center, even the rack power, can stay the same.”

The industry can “take the whole ecosystem, upgrade it, and deploy it at scale,” Buck says. And, he says, customers “get all the benefits of the Blackwell GPU, that P4, the transformer engine, is twice the speed of NV Link between them. “So Blackwell is going to get to market much faster than Hopper, partly for that reason.”

The company also has a 1000W version of the HGX: “same silicon, slightly modified servers, they have to be a little taller, and a different air cooling solution. “Basically, the best you can do with air cooling.”

But after that, things get a little more complicated. “For the NVL72, we want to make sure we have the best available,” Buck says, with the rack including B200 GPUs. “That’s 1200W per GPU, and it becomes the real engine of liquid cooling.

“Four GPUs in 1U? Liquid is key to realizing the benefits of NVL72, which offers the benefit of 30x faster inference performance.”

However, better isn’t always better. “TDP is not the right way to answer that question,” he says. “What’s the workload and what makes the most sense for your setup? If you’re doing 7 billion parameter model inference, or 70 billion, HGX might be ideal, and it might not always require 100% power.”

However, the trend is clearly toward larger chips, which consume more power and need to be cooled to lower temperatures. Nvidia itself is part of the U.S. Department of Energy’s Coolerchips program, which focuses on radical cooling solutions for increasingly hot semiconductors. Buck declined to comment on TDP evolution, especially as the company shifts GPU releases each year. “We’re just working as fast as we can,” he says. “Don’t wait. Don’t keep anything. “We’ll build the best we can and move on.”

Was this article useful to you?


0 Feedbacks

Users comments


Abgineh Pardaz Shargh