Lex Fridman on visiting xAI cluster with Elon M...
I should say that I got a chance to visit the Memphis Data Center and it's kind of incredible. I mean, I visited with you on just the teams and the rate of innovation there is insane. Because my sense is that, you know, nobody's ever done anything of this scale and nobody has certainly ever done anything of this scale at the rate that XAI is doing. So they're like figuring out, I mean, and so I was sitting in on all these meetings where they're brainstorming. It's like, it's insane. It's exciting because they're like, they're trying to figure out what the bottlenecks are, how to remove the bottlenecks, how to make sure that, you know, there's just so many really cool things about putting together a data center because, you know, everything has to work. It's the people that do like the sysadmin, you know, the machine learning, all that is the exciting thing, so on. But really, the people that run everything are the folks that know like the low level software and hardware that runs everything, the networking, all of that. And so you have to like make sure you have procedures that test everything. I think they're using Ethernet. I don't know how they're doing the networking, but they're using NVIDIA Spectrum X Ethernet. There's actually like, I think, yeah, the unsung heroes are the cooling and electrical systems, which are just like glossed over. But I think like one story that maybe is like exemplifies how insane this stuff is, is when you're training, right, you're always doing, you're running through the model a bunch, right, in the most simplistic terms, running through the model a bunch. And then you're going to exchange everything and synchronize the weights, right? So you'll do a step. This is like a step in model training, right? And every step your loss goes down, hopefully, and it doesn't always. But in the simplest terms, you'll be computing a lot and then you'll exchange, right? The interesting thing is GPU power is most of it. Networking power is some, but it's a lot less. But so while you're computing, your power for your GPUs is here. But then when you're exchanging weights, if you're not able to overlap communications and compute perfectly, there may be a time period where your GPUs are just idle and you're exchanging weights and you're like, hey, the model's updating. So you're exchanging the gradients, you do the model update, and then you start training again. So the power goes, right? And it's super spiky. And so funnily enough, right, like this, when you talk about the scale of data center power, right, you can blow stuff up so easily. And so Meta actually has accidentally upstreamed something to code in PyTorch where they added an operator. And I kid you not, whoever made this, like I want to hug the guy because it says PyTorch, it's like PyTorch. Power plant, no blow up. Equals zero or equal one. And what it does is amazing, right? When you're exchanging the weights, the GPU will just compute fake numbers so the power doesn't spike too much. And so then the power plants don't blow up because the transient spikes like screw stuff up. Well, that makes sense. I mean, you have to do that kind of thing. You have to make sure they're not idle. Yeah. And Elon's solution was like, let me throw a bunch of Tesla megapacks and a few other things, right? Like everyone has different solutions, but like Meta at least was publicly and openly known, which is just like set this operator. And what this operator does is it just makes the GPUs compute nothing so that the power doesn't spike. But that just tells you how much power you're working with. I mean, it's insane. It's insane. People should just go to Google like scale, like what does X Watts do and go through all the scales from one watt to a kilowatt to a megawatt. And you look and stare at that and you're how high on the list a gigawatt is. And it's mind blowing. Can you say something about the cooling? So I know Elon's using liquid cooling, I believe in all cases. That's a new thing, right? Most of them don't use liquid cooling. Is there something interesting to say about the cooling? Yeah. So air cooling has been the de facto standard. Throw a bunch of metal, heat pipes, et cetera, and fans, right? And like that's cool. That's been enough to cool it. People have been dabbling in water cooling. Google's TPUs are water cooled, right? So they've been doing that for a few years. But with GPUs, no one's ever done, and no one's ever done the scale of water cooling that Elon just did, right? Now, next generation NVIDIA is for the like highest end GPU, it is mandatory water cooling. You have to water cool it. But Elon did it on this current generation. And that required a lot of stuff, right? If you look at like some of the satellite photos and stuff of the Memphis facility, there's all these external water chillers that are sitting basically, it looks like a semi-truck pod thing. What's it called? The container. But really those are water chillers. And he has like 90 of those water chillers just sitting outside. 90 different containers, right? With water, chill the water, bring it back to the data center, and then you distribute it to all the chips, pull all the heat out and then send it back, right? And this is both a way to cool the chips, but also an efficiency thing. All right. And going back to that like sort of three vector thing, right? There is memory bandwidth flops and interconnect. The closer the chips are together, the easier it is to do high speed interconnects, right? And so this is also like a reason why you're going to go water cooling is because you can just put the chips right next to each other and therefore get higher speed connectivity.
No AI insights yet
Save videos. Search everything.
Build your personal library of inspiration. Find any quote, hook, or idea in seconds.
Create Free Account No credit card required