from: Latent Space: The AI Engineer Podcast

Latent.Space

PUBLISHED: MAR 10, 2026INDEXED: APR 30, 2026, 11:12 AM

NVIDIA's AI Engineers: Agent Inference at Planetary Scale and "Speed of Light" — Nader Khalil (Brev), Kyle Kranen (Dynamo)

BUILD CLIs ISOLATE AGENTS SCALE OUT DISAGGREGATE INFERENCE USE SPARK

Quotes & Clips

10 clips

Mar 10

Agents should only do two of three things: files, internet, code execution

“Agents can do three things. They can access your files. They can access the Internet, and then now they can write custom code and execute it. You should really only let an agent do two of those three things. If you can access your files and you can write custom code, you don't want Internet access because that's one issue. Vulnerability. Right? If you have access to Internet and your file system, you should know the full scope of what that agent's capable of doing. Otherwise, malware can get injected or something that can happen.”
— Nader Khalil - NVIDIA Director of Developer Experience

Mar 10

Brev's surfboard stunt at GTC led directly to NVIDIA acquisition

“Brev was just it's a developer tool that makes it really easy to get a GPU. It was actually Evan Conrad, SF Compute, who was just like, you guys are two dudes in the room. Why are you pretending that you're not? And so then we were like, okay. Let's make the logo of Shaka. We brought surfboards to our booth to GTC, and the energy was great. My wife was, at the time, fiancee, helping me put these vinyl stickers on and she goes, you son of a, if you pull this off. And so, pretty much after the acquisition, I stitched that with the acquisition. I sent it to our family group chat.”
— Nader Khalil - NVIDIA Director of Developer Experience

Mar 10

SOL means asking what physics actually allows, not what people promise

“SOL is actually I think of all the lessons I've learned, that one's definitely my favorite. The speed of light moves at a certain speed. So if light's moving slower, then you know something's in the way. So before trying to layer reality back in of why can't this be delivered at some date, let's just understand the physics. What is the theoretical limit to how fast this can go? And then start to tell me why. Because otherwise, people will start telling you why something can't be done.”
— Nader Khalil - NVIDIA Director of Developer Experience

Mar 10

NVIDIA invests in $0 markets to learn future categories early

“There's the other concept that is explored a lot at NVIDIA, which is this idea of a $0 business. Market creation is a big thing at NVIDIA. Jensen says we are completely happy investing in $0 markets. We don't care if this creates revenue. It's important for us to know about this market. We think it will be important in the future. It can be $0 for a while. An org doesn't have to ruthlessly find revenue very quickly to justify their existence.”
— Kyle Kranen - NVIDIA Dynamo architect

Mar 10

Disaggregating prefill and decode unlocks major inference efficiency

“Historically, models would be hosted with a single inference engine, and that inference engine would ping pong between two phases. There's pre fill where you're reading the sequence, generating kv cache, and then using that kv cache to generate new tokens, which is called decode. Some brilliant researchers across multiple different papers essentially made the realization that if you separate these two phases, you actually gain some benefits. You don't have to worry about step synchronous scheduling, and you allow yourself to split the work into two different types of pools.”
— Kyle Kranen - NVIDIA Dynamo architect

Mar 10

Kimi K2 traded attention heads for more experts as hardware co-design

“Kimi two comes out, right. And it's an interesting model. The creators of Kimi k two actually talked about it in a blog post. Attention scales to the number of heads. They made a very specific barter in their architecture. They basically said, hey. What if we give it more experts? So we're gonna use more memory capacity, but we keep the amount of activated experts the same. We increase the experts' sparsity, and we decrease the number of attention heads.”
— Kyle Kranen - NVIDIA Dynamo architect

Mar 10

Codex one-shotted Dynamo configurations faster than human engineers

“We have a couple of people at NVIDIA. We've been working with security to bring agents really close to compute. So we now have stuff where you can tell Dynamo, like, go run some experience with Dynamo on x cluster and just try it right now. We've actually been able to one shot problems. We used to have this problem where, with Dynamo, you have to find the right configurations. We've just had an agent just completely one shot that. It goes. It gets the compute. It runs a couple experiments. It's like, this is the best. Go run this. And then we just give that to people, and it's faster than anything that they have.”
— Kyle Kranen - NVIDIA Dynamo architect

Mar 10

Coding agents win because terminals expose every installed tool

“Coding agents have been so much more effective than general purpose agents. And I think a large part of that is it just has access to the terminal, and that means it has access to everything that you've installed into your terminal. It can write code, and it can compile the code. And if there are errors, it can fix it. It can run your suite of tests because that's all just in your terminal. Computing began with a terminal with a shell, but we said that it's not empathetic to humans, so we built these nice user interfaces. And then now we have LLMs navigating our user interfaces, and ironically, we're not empathetic to the machine anymore.”
— Nader Khalil - NVIDIA Director of Developer Experience

Mar 10

Long-running agents waste GPUs by refusing to shut down instances

“I have a twenty four seven agent running. I hooked up to run pod. It doesn't shut down instances, and I've tried prompting it. I've given the instructions, shut down when you're done. It's like, I need to keep it warm. I'll need it soon. It's horrible on time estimates too because it's like, yeah. I'll need it in forty five minutes. Forty five minutes of human time is actually three minutes of agent time, so it's like, I'm booting it up. I'm waiting. I'll just leave it on all night.”
— Vibhu - guest co-host

Mar 10

The 'system as model' era replaces single models with orchestrated subagents

“There's a summarization of that trend that I like to say to my team. This is the year system as model. Where, instead of having a single model be a thing, you have a system of models and components that are working together to emulate the black box model. So when you make an API call to something that's like a multi agent in the background, it still looks like an API call to a model. Under the hood, it's like a billion different models.”
— Kyle Kranen - NVIDIA Dynamo architect

Want to hear more clips?

Get a daily email of the best quotes & audio clips from the top podcasts.

Subscribe for daily Quicklets

← More Episodes

NVIDIA's AI Engineers: Agent Inference at Planetary Scale and "Speed of Light" — Nader Khalil (Brev), Kyle Kranen (Dynamo)

Quotes & Clips

Agents should only do two of three things: files, internet, code execution

Brev's surfboard stunt at GTC led directly to NVIDIA acquisition

SOL means asking what physics actually allows, not what people promise

NVIDIA invests in $0 markets to learn future categories early

Disaggregating prefill and decode unlocks major inference efficiency

Kimi K2 traded attention heads for more experts as hardware co-design

Codex one-shotted Dynamo configurations faster than human engineers

Coding agents win because terminals expose every installed tool

Long-running agents waste GPUs by refusing to shut down instances

The 'system as model' era replaces single models with orchestrated subagents

Want to hear more clips?

Featured in Category Feeds

More from Latent Space: The AI Engineer Podcast

Stay in the Loop