from: The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Sam Charrington

PUBLISHED: APR 30, 2026INDEXED: APR 30, 2026, 9:05 PM

How to Engineer AI Inference Systems with Philip Kiely - #766

Q: Inference research

to — production timeline is often just hours

LEARN INFERENCE OWN INTELLIGENCE SPECIALIZE RUNTIMES BUY HOPPER QUANTIZE MODELS

Quotes & Clips

8 clips

Apr 30

Inference research-to-production timeline is often just hours

“I mean, it might be the fastest timeline in the world. If you think about medicine, for example, it can take decades for research to reach a pharmacy. If you think about, you know, physics or engineering, it can take years to apply a new concept, material science. Even within AI, what moves faster. Training ones, for example, if you want to train a model off of a new technique, it can still take weeks or months to fine tune the hyperparameters and and find the exact right way to sort of express that technique. But with influence, the timeline is often hours. A new model architecture comes out, you have to figure out how to support it day zero. We had PoloQuant come out that research paper and an engineer on our model performance team had it implemented thirty one hours later, as a as a CUDA kernel.”
— Philip Kiely - head of AI education at Baseten

Apr 30

Inference engineering is like mixed martial arts—many disciplines

“Yeah. So inference is a really fun and difficult topic, and I'm gonna go into metaphor territory for a second here, which is that I'm a I'm a martial artist. I have been my entire life. Are are you familiar with, like, UFC and and MMA and all those kind of things? So, you know, they are you can't just be an expert in one thing and expect to become a champion. You can't just be a great wrestler. You can't just be a great boxer. The idea is that you have a lot of different skills, each one of which can take a lifetime to master. And somehow you have to be excellent at all of them in order to be a well rounded mixed martial artist.”
— Philip Kiely - head of AI education at Baseten

Apr 30

Knowing inference 'knobs' lets you build priority queues and quantize confidently

“You can trade off between, for example, latency and throughput in a given influence engine by adjusting things as simple as batch size or things like, you know, adding or removing a speculation algorithm. When you when you do that, when you create sort of a a spectrum of outcomes, an efficient frontier of high performance influence, then you start to understand, like, wait, I can choose to change the way that I consume these systems. So for example, you might start having priority queues in your product where paid user traffic gets prioritized over for user traffic. And maybe that's something that you couldn't have built previously, or maybe you add a concept of, you know, I have the ability to quantize models and because I have my own very sophisticated and product specific evals, I can do that with complete confidence that I'm not degrading the quality of the service my users are experiencing.”
— Philip Kiely - head of AI education at Baseten

Apr 30

Hopper GPUs gained value because Chinese labs optimize for them

“actually, like, Hopper GPUs in particular still are very, very popular for influence. One big reason for that actually is because so much open source work comes out of Chinese labs who, due to export controls, generally work on hopper GPUs and not Blackwell GPUs. So you generally get like either, you know, FPA to inflow kernels, you get things that are built for hopper's asynchronous programming paradigm instead of the slightly different paradigm of Blackwell kernels, you get things that are models that are built for the size and restriction of say like an eight x h 200 node instead of say a GB 300 mvl 72 system.”
— Philip Kiely - head of AI education at Baseten

Apr 30

You can't vibe code uptime for mission-critical inference

“But one of the things we like to say at Baystone is, like, you can't vibe code uptime. And ultimately, like, there is still going to need to be for these mission critical systems that have, you know, hundreds of millions or billions of dollars of of economic value, like, relying on them. There's going to need to be human owners who can be accountable for the results of this system.”
— Philip Kiely - head of AI education at Baseten

Apr 30

Specialized runtimes turn 500ms tasks into 1ms tasks

“No one is truly doing named entity recognition, which is sort of extracting keywords from sentences. No one is truly using Frontier LMS to do that, or at least I sure hope they're not. But even if you're using, you know, like a flash model type of thing for that versus specialized model, maybe a highly optimized small LMS can do that task in five hundred milliseconds. We just released an named activity recognition run time that does it in one millisecond. One, not not five hundred. And if you have an agent that is doing this a 100 times per user request, all of a sudden, this has gone from something where you have to look at the spinner to something that happens instantly.”
— Philip Kiely - head of AI education at Baseten

Apr 30

Inference runtime market concentrated around vLLM, SGLang, TensorRT-LLM

“Within influence, there's really been a concentration around, like, three major open source run times, the VLLM, SG Lang, and TensorRT LOM. And I think part of that is just the the complexity of standing up a new run time from Xero is is very, very high. And so most people find it more useful to sort of contribute to an existing one. There's a lot of really good open source work around inference optimization outside of that, like, good open source kernel libraries, open source quantization tools, kv cache reuse tools, new speculation stuff.”
— Philip Kiely - head of AI education at Baseten

Apr 30

Owning your intelligence is the differentiator in 2026

“I think that this year, we're going to see a real increase in ownership of intelligence. You've seen, like, you saw with Shopify, then moving to a QAN model and and saving millions, tens of millions on on their workloads. You see companies like COSO coming out with very sophisticated models like Composer that are allowing them to, you know, really create a a novel experience for for the users who are depending on their platform. So I think that's the the trend that I'm most excited about is companies understanding that if they want to build a really differentiated product, they need to be differentiated at every level, and that's starting to include the model level.”
— Philip Kiely - head of AI education at Baseten

Want to hear more clips?

Get a daily email of the best quotes & audio clips from the top podcasts.

Subscribe for daily Quicklets

← More Episodes

How to Engineer AI Inference Systems with Philip Kiely - #766

Quotes & Clips

Inference research-to-production timeline is often just hours

Inference engineering is like mixed martial arts—many disciplines

Knowing inference 'knobs' lets you build priority queues and quantize confidently

Hopper GPUs gained value because Chinese labs optimize for them

You can't vibe code uptime for mission-critical inference

Specialized runtimes turn 500ms tasks into 1ms tasks

Inference runtime market concentrated around vLLM, SGLang, TensorRT-LLM

Owning your intelligence is the differentiator in 2026

Want to hear more clips?

Featured in Category Feeds

More from The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Stay in the Loop