Diffusion LLMs scale better than autoregressive models at inference

Quotes & Clips from Stefano Ermon

9 on this page

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Mar 26

“If you need to scale up these models and they are actually getting into production, the price per token or the what's needed per token becomes the key metrics that you care about. And so what we're seeing with the fusion language models is that they scale better than autoregressive models at inference time. They're cheaper to serve. They're faster. You get more tokens per GPU, which means that the price is actually lower.”
— Stefano Ermon - Stanford professor, Inception Labs CEO

From “The Race to Production-Grade Diffusion LLMs with Stefano Ermon - #764” →

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Mar 26

Discrete tokens break the geometry diffusion relies on

“But if you think about text and you take two words, then it's not clear what's in between the meaning of two different words. Right? And so there is no real geometry to the space of possible tokens or possible words. And so that makes the idea of denoising much more challenging because there is it's not clear what it means to perturb, add noise to to text.”
— Stefano Ermon - Stanford professor, Inception Labs CEO

From “The Race to Production-Grade Diffusion LLMs with Stefano Ermon - #764” →

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Mar 26

Masking tokens replaces noise in diffusion text models

“One that works pretty well is basically one where you, mask out tokens. So you you kind of like, hide them. You you take a sentence and then you remove some of the tokens. You hide them from the neural network, and then you ask the neural network, can you predict what those tokens were? And so it's similar in some sense to next token prediction, except that things were done out of order, and the network needs to be able to use information from you needs to use context to the left and to the right and combine it in some interesting ways to figure out how to predict all these missing tokens from the from the sentence.”
— Stefano Ermon - Stanford professor, Inception Labs CEO

From “The Race to Production-Grade Diffusion LLMs with Stefano Ermon - #764” →

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Mar 26

Mercury 2 matches frontier speed-tier quality 5-10x faster

“The latest model that we announced this week, Mercury two, is actually matching in quality, some of the best speed optimized models from Frontier Labs. So we'll think about the Haiku models, the flash models, mini models from OpenAI. So it's the at that quality level. But, again, it's about five to 10 x faster in terms of, like, the time it takes you to get an answer, using a diffusion model versus an autoregressive model.”
— Stefano Ermon - Stanford professor, Inception Labs CEO

From “The Race to Production-Grade Diffusion LLMs with Stefano Ermon - #764” →

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Mar 26

Causal attention masks block reuse of pretrained autoregressive weights

“The the real challenge is that you're like the the attention mask that you use in a traditional autoregressive model is causal. So the model only knows how to use context to the left as it figures out how to what to do next. And in a diffusion language model, you really wanna be able to have access to the context to the left and to the right as you decide what to change. It's like one of the key properties that make these models potentially much higher quality than compared to autoregressive models.”
— Stefano Ermon - Stanford professor, Inception Labs CEO

From “The Race to Production-Grade Diffusion LLMs with Stefano Ermon - #764” →

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Mar 26

Existing serving engines cannot run diffusion language models

“I think one of the reasons why, there are still no other providers that are able to serve diffusion language models, in production today, you cannot run a diffusion language model on existing serving engines. So if you think about BLLM, SG Lang, TensorRT, these frameworks that exist and and not even open source, and and they are really, really good at serving, other aggressive LLMs very efficiently. The space for diffusion language models is much, much, less developed, so we had to build our own serving engine.”
— Stefano Ermon - Stanford professor, Inception Labs CEO

From “The Race to Production-Grade Diffusion LLMs with Stefano Ermon - #764” →

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Mar 26

Diffusion models enable controllable generation through external constraints

“A diffusion model, at least for images, diffusion models are are known to be, much more suitable for controllable generation. And the reason is that because the object, let's say the image that you're generating is, sort of, like, available to the model from the very beginning, it's very easy for the model to check whether or not this object that it's generating is consistent with, say, some constraints or some kind of, some kind of, like, control signal that you wanna use to to make sure that the output is consistent with whatever you want the model to generate. So I was on some papers where we're doing medical imaging, and and the idea is that, you know, when you do a CT scan, you're basically taking some projections of your body cross section, and then, you know, you're trying to reconstruct what your body looks like from some measurements that you get from the machine.”
— Stefano Ermon - Stanford professor, Inception Labs CEO

From “The Race to Production-Grade Diffusion LLMs with Stefano Ermon - #764” →

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Mar 26

Voice agents and fast agentic loops are killer use cases

“We're already seeing, a lot of usage. I mean, you nailed the two main ones that we're seeing, voice, a lot of voice, customer support, the educational kinda like agents. People love the speed of the of diffusion language models. They always have this issue that they would wanna be able to use a thinking model, like a reasoning model, but usually, the latency is just not enough. And so maybe they use unless they use specialized AI inference chips, but that's too expensive and they cannot scale to large volumes. So we had a bunch of, customers that are building voice agents on top of the fusion language models.”
— Stefano Ermon - Stanford professor, Inception Labs CEO

From “The Race to Production-Grade Diffusion LLMs with Stefano Ermon - #764” →

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Mar 26

Big labs face high switching costs to adopt diffusion

“My sense is that, you know, there is a big switching cost. Like, they're very, very focused on Gemini on the on their main model. And so, you know, it could be a that that's kind of, like, the issue with these big labs is that, you know, they're only in one direction, and then it's hard for them to really focus on on an alternative direction. As a start up, we're in much better positions to do that because we, you know, we're laser focused on one thing, and we can really deliver and and build everything that's needed to get that, technology to succeed.”
— Stefano Ermon - Stanford professor, Inception Labs CEO

From “The Race to Production-Grade Diffusion LLMs with Stefano Ermon - #764” →

Quotes & Clips from Stefano Ermon

Diffusion LLMs scale better than autoregressive models at inference

Discrete tokens break the geometry diffusion relies on

Masking tokens replaces noise in diffusion text models

Mercury 2 matches frontier speed-tier quality 5-10x faster

Causal attention masks block reuse of pretrained autoregressive weights

Existing serving engines cannot run diffusion language models

Diffusion models enable controllable generation through external constraints

Voice agents and fast agentic loops are killer use cases

Big labs face high switching costs to adopt diffusion

More clips from Stefano Ermon?