Discrete tokens break the geometry diffusion relies on
βBut if you think about text and you take two words, then it's not clear what's in between the meaning of two different words. Right? And so there is no real geometry to the space of possible tokens or possible words. And so that makes the idea of denoising much more challenging because there is it's not clear what it means to perturb, add noise to to text.β
Causal attention masks block reuse of pretrained autoregressive weights
βThe the real challenge is that you're like the the attention mask that you use in a traditional autoregressive model is causal. So the model only knows how to use context to the left as it figures out how to what to do next. And in a diffusion language model, you really wanna be able to have access to the context to the left and to the right as you decide what to change. It's like one of the key properties that make these models potentially much higher quality than compared to autoregressive models.β
Big labs face high switching costs to adopt diffusion
βMy sense is that, you know, there is a big switching cost. Like, they're very, very focused on Gemini on the on their main model. And so, you know, it could be a that that's kind of, like, the issue with these big labs is that, you know, they're only in one direction, and then it's hard for them to really focus on on an alternative direction. As a start up, we're in much better positions to do that because we, you know, we're laser focused on one thing, and we can really deliver and and build everything that's needed to get that, technology to succeed.β
Diffusion models enable controllable generation through external constraints
βA diffusion model, at least for images, diffusion models are are known to be, much more suitable for controllable generation. And the reason is that because the object, let's say the image that you're generating is, sort of, like, available to the model from the very beginning, it's very easy for the model to check whether or not this object that it's generating is consistent with, say, some constraints or some kind of, some kind of, like, control signal that you wanna use to to make sure that the output is consistent with whatever you want the model to generate. So I was on some papers where we're doing medical imaging, and and the idea is that, you know, when you do a CT scan, you're basically taking some projections of your body cross section, and then, you know, you're trying to reconstruct what your body looks like from some measurements that you get from the machine.β
Masking tokens replaces noise in diffusion text models
βOne that works pretty well is basically one where you, mask out tokens. So you you kind of like, hide them. You you take a sentence and then you remove some of the tokens. You hide them from the neural network, and then you ask the neural network, can you predict what those tokens were? And so it's similar in some sense to next token prediction, except that things were done out of order, and the network needs to be able to use information from you needs to use context to the left and to the right and combine it in some interesting ways to figure out how to predict all these missing tokens from the from the sentence.β
Diffusion LLMs scale better than autoregressive models at inference
βIf you need to scale up these models and they are actually getting into production, the price per token or the what's needed per token becomes the key metrics that you care about. And so what we're seeing with the fusion language models is that they scale better than autoregressive models at inference time. They're cheaper to serve. They're faster. You get more tokens per GPU, which means that the price is actually lower.β
βThe latest model that we announced this week, Mercury two, is actually matching in quality, some of the best speed optimized models from Frontier Labs. So we'll think about the Haiku models, the flash models, mini models from OpenAI. So it's the at that quality level. But, again, it's about five to 10 x faster in terms of, like, the time it takes you to get an answer, using a diffusion model versus an autoregressive model.β
Existing serving engines cannot run diffusion language models
βI think one of the reasons why, there are still no other providers that are able to serve diffusion language models, in production today, you cannot run a diffusion language model on existing serving engines. So if you think about BLLM, SG Lang, TensorRT, these frameworks that exist and and not even open source, and and they are really, really good at serving, other aggressive LLMs very efficiently. The space for diffusion language models is much, much, less developed, so we had to build our own serving engine.β
Voice agents and fast agentic loops are killer use cases
βWe're already seeing, a lot of usage. I mean, you nailed the two main ones that we're seeing, voice, a lot of voice, customer support, the educational kinda like agents. People love the speed of the of diffusion language models. They always have this issue that they would wanna be able to use a thinking model, like a reasoning model, but usually, the latency is just not enough. And so maybe they use unless they use specialized AI inference chips, but that's too expensive and they cannot scale to large volumes. So we had a bunch of, customers that are building voice agents on top of the fusion language models.β