Disaggregating prefill and decode unlocks major inference efficiency
βHistorically, models would be hosted with a single inference engine, and that inference engine would ping pong between two phases. There's pre fill where you're reading the sequence, generating kv cache, and then using that kv cache to generate new tokens, which is called decode. Some brilliant researchers across multiple different papers essentially made the realization that if you separate these two phases, you actually gain some benefits. You don't have to worry about step synchronous scheduling, and you allow yourself to split the work into two different types of pools.β













