βDeepSeq R1 does some amount of this imitation learning after the reinforcement learning... if you just do reinforcement learning, by the way, sometimes this model starts code switching in the middle of solving math problems. It's just suddenly speaking in Chinese and English and back and forth, or some other foreign languages that may not make sense to human readers. So reinforcement learning only cares about whether you got the final solution right or not. It doesn't care about how you got there. So strange behaviors can be emergent and then it can be even reinforced.β
Pluralistic alignment means AI should present multiple valid views
βOvertone pluralism means when you ask a question that's politically thorny, for example, that could have different answers, the best way might be for LLM to just present all of them. All of the reasonable opinions as, hey, the answer is that people have different opinions. Here's one view, there's another view, and be able to include all of them as opposed to picking the majority opinion. Because that marginalizes out the rest.β
βThe idea is that during pre-training, the model is forced to be completely passive in the way that it learns to predict which token comes next. But what if we encourage the model to think for itself? Before predicting the next token, what if we encourage the model to think for itself by generating something like a chain of a thought? And then predict the next token... the key idea of our approach is to make the reward information gain of predicting next token with thought compared to without thought.β
LLMs collapse to stereotypical answers even on open-ended prompts
βMode collapse is a real concern with LLM generation. So, what we find in our paper is that even when you ask open-ended questions, like, you know, tell me a joke about time, or tell me something wise about time. Even when you ask, by the way, hey, give me a random number between and 10. It's not random... The bigger problem is after post-training, like sequential fine-tuning and RL, the probability, output probability of the model becomes even more skewed, like zoning in to the stereotypical answers that people tend to like.β
Small models can rival large ones with better data
βThe mission really is democratizing general AI, so that it's not just companies who can purchase a lot of GPUs, are able to create LLMs and adapt to LLMs and serve LLMs, but also people like myself and colleagues who are academics, so for example, cannot buy as many GPUs, and is there something really meaningful and fun that we could do, even with a smaller counterpart? And at the end of the day, I believe that fundamentally it should be feasible. It's only that the world has invested so much more into exploring what happens when you scale things up so much.β
βThere's a lot of delving now that wasn't happening before. Yeah, probably, yeah. You know, actually, whenever I see the word delve in anybody's writing, I'm like, hmm, what did you do?β
Prismatic Synthesis beats teacher models 20x its size
βI can give you one example of our recent work called the prismatic synthesis. It's a synthetic data generation algorithm which is prismatic because it acts like a little bit like a prism that can scatter the light to make it more diversified... we're doing this using Dipsic R1 32 billion parameter model as the teacher model... our goal is to compete against the alternative, which is to use much stronger teacher that's 20 times larger... we find that that one million data points is actually better than the one million data points that you generate from the stronger teacher model, the best teacher model.β
βThere must be a better way of it, fundamentally better way of doing this. And can we find it? In some ways, the nature found a solution, which is the human brain. The nature found the solution, and human brain requires so little energy. Our brain apparently use less energy than one light bulb.β