from: The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Sam Charrington

PUBLISHED: JAN 29, 2026INDEXED: APR 30, 2026, 9:20 AM

The Evolution of Reasoning in Small Language Models with Yejin Choi - #761

Q: Reinforcement learning produces strange code

switching behaviors mid — solution

BUILD SMALL MODELS DIVERSIFY DATA DEMOCRATIZE AI AVOID MODE COLLAPSE

Quotes & Clips

8 clips

Jan 29

Small models can rival large ones with better data

“The mission really is democratizing general AI, so that it's not just companies who can purchase a lot of GPUs, are able to create LLMs and adapt to LLMs and serve LLMs, but also people like myself and colleagues who are academics, so for example, cannot buy as many GPUs, and is there something really meaningful and fun that we could do, even with a smaller counterpart? And at the end of the day, I believe that fundamentally it should be feasible. It's only that the world has invested so much more into exploring what happens when you scale things up so much.”
— Yejin Choi - Stanford professor researching AI

Jan 29

Reinforcement learning produces strange code-switching behaviors mid-solution

“DeepSeq R1 does some amount of this imitation learning after the reinforcement learning... if you just do reinforcement learning, by the way, sometimes this model starts code switching in the middle of solving math problems. It's just suddenly speaking in Chinese and English and back and forth, or some other foreign languages that may not make sense to human readers. So reinforcement learning only cares about whether you got the final solution right or not. It doesn't care about how you got there. So strange behaviors can be emergent and then it can be even reinforced.”
— Yejin Choi - Stanford professor researching AI

Jan 29

LLMs collapse to stereotypical answers even on open-ended prompts

“Mode collapse is a real concern with LLM generation. So, what we find in our paper is that even when you ask open-ended questions, like, you know, tell me a joke about time, or tell me something wise about time. Even when you ask, by the way, hey, give me a random number between and 10. It's not random... The bigger problem is after post-training, like sequential fine-tuning and RL, the probability, output probability of the model becomes even more skewed, like zoning in to the stereotypical answers that people tend to like.”
— Yejin Choi - Stanford professor researching AI

Jan 29

Spotting AI writing: watch for the word 'delve'

“There's a lot of delving now that wasn't happening before. Yeah, probably, yeah. You know, actually, whenever I see the word delve in anybody's writing, I'm like, hmm, what did you do?”
— Yejin Choi - Stanford professor researching AI

Jan 29

Prismatic Synthesis beats teacher models 20x its size

“I can give you one example of our recent work called the prismatic synthesis. It's a synthetic data generation algorithm which is prismatic because it acts like a little bit like a prism that can scatter the light to make it more diversified... we're doing this using Dipsic R1 32 billion parameter model as the teacher model... our goal is to compete against the alternative, which is to use much stronger teacher that's 20 times larger... we find that that one million data points is actually better than the one million data points that you generate from the stronger teacher model, the best teacher model.”
— Yejin Choi - Stanford professor researching AI

Jan 29

Reward thinking before predicting the next token

“The idea is that during pre-training, the model is forced to be completely passive in the way that it learns to predict which token comes next. But what if we encourage the model to think for itself? Before predicting the next token, what if we encourage the model to think for itself by generating something like a chain of a thought? And then predict the next token... the key idea of our approach is to make the reward information gain of predicting next token with thought compared to without thought.”
— Yejin Choi - Stanford professor researching AI

Jan 29

Human brains run on less energy than a lightbulb

“There must be a better way of it, fundamentally better way of doing this. And can we find it? In some ways, the nature found a solution, which is the human brain. The nature found the solution, and human brain requires so little energy. Our brain apparently use less energy than one light bulb.”
— Yejin Choi - Stanford professor researching AI

Jan 29

Pluralistic alignment means AI should present multiple valid views

“Overtone pluralism means when you ask a question that's politically thorny, for example, that could have different answers, the best way might be for LLM to just present all of them. All of the reasonable opinions as, hey, the answer is that people have different opinions. Here's one view, there's another view, and be able to include all of them as opposed to picking the majority opinion. Because that marginalizes out the rest.”
— Yejin Choi - Stanford professor researching AI

Want to hear more clips?

Get a daily email of the best quotes & audio clips from the top podcasts.

Subscribe for daily Quicklets

← More Episodes

The Evolution of Reasoning in Small Language Models with Yejin Choi - #761

Quotes & Clips

Small models can rival large ones with better data

Reinforcement learning produces strange code-switching behaviors mid-solution

LLMs collapse to stereotypical answers even on open-ended prompts

Spotting AI writing: watch for the word 'delve'

Prismatic Synthesis beats teacher models 20x its size

Reward thinking before predicting the next token

Human brains run on less energy than a lightbulb

Pluralistic alignment means AI should present multiple valid views

Want to hear more clips?

Featured in Category Feeds

More from The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Stay in the Loop