Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 6 - LLM Reasoning
Hello everyone and again welcome to
lecture six of CME295.
Uh so today is actually an exciting day
because we're going to uh cover a topic
that has been trending over the past
year or so which is LLM reasoning.
And it's actually a good segue compared
to what we talked last time which was uh
preference tuning because a lot of the
methods that we used in lecture five are
going to be the ones that we'll be using
as kind of the foundations for this
lecture.
So before we start as usual we're going
to uh just cover quickly what we saw in
the last lecture.
So if you remember lecture four and
lecture five were all about learning how
we can train a model. So in lecture four
we saw the first part which was the
pre-training part which is the most
compute intensive step where we
basically teach the model the structure
of text the structure of codes and we do
this very large large scale training.
So at the end of this first step which
is the pre-training we get a model that
knows about code that knows about
language but it only knows how to
autocomplete a sequence and so that's
why we also saw uh the second step which
was the finetuning step where we take
our pre-trained model and we try to make
it useful.
So one use case that we saw was for
instance uh you know an assistant. So we
tried to tune it in a way that it can
respond to questions and so here we have
uh this work of preparing the SFT data
which which is what we call the SFT data
uh which is a high quality created data
set that we can use to teach our model
on how to behave. So at the end of this
second step we have our model that is
tuned for a specific task which can be
um responding to queries. And then last
lecture we saw this third step
which was the preference tuning step.
And here the goal is to align our model
with human preferences.
So we saw in particular RHF which was a
common method to do that and we saw that
there was two steps to it. So there was
the first part learning to distinguish
good from bad using human preference
data
and then there was this second step
which was this RL stage which is going
to be useful today.
So in particular, if you remember, we
had drawn a comparison between the RL
setup that you may be familiar with uh
outside of this class where we have uh
an agent that is interacting with the
environment
and the way that it interact with it is
that given a given a state that it is
in, it can take an action following ing
a policy
which is nothing else than just a
probability distribution over actions.
So given
a state our agent can take an action
following this policy and at the result
of this it receives some reward.
And we saw last lecture that we have
this nice comparison between the
traditional RL setup and the LLM setup.
And so here our quote unquote agent is
just simply the LLM.
Um the environment that it interacts
with is just the set of tokens that it
can predict over.
So given an input that it has received
so far,
it can predict what the next token could
be which is the action and it does that
using the probability distribution
that is the result of uh I guess the LM
prediction. And we saw that we can
obtain human preferences
for each completion.
So you have a prompt, you have a
completion and you get those human
preferences.
And this is the one that you then use in
order to tune the LLM because here this
step is about aligning the model with
human preferences. And here the human
preferences is encapsulated in the
reward.
And so we saw that the loss function
during the RL stage is composed of two
parts.
So the first part is this advantage
maximization
and we saw that that advantage was based
on the rewards uh and it has some
baseline to just reduce the variance of
the gradient. Uh so we have that part
but we also saw
another part
which was that we don't want our model
to deviate too much from either the
previous iteration. So we don't want the
model to change too much from iteration
to iteration.
But we also don't want our model to
change too much compared to our initial
model like our base model. And here it's
the SFT model. And the reason why we
don't want to change too much is because
our model has already learned a lot.
It's already quite performant in what it
does. And what we want is to just align
it with human preferences which is not
something for which we want the model to
change completely.
And I think that was the part that um I
think was a little bit scary which was
the actual loss functions. So if you
remember the main algorithm that is
typically used in the RHF setting is PO
which stands for proximal policy
optimization
and there is this one variant that's
called PO clip
which is such that it clips the updates
from one iteration to another. So here
if you remember R is actually not the
reward it is the ratio and I know it's a
confusing confusing notation.
So it's the ratio between the current
policy and the old policy and the old
policy here is the policy that is at the
previous RL iteration.
So what we do is that we have a clipping
mechanism
such that the ratio cannot go beyond
some certain thresholds
such that we do not want to incentivize
the model to make too big of an update.
And we saw another variant of PO which
is called PPO kale penalty.
And this variant
uses the kale divergence
to penalize the model from changing too
much.
So if you remember in the original PO
paper uh you we used the old
quoteunquote old version of the model
which is the one at the previous RL
iteration.
But in modernday RHF training, this
scale divergence is typically applied to
the base model, which is the SFT model.
So I guess long story short is we have
these two PO variants that were variants
that were introduced in the original PO
paper.
But in modernday RLF training, we
typically have some combination, some
mix of these two loss functions.
Cool.
So, up until now, we've seen what I want
to call vanilla LLMs, which are LLMs
that take something as an input, let's
say a prompt, and just respond with some
answer.
And so those vanilla LLMs, they have a
lot of strength that we can enjoy. So,
first of all, we had seen that those
LLMs, they know a lot about structure of
the text. They know a lot about codes.
So in particular, if you want to debug
your code, I guess they're great to find
where the error is. They are great to
generate codes. They're also great to
generate, I don't know, essays or poems.
They're really, really good at that.
But I do want to call out some
weaknesses.
So the first, I guess, weakness that I
want to call out in these vanilla LLMs
is that they have quote unquote limited
resoning.
So typically if you kind of have it have
uh some sophisticated let's say math
problem it will not really um kind of
come up with I guess the solution
because maybe it will like uh get lost
in the in the way um because up until
now our model has been really trained to
I guess given a prompt respond to it
using the kind of next token prediction.
So I guess here there is not like really
a big reason why it would be able to
solve I guess complicated problems. So I
guess that is one.
So second weakness is that the LLM that
we have has been pre-trained on a huge
amount of data which is static
meaning that the knowledge that the LLM
has acquired is bound to the cutoff date
what we call the cutoff dates at which
we cut and formed our pre-training data.
So I know a few days ago we had an
election. So if let's say we trained our
LLM based on data before the election
and let's say today we ask it okay who
is the I don't know elected official of
let's say X it will not be able to
answer us because it does not have
access to knowledge after that date.
So a third weakness is so far it's uh
all talk no action. So you just uh
prompt your LLM but I guess if you want
to let's say I don't know um like place
an order or I don't know do some action
you just cannot do it.
And then I guess the last weakness which
by the way is not an exhaustive list is
contrary to traditional NLP models LLMs
they generate free form text
and it's hard to evaluate them in the
framework that we I mean by we is the ML
community has adopted up until a few
years ago. So if you're familiar with
it, let's suppose if you worked in the
translation world, you would use
rulebased metrics like blur or for
summarization rouge to evaluate your
outputs. But then here LLMs, they can
really do more than just that. So it's
very hard to evaluate them.
So what I want to say is that this is
let's say a subset of all the weaknesses
that LMS can have. And the last three
are going to be topics that we will
cover in the next lecture and in the
lecture eight.
And as I mentioned before, the focus of
today is reasoning. So we'll see how we
can improve the way our LLM reasons.
Cool. And again, um this is a topic
that's been very new and by new I mean
roughly a year. So almost everything
that we will see is uh either from 2024
or 2025.
And I guess we're lucky because now we
have kind of enough hindsight to just
know which piece is more important than
other things. And so the goal for today
is going to be to know what reasoning
models are.
And the second big goal is to know how
they are trained.
So hopefully at the end of this lecture
if you have a good answer to these two
questions that means that we have done a
good job.
Okay. So let's start with reasoning
models. What are they? Well to answer
this question we first need to I guess
define what reasoning is. And the bad
news here is there's not a commonly
agreed upon definition out there of what
reasoning is. So I will try to define it
um I guess to the best of my ability. So
here we define reasoning as the ability
to solve a problem.
And here by problem we typically think
more of math problems or let's say
coding problems.
But hopefully these abilities can also
kind of spread to other fields as well.
And in order to solve this problem,
we typically need a multi-step reasoning
process. like a little bit like you when
you take an exam when you have a
question that's not trivial you
typically break that down into several
steps and then kind of go through them
and then come to the final answer. So I
guess the idea is problems that are
reasoning problems they would have some
pattern of that.
So to illustrate this let's uh just make
sure we're all kind of thinking the
same. So a non-reasoning question would
be something like what is the course
code of Stanford's transformers and LM
class. So this one know it's a knowledge
thing. Everyone knows it's CME 295.
But then um as opposed to that in
contrast to that a reasoning based
question would be maybe a math question.
So for instance let's suppose we have a
bear that was born in 2020. How old is
that bear now in 2025? So that that
would be the kind of question that we're
looking at. Of course, this is super
easy. Can think of like something
that's, you know, much harder than that
and this would qualify.
Cool. Okay. So now that we know a little
bit what reasoning is, now we're going
to look at how we can I guess obtain a
model that can deal with these kinds of
prompts.
So the core idea here is to leverage a
concept we saw earlier in the class.
So I'm not sure if you remember but back
in lecture maybe two or three we saw a
technique called chain of thought.
So who remembers what chain of thought
is?
Yeah. Yes. Do you want to say what what
that is?
Yeah. Yeah. Exactly. So the answer is
you know thinking in steps be instead of
just giving just a blanket answer. So
great great answer. So here just to
illustrate. Yeah.
Yep.
>> Oh yeah, great point. So the question is
uh for some questions you need to have
some element of context. So for instance
here you need for your LLM to know that
this year I don't know today's November
7th 2025. So yes, great points. Um so
there are two parts of that for that
specific question. So typically LLMs
they have a something that that we call
preamble. So something that tells you
about some context uh and typically the
date is something that we put in the
preamble. So for this very specific case
the date would be uh an information that
would already be uh I guess available to
the LM. But for some other problems, you
can very well have information or
context that you do not have. And um in
lecture 7, we will see how we can uh
kind of fetch that. And it's definitely
something that that will be very useful.
Um but for the sake of today, we will
consider this more as being an extension
of reasoning models as opposed to
something that is foundational to how
they work. But we'll respond to to this
part I guess next lecture.
Cool. And so here we were about to
illustrate what a chain of thought look
like. And so here um instead of having
as you mentioned just a blanket answer
what we want is to explain also the
reasoning.
So the way Chenn thought did that was by
having some in context learning examples
that explicited the reasoning
to encourage the model to also do that
before providing the answer.
So that's the rough it's not a rough
idea it's the the idea behind chain of
dots.
And here what we want to do is to do
that but at a much larger scale.
So I just want to build an intuition as
to why this may help us.
So LLMs they're basically trained with
this next token prediction objective.
And so the way they respond to questions
that we ask is typically to sound
plausible
in a way to maximize or optimize for the
probability of the tokens that you want
to generate to happen.
So if you have a very hard problem that
you're presenting to the LLM,
there are very few chances that that
problem appeared in the training sets.
And so here the idea is to have the LLM
decompose the problem into tractable
ones
and then rely on the patterns that it
has seen during training to solve all
these more tractable problems. And it's
a little bit like uh you or when we were
students, right? Like when we give you a
problem, you try to kind of link it back
to something that you know that you have
seen at training time like during your
studying
to solve them in order to find the
answer.
So I guess that's the intuition. Um I
guess another reason that you can think
of is when you let the LLM generate more
tokens,
you're just giving it more compute. If
you think about it,
because at each generation step, you
have this whole forward path pass.
And when you do that more, you just give
it more compute.
And we will see there is a term that is
dubbed compute budgets,
which is I guess the budget that you
want your L&M to have to generate your
response that we will see later on. Uh
but yeah, so that also plays a role.
So, is everyone uh clear on the overall
idea of what the reasoning model is?
Yeah.
Okay. Perfect. So, just to make sure
again that we're all clear on this. So,
up until now, we had quote unquote
vanilla LLMs
that had some prompt, some question as
input, and they were responding with
something as output.
And in our case what we want is given a
question as input
we don't want to directly generate an
answer we first want to think and here I
guess we think by first outputting a
reasoning chain
and then we provide the answer.
So here the LLM output is not just the
answer, it's the reasoning
plus the answer.
And I guess yeah, I was saying that this
uh topic has been very hot and trendy
for the past year. So this is a little
bit of a preview of the timeline of
reasoning models. Um so reasoning itself
is a topic that has been studied for
more than a year but reasoning models
have started popping out
starting from OpenAI's release of 01
preview and this one was uh September of
2024
and then after that uh you basically
have everyone just wondering how OpenAI
did it and so you had you know all this
AI labs trying to kind of see what we
can do to um I guess increase the
reasoning abilities of the model. So you
would have everyone working on this and
then you had uh Google's Gemini 2.0
uh flash thinking that was I think
released in December
and then starting in 2025
uh there was this uh DeepSync R1 paper
that was published in January which made
a lot of noise because what they were
able to do was to match OpenAI's
reasoning ability performance
with a method that they were actually
you know describing ing in their paper.
So that was a big moment January 2025
and then after that you know all these
other models also had uh reasoning
abilities added. So you had um some
models from XAI from anthropic with
clouds and then from uh other labs as
well also from um uh Mistral. So this
timeline is not exhaustive but it's just
to show you I guess how recent these
models are and I guess one thing to have
in mind is it basically started at the
end of 2024.
Cool. So I know a lot of you if not all
of you are using uh LLM every day as you
know chatbots like let's say chat GPT or
Gemini and so I want you to know when
the model that you're interacting with
is a reasoning model. So I have a
question for you.
I was having a discussion with Chad GPT
the other day and I was wondering is
this
using a reasoning model
because guess uh what do you think if
you kind of stare at the screenshots?
Yeah. Why?
Right. Exactly. So the key word here is
thinking. Actually they they put it
everywhere. But um when these UIs show
you the thinking process
so what they do is actually so they
don't show you actually the full
reasoning process and we're going to see
why. But they tell you that they spend
time thinking
and that time thinking is actually the
time that is needed to produce the
reasoning chain that can be I guess more
or less something that takes time. Uh
and so for instance for chip you have
this uh thinking that you can also
change from standard to extended etc.
And you have this for other models as
well.
And in particular, the thought that you
the quote unquote thought summary that
you see is not the raw reasoning chain.
It's a summary of it.
And the reason why they do that, I mean,
I'm hypothetizing is one because the raw
chain maybe may not be something that is
fully intelligible from a human
standpoint.
Second is because maybe you as a user
you don't want to read pages and pages.
And then third, this is going to be what
we see. But
if you have the reasoning chains, if you
have the reasoning chain, you can
maybe uh train a model that you know is
trained on these chains. So it can
basically mimic those abilities as well.
So this may be the reason the reasons
why you would typically not see the raw
chain
and when it comes to pricing so we saw
that the output of reasoning models are
not only the answer but also the
reasoning itself and so you will see in
I guess all of these APIs that you're
actually getting charged for it so
you're not necessarily getting all the
reasoning chain but uh if you look at
the the ducks there's always I guess a
sentence or two uh I guess that
specifies that you're actually being
charged for output tokens and those
output tokens they also include
reasoning tokens so I think that's also
a good thing to know
um and I guess from a user standpoint
this also gives an incentive to have the
maximum imum reasoning ability for a
minimum amount of reasoning token
because you don't want to pay a lot,
right? So, we're going to see later on
how we can deal with that, but this is
just one thing that's good to note.
Cool. So, we talked about what reasoning
was or at least tentative definition. We
saw I guess the core idea behind
reasoning models. So now we're going to
talk about some of the benchmarks that
people use to quantify reasoning
abilities.
So the first one as I previously
mentioned is about assessing the coding
abilities.
And so here the goal is typically to
solve a coding problem or to fix a bug.
So you typically have the following
setup.
So given a problem,
you want to produce a solution
that
is able to pass test cases.
So if you have a solution that you that
passes all test cases, then you have a
solution that works.
And this is our way of verifying that
a response that was generated is
actually a correct one.
So that's for coding.
Um okay so just giving you an overview
of the kinds of benchmarks that are out
there. So we have uh human eval which is
uh the screenshots here which I believe
is a set of 100 something coding
problems that were human written which
is why it's called human evil. But you
have then code forces uh which is coming
from a website of competitive
programming that some of you may know
and there is also swbench uh which is I
believe a set of problems that were
derived from GitHub issues. So these are
like real practical problems. So you
will see that these reasoning models
in the reports they typically have
results along these benchmarks.
So this is for coding.
Now for math
the goal is to you know solve a problem
and the way it works here is given a
problem you want the answer and of
course you're letting your model you
know generate some reasoning.
And so here what you would do to verify
that your solution works is by parsing
the answer
like like so and then comparing it with
some ground truth.
So here you can really know if the
answer that your model is producing is
true by comparing the two.
So you will ask me okay how do you parse
that? So you can force your model and by
force I mean in the prompt to output the
answer in a way that can be parsible.
Sometimes people they just put it in
some box in the box brackets. So there's
a number of different ways to do that
and uh this is uh how the problems look
like. So uh you have some problem and
then you can let your model uh you know
generate some reasoning chain along with
the answer and then you have the answer
here that you can compare to. Um so I
just want to add that um the kinds of
benchmarks that are out there are based
typically on problems that are actually
not that trivial. So a name that you
will see a lot is AIM.
So I'm not sure if you're familiar with
it. It's um a math exam to qualify for
the US math olympiads.
Um so there is a bunch of models that um
I guess um quantify their performance
based on that. There's also GSM 8K which
is a grade school kind of math problem.
And I guess now your question maybe okay
it's great we have all these benchmarks
but what is the metric that you will use
to quantify that what you're doing is
good or not
and so there is one metric that you will
see a lot and we're going to see that
right now because it's not that obvious.
So that metric is called pass at K.
So pass at K is by definition a metric
that tries to estimate the probability
that at least one of K attempts
succeeds.
So here what we're saying is let's
suppose we're I don't know having some
coding problem. We tell our model to
generate K answers
and pass at K is the probability that at
least one of these K answers
passes
the tests.
Does that make sense?
So, by the way, why would you have pass
at K? Like why why would that make
sense?
Okay, so it's not super trivial. Um, so
you may have use cases where
you can afford to spend more time
to generate more answers.
If you know that you will get more
chances to get it right,
then you can spend more time to generate
more answers so that you can can have
like the right one.
So in a problem like coding,
the good thing is you can check whether
your answer is correct or not.
And so here the idea is if you are in
cases where you can afford to generate
not just one answer but multiple answers
then it may be worth it to just spend
more time spend more compute to generate
more answers if it means that the
probability of getting it right in one
of them is going to be higher.
So there is one technique that we saw
last lecture that
is kind of familiar with this best of n.
Do you remember?
Yeah roughly. Okay. So best of n is this
method that we saw last last week which
was you generate n answers and you use
your reward model to score everything
and you take the best best of all of
them. So here you can think of it as
being kind of the same
just that we do not have a reward model.
We have a deterministic verifiable way
of checking whether something is
correct.
So it's very similar to that.
And okay so now we're going to spend
just a few minutes just um aligning our
understanding on
um how we can estimate such a quantity.
Yep.
That's a great question. So question is
uh do you uh choose a particular
temperature? So I have a slide on that.
So I I'll just hold your question for a
few slides. Cool. Wait. By the way, if
there is any other questions uh you know
happy to Is there any other questions on
this?
Everyone is super clear. Okay. Cool. So
what we want to do here is to estimate
this probability.
But you may tell me okay just generate k
attempts and just count the number of uh
of the attempts that are um correct to
estimate that.
But the problem is if you only do this k
times your estimate can be very noisy
because by pure chance I don't know you
may have uh one instance that has I
don't know three correct out of five and
the other one one. So you want to have
an estimate that doesn't have that much
variance. So what you do is you
typically generate not k but n answers
and out of these n answers
you are going to have c of them that are
going to be successful and then n minus
c that are not going to be successful.
And the question here that you ask
yourself
is out of these n attempts
you want to quantify
your the probability of having at least
one out of k attempts that is right. So
the question here if you have those n
observations
if you were to select let's say k
what is the probability of having at
least one of those k passing?
That's the question.
So we're short on time. Uh I'm going to
actually uh derive that right now. Um,
and the reason why I want to do that is
because the answer is not necessarily
trivial and I don't want you to be uh
too uh you know uh surprised of how how
it looks like without uh kind of knowing
where it comes from.
So we're going to derive that. So we
want to derive the probability that at
least one attempt out of k that we
select out of these n samples
is correct.
So in order to do that I'm just going to
write that down. So pass k. So we want
the estimate for that.
So, it's going to be the probability
that at least one attempt
out of K
is correct.
So if you've done some probability there
is like a very common trick which is if
you have a probability of at least one
it's one minus the probability of having
all of them being incorrect.
Do you all know that trick? Okay. So
we're going to use that trick. So it's
going to be one
minus the probability
of having all
k attempts
uh incorrect.
So far so good. Yeah. Okay. So now we're
going to try to quantify this
probability.
So what is the probability of you
finding
an unsuccessful attempt among this n?
You have n minus c unsuccessful.
you have n observations
is going to be
n minus c
over n.
Right? So it's your first it's your
first unsuccessful attempt.
So now what is the probability that you
have another unsuccessful attempt
knowing this one?
So you have n minus c minus one other
unsuccessful attempts
over
n minus one because you already took one
and you continue that k times
n - c - k + 1
over n - k + 1.
Does everyone agree with me?
So here we are quantifying the
probability that if we take k
samples randomly among these n that all
k1s are incorrect.
So here
we compute this by computing the
probability that the first attempt is
incorrect and then the second one is
incorrect knowing that the first one was
incorrect and so on which we have for
which we have this uh formula.
So if you uh saw that in probability
class, it's basically the same as
sampling without replacement.
And now the nice thing with this is that
you can express that with a nice
mathematical notation. So for people who
know
so n choose k
is equal to factorial n
factorial k
factorial n minus k. So this is by
definition of what this quantity is
about. So we're going to use that here.
So here we have
a series of products. So here we can
express that as factorial over another
factorial. So here it's factorial n
minus c
over factorial
n minus c minus k.
And then this quantity we can also
express express that with factorials.
So it's factorial n and then numerator
is factorial n minus k.
Everyone agrees with the math here.
I take that as a yes.
And so here we're going to somehow
pop the expression above up. So it's
going to be equal to 1 minus
um so
n minus c factorial over n - c minus k
factorial
and then k factorial
and then here we're going to say it's
this one
k factorial k factorial
over n factorial. So here I did a a
trick. What I did was
I basically multiplied the numerator and
the denominator by k factorial.
And so here I have 1
minus. So this is
n minus c over
n. So n minus c choose sorry n minus c
choose k
over
n
choose k.
So this
this is our estimate
of pass at k.
Does everyone agree?
Yeah.
Okay.
So, the reason why I derived it is uh
because that this formula can be a
little bit daunting to look at, but it's
actually quite um natural because it's
only using some sampling without
replacement
um considerations.
And so, you will see in papers that
people have this pass K metric. So if
you were to compute it, this would be
the formula to use.
Cool. And uh special K of pass at K is
pass at one
and pass at one is defined as the
probability of a single attempt
succeeding.
So this is something that is more uh
something that you you commonly know. Uh
so if suppose you have a model that
produces an answer what is the
probability that is correct.
So if you replace K with one you will
see that this formula will simplify a
lot and it will be just a proportion of
successful attempts which also makes
sense just intuitively.
Cool. Everyone clear with this?
Yeah.
Great. Um so now to your question um so
what do you do with the temperature and
you're exactly right
when you generate these solutions
multiple times something that will
influence your results is how diverse
the solutions that you will generate
will be and this is something that you
indeed
use the temperature for.
Now, okay. So, what is the relationship
between temperature and pass K? Do do we
want low temperature? Do we want a high
temperature? Or do we want something in
the in between?
Well, if you take a very low
temperature,
you know that your solutions will be
good, but they will not be diverse.
So if you increase the number of number
of samples that you generate that
quantity would not change which is seen
with uh this graph with the t equal to0
it doesn't change but then when you have
t equal to 0.2 two. Now it increases
because you have some diversity.
But if you increase this temperature too
much,
then you will have the diversity that
you want, but they will also harm the
performance of your predictions because
maybe the tokens that were not that
likely would become likely. And so here
on the opposite side, we have I don't
know t= 1.2 two here which is extreme in
this in this example which is not the
best. So I guess to respond to you uh
typical temperature choice would be
something that is not too large not too
small and I believe in this example is t
equal to8
that I think produces the best results
towards the bigger samples but then here
I guess it's not clear maybe 04
so that's why you will see that in
papers
people always specify the temperature
that they choose to produce the
benchmark results.
So that's why it's uh good to have have
that in mind.
Cool. So these are the main metrics that
people use but not the only ones.
There's another one that you may see
which is called consensus at K
which is the answer that is coming from
the highest number of times that the
answer appears among your generations
and you can think of the
self-consistency technique as being very
related to that.
Um so yes so you may see that and then
the other metrics that you will see in
those benchmarks are typically the ones
that you're familiar with uh like
accuracy exact match and so on.
Cool. Okay. I took a lot of time on this
first part. I think I need to go a bit
faster. But does that roughly make sense
so far?
Yeah. Okay. So right now I hope you know
what a reasoning model is. So now we're
going to go into the most interesting
part of the lecture which is how can we
build such a reasoning model. So we know
we want our reasoning model to produce a
reasoning chain but right now we don't
really know how to do that at scale
and this is going to be the main focus
in this part. So here the idea is to
somehow incentivize the model to produce
a chain of thought a reasoning chain
before answering.
So the problem with that is that
reasoning chains writing is a very tough
task especially for I guess long
reasoning long reasoning chains
and for that let's suppose if we were to
go look into our toolbox of the
techniques we have learned so far you
know maybe we can think okay we can use
SFT to do that but the problem with SFT
is you need high quality data and in
particular you would need to write all
these reasoning chains.
Let's suppose you do not have these
reasoning chains. So how do you do that?
You would typically write them from
scratch but that is very hard.
So that's one uh I guess one vote one
vote to not do SFT if we don't have any
reasoning chains.
So the second reason here is or the
second fact is that the way the model
핵심 요약
LLM Reasoning 강의는 추론 모델(Reasoning Model)의 개념과 학습 방법을 다룹니다. Chain of Thought를 대규모로 적용하여 모델이 복잡한 문제를 단계별로 분해하고 해결하도록 학습시키는 RL 기반 접근법, 특히 GRPO 알고리즘을 중심으로 설명합니다.
주요 개념
Vanilla LLM의 한계 08:50
- Limited Reasoning: 복잡한 수학/코딩 문제 해결 능력 부족
- Static Knowledge: 학습 데이터 cutoff date 이후 정보 없음
- All Talk No Action: 실제 행동(주문, API 호출 등) 불가
- Evaluation 어려움: Free-form text 출력으로 BLEU/ROUGE 같은 기존 메트릭 부적합
Reasoning의 정의 13:55
- 정의: Multi-step 추론을 통해 문제를 해결하는 능력
- Knowledge vs Reasoning: "CME295가 뭐지?" (지식) vs "2020년생 곰이 2025년에 몇 살?" (추론)
- 주로 수학, 코딩 문제에 적용되지만 다른 분야로 확장 가능
Chain of Thought의 대규모 적용 15:55
- LLM은 Next Token Prediction으로 학습되어 "plausible"하게 답변
- 복잡한 문제는 training data에 거의 없어서 직접 풀기 어려움
- 핵심 아이디어: 문제를 tractable한 하위 문제로 분해 → 학습된 패턴으로 해결
- Compute Budget: 더 많은 토큰 생성 = 더 많은 forward pass = 더 많은 compute
Reasoning Model의 구조 21:58
- Vanilla LLM: Question → Answer
- Reasoning Model: Question → Reasoning Chain → Answer
- 출력이 단순 Answer가 아닌 "Reasoning + Answer"
Reasoning Model 타임라인 22:23
- 2024.09: OpenAI o1 preview 출시 (시작점)
- 2024.12: Google Gemini 2.0 Flash Thinking
- 2025.01: DeepSeek R1 - OpenAI 성능 match + 방법론 공개 (빅 모먼트)
- 이후: xAI, Anthropic Claude, Mistral 등 추론 기능 추가
Test-Time Scaling 24:30
- Train-Time Scaling: 더 큰 모델, 더 많은 데이터, 더 많은 compute
- Test-Time Scaling: 추론 시점에 더 많은 compute 투입 (새로운 패러다임)
- 같은 모델이라도 추론 시간을 더 주면 성능 향상 가능
Pass@K 메트릭 32:28
- 정의: K번 시도 중 최소 1번 성공할 확률
- Best-of-N과 유사하지만, Reward Model 대신 Verifiable Reward 사용
- 코딩: 테스트 케이스 통과 여부 / 수학: 정답과 일치 여부
- 추정 공식: Pass@K = 1 - C(n-c, k) / C(n, k) (n개 샘플 중 c개 성공일 때)
Why RL for Reasoning? 50:40
- 수학/코딩은 Verifiable Reward 존재 (정답 여부 명확)
- SFT만으로 처음부터 학습하기 어려움 (high-quality reasoning data 부족)
- 해결책: RL로 모델이 스스로 reasoning 패턴 학습
RL Reward 설계 51:30
- Format Reward:
토큰 존재 여부 (reasoning chain 생성 유도) - Correctness Reward: 최종 답의 정확성 (verifiable)
- Reward Model 불필요 - 둘 다 rule-based로 검증 가능
GRPO (Group Relative Policy Optimization) 1:10:20
- DeepSeek Math 논문에서 제안
- 핵심 차이: Value Function 없이 Advantage 계산
- 방법: 같은 prompt에 대해 G개 completion 생성 → 각 completion의 reward를 group 내에서 비교
- Advantage = (Ri - mean(R)) / std(R) (Group 내 상대적 비교)
GRPO vs PPO 비교 1:10:40
| 항목 | PPO | GRPO |
|---|---|---|
| Frozen Models | Reference + Reward Model | Reference만 |
| Trained Models | Policy + Value Function | Policy만 |
| Advantage | Reward - Value | Group 내 상대 비교 |
| 용도 | Preference Tuning | Reasoning Training |
GRPO Loss 구성요소 1:11:10
- 공통점: Policy ratio (π/π_old), Clipping mechanism
- 차이점: GRPO는 KL divergence가 loss에 명시적으로 포함
- PPO는 KL을 advantage 계산에 내재화
Thinking Budget Control 55:20
- 모든 문제에 같은 양의 thinking 불필요
- Dynamic Budget: Classifier로 문제 난이도 판단
- Budget Forcing: "wait" 토큰으로 더 생각하게, "time's up" 토큰으로 종료 유도
- Continuous Thoughts: Token 대신 hidden representation으로 사고 (더 압축된 형태)
Length Optimization 1:16:20
- RL training이 진행될수록 output 길이 증가 경향
- 더 긴 reasoning = 더 많은 비용 (사용자/제공자 모두)
- 효율성을 위한 length reward 추가 연구 진행 중
핵심 인사이트
- Reasoning Model = Chain of Thought의 대규모 적용 + RL 기반 학습
- Test-Time Scaling: 추론 시점 compute 증가로 성능 향상 (새로운 scaling 패러다임)
- GRPO: Value Function 없이 Group 내 비교로 Advantage 계산 - PPO보다 간단
- Verifiable Reward가 있는 도메인(수학, 코딩)에서 RL이 특히 효과적
- DeepSeek R1이 방법론 공개로 reasoning 연구 민주화에 기여