Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 5 - LLM Tuning
Cool.
Hello everyone and welcome to lecture 5
of CME 295.
So first of all, thank you all for
taking the time to take the midterm last
week. So I hope it was reasonable for
you. Um so for those of you who are
auditing in case you are interested in
taking the exam just know that the exam
and the solutions are both posted on the
websites
and um so now you know a little bit how
the exam looks like. So for the final
we'll have the same format except that
the content will be on lectures five. So
this one up until 9.
So I'll go down that point.
Cool. Uh great. So with that we're going
to start the lecture. So today we're
going to talk about LLM tuning.
So as usual we're just going to recap
what we saw last time. Um, so last time
was already two weeks ago, but we talked
about how to train an LLM. And in in
particular, we've looked at two
important steps. So the first one was
called pre-training
where you're basically taking a model
that has been initialized and you're
trying to teach the model about
language, about code. So you have this
step that is very time consuming,
expensive
um compute heavy that is happening on a
lot of data. So we've seen training
optimizations on how to make that
happen. So we've seen uh techniques to
parallelize that across GPUs. So we've
seen data parallelism methods and in
particular zero so the variant 0 1 2 3
and we've also seen very quickly what
model parallelism was in this case. So
at the end of this step what you obtain
is a model that knows about the
structure of language about codes
basically all the text that it has been
fed. But what this model can do is only
predict the next token.
So it's a great autocompleter,
but it's not a helpful model yet, which
is why we have a second step. And we saw
uh here it's typically called
fine-tuning or SFT for supervised
fine-tuning.
So here what we do is we take our
pre-trained model and we train it for
specific tasks.
So nowaday you have you know chat GPT
and all these like uh chat assistants.
So this can be one application. So
transforming your model into an
assistant. And so here the goal is to
teach the model how to behave.
So the model already knows what language
is, what code is, etc. And you're just
trying to make it behave like the use
case that you're trying to tune it for.
So typically here what you have is a
data set that is much smaller in scale
but of much higher quality
and you're basically tra uh taking your
model your pre-trained model and
teaching it exactly which tokens to
predict with the next token prediction
task.
And uh we've seen uh Laura which is
parameter efficient method which does
not tune all the weights but in a clever
way introduces low rank matrices that
are the ones that are being tuned
and we had stopped there. So what we're
going to see today is how to align the
model to align with what we call human
preferences.
And so here we're taking our model that
has been fine-tuned for a specific task
and trying to align our model to um make
it more something that a human would
like or that some metric that we're
defining would be more aligned with
that.
So as an example, let's suppose if you
have an assistant at the end of step
step two. So it's very much possible
that your assistant is you know behaving
the way you want but not let's say at
the tone that you want. So it's not
let's say friendly or it's not uh you
know safe. So you want to tune those
aspects in that third step.
Cool. So this step is called preference
tuning
and we're going to exactly see what that
is. So here for context let's suppose
that we have an SFT model and so by SFT
model I mean a model that has gone
through the pre-train stage and the
finetuning stage. And so for instance we
may ask our model uh to suggest a new
activity we could do with our teddy
bear. And here the model let's say
responds with I would suggest you to not
spend much time with your teddy bear at
all. So this is the response of an
assistant but it's not necessarily with
aligned with what we want.
So the idea here is to take these quote
unquote bad outputs and find
an output or rewrite an output that we
would want to have instead.
So this pair would be what what we call
a preference pair.
So in other words given this prompt we
have two responses.
One that we want to see which is this
one I'm going to read in a second and
then the other one that we do not want
to see.
So the for instance in this example the
answer we want to see is you know of
course teddy bears not only make awesome
companions for delightful sleep but also
can also be great buddies for fun
activities and you know just like
suggest some activities.
Does the setup sound good?
Yeah. So long story short we want to
align our model with human preferences.
So you may ask, you know, we have this
fine-tuning step already. Why would we
want to have a third step? Well, during
the second step, which is the
fine-tuning stage, if you remember, what
we did was construct a very high quality
data set of the kinds of prompts on
which we want our model to behave in a
certain way. So in order to compose such
a data set, it's actually something that
is very time consuming and very actually
difficult because the data set must be
of very high quality. And in this case,
what we're doing is not really teaching
the model exactly what it should
generate,
but rather telling the model what kind
of output it should prefer.
So we're less in a you know please
generate that kind of thing and more in
a you know I prefer this option kind of
thing.
And so typically, I don't know, if we
asked you uh write a poem, write a great
poem from scratch, it would typically be
much more difficult to do rather than
just showing you two poems, one bad poem
and one great poem, and ask you to just
say which one is better.
So in order to obtain the data sets,
it's already much easier.
The second reason is so during the SFT
stage when we compose our data set of
very high quality there is one aspect
that we really try to get right which is
the distribution of prompts
and what I mean by that is if we have
too much of a given kind of prompt our
model will be more biased towards
responding in that particular way. So
what people try to do is to be careful
about the distribution of prompts that
are in the SFT data.
And so here, let's say if our model
misbehaves, if we were thinking about
just adding one example in the SFT data
set, well, we would have to be very
careful about which prompt we're adding
and whether it's not going to bias the
model like too much in that direction.
So that's the second reason. And then uh
the third reason is what I mentioned
which is uh the SFT data is typically
very high quality. So if you're um
looking at all the missteps that your
model is doing and trying to put that in
the SFT data, you're just, you know,
have a hard time takes a lot of time. Um
but one note here
is if your SFT data is misbehaving a
lot, it may also be due to the fact that
your SFT data set has some problem.
So preference tuning is not the answer
to everything. Maybe it's better
sometimes to just check your SFT data
set for some issues.
That sound good? Yep.
Yeah. So to do that in the preference
tuning stage. Okay. So the question is
we saw Laura for SFT. What is the
equivalent for preference tuning? uh so
we'll see that later in the in the
lecture but you can think of Laura as
being some way to I guess reduce the
number of parameters that you need to
tune that is slightly different with the
objective function that you're using to
train your model and here preference
tuning you can think of it more as a
different objective function but it's it
may very well be something that you also
use LoRa for. So the two are not
incompatible.
>> Yes,
>> but it will become more clear later on.
But yeah, great question. Uh one last
thing I will add here is um another
difference with the SFT stage is that
preference tuning allows us to inject
some negative signal
because SFT is all about teaching the
model about like what it should predict
but it does not teach the model about
what it should not predict
and we will see that perference tuning
allows you to inject some negative
signal
Cool. So to start, of course, we need to
have our preference pairs. And so we're
going to look at this uh data collection
step first.
So here's the setup.
You have a prompt. Let's say write a
poem and you have a given response which
is the poem that is being generated by
the model.
You have a few ways to construct your
preference data.
So either you start with kind of a p
pointwise mindset where you score each
proposed poem with some kind of
pointwise score. And here pointwise
score means a score just relative to one
observation.
You could very well do that, but I will
say it's kind of hard.
It's kind of tough as a human to say,
okay, this one is, I don't know, 0.9,
this one is like a 0 2. It's not super
clear uh exactly how you would scale
that.
The second idea is for you to get two
observations at the time and for you to
say which one is better. So this one is
called pair-wise preference data and
it's much easier.
And then the third one is listwise which
is you know you get a list of let's say
n poems and then you just rank you know
which one is best which one is uh worse
etc etc and I guess this one is easier
than pointwise because you don't have to
specify I guess how much better it is
but I guess it's still a bit more
complicated
and that's the reason why people
typically use parise
So what they do is they collect
pair-wise preference data meaning for
each prompt they have two possible
answers and then
they just specify which one is better
and so that's the one we'll continue um
this lecture with. Okay so now you may
ask okay great uh pair wise preference
data but how do you get it?
Well, here is the recipe. So, in order
to generate a pair of responses, so
first you need a prompt and we've seen
um you know, previously I think it's
lecture probably three um that what you
could do was to generate different
answers if you have a temperature that's
positive. So typically what people do is
they put this prompt let's say twice
into the model with a positive
temperature and then they can get two
different answers.
Um the prompt is typically something
where we wanted to follow the
distribution of what the users typically
ask. So the prompt X can be something
that we obtain from the logs or from
let's say a desired set of prompts.
And then what we do is we have these two
observations. So the first one is the
prompt and the first response. The
second one is the prompt and the second
response. And what we do is we rate we
compare them.
So we can compare them with of course
human ratings
but we can also compare them with some
other metrics.
So I'll just list a few. Uh so we have
LLM as a judge which you may have heard
of which we have not seen yet which we
will see in a few lectures. That is also
typically used to just compare
I guess how much better one observation
is compared to another.
We can also use some other metrics
rule-based one like blur rouge etc.
Although it is not as used these days
and the simplest way to compare these
two observations is to have a binary
setting
where you say okay is response one
better or worse than response two. But
you could also think of a more nuanced
scale. Meaning you can also say okay
response one is much better, better,
slightly better, slightly worse, worse
or much worse. So this is also something
that you can do. Um but there are some
challenges with that approach
because for instance if you uh I don't
know consider human ratings a lot of
tasks are a little bit subjective.
So in a lot of cases actually what
people do is
having a pair wise preference data set
on the binary scale.
So only is it better or worse.
Does that sound good?
Yeah. Um okay. So another way to obtain
that data is
to have to find in your logs a response
that you did not like
to take that response and to rewrite it
which is basically what we did here. So
here when we have the response here what
we did is take the response and rewrite
rewrite a good one.
So this is also what people do but of
course it is a bit more involved because
you need to you know generate and I told
you that generation was uh kind of
costly and tough but this is also
possible.
Does the data collection make sense?
Yeah. Cool. Okay. So now we have our
preference data and what we want is to
align our model to prefer responses that
were preferred by the rating and I guess
down downweight the responses that were
not preferred.
And so in order to do that we will see a
method that's called RLHF RLHF.
And uh I'll see we'll see that uh in
more details in a second. Uh but as the
name indicates RLHF
relies on RL.
So I'm just going to start with some RL
basics. So do we have any RL experts
here?
Yeah. No. Okay. So no need to be an RL
expert. So don't worry, we'll go really
slowly on that.
So in the RL world world what people do
is they have an agent and here I mean
agent in the RL term which interacts
with an environment.
So what does it do? It is at a given
state at let's say time t. It can take
an action
at time t and it takes that action
according to some policy.
The policy is typically noted pi pi of
theta of the action given the state. So
what that policy means is simply
giving you the probability of you taking
an action given a state.
Easy, right? And so given that the agent
takes some action,
it also receives some rewards.
So sometimes it's a good reward,
sometimes it's bad rewards.
So what we're going to do is to leverage
that mindset
for our preference tuning exercise.
So let's see together how we can
transpose these quantities in the LLM
world.
So what is the agent?
The agent is the LLM.
So in terms of uh the state that it is
in so it's simply the input that it has
so far.
So the action that it wants to do is to
predict the next token. So it basically
always wonders okay what is the next
token given this input
and this action or this next token is
among the set of tokens that there are
out there. So if you want you know the
environment can be the set of tokens of
your vocabulary
and in order to decide which token
should come next.
The LLM determines that using the
probability of next token which is
obtained by you know when you do your
forward pass and when you look at uh the
probability distribution as an output
which is basically our policy.
So our policy here is simply equal to
the output of the LLM given the input in
order to determine the next token.
So far so good. And now we're adding one
additional thing. So I told you, you
know, we're composing our preference
data sets and we want to know which
output is better than the other output.
And so we're going to use that for a
reward. We're going to somehow use that
for a reward and we will see how we will
use that.
So I'm going to just recap that part. So
we have our LLM take some inputs and
then it wants to predict
the next token. It wants to take this
action. So in order to do that, it uses
the probability distribution that it
outputs
and then the token that it predicts or
the output that it generates then
receives some reward that is going to
feed back into tuning the agent. And
here's the LLM. Yep.
So the question is wouldn't that be
expensive if we do uh this for all
pairs? Why would that be expensive?
Mhm.
Uh yeah. So
yeah, the question is what do you do in
terms of how expensive that is? So yeah,
typically people take batches for
instance. Um but I would not so it is
indeed expensive but we're going to see
a little bit some order of magnitude and
how that works. But you can think of
this as uh being um kind of a training
procedure that could be seen as as
expensive as some other training
procedure. There's nothing that makes
this more expensive.
>> Okay,
>> but we're going to see exactly. So I
think you're there are some points that
you're you know touching correctly. Uh
there's some parts that are added
compared to the regular you know
supervised let's say supervised fine
tuning uh setting and we're going to see
that in a second.
H so the question is uh does the reward
that you're getting from here uh I guess
powerful enough for the LM to change
well you will see some expressions on
the internet just characterizing this uh
training procedure as you know like it's
nearly not as many signals as what you
would get for SFT let's say because for
SFT you're literally always taking a
partial input and making the LLM learn
how to um generate the next token. But
here you're only getting roughly one one
signal per completion. So yeah,
definitely it's more sparse and that's
why you will see that RHF is seen as
more an approach that has sparse
signals.
Yeah. Yeah.
Great question. So the question is, do
you apply reward for each token or for
the whole thing? We're going to see this
in more detail, but it's for the whole
thing. It's for the whole thing, but
we'll see that in uh in just a second.
Yeah.
So the AD and H. Oh, right. So the
question is what is the AT and ST? So uh
just as a reminder ST is the state that
you're in and 80 is the action you want
to take. So in the context of an LLM
uh the state that you're in for an LLM
is the input that you have that you have
so far and then the action is which
token you want to generate given that
input. Yeah.
Cool. Yeah. So far so good.
Okay, perfect. So now that we know a
little bit what a mental model could be
for I guess LLM based RL,
I just want to highlight once more that
what we're trying to achieve is learn
how to align this policy
with
rewards.
So we want to learn theta and theta is
the parameters of our LLM such that pi
of theta
align with preferences
and that's where RHF comes into play. So
RHF stands for reinforcement learning
from human feedback and it is typically
composed of two stages. So the first
stage is you
figuring out how to distinguish good
output from bad output.
So here you know all the preference
pairs that you collected are actually
used for you to learn what is good what
is bad. So the input here is the
concatenation of the prompt and the
response and the output is a score.
So what you want to know given a prompt
and a response how good that is.
And then the second step is the RL step,
the reinforcement learning step. And
this is where you use the rewards
to align your model with the
preferences. So here as input you have
the prompt and what you somehow want to
do is to be able to generate yhat that
is more aligned with the rewards.
And by the way I just want to call out
one thing. So RLHF is reinforcement
learning from human feedback.
The human feedback part
refers to the labels on which the reward
model is trained.
So if the preference pairs are based on
human ratings
then we're relying on human preferences
then we're in RHF
because you will see out there there is
also RL let's say Aif reinforcement
learning from AI feedback and that one
is relying on nonhuman preferences.
So yeah
cool. Okay, so now that we know what RHF
is and we know that there are two steps,
we'll go through the first step
naturally. So here the idea is we want
to construct a model that knows which
output is good, which output is bad. And
of course here what we want is to not
only consider the output but also the
input
because uh you need to somehow
contextualize that answer.
So let's go with the our favorite
example. Uh so let's suppose you have
the following prompt. Suggest a new
activity I could do with my teddy bear.
So what you want is to have a reward
model that here we note RM that tells
you that you know the answer that we
rewrote into the good answer is good
and we want to somehow have that model
that tells us that the output that we
constructed or that we did not construct
the output that we saw was bad is bad.
So we want to somehow have a model that
takes in the prompt and the good
response and say it's you know good and
we want to somehow have a model that uh
you know takes the prompt and the bad
response to say it's bad. So now the
question is how do you construct such a
model?
Well in order to do that we are using a
formulation that's called the Bradley
Terry formulation.
So that one is an important formula. So
uh we just stay here for a little bit.
So what it what it says is that the
probability bless you uh to have an
output yi
be better than an output yj
is equal to an exponential of some score
which is a score with respect to i over
the exponential of that score with
respect to i plus the exponential of
some 4 respect to J.
So that is called the Bradley Terry
formulation and this is the formulation
we will use to build our model.
So here um it's also equal to sigma of
RA I minus RJ. So who knows what sigma
is?
>> Yes, exactly. So it's a sigmoid. So just
as a reminder sigmoid is 1 / 1 +
exponential of minus x. So this is how
uh the graph looks like. So when it's
minus infinity it's uh it tends towards
zero and then when it's towards plus
infinity it tends uh towards one.
So in other words,
if I is better than J,
what we want is for the input to sigma
to be as high as possible
because you want the probability to be
as close to one.
So you want to somehow
have
R I be high if the output I is good.
and somehow our J to be low if the
output is bad.
So far so good.
So here
what we want to do is to somehow
train a model that is able to output
scores RA I and RJ
using this formulation.
So this formulation involves two
quantities because we have a pair-wise
data set.
So far so good. Okay. So it will become
more clear in a second. So here for
training the idea here is you have some
model that you initialize
and you input on the one hand your
prompt X and your winning output. So I'm
saying winning. So it's yhat W.
You put it into the model.
It produces score R of X and Y W.
And then you have a second output. So X
and Y L Y L which is the losing output.
So you put it into the model and it has
a second score.
And now you somehow want to have a loss
function
that takes into account these two
scores.
So based on this formulation,
do you have like a some suggestion as to
which loss function to use?
So here our loss function would be
parise.
Yep.
Yeah. Uh so the uh the proposed answer
is a binary cony. Um so can you explicit
that a bit more?
>> Mhm.
>> Uh-huh. Okay. So um I guess in that case
what would it be?
Yep.
Yeah.
Yeah. Yeah. Exactly. So I guess the Yes.
So great great answer. So I guess your
answer is having uh some negative log
likelihoods of this quantity which is uh
so the cross entropy can be seen as kind
of a special case of this but so just so
that that we make sure that we uh get
this part. Um
so in order to just get that uh
formulation which you mentioned again I
guess one idea that we can have is given
our data
and given this formulation that we saw
the Bradley Terry one
to find
parameters theta that maximize the
probability of that data happening which
basically leads leads to what you
mentioned. Uh so here I'm just going to
write what that means so that just we
can um kind of be all aligned. So here
it's
so let's suppose you have a preference
data set of let's say um you know
winning example associated with losing
example. So you have a bunch of pairs
like this. So let's suppose that these
pairs they happened independently of one
another. So what you want is to find
parameters that somehow maximize the
probability of you seeing
those examples.
Right? So here what that means is we
want to somehow maximize
the product of let's say these
n preference pairs which is the
probability of you know uh like some
output uh w being uh better than some
output l
and we saw that with the Bradley Terry
formulation. So we have this formulation
which is basically the product of
uh y I1 to n of this uh sigma of
this reward which is basically a
function of the input prompt and the
output
like this.
So what I'm doing here is to just
reconstruct the loss from first
principles.
So whenever you see a product of
probabilities, the first thing you need
to think about is
yes log because this can become very
small. So it can cause instabilities. So
if you take the log, if you want to
maximize a product is the same as
maximizing the log of that. So if you
take the log of that. So let's suppose
I'm taking let's say the log of that.
Let's say I'm taking the log of that.
So it's basically equal to the sum
of the log of
sigma of r of the winning
minus the one from the losing.
So we want to maximize that.
But people in ML they like to minimize
things. So we're going to take like have
a negative in front. So maximizing plus
the sum of the log is the same as
minimizing minus the sum of the log. And
so this
will be our loss function.
Does that make sense?
And this is exactly what you mentioned.
And
typically we uh you know write the
expectation
and this is our loss function.
So our last function is minus the
expectation of log of sigma r and x and
y y w minus the reward of the losing.
Sounds good. So here the last function
is pairwise
but the reward model is it pair wise or
is it pointwise
like I guess do you need a pair to make
a prediction or do you need just one?
>> Do you need a pair
pair? Well that's the beauty of things.
You don't need a pair. So you're just
training it pair-wise but it's actually
pointwise.
So I think that's one thing I kind of
realized. You know I look at this loss
and you know this is a beautiful thing.
So you're training in a pair wise way,
but at the end of the day, you have a
reward model that takes in one prompt
and its output and just outputs one
number.
And we saw that it will try to output a
high score for, you know, winning um
examples and then a low score for losing
examples.
So this is our reward model. And I'm
looking at the time. I'm actually not on
time, so I'll just move on. Um, so
typically here what you would need is a
set of data, which is typically on the
tens of thousands or maybe even more.
Um, and here the label would be the
preference rating. The preference here
should come from humans if we're talking
about RLHF.
And in terms of model, well, you have a
bunch of different choices. Uh so as we
know now uh models that are decoder only
that predict the next token are very
popular. So you could very well take
such an LLM and just have some
classification heads at the end of your
sentence that you're uh taking into
consideration and then use that to
predict the reward. Or if you remember
uh we've also seen encoder only models
uh in the beginning of the class. So for
instance BERS you could typically also
project the embedding of the CLS token.
This could very well be an option.
So typically what people do nowadays
is take the LLM route because everything
is an LM these days. So it's like
decoder only. You just put a
classification head.
And so people have also come up with
benchmarks
to evaluate how good you're doing. So
just put a reference here. Reward bench
in case you're interested is a pretty
popular one.
Um
but yeah,
so at the end of this you get a reward
model that takes in a prompt and a given
response
and gives you a score.
Yep.
>> Exactly. So the question is uh in some
tasks the human preference would be
something but in other tasks it may be
something else. So typically those
rewards they are with respect to a given
dimension.
So they can be let's say is the output
useful
or is the output I don't know friendly
is the output safe. So all of that are
different dimensions. So they can be
different reward models.
Um you can also have like some holistic
score as well.
uh but yes so you need to define a
dimension across which you're actually
quantifying how good your response is.
So the ones that I mentioned are uh the
common ones that you would you would
find. Um another thing that I will say
as well uh while we're at it so human
ratings are very sensitive to the
guidelines that you're also exposing.
So, we're not going to go into details
here, but one important aspect of human
preference data is to make sure that the
guidelines you're telling your raiders
are as objective as it could get. I
mean, sometimes you cannot have them be
super objective, but you you have the I
guess the task of just making sure
they're clear enough so that your human
preferences
uh they're not noisy because they can be
noisy, but that's also one challenge.
Yep.
So the question is uh is the reward here
like a regression or a classification.
So let's look at the
at the loss function and here.
So well it's kind of hard I guess I
would say um because the reward can be
kind of something that can be
interpreted as a score but we have a
probabilistic formulation
um so I'll probably frame that as a
classification task because you have
preference data is it better worse so
it's like one or zero but at the end you
use the rewards so it's not I
purely a classification task because at
the end you have this um but one thing
to note is that the scale that these
rewards are at is typically something
that is scaled itself.
>> So at inference time you have some
normalization procedure that normalizes
that across your batch but yeah I would
probably characterize the formulation
more as a probabilistic one I would say.
Yep.
Yep.
Yep. So the question is do we normalize
the score? So there are many different
methods to I guess normalize things. We
typically do. We typically do normalize.
So I guess if you're in a regression
setting, you're trying to predict some
score on a given scale, which you're
you're not doing here. So it's kind of
free form here.
Um so yeah, so there is some rescaling
that happens. Yeah. You had a question.
>> Yeah. The question is uh can you tell us
more about what a the output of a reward
model is? So we're going to see that a
bit later but you can think of you know
good outputs as being let's say one bad
outputs as being a minus two minus
three. So it's basically on the
continuous scale if you want.
We're going to see an example in a
little bit. So hopefully this can Do you
have a question? Perfect. Uh we're going
to see an example. So hopefully that
will be clear.
Cool. Um
so what we did is train a reward model.
Now the second step is to use that
reward model to align our model or and
by model I mean the LLM. The LLM that
went through the pre-training stage and
the SFT stage is the model that we want
to align with human preferences.
We're going to do that using
reinforcement learning and we're going
to do that using the reward model that
we just constructed.
So the reward model here
is the model that we obtained in step
one and it allows us to distinguish
good outputs and bad outputs.
So here is a general recipe of how you
would align your model.
So first you would take your prompt
as input.
Your LLM will generate
a completion.
So here a completion means like a full
response from the model. So by the way
completion people use also the term roll
out. So it generates a completion, a
roll out, a full response
and that full response along with the
prompt
goes into the reward model.
So as we saw the reward model right now
knows if it's a good, if it's bad. Let's
say right now it's bad. So in practice
it does not generate the the thumbs
down. It generates some kind of score.
So if you if you want like minus two,
let's suppose.
So we take that reward into
consideration
and then what we do is we tune the LLM
with this information.
So it's more complicated than that and
we'll see how we go about doing this but
this is the general idea.
So just as a reminder, the reward model
is the model that we trained as step one
and it is a model that is frozen.
We're not training the reward model. The
reward model has been trained.
The model that we're training is the the
LLM.
So I think that's an important point to
note.
And our goal
is to optimize for higher rewards
but then
without going too far from the initial
model.
So I think the you know optimizing for
higher rewards I think everyone agrees
with me right. But I I I just said a
second statement which is but we don't
want to go too far from the initial
model.
So why would that be like why would we
want to not deviate too much from the
base model?
So one suggested answer is it will
catastrophically forget what it has
learned. Why would that be a problem?
great. Yeah, I think it's a great way to
put it. So the the the suggested answer
here is uh you have all this knowledge
in your initial model which is
pre-trained and then instruction tuned
or tuned and you don't want to move away
too much. This is exactly it. So that's
one reason.
What could be another reason? So
definitely one, it's a very very good
reason. What could be another reason?
Yeah.
Yeah. So uh overfeeding on that data. So
can you tell me more?
>> Yeah, great point. So the second
suggested answer is uh you can overfeit
on a data that is not super clean. And I
will go a bit more into detail on what
that means.
Yep. You want to add?
Yeah. Yeah. Exactly. So I guess your two
points are I guess along the same lines
which is the reward model can be noisy.
So that's exactly it. So, it's a
phenomenon that's called reward hacking
where
if you are trying to optimize too much
핵심 요약
LLM Tuning 강의는 Pre-training과 SFT 이후의 3단계인 Preference Tuning(선호도 조정)을 다룹니다. RLHF, PPO, DPO 등 다양한 정렬 기법을 통해 모델이 인간의 선호에 맞게 행동하도록 학습시키는 방법을 설명합니다.
주요 개념
Preference Tuning의 필요성 04:56
- SFT만으로는 모델의 "톤"이나 "안전성" 같은 미세한 행동 조정이 어려움
- Preference Pair(선호 쌍): 같은 프롬프트에 대한 좋은 응답(winning)과 나쁜 응답(losing) 쌍으로 구성
- SFT는 "무엇을 생성할지" 가르치고, Preference Tuning은 "무엇을 선호할지" 가르침
- Negative Signal 주입 가능: SFT는 생성해야 할 것만 가르치지만, Preference Tuning은 생성하지 말아야 할 것도 학습
Preference Data 수집 방식 11:50
- Pointwise: 각 응답에 절대적 점수 부여 (어려움)
- Pairwise: 두 응답 중 어느 것이 더 나은지 비교 (가장 많이 사용)
- Listwise: n개 응답을 순위로 정렬
- 평가 방법: Human Rating, LLM as a Judge, BLEU/ROUGE 등 규칙 기반 메트릭
RLHF (Reinforcement Learning from Human Feedback) 18:20
- Stage 1 - Reward Model 학습: 프롬프트+응답을 받아 품질 점수 출력
- Stage 2 - RL 학습: Reward를 사용해 정책(policy) 최적화
- RL 관점에서 LLM: Agent=LLM, State=현재 입력, Action=다음 토큰 예측, Policy=출력 확률분포
Bradley-Terry Model 27:30
- P(yi > yj) = σ(R(yi) - R(yj)) = exp(Ri) / (exp(Ri) + exp(Rj))
- Reward Model 학습의 수학적 기반
- Loss = -log σ(R(x,yw) - R(x,yl)): winning 응답의 reward를 높이고 losing 응답의 reward를 낮춤
PPO (Proximal Policy Optimization) 48:30
- 목표: Reward 최대화 + Base Model에서 너무 멀어지지 않기
- 왜 가까이 유지해야 하나?
- Catastrophic Forgetting 방지: Pre-training/SFT에서 학습한 지식 유지
- Reward Hacking 방지: 불완전한 Reward Model에 과적합 방지
- Training Instability 방지
- KL Divergence: 두 확률분포 간 거리 측정, Reference Model과의 거리 제한
- Clipping: 업데이트 크기 제한으로 안정성 확보
Reward Hacking 문제 51:00
- Reward Model이 불완전하므로 과도하게 최적화하면 실제 목표와 괴리 발생
- 예시: 강의의 "정보성"을 "박수 소리 크기"로 측정하면, 농담만 하는 강의가 높은 점수
PPO의 복잡성과 대안 57:36
- PPO는 4개 모델 필요: Policy, Value Function, Reward Model, Reference Model
- Best-of-N (BoN): N개 생성 후 Reward Model로 최고 점수 선택 (RL 학습 없이)
- 장점: 학습 불필요
- 단점: 추론 비용 N배 증가
- GRPO: DeepSeek에서 제안, 다음 강의에서 다룰 예정
DPO (Direct Preference Optimization) 1:29:59
- PPO의 복잡성을 해결하기 위한 접근
- Reward Model 없이 Preference Data로 직접 정책 최적화
- Loss = -log σ(β * (log π(yw|x)/πref(yw|x) - log π(yl|x)/πref(yl|x)))
- 2개 모델만 필요: 학습할 Policy + Frozen Reference Model
- Implicit Reward: r(x,y) = β * log(π(y|x)/πref(y|x))
DPO의 직관적 이해 1:37:00
- Winning 응답의 확률 증가, Losing 응답의 확률 감소
- Reference Model 대비 상대적 변화를 학습
- PPO 대비 훨씬 간단하고 안정적
LoRA와 Preference Tuning 10:20
- LoRA는 파라미터 효율적 학습 방법 (어떤 파라미터를 튜닝할지)
- Preference Tuning은 목적 함수 (무엇을 최적화할지)
- 두 기법은 상호 보완적으로 함께 사용 가능
핵심 인사이트
- Preference Tuning은 모델의 "행동"을 세밀하게 조정하는 3단계 학습
- RLHF는 강력하지만 복잡하고 불안정 (4개 모델, Reward Hacking 위험)
- DPO는 Reward Model 없이 직접 최적화로 단순화 (현재 많이 사용)
- Reward Hacking 방지를 위해 Reference Model과의 KL Divergence 제한이 중요
- Best-of-N은 학습 없이 추론 시점에 품질 향상 가능 (비용 트레이드오프)