Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 5 - LLM Tuning

Stanford Online • November 14, 2025 • AI 요약 생성: January 24, 2026

NaN:NaN

Cool.

NaN:NaN

Hello everyone and welcome to lecture 5

NaN:NaN

of CME 295.

NaN:NaN

So first of all, thank you all for

NaN:NaN

taking the time to take the midterm last

NaN:NaN

week. So I hope it was reasonable for

NaN:NaN

you. Um so for those of you who are

NaN:NaN

auditing in case you are interested in

NaN:NaN

taking the exam just know that the exam

NaN:NaN

and the solutions are both posted on the

NaN:NaN

websites

NaN:NaN

and um so now you know a little bit how

NaN:NaN

the exam looks like. So for the final

NaN:NaN

we'll have the same format except that

NaN:NaN

the content will be on lectures five. So

NaN:NaN

this one up until 9.

NaN:NaN

So I'll go down that point.

NaN:NaN

Cool. Uh great. So with that we're going

NaN:NaN

to start the lecture. So today we're

NaN:NaN

going to talk about LLM tuning.

NaN:NaN

So as usual we're just going to recap

NaN:NaN

what we saw last time. Um, so last time

NaN:NaN

was already two weeks ago, but we talked

NaN:NaN

about how to train an LLM. And in in

NaN:NaN

particular, we've looked at two

NaN:NaN

important steps. So the first one was

NaN:NaN

called pre-training

NaN:NaN

where you're basically taking a model

NaN:NaN

that has been initialized and you're

NaN:NaN

trying to teach the model about

NaN:NaN

language, about code. So you have this

NaN:NaN

step that is very time consuming,

NaN:NaN

expensive

NaN:NaN

um compute heavy that is happening on a

NaN:NaN

lot of data. So we've seen training

NaN:NaN

optimizations on how to make that

NaN:NaN

happen. So we've seen uh techniques to

NaN:NaN

parallelize that across GPUs. So we've

NaN:NaN

seen data parallelism methods and in

NaN:NaN

particular zero so the variant 0 1 2 3

NaN:NaN

and we've also seen very quickly what

NaN:NaN

model parallelism was in this case. So

NaN:NaN

at the end of this step what you obtain

NaN:NaN

is a model that knows about the

NaN:NaN

structure of language about codes

NaN:NaN

basically all the text that it has been

NaN:NaN

fed. But what this model can do is only

NaN:NaN

predict the next token.

NaN:NaN

So it's a great autocompleter,

NaN:NaN

but it's not a helpful model yet, which

NaN:NaN

is why we have a second step. And we saw

NaN:NaN

uh here it's typically called

NaN:NaN

fine-tuning or SFT for supervised

NaN:NaN

fine-tuning.

NaN:NaN

So here what we do is we take our

NaN:NaN

pre-trained model and we train it for

NaN:NaN

specific tasks.

NaN:NaN

So nowaday you have you know chat GPT

NaN:NaN

and all these like uh chat assistants.

NaN:NaN

So this can be one application. So

NaN:NaN

transforming your model into an

NaN:NaN

assistant. And so here the goal is to

NaN:NaN

teach the model how to behave.

NaN:NaN

So the model already knows what language

NaN:NaN

is, what code is, etc. And you're just

NaN:NaN

trying to make it behave like the use

NaN:NaN

case that you're trying to tune it for.

NaN:NaN

So typically here what you have is a

NaN:NaN

data set that is much smaller in scale

NaN:NaN

but of much higher quality

NaN:NaN

and you're basically tra uh taking your

NaN:NaN

model your pre-trained model and

NaN:NaN

teaching it exactly which tokens to

NaN:NaN

predict with the next token prediction

NaN:NaN

task.

NaN:NaN

And uh we've seen uh Laura which is

NaN:NaN

parameter efficient method which does

NaN:NaN

not tune all the weights but in a clever

NaN:NaN

way introduces low rank matrices that

NaN:NaN

are the ones that are being tuned

NaN:NaN

and we had stopped there. So what we're

NaN:NaN

going to see today is how to align the

NaN:NaN

model to align with what we call human

NaN:NaN

preferences.

NaN:NaN

And so here we're taking our model that

NaN:NaN

has been fine-tuned for a specific task

NaN:NaN

and trying to align our model to um make

NaN:NaN

it more something that a human would

NaN:NaN

like or that some metric that we're

NaN:NaN

defining would be more aligned with

NaN:NaN

that.

NaN:NaN

So as an example, let's suppose if you

NaN:NaN

have an assistant at the end of step

NaN:NaN

step two. So it's very much possible

NaN:NaN

that your assistant is you know behaving

NaN:NaN

the way you want but not let's say at

NaN:NaN

the tone that you want. So it's not

NaN:NaN

let's say friendly or it's not uh you

NaN:NaN

know safe. So you want to tune those

NaN:NaN

aspects in that third step.

NaN:NaN

Cool. So this step is called preference

NaN:NaN

tuning

NaN:NaN

and we're going to exactly see what that

NaN:NaN

is. So here for context let's suppose

NaN:NaN

that we have an SFT model and so by SFT

NaN:NaN

model I mean a model that has gone

NaN:NaN

through the pre-train stage and the

NaN:NaN

finetuning stage. And so for instance we

NaN:NaN

may ask our model uh to suggest a new

NaN:NaN

activity we could do with our teddy

NaN:NaN

bear. And here the model let's say

NaN:NaN

responds with I would suggest you to not

NaN:NaN

spend much time with your teddy bear at

NaN:NaN

all. So this is the response of an

NaN:NaN

assistant but it's not necessarily with

NaN:NaN

aligned with what we want.

NaN:NaN

So the idea here is to take these quote

NaN:NaN

unquote bad outputs and find

NaN:NaN

an output or rewrite an output that we

NaN:NaN

would want to have instead.

NaN:NaN

So this pair would be what what we call

NaN:NaN

a preference pair.

NaN:NaN

So in other words given this prompt we

NaN:NaN

have two responses.

NaN:NaN

One that we want to see which is this

NaN:NaN

one I'm going to read in a second and

NaN:NaN

then the other one that we do not want

NaN:NaN

to see.

NaN:NaN

So the for instance in this example the

NaN:NaN

answer we want to see is you know of

NaN:NaN

course teddy bears not only make awesome

NaN:NaN

companions for delightful sleep but also

NaN:NaN

can also be great buddies for fun

NaN:NaN

activities and you know just like

NaN:NaN

suggest some activities.

NaN:NaN

Does the setup sound good?

NaN:NaN

Yeah. So long story short we want to

NaN:NaN

align our model with human preferences.

NaN:NaN

So you may ask, you know, we have this

NaN:NaN

fine-tuning step already. Why would we

NaN:NaN

want to have a third step? Well, during

NaN:NaN

the second step, which is the

NaN:NaN

fine-tuning stage, if you remember, what

NaN:NaN

we did was construct a very high quality

NaN:NaN

data set of the kinds of prompts on

NaN:NaN

which we want our model to behave in a

NaN:NaN

certain way. So in order to compose such

NaN:NaN

a data set, it's actually something that

NaN:NaN

is very time consuming and very actually

NaN:NaN

difficult because the data set must be

NaN:NaN

of very high quality. And in this case,

NaN:NaN

what we're doing is not really teaching

NaN:NaN

the model exactly what it should

NaN:NaN

generate,

NaN:NaN

but rather telling the model what kind

NaN:NaN

of output it should prefer.

NaN:NaN

So we're less in a you know please

NaN:NaN

generate that kind of thing and more in

NaN:NaN

a you know I prefer this option kind of

NaN:NaN

thing.

NaN:NaN

And so typically, I don't know, if we

NaN:NaN

asked you uh write a poem, write a great

NaN:NaN

poem from scratch, it would typically be

NaN:NaN

much more difficult to do rather than

NaN:NaN

just showing you two poems, one bad poem

NaN:NaN

and one great poem, and ask you to just

NaN:NaN

say which one is better.

NaN:NaN

So in order to obtain the data sets,

NaN:NaN

it's already much easier.

NaN:NaN

The second reason is so during the SFT

NaN:NaN

stage when we compose our data set of

NaN:NaN

very high quality there is one aspect

NaN:NaN

that we really try to get right which is

NaN:NaN

the distribution of prompts

NaN:NaN

and what I mean by that is if we have

NaN:NaN

too much of a given kind of prompt our

NaN:NaN

model will be more biased towards

NaN:NaN

responding in that particular way. So

NaN:NaN

what people try to do is to be careful

NaN:NaN

about the distribution of prompts that

NaN:NaN

are in the SFT data.

NaN:NaN

And so here, let's say if our model

NaN:NaN

misbehaves, if we were thinking about

NaN:NaN

just adding one example in the SFT data

NaN:NaN

set, well, we would have to be very

NaN:NaN

careful about which prompt we're adding

NaN:NaN

and whether it's not going to bias the

NaN:NaN

model like too much in that direction.

NaN:NaN

So that's the second reason. And then uh

NaN:NaN

the third reason is what I mentioned

NaN:NaN

which is uh the SFT data is typically

NaN:NaN

very high quality. So if you're um

NaN:NaN

looking at all the missteps that your

NaN:NaN

model is doing and trying to put that in

NaN:NaN

the SFT data, you're just, you know,

NaN:NaN

have a hard time takes a lot of time. Um

NaN:NaN

but one note here

NaN:NaN

is if your SFT data is misbehaving a

NaN:NaN

lot, it may also be due to the fact that

NaN:NaN

your SFT data set has some problem.

NaN:NaN

So preference tuning is not the answer

NaN:NaN

to everything. Maybe it's better

NaN:NaN

sometimes to just check your SFT data

NaN:NaN

set for some issues.

NaN:NaN

That sound good? Yep.

NaN:NaN

Yeah. So to do that in the preference

NaN:NaN

tuning stage. Okay. So the question is

NaN:NaN

we saw Laura for SFT. What is the

NaN:NaN

equivalent for preference tuning? uh so

NaN:NaN

we'll see that later in the in the

NaN:NaN

lecture but you can think of Laura as

NaN:NaN

being some way to I guess reduce the

NaN:NaN

number of parameters that you need to

NaN:NaN

tune that is slightly different with the

NaN:NaN

objective function that you're using to

NaN:NaN

train your model and here preference

NaN:NaN

tuning you can think of it more as a

NaN:NaN

different objective function but it's it

NaN:NaN

may very well be something that you also

NaN:NaN

use LoRa for. So the two are not

NaN:NaN

incompatible.

NaN:NaN

>> Yes,

NaN:NaN

>> but it will become more clear later on.

NaN:NaN

But yeah, great question. Uh one last

NaN:NaN

thing I will add here is um another

NaN:NaN

difference with the SFT stage is that

NaN:NaN

preference tuning allows us to inject

NaN:NaN

some negative signal

NaN:NaN

because SFT is all about teaching the

NaN:NaN

model about like what it should predict

NaN:NaN

but it does not teach the model about

NaN:NaN

what it should not predict

NaN:NaN

and we will see that perference tuning

NaN:NaN

allows you to inject some negative

NaN:NaN

signal

NaN:NaN

Cool. So to start, of course, we need to

NaN:NaN

have our preference pairs. And so we're

NaN:NaN

going to look at this uh data collection

NaN:NaN

step first.

NaN:NaN

So here's the setup.

NaN:NaN

You have a prompt. Let's say write a

NaN:NaN

poem and you have a given response which

NaN:NaN

is the poem that is being generated by

NaN:NaN

the model.

NaN:NaN

You have a few ways to construct your

NaN:NaN

preference data.

NaN:NaN

So either you start with kind of a p

NaN:NaN

pointwise mindset where you score each

NaN:NaN

proposed poem with some kind of

NaN:NaN

pointwise score. And here pointwise

NaN:NaN

score means a score just relative to one

NaN:NaN

observation.

NaN:NaN

You could very well do that, but I will

NaN:NaN

say it's kind of hard.

NaN:NaN

It's kind of tough as a human to say,

NaN:NaN

okay, this one is, I don't know, 0.9,

NaN:NaN

this one is like a 0 2. It's not super

NaN:NaN

clear uh exactly how you would scale

NaN:NaN

that.

NaN:NaN

The second idea is for you to get two

NaN:NaN

observations at the time and for you to

NaN:NaN

say which one is better. So this one is

NaN:NaN

called pair-wise preference data and

NaN:NaN

it's much easier.

NaN:NaN

And then the third one is listwise which

NaN:NaN

is you know you get a list of let's say

NaN:NaN

n poems and then you just rank you know

NaN:NaN

which one is best which one is uh worse

NaN:NaN

etc etc and I guess this one is easier

NaN:NaN

than pointwise because you don't have to

NaN:NaN

specify I guess how much better it is

NaN:NaN

but I guess it's still a bit more

NaN:NaN

complicated

NaN:NaN

and that's the reason why people

NaN:NaN

typically use parise

NaN:NaN

So what they do is they collect

NaN:NaN

pair-wise preference data meaning for

NaN:NaN

each prompt they have two possible

NaN:NaN

answers and then

NaN:NaN

they just specify which one is better

NaN:NaN

and so that's the one we'll continue um

NaN:NaN

this lecture with. Okay so now you may

NaN:NaN

ask okay great uh pair wise preference

NaN:NaN

data but how do you get it?

NaN:NaN

Well, here is the recipe. So, in order

NaN:NaN

to generate a pair of responses, so

NaN:NaN

first you need a prompt and we've seen

NaN:NaN

um you know, previously I think it's

NaN:NaN

lecture probably three um that what you

NaN:NaN

could do was to generate different

NaN:NaN

answers if you have a temperature that's

NaN:NaN

positive. So typically what people do is

NaN:NaN

they put this prompt let's say twice

NaN:NaN

into the model with a positive

NaN:NaN

temperature and then they can get two

NaN:NaN

different answers.

NaN:NaN

Um the prompt is typically something

NaN:NaN

where we wanted to follow the

NaN:NaN

distribution of what the users typically

NaN:NaN

ask. So the prompt X can be something

NaN:NaN

that we obtain from the logs or from

NaN:NaN

let's say a desired set of prompts.

NaN:NaN

And then what we do is we have these two

NaN:NaN

observations. So the first one is the

NaN:NaN

prompt and the first response. The

NaN:NaN

second one is the prompt and the second

NaN:NaN

response. And what we do is we rate we

NaN:NaN

compare them.

NaN:NaN

So we can compare them with of course

NaN:NaN

human ratings

NaN:NaN

but we can also compare them with some

NaN:NaN

other metrics.

NaN:NaN

So I'll just list a few. Uh so we have

NaN:NaN

LLM as a judge which you may have heard

NaN:NaN

of which we have not seen yet which we

NaN:NaN

will see in a few lectures. That is also

NaN:NaN

typically used to just compare

NaN:NaN

I guess how much better one observation

NaN:NaN

is compared to another.

NaN:NaN

We can also use some other metrics

NaN:NaN

rule-based one like blur rouge etc.

NaN:NaN

Although it is not as used these days

NaN:NaN

and the simplest way to compare these

NaN:NaN

two observations is to have a binary

NaN:NaN

setting

NaN:NaN

where you say okay is response one

NaN:NaN

better or worse than response two. But

NaN:NaN

you could also think of a more nuanced

NaN:NaN

scale. Meaning you can also say okay

NaN:NaN

response one is much better, better,

NaN:NaN

slightly better, slightly worse, worse

NaN:NaN

or much worse. So this is also something

NaN:NaN

that you can do. Um but there are some

NaN:NaN

challenges with that approach

NaN:NaN

because for instance if you uh I don't

NaN:NaN

know consider human ratings a lot of

NaN:NaN

tasks are a little bit subjective.

NaN:NaN

So in a lot of cases actually what

NaN:NaN

people do is

NaN:NaN

having a pair wise preference data set

NaN:NaN

on the binary scale.

NaN:NaN

So only is it better or worse.

NaN:NaN

Does that sound good?

NaN:NaN

Yeah. Um okay. So another way to obtain

NaN:NaN

that data is

NaN:NaN

to have to find in your logs a response

NaN:NaN

that you did not like

NaN:NaN

to take that response and to rewrite it

NaN:NaN

which is basically what we did here. So

NaN:NaN

here when we have the response here what

NaN:NaN

we did is take the response and rewrite

NaN:NaN

rewrite a good one.

NaN:NaN

So this is also what people do but of

NaN:NaN

course it is a bit more involved because

NaN:NaN

you need to you know generate and I told

NaN:NaN

you that generation was uh kind of

NaN:NaN

costly and tough but this is also

NaN:NaN

possible.

NaN:NaN

Does the data collection make sense?

NaN:NaN

Yeah. Cool. Okay. So now we have our

NaN:NaN

preference data and what we want is to

NaN:NaN

align our model to prefer responses that

NaN:NaN

were preferred by the rating and I guess

NaN:NaN

down downweight the responses that were

NaN:NaN

not preferred.

NaN:NaN

And so in order to do that we will see a

NaN:NaN

method that's called RLHF RLHF.

NaN:NaN

And uh I'll see we'll see that uh in

NaN:NaN

more details in a second. Uh but as the

NaN:NaN

name indicates RLHF

NaN:NaN

relies on RL.

NaN:NaN

So I'm just going to start with some RL

NaN:NaN

basics. So do we have any RL experts

NaN:NaN

here?

NaN:NaN

Yeah. No. Okay. So no need to be an RL

NaN:NaN

expert. So don't worry, we'll go really

NaN:NaN

slowly on that.

NaN:NaN

So in the RL world world what people do

NaN:NaN

is they have an agent and here I mean

NaN:NaN

agent in the RL term which interacts

NaN:NaN

with an environment.

NaN:NaN

So what does it do? It is at a given

NaN:NaN

state at let's say time t. It can take

NaN:NaN

an action

NaN:NaN

at time t and it takes that action

NaN:NaN

according to some policy.

NaN:NaN

The policy is typically noted pi pi of

NaN:NaN

theta of the action given the state. So

NaN:NaN

what that policy means is simply

NaN:NaN

giving you the probability of you taking

NaN:NaN

an action given a state.

NaN:NaN

Easy, right? And so given that the agent

NaN:NaN

takes some action,

NaN:NaN

it also receives some rewards.

NaN:NaN

So sometimes it's a good reward,

NaN:NaN

sometimes it's bad rewards.

NaN:NaN

So what we're going to do is to leverage

NaN:NaN

that mindset

NaN:NaN

for our preference tuning exercise.

NaN:NaN

So let's see together how we can

NaN:NaN

transpose these quantities in the LLM

NaN:NaN

world.

NaN:NaN

So what is the agent?

NaN:NaN

The agent is the LLM.

NaN:NaN

So in terms of uh the state that it is

NaN:NaN

in so it's simply the input that it has

NaN:NaN

so far.

NaN:NaN

So the action that it wants to do is to

NaN:NaN

predict the next token. So it basically

NaN:NaN

always wonders okay what is the next

NaN:NaN

token given this input

NaN:NaN

and this action or this next token is

NaN:NaN

among the set of tokens that there are

NaN:NaN

out there. So if you want you know the

NaN:NaN

environment can be the set of tokens of

NaN:NaN

your vocabulary

NaN:NaN

and in order to decide which token

NaN:NaN

should come next.

NaN:NaN

The LLM determines that using the

NaN:NaN

probability of next token which is

NaN:NaN

obtained by you know when you do your

NaN:NaN

forward pass and when you look at uh the

NaN:NaN

probability distribution as an output

NaN:NaN

which is basically our policy.

NaN:NaN

So our policy here is simply equal to

NaN:NaN

the output of the LLM given the input in

NaN:NaN

order to determine the next token.

NaN:NaN

So far so good. And now we're adding one

NaN:NaN

additional thing. So I told you, you

NaN:NaN

know, we're composing our preference

NaN:NaN

data sets and we want to know which

NaN:NaN

output is better than the other output.

NaN:NaN

And so we're going to use that for a

NaN:NaN

reward. We're going to somehow use that

NaN:NaN

for a reward and we will see how we will

NaN:NaN

use that.

NaN:NaN

So I'm going to just recap that part. So

NaN:NaN

we have our LLM take some inputs and

NaN:NaN

then it wants to predict

NaN:NaN

the next token. It wants to take this

NaN:NaN

action. So in order to do that, it uses

NaN:NaN

the probability distribution that it

NaN:NaN

outputs

NaN:NaN

and then the token that it predicts or

NaN:NaN

the output that it generates then

NaN:NaN

receives some reward that is going to

NaN:NaN

feed back into tuning the agent. And

NaN:NaN

here's the LLM. Yep.

NaN:NaN

So the question is wouldn't that be

NaN:NaN

expensive if we do uh this for all

NaN:NaN

pairs? Why would that be expensive?

NaN:NaN

Mhm.

NaN:NaN

Uh yeah. So

NaN:NaN

yeah, the question is what do you do in

NaN:NaN

terms of how expensive that is? So yeah,

NaN:NaN

typically people take batches for

NaN:NaN

instance. Um but I would not so it is

NaN:NaN

indeed expensive but we're going to see

NaN:NaN

a little bit some order of magnitude and

NaN:NaN

how that works. But you can think of

NaN:NaN

this as uh being um kind of a training

NaN:NaN

procedure that could be seen as as

NaN:NaN

expensive as some other training

NaN:NaN

procedure. There's nothing that makes

NaN:NaN

this more expensive.

NaN:NaN

>> Okay,

NaN:NaN

>> but we're going to see exactly. So I

NaN:NaN

think you're there are some points that

NaN:NaN

you're you know touching correctly. Uh

NaN:NaN

there's some parts that are added

NaN:NaN

compared to the regular you know

NaN:NaN

supervised let's say supervised fine

NaN:NaN

tuning uh setting and we're going to see

NaN:NaN

that in a second.

NaN:NaN

H so the question is uh does the reward

NaN:NaN

that you're getting from here uh I guess

NaN:NaN

powerful enough for the LM to change

NaN:NaN

well you will see some expressions on

NaN:NaN

the internet just characterizing this uh

NaN:NaN

training procedure as you know like it's

NaN:NaN

nearly not as many signals as what you

NaN:NaN

would get for SFT let's say because for

NaN:NaN

SFT you're literally always taking a

NaN:NaN

partial input and making the LLM learn

NaN:NaN

how to um generate the next token. But

NaN:NaN

here you're only getting roughly one one

NaN:NaN

signal per completion. So yeah,

NaN:NaN

definitely it's more sparse and that's

NaN:NaN

why you will see that RHF is seen as

NaN:NaN

more an approach that has sparse

NaN:NaN

signals.

NaN:NaN

Yeah. Yeah.

NaN:NaN

Great question. So the question is, do

NaN:NaN

you apply reward for each token or for

NaN:NaN

the whole thing? We're going to see this

NaN:NaN

in more detail, but it's for the whole

NaN:NaN

thing. It's for the whole thing, but

NaN:NaN

we'll see that in uh in just a second.

NaN:NaN

Yeah.

NaN:NaN

So the AD and H. Oh, right. So the

NaN:NaN

question is what is the AT and ST? So uh

NaN:NaN

just as a reminder ST is the state that

NaN:NaN

you're in and 80 is the action you want

NaN:NaN

to take. So in the context of an LLM

NaN:NaN

uh the state that you're in for an LLM

NaN:NaN

is the input that you have that you have

NaN:NaN

so far and then the action is which

NaN:NaN

token you want to generate given that

NaN:NaN

input. Yeah.

NaN:NaN

Cool. Yeah. So far so good.

NaN:NaN

Okay, perfect. So now that we know a

NaN:NaN

little bit what a mental model could be

NaN:NaN

for I guess LLM based RL,

NaN:NaN

I just want to highlight once more that

NaN:NaN

what we're trying to achieve is learn

NaN:NaN

how to align this policy

NaN:NaN

with

NaN:NaN

rewards.

NaN:NaN

So we want to learn theta and theta is

NaN:NaN

the parameters of our LLM such that pi

NaN:NaN

of theta

NaN:NaN

align with preferences

NaN:NaN

and that's where RHF comes into play. So

NaN:NaN

RHF stands for reinforcement learning

NaN:NaN

from human feedback and it is typically

NaN:NaN

composed of two stages. So the first

NaN:NaN

stage is you

NaN:NaN

figuring out how to distinguish good

NaN:NaN

output from bad output.

NaN:NaN

So here you know all the preference

NaN:NaN

pairs that you collected are actually

NaN:NaN

used for you to learn what is good what

NaN:NaN

is bad. So the input here is the

NaN:NaN

concatenation of the prompt and the

NaN:NaN

response and the output is a score.

NaN:NaN

So what you want to know given a prompt

NaN:NaN

and a response how good that is.

NaN:NaN

And then the second step is the RL step,

NaN:NaN

the reinforcement learning step. And

NaN:NaN

this is where you use the rewards

NaN:NaN

to align your model with the

NaN:NaN

preferences. So here as input you have

NaN:NaN

the prompt and what you somehow want to

NaN:NaN

do is to be able to generate yhat that

NaN:NaN

is more aligned with the rewards.

NaN:NaN

And by the way I just want to call out

NaN:NaN

one thing. So RLHF is reinforcement

NaN:NaN

learning from human feedback.

NaN:NaN

The human feedback part

NaN:NaN

refers to the labels on which the reward

NaN:NaN

model is trained.

NaN:NaN

So if the preference pairs are based on

NaN:NaN

human ratings

NaN:NaN

then we're relying on human preferences

NaN:NaN

then we're in RHF

NaN:NaN

because you will see out there there is

NaN:NaN

also RL let's say Aif reinforcement

NaN:NaN

learning from AI feedback and that one

NaN:NaN

is relying on nonhuman preferences.

NaN:NaN

So yeah

NaN:NaN

cool. Okay, so now that we know what RHF

NaN:NaN

is and we know that there are two steps,

NaN:NaN

we'll go through the first step

NaN:NaN

naturally. So here the idea is we want

NaN:NaN

to construct a model that knows which

NaN:NaN

output is good, which output is bad. And

NaN:NaN

of course here what we want is to not

NaN:NaN

only consider the output but also the

NaN:NaN

input

NaN:NaN

because uh you need to somehow

NaN:NaN

contextualize that answer.

NaN:NaN

So let's go with the our favorite

NaN:NaN

example. Uh so let's suppose you have

NaN:NaN

the following prompt. Suggest a new

NaN:NaN

activity I could do with my teddy bear.

NaN:NaN

So what you want is to have a reward

NaN:NaN

model that here we note RM that tells

NaN:NaN

you that you know the answer that we

NaN:NaN

rewrote into the good answer is good

NaN:NaN

and we want to somehow have that model

NaN:NaN

that tells us that the output that we

NaN:NaN

constructed or that we did not construct

NaN:NaN

the output that we saw was bad is bad.

NaN:NaN

So we want to somehow have a model that

NaN:NaN

takes in the prompt and the good

NaN:NaN

response and say it's you know good and

NaN:NaN

we want to somehow have a model that uh

NaN:NaN

you know takes the prompt and the bad

NaN:NaN

response to say it's bad. So now the

NaN:NaN

question is how do you construct such a

NaN:NaN

model?

NaN:NaN

Well in order to do that we are using a

NaN:NaN

formulation that's called the Bradley

NaN:NaN

Terry formulation.

NaN:NaN

So that one is an important formula. So

NaN:NaN

uh we just stay here for a little bit.

NaN:NaN

So what it what it says is that the

NaN:NaN

probability bless you uh to have an

NaN:NaN

output yi

NaN:NaN

be better than an output yj

NaN:NaN

is equal to an exponential of some score

NaN:NaN

which is a score with respect to i over

NaN:NaN

the exponential of that score with

NaN:NaN

respect to i plus the exponential of

NaN:NaN

some 4 respect to J.

NaN:NaN

So that is called the Bradley Terry

NaN:NaN

formulation and this is the formulation

NaN:NaN

we will use to build our model.

NaN:NaN

So here um it's also equal to sigma of

NaN:NaN

RA I minus RJ. So who knows what sigma

NaN:NaN

is?

NaN:NaN

>> Yes, exactly. So it's a sigmoid. So just

NaN:NaN

as a reminder sigmoid is 1 / 1 +

NaN:NaN

exponential of minus x. So this is how

NaN:NaN

uh the graph looks like. So when it's

NaN:NaN

minus infinity it's uh it tends towards

NaN:NaN

zero and then when it's towards plus

NaN:NaN

infinity it tends uh towards one.

NaN:NaN

So in other words,

NaN:NaN

if I is better than J,

NaN:NaN

what we want is for the input to sigma

NaN:NaN

to be as high as possible

NaN:NaN

because you want the probability to be

NaN:NaN

as close to one.

NaN:NaN

So you want to somehow

NaN:NaN

have

NaN:NaN

R I be high if the output I is good.

NaN:NaN

and somehow our J to be low if the

NaN:NaN

output is bad.

NaN:NaN

So far so good.

NaN:NaN

So here

NaN:NaN

what we want to do is to somehow

NaN:NaN

train a model that is able to output

NaN:NaN

scores RA I and RJ

NaN:NaN

using this formulation.

NaN:NaN

So this formulation involves two

NaN:NaN

quantities because we have a pair-wise

NaN:NaN

data set.

NaN:NaN

So far so good. Okay. So it will become

NaN:NaN

more clear in a second. So here for

NaN:NaN

training the idea here is you have some

NaN:NaN

model that you initialize

NaN:NaN

and you input on the one hand your

NaN:NaN

prompt X and your winning output. So I'm

NaN:NaN

saying winning. So it's yhat W.

NaN:NaN

You put it into the model.

NaN:NaN

It produces score R of X and Y W.

NaN:NaN

And then you have a second output. So X

NaN:NaN

and Y L Y L which is the losing output.

NaN:NaN

So you put it into the model and it has

NaN:NaN

a second score.

NaN:NaN

And now you somehow want to have a loss

NaN:NaN

function

NaN:NaN

that takes into account these two

NaN:NaN

scores.

NaN:NaN

So based on this formulation,

NaN:NaN

do you have like a some suggestion as to

NaN:NaN

which loss function to use?

NaN:NaN

So here our loss function would be

NaN:NaN

parise.

NaN:NaN

Yep.

NaN:NaN

Yeah. Uh so the uh the proposed answer

NaN:NaN

is a binary cony. Um so can you explicit

NaN:NaN

that a bit more?

NaN:NaN

>> Mhm.

NaN:NaN

>> Uh-huh. Okay. So um I guess in that case

NaN:NaN

what would it be?

NaN:NaN

Yep.

NaN:NaN

Yeah.

NaN:NaN

Yeah. Yeah. Exactly. So I guess the Yes.

NaN:NaN

So great great answer. So I guess your

NaN:NaN

answer is having uh some negative log

NaN:NaN

likelihoods of this quantity which is uh

NaN:NaN

so the cross entropy can be seen as kind

NaN:NaN

of a special case of this but so just so

NaN:NaN

that that we make sure that we uh get

NaN:NaN

this part. Um

NaN:NaN

so in order to just get that uh

NaN:NaN

formulation which you mentioned again I

NaN:NaN

guess one idea that we can have is given

NaN:NaN

our data

NaN:NaN

and given this formulation that we saw

NaN:NaN

the Bradley Terry one

NaN:NaN

to find

NaN:NaN

parameters theta that maximize the

NaN:NaN

probability of that data happening which

NaN:NaN

basically leads leads to what you

NaN:NaN

mentioned. Uh so here I'm just going to

NaN:NaN

write what that means so that just we

NaN:NaN

can um kind of be all aligned. So here

NaN:NaN

it's

NaN:NaN

so let's suppose you have a preference

NaN:NaN

data set of let's say um you know

NaN:NaN

winning example associated with losing

NaN:NaN

example. So you have a bunch of pairs

NaN:NaN

like this. So let's suppose that these

NaN:NaN

pairs they happened independently of one

NaN:NaN

another. So what you want is to find

NaN:NaN

parameters that somehow maximize the

NaN:NaN

probability of you seeing

NaN:NaN

those examples.

NaN:NaN

Right? So here what that means is we

NaN:NaN

want to somehow maximize

NaN:NaN

the product of let's say these

NaN:NaN

n preference pairs which is the

NaN:NaN

probability of you know uh like some

NaN:NaN

output uh w being uh better than some

NaN:NaN

output l

NaN:NaN

and we saw that with the Bradley Terry

NaN:NaN

formulation. So we have this formulation

NaN:NaN

which is basically the product of

NaN:NaN

uh y I1 to n of this uh sigma of

NaN:NaN

this reward which is basically a

NaN:NaN

function of the input prompt and the

NaN:NaN

output

NaN:NaN

like this.

NaN:NaN

So what I'm doing here is to just

NaN:NaN

reconstruct the loss from first

NaN:NaN

principles.

NaN:NaN

So whenever you see a product of

NaN:NaN

probabilities, the first thing you need

NaN:NaN

to think about is

NaN:NaN

yes log because this can become very

NaN:NaN

small. So it can cause instabilities. So

NaN:NaN

if you take the log, if you want to

NaN:NaN

maximize a product is the same as

NaN:NaN

maximizing the log of that. So if you

NaN:NaN

take the log of that. So let's suppose

NaN:NaN

I'm taking let's say the log of that.

NaN:NaN

Let's say I'm taking the log of that.

NaN:NaN

So it's basically equal to the sum

NaN:NaN

of the log of

NaN:NaN

sigma of r of the winning

NaN:NaN

minus the one from the losing.

NaN:NaN

So we want to maximize that.

NaN:NaN

But people in ML they like to minimize

NaN:NaN

things. So we're going to take like have

NaN:NaN

a negative in front. So maximizing plus

NaN:NaN

the sum of the log is the same as

NaN:NaN

minimizing minus the sum of the log. And

NaN:NaN

so this

NaN:NaN

will be our loss function.

NaN:NaN

Does that make sense?

NaN:NaN

And this is exactly what you mentioned.

NaN:NaN

And

NaN:NaN

typically we uh you know write the

NaN:NaN

expectation

NaN:NaN

and this is our loss function.

NaN:NaN

So our last function is minus the

NaN:NaN

expectation of log of sigma r and x and

NaN:NaN

y y w minus the reward of the losing.

NaN:NaN

Sounds good. So here the last function

NaN:NaN

is pairwise

NaN:NaN

but the reward model is it pair wise or

NaN:NaN

is it pointwise

NaN:NaN

like I guess do you need a pair to make

NaN:NaN

a prediction or do you need just one?

NaN:NaN

>> Do you need a pair

NaN:NaN

pair? Well that's the beauty of things.

NaN:NaN

You don't need a pair. So you're just

NaN:NaN

training it pair-wise but it's actually

NaN:NaN

pointwise.

NaN:NaN

So I think that's one thing I kind of

NaN:NaN

realized. You know I look at this loss

NaN:NaN

and you know this is a beautiful thing.

NaN:NaN

So you're training in a pair wise way,

NaN:NaN

but at the end of the day, you have a

NaN:NaN

reward model that takes in one prompt

NaN:NaN

and its output and just outputs one

NaN:NaN

number.

NaN:NaN

And we saw that it will try to output a

NaN:NaN

high score for, you know, winning um

NaN:NaN

examples and then a low score for losing

NaN:NaN

examples.

NaN:NaN

So this is our reward model. And I'm

NaN:NaN

looking at the time. I'm actually not on

NaN:NaN

time, so I'll just move on. Um, so

NaN:NaN

typically here what you would need is a

NaN:NaN

set of data, which is typically on the

NaN:NaN

tens of thousands or maybe even more.

NaN:NaN

Um, and here the label would be the

NaN:NaN

preference rating. The preference here

NaN:NaN

should come from humans if we're talking

NaN:NaN

about RLHF.

NaN:NaN

And in terms of model, well, you have a

NaN:NaN

bunch of different choices. Uh so as we

NaN:NaN

know now uh models that are decoder only

NaN:NaN

that predict the next token are very

NaN:NaN

popular. So you could very well take

NaN:NaN

such an LLM and just have some

NaN:NaN

classification heads at the end of your

NaN:NaN

sentence that you're uh taking into

NaN:NaN

consideration and then use that to

NaN:NaN

predict the reward. Or if you remember

NaN:NaN

uh we've also seen encoder only models

NaN:NaN

uh in the beginning of the class. So for

NaN:NaN

instance BERS you could typically also

NaN:NaN

project the embedding of the CLS token.

NaN:NaN

This could very well be an option.

NaN:NaN

So typically what people do nowadays

NaN:NaN

is take the LLM route because everything

NaN:NaN

is an LM these days. So it's like

NaN:NaN

decoder only. You just put a

NaN:NaN

classification head.

NaN:NaN

And so people have also come up with

NaN:NaN

benchmarks

NaN:NaN

to evaluate how good you're doing. So

NaN:NaN

just put a reference here. Reward bench

NaN:NaN

in case you're interested is a pretty

NaN:NaN

popular one.

NaN:NaN

but yeah,

NaN:NaN

so at the end of this you get a reward

NaN:NaN

model that takes in a prompt and a given

NaN:NaN

response

NaN:NaN

and gives you a score.

NaN:NaN

Yep.

NaN:NaN

>> Exactly. So the question is uh in some

NaN:NaN

tasks the human preference would be

NaN:NaN

something but in other tasks it may be

NaN:NaN

something else. So typically those

NaN:NaN

rewards they are with respect to a given

NaN:NaN

dimension.

NaN:NaN

So they can be let's say is the output

NaN:NaN

useful

NaN:NaN

or is the output I don't know friendly

NaN:NaN

is the output safe. So all of that are

NaN:NaN

different dimensions. So they can be

NaN:NaN

different reward models.

NaN:NaN

Um you can also have like some holistic

NaN:NaN

score as well.

NaN:NaN

uh but yes so you need to define a

NaN:NaN

dimension across which you're actually

NaN:NaN

quantifying how good your response is.

NaN:NaN

So the ones that I mentioned are uh the

NaN:NaN

common ones that you would you would

NaN:NaN

find. Um another thing that I will say

NaN:NaN

as well uh while we're at it so human

NaN:NaN

ratings are very sensitive to the

NaN:NaN

guidelines that you're also exposing.

NaN:NaN

So, we're not going to go into details

NaN:NaN

here, but one important aspect of human

NaN:NaN

preference data is to make sure that the

NaN:NaN

guidelines you're telling your raiders

NaN:NaN

are as objective as it could get. I

NaN:NaN

mean, sometimes you cannot have them be

NaN:NaN

super objective, but you you have the I

NaN:NaN

guess the task of just making sure

NaN:NaN

they're clear enough so that your human

NaN:NaN

preferences

NaN:NaN

uh they're not noisy because they can be

NaN:NaN

noisy, but that's also one challenge.

NaN:NaN

Yep.

NaN:NaN

So the question is uh is the reward here

NaN:NaN

like a regression or a classification.

NaN:NaN

So let's look at the

NaN:NaN

at the loss function and here.

NaN:NaN

So well it's kind of hard I guess I

NaN:NaN

would say um because the reward can be

NaN:NaN

kind of something that can be

NaN:NaN

interpreted as a score but we have a

NaN:NaN

probabilistic formulation

NaN:NaN

um so I'll probably frame that as a

NaN:NaN

classification task because you have

NaN:NaN

preference data is it better worse so

NaN:NaN

it's like one or zero but at the end you

NaN:NaN

use the rewards so it's not I

NaN:NaN

purely a classification task because at

NaN:NaN

the end you have this um but one thing

NaN:NaN

to note is that the scale that these

NaN:NaN

rewards are at is typically something

NaN:NaN

that is scaled itself.

NaN:NaN

>> So at inference time you have some

NaN:NaN

normalization procedure that normalizes

NaN:NaN

that across your batch but yeah I would

NaN:NaN

probably characterize the formulation

NaN:NaN

more as a probabilistic one I would say.

NaN:NaN

Yep.

NaN:NaN

Yep.

NaN:NaN

Yep. So the question is do we normalize

NaN:NaN

the score? So there are many different

NaN:NaN

methods to I guess normalize things. We

NaN:NaN

typically do. We typically do normalize.

NaN:NaN

So I guess if you're in a regression

NaN:NaN

setting, you're trying to predict some

NaN:NaN

score on a given scale, which you're

NaN:NaN

you're not doing here. So it's kind of

NaN:NaN

free form here.

NaN:NaN

Um so yeah, so there is some rescaling

NaN:NaN

that happens. Yeah. You had a question.

NaN:NaN

>> Yeah. The question is uh can you tell us

NaN:NaN

more about what a the output of a reward

NaN:NaN

model is? So we're going to see that a

NaN:NaN

bit later but you can think of you know

NaN:NaN

good outputs as being let's say one bad

NaN:NaN

outputs as being a minus two minus

NaN:NaN

three. So it's basically on the

NaN:NaN

continuous scale if you want.

NaN:NaN

We're going to see an example in a

NaN:NaN

little bit. So hopefully this can Do you

NaN:NaN

have a question? Perfect. Uh we're going

NaN:NaN

to see an example. So hopefully that

NaN:NaN

will be clear.

NaN:NaN

Cool. Um

NaN:NaN

so what we did is train a reward model.

NaN:NaN

Now the second step is to use that

NaN:NaN

reward model to align our model or and

NaN:NaN

by model I mean the LLM. The LLM that

NaN:NaN

went through the pre-training stage and

NaN:NaN

the SFT stage is the model that we want

NaN:NaN

to align with human preferences.

NaN:NaN

We're going to do that using

NaN:NaN

reinforcement learning and we're going

NaN:NaN

to do that using the reward model that

NaN:NaN

we just constructed.

NaN:NaN

So the reward model here

NaN:NaN

is the model that we obtained in step

NaN:NaN

one and it allows us to distinguish

NaN:NaN

good outputs and bad outputs.

NaN:NaN

So here is a general recipe of how you

NaN:NaN

would align your model.

NaN:NaN

So first you would take your prompt

NaN:NaN

as input.

NaN:NaN

Your LLM will generate

NaN:NaN

a completion.

NaN:NaN

So here a completion means like a full

NaN:NaN

response from the model. So by the way

NaN:NaN

completion people use also the term roll

NaN:NaN

out. So it generates a completion, a

NaN:NaN

roll out, a full response

NaN:NaN

and that full response along with the

NaN:NaN

prompt

NaN:NaN

goes into the reward model.

NaN:NaN

So as we saw the reward model right now

NaN:NaN

knows if it's a good, if it's bad. Let's

NaN:NaN

say right now it's bad. So in practice

NaN:NaN

it does not generate the the thumbs

NaN:NaN

down. It generates some kind of score.

NaN:NaN

So if you if you want like minus two,

NaN:NaN

let's suppose.

NaN:NaN

So we take that reward into

NaN:NaN

consideration

NaN:NaN

and then what we do is we tune the LLM

NaN:NaN

with this information.

NaN:NaN

So it's more complicated than that and

NaN:NaN

we'll see how we go about doing this but

NaN:NaN

this is the general idea.

NaN:NaN

So just as a reminder, the reward model

NaN:NaN

is the model that we trained as step one

NaN:NaN

and it is a model that is frozen.

NaN:NaN

We're not training the reward model. The

NaN:NaN

reward model has been trained.

NaN:NaN

The model that we're training is the the

NaN:NaN

LLM.

NaN:NaN

So I think that's an important point to

NaN:NaN

note.

NaN:NaN

And our goal

NaN:NaN

is to optimize for higher rewards

NaN:NaN

but then

NaN:NaN

without going too far from the initial

NaN:NaN

model.

NaN:NaN

So I think the you know optimizing for

NaN:NaN

higher rewards I think everyone agrees

NaN:NaN

with me right. But I I I just said a

NaN:NaN

second statement which is but we don't

NaN:NaN

want to go too far from the initial

NaN:NaN

model.

NaN:NaN

So why would that be like why would we

NaN:NaN

want to not deviate too much from the

NaN:NaN

base model?

NaN:NaN

So one suggested answer is it will

NaN:NaN

catastrophically forget what it has

NaN:NaN

learned. Why would that be a problem?

NaN:NaN

great. Yeah, I think it's a great way to

NaN:NaN

put it. So the the the suggested answer

NaN:NaN

here is uh you have all this knowledge

NaN:NaN

in your initial model which is

NaN:NaN

pre-trained and then instruction tuned

NaN:NaN

or tuned and you don't want to move away

NaN:NaN

too much. This is exactly it. So that's

NaN:NaN

one reason.

NaN:NaN

What could be another reason? So

NaN:NaN

definitely one, it's a very very good

NaN:NaN

reason. What could be another reason?

NaN:NaN

Yeah.

NaN:NaN

Yeah. So uh overfeeding on that data. So

NaN:NaN

can you tell me more?

NaN:NaN

>> Yeah, great point. So the second

NaN:NaN

핵심 요약

LLM Tuning 강의는 Pre-training과 SFT 이후의 3단계인 Preference Tuning(선호도 조정)을 다룹니다. RLHF, PPO, DPO 등 다양한 정렬 기법을 통해 모델이 인간의 선호에 맞게 행동하도록 학습시키는 방법을 설명합니다.

주요 개념

Preference Tuning의 필요성 04:56

SFT만으로는 모델의 "톤"이나 "안전성" 같은 미세한 행동 조정이 어려움
Preference Pair(선호 쌍): 같은 프롬프트에 대한 좋은 응답(winning)과 나쁜 응답(losing) 쌍으로 구성
SFT는 "무엇을 생성할지" 가르치고, Preference Tuning은 "무엇을 선호할지" 가르침
Negative Signal 주입 가능: SFT는 생성해야 할 것만 가르치지만, Preference Tuning은 생성하지 말아야 할 것도 학습

Preference Data 수집 방식 11:50

Pointwise: 각 응답에 절대적 점수 부여 (어려움)
Pairwise: 두 응답 중 어느 것이 더 나은지 비교 (가장 많이 사용)
Listwise: n개 응답을 순위로 정렬
평가 방법: Human Rating, LLM as a Judge, BLEU/ROUGE 등 규칙 기반 메트릭

RLHF (Reinforcement Learning from Human Feedback) 18:20

Stage 1 - Reward Model 학습: 프롬프트+응답을 받아 품질 점수 출력
Stage 2 - RL 학습: Reward를 사용해 정책(policy) 최적화
RL 관점에서 LLM: Agent=LLM, State=현재 입력, Action=다음 토큰 예측, Policy=출력 확률분포

Bradley-Terry Model 27:30

P(yi > yj) = σ(R(yi) - R(yj)) = exp(Ri) / (exp(Ri) + exp(Rj))
Reward Model 학습의 수학적 기반
Loss = -log σ(R(x,yw) - R(x,yl)): winning 응답의 reward를 높이고 losing 응답의 reward를 낮춤

PPO (Proximal Policy Optimization) 48:30

목표: Reward 최대화 + Base Model에서 너무 멀어지지 않기
왜 가까이 유지해야 하나?
1. Catastrophic Forgetting 방지: Pre-training/SFT에서 학습한 지식 유지
2. Reward Hacking 방지: 불완전한 Reward Model에 과적합 방지
3. Training Instability 방지
KL Divergence: 두 확률분포 간 거리 측정, Reference Model과의 거리 제한
Clipping: 업데이트 크기 제한으로 안정성 확보

Reward Hacking 문제 51:00

Reward Model이 불완전하므로 과도하게 최적화하면 실제 목표와 괴리 발생
예시: 강의의 "정보성"을 "박수 소리 크기"로 측정하면, 농담만 하는 강의가 높은 점수

PPO의 복잡성과 대안 57:36

PPO는 4개 모델 필요: Policy, Value Function, Reward Model, Reference Model
Best-of-N (BoN): N개 생성 후 Reward Model로 최고 점수 선택 (RL 학습 없이)
- 장점: 학습 불필요
- 단점: 추론 비용 N배 증가
GRPO: DeepSeek에서 제안, 다음 강의에서 다룰 예정

DPO (Direct Preference Optimization) 1:29:59

PPO의 복잡성을 해결하기 위한 접근
Reward Model 없이 Preference Data로 직접 정책 최적화
Loss = -log σ(β * (log π(yw|x)/πref(yw|x) - log π(yl|x)/πref(yl|x)))
2개 모델만 필요: 학습할 Policy + Frozen Reference Model
Implicit Reward: r(x,y) = β * log(π(y|x)/πref(y|x))

DPO의 직관적 이해 1:37:00

Winning 응답의 확률 증가, Losing 응답의 확률 감소
Reference Model 대비 상대적 변화를 학습
PPO 대비 훨씬 간단하고 안정적

LoRA와 Preference Tuning 10:20

LoRA는 파라미터 효율적 학습 방법 (어떤 파라미터를 튜닝할지)
Preference Tuning은 목적 함수 (무엇을 최적화할지)
두 기법은 상호 보완적으로 함께 사용 가능

핵심 인사이트

Preference Tuning은 모델의 "행동"을 세밀하게 조정하는 3단계 학습
RLHF는 강력하지만 복잡하고 불안정 (4개 모델, Reward Hacking 위험)
DPO는 Reward Model 없이 직접 최적화로 단순화 (현재 많이 사용)
Reward Hacking 방지를 위해 Reference Model과의 KL Divergence 제한이 중요
Best-of-N은 학습 없이 추론 시점에 품질 향상 가능 (비용 트레이드오프)