Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 8 - LLM Evaluation

Stanford Online • December 2, 2025 • AI 요약 생성: January 24, 2026

NaN:NaN

Hello everyone and uh welcome to lecture

NaN:NaN

8 of CME295.

NaN:NaN

So today's topic will be LLM evaluation

NaN:NaN

and I think this class is probably one

NaN:NaN

of the most important classes of this uh

NaN:NaN

quarter because the idea is if we don't

NaN:NaN

know how to measure the performance of

NaN:NaN

our LLM, we don't really know what to

NaN:NaN

improve and so this class will focus on

NaN:NaN

how we can quantify how the LLM performs

NaN:NaN

in a bunch of different cases.

NaN:NaN

So with that said, we are going to start

NaN:NaN

the class as usual by recapping what we

NaN:NaN

saw last week. So if you remember last

NaN:NaN

week, we saw how our LLM could interact

NaN:NaN

with systems that are outside of the LM

NaN:NaN

itself. So we saw one core technique

NaN:NaN

that is called rag that allows our LLM

NaN:NaN

to fetch information from external

NaN:NaN

knowledge bases. And so here rag stands

NaN:NaN

for retrieval augmented generation

NaN:NaN

and we saw how we could uh improve the

NaN:NaN

retrieval system. So we saw uh that it

NaN:NaN

was composed of two main steps. So one

NaN:NaN

was candidate retrieval which is

NaN:NaN

typically something that is done with uh

NaN:NaN

a c by encoder kind of setup. So

NaN:NaN

sentence spurt was uh like a good

NaN:NaN

example of how people would design such

NaN:NaN

a model. Um and so this first step is

NaN:NaN

typically there to filter down the

NaN:NaN

potential relevant candidates for a

NaN:NaN

given incoming query.

NaN:NaN

And then we saw that there was a second

NaN:NaN

step which was um reranking and that one

NaN:NaN

was a bit more involved and involved

NaN:NaN

cross encoders which were most

NaN:NaN

sophisticated.

NaN:NaN

And we also saw some ways to quantify

NaN:NaN

how well our retrieval system performed.

NaN:NaN

And then we also saw something that was

NaN:NaN

called tool calling which is the ability

NaN:NaN

for our model to know which tool to call

NaN:NaN

with which argument.

NaN:NaN

So if you remember, if we give our LLM

NaN:NaN

the knowledge of the tools that are

NaN:NaN

available to to it, it can figure out

NaN:NaN

which arguments it needs to input to the

NaN:NaN

function as a function of the input

NaN:NaN

query and then run that function and

NaN:NaN

then output the result in natural

NaN:NaN

language to the user.

NaN:NaN

And then we also saw how agentic

NaN:NaN

workflows were composed of. So uh

NaN:NaN

spoiler alert is something that is a

NaN:NaN

combination of the two [snorts] previous

NaN:NaN

methods. So rag and tool calling and in

NaN:NaN

particular

NaN:NaN

given an input we're allowing our model

NaN:NaN

to make multiple calls

NaN:NaN

to call different tools to fetch

NaN:NaN

relevant data from uh other uh knowledge

NaN:NaN

bases and uh we saw one example that was

NaN:NaN

kind of successful uh from the current

NaN:NaN

applications which was AI assisted

NaN:NaN

coding which relies on this principle

NaN:NaN

And react is typically the framework

NaN:NaN

that people would use. So reason plus

NaN:NaN

act which is decomposing this into

NaN:NaN

observe, plan and act steps.

NaN:NaN

Cool. So this is what we saw last time

NaN:NaN

and we also started from this slide last

NaN:NaN

time. We, you know, if you remember, our

NaN:NaN

LLM has strength but also weaknesses

NaN:NaN

that we're trying to mitigate. So, in

NaN:NaN

particular, the focus of lectures six

NaN:NaN

and seven were on methods to improve

NaN:NaN

reasoning of the model and ways to for

NaN:NaN

the model to fetch knowledge from other

NaN:NaN

systems as well as performing actions.

NaN:NaN

And today we're going to focus on the

NaN:NaN

evaluation part. In particular, given a

NaN:NaN

response that the model is giving, how

NaN:NaN

can we quantify how well the LLM is

NaN:NaN

giving its response?

NaN:NaN

Cool. So first of all, I would like to

NaN:NaN

define the term evaluation and the

NaN:NaN

meaning that we will use for this

NaN:NaN

lecture. So when we say I want to

NaN:NaN

evaluate my LLM, it can actually take a

NaN:NaN

lot of different meanings.

NaN:NaN

So when you say let's evaluate the LLM,

NaN:NaN

it can mean let's evaluate the

NaN:NaN

performance, the output, let's evaluate

NaN:NaN

this based on coherence, factuality.

NaN:NaN

Let's evaluate it based on latency. So

NaN:NaN

more system related metrics or pricing

NaN:NaN

or how often it is up and so on.

NaN:NaN

So just to make sure we're on the same

NaN:NaN

page, this lecture will mostly focus on

NaN:NaN

the output quality parts and in

NaN:NaN

particular we'll focus on quantifying

NaN:NaN

how good the actual response is.

NaN:NaN

And here you will note that this is a

NaN:NaN

challenging problem because um as we saw

NaN:NaN

previously our LLM is a texttoext

NaN:NaN

model that can output basically

NaN:NaN

anything.

NaN:NaN

So it can be natural language, it can be

NaN:NaN

code, it can be math reasoning and so on

NaN:NaN

and so forth. So it's very hard to come

NaN:NaN

up with universal metrics to evaluate

NaN:NaN

that. So we will see how people do this

NaN:NaN

in practice.

NaN:NaN

Cool. So given the fact that our LLM

NaN:NaN

generates free form output,

NaN:NaN

one could imagine that the ideal

NaN:NaN

scenario for us to evaluate the LLM

NaN:NaN

output would be to every time ask a

NaN:NaN

human to rate the response.

NaN:NaN

So here the ideal scenario would be okay

NaN:NaN

I give a prompt to my LLM. It gives a

NaN:NaN

response. I ask a human to rate it and I

NaN:NaN

start again and again

NaN:NaN

and uh what I do is at the end of the

NaN:NaN

day I just collect all these human

NaN:NaN

responses and I try to quantify the

NaN:NaN

overall performance of my model. Well,

NaN:NaN

as you can imagine, the main problem is

NaN:NaN

that such a system would be very cost

NaN:NaN

intensive.

NaN:NaN

But let's look at this into more detail.

NaN:NaN

if you remember

NaN:NaN

the LLM outputs are really free form and

NaN:NaN

there may be cases that even human

NaN:NaN

judgment may be something that is fuzzy

NaN:NaN

because maybe the rating task in itself

NaN:NaN

is subjective.

NaN:NaN

So let's take the following example.

NaN:NaN

Let's suppose I ask my LLM what birthday

NaN:NaN

gift should I get? And let's suppose the

NaN:NaN

LLM responds with a teddy bear is almost

NaN:NaN

always a sweet gift. Just pick one that

NaN:NaN

feels right for you. So let's suppose I

NaN:NaN

want to evaluate this response with

NaN:NaN

respect to the usefulness dimension.

NaN:NaN

I may have one human raider that says,

NaN:NaN

"Yeah, it's pretty useful because you

NaN:NaN

know like teddy bear is pretty

NaN:NaN

indicative of um I guess the what the

NaN:NaN

user should get as a gift." But then

NaN:NaN

another raider may say, "No, actually

NaN:NaN

it's not useful because maybe the

NaN:NaN

response didn't specify exactly which

NaN:NaN

teddy bear. Should I have a bear? Should

NaN:NaN

I have an elephant, a giraffe? Like

NaN:NaN

which stuffed animal should I get?"

NaN:NaN

And so there is this um notion of

NaN:NaN

interrator agreements

NaN:NaN

where we're bas basically

NaN:NaN

um concerned with making sure that

NaN:NaN

everyone is aligned on how to rate those

NaN:NaN

responses

NaN:NaN

because sometimes like in this

NaN:NaN

illustrative example

NaN:NaN

it may be a little bit subjective.

NaN:NaN

So responses may vary. So what people

NaN:NaN

want to do is to make sure that the

NaN:NaN

guidelines are clear enough for everyone

NaN:NaN

to rate these responses in a consistent

NaN:NaN

manner.

NaN:NaN

So people come up with um agreement

NaN:NaN

types of metrics.

NaN:NaN

So a very natural metric that you may

NaN:NaN

think of is the quote unquote agreement

NaN:NaN

rates.

NaN:NaN

So for instance you have this uh two

NaN:NaN

raiders. So what you do is you just um

NaN:NaN

measure the proportion of the time that

NaN:NaN

the two raiders give the same response.

NaN:NaN

And let's suppose the response here is

NaN:NaN

binary. So let's say yes good or not

NaN:NaN

good.

NaN:NaN

Well do you see a problem with such a

NaN:NaN

metric?

NaN:NaN

like is this a good metric?

NaN:NaN

I guess another way to ask this question

NaN:NaN

is if I give you a given number of

NaN:NaN

agreement rates, can you tell me if it's

NaN:NaN

a good number or if it's a bad number?

NaN:NaN

Well, let's take the example of let's

NaN:NaN

say two raiders.

NaN:NaN

um let's say Alice

NaN:NaN

and let's say Bob

NaN:NaN

and let's suppose we have two different

NaN:NaN

types of ratings that these raiders can

NaN:NaN

give. So either let's say uh yes it's

NaN:NaN

good so one the output is good or the

NaN:NaN

output is not good.

NaN:NaN

So if we assume that the first raider

NaN:NaN

gives let's say random responses

NaN:NaN

with some probability P of A for being

NaN:NaN

good and one minus P of A for being not

NaN:NaN

good and then let's say Bob who should

NaN:NaN

have eyes and a smile

NaN:NaN

has uh P of B for being good and one

NaN:NaN

minus P of B for being not

NaN:NaN

Then

NaN:NaN

let's compute the agreement rate for

NaN:NaN

this case. So the agreement rate

NaN:NaN

is basically the probability

NaN:NaN

that

NaN:NaN

raider A and raider B

NaN:NaN

agree.

NaN:NaN

And so here A and B agree.

NaN:NaN

if A and B both vote one

NaN:NaN

or way when A and B both vote zero,

NaN:NaN

right?

NaN:NaN

But if they give the their response in

NaN:NaN

an independent and random way,

NaN:NaN

well, if you use like these probability

NaN:NaN

concepts that you know, then we will

NaN:NaN

have

NaN:NaN

probability of A and B responding to

NaN:NaN

one, which is probability of A

NaN:NaN

responding to one time probability of B

NaN:NaN

responding to one and same for zero. So

NaN:NaN

we will have something like this.

NaN:NaN

So P of A P of B

NaN:NaN

plus 1 - P of A

NaN:NaN

1 - P of B.

NaN:NaN

So this one is A and B

NaN:NaN

say 1

NaN:NaN

and here A and B

NaN:NaN

say zero.

NaN:NaN

So,

NaN:NaN

so let's see what the agreement rate

NaN:NaN

would be in that case.

NaN:NaN

So if we assume that

NaN:NaN

suppose that let's suppose uh P of A is

NaN:NaN

equal to P of B which is equal to let's

NaN:NaN

say.5

NaN:NaN

then the agreement rate would be so

NaN:NaN

agreement rate

NaN:NaN

would be so I'm just replacing the

NaN:NaN

numbers here so.5

NaN:NaN

squared +.5^

NaN:NaN

squar. So it's 0.25 + 25 which is equal

NaN:NaN

to.5. So what that means is

NaN:NaN

if we're just letting our raers

NaN:NaN

rate these things in a random way with

NaN:NaN

some probability, you know, P of A, P of

NaN:NaN

B, we would already have an agreement

NaN:NaN

rate of 50%. Just by pure random chance.

NaN:NaN

And so

NaN:NaN

one thing that I want to say is that

NaN:NaN

this agreement rate by pure chance

NaN:NaN

is a function of the probability that

NaN:NaN

each of these raiders give these

NaN:NaN

ratings.

NaN:NaN

And so if

NaN:NaN

this probability is actually higher the

NaN:NaN

agreement rate by pure chance is also

NaN:NaN

higher.

NaN:NaN

So what that means? So what do I want to

NaN:NaN

say?

NaN:NaN

I want to say that if we just take the

NaN:NaN

agreement rates then it's very hard to

NaN:NaN

put it into context in terms of what you

NaN:NaN

would have gotten if things would have

NaN:NaN

happened just by pure chance.

NaN:NaN

So for this reason,

NaN:NaN

people have come up with a series of

NaN:NaN

metrics that try to make it more

NaN:NaN

relative to this baseline, which is what

NaN:NaN

would happen if our raiders would choose

NaN:NaN

things kind of randomly.

NaN:NaN

And so you have these metrics like for

NaN:NaN

instance this one is the coins scapa

NaN:NaN

metric which

NaN:NaN

computes a quantity that is a function

NaN:NaN

of this agreement rate by chance

NaN:NaN

and take the observed one

NaN:NaN

such that if our observed agreement rate

NaN:NaN

is greater than the by chance agreement

NaN:NaN

rates

NaN:NaN

then our coefficient is positive. So

NaN:NaN

when it's positive you know at least you

NaN:NaN

know it's going in the right direction.

NaN:NaN

So here if the observed agreement rate

NaN:NaN

is equal to one

NaN:NaN

then alpha is so kappa is equal to one.

NaN:NaN

But if our observed agreement rate is

NaN:NaN

below the you know by pure random chance

NaN:NaN

erment rate that we saw on the

NaN:NaN

blackboard then our coefficient would be

NaN:NaN

negative.

NaN:NaN

So long story short, there is a bunch of

NaN:NaN

metrics that try to quantify interrator

NaN:NaN

agreement rates using these kinds of

NaN:NaN

formulas to be able to make these

NaN:NaN

quantities relative to what would happen

NaN:NaN

if things were done in a random way.

NaN:NaN

And so that's why you may see a bunch of

NaN:NaN

metrics out there. So here it's cohen

NaN:NaN

scappa that people use for cases where

NaN:NaN

there are tool raiders but then you have

NaN:NaN

extensions such as fly scapa and

NaN:NaN

crypendors alpha that you may see out

NaN:NaN

there. So they all rely on this idea

NaN:NaN

that we should have some baseline

NaN:NaN

which is

NaN:NaN

you know our raiders just randomly

NaN:NaN

picking answers and try to see how much

NaN:NaN

better our actual agreement is compared

NaN:NaN

to this.

NaN:NaN

So does that make sense?

NaN:NaN

Yeah. So I guess what I want to say is

NaN:NaN

that the first limitation of asking

NaN:NaN

humans

NaN:NaN

to rate our LM outputs which was

NaN:NaN

sometimes the task being subjective

NaN:NaN

can be something that we can quantify

NaN:NaN

with this interrator agreement metrics.

NaN:NaN

So what people typically do is they keep

NaN:NaN

track of how good that agreement is. And

NaN:NaN

if let's say we have a quantity that's

NaN:NaN

not satisfactory,

NaN:NaN

people would just hold some quote

NaN:NaN

unquote agreement sessions between the

NaN:NaN

raiders to just align on how they should

NaN:NaN

rate the answers.

NaN:NaN

So that it can be seen as just a health

NaN:NaN

metric to track how consistent your

NaN:NaN

rating are and this is typically

NaN:NaN

something that people use in practice.

NaN:NaN

up until now we've seen one limitation

NaN:NaN

of human ratings.

NaN:NaN

Well second limitation I think I also

NaN:NaN

said this previously. So it's really

NaN:NaN

slow. You know, if you ask someone to

NaN:NaN

rate a thousand LM outputs, well, it

NaN:NaN

will take them a while and it's of

NaN:NaN

course expensive.

NaN:NaN

So, all of that to say that our ideal

NaN:NaN

scenario of asking a human to rate every

NaN:NaN

LLM output is not something that is

NaN:NaN

practical.

NaN:NaN

But we can leverage human

NaN:NaN

ratings in some way because we've seen

NaN:NaN

that even if the task is subjective, we

NaN:NaN

can have a way to align our raiders.

NaN:NaN

So now let's move on to another way to

NaN:NaN

go about doing this, which is by using

NaN:NaN

some rule-based metrics.

NaN:NaN

So here I'm just going to revise the

NaN:NaN

setting that I mentioned before

NaN:NaN

and instead of asking our humans to

NaN:NaN

write every LLM output,

NaN:NaN

this time I'm just going to ask them to

NaN:NaN

write

NaN:NaN

the references or the ideal outputs for

NaN:NaN

a given set of prompts. just fix that

NaN:NaN

for good

NaN:NaN

and then use some kind of metric that

NaN:NaN

would compare the LLM outputs

NaN:NaN

with those references.

NaN:NaN

So here the main difference is let's

NaN:NaN

suppose I have a given set of prompts

NaN:NaN

fixed.

NaN:NaN

Well, I can make iterations in my model

NaN:NaN

and always compare the output of my LLM

NaN:NaN

with this fixed reference instead of

NaN:NaN

always asking humans to rate that again

NaN:NaN

and again. So, it's already an

NaN:NaN

improvement

NaN:NaN

and we will see a little bit what are

NaN:NaN

the kinds of rulebased metrics that you

NaN:NaN

will see out there.

NaN:NaN

So ideally these metrics should reflect

NaN:NaN

the performance of the LM output in a in

NaN:NaN

an optimal way. And what I mean by an

NaN:NaN

optimal way, it is to make it be a

NaN:NaN

little bit flexible given the fact that

NaN:NaN

natural language is not always something

NaN:NaN

that you can say in one given way.

NaN:NaN

So for instance when I provide a

NaN:NaN

response to a given prompt there can be

NaN:NaN

very well a case where I can formulate

NaN:NaN

the response slightly differently but it

NaN:NaN

will still be just as good.

NaN:NaN

So the idea behind this metrics is to

NaN:NaN

make this comparison

NaN:NaN

a little bit flexible.

NaN:NaN

So let's start with one common one that

NaN:NaN

people use in the translation case. So

NaN:NaN

this metric is called meteor and it

NaN:NaN

stands for metric for evaluation of

NaN:NaN

translation with explicit ordering.

NaN:NaN

So the idea here is to compare

NaN:NaN

reference and predicted and we'll see

NaN:NaN

how it's being done and also

NaN:NaN

penalize cases when words are not in the

NaN:NaN

same order which is explaining why the

NaN:NaN

metric is called with explicit ordering.

NaN:NaN

So the formula is as follows. So it is

NaN:NaN

some fcore times 1 minus some penalty.

NaN:NaN

So the fcore here is so you may be

NaN:NaN

familiar with f1 score. So it's like the

NaN:NaN

harmonic mean with equal weights. So

NaN:NaN

this one is with the variable weights.

NaN:NaN

So it is a function of precision and

NaN:NaN

recall

NaN:NaN

where precision is the proportion of the

NaN:NaN

unigs that are in your predicted

NaN:NaN

sequence

NaN:NaN

that are matching with the reference and

NaN:NaN

the recall is the proportion of the

NaN:NaN

unigs in the reference

NaN:NaN

that are matching with what is in the

NaN:NaN

predicted. So it's basically matching

NaN:NaN

the

NaN:NaN

usual precision recall metrics that you

NaN:NaN

know and then we have another quantity

NaN:NaN

here which is the penalty and I

NaN:NaN

mentioned the penalty here tries to

NaN:NaN

incentivize

NaN:NaN

goods ordering. So if it's ordered the

NaN:NaN

same in the reference and in the

NaN:NaN

prediction then it's good otherwise it's

NaN:NaN

bad.

NaN:NaN

And so here there's a bunch of

NaN:NaN

quantities. So gamma and beta are

NaN:NaN

hyperparameters that people arbitrarily

NaN:NaN

choose.

NaN:NaN

And it's a function of C,

NaN:NaN

the number of contiguous chunks that are

NaN:NaN

matched

NaN:NaN

over the matched unigs. The number of

NaN:NaN

matched unigs.

NaN:NaN

So ideally you would want

NaN:NaN

C that would be as low as possible

NaN:NaN

because if you have a low number of

NaN:NaN

contiguous matches

NaN:NaN

it means that your contiguous sequences

NaN:NaN

are are long

NaN:NaN

which means that the ordering is the

NaN:NaN

same.

NaN:NaN

So you want C to be low

NaN:NaN

and then matched unigs to be high.

NaN:NaN

So you want that penalty term to be low

NaN:NaN

for a good I guess for a prediction that

NaN:NaN

has the same ordering as the reference.

NaN:NaN

So I guess

NaN:NaN

higher meter score means better

NaN:NaN

translation AC according to this way of

NaN:NaN

doing things.

NaN:NaN

So I guess when you look at this formula

NaN:NaN

so first of all it seems it looks very

NaN:NaN

arbitrary

NaN:NaN

right I have uh alpha as a

NaN:NaN

hyperparameter gamma beta so it's kind

NaN:NaN

of a

NaN:NaN

a recipe I feel so that's one

NaN:NaN

and the second thing is that it does not

NaN:NaN

allow for stylistic variations because

NaN:NaN

here we're measuring the number of

NaN:NaN

matched unigs

NaN:NaN

Although the metric expands the like

NaN:NaN

range of what is called matched unigs by

NaN:NaN

taking into account things like words

NaN:NaN

that are synonyms of one another and

NaN:NaN

things that are of the same roots

NaN:NaN

but still it is not like extremely uh

NaN:NaN

satisfactory in that sense.

NaN:NaN

So mter is meteor is one such metric.

NaN:NaN

You have another one that's being used

NaN:NaN

or that has been used in translation

NaN:NaN

tasks which is called blur which you may

NaN:NaN

know. So, BL stands for bilingual

NaN:NaN

evaluation under study. And you can

NaN:NaN

think of this as kind of a precision

NaN:NaN

focused kind of metric

NaN:NaN

that uh looks at the number of matched

NaN:NaN

the matching engrams

NaN:NaN

over the engrams that are in the

NaN:NaN

prediction, which is why it's a

NaN:NaN

precision kind of metric.

NaN:NaN

And it also has uh a penalty term here.

NaN:NaN

It's called brevity penalty

NaN:NaN

because given that it's more of a

NaN:NaN

precision kind of metric. If you

NaN:NaN

translate something that's very short,

NaN:NaN

you may be able to gain the the metric.

NaN:NaN

So you want to penalize

NaN:NaN

the translation being too short.

NaN:NaN

So we'll not go into a lot of details,

NaN:NaN

but I just want to just show you the

NaN:NaN

kinds of metrics that are out there. So

NaN:NaN

meter is one, blue is another one and

NaN:NaN

rouge which you may have heard is also

NaN:NaN

another one typically used for

NaN:NaN

summarization tasks.

NaN:NaN

Again same idea and um it has a bunch of

NaN:NaN

varants that you may see out there

NaN:NaN

but long story short all these metrics

NaN:NaN

they all compare

NaN:NaN

the output with a reference.

NaN:NaN

So as we saw one key limitation is that

NaN:NaN

they do not allow stylistic variation.

NaN:NaN

So let's take an example. So let's

NaN:NaN

suppose I say a plush teddy bear can

NaN:NaN

comfort a child during bedtime. Well,

NaN:NaN

the exact same thing I can say it I can

NaN:NaN

say it in a really different way. So

NaN:NaN

soft, stuffed uh bears often help kids

NaN:NaN

feel safe as they fall asleep or many

NaN:NaN

youngsters rest more easily at night

NaN:NaN

when they cuddle a gentle toy companion.

NaN:NaN

So in all these cases, the metrics that

NaN:NaN

we saw would you would really perform

NaN:NaN

very poorly.

NaN:NaN

So that's one key limitation. So the

NaN:NaN

second key limitation is correlation is

NaN:NaN

not that great.

NaN:NaN

I mean, you can imagine that people have

NaN:NaN

come up with all these hyperparameters

NaN:NaN

to kind of make it be correlated to

NaN:NaN

human ratings, but they're not that

NaN:NaN

correlated.

NaN:NaN

And the bottom line is it still requires

NaN:NaN

human ratings

NaN:NaN

to just get started. And sometimes

NaN:NaN

you just can't afford to have human

NaN:NaN

ratings maybe in your project.

NaN:NaN

So I guess there are still some key

NaN:NaN

limitations

NaN:NaN

which is the reason why. So all of that

NaN:NaN

to say I want to motivate the key method

NaN:NaN

of this class or of this lecture which

NaN:NaN

is called LLM as a judge.

NaN:NaN

So, you know, we spent the first seven

NaN:NaN

lectures motivating these large language

NaN:NaN

models that are pre-trained on huge

NaN:NaN

amounts of data that are uh tuned in a

NaN:NaN

way to match human preference. So, they

NaN:NaN

do contain human knowledge. They do

NaN:NaN

contain some indication of what humans

NaN:NaN

may prefer.

NaN:NaN

So the idea here is to have our model

NaN:NaN

response

NaN:NaN

be actually an input of yet another LLM

NaN:NaN

and that LLM is something that people

NaN:NaN

typically call LLM as a judge.

NaN:NaN

So it was a term that was introduced in

NaN:NaN

a paper uh from two years ago.

NaN:NaN

So here the idea is to use an LLM for

NaN:NaN

rating purposes

NaN:NaN

and things that you would see as input

NaN:NaN

would be the prompt that was used to

NaN:NaN

produce the response,

NaN:NaN

the response

NaN:NaN

and the criteria along which you want to

NaN:NaN

grade your response.

NaN:NaN

And so here LLM as a judge would give

NaN:NaN

you the following outputs. So the first

NaN:NaN

thing is it would give you a score.

NaN:NaN

So here you can think of it as like a

NaN:NaN

binary scale kind of score. So pass or

NaN:NaN

fail

NaN:NaN

but and this is very new

NaN:NaN

also a rationale

NaN:NaN

because LLMs they understand text

NaN:NaN

so they can also explain you why they

NaN:NaN

graded something with a given score

NaN:NaN

and that part is

NaN:NaN

the key difference with previous

NaN:NaN

methods. we are able to explain why the

NaN:NaN

metric or the model is giving us a given

NaN:NaN

score. And this is quite quite good

NaN:NaN

because in the other let's say rulebased

NaN:NaN

worlds where you would have all these

NaN:NaN

like formulas and multiplication all

NaN:NaN

these things and sometimes you would

NaN:NaN

come up with a number that would not be

NaN:NaN

very self-explanatory

NaN:NaN

and this is luckily something that LM as

NaN:NaN

a judge addresses.

NaN:NaN

So to recap,

NaN:NaN

what we want is to use an LLM as a way

NaN:NaN

to grade the response.

NaN:NaN

So here you would have typically

NaN:NaN

the following kind of prompts. So you

NaN:NaN

would state okay I want to evaluate my

NaN:NaN

response with respect to a given

NaN:NaN

criteria

NaN:NaN

and then you give the prompt that you

NaN:NaN

used to generate that response along

NaN:NaN

with the model response

NaN:NaN

and then you would ask

NaN:NaN

the judge to return two things the

NaN:NaN

rationale and then the score.

NaN:NaN

So one little trick I want to point out

NaN:NaN

is people typically ask the model to

NaN:NaN

first output the rationale and then the

NaN:NaN

score

NaN:NaN

and the reason why we typically do that

NaN:NaN

is it's something that empirically

NaN:NaN

improves the quality of the results

NaN:NaN

but then given what we saw I think in

NaN:NaN

lecture six if you remember the

NaN:NaN

reasoning class we saw that these

NaN:NaN

reasoning models that are being trendy

NaN:NaN

especially in 2025

NaN:NaN

what they do is they first output a

NaN:NaN

chain of thought before giving the

NaN:NaN

answer.

NaN:NaN

So you can actually think of this trick

NaN:NaN

as being on the same kind of idea of

NaN:NaN

reasoning models as in it allows the

NaN:NaN

model to

NaN:NaN

externalize verbalize its quote unquote

NaN:NaN

thought process before giving the score.

NaN:NaN

So it gives it a chance to really figure

NaN:NaN

out what is good or what is wrong in the

NaN:NaN

model response.

NaN:NaN

So far so good.

NaN:NaN

Any questions on I guess the setup?

NaN:NaN

Yeah, all good. Okay. So now I have a

NaN:NaN

question for you.

NaN:NaN

If I give the following prompt to my LM

NaN:NaN

as a judge,

NaN:NaN

am I guaranteed to have a rational and a

NaN:NaN

score that I can parse?

NaN:NaN

Am I guaranteed?

NaN:NaN

No. Yeah, exactly. No. The answer is no.

NaN:NaN

You're not guaranteed to have um a

NaN:NaN

rational and a score that you can parse

NaN:NaN

because this model has some

NaN:NaN

probabilistic nature to it with the

NaN:NaN

sampling process and it's not something

NaN:NaN

that you can really control. So I guess

NaN:NaN

my follow-up question is do you know a

NaN:NaN

technique that would

NaN:NaN

I guess guarantee you to have um a

NaN:NaN

structured response. So hint is a it's a

NaN:NaN

technique that we seen that we saw like

NaN:NaN

towards the beginning of the class.

NaN:NaN

Okay, I'll give you a little hint. So if

NaN:NaN

you remember on slide 65 of lecture 3,

NaN:NaN

we saw a technique called constrained

NaN:NaN

guided decoding.

NaN:NaN

So if you remember the idea here is to

NaN:NaN

constrain the decoding part pro process

NaN:NaN

by allowing our model to only sample

NaN:NaN

from quote unquote valid tokens.

NaN:NaN

And we typically do that in cases where

NaN:NaN

we want our output to have a given

NaN:NaN

format. So let's suppose a JSON format

NaN:NaN

and we want absolutely that format.

NaN:NaN

So what people do is they use this

NaN:NaN

technique to guarantee the form of the

NaN:NaN

response.

NaN:NaN

And in case you're using these provider

NaN:NaN

like the providers that are you know out

NaN:NaN

there like for instance open AAI or

NaN:NaN

Gemini or anthropic

NaN:NaN

this technique is known under the name

NaN:NaN

structured output.

NaN:NaN

So in your project if you want to

NaN:NaN

constrain the decoding process

NaN:NaN

in order to output a response of a given

NaN:NaN

format. So let's suppose my format is a

NaN:NaN

response and I kind of represent it by a

NaN:NaN

class and there are like two attributes

NaN:NaN

so rational and score.

NaN:NaN

Well typically you can reference that

NaN:NaN

with the argument text format equal to

NaN:NaN

that representation.

NaN:NaN

So, this I believe is something that uh

NaN:NaN

OpenAI does, and I'm not exactly sure if

NaN:NaN

it's exactly the argument name that you

NaN:NaN

would see for the other providers, but

NaN:NaN

they're all I guess along the same

NaN:NaN

lines.

NaN:NaN

Does that sound good? So, the key word

NaN:NaN

here is structured output. Whenever you

NaN:NaN

want a response of a given format, you

NaN:NaN

would just go for that.

NaN:NaN

Okay, cool. So just to recap, our LLM as

NaN:NaN

a judge has two main benefits. So the

NaN:NaN

first one is that we do not need a

NaN:NaN

reference text. We do not need human

NaN:NaN

ratings to just get started because our

NaN:NaN

LLM already has a lot of I guess

NaN:NaN

knowledge that has acquired during

NaN:NaN

pre-training and human preferences and

NaN:NaN

so on. So you do not need that.

NaN:NaN

And then the second thing is you can

NaN:NaN

interpret the score with the rationale

NaN:NaN

that is being output

NaN:NaN

and that is also quite remarkable.

NaN:NaN

So just as an example so here you would

NaN:NaN

say okay evaluate the quality of like

NaN:NaN

this this response. So you would have

NaN:NaN

some rational that would explain what

NaN:NaN

this response has or doesn't have that

NaN:NaN

makes it good or bad

NaN:NaN

along with the score.

NaN:NaN

Okay, cool. And I believe uh Okay, so

NaN:NaN

now we're going to see the kinds of LM

NaN:NaN

messages that you can see out there. So

NaN:NaN

of course there are many variations but

NaN:NaN

there are generally two types of L me

NaN:NaN

that you will see.

NaN:NaN

So the first one is you have a single

NaN:NaN

output a single response that you want

NaN:NaN

to evaluate

NaN:NaN

and here you would ask LS judge to say

NaN:NaN

okay is it good or is it not good

NaN:NaN

and the second big kind of LS judge that

NaN:NaN

you will see out there is pairwise kind

NaN:NaN

of setup. So you have two responses and

NaN:NaN

you say is response A better or is

NaN:NaN

response B better

NaN:NaN

and here you would obtain a response

NaN:NaN

either okay that one or this one.

NaN:NaN

So if you remember we've seen uh in

NaN:NaN

previous lectures that there are a lot

NaN:NaN

of situations where we would want to

NaN:NaN

have preference data. So for instance in

NaN:NaN

the preference tuning class that we had

NaN:NaN

I believe it was lecture five.

NaN:NaN

So this kind of method can also be a

NaN:NaN

good way to synthetically generate

NaN:NaN

preference ratings

NaN:NaN

where you have two responses

NaN:NaN

and then you ask your LLM to say okay I

NaN:NaN

prefer that one and you can use that one

NaN:NaN

as a as the label to train your rework

NaN:NaN

model.

NaN:NaN

Does it sound good?

NaN:NaN

Any questions on the setup or

NaN:NaN

everything that we've talked about so

NaN:NaN

far?

NaN:NaN

Okay, cool. Everyone is uh on the same

NaN:NaN

page. So now let's see what can go wrong

NaN:NaN

with our LMS as a judge. So uh let's

NaN:NaN

think of the possible kinds of failures

NaN:NaN

that we can encounter.

NaN:NaN

So the first one is called position bias

NaN:NaN

and as the name suggests it has to do

NaN:NaN

with the ordering at which we present

NaN:NaN

the responses to our model.

NaN:NaN

So let's say if we ask our model is

NaN:NaN

response A better or response B.

NaN:NaN

Well, there is a chance that the model

NaN:NaN

responds with response A just because it

NaN:NaN

it was the first one to be mentioned.

NaN:NaN

So that kind of bias is called position

NaN:NaN

bias. So it's where the position at

NaN:NaN

which you place the response matters in

NaN:NaN

the

NaN:NaN

judgment of the LMS as a judge model.

NaN:NaN

And I guess as a way to remedy that,

NaN:NaN

people have different techniques. But

NaN:NaN

one typical technique would be to ask

NaN:NaN

the model is A or B better and then ask

NaN:NaN

the model is B or A better and then take

NaN:NaN

the majority voting. So if both of them

NaN:NaN

lead to the same response, then it's

NaN:NaN

good. But if the response changes,

NaN:NaN

then it may not be good. So you may want

NaN:NaN

to do something else.

NaN:NaN

Um there are a bunch of other

NaN:NaN

techniques. So I know there's a bunch of

NaN:NaN

papers that try to tweak the position

NaN:NaN

embeddings but those ones are a bit more

NaN:NaN

advanced. So it's not typically the

NaN:NaN

thing that you would do just out of the

NaN:NaN

box. So taking the average or like you

NaN:NaN

know taking the majority voting of this

NaN:NaN

you know position swapping is CPT what

NaN:NaN

you would do.

NaN:NaN

Okay, cool. So this was the first kind

NaN:NaN

of bias. The second bias is called

NaN:NaN

verbosity bias.

NaN:NaN

So let's suppose you have two responses

NaN:NaN

and the first response is short and

NaN:NaN

concise.

NaN:NaN

The second response is something that

NaN:NaN

goes much more into details is typically

NaN:NaN

something that is more verbose.

NaN:NaN

Well, there are cases where the model

NaN:NaN

will tend to um

NaN:NaN

to I guess prefer responses that are

NaN:NaN

just more verbose just because they're

NaN:NaN

more verbose, not necessarily because

NaN:NaN

they're more correct.

NaN:NaN

And for that, it's maybe a little bit

NaN:NaN

trickier. So

NaN:NaN

people typically try to explicit

NaN:NaN

this dimension in the guidelines

NaN:NaN

when they input I guess the this this

NaN:NaN

question to the LMS as a judge they say

NaN:NaN

uh well make sure to not pay too much

NaN:NaN

attention to the length of these

NaN:NaN

responses to not I guess prefer

NaN:NaN

something just because it's more

NaN:NaN

verbose.

NaN:NaN

So that's one kind of meth method that

NaN:NaN

you will see out there.

NaN:NaN

Uh the second one is to just um also add

NaN:NaN

some examples in context learning

NaN:NaN

examples

NaN:NaN

uh to the model to just um I guess tell

NaN:NaN

it to

NaN:NaN

um I guess show by example that

NaN:NaN

verbosity is not something you should uh

NaN:NaN

prefer. And then uh the last one is to

NaN:NaN

have some kind of penalty

NaN:NaN

on the output length.

NaN:NaN

So you can ask your model in a pointwise

NaN:NaN

way is one how how good is one how good

NaN:NaN

is two and then try to penalize that

NaN:NaN

with the length. So that's something

NaN:NaN

that also people may use.

NaN:NaN

Okay. So we've seen position bias, we've

NaN:NaN

seen verbosity bias, now we will see the

NaN:NaN

third kind of bias that you may see out

NaN:NaN

there which is called selfenhancement

NaN:NaN

bias.

NaN:NaN

And so that one has to do with the fact

NaN:NaN

that if you ask a model to judge an

NaN:NaN

output

NaN:NaN

that was produced by itself,

NaN:NaN

well, the model will tend to prefer

NaN:NaN

responses that are generated by itself

NaN:NaN

regardless of whether or not the other

NaN:NaN

one was more

NaN:NaN

uh aligned with what we wanted.

NaN:NaN

And I guess here the intuition is that

NaN:NaN

if our model

NaN:NaN

generated such an answer then it's maybe

NaN:NaN

the case that

NaN:NaN

uh our model thought that from a

NaN:NaN

probabilistic standpoint this was a

NaN:NaN

sequence that was very much likely to

NaN:NaN

appear. So it may be I guess one way to

NaN:NaN

think about it which is if it has

NaN:NaN

generated such a sequence then it means

NaN:NaN

that it it is something that

NaN:NaN

it thinks I mean think uh quote unquote

NaN:NaN

that it's a good answer.

NaN:NaN

So the general guideline here is to

NaN:NaN

typically not use the same model that

NaN:NaN

you use for generation and for

NaN:NaN

judges.

NaN:NaN

But um I guess nowadays it's kind of

NaN:NaN

hard to have that strict

NaN:NaN

uh constraints um I guess

NaN:NaN

uh respected because I guess all models

NaN:NaN

they are trained on basically the same

NaN:NaN

data sets.

NaN:NaN

So you can argue they're all being

NaN:NaN

subject to the same I guess training

NaN:NaN

mixes and so on. But still I guess

NaN:NaN

people what what people do is they tend

NaN:NaN

to use another model just to have

NaN:NaN

such a risk be minimized. So long story

NaN:NaN

short try to not use the exact same

NaN:NaN

model that you use for generation

NaN:NaN

and for evaluation.

NaN:NaN

So this is self- enhancement bias.

NaN:NaN

Okay. So before we go to the next uh

NaN:NaN

subp part I guess what do you think of

NaN:NaN

these three biases? Do they make sense?

NaN:NaN

Any questions so far?

NaN:NaN

Yep.

NaN:NaN

So, can you elaborate a bit more?

NaN:NaN

Yeah.

NaN:NaN

Yeah. Yeah. Yeah. So, the question is,

NaN:NaN

can you have a model that just maybe

NaN:NaN

isn't aligned with uh I guess the ground

NaN:NaN

truth and maybe prioritizes maybe one

NaN:NaN

label over another. So, yeah, this can

NaN:NaN

definitely be another kind of bias. Uh

NaN:NaN

so this bias being that our LLM is not

NaN:NaN

exactly aligned with what humans would

NaN:NaN

prefer. So these three biases are by no

NaN:NaN

means exhaustive. So this can very well

NaN:NaN

be another bias that you can list as

NaN:NaN

well. Um so yeah, this is definitely

NaN:NaN

another kind of bias.

NaN:NaN

Yep.

NaN:NaN

Yep.

NaN:NaN

So the question is, is it possible that

NaN:NaN

our judge still prefers an LM response

NaN:NaN

even if it's a different one? Well, it

NaN:NaN

depends how good your judge is. But

NaN:NaN

typically the best practice is to have a

NaN:NaN

judge that has a much bigger capacity

NaN:NaN

that may capture these kind of

NaN:NaN

differences and not be fooled by a

NaN:NaN

response that just sounds like something

NaN:NaN

it may generate but something that is

NaN:NaN

maybe more aligned with human

NaN:NaN

preferences. So I guess the short answer

NaN:NaN

is yes, you can still have such a

NaN:NaN

situation. But in order to mitigate that

NaN:NaN

risk, you would typically take a model

NaN:NaN

that is not the same but also typically

NaN:NaN

much bigger.

NaN:NaN

So uh you have a bunch of uh such models

NaN:NaN

out there and uh you know with all the I

NaN:NaN

guess um improvements that have been

NaN:NaN

made with like reasoning models. This is

NaN:NaN

also something that people try.

NaN:NaN

Yeah, question is should the judge be

NaN:NaN

bigger? Um, it's not a hard constraint,

NaN:NaN

but it's typically something that people

NaN:NaN

would take like a bigger model that

NaN:NaN

would have strong reasoning capabilities

NaN:NaN

that could really tease out what's good

NaN:NaN

and what's not good. Yeah.

NaN:NaN

Okay, cool. So, with that, I'm going to

NaN:NaN

just go over the best practices that

NaN:NaN

we've seen. So we saw that um in order

NaN:NaN

for our LMS as a judge to output a

NaN:NaN

score, we need to give the criteria that

NaN:NaN

we want this to be evaluated against.

NaN:NaN

But sometimes this criteria may be a

NaN:NaN

little bit subjective. So one thing that

NaN:NaN

really works very well is to have crisp

NaN:NaN

guidelines. So really explicit what we

NaN:NaN

want, what we don't want.

NaN:NaN

The other point is you may see different

NaN:NaN

kinds of scaling out there. So sometimes

NaN:NaN

people having uh a scale that is maybe

NaN:NaN

more granular and maybe other cases

NaN:NaN

where we're just operating on a binary

NaN:NaN

scale. So typically what people would

NaN:NaN

tend to prefer is actually the binary

NaN:NaN

one because it makes the job of the LM

NaN:NaN

as a judge easier. So it's just either

NaN:NaN

good or bad.

NaN:NaN

And also when it comes to aligning the

NaN:NaN

judge with human ratings, humans, they

NaN:NaN

typically also find it easier to just

NaN:NaN

judge out of two options as opposed to

NaN:NaN

several. So it just removes the noise of

NaN:NaN

having several uh possible choices. And

NaN:NaN

it's not necessarily an extra signal

NaN:NaN

that may be really useful. So here the

NaN:NaN

tip is to use a binary scale like a pass

NaN:NaN

or fail kind of score as opposed to like

NaN:NaN

a gradual one.

NaN:NaN

The third tip is to make sure to output

NaN:NaN

the rational before outputting the

NaN:NaN

score.

NaN:NaN

And we've seen this is along the same

NaN:NaN

ideas of outputting a chain of thought

NaN:NaN

before providing the response which is

NaN:NaN

something that is done by our reasoning

NaN:NaN

models. So it's typically something that

NaN:NaN

will improve uh the judge performance.

NaN:NaN

Uh so we've talked about the different

NaN:NaN

kinds of biases. So position, verbosity,

NaN:NaN

self-enhancement, but it's not the only

NaN:NaN

ones of course and uh I guess people

NaN:NaN

typically also look at how to mitigate

NaN:NaN

those with the remedies that we

NaN:NaN

mentioned.

NaN:NaN

So, so far we've stated that we do not

NaN:NaN

need human ratings to get started,

NaN:NaN

but a good practice is to still look at

NaN:NaN

how the LLM ratings compare with the

NaN:NaN

human ratings.

NaN:NaN

So here one tip is to just calibrate the

NaN:NaN

responses that the judge is giving with

NaN:NaN

respect to the human ratings because at

NaN:NaN

the end of the day it is the quantity

NaN:NaN

that we want to approximate

NaN:NaN

and so here I guess if there is the

NaN:NaN

budget and it's something that is

NaN:NaN

possible for the project one good uh

NaN:NaN

practice is to collect the human ratings

NaN:NaN

output the LM as a judge scores and then

NaN:NaN

run some correlation analysis. this to

NaN:NaN

see if there is something that can be uh

NaN:NaN

improved in terms of the prompt mainly

NaN:NaN

the prompt

NaN:NaN

and then the last thing is the

NaN:NaN

temperature.

NaN:NaN

So if you remember the temperature

NaN:NaN

um is a parameter that you can tweak to

NaN:NaN

make your generation more deterministic

NaN:NaN

as opposed to more creative. And so you

핵심 요약

LLM 출력 품질을 측정하는 다양한 평가 방법론을 다룹니다. Human rating의 한계부터 LLM-as-a-Judge, Factuality 평가, Agent 평가까지 실무에서 필요한 평가 기법들을 체계적으로 설명합니다.

주요 개념

Human Evaluation의 한계 06:21

LLM 출력이 자유 형식이라 이상적으로는 매번 사람이 평가해야 하지만 비용이 많이 듦
평가 자체가 주관적일 수 있음 (예: "유용성"의 기준이 평가자마다 다름)
Inter-rater agreement가 중요한 이유: 평가자 간 일관성 확보 필요

Agreement Metrics 09:30

Agreement Rate: 단순히 두 평가자가 같은 응답을 준 비율
문제점: 랜덤하게 평가해도 50% agreement rate 나옴 (P(A)=P(B)=0.5일 때)
Cohen's Kappa: 우연에 의한 일치를 보정한 지표
- κ = (observed - chance) / (1 - chance)
- 양수면 우연보다 나음, 1이면 완벽한 일치

Rule-based Metrics 18:32

Exact Match: 정답과 정확히 일치하는지 (binary)
BLEU: 번역 품질 평가, n-gram precision 기반
METEOR: BLEU 개선, recall도 고려, 동의어/어근 매칭
한계: LLM의 자유 형식 출력에는 적합하지 않음

LLM-as-a-Judge 32:19

다른 LLM을 사용해 출력 품질 평가
Pointwise: 단일 응답에 점수 부여
Pairwise: 두 응답 중 더 나은 것 선택
장점: Human rating 없이 시작 가능, 빠르고 저렴

LLM-as-a-Judge의 Bias 38:40

Position Bias: 먼저 제시된 응답 선호 → 순서 바꿔서 majority voting
Verbosity Bias: 길고 상세한 응답 선호 → 프롬프트에 명시, 길이 패널티
Self-enhancement Bias: 자기가 생성한 응답 선호 → 다른 모델을 judge로 사용

Best Practices for LLM-as-a-Judge 46:40

명확한 가이드라인: 평가 기준을 구체적으로 명시
Binary scale 선호: Pass/Fail이 세분화된 점수보다 일관성 높음
Rationale before score: 점수 전에 이유를 먼저 출력하게 함 (CoT와 유사)
낮은 Temperature: 0.1~0.2로 설정해 재현성 확보
Human rating과 calibration: 주기적으로 LLM judge와 human 평가 비교

Factuality 평가 52:15

응답이 사실에 부합하는지 평가하는 것이 핵심
평가 파이프라인:
1. Text → Facts 분해 (atomic claims로 쪼갬)
2. 각 Fact를 독립적으로 검증 (source 대비)
3. 가중 집계 (중요도에 따라)
복잡한 응답도 체계적으로 평가 가능

Agent 평가의 어려움 68:13

Agent는 여러 단계를 거쳐 작업 수행: Tool Prediction → Tool Call → Response Synthesis
각 단계별 실패 모드가 다름

Tool Prediction Errors 68:50

모델이 너무 약함: 더 강한 모델로 교체
Tool hallucination: 존재하지 않는 tool 호출 → horizontal instruction 개선
Wrong tool selection: 유사한 tool 중 잘못 선택 → API description 명확화
Wrong arguments: 올바른 tool이지만 잘못된 인자 → context에 필요 정보 포함

Tool Call Errors 76:25

Wrong response: 버그로 인한 잘못된 출력 → 코드 수정
No response: 응답 없음 → 항상 의미 있는 출력 반환 (빈 JSON도 OK, None은 안됨)

Response Synthesis Errors 78:53

Grounding 실패: tool 출력을 참조하지 못함
출력 과다: 너무 많은 정보로 중요한 것 놓침 → 출력 trim
비구조화된 출력: 모델이 해석하기 어려움 → structured output 사용

Agent 평가 Benchmark: τ-bench 97:30

Tool Agent User Simulation benchmark
Airline/Retail 도메인에서 policy 준수 여부 평가
LLM이 사용자를 시뮬레이션하여 다양한 시나리오 테스트

핵심 인사이트

측정 없이 개선 없다: LLM 성능 향상의 첫 걸음은 올바른 평가 체계 구축
LLM-as-a-Judge는 강력하지만 bias 주의: Position, Verbosity, Self-enhancement bias 모두 완화 전략 필요
Binary scale + Rationale: 복잡한 점수 체계보다 단순한 Pass/Fail이 더 일관성 있음
Agent 평가는 단계별 분석 필수: 어디서 실패했는지 파악해야 개선 가능
항상 의미 있는 출력: Tool call은 None 대신 빈 JSON이라도 반환해야 모델이 해석 가능