Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 8 - LLM Evaluation
Hello everyone and uh welcome to lecture
8 of CME295.
So today's topic will be LLM evaluation
and I think this class is probably one
of the most important classes of this uh
quarter because the idea is if we don't
know how to measure the performance of
our LLM, we don't really know what to
improve and so this class will focus on
how we can quantify how the LLM performs
in a bunch of different cases.
So with that said, we are going to start
the class as usual by recapping what we
saw last week. So if you remember last
week, we saw how our LLM could interact
with systems that are outside of the LM
itself. So we saw one core technique
that is called rag that allows our LLM
to fetch information from external
knowledge bases. And so here rag stands
for retrieval augmented generation
and we saw how we could uh improve the
retrieval system. So we saw uh that it
was composed of two main steps. So one
was candidate retrieval which is
typically something that is done with uh
a c by encoder kind of setup. So
sentence spurt was uh like a good
example of how people would design such
a model. Um and so this first step is
typically there to filter down the
potential relevant candidates for a
given incoming query.
And then we saw that there was a second
step which was um reranking and that one
was a bit more involved and involved
cross encoders which were most
sophisticated.
And we also saw some ways to quantify
how well our retrieval system performed.
And then we also saw something that was
called tool calling which is the ability
for our model to know which tool to call
with which argument.
So if you remember, if we give our LLM
the knowledge of the tools that are
available to to it, it can figure out
which arguments it needs to input to the
function as a function of the input
query and then run that function and
then output the result in natural
language to the user.
And then we also saw how agentic
workflows were composed of. So uh
spoiler alert is something that is a
combination of the two [snorts] previous
methods. So rag and tool calling and in
particular
given an input we're allowing our model
to make multiple calls
to call different tools to fetch
relevant data from uh other uh knowledge
bases and uh we saw one example that was
kind of successful uh from the current
applications which was AI assisted
coding which relies on this principle
And react is typically the framework
that people would use. So reason plus
act which is decomposing this into
observe, plan and act steps.
Cool. So this is what we saw last time
and we also started from this slide last
time. We, you know, if you remember, our
LLM has strength but also weaknesses
that we're trying to mitigate. So, in
particular, the focus of lectures six
and seven were on methods to improve
reasoning of the model and ways to for
the model to fetch knowledge from other
systems as well as performing actions.
And today we're going to focus on the
evaluation part. In particular, given a
response that the model is giving, how
can we quantify how well the LLM is
giving its response?
Cool. So first of all, I would like to
define the term evaluation and the
meaning that we will use for this
lecture. So when we say I want to
evaluate my LLM, it can actually take a
lot of different meanings.
So when you say let's evaluate the LLM,
it can mean let's evaluate the
performance, the output, let's evaluate
this based on coherence, factuality.
Let's evaluate it based on latency. So
more system related metrics or pricing
or how often it is up and so on.
So just to make sure we're on the same
page, this lecture will mostly focus on
the output quality parts and in
particular we'll focus on quantifying
how good the actual response is.
And here you will note that this is a
challenging problem because um as we saw
previously our LLM is a texttoext
model that can output basically
anything.
So it can be natural language, it can be
code, it can be math reasoning and so on
and so forth. So it's very hard to come
up with universal metrics to evaluate
that. So we will see how people do this
in practice.
Cool. So given the fact that our LLM
generates free form output,
one could imagine that the ideal
scenario for us to evaluate the LLM
output would be to every time ask a
human to rate the response.
So here the ideal scenario would be okay
I give a prompt to my LLM. It gives a
response. I ask a human to rate it and I
start again and again
and uh what I do is at the end of the
day I just collect all these human
responses and I try to quantify the
overall performance of my model. Well,
as you can imagine, the main problem is
that such a system would be very cost
intensive.
But let's look at this into more detail.
So
if you remember
the LLM outputs are really free form and
there may be cases that even human
judgment may be something that is fuzzy
because maybe the rating task in itself
is subjective.
So let's take the following example.
Let's suppose I ask my LLM what birthday
gift should I get? And let's suppose the
LLM responds with a teddy bear is almost
always a sweet gift. Just pick one that
feels right for you. So let's suppose I
want to evaluate this response with
respect to the usefulness dimension.
I may have one human raider that says,
"Yeah, it's pretty useful because you
know like teddy bear is pretty
indicative of um I guess the what the
user should get as a gift." But then
another raider may say, "No, actually
it's not useful because maybe the
response didn't specify exactly which
teddy bear. Should I have a bear? Should
I have an elephant, a giraffe? Like
which stuffed animal should I get?"
And so there is this um notion of
interrator agreements
where we're bas basically
um concerned with making sure that
everyone is aligned on how to rate those
responses
because sometimes like in this
illustrative example
it may be a little bit subjective.
So responses may vary. So what people
want to do is to make sure that the
guidelines are clear enough for everyone
to rate these responses in a consistent
manner.
So people come up with um agreement
types of metrics.
So a very natural metric that you may
think of is the quote unquote agreement
rates.
So for instance you have this uh two
raiders. So what you do is you just um
measure the proportion of the time that
the two raiders give the same response.
And let's suppose the response here is
binary. So let's say yes good or not
good.
Well do you see a problem with such a
metric?
like is this a good metric?
I guess another way to ask this question
is if I give you a given number of
agreement rates, can you tell me if it's
a good number or if it's a bad number?
Well, let's take the example of let's
say two raiders.
um let's say Alice
and let's say Bob
and let's suppose we have two different
types of ratings that these raiders can
give. So either let's say uh yes it's
good so one the output is good or the
output is not good.
So if we assume that the first raider
gives let's say random responses
with some probability P of A for being
good and one minus P of A for being not
good and then let's say Bob who should
have eyes and a smile
has uh P of B for being good and one
minus P of B for being not
Then
let's compute the agreement rate for
this case. So the agreement rate
is basically the probability
that
raider A and raider B
agree.
And so here A and B agree.
if A and B both vote one
or way when A and B both vote zero,
right?
But if they give the their response in
an independent and random way,
well, if you use like these probability
concepts that you know, then we will
have
probability of A and B responding to
one, which is probability of A
responding to one time probability of B
responding to one and same for zero. So
we will have something like this.
So P of A P of B
plus 1 - P of A
1 - P of B.
So this one is A and B
say 1
and here A and B
say zero.
So,
so let's see what the agreement rate
would be in that case.
So if we assume that
suppose that let's suppose uh P of A is
equal to P of B which is equal to let's
say.5
then the agreement rate would be so
agreement rate
would be so I'm just replacing the
numbers here so.5
squared +.5^
squar. So it's 0.25 + 25 which is equal
to.5. So what that means is
if we're just letting our raers
rate these things in a random way with
some probability, you know, P of A, P of
B, we would already have an agreement
rate of 50%. Just by pure random chance.
And so
one thing that I want to say is that
this agreement rate by pure chance
is a function of the probability that
each of these raiders give these
ratings.
And so if
this probability is actually higher the
agreement rate by pure chance is also
higher.
So what that means? So what do I want to
say?
I want to say that if we just take the
agreement rates then it's very hard to
put it into context in terms of what you
would have gotten if things would have
happened just by pure chance.
So for this reason,
people have come up with a series of
metrics that try to make it more
relative to this baseline, which is what
would happen if our raiders would choose
things kind of randomly.
And so you have these metrics like for
instance this one is the coins scapa
metric which
computes a quantity that is a function
of this agreement rate by chance
and take the observed one
such that if our observed agreement rate
is greater than the by chance agreement
rates
then our coefficient is positive. So
when it's positive you know at least you
know it's going in the right direction.
So here if the observed agreement rate
is equal to one
then alpha is so kappa is equal to one.
But if our observed agreement rate is
below the you know by pure random chance
erment rate that we saw on the
blackboard then our coefficient would be
negative.
So long story short, there is a bunch of
metrics that try to quantify interrator
agreement rates using these kinds of
formulas to be able to make these
quantities relative to what would happen
if things were done in a random way.
And so that's why you may see a bunch of
metrics out there. So here it's cohen
scappa that people use for cases where
there are tool raiders but then you have
extensions such as fly scapa and
crypendors alpha that you may see out
there. So they all rely on this idea
that we should have some baseline
which is
you know our raiders just randomly
picking answers and try to see how much
better our actual agreement is compared
to this.
So does that make sense?
Yeah. So I guess what I want to say is
that the first limitation of asking
humans
to rate our LM outputs which was
sometimes the task being subjective
can be something that we can quantify
with this interrator agreement metrics.
So what people typically do is they keep
track of how good that agreement is. And
if let's say we have a quantity that's
not satisfactory,
people would just hold some quote
unquote agreement sessions between the
raiders to just align on how they should
rate the answers.
So that it can be seen as just a health
metric to track how consistent your
rating are and this is typically
something that people use in practice.
So
up until now we've seen one limitation
of human ratings.
Well second limitation I think I also
said this previously. So it's really
slow. You know, if you ask someone to
rate a thousand LM outputs, well, it
will take them a while and it's of
course expensive.
So, all of that to say that our ideal
scenario of asking a human to rate every
LLM output is not something that is
practical.
But we can leverage human
ratings in some way because we've seen
that even if the task is subjective, we
can have a way to align our raiders.
So now let's move on to another way to
go about doing this, which is by using
some rule-based metrics.
So here I'm just going to revise the
setting that I mentioned before
and instead of asking our humans to
write every LLM output,
this time I'm just going to ask them to
write
the references or the ideal outputs for
a given set of prompts. just fix that
for good
and then use some kind of metric that
would compare the LLM outputs
with those references.
So here the main difference is let's
suppose I have a given set of prompts
fixed.
Well, I can make iterations in my model
and always compare the output of my LLM
with this fixed reference instead of
always asking humans to rate that again
and again. So, it's already an
improvement
and we will see a little bit what are
the kinds of rulebased metrics that you
will see out there.
So ideally these metrics should reflect
the performance of the LM output in a in
an optimal way. And what I mean by an
optimal way, it is to make it be a
little bit flexible given the fact that
natural language is not always something
that you can say in one given way.
So for instance when I provide a
response to a given prompt there can be
very well a case where I can formulate
the response slightly differently but it
will still be just as good.
So the idea behind this metrics is to
make this comparison
a little bit flexible.
So let's start with one common one that
people use in the translation case. So
this metric is called meteor and it
stands for metric for evaluation of
translation with explicit ordering.
So the idea here is to compare
reference and predicted and we'll see
how it's being done and also
penalize cases when words are not in the
same order which is explaining why the
metric is called with explicit ordering.
So the formula is as follows. So it is
some fcore times 1 minus some penalty.
So the fcore here is so you may be
familiar with f1 score. So it's like the
harmonic mean with equal weights. So
this one is with the variable weights.
So it is a function of precision and
recall
where precision is the proportion of the
unigs that are in your predicted
sequence
that are matching with the reference and
the recall is the proportion of the
unigs in the reference
that are matching with what is in the
predicted. So it's basically matching
the
usual precision recall metrics that you
know and then we have another quantity
here which is the penalty and I
mentioned the penalty here tries to
incentivize
goods ordering. So if it's ordered the
same in the reference and in the
prediction then it's good otherwise it's
bad.
And so here there's a bunch of
quantities. So gamma and beta are
hyperparameters that people arbitrarily
choose.
And it's a function of C,
the number of contiguous chunks that are
matched
over the matched unigs. The number of
matched unigs.
So ideally you would want
C that would be as low as possible
because if you have a low number of
contiguous matches
it means that your contiguous sequences
are are long
which means that the ordering is the
same.
So you want C to be low
and then matched unigs to be high.
So you want that penalty term to be low
for a good I guess for a prediction that
has the same ordering as the reference.
So I guess
higher meter score means better
translation AC according to this way of
doing things.
So I guess when you look at this formula
so first of all it seems it looks very
arbitrary
right I have uh alpha as a
hyperparameter gamma beta so it's kind
of a
a recipe I feel so that's one
and the second thing is that it does not
allow for stylistic variations because
here we're measuring the number of
matched unigs
Although the metric expands the like
range of what is called matched unigs by
taking into account things like words
that are synonyms of one another and
things that are of the same roots
but still it is not like extremely uh
satisfactory in that sense.
So mter is meteor is one such metric.
You have another one that's being used
or that has been used in translation
tasks which is called blur which you may
know. So, BL stands for bilingual
evaluation under study. And you can
think of this as kind of a precision
focused kind of metric
that uh looks at the number of matched
the matching engrams
over the engrams that are in the
prediction, which is why it's a
precision kind of metric.
And it also has uh a penalty term here.
It's called brevity penalty
because given that it's more of a
precision kind of metric. If you
translate something that's very short,
you may be able to gain the the metric.
So you want to penalize
the translation being too short.
So we'll not go into a lot of details,
but I just want to just show you the
kinds of metrics that are out there. So
meter is one, blue is another one and
rouge which you may have heard is also
another one typically used for
summarization tasks.
Again same idea and um it has a bunch of
varants that you may see out there
but long story short all these metrics
they all compare
the output with a reference.
So as we saw one key limitation is that
they do not allow stylistic variation.
So let's take an example. So let's
suppose I say a plush teddy bear can
comfort a child during bedtime. Well,
the exact same thing I can say it I can
say it in a really different way. So
soft, stuffed uh bears often help kids
feel safe as they fall asleep or many
youngsters rest more easily at night
when they cuddle a gentle toy companion.
So in all these cases, the metrics that
we saw would you would really perform
very poorly.
So that's one key limitation. So the
second key limitation is correlation is
not that great.
I mean, you can imagine that people have
come up with all these hyperparameters
to kind of make it be correlated to
human ratings, but they're not that
correlated.
And the bottom line is it still requires
human ratings
to just get started. And sometimes
you just can't afford to have human
ratings maybe in your project.
So I guess there are still some key
limitations
which is the reason why. So all of that
to say I want to motivate the key method
of this class or of this lecture which
is called LLM as a judge.
So, you know, we spent the first seven
lectures motivating these large language
models that are pre-trained on huge
amounts of data that are uh tuned in a
way to match human preference. So, they
do contain human knowledge. They do
contain some indication of what humans
may prefer.
So the idea here is to have our model
response
be actually an input of yet another LLM
and that LLM is something that people
typically call LLM as a judge.
So it was a term that was introduced in
a paper uh from two years ago.
So here the idea is to use an LLM for
rating purposes
and things that you would see as input
would be the prompt that was used to
produce the response,
the response
and the criteria along which you want to
grade your response.
And so here LLM as a judge would give
you the following outputs. So the first
thing is it would give you a score.
So here you can think of it as like a
binary scale kind of score. So pass or
fail
but and this is very new
also a rationale
because LLMs they understand text
so they can also explain you why they
graded something with a given score
and that part is
the key difference with previous
methods. we are able to explain why the
metric or the model is giving us a given
score. And this is quite quite good
because in the other let's say rulebased
worlds where you would have all these
like formulas and multiplication all
these things and sometimes you would
come up with a number that would not be
very self-explanatory
and this is luckily something that LM as
a judge addresses.
So to recap,
what we want is to use an LLM as a way
to grade the response.
So here you would have typically
the following kind of prompts. So you
would state okay I want to evaluate my
response with respect to a given
criteria
and then you give the prompt that you
used to generate that response along
with the model response
and then you would ask
the judge to return two things the
rationale and then the score.
So one little trick I want to point out
is people typically ask the model to
first output the rationale and then the
score
and the reason why we typically do that
is it's something that empirically
improves the quality of the results
but then given what we saw I think in
lecture six if you remember the
reasoning class we saw that these
reasoning models that are being trendy
especially in 2025
what they do is they first output a
chain of thought before giving the
answer.
So you can actually think of this trick
as being on the same kind of idea of
reasoning models as in it allows the
model to
externalize verbalize its quote unquote
thought process before giving the score.
So it gives it a chance to really figure
out what is good or what is wrong in the
model response.
So far so good.
Any questions on I guess the setup?
Yeah, all good. Okay. So now I have a
question for you.
If I give the following prompt to my LM
as a judge,
am I guaranteed to have a rational and a
score that I can parse?
Am I guaranteed?
No. Yeah, exactly. No. The answer is no.
You're not guaranteed to have um a
rational and a score that you can parse
because this model has some
probabilistic nature to it with the
sampling process and it's not something
that you can really control. So I guess
my follow-up question is do you know a
technique that would
I guess guarantee you to have um a
structured response. So hint is a it's a
technique that we seen that we saw like
towards the beginning of the class.
Okay, I'll give you a little hint. So if
you remember on slide 65 of lecture 3,
we saw a technique called constrained
guided decoding.
So if you remember the idea here is to
constrain the decoding part pro process
by allowing our model to only sample
from quote unquote valid tokens.
And we typically do that in cases where
we want our output to have a given
format. So let's suppose a JSON format
and we want absolutely that format.
So what people do is they use this
technique to guarantee the form of the
response.
And in case you're using these provider
like the providers that are you know out
there like for instance open AAI or
Gemini or anthropic
this technique is known under the name
structured output.
So in your project if you want to
constrain the decoding process
in order to output a response of a given
format. So let's suppose my format is a
response and I kind of represent it by a
class and there are like two attributes
so rational and score.
Well typically you can reference that
with the argument text format equal to
that representation.
So, this I believe is something that uh
OpenAI does, and I'm not exactly sure if
it's exactly the argument name that you
would see for the other providers, but
they're all I guess along the same
lines.
Does that sound good? So, the key word
here is structured output. Whenever you
want a response of a given format, you
would just go for that.
Okay, cool. So just to recap, our LLM as
a judge has two main benefits. So the
first one is that we do not need a
reference text. We do not need human
ratings to just get started because our
LLM already has a lot of I guess
knowledge that has acquired during
pre-training and human preferences and
so on. So you do not need that.
And then the second thing is you can
interpret the score with the rationale
that is being output
and that is also quite remarkable.
So just as an example so here you would
say okay evaluate the quality of like
this this response. So you would have
some rational that would explain what
this response has or doesn't have that
makes it good or bad
along with the score.
Okay, cool. And I believe uh Okay, so
now we're going to see the kinds of LM
messages that you can see out there. So
of course there are many variations but
there are generally two types of L me
that you will see.
So the first one is you have a single
output a single response that you want
to evaluate
and here you would ask LS judge to say
okay is it good or is it not good
and the second big kind of LS judge that
you will see out there is pairwise kind
of setup. So you have two responses and
you say is response A better or is
response B better
and here you would obtain a response
either okay that one or this one.
So if you remember we've seen uh in
previous lectures that there are a lot
of situations where we would want to
have preference data. So for instance in
the preference tuning class that we had
I believe it was lecture five.
So this kind of method can also be a
good way to synthetically generate
preference ratings
where you have two responses
and then you ask your LLM to say okay I
prefer that one and you can use that one
as a as the label to train your rework
model.
Does it sound good?
Any questions on the setup or
everything that we've talked about so
far?
Okay, cool. Everyone is uh on the same
page. So now let's see what can go wrong
with our LMS as a judge. So uh let's
think of the possible kinds of failures
that we can encounter.
So the first one is called position bias
and as the name suggests it has to do
with the ordering at which we present
the responses to our model.
So let's say if we ask our model is
response A better or response B.
Well, there is a chance that the model
responds with response A just because it
it was the first one to be mentioned.
So that kind of bias is called position
bias. So it's where the position at
which you place the response matters in
the
judgment of the LMS as a judge model.
And I guess as a way to remedy that,
people have different techniques. But
one typical technique would be to ask
the model is A or B better and then ask
the model is B or A better and then take
the majority voting. So if both of them
lead to the same response, then it's
good. But if the response changes,
then it may not be good. So you may want
to do something else.
Um there are a bunch of other
techniques. So I know there's a bunch of
papers that try to tweak the position
embeddings but those ones are a bit more
advanced. So it's not typically the
thing that you would do just out of the
box. So taking the average or like you
know taking the majority voting of this
you know position swapping is CPT what
you would do.
Okay, cool. So this was the first kind
of bias. The second bias is called
verbosity bias.
So let's suppose you have two responses
and the first response is short and
concise.
The second response is something that
goes much more into details is typically
something that is more verbose.
Well, there are cases where the model
will tend to um
to I guess prefer responses that are
just more verbose just because they're
more verbose, not necessarily because
they're more correct.
And for that, it's maybe a little bit
trickier. So
people typically try to explicit
this dimension in the guidelines
when they input I guess the this this
question to the LMS as a judge they say
uh well make sure to not pay too much
attention to the length of these
responses to not I guess prefer
something just because it's more
verbose.
So that's one kind of meth method that
you will see out there.
Uh the second one is to just um also add
some examples in context learning
examples
uh to the model to just um I guess tell
it to
um I guess show by example that
verbosity is not something you should uh
prefer. And then uh the last one is to
have some kind of penalty
on the output length.
So you can ask your model in a pointwise
way is one how how good is one how good
is two and then try to penalize that
with the length. So that's something
that also people may use.
Okay. So we've seen position bias, we've
seen verbosity bias, now we will see the
third kind of bias that you may see out
there which is called selfenhancement
bias.
And so that one has to do with the fact
that if you ask a model to judge an
output
that was produced by itself,
well, the model will tend to prefer
responses that are generated by itself
regardless of whether or not the other
one was more
uh aligned with what we wanted.
And I guess here the intuition is that
if our model
generated such an answer then it's maybe
the case that
uh our model thought that from a
probabilistic standpoint this was a
sequence that was very much likely to
appear. So it may be I guess one way to
think about it which is if it has
generated such a sequence then it means
that it it is something that
it thinks I mean think uh quote unquote
that it's a good answer.
So the general guideline here is to
typically not use the same model that
you use for generation and for
judges.
But um I guess nowadays it's kind of
hard to have that strict
uh constraints um I guess
uh respected because I guess all models
they are trained on basically the same
data sets.
So you can argue they're all being
subject to the same I guess training
mixes and so on. But still I guess
people what what people do is they tend
to use another model just to have
such a risk be minimized. So long story
short try to not use the exact same
model that you use for generation
and for evaluation.
So this is self- enhancement bias.
Okay. So before we go to the next uh
subp part I guess what do you think of
these three biases? Do they make sense?
Any questions so far?
Yep.
So, can you elaborate a bit more?
Yeah.
Yeah. Yeah. Yeah. So, the question is,
can you have a model that just maybe
isn't aligned with uh I guess the ground
truth and maybe prioritizes maybe one
label over another. So, yeah, this can
definitely be another kind of bias. Uh
so this bias being that our LLM is not
exactly aligned with what humans would
prefer. So these three biases are by no
means exhaustive. So this can very well
be another bias that you can list as
well. Um so yeah, this is definitely
another kind of bias.
Yep.
Yep.
So the question is, is it possible that
our judge still prefers an LM response
even if it's a different one? Well, it
depends how good your judge is. But
typically the best practice is to have a
judge that has a much bigger capacity
that may capture these kind of
differences and not be fooled by a
response that just sounds like something
it may generate but something that is
maybe more aligned with human
preferences. So I guess the short answer
is yes, you can still have such a
situation. But in order to mitigate that
risk, you would typically take a model
that is not the same but also typically
much bigger.
So uh you have a bunch of uh such models
out there and uh you know with all the I
guess um improvements that have been
made with like reasoning models. This is
also something that people try.
Yeah, question is should the judge be
bigger? Um, it's not a hard constraint,
but it's typically something that people
would take like a bigger model that
would have strong reasoning capabilities
that could really tease out what's good
and what's not good. Yeah.
Okay, cool. So, with that, I'm going to
just go over the best practices that
we've seen. So we saw that um in order
for our LMS as a judge to output a
score, we need to give the criteria that
we want this to be evaluated against.
But sometimes this criteria may be a
little bit subjective. So one thing that
really works very well is to have crisp
guidelines. So really explicit what we
want, what we don't want.
The other point is you may see different
kinds of scaling out there. So sometimes
people having uh a scale that is maybe
more granular and maybe other cases
where we're just operating on a binary
scale. So typically what people would
tend to prefer is actually the binary
one because it makes the job of the LM
as a judge easier. So it's just either
good or bad.
And also when it comes to aligning the
judge with human ratings, humans, they
typically also find it easier to just
judge out of two options as opposed to
several. So it just removes the noise of
having several uh possible choices. And
it's not necessarily an extra signal
that may be really useful. So here the
tip is to use a binary scale like a pass
or fail kind of score as opposed to like
a gradual one.
The third tip is to make sure to output
the rational before outputting the
score.
And we've seen this is along the same
ideas of outputting a chain of thought
before providing the response which is
something that is done by our reasoning
models. So it's typically something that
will improve uh the judge performance.
Uh so we've talked about the different
kinds of biases. So position, verbosity,
self-enhancement, but it's not the only
ones of course and uh I guess people
typically also look at how to mitigate
those with the remedies that we
mentioned.
So, so far we've stated that we do not
need human ratings to get started,
but a good practice is to still look at
how the LLM ratings compare with the
human ratings.
So here one tip is to just calibrate the
responses that the judge is giving with
respect to the human ratings because at
the end of the day it is the quantity
that we want to approximate
and so here I guess if there is the
budget and it's something that is
possible for the project one good uh
practice is to collect the human ratings
output the LM as a judge scores and then
run some correlation analysis. this to
see if there is something that can be uh
improved in terms of the prompt mainly
the prompt
and then the last thing is the
temperature.
So if you remember the temperature
um is a parameter that you can tweak to
make your generation more deterministic
as opposed to more creative. And so you
핵심 요약
LLM 출력 품질을 측정하는 다양한 평가 방법론을 다룹니다. Human rating의 한계부터 LLM-as-a-Judge, Factuality 평가, Agent 평가까지 실무에서 필요한 평가 기법들을 체계적으로 설명합니다.
주요 개념
Human Evaluation의 한계 06:21
- LLM 출력이 자유 형식이라 이상적으로는 매번 사람이 평가해야 하지만 비용이 많이 듦
- 평가 자체가 주관적일 수 있음 (예: "유용성"의 기준이 평가자마다 다름)
- Inter-rater agreement가 중요한 이유: 평가자 간 일관성 확보 필요
Agreement Metrics 09:30
- Agreement Rate: 단순히 두 평가자가 같은 응답을 준 비율
- 문제점: 랜덤하게 평가해도 50% agreement rate 나옴 (P(A)=P(B)=0.5일 때)
- Cohen's Kappa: 우연에 의한 일치를 보정한 지표
- κ = (observed - chance) / (1 - chance)
- 양수면 우연보다 나음, 1이면 완벽한 일치
Rule-based Metrics 18:32
- Exact Match: 정답과 정확히 일치하는지 (binary)
- BLEU: 번역 품질 평가, n-gram precision 기반
- METEOR: BLEU 개선, recall도 고려, 동의어/어근 매칭
- 한계: LLM의 자유 형식 출력에는 적합하지 않음
LLM-as-a-Judge 32:19
- 다른 LLM을 사용해 출력 품질 평가
- Pointwise: 단일 응답에 점수 부여
- Pairwise: 두 응답 중 더 나은 것 선택
- 장점: Human rating 없이 시작 가능, 빠르고 저렴
LLM-as-a-Judge의 Bias 38:40
- Position Bias: 먼저 제시된 응답 선호 → 순서 바꿔서 majority voting
- Verbosity Bias: 길고 상세한 응답 선호 → 프롬프트에 명시, 길이 패널티
- Self-enhancement Bias: 자기가 생성한 응답 선호 → 다른 모델을 judge로 사용
Best Practices for LLM-as-a-Judge 46:40
- 명확한 가이드라인: 평가 기준을 구체적으로 명시
- Binary scale 선호: Pass/Fail이 세분화된 점수보다 일관성 높음
- Rationale before score: 점수 전에 이유를 먼저 출력하게 함 (CoT와 유사)
- 낮은 Temperature: 0.1~0.2로 설정해 재현성 확보
- Human rating과 calibration: 주기적으로 LLM judge와 human 평가 비교
Factuality 평가 52:15
- 응답이 사실에 부합하는지 평가하는 것이 핵심
- 평가 파이프라인:
- Text → Facts 분해 (atomic claims로 쪼갬)
- 각 Fact를 독립적으로 검증 (source 대비)
- 가중 집계 (중요도에 따라)
- 복잡한 응답도 체계적으로 평가 가능
Agent 평가의 어려움 68:13
- Agent는 여러 단계를 거쳐 작업 수행: Tool Prediction → Tool Call → Response Synthesis
- 각 단계별 실패 모드가 다름
Tool Prediction Errors 68:50
- 모델이 너무 약함: 더 강한 모델로 교체
- Tool hallucination: 존재하지 않는 tool 호출 → horizontal instruction 개선
- Wrong tool selection: 유사한 tool 중 잘못 선택 → API description 명확화
- Wrong arguments: 올바른 tool이지만 잘못된 인자 → context에 필요 정보 포함
Tool Call Errors 76:25
- Wrong response: 버그로 인한 잘못된 출력 → 코드 수정
- No response: 응답 없음 → 항상 의미 있는 출력 반환 (빈 JSON도 OK, None은 안됨)
Response Synthesis Errors 78:53
- Grounding 실패: tool 출력을 참조하지 못함
- 출력 과다: 너무 많은 정보로 중요한 것 놓침 → 출력 trim
- 비구조화된 출력: 모델이 해석하기 어려움 → structured output 사용
Agent 평가 Benchmark: τ-bench 97:30
- Tool Agent User Simulation benchmark
- Airline/Retail 도메인에서 policy 준수 여부 평가
- LLM이 사용자를 시뮬레이션하여 다양한 시나리오 테스트
핵심 인사이트
- 측정 없이 개선 없다: LLM 성능 향상의 첫 걸음은 올바른 평가 체계 구축
- LLM-as-a-Judge는 강력하지만 bias 주의: Position, Verbosity, Self-enhancement bias 모두 완화 전략 필요
- Binary scale + Rationale: 복잡한 점수 체계보다 단순한 Pass/Fail이 더 일관성 있음
- Agent 평가는 단계별 분석 필수: 어디서 실패했는지 파악해야 개선 가능
- 항상 의미 있는 출력: Tool call은 None 대신 빈 JSON이라도 반환해야 모델이 해석 가능