Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 7 - Agentic LLMs
Hello everyone and uh welcome to lecture
7 of CME 295.
So today we're going to uh focus on
practical techniques to let our LLM
interact with the outside world with
other systems because um up until now
our LLM was purely on its own. We've
trained it. We've seen how it can reason
on problem math, coding math, coding
problems. And uh now what we want to do
is to use our LLM in the context of
other systems. So today's class we'll
focus on rag that you may have heard
tool calling
and agents.
But before we start as usual I'm going
to recap what we did uh last time. So if
you remember last time we focused on
reasoning models and we saw the
differences between reasoning model and
what we call the vanilla LLM and in
particular up until the lecture before
last lecture. What we saw was we fed a
prompt to the LLM and it gave us
directly a response. But what we saw
last time was that if we let the LLM
reason before outputting the response,
then we can gain some performance when
it comes to reasoning tasks such as math
and coding. And so in particular
reasoning models what they do is they
take a prompt as input and then what
they output is both a reasoning chain
which is typically hidden from the user
and then a response.
So with that we saw how we could train a
model to be more of a reasoning model.
And in particular we saw a core RL
algorithm called GRPO which stands for
group relative policy optimization.
And we saw that this algorithm had some
differences compared to the ones that we
saw previously. And in particular one
notable aspect is that it does not have
it does not train a value function.
So here this uh illustration shows a
little bit how gRPO is trained. So it
takes a query as input and then it
computes an advantage for each output by
um computing the rewards for different
completions of a same prompt and then
computing a quantity which is the
advantage that is relative to the other
rewards of that group of completions.
And then we saw that if we applied GRPO
with carefully chosen rewards which is
one rewarding the model for outputting a
reasoning chain and then second
rewarding the model to for producing a
good response. What we saw is that as
the RL training progresses, we have an
improvement of the model on these
reasoning tasks. And so we saw one of
the tasks being uh math problems. So
here on the left graph you see the
evolution of the performance of the
model on the AIM data set which is a
kind of a challenging math uh problem.
And
we saw that but we also saw that the
model kept on input outputting
responses that were longer and longer.
And in particular we saw that even
though you know towards the end of the
graph above uh that the performance was
kind of plateauing we saw that the
output length was still increasing.
So then what we did was go back to the
loss formulation that is used by gRPO
and realize that there is a term that
makes the contribution of a token
different if it is in a short response
or a long response.
So this is a phenomenon called length
bias and we saw some mitigation
strategies that were explored by some
papers that came out uh in the past few
months. So one was DAPo which um had a a
normalization factor that was not
dependent on where the token was located
on which sentence it was located. And
the other one was uh this paper called
GRPO done right which actually just
removed
the normalization term.
I'll go down that.
Cool. So this was last time and last
time what we said right before starting
the reasoning class was to enumerate the
strength and the weaknesses of vanilla
LLMs.
So last lecture was all about focusing
on how we can improve the limited
reasoning capabilities of vanilla LLMs.
And in this lecture what we will do is
two things. So the first one is see how
we can connect our LLM to the ever
evolving knowledge base
and in particular see how we can um have
access to the latest information.
And then the second one is how our LLM
can help us perform actions and we will
see this with Shervin uh with things
like tool calling and agentic workflows.
Cool. So with that let's start with the
first one and let's start with this
method called rag that you may have
heard.
So
let's suppose you have a model that you
have trained
uh but the problem is that the
pre-training data on which you have
trained your model is let's say a month
a month ago let's suppose
now let's suppose you want to prompt
your model about the winner of the
elections that happened a couple of
weeks ago. Well, your model will not be
able to respond to you or it will output
the incorrect answer because up until
now our LLM does not have any link to
outside sources. It only relies on the
knowledge that it has acquired during
training. So the response that it will
give us will only be based on the data
that has been trained up until the
cutoff which is a month ago in this
example.
And so we have this big limitation which
is our LLM only knows about things that
is it has been trained on. And you will
see that all the models out there. So
here I have an example with OpenAI GPT5.
So if you look at their model cards,
they always have these knowledge cutoff
dates that is written somewhere. And in
the case for instance of GPT5, the
knowledge cut off date is September
30th,
2024.
Which means that if you ask it in a very
naive way, anything that happen after
that like the base model will not be
able to answer you as is.
Well, you might you may tell me why not
just uh continue training your model uh
on you know data that happened after
that. Well, the problem with that
actually there are several problems with
this. So the first problem is that it's
very tricky to change the knowledge of
an LLM without causing regression on
other things. So this is typically a
task that people they try to avoid
doing. And the second thing is it's not
very practical because you may very well
have use cases that require you to
fine-tune this model.
So let's suppose you have a use case
one. You fine-tune from this model and
then somehow you want to update the
weights of your model to inject some
knowledge. Well, you somehow will have
to do that for all the use cases that
you are doing which basically adds a lot
of overhead for you and just uh you know
adds a lot of maintenance.
So people they typically prefer to not
do additional training to inject
knowledge. So one idea can be to somehow
take your prompt and just add anything
that happens after the cut off date
as a way for your model to just know
what happened.
Well,
the problem with that naive approach is
that as you know context length length
is limited and so typically models have
on the order of magnitude of oh of
hundreds of thousands of tokens in
context length.
Do you know what that is roughly like
what is this roughly equal to?
Yes. So one token is equal to four
characters. So like using this like
rough approximation
uh hundreds of thousands of tokens is
roughly like hundreds of pages like
something like a very big book. So so
it's not it's not I mean it's it's big
but it's not enough for us to go in that
very naive route. So
um so again going back to GPT5 uh so if
you go to the model card you have the
knowledge cutoff date which is uh
September 2024. You also have the
context window and in this case it's
400,000
tokens.
U okay so let's suppose you actually uh
context is not a problem. It's actually
unlimited.
Let's imagine we actually put everything
in the context. Well, the problem then
is that people noticed that if you feed
a lot of irrelevant information to your
LLM,
the performance of the LLM will actually
degrade.
Meaning that if for instance you ask it
about I guess who was the winner of the
last elections and then you you feed it
a bunch of information that are not
relevant your LLM will tend to be
confused
and so people have run this test that's
called the needle in a haststack test
where the idea is
you give a big prompt to your LLM
which is your haststack
and you place a fact in the prompt and
you ask your model what that fact was.
So the idea is for your LLM to know I
guess among that huge prompt where is
the relevant information which is the
needle. And so when people have tried
doing this for several uh length of
prompts and tried different positions of
where to put the fact, they have seen
that
the length of the prompt and the
position at which you put the fact are
both important.
So here uh on the slide we have a heat
map um that was performed for GPT4 which
was I guess one or two years ago and uh
what the person did was place the fact
at different places in the document. So
this is document depth and the x-axis is
the length of your prompt.
And what we saw was that for prompts
that exceeded a certain amount of
tokens, the LLM actually had trouble
retrieving the correct piece of
information.
Um, and in particular, it had trouble
doing so when the fact was somewhere in
the first half of the prompt.
Um so this just tells us that you know
even if [snorts] like let's say our
context length was unlimited
we would still have a problem by just
going through that naive approach.
So that's another reason. Okay. So now
let's suppose context length is
unlimited. Let's suppose the problem
that I mentioned is not a problem. Well
the other problem is that you pay
So in particular, these calls, these LLM
calls, they are per token.
So the bigger your input prompt, the
more you will pay. So you have an
incentive to not put too much in your
prompt just from that standpoint.
And so for instance again going back to
GP5
uh order of magnitude is somewhere
around a dollar per million token. So I
guess it's not that expensive but it can
add up if you do that for all your
prompts.
So for all these reasons, I hope I
convinced you that we need a more clever
approach where instead of putting all
the new information all at once in the
prompt, what we do is we only somehow
find the relevant information and put
that in the prompt.
So that is the idea behind rag. Rag
stands for retrieval augmented
generation.
And the idea here is to augment the
prompt
with relevant information.
And here I put uh relevant in both uh
and this is the I guess the core part of
this technique is how can we get only
the relevant part in the prompt. So we
will see that in a second. So just at a
very high level. So you have let's say a
question as input. So in this case who
was the winner of let's say the local
election. The idea here is to somehow
fetch the correct or the relevant piece
of information and then augment that
here in order to output your answer.
So that's the rough idea.
So does this uh method make sense so
far?
Yeah. Okay. Cool. So that is the idea
behind rag and now we're going to go
into more details.
Um so what I mentioned is the rough
idea. Um and here I just want to
emphasize on the three main steps of
rag. So first one is you have your
prompt and you somehow want to retrieve
a relevant piece of information that
will help you in answering your prompt.
So here the first step is to retrieve
relevant documents and so you can think
of your prompt as being one entity and
then you can have some other uh space
which maybe like I don't know knowledge
base where all your documents live and
so the idea is to somehow fetch the
relevant documents.
So this is the retrieve step. The second
step is once you have fetched the
relevant information, you augment your
prompt. So you take that retrieved info,
you just put it in your prompt and then
uh ask the question. So in the local
election example, it's as if I was
saying who is the winner of this
election and then I retrieve the
relevant piece of information and now
the prompt becomes um who is the winner
of this election and by the way uh this
election was held blah blah blah and
this was the winner and this is what
we're feeding to our LLM.
So in other words, we're giving the
answer in the prompt
and the third step is to feed that
prompt to the LM to generate the
response. Yeah.
>> Yeah, exactly. So the question is uh you
may very well somehow do a bad job at
retrieval stage. So yes um so this is
why the retrieval stage is so important
and we're going to focus on what we can
do to make sure that one part like does
well.
Um so yeah uh we'll see how we can
evaluate um I guess our our setup and
different methods. But when we talk
about rag, we're mainly focusing on
making the retrieval part as good as it
can.
Cool. And I just want to emphasize once
again on why it's called rag. So you
have retrieve,
augment, generate
rag.
Cool. And as you pointed out the first
step which is the retrieval step is very
important which is why we'll spend a
little bit of time over there.
So I guess the first step is for us to
somehow
clean the set of documents that we may
need.
So I said you know uh we may um want to
look into outside information but we
need to somehow uh sort that order that
put that somewhere and this whole thing
is usually called a knowledge base.
So in order to form our knowledge base
what we do is typically collect the set
of documents that are or may be useful
and once we do that what we do is we
divide them into what we call chunks.
So a chunk is you can think of it as a
subset of the document which has a given
maximum length which is uh measured in
number of tokens which is typically on
the order of hundreds of tokens.
And the idea here is you know whenever
you hear retrieval you should think
about embeddings.
And here what we do is
we compute embeddings corresponding to
each of these chunks.
Now when you create your knowledge base
there are a few hyperparameters that you
need to tweak.
So the first one uh obviously is the
size of the embedding. So typically you
would want a bigger size if let's say
your documents are maybe more nuanced,
more complex.
But then if you have like a higher size
uh maybe it will take more space, maybe
you'll have more computation at
inference time. So I guess it's a
trade-off. You don't necessarily want
too big of a of an embedding size. So
here typical embedding sizes are on the
order of thousands. So for instance like
1,500 something like this.
So then you have the chunk size. Chunk
size is how big your little pieces here
are. So you don't want them to be too
small because otherwise the text may be
out of context. You don't want it to be
too large because maybe the embedding
will not represent in a meaningful way
what is inside. So again, it's a
trade-off. But typically people they
choose a chunk size of around 500 tokens
like on the order of hundreds of tokens.
And then you also have a oh yeah you
have a question. Yeah.
So the question is do you train an
embedding model for this? So you have
two choices. Either you can use a
pre-trained embedding model which people
typically do or you can train your own.
We will see that in a bit more detail in
a few slides.
So the question is what is the purpose
of the embedded model? So we will see
this in a second but long story short it
tries to represent chunks such that it
achieves your end goal which is to fetch
relevant documents.
So we will see a little bit how they're
trained but this is uh the general idea.
Cool. Um so that's this and then we have
a third hyperparameter
which is how much overlap you want to
have in between your chunks.
So here
um when you do the division uh you know
you uh like in a very naive way you have
everything be like independent no
overlap in between but typically you
have some part that is from the previous
chunk that is relevant to understand the
current chunk which is why we want to
have some overlap and which is why
people they typically also have that. So
it's typically in the low hundreds of
tokens.
Cool. So let's suppose you have your
knowledge base. Now the question is
given a prompt how can you retrieve
relevant documents
and the answer to that is we typically
proceed in two steps.
So I'm not sure if any of you has
background in recommendation systems or
search does any Yeah. So the methods
we're seeing here are very similar to
that space. So I guess people in the LLM
community they have borrowed ideas and
just leverage some techniques that we
have over there. And this is typically a
setting that we'll also have for
recommendation problems.
So we have two stages.
So the first stage is typically called
candidate retrieval.
And the goal here is to go from a set of
many many many chunks and filter it down
to a much smaller set of potentially
relevant candidates.
So during that stage, what we're trying
to do is to somehow maximize recall.
just do a rough operation
so that we get as many potentially
relevant candidates as possible.
And then we have a a second stage which
is sometimes optional, but this stage is
to really make sure we have the top
documents being the really the relevant
ones. And this one is called ranking.
So the idea here is
based on the list of potentially
relevant documents to really rank them
in a way that really the relevant ones
come at the top and so on.
And typically during that stage, we're
going to use a model, a method that's
going to be a bit more compute inensive
because we have a much smaller set of
candidates to rank compared to the first
one.
So going back to your question on how do
we want our embeddings to be. So here it
will really impact uh the first stage
and we will see that in a second, but
the second stage is also quite
important.
Cool. So far so good. Is everyone uh
clear with the two-stage approach
approach?
Yep. Yep.
Yeah.
Very good question. So the question is
do we chunk things in a naive way as in
we just go with the number of tokens
regardless of what happens. So it's a
great question and the answer is that we
will see some extensions that will
mitigate the problem of when you chunk
it in a way that does not make sense in
a naive way you want to somehow put that
into context and we will see a method
that does that. So in a few slides we
will see that.
Um but I think your question is also a
great question because depending on the
kind of document that we have like for
instance if we have uh I don't know like
a JSON file or a markdown or like
depending on the kind of file that you
need to chunk you also need to be aware
of the structure that is within those
files. So there is also some nuance
there that we will not go into details
but I just want to call that out
but yeah great question.
Any other questions?
Okay cool. So now that we're clear on
the two main stages of retrieval we're
going to focus on each one of these
steps.
So as I mentioned the first step is
candidate retrieval.
So here what we want is among that
potentially huge knowledge base to
somehow filter it down to let's say over
100 potentially relevant candidates.
So here what we do is well we will
leverage the embeddings that we um I
guess computed during the knowledge base
initialization
and
we will try to fetch
potentially relevant candidates by doing
a semantic similarity search.
So do you recall how we compare
embeddings?
Yes. Yes. So cosign similarity is
typically one way one good way to
compare embeddings. So the idea here is
to represent our query with an
embedding.
We already have
embeddings of all our chunks. So the
idea here is
to somehow find the most relevant chunks
by doing this similarity search and
filtering out the ones that come at the
top.
So the idea here is you have your query,
you have your chunk both of them, you
find an embedding
and then you perform a similarity
operation which is most of the time
cosign similarity
and you obtain a similarity score.
So the idea here is you just just keep
top I don't know 100 and you go with
that.
So I just want to call out that there is
some complexity in that stage because
your knowledge base can potentially be
huge.
So what people do is typically use what
we call approximate nearest neighbor
methods.
So you may have heard of some libraries
[music] that do that. So typically this
is something that will be relevant here.
We're not going to go into details, but
I just want to call that out. So here,
the idea here is that
when you build your knowledge base, you
somehow partition the embeddings in a
way that will avoid like make you avoid
doing like just a naive linear search.
So that's the idea. But uh you you may
see some techniques uh like ANN
techniques approximate near nearest
nearest neighbor techniques and these
are typically happening here.
So another thing that I want to point
out is
a name of the architecture that we
typically use here that you may also
hear
and for that we need to recall that
these embeddings they're actually
obtained by passing them through a
model. So typically encoder only.
So you may hear the term by encoder and
this one refers to the fact that we are
passing the query through an encoder and
then passing the chunk through an
encoder. So both of them are independent
and we're comparing the embeddings
and uh yeah so this is another question
I wanted to ask you but I guess I didn't
get the chance to but um so if you
remember I think lecture two or three we
had seen the birds model
and so typically you would have
something like a birdlike model that you
would use to um encode these documents.
So going back to your question, how do
you compute these embeddings? So there
is a a paper that I highly recommend
reading actually that's called sentence
BERT.
And so that paper explains so it's first
of all it's an extension of BERT as the
name suggests and it's an extension that
allows you to compute an embedding per
let's say sequence for your query for
your document
that is tailored
to be used for similarity search
purposes.
So the idea here is to have a loss
function that will incentivize having a
high cosine similarity for relevant
entities and low cosine similarity for
entities that are not relevant.
So yeah, so feel free to check that
paper out uh if you know birds which I I
know right now you do. Uh it's quite
easy to read. So yeah, highly recommend.
So far so good.
Yeah. Yep.
So the question is what is the default
way to compute the similarity? So yes
it's cosign similarity but again you
will see in different implementations
that people can use other distances and
I would encourage you to think about how
they relate to one another. So you will
see for instance the L2 distance but
then if everything has a norm of one
there's a lot of simplifications that
can happen. So you may see some variance
but I would say they're all more or less
cosign similarities.
Yeah great question.
Cool. So we're still at the candidate
retrieval stage and what we saw was one
way of retrieving documents from a
similarity sorry from a semantic
similarity standpoint. So by the way
what does semantic similarity mean? It
means finding documents or finding
entities that have the same meaning or
that are relevant.
But in the way that we compute these
embeddings, we're not enforcing any kind
of uh keyword match.
Like when we re when we retrieve
documents in this way, it can very well
be that the documents that are matched,
they do not have any word in common, but
they mean the same.
Well, sometimes
you want to ensure that your what you're
looking for, what you're searching for
is exactly containing the keywords that
is in your prompt.
And in that case you would want to have
a second way of doing things. So you may
have seen BM25
out there. So BF20 BM25 is a relevant
score that is actually a huristic score.
It is based on some function of the
overlap
between what is in your query and what
is in your document.
And so that one is actually quite handy
for cases where you have a query where
you absolutely want to have documents
that contain keywords of this query.
So here I have an example that I
actually passed super briefly for the
previous one and we will come back to
it. Uh but let's suppose we have let's
say two teddy bears. One is named cuddly
and the other one is named huggy. So
what you want is to figure out where is
cuddly. So this is your query.
So if you use BM25
well the answers that you you're going
to get are
by definition going to contain some
overlap of words that were in your
query. And so here you will have let's
say uh documents that contain let's say
where cuddly is. But if let's say you
you only used this semantic similarity
search you would not have that
guarantee.
you would only have documents that are
kind of semantically similar and those
are not
they are not guaranteed to contain
keywords of your prompt and so just just
to illustrate that. So here huggy and
cuddly they they can be thought of you
know semantically similar. So you you
will probably not have cuddly. I mean
you may not necessarily have cuddly in
there
just to illustrate that and that is the
reason why nowadays what people do is to
look at the use cases that they have and
think about whether having some heristic
as well in the relevant score is useful
for their use case. So some people they
go with the hybrid combination
of this embedding based search and the
huristic based search. So some
combination of uh embeddings and BM25 in
which case you may have even more
relevant documents depending on your use
case.
Does that make sense? Yeah.
So now I'll come back to what you
mentioned about whether cutting chunks
in a naive way will necessarily lead you
to things that are coherent. Well,
you're completely right. Sometimes you
will not. Um but before we answer this
question, we actually are going to
address another concern
which is that typically
when people want
to ask about something in their LLM,
the query that they input is of a
different nature
compared to what is in the knowledge
base. So your query is typically going
to be maybe something uh short, maybe a
question,
but what is in your documents is
typically going to be, you know, longer.
These are like sentences and sentences.
So if you really think about it,
if you use the same encoder
to embed your query and to embed your
documents,
well, these two embeddings, they're not
super comparable
because one is for a question and the
other one is for a document.
So there's one extension that tries to
mitigate that issue and uh I linked the
paper down there. It's called the
height.
So what it does is instead of computing
the embedding
related to the prompt, it will first
generate a fake document. So it's just
an LM call, a fake document based on
that prompt
and then embeds
that fake document to find relevant
chunks.
So it may or may not work. It's not used
all the time by everyone. So I would say
just something that is good to try to
see if that that works. But this is one
way of mitigating this. Another way it
could be to simply have encoders that
are specifically trained to encode the
query on one side and encode the
documents on the other side.
So in other words to not use the same
encoder.
People typically don't do that just
because of maintenance purposes but this
could also be another solution.
Okay. So now finally going to your your
question regarding how we can make sense
of these chunks. If they are taken out
of context they may not make sense.
And so here the idea here is to prepend
some piece of text that just sums up
what you need to know in order to
understand that chunk.
So here the idea is that you have all
your documents. Let's say you have one
document that you divide it into n
chunks.
The idea here is instead of considering
these chunks separately,
you're going to
compute some kind of context that is
relevant to each chunk and that is based
on the whole document.
So how are you going to do that? So well
it's again an LLM call. So what you do
is typically uh have let's say the whole
document and then you have the chunk
that you want to contextualize and you
ask your model well please give me a
short 16 context to just make sense of
that chunk
and now you may tell me well that's a
lot of LM calls you have potentially a
lot of chunks and that's just going to
be very pricey. Well, there's one
strategy to make this less expensive.
And I'm not sure if you've heard that uh
option. It's called prompt caching.
So now that you know very well how LLM
work, you know like typically these are
decoder only and so on. So you know that
if
you use
the same prefix
for all your prompts,
well it's going to be the same
computations that you just do again and
again and again.
So the idea here is you just do it once
and you save all the relevant
activations
and instead of computing them again
you're just going to look them up just
do a lookup and then decode the rest.
Does that does that make sense?
>> Yeah.
>> Um the question is activation from a
language model. Yes. Because when you
feed a prompt to your model and you ask
it to generate a response,
what the model needs to do is well to
take all this like input and then
compute the activations of all the
layers and then have uh you know for the
generation process to have uh this
attention across all these other
components. Well, given that it's
decoder only, meaning it's only left to
right,
the thing that you input, if it's the
same, then it will lead to the same
activations.
Yeah.
The question is, what if you have a
closed model? where prompt caching and
I'm going to just talk about this in
just one slide is an option that these
uh like closed models or providers
offer. And what they tell you is well
this is the same prefix for all your
prompts. So what we're going to do is
we're just going to make it cheaper for
you.
So if you look at the model pricing
page,
you will see that there is a price
for regular inputs. So inputs that are
not cached and then you have the price
per cached input token. And here you see
for the let's say open model it's
one10enth
of the price. So I guess what do I want
to tell you by this? Well, just try to
be smart with the prompts and try to
gather all the things that are likely to
be repeated across prompts in the
beginning so that you can uh leverage
this nice uh uh percentage of
that make sense.
Okay, cool.
Great. So up until now we have seen how
we can go from potentially thousands or
even like let's say millions of chunks
up to uh or down to uh let's say
hundreds of potentially relevant chunks.
Now what we want to do is to sort them
in a more meaningful way. And the second
part is more optional because maybe
sometimes this first cut that we've done
may be good enough. But I guess this
second step is about being more
intentional in how we give the final
score to be able to really like select
the final let's say top K chunks.
So this second stage is called ranking
or even reranking.
reranking because I guess with the first
step you already have some kind of
ranking. So we're reranking
and what we're doing is instead of using
this very quick operation similarity
operation between embeddings that we've
computed,
we're going to use something that is
maybe a bit more sophisticated.
So instead of considering the query and
the chunk separately,
what we're going to do is to actually
put them both in the encoder, both of
them, and have a relevant score out of
that.
So the reason why it's maybe a little
bit more meaningful to do it this way is
that you have a model that takes a look
at both your query and your chunk at the
same time and gives you gives you a
score.
Whereas in the first step you had one
embedding for the query and one
embedding for the chunk which didn't
have that like interaction that a model
could capture.
And you will also see out there that
this setup is called cross encoder setup
because you have both your inputs
fed to your encoder. So there's like
some cross interactions.
So if you remember the first approach is
a by encoder kind of setup and this one
is a cross encoder.
Yeah, the question is you will actually
compute the attention between the two.
Yes,
absolutely.
And uh here I mean um sentence bird
there they have a lot of uh good
documents. So I highly recommend just
reading their docs there at the bottom
of the slide.
Cool. Well, you do that uh on all your
uh potentially relevant chunks. So you
have uh this uh score that is computed
for each of these chunk with the prompts
and then you finally obtain the ranking.
And now the question is are you happy
with the ranking?
And in order to answer that question,
you need to have a way to quantify your
performance.
And that's why we're going to see in the
next five minutes what are the metrics
that we typically use to do that. So
again this is very similar to you know
if you do search or recommendation. So
in case you have a background there you
will see some commonalities.
So here's the setup.
You have a bunch of chunks.
You do this first and second step
and at the end of the day you will have
k chunks that will come at the top that
you will qualify as being relevant
and you want to compare that with
respect to
actually relevant chunks. So you can
think of it as you know you have label
like same as in binary classification.
So you have relevant and not relevant
and you predict some that are relevant
and you want to know how well you're
doing.
So this is the setup.
Well, when it comes to ranking,
you need to somehow incorporate this
information of how high in the ranking
you've put stuff.
So here let's suppose that you have
ranked these n chunks from most
important to least important. So let's
suppose you have like first, second, uh
third and so on and so forth.
And you only care about the first K
because in the rack setting you
typically retrieve the top K that are
relevant and you put all these top K in
your prompt.
Well, the first metric that you will
likely use is called NDCG.
So, it's a lot of letters. I'm going to
핵심 요약
LLM이 외부 시스템과 상호작용하는 방법을 다룹니다. RAG(Retrieval-Augmented Generation)를 통한 외부 지식 접근, Tool/Function Calling을 통한 구조화된 데이터 활용, 그리고 Agent를 통한 자율적 작업 수행 방식을 학습합니다.
주요 개념
RAG (Retrieval-Augmented Generation) 06:37
- LLM의 knowledge cutoff 문제 해결
- 모델 재학습 없이 최신 정보 접근 가능
- 세 단계: Retrieve(검색) → Augment(증강) → Generate(생성)
Knowledge Base 구축 16:28
- Chunking: 문서를 수백 토큰 단위로 분할 (보통 ~500 토큰)
- Embedding: 각 청크를 벡터로 변환 (보통 ~1,500차원)
- Overlap: 청크 간 중첩으로 문맥 유지
Two-Stage Retrieval 24:28
- Candidate Retrieval: Bi-encoder로 빠르게 후보 필터링, Recall 최대화
- Reranking: Cross-encoder로 정밀하게 순위 재조정, Precision 최적화
- Semantic similarity (cosine) + BM25 (keyword matching) 하이브리드 가능
Contextual Retrieval 42:33
- 청크에 문맥 정보를 prepend하여 이해도 향상
- LLM 호출로 문맥 생성, Prompt Caching으로 비용 절감
Tool Calling / Function Calling 59:30
- 구조화된 데이터를 함수로 표현하여 LLM이 활용
- LLM은 함수 구현 아닌 API 시그니처와 문서만 참조
- 3단계: 함수 선택 → 실행 → 결과를 자연어로 변환
Tool Calling 학습 방법 69:20
- SFT 방식: 입출력 쌍으로 학습 (tool prediction + response generation)
- Prompting 방식: 최신 모델은 few-shot/reasoning으로도 가능
MCP (Model Context Protocol) 77:00
- Anthropic이 제안한 LLM-도구 연결 표준
- Host(LLM), Client(connector), Server(tool provider) 구조
- 도구 정의 표준화로 재사용성 향상
Agents 92:11
- Tool calling의 상위 개념: 자율적으로 목표 추구
- ReAct 패턴: Observe → Plan → Act 반복 루프
- 목표 달성까지 다중 도구 호출 및 추론 수행
Multi-Agent & A2A Protocol 99:00
- 복수 에이전트 간 협업 가능
- Google의 Agent-to-Agent Protocol: skills, execute, cancel 표준화
Safety 고려사항 103:30
- Data exfiltration 등 새로운 위험 발생
- Training 단계 (harmlessness) + Inference 단계 (safety classifier) 대응
- AgentSafetyBench로 안전성 평가
핵심 인사이트
- RAG는 모델 재학습 없이 최신 정보 접근하는 실용적 방법
- Retrieval 품질이 RAG 성능의 핵심 (Candidate Retrieval + Reranking)
- Tool Calling으로 LLM이 외부 API/시스템과 상호작용 가능
- Agent는 Tool + Reasoning Loop로 복잡한 작업 자율 수행
- 강력한 능력에는 Safety 대책이 필수 (학습/추론 양면)