Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 7 - Agentic LLMs

Stanford Online • November 18, 2025 • AI 요약 생성: January 24, 2026

NaN:NaN

Hello everyone and uh welcome to lecture

NaN:NaN

7 of CME 295.

NaN:NaN

So today we're going to uh focus on

NaN:NaN

practical techniques to let our LLM

NaN:NaN

interact with the outside world with

NaN:NaN

other systems because um up until now

NaN:NaN

our LLM was purely on its own. We've

NaN:NaN

trained it. We've seen how it can reason

NaN:NaN

on problem math, coding math, coding

NaN:NaN

problems. And uh now what we want to do

NaN:NaN

is to use our LLM in the context of

NaN:NaN

other systems. So today's class we'll

NaN:NaN

focus on rag that you may have heard

NaN:NaN

tool calling

NaN:NaN

and agents.

NaN:NaN

But before we start as usual I'm going

NaN:NaN

to recap what we did uh last time. So if

NaN:NaN

you remember last time we focused on

NaN:NaN

reasoning models and we saw the

NaN:NaN

differences between reasoning model and

NaN:NaN

what we call the vanilla LLM and in

NaN:NaN

particular up until the lecture before

NaN:NaN

last lecture. What we saw was we fed a

NaN:NaN

prompt to the LLM and it gave us

NaN:NaN

directly a response. But what we saw

NaN:NaN

last time was that if we let the LLM

NaN:NaN

reason before outputting the response,

NaN:NaN

then we can gain some performance when

NaN:NaN

it comes to reasoning tasks such as math

NaN:NaN

and coding. And so in particular

NaN:NaN

reasoning models what they do is they

NaN:NaN

take a prompt as input and then what

NaN:NaN

they output is both a reasoning chain

NaN:NaN

which is typically hidden from the user

NaN:NaN

and then a response.

NaN:NaN

So with that we saw how we could train a

NaN:NaN

model to be more of a reasoning model.

NaN:NaN

And in particular we saw a core RL

NaN:NaN

algorithm called GRPO which stands for

NaN:NaN

group relative policy optimization.

NaN:NaN

And we saw that this algorithm had some

NaN:NaN

differences compared to the ones that we

NaN:NaN

saw previously. And in particular one

NaN:NaN

notable aspect is that it does not have

NaN:NaN

it does not train a value function.

NaN:NaN

So here this uh illustration shows a

NaN:NaN

little bit how gRPO is trained. So it

NaN:NaN

takes a query as input and then it

NaN:NaN

computes an advantage for each output by

NaN:NaN

um computing the rewards for different

NaN:NaN

completions of a same prompt and then

NaN:NaN

computing a quantity which is the

NaN:NaN

advantage that is relative to the other

NaN:NaN

rewards of that group of completions.

NaN:NaN

And then we saw that if we applied GRPO

NaN:NaN

with carefully chosen rewards which is

NaN:NaN

one rewarding the model for outputting a

NaN:NaN

reasoning chain and then second

NaN:NaN

rewarding the model to for producing a

NaN:NaN

good response. What we saw is that as

NaN:NaN

the RL training progresses, we have an

NaN:NaN

improvement of the model on these

NaN:NaN

reasoning tasks. And so we saw one of

NaN:NaN

the tasks being uh math problems. So

NaN:NaN

here on the left graph you see the

NaN:NaN

evolution of the performance of the

NaN:NaN

model on the AIM data set which is a

NaN:NaN

kind of a challenging math uh problem.

NaN:NaN

And

NaN:NaN

we saw that but we also saw that the

NaN:NaN

model kept on input outputting

NaN:NaN

responses that were longer and longer.

NaN:NaN

And in particular we saw that even

NaN:NaN

though you know towards the end of the

NaN:NaN

graph above uh that the performance was

NaN:NaN

kind of plateauing we saw that the

NaN:NaN

output length was still increasing.

NaN:NaN

So then what we did was go back to the

NaN:NaN

loss formulation that is used by gRPO

NaN:NaN

and realize that there is a term that

NaN:NaN

makes the contribution of a token

NaN:NaN

different if it is in a short response

NaN:NaN

or a long response.

NaN:NaN

So this is a phenomenon called length

NaN:NaN

bias and we saw some mitigation

NaN:NaN

strategies that were explored by some

NaN:NaN

papers that came out uh in the past few

NaN:NaN

months. So one was DAPo which um had a a

NaN:NaN

normalization factor that was not

NaN:NaN

dependent on where the token was located

NaN:NaN

on which sentence it was located. And

NaN:NaN

the other one was uh this paper called

NaN:NaN

GRPO done right which actually just

NaN:NaN

removed

NaN:NaN

the normalization term.

NaN:NaN

I'll go down that.

NaN:NaN

Cool. So this was last time and last

NaN:NaN

time what we said right before starting

NaN:NaN

the reasoning class was to enumerate the

NaN:NaN

strength and the weaknesses of vanilla

NaN:NaN

LLMs.

NaN:NaN

So last lecture was all about focusing

NaN:NaN

on how we can improve the limited

NaN:NaN

reasoning capabilities of vanilla LLMs.

NaN:NaN

And in this lecture what we will do is

NaN:NaN

two things. So the first one is see how

NaN:NaN

we can connect our LLM to the ever

NaN:NaN

evolving knowledge base

NaN:NaN

and in particular see how we can um have

NaN:NaN

access to the latest information.

NaN:NaN

And then the second one is how our LLM

NaN:NaN

can help us perform actions and we will

NaN:NaN

see this with Shervin uh with things

NaN:NaN

like tool calling and agentic workflows.

NaN:NaN

Cool. So with that let's start with the

NaN:NaN

first one and let's start with this

NaN:NaN

method called rag that you may have

NaN:NaN

heard.

NaN:NaN

let's suppose you have a model that you

NaN:NaN

have trained

NaN:NaN

uh but the problem is that the

NaN:NaN

pre-training data on which you have

NaN:NaN

trained your model is let's say a month

NaN:NaN

a month ago let's suppose

NaN:NaN

now let's suppose you want to prompt

NaN:NaN

your model about the winner of the

NaN:NaN

elections that happened a couple of

NaN:NaN

weeks ago. Well, your model will not be

NaN:NaN

able to respond to you or it will output

NaN:NaN

the incorrect answer because up until

NaN:NaN

now our LLM does not have any link to

NaN:NaN

outside sources. It only relies on the

NaN:NaN

knowledge that it has acquired during

NaN:NaN

training. So the response that it will

NaN:NaN

give us will only be based on the data

NaN:NaN

that has been trained up until the

NaN:NaN

cutoff which is a month ago in this

NaN:NaN

example.

NaN:NaN

And so we have this big limitation which

NaN:NaN

is our LLM only knows about things that

NaN:NaN

is it has been trained on. And you will

NaN:NaN

see that all the models out there. So

NaN:NaN

here I have an example with OpenAI GPT5.

NaN:NaN

So if you look at their model cards,

NaN:NaN

they always have these knowledge cutoff

NaN:NaN

dates that is written somewhere. And in

NaN:NaN

the case for instance of GPT5, the

NaN:NaN

knowledge cut off date is September

NaN:NaN

30th,

NaN:NaN

2024.

NaN:NaN

Which means that if you ask it in a very

NaN:NaN

naive way, anything that happen after

NaN:NaN

that like the base model will not be

NaN:NaN

able to answer you as is.

NaN:NaN

Well, you might you may tell me why not

NaN:NaN

just uh continue training your model uh

NaN:NaN

on you know data that happened after

NaN:NaN

that. Well, the problem with that

NaN:NaN

actually there are several problems with

NaN:NaN

this. So the first problem is that it's

NaN:NaN

very tricky to change the knowledge of

NaN:NaN

an LLM without causing regression on

NaN:NaN

other things. So this is typically a

NaN:NaN

task that people they try to avoid

NaN:NaN

doing. And the second thing is it's not

NaN:NaN

very practical because you may very well

NaN:NaN

have use cases that require you to

NaN:NaN

fine-tune this model.

NaN:NaN

So let's suppose you have a use case

NaN:NaN

one. You fine-tune from this model and

NaN:NaN

then somehow you want to update the

NaN:NaN

weights of your model to inject some

NaN:NaN

knowledge. Well, you somehow will have

NaN:NaN

to do that for all the use cases that

NaN:NaN

you are doing which basically adds a lot

NaN:NaN

of overhead for you and just uh you know

NaN:NaN

adds a lot of maintenance.

NaN:NaN

So people they typically prefer to not

NaN:NaN

do additional training to inject

NaN:NaN

knowledge. So one idea can be to somehow

NaN:NaN

take your prompt and just add anything

NaN:NaN

that happens after the cut off date

NaN:NaN

as a way for your model to just know

NaN:NaN

what happened.

NaN:NaN

Well,

NaN:NaN

the problem with that naive approach is

NaN:NaN

that as you know context length length

NaN:NaN

is limited and so typically models have

NaN:NaN

on the order of magnitude of oh of

NaN:NaN

hundreds of thousands of tokens in

NaN:NaN

context length.

NaN:NaN

Do you know what that is roughly like

NaN:NaN

what is this roughly equal to?

NaN:NaN

Yes. So one token is equal to four

NaN:NaN

characters. So like using this like

NaN:NaN

rough approximation

NaN:NaN

uh hundreds of thousands of tokens is

NaN:NaN

roughly like hundreds of pages like

NaN:NaN

something like a very big book. So so

NaN:NaN

it's not it's not I mean it's it's big

NaN:NaN

but it's not enough for us to go in that

NaN:NaN

very naive route. So

NaN:NaN

um so again going back to GPT5 uh so if

NaN:NaN

you go to the model card you have the

NaN:NaN

knowledge cutoff date which is uh

NaN:NaN

September 2024. You also have the

NaN:NaN

context window and in this case it's

NaN:NaN

400,000

NaN:NaN

tokens.

NaN:NaN

U okay so let's suppose you actually uh

NaN:NaN

context is not a problem. It's actually

NaN:NaN

unlimited.

NaN:NaN

Let's imagine we actually put everything

NaN:NaN

in the context. Well, the problem then

NaN:NaN

is that people noticed that if you feed

NaN:NaN

a lot of irrelevant information to your

NaN:NaN

LLM,

NaN:NaN

the performance of the LLM will actually

NaN:NaN

degrade.

NaN:NaN

Meaning that if for instance you ask it

NaN:NaN

about I guess who was the winner of the

NaN:NaN

last elections and then you you feed it

NaN:NaN

a bunch of information that are not

NaN:NaN

relevant your LLM will tend to be

NaN:NaN

confused

NaN:NaN

and so people have run this test that's

NaN:NaN

called the needle in a haststack test

NaN:NaN

where the idea is

NaN:NaN

you give a big prompt to your LLM

NaN:NaN

which is your haststack

NaN:NaN

and you place a fact in the prompt and

NaN:NaN

you ask your model what that fact was.

NaN:NaN

So the idea is for your LLM to know I

NaN:NaN

guess among that huge prompt where is

NaN:NaN

the relevant information which is the

NaN:NaN

needle. And so when people have tried

NaN:NaN

doing this for several uh length of

NaN:NaN

prompts and tried different positions of

NaN:NaN

where to put the fact, they have seen

NaN:NaN

that

NaN:NaN

the length of the prompt and the

NaN:NaN

position at which you put the fact are

NaN:NaN

both important.

NaN:NaN

So here uh on the slide we have a heat

NaN:NaN

map um that was performed for GPT4 which

NaN:NaN

was I guess one or two years ago and uh

NaN:NaN

what the person did was place the fact

NaN:NaN

at different places in the document. So

NaN:NaN

this is document depth and the x-axis is

NaN:NaN

the length of your prompt.

NaN:NaN

And what we saw was that for prompts

NaN:NaN

that exceeded a certain amount of

NaN:NaN

tokens, the LLM actually had trouble

NaN:NaN

retrieving the correct piece of

NaN:NaN

information.

NaN:NaN

Um, and in particular, it had trouble

NaN:NaN

doing so when the fact was somewhere in

NaN:NaN

the first half of the prompt.

NaN:NaN

Um so this just tells us that you know

NaN:NaN

even if [snorts] like let's say our

NaN:NaN

context length was unlimited

NaN:NaN

we would still have a problem by just

NaN:NaN

going through that naive approach.

NaN:NaN

So that's another reason. Okay. So now

NaN:NaN

let's suppose context length is

NaN:NaN

unlimited. Let's suppose the problem

NaN:NaN

that I mentioned is not a problem. Well

NaN:NaN

the other problem is that you pay

NaN:NaN

So in particular, these calls, these LLM

NaN:NaN

calls, they are per token.

NaN:NaN

So the bigger your input prompt, the

NaN:NaN

more you will pay. So you have an

NaN:NaN

incentive to not put too much in your

NaN:NaN

prompt just from that standpoint.

NaN:NaN

And so for instance again going back to

NaN:NaN

GP5

NaN:NaN

uh order of magnitude is somewhere

NaN:NaN

around a dollar per million token. So I

NaN:NaN

guess it's not that expensive but it can

NaN:NaN

add up if you do that for all your

NaN:NaN

prompts.

NaN:NaN

So for all these reasons, I hope I

NaN:NaN

convinced you that we need a more clever

NaN:NaN

approach where instead of putting all

NaN:NaN

the new information all at once in the

NaN:NaN

prompt, what we do is we only somehow

NaN:NaN

find the relevant information and put

NaN:NaN

that in the prompt.

NaN:NaN

So that is the idea behind rag. Rag

NaN:NaN

stands for retrieval augmented

NaN:NaN

generation.

NaN:NaN

And the idea here is to augment the

NaN:NaN

prompt

NaN:NaN

with relevant information.

NaN:NaN

And here I put uh relevant in both uh

NaN:NaN

and this is the I guess the core part of

NaN:NaN

this technique is how can we get only

NaN:NaN

the relevant part in the prompt. So we

NaN:NaN

will see that in a second. So just at a

NaN:NaN

very high level. So you have let's say a

NaN:NaN

question as input. So in this case who

NaN:NaN

was the winner of let's say the local

NaN:NaN

election. The idea here is to somehow

NaN:NaN

fetch the correct or the relevant piece

NaN:NaN

of information and then augment that

NaN:NaN

here in order to output your answer.

NaN:NaN

So that's the rough idea.

NaN:NaN

So does this uh method make sense so

NaN:NaN

far?

NaN:NaN

Yeah. Okay. Cool. So that is the idea

NaN:NaN

behind rag and now we're going to go

NaN:NaN

into more details.

NaN:NaN

Um so what I mentioned is the rough

NaN:NaN

idea. Um and here I just want to

NaN:NaN

emphasize on the three main steps of

NaN:NaN

rag. So first one is you have your

NaN:NaN

prompt and you somehow want to retrieve

NaN:NaN

a relevant piece of information that

NaN:NaN

will help you in answering your prompt.

NaN:NaN

So here the first step is to retrieve

NaN:NaN

relevant documents and so you can think

NaN:NaN

of your prompt as being one entity and

NaN:NaN

then you can have some other uh space

NaN:NaN

which maybe like I don't know knowledge

NaN:NaN

base where all your documents live and

NaN:NaN

so the idea is to somehow fetch the

NaN:NaN

relevant documents.

NaN:NaN

So this is the retrieve step. The second

NaN:NaN

step is once you have fetched the

NaN:NaN

relevant information, you augment your

NaN:NaN

prompt. So you take that retrieved info,

NaN:NaN

you just put it in your prompt and then

NaN:NaN

uh ask the question. So in the local

NaN:NaN

election example, it's as if I was

NaN:NaN

saying who is the winner of this

NaN:NaN

election and then I retrieve the

NaN:NaN

relevant piece of information and now

NaN:NaN

the prompt becomes um who is the winner

NaN:NaN

of this election and by the way uh this

NaN:NaN

election was held blah blah blah and

NaN:NaN

this was the winner and this is what

NaN:NaN

we're feeding to our LLM.

NaN:NaN

So in other words, we're giving the

NaN:NaN

answer in the prompt

NaN:NaN

and the third step is to feed that

NaN:NaN

prompt to the LM to generate the

NaN:NaN

response. Yeah.

NaN:NaN

>> Yeah, exactly. So the question is uh you

NaN:NaN

may very well somehow do a bad job at

NaN:NaN

retrieval stage. So yes um so this is

NaN:NaN

why the retrieval stage is so important

NaN:NaN

and we're going to focus on what we can

NaN:NaN

do to make sure that one part like does

NaN:NaN

well.

NaN:NaN

Um so yeah uh we'll see how we can

NaN:NaN

evaluate um I guess our our setup and

NaN:NaN

different methods. But when we talk

NaN:NaN

about rag, we're mainly focusing on

NaN:NaN

making the retrieval part as good as it

NaN:NaN

can.

NaN:NaN

Cool. And I just want to emphasize once

NaN:NaN

again on why it's called rag. So you

NaN:NaN

have retrieve,

NaN:NaN

augment, generate

NaN:NaN

rag.

NaN:NaN

Cool. And as you pointed out the first

NaN:NaN

step which is the retrieval step is very

NaN:NaN

important which is why we'll spend a

NaN:NaN

little bit of time over there.

NaN:NaN

So I guess the first step is for us to

NaN:NaN

somehow

NaN:NaN

clean the set of documents that we may

NaN:NaN

need.

NaN:NaN

So I said you know uh we may um want to

NaN:NaN

look into outside information but we

NaN:NaN

need to somehow uh sort that order that

NaN:NaN

put that somewhere and this whole thing

NaN:NaN

is usually called a knowledge base.

NaN:NaN

So in order to form our knowledge base

NaN:NaN

what we do is typically collect the set

NaN:NaN

of documents that are or may be useful

NaN:NaN

and once we do that what we do is we

NaN:NaN

divide them into what we call chunks.

NaN:NaN

So a chunk is you can think of it as a

NaN:NaN

subset of the document which has a given

NaN:NaN

maximum length which is uh measured in

NaN:NaN

number of tokens which is typically on

NaN:NaN

the order of hundreds of tokens.

NaN:NaN

And the idea here is you know whenever

NaN:NaN

you hear retrieval you should think

NaN:NaN

about embeddings.

NaN:NaN

And here what we do is

NaN:NaN

we compute embeddings corresponding to

NaN:NaN

each of these chunks.

NaN:NaN

Now when you create your knowledge base

NaN:NaN

there are a few hyperparameters that you

NaN:NaN

need to tweak.

NaN:NaN

So the first one uh obviously is the

NaN:NaN

size of the embedding. So typically you

NaN:NaN

would want a bigger size if let's say

NaN:NaN

your documents are maybe more nuanced,

NaN:NaN

more complex.

NaN:NaN

But then if you have like a higher size

NaN:NaN

uh maybe it will take more space, maybe

NaN:NaN

you'll have more computation at

NaN:NaN

inference time. So I guess it's a

NaN:NaN

trade-off. You don't necessarily want

NaN:NaN

too big of a of an embedding size. So

NaN:NaN

here typical embedding sizes are on the

NaN:NaN

order of thousands. So for instance like

NaN:NaN

1,500 something like this.

NaN:NaN

So then you have the chunk size. Chunk

NaN:NaN

size is how big your little pieces here

NaN:NaN

are. So you don't want them to be too

NaN:NaN

small because otherwise the text may be

NaN:NaN

out of context. You don't want it to be

NaN:NaN

too large because maybe the embedding

NaN:NaN

will not represent in a meaningful way

NaN:NaN

what is inside. So again, it's a

NaN:NaN

trade-off. But typically people they

NaN:NaN

choose a chunk size of around 500 tokens

NaN:NaN

like on the order of hundreds of tokens.

NaN:NaN

And then you also have a oh yeah you

NaN:NaN

have a question. Yeah.

NaN:NaN

So the question is do you train an

NaN:NaN

embedding model for this? So you have

NaN:NaN

two choices. Either you can use a

NaN:NaN

pre-trained embedding model which people

NaN:NaN

typically do or you can train your own.

NaN:NaN

We will see that in a bit more detail in

NaN:NaN

a few slides.

NaN:NaN

So the question is what is the purpose

NaN:NaN

of the embedded model? So we will see

NaN:NaN

this in a second but long story short it

NaN:NaN

tries to represent chunks such that it

NaN:NaN

achieves your end goal which is to fetch

NaN:NaN

relevant documents.

NaN:NaN

So we will see a little bit how they're

NaN:NaN

trained but this is uh the general idea.

NaN:NaN

Cool. Um so that's this and then we have

NaN:NaN

a third hyperparameter

NaN:NaN

which is how much overlap you want to

NaN:NaN

have in between your chunks.

NaN:NaN

So here

NaN:NaN

um when you do the division uh you know

NaN:NaN

you uh like in a very naive way you have

NaN:NaN

everything be like independent no

NaN:NaN

overlap in between but typically you

NaN:NaN

have some part that is from the previous

NaN:NaN

chunk that is relevant to understand the

NaN:NaN

current chunk which is why we want to

NaN:NaN

have some overlap and which is why

NaN:NaN

people they typically also have that. So

NaN:NaN

it's typically in the low hundreds of

NaN:NaN

tokens.

NaN:NaN

Cool. So let's suppose you have your

NaN:NaN

knowledge base. Now the question is

NaN:NaN

given a prompt how can you retrieve

NaN:NaN

relevant documents

NaN:NaN

and the answer to that is we typically

NaN:NaN

proceed in two steps.

NaN:NaN

So I'm not sure if any of you has

NaN:NaN

background in recommendation systems or

NaN:NaN

search does any Yeah. So the methods

NaN:NaN

we're seeing here are very similar to

NaN:NaN

that space. So I guess people in the LLM

NaN:NaN

community they have borrowed ideas and

NaN:NaN

just leverage some techniques that we

NaN:NaN

have over there. And this is typically a

NaN:NaN

setting that we'll also have for

NaN:NaN

recommendation problems.

NaN:NaN

So we have two stages.

NaN:NaN

So the first stage is typically called

NaN:NaN

candidate retrieval.

NaN:NaN

And the goal here is to go from a set of

NaN:NaN

many many many chunks and filter it down

NaN:NaN

to a much smaller set of potentially

NaN:NaN

relevant candidates.

NaN:NaN

So during that stage, what we're trying

NaN:NaN

to do is to somehow maximize recall.

NaN:NaN

just do a rough operation

NaN:NaN

so that we get as many potentially

NaN:NaN

relevant candidates as possible.

NaN:NaN

And then we have a a second stage which

NaN:NaN

is sometimes optional, but this stage is

NaN:NaN

to really make sure we have the top

NaN:NaN

documents being the really the relevant

NaN:NaN

ones. And this one is called ranking.

NaN:NaN

So the idea here is

NaN:NaN

based on the list of potentially

NaN:NaN

relevant documents to really rank them

NaN:NaN

in a way that really the relevant ones

NaN:NaN

come at the top and so on.

NaN:NaN

And typically during that stage, we're

NaN:NaN

going to use a model, a method that's

NaN:NaN

going to be a bit more compute inensive

NaN:NaN

because we have a much smaller set of

NaN:NaN

candidates to rank compared to the first

NaN:NaN

one.

NaN:NaN

So going back to your question on how do

NaN:NaN

we want our embeddings to be. So here it

NaN:NaN

will really impact uh the first stage

NaN:NaN

and we will see that in a second, but

NaN:NaN

the second stage is also quite

NaN:NaN

important.

NaN:NaN

Cool. So far so good. Is everyone uh

NaN:NaN

clear with the two-stage approach

NaN:NaN

approach?

NaN:NaN

Yep. Yep.

NaN:NaN

Yeah.

NaN:NaN

Very good question. So the question is

NaN:NaN

do we chunk things in a naive way as in

NaN:NaN

we just go with the number of tokens

NaN:NaN

regardless of what happens. So it's a

NaN:NaN

great question and the answer is that we

NaN:NaN

will see some extensions that will

NaN:NaN

mitigate the problem of when you chunk

NaN:NaN

it in a way that does not make sense in

NaN:NaN

a naive way you want to somehow put that

NaN:NaN

into context and we will see a method

NaN:NaN

that does that. So in a few slides we

NaN:NaN

will see that.

NaN:NaN

Um but I think your question is also a

NaN:NaN

great question because depending on the

NaN:NaN

kind of document that we have like for

NaN:NaN

instance if we have uh I don't know like

NaN:NaN

a JSON file or a markdown or like

NaN:NaN

depending on the kind of file that you

NaN:NaN

need to chunk you also need to be aware

NaN:NaN

of the structure that is within those

NaN:NaN

files. So there is also some nuance

NaN:NaN

there that we will not go into details

NaN:NaN

but I just want to call that out

NaN:NaN

but yeah great question.

NaN:NaN

Any other questions?

NaN:NaN

Okay cool. So now that we're clear on

NaN:NaN

the two main stages of retrieval we're

NaN:NaN

going to focus on each one of these

NaN:NaN

steps.

NaN:NaN

So as I mentioned the first step is

NaN:NaN

candidate retrieval.

NaN:NaN

So here what we want is among that

NaN:NaN

potentially huge knowledge base to

NaN:NaN

somehow filter it down to let's say over

NaN:NaN

100 potentially relevant candidates.

NaN:NaN

So here what we do is well we will

NaN:NaN

leverage the embeddings that we um I

NaN:NaN

guess computed during the knowledge base

NaN:NaN

initialization

NaN:NaN

and

NaN:NaN

we will try to fetch

NaN:NaN

potentially relevant candidates by doing

NaN:NaN

a semantic similarity search.

NaN:NaN

So do you recall how we compare

NaN:NaN

embeddings?

NaN:NaN

Yes. Yes. So cosign similarity is

NaN:NaN

typically one way one good way to

NaN:NaN

compare embeddings. So the idea here is

NaN:NaN

to represent our query with an

NaN:NaN

embedding.

NaN:NaN

We already have

NaN:NaN

embeddings of all our chunks. So the

NaN:NaN

idea here is

NaN:NaN

to somehow find the most relevant chunks

NaN:NaN

by doing this similarity search and

NaN:NaN

filtering out the ones that come at the

NaN:NaN

top.

NaN:NaN

So the idea here is you have your query,

NaN:NaN

you have your chunk both of them, you

NaN:NaN

find an embedding

NaN:NaN

and then you perform a similarity

NaN:NaN

operation which is most of the time

NaN:NaN

cosign similarity

NaN:NaN

and you obtain a similarity score.

NaN:NaN

So the idea here is you just just keep

NaN:NaN

top I don't know 100 and you go with

NaN:NaN

that.

NaN:NaN

So I just want to call out that there is

NaN:NaN

some complexity in that stage because

NaN:NaN

your knowledge base can potentially be

NaN:NaN

huge.

NaN:NaN

So what people do is typically use what

NaN:NaN

we call approximate nearest neighbor

NaN:NaN

methods.

NaN:NaN

So you may have heard of some libraries

NaN:NaN

[music] that do that. So typically this

NaN:NaN

is something that will be relevant here.

NaN:NaN

We're not going to go into details, but

NaN:NaN

I just want to call that out. So here,

NaN:NaN

the idea here is that

NaN:NaN

when you build your knowledge base, you

NaN:NaN

somehow partition the embeddings in a

NaN:NaN

way that will avoid like make you avoid

NaN:NaN

doing like just a naive linear search.

NaN:NaN

So that's the idea. But uh you you may

NaN:NaN

see some techniques uh like ANN

NaN:NaN

techniques approximate near nearest

NaN:NaN

nearest neighbor techniques and these

NaN:NaN

are typically happening here.

NaN:NaN

So another thing that I want to point

NaN:NaN

out is

NaN:NaN

a name of the architecture that we

NaN:NaN

typically use here that you may also

NaN:NaN

hear

NaN:NaN

and for that we need to recall that

NaN:NaN

these embeddings they're actually

NaN:NaN

obtained by passing them through a

NaN:NaN

model. So typically encoder only.

NaN:NaN

So you may hear the term by encoder and

NaN:NaN

this one refers to the fact that we are

NaN:NaN

passing the query through an encoder and

NaN:NaN

then passing the chunk through an

NaN:NaN

encoder. So both of them are independent

NaN:NaN

and we're comparing the embeddings

NaN:NaN

and uh yeah so this is another question

NaN:NaN

I wanted to ask you but I guess I didn't

NaN:NaN

get the chance to but um so if you

NaN:NaN

remember I think lecture two or three we

NaN:NaN

had seen the birds model

NaN:NaN

and so typically you would have

NaN:NaN

something like a birdlike model that you

NaN:NaN

would use to um encode these documents.

NaN:NaN

So going back to your question, how do

NaN:NaN

you compute these embeddings? So there

NaN:NaN

is a a paper that I highly recommend

NaN:NaN

reading actually that's called sentence

NaN:NaN

BERT.

NaN:NaN

And so that paper explains so it's first

NaN:NaN

of all it's an extension of BERT as the

NaN:NaN

name suggests and it's an extension that

NaN:NaN

allows you to compute an embedding per

NaN:NaN

let's say sequence for your query for

NaN:NaN

your document

NaN:NaN

that is tailored

NaN:NaN

to be used for similarity search

NaN:NaN

purposes.

NaN:NaN

So the idea here is to have a loss

NaN:NaN

function that will incentivize having a

NaN:NaN

high cosine similarity for relevant

NaN:NaN

entities and low cosine similarity for

NaN:NaN

entities that are not relevant.

NaN:NaN

So yeah, so feel free to check that

NaN:NaN

paper out uh if you know birds which I I

NaN:NaN

know right now you do. Uh it's quite

NaN:NaN

easy to read. So yeah, highly recommend.

NaN:NaN

So far so good.

NaN:NaN

Yeah. Yep.

NaN:NaN

So the question is what is the default

NaN:NaN

way to compute the similarity? So yes

NaN:NaN

it's cosign similarity but again you

NaN:NaN

will see in different implementations

NaN:NaN

that people can use other distances and

NaN:NaN

I would encourage you to think about how

NaN:NaN

they relate to one another. So you will

NaN:NaN

see for instance the L2 distance but

NaN:NaN

then if everything has a norm of one

NaN:NaN

there's a lot of simplifications that

NaN:NaN

can happen. So you may see some variance

NaN:NaN

but I would say they're all more or less

NaN:NaN

cosign similarities.

NaN:NaN

Yeah great question.

NaN:NaN

Cool. So we're still at the candidate

NaN:NaN

retrieval stage and what we saw was one

NaN:NaN

way of retrieving documents from a

NaN:NaN

similarity sorry from a semantic

NaN:NaN

similarity standpoint. So by the way

NaN:NaN

what does semantic similarity mean? It

NaN:NaN

means finding documents or finding

NaN:NaN

entities that have the same meaning or

NaN:NaN

that are relevant.

NaN:NaN

But in the way that we compute these

NaN:NaN

embeddings, we're not enforcing any kind

NaN:NaN

of uh keyword match.

NaN:NaN

Like when we re when we retrieve

NaN:NaN

documents in this way, it can very well

NaN:NaN

be that the documents that are matched,

NaN:NaN

they do not have any word in common, but

NaN:NaN

they mean the same.

NaN:NaN

Well, sometimes

NaN:NaN

you want to ensure that your what you're

NaN:NaN

looking for, what you're searching for

NaN:NaN

is exactly containing the keywords that

NaN:NaN

is in your prompt.

NaN:NaN

And in that case you would want to have

NaN:NaN

a second way of doing things. So you may

NaN:NaN

have seen BM25

NaN:NaN

out there. So BF20 BM25 is a relevant

NaN:NaN

score that is actually a huristic score.

NaN:NaN

It is based on some function of the

NaN:NaN

overlap

NaN:NaN

between what is in your query and what

NaN:NaN

is in your document.

NaN:NaN

And so that one is actually quite handy

NaN:NaN

for cases where you have a query where

NaN:NaN

you absolutely want to have documents

NaN:NaN

that contain keywords of this query.

NaN:NaN

So here I have an example that I

NaN:NaN

actually passed super briefly for the

NaN:NaN

previous one and we will come back to

NaN:NaN

it. Uh but let's suppose we have let's

NaN:NaN

say two teddy bears. One is named cuddly

NaN:NaN

and the other one is named huggy. So

NaN:NaN

what you want is to figure out where is

NaN:NaN

cuddly. So this is your query.

NaN:NaN

So if you use BM25

NaN:NaN

well the answers that you you're going

NaN:NaN

to get are

NaN:NaN

by definition going to contain some

NaN:NaN

overlap of words that were in your

NaN:NaN

query. And so here you will have let's

NaN:NaN

say uh documents that contain let's say

NaN:NaN

where cuddly is. But if let's say you

NaN:NaN

you only used this semantic similarity

NaN:NaN

search you would not have that

NaN:NaN

guarantee.

NaN:NaN

you would only have documents that are

NaN:NaN

kind of semantically similar and those

NaN:NaN

are not

NaN:NaN

they are not guaranteed to contain

NaN:NaN

keywords of your prompt and so just just

NaN:NaN

to illustrate that. So here huggy and

NaN:NaN

cuddly they they can be thought of you

NaN:NaN

know semantically similar. So you you

NaN:NaN

will probably not have cuddly. I mean

NaN:NaN

you may not necessarily have cuddly in

NaN:NaN

there

NaN:NaN

just to illustrate that and that is the

NaN:NaN

reason why nowadays what people do is to

NaN:NaN

look at the use cases that they have and

NaN:NaN

think about whether having some heristic

NaN:NaN

as well in the relevant score is useful

NaN:NaN

for their use case. So some people they

NaN:NaN

go with the hybrid combination

NaN:NaN

of this embedding based search and the

NaN:NaN

huristic based search. So some

NaN:NaN

combination of uh embeddings and BM25 in

NaN:NaN

which case you may have even more

NaN:NaN

relevant documents depending on your use

NaN:NaN

case.

NaN:NaN

Does that make sense? Yeah.

NaN:NaN

So now I'll come back to what you

NaN:NaN

mentioned about whether cutting chunks

NaN:NaN

in a naive way will necessarily lead you

NaN:NaN

to things that are coherent. Well,

NaN:NaN

you're completely right. Sometimes you

NaN:NaN

will not. Um but before we answer this

NaN:NaN

question, we actually are going to

NaN:NaN

address another concern

NaN:NaN

which is that typically

NaN:NaN

when people want

NaN:NaN

to ask about something in their LLM,

NaN:NaN

the query that they input is of a

NaN:NaN

different nature

NaN:NaN

compared to what is in the knowledge

NaN:NaN

base. So your query is typically going

NaN:NaN

to be maybe something uh short, maybe a

NaN:NaN

question,

NaN:NaN

but what is in your documents is

NaN:NaN

typically going to be, you know, longer.

NaN:NaN

These are like sentences and sentences.

NaN:NaN

So if you really think about it,

NaN:NaN

if you use the same encoder

NaN:NaN

to embed your query and to embed your

NaN:NaN

documents,

NaN:NaN

well, these two embeddings, they're not

NaN:NaN

super comparable

NaN:NaN

because one is for a question and the

NaN:NaN

핵심 요약

LLM이 외부 시스템과 상호작용하는 방법을 다룹니다. RAG(Retrieval-Augmented Generation)를 통한 외부 지식 접근, Tool/Function Calling을 통한 구조화된 데이터 활용, 그리고 Agent를 통한 자율적 작업 수행 방식을 학습합니다.

주요 개념

RAG (Retrieval-Augmented Generation) 06:37

LLM의 knowledge cutoff 문제 해결
모델 재학습 없이 최신 정보 접근 가능
세 단계: Retrieve(검색) → Augment(증강) → Generate(생성)

Knowledge Base 구축 16:28

Chunking: 문서를 수백 토큰 단위로 분할 (보통 ~500 토큰)
Embedding: 각 청크를 벡터로 변환 (보통 ~1,500차원)
Overlap: 청크 간 중첩으로 문맥 유지

Two-Stage Retrieval 24:28

Candidate Retrieval: Bi-encoder로 빠르게 후보 필터링, Recall 최대화
Reranking: Cross-encoder로 정밀하게 순위 재조정, Precision 최적화
Semantic similarity (cosine) + BM25 (keyword matching) 하이브리드 가능

Contextual Retrieval 42:33

청크에 문맥 정보를 prepend하여 이해도 향상
LLM 호출로 문맥 생성, Prompt Caching으로 비용 절감

Tool Calling / Function Calling 59:30

구조화된 데이터를 함수로 표현하여 LLM이 활용
LLM은 함수 구현 아닌 API 시그니처와 문서만 참조
3단계: 함수 선택 → 실행 → 결과를 자연어로 변환

Tool Calling 학습 방법 69:20

SFT 방식: 입출력 쌍으로 학습 (tool prediction + response generation)
Prompting 방식: 최신 모델은 few-shot/reasoning으로도 가능

MCP (Model Context Protocol) 77:00

Anthropic이 제안한 LLM-도구 연결 표준
Host(LLM), Client(connector), Server(tool provider) 구조
도구 정의 표준화로 재사용성 향상

Agents 92:11

Tool calling의 상위 개념: 자율적으로 목표 추구
ReAct 패턴: Observe → Plan → Act 반복 루프
목표 달성까지 다중 도구 호출 및 추론 수행

Multi-Agent & A2A Protocol 99:00

복수 에이전트 간 협업 가능
Google의 Agent-to-Agent Protocol: skills, execute, cancel 표준화

Safety 고려사항 103:30

Data exfiltration 등 새로운 위험 발생
Training 단계 (harmlessness) + Inference 단계 (safety classifier) 대응
AgentSafetyBench로 안전성 평가

핵심 인사이트

RAG는 모델 재학습 없이 최신 정보 접근하는 실용적 방법
Retrieval 품질이 RAG 성능의 핵심 (Candidate Retrieval + Reranking)
Tool Calling으로 LLM이 외부 API/시스템과 상호작용 가능
Agent는 Tool + Reasoning Loop로 복잡한 작업 자율 수행
강력한 능력에는 Safety 대책이 필수 (학습/추론 양면)