Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 3 - Transformers & Large Language Models

Stanford Online • October 17, 2025 • AI 요약 생성: January 24, 2026

NaN:NaN

Cool. Hello everyone and uh welcome to

NaN:NaN

lecture 3 of CME 295.

NaN:NaN

Um so today is a very exciting day

NaN:NaN

because we're going to finally introduce

NaN:NaN

large language models. Uh but I guess

NaN:NaN

before we go into that I'm just going to

NaN:NaN

start traditionally with uh some

NaN:NaN

announcements. Um so some of you wanted

NaN:NaN

to have the slides before the class. So

NaN:NaN

just a heads up that in case you want to

NaN:NaN

use the slides to do some annotations

NaN:NaN

uh they're on the website right now. So

NaN:NaN

feel free to get them and uh with

NaN:NaN

Shervin we'll try to just on a regular

NaN:NaN

basis every Thursday evening have them

NaN:NaN

published on the website so that you can

NaN:NaN

download them and annotate.

NaN:NaN

Cool. So with that, let's start. And as

NaN:NaN

usual, we're going to recap uh last

NaN:NaN

week's episodes.

NaN:NaN

So if you remember, you know, lecture

NaN:NaN

one and lecture two were all about

NaN:NaN

introducing the concept of self

NaN:NaN

attention and linking them to the

NaN:NaN

construct of the transformer. And what

NaN:NaN

we did last lecture was look at all the

NaN:NaN

types of models that there are out there

NaN:NaN

and how they were all based on the

NaN:NaN

transformer.

NaN:NaN

So there are three categories, three

NaN:NaN

main categories of models. So the first

NaN:NaN

one that we saw was encoder decoder

NaN:NaN

model which basically relies on the

NaN:NaN

transformer. It has the encoder of the

NaN:NaN

transformer, the decoder of the

NaN:NaN

transformer and typically the tasks

NaN:NaN

there are input text. So text in text

NaN:NaN

out.

NaN:NaN

So we saw that one example was T5 and

NaN:NaN

all the variations.

NaN:NaN

The second type of model is where we

NaN:NaN

remove the decoder from the trans

NaN:NaN

transformer and we obtain an encoder

NaN:NaN

only model.

NaN:NaN

So uh we went uh deeper into BERT which

NaN:NaN

is the typical encoder only model and uh

NaN:NaN

I guess we also saw that what BERT has

NaN:NaN

is this nice property that it's encoded

NaN:NaN

embeddings are very meaningful and

NaN:NaN

expressive of the inputs and so yeah we

NaN:NaN

saw like the example of classification

NaN:NaN

sentiment extraction. So what we did was

NaN:NaN

in particular um take into consideration

NaN:NaN

the encoded embedding of the CLS token.

NaN:NaN

So uh I guess in real life BERT is used

NaN:NaN

to encode documents to encode sentences

NaN:NaN

and we're going to see later in the

NaN:NaN

class how these models are useful.

NaN:NaN

And then last but not least, we have the

NaN:NaN

third category of models which is

NaN:NaN

decoder only. So we only keep the

NaN:NaN

decoder part of the transformer. And so

NaN:NaN

here we also do a little bit of a of a

NaN:NaN

modification. So we remove the cross

NaN:NaN

attention because we don't need it

NaN:NaN

anymore. We don't have an encoder. And

NaN:NaN

these kinds of models are text in text

NaN:NaN

out.

NaN:NaN

And so GPT is a very good example of

NaN:NaN

such models and actually most models

NaN:NaN

these days they're only like they are

NaN:NaN

decoder only.

NaN:NaN

So these are the three main kinds of

NaN:NaN

models that you can see out there.

NaN:NaN

So far so good for everyone.

NaN:NaN

Cool. So with that I'm going to

NaN:NaN

introduce the term LLM. So, LLM stands

NaN:NaN

for large language model. So, what is a

NaN:NaN

large language model? So, first of all,

NaN:NaN

a large language model is a language

NaN:NaN

model. So, a language model is a model

NaN:NaN

that assigns probability to sequences of

NaN:NaN

tokens.

NaN:NaN

So, in this case, our model always

NaN:NaN

predicts the probability of the next

NaN:NaN

token. So, it's in that sense a language

NaN:NaN

model.

NaN:NaN

But also a large language model is

NaN:NaN

large. So why is it large? So we're

NaN:NaN

going to see that these models they are

NaN:NaN

actually scaled up in terms of size. So

NaN:NaN

first of all in terms of model size. So

NaN:NaN

these days it's not uncommon to see

NaN:NaN

models on the order of hundreds of

NaN:NaN

billions of parameters.

NaN:NaN

But yeah, typically when we say LLM, we

NaN:NaN

say at least on the order of a billion.

NaN:NaN

These models, they've also been trained

NaN:NaN

on a huge amount of data. And here by

NaN:NaN

amount of data, we quantify that by the

NaN:NaN

number of tokens that they were

NaN:NaN

pre-trained with. And this is on the

NaN:NaN

order of magnitude of hundreds of

NaN:NaN

billions of tokens or even trillions of

NaN:NaN

tokens.

NaN:NaN

So I think the biggest one biggest ones

NaN:NaN

are like on the tens of trillions of

NaN:NaN

tokens. So it's like huge

NaN:NaN

training sets

NaN:NaN

and they're also large because they need

NaN:NaN

a lot of compute. So typically you need

NaN:NaN

a bunch of GPUs to make them work.

NaN:NaN

Although these days there's been a lot

NaN:NaN

of um optimizations to have them work on

NaN:NaN

uh consumerbased GPUs. So we're going to

NaN:NaN

see that later.

NaN:NaN

But an LLM is large according to these

NaN:NaN

categories.

NaN:NaN

So another thing that I want to point

NaN:NaN

out is that all these terminologies

NaN:NaN

they're relatively new. So um I remember

NaN:NaN

in 2018 19 there was nothing like there

NaN:NaN

was no real definition of an LM. I think

NaN:NaN

no one actually talked about LLMs in the

NaN:NaN

beginning maybe people talked about

NaN:NaN

LLMs. They included BERT,

NaN:NaN

but BERT is an encoder only model that

NaN:NaN

does not produce text. So with the

NaN:NaN

current definition of an LLM, which

NaN:NaN

right now has been pretty well

NaN:NaN

established, BERT would not be an LLM

NaN:NaN

because it doesn't produce text. So here

NaN:NaN

we only consider language models that do

NaN:NaN

textto text that are very large in size

NaN:NaN

in terms of amount of data they have

NaN:NaN

been trained on and in terms of compute.

NaN:NaN

Cool. And as we saw before these models

NaN:NaN

are decoder only. So here what we do is

NaN:NaN

we remove the encoder. We only keep the

NaN:NaN

masked self attention, the feed forward

NaN:NaN

neuronet network and then the you know

NaN:NaN

addition and normalization.

NaN:NaN

So we only keep this and this is the

NaN:NaN

backbone of LMS

NaN:NaN

and uh so I mentioned you know GPT is a

NaN:NaN

kind of a good example but it's not just

NaN:NaN

that you have plenty of other models. So

NaN:NaN

you may have heard of Llama from Meta,

NaN:NaN

Gemma from Google, uh Deepseek, Mistro

NaN:NaN

Quen, so on like the the list is long.

NaN:NaN

I would say roughly I mean more than 90%

NaN:NaN

of like modern day LLMs, they're all

NaN:NaN

decoder only. So I think that's uh

NaN:NaN

something to keep in mind.

NaN:NaN

Cool.

NaN:NaN

Um okay so now you know how LLMs are

NaN:NaN

made of but there is something else that

NaN:NaN

people these days also introduced to

NaN:NaN

these models and we're going to see that

NaN:NaN

in a bit

NaN:NaN

so I mentioned that these models are

NaN:NaN

huge in size again typically hundreds of

NaN:NaN

billions of parameters so it takes a lot

NaN:NaN

of compute to just compute one inference

NaN:NaN

or also to train these models.

NaN:NaN

But you may wonder, do you really need

NaN:NaN

to have all these parameters be

NaN:NaN

activated during a forward pass to make

NaN:NaN

a simple prediction?

NaN:NaN

So I'm going to u like have a little

NaN:NaN

metaphor. So let's suppose you enter in

NaN:NaN

a room and in the room there is a

NaN:NaN

mathematician,

NaN:NaN

a physicist, a chemist and a historian.

NaN:NaN

So you come in this class there's a

NaN:NaN

bunch of people who are expert in what

NaN:NaN

they do.

NaN:NaN

And you have a question, you have a math

NaN:NaN

question.

NaN:NaN

That's a question I have for you is who

NaN:NaN

would you ask your question?

NaN:NaN

Would you ask the mathematician? Would

NaN:NaN

you ask the chemist? Would you ask

NaN:NaN

everyone?

NaN:NaN

Well, right now we ask everyone. We ask

NaN:NaN

all parameters of the model to be

NaN:NaN

involved in the computation of uh you

NaN:NaN

know the generation.

NaN:NaN

And so the idea here is

NaN:NaN

given an input

NaN:NaN

maybe it's not necessary to ask everyone

NaN:NaN

to be involved in the computation. So

NaN:NaN

the idea is let's actually just have a

NaN:NaN

subset of the model be involved in the

NaN:NaN

computation of the next token.

NaN:NaN

So I'm just introducing this idea of

NaN:NaN

experts. So let's suppose we're

NaN:NaN

introducing uh the following notation.

NaN:NaN

So let's suppose we have n experts. So

NaN:NaN

think of it as your mathematician, your

NaN:NaN

chemist, historian, like whatever. So

NaN:NaN

these are your experts and the idea is

NaN:NaN

given an input X,

NaN:NaN

you're going to ask yourself

NaN:NaN

who should be involved in the generation

NaN:NaN

of the output.

NaN:NaN

So you're going to have let's say

NaN:NaN

another network. We're we're going to

NaN:NaN

call it G like gate but it's also

NaN:NaN

sometimes called router.

NaN:NaN

So let's suppose we have some gate that

NaN:NaN

tells us which expert should be involved

NaN:NaN

in the inference.

NaN:NaN

So if we have that so let's suppose here

NaN:NaN

the gate tells us okay so actually

NaN:NaN

expert number two is well suited to

NaN:NaN

answer your question. So here the idea

NaN:NaN

is that the input is just going to flow

NaN:NaN

into that expert

NaN:NaN

but not the other experts.

NaN:NaN

So this has a name. It's called mixture

NaN:NaN

of experts. So it's denoted.

NaN:NaN

Everyone talks about so these are

NaN:NaN

mixture of experts. And so the formula

NaN:NaN

that you will see a lot is this one. So

NaN:NaN

the output y which is denoted yhat is

NaN:NaN

the sum of the expert output

NaN:NaN

weighted by some quantity which is the

NaN:NaN

output of the gate which tells you how

NaN:NaN

important the output of each expert is.

NaN:NaN

Yeah,

NaN:NaN

great question. So question is how do

NaN:NaN

you train G? How do you train E? So

NaN:NaN

typically you train them jointly.

NaN:NaN

So we're going to see that maybe in a

NaN:NaN

yeah a little later, but you can think

NaN:NaN

of it as you know just training as

NaN:NaN

usual. You do your forward pass, you

NaN:NaN

compute the last and you back prop.

NaN:NaN

And it's actually an interesting

NaN:NaN

question because there is some

NaN:NaN

challenges that come with training emois

NaN:NaN

that we're going to see in a second.

NaN:NaN

Yep.

NaN:NaN

Uh question is uh what are these E what

NaN:NaN

is the architecture of these E? So let's

NaN:NaN

suppose right now that they're just some

NaN:NaN

network. We're not specifying them for

NaN:NaN

now but we're going to see this in a

NaN:NaN

second. Let's suppose for now it's like

NaN:NaN

some network.

NaN:NaN

Cool.

NaN:NaN

Okay, so I told you that you know what

NaN:NaN

if we don't activate everyone? What if

NaN:NaN

we activate a subset? But this formula

NaN:NaN

actually

NaN:NaN

um I guess assumes that we're actually

NaN:NaN

considering all expert outputs. So I

NaN:NaN

just want to distinguish two kinds of.

NaN:NaN

So there's one kind that is called a

NaN:NaN

dense.

NaN:NaN

So dense MO actually does not have any

NaN:NaN

constraints on the number of experts

NaN:NaN

that are involved.

NaN:NaN

So these weights they can be anywhere

NaN:NaN

between zero and one. So think of them

NaN:NaN

as a probability distribution

NaN:NaN

but it's just going to put more weight

NaN:NaN

towards some experts compared to others.

NaN:NaN

So back back to the example that I had.

NaN:NaN

So let's suppose I have a math question.

NaN:NaN

So I'm going to ask the mathematician,

NaN:NaN

the chemist, the historian. So I'm

NaN:NaN

probably going to add a higher weight to

NaN:NaN

what the mathematician says compared to

NaN:NaN

let's say the historian. So this is the

NaN:NaN

idea.

NaN:NaN

But then the interesting thing is when

NaN:NaN

we constrain the number of experts that

NaN:NaN

are activated because here as we

NaN:NaN

mentioned previously what we're

NaN:NaN

interested in is to not involve everyone

NaN:NaN

is to make some savings in the amount of

NaN:NaN

compute that we do. So there's a second

NaN:NaN

kind of MOE that's called the sparse MOE

NaN:NaN

and what it does is it only selects the

NaN:NaN

quoteunquote top K experts.

NaN:NaN

So K can be equal to one. So one expert

NaN:NaN

or even two. So it's a hyperparameter

NaN:NaN

that you choose. And so here the

NaN:NaN

expression of the output becomes the sum

NaN:NaN

over all the chosen experts of uh g of x

NaN:NaN

time e of x.

NaN:NaN

So far so good.

NaN:NaN

Cool. Uh and of course there's a lot

NaN:NaN

more to it. So in case you are

NaN:NaN

interested in learning more uh feel free

NaN:NaN

to go into the resources that are at the

NaN:NaN

bottom of the slide. Um so one thing

NaN:NaN

that I will say is that we have a unit

NaN:NaN

of measure of the amount of compute that

NaN:NaN

these models produce for each uh each

NaN:NaN

pass. So you will see the term flops. So

NaN:NaN

have you have you seen the term flops

NaN:NaN

out there?

NaN:NaN

No not really. So it stands for floating

NaN:NaN

point operations.

NaN:NaN

So it quantifies how many operations

NaN:NaN

like think of it as like additions,

NaN:NaN

multiplications

NaN:NaN

are involved in a forward pass let's say

NaN:NaN

and it basically quantifies how compute

NaN:NaN

heavy is your task.

NaN:NaN

So typically what we say is when we go

NaN:NaN

through a sparse MO as opposed to a

NaN:NaN

dense MOE we have a lower amount of

NaN:NaN

flops.

NaN:NaN

So this is the unit of measure that you

NaN:NaN

will see.

NaN:NaN

But back to your question. So what are

NaN:NaN

these experts?

NaN:NaN

So if you remember I mean 10 minutes ago

NaN:NaN

we said that LLMs they are decoder only

NaN:NaN

models.

NaN:NaN

So I have a question for you. Let's

NaN:NaN

suppose we wanted to put some

NaN:NaN

in our LLM.

NaN:NaN

Where would would we put it?

NaN:NaN

So here I guess we have three choices.

NaN:NaN

We have uh the mass self attention

NaN:NaN

layer. We have the feed for neural

NaN:NaN

network and then we have like this

NaN:NaN

normalization. So question for you I

NaN:NaN

guess where would you put this?

NaN:NaN

I guess where do you think

NaN:NaN

is I guess the most complex part of the

NaN:NaN

network? Where is there a lot of

NaN:NaN

operations

NaN:NaN

feed forward? Yes. Yeah. Great great

NaN:NaN

answer. So uh it's indeed the feed

NaN:NaN

forward neuron network

NaN:NaN

and the reason for that is I think

NaN:NaN

Shervin mentioned it I think in lecture

NaN:NaN

one. So if you remember the feed for

NaN:NaN

neural network is a network such that

NaN:NaN

you have the input which is your you

NaN:NaN

know dd dimensional input vector and

NaN:NaN

then you have uh this being projected

NaN:NaN

into let's say dfff dimensional space

NaN:NaN

and then it goes back to the dd

NaN:NaN

dimensional

NaN:NaN

um I guess uh space. So the DFF

NaN:NaN

is typically larger than your

NaN:NaN

dimension of I guess the input. When I

NaN:NaN

say input is here. So it's typically

NaN:NaN

larger. So

NaN:NaN

the amount of parameters that you have

NaN:NaN

in that feed for null network is

NaN:NaN

something on the order of magnitude of D

NaN:NaN

model time D FFF

NaN:NaN

time 2 plus some bias. So it's basically

NaN:NaN

your order of magnitude and

NaN:NaN

the attention layer if you think about

NaN:NaN

if you remember it's basically composed

NaN:NaN

of the projection matrices.

NaN:NaN

What is the dimension of the projection

NaN:NaN

matrices? So it's the model times the

NaN:NaN

dimension of keys, the dimension of

NaN:NaN

queries, the dimension of values.

NaN:NaN

And this dimension is typically much

NaN:NaN

lower.

NaN:NaN

So think of it of hundreds.

NaN:NaN

So your D model is typically O of 100, O

NaN:NaN

of a,000

NaN:NaN

and then the projection here the DFF

NaN:NaN

is O of a,000 or of 10,000.

NaN:NaN

Cool. So is everyone now convinced that

NaN:NaN

this is a good way good place to put the

NaN:NaN

mixture of experts?

NaN:NaN

Yeah.

NaN:NaN

Cool. So this is actually how it's done.

NaN:NaN

So in modern day day LLMs this idea of

NaN:NaN

not involving everyone in the

NaN:NaN

computation of the next token prediction

NaN:NaN

is such that you would put put the

NaN:NaN

mixture of experts where the FFN is

NaN:NaN

which is basically here

NaN:NaN

and typically you would have a sparse

NaN:NaN

mixture of experts meaning that so back

NaN:NaN

to your question so these experts are

NaN:NaN

feed for neural network. So you would

NaN:NaN

you would have several networks that you

NaN:NaN

can train but you would only activate

NaN:NaN

one.

NaN:NaN

So typically K would be equal to one. It

NaN:NaN

can be also equal to two but it would

NaN:NaN

only be a subset. So that's my point. Um

NaN:NaN

and this routing would be done at the

NaN:NaN

token level.

NaN:NaN

So if you remember you know the decoder

NaN:NaN

it basically takes uh something as

NaN:NaN

input. So you know a bunch of tokens

NaN:NaN

and here what I'm saying is that each

NaN:NaN

token will be processed by an expert

NaN:NaN

that may be different from the other

NaN:NaN

token.

NaN:NaN

So the router router here would take

NaN:NaN

the representation of the token as input

NaN:NaN

and figure out which expert should be

NaN:NaN

best for this token to flow towards.

NaN:NaN

Does this idea make sense?

NaN:NaN

So I have a little um illustration later

NaN:NaN

on that hopefully will help.

NaN:NaN

So now back to your question about how

NaN:NaN

you train this model like do you train

NaN:NaN

the router separately do you train the

NaN:NaN

expert separately. So one challenge that

NaN:NaN

people have is when they train these

NaN:NaN

based models

NaN:NaN

to make sure that all experts

NaN:NaN

are I guess having a weight are being

NaN:NaN

used

NaN:NaN

because it's very possible that you

NaN:NaN

train your model and that somehow only I

NaN:NaN

don't know one or two experts always get

NaN:NaN

activated

NaN:NaN

and the other one they always are

NaN:NaN

inactive. They're never involved in the

NaN:NaN

computation.

NaN:NaN

So this problem is called routing

NaN:NaN

collapse. So why is it called routing

NaN:NaN

collapse? Is because a router always

NaN:NaN

chooses some experts but not others.

NaN:NaN

So this is a challenge and the way

NaN:NaN

people try to mitigate this challenge is

NaN:NaN

by changing the loss function

NaN:NaN

and adding it some extra term which is

NaN:NaN

written here.

NaN:NaN

So it's basically some hyperparameter

NaN:NaN

alpha

NaN:NaN

times the number of experts

NaN:NaN

times the sum of quantities that depend

NaN:NaN

on whether or not tokens went to a

NaN:NaN

certain expert I and then summed over

NaN:NaN

all experts.

NaN:NaN

So it's not super important that you

NaN:NaN

completely understand exactly how that

NaN:NaN

formula works. The only thing that I

NaN:NaN

think you should take away from this

NaN:NaN

slide is that this extra loss

NaN:NaN

allows these quantities

NaN:NaN

to converge more towards uniform

NaN:NaN

distributions. So what are these

NaN:NaN

quantities just as a reminder? So f of I

NaN:NaN

is the fraction of tokens that are

NaN:NaN

routed to expert I

NaN:NaN

and P of I is the average routing

NaN:NaN

probability for expert I.

NaN:NaN

So when I say that, you know, all

NaN:NaN

experts should be used kind of the same,

NaN:NaN

what I'm saying is I want this

NaN:NaN

probability to be kind of uniform

NaN:NaN

across experts.

NaN:NaN

Yeah.

NaN:NaN

Yeah. So the question is uh I guess uh

NaN:NaN

when do you compute this quantities? So

NaN:NaN

yeah like you can think of it as like

NaN:NaN

you know regular training uh process

NaN:NaN

like you know you do some like mini

NaN:NaN

batch you kind of like go go go through

NaN:NaN

that through like the the model and then

NaN:NaN

you compute all these quantities and

NaN:NaN

then what you do is you do your um your

NaN:NaN

back propagation based on that and I

NaN:NaN

guess what I want you to remember is

NaN:NaN

that this incentivizes

NaN:NaN

the probability

NaN:NaN

um I guess the the choice of the router

NaN:NaN

to be more uniform across experts which

NaN:NaN

is something that mitigates this routing

NaN:NaN

collapse phenomena.

NaN:NaN

Yeah.

NaN:NaN

>> Yeah. So the question is can we use

NaN:NaN

dropout? Of course you can always like

NaN:NaN

bundle that with some other techniques.

NaN:NaN

So people have just kind of found this

NaN:NaN

to be very helpful. So speaking of other

NaN:NaN

techniques, there is something that I've

NaN:NaN

not talked about which is very similar

NaN:NaN

to the dropout um idea. So it's called

NaN:NaN

noisy gating.

NaN:NaN

Noisy gating is basically uh you have

NaN:NaN

your

NaN:NaN

your predictions from the gates but then

NaN:NaN

you add some noise to it.

NaN:NaN

So basically it you know by pure chance

NaN:NaN

it just allows other experts to be

NaN:NaN

involved in the computation. So it's

NaN:NaN

also some other technique. There's a

NaN:NaN

bunch of techniques but yeah dropout is

NaN:NaN

uh indeed quite useful for like things

NaN:NaN

like overfeeding and the idea can be

NaN:NaN

reused in in different settings.

NaN:NaN

Yep.

NaN:NaN

Yep.

NaN:NaN

So the question is how can you so you

NaN:NaN

mean differentiable?

NaN:NaN

So here I guess how do you take the

NaN:NaN

derivative is that your your question?

NaN:NaN

Uh how is that? So can you explain a bit

NaN:NaN

more what your concern is?

NaN:NaN

>> Mhm. So I guess uh to this question um

NaN:NaN

so the average routing probability

NaN:NaN

so that one is a function of the gate

NaN:NaN

output right that's one pi

NaN:NaN

right so I guess your question is for fi

NaN:NaN

for fi

NaN:NaN

okay I don't have a good answer on top

NaN:NaN

of my head but I think like people have

NaN:NaN

some techniques and these days you know

NaN:NaN

you just don't even have to do this by

NaN:NaN

hand. You have uh like the built-in

NaN:NaN

thing. Um so maybe I can follow up with

NaN:NaN

you for fi, but for P of I, do you see

NaN:NaN

that this one is is quite clean? It's

NaN:NaN

just the average of uh the probabilities

NaN:NaN

from the gates.

NaN:NaN

Uh yeah. So basically the probability

NaN:NaN

the output probability from the the gate

NaN:NaN

you can think of it as just the vector

NaN:NaN

being projected on a space of n where n

NaN:NaN

corresponds to your number of experts

NaN:NaN

and then it's gone through softmax. So

NaN:NaN

your output is basically summing up to

NaN:NaN

one

NaN:NaN

and each of these dimensions they

NaN:NaN

represent the value corresponding to

NaN:NaN

what expert I would be I guess used for

NaN:NaN

like for instance the first dimension

NaN:NaN

would be for expert one second dimension

NaN:NaN

for expert two and so on. So you just

NaN:NaN

take the average of this and like this

NaN:NaN

one you can express it from you know all

NaN:NaN

the parameters. So I think you should be

NaN:NaN

you should be fine with that one. Yeah.

NaN:NaN

So the question is if we increase the

NaN:NaN

number of moes does it increase the

NaN:NaN

number of model parameters. So it's a

NaN:NaN

great question. So it's actually

NaN:NaN

actually one of the ideas behind MO

NaN:NaN

based models which is that you can scale

NaN:NaN

the model without having to incur the

NaN:NaN

cost of having significantly more

NaN:NaN

compute at inference time.

NaN:NaN

So you can increase so people say

NaN:NaN

capacity can increase your the capacity

NaN:NaN

of your model but you will still keep

NaN:NaN

some I guess like control the amount of

NaN:NaN

active parameters and active parameters

NaN:NaN

are the parameters that are that are

NaN:NaN

used for a forward pass. So yeah people

NaN:NaN

just kind of use that. So yeah it will

NaN:NaN

just increase the number of parameters.

NaN:NaN

So that's why you see some MO based

NaN:NaN

models that are even bigger than the

NaN:NaN

ones that we kind of had like on the

NaN:NaN

order of hundreds of billions. We even

NaN:NaN

have on the order of trillions of

NaN:NaN

parameters. So for instance here uh one

NaN:NaN

reading I I recommend is switch

NaN:NaN

transformer

NaN:NaN

which scaled up to 1 something trillion

NaN:NaN

parameters. So yes, definitely more

NaN:NaN

uh but that being said, so if you read

NaN:NaN

the paper, you will also see that these

NaN:NaN

models there are more what they call

NaN:NaN

sample efficient.

NaN:NaN

So they take less time

NaN:NaN

to be as good as what the model would

NaN:NaN

have been with a lower number of

NaN:NaN

parameters. So if you look at like if

NaN:NaN

you draw the you know training training

NaN:NaN

curve as a function of like the training

NaN:NaN

time you see that these models they're

NaN:NaN

typically more sample efficient.

NaN:NaN

Sorry.

NaN:NaN

>> Yeah. Yeah. Exactly. Everything here is

NaN:NaN

a trade-off. Everything here is a

NaN:NaN

trade-off.

NaN:NaN

>> Yeah. Cool. Yeah.

NaN:NaN

Um, so the question is each attention

NaN:NaN

head will have a number of experts. So

NaN:NaN

it's actually regardless of the

NaN:NaN

attention heads. So the attention heads

NaN:NaN

are you can think of them as being like

NaN:NaN

independent you know something else and

NaN:NaN

the number of experts is independent of

NaN:NaN

that.

NaN:NaN

Does that make sense?

NaN:NaN

Yeah.

NaN:NaN

Right. Right.

NaN:NaN

All right. Yeah. The question is whether

NaN:NaN

every block will have a number of extra.

NaN:NaN

The answer is yes. And typically those

NaN:NaN

weights are not shared.

NaN:NaN

So typically you So actually we're going

NaN:NaN

to see an example. Um it can very well

NaN:NaN

be that layer one there is the expert

NaN:NaN

number I don't know three that was

NaN:NaN

chosen but layer two there is like

NaN:NaN

expert number one. And you know it's

NaN:NaN

it's like it's all free. It's trainable.

NaN:NaN

So the so the question is we will decide

NaN:NaN

where to uh where expert will go to. So

NaN:NaN

all of that is decided by the gates

NaN:NaN

which is this quantity.

NaN:NaN

Everything is uh decided by the gates

NaN:NaN

which has trainable weights. So you can

NaN:NaN

think of this as just some projection

NaN:NaN

from the input x to an n dimensional

NaN:NaN

space where n is the number of experts.

NaN:NaN

Uh so question is at what point during

NaN:NaN

the inference is it decided? So um let's

NaN:NaN

suppose we're at inference time. I'm

NaN:NaN

going to walk you through how it works.

NaN:NaN

So you have your x.

NaN:NaN

So you have this you know attention

NaN:NaN

mechanism. So it interacts with all

NaN:NaN

tokens from the past given that it's

NaN:NaN

decoder only. So it's like the masked

NaN:NaN

and it goes here and at the beginning of

NaN:NaN

the fit for neural network block the

NaN:NaN

token is of course contextual. So it has

NaN:NaN

the information from like these other

NaN:NaN

tokens because it's attended. And what

NaN:NaN

it does is it goes here.

NaN:NaN

So X first goes into G.

NaN:NaN

G computes uh this you know probability

NaN:NaN

distribution over all experts and given

NaN:NaN

here that we are in a sparse MOE setting

NaN:NaN

we will only choose the top K. So let's

NaN:NaN

suppose it's the top one just the

NaN:NaN

highest uh probability

NaN:NaN

um and you will you will know which

NaN:NaN

expert this one will be and so as a

NaN:NaN

result of that you will only compute the

NaN:NaN

the output value of the expert that the

NaN:NaN

input was chosen.

NaN:NaN

Mhm. So can you elaborate on that

NaN:NaN

actually?

NaN:NaN

Yeah.

NaN:NaN

At what point exactly? So it's after the

NaN:NaN

self attention layer.

NaN:NaN

Yeah.

NaN:NaN

Uh so the question is do we have

NaN:NaN

different classification for different

NaN:NaN

heads? No. So there's only one router.

NaN:NaN

So uh I think I I understand your

NaN:NaN

question. So your question is what do

NaN:NaN

you do given that you have different

NaN:NaN

attention computations going on in

NaN:NaN

parallel with the heads. So if you

NaN:NaN

remember the attention layer has these

NaN:NaN

different heads but at the end of it

NaN:NaN

what it does is it concatenates all the

NaN:NaN

results from each of these heads and

NaN:NaN

then projects it once again in the D

NaN:NaN

model space.

NaN:NaN

Yep.

NaN:NaN

Yep.

NaN:NaN

Yep.

NaN:NaN

Yes. So the question is do we have

NaN:NaN

different G's? So the only thing I can

NaN:NaN

tell you is that the G is just layer

NaN:NaN

specific. It's layer specific. It's

NaN:NaN

trainable. So it's basically going to

NaN:NaN

learn how to process all these uh

NaN:NaN

inputs. So the G the only thing I can

NaN:NaN

tell you for for your question is is

NaN:NaN

going to be layer specific. So the G is

NaN:NaN

going to be one G for let's say the

NaN:NaN

first layer, another G for the second

NaN:NaN

layer and so on. And that one is going

NaN:NaN

to be trained.

NaN:NaN

Cool.

NaN:NaN

Great. Looking at the time. Do we have

NaN:NaN

any other questions here?

NaN:NaN

We're good.

NaN:NaN

Perfect.

NaN:NaN

So now I just wanted to show you a cool

NaN:NaN

thing that um I believe uh the Mistral

NaN:NaN

uh team was showing in one of their

NaN:NaN

papers. So what they were showing here

NaN:NaN

was for a given piece of text

NaN:NaN

to show in which expert each token was

NaN:NaN

routed.

NaN:NaN

So as we noted before um

NaN:NaN

so experts are different from one layer

NaN:NaN

to another. So I believe here it's yeah

NaN:NaN

for layer zero so it's like for one

NaN:NaN

given layer and we do see that

NaN:NaN

you know roughly these tokens I guess

NaN:NaN

they leverage

NaN:NaN

like a uniform amount of experts more or

NaN:NaN

less.

NaN:NaN

What you would not want to see is to

NaN:NaN

have every token be the same color, but

NaN:NaN

luckily it is not.

NaN:NaN

But yeah, so that's one cool way of just

NaN:NaN

representing how the routing is done is

NaN:NaN

to just have your input text and just

NaN:NaN

represent where each token in which

NaN:NaN

expert each token went.

NaN:NaN

Cool. Okay.

NaN:NaN

So what we just saw was

NaN:NaN

one way that modern day LMS

NaN:NaN

change their architecture to incorporate

NaN:NaN

the fact that we may want to scale the

NaN:NaN

model but not increase the computation

NaN:NaN

complexity for one forward pass and we

NaN:NaN

saw that with

NaN:NaN

so you will see a lot of MOE based LLM

NaN:NaN

out there and now what we will

NaN:NaN

knowing that we have an LLM, we're going

NaN:NaN

to focus on

NaN:NaN

don't worry uh we're going to focus on

NaN:NaN

how a response is being generated.

NaN:NaN

So remember when I told you that you

NaN:NaN

know these uh modern day LLMs what they

NaN:NaN

do is they take some text in and they

NaN:NaN

have some text out. So it's typically

NaN:NaN

this task of next token prediction. So

NaN:NaN

you have a token in so let's say

NaN:NaN

beginning of sentence you go through

NaN:NaN

your llm and it just uh generates the

NaN:NaN

next word or the next token so a and

NaN:NaN

then you take a and it goes to teddy and

NaN:NaN

then teddy bear is etc etc

NaN:NaN

but so far we have never really dug into

NaN:NaN

exactly how we chose the next token.

NaN:NaN

So what we're going to do right now is

NaN:NaN

to see exactly how we're generating the

NaN:NaN

next token.

NaN:NaN

So as you know here our LM is just a

NaN:NaN

decoder only architecture. So here what

NaN:NaN

you have is a decoder with your input

NaN:NaN

here and then your output there.

NaN:NaN

So let's suppose for a second that we

NaN:NaN

know everything that's happening in the

NaN:NaN

middle and we're just obtaining

NaN:NaN

output probabilities

NaN:NaN

that are going to look a little bit like

NaN:NaN

this.

NaN:NaN

So given a token or some sequence of

NaN:NaN

token as inputs, you have an output

NaN:NaN

probability distribution that represents

NaN:NaN

what the model thinks is the likelihood

NaN:NaN

that there will be a next token let's

NaN:NaN

say that is equal to a to airplane to

NaN:NaN

fluffy etc.

NaN:NaN

So this is what you have.

NaN:NaN

So now my question to you is

NaN:NaN

if we told you we have some sequence as

NaN:NaN

inputs and we want to choose the next

NaN:NaN

token

NaN:NaN

and if I told you that our model is

NaN:NaN

giving out actually a probability

NaN:NaN

distribution.

NaN:NaN

I guess how would you choose the next

NaN:NaN

token based on this?

NaN:NaN

Sorry.

NaN:NaN

Great. The token with maximum

NaN:NaN

probability

NaN:NaN

there. Okay. Great. So yeah, the first

NaN:NaN

idea, let's just take the token with

NaN:NaN

highest probability.

NaN:NaN

So,

NaN:NaN

so it's a very natural approach, but I'm

NaN:NaN

not sure if you've been using things

NaN:NaN

like chat GPT or Gemini. Every time you

NaN:NaN

ask something, it always responds it

NaN:NaN

responds something that is slightly

NaN:NaN

different,

NaN:NaN

right? So if you always choose the token

NaN:NaN

with the highest probability

NaN:NaN

given that the computation here we're

NaN:NaN

going to see is all deterministic

NaN:NaN

what that means is you're always going

NaN:NaN

to generate the same thing regardless of

NaN:NaN

uh I guess with the same input right so

NaN:NaN

that's one one limitation so it's not

NaN:NaN

very like diverse

NaN:NaN

the second problem is

NaN:NaN

if you choose

NaN:NaN

the highest probability token on a I

NaN:NaN

guess iterative basis.

NaN:NaN

You're locally optimal, but you're not

NaN:NaN

necessarily globally optimal.

NaN:NaN

So what does that mean? So I guess if

NaN:NaN

you think about it, our objective is for

NaN:NaN

us to produce a sequence, an output

NaN:NaN

sequence of tokens that is, I guess, of

NaN:NaN

a high probability.

NaN:NaN

But the problem is if you always choose

NaN:NaN

the highest probability token, you will

NaN:NaN

not necessarily obtain the highest

NaN:NaN

probability sequence.

NaN:NaN

Are you convinced of this statement by

NaN:NaN

the way? So let me give you an example.

NaN:NaN

So let's suppose you have the next token

NaN:NaN

where one token is8 the other one is 7

NaN:NaN

and then you choose to go with the 08.

NaN:NaN

No actually it's not 7 because it has to

NaN:NaN

sum to one. So let's say 02. So let's

NaN:NaN

suppose if you go ahead with the

NaN:NaN

sequence that that starts with the 08.

NaN:NaN

Let's suppose all other token

NaN:NaN

probabilities are very low.

NaN:NaN

Basically you will have an output

NaN:NaN

sequence that will have a lower

NaN:NaN

probability than let's say the other

NaN:NaN

path which would let's suppose have

NaN:NaN

higher probability predictions in the

NaN:NaN

later steps.

NaN:NaN

Right? So we're going to see this in a

NaN:NaN

second, but this is the idea. So if you

NaN:NaN

choose the highest predicted

NaN:NaN

probability, it's a good first idea, but

NaN:NaN

it's locally optimal, but not

NaN:NaN

necessarily globally optimal.

NaN:NaN

And this is the reason why we have a a

NaN:NaN

second method

NaN:NaN

that is about keeping track of the K

NaN:NaN

most probable path.

NaN:NaN

So I'm not sure if you've heard of beam

NaN:NaN

search. So that's what beam search does.

NaN:NaN

So here K is sometimes called the beam

NaN:NaN

size or the beam width. So if you hear

NaN:NaN

these terms, these are just names that

NaN:NaN

are given to the number of path that we

NaN:NaN

keep track of. And so this works as

NaN:NaN

follows.

NaN:NaN

So let's suppose we start our generation

NaN:NaN

with the beginning of sentence token.

NaN:NaN

we want to figure out what the next

NaN:NaN

token is. So let's suppose we have uh

NaN:NaN

here in this example very basic example

NaN:NaN

like three tokens

NaN:NaN

and let's suppose the two highest

NaN:NaN

probable tokens are a and z.

NaN:NaN

So if we have k equal to two what we're

NaN:NaN

going to do is to keep track of these

NaN:NaN

two branches.

NaN:NaN

So that's the first iteration.

NaN:NaN

The second iteration is we're going to

NaN:NaN

look at all the probabilities

NaN:NaN

of next token prediction for these two

NaN:NaN

tokens

NaN:NaN

and we're going to always

NaN:NaN

save the two most probable path and here

NaN:NaN

for instance let's suppose if it's like

NaN:NaN

the and then fluffy and then a and cute.

NaN:NaN

So back to what I was saying later uh

NaN:NaN

earlier. So what I was saying was here

NaN:NaN

if you were choosing the path that went

NaN:NaN

along the highest probability token path

NaN:NaN

like the the

NaN:NaN

it's very much possible that the highest

NaN:NaN

probability token after the would be a

NaN:NaN

much lower probability than the one

NaN:NaN

after a.

NaN:NaN

And so this is what beam search tries to

NaN:NaN

do. tries to have a more globally

NaN:NaN

optimal solution. So let's suppose we

NaN:NaN

continue that and then at the end of the

NaN:NaN

day we obtain

NaN:NaN

uh I guess a number of uh potential I

NaN:NaN

guess choices uh and the k potential

NaN:NaN

choices and then we we pick the one that

NaN:NaN

is the kind of highest likely the

NaN:NaN

sequence with the highest probability.

NaN:NaN

So people typically what they do is they

NaN:NaN

take the sum of the logarithm of the

NaN:NaN

probabilities of the token. So they they

NaN:NaN

what they say is they say uh okay so the

NaN:NaN

log probability of the sequence is the

NaN:NaN

sum of the log probability of each next

NaN:NaN

token prediction. So it's the log

NaN:NaN

probability of a uh knowing bos and and

NaN:NaN

then cute knowing boss and a and so on

NaN:NaN

and so forth.

NaN:NaN

But I just want to point out one

NaN:NaN

limitation of this approach which is

NaN:NaN

that

NaN:NaN

the more you generate tokens

NaN:NaN

the lower

NaN:NaN

your I guess uh end sequence probability

NaN:NaN

will be

NaN:NaN

because if you think about it you know

NaN:NaN

all these probabilities they're between

NaN:NaN

zero and one. So think of it like in the

NaN:NaN

kind of multiplied sense. So let's

NaN:NaN

suppose you have the probability of the

NaN:NaN

whole sequence

NaN:NaN

which is just a multiplication of the

NaN:NaN

probability of the next token.

NaN:NaN

The more you add probabilities be below

NaN:NaN

like less than one,

NaN:NaN

the more this quantity will I guess go

NaN:NaN

towards zero.

NaN:NaN

So I guess uh this method as is

NaN:NaN

will prioritize sequences that are

NaN:NaN

shorter

NaN:NaN

and so for that reason beam search has

NaN:NaN

some uh additional term that basically

NaN:NaN

counteracts that that effect. So

NaN:NaN

something on the order of like one over

NaN:NaN

number of tokens to the power of

NaN:NaN

something. So in practice there's some

NaN:NaN

uh technique to make sure that you know

NaN:NaN

things kind of work relatively well.

NaN:NaN

But okay let's suppose we figure out all

NaN:NaN

these things. Uh the problem is that we

NaN:NaN

need to keep track of this most probable

NaN:NaN

path. We need to do all this uh kind of

NaN:NaN

uh saving etc. And it just like requires

NaN:NaN

a lot of computation.

NaN:NaN

And the other thing is we're still

NaN:NaN

interested in the most probable path

NaN:NaN

which basically will lead to a sequence

NaN:NaN

that is you know very that the model

NaN:NaN

thinks is very likely

NaN:NaN

but sometimes what you want is for your

NaN:NaN

output to be more diverse or more

NaN:NaN

creative.

NaN:NaN

So that's why beam search is actually

NaN:NaN

not something that people typically use.

NaN:NaN

People use beam search for things like

NaN:NaN

machine translation where you actually

NaN:NaN

need to have something that is close to

NaN:NaN

uh something being very likely but

NaN:NaN

actually people use a third method

NaN:NaN

and this method is called also the

NaN:NaN

sampling method. So I told you we have a

NaN:NaN

probability distribution over tokens

NaN:NaN

regarding what the next token should be.

NaN:NaN

And so what people do is they just

NaN:NaN

sample the next token

NaN:NaN

using that probability distribution.

NaN:NaN

Does this make sense?

NaN:NaN

So in this example, uh fluffy, gentle,

NaN:NaN

kind, and let's say smart will have a

NaN:NaN

higher probability of being drawn as

NaN:NaN

opposed to let's say airplane and wear

NaN:NaN

that have a lower probability of

NaN:NaN

occurring. nonzero probability but they

NaN:NaN

have a probability of occurring.

NaN:NaN

Cool.

NaN:NaN

Any questions so far?

NaN:NaN

Yep.

NaN:NaN

Right. Uh so the question is it's not

NaN:NaN

for training, it's for inference. Yes.

NaN:NaN

Correct. So this is the you can think of

NaN:NaN

it as response generation. So let's

NaN:NaN

suppose you have your model that is

NaN:NaN

trained. What you want is to generate an

NaN:NaN

output. So what would you do?

NaN:NaN

Yeah.

NaN:NaN

So I guess just to complete my answer.

NaN:NaN

So during training what you would do is

NaN:NaN

care about the output probabilities and

NaN:NaN

then compare them with the actual label

NaN:NaN

uh which is most of the time like a hard

NaN:NaN

label. Uh and yeah this is what you

NaN:NaN

would compare. So this one is let's

NaN:NaN

suppose you have an LLM that is trained.

NaN:NaN

How would you generate a response?

NaN:NaN

Cool.

NaN:NaN

Everyone good?

NaN:NaN

Yeah.

NaN:NaN

Yep.

NaN:NaN

Yeah. So question is how do you do

NaN:NaN

sampling in this situation? So actually

NaN:NaN

my next slides but I just want to make

NaN:NaN

sure everyone was uh on the same page

NaN:NaN

regarding just the intuition.

NaN:NaN

So highest probability is called greedy

NaN:NaN

decoding is probably not something that

NaN:NaN

we want.

NaN:NaN

Beam search is a little bit better. It's

NaN:NaN

more something that is globally optimal.

NaN:NaN

It's not globally optimal, but more

NaN:NaN

towards that. So, it's better, but it

NaN:NaN

lacks diversity. It lacks creativity,

NaN:NaN

which is why what we want to do is to

NaN:NaN

actually sample

NaN:NaN

each token.

NaN:NaN

Um, and I guess we have a few methods

NaN:NaN

that um also restrict

핵심 요약

Large Language Model(LLM)의 정의와 Decoder-only 아키텍처, Mixture of Experts(MoE)를 통한 확장, Temperature를 통한 출력 제어, 다양한 Prompting 기법, 그리고 KV Cache와 PagedAttention을 통한 추론 효율화를 다룹니다.

주요 개념

LLM의 정의 3:30

Language Model: 토큰 시퀀스에 확률을 부여하는 모델
Large의 의미: 모델 크기(수십~수백B 파라미터), 학습 데이터(수조 토큰), 컴퓨팅 자원
현대 LLM: Decoder-only 아키텍처가 90% 이상. GPT, Llama, Gemma, DeepSeek, Mistral, Qwen 등
BERT는 텍스트를 생성하지 않으므로 현대적 정의의 LLM이 아님

Mixture of Experts (MoE) 7:30

동기: 모든 파라미터를 매번 활성화할 필요가 있을까? 수학 질문에 역사학자가 필요한가?
구조: n개의 Expert 네트워크 + Router(Gate)가 입력에 따라 적합한 Expert 선택
Dense vs Sparse MoE: Dense는 모든 Expert 가중합, Sparse는 Top-K Expert만 활성화 (보통 K=1~2)
Expert 위치: FFN 레이어에 적용 (파라미터가 가장 많은 부분, d_model × d_ff × 2)
토큰 레벨 라우팅: 각 토큰마다 독립적으로 Expert 선택

MoE 학습의 도전과제 18:00

Routing Collapse: Router가 특정 Expert만 계속 선택하는 문제
Load Balancing Loss: f(i) × P(i)의 합을 최소화하여 Expert 사용을 균등하게 유도
- f(i): Expert i로 라우팅된 토큰 비율
- P(i): Expert i의 평균 라우팅 확률
장점: 파라미터는 늘리되 활성 파라미터(Active Parameters)는 제한하여 추론 비용 절감

Temperature와 Sampling 48:00

Softmax with Temperature: P(i) = exp(x_i/T) / Σexp(x_j/T)
Low Temperature (→0): Spiky 분포, 가장 높은 확률 토큰만 선택 (deterministic)
High Temperature (→∞): Uniform 분포에 가까워짐, 다양하고 창의적인 출력
Top-K Sampling: 상위 K개 토큰에서만 샘플링
Top-P (Nucleus) Sampling: 누적 확률이 P를 넘는 최소 토큰 집합에서 샘플링
비결정성의 유일한 원인: Transformer 내부는 모두 deterministic, 샘플링만 확률적

Prompting 기법 1:15:00

Zero-shot: 예시 없이 태스크 설명만으로 수행
Few-shot (In-Context Learning): 입력-출력 예시를 컨텍스트에 포함
Chain-of-Thought (CoT): 답 도출 과정의 추론을 함께 생성하도록 유도. 디버깅에도 유용
Self-Consistency: 여러 번 샘플링 후 다수결로 최종 답 선택 (병렬 처리 가능)
Context Rot: 컨텍스트가 길어질수록 정보 검색 능력 저하 (Needle in a Haystack 실험)

KV Cache 1:25:00

목적: 이전 토큰의 Key, Value 계산 결과를 저장하여 재계산 방지
원리: 현재 토큰의 Query만 새로 계산, K/V는 캐시에서 가져와 Attention 수행
GQA와 연계: Grouped Query Attention으로 K/V 헤드 수를 줄여 캐시 크기 감소
학습 시에는 불필요: Teacher Forcing으로 전체 시퀀스를 한 번에 처리

PagedAttention (vLLM) 1:32:00

문제: 최대 컨텍스트 길이만큼 메모리를 미리 예약하면 낭비 발생 (Internal Fragmentation)
해결: KV Cache를 고정 크기 블록(예: 16토큰)으로 나누어 동적 할당
효과: 메모리 단편화 감소, 더 많은 요청 동시 처리 가능
구현: vLLM 추론 엔진에서 사용

핵심 인사이트

MoE는 '용량은 크게, 비용은 작게'를 실현하는 핵심 기술. Switch Transformer는 1조 파라미터 달성
Temperature는 창의성 vs 정확성의 트레이드오프를 조절하는 핵심 하이퍼파라미터
Chain-of-Thought는 성능 향상뿐 아니라 모델의 추론 과정을 해석 가능하게 만듦
KV Cache + GQA + PagedAttention의 조합이 현대 LLM 추론 효율화의 핵심