Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 3 - Transformers & Large Language Models
Cool. Hello everyone and uh welcome to
lecture 3 of CME 295.
Um so today is a very exciting day
because we're going to finally introduce
large language models. Uh but I guess
before we go into that I'm just going to
start traditionally with uh some
announcements. Um so some of you wanted
to have the slides before the class. So
just a heads up that in case you want to
use the slides to do some annotations
uh they're on the website right now. So
feel free to get them and uh with
Shervin we'll try to just on a regular
basis every Thursday evening have them
published on the website so that you can
download them and annotate.
Cool. So with that, let's start. And as
usual, we're going to recap uh last
week's episodes.
So if you remember, you know, lecture
one and lecture two were all about
introducing the concept of self
attention and linking them to the
construct of the transformer. And what
we did last lecture was look at all the
types of models that there are out there
and how they were all based on the
transformer.
So there are three categories, three
main categories of models. So the first
one that we saw was encoder decoder
model which basically relies on the
transformer. It has the encoder of the
transformer, the decoder of the
transformer and typically the tasks
there are input text. So text in text
out.
So we saw that one example was T5 and
all the variations.
The second type of model is where we
remove the decoder from the trans
transformer and we obtain an encoder
only model.
So uh we went uh deeper into BERT which
is the typical encoder only model and uh
I guess we also saw that what BERT has
is this nice property that it's encoded
embeddings are very meaningful and
expressive of the inputs and so yeah we
saw like the example of classification
sentiment extraction. So what we did was
in particular um take into consideration
the encoded embedding of the CLS token.
So uh I guess in real life BERT is used
to encode documents to encode sentences
and we're going to see later in the
class how these models are useful.
And then last but not least, we have the
third category of models which is
decoder only. So we only keep the
decoder part of the transformer. And so
here we also do a little bit of a of a
modification. So we remove the cross
attention because we don't need it
anymore. We don't have an encoder. And
these kinds of models are text in text
out.
And so GPT is a very good example of
such models and actually most models
these days they're only like they are
decoder only.
So these are the three main kinds of
models that you can see out there.
So far so good for everyone.
Cool. So with that I'm going to
introduce the term LLM. So, LLM stands
for large language model. So, what is a
large language model? So, first of all,
a large language model is a language
model. So, a language model is a model
that assigns probability to sequences of
tokens.
So, in this case, our model always
predicts the probability of the next
token. So, it's in that sense a language
model.
But also a large language model is
large. So why is it large? So we're
going to see that these models they are
actually scaled up in terms of size. So
first of all in terms of model size. So
these days it's not uncommon to see
models on the order of hundreds of
billions of parameters.
But yeah, typically when we say LLM, we
say at least on the order of a billion.
These models, they've also been trained
on a huge amount of data. And here by
amount of data, we quantify that by the
number of tokens that they were
pre-trained with. And this is on the
order of magnitude of hundreds of
billions of tokens or even trillions of
tokens.
So I think the biggest one biggest ones
are like on the tens of trillions of
tokens. So it's like huge
training sets
and they're also large because they need
a lot of compute. So typically you need
a bunch of GPUs to make them work.
Although these days there's been a lot
of um optimizations to have them work on
uh consumerbased GPUs. So we're going to
see that later.
But an LLM is large according to these
categories.
So another thing that I want to point
out is that all these terminologies
they're relatively new. So um I remember
in 2018 19 there was nothing like there
was no real definition of an LM. I think
no one actually talked about LLMs in the
beginning maybe people talked about
LLMs. They included BERT,
but BERT is an encoder only model that
does not produce text. So with the
current definition of an LLM, which
right now has been pretty well
established, BERT would not be an LLM
because it doesn't produce text. So here
we only consider language models that do
textto text that are very large in size
in terms of amount of data they have
been trained on and in terms of compute.
Cool. And as we saw before these models
are decoder only. So here what we do is
we remove the encoder. We only keep the
masked self attention, the feed forward
neuronet network and then the you know
addition and normalization.
So we only keep this and this is the
backbone of LMS
and uh so I mentioned you know GPT is a
kind of a good example but it's not just
that you have plenty of other models. So
you may have heard of Llama from Meta,
Gemma from Google, uh Deepseek, Mistro
Quen, so on like the the list is long.
I would say roughly I mean more than 90%
of like modern day LLMs, they're all
decoder only. So I think that's uh
something to keep in mind.
Cool.
Um okay so now you know how LLMs are
made of but there is something else that
people these days also introduced to
these models and we're going to see that
in a bit
so I mentioned that these models are
huge in size again typically hundreds of
billions of parameters so it takes a lot
of compute to just compute one inference
or also to train these models.
But you may wonder, do you really need
to have all these parameters be
activated during a forward pass to make
a simple prediction?
So I'm going to u like have a little
metaphor. So let's suppose you enter in
a room and in the room there is a
mathematician,
a physicist, a chemist and a historian.
So you come in this class there's a
bunch of people who are expert in what
they do.
And you have a question, you have a math
question.
That's a question I have for you is who
would you ask your question?
Would you ask the mathematician? Would
you ask the chemist? Would you ask
everyone?
Well, right now we ask everyone. We ask
all parameters of the model to be
involved in the computation of uh you
know the generation.
And so the idea here is
given an input
maybe it's not necessary to ask everyone
to be involved in the computation. So
the idea is let's actually just have a
subset of the model be involved in the
computation of the next token.
So I'm just introducing this idea of
experts. So let's suppose we're
introducing uh the following notation.
So let's suppose we have n experts. So
think of it as your mathematician, your
chemist, historian, like whatever. So
these are your experts and the idea is
given an input X,
you're going to ask yourself
who should be involved in the generation
of the output.
So you're going to have let's say
another network. We're we're going to
call it G like gate but it's also
sometimes called router.
So let's suppose we have some gate that
tells us which expert should be involved
in the inference.
So if we have that so let's suppose here
the gate tells us okay so actually
expert number two is well suited to
answer your question. So here the idea
is that the input is just going to flow
into that expert
but not the other experts.
So this has a name. It's called mixture
of experts. So it's denoted.
Everyone talks about so these are
mixture of experts. And so the formula
that you will see a lot is this one. So
the output y which is denoted yhat is
the sum of the expert output
weighted by some quantity which is the
output of the gate which tells you how
important the output of each expert is.
Yeah,
great question. So question is how do
you train G? How do you train E? So
typically you train them jointly.
So we're going to see that maybe in a
yeah a little later, but you can think
of it as you know just training as
usual. You do your forward pass, you
compute the last and you back prop.
And it's actually an interesting
question because there is some
challenges that come with training emois
that we're going to see in a second.
Yep.
Uh question is uh what are these E what
is the architecture of these E? So let's
suppose right now that they're just some
network. We're not specifying them for
now but we're going to see this in a
second. Let's suppose for now it's like
some network.
Cool.
Okay, so I told you that you know what
if we don't activate everyone? What if
we activate a subset? But this formula
actually
um I guess assumes that we're actually
considering all expert outputs. So I
just want to distinguish two kinds of.
So there's one kind that is called a
dense.
So dense MO actually does not have any
constraints on the number of experts
that are involved.
So these weights they can be anywhere
between zero and one. So think of them
as a probability distribution
but it's just going to put more weight
towards some experts compared to others.
So back back to the example that I had.
So let's suppose I have a math question.
So I'm going to ask the mathematician,
the chemist, the historian. So I'm
probably going to add a higher weight to
what the mathematician says compared to
let's say the historian. So this is the
idea.
But then the interesting thing is when
we constrain the number of experts that
are activated because here as we
mentioned previously what we're
interested in is to not involve everyone
is to make some savings in the amount of
compute that we do. So there's a second
kind of MOE that's called the sparse MOE
and what it does is it only selects the
quoteunquote top K experts.
So K can be equal to one. So one expert
or even two. So it's a hyperparameter
that you choose. And so here the
expression of the output becomes the sum
over all the chosen experts of uh g of x
time e of x.
So far so good.
Cool. Uh and of course there's a lot
more to it. So in case you are
interested in learning more uh feel free
to go into the resources that are at the
bottom of the slide. Um so one thing
that I will say is that we have a unit
of measure of the amount of compute that
these models produce for each uh each
pass. So you will see the term flops. So
have you have you seen the term flops
out there?
No not really. So it stands for floating
point operations.
So it quantifies how many operations
like think of it as like additions,
multiplications
are involved in a forward pass let's say
and it basically quantifies how compute
heavy is your task.
So typically what we say is when we go
through a sparse MO as opposed to a
dense MOE we have a lower amount of
flops.
So this is the unit of measure that you
will see.
But back to your question. So what are
these experts?
So if you remember I mean 10 minutes ago
we said that LLMs they are decoder only
models.
So I have a question for you. Let's
suppose we wanted to put some
in our LLM.
Where would would we put it?
So here I guess we have three choices.
We have uh the mass self attention
layer. We have the feed for neural
network and then we have like this
normalization. So question for you I
guess where would you put this?
I guess where do you think
is I guess the most complex part of the
network? Where is there a lot of
operations
feed forward? Yes. Yeah. Great great
answer. So uh it's indeed the feed
forward neuron network
and the reason for that is I think
Shervin mentioned it I think in lecture
one. So if you remember the feed for
neural network is a network such that
you have the input which is your you
know dd dimensional input vector and
then you have uh this being projected
into let's say dfff dimensional space
and then it goes back to the dd
dimensional
um I guess uh space. So the DFF
is typically larger than your
dimension of I guess the input. When I
say input is here. So it's typically
larger. So
the amount of parameters that you have
in that feed for null network is
something on the order of magnitude of D
model time D FFF
time 2 plus some bias. So it's basically
your order of magnitude and
the attention layer if you think about
if you remember it's basically composed
of the projection matrices.
What is the dimension of the projection
matrices? So it's the model times the
dimension of keys, the dimension of
queries, the dimension of values.
And this dimension is typically much
lower.
So think of it of hundreds.
So your D model is typically O of 100, O
of a,000
and then the projection here the DFF
is O of a,000 or of 10,000.
Cool. So is everyone now convinced that
this is a good way good place to put the
mixture of experts?
Yeah.
Cool. So this is actually how it's done.
So in modern day day LLMs this idea of
not involving everyone in the
computation of the next token prediction
is such that you would put put the
mixture of experts where the FFN is
which is basically here
and typically you would have a sparse
mixture of experts meaning that so back
to your question so these experts are
feed for neural network. So you would
you would have several networks that you
can train but you would only activate
one.
So typically K would be equal to one. It
can be also equal to two but it would
only be a subset. So that's my point. Um
and this routing would be done at the
token level.
So if you remember you know the decoder
it basically takes uh something as
input. So you know a bunch of tokens
and here what I'm saying is that each
token will be processed by an expert
that may be different from the other
token.
So the router router here would take
the representation of the token as input
and figure out which expert should be
best for this token to flow towards.
Does this idea make sense?
So I have a little um illustration later
on that hopefully will help.
So now back to your question about how
you train this model like do you train
the router separately do you train the
expert separately. So one challenge that
people have is when they train these
based models
to make sure that all experts
are I guess having a weight are being
used
because it's very possible that you
train your model and that somehow only I
don't know one or two experts always get
activated
and the other one they always are
inactive. They're never involved in the
computation.
So this problem is called routing
collapse. So why is it called routing
collapse? Is because a router always
chooses some experts but not others.
So this is a challenge and the way
people try to mitigate this challenge is
by changing the loss function
and adding it some extra term which is
written here.
So it's basically some hyperparameter
alpha
times the number of experts
times the sum of quantities that depend
on whether or not tokens went to a
certain expert I and then summed over
all experts.
So it's not super important that you
completely understand exactly how that
formula works. The only thing that I
think you should take away from this
slide is that this extra loss
allows these quantities
to converge more towards uniform
distributions. So what are these
quantities just as a reminder? So f of I
is the fraction of tokens that are
routed to expert I
and P of I is the average routing
probability for expert I.
So when I say that, you know, all
experts should be used kind of the same,
what I'm saying is I want this
probability to be kind of uniform
across experts.
Yeah.
Yeah. So the question is uh I guess uh
when do you compute this quantities? So
yeah like you can think of it as like
you know regular training uh process
like you know you do some like mini
batch you kind of like go go go through
that through like the the model and then
you compute all these quantities and
then what you do is you do your um your
back propagation based on that and I
guess what I want you to remember is
that this incentivizes
the probability
um I guess the the choice of the router
to be more uniform across experts which
is something that mitigates this routing
collapse phenomena.
Yeah.
>> Yeah. So the question is can we use
dropout? Of course you can always like
bundle that with some other techniques.
So people have just kind of found this
to be very helpful. So speaking of other
techniques, there is something that I've
not talked about which is very similar
to the dropout um idea. So it's called
noisy gating.
Noisy gating is basically uh you have
your
your predictions from the gates but then
you add some noise to it.
So basically it you know by pure chance
it just allows other experts to be
involved in the computation. So it's
also some other technique. There's a
bunch of techniques but yeah dropout is
uh indeed quite useful for like things
like overfeeding and the idea can be
reused in in different settings.
Yep.
Yep.
So the question is how can you so you
mean differentiable?
So here I guess how do you take the
derivative is that your your question?
Uh how is that? So can you explain a bit
more what your concern is?
>> Mhm. So I guess uh to this question um
so the average routing probability
so that one is a function of the gate
output right that's one pi
right so I guess your question is for fi
for fi
okay I don't have a good answer on top
of my head but I think like people have
some techniques and these days you know
you just don't even have to do this by
hand. You have uh like the built-in
thing. Um so maybe I can follow up with
you for fi, but for P of I, do you see
that this one is is quite clean? It's
just the average of uh the probabilities
from the gates.
Uh yeah. So basically the probability
the output probability from the the gate
you can think of it as just the vector
being projected on a space of n where n
corresponds to your number of experts
and then it's gone through softmax. So
your output is basically summing up to
one
and each of these dimensions they
represent the value corresponding to
what expert I would be I guess used for
like for instance the first dimension
would be for expert one second dimension
for expert two and so on. So you just
take the average of this and like this
one you can express it from you know all
the parameters. So I think you should be
you should be fine with that one. Yeah.
So the question is if we increase the
number of moes does it increase the
number of model parameters. So it's a
great question. So it's actually
actually one of the ideas behind MO
based models which is that you can scale
the model without having to incur the
cost of having significantly more
compute at inference time.
So you can increase so people say
capacity can increase your the capacity
of your model but you will still keep
some I guess like control the amount of
active parameters and active parameters
are the parameters that are that are
used for a forward pass. So yeah people
just kind of use that. So yeah it will
just increase the number of parameters.
So that's why you see some MO based
models that are even bigger than the
ones that we kind of had like on the
order of hundreds of billions. We even
have on the order of trillions of
parameters. So for instance here uh one
reading I I recommend is switch
transformer
which scaled up to 1 something trillion
parameters. So yes, definitely more
uh but that being said, so if you read
the paper, you will also see that these
models there are more what they call
sample efficient.
So they take less time
to be as good as what the model would
have been with a lower number of
parameters. So if you look at like if
you draw the you know training training
curve as a function of like the training
time you see that these models they're
typically more sample efficient.
Sorry.
>> Yeah. Yeah. Exactly. Everything here is
a trade-off. Everything here is a
trade-off.
>> Yeah. Cool. Yeah.
Um, so the question is each attention
head will have a number of experts. So
it's actually regardless of the
attention heads. So the attention heads
are you can think of them as being like
independent you know something else and
the number of experts is independent of
that.
Does that make sense?
Yeah.
Right. Right.
All right. Yeah. The question is whether
every block will have a number of extra.
The answer is yes. And typically those
weights are not shared.
So typically you So actually we're going
to see an example. Um it can very well
be that layer one there is the expert
number I don't know three that was
chosen but layer two there is like
expert number one. And you know it's
it's like it's all free. It's trainable.
So the so the question is we will decide
where to uh where expert will go to. So
all of that is decided by the gates
which is this quantity.
Everything is uh decided by the gates
which has trainable weights. So you can
think of this as just some projection
from the input x to an n dimensional
space where n is the number of experts.
Uh so question is at what point during
the inference is it decided? So um let's
suppose we're at inference time. I'm
going to walk you through how it works.
So you have your x.
So you have this you know attention
mechanism. So it interacts with all
tokens from the past given that it's
decoder only. So it's like the masked
and it goes here and at the beginning of
the fit for neural network block the
token is of course contextual. So it has
the information from like these other
tokens because it's attended. And what
it does is it goes here.
So X first goes into G.
G computes uh this you know probability
distribution over all experts and given
here that we are in a sparse MOE setting
we will only choose the top K. So let's
suppose it's the top one just the
highest uh probability
um and you will you will know which
expert this one will be and so as a
result of that you will only compute the
the output value of the expert that the
input was chosen.
Mhm. So can you elaborate on that
actually?
Yeah.
At what point exactly? So it's after the
self attention layer.
Yeah.
Uh so the question is do we have
different classification for different
heads? No. So there's only one router.
So uh I think I I understand your
question. So your question is what do
you do given that you have different
attention computations going on in
parallel with the heads. So if you
remember the attention layer has these
different heads but at the end of it
what it does is it concatenates all the
results from each of these heads and
then projects it once again in the D
model space.
Yep.
Yep.
Yep.
Yes. So the question is do we have
different G's? So the only thing I can
tell you is that the G is just layer
specific. It's layer specific. It's
trainable. So it's basically going to
learn how to process all these uh
inputs. So the G the only thing I can
tell you for for your question is is
going to be layer specific. So the G is
going to be one G for let's say the
first layer, another G for the second
layer and so on. And that one is going
to be trained.
Cool.
Great. Looking at the time. Do we have
any other questions here?
We're good.
Perfect.
So now I just wanted to show you a cool
thing that um I believe uh the Mistral
uh team was showing in one of their
papers. So what they were showing here
was for a given piece of text
to show in which expert each token was
routed.
So as we noted before um
so experts are different from one layer
to another. So I believe here it's yeah
for layer zero so it's like for one
given layer and we do see that
you know roughly these tokens I guess
they leverage
like a uniform amount of experts more or
less.
What you would not want to see is to
have every token be the same color, but
luckily it is not.
But yeah, so that's one cool way of just
representing how the routing is done is
to just have your input text and just
represent where each token in which
expert each token went.
Cool. Okay.
So what we just saw was
one way that modern day LMS
change their architecture to incorporate
the fact that we may want to scale the
model but not increase the computation
complexity for one forward pass and we
saw that with
so you will see a lot of MOE based LLM
out there and now what we will
is
knowing that we have an LLM, we're going
to focus on
don't worry uh we're going to focus on
how a response is being generated.
So remember when I told you that you
know these uh modern day LLMs what they
do is they take some text in and they
have some text out. So it's typically
this task of next token prediction. So
you have a token in so let's say
beginning of sentence you go through
your llm and it just uh generates the
next word or the next token so a and
then you take a and it goes to teddy and
then teddy bear is etc etc
but so far we have never really dug into
exactly how we chose the next token.
So what we're going to do right now is
to see exactly how we're generating the
next token.
So as you know here our LM is just a
decoder only architecture. So here what
you have is a decoder with your input
here and then your output there.
So let's suppose for a second that we
know everything that's happening in the
middle and we're just obtaining
output probabilities
that are going to look a little bit like
this.
So given a token or some sequence of
token as inputs, you have an output
probability distribution that represents
what the model thinks is the likelihood
that there will be a next token let's
say that is equal to a to airplane to
fluffy etc.
So this is what you have.
So now my question to you is
if we told you we have some sequence as
inputs and we want to choose the next
token
and if I told you that our model is
giving out actually a probability
distribution.
I guess how would you choose the next
token based on this?
Sorry.
Great. The token with maximum
probability
there. Okay. Great. So yeah, the first
idea, let's just take the token with
highest probability.
So,
so it's a very natural approach, but I'm
not sure if you've been using things
like chat GPT or Gemini. Every time you
ask something, it always responds it
responds something that is slightly
different,
right? So if you always choose the token
with the highest probability
given that the computation here we're
going to see is all deterministic
what that means is you're always going
to generate the same thing regardless of
uh I guess with the same input right so
that's one one limitation so it's not
very like diverse
the second problem is
if you choose
the highest probability token on a I
guess iterative basis.
You're locally optimal, but you're not
necessarily globally optimal.
So what does that mean? So I guess if
you think about it, our objective is for
us to produce a sequence, an output
sequence of tokens that is, I guess, of
a high probability.
But the problem is if you always choose
the highest probability token, you will
not necessarily obtain the highest
probability sequence.
Are you convinced of this statement by
the way? So let me give you an example.
So let's suppose you have the next token
where one token is8 the other one is 7
and then you choose to go with the 08.
No actually it's not 7 because it has to
sum to one. So let's say 02. So let's
suppose if you go ahead with the
sequence that that starts with the 08.
Let's suppose all other token
probabilities are very low.
Basically you will have an output
sequence that will have a lower
probability than let's say the other
path which would let's suppose have
higher probability predictions in the
later steps.
Right? So we're going to see this in a
second, but this is the idea. So if you
choose the highest predicted
probability, it's a good first idea, but
it's locally optimal, but not
necessarily globally optimal.
And this is the reason why we have a a
second method
that is about keeping track of the K
most probable path.
So I'm not sure if you've heard of beam
search. So that's what beam search does.
So here K is sometimes called the beam
size or the beam width. So if you hear
these terms, these are just names that
are given to the number of path that we
keep track of. And so this works as
follows.
So let's suppose we start our generation
with the beginning of sentence token.
we want to figure out what the next
token is. So let's suppose we have uh
here in this example very basic example
like three tokens
and let's suppose the two highest
probable tokens are a and z.
So if we have k equal to two what we're
going to do is to keep track of these
two branches.
So that's the first iteration.
The second iteration is we're going to
look at all the probabilities
of next token prediction for these two
tokens
and we're going to always
save the two most probable path and here
for instance let's suppose if it's like
the and then fluffy and then a and cute.
So back to what I was saying later uh
earlier. So what I was saying was here
if you were choosing the path that went
along the highest probability token path
like the the
it's very much possible that the highest
probability token after the would be a
much lower probability than the one
after a.
And so this is what beam search tries to
do. tries to have a more globally
optimal solution. So let's suppose we
continue that and then at the end of the
day we obtain
uh I guess a number of uh potential I
guess choices uh and the k potential
choices and then we we pick the one that
is the kind of highest likely the
sequence with the highest probability.
So people typically what they do is they
take the sum of the logarithm of the
probabilities of the token. So they they
what they say is they say uh okay so the
log probability of the sequence is the
sum of the log probability of each next
token prediction. So it's the log
probability of a uh knowing bos and and
then cute knowing boss and a and so on
and so forth.
But I just want to point out one
limitation of this approach which is
that
the more you generate tokens
the lower
your I guess uh end sequence probability
will be
because if you think about it you know
all these probabilities they're between
zero and one. So think of it like in the
kind of multiplied sense. So let's
suppose you have the probability of the
whole sequence
which is just a multiplication of the
probability of the next token.
The more you add probabilities be below
like less than one,
the more this quantity will I guess go
towards zero.
So I guess uh this method as is
will prioritize sequences that are
shorter
and so for that reason beam search has
some uh additional term that basically
counteracts that that effect. So
something on the order of like one over
number of tokens to the power of
something. So in practice there's some
uh technique to make sure that you know
things kind of work relatively well.
But okay let's suppose we figure out all
these things. Uh the problem is that we
need to keep track of this most probable
path. We need to do all this uh kind of
uh saving etc. And it just like requires
a lot of computation.
And the other thing is we're still
interested in the most probable path
which basically will lead to a sequence
that is you know very that the model
thinks is very likely
but sometimes what you want is for your
output to be more diverse or more
creative.
So that's why beam search is actually
not something that people typically use.
People use beam search for things like
machine translation where you actually
need to have something that is close to
uh something being very likely but
actually people use a third method
and this method is called also the
sampling method. So I told you we have a
probability distribution over tokens
regarding what the next token should be.
And so what people do is they just
sample the next token
using that probability distribution.
Does this make sense?
So in this example, uh fluffy, gentle,
kind, and let's say smart will have a
higher probability of being drawn as
opposed to let's say airplane and wear
that have a lower probability of
occurring. nonzero probability but they
have a probability of occurring.
Cool.
Any questions so far?
Yep.
Right. Uh so the question is it's not
for training, it's for inference. Yes.
Correct. So this is the you can think of
it as response generation. So let's
suppose you have your model that is
trained. What you want is to generate an
output. So what would you do?
Yeah.
So I guess just to complete my answer.
So during training what you would do is
care about the output probabilities and
then compare them with the actual label
uh which is most of the time like a hard
label. Uh and yeah this is what you
would compare. So this one is let's
suppose you have an LLM that is trained.
How would you generate a response?
Cool.
Everyone good?
Yeah.
Yep.
Yeah. So question is how do you do
sampling in this situation? So actually
my next slides but I just want to make
sure everyone was uh on the same page
regarding just the intuition.
So highest probability is called greedy
decoding is probably not something that
we want.
Beam search is a little bit better. It's
more something that is globally optimal.
It's not globally optimal, but more
towards that. So, it's better, but it
lacks diversity. It lacks creativity,
which is why what we want to do is to
actually sample
each token.
Um, and I guess we have a few methods
that um also restrict
핵심 요약
Large Language Model(LLM)의 정의와 Decoder-only 아키텍처, Mixture of Experts(MoE)를 통한 확장, Temperature를 통한 출력 제어, 다양한 Prompting 기법, 그리고 KV Cache와 PagedAttention을 통한 추론 효율화를 다룹니다.
주요 개념
LLM의 정의 3:30
- Language Model: 토큰 시퀀스에 확률을 부여하는 모델
- Large의 의미: 모델 크기(수십~수백B 파라미터), 학습 데이터(수조 토큰), 컴퓨팅 자원
- 현대 LLM: Decoder-only 아키텍처가 90% 이상. GPT, Llama, Gemma, DeepSeek, Mistral, Qwen 등
- BERT는 텍스트를 생성하지 않으므로 현대적 정의의 LLM이 아님
Mixture of Experts (MoE) 7:30
- 동기: 모든 파라미터를 매번 활성화할 필요가 있을까? 수학 질문에 역사학자가 필요한가?
- 구조: n개의 Expert 네트워크 + Router(Gate)가 입력에 따라 적합한 Expert 선택
- Dense vs Sparse MoE: Dense는 모든 Expert 가중합, Sparse는 Top-K Expert만 활성화 (보통 K=1~2)
- Expert 위치: FFN 레이어에 적용 (파라미터가 가장 많은 부분, d_model × d_ff × 2)
- 토큰 레벨 라우팅: 각 토큰마다 독립적으로 Expert 선택
MoE 학습의 도전과제 18:00
- Routing Collapse: Router가 특정 Expert만 계속 선택하는 문제
- Load Balancing Loss: f(i) × P(i)의 합을 최소화하여 Expert 사용을 균등하게 유도
- f(i): Expert i로 라우팅된 토큰 비율
- P(i): Expert i의 평균 라우팅 확률
- 장점: 파라미터는 늘리되 활성 파라미터(Active Parameters)는 제한하여 추론 비용 절감
Temperature와 Sampling 48:00
- Softmax with Temperature: P(i) = exp(x_i/T) / Σexp(x_j/T)
- Low Temperature (→0): Spiky 분포, 가장 높은 확률 토큰만 선택 (deterministic)
- High Temperature (→∞): Uniform 분포에 가까워짐, 다양하고 창의적인 출력
- Top-K Sampling: 상위 K개 토큰에서만 샘플링
- Top-P (Nucleus) Sampling: 누적 확률이 P를 넘는 최소 토큰 집합에서 샘플링
- 비결정성의 유일한 원인: Transformer 내부는 모두 deterministic, 샘플링만 확률적
Prompting 기법 1:15:00
- Zero-shot: 예시 없이 태스크 설명만으로 수행
- Few-shot (In-Context Learning): 입력-출력 예시를 컨텍스트에 포함
- Chain-of-Thought (CoT): 답 도출 과정의 추론을 함께 생성하도록 유도. 디버깅에도 유용
- Self-Consistency: 여러 번 샘플링 후 다수결로 최종 답 선택 (병렬 처리 가능)
- Context Rot: 컨텍스트가 길어질수록 정보 검색 능력 저하 (Needle in a Haystack 실험)
KV Cache 1:25:00
- 목적: 이전 토큰의 Key, Value 계산 결과를 저장하여 재계산 방지
- 원리: 현재 토큰의 Query만 새로 계산, K/V는 캐시에서 가져와 Attention 수행
- GQA와 연계: Grouped Query Attention으로 K/V 헤드 수를 줄여 캐시 크기 감소
- 학습 시에는 불필요: Teacher Forcing으로 전체 시퀀스를 한 번에 처리
PagedAttention (vLLM) 1:32:00
- 문제: 최대 컨텍스트 길이만큼 메모리를 미리 예약하면 낭비 발생 (Internal Fragmentation)
- 해결: KV Cache를 고정 크기 블록(예: 16토큰)으로 나누어 동적 할당
- 효과: 메모리 단편화 감소, 더 많은 요청 동시 처리 가능
- 구현: vLLM 추론 엔진에서 사용
핵심 인사이트
- MoE는 '용량은 크게, 비용은 작게'를 실현하는 핵심 기술. Switch Transformer는 1조 파라미터 달성
- Temperature는 창의성 vs 정확성의 트레이드오프를 조절하는 핵심 하이퍼파라미터
- Chain-of-Thought는 성능 향상뿐 아니라 모델의 추론 과정을 해석 가능하게 만듦
- KV Cache + GQA + PagedAttention의 조합이 현대 LLM 추론 효율화의 핵심