Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 2 - Transformer-Based Models & Tricks
Cool. Hello everyone. Welcome to lecture
two of CME295.
So before we start, I wanted to give you
a heads up about two logistical things.
So the first one is uh with Shervin we
reviewed the recording of lecture one
and we couldn't help but notice that the
audio was suboptimal.
So what we're doing for this lecture is
to have another setup. But then one
issue is my voice will not be you know
propagated in the room. So I guess I
have one question. Can any everyone hear
me very well even from the back? Okay
cool. Great. So that's point number one.
Point number two is about the final
exam. So the final exam right now has a
placeholder dates for Wednesday, I think
December the 10th. But uh just a heads
up that we're trying to see if there's a
way to move that to earlier in the week.
So we'll let you know when that's
finalized, but for now uh this is still
TBD, but we'll make sure to let you
know.
Cool. So with that aside, let's go into
today's topic. But before we do that, as
always, what we will do is we'll just
quickly recap what we saw in the
previous episodes.
So if you remember, lecture one was all
about introducing the concept of self
attention.
And if you remember what self attention
is is that each token is attending to
all other token in the sequence through
this mechanism of attention. So you have
these notations of queries, keys and
values. So here the idea is that the
query is going to ask which other tokens
are most similar to itself by comparing
query and key and then once that's done
uh basically we will be taking the
associated value.
So we saw that the self attention
mechanism can be expressed in with this
formula. So soft max of query *
qranspose over square root of dk * v.
So I hope this formula is familiar for
you. Um so just know that this formula
is highly optimized. You know these are
like big matrix multiplications that our
hardware is uh very um you know capable
of doing is very optimized in doing
that.
And all of that to say that we also
introduced the architecture of the
transformer which you can see on the
right.
So uh here if you remember the
transformer is composed of two main
components. So the encoder on the left
side and the decoder on the right side
and uh the transformer was initially
introduced in the context of machine
translation.
So you can think of the left side as
processing the input text in the source
language say English
and the right side is responsible for
decoding the translation in a target
language let's say in French
and this multi head attention layer is
where the self attention mechanism
happens
and um I remember there was a lot of
questions regarding you know um it's
called multi-head attention layer so
there are several heads what does that
correspond to
so in the transformer paper which is the
attention is unit paper you have this
figure
which actually represents each of these
heads and you can think of each head as
an opportunity for the model to learn
one way of projecting
the input into being a query, a key or a
value.
So just being a bit more clear. So for
instance for the query and the key, so
this is like each heads. So the number
of these little boxes is basically the
number of heads.
And to better visualize and understand
what this means, um
I also wanted to show I guess what is
being shown in the paper which is a way
to interpret what each of these heads
do. So we have this concept called
attention map
which basically
um
tries to represent the value of each of
these query.p product query.
So in this example what we're interested
in is to see which other token is being
most similar to token its
And so what we do is we take a look at
the quantities
the query that is representing it
product all other keys
and we look at what other keys are
leading to a high value of the product
query times key.
And when you do that basically what you
see or what the paper sees is you have
these two words. So application and law
which are highlighted as being you know
with a high attention weight. So
attention weight here being uh the
dotproduct of the query eats and the key
uh you know for each of these tokens.
And I guess there is also a way to
interpret those. And here you can see
that the tokens that are being kind of
highlighted are law and application
which basically makes sense because the
token eats is referring to law.
So basically the model needs to kind of
learn how to associate these words with
what happened before. And it is also
referring to application which is also
another way of kind of explaining why
that is the case. And so here what the
authors chose to do was to show these
values as a function of these different
heads.
So for instance on the left side these
are the kind of intensity for heads for
let's say the first heads and then the
second heads shows that you know the
intensity is very high for law. So
basically long story short these heads
may learn I guess different ways of uh
kind of figuring out what what words
matters. Yeah.
>> Great question. So the question is when
we're doing all these computations, are
they going through different uh
different MLPS? The answer is that we're
going to have different projection
matrices for each of them. And so um you
can think of um so we had this like
detailed example that actually kind of
uh went through that where
each head is going to have its own
projection and in parallel you're going
to have that computation that's going to
happen.
So each head is going to have one result
here that is then going to be
concatenated and then projected once
once again
with the output matrix.
So yeah, long story short, it's highly
parallelized
and uh it's basically just like
projections and like here you have like
some matrix multiplication and softmax.
Does that make sense?
>> Cool.
Any other questions on this?
Cool. So, I guess this is just a way to
illustrate the conversation we had for,
you know, lecture one. I know there was
some questions about about
um these uh attention heads, what they
do. And I guess looking at the attention
maps is one way of making sense of what
they mean.
Cool. With that, I highly recommend that
you read the transformers paper. So,
attention is all you need. So, it's a
very dense paper. It's just a few pages
long. Uh, but I hope that what we you've
seen in lecture one, you'll be able to
digest the content in a way that will
kind of make sense to you.
Cool. So, with that, we're going to
start the actual meat of what we're
going to discuss today.
So surprisingly
this transformer architecture which was
introduced in 2017
is actually an architecture that has
kind of um you know still stayed
relevant along the years and there are
few components that have slightly
changed and we're going to see which
ones they are. So there are like some
slight variations
but overall today's models are we're
going to see all more or less based on
the initial transformer architecture.
So we're going to try to divide the
class in in in two parts. So the first
one is what I'm going to cover which is
what are the parts of the transformer
that are important and that had some
variation. And in the second part,
Shervin is going to talk about um I
guess the nomenclature of today's models
and how they relate to the original
transformer.
Cool.
Okay. So, let's start with the first
important concept that's in this
architecture and this is the position
embedding.
So if you remember here, we're letting
tokens interact with all other tokens in
a direct fashion. So they have direct
links.
But contrary to things like RNNs where
you have a sequential dependency where
you process each token one at a time
here you're basically losing
this idea of a token being processed
before another one. So you kind of lose
this position information.
So as a result of that we need to
somehow
quantify
positions at sorry tokens at each
position and try to inject that
information when the transformer is
processing the the inputs.
So how are we going to do that? So the
original transformer paper authors,
they choose to have a dedicated
embedding. And when I say dedicated,
what that means is each position has one
embedding.
So position one has one embedding,
position two has one embedding, etc.,
etc.
And what they chose to do is
to add that embedding to the input token
embedding.
So for instance, if I say a cute teddy
bear is reading uh which is position
number one will be represented by the
token representing the token A
plus the embedding
representing the first position.
Yeah,
>> it's a great question. So the question
is are the position embeddings learned
or static? Both. Both as in the authors
have tried both
and we're going to see what the second
one is. But I guess I guess here let's
suppose that they are learned. So what
does that mean? So that means that
basically you need to learn embeddings
for each position.
And um the problem with this approach is
that you're very much dependent on what
is in your training set.
So for instance um like here if you have
somehow a text that always has something
that is happening at position number two
your learn embeddings will kind of have
that bias kind of learned. So that's
like one limitation of that.
Second limitation is you can only learn
positions up to the max number of
position that is in your training set.
So let's suppose you train your
transformer on sequences that are up to
let's say I don't know 512 let's say
you can only learn position embeddings
up to that position right
yeah so the question is I guess how do
you parameterize that so I guess what
you do is you have a kind of a
placeholder of a position learnable
position embedding. Let's say between
like position one and 512.
And basically when you do your training,
you're just letting these these weights
be learned through the regular, you
know, gradient descent, all these
things.
So yeah, so this is like the first
method. Uh but as I was mentioning, it
has its limitations.
um because you can only learn embeddings
of positions up to the max position that
is present in the training set. So for
instance, if you have at inference time
a position that is beyond the position
that was uh in the training set, well
you have not learned that. So you need
to find a way to kind of infer the
value. So that's the second limitation.
And uh but yeah, but on the pro side, I
guess you're just letting your model
learn and uh we've seen that the
gradient descent does wonders when it
comes to just, you know, learning from
the data. Um so yeah and for these
reasons these methods
was something that the authors said that
was performing well along with a second
method which is different
which is around
having an arbitrary formula for each
dimension
corresponding to a position embedding.
And we're going to see that now.
So first method was you know you have
one embedding per position and you just
learn that.
Second method is you have one embedding
per position but you're not going to
learn that. you're going to have
something that is predetermined that
you're going to use
and
we're going to see that what the authors
chose was a formulation using s and
cosine. So it can feel kind of weird,
you know, why did they choose this? But
we're going to see why that makes sense.
So the idea here is for a given position
let's say m
have a vector of size d model. So d
needs to match the dimension of your
token embeddings because of course
you're adding them.
And what you're going to do is for every
index you're going to compute the
corresponding value with respect to
these formulas.
So what are these formulas? So it's
basically s of something time m and
we're going to see what that something
something means. And then the second one
is cosine of something* m.
So who remembers trigonometry formulas?
Cool. Everyone.
So before we go into that, let's just
simplify the notations. Let's just
assume that this big quantity that you
saw is actually something like omega. So
let's suppose it's omega as a function
of i * m
and you note omega i as being this you
know quantity. So 10,000 to the^ of
minus 2 i over d model.
Let's suppose you you construct your
embeddings to have this way.
Then
I guess I want us to think about why
that would make sense.
Because if you think about it,
what you want is to represent
positions in a way that reflects the
following facts.
words that are close together are likely
to be more relevant
as opposed to words that are further
together, right? So, if you have two
words that are like just one position
apart versus 10 10,000 position apart,
what you want is that the one that is
one position apart is more similar than
the other one.
So, let's see if the formula makes
sense. So let's suppose you have two
position embeddings. So one at position
m and the other one at position n.
And let's suppose you compute, you know,
all the values from this predetermined
um formula.
Well, it turns out
if you remember your trigonometry
formulas, so cosine of a minus b is
equal to cosine of a cosine of b plus
sin a sin b. Right?
Well, turns out that if you express
cosine of omega i
factor of m minus n,
this is something that you obtain. It's
just like the identity that I mentioned.
Well, it just turns out that this
quantity
is just one component that appears when
you do the dotproduct of these two
position embeddings.
Right?
Because basically here when you do the
dot product of position m and position n
what you do is you take the first
position you multiply them then you plus
the second position you multiply them
etc etc and then you come here it's s of
this time s of this plus cosine of this
time cosine of this right which is just
this quantity.
So at the end of the day what you
realize is that when you perform a
dotproduct of these embeddings
you end up with a sum of cosine
that are a function of the relative
distance between m and n.
Yeah.
What do you mean by pair by the way?
>> Exactly. Yeah. So basically, so the
question is um Yeah. So the closer they
are, the more similar they are. So
that's the intuition that basically this
way are formulating uh the embeddings is
trying to approximate or is trying to
mimic.
So with that you basically obtain a dot
product that is just a function of m and
n the relative distance between them
actually. Um and just as a reminder so
why do I care about the dot products?
It's because if you're remember
in the embedding world when we try to
quantify the similarity between two
embeddings what we do is typically
something that is involving the
dotproduct of these two. So you
typically have the cosine similarity but
cosine similarity is just dotproduct
over the norm of each embedding. So
which is basically a dot product
right? So that's why we care about the
dot product. And here we see great it's
a function of the relative distance
between the two.
So in particular again if you remember
your trigonometry class
cosine of zero is
one
and the higher this number
the lower the value of cosine of this
number. Right? Of course, it's uh
periodic. So, what I'm saying is not
necessarily true. Past 2 pi or sorry uh
pi just goes the other way. But I guess
what I'm trying to say is for m equal to
n
you'll have basically a sum of cosine of
zero
and it is the value at which
this quantity is maximum.
So when m is equal to n this quantity is
maximum which means basically if you're
looking at the position itself it's the
most similar which basically matches our
intuition
and so now when you plot the values of
the embeddings this is what you obtain.
So here in this graph on the yaxis I'm
basically representing all the
embeddings for each of the let's say 50
positions
and on the x-axis it's basically
values along a given vector across
several dimensions of a vector. So if
you take let's say the first row, you're
looking at the first or number zero
position depending on how you index your
vector and you're looking at all the
values of the position zero embedding.
And so you see that uh for low
dimensions
this value kind of goes up and down very
frequently. So it's more high frequency
and when the dimension is high it
basically takes a lot of time for the
value to go up and down. So it's
basically more low frequency.
So this relates to this omega I that I
mentioned this omega I. So omega I
is
very high for low values of I which is
the dimension and is very low for high
values of I.
So this basically just determines how
quickly your cosine and sign basically
vary.
Cool.
So this is what the original authors
have tried and basically what they said
what they noted was that using this
method leads to comparable results
compared to the learned one.
But here we have a big advantage because
it can extend to any sequence length not
just the sequence length that you saw at
training time.
And this is one of the reasons why you
know this may be something that is
preferable.
So yeah this is the intuition and this
is what the authors chose.
Now fast forward to 2025 guess you may
ask me are we still using that?
And the answer is kind of. So we're
still using this idea of we want far I
guess tokens to be less similar than
closer tokens.
But we're not injecting the embedding
like they did. And we're going to see
why.
Because if you remember,
what you care about is determining how
similar tokens are in the self attention
computation.
And where does the self attention
computation happen?
In the attention layer.
But here what what did I say? I said
let's compute these embeddings and let's
add them here.
But actually what we want is to reflect
this similarity in
the attention layer.
Do you have a question? Yeah.
So in this first method, yes. So the
question is is it added to the input
feature? Yes.
Yes, just add. Yeah. Yeah. But the
problem is we mostly want these I guess
intuition to hold true in the
multi head attention layer.
So this is one of the reasons why people
have tried different variations
and in particular
have these position embeddings
intervene directly in the attention
layer as opposed to the input because
basically when you do it at the input
just here fair okay it's going to be
roughly something that is going to go
into this attention layer but it's kind
of indirect
So what we want is to directly
kind of do something about the attention
formula that would I guess reflect the
fact that we want close tokens to be
more similar compared to further tokens.
And the way we do that is if you
remember
the self attention layer is basically
the softmax of q krpose over square root
of d * v.
So what we want is to add a little
something inside that softmax
which is basically where you quantify
how similar a token is to another token.
You want to add a little something
to reflect the fact that some tokens
they're supposed to be more similar
compared to that token compared to other
others.
So there's a a few methods that have
tried
to kind of have some variation of that.
So
for those of us who know like paper T T5
the T5 paper which we're going to see a
little bit later um they have tried
these kind of relative position bias by
learning the bias term which is in the
formula above.
So what they did was
let's suppose you have a given distance
between the positions m and n.
So their idea was let's learn that
let's basically bucketize
all you know m minus n into some buckets
and let's just have the model learn
these quantities that are then going to
be injected in the softmax.
Yeah.
M so question is does that pose a
problem that the the bias is here with
respect to the probability at the end
because it has to sum to one. Well you
can do whatever you want inside the
softmax because the softmax is going to
normalize it anyways.
So you can think of the bias as being
something that is maybe more negative
for things that are far apart compared
to things that are closer together.
So T5 says let's learn them.
We have another method from
um guess this train short Teslong paper
which introduced this method called
alibi. Alibi stands for attention with
linear bias I believe. And what they did
was
say instead of learning those biases,
let's actually have a deterministic
formula which is as a function of the
difference the relative difference
between those two positions
and they said that so they kind of had
some results. So you know all these
papers they always kind of compare based
on one another to kind of see which one
is um is I guess more performant.
But the reality is is that in today's
models
most models actually use another
kind of position embedding methods
and we're going to see it now.
So this method relies on rotating
the query and the key vector
by some angle
and I guess can think of it as you know
you have your query you have your key
let's suppose in the 2D space so what
you're going to do is rotate your query
by some angle that is a function of its
position
and you're going to
rotate the key
vector by some angle that is a function
of its position n
and
I guess how do you do that by the way I
wasn't supposed to show this thing so
let's suppose you have a vector by the
way
and you want to rotate that vector
because I had the answer on the slide
but guess how would you go about this
guess for people who just want to kind
talk about this with intuition.
So, who here has done uh I don't know
rotations in uh in space? Yeah.
Mhm.
Yeah, that's correct. Yeah. M matrix
multiplication. And you're going to use
a quantity that's called rotation
matrix.
And uh the rotation matrix is expressed
as follows. So it's basically a 2x2
matrix in the 2D plane that has co
cosine of this angle minus sign of this
angle s of this angle cosine of this
angle. I'm looking at the time. There's
actually it's quite simple to just show
that it works. Uh but we may run out of
time. Do you want you want me to quickly
show you that it's indeed a way to uh
rotate a vector?
Okay.
So here as a reminder what we're trying
to show is that we can use a matrix
multiplication to rotate a vector in 2D
space.
So let's suppose we have the following
vector that is something that you can
um you can quantify with uh two
dimensions let's say x and y
you can express your vector in 2D space
with this right
but I guess this
if you note are
the norm of the vector
and
phi
the angle respect to the x-axis.
You can also write v
as r
with
vector cos of cosine of phi and sine of
phi,
right?
So if you multiply the rotation matrix
with this V,
what you're going to obtain is some
multiplication of cosine minus s and co
cosine of this and that.
And I will leave this exercise for you.
But you can show that uh rotation
times this v
can be expressed as r
of cosine of theta + v
and sine
of theta + v.
So this is a quick proof. So I'll just
kind of leave you the multiplication of
the rotation matrix and V but you will
obtain these like trigonometric
identities that will have I guess lead
you to this formula. This basically
shows that if you multiply this matrix
and this vector you're basically
rotating the vector by this angle. Yeah.
So question is why do you want to do
this? It's a great question. It's my
next uh slide. So this is just like a
little intro. Um so I guess here just
going back to this methods.
What we want to do is to quantify the
similarity between tokens and have
closed tokens be more similar compared
to tokens that are more a far.
So
the problem that we had with the
previous methods was so in the first
method
this learned embedding you always had
this overfitting issue because basically
when you learn these biases it always
depends on which training set you have.
So maybe your data set in is in a way
that let's say tokens that are close are
kind of similar but in a different way
compared to what you see at inference
time
and this alibi methods
it didn't have that learnable component
but this was quite restrictive because
it's a very simple formula after after
all right just the relative difference
between n and m so I guess people have
tried different ways of coming up with
something that kind of tells you that I
guess an embedding that reflects the
fact that you want further positions to
be less similar than closer ones. So in
this method, we're going back to the s
and cosine world from what the author
had proposed and think about similarity
from the lens of I guess cosine and s
functions.
So this is a little intro
and their method is called ROP. I'm not
sure if you've heard of it. So it stands
for rotary position embeddings
and we're going to see that this method.
So why why do we care about this method?
So this method has two great things.
So the first one is
that if you rotate the query and the
key,
you will end up with a quantity that
will be a function of the relative
distance between the two. That's going
to be very nice. And that's why I I
wrote this thing on the blackboard. Not
sure if you can see by the way, but um
we won't have time to go into the
mathematical detail. But if you want to
just you know express these things at
home uh this is just a foundation.
And so in particular
if you remember your attention formula
so you have you know query times key
transpose.
So if you rotate the query by an angle m
and the key by an angle n,
what you're going to end up is a formula
that has the rotation matrix
of angle theta and I guess n minus m.
And this is great because this is a
function of the relative distance
between these two positions.
Okay, so why do I talk about this in
detail? Well, it turns out that most
models these days, they use rope, which
is why it's important.
And I would say another thing, it's it's
maybe a little bit hard to get the
intuition as to why that works, but
hopefully the explanation that I gave
regarding the, you know, s and cosine at
the very beginning can help you just
build that intuition.
And speaking of that,
it turns out that the upper bound of the
attention weight given by the query and
the key is such that we observe a
long-term decay.
Meaning that as m minus m is large, we
do see the upper bound that gets kind of
smaller and smaller.
Well, you see these little oscillations.
It's not like perfect either, but we do
have some mathematical
uh I guess results as to just the upper
bounds kind of decaying over the long
term.
Cool.
Any questions on this? Yeah.
Yep.
>> Exactly. Yeah. So the question is the
relative distance is captured in the
rotation matrix and yes.
Yes.
Oh yeah.
Ah it's a great question. So question is
what is theta? So theta is actually
fixed. So do you remember this omega
that I talked about here
here?
So it's basically some function of i and
d. I actually kind of uh passed it quite
quickly. But what I showed you is
in the 2D space but here we're in this
dimensional space which is greater than
two. So I kind of glossed it very
quickly but the way you extend this
method is by having this 2D 2D space
kind of by block. But then the theta is
a function of uh typically something
that you fix but a function of I which
is the dimension you know um it's
basically between one and d /2
it's a function of that and it's a
function of uh d as well. So it's
typically you you will see this theta as
being roughly equal to omega I which is
this this one more or less more or less
sorry.
Oh so the question is uh so that it has
the same dimension as the latent
dimension. So well here
you have a product of matrices.
So you need to have like the dimensions
match. So I guess your answer Yeah.
Does that make sense?
I'll take that as a yes. Okay. So I I
spent a bunch of time on this because uh
I think this is actually quite
important. Uh a lot of models use this
and uh yeah, I guess the intuition is
not super obvious. So I hope this was
helpful.
Okay. So this was position embeddings.
Yeah.
Oh yeah.
Oh yeah. So the question is about how do
you obtain this curve? So this is
actually a curve that I believe is shown
mathematically
cuz basically so it's kind of
complicated. We're not going to write
down the formula but if you're
interested so in this paper in the row
farmer paper there's an appendix where
they show mathematically that is upper
bounded by some quantity and this is
what this is what this is showing. Yep.
Great question.
Cool. So position embeddings is one part
of the transformer that has changed a
little bit and we've seen how that
changed and why.
So now we're going to see another
component of the transformer
that has also little bit changed and
that component is the layer
normalization.
So if you remember
the transformer architecture
is composed again of encoder decoder and
then you have the components inside.
So you have these boxes that say add and
norm.
So what do they mean? So basically here
what we do is we take the input as well
as the output of this sub layer.
We add them together
and then we normalize.
So this is a little trick that the
authors do and it is shown in practice
to improve convergence and just make the
convergence be quicker.
So the idea is as follows.
If you have a vector,
sometimes the components of the vector
can be super large, sometimes they can
be super small. The idea here is to kind
of normalize
the components of your vector
within some range, some normalized
range.
So the way you're going to do that is
you're going to take your vector
and then sub subtract it by the mean
computed mean which is basically the sum
of its components
and normalize it by basically its uh
standard deviation.
And what you're going to do is you're
going to learn two quantities.
One gamma which is going to be the
rescaling factor
and then beta which is um another factor
another term as well that you learn.
And you're going to let these two
quantities be learned by your model.
And so in practice as I mentioned so
what this does is it helps with training
stability and with convergence time.
So this was a technique that was used in
the original transformer paper.
Uh I just want to call out that there
has been some changes since then. Um, so
we went from normalizing the input
plus the output of the sub layer
to having a sum of the input
and the sub layer of the normalized
input.
So in other words, what we've done is to
change where the normalization is
located. So here in the the transformer
paper it was what we now call a postnorm
version
and nowadays we use a prenorm version
which basically consists of having the
layer norm right before the vector goes
into the sub layer and sub layer here
can be either the attention layer or the
ffn
but not only that there is also another
change. So nowadays people they do not
use layer norm. They use something else
called RMS norm RMS root mean square
normalization which is basically a
variation of what you've seen before.
So instead of computing this
basically what people do is they just
normalize X by the root mean square of
the
of the components of X and they learn
gamma only gamma
though. So why do they do that?
Basically they show that the convergence
uh properties they're basically
comparable but here you have fewer
parameters to learn. So it's basically
quicker.
Yeah.
Mhm.
Good question. So I guess the question
is what is the intuition behind
normalizing? So the intuition is that if
you look at your model you have several
layers in some layers your model your
vector your activation to be more
precise. So do you know the the vector
that you see that goes from here to here
is basically called the activation.
Sometimes the activation has extreme
values in one part of its components,
sometimes in another part.
And the model is typically having
trouble in learning the weights in each
of these layers if these activations
they vary too much.
So the idea is to bring the values of
the components of the activation to some
range
that is not too you know far off in some
direction.
So in case you're interested there is
this key word internal coariate shift
which is basically the term that is
given to the phenomenon that I'm
describing here. So yeah that's the
intuition.
Yeah.
Oh great great question. So question is
what is the difference between this and
batch normalization? So batch
normalization
is normalization across the other
dimension which is the norm the
dimension of the batch.
So let's suppose you have a bunch of
vectors. What you do is you normalize
each component
with respect to all the other components
for the same dimension but of the other
vectors.
핵심 요약
Transformer 아키텍처의 실용적인 변형과 최적화 기법을 다룹니다. Position Encoding의 발전(Absolute → Relative → RoPE), Attention 효율화 기법(Sliding Window, Grouped Query Attention), 그리고 Encoder-only 모델인 BERT까지 설명합니다.
주요 개념
Positional Encoding 발전 14:30
- Absolute (Learned): 각 위치마다 학습된 임베딩. 학습 시 본 길이까지만 적용 가능
- Absolute (Sinusoidal): sin/cos 함수 사용. 임의 길이로 확장 가능. dot product가 상대 거리(m-n)의 함수가 됨
- Relative Position Bias (T5): Attention 내부에서 상대 거리 기반 bias 학습
- RoPE (Rotary Position Embedding): Query/Key에 회전 행렬 적용. 현대 LLM의 표준
Attention 효율화 54:20
- Sliding Window Attention: 전체 O(n²) 대신 고정 윈도우 내에서만 attention. Mistral 등에서 사용
- Receptive Field 개념: 여러 레이어 쌓으면 간접적으로 먼 토큰 정보도 접근 가능
Grouped Query Attention (GQA) 55:30
- MHA (Multi-Head Attention): 각 head마다 Q, K, V 별도 projection matrix
- MQA (Multi-Query Attention): Q만 head별, K/V는 모든 head가 공유
- GQA: Q는 head별, K/V는 그룹 단위로 공유. MHA와 MQA의 중간
- KV Cache 절약: 디코딩 시 K/V를 캐싱하는데, 공유하면 메모리 절약
Transformer 모델 분류 1:02:50
- Encoder-Decoder (T5, mT5, ByT5): 원본 Transformer. Span Corruption 학습
- Encoder-only (BERT): 분류 작업에 특화. Bidirectional attention
- Decoder-only (GPT 계열): 생성 작업. 현대 LLM의 주류
BERT 아키텍처 1:27:00
- [CLS] 토큰: 문장 전체의 representation. 분류 head 연결점
- [SEP] 토큰: 문장 간 구분
- Segment Embedding: 문장 A/B 구분을 위한 추가 임베딩
BERT 학습 목표 1:28:30
- MLM (Masked Language Modeling): 15% 토큰 마스킹 (80% [MASK], 10% 랜덤, 10% 유지)
- NSP (Next Sentence Prediction): 두 문장이 연속인지 분류
- 양방향 컨텍스트를 학습하여 풍부한 representation 획득
BERT Fine-tuning 1:33:20
- Pre-trained weights 위에 classification head 추가
- 감정 분석: [CLS] 토큰 → FFN → 클래스
- QA: 각 토큰 → FFN → 답변 시작/끝 span 예측
핵심 인사이트
- Position Encoding은 "가까운 토큰이 더 유사해야 한다"는 직관을 수학적으로 구현
- GQA는 성능과 효율의 균형점. K/V 공유로 KV Cache 메모리 대폭 절약
- BERT의 [CLS] 토큰은 self-attention을 통해 전체 문맥 정보를 압축한 representation