Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 2 - Transformer-Based Models & Tricks

Stanford Online • October 17, 2025 • AI 요약 생성: January 24, 2026

NaN:NaN

Cool. Hello everyone. Welcome to lecture

NaN:NaN

two of CME295.

NaN:NaN

So before we start, I wanted to give you

NaN:NaN

a heads up about two logistical things.

NaN:NaN

So the first one is uh with Shervin we

NaN:NaN

reviewed the recording of lecture one

NaN:NaN

and we couldn't help but notice that the

NaN:NaN

audio was suboptimal.

NaN:NaN

So what we're doing for this lecture is

NaN:NaN

to have another setup. But then one

NaN:NaN

issue is my voice will not be you know

NaN:NaN

propagated in the room. So I guess I

NaN:NaN

have one question. Can any everyone hear

NaN:NaN

me very well even from the back? Okay

NaN:NaN

cool. Great. So that's point number one.

NaN:NaN

Point number two is about the final

NaN:NaN

exam. So the final exam right now has a

NaN:NaN

placeholder dates for Wednesday, I think

NaN:NaN

December the 10th. But uh just a heads

NaN:NaN

up that we're trying to see if there's a

NaN:NaN

way to move that to earlier in the week.

NaN:NaN

So we'll let you know when that's

NaN:NaN

finalized, but for now uh this is still

NaN:NaN

TBD, but we'll make sure to let you

NaN:NaN

know.

NaN:NaN

Cool. So with that aside, let's go into

NaN:NaN

today's topic. But before we do that, as

NaN:NaN

always, what we will do is we'll just

NaN:NaN

quickly recap what we saw in the

NaN:NaN

previous episodes.

NaN:NaN

So if you remember, lecture one was all

NaN:NaN

about introducing the concept of self

NaN:NaN

attention.

NaN:NaN

And if you remember what self attention

NaN:NaN

is is that each token is attending to

NaN:NaN

all other token in the sequence through

NaN:NaN

this mechanism of attention. So you have

NaN:NaN

these notations of queries, keys and

NaN:NaN

values. So here the idea is that the

NaN:NaN

query is going to ask which other tokens

NaN:NaN

are most similar to itself by comparing

NaN:NaN

query and key and then once that's done

NaN:NaN

uh basically we will be taking the

NaN:NaN

associated value.

NaN:NaN

So we saw that the self attention

NaN:NaN

mechanism can be expressed in with this

NaN:NaN

formula. So soft max of query *

NaN:NaN

qranspose over square root of dk * v.

NaN:NaN

So I hope this formula is familiar for

NaN:NaN

you. Um so just know that this formula

NaN:NaN

is highly optimized. You know these are

NaN:NaN

like big matrix multiplications that our

NaN:NaN

hardware is uh very um you know capable

NaN:NaN

of doing is very optimized in doing

NaN:NaN

that.

NaN:NaN

And all of that to say that we also

NaN:NaN

introduced the architecture of the

NaN:NaN

transformer which you can see on the

NaN:NaN

right.

NaN:NaN

So uh here if you remember the

NaN:NaN

transformer is composed of two main

NaN:NaN

components. So the encoder on the left

NaN:NaN

side and the decoder on the right side

NaN:NaN

and uh the transformer was initially

NaN:NaN

introduced in the context of machine

NaN:NaN

translation.

NaN:NaN

So you can think of the left side as

NaN:NaN

processing the input text in the source

NaN:NaN

language say English

NaN:NaN

and the right side is responsible for

NaN:NaN

decoding the translation in a target

NaN:NaN

language let's say in French

NaN:NaN

and this multi head attention layer is

NaN:NaN

where the self attention mechanism

NaN:NaN

happens

NaN:NaN

and um I remember there was a lot of

NaN:NaN

questions regarding you know um it's

NaN:NaN

called multi-head attention layer so

NaN:NaN

there are several heads what does that

NaN:NaN

correspond to

NaN:NaN

so in the transformer paper which is the

NaN:NaN

attention is unit paper you have this

NaN:NaN

figure

NaN:NaN

which actually represents each of these

NaN:NaN

heads and you can think of each head as

NaN:NaN

an opportunity for the model to learn

NaN:NaN

one way of projecting

NaN:NaN

the input into being a query, a key or a

NaN:NaN

value.

NaN:NaN

So just being a bit more clear. So for

NaN:NaN

instance for the query and the key, so

NaN:NaN

this is like each heads. So the number

NaN:NaN

of these little boxes is basically the

NaN:NaN

number of heads.

NaN:NaN

And to better visualize and understand

NaN:NaN

what this means, um

NaN:NaN

I also wanted to show I guess what is

NaN:NaN

being shown in the paper which is a way

NaN:NaN

to interpret what each of these heads

NaN:NaN

do. So we have this concept called

NaN:NaN

attention map

NaN:NaN

which basically

NaN:NaN

tries to represent the value of each of

NaN:NaN

these query.p product query.

NaN:NaN

So in this example what we're interested

NaN:NaN

in is to see which other token is being

NaN:NaN

most similar to token its

NaN:NaN

And so what we do is we take a look at

NaN:NaN

the quantities

NaN:NaN

the query that is representing it

NaN:NaN

product all other keys

NaN:NaN

and we look at what other keys are

NaN:NaN

leading to a high value of the product

NaN:NaN

query times key.

NaN:NaN

And when you do that basically what you

NaN:NaN

see or what the paper sees is you have

NaN:NaN

these two words. So application and law

NaN:NaN

which are highlighted as being you know

NaN:NaN

with a high attention weight. So

NaN:NaN

attention weight here being uh the

NaN:NaN

dotproduct of the query eats and the key

NaN:NaN

uh you know for each of these tokens.

NaN:NaN

And I guess there is also a way to

NaN:NaN

interpret those. And here you can see

NaN:NaN

that the tokens that are being kind of

NaN:NaN

highlighted are law and application

NaN:NaN

which basically makes sense because the

NaN:NaN

token eats is referring to law.

NaN:NaN

So basically the model needs to kind of

NaN:NaN

learn how to associate these words with

NaN:NaN

what happened before. And it is also

NaN:NaN

referring to application which is also

NaN:NaN

another way of kind of explaining why

NaN:NaN

that is the case. And so here what the

NaN:NaN

authors chose to do was to show these

NaN:NaN

values as a function of these different

NaN:NaN

heads.

NaN:NaN

So for instance on the left side these

NaN:NaN

are the kind of intensity for heads for

NaN:NaN

let's say the first heads and then the

NaN:NaN

second heads shows that you know the

NaN:NaN

intensity is very high for law. So

NaN:NaN

basically long story short these heads

NaN:NaN

may learn I guess different ways of uh

NaN:NaN

kind of figuring out what what words

NaN:NaN

matters. Yeah.

NaN:NaN

>> Great question. So the question is when

NaN:NaN

we're doing all these computations, are

NaN:NaN

they going through different uh

NaN:NaN

different MLPS? The answer is that we're

NaN:NaN

going to have different projection

NaN:NaN

matrices for each of them. And so um you

NaN:NaN

can think of um so we had this like

NaN:NaN

detailed example that actually kind of

NaN:NaN

uh went through that where

NaN:NaN

each head is going to have its own

NaN:NaN

projection and in parallel you're going

NaN:NaN

to have that computation that's going to

NaN:NaN

happen.

NaN:NaN

So each head is going to have one result

NaN:NaN

here that is then going to be

NaN:NaN

concatenated and then projected once

NaN:NaN

once again

NaN:NaN

with the output matrix.

NaN:NaN

So yeah, long story short, it's highly

NaN:NaN

parallelized

NaN:NaN

and uh it's basically just like

NaN:NaN

projections and like here you have like

NaN:NaN

some matrix multiplication and softmax.

NaN:NaN

Does that make sense?

NaN:NaN

>> Cool.

NaN:NaN

Any other questions on this?

NaN:NaN

Cool. So, I guess this is just a way to

NaN:NaN

illustrate the conversation we had for,

NaN:NaN

you know, lecture one. I know there was

NaN:NaN

some questions about about

NaN:NaN

um these uh attention heads, what they

NaN:NaN

do. And I guess looking at the attention

NaN:NaN

maps is one way of making sense of what

NaN:NaN

they mean.

NaN:NaN

Cool. With that, I highly recommend that

NaN:NaN

you read the transformers paper. So,

NaN:NaN

attention is all you need. So, it's a

NaN:NaN

very dense paper. It's just a few pages

NaN:NaN

long. Uh, but I hope that what we you've

NaN:NaN

seen in lecture one, you'll be able to

NaN:NaN

digest the content in a way that will

NaN:NaN

kind of make sense to you.

NaN:NaN

Cool. So, with that, we're going to

NaN:NaN

start the actual meat of what we're

NaN:NaN

going to discuss today.

NaN:NaN

So surprisingly

NaN:NaN

this transformer architecture which was

NaN:NaN

introduced in 2017

NaN:NaN

is actually an architecture that has

NaN:NaN

kind of um you know still stayed

NaN:NaN

relevant along the years and there are

NaN:NaN

few components that have slightly

NaN:NaN

changed and we're going to see which

NaN:NaN

ones they are. So there are like some

NaN:NaN

slight variations

NaN:NaN

but overall today's models are we're

NaN:NaN

going to see all more or less based on

NaN:NaN

the initial transformer architecture.

NaN:NaN

So we're going to try to divide the

NaN:NaN

class in in in two parts. So the first

NaN:NaN

one is what I'm going to cover which is

NaN:NaN

what are the parts of the transformer

NaN:NaN

that are important and that had some

NaN:NaN

variation. And in the second part,

NaN:NaN

Shervin is going to talk about um I

NaN:NaN

guess the nomenclature of today's models

NaN:NaN

and how they relate to the original

NaN:NaN

transformer.

NaN:NaN

Cool.

NaN:NaN

Okay. So, let's start with the first

NaN:NaN

important concept that's in this

NaN:NaN

architecture and this is the position

NaN:NaN

embedding.

NaN:NaN

So if you remember here, we're letting

NaN:NaN

tokens interact with all other tokens in

NaN:NaN

a direct fashion. So they have direct

NaN:NaN

links.

NaN:NaN

But contrary to things like RNNs where

NaN:NaN

you have a sequential dependency where

NaN:NaN

you process each token one at a time

NaN:NaN

here you're basically losing

NaN:NaN

this idea of a token being processed

NaN:NaN

before another one. So you kind of lose

NaN:NaN

this position information.

NaN:NaN

So as a result of that we need to

NaN:NaN

somehow

NaN:NaN

quantify

NaN:NaN

positions at sorry tokens at each

NaN:NaN

position and try to inject that

NaN:NaN

information when the transformer is

NaN:NaN

processing the the inputs.

NaN:NaN

So how are we going to do that? So the

NaN:NaN

original transformer paper authors,

NaN:NaN

they choose to have a dedicated

NaN:NaN

embedding. And when I say dedicated,

NaN:NaN

what that means is each position has one

NaN:NaN

embedding.

NaN:NaN

So position one has one embedding,

NaN:NaN

position two has one embedding, etc.,

NaN:NaN

etc.

NaN:NaN

And what they chose to do is

NaN:NaN

to add that embedding to the input token

NaN:NaN

embedding.

NaN:NaN

So for instance, if I say a cute teddy

NaN:NaN

bear is reading uh which is position

NaN:NaN

number one will be represented by the

NaN:NaN

token representing the token A

NaN:NaN

plus the embedding

NaN:NaN

representing the first position.

NaN:NaN

Yeah,

NaN:NaN

>> it's a great question. So the question

NaN:NaN

is are the position embeddings learned

NaN:NaN

or static? Both. Both as in the authors

NaN:NaN

have tried both

NaN:NaN

and we're going to see what the second

NaN:NaN

one is. But I guess I guess here let's

NaN:NaN

suppose that they are learned. So what

NaN:NaN

does that mean? So that means that

NaN:NaN

basically you need to learn embeddings

NaN:NaN

for each position.

NaN:NaN

And um the problem with this approach is

NaN:NaN

that you're very much dependent on what

NaN:NaN

is in your training set.

NaN:NaN

So for instance um like here if you have

NaN:NaN

somehow a text that always has something

NaN:NaN

that is happening at position number two

NaN:NaN

your learn embeddings will kind of have

NaN:NaN

that bias kind of learned. So that's

NaN:NaN

like one limitation of that.

NaN:NaN

Second limitation is you can only learn

NaN:NaN

positions up to the max number of

NaN:NaN

position that is in your training set.

NaN:NaN

So let's suppose you train your

NaN:NaN

transformer on sequences that are up to

NaN:NaN

let's say I don't know 512 let's say

NaN:NaN

you can only learn position embeddings

NaN:NaN

up to that position right

NaN:NaN

yeah so the question is I guess how do

NaN:NaN

you parameterize that so I guess what

NaN:NaN

you do is you have a kind of a

NaN:NaN

placeholder of a position learnable

NaN:NaN

position embedding. Let's say between

NaN:NaN

like position one and 512.

NaN:NaN

And basically when you do your training,

NaN:NaN

you're just letting these these weights

NaN:NaN

be learned through the regular, you

NaN:NaN

know, gradient descent, all these

NaN:NaN

things.

NaN:NaN

So yeah, so this is like the first

NaN:NaN

method. Uh but as I was mentioning, it

NaN:NaN

has its limitations.

NaN:NaN

um because you can only learn embeddings

NaN:NaN

of positions up to the max position that

NaN:NaN

is present in the training set. So for

NaN:NaN

instance, if you have at inference time

NaN:NaN

a position that is beyond the position

NaN:NaN

that was uh in the training set, well

NaN:NaN

you have not learned that. So you need

NaN:NaN

to find a way to kind of infer the

NaN:NaN

value. So that's the second limitation.

NaN:NaN

And uh but yeah, but on the pro side, I

NaN:NaN

guess you're just letting your model

NaN:NaN

learn and uh we've seen that the

NaN:NaN

gradient descent does wonders when it

NaN:NaN

comes to just, you know, learning from

NaN:NaN

the data. Um so yeah and for these

NaN:NaN

reasons these methods

NaN:NaN

was something that the authors said that

NaN:NaN

was performing well along with a second

NaN:NaN

method which is different

NaN:NaN

which is around

NaN:NaN

having an arbitrary formula for each

NaN:NaN

dimension

NaN:NaN

corresponding to a position embedding.

NaN:NaN

And we're going to see that now.

NaN:NaN

So first method was you know you have

NaN:NaN

one embedding per position and you just

NaN:NaN

learn that.

NaN:NaN

Second method is you have one embedding

NaN:NaN

per position but you're not going to

NaN:NaN

learn that. you're going to have

NaN:NaN

something that is predetermined that

NaN:NaN

you're going to use

NaN:NaN

and

NaN:NaN

we're going to see that what the authors

NaN:NaN

chose was a formulation using s and

NaN:NaN

cosine. So it can feel kind of weird,

NaN:NaN

you know, why did they choose this? But

NaN:NaN

we're going to see why that makes sense.

NaN:NaN

So the idea here is for a given position

NaN:NaN

let's say m

NaN:NaN

have a vector of size d model. So d

NaN:NaN

needs to match the dimension of your

NaN:NaN

token embeddings because of course

NaN:NaN

you're adding them.

NaN:NaN

And what you're going to do is for every

NaN:NaN

index you're going to compute the

NaN:NaN

corresponding value with respect to

NaN:NaN

these formulas.

NaN:NaN

So what are these formulas? So it's

NaN:NaN

basically s of something time m and

NaN:NaN

we're going to see what that something

NaN:NaN

something means. And then the second one

NaN:NaN

is cosine of something* m.

NaN:NaN

So who remembers trigonometry formulas?

NaN:NaN

Cool. Everyone.

NaN:NaN

So before we go into that, let's just

NaN:NaN

simplify the notations. Let's just

NaN:NaN

assume that this big quantity that you

NaN:NaN

saw is actually something like omega. So

NaN:NaN

let's suppose it's omega as a function

NaN:NaN

of i * m

NaN:NaN

and you note omega i as being this you

NaN:NaN

know quantity. So 10,000 to the^ of

NaN:NaN

minus 2 i over d model.

NaN:NaN

Let's suppose you you construct your

NaN:NaN

embeddings to have this way.

NaN:NaN

Then

NaN:NaN

I guess I want us to think about why

NaN:NaN

that would make sense.

NaN:NaN

Because if you think about it,

NaN:NaN

what you want is to represent

NaN:NaN

positions in a way that reflects the

NaN:NaN

following facts.

NaN:NaN

words that are close together are likely

NaN:NaN

to be more relevant

NaN:NaN

as opposed to words that are further

NaN:NaN

together, right? So, if you have two

NaN:NaN

words that are like just one position

NaN:NaN

apart versus 10 10,000 position apart,

NaN:NaN

what you want is that the one that is

NaN:NaN

one position apart is more similar than

NaN:NaN

the other one.

NaN:NaN

So, let's see if the formula makes

NaN:NaN

sense. So let's suppose you have two

NaN:NaN

position embeddings. So one at position

NaN:NaN

m and the other one at position n.

NaN:NaN

And let's suppose you compute, you know,

NaN:NaN

all the values from this predetermined

NaN:NaN

um formula.

NaN:NaN

Well, it turns out

NaN:NaN

if you remember your trigonometry

NaN:NaN

formulas, so cosine of a minus b is

NaN:NaN

equal to cosine of a cosine of b plus

NaN:NaN

sin a sin b. Right?

NaN:NaN

Well, turns out that if you express

NaN:NaN

cosine of omega i

NaN:NaN

factor of m minus n,

NaN:NaN

this is something that you obtain. It's

NaN:NaN

just like the identity that I mentioned.

NaN:NaN

Well, it just turns out that this

NaN:NaN

quantity

NaN:NaN

is just one component that appears when

NaN:NaN

you do the dotproduct of these two

NaN:NaN

position embeddings.

NaN:NaN

Right?

NaN:NaN

Because basically here when you do the

NaN:NaN

dot product of position m and position n

NaN:NaN

what you do is you take the first

NaN:NaN

position you multiply them then you plus

NaN:NaN

the second position you multiply them

NaN:NaN

etc etc and then you come here it's s of

NaN:NaN

this time s of this plus cosine of this

NaN:NaN

time cosine of this right which is just

NaN:NaN

this quantity.

NaN:NaN

So at the end of the day what you

NaN:NaN

realize is that when you perform a

NaN:NaN

dotproduct of these embeddings

NaN:NaN

you end up with a sum of cosine

NaN:NaN

that are a function of the relative

NaN:NaN

distance between m and n.

NaN:NaN

Yeah.

NaN:NaN

What do you mean by pair by the way?

NaN:NaN

>> Exactly. Yeah. So basically, so the

NaN:NaN

question is um Yeah. So the closer they

NaN:NaN

are, the more similar they are. So

NaN:NaN

that's the intuition that basically this

NaN:NaN

way are formulating uh the embeddings is

NaN:NaN

trying to approximate or is trying to

NaN:NaN

mimic.

NaN:NaN

So with that you basically obtain a dot

NaN:NaN

product that is just a function of m and

NaN:NaN

n the relative distance between them

NaN:NaN

actually. Um and just as a reminder so

NaN:NaN

why do I care about the dot products?

NaN:NaN

It's because if you're remember

NaN:NaN

in the embedding world when we try to

NaN:NaN

quantify the similarity between two

NaN:NaN

embeddings what we do is typically

NaN:NaN

something that is involving the

NaN:NaN

dotproduct of these two. So you

NaN:NaN

typically have the cosine similarity but

NaN:NaN

cosine similarity is just dotproduct

NaN:NaN

over the norm of each embedding. So

NaN:NaN

which is basically a dot product

NaN:NaN

right? So that's why we care about the

NaN:NaN

dot product. And here we see great it's

NaN:NaN

a function of the relative distance

NaN:NaN

between the two.

NaN:NaN

So in particular again if you remember

NaN:NaN

your trigonometry class

NaN:NaN

cosine of zero is

NaN:NaN

one

NaN:NaN

and the higher this number

NaN:NaN

the lower the value of cosine of this

NaN:NaN

number. Right? Of course, it's uh

NaN:NaN

periodic. So, what I'm saying is not

NaN:NaN

necessarily true. Past 2 pi or sorry uh

NaN:NaN

pi just goes the other way. But I guess

NaN:NaN

what I'm trying to say is for m equal to

NaN:NaN

you'll have basically a sum of cosine of

NaN:NaN

zero

NaN:NaN

and it is the value at which

NaN:NaN

this quantity is maximum.

NaN:NaN

So when m is equal to n this quantity is

NaN:NaN

maximum which means basically if you're

NaN:NaN

looking at the position itself it's the

NaN:NaN

most similar which basically matches our

NaN:NaN

intuition

NaN:NaN

and so now when you plot the values of

NaN:NaN

the embeddings this is what you obtain.

NaN:NaN

So here in this graph on the yaxis I'm

NaN:NaN

basically representing all the

NaN:NaN

embeddings for each of the let's say 50

NaN:NaN

positions

NaN:NaN

and on the x-axis it's basically

NaN:NaN

values along a given vector across

NaN:NaN

several dimensions of a vector. So if

NaN:NaN

you take let's say the first row, you're

NaN:NaN

looking at the first or number zero

NaN:NaN

position depending on how you index your

NaN:NaN

vector and you're looking at all the

NaN:NaN

values of the position zero embedding.

NaN:NaN

And so you see that uh for low

NaN:NaN

dimensions

NaN:NaN

this value kind of goes up and down very

NaN:NaN

frequently. So it's more high frequency

NaN:NaN

and when the dimension is high it

NaN:NaN

basically takes a lot of time for the

NaN:NaN

value to go up and down. So it's

NaN:NaN

basically more low frequency.

NaN:NaN

So this relates to this omega I that I

NaN:NaN

mentioned this omega I. So omega I

NaN:NaN

very high for low values of I which is

NaN:NaN

the dimension and is very low for high

NaN:NaN

values of I.

NaN:NaN

So this basically just determines how

NaN:NaN

quickly your cosine and sign basically

NaN:NaN

vary.

NaN:NaN

Cool.

NaN:NaN

So this is what the original authors

NaN:NaN

have tried and basically what they said

NaN:NaN

what they noted was that using this

NaN:NaN

method leads to comparable results

NaN:NaN

compared to the learned one.

NaN:NaN

But here we have a big advantage because

NaN:NaN

it can extend to any sequence length not

NaN:NaN

just the sequence length that you saw at

NaN:NaN

training time.

NaN:NaN

And this is one of the reasons why you

NaN:NaN

know this may be something that is

NaN:NaN

preferable.

NaN:NaN

So yeah this is the intuition and this

NaN:NaN

is what the authors chose.

NaN:NaN

Now fast forward to 2025 guess you may

NaN:NaN

ask me are we still using that?

NaN:NaN

And the answer is kind of. So we're

NaN:NaN

still using this idea of we want far I

NaN:NaN

guess tokens to be less similar than

NaN:NaN

closer tokens.

NaN:NaN

But we're not injecting the embedding

NaN:NaN

like they did. And we're going to see

NaN:NaN

why.

NaN:NaN

Because if you remember,

NaN:NaN

what you care about is determining how

NaN:NaN

similar tokens are in the self attention

NaN:NaN

computation.

NaN:NaN

And where does the self attention

NaN:NaN

computation happen?

NaN:NaN

In the attention layer.

NaN:NaN

But here what what did I say? I said

NaN:NaN

let's compute these embeddings and let's

NaN:NaN

add them here.

NaN:NaN

But actually what we want is to reflect

NaN:NaN

this similarity in

NaN:NaN

the attention layer.

NaN:NaN

Do you have a question? Yeah.

NaN:NaN

So in this first method, yes. So the

NaN:NaN

question is is it added to the input

NaN:NaN

feature? Yes.

NaN:NaN

Yes, just add. Yeah. Yeah. But the

NaN:NaN

problem is we mostly want these I guess

NaN:NaN

intuition to hold true in the

NaN:NaN

multi head attention layer.

NaN:NaN

So this is one of the reasons why people

NaN:NaN

have tried different variations

NaN:NaN

and in particular

NaN:NaN

have these position embeddings

NaN:NaN

intervene directly in the attention

NaN:NaN

layer as opposed to the input because

NaN:NaN

basically when you do it at the input

NaN:NaN

just here fair okay it's going to be

NaN:NaN

roughly something that is going to go

NaN:NaN

into this attention layer but it's kind

NaN:NaN

of indirect

NaN:NaN

So what we want is to directly

NaN:NaN

kind of do something about the attention

NaN:NaN

formula that would I guess reflect the

NaN:NaN

fact that we want close tokens to be

NaN:NaN

more similar compared to further tokens.

NaN:NaN

And the way we do that is if you

NaN:NaN

remember

NaN:NaN

the self attention layer is basically

NaN:NaN

the softmax of q krpose over square root

NaN:NaN

of d * v.

NaN:NaN

So what we want is to add a little

NaN:NaN

something inside that softmax

NaN:NaN

which is basically where you quantify

NaN:NaN

how similar a token is to another token.

NaN:NaN

You want to add a little something

NaN:NaN

to reflect the fact that some tokens

NaN:NaN

they're supposed to be more similar

NaN:NaN

compared to that token compared to other

NaN:NaN

others.

NaN:NaN

So there's a a few methods that have

NaN:NaN

tried

NaN:NaN

to kind of have some variation of that.

NaN:NaN

for those of us who know like paper T T5

NaN:NaN

the T5 paper which we're going to see a

NaN:NaN

little bit later um they have tried

NaN:NaN

these kind of relative position bias by

NaN:NaN

learning the bias term which is in the

NaN:NaN

formula above.

NaN:NaN

So what they did was

NaN:NaN

let's suppose you have a given distance

NaN:NaN

between the positions m and n.

NaN:NaN

So their idea was let's learn that

NaN:NaN

let's basically bucketize

NaN:NaN

all you know m minus n into some buckets

NaN:NaN

and let's just have the model learn

NaN:NaN

these quantities that are then going to

NaN:NaN

be injected in the softmax.

NaN:NaN

Yeah.

NaN:NaN

M so question is does that pose a

NaN:NaN

problem that the the bias is here with

NaN:NaN

respect to the probability at the end

NaN:NaN

because it has to sum to one. Well you

NaN:NaN

can do whatever you want inside the

NaN:NaN

softmax because the softmax is going to

NaN:NaN

normalize it anyways.

NaN:NaN

So you can think of the bias as being

NaN:NaN

something that is maybe more negative

NaN:NaN

for things that are far apart compared

NaN:NaN

to things that are closer together.

NaN:NaN

So T5 says let's learn them.

NaN:NaN

We have another method from

NaN:NaN

um guess this train short Teslong paper

NaN:NaN

which introduced this method called

NaN:NaN

alibi. Alibi stands for attention with

NaN:NaN

linear bias I believe. And what they did

NaN:NaN

was

NaN:NaN

say instead of learning those biases,

NaN:NaN

let's actually have a deterministic

NaN:NaN

formula which is as a function of the

NaN:NaN

difference the relative difference

NaN:NaN

between those two positions

NaN:NaN

and they said that so they kind of had

NaN:NaN

some results. So you know all these

NaN:NaN

papers they always kind of compare based

NaN:NaN

on one another to kind of see which one

NaN:NaN

is um is I guess more performant.

NaN:NaN

But the reality is is that in today's

NaN:NaN

models

NaN:NaN

most models actually use another

NaN:NaN

kind of position embedding methods

NaN:NaN

and we're going to see it now.

NaN:NaN

So this method relies on rotating

NaN:NaN

the query and the key vector

NaN:NaN

by some angle

NaN:NaN

and I guess can think of it as you know

NaN:NaN

you have your query you have your key

NaN:NaN

let's suppose in the 2D space so what

NaN:NaN

you're going to do is rotate your query

NaN:NaN

by some angle that is a function of its

NaN:NaN

position

NaN:NaN

and you're going to

NaN:NaN

rotate the key

NaN:NaN

vector by some angle that is a function

NaN:NaN

of its position n

NaN:NaN

and

NaN:NaN

I guess how do you do that by the way I

NaN:NaN

wasn't supposed to show this thing so

NaN:NaN

let's suppose you have a vector by the

NaN:NaN

way

NaN:NaN

and you want to rotate that vector

NaN:NaN

because I had the answer on the slide

NaN:NaN

but guess how would you go about this

NaN:NaN

guess for people who just want to kind

NaN:NaN

talk about this with intuition.

NaN:NaN

So, who here has done uh I don't know

NaN:NaN

rotations in uh in space? Yeah.

NaN:NaN

Mhm.

NaN:NaN

Yeah, that's correct. Yeah. M matrix

NaN:NaN

multiplication. And you're going to use

NaN:NaN

a quantity that's called rotation

NaN:NaN

matrix.

NaN:NaN

And uh the rotation matrix is expressed

NaN:NaN

as follows. So it's basically a 2x2

NaN:NaN

matrix in the 2D plane that has co

NaN:NaN

cosine of this angle minus sign of this

NaN:NaN

angle s of this angle cosine of this

NaN:NaN

angle. I'm looking at the time. There's

NaN:NaN

actually it's quite simple to just show

NaN:NaN

that it works. Uh but we may run out of

NaN:NaN

time. Do you want you want me to quickly

NaN:NaN

show you that it's indeed a way to uh

NaN:NaN

rotate a vector?

NaN:NaN

Okay.

NaN:NaN

So here as a reminder what we're trying

NaN:NaN

to show is that we can use a matrix

NaN:NaN

multiplication to rotate a vector in 2D

NaN:NaN

space.

NaN:NaN

So let's suppose we have the following

NaN:NaN

vector that is something that you can

NaN:NaN

um you can quantify with uh two

NaN:NaN

dimensions let's say x and y

NaN:NaN

you can express your vector in 2D space

NaN:NaN

with this right

NaN:NaN

but I guess this

NaN:NaN

if you note are

NaN:NaN

the norm of the vector

NaN:NaN

and

NaN:NaN

phi

NaN:NaN

the angle respect to the x-axis.

NaN:NaN

You can also write v

NaN:NaN

as r

NaN:NaN

with

NaN:NaN

vector cos of cosine of phi and sine of

NaN:NaN

phi,

NaN:NaN

right?

NaN:NaN

So if you multiply the rotation matrix

NaN:NaN

with this V,

NaN:NaN

what you're going to obtain is some

NaN:NaN

multiplication of cosine minus s and co

NaN:NaN

cosine of this and that.

NaN:NaN

And I will leave this exercise for you.

NaN:NaN

But you can show that uh rotation

NaN:NaN

times this v

NaN:NaN

can be expressed as r

NaN:NaN

of cosine of theta + v

NaN:NaN

and sine

NaN:NaN

of theta + v.

NaN:NaN

So this is a quick proof. So I'll just

NaN:NaN

kind of leave you the multiplication of

NaN:NaN

the rotation matrix and V but you will

NaN:NaN

obtain these like trigonometric

NaN:NaN

identities that will have I guess lead

NaN:NaN

you to this formula. This basically

NaN:NaN

shows that if you multiply this matrix

NaN:NaN

and this vector you're basically

NaN:NaN

rotating the vector by this angle. Yeah.

NaN:NaN

So question is why do you want to do

NaN:NaN

this? It's a great question. It's my

NaN:NaN

next uh slide. So this is just like a

NaN:NaN

little intro. Um so I guess here just

NaN:NaN

going back to this methods.

NaN:NaN

What we want to do is to quantify the

NaN:NaN

similarity between tokens and have

NaN:NaN

closed tokens be more similar compared

NaN:NaN

to tokens that are more a far.

NaN:NaN

the problem that we had with the

NaN:NaN

previous methods was so in the first

NaN:NaN

method

NaN:NaN

this learned embedding you always had

NaN:NaN

this overfitting issue because basically

NaN:NaN

when you learn these biases it always

NaN:NaN

depends on which training set you have.

NaN:NaN

So maybe your data set in is in a way

NaN:NaN

that let's say tokens that are close are

NaN:NaN

kind of similar but in a different way

NaN:NaN

compared to what you see at inference

NaN:NaN

time

NaN:NaN

and this alibi methods

NaN:NaN

it didn't have that learnable component

NaN:NaN

but this was quite restrictive because

NaN:NaN

it's a very simple formula after after

NaN:NaN

all right just the relative difference

NaN:NaN

between n and m so I guess people have

NaN:NaN

tried different ways of coming up with

NaN:NaN

something that kind of tells you that I

NaN:NaN

guess an embedding that reflects the

NaN:NaN

fact that you want further positions to

NaN:NaN

be less similar than closer ones. So in

NaN:NaN

this method, we're going back to the s

NaN:NaN

and cosine world from what the author

NaN:NaN

had proposed and think about similarity

NaN:NaN

from the lens of I guess cosine and s

NaN:NaN

functions.

NaN:NaN

So this is a little intro

NaN:NaN

and their method is called ROP. I'm not

NaN:NaN

sure if you've heard of it. So it stands

NaN:NaN

for rotary position embeddings

NaN:NaN

and we're going to see that this method.

NaN:NaN

So why why do we care about this method?

NaN:NaN

So this method has two great things.

NaN:NaN

So the first one is

NaN:NaN

that if you rotate the query and the

NaN:NaN

key,

NaN:NaN

you will end up with a quantity that

NaN:NaN

will be a function of the relative

NaN:NaN

distance between the two. That's going

NaN:NaN

to be very nice. And that's why I I

NaN:NaN

wrote this thing on the blackboard. Not

NaN:NaN

sure if you can see by the way, but um

NaN:NaN

we won't have time to go into the

NaN:NaN

mathematical detail. But if you want to

NaN:NaN

just you know express these things at

NaN:NaN

home uh this is just a foundation.

NaN:NaN

And so in particular

NaN:NaN

if you remember your attention formula

NaN:NaN

so you have you know query times key

NaN:NaN

transpose.

NaN:NaN

So if you rotate the query by an angle m

NaN:NaN

and the key by an angle n,

NaN:NaN

what you're going to end up is a formula

NaN:NaN

that has the rotation matrix

NaN:NaN

of angle theta and I guess n minus m.

NaN:NaN

And this is great because this is a

NaN:NaN

function of the relative distance

NaN:NaN

between these two positions.

NaN:NaN

Okay, so why do I talk about this in

NaN:NaN

detail? Well, it turns out that most

NaN:NaN

models these days, they use rope, which

NaN:NaN

is why it's important.

NaN:NaN

And I would say another thing, it's it's

NaN:NaN

maybe a little bit hard to get the

NaN:NaN

intuition as to why that works, but

NaN:NaN

hopefully the explanation that I gave

NaN:NaN

regarding the, you know, s and cosine at

NaN:NaN

the very beginning can help you just

NaN:NaN

build that intuition.

NaN:NaN

And speaking of that,

NaN:NaN

it turns out that the upper bound of the

NaN:NaN

attention weight given by the query and

NaN:NaN

the key is such that we observe a

NaN:NaN

long-term decay.

NaN:NaN

Meaning that as m minus m is large, we

NaN:NaN

do see the upper bound that gets kind of

NaN:NaN

smaller and smaller.

NaN:NaN

Well, you see these little oscillations.

NaN:NaN

It's not like perfect either, but we do

NaN:NaN

have some mathematical

NaN:NaN

uh I guess results as to just the upper

NaN:NaN

bounds kind of decaying over the long

NaN:NaN

term.

NaN:NaN

Cool.

NaN:NaN

Any questions on this? Yeah.

NaN:NaN

Yep.

NaN:NaN

>> Exactly. Yeah. So the question is the

NaN:NaN

relative distance is captured in the

NaN:NaN

rotation matrix and yes.

NaN:NaN

Yes.

NaN:NaN

Oh yeah.

NaN:NaN

Ah it's a great question. So question is

NaN:NaN

what is theta? So theta is actually

NaN:NaN

fixed. So do you remember this omega

NaN:NaN

that I talked about here

NaN:NaN

here?

NaN:NaN

So it's basically some function of i and

NaN:NaN

d. I actually kind of uh passed it quite

NaN:NaN

quickly. But what I showed you is

NaN:NaN

in the 2D space but here we're in this

NaN:NaN

dimensional space which is greater than

NaN:NaN

two. So I kind of glossed it very

NaN:NaN

quickly but the way you extend this

NaN:NaN

method is by having this 2D 2D space

NaN:NaN

kind of by block. But then the theta is

NaN:NaN

a function of uh typically something

NaN:NaN

that you fix but a function of I which

NaN:NaN

is the dimension you know um it's

NaN:NaN

basically between one and d /2

NaN:NaN

it's a function of that and it's a

NaN:NaN

function of uh d as well. So it's

NaN:NaN

typically you you will see this theta as

NaN:NaN

being roughly equal to omega I which is

NaN:NaN

this this one more or less more or less

NaN:NaN

sorry.

NaN:NaN

Oh so the question is uh so that it has

NaN:NaN

the same dimension as the latent

NaN:NaN

dimension. So well here

NaN:NaN

you have a product of matrices.

NaN:NaN

So you need to have like the dimensions

NaN:NaN

match. So I guess your answer Yeah.

NaN:NaN

Does that make sense?

NaN:NaN

I'll take that as a yes. Okay. So I I

NaN:NaN

spent a bunch of time on this because uh

NaN:NaN

I think this is actually quite

NaN:NaN

important. Uh a lot of models use this

NaN:NaN

and uh yeah, I guess the intuition is

NaN:NaN

not super obvious. So I hope this was

NaN:NaN

helpful.

NaN:NaN

Okay. So this was position embeddings.

NaN:NaN

Yeah.

NaN:NaN

Oh yeah.

NaN:NaN

Oh yeah. So the question is about how do

NaN:NaN

you obtain this curve? So this is

NaN:NaN

actually a curve that I believe is shown

NaN:NaN

mathematically

NaN:NaN

cuz basically so it's kind of

NaN:NaN

complicated. We're not going to write

NaN:NaN

down the formula but if you're

NaN:NaN

interested so in this paper in the row

NaN:NaN

farmer paper there's an appendix where

NaN:NaN

they show mathematically that is upper

NaN:NaN

bounded by some quantity and this is

NaN:NaN

what this is what this is showing. Yep.

NaN:NaN

Great question.

NaN:NaN

Cool. So position embeddings is one part

NaN:NaN

of the transformer that has changed a

NaN:NaN

little bit and we've seen how that

NaN:NaN

changed and why.

NaN:NaN

So now we're going to see another

NaN:NaN

component of the transformer

NaN:NaN

that has also little bit changed and

NaN:NaN

that component is the layer

NaN:NaN

normalization.

NaN:NaN

So if you remember

NaN:NaN

the transformer architecture

NaN:NaN

is composed again of encoder decoder and

NaN:NaN

then you have the components inside.

NaN:NaN

So you have these boxes that say add and

NaN:NaN

norm.

NaN:NaN

So what do they mean? So basically here

NaN:NaN

what we do is we take the input as well

NaN:NaN

as the output of this sub layer.

NaN:NaN

We add them together

NaN:NaN

and then we normalize.

NaN:NaN

So this is a little trick that the

NaN:NaN

authors do and it is shown in practice

NaN:NaN

to improve convergence and just make the

NaN:NaN

convergence be quicker.

NaN:NaN

So the idea is as follows.

NaN:NaN

If you have a vector,

NaN:NaN

sometimes the components of the vector

NaN:NaN

can be super large, sometimes they can

NaN:NaN

be super small. The idea here is to kind

NaN:NaN

of normalize

NaN:NaN

the components of your vector

NaN:NaN

within some range, some normalized

NaN:NaN

range.

NaN:NaN

So the way you're going to do that is

NaN:NaN

you're going to take your vector

NaN:NaN

and then sub subtract it by the mean

NaN:NaN

computed mean which is basically the sum

NaN:NaN

of its components

NaN:NaN

and normalize it by basically its uh

NaN:NaN

standard deviation.

NaN:NaN

And what you're going to do is you're

NaN:NaN

going to learn two quantities.

NaN:NaN

One gamma which is going to be the

NaN:NaN

rescaling factor

NaN:NaN

and then beta which is um another factor

NaN:NaN

another term as well that you learn.

NaN:NaN

And you're going to let these two

NaN:NaN

quantities be learned by your model.

NaN:NaN

And so in practice as I mentioned so

NaN:NaN

what this does is it helps with training

NaN:NaN

stability and with convergence time.

NaN:NaN

So this was a technique that was used in

NaN:NaN

the original transformer paper.

NaN:NaN

Uh I just want to call out that there

NaN:NaN

has been some changes since then. Um, so

NaN:NaN

we went from normalizing the input

NaN:NaN

plus the output of the sub layer

NaN:NaN

to having a sum of the input

NaN:NaN

and the sub layer of the normalized

NaN:NaN

input.

NaN:NaN

So in other words, what we've done is to

NaN:NaN

change where the normalization is

NaN:NaN

located. So here in the the transformer

NaN:NaN

paper it was what we now call a postnorm

NaN:NaN

version

NaN:NaN

and nowadays we use a prenorm version

NaN:NaN

which basically consists of having the

NaN:NaN

layer norm right before the vector goes

NaN:NaN

into the sub layer and sub layer here

NaN:NaN

can be either the attention layer or the

NaN:NaN

ffn

NaN:NaN

but not only that there is also another

NaN:NaN

change. So nowadays people they do not

NaN:NaN

use layer norm. They use something else

NaN:NaN

called RMS norm RMS root mean square

NaN:NaN

normalization which is basically a

NaN:NaN

variation of what you've seen before.

NaN:NaN

So instead of computing this

NaN:NaN

basically what people do is they just

NaN:NaN

normalize X by the root mean square of

NaN:NaN

the

NaN:NaN

of the components of X and they learn

NaN:NaN

gamma only gamma

NaN:NaN

though. So why do they do that?

NaN:NaN

Basically they show that the convergence

NaN:NaN

uh properties they're basically

NaN:NaN

comparable but here you have fewer

NaN:NaN

parameters to learn. So it's basically

NaN:NaN

quicker.

NaN:NaN

Yeah.

NaN:NaN

Mhm.

NaN:NaN

Good question. So I guess the question

NaN:NaN

is what is the intuition behind

NaN:NaN

normalizing? So the intuition is that if

NaN:NaN

you look at your model you have several

NaN:NaN

layers in some layers your model your

NaN:NaN

vector your activation to be more

NaN:NaN

precise. So do you know the the vector

NaN:NaN

that you see that goes from here to here

NaN:NaN

is basically called the activation.

NaN:NaN

Sometimes the activation has extreme

NaN:NaN

values in one part of its components,

NaN:NaN

sometimes in another part.

NaN:NaN

And the model is typically having

NaN:NaN

trouble in learning the weights in each

NaN:NaN

of these layers if these activations

NaN:NaN

they vary too much.

NaN:NaN

So the idea is to bring the values of

NaN:NaN

the components of the activation to some

NaN:NaN

range

NaN:NaN

that is not too you know far off in some

NaN:NaN

direction.

NaN:NaN

So in case you're interested there is

NaN:NaN

this key word internal coariate shift

NaN:NaN

which is basically the term that is

NaN:NaN

given to the phenomenon that I'm

NaN:NaN

describing here. So yeah that's the

NaN:NaN

intuition.

NaN:NaN

Yeah.

NaN:NaN

Oh great great question. So question is

NaN:NaN

what is the difference between this and

NaN:NaN

batch normalization? So batch

NaN:NaN

normalization

NaN:NaN

is normalization across the other

NaN:NaN

dimension which is the norm the

NaN:NaN

dimension of the batch.

NaN:NaN

So let's suppose you have a bunch of

NaN:NaN

vectors. What you do is you normalize

NaN:NaN

each component

NaN:NaN

with respect to all the other components

NaN:NaN

for the same dimension but of the other

NaN:NaN

vectors.

핵심 요약

Transformer 아키텍처의 실용적인 변형과 최적화 기법을 다룹니다. Position Encoding의 발전(Absolute → Relative → RoPE), Attention 효율화 기법(Sliding Window, Grouped Query Attention), 그리고 Encoder-only 모델인 BERT까지 설명합니다.

주요 개념

Positional Encoding 발전 14:30

Absolute (Learned): 각 위치마다 학습된 임베딩. 학습 시 본 길이까지만 적용 가능
Absolute (Sinusoidal): sin/cos 함수 사용. 임의 길이로 확장 가능. dot product가 상대 거리(m-n)의 함수가 됨
Relative Position Bias (T5): Attention 내부에서 상대 거리 기반 bias 학습
RoPE (Rotary Position Embedding): Query/Key에 회전 행렬 적용. 현대 LLM의 표준

Attention 효율화 54:20

Sliding Window Attention: 전체 O(n²) 대신 고정 윈도우 내에서만 attention. Mistral 등에서 사용
Receptive Field 개념: 여러 레이어 쌓으면 간접적으로 먼 토큰 정보도 접근 가능

Grouped Query Attention (GQA) 55:30

MHA (Multi-Head Attention): 각 head마다 Q, K, V 별도 projection matrix
MQA (Multi-Query Attention): Q만 head별, K/V는 모든 head가 공유
GQA: Q는 head별, K/V는 그룹 단위로 공유. MHA와 MQA의 중간
KV Cache 절약: 디코딩 시 K/V를 캐싱하는데, 공유하면 메모리 절약

Transformer 모델 분류 1:02:50

Encoder-Decoder (T5, mT5, ByT5): 원본 Transformer. Span Corruption 학습
Encoder-only (BERT): 분류 작업에 특화. Bidirectional attention
Decoder-only (GPT 계열): 생성 작업. 현대 LLM의 주류

BERT 아키텍처 1:27:00

[CLS] 토큰: 문장 전체의 representation. 분류 head 연결점
[SEP] 토큰: 문장 간 구분
Segment Embedding: 문장 A/B 구분을 위한 추가 임베딩

BERT 학습 목표 1:28:30

MLM (Masked Language Modeling): 15% 토큰 마스킹 (80% [MASK], 10% 랜덤, 10% 유지)
NSP (Next Sentence Prediction): 두 문장이 연속인지 분류
양방향 컨텍스트를 학습하여 풍부한 representation 획득

BERT Fine-tuning 1:33:20

Pre-trained weights 위에 classification head 추가
감정 분석: [CLS] 토큰 → FFN → 클래스
QA: 각 토큰 → FFN → 답변 시작/끝 span 예측

핵심 인사이트

Position Encoding은 "가까운 토큰이 더 유사해야 한다"는 직관을 수학적으로 구현
GQA는 성능과 효율의 균형점. K/V 공유로 KV Cache 메모리 대폭 절약
BERT의 [CLS] 토큰은 self-attention을 통해 전체 문맥 정보를 압축한 representation