Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 1 - Transformer

Stanford Online • October 17, 2025 • AI 요약 생성: January 24, 2026

NaN:NaN

Cool.

NaN:NaN

Hello everyone and uh welcome to CME 295

NaN:NaN

transformers and large language models.

NaN:NaN

So my name is Afin and I will be

NaN:NaN

teaching this class with Shervin who's

NaN:NaN

in the back and uh before I start I'm

NaN:NaN

just going to introduce ourselves.

NaN:NaN

Um, so we're twin brothers and uh we

NaN:NaN

actually had kind of a similar

NaN:NaN

background. So we both went to a school

NaN:NaN

in France called Salbar and then we each

NaN:NaN

went our way. So on my end I went to MIT

NaN:NaN

and then Shervin went to Stanford to do

NaN:NaN

the ICME masters program

NaN:NaN

and after that I guess our um industry

NaN:NaN

background is very similar as well. So I

NaN:NaN

first went to Uber and then Shervin came

NaN:NaN

to Uber as well and then Sharvin left to

NaN:NaN

Google and I went to Google and then

NaN:NaN

very recently I joined Netflix and

NaN:NaN

Shervin joined Netflix as well and we've

NaN:NaN

been working on large language models.

NaN:NaN

Um so yeah I guess we have like

NaN:NaN

technical backgrounds and mostly

NaN:NaN

oriented towards LLMs.

NaN:NaN

Okay. So why are we doing this class? Um

NaN:NaN

so since 2020 uh Shervin and I have been

NaN:NaN

specializing in NLP and we've been

NaN:NaN

giving this class uh in a format of a

NaN:NaN

workshop that was done in a yearly

NaN:NaN

basis. So in 2021 2022, 2023, 2024, you

NaN:NaN

know, CH GPD came in 2022 and suddenly

NaN:NaN

there was a lot of interest for LLMs and

NaN:NaN

so uh it's actually last spring that we

NaN:NaN

started to offer this class as a

NaN:NaN

Stanford course that is now called CME

NaN:NaN

295 and this is the second instance.

NaN:NaN

>> Boom. Um so what can you expect from

NaN:NaN

this class? So first of all are

NaN:NaN

basically uh everywhere now and I guess

NaN:NaN

our goal here is twofold. So the first

NaN:NaN

one is to learn about the underlying

NaN:NaN

mechanism that makes all this work and

NaN:NaN

we're going to see the transformer which

NaN:NaN

is the foundational architecture uh that

NaN:NaN

makes all this work. And then the second

NaN:NaN

thing is to know how these LLMs are

NaN:NaN

trained and where they are applied.

NaN:NaN

So in case you're still wondering if

NaN:NaN

this class is good for you, um I would

NaN:NaN

say that this class is great for people

NaN:NaN

who just in general have an interest in

NaN:NaN

this field either because you wanted to

NaN:NaN

make it your career goal. uh if you want

NaN:NaN

to be I don't know research scientists

NaN:NaN

or an ML scientist or if you want to uh

NaN:NaN

develop like a personal project that

NaN:NaN

relies on LLMs to some extent to just

NaN:NaN

like knowing the caveats I guess what

NaN:NaN

works what doesn't

NaN:NaN

or just if you're in a separate field

NaN:NaN

and you just want to know how this whole

NaN:NaN

AI geni LLMs thing works and how you can

NaN:NaN

apply it to your domain.

NaN:NaN

Okay. So now in terms of prerequisites,

NaN:NaN

I would say that at a very minimum um

NaN:NaN

you should have some foundations in ML

NaN:NaN

like basically know what a how a model

NaN:NaN

is trained, what a neural network is uh

NaN:NaN

and also some basics in linear algebra.

NaN:NaN

So basically how matrices are multiplied

NaN:NaN

for instance. Uh but even even if you

NaN:NaN

have kind of a developing um I guess

NaN:NaN

competency in these fields I guess it's

NaN:NaN

fine. We still be here to help you out.

NaN:NaN

I guess this is like the ideal set of

NaN:NaN

prerequisites.

NaN:NaN

Cool. So still on the logistics. Um, so

NaN:NaN

this class will be held every Friday

NaN:NaN

from 3:30 to 5:20 and it will be held

NaN:NaN

here.

NaN:NaN

So this class is two units and you have

NaN:NaN

the choice to either take it as a letter

NaN:NaN

or credit nonredit.

NaN:NaN

So, as you could tell from the from the

NaN:NaN

setup, we're basically recording this

NaN:NaN

class. And if you cannot for some reason

NaN:NaN

attend this uh you know this time this

NaN:NaN

slot, uh we'll make sure with to um make

NaN:NaN

the recordings available either tonight

NaN:NaN

like every Friday night or on Saturday.

NaN:NaN

So in terms of the grades um so what

NaN:NaN

we're doing for this quarter is to have

NaN:NaN

two exams.

NaN:NaN

So one is the midterm which will be

NaN:NaN

happening uh during our fifth instance

NaN:NaN

which is October 24th

NaN:NaN

and then the second exam will be the

NaN:NaN

final exam which will be held uh like in

NaN:NaN

the on the you know in the week of the

NaN:NaN

December 8th. So date is still TVD

NaN:NaN

so we'll let you know.

NaN:NaN

Cool. Um so every time we have a lecture

NaN:NaN

we'll be posting the slides and the

NaN:NaN

recordings on the website

NaN:NaN

and in case you're interested we also

NaN:NaN

have the syllabus in there so you can

NaN:NaN

know a little bit what are the topics

NaN:NaN

that we'll be talking about.

NaN:NaN

uh and uh the class textbook is uh this

NaN:NaN

super study guide transformer LMS so we

NaN:NaN

have a copy here in case you want to

NaN:NaN

take a look um so yeah I guess a lot of

NaN:NaN

the concepts that we have in this class

NaN:NaN

will actually be in the book so I guess

NaN:NaN

it's a helpful way to follow this as

NaN:NaN

well

NaN:NaN

and also we did uh some kind of very

NaN:NaN

short condensed version of this whole

NaN:NaN

class that we called the VIP cheat sheet

NaN:NaN

so this one is available on GitHub in

NaN:NaN

case you're um you're interested. Uh and

NaN:NaN

yeah, we also translated it into a

NaN:NaN

number of languages now. Um by the way,

NaN:NaN

if your language is not there, uh let us

NaN:NaN

know and uh yeah, happy to work on that

NaN:NaN

as well together.

NaN:NaN

Okay, cool. Uh I think it's the last

NaN:NaN

things on the logistics part. So in

NaN:NaN

terms of announcements, we'll be posting

NaN:NaN

things on Canvas. In case you have any

NaN:NaN

questions, you can of course reach out

NaN:NaN

to us. Uh, but there's also a tab on

NaN:NaN

canvas that's called ED. Uh, I'm sure

NaN:NaN

you're familiar. Um, so yeah, just click

NaN:NaN

on that, just post your question and

NaN:NaN

then Shervin and I would be responding.

NaN:NaN

And, uh, yeah, I guess to reach out to

NaN:NaN

us, you have this mailing list or just

NaN:NaN

like, you know, we're just two, so just,

NaN:NaN

uh, ping us.

NaN:NaN

Cool. So, on the logistics, do we have

NaN:NaN

any questions so far? And one thing I

NaN:NaN

forgot to mention is that given that

NaN:NaN

we're recording this class, um I guess

NaN:NaN

if you're asking a question, it may not

NaN:NaN

be super clear for the viewer what your

NaN:NaN

question was. So I'm going to make an

NaN:NaN

effort to just repeat your question. It

NaN:NaN

will sound weird, but yeah, I try to not

NaN:NaN

forget.

NaN:NaN

But yeah, so yeah, any questions so far

NaN:NaN

on the logistics?

NaN:NaN

Yep.

NaN:NaN

So the question is whether there are

NaN:NaN

like coding parts in the exams. So the

NaN:NaN

answer is no. So the exams will purely

NaN:NaN

focus on concepts that we see in class

NaN:NaN

and actually it's not meant to you know

NaN:NaN

trap you. So I guess if you follow the

NaN:NaN

class if uh you know you see the slides

NaN:NaN

and like the concepts that we see should

NaN:NaN

be fine.

NaN:NaN

Yeah.

NaN:NaN

Oh yeah. Uh question is if you're weight

NaN:NaN

listed what do you do? Um I think so by

NaN:NaN

experience you know a lot of people will

NaN:NaN

kind of finalize their schedule. Some

NaN:NaN

people will drop some won't. In case

NaN:NaN

you're still weight listed uh you know

NaN:NaN

come talk to us but I'm pretty confident

NaN:NaN

uh you know it's going to be okay

NaN:NaN

because I think the weight list right

NaN:NaN

now is like six. So yeah I think you

NaN:NaN

should be fine. Cool.

NaN:NaN

Yeah.

NaN:NaN

Uh they will be on the website and we'll

NaN:NaN

make sure to um also post the link on

NaN:NaN

canvas. Yeah. Uh so the question was

NaN:NaN

where are the slides and they're on the

NaN:NaN

website.

NaN:NaN

Cool. Yeah.

NaN:NaN

So question is on the waiting of the

NaN:NaN

exams. So yeah there is no homework. So

NaN:NaN

50% is midterm, 50% is final and no no

NaN:NaN

grades I mean no weights are from that

NaN:NaN

in particular I mean if this slot is

NaN:NaN

conflicting with something just uh keep

NaN:NaN

in mind that we are recording this so I

NaN:NaN

mean it's fine if you if you cannot

NaN:NaN

attend let's say session yeah

NaN:NaN

>> sorry

NaN:NaN

oh um is a question that the final is

NaN:NaN

about just the second half of the class.

NaN:NaN

Um, we've not uh we have not uh written

NaN:NaN

the exam yet, but I think this is

NaN:NaN

something we're thinking of. So, yeah,

NaN:NaN

the final is probably going to be the

NaN:NaN

second half about the second half of the

NaN:NaN

topics.

NaN:NaN

Cool. Okay, long story short, 50%

NaN:NaN

midterm, 50% exam, final exam, and uh

NaN:NaN

yeah, it's a fun class.

NaN:NaN

Cool. So, with that, I'm going to just

NaN:NaN

slowly start the class. Um, so another

NaN:NaN

thing that I wanted to mention was every

NaN:NaN

time we're talking about something you

NaN:NaN

will see that at the bottom of the slide

NaN:NaN

there would be a source. It's mostly for

NaN:NaN

so first to credit whatever we're

NaN:NaN

quoting but also uh for you to kind of

NaN:NaN

dig into those material a little bit

NaN:NaN

more in case you're interested. Uh

NaN:NaN

because of course we have only like two

NaN:NaN

hours per week and we only have nine or

NaN:NaN

10 weeks. So there's nowhere near the

NaN:NaN

you know enough time for us to cover

NaN:NaN

everything.

NaN:NaN

And the second disclaimer is you will

NaN:NaN

see that the field is full of

NaN:NaN

abbreviations. So I myself was

NaN:NaN

completely scared of them when I

NaN:NaN

started. Uh but hopefully by the end of

NaN:NaN

the class you will have a mental mapping

NaN:NaN

of what these abbreviations mean respect

NaN:NaN

to what they correspond to. Um so yeah

NaN:NaN

so if you have that mental mapping

NaN:NaN

towards the end of the class then we'll

NaN:NaN

know we did a good job.

NaN:NaN

So with that

NaN:NaN

um let's start and I guess we will start

NaN:NaN

at the very high level because I would

NaN:NaN

just assume that um I guess we're

NaN:NaN

starting from scratch um and we're going

NaN:NaN

to talk about NLP in general. So NLP is

NaN:NaN

going to be our first abbreviation. So

NaN:NaN

NLP stands for natural language

NaN:NaN

processing and it is a field that is

NaN:NaN

around like manipulating text just

NaN:NaN

computing things with text

NaN:NaN

and at a very high level can basically

NaN:NaN

classify NLP tasks into three buckets.

NaN:NaN

So the first bucket is what we call

NaN:NaN

classification.

NaN:NaN

So we have an input text as an input and

NaN:NaN

then what we want is to predict

NaN:NaN

something. So one example is you have a

NaN:NaN

movie review and you want to predict

NaN:NaN

whether uh the sentiment is positive,

NaN:NaN

negative or neutral. Uh so that's one

NaN:NaN

example. You can also have intent

NaN:NaN

detection. Uh just you know knowing what

NaN:NaN

for instance the person want to do. So

NaN:NaN

let's suppose you say I want to create

NaN:NaN

an alarm for tomorrow. So the intent

NaN:NaN

here is create an alarm.

NaN:NaN

So also to detect a language. So for

NaN:NaN

instance, if you write in French, you

NaN:NaN

want to detect that that text is in

NaN:NaN

French. Topic modeling.

NaN:NaN

The second category is what we call

NaN:NaN

multi-classification.

NaN:NaN

So we still have a text that's input,

NaN:NaN

but this time we predict more than one

NaN:NaN

thing. So you have a number of tasks in

NaN:NaN

that bucket as well. So one that is very

NaN:NaN

popular is called named entity

NaN:NaN

recognition aka neer.

NaN:NaN

So what that task does is given an input

NaN:NaN

text we want to basically label some

NaN:NaN

specific words like for instance

NaN:NaN

identifying whether something is a

NaN:NaN

location or a time and so on.

NaN:NaN

And then you have some other tasks as

NaN:NaN

well that are a little bit more on the

NaN:NaN

linguistic sides. uh I think they're

NaN:NaN

less trendy now but I guess 10 years ago

NaN:NaN

it was something that people would study

NaN:NaN

a lot. Uh so part of speech tagging

NaN:NaN

which is about just figuring out which

NaN:NaN

word is you know noun a verb etc or some

NaN:NaN

parsing related tasks so dependency or

NaN:NaN

constituency parsing

NaN:NaN

and then the last bucket which is very

NaN:NaN

popular these days is the generation

NaN:NaN

bucket.

NaN:NaN

So you have an a text as inputs and you

NaN:NaN

also have text as outputs and here the

NaN:NaN

length can be variable meaning you don't

NaN:NaN

know what the length of your output text

NaN:NaN

will be beforehand. So here you have

NaN:NaN

several tasks. So for instance you have

NaN:NaN

machine translation so for something in

NaN:NaN

English and I wanted to let's say uh

NaN:NaN

German

NaN:NaN

question answering. So typically you

NaN:NaN

know the chat GPT gemini that you're

NaN:NaN

using you know the assistant. So you ask

NaN:NaN

a question and you have a response

NaN:NaN

and then you have like other um tasks as

NaN:NaN

well like summarization you want to

NaN:NaN

summarize an article let's say or just

NaN:NaN

generate something. So something can be

NaN:NaN

generate codes generate a poem can also

NaN:NaN

be a lot of things.

NaN:NaN

Cool. So now what we will do is go

NaN:NaN

through these tasks one by one to just

NaN:NaN

illustrate what people typically handle

NaN:NaN

with. So we're going to start with the

NaN:NaN

first bucket which is the classification

NaN:NaN

bucket. And here we're going to

NaN:NaN

illustrate this with the sentiment

NaN:NaN

extraction task. So let's suppose we

NaN:NaN

have a sentence. This teddy bear is so

NaN:NaN

cute. We want our model to predict you

NaN:NaN

know this to be a positive sentiment.

NaN:NaN

So typically what you would use is you

NaN:NaN

know data sets that are around know

NaN:NaN

sentiment extraction data set. So I

NaN:NaN

mentioned movie reviews. So this is IMDb

NaN:NaN

critics but you also have reviews about

NaN:NaN

products. So, Amazon reviews or you know

NaN:NaN

tweets now I guess it's called X so X

NaN:NaN

posts

NaN:NaN

and the way you would evaluate such

NaN:NaN

outputs would be by typically using

NaN:NaN

traditional classification metrics. Um

NaN:NaN

so you have accuracy which is you know

NaN:NaN

how many what is the percentage of the

NaN:NaN

observations that you correctly

NaN:NaN

predicted

NaN:NaN

but you also have two key metrics which

NaN:NaN

I'm just going to uh remind I'm not sure

NaN:NaN

if everyone knows about them. So one is

NaN:NaN

precision

NaN:NaN

which is out of all the positive

NaN:NaN

predictions that you made which ones

NaN:NaN

were correct

NaN:NaN

and then the second one is recall.

NaN:NaN

Out of all the true labels, how many of

NaN:NaN

them did you correctly predict as being

NaN:NaN

positive? And uh you have this metric

NaN:NaN

called the F1 score which basically

NaN:NaN

takes the harmonic mean of precision and

NaN:NaN

recall to just give you one number.

NaN:NaN

So now you may wonder you know why do

NaN:NaN

you need all these metrics? So the short

NaN:NaN

answer is that sometimes you have tasks

NaN:NaN

and data sets where your classes are

NaN:NaN

very imbalanced. So for instance you can

NaN:NaN

have I don't know 99% of your data set

NaN:NaN

that is a positive label and then only

NaN:NaN

1% of the data set which is negative.

NaN:NaN

And so here if you take like a metric

NaN:NaN

like accuracy it would be very

NaN:NaN

misleading because if you have a model

NaN:NaN

that would predict everything as the

NaN:NaN

majority class then you would have a you

NaN:NaN

know great classifier but that's not the

NaN:NaN

case. So that's why precision and recall

NaN:NaN

really play a role.

NaN:NaN

So that's the first one. Okay. So now

NaN:NaN

let's move to the second category of NLP

NaN:NaN

tasks. So this this one is the

NaN:NaN

multiclassification category. So you

NaN:NaN

have an input text and you predict

NaN:NaN

multiple things and we're illustrating

NaN:NaN

this with the neer task which as I

NaN:NaN

mentioned is about identifying the

NaN:NaN

category of given words.

NaN:NaN

And so here for instance we want to

NaN:NaN

identify teddy bear as being an entity.

NaN:NaN

Um I guess for that you would use

NaN:NaN

classification metrics but not at the

NaN:NaN

sentence level but more either at the

NaN:NaN

token level or at the entity type level.

NaN:NaN

And by by that I mean let's suppose you

NaN:NaN

have a category um let's say location

NaN:NaN

and you want to know how well you're

NaN:NaN

predicting words in that category. So

NaN:NaN

you would typically aggregate these

NaN:NaN

metrics

NaN:NaN

um as a function of that.

NaN:NaN

Cool.

NaN:NaN

Okay. Let's go to the last category

NaN:NaN

which is as I mentioned the most popular

NaN:NaN

one. So this one is text in text out. So

NaN:NaN

I'm illustrating this with the machine

NaN:NaN

translation task which is around

NaN:NaN

translating a text from a source

NaN:NaN

language to a target language. So here

NaN:NaN

you have the example with English to

NaN:NaN

French. So cute teddy bear is reading.

NaN:NaN

Um so for that I guess it's harder to

NaN:NaN

get data sets because here you you need

NaN:NaN

to have pairs of text. So you have a

NaN:NaN

very popular data set that's called WMT

NaN:NaN

which stands for workshop on machine

NaN:NaN

translation.

NaN:NaN

And that one contains a bunch of paired

NaN:NaN

sequences in different languages. Uh so

NaN:NaN

for instance you have the English,

NaN:NaN

French, English, German coming from the

NaN:NaN

European Parliament data set for

NaN:NaN

instance. Um okay so to evaluate those

NaN:NaN

the to evaluate the performance of your

NaN:NaN

model it's actually a lot more tricky

NaN:NaN

because as you can imagine you can have

NaN:NaN

many different ways to translate

NaN:NaN

something. I'm sure many of us in the

NaN:NaN

room are, you know, bilingual, triang

NaN:NaN

trilingual. Um, so yeah, that's that's

NaN:NaN

what is making it this hard.

NaN:NaN

So, in the past, people have used

NaN:NaN

several rule-based metrics to do that.

NaN:NaN

So, one that you may have heard is blue.

NaN:NaN

Blue stands for bilingual evaluation

NaN:NaN

under study and it is a measure of how

NaN:NaN

well your translation stands with

NaN:NaN

respect to a reference text.

NaN:NaN

Same story for Rouge which is actually a

NaN:NaN

suite of metrics

NaN:NaN

um but kind of captures that in a

NaN:NaN

different way. And uh you will see that

NaN:NaN

the machine learning community is funny

NaN:NaN

because blue I'm not sure if you know

NaN:NaN

French means blue but rouge means reds.

NaN:NaN

So I guess they tried to kind of uh add

NaN:NaN

some some fun in this. Um but the

NaN:NaN

problem with these metrics is that you

NaN:NaN

always need a reference text. So you

NaN:NaN

basically need labels

NaN:NaN

and in practice having labels is very

NaN:NaN

cost expensive. It takes a lot of time,

NaN:NaN

a lot of money to uh get labels and we

NaN:NaN

will see later in the class that with

NaN:NaN

the progress that we have made in the LM

NaN:NaN

space or that the community has made in

NaN:NaN

the LM space, we can actually forgo this

NaN:NaN

reference based metrics and go towards a

NaN:NaN

more reference free

NaN:NaN

uh kind of metrics and we will see that

NaN:NaN

later on.

NaN:NaN

And then the last metric that I would

NaN:NaN

say that people sometimes use is called

NaN:NaN

perplexity.

NaN:NaN

And perplexity only looks at the

NaN:NaN

probabilities that are output by the

NaN:NaN

model. And it basically quantifies how

NaN:NaN

surprised the model is by its outputs.

NaN:NaN

So blue and rouge the higher the better.

NaN:NaN

Proplexity the lower the better.

NaN:NaN

And I guess um LLMs have been a kind of

NaN:NaN

a hot topic since 2022,

NaN:NaN

but actually the field goes way back way

NaN:NaN

before that uh that year. So in the 80s,

NaN:NaN

we'll see it in a second, but uh there's

NaN:NaN

a class of models that were actually

NaN:NaN

kind of thought of even in the 80s and

NaN:NaN

the 90s we had LSTMs that we'll see also

NaN:NaN

in a second.

NaN:NaN

Um but the problem was during that time

NaN:NaN

we didn't have the internet we didn't

NaN:NaN

have a lot of compute and I guess this

NaN:NaN

was one of the limiting factors which

NaN:NaN

prevented these models from like the the

NaN:NaN

models from today from being trained

NaN:NaN

and then more recently uh we've had

NaN:NaN

several advances so word tovec was

NaN:NaN

really one of the uh kind of pioneering

NaN:NaN

um work in just um computing meaningful

NaN:NaN

embedding settings

NaN:NaN

and we'll see it in a second. And then

NaN:NaN

of course we had the transformers which

NaN:NaN

were part of a paper that was published

NaN:NaN

in 2017

NaN:NaN

which is basically at the foundation of

NaN:NaN

of the models that you see today.

NaN:NaN

And then uh you know these models they

NaN:NaN

just were scaled up both by compute but

NaN:NaN

also in in terms of like the data that

NaN:NaN

was used to train them. And you know

NaN:NaN

that's how LLMs were dubbed. And I guess

NaN:NaN

these are more like the 2020s.

NaN:NaN

But yeah, I guess we'll see those.

NaN:NaN

Cool.

NaN:NaN

Any questions on

NaN:NaN

I guess the high level.

NaN:NaN

Everyone good?

NaN:NaN

Cool. So I guess

NaN:NaN

the first question that I want to ask

NaN:NaN

ourselves is

NaN:NaN

what we want to do is to have a model

NaN:NaN

that handles text.

NaN:NaN

But models they understand numbers. They

NaN:NaN

don't really understand text. So we need

NaN:NaN

to somehow

NaN:NaN

do something with that text to make it

NaN:NaN

more quantifiable, something that a

NaN:NaN

model can understand.

NaN:NaN

So if you look at a sentence for

NaN:NaN

instance, a cute teddy bear is reading,

NaN:NaN

you first need to ask yourself how can

NaN:NaN

you cut this sentence to pass it to a

NaN:NaN

model.

NaN:NaN

So this part is called tokenization

NaN:NaN

and what it entails is basically cutting

NaN:NaN

the text respect to some arbitrary unit

NaN:NaN

of text.

NaN:NaN

So there are several ways of doing this.

NaN:NaN

I guess the first way is doing it

NaN:NaN

completely arbitrarily.

NaN:NaN

So here for instance you would have a

NaN:NaN

that would be one unit of text

NaN:NaN

would be another unit of text bear would

NaN:NaN

be another one and so on. And by the way

NaN:NaN

the unit of test is called a token which

NaN:NaN

is why uh the method method is called

NaN:NaN

tokenization.

NaN:NaN

Another

NaN:NaN

way would be to just separate by words.

NaN:NaN

But I guess uh we would have you know

NaN:NaN

always pros and cons. I guess um

NaN:NaN

one of the goals that we want to achieve

NaN:NaN

is for us to then be able to represent

NaN:NaN

these tokens in a meaningful way. So,

NaN:NaN

one con with doing this at the word

NaN:NaN

level is you will end up with words that

NaN:NaN

look similar but that are actually

NaN:NaN

considered as different tokens. And I

NaN:NaN

guess the limitation here is you will

NaN:NaN

need to compute embeddings for these

NaN:NaN

similar yet different tokens

NaN:NaN

and somehow make their embedding

NaN:NaN

similar. So I'll give you an example. So

NaN:NaN

let's suppose I have the word bear

NaN:NaN

and then you have another word plural

NaN:NaN

form bears.

NaN:NaN

So these two words they are very similar

NaN:NaN

just one is singular the other one is

NaN:NaN

plural. If we go ahead with the word

NaN:NaN

level tokenization

NaN:NaN

then we will end up with just two

NaN:NaN

different entities

NaN:NaN

which are basically yeah just considered

NaN:NaN

as different. same with run and then

NaN:NaN

runs you know variations of verbs.

NaN:NaN

So for that reason

NaN:NaN

people have uh kind of um dug into a

NaN:NaN

category of token tokenizers that are

NaN:NaN

called subword tokenizers

NaN:NaN

which is around leveraging roots of

NaN:NaN

words in order to find what are the

NaN:NaN

common roots that we can find these in

NaN:NaN

these words. So for instance, for bear

NaN:NaN

and bears, you would have the bear

NaN:NaN

particle that would be uh kind of

NaN:NaN

shared.

NaN:NaN

And so I guess the pro is that you get

NaN:NaN

to leverage the root of the words. But

NaN:NaN

then the con here is that your sequence

NaN:NaN

will be longer. And we will see why this

NaN:NaN

is a con. Um I guess later on I guess I

NaN:NaN

can give you a preview. Um so the

NaN:NaN

complexity of these models is also a

NaN:NaN

function of the sequence length.

NaN:NaN

So the more you have the more tokens you

NaN:NaN

have to process the more time it would

NaN:NaN

take for your models model to run

NaN:NaN

because it needs to basically process

NaN:NaN

all these tokens.

NaN:NaN

So that's one con. So pro is it

NaN:NaN

leverages the root of words. Con is it

NaN:NaN

just makes your sequences longer.

NaN:NaN

Okay, you have a last category of uh

NaN:NaN

kind of ways of tokenizing things which

NaN:NaN

is just going at the character level

NaN:NaN

just like taking all characters. So here

NaN:NaN

I guess uh you and I when we write I

NaN:NaN

don't know a message we typically have

NaN:NaN

uh sometimes the misspellings

NaN:NaN

and with the subword uh way of

NaN:NaN

tokenizing things you may not uh be able

NaN:NaN

to recognize a word that has been

NaN:NaN

misspelled

NaN:NaN

and this is something that the character

NaN:NaN

level tokenizer can uh I guess take into

NaN:NaN

consideration But here the problem is

NaN:NaN

you have a sequence length that's much

NaN:NaN

much longer which will make your model I

NaN:NaN

guess take much more time to process uh

NaN:NaN

this sequence. So that's one con. And

NaN:NaN

then uh the other con is I guess when

NaN:NaN

you want to represent each of these

NaN:NaN

tokens I guess it's very hard to know

NaN:NaN

what a representation of a letter really

NaN:NaN

means. Like what does the representation

NaN:NaN

of the letter U mean?

NaN:NaN

Very hard.

NaN:NaN

Cool. So I have just a a quick recap. So

NaN:NaN

word level is a super naive way, super

NaN:NaN

simple way of um I guess dividing your

NaN:NaN

text into arbitrary units. Um but then

NaN:NaN

the problem is as we mentioned we do not

NaN:NaN

leverage the root of words and uh I did

NaN:NaN

not mention this but there's a term

NaN:NaN

whenever you cut something and then um

NaN:NaN

at inference time when you want to make

NaN:NaN

a prediction I guess one uh prerequisite

NaN:NaN

that you have is that you need to have

NaN:NaN

the token that you saw at training time

NaN:NaN

you need to have it in your training

NaN:NaN

sets.

NaN:NaN

And the problem is let's suppose at

NaN:NaN

inference time you cut your text into

NaN:NaN

words and let's suppose you have not

NaN:NaN

seen a word at training time you will

NaN:NaN

need to mark it as unknown

NaN:NaN

and so this thing is called OOV out of

NaN:NaN

vocabulary.

NaN:NaN

So luckily the subboard level tokenizer

NaN:NaN

mitigates that problem. Uh so you have

NaN:NaN

like a a lower risk of OOV but still you

NaN:NaN

you can have and as we mentioned in

NaN:NaN

terms of the pro uh you leverage the

NaN:NaN

root of the words

NaN:NaN

um and then character level uh you know

NaN:NaN

it's robust to our misspellings and our

NaN:NaN

casing errors uh but the problem is it

NaN:NaN

makes computations just like much slower

NaN:NaN

and uh your sequences would be like very

NaN:NaN

very long uh which will also make your I

NaN:NaN

guess inference time much higher.

NaN:NaN

Does that sound good? I guess this is

NaN:NaN

really the foundation of uh I guess how

NaN:NaN

to handle things with text. But yeah,

NaN:NaN

does that make sense overall?

NaN:NaN

Cool.

NaN:NaN

Okay. So now, okay, what we did is we

NaN:NaN

took an input text.

NaN:NaN

What we did is we cut it into parts that

NaN:NaN

are basically tokens.

NaN:NaN

So in order for our model to understand

NaN:NaN

it these tokens, we need to find a

NaN:NaN

representation for each of them. So here

NaN:NaN

uh we're going to take a look at this.

NaN:NaN

So that's called word representation or

NaN:NaN

more um I guess in a more correct way it

NaN:NaN

should be token representation. So we

NaN:NaN

want to find a way to represent each of

NaN:NaN

these tokens.

NaN:NaN

So the simple and naive way to do this

NaN:NaN

would be to just assign a one hot vector

NaN:NaN

for each word or for each token. So for

NaN:NaN

instance let's suppose if let's suppose

NaN:NaN

we have a vocabulary of three tokens

NaN:NaN

book soft and teddy bears. We would have

NaN:NaN

let's say soft that is a one 0 0 vector

NaN:NaN

teddy bear that is let's say a 0 1 0

NaN:NaN

vector and book that is let's say a 0 01

NaN:NaN

vector

NaN:NaN

so this is called one hot encoding oh e

NaN:NaN

we'll typically see

NaN:NaN

so cool yeah this is a a way to

NaN:NaN

represent of our tokens but um basically

NaN:NaN

what people want to do is compare these

NaN:NaN

tokens to basically see which ones are

NaN:NaN

more similar to what other ones.

NaN:NaN

So a common similarity measure that

NaN:NaN

people use is something called cosine

NaN:NaN

similarity. Not sure if you have heard

NaN:NaN

of it. Um so you can think of it as just

NaN:NaN

seeing what angle these vectors make in

NaN:NaN

the n dimensional space. And if I guess

NaN:NaN

they are pointing in the right in the

NaN:NaN

same direction then maybe they're

NaN:NaN

similar maybe if they're orthogonal

NaN:NaN

maybe they're kind of independent and if

NaN:NaN

they're completely opposite then maybe

NaN:NaN

they're opposite. That's basically the

NaN:NaN

mental model we want to uh go into.

NaN:NaN

So the problem is if you represent your

NaN:NaN

tokens in a one hot fashion, you will

NaN:NaN

end up with all your vectors being

NaN:NaN

orthogonal to one another.

NaN:NaN

So that's the problem.

NaN:NaN

So ideally what we want is for tokens

NaN:NaN

that mean the same or similar to

NaN:NaN

basically have a high similarity

NaN:NaN

and for tokens that are not similar like

NaN:NaN

on like about different things to be

NaN:NaN

more like or sodom.

NaN:NaN

So here I just for illustrative purposes

NaN:NaN

uh teddy bears are soft. So you want

NaN:NaN

teddy bear and soft to be I guess with a

NaN:NaN

high similarity and let's say teddy bear

NaN:NaN

and book which is kind of independent

NaN:NaN

you want them to be closer to zero.

NaN:NaN

So that's what you want. That's what you

NaN:NaN

have with one hot encoding and that's

NaN:NaN

what you want.

NaN:NaN

Yeah.

NaN:NaN

Sorry.

NaN:NaN

Oh, I see. Uh, the question is why do

NaN:NaN

you care about the norm? Um, so I guess

NaN:NaN

cosine similarity is actually normalized

NaN:NaN

by the norms. So it's um dot product.

NaN:NaN

Oh, you mean why did I just put dot

NaN:NaN

product here instead of the

NaN:NaN

Oh, I see. And your question is why do

NaN:NaN

we not care about the norm? Um, cool. I

NaN:NaN

guess the viewers know the question. Um

NaN:NaN

I guess these measures they all you know

NaN:NaN

measures they're all ways to try to

NaN:NaN

capture these uh kind of similarity uh

NaN:NaN

things. Um so I guess why do you not

NaN:NaN

care about the norm?

NaN:NaN

H I guess it's how people have tried to

NaN:NaN

kind of quantify that. Um I I guess you

NaN:NaN

will need to see how your vectors are

NaN:NaN

trained and whether the norm would be

NaN:NaN

indicative of something.

NaN:NaN

Um I guess the best answer I can give

NaN:NaN

you is I guess this is a measure. This

NaN:NaN

is not the perfect measure. Um yeah

NaN:NaN

people may use also product as a as a

NaN:NaN

measure but yeah I don't have like a

NaN:NaN

great answer for you.

NaN:NaN

Cool. But as long as you capture, I

NaN:NaN

guess how these vectors they're um

NaN:NaN

they're pointing. I guess typically what

NaN:NaN

you care about is the angle between

NaN:NaN

them. Um but yeah, typically you don't

NaN:NaN

really take into consideration the the

NaN:NaN

Cool. Any questions? Any other

NaN:NaN

questions?

NaN:NaN

Yeah.

NaN:NaN

Yeah.

NaN:NaN

It's a great question. So question is

NaN:NaN

around size of vocabulary and how that

NaN:NaN

would inform the choice with respect to

NaN:NaN

word, subword and how that changes

NaN:NaN

across languages. It's a great question.

NaN:NaN

Um, so I would say it really depends

NaN:NaN

first of all on the task that you're

NaN:NaN

trying to achieve. If your task is just

NaN:NaN

about one language, you will just uh

NaN:NaN

take that same language. You would

NaN:NaN

typically go with a subword tokenizer

NaN:NaN

just because of the reasons that we

NaN:NaN

mentioned here. Um, so I guess subwords

NaN:NaN

is a nice trade-off between being able

NaN:NaN

to identify uh words by their roots like

NaN:NaN

leveraging that but also running less

NaN:NaN

into the OOV uh risk. Um

NaN:NaN

so uh in terms of the size I know that

NaN:NaN

people they've uh you know like try

NaN:NaN

different things. I think uh typically

NaN:NaN

for English you would target something

NaN:NaN

on the order of tens of thousands of

NaN:NaN

vocabulary size uh but you know like

NaN:NaN

nowadays the models uh there are

NaN:NaN

multilingual they're also about codes so

NaN:NaN

you will see that the vocabulary size

NaN:NaN

now is sometimes on the order of

NaN:NaN

hundreds of thousands

NaN:NaN

uh okay so with respect to Chinese so I

NaN:NaN

guess you have this uh you know

NaN:NaN

difference in characters that you're

NaN:NaN

using so for Latin I guess so it's the

NaN:NaN

alphabet we're all accustomed to but of

NaN:NaN

course for the other ones uh you have

NaN:NaN

something similar but in uh I guess the

NaN:NaN

the target language character um so yeah

NaN:NaN

I would say order of magnitude tens of

NaN:NaN

thousands for one language hundreds of

NaN:NaN

thousands if it's like multilingual

NaN:NaN

um yeah these are the order of magnitude

NaN:NaN

that you want to target for

NaN:NaN

cool Yeah.

NaN:NaN

Great question. So question is how do

NaN:NaN

you get those embeddings? So it's

NaN:NaN

actually the next slide. So we're going

NaN:NaN

to talk about this.

NaN:NaN

Cool.

NaN:NaN

Great. So okay. So now that we know that

NaN:NaN

the one hot encoding is not a good way

NaN:NaN

to represent tokens, what we want to do

NaN:NaN

is to learn those embeddings from the

NaN:NaN

data.

NaN:NaN

So I mentioned that there was this um

NaN:NaN

you know um paper that came out in the

NaN:NaN

2010s. So I think it was 2013 that was

NaN:NaN

called Vtovec.

NaN:NaN

And the reason why it was so popular is

NaN:NaN

because they showed a very intuitive and

NaN:NaN

interpretable way of seeing these

NaN:NaN

embeddings. Uh because they were saying

NaN:NaN

saying something like okay king is to

NaN:NaN

queen what this is to that like Paris is

NaN:NaN

to France what Berlin is to Germany. So

NaN:NaN

there was basically a way to make sense

NaN:NaN

of the embeddings. So now the question

NaN:NaN

is how did they do that?

NaN:NaN

So they had two ways of computing these

NaN:NaN

uh embeddings. So one way was called

NaN:NaN

continuous bag of words. The other one

NaN:NaN

was called skipgram. But they all rely

NaN:NaN

on the same idea

NaN:NaN

which is let's just leverage text that

NaN:NaN

we have and then try to predict

NaN:NaN

something that is part of the text based

NaN:NaN

on let's say the context.

NaN:NaN

So for instance, continuous pack of

NaN:NaN

words.

NaN:NaN

The goal is you take into consideration

NaN:NaN

the words that are around a given target

NaN:NaN

words and your goal is to predict that

NaN:NaN

target words.

NaN:NaN

And skip gram is kind of the opposite.

NaN:NaN

You go from uh a target word and you

NaN:NaN

want to predict the words that are

NaN:NaN

around it. So I guess this task is

NaN:NaN

commonly called a proxy task. Because at

NaN:NaN

the end of the day in this exercise,

NaN:NaN

what we care about is not necessarily to

NaN:NaN

predict the next word or at least not

NaN:NaN

yet. Our goal is to learn a

NaN:NaN

representation of these words that are

NaN:NaN

meaningful.

NaN:NaN

And so here the idea is if you have a

NaN:NaN

model that somehow knows how to predict

NaN:NaN

let's say the next word

NaN:NaN

then it means that your model has some

NaN:NaN

understanding of how language works

NaN:NaN

which is basically what you want. You

NaN:NaN

basically want an embedding that is

NaN:NaN

reflective of

NaN:NaN

uh I guess what languages

NaN:NaN

which is u you know king and queen or

NaN:NaN

you know or similar you know Paris and

NaN:NaN

France like this is the capital you want

NaN:NaN

to have these associations embedded in

NaN:NaN

the representation

NaN:NaN

and let's go through a very simple

NaN:NaN

example of what that looks Like

NaN:NaN

so here in our example,

NaN:NaN

let's suppose that our proxy task is

NaN:NaN

about predicting the next word.

NaN:NaN

So here what we take is a very vanilla

NaN:NaN

neural network model which basically

NaN:NaN

receives a vector of size V

NaN:NaN

has some multiplication and a bias term

NaN:NaN

to get like hidden state and then uh

NaN:NaN

another set of multiplications to get

NaN:NaN

our final vector.

NaN:NaN

So here it's basically a a very simple

NaN:NaN

neural network. Uh so the input is of

NaN:NaN

size V. The hidden layer is of size D

NaN:NaN

which is typically much smaller than the

NaN:NaN

vocabulary. So vocabulary is typically

NaN:NaN

like tens of thousands or hundreds of

NaN:NaN

thousands. So D is typically hundreds

NaN:NaN

like 768 for instance is is one example

NaN:NaN

of dimension.

NaN:NaN

So it's much much smaller. So what we're

NaN:NaN

trying to do is to learn the word

NaN:NaN

representation through this proxy task.

NaN:NaN

And what we're going to do is try to

NaN:NaN

consider

NaN:NaN

a words as input

NaN:NaN

and predict the next word.

NaN:NaN

So let's go with the first word of the

NaN:NaN

sequence. So by the way I use token and

NaN:NaN

words interchangeably.

NaN:NaN

Um so let's suppose we have the word a

NaN:NaN

and we want to predict the next word

NaN:NaN

which is the word cute.

NaN:NaN

So what we do is we take the word a

NaN:NaN

we take the one hot

NaN:NaN

encoding representation

NaN:NaN

and we pass it through the network. So

NaN:NaN

here if you're familiar with neural

NaN:NaN

networks so here you have I guess a

NaN:NaN

multiplication between I guess a matrix

NaN:NaN

and and this vector. So you have a

NaN:NaN

hidden state representation which is a

NaN:NaN

vector of size D. So here let's suppose

NaN:NaN

it's 2 1.9. So D equ= 2

NaN:NaN

and then you have I guess another pass

NaN:NaN

here. uh and then you get after softmax

NaN:NaN

a set of probabilities

NaN:NaN

which are around

NaN:NaN

seeing what is the next word. So in this

NaN:NaN

example we have uh a vocabulary of size

NaN:NaN

six. So the first word is predicted with

NaN:NaN

probability 02

NaN:NaN

second word 4 and then the other words

NaN:NaN

are all 0.1 in this example.

NaN:NaN

So let's suppose that we want to somehow

NaN:NaN

be able to maximize our prediction to be

NaN:NaN

the second word of the vocabulary which

NaN:NaN

is the point 4.

NaN:NaN

So we basically compare the prediction

NaN:NaN

with uh I guess 0 1 0 0 which is a

NaN:NaN

representation of the second word of the

NaN:NaN

vocabulary

NaN:NaN

and then we you know do the back prop we

NaN:NaN

you know update the weights. Not sure if

NaN:NaN

everyone is familiar with that part. Uh

NaN:NaN

but the idea here is once you obtain a

NaN:NaN

prediction you compute a loss. So

NaN:NaN

typically cross entropy

NaN:NaN

which will determine how far off you are

NaN:NaN

from the true answer and based on that

NaN:NaN

difference you're going to update the

NaN:NaN

weights in order to make your prediction

NaN:NaN

closer

NaN:NaN

to the truth.

NaN:NaN

So that's what you do and then you

NaN:NaN

repeat that process. Let's suppose you

NaN:NaN

take the word cute

NaN:NaN

which as we said is the second word in

NaN:NaN

the vocabulary. So the one hot encoding

NaN:NaN

representation is 0 1 0 0 0. So you go

NaN:NaN

through that uh you know network uh you

NaN:NaN

have a hidden state like the vector is

NaN:NaN

08.4 before you do that again. And what

NaN:NaN

you want to do is to predict the next

NaN:NaN

token. And here's teddy bear.

NaN:NaN

And so you see now your model in this

NaN:NaN

example is predicting the next word to

NaN:NaN

be kind of like uniform, but you want to

NaN:NaN

somehow maximize the probability for

NaN:NaN

teddy bear. So you go about doing this

NaN:NaN

again and again for all the words. And

NaN:NaN

at the end of the day, you obtain a

NaN:NaN

model

NaN:NaN

that learns how to predict the next

NaN:NaN

words, which is basically the proxy

NaN:NaN

task. And what you're going to do is to

NaN:NaN

take the representation that the model

NaN:NaN

learns, which is the the green units.

NaN:NaN

So what happens now is every time you

NaN:NaN

have a word

NaN:NaN

you just represent that as a one hot

NaN:NaN

encoding representation and you just

NaN:NaN

like multiply this with these weights

NaN:NaN

and then you obtain the green

NaN:NaN

representation

NaN:NaN

and that is your war representation.

NaN:NaN

Does that make sense? Yeah.

NaN:NaN

Yeah.

NaN:NaN

Yeah. Great question. Yes. Great

NaN:NaN

question. So the question is about what

NaN:NaN

does V correspond to and why there's

NaN:NaN

only six. So yes, in this example, we

NaN:NaN

only have six possible words, which is

NaN:NaN

basically the vocabulary size, just like

NaN:NaN

very um kind of a toy example because in

NaN:NaN

practice there's many more. Um so I

NaN:NaN

guess uh that's one of the challenges

NaN:NaN

with language. So you can technically

NaN:NaN

have many variation of words which is

NaN:NaN

why if you take a word level way to

NaN:NaN

divide your text into tokens you can end

NaN:NaN

up with the vocabulary that's like very

NaN:NaN

big because you need to account for all

NaN:NaN

the variations of given words.

NaN:NaN

Uh and the other thing that I want to

NaN:NaN

point out is um let's suppose you have a

NaN:NaN

vocabulary size of six and it's the six

NaN:NaN

words that you saw at training time. But

NaN:NaN

what what happens if if at inference

NaN:NaN

time you have a word that you have not

NaN:NaN

seen at training time? And so the answer

NaN:NaN

for that is typically what people do is

NaN:NaN

they

NaN:NaN

reserve a spot for what they call an

NaN:NaN

unknown token or out of vocabulary token

NaN:NaN

which is basically um can think of it as

NaN:NaN

you know a bucket for everything that we

NaN:NaN

could not have we we were not able to

NaN:NaN

identify.

NaN:NaN

So if let's suppose at inference time

NaN:NaN

you have a token that you were not able

NaN:NaN

to identify, they will all take that

NaN:NaN

representation which is the unknown

NaN:NaN

token representation.

NaN:NaN

Um and it's by the way something that I

NaN:NaN

guess the

NaN:NaN

word level tokenizer has kind of trouble

NaN:NaN

to do because you will have a much

NaN:NaN

bigger chance of having out of

NaN:NaN

vocabulary tokens. word level subword

NaN:NaN

level will have a lower chance and then

NaN:NaN

character level uh I guess you yeah you

NaN:NaN

don't have that problem does that answer

NaN:NaN

your your question

CME 295: Transformers & Large Language Models – 오리엔테이션 및 기초 NLP 정리

Executive Summary

CME 295는 Transformer와 LLM(대형 언어 모델) 을 중심으로 NLP 전반을 다루는 2유닛 스탠포드 강의로, 이론(Transformer·Attention·언어 모델링)과 실전 LLM 활용을 함께 목표로 한다. 수업은 기본 ML/선형대수 배경을 전제로 하며, 중간·기말 2회 시험(코딩 없음) 으로만 성적을 평가한다.

Key Takeaways

강의 개요
- 과목명: CME 295 – Transformers and Large Language Models
- 담당: Afshine & Shervine (쌍둥이 형제, Uber·Google·Netflix LLM 실무 경험)
- 배경: 2020년부터 NLP 워크샵 → 2023/24부터 스탠포드 정식 과목
수강 대상
- LLM/NLP를 연구·산업 커리어로 삼고 싶은 학생
- LLM 기반 개인 프로젝트를 하고 싶은 학생
- 비전공이지만 자신의 도메인에 LLM/GenAI를 적용해 보고 싶은 사람
선수 지식
- 기본 ML 개념: 모델 학습 과정, 신경망이 무엇인지
- 기본 선형대수: 특히 행렬 곱 개념
운영 및 평가
- 시간: 매주 금요일 3:30–5:20, 녹화 제공
- 학점: 2 units, Letter Grade 또는 Credit/No Credit
- 과제 없음, 시험 2개만:
  - 중간: 10월 24일(5주차), 기말: 12월 8일 전후(날짜 추후 공지)
  - 코딩 문제 없음, 수업·슬라이드·개념 중심
  - 성적 비중: 중간 50% + 기말 50%
  - 기말은 후반부 내용 위주로 출제 예정
자료 및 커뮤니케이션
- 슬라이드·녹화·실라버스: 과목 웹사이트 + Canvas 링크
- 교재: “Super Study Guide – Transformer LLMs”
- 요약 자료: VIP Cheat Sheet (GitHub, 다국어 번역)
- 공지: Canvas, 질문: Canvas의 Ed 탭, 이메일/메일링리스트
수업 핵심 내용
- NLP 주요 태스크 3분류: 분류 / 다중-분류 / 생성
- 토크나이제이션: Word / Subword / Character 수준 비교
- 임베딩: One-hot 한계, Word2vec (CBOW, Skip-gram), 코사인 유사도
- 시퀀스 모델: RNN, LSTM, 장기 의존성·Vanishing Gradient 문제
- Attention 개념 도입: 과거 전체를 직접 참조
- Transformer 구조:
  - Encoder–Decoder 구조
  - Self-Attention / Cross-Attention / Masked Self-Attention
  - Multi-Head Attention, Position Embedding, FFN
- 학습 기법: Label Smoothing, Perplexity, BLEU/ROUGE 등

핵심 요약: 이 수업은 LLM이 ‘어떻게 작동하는지’(Transformer·Attention)와 ‘어떻게 학습·활용하는지’를 체계적으로 이해하게 하는 NLP·LLM 입문~중급 강의이다.

Detailed Summary

1. 강의·강사 소개 및 수업 목표

강사 배경
- Afshine & Shervine: 쌍둥이 형제
- 학력:
  - Centrale Paris (프랑스 공대)
  - Afshine: MIT
  - Shervine: Stanford ICME 석사
- 경력:
  - Uber → Google → Netflix에서 Large Language Models 관련 업무
- 지난 수년간 NLP/LLM 워크샵 진행 → 수요 증가로 정식 과목 개설
과목 목적
- LLM이 “요즘 뜨는 도구”를 넘어서,
  1. Transformer 구조와 내부 메커니즘 이해
  2. LLM 학습 방법과 응용 분야 이해
- 대상:
  - 연구자/ML Scientist 지망생
  - LLM 기반 애플리케이션/개인 프로젝트 개발자
  - 비전공 도메인에서 GenAI/LLM 적용 방향을 알고 싶은 사람
수업 난이도·필요 배경
- 기본 ML: 지도학습, 신경망 학습, 손실·역전파 개념
- 기본 선형대수: 벡터·행렬, 행렬 곱
- 기초가 완벽하지 않아도 수업에서 도움 제공

2. 수업 운영 및 평가 방식

시간·형식
- 매주 금요일 3:30–5:20, 같은 강의실
- 강의 녹화: 매주 금요일 밤 또는 토요일에 업로드
- 시간 충돌이 있어도 녹화 시청으로 대체 가능
수강 형태
- 2 units
- Letter Grade 또는 Credit / No Credit 선택 가능
평가
- 과제(Homework): 없음
- 시험 2회
  - Midterm: 5번째 수업(10/24)
  - Final: 12/8 주간, 정확한 날짜는 추후 공지
- 출제 범위·유형
  - 코딩 문제 없음
  - 수업에서 다룬 개념·슬라이드 기반 이론 문제
  - 기말은 후반부 내용 중심으로 출제할 가능성 큼
- 성적 비중
  - 중간고사 50%
  - 기말고사 50%
자료·커뮤니케이션
- 슬라이드 및 녹화:
  - 과목 웹사이트에 업로드
  - Canvas에도 링크 공유
- 교재
  - “Super Study Guide – Transformer LLMs”
  - 강의 내용과 개념 대부분 포함
- 요약 자료
  - VIP Cheat Sheet (GitHub 공개, 다국어 번역 제공)
  - 본인 언어가 없으면 제안 가능
- 공지
  - Canvas에 공지
- 질문 채널
  - Canvas의 Ed 탭에 질문 작성 → 강사들이 답변
  - 이메일/메일링리스트, 직접 연락도 가능
- 질문이 녹화에 잘 안 들릴 수 있어, 강사가 질문을 반복해 말해 줄 예정
수강 신청·대기자
- 대기자 수: 대략 6명 정도
- 시간표 조정·수강 취소 등으로 대부분 등록 가능할 것으로 예상
- 여전히 대기 상태면 직접 강사에게 이야기할 것

3. NLP 전반 개관: 태스크와 지표

3.1 NLP(Natural Language Processing)의 세 가지 큰 범주

분류(Classification) – 텍스트 → 하나의 라벨
- 예시:
  - 감성 분석: 영화 리뷰 → 긍/부/중립
  - Intent Detection: “내일 알람 설정해 줘” → intent = “알람 생성”
  - 언어 식별(Language Detection): 문장이 어느 언어인지 판별
  - 토픽 분류 등
- 평가 지표:
  - Accuracy: 전체 중 맞춘 비율
  - Precision: 긍정이라고 예측한 것들 중 진짜 긍정 비율
  - Recall: 실제 긍정 중, 긍정으로 맞춘 비율
  - F1 Score: Precision·Recall의 조화 평균
- 이유:
  - 클래스 불균형(예: 99%가 Positive)에서 Accuracy만 보면 오해 가능
    → Precision/Recall/F1이 중요
다중-분류(Multi-label / Token-level Classification) – 텍스트 → 여러 라벨
- 대표 태스크:
  - Named Entity Recognition (NER):
    문장 속에서 인물, 장소, 시간, 조직명 등 엔티티 태깅
  - Part-of-Speech Tagging: 품사 태깅 (명사·동사·형용사 등)
  - 구문 분석(Dependency / Constituency Parsing) 등
- 평가:
  - 토큰 단위, 엔티티 타입 단위로 Precision/Recall/F1 등 계산
생성(Generation) – 텍스트 → 텍스트 (길이 가변)
- 예시:
  - 기계 번역 (Machine Translation): EN → FR/DE 등
  - 질의응답 / 대화형 모델: ChatGPT, Gemini와 같은 Assistant
  - 요약(Summarization): 문서·기사 요약
  - 코드·시·이야기 생성 등
- 데이터:
  - 번역의 경우, 쌍(pair) 데이터 필요 (예: WMT, Europarl EN–FR/EN–DE)
- 평가:
  - BLEU: 참조 번역과의 n-gram 겹침 정도(높을수록 좋음)
  - ROUGE: 요약·번역 품질 평가용 n-gram 기반 지표(역시 높을수록 좋음)
  - Perplexity:
    모델이 실제 텍스트에 얼마나 놀라는지(불확실한지) 측정
    → 낮을수록 좋음
- 한계:
  - BLEU/ROUGE는 참조(reference) 문장 필요 → 라벨링 비용이 큼
  - LLM 발전 이후 Reference-free 평가(LLM-as-a-judge) 연구 활발

4. 텍스트 처리의 기초: 토크나이제이션과 임베딩

4.1 토크나이제이션(Tokenization)

목표: 텍스트를 모델이 다룰 수 있는 최소 단위(token) 로 분해
예시 문장: “a cute teddy bear is reading”

토크나이제이션 수준 비교

Word-level
- 공백 기준 단어 단위 분할
- 장점:
  - 단순, 직관적
- 단점:
  - 단어 변형이 모두 다른 토큰으로 취급
    - 예: “bear” vs “bears”, “run” vs “runs”
  - Out-of-Vocabulary(OOV) 문제 심각
    - 학습 시 보지 못한 단어는 [UNK] 토큰으로 처리해야 함
  - 유사 단어 간 관계를 임베딩으로 따로 배워야 함
Subword-level
- 단어를 더 작은 부분(어근·접두사·접미사) 단위로 분해
- 예:
  - “bears” → “bear” + “s”
  - “running” → “run” + “ning” 등
- 장점:
  - 어근 공유를 통해 유사 단어 간 의미 공유
  - 완전한 OOV는 줄어듦(새 단어도 서브워드 조합으로 표현 가능)
- 단점:
  - 시퀀스 길이가 Word-level보다 길어짐
  - 시퀀스 길이는 곧 연산량과 메모리 사용량 증가로 직결
Character-level
- 문자 하나하나를 토큰으로 사용
- 장점:
  - 오탈자·철자 변형·대소문자 차이에 강건
  - 이론적으로 OOV 없음
- 단점:
  - 시퀀스 길이 극단적으로 길어짐
  - 문자 단위 임베딩은 의미 해석이 어렵고
    상위 구조(단어·구문)를 모델이 모두 학습해야 하므로 비용 큼

정리: 실전 LLM 대부분은 Subword-level 토크나이저를 사용해
의미·OOV·연산량을 균형 있게 맞춘다.

어휘 크기(Vocabulary Size)
- 단일 언어(예: 영어): 수만 개 규모
- 다국어 + 코드 포함 LLM: 수십만 개 규모
- 중국어 등은 문자 체계가 달라, 해당 문자 기반 subword/character 전략을 사용

4.2 토큰 표현(임베딩)과 유사도

(1) One-Hot Encoding의 한계

어휘 크기 V일 때, 각 토큰을 길이 V의 벡터로 표현
- 예: vocab = {book, soft, teddy_bear}
  - soft → [1, 0, 0]
  - teddy → [0, 1, 0]
  - book → [0, 0, 1]
문제:
- 모든 벡터가 서로 직교(코사인 유사도 0) →
  “soft”와 “teddy_bear”가 실제로는 연관 있지만 숫자상 아무 관계 없음
- 의미·유사성을 반영하지 못함

(2) 분산 표현(Distributed Representation)과 코사인 유사도

목표:
- 의미적으로 가까운 단어는 비슷한 방향의 벡터를 갖게 만드는 것
- 예:
  - teddy_bear ↔ soft: 높은 유사도
  - teddy_bear ↔ book: 낮은 유사도
코사인 유사도:
- 두 벡터의 각도를 기준으로 유사도를 측정
- 방향만 보고, 길이(노름)는 정규화로 상쇄

(3) Word2vec: CBOW·Skip-Gram

아이디어:
- 대량의 원시 텍스트에서, “언어의 패턴”을 학습하는 프록시 태스크로
  단어 임베딩을 학습
두 가지 프록시 태스크:
1. CBOW (Continuous Bag-of-Words)
  - 주변 단어(context)들을 보고 중심 단어(target) 를 예측
2. Skip-Gram
  - 중심 단어를 보고 주변 단어들(context) 을 예측
특징:
- 실제 학습 목표는 “정확한 다음 단어 예측”이 아니라,
  그 과정에서 생기는 임베딩(가중치)을 얻는 것
- 결과적으로,
  - king - man + woman ≈ queen
  - Paris - France + Germany ≈ Berlin
    같은 해석 가능한 벡터 연산이 가능해짐

(4) 간단한 예: “a cute teddy bear is reading”

프록시 태스크: 다음 단어 예측(next-word prediction)
모형:
- 입력 차원 V, 은닉 차원 d의 간단한 신경망
- 입력: one-hot(“a”) → W1 → 은닉 h → W2 + softmax → 다음 단어 분포
학습:
- 예측 분포와 실제 다음 단어(one-hot) 간 cross-entropy loss 최소화
- 역전파로 W1, W2 업데이트
- 이를 모든 토큰에 대해 반복
임베딩 추출:
- 학습된 W1(또는 은닉층 표현)을 단어 임베딩으로 사용

5. 시퀀스 모델: RNN, LSTM, 그리고 한계

5.1 RNN (Recurrent Neural Network)

목적:
- 문장의 순서 정보를 반영해 문장 표현을 얻기
구조:
- 각 시점 t마다:
  - 입력: 현재 단어 임베딩 x_t, 이전 hidden state h_{t-1}
  - 출력: 새로운 hidden state h_t, (필요시) 출력 y_t
해석:
- h_t는 t번째까지의 문맥 정보를 요약한 벡터
활용:
- 분류: 마지막 hidden state h_T → 문장 임베딩 → 클래스 예측
- 토큰 분류: 각 시점의 h_t → 토큰 태깅(NER 등)
- 생성(번역 등):
  - Encoder RNN으로 전체 문장 인코딩 → 최종 hidden state를 context로 사용
    → Decoder RNN으로 출력 문장 생성

5.2 문제점: 장기 의존성·Vanishing Gradient

모든 정보가 하나의 hidden state에 누적
문장이 길어질수록:
- 오래전 정보는 잘 반영되지 못함 (장기 의존성 문제)
- 역전파 시, 시간축을 따라 긴 곱셈 연산 →
  0<|값|<1인 항의 반복 곱 → gradient가 0에 수렴(vanishing gradient)
  - 반대로 1보다 크면 exploding gradient
결과:
- 문장이 길어질수록 앞 부분 정보를 기억하기 어려움
- 학습도 불안정·느려짐

5.3 LSTM (Long Short-Term Memory)

RNN을 개선한 구조:
- hidden state 외에 cell state c_t를 추가
- 여러 게이트(input, forget, output gate)를 통해
  - “무엇을 기억/버릴지” 선택적으로 조절
목표:
- 장기 의존성 문제를 완화
하지만:
- 여전히 순차적 연산에 의존 → 병렬화 어려움, 학습 느림
- 아주 긴 시퀀스에서 완벽한 해결은 아님

6. Attention: 장기 의존성 해결의 핵심 아이디어

6.1 Attention 개념

동기:
- RNN/LSTM은 과거 정보를 순차적으로만 전달 →
  “중요한 과거 단어”를 직접 참조하기 어렵다.
Attention의 아이디어:
- 현재 예측하려는 위치에서,
  시퀀스 전체의 표현들을 직접 보고 가중합을 계산
예: 번역
- 입력: “a cute teddy bear is reading”
- 출력: “un ours en peluche lit”
- 특정 프랑스어 단어를 생성할 때,
  - 해당하는 영어 단어에 직접 가중치를 크게 두어 참고

6.2 Self-Attention & Q-K-V (Query, Key, Value)

Self-Attention:
- 한 문장 내에서 모든 토큰이 서로를 참조하도록 하는 메커니즘
- 예: “teddy bear”의 표현을 만들 때
  “cute”, “reading” 등 다른 토큰들을 함께 고려
Q (Query), K (Key), V (Value):
- Query: “내가 지금 알고 싶은 것”을 나타내는 벡터
- Key: 각 토큰의 “색인/주소 정보” 역할
- Value: 실제로 가중합에 사용될 내용 벡터
메커니즘:
1. Query와 각 Key의 유사도(주로 dot product) 계산
2. softmax로 정규화 → 어텐션 가중치
3. 이 가중치로 각 Value를 가중합 → 새로운 토큰 표현

해석: Query 기준으로 “어떤 토큰을 얼마나 참고할지”를 결정한 뒤,
그에 해당하는 Value들의 가중합으로 새로운 표현을 만든다.

6.3 행렬 연산 형태

입력 임베딩 X (d_model × N)
학습되는 가중치 행렬:
- W_Q, W_K, W_V (각각 d_model × d_k/d_v)
계산:
- Q = X W_Q, K = X W_K, V = X W_V
- Attention(Q, K, V) = softmax( Q Kᵀ / √d_k ) V
의미:
- QKᵀ: 각 토큰 쿼리가 다른 모든 토큰 키와 맺는 유사도 행렬
- softmax: 각 쿼리별 확률 분포(어텐션 가중치)
- 가중합: 이 분포를 기반으로 Value들의 가중 평균 계산
√d_k로 나누는 이유:
- d_k가 커질수록 dot product 값의 분산이 커짐 → softmax가 과도하게 쏠림
- 이를 정규화해 학습 안정화

7. Transformer 구조: Attention is All You Need

7.1 전체 구조 개요

2017년 논문 “Attention is All You Need” 에서 제안
순차적 RNN 구조를 버리고,
전적으로 Self-Attention(및 변형) 에 기반한 구조 사용
크게 Encoder와 Decoder의 두 부분으로 구성
대표 응용: 기계 번역

7.2 입력 처리: 토큰 임베딩 + 위치 인코딩

토크나이즈: BOS, EOS 포함
임베딩 레이어: 각 토큰 → d_model 차원 벡터
Positional Encoding(위치 인코딩):
- 단순 Self-Attention은 순서를 모름 → 위치 정보를 추가해야 함
- 논문에서는 sin/cos 주기 함수로 위치 정보를 인코딩
- 임베딩 + 위치 인코딩을 원소별 더하기로 결합

7.3 Encoder 블록

각 Encoder Layer는 다음으로 구성:

Multi-Head Self-Attention
- 입력: 길이 N의 시퀀스 임베딩
- 각 토큰이 모든 토큰을 참조하여 새로운 표현 생성
- Head별로 서로 다른 Q/K/V 투영을 학습 → 다양한 관계 포착
- 여러 head 결과를 concat 후, W_O로 다시 d_model 차원으로 투영
Feed-Forward Network (FFN)
- 입력 차원 d_model → 중간 차원 d_ff → 다시 d_model
- 비선형 변환으로 표현력을 확장
- d_ff는 d_model보다 더 크게 잡아 모델 용량 확보
(논문에는 Residual Connection + LayerNorm도 있지만, 내용상 언급만)

여러 개(예: N=6)를 쌓아서 사용
최종 Encoder 출력: 입력 시퀀스의 문맥-aware 임베딩 세트

7.4 Decoder 블록

각 Decoder Layer는 세 종류의 Attention을 포함:

Masked Multi-Head Self-Attention
- Decoder의 이전 출력 토큰들에 대해 Self-Attention
- 미래 토큰을 보지 못하도록 마스킹(causal mask) 적용
  - 시점 t에서 t 이후 토큰은 attention 대상에서 제외
Cross-Attention (Encoder-Decoder Attention)
- Query: Decoder의 현재 hidden 표현
- Key/Value: Encoder 출력(입력 문장 임베딩)
- 역할:
  - “지금 생성 중인 번역 토큰”이 입력 문장의 어느 부분을 참고해야 할지 학습
Feed-Forward Network
- Encoder와 동일 구조, 표현력 확장

역시 여러 개의 Decoder Layer를 스택으로 사용
마지막 Decoder Layer 출력 → 최종 Linear + Softmax → 다음 토큰 확률

7.5 Multi-Head Attention

하나의 Attention만 쓰지 않고, 여러 Head(h개)를 병렬로 사용
각 Head:
- Q = X W_Q^(head_i), K = X W_K^(head_i), V = X W_V^(head_i)
- 서로 다른 투영 행렬을 사용하므로 다른 관점의 관계를 학습
모든 Head의 출력을 concat 후, 다시 W_O로 d_model 차원으로 투영
장점:
- 문법적 관계, 의미적 관계, 장거리 의존 등 다양한 패턴을 분리 학습

8. Transformer를 이용한 번역 과정 (End-to-End 예시)

문장: “a cute teddy bear is reading” → “un ours en peluche lit”

인코딩 단계
- 입력 문장을 토크나이즈 + 임베딩 + 위치 인코딩
- 여러 Encoder Layer를 통과하면서:
  - 각 토큰이 문장 내 다른 토큰들과의 관계를 통해 문맥 정보 포함 표현으로 변환
- 최종 Encoder 출력:
  길이 N의 컨텍스트 인코딩 벡터들 (각 토큰당 하나씩)
디코딩 시작
- Decoder 입력: BOS 토큰
- 첫 시점:
  - Masked Self-Attention: BOS만 참조 (의미 없음, 자기 자신)
  - Cross-Attention: Encoder 출력 전체를 Key/Value로 참고
  - FFN 통과 후 Linear + Softmax → 첫 출력 토큰(예: “un”) 확률 분포
- Argmax/샘플링으로 “un” 선택
두 번째 토큰 생성
- Decoder 입력: [BOS, “un”]
- Masked Self-Attention:
  - “un”의 표현은 [BOS, “un”]에 대해 자기-어텐션
- Cross-Attention:
  - 해당 시점의 Decoder 표현을 Query로, Encoder 출력을 Key/Value로 사용
- 출력: “ours” 확률 ↑ → “ours” 선택
반복
- [BOS, “un”, “ours”, “en”, “peluche”, …] 를 입력으로 계속 디코딩
- 각 단계에서:
  - 이전 번역 토큰들에 대한 Masked Self-Attention
  - Encoder 출력에 대한 Cross-Attention
- EOS 토큰이 생성될 때까지 반복 → 번역 종료

9. 학습 테크닉: Label Smoothing

NLP 생성 태스크(번역 등)에서 다음 단어 정답은 여러 개일 수 있음
- 예: “What a great ___” → “day”, “idea”, “lecture” 등
기존: 정답 토큰에 확률 1, 나머지는 0인 one-hot 라벨
Label Smoothing:
- 정답 토큰 확률: 1 − ε
- 나머지 토큰들: ε / (V−1) 로 조금씩 분산
효과:
- 모델이 너무 확신(Over-confident) 하지 않도록 억제
- 일반화 성능, BLEU 점수 등 개선 보고
구현 관점:
- “softmax 출력에 라벨 스무딩을 적용”하는 것이 아니라,
  손실 계산에 사용하는 정답 분포를 부드럽게 바꾸는 것

마무리

이 강의는 다음을 목표로 한다:
- NLP 태스크와 평가 지표의 기본 이해
- 토크나이제이션·임베딩·Word2vec·RNN/LSTM의 한계 파악
- Attention·Self-Attention·Multi-Head·Transformer Encoder–Decoder 구조의 이해
- 이를 바탕으로 LLM의 구조·학습·활용을 심도 있게 다룰 예정
수업과 교재, VIP Cheat Sheet, 그리고 매주 업로드되는 슬라이드·녹화를 잘 활용하면
현대 LLM 시스템의 동작 원리를 상당히 깊게 이해할 수 있다.