Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 1 - Transformer
Cool.
Hello everyone and uh welcome to CME 295
transformers and large language models.
So my name is Afin and I will be
teaching this class with Shervin who's
in the back and uh before I start I'm
just going to introduce ourselves.
Um, so we're twin brothers and uh we
actually had kind of a similar
background. So we both went to a school
in France called Salbar and then we each
went our way. So on my end I went to MIT
and then Shervin went to Stanford to do
the ICME masters program
and after that I guess our um industry
background is very similar as well. So I
first went to Uber and then Shervin came
to Uber as well and then Sharvin left to
Google and I went to Google and then
very recently I joined Netflix and
Shervin joined Netflix as well and we've
been working on large language models.
Um so yeah I guess we have like
technical backgrounds and mostly
oriented towards LLMs.
Okay. So why are we doing this class? Um
so since 2020 uh Shervin and I have been
specializing in NLP and we've been
giving this class uh in a format of a
workshop that was done in a yearly
basis. So in 2021 2022, 2023, 2024, you
know, CH GPD came in 2022 and suddenly
there was a lot of interest for LLMs and
so uh it's actually last spring that we
started to offer this class as a
Stanford course that is now called CME
295 and this is the second instance.
>> Boom. Um so what can you expect from
this class? So first of all are
basically uh everywhere now and I guess
our goal here is twofold. So the first
one is to learn about the underlying
mechanism that makes all this work and
we're going to see the transformer which
is the foundational architecture uh that
makes all this work. And then the second
thing is to know how these LLMs are
trained and where they are applied.
So in case you're still wondering if
this class is good for you, um I would
say that this class is great for people
who just in general have an interest in
this field either because you wanted to
make it your career goal. uh if you want
to be I don't know research scientists
or an ML scientist or if you want to uh
develop like a personal project that
relies on LLMs to some extent to just
like knowing the caveats I guess what
works what doesn't
or just if you're in a separate field
and you just want to know how this whole
AI geni LLMs thing works and how you can
apply it to your domain.
Okay. So now in terms of prerequisites,
I would say that at a very minimum um
you should have some foundations in ML
like basically know what a how a model
is trained, what a neural network is uh
and also some basics in linear algebra.
So basically how matrices are multiplied
for instance. Uh but even even if you
have kind of a developing um I guess
competency in these fields I guess it's
fine. We still be here to help you out.
I guess this is like the ideal set of
prerequisites.
Cool. So still on the logistics. Um, so
this class will be held every Friday
from 3:30 to 5:20 and it will be held
here.
So this class is two units and you have
the choice to either take it as a letter
or credit nonredit.
So, as you could tell from the from the
setup, we're basically recording this
class. And if you cannot for some reason
attend this uh you know this time this
slot, uh we'll make sure with to um make
the recordings available either tonight
like every Friday night or on Saturday.
So in terms of the grades um so what
we're doing for this quarter is to have
two exams.
So one is the midterm which will be
happening uh during our fifth instance
which is October 24th
and then the second exam will be the
final exam which will be held uh like in
the on the you know in the week of the
December 8th. So date is still TVD
so we'll let you know.
Cool. Um so every time we have a lecture
we'll be posting the slides and the
recordings on the website
and in case you're interested we also
have the syllabus in there so you can
know a little bit what are the topics
that we'll be talking about.
uh and uh the class textbook is uh this
super study guide transformer LMS so we
have a copy here in case you want to
take a look um so yeah I guess a lot of
the concepts that we have in this class
will actually be in the book so I guess
it's a helpful way to follow this as
well
and also we did uh some kind of very
short condensed version of this whole
class that we called the VIP cheat sheet
so this one is available on GitHub in
case you're um you're interested. Uh and
yeah, we also translated it into a
number of languages now. Um by the way,
if your language is not there, uh let us
know and uh yeah, happy to work on that
as well together.
Okay, cool. Uh I think it's the last
things on the logistics part. So in
terms of announcements, we'll be posting
things on Canvas. In case you have any
questions, you can of course reach out
to us. Uh, but there's also a tab on
canvas that's called ED. Uh, I'm sure
you're familiar. Um, so yeah, just click
on that, just post your question and
then Shervin and I would be responding.
And, uh, yeah, I guess to reach out to
us, you have this mailing list or just
like, you know, we're just two, so just,
uh, ping us.
Cool. So, on the logistics, do we have
any questions so far? And one thing I
forgot to mention is that given that
we're recording this class, um I guess
if you're asking a question, it may not
be super clear for the viewer what your
question was. So I'm going to make an
effort to just repeat your question. It
will sound weird, but yeah, I try to not
forget.
But yeah, so yeah, any questions so far
on the logistics?
Yep.
So the question is whether there are
like coding parts in the exams. So the
answer is no. So the exams will purely
focus on concepts that we see in class
and actually it's not meant to you know
trap you. So I guess if you follow the
class if uh you know you see the slides
and like the concepts that we see should
be fine.
Yeah.
Oh yeah. Uh question is if you're weight
listed what do you do? Um I think so by
experience you know a lot of people will
kind of finalize their schedule. Some
people will drop some won't. In case
you're still weight listed uh you know
come talk to us but I'm pretty confident
uh you know it's going to be okay
because I think the weight list right
now is like six. So yeah I think you
should be fine. Cool.
Yeah.
Uh they will be on the website and we'll
make sure to um also post the link on
canvas. Yeah. Uh so the question was
where are the slides and they're on the
website.
Cool. Yeah.
So question is on the waiting of the
exams. So yeah there is no homework. So
50% is midterm, 50% is final and no no
grades I mean no weights are from that
in particular I mean if this slot is
conflicting with something just uh keep
in mind that we are recording this so I
mean it's fine if you if you cannot
attend let's say session yeah
>> sorry
oh um is a question that the final is
about just the second half of the class.
Um, we've not uh we have not uh written
the exam yet, but I think this is
something we're thinking of. So, yeah,
the final is probably going to be the
second half about the second half of the
topics.
Cool. Okay, long story short, 50%
midterm, 50% exam, final exam, and uh
yeah, it's a fun class.
Cool. So, with that, I'm going to just
slowly start the class. Um, so another
thing that I wanted to mention was every
time we're talking about something you
will see that at the bottom of the slide
there would be a source. It's mostly for
so first to credit whatever we're
quoting but also uh for you to kind of
dig into those material a little bit
more in case you're interested. Uh
because of course we have only like two
hours per week and we only have nine or
10 weeks. So there's nowhere near the
you know enough time for us to cover
everything.
And the second disclaimer is you will
see that the field is full of
abbreviations. So I myself was
completely scared of them when I
started. Uh but hopefully by the end of
the class you will have a mental mapping
of what these abbreviations mean respect
to what they correspond to. Um so yeah
so if you have that mental mapping
towards the end of the class then we'll
know we did a good job.
So with that
um let's start and I guess we will start
at the very high level because I would
just assume that um I guess we're
starting from scratch um and we're going
to talk about NLP in general. So NLP is
going to be our first abbreviation. So
NLP stands for natural language
processing and it is a field that is
around like manipulating text just
computing things with text
and at a very high level can basically
classify NLP tasks into three buckets.
So the first bucket is what we call
classification.
So we have an input text as an input and
then what we want is to predict
something. So one example is you have a
movie review and you want to predict
whether uh the sentiment is positive,
negative or neutral. Uh so that's one
example. You can also have intent
detection. Uh just you know knowing what
for instance the person want to do. So
let's suppose you say I want to create
an alarm for tomorrow. So the intent
here is create an alarm.
So also to detect a language. So for
instance, if you write in French, you
want to detect that that text is in
French. Topic modeling.
The second category is what we call
multi-classification.
So we still have a text that's input,
but this time we predict more than one
thing. So you have a number of tasks in
that bucket as well. So one that is very
popular is called named entity
recognition aka neer.
So what that task does is given an input
text we want to basically label some
specific words like for instance
identifying whether something is a
location or a time and so on.
And then you have some other tasks as
well that are a little bit more on the
linguistic sides. uh I think they're
less trendy now but I guess 10 years ago
it was something that people would study
a lot. Uh so part of speech tagging
which is about just figuring out which
word is you know noun a verb etc or some
parsing related tasks so dependency or
constituency parsing
and then the last bucket which is very
popular these days is the generation
bucket.
So you have an a text as inputs and you
also have text as outputs and here the
length can be variable meaning you don't
know what the length of your output text
will be beforehand. So here you have
several tasks. So for instance you have
machine translation so for something in
English and I wanted to let's say uh
German
question answering. So typically you
know the chat GPT gemini that you're
using you know the assistant. So you ask
a question and you have a response
and then you have like other um tasks as
well like summarization you want to
summarize an article let's say or just
generate something. So something can be
generate codes generate a poem can also
be a lot of things.
Cool. So now what we will do is go
through these tasks one by one to just
illustrate what people typically handle
with. So we're going to start with the
first bucket which is the classification
bucket. And here we're going to
illustrate this with the sentiment
extraction task. So let's suppose we
have a sentence. This teddy bear is so
cute. We want our model to predict you
know this to be a positive sentiment.
So typically what you would use is you
know data sets that are around know
sentiment extraction data set. So I
mentioned movie reviews. So this is IMDb
critics but you also have reviews about
products. So, Amazon reviews or you know
tweets now I guess it's called X so X
posts
and the way you would evaluate such
outputs would be by typically using
traditional classification metrics. Um
so you have accuracy which is you know
how many what is the percentage of the
observations that you correctly
predicted
but you also have two key metrics which
I'm just going to uh remind I'm not sure
if everyone knows about them. So one is
precision
which is out of all the positive
predictions that you made which ones
were correct
and then the second one is recall.
Out of all the true labels, how many of
them did you correctly predict as being
positive? And uh you have this metric
called the F1 score which basically
takes the harmonic mean of precision and
recall to just give you one number.
So now you may wonder you know why do
you need all these metrics? So the short
answer is that sometimes you have tasks
and data sets where your classes are
very imbalanced. So for instance you can
have I don't know 99% of your data set
that is a positive label and then only
1% of the data set which is negative.
And so here if you take like a metric
like accuracy it would be very
misleading because if you have a model
that would predict everything as the
majority class then you would have a you
know great classifier but that's not the
case. So that's why precision and recall
really play a role.
So that's the first one. Okay. So now
let's move to the second category of NLP
tasks. So this this one is the
multiclassification category. So you
have an input text and you predict
multiple things and we're illustrating
this with the neer task which as I
mentioned is about identifying the
category of given words.
And so here for instance we want to
identify teddy bear as being an entity.
Um I guess for that you would use
classification metrics but not at the
sentence level but more either at the
token level or at the entity type level.
And by by that I mean let's suppose you
have a category um let's say location
and you want to know how well you're
predicting words in that category. So
you would typically aggregate these
metrics
um as a function of that.
Cool.
Okay. Let's go to the last category
which is as I mentioned the most popular
one. So this one is text in text out. So
I'm illustrating this with the machine
translation task which is around
translating a text from a source
language to a target language. So here
you have the example with English to
French. So cute teddy bear is reading.
Um so for that I guess it's harder to
get data sets because here you you need
to have pairs of text. So you have a
very popular data set that's called WMT
which stands for workshop on machine
translation.
And that one contains a bunch of paired
sequences in different languages. Uh so
for instance you have the English,
French, English, German coming from the
European Parliament data set for
instance. Um okay so to evaluate those
the to evaluate the performance of your
model it's actually a lot more tricky
because as you can imagine you can have
many different ways to translate
something. I'm sure many of us in the
room are, you know, bilingual, triang
trilingual. Um, so yeah, that's that's
what is making it this hard.
So, in the past, people have used
several rule-based metrics to do that.
So, one that you may have heard is blue.
Blue stands for bilingual evaluation
under study and it is a measure of how
well your translation stands with
respect to a reference text.
Same story for Rouge which is actually a
suite of metrics
um but kind of captures that in a
different way. And uh you will see that
the machine learning community is funny
because blue I'm not sure if you know
French means blue but rouge means reds.
So I guess they tried to kind of uh add
some some fun in this. Um but the
problem with these metrics is that you
always need a reference text. So you
basically need labels
and in practice having labels is very
cost expensive. It takes a lot of time,
a lot of money to uh get labels and we
will see later in the class that with
the progress that we have made in the LM
space or that the community has made in
the LM space, we can actually forgo this
reference based metrics and go towards a
more reference free
uh kind of metrics and we will see that
later on.
And then the last metric that I would
say that people sometimes use is called
perplexity.
And perplexity only looks at the
probabilities that are output by the
model. And it basically quantifies how
surprised the model is by its outputs.
So blue and rouge the higher the better.
Proplexity the lower the better.
And I guess um LLMs have been a kind of
a hot topic since 2022,
but actually the field goes way back way
before that uh that year. So in the 80s,
we'll see it in a second, but uh there's
a class of models that were actually
kind of thought of even in the 80s and
the 90s we had LSTMs that we'll see also
in a second.
Um but the problem was during that time
we didn't have the internet we didn't
have a lot of compute and I guess this
was one of the limiting factors which
prevented these models from like the the
models from today from being trained
and then more recently uh we've had
several advances so word tovec was
really one of the uh kind of pioneering
um work in just um computing meaningful
embedding settings
and we'll see it in a second. And then
of course we had the transformers which
were part of a paper that was published
in 2017
which is basically at the foundation of
of the models that you see today.
And then uh you know these models they
just were scaled up both by compute but
also in in terms of like the data that
was used to train them. And you know
that's how LLMs were dubbed. And I guess
these are more like the 2020s.
But yeah, I guess we'll see those.
Cool.
Any questions on
I guess the high level.
Everyone good?
Cool. So I guess
the first question that I want to ask
ourselves is
what we want to do is to have a model
that handles text.
But models they understand numbers. They
don't really understand text. So we need
to somehow
do something with that text to make it
more quantifiable, something that a
model can understand.
So if you look at a sentence for
instance, a cute teddy bear is reading,
you first need to ask yourself how can
you cut this sentence to pass it to a
model.
So this part is called tokenization
and what it entails is basically cutting
the text respect to some arbitrary unit
of text.
So there are several ways of doing this.
I guess the first way is doing it
completely arbitrarily.
So here for instance you would have a
that would be one unit of text
would be another unit of text bear would
be another one and so on. And by the way
the unit of test is called a token which
is why uh the method method is called
tokenization.
Another
way would be to just separate by words.
But I guess uh we would have you know
always pros and cons. I guess um
one of the goals that we want to achieve
is for us to then be able to represent
these tokens in a meaningful way. So,
one con with doing this at the word
level is you will end up with words that
look similar but that are actually
considered as different tokens. And I
guess the limitation here is you will
need to compute embeddings for these
similar yet different tokens
and somehow make their embedding
similar. So I'll give you an example. So
let's suppose I have the word bear
and then you have another word plural
form bears.
So these two words they are very similar
just one is singular the other one is
plural. If we go ahead with the word
level tokenization
then we will end up with just two
different entities
which are basically yeah just considered
as different. same with run and then
runs you know variations of verbs.
So for that reason
people have uh kind of um dug into a
category of token tokenizers that are
called subword tokenizers
which is around leveraging roots of
words in order to find what are the
common roots that we can find these in
these words. So for instance, for bear
and bears, you would have the bear
particle that would be uh kind of
shared.
And so I guess the pro is that you get
to leverage the root of the words. But
then the con here is that your sequence
will be longer. And we will see why this
is a con. Um I guess later on I guess I
can give you a preview. Um so the
complexity of these models is also a
function of the sequence length.
So the more you have the more tokens you
have to process the more time it would
take for your models model to run
because it needs to basically process
all these tokens.
So that's one con. So pro is it
leverages the root of words. Con is it
just makes your sequences longer.
Okay, you have a last category of uh
kind of ways of tokenizing things which
is just going at the character level
just like taking all characters. So here
I guess uh you and I when we write I
don't know a message we typically have
uh sometimes the misspellings
and with the subword uh way of
tokenizing things you may not uh be able
to recognize a word that has been
misspelled
and this is something that the character
level tokenizer can uh I guess take into
consideration But here the problem is
you have a sequence length that's much
much longer which will make your model I
guess take much more time to process uh
this sequence. So that's one con. And
then uh the other con is I guess when
you want to represent each of these
tokens I guess it's very hard to know
what a representation of a letter really
means. Like what does the representation
of the letter U mean?
Very hard.
Cool. So I have just a a quick recap. So
word level is a super naive way, super
simple way of um I guess dividing your
text into arbitrary units. Um but then
the problem is as we mentioned we do not
leverage the root of words and uh I did
not mention this but there's a term
whenever you cut something and then um
at inference time when you want to make
a prediction I guess one uh prerequisite
that you have is that you need to have
the token that you saw at training time
you need to have it in your training
sets.
And the problem is let's suppose at
inference time you cut your text into
words and let's suppose you have not
seen a word at training time you will
need to mark it as unknown
and so this thing is called OOV out of
vocabulary.
So luckily the subboard level tokenizer
mitigates that problem. Uh so you have
like a a lower risk of OOV but still you
you can have and as we mentioned in
terms of the pro uh you leverage the
root of the words
um and then character level uh you know
it's robust to our misspellings and our
casing errors uh but the problem is it
makes computations just like much slower
and uh your sequences would be like very
very long uh which will also make your I
guess inference time much higher.
Does that sound good? I guess this is
really the foundation of uh I guess how
to handle things with text. But yeah,
does that make sense overall?
Cool.
Okay. So now, okay, what we did is we
took an input text.
What we did is we cut it into parts that
are basically tokens.
So in order for our model to understand
it these tokens, we need to find a
representation for each of them. So here
uh we're going to take a look at this.
So that's called word representation or
more um I guess in a more correct way it
should be token representation. So we
want to find a way to represent each of
these tokens.
So the simple and naive way to do this
would be to just assign a one hot vector
for each word or for each token. So for
instance let's suppose if let's suppose
we have a vocabulary of three tokens
book soft and teddy bears. We would have
let's say soft that is a one 0 0 vector
teddy bear that is let's say a 0 1 0
vector and book that is let's say a 0 01
vector
so this is called one hot encoding oh e
we'll typically see
so cool yeah this is a a way to
represent of our tokens but um basically
what people want to do is compare these
tokens to basically see which ones are
more similar to what other ones.
So a common similarity measure that
people use is something called cosine
similarity. Not sure if you have heard
of it. Um so you can think of it as just
seeing what angle these vectors make in
the n dimensional space. And if I guess
they are pointing in the right in the
same direction then maybe they're
similar maybe if they're orthogonal
maybe they're kind of independent and if
they're completely opposite then maybe
they're opposite. That's basically the
mental model we want to uh go into.
So the problem is if you represent your
tokens in a one hot fashion, you will
end up with all your vectors being
orthogonal to one another.
So that's the problem.
So ideally what we want is for tokens
that mean the same or similar to
basically have a high similarity
and for tokens that are not similar like
on like about different things to be
more like or sodom.
So here I just for illustrative purposes
uh teddy bears are soft. So you want
teddy bear and soft to be I guess with a
high similarity and let's say teddy bear
and book which is kind of independent
you want them to be closer to zero.
So that's what you want. That's what you
have with one hot encoding and that's
what you want.
Yeah.
Sorry.
Oh, I see. Uh, the question is why do
you care about the norm? Um, so I guess
cosine similarity is actually normalized
by the norms. So it's um dot product.
Oh, you mean why did I just put dot
product here instead of the
Oh, I see. And your question is why do
we not care about the norm? Um, cool. I
guess the viewers know the question. Um
I guess these measures they all you know
measures they're all ways to try to
capture these uh kind of similarity uh
things. Um so I guess why do you not
care about the norm?
H I guess it's how people have tried to
kind of quantify that. Um I I guess you
will need to see how your vectors are
trained and whether the norm would be
indicative of something.
Um I guess the best answer I can give
you is I guess this is a measure. This
is not the perfect measure. Um yeah
people may use also product as a as a
measure but yeah I don't have like a
great answer for you.
Cool. But as long as you capture, I
guess how these vectors they're um
they're pointing. I guess typically what
you care about is the angle between
them. Um but yeah, typically you don't
really take into consideration the the
N.
Cool. Any questions? Any other
questions?
Yeah.
Yeah.
It's a great question. So question is
around size of vocabulary and how that
would inform the choice with respect to
word, subword and how that changes
across languages. It's a great question.
Um, so I would say it really depends
first of all on the task that you're
trying to achieve. If your task is just
about one language, you will just uh
take that same language. You would
typically go with a subword tokenizer
just because of the reasons that we
mentioned here. Um, so I guess subwords
is a nice trade-off between being able
to identify uh words by their roots like
leveraging that but also running less
into the OOV uh risk. Um
so uh in terms of the size I know that
people they've uh you know like try
different things. I think uh typically
for English you would target something
on the order of tens of thousands of
vocabulary size uh but you know like
nowadays the models uh there are
multilingual they're also about codes so
you will see that the vocabulary size
now is sometimes on the order of
hundreds of thousands
uh okay so with respect to Chinese so I
guess you have this uh you know
difference in characters that you're
using so for Latin I guess so it's the
alphabet we're all accustomed to but of
course for the other ones uh you have
something similar but in uh I guess the
the target language character um so yeah
I would say order of magnitude tens of
thousands for one language hundreds of
thousands if it's like multilingual
um yeah these are the order of magnitude
that you want to target for
cool Yeah.
Great question. So question is how do
you get those embeddings? So it's
actually the next slide. So we're going
to talk about this.
Cool.
Great. So okay. So now that we know that
the one hot encoding is not a good way
to represent tokens, what we want to do
is to learn those embeddings from the
data.
So I mentioned that there was this um
you know um paper that came out in the
2010s. So I think it was 2013 that was
called Vtovec.
And the reason why it was so popular is
because they showed a very intuitive and
interpretable way of seeing these
embeddings. Uh because they were saying
saying something like okay king is to
queen what this is to that like Paris is
to France what Berlin is to Germany. So
there was basically a way to make sense
of the embeddings. So now the question
is how did they do that?
So they had two ways of computing these
uh embeddings. So one way was called
continuous bag of words. The other one
was called skipgram. But they all rely
on the same idea
which is let's just leverage text that
we have and then try to predict
something that is part of the text based
on let's say the context.
So for instance, continuous pack of
words.
The goal is you take into consideration
the words that are around a given target
words and your goal is to predict that
target words.
And skip gram is kind of the opposite.
You go from uh a target word and you
want to predict the words that are
around it. So I guess this task is
commonly called a proxy task. Because at
the end of the day in this exercise,
what we care about is not necessarily to
predict the next word or at least not
yet. Our goal is to learn a
representation of these words that are
meaningful.
And so here the idea is if you have a
model that somehow knows how to predict
let's say the next word
then it means that your model has some
understanding of how language works
which is basically what you want. You
basically want an embedding that is
reflective of
uh I guess what languages
which is u you know king and queen or
you know or similar you know Paris and
France like this is the capital you want
to have these associations embedded in
the representation
and let's go through a very simple
example of what that looks Like
so here in our example,
let's suppose that our proxy task is
about predicting the next word.
So here what we take is a very vanilla
neural network model which basically
receives a vector of size V
has some multiplication and a bias term
to get like hidden state and then uh
another set of multiplications to get
our final vector.
So here it's basically a a very simple
neural network. Uh so the input is of
size V. The hidden layer is of size D
which is typically much smaller than the
vocabulary. So vocabulary is typically
like tens of thousands or hundreds of
thousands. So D is typically hundreds
like 768 for instance is is one example
of dimension.
So it's much much smaller. So what we're
trying to do is to learn the word
representation through this proxy task.
And what we're going to do is try to
consider
a words as input
and predict the next word.
So let's go with the first word of the
sequence. So by the way I use token and
words interchangeably.
Um so let's suppose we have the word a
and we want to predict the next word
which is the word cute.
So what we do is we take the word a
we take the one hot
encoding representation
and we pass it through the network. So
here if you're familiar with neural
networks so here you have I guess a
multiplication between I guess a matrix
and and this vector. So you have a
hidden state representation which is a
vector of size D. So here let's suppose
it's 2 1.9. So D equ= 2
and then you have I guess another pass
here. uh and then you get after softmax
a set of probabilities
which are around
seeing what is the next word. So in this
example we have uh a vocabulary of size
six. So the first word is predicted with
probability 02
second word 4 and then the other words
are all 0.1 in this example.
So let's suppose that we want to somehow
be able to maximize our prediction to be
the second word of the vocabulary which
is the point 4.
So we basically compare the prediction
with uh I guess 0 1 0 0 which is a
representation of the second word of the
vocabulary
and then we you know do the back prop we
you know update the weights. Not sure if
everyone is familiar with that part. Uh
but the idea here is once you obtain a
prediction you compute a loss. So
typically cross entropy
which will determine how far off you are
from the true answer and based on that
difference you're going to update the
weights in order to make your prediction
closer
to the truth.
So that's what you do and then you
repeat that process. Let's suppose you
take the word cute
which as we said is the second word in
the vocabulary. So the one hot encoding
representation is 0 1 0 0 0. So you go
through that uh you know network uh you
have a hidden state like the vector is
08.4 before you do that again. And what
you want to do is to predict the next
token. And here's teddy bear.
And so you see now your model in this
example is predicting the next word to
be kind of like uniform, but you want to
somehow maximize the probability for
teddy bear. So you go about doing this
again and again for all the words. And
at the end of the day, you obtain a
model
that learns how to predict the next
words, which is basically the proxy
task. And what you're going to do is to
take the representation that the model
learns, which is the the green units.
So what happens now is every time you
have a word
you just represent that as a one hot
encoding representation and you just
like multiply this with these weights
and then you obtain the green
representation
and that is your war representation.
Does that make sense? Yeah.
Yeah.
Yeah. Great question. Yes. Great
question. So the question is about what
does V correspond to and why there's
only six. So yes, in this example, we
only have six possible words, which is
basically the vocabulary size, just like
very um kind of a toy example because in
practice there's many more. Um so I
guess uh that's one of the challenges
with language. So you can technically
have many variation of words which is
why if you take a word level way to
divide your text into tokens you can end
up with the vocabulary that's like very
big because you need to account for all
the variations of given words.
Uh and the other thing that I want to
point out is um let's suppose you have a
vocabulary size of six and it's the six
words that you saw at training time. But
what what happens if if at inference
time you have a word that you have not
seen at training time? And so the answer
for that is typically what people do is
they
reserve a spot for what they call an
unknown token or out of vocabulary token
which is basically um can think of it as
you know a bucket for everything that we
could not have we we were not able to
identify.
So if let's suppose at inference time
you have a token that you were not able
to identify, they will all take that
representation which is the unknown
token representation.
Um and it's by the way something that I
guess the
word level tokenizer has kind of trouble
to do because you will have a much
bigger chance of having out of
vocabulary tokens. word level subword
level will have a lower chance and then
character level uh I guess you yeah you
don't have that problem does that answer
your your question
CME 295: Transformers & Large Language Models – 오리엔테이션 및 기초 NLP 정리
Executive Summary
CME 295는 Transformer와 LLM(대형 언어 모델) 을 중심으로 NLP 전반을 다루는 2유닛 스탠포드 강의로, 이론(Transformer·Attention·언어 모델링)과 실전 LLM 활용을 함께 목표로 한다. 수업은 기본 ML/선형대수 배경을 전제로 하며, 중간·기말 2회 시험(코딩 없음) 으로만 성적을 평가한다.
Key Takeaways
- 강의 개요
- 과목명: CME 295 – Transformers and Large Language Models
- 담당: Afshine & Shervine (쌍둥이 형제, Uber·Google·Netflix LLM 실무 경험)
- 배경: 2020년부터 NLP 워크샵 → 2023/24부터 스탠포드 정식 과목
- 수강 대상
- LLM/NLP를 연구·산업 커리어로 삼고 싶은 학생
- LLM 기반 개인 프로젝트를 하고 싶은 학생
- 비전공이지만 자신의 도메인에 LLM/GenAI를 적용해 보고 싶은 사람
- 선수 지식
- 기본 ML 개념: 모델 학습 과정, 신경망이 무엇인지
- 기본 선형대수: 특히 행렬 곱 개념
- 운영 및 평가
- 시간: 매주 금요일 3:30–5:20, 녹화 제공
- 학점: 2 units, Letter Grade 또는 Credit/No Credit
- 과제 없음, 시험 2개만:
- 중간: 10월 24일(5주차), 기말: 12월 8일 전후(날짜 추후 공지)
- 코딩 문제 없음, 수업·슬라이드·개념 중심
- 성적 비중: 중간 50% + 기말 50%
- 기말은 후반부 내용 위주로 출제 예정
- 자료 및 커뮤니케이션
- 슬라이드·녹화·실라버스: 과목 웹사이트 + Canvas 링크
- 교재: “Super Study Guide – Transformer LLMs”
- 요약 자료: VIP Cheat Sheet (GitHub, 다국어 번역)
- 공지: Canvas, 질문: Canvas의 Ed 탭, 이메일/메일링리스트
- 수업 핵심 내용
- NLP 주요 태스크 3분류: 분류 / 다중-분류 / 생성
- 토크나이제이션: Word / Subword / Character 수준 비교
- 임베딩: One-hot 한계, Word2vec (CBOW, Skip-gram), 코사인 유사도
- 시퀀스 모델: RNN, LSTM, 장기 의존성·Vanishing Gradient 문제
- Attention 개념 도입: 과거 전체를 직접 참조
- Transformer 구조:
- Encoder–Decoder 구조
- Self-Attention / Cross-Attention / Masked Self-Attention
- Multi-Head Attention, Position Embedding, FFN
- 학습 기법: Label Smoothing, Perplexity, BLEU/ROUGE 등
핵심 요약: 이 수업은 LLM이 ‘어떻게 작동하는지’(Transformer·Attention)와 ‘어떻게 학습·활용하는지’를 체계적으로 이해하게 하는 NLP·LLM 입문~중급 강의이다.
Detailed Summary
1. 강의·강사 소개 및 수업 목표
강사 배경
- Afshine & Shervine: 쌍둥이 형제
- 학력:
- Centrale Paris (프랑스 공대)
- Afshine: MIT
- Shervine: Stanford ICME 석사
- 경력:
- Uber → Google → Netflix에서 Large Language Models 관련 업무
- 지난 수년간 NLP/LLM 워크샵 진행 → 수요 증가로 정식 과목 개설
과목 목적
- LLM이 “요즘 뜨는 도구”를 넘어서,
- Transformer 구조와 내부 메커니즘 이해
- LLM 학습 방법과 응용 분야 이해
- 대상:
- 연구자/ML Scientist 지망생
- LLM 기반 애플리케이션/개인 프로젝트 개발자
- 비전공 도메인에서 GenAI/LLM 적용 방향을 알고 싶은 사람
- LLM이 “요즘 뜨는 도구”를 넘어서,
수업 난이도·필요 배경
- 기본 ML: 지도학습, 신경망 학습, 손실·역전파 개념
- 기본 선형대수: 벡터·행렬, 행렬 곱
- 기초가 완벽하지 않아도 수업에서 도움 제공
2. 수업 운영 및 평가 방식
시간·형식
- 매주 금요일 3:30–5:20, 같은 강의실
- 강의 녹화: 매주 금요일 밤 또는 토요일에 업로드
- 시간 충돌이 있어도 녹화 시청으로 대체 가능
수강 형태
- 2 units
- Letter Grade 또는 Credit / No Credit 선택 가능
평가
- 과제(Homework): 없음
- 시험 2회
- Midterm: 5번째 수업(10/24)
- Final: 12/8 주간, 정확한 날짜는 추후 공지
- 출제 범위·유형
- 코딩 문제 없음
- 수업에서 다룬 개념·슬라이드 기반 이론 문제
- 기말은 후반부 내용 중심으로 출제할 가능성 큼
- 성적 비중
- 중간고사 50%
- 기말고사 50%
자료·커뮤니케이션
- 슬라이드 및 녹화:
- 과목 웹사이트에 업로드
- Canvas에도 링크 공유
- 교재
- “Super Study Guide – Transformer LLMs”
- 강의 내용과 개념 대부분 포함
- 요약 자료
- VIP Cheat Sheet (GitHub 공개, 다국어 번역 제공)
- 본인 언어가 없으면 제안 가능
- 공지
- Canvas에 공지
- 질문 채널
- Canvas의 Ed 탭에 질문 작성 → 강사들이 답변
- 이메일/메일링리스트, 직접 연락도 가능
- 질문이 녹화에 잘 안 들릴 수 있어, 강사가 질문을 반복해 말해 줄 예정
- 슬라이드 및 녹화:
수강 신청·대기자
- 대기자 수: 대략 6명 정도
- 시간표 조정·수강 취소 등으로 대부분 등록 가능할 것으로 예상
- 여전히 대기 상태면 직접 강사에게 이야기할 것
3. NLP 전반 개관: 태스크와 지표
3.1 NLP(Natural Language Processing)의 세 가지 큰 범주
분류(Classification) – 텍스트 → 하나의 라벨
- 예시:
- 감성 분석: 영화 리뷰 → 긍/부/중립
- Intent Detection: “내일 알람 설정해 줘” → intent = “알람 생성”
- 언어 식별(Language Detection): 문장이 어느 언어인지 판별
- 토픽 분류 등
- 평가 지표:
- Accuracy: 전체 중 맞춘 비율
- Precision: 긍정이라고 예측한 것들 중 진짜 긍정 비율
- Recall: 실제 긍정 중, 긍정으로 맞춘 비율
- F1 Score: Precision·Recall의 조화 평균
- 이유:
- 클래스 불균형(예: 99%가 Positive)에서 Accuracy만 보면 오해 가능
→ Precision/Recall/F1이 중요
- 클래스 불균형(예: 99%가 Positive)에서 Accuracy만 보면 오해 가능
- 예시:
다중-분류(Multi-label / Token-level Classification) – 텍스트 → 여러 라벨
- 대표 태스크:
- Named Entity Recognition (NER):
문장 속에서 인물, 장소, 시간, 조직명 등 엔티티 태깅 - Part-of-Speech Tagging: 품사 태깅 (명사·동사·형용사 등)
- 구문 분석(Dependency / Constituency Parsing) 등
- Named Entity Recognition (NER):
- 평가:
- 토큰 단위, 엔티티 타입 단위로 Precision/Recall/F1 등 계산
- 대표 태스크:
생성(Generation) – 텍스트 → 텍스트 (길이 가변)
- 예시:
- 기계 번역 (Machine Translation): EN → FR/DE 등
- 질의응답 / 대화형 모델: ChatGPT, Gemini와 같은 Assistant
- 요약(Summarization): 문서·기사 요약
- 코드·시·이야기 생성 등
- 데이터:
- 번역의 경우, 쌍(pair) 데이터 필요 (예: WMT, Europarl EN–FR/EN–DE)
- 평가:
- BLEU: 참조 번역과의 n-gram 겹침 정도(높을수록 좋음)
- ROUGE: 요약·번역 품질 평가용 n-gram 기반 지표(역시 높을수록 좋음)
- Perplexity:
모델이 실제 텍스트에 얼마나 놀라는지(불확실한지) 측정
→ 낮을수록 좋음
- 한계:
- BLEU/ROUGE는 참조(reference) 문장 필요 → 라벨링 비용이 큼
- LLM 발전 이후 Reference-free 평가(LLM-as-a-judge) 연구 활발
- 예시:
4. 텍스트 처리의 기초: 토크나이제이션과 임베딩
4.1 토크나이제이션(Tokenization)
- 목표: 텍스트를 모델이 다룰 수 있는 최소 단위(token) 로 분해
- 예시 문장: “a cute teddy bear is reading”
토크나이제이션 수준 비교
Word-level
- 공백 기준 단어 단위 분할
- 장점:
- 단순, 직관적
- 단점:
- 단어 변형이 모두 다른 토큰으로 취급
- 예: “bear” vs “bears”, “run” vs “runs”
- Out-of-Vocabulary(OOV) 문제 심각
- 학습 시 보지 못한 단어는 [UNK] 토큰으로 처리해야 함
- 유사 단어 간 관계를 임베딩으로 따로 배워야 함
- 단어 변형이 모두 다른 토큰으로 취급
Subword-level
- 단어를 더 작은 부분(어근·접두사·접미사) 단위로 분해
- 예:
- “bears” → “bear” + “s”
- “running” → “run” + “ning” 등
- 장점:
- 어근 공유를 통해 유사 단어 간 의미 공유
- 완전한 OOV는 줄어듦(새 단어도 서브워드 조합으로 표현 가능)
- 단점:
- 시퀀스 길이가 Word-level보다 길어짐
- 시퀀스 길이는 곧 연산량과 메모리 사용량 증가로 직결
Character-level
- 문자 하나하나를 토큰으로 사용
- 장점:
- 오탈자·철자 변형·대소문자 차이에 강건
- 이론적으로 OOV 없음
- 단점:
- 시퀀스 길이 극단적으로 길어짐
- 문자 단위 임베딩은 의미 해석이 어렵고
상위 구조(단어·구문)를 모델이 모두 학습해야 하므로 비용 큼
정리: 실전 LLM 대부분은 Subword-level 토크나이저를 사용해
의미·OOV·연산량을 균형 있게 맞춘다.
- 어휘 크기(Vocabulary Size)
- 단일 언어(예: 영어): 수만 개 규모
- 다국어 + 코드 포함 LLM: 수십만 개 규모
- 중국어 등은 문자 체계가 달라, 해당 문자 기반 subword/character 전략을 사용
4.2 토큰 표현(임베딩)과 유사도
(1) One-Hot Encoding의 한계
- 어휘 크기 V일 때, 각 토큰을 길이 V의 벡터로 표현
- 예: vocab = {book, soft, teddy_bear}
- soft → [1, 0, 0]
- teddy → [0, 1, 0]
- book → [0, 0, 1]
- 예: vocab = {book, soft, teddy_bear}
- 문제:
- 모든 벡터가 서로 직교(코사인 유사도 0) →
“soft”와 “teddy_bear”가 실제로는 연관 있지만 숫자상 아무 관계 없음 - 의미·유사성을 반영하지 못함
- 모든 벡터가 서로 직교(코사인 유사도 0) →
(2) 분산 표현(Distributed Representation)과 코사인 유사도
- 목표:
- 의미적으로 가까운 단어는 비슷한 방향의 벡터를 갖게 만드는 것
- 예:
- teddy_bear ↔ soft: 높은 유사도
- teddy_bear ↔ book: 낮은 유사도
- 코사인 유사도:
- 두 벡터의 각도를 기준으로 유사도를 측정
- 방향만 보고, 길이(노름)는 정규화로 상쇄
(3) Word2vec: CBOW·Skip-Gram
- 아이디어:
- 대량의 원시 텍스트에서, “언어의 패턴”을 학습하는 프록시 태스크로
단어 임베딩을 학습
- 대량의 원시 텍스트에서, “언어의 패턴”을 학습하는 프록시 태스크로
- 두 가지 프록시 태스크:
- CBOW (Continuous Bag-of-Words)
- 주변 단어(context)들을 보고 중심 단어(target) 를 예측
- Skip-Gram
- 중심 단어를 보고 주변 단어들(context) 을 예측
- CBOW (Continuous Bag-of-Words)
- 특징:
- 실제 학습 목표는 “정확한 다음 단어 예측”이 아니라,
그 과정에서 생기는 임베딩(가중치)을 얻는 것 - 결과적으로,
- king - man + woman ≈ queen
- Paris - France + Germany ≈ Berlin
같은 해석 가능한 벡터 연산이 가능해짐
- 실제 학습 목표는 “정확한 다음 단어 예측”이 아니라,
(4) 간단한 예: “a cute teddy bear is reading”
- 프록시 태스크: 다음 단어 예측(next-word prediction)
- 모형:
- 입력 차원 V, 은닉 차원 d의 간단한 신경망
- 입력: one-hot(“a”) → W1 → 은닉 h → W2 + softmax → 다음 단어 분포
- 학습:
- 예측 분포와 실제 다음 단어(one-hot) 간 cross-entropy loss 최소화
- 역전파로 W1, W2 업데이트
- 이를 모든 토큰에 대해 반복
- 임베딩 추출:
- 학습된 W1(또는 은닉층 표현)을 단어 임베딩으로 사용
5. 시퀀스 모델: RNN, LSTM, 그리고 한계
5.1 RNN (Recurrent Neural Network)
- 목적:
- 문장의 순서 정보를 반영해 문장 표현을 얻기
- 구조:
- 각 시점 t마다:
- 입력: 현재 단어 임베딩 x_t, 이전 hidden state h_{t-1}
- 출력: 새로운 hidden state h_t, (필요시) 출력 y_t
- 각 시점 t마다:
- 해석:
- h_t는 t번째까지의 문맥 정보를 요약한 벡터
- 활용:
- 분류: 마지막 hidden state h_T → 문장 임베딩 → 클래스 예측
- 토큰 분류: 각 시점의 h_t → 토큰 태깅(NER 등)
- 생성(번역 등):
- Encoder RNN으로 전체 문장 인코딩 → 최종 hidden state를 context로 사용
→ Decoder RNN으로 출력 문장 생성
- Encoder RNN으로 전체 문장 인코딩 → 최종 hidden state를 context로 사용
5.2 문제점: 장기 의존성·Vanishing Gradient
- 모든 정보가 하나의 hidden state에 누적
- 문장이 길어질수록:
- 오래전 정보는 잘 반영되지 못함 (장기 의존성 문제)
- 역전파 시, 시간축을 따라 긴 곱셈 연산 →
0<|값|<1인 항의 반복 곱 → gradient가 0에 수렴(vanishing gradient)- 반대로 1보다 크면 exploding gradient
- 결과:
- 문장이 길어질수록 앞 부분 정보를 기억하기 어려움
- 학습도 불안정·느려짐
5.3 LSTM (Long Short-Term Memory)
- RNN을 개선한 구조:
- hidden state 외에 cell state c_t를 추가
- 여러 게이트(input, forget, output gate)를 통해
- “무엇을 기억/버릴지” 선택적으로 조절
- 목표:
- 장기 의존성 문제를 완화
- 하지만:
- 여전히 순차적 연산에 의존 → 병렬화 어려움, 학습 느림
- 아주 긴 시퀀스에서 완벽한 해결은 아님
6. Attention: 장기 의존성 해결의 핵심 아이디어
6.1 Attention 개념
- 동기:
- RNN/LSTM은 과거 정보를 순차적으로만 전달 →
“중요한 과거 단어”를 직접 참조하기 어렵다.
- RNN/LSTM은 과거 정보를 순차적으로만 전달 →
- Attention의 아이디어:
- 현재 예측하려는 위치에서,
시퀀스 전체의 표현들을 직접 보고 가중합을 계산
- 현재 예측하려는 위치에서,
- 예: 번역
- 입력: “a cute teddy bear is reading”
- 출력: “un ours en peluche lit”
- 특정 프랑스어 단어를 생성할 때,
- 해당하는 영어 단어에 직접 가중치를 크게 두어 참고
6.2 Self-Attention & Q-K-V (Query, Key, Value)
- Self-Attention:
- 한 문장 내에서 모든 토큰이 서로를 참조하도록 하는 메커니즘
- 예: “teddy bear”의 표현을 만들 때
“cute”, “reading” 등 다른 토큰들을 함께 고려
- Q (Query), K (Key), V (Value):
- Query: “내가 지금 알고 싶은 것”을 나타내는 벡터
- Key: 각 토큰의 “색인/주소 정보” 역할
- Value: 실제로 가중합에 사용될 내용 벡터
- 메커니즘:
- Query와 각 Key의 유사도(주로 dot product) 계산
- softmax로 정규화 → 어텐션 가중치
- 이 가중치로 각 Value를 가중합 → 새로운 토큰 표현
해석: Query 기준으로 “어떤 토큰을 얼마나 참고할지”를 결정한 뒤,
그에 해당하는 Value들의 가중합으로 새로운 표현을 만든다.
6.3 행렬 연산 형태
- 입력 임베딩 X (d_model × N)
- 학습되는 가중치 행렬:
- W_Q, W_K, W_V (각각 d_model × d_k/d_v)
- 계산:
- Q = X W_Q, K = X W_K, V = X W_V
- Attention(Q, K, V) = softmax( Q Kᵀ / √d_k ) V
- 의미:
- QKᵀ: 각 토큰 쿼리가 다른 모든 토큰 키와 맺는 유사도 행렬
- softmax: 각 쿼리별 확률 분포(어텐션 가중치)
- 가중합: 이 분포를 기반으로 Value들의 가중 평균 계산
- √d_k로 나누는 이유:
- d_k가 커질수록 dot product 값의 분산이 커짐 → softmax가 과도하게 쏠림
- 이를 정규화해 학습 안정화
7. Transformer 구조: Attention is All You Need
7.1 전체 구조 개요
- 2017년 논문 “Attention is All You Need” 에서 제안
- 순차적 RNN 구조를 버리고,
전적으로 Self-Attention(및 변형) 에 기반한 구조 사용 - 크게 Encoder와 Decoder의 두 부분으로 구성
- 대표 응용: 기계 번역
7.2 입력 처리: 토큰 임베딩 + 위치 인코딩
- 토크나이즈: BOS, EOS 포함
- 임베딩 레이어: 각 토큰 → d_model 차원 벡터
- Positional Encoding(위치 인코딩):
- 단순 Self-Attention은 순서를 모름 → 위치 정보를 추가해야 함
- 논문에서는 sin/cos 주기 함수로 위치 정보를 인코딩
- 임베딩 + 위치 인코딩을 원소별 더하기로 결합
7.3 Encoder 블록
각 Encoder Layer는 다음으로 구성:
- Multi-Head Self-Attention
- 입력: 길이 N의 시퀀스 임베딩
- 각 토큰이 모든 토큰을 참조하여 새로운 표현 생성
- Head별로 서로 다른 Q/K/V 투영을 학습 → 다양한 관계 포착
- 여러 head 결과를 concat 후, W_O로 다시 d_model 차원으로 투영
- Feed-Forward Network (FFN)
- 입력 차원 d_model → 중간 차원 d_ff → 다시 d_model
- 비선형 변환으로 표현력을 확장
- d_ff는 d_model보다 더 크게 잡아 모델 용량 확보
- (논문에는 Residual Connection + LayerNorm도 있지만, 내용상 언급만)
- 여러 개(예: N=6)를 쌓아서 사용
- 최종 Encoder 출력: 입력 시퀀스의 문맥-aware 임베딩 세트
7.4 Decoder 블록
각 Decoder Layer는 세 종류의 Attention을 포함:
- Masked Multi-Head Self-Attention
- Decoder의 이전 출력 토큰들에 대해 Self-Attention
- 미래 토큰을 보지 못하도록 마스킹(causal mask) 적용
- 시점 t에서 t 이후 토큰은 attention 대상에서 제외
- Cross-Attention (Encoder-Decoder Attention)
- Query: Decoder의 현재 hidden 표현
- Key/Value: Encoder 출력(입력 문장 임베딩)
- 역할:
- “지금 생성 중인 번역 토큰”이 입력 문장의 어느 부분을 참고해야 할지 학습
- Feed-Forward Network
- Encoder와 동일 구조, 표현력 확장
- 역시 여러 개의 Decoder Layer를 스택으로 사용
- 마지막 Decoder Layer 출력 → 최종 Linear + Softmax → 다음 토큰 확률
7.5 Multi-Head Attention
- 하나의 Attention만 쓰지 않고, 여러 Head(h개)를 병렬로 사용
- 각 Head:
- Q = X W_Q^(head_i), K = X W_K^(head_i), V = X W_V^(head_i)
- 서로 다른 투영 행렬을 사용하므로 다른 관점의 관계를 학습
- 모든 Head의 출력을 concat 후, 다시 W_O로 d_model 차원으로 투영
- 장점:
- 문법적 관계, 의미적 관계, 장거리 의존 등 다양한 패턴을 분리 학습
8. Transformer를 이용한 번역 과정 (End-to-End 예시)
문장: “a cute teddy bear is reading” → “un ours en peluche lit”
인코딩 단계
- 입력 문장을 토크나이즈 + 임베딩 + 위치 인코딩
- 여러 Encoder Layer를 통과하면서:
- 각 토큰이 문장 내 다른 토큰들과의 관계를 통해 문맥 정보 포함 표현으로 변환
- 최종 Encoder 출력:
길이 N의 컨텍스트 인코딩 벡터들 (각 토큰당 하나씩)
디코딩 시작
- Decoder 입력: BOS 토큰
- 첫 시점:
- Masked Self-Attention: BOS만 참조 (의미 없음, 자기 자신)
- Cross-Attention: Encoder 출력 전체를 Key/Value로 참고
- FFN 통과 후 Linear + Softmax → 첫 출력 토큰(예: “un”) 확률 분포
- Argmax/샘플링으로 “un” 선택
두 번째 토큰 생성
- Decoder 입력: [BOS, “un”]
- Masked Self-Attention:
- “un”의 표현은 [BOS, “un”]에 대해 자기-어텐션
- Cross-Attention:
- 해당 시점의 Decoder 표현을 Query로, Encoder 출력을 Key/Value로 사용
- 출력: “ours” 확률 ↑ → “ours” 선택
반복
- [BOS, “un”, “ours”, “en”, “peluche”, …] 를 입력으로 계속 디코딩
- 각 단계에서:
- 이전 번역 토큰들에 대한 Masked Self-Attention
- Encoder 출력에 대한 Cross-Attention
- EOS 토큰이 생성될 때까지 반복 → 번역 종료
9. 학습 테크닉: Label Smoothing
- NLP 생성 태스크(번역 등)에서 다음 단어 정답은 여러 개일 수 있음
- 예: “What a great ___” → “day”, “idea”, “lecture” 등
- 기존: 정답 토큰에 확률 1, 나머지는 0인 one-hot 라벨
- Label Smoothing:
- 정답 토큰 확률: 1 − ε
- 나머지 토큰들: ε / (V−1) 로 조금씩 분산
- 효과:
- 모델이 너무 확신(Over-confident) 하지 않도록 억제
- 일반화 성능, BLEU 점수 등 개선 보고
- 구현 관점:
- “softmax 출력에 라벨 스무딩을 적용”하는 것이 아니라,
손실 계산에 사용하는 정답 분포를 부드럽게 바꾸는 것
- “softmax 출력에 라벨 스무딩을 적용”하는 것이 아니라,
마무리
- 이 강의는 다음을 목표로 한다:
- NLP 태스크와 평가 지표의 기본 이해
- 토크나이제이션·임베딩·Word2vec·RNN/LSTM의 한계 파악
- Attention·Self-Attention·Multi-Head·Transformer Encoder–Decoder 구조의 이해
- 이를 바탕으로 LLM의 구조·학습·활용을 심도 있게 다룰 예정
- 수업과 교재, VIP Cheat Sheet, 그리고 매주 업로드되는 슬라이드·녹화를 잘 활용하면
현대 LLM 시스템의 동작 원리를 상당히 깊게 이해할 수 있다.