Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 9 - Recap & Current Trends
Hello everyone and uh welcome to lecture
9 of CM295.
So as you know uh today is a kind of a
special day because uh we're having the
last lecture of the entire course. Um so
the menu for today will be a little
different compared to usual.
uh we're going to try to divide the
lecture in three parts. So in the first
part we're going to recap actually what
we did in the entire class
just to see how different pieces kind of
fit together. Uh in the second part we
will look at some topics that are
particularly trending uh in 2025 and
what we think are going to be trending
in the near future. And then uh the
third part will be more uh way for us to
just conclude and uh next steps uh for
all of you.
Does that sound good?
Cool. So with that uh we're going to
start with the first part which uh what
I mentioned is about recapping what we
did this entire quarter.
So nothing new here. It's just a way for
us to piece everything together.
So if you remember u lot of weeks ago I
believe it's like maybe 10 weeks ago we
had lecture one which was focused on
understanding what transformers were. So
at the very beginning of the class we
didn't even know how we could process
text. So I guess the first step that we
saw was this tokenization step which
consists of dividing the input into
atomic units. And so here the way you we
divide the text is something that is
arbitrary in some sense. So we have
different algorithms that allow us to do
that. Uh and we saw that the most common
tokenization algorithm is the subword
level tokenizer.
And we saw that some of the advantages
were that um roots of words could be
reused um and leveraged especially when
it came to representing those tokens.
And
speaking of representation,
once we were able to divide the input
text into atomic units, aka tokens, the
next step for us was to learn how to
represent these embeddings. So if you
remember, we saw some methods that were
very popular back then. So one of them
was called wordtovec
and the representation was learned from
a proxy task which was something like
predicting the center word or predicting
the context words.
But then we saw that this way of
learning representations had some
limitations.
One of which was that these
representations were not contextaware.
Meaning that uh if a word is in a given
sentence or in another sentence they
will both have the same like that word
will have the same representation in
both sentences.
And so for that reason we saw some uh
other methods that were popular in the
2010s. one of which was
RNN's if you remember. So RNN's um had
this recurrent structure which process
tokens one at a time and kept an
internal representation of the sequence
so far.
But then we saw that a big limitation of
this was this problem of long range
dependency and in particular the fact
that uh tokens that were encoded far in
the past were not um quantities that
were able to be kept I guess as the
sequence got longer
and this is the reason why we saw the
central idea of this whole class which
is the idea of self attention where
tokens s can actually
attend to one another regardless of
where they are placed in the sequence.
So you can think of this as a direct
link.
And so uh this for instance is what we
saw. We we saw that there are like three
main uh terminologies that people use.
So query, key and value. So typically
you want to know how similar a query is
compared to the keys in the in the
sequence and you quantify that by um
taking some dot products that's kind of
scaled and softmaxed
and then you have the corresponding
value that is taken. So at the end of
the day we obtain some kind of weighted
average of all the tokens that are in
the sequence.
And then uh you may also be familiar now
with this formula. So soft max of q
krpose over square root of dk um time v.
So this is the matrix formulation of
what I mentioned here which uh is able
to process these computations in a very
efficient way and it's something that uh
today's hardware is well equipped to do
and then we finish the first lecture by
going through the architecture that is
the foundation of modern day LLMs which
is the transformer and we saw that there
are two u not notable parts in the
transformer. So one was the encoder in
the left part of the uh um the figure
and then the right part is the decoder
and we saw how this was applied in the
case of translation.
So at the end of the first lecture we
saw what motivated us to end up with the
transformer
and we saw that transformer was working
quite well in the case of uh translation
and so in the next lecture what we saw
was what were the little improvements
that people have made to this
architecture since it was released and
if you remember it was in 2017 that it
was publicublished.
So one particular improvement that
people have made is in the way we
consider positions
because in the original transformer
paper positions were encoded
in an absolute way as in each position
had its own embedding
and this embedding was added to the
token embedding.
But then if we think about it
positions actually we don't really care
about the absolute position. We care
about the relative position between
tokens
and in particular we care about how far
tokens are in the self attention
computation.
Which is why we saw this methods that is
now quite popular called rotary position
embeddings aka rope
that is now quite used. And it is a
method that rotates
query and keys
both of which happen in the self
attention computation.
And so here um what is uh quantified
here is purely a function of the
relative distance between um two tokens
and not only that it is something that
is uh taken care of in the self
attention layer which is what we care
about.
So this was one big improvement and then
we saw some other improvements
especially when it came to how the multi
head attention um layer multi head
attention layer was composed of and in
particular we saw that it was possible
for us to have some groupings
of the matrices that we learn. So we
don't need to have one matrix one
projection matrix per head for let's say
keys and values. We can actually uh
group them. So this is uh for instance
what is mentioned here. So group query
attention. Um and then we also saw some
other techniques that I have not
represented here like for instance the
normalization layer in the transformer
which here happens after each
sub layer but I guess nowadays people
have tried moving the normalization
piece before the sub layer. So here it's
the postnorm
version and then the before the sub
layer part is called the prenorm
version.
And then the last thing that we saw was
that from this transformer architecture
there were a lot of derived models that
were based from that. So we saw that if
we only keep the encoder part we could
compute very meaningful embeddings.
If you remember there was this um uh
kind of landmark paper on encoder only
model which is birds which was heavily
used in the context of classification
because it relied on the encoded
embedding of the CLS token. And so that
was one.
But then we also saw that there was a
number of other kinds of models all more
or less derived from the transformer. So
you could only keep the encoder which
was for birds. You could only keep the
decoder which is for instance for GPT
and you could also have both
which is for instance the case of T5.
And one particular aspect of each of
these models is that encoder only is not
able in the way that we saw is not able
to generate text
but is able to generate embeddings which
can be used for downstream tasks. But
then encoder decoder models like T5 or
decoder only models like GPT they can be
auto reggressive and
generate text. The paradigm can be text
in text out.
And with that we then focused on what
now everyone calls large language models
which are transformerbased models
specifically texttoext models. So
decoder only transformer-based models
and we saw that people have come up with
a lot of new tricks now because um you
know these models as the name uh
indicates um people have scaled them up.
But then one question was uh kind of
kind of thrown which is do you actually
need all these parameters to just do a
forward pass.
So we saw uh one kind of uh variant
which was based on mixture of experts.
So what mixture of experts are is
instead of running everything through
the whole entire model, you're going to
instead have a number of experts
that you're going to activate in a
sparse way.
So for instance, for one input, you're
going to just activate just a subset and
then for another input you're going to
activate another subset so that you
don't need to do all the computations
all the time.
And we saw that these mixture of experts
they were used in LLMs in particular in
the feed for neural network layer. So
here you would have experts as being
different feed for neural networks and
you would have a gating mechanism that
would reroute
to the correct feed for neural network.
And then we also saw that some papers
were also able to kind of produce some
nice visualization in terms of uh I
guess which token gets routed to which
experts because this rerouting we saw
that it was done at the token level.
And so one reason why it's done at the
token level is to be able to I guess
smartly put the experts on different
pieces of hardware, different GPUs and
then kind of parallelize the computation
a little bit more.
And then we also saw that these LLMs
they always are tasked with predicting
the next token. And in order to predict
the next token, we were interested in uh
I guess how we were uh you know doing
this. And so one particular uh method
that people use is just sample
sample from the output distribution. So
you have let's say given an input you
have a distribution of probabilities of
what the next token would be that is
output by the model. And what you do is
instead of let's say taking the highest
probability which is called the greedy
decoding
uh greedy kind of decoding you actually
sample.
So it introduces some randomness and
allows the model to produce kind of a
bigger variety of uh kinds of outputs.
And we saw uh that you could adjust how
how much I guess variety you want in
your output by tweaking a hyperparameter
called temperature.
So very low temperature leads to very
spiky distribution. So more
deterministic outputs and higher
temperatures are I guess a bit more uh
random a bit more creative.
Okay. So until then we saw what LLMs
were, how they were based on the
transformer, how they connected to the
architecture that we saw in the first
lecture and then in lecture lecture four
we saw how people actually trained those
LLMs
because as I mentioned these LLMs are
large and so you cannot kind of naively
fit them in your hardware. you need to
be a little bit smart about it.
So in particular, what people have uh
kind of noticed in the early 2020s is
that the bigger your model is,
the better your performance.
So people just started building bigger
and bigger models. So here in the uh
illustration we saw that so on the y-
axis is the test loss. So the lower the
better. So we saw that the more compute
you use the better your tests uh
performance and same with uh increasing
the data set size and same with
increasing the number of parameters
but then as you know compute is not
infinite. So there was a natural
question that came out of the community
which was okay if we give you a given
budget a given compute budget
can you choose
I guess some quote unquote optimal
number of parameters and data set size
on which you want to train your model.
And so we saw that there was this paper
that was um published in the early 2020s
uh which actually studied the
relationship between
um I guess if you vary the data set size
and uh the size of your model and the
performance on the test set.
And then we saw that actually most
models at the time were what we say
undertrained because they were too big
compared to the data set that they were
trained on. Like the data set that they
were trained on it was not as big as
they should have been.
And so in particular there was a kind of
a rule of thumb that came out of this
which was if you have a given number of
parameters in your model
you should at least train it on 20 times
the number of parameters in terms of
tokens.
So for instance, if you have uh a 100
billion parameter model, you should
train it on at least two trillion
tokens because two trillion is 100
billion * 20. So that's kind of the rule
of thumb that people have uh used and
then you know as I mentioned previously
you know these models are huge. So
people have tried to also make the
computation more efficient
and so there was this uh method that we
saw which is actually quite important
called flash attention
and flash attention is a method that
leverages the strength the strength of
the underlying hardware
and in particular it looks at so GPUs
more particularly it looks at the kinds
of memory memories that a GPU has.
So it has a big but slow memory and a
small but fast memory. So the HPM and
the SRAMM respectively.
And we saw that this method tries to
minimize the number of reads and writes
to the big and slow memory to the HPM.
And so the the way it was doing this was
to divide the computation in uh little
bits that it would send to uh the SRAMM
which is the small but fast memory so
that it can do the end to end
computation
and then send it back to where it was in
order to do the full end toend
computation.
So that method is an exact method
meaning that we're not doing any
approximations to the results
but it led to significant speedups and
in particular there was this second idea
from uh the paper which is a kind of an
important one as well which was that
sometimes
it's okay for you to not store results.
it's okay for you to just throw them out
and then recomputee when you need them
again. So there is this idea of
recomputation
using what I described which led to
faster run times even though we were
doing more computations.
So that was flash flash attention and uh
we also saw a number of other methods
that were meant to I guess parallelize
the computation. So we saw uh data
parallelism
which was this idea of not having all
your data be processed on a single GPU
but instead divided into uh kind of
multiple places.
And then we had the second method which
was model parallelism
where even for a given forward pass you
would actually involve multiple GPUs.
So anyway, there were a lot of very
interesting techniques, a lot of
different uh ideas about how to train
this model in an efficient way.
And uh in particular um so what I
described here is mostly important for
the first step of uh the training
process of an LLM which is called the
pre-training
which is meant to teach the model about
the structure of language about the
structure of codes. Uh and in particular
this model was trained with huge amounts
of data. So think about trillions of
tokens or even tens of trillions of
tokens.
Um and so that first step goes from an
initialized model to a model that is
able to autocomplete because it is
trained with an objective of predicting
the next token.
So at the end of this first stage, you
have a model that knows how to
autocomplete, but you have a model that
is not very helpful because it only
knows how to complete things.
So in order to have the model be useful
for our use cases, we had this second
step which is called the fine-tuning
step. uh where we teach the model on the
kinds of input output pairs that we want
it to perform well. So this is also
called uh the SFT stage supervised
fine-tuning stage. And at the end of
this second step, we have a model that
not only knows the structure of text and
codes, but also is able to behave in the
way you want.
But so far up until step number two, we
have only taught our model what to do.
We have not taught it what to not do.
And this is why we had our third step
which was the preference tuning step
where we took our model that went
through the pre-training stage that went
to the SFT stage and now we want to
inject some negative signal as well as
in I want you to prefer this compared to
this output.
And this third step uses preference
data. So like the name uh suggests so
preference tuning uses preference data
which is typically pair-wise data where
humans say okay I prefer this output
compared to that output.
And typically the model here is able to
align the kind of output it produces
with human preferences that could be
along the dimension of uh usefulness of
safety, friendliness, tone. Um there's a
bunch of different dimensions but uh
yeah so that's what is happening in this
third step.
And in this third step, it's actually in
lecture five that we dug into what that
third step was about. So if you remember
uh we had drawn a parallel
between the way our LLM produces tokens
and I guess what people in the
reinforcement learning field um I guess
consider how uh given policy is uh
interacting with some environment and
performing some action and being in some
states. uh and the reason why we drew
drew that parallel was to be able to
leverage some RLbased techniques
in order to train our model. So in this
case we said our LLM is a little bit
like a policy.
So given some state which is the input
it has received so far it can perform
the next action and in this case it is
to predict the next token
and this prediction is made in the
environment of tokens
and when we u like predict a completion
what we do is at the end of the day we
have some signal some reward part which
can be the human preference.
So this is the parallel we drew with the
RL worlds and with that in mind we
talked about rewards
but the problem is that rewards are only
available for a limited set of data
which is why we saw how to model
rewards.
So we saw this formula if you remember
it's called the Bradley Terry
formulation
which um models
how the probability of an output being
better than another one is as a function
of I guess two scores like the score of
output I and the score of output J. And
we saw that reward models they are
typically trained by having this
formulation in mind in a pair-wise
fashion.
So what this means is a reward model you
give it two outputs. You say this one is
good, this one is bad and then I want
you to say this one is good. You train
it in a pair wise fashion. But then your
model is actually predicting always two
scores. It's always predicting the score
RA I for output I RJ for output J. Um
and so at inference time you're only
giving it one output.
So I think that's like one subtlety like
we train it in a pair wise way but at
inference time we're kind of using it
in a in an individual way if that makes
sense.
And so once we trained our reward model
using this formulation
then we were able to use it to steer our
LLM towards the direction that we care
about.
So if you remember the way we steer our
LLM in the direction of human
preferences is to give it a prompt
so that it can produce a completion
aka a rollout or in simpler terms an an
answer.
And then we take this prompt, we take
this answer, we put them both in the
reward model that tells us how good the
model response is.
And depending on what the reward model
says, we can tune the weights of the LLM
in a way that maximizes human or the
reward that we saw which is trained on
human preferences.
And the loss function of this RL uh
setup
is typically something that tries to
maximize rewards
but also keep the model close to the
base model. And here by base model we
mean the SFT model. And the reason why
we want that is because this reward is
imperfect.
So we saw this uh phenomenon of reward
hacking
where your reward can be imperfect and
the LLM can exploit its imperfect nature
to tune it in a way that actually does
not align with what you want it to be.
So you want the LLM to not be too far
from the base model which is actually
already a good model. So it's a way to
regularize that if you want and you also
want the
iteration updates to not be too big
either.
So you typically have these two
constraints. You don't want it to
deviate too much from the base model,
but you don't want it to deviate too
much from the previous RL iteration.
And then just as a reminder, I think
this was lecture five. I think was the
most technically challenging of the
whole class. So completely fine if the
first time you were like you know what's
happening. Uh but hopefully now it
should be a little bit more more clear.
Uh cool. And then after lecture five
we're like okay we've done a lot of uh
hard work. So uh the good thing is you
know we're in 2025 and in the past 12
months or now 14 months we've seen a lot
of models that were being released with
these reasoning capabilities
and the way they were trained to exhibit
these advanced reasoning capabilities
was actually leveraging a lot of the
techniques that we saw in lecture 5
just like oral based techniques.
And in particular, what we want our LLM
to do is to output a reasoning chain
before producing the final answer.
And the reason why we wanted to do that
is because people have seen that it
improves the performance of the model.
And so it's actually relying on this
idea of chain of thoughts, which I
believe we saw at lecture three.
which is a prompting technique to have
your model output the reasoning before
outputting the the response.
So long story short, up until lecture
six, our LLM was having a prompt as
input directly outputting the output.
But in lecture seven, we said, sorry, in
lecture six, we said, uh, well, let's
have our LLM actually first output a
reasoning chain that the user may or may
not have access to before outputting the
final answer.
So you want to teach the LM to do that.
So how do you do that? Well, first
before doing this, I just want to show
you this chart which we saw which is the
performance of uh the model as we're
teaching it to produce these reasoning
chains. So people have typically
measured uh the improvement in
performance by comparing it to uh I
guess certain benchmarks and this one is
a popular one the AIM benchmark which is
the math math benchmark and we saw that
as the training progresses the accuracy
number of uh I guess what the LLM
outputs is increasing.
But back to what I was I was saying uh
the key technique that we use to teach
the model how to output these reasoning
chains is
leveraging the RL techniques that we saw
in lecture five. And in particular
um up until now we saw PO which was the
main RL algorithm that people were using
up to maybe last year and now people are
kind of prioritizing GRPO as an R
algorithm in order to teach the model to
be better at reasoning tasks.
And there are several reasons to do to
to that that I will explicit right now.
So we saw this illustration that
compared how GRPO was differing with PO
and if you can see in the graph uh there
are a few things that are different.
The first thing is that GRPO does not
rely on a value model.
So, who remembers what a value model is?
Yep.
Yes. Exactly. So, the value function is
trying to predict what the reward would
be if you um were to follow the policy
of the LLM. Um, and I guess it's a way
to have some baseline as to how good
some predictions are. You want to make
it more relative. So the value function
is a way for us to make these uh rewards
a little bit more relative to one
another. Um, and so that's what that's
how PPU was doing this. So it was having
a value model that was um making these
predictions and then we had uh this
generalized advantage estimation method
that was combining the reward
predictions
with the value function predictions in
order to have what we call advantages.
So advantages is how good your output is
compared to some baseline.
But then in contrast to that, GRPL said,
"Okay, tree, we don't need a value
function because it's, you know, too
expensive to to train, to maintain. What
we're going to do instead
is generate several completions
and then have some formula that compares
the rewards of these completion these
completions
to one another.
So it's going to have some relative
effect in a sense that it will make
things more relative
and in doing so you are actually not
uh needed to maintain and train a value
function and that's like one big
difference compared to PO.
Um and uh the second big difference
which is not represented in this
illustration uh is that uh GRPO is
typically an algorithm that people have
used in the context of
teaching your model to be better at
reasoning tasks.
And so we saw that these kinds of
problems
have a verifiable reward
because when you complete a math
problem,
you actually know the answer you need to
get to. So you don't need to train a
reward model to tell you how good your
final answer is because you you already
know the answer.
And so we saw that GRPO was in
particular used in the context of when
you actually don't even need a reward
model when you actually have a
verifiable reward. So at the end of the
day, the only two models you need to
keep are the policy model
and the reference model to be able to
just compare how far you are from the
reference model.
Cool. Um, I know this one was also a
challenging class, I guess. So far so
good. And this is also on on the final.
So, which is why I'm I'm taking things
more slowly for this second part of the
recap. So, is everything good so far?
Yeah.
Okay. Perfect. We also saw some
extensions of GRPO. So if you remember
there was um some kind of bias that was
um a result of the loss function of GRPO
having some normalization term that
penalized
tokens that were in shorter outputs.
So we saw that if you use GRPO in its
original case in its original form
we saw that after a certain point
the algorithm will incentivize your
model to produce longer and longer
answers longer and longer incorrect
answers.
And the reason why it does that is
because relative to short incorrect
answers,
it penalizes less
long incorrect answers. And so this is
the reason why there are some extensions
that people have worked on this year.
One of which was uh GRPO done rights. So
we saw like um that they basically
removed the normalization term and there
was another method that we saw it was
called depo dapo
which also had some variance
and that's for reasoning models and then
lecture seven we had a model that you
know we knew how to train it we knew how
to uh use it for uh reasoning tasks how
to train it to be better but now we
wanted the model to be useful and
interacting with outside systems.
So we saw one technique that is kind of
an an essential technique called rag
short for retrieval augmented generation
that is meant for you to be able to
fetch relevant documents from some
knowledge base in order to answer
a question or answer a prompt.
And the reason why you want to do that
is that the knowledge of your LLM is
including up to
the data that is up to the knowledge cut
updates which is the max dates of what
your LM has been trained on.
And from a practical standpoint,
I guess from what we see nowadays,
you're typically not training your LLM
daily or continuously. And so in cases
where you need your LLM to know about
things that happened recently or about
things that happened that were not
in your LM training data,
you want your LLM to have access to such
information. And so that's how rag is
very useful. So we saw that rag
dependent very heavily on the way it
retrieves data. So we saw that the
retrieval part was mainly composed of
two steps. So the first one was
candidate retrieval
which use which uses a by encoder kind
of setup where you're basically doing
some semantic search. So you're
computing the embedding of the query.
you have some precomputed embeddings of
the documents in your knowledge base and
you're taking the ones that maximize
some similarity score like let's say
some cosign similarity.
So the this first step is allowing you
to retrieve
um I guess a filtered version of the
potential documents and then typically
you have a second step which is called
ranking or reranking because the first
step already gives you a ranking which
has typically a more sophisticated
setup.
So it's a cross encoder kind of setup
where you have your query and your
document that are both fed to some model
and produces a more precise score
and then you use this final score to
rank the final results and you typically
choose the top let's say K
and then you add them to your prompt. So
it's the augmented part. So retrieval is
everything I mentioned so far and then
once you have the relevant documents you
add them in your prompt which is the
augmented part and you generate the
answer.
So the reason why I'm taking so much
time on rag is rag is such an important
concept also
if you were to you know have interviews
or you know also maybe in the exam who
knows um so I think it's a it's an
important concept to uh to have in mind
the second one that we saw was tool
calling
and tool calling is allowing your LLM to
leverage tools. The way it does that is
in two steps. The first step is for your
model to know which API there is out
there.
At the end of which your LLM says, okay,
I want to use this API and I want to use
it with these arguments.
And then you have an intermediary step
which is you just run your API with
these arguments.
And then the second step is you feed the
results of this operation back to the
LLM which then produces a final answer.
So that's how tool calling works. So if
you say to your LM okay you can use this
use this API this is how your LM would
leverage that.
And then we saw that modern-day agentic
workflows were leveraging both rag and
tool calling as key methods to um
perform actions.
And we saw an example detail example uh
which was such that you had some inputs
and then your LLM had a series of
different calls
um in order to perform some action and
then at the end of it it retrieves sorry
it returns an answer.
Cool. And then last lecture
we saw how we could evaluate LLMs which
is a much tougher thing to do now that
LLMs can do a bunch of different things.
So we first saw that there were some
rulebased metrics that people were using
before LMS came into play. metrics that
you may have heard like blur, rouge,
meor and so on, but the main limitation
was that they were not considering how
language could differ but still be
correct.
And so uh this key idea that we saw was
why not leverage LLMs to evaluate
outputs. And so there is this uh key
idea of LLM as a judge where you receive
as input the prompts
the model response along with the
criteria that you want the response to
be evaluated on.
And then you want your LM messages to
output two things. The first one is a
rationale for why a given score is
output.
along with that score.
So nowadays, LM as a judges, they're
typically outputting a binary response
either
uh pass or fail, true or false, just
because it's easier. And we're also
having the rational be output before the
score
because in practice it's something that
also improves the performance of uh the
element as a judge a little bit if you
want like reasoning models do by
outputting the reasoning chain before
they output the answer.
But then we also saw that there were uh
some biases that came with this
approach. We saw position bias which is
the way you present the elements to
compare matters. So if you present
something first then maybe the LLM will
just prioritize that first. Uh so there
was position bias, there was verbosity
bias which is your LLM just preferring
longer outputs.
Uh and self-enhancement bias was another
one where it prefers its own outputs.
Um and then we also saw a number of
benchmarks
uh that people use nowadays in order to
say how great their LLM is. So if you
see the releases that come out, there
are typically a bunch of metrics across
a number of different benchmarks that
people know about. So that spans uh
knowledge, the ability to reason, coding
which is very important because a lot of
applications are coding related
and then safety and then this is not an
an
extensive list so there's actually many
more dimensions.
Um
so yeah I think that's where we stopped
and it was last lecture
and this is all you are expected to know
for the final.
Everything after that is not going to be
part of the final.
Any questions
on this so far?
Cool. Okay. I'm expecting a hundreds for
everyone for the final. But yeah, um
I would say what I went through
is going to be foundational for the
final. So I guess if you understood
everything I said,
I think you're going to be ready for the
final. So um yeah, but if you have any
questions, you know, Shervin and I are
always here um to uh Oh, yeah. You have
a question?
Yes. So the question is, is the scope
for the final of lecture 5 to lecture 8?
Yes. So for midterm it was lectures 1 2
3 4 and this one is 5 6 7 8. So I guess
it's u equal equal size.
Cool.
Okay, great. So, with that said, we just
finished recapping this entire quarter
worth of lectures and now we're going to
go to the second item of today's menu,
which is looking at some trending
topics.
And so, I'm going to start with the
first one.
And I'm going to introduce it as
follows.
So if you remember we saw that the
transformer
was a concept and an architecture that
was first introduced in the context of
machine translation.
So it performed great. People said okay
it performs great on machine translation
why not try it on other text tasks. So
they tried it performs great
but now the question is can you not use
it for things other than text
it's a natural question right so in
order to answer that question I just
want us to remind ourselves that this
architecture is relying on this concept
of self attention
and this is what is making the
transformer work so Well,
so if we just recap what self attention
is, this uh illustration kind of does
the job quite well.
You have a query and then you have a
bunch of other elements which are
represented by your keys and your values
and you want to know which other
elements are actually relevant
in order to compute the embedding for
that query.
So right now we have only used tokens
you know text tokens
핵심 요약
전체 과정을 복습하고 2025년 최신 트렌드를 소개합니다. Lecture 1-8의 핵심 개념을 정리하고, Vision Transformer, Diffusion LLM 등 새로운 패러다임과 앞으로의 연구 방향을 다룹니다.
주요 개념
Part 1: 전체 과정 복습
Lecture 1 - Transformer 기초 01:24
- Tokenization: 텍스트를 atomic unit으로 분할 (subword level이 가장 일반적)
- Embedding: Word2Vec → RNN의 한계 (long-range dependency)
- Self-Attention: 토큰 간 직접 연결, 위치에 상관없이 attend 가능
- Transformer: Encoder-Decoder 구조, 번역 task에서 시작
Lecture 2 - Transformer 개선 05:33
- RoPE (Rotary Position Embedding): 절대 위치 → 상대 위치, Q/K 회전
- Grouped Query Attention: K/V 행렬 그룹화로 효율성 향상
- Pre-norm vs Post-norm: 현대 LLM은 Pre-norm 선호
- Encoder-only (BERT): 분류 task / Decoder-only (GPT): 생성 task
Lecture 3 - LLM 구조 09:07
- MoE (Mixture of Experts): 전체 파라미터 중 일부만 활성화, FFN 레이어에 적용
- Temperature: 낮으면 deterministic, 높으면 creative
- Sampling: Greedy decoding 대신 확률적 샘플링으로 다양성 확보
Lecture 4 - LLM 학습 15:12
- Scaling Laws: 더 큰 모델, 더 많은 데이터 = 더 좋은 성능
- Chinchilla Rule: 파라미터 수 × 20 = 최소 학습 토큰 수
- Flash Attention: HBM/SRAM 메모리 계층 활용, 정확한 결과 + 속도 향상
- Parallelism: Data / Model / Pipeline parallelism
Lecture 5 - LLM Tuning 20:49
- SFT (Supervised Fine-Tuning): instruction-response 쌍으로 fine-tuning
- RLHF: Human preference로 모델 정렬
- DPO: RLHF 단순화, reward model 없이 직접 최적화
- LoRA: 저랭크 어댑터로 효율적 fine-tuning
Lecture 6 - LLM Reasoning 24:30
- Chain-of-Thought (CoT): 단계별 추론으로 복잡한 문제 해결
- Test-time Compute: 추론 시 더 많은 연산으로 성능 향상
- GRPO, DAPO: Reasoning 모델 학습을 위한 RL 확장
Lecture 7 - Agentic LLMs 38:41
- RAG (Retrieval-Augmented Generation): 외부 지식 검색 후 생성
- Candidate Retrieval (bi-encoder) → Re-ranking (cross-encoder)
- Tool Calling: LLM이 API 선택 + 인자 결정 → 실행 → 결과 종합
- ReAct: Observe → Plan → Act 반복 루프
Lecture 8 - LLM Evaluation 44:10
- Rule-based Metrics: BLEU, ROUGE (언어 다양성 미반영)
- LLM-as-a-Judge: Binary scale + Rationale before score
- Biases: Position, Verbosity, Self-enhancement
- Benchmarks: Knowledge, Reasoning, Coding, Safety
Part 2: 2025 트렌드 (시험 범위 외)
Vision Transformer (ViT) 49:32
- 이미지를 패치로 분할 → 벡터로 임베딩 → Transformer encoder
- BERT와 유사: CLS 토큰으로 분류
- 충분한 데이터가 있으면 CNN보다 우수한 성능
- 핵심 통찰: Transformer는 낮은 inductive bias, 데이터로 학습
Multimodal LLMs 56:10
- 텍스트 + 이미지 입력 처리
- Vision encoder로 이미지 → 토큰 변환 후 LLM에 입력
- GPT-4V, Gemini 등에서 활용
Diffusion LLMs 77:46
- Auto-regressive와 다른 생성 방식
- 노이즈 → 점진적 디노이징 → 텍스트
- 장점: Forward pass 수 = diffusion step 수 (토큰 수보다 적음) → 10배 빠름
- Fill-in-the-middle: 양방향 컨텍스트 활용에 유리
- 아직 frontier 모델 수준은 아니지만 발전 중
Cross-Domain Pollination 83:24
- 이미지 → 텍스트: Diffusion 개념 차용 (속도 향상)
- 텍스트 → 이미지: Transformer 아키텍처 차용 (DiT)
- RoPE의 2D 확장: 멀티모달 설정에서 위치 인코딩
Part 3: 미래 연구 방향
진행 중인 연구 영역 89:00
- Optimizer: Adam → Muon/Muon-clip 등장
- Normalization: LayerNorm → RMSNorm
- Activation Functions: ReLU → GELU 등
- Data Curation: LLM 생성 데이터의 model collapse 문제
- Mid-training: Pre-training과 Fine-tuning 사이 고품질 데이터 학습
열린 문제들 107:21
- 지속적 학습: 현재는 학습 후 weight 고정
- Hallucination: 본질적으로 next token prediction의 한계
- Personalization, Interpretability, Safety
- Hardware: GPU 외 새로운 아키텍처 탐색
- Cost-effective LLM: SLM (Small Language Model) 등장
학습 리소스 109:42
- arXiv, NeurIPS 등 학회
- Hugging Face Trending Papers
- YouTube: Yannic Kilcher, Andrej Karpathy
- Twitter/X ML 커뮤니티
- CME295 Study Guide (매년 업데이트 예정)
핵심 인사이트
- 시험 범위: Lecture 5-8 (Tuning, Reasoning, Agents, Evaluation)
- Transformer의 범용성: 텍스트에서 시작해 이미지, 멀티모달로 확장
- 양방향 영감: 이미지의 Diffusion → 텍스트, 텍스트의 Transformer → 이미지
- 아직 정해진 것 없음: Optimizer, Normalization, Architecture 모두 연구 진행 중
- Data가 핵심: LLM 생성 데이터 증가로 data curation의 중요성 부상
- Cost-effectiveness가 다음 frontier: 성능만큼 비용 효율도 중요해질 것