Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 9 - Recap & Current Trends

Stanford Online • December 9, 2025 • AI 요약 생성: January 24, 2026

NaN:NaN

Hello everyone and uh welcome to lecture

NaN:NaN

9 of CM295.

NaN:NaN

So as you know uh today is a kind of a

NaN:NaN

special day because uh we're having the

NaN:NaN

last lecture of the entire course. Um so

NaN:NaN

the menu for today will be a little

NaN:NaN

different compared to usual.

NaN:NaN

uh we're going to try to divide the

NaN:NaN

lecture in three parts. So in the first

NaN:NaN

part we're going to recap actually what

NaN:NaN

we did in the entire class

NaN:NaN

just to see how different pieces kind of

NaN:NaN

fit together. Uh in the second part we

NaN:NaN

will look at some topics that are

NaN:NaN

particularly trending uh in 2025 and

NaN:NaN

what we think are going to be trending

NaN:NaN

in the near future. And then uh the

NaN:NaN

third part will be more uh way for us to

NaN:NaN

just conclude and uh next steps uh for

NaN:NaN

all of you.

NaN:NaN

Does that sound good?

NaN:NaN

Cool. So with that uh we're going to

NaN:NaN

start with the first part which uh what

NaN:NaN

I mentioned is about recapping what we

NaN:NaN

did this entire quarter.

NaN:NaN

So nothing new here. It's just a way for

NaN:NaN

us to piece everything together.

NaN:NaN

So if you remember u lot of weeks ago I

NaN:NaN

believe it's like maybe 10 weeks ago we

NaN:NaN

had lecture one which was focused on

NaN:NaN

understanding what transformers were. So

NaN:NaN

at the very beginning of the class we

NaN:NaN

didn't even know how we could process

NaN:NaN

text. So I guess the first step that we

NaN:NaN

saw was this tokenization step which

NaN:NaN

consists of dividing the input into

NaN:NaN

atomic units. And so here the way you we

NaN:NaN

divide the text is something that is

NaN:NaN

arbitrary in some sense. So we have

NaN:NaN

different algorithms that allow us to do

NaN:NaN

that. Uh and we saw that the most common

NaN:NaN

tokenization algorithm is the subword

NaN:NaN

level tokenizer.

NaN:NaN

And we saw that some of the advantages

NaN:NaN

were that um roots of words could be

NaN:NaN

reused um and leveraged especially when

NaN:NaN

it came to representing those tokens.

NaN:NaN

And

NaN:NaN

speaking of representation,

NaN:NaN

once we were able to divide the input

NaN:NaN

text into atomic units, aka tokens, the

NaN:NaN

next step for us was to learn how to

NaN:NaN

represent these embeddings. So if you

NaN:NaN

remember, we saw some methods that were

NaN:NaN

very popular back then. So one of them

NaN:NaN

was called wordtovec

NaN:NaN

and the representation was learned from

NaN:NaN

a proxy task which was something like

NaN:NaN

predicting the center word or predicting

NaN:NaN

the context words.

NaN:NaN

But then we saw that this way of

NaN:NaN

learning representations had some

NaN:NaN

limitations.

NaN:NaN

One of which was that these

NaN:NaN

representations were not contextaware.

NaN:NaN

Meaning that uh if a word is in a given

NaN:NaN

sentence or in another sentence they

NaN:NaN

will both have the same like that word

NaN:NaN

will have the same representation in

NaN:NaN

both sentences.

NaN:NaN

And so for that reason we saw some uh

NaN:NaN

other methods that were popular in the

NaN:NaN

2010s. one of which was

NaN:NaN

RNN's if you remember. So RNN's um had

NaN:NaN

this recurrent structure which process

NaN:NaN

tokens one at a time and kept an

NaN:NaN

internal representation of the sequence

NaN:NaN

so far.

NaN:NaN

But then we saw that a big limitation of

NaN:NaN

this was this problem of long range

NaN:NaN

dependency and in particular the fact

NaN:NaN

that uh tokens that were encoded far in

NaN:NaN

the past were not um quantities that

NaN:NaN

were able to be kept I guess as the

NaN:NaN

sequence got longer

NaN:NaN

and this is the reason why we saw the

NaN:NaN

central idea of this whole class which

NaN:NaN

is the idea of self attention where

NaN:NaN

tokens s can actually

NaN:NaN

attend to one another regardless of

NaN:NaN

where they are placed in the sequence.

NaN:NaN

So you can think of this as a direct

NaN:NaN

link.

NaN:NaN

And so uh this for instance is what we

NaN:NaN

saw. We we saw that there are like three

NaN:NaN

main uh terminologies that people use.

NaN:NaN

So query, key and value. So typically

NaN:NaN

you want to know how similar a query is

NaN:NaN

compared to the keys in the in the

NaN:NaN

sequence and you quantify that by um

NaN:NaN

taking some dot products that's kind of

NaN:NaN

scaled and softmaxed

NaN:NaN

and then you have the corresponding

NaN:NaN

value that is taken. So at the end of

NaN:NaN

the day we obtain some kind of weighted

NaN:NaN

average of all the tokens that are in

NaN:NaN

the sequence.

NaN:NaN

And then uh you may also be familiar now

NaN:NaN

with this formula. So soft max of q

NaN:NaN

krpose over square root of dk um time v.

NaN:NaN

So this is the matrix formulation of

NaN:NaN

what I mentioned here which uh is able

NaN:NaN

to process these computations in a very

NaN:NaN

efficient way and it's something that uh

NaN:NaN

today's hardware is well equipped to do

NaN:NaN

and then we finish the first lecture by

NaN:NaN

going through the architecture that is

NaN:NaN

the foundation of modern day LLMs which

NaN:NaN

is the transformer and we saw that there

NaN:NaN

are two u not notable parts in the

NaN:NaN

transformer. So one was the encoder in

NaN:NaN

the left part of the uh um the figure

NaN:NaN

and then the right part is the decoder

NaN:NaN

and we saw how this was applied in the

NaN:NaN

case of translation.

NaN:NaN

So at the end of the first lecture we

NaN:NaN

saw what motivated us to end up with the

NaN:NaN

transformer

NaN:NaN

and we saw that transformer was working

NaN:NaN

quite well in the case of uh translation

NaN:NaN

and so in the next lecture what we saw

NaN:NaN

was what were the little improvements

NaN:NaN

that people have made to this

NaN:NaN

architecture since it was released and

NaN:NaN

if you remember it was in 2017 that it

NaN:NaN

was publicublished.

NaN:NaN

So one particular improvement that

NaN:NaN

people have made is in the way we

NaN:NaN

consider positions

NaN:NaN

because in the original transformer

NaN:NaN

paper positions were encoded

NaN:NaN

in an absolute way as in each position

NaN:NaN

had its own embedding

NaN:NaN

and this embedding was added to the

NaN:NaN

token embedding.

NaN:NaN

But then if we think about it

NaN:NaN

positions actually we don't really care

NaN:NaN

about the absolute position. We care

NaN:NaN

about the relative position between

NaN:NaN

tokens

NaN:NaN

and in particular we care about how far

NaN:NaN

tokens are in the self attention

NaN:NaN

computation.

NaN:NaN

Which is why we saw this methods that is

NaN:NaN

now quite popular called rotary position

NaN:NaN

embeddings aka rope

NaN:NaN

that is now quite used. And it is a

NaN:NaN

method that rotates

NaN:NaN

query and keys

NaN:NaN

both of which happen in the self

NaN:NaN

attention computation.

NaN:NaN

And so here um what is uh quantified

NaN:NaN

here is purely a function of the

NaN:NaN

relative distance between um two tokens

NaN:NaN

and not only that it is something that

NaN:NaN

is uh taken care of in the self

NaN:NaN

attention layer which is what we care

NaN:NaN

about.

NaN:NaN

So this was one big improvement and then

NaN:NaN

we saw some other improvements

NaN:NaN

especially when it came to how the multi

NaN:NaN

head attention um layer multi head

NaN:NaN

attention layer was composed of and in

NaN:NaN

particular we saw that it was possible

NaN:NaN

for us to have some groupings

NaN:NaN

of the matrices that we learn. So we

NaN:NaN

don't need to have one matrix one

NaN:NaN

projection matrix per head for let's say

NaN:NaN

keys and values. We can actually uh

NaN:NaN

group them. So this is uh for instance

NaN:NaN

what is mentioned here. So group query

NaN:NaN

attention. Um and then we also saw some

NaN:NaN

other techniques that I have not

NaN:NaN

represented here like for instance the

NaN:NaN

normalization layer in the transformer

NaN:NaN

which here happens after each

NaN:NaN

sub layer but I guess nowadays people

NaN:NaN

have tried moving the normalization

NaN:NaN

piece before the sub layer. So here it's

NaN:NaN

the postnorm

NaN:NaN

version and then the before the sub

NaN:NaN

layer part is called the prenorm

NaN:NaN

version.

NaN:NaN

And then the last thing that we saw was

NaN:NaN

that from this transformer architecture

NaN:NaN

there were a lot of derived models that

NaN:NaN

were based from that. So we saw that if

NaN:NaN

we only keep the encoder part we could

NaN:NaN

compute very meaningful embeddings.

NaN:NaN

If you remember there was this um uh

NaN:NaN

kind of landmark paper on encoder only

NaN:NaN

model which is birds which was heavily

NaN:NaN

used in the context of classification

NaN:NaN

because it relied on the encoded

NaN:NaN

embedding of the CLS token. And so that

NaN:NaN

was one.

NaN:NaN

But then we also saw that there was a

NaN:NaN

number of other kinds of models all more

NaN:NaN

or less derived from the transformer. So

NaN:NaN

you could only keep the encoder which

NaN:NaN

was for birds. You could only keep the

NaN:NaN

decoder which is for instance for GPT

NaN:NaN

and you could also have both

NaN:NaN

which is for instance the case of T5.

NaN:NaN

And one particular aspect of each of

NaN:NaN

these models is that encoder only is not

NaN:NaN

able in the way that we saw is not able

NaN:NaN

to generate text

NaN:NaN

but is able to generate embeddings which

NaN:NaN

can be used for downstream tasks. But

NaN:NaN

then encoder decoder models like T5 or

NaN:NaN

decoder only models like GPT they can be

NaN:NaN

auto reggressive and

NaN:NaN

generate text. The paradigm can be text

NaN:NaN

in text out.

NaN:NaN

And with that we then focused on what

NaN:NaN

now everyone calls large language models

NaN:NaN

which are transformerbased models

NaN:NaN

specifically texttoext models. So

NaN:NaN

decoder only transformer-based models

NaN:NaN

and we saw that people have come up with

NaN:NaN

a lot of new tricks now because um you

NaN:NaN

know these models as the name uh

NaN:NaN

indicates um people have scaled them up.

NaN:NaN

But then one question was uh kind of

NaN:NaN

kind of thrown which is do you actually

NaN:NaN

need all these parameters to just do a

NaN:NaN

forward pass.

NaN:NaN

So we saw uh one kind of uh variant

NaN:NaN

which was based on mixture of experts.

NaN:NaN

So what mixture of experts are is

NaN:NaN

instead of running everything through

NaN:NaN

the whole entire model, you're going to

NaN:NaN

instead have a number of experts

NaN:NaN

that you're going to activate in a

NaN:NaN

sparse way.

NaN:NaN

So for instance, for one input, you're

NaN:NaN

going to just activate just a subset and

NaN:NaN

then for another input you're going to

NaN:NaN

activate another subset so that you

NaN:NaN

don't need to do all the computations

NaN:NaN

all the time.

NaN:NaN

And we saw that these mixture of experts

NaN:NaN

they were used in LLMs in particular in

NaN:NaN

the feed for neural network layer. So

NaN:NaN

here you would have experts as being

NaN:NaN

different feed for neural networks and

NaN:NaN

you would have a gating mechanism that

NaN:NaN

would reroute

NaN:NaN

to the correct feed for neural network.

NaN:NaN

And then we also saw that some papers

NaN:NaN

were also able to kind of produce some

NaN:NaN

nice visualization in terms of uh I

NaN:NaN

guess which token gets routed to which

NaN:NaN

experts because this rerouting we saw

NaN:NaN

that it was done at the token level.

NaN:NaN

And so one reason why it's done at the

NaN:NaN

token level is to be able to I guess

NaN:NaN

smartly put the experts on different

NaN:NaN

pieces of hardware, different GPUs and

NaN:NaN

then kind of parallelize the computation

NaN:NaN

a little bit more.

NaN:NaN

And then we also saw that these LLMs

NaN:NaN

they always are tasked with predicting

NaN:NaN

the next token. And in order to predict

NaN:NaN

the next token, we were interested in uh

NaN:NaN

I guess how we were uh you know doing

NaN:NaN

this. And so one particular uh method

NaN:NaN

that people use is just sample

NaN:NaN

sample from the output distribution. So

NaN:NaN

you have let's say given an input you

NaN:NaN

have a distribution of probabilities of

NaN:NaN

what the next token would be that is

NaN:NaN

output by the model. And what you do is

NaN:NaN

instead of let's say taking the highest

NaN:NaN

probability which is called the greedy

NaN:NaN

decoding

NaN:NaN

uh greedy kind of decoding you actually

NaN:NaN

sample.

NaN:NaN

So it introduces some randomness and

NaN:NaN

allows the model to produce kind of a

NaN:NaN

bigger variety of uh kinds of outputs.

NaN:NaN

And we saw uh that you could adjust how

NaN:NaN

how much I guess variety you want in

NaN:NaN

your output by tweaking a hyperparameter

NaN:NaN

called temperature.

NaN:NaN

So very low temperature leads to very

NaN:NaN

spiky distribution. So more

NaN:NaN

deterministic outputs and higher

NaN:NaN

temperatures are I guess a bit more uh

NaN:NaN

random a bit more creative.

NaN:NaN

Okay. So until then we saw what LLMs

NaN:NaN

were, how they were based on the

NaN:NaN

transformer, how they connected to the

NaN:NaN

architecture that we saw in the first

NaN:NaN

lecture and then in lecture lecture four

NaN:NaN

we saw how people actually trained those

NaN:NaN

LLMs

NaN:NaN

because as I mentioned these LLMs are

NaN:NaN

large and so you cannot kind of naively

NaN:NaN

fit them in your hardware. you need to

NaN:NaN

be a little bit smart about it.

NaN:NaN

So in particular, what people have uh

NaN:NaN

kind of noticed in the early 2020s is

NaN:NaN

that the bigger your model is,

NaN:NaN

the better your performance.

NaN:NaN

So people just started building bigger

NaN:NaN

and bigger models. So here in the uh

NaN:NaN

illustration we saw that so on the y-

NaN:NaN

axis is the test loss. So the lower the

NaN:NaN

better. So we saw that the more compute

NaN:NaN

you use the better your tests uh

NaN:NaN

performance and same with uh increasing

NaN:NaN

the data set size and same with

NaN:NaN

increasing the number of parameters

NaN:NaN

but then as you know compute is not

NaN:NaN

infinite. So there was a natural

NaN:NaN

question that came out of the community

NaN:NaN

which was okay if we give you a given

NaN:NaN

budget a given compute budget

NaN:NaN

can you choose

NaN:NaN

I guess some quote unquote optimal

NaN:NaN

number of parameters and data set size

NaN:NaN

on which you want to train your model.

NaN:NaN

And so we saw that there was this paper

NaN:NaN

that was um published in the early 2020s

NaN:NaN

uh which actually studied the

NaN:NaN

relationship between

NaN:NaN

um I guess if you vary the data set size

NaN:NaN

and uh the size of your model and the

NaN:NaN

performance on the test set.

NaN:NaN

And then we saw that actually most

NaN:NaN

models at the time were what we say

NaN:NaN

undertrained because they were too big

NaN:NaN

compared to the data set that they were

NaN:NaN

trained on. Like the data set that they

NaN:NaN

were trained on it was not as big as

NaN:NaN

they should have been.

NaN:NaN

And so in particular there was a kind of

NaN:NaN

a rule of thumb that came out of this

NaN:NaN

which was if you have a given number of

NaN:NaN

parameters in your model

NaN:NaN

you should at least train it on 20 times

NaN:NaN

the number of parameters in terms of

NaN:NaN

tokens.

NaN:NaN

So for instance, if you have uh a 100

NaN:NaN

billion parameter model, you should

NaN:NaN

train it on at least two trillion

NaN:NaN

tokens because two trillion is 100

NaN:NaN

billion * 20. So that's kind of the rule

NaN:NaN

of thumb that people have uh used and

NaN:NaN

then you know as I mentioned previously

NaN:NaN

you know these models are huge. So

NaN:NaN

people have tried to also make the

NaN:NaN

computation more efficient

NaN:NaN

and so there was this uh method that we

NaN:NaN

saw which is actually quite important

NaN:NaN

called flash attention

NaN:NaN

and flash attention is a method that

NaN:NaN

leverages the strength the strength of

NaN:NaN

the underlying hardware

NaN:NaN

and in particular it looks at so GPUs

NaN:NaN

more particularly it looks at the kinds

NaN:NaN

of memory memories that a GPU has.

NaN:NaN

So it has a big but slow memory and a

NaN:NaN

small but fast memory. So the HPM and

NaN:NaN

the SRAMM respectively.

NaN:NaN

And we saw that this method tries to

NaN:NaN

minimize the number of reads and writes

NaN:NaN

to the big and slow memory to the HPM.

NaN:NaN

And so the the way it was doing this was

NaN:NaN

to divide the computation in uh little

NaN:NaN

bits that it would send to uh the SRAMM

NaN:NaN

which is the small but fast memory so

NaN:NaN

that it can do the end to end

NaN:NaN

computation

NaN:NaN

and then send it back to where it was in

NaN:NaN

order to do the full end toend

NaN:NaN

computation.

NaN:NaN

So that method is an exact method

NaN:NaN

meaning that we're not doing any

NaN:NaN

approximations to the results

NaN:NaN

but it led to significant speedups and

NaN:NaN

in particular there was this second idea

NaN:NaN

from uh the paper which is a kind of an

NaN:NaN

important one as well which was that

NaN:NaN

sometimes

NaN:NaN

it's okay for you to not store results.

NaN:NaN

it's okay for you to just throw them out

NaN:NaN

and then recomputee when you need them

NaN:NaN

again. So there is this idea of

NaN:NaN

recomputation

NaN:NaN

using what I described which led to

NaN:NaN

faster run times even though we were

NaN:NaN

doing more computations.

NaN:NaN

So that was flash flash attention and uh

NaN:NaN

we also saw a number of other methods

NaN:NaN

that were meant to I guess parallelize

NaN:NaN

the computation. So we saw uh data

NaN:NaN

parallelism

NaN:NaN

which was this idea of not having all

NaN:NaN

your data be processed on a single GPU

NaN:NaN

but instead divided into uh kind of

NaN:NaN

multiple places.

NaN:NaN

And then we had the second method which

NaN:NaN

was model parallelism

NaN:NaN

where even for a given forward pass you

NaN:NaN

would actually involve multiple GPUs.

NaN:NaN

So anyway, there were a lot of very

NaN:NaN

interesting techniques, a lot of

NaN:NaN

different uh ideas about how to train

NaN:NaN

this model in an efficient way.

NaN:NaN

And uh in particular um so what I

NaN:NaN

described here is mostly important for

NaN:NaN

the first step of uh the training

NaN:NaN

process of an LLM which is called the

NaN:NaN

pre-training

NaN:NaN

which is meant to teach the model about

NaN:NaN

the structure of language about the

NaN:NaN

structure of codes. Uh and in particular

NaN:NaN

this model was trained with huge amounts

NaN:NaN

of data. So think about trillions of

NaN:NaN

tokens or even tens of trillions of

NaN:NaN

tokens.

NaN:NaN

Um and so that first step goes from an

NaN:NaN

initialized model to a model that is

NaN:NaN

able to autocomplete because it is

NaN:NaN

trained with an objective of predicting

NaN:NaN

the next token.

NaN:NaN

So at the end of this first stage, you

NaN:NaN

have a model that knows how to

NaN:NaN

autocomplete, but you have a model that

NaN:NaN

is not very helpful because it only

NaN:NaN

knows how to complete things.

NaN:NaN

So in order to have the model be useful

NaN:NaN

for our use cases, we had this second

NaN:NaN

step which is called the fine-tuning

NaN:NaN

step. uh where we teach the model on the

NaN:NaN

kinds of input output pairs that we want

NaN:NaN

it to perform well. So this is also

NaN:NaN

called uh the SFT stage supervised

NaN:NaN

fine-tuning stage. And at the end of

NaN:NaN

this second step, we have a model that

NaN:NaN

not only knows the structure of text and

NaN:NaN

codes, but also is able to behave in the

NaN:NaN

way you want.

NaN:NaN

But so far up until step number two, we

NaN:NaN

have only taught our model what to do.

NaN:NaN

We have not taught it what to not do.

NaN:NaN

And this is why we had our third step

NaN:NaN

which was the preference tuning step

NaN:NaN

where we took our model that went

NaN:NaN

through the pre-training stage that went

NaN:NaN

to the SFT stage and now we want to

NaN:NaN

inject some negative signal as well as

NaN:NaN

in I want you to prefer this compared to

NaN:NaN

this output.

NaN:NaN

And this third step uses preference

NaN:NaN

data. So like the name uh suggests so

NaN:NaN

preference tuning uses preference data

NaN:NaN

which is typically pair-wise data where

NaN:NaN

humans say okay I prefer this output

NaN:NaN

compared to that output.

NaN:NaN

And typically the model here is able to

NaN:NaN

align the kind of output it produces

NaN:NaN

with human preferences that could be

NaN:NaN

along the dimension of uh usefulness of

NaN:NaN

safety, friendliness, tone. Um there's a

NaN:NaN

bunch of different dimensions but uh

NaN:NaN

yeah so that's what is happening in this

NaN:NaN

third step.

NaN:NaN

And in this third step, it's actually in

NaN:NaN

lecture five that we dug into what that

NaN:NaN

third step was about. So if you remember

NaN:NaN

uh we had drawn a parallel

NaN:NaN

between the way our LLM produces tokens

NaN:NaN

and I guess what people in the

NaN:NaN

reinforcement learning field um I guess

NaN:NaN

consider how uh given policy is uh

NaN:NaN

interacting with some environment and

NaN:NaN

performing some action and being in some

NaN:NaN

states. uh and the reason why we drew

NaN:NaN

drew that parallel was to be able to

NaN:NaN

leverage some RLbased techniques

NaN:NaN

in order to train our model. So in this

NaN:NaN

case we said our LLM is a little bit

NaN:NaN

like a policy.

NaN:NaN

So given some state which is the input

NaN:NaN

it has received so far it can perform

NaN:NaN

the next action and in this case it is

NaN:NaN

to predict the next token

NaN:NaN

and this prediction is made in the

NaN:NaN

environment of tokens

NaN:NaN

and when we u like predict a completion

NaN:NaN

what we do is at the end of the day we

NaN:NaN

have some signal some reward part which

NaN:NaN

can be the human preference.

NaN:NaN

So this is the parallel we drew with the

NaN:NaN

RL worlds and with that in mind we

NaN:NaN

talked about rewards

NaN:NaN

but the problem is that rewards are only

NaN:NaN

available for a limited set of data

NaN:NaN

which is why we saw how to model

NaN:NaN

rewards.

NaN:NaN

So we saw this formula if you remember

NaN:NaN

it's called the Bradley Terry

NaN:NaN

formulation

NaN:NaN

which um models

NaN:NaN

how the probability of an output being

NaN:NaN

better than another one is as a function

NaN:NaN

of I guess two scores like the score of

NaN:NaN

output I and the score of output J. And

NaN:NaN

we saw that reward models they are

NaN:NaN

typically trained by having this

NaN:NaN

formulation in mind in a pair-wise

NaN:NaN

fashion.

NaN:NaN

So what this means is a reward model you

NaN:NaN

give it two outputs. You say this one is

NaN:NaN

good, this one is bad and then I want

NaN:NaN

you to say this one is good. You train

NaN:NaN

it in a pair wise fashion. But then your

NaN:NaN

model is actually predicting always two

NaN:NaN

scores. It's always predicting the score

NaN:NaN

RA I for output I RJ for output J. Um

NaN:NaN

and so at inference time you're only

NaN:NaN

giving it one output.

NaN:NaN

So I think that's like one subtlety like

NaN:NaN

we train it in a pair wise way but at

NaN:NaN

inference time we're kind of using it

NaN:NaN

in a in an individual way if that makes

NaN:NaN

sense.

NaN:NaN

And so once we trained our reward model

NaN:NaN

using this formulation

NaN:NaN

then we were able to use it to steer our

NaN:NaN

LLM towards the direction that we care

NaN:NaN

about.

NaN:NaN

So if you remember the way we steer our

NaN:NaN

LLM in the direction of human

NaN:NaN

preferences is to give it a prompt

NaN:NaN

so that it can produce a completion

NaN:NaN

aka a rollout or in simpler terms an an

NaN:NaN

answer.

NaN:NaN

And then we take this prompt, we take

NaN:NaN

this answer, we put them both in the

NaN:NaN

reward model that tells us how good the

NaN:NaN

model response is.

NaN:NaN

And depending on what the reward model

NaN:NaN

says, we can tune the weights of the LLM

NaN:NaN

in a way that maximizes human or the

NaN:NaN

reward that we saw which is trained on

NaN:NaN

human preferences.

NaN:NaN

And the loss function of this RL uh

NaN:NaN

setup

NaN:NaN

is typically something that tries to

NaN:NaN

maximize rewards

NaN:NaN

but also keep the model close to the

NaN:NaN

base model. And here by base model we

NaN:NaN

mean the SFT model. And the reason why

NaN:NaN

we want that is because this reward is

NaN:NaN

imperfect.

NaN:NaN

So we saw this uh phenomenon of reward

NaN:NaN

hacking

NaN:NaN

where your reward can be imperfect and

NaN:NaN

the LLM can exploit its imperfect nature

NaN:NaN

to tune it in a way that actually does

NaN:NaN

not align with what you want it to be.

NaN:NaN

So you want the LLM to not be too far

NaN:NaN

from the base model which is actually

NaN:NaN

already a good model. So it's a way to

NaN:NaN

regularize that if you want and you also

NaN:NaN

want the

NaN:NaN

iteration updates to not be too big

NaN:NaN

either.

NaN:NaN

So you typically have these two

NaN:NaN

constraints. You don't want it to

NaN:NaN

deviate too much from the base model,

NaN:NaN

but you don't want it to deviate too

NaN:NaN

much from the previous RL iteration.

NaN:NaN

And then just as a reminder, I think

NaN:NaN

this was lecture five. I think was the

NaN:NaN

most technically challenging of the

NaN:NaN

whole class. So completely fine if the

NaN:NaN

first time you were like you know what's

NaN:NaN

happening. Uh but hopefully now it

NaN:NaN

should be a little bit more more clear.

NaN:NaN

Uh cool. And then after lecture five

NaN:NaN

we're like okay we've done a lot of uh

NaN:NaN

hard work. So uh the good thing is you

NaN:NaN

know we're in 2025 and in the past 12

NaN:NaN

months or now 14 months we've seen a lot

NaN:NaN

of models that were being released with

NaN:NaN

these reasoning capabilities

NaN:NaN

and the way they were trained to exhibit

NaN:NaN

these advanced reasoning capabilities

NaN:NaN

was actually leveraging a lot of the

NaN:NaN

techniques that we saw in lecture 5

NaN:NaN

just like oral based techniques.

NaN:NaN

And in particular, what we want our LLM

NaN:NaN

to do is to output a reasoning chain

NaN:NaN

before producing the final answer.

NaN:NaN

And the reason why we wanted to do that

NaN:NaN

is because people have seen that it

NaN:NaN

improves the performance of the model.

NaN:NaN

And so it's actually relying on this

NaN:NaN

idea of chain of thoughts, which I

NaN:NaN

believe we saw at lecture three.

NaN:NaN

which is a prompting technique to have

NaN:NaN

your model output the reasoning before

NaN:NaN

outputting the the response.

NaN:NaN

So long story short, up until lecture

NaN:NaN

six, our LLM was having a prompt as

NaN:NaN

input directly outputting the output.

NaN:NaN

But in lecture seven, we said, sorry, in

NaN:NaN

lecture six, we said, uh, well, let's

NaN:NaN

have our LLM actually first output a

NaN:NaN

reasoning chain that the user may or may

NaN:NaN

not have access to before outputting the

NaN:NaN

final answer.

NaN:NaN

So you want to teach the LM to do that.

NaN:NaN

So how do you do that? Well, first

NaN:NaN

before doing this, I just want to show

NaN:NaN

you this chart which we saw which is the

NaN:NaN

performance of uh the model as we're

NaN:NaN

teaching it to produce these reasoning

NaN:NaN

chains. So people have typically

NaN:NaN

measured uh the improvement in

NaN:NaN

performance by comparing it to uh I

NaN:NaN

guess certain benchmarks and this one is

NaN:NaN

a popular one the AIM benchmark which is

NaN:NaN

the math math benchmark and we saw that

NaN:NaN

as the training progresses the accuracy

NaN:NaN

number of uh I guess what the LLM

NaN:NaN

outputs is increasing.

NaN:NaN

But back to what I was I was saying uh

NaN:NaN

the key technique that we use to teach

NaN:NaN

the model how to output these reasoning

NaN:NaN

chains is

NaN:NaN

leveraging the RL techniques that we saw

NaN:NaN

in lecture five. And in particular

NaN:NaN

um up until now we saw PO which was the

NaN:NaN

main RL algorithm that people were using

NaN:NaN

up to maybe last year and now people are

NaN:NaN

kind of prioritizing GRPO as an R

NaN:NaN

algorithm in order to teach the model to

NaN:NaN

be better at reasoning tasks.

NaN:NaN

And there are several reasons to do to

NaN:NaN

to that that I will explicit right now.

NaN:NaN

So we saw this illustration that

NaN:NaN

compared how GRPO was differing with PO

NaN:NaN

and if you can see in the graph uh there

NaN:NaN

are a few things that are different.

NaN:NaN

The first thing is that GRPO does not

NaN:NaN

rely on a value model.

NaN:NaN

So, who remembers what a value model is?

NaN:NaN

Yep.

NaN:NaN

Yes. Exactly. So, the value function is

NaN:NaN

trying to predict what the reward would

NaN:NaN

be if you um were to follow the policy

NaN:NaN

of the LLM. Um, and I guess it's a way

NaN:NaN

to have some baseline as to how good

NaN:NaN

some predictions are. You want to make

NaN:NaN

it more relative. So the value function

NaN:NaN

is a way for us to make these uh rewards

NaN:NaN

a little bit more relative to one

NaN:NaN

another. Um, and so that's what that's

NaN:NaN

how PPU was doing this. So it was having

NaN:NaN

a value model that was um making these

NaN:NaN

predictions and then we had uh this

NaN:NaN

generalized advantage estimation method

NaN:NaN

that was combining the reward

NaN:NaN

predictions

NaN:NaN

with the value function predictions in

NaN:NaN

order to have what we call advantages.

NaN:NaN

So advantages is how good your output is

NaN:NaN

compared to some baseline.

NaN:NaN

But then in contrast to that, GRPL said,

NaN:NaN

"Okay, tree, we don't need a value

NaN:NaN

function because it's, you know, too

NaN:NaN

expensive to to train, to maintain. What

NaN:NaN

we're going to do instead

NaN:NaN

is generate several completions

NaN:NaN

and then have some formula that compares

NaN:NaN

the rewards of these completion these

NaN:NaN

completions

NaN:NaN

to one another.

NaN:NaN

So it's going to have some relative

NaN:NaN

effect in a sense that it will make

NaN:NaN

things more relative

NaN:NaN

and in doing so you are actually not

NaN:NaN

uh needed to maintain and train a value

NaN:NaN

function and that's like one big

NaN:NaN

difference compared to PO.

NaN:NaN

Um and uh the second big difference

NaN:NaN

which is not represented in this

NaN:NaN

illustration uh is that uh GRPO is

NaN:NaN

typically an algorithm that people have

NaN:NaN

used in the context of

NaN:NaN

teaching your model to be better at

NaN:NaN

reasoning tasks.

NaN:NaN

And so we saw that these kinds of

NaN:NaN

problems

NaN:NaN

have a verifiable reward

NaN:NaN

because when you complete a math

NaN:NaN

problem,

NaN:NaN

you actually know the answer you need to

NaN:NaN

get to. So you don't need to train a

NaN:NaN

reward model to tell you how good your

NaN:NaN

final answer is because you you already

NaN:NaN

know the answer.

NaN:NaN

And so we saw that GRPO was in

NaN:NaN

particular used in the context of when

NaN:NaN

you actually don't even need a reward

NaN:NaN

model when you actually have a

NaN:NaN

verifiable reward. So at the end of the

NaN:NaN

day, the only two models you need to

NaN:NaN

keep are the policy model

NaN:NaN

and the reference model to be able to

NaN:NaN

just compare how far you are from the

NaN:NaN

reference model.

NaN:NaN

Cool. Um, I know this one was also a

NaN:NaN

challenging class, I guess. So far so

NaN:NaN

good. And this is also on on the final.

NaN:NaN

So, which is why I'm I'm taking things

NaN:NaN

more slowly for this second part of the

NaN:NaN

recap. So, is everything good so far?

NaN:NaN

Yeah.

NaN:NaN

Okay. Perfect. We also saw some

NaN:NaN

extensions of GRPO. So if you remember

NaN:NaN

there was um some kind of bias that was

NaN:NaN

um a result of the loss function of GRPO

NaN:NaN

having some normalization term that

NaN:NaN

penalized

NaN:NaN

tokens that were in shorter outputs.

NaN:NaN

So we saw that if you use GRPO in its

NaN:NaN

original case in its original form

NaN:NaN

we saw that after a certain point

NaN:NaN

the algorithm will incentivize your

NaN:NaN

model to produce longer and longer

NaN:NaN

answers longer and longer incorrect

NaN:NaN

answers.

NaN:NaN

And the reason why it does that is

NaN:NaN

because relative to short incorrect

NaN:NaN

answers,

NaN:NaN

it penalizes less

NaN:NaN

long incorrect answers. And so this is

NaN:NaN

the reason why there are some extensions

NaN:NaN

that people have worked on this year.

NaN:NaN

One of which was uh GRPO done rights. So

NaN:NaN

we saw like um that they basically

NaN:NaN

removed the normalization term and there

NaN:NaN

was another method that we saw it was

NaN:NaN

called depo dapo

NaN:NaN

which also had some variance

NaN:NaN

and that's for reasoning models and then

NaN:NaN

lecture seven we had a model that you

NaN:NaN

know we knew how to train it we knew how

NaN:NaN

to uh use it for uh reasoning tasks how

NaN:NaN

to train it to be better but now we

NaN:NaN

wanted the model to be useful and

NaN:NaN

interacting with outside systems.

NaN:NaN

So we saw one technique that is kind of

NaN:NaN

an an essential technique called rag

NaN:NaN

short for retrieval augmented generation

NaN:NaN

that is meant for you to be able to

NaN:NaN

fetch relevant documents from some

NaN:NaN

knowledge base in order to answer

NaN:NaN

a question or answer a prompt.

NaN:NaN

And the reason why you want to do that

NaN:NaN

is that the knowledge of your LLM is

NaN:NaN

including up to

NaN:NaN

the data that is up to the knowledge cut

NaN:NaN

updates which is the max dates of what

NaN:NaN

your LM has been trained on.

NaN:NaN

And from a practical standpoint,

NaN:NaN

I guess from what we see nowadays,

NaN:NaN

you're typically not training your LLM

NaN:NaN

daily or continuously. And so in cases

NaN:NaN

where you need your LLM to know about

NaN:NaN

things that happened recently or about

NaN:NaN

things that happened that were not

NaN:NaN

in your LM training data,

NaN:NaN

you want your LLM to have access to such

NaN:NaN

information. And so that's how rag is

NaN:NaN

very useful. So we saw that rag

NaN:NaN

dependent very heavily on the way it

NaN:NaN

retrieves data. So we saw that the

NaN:NaN

retrieval part was mainly composed of

NaN:NaN

two steps. So the first one was

NaN:NaN

candidate retrieval

NaN:NaN

which use which uses a by encoder kind

NaN:NaN

of setup where you're basically doing

NaN:NaN

some semantic search. So you're

NaN:NaN

computing the embedding of the query.

NaN:NaN

you have some precomputed embeddings of

NaN:NaN

the documents in your knowledge base and

NaN:NaN

you're taking the ones that maximize

NaN:NaN

some similarity score like let's say

NaN:NaN

some cosign similarity.

NaN:NaN

So the this first step is allowing you

NaN:NaN

to retrieve

NaN:NaN

um I guess a filtered version of the

NaN:NaN

potential documents and then typically

NaN:NaN

you have a second step which is called

NaN:NaN

ranking or reranking because the first

NaN:NaN

step already gives you a ranking which

NaN:NaN

has typically a more sophisticated

NaN:NaN

setup.

NaN:NaN

So it's a cross encoder kind of setup

NaN:NaN

where you have your query and your

NaN:NaN

document that are both fed to some model

NaN:NaN

and produces a more precise score

NaN:NaN

and then you use this final score to

NaN:NaN

rank the final results and you typically

NaN:NaN

choose the top let's say K

NaN:NaN

and then you add them to your prompt. So

NaN:NaN

it's the augmented part. So retrieval is

NaN:NaN

everything I mentioned so far and then

NaN:NaN

once you have the relevant documents you

NaN:NaN

add them in your prompt which is the

NaN:NaN

augmented part and you generate the

NaN:NaN

answer.

NaN:NaN

So the reason why I'm taking so much

NaN:NaN

time on rag is rag is such an important

NaN:NaN

concept also

NaN:NaN

if you were to you know have interviews

NaN:NaN

or you know also maybe in the exam who

NaN:NaN

knows um so I think it's a it's an

NaN:NaN

important concept to uh to have in mind

NaN:NaN

the second one that we saw was tool

NaN:NaN

calling

NaN:NaN

and tool calling is allowing your LLM to

NaN:NaN

leverage tools. The way it does that is

NaN:NaN

in two steps. The first step is for your

NaN:NaN

model to know which API there is out

NaN:NaN

there.

NaN:NaN

At the end of which your LLM says, okay,

NaN:NaN

I want to use this API and I want to use

NaN:NaN

it with these arguments.

NaN:NaN

And then you have an intermediary step

NaN:NaN

which is you just run your API with

NaN:NaN

these arguments.

NaN:NaN

And then the second step is you feed the

NaN:NaN

results of this operation back to the

NaN:NaN

LLM which then produces a final answer.

NaN:NaN

So that's how tool calling works. So if

NaN:NaN

you say to your LM okay you can use this

NaN:NaN

use this API this is how your LM would

NaN:NaN

leverage that.

NaN:NaN

And then we saw that modern-day agentic

NaN:NaN

workflows were leveraging both rag and

NaN:NaN

tool calling as key methods to um

NaN:NaN

perform actions.

NaN:NaN

And we saw an example detail example uh

NaN:NaN

which was such that you had some inputs

NaN:NaN

and then your LLM had a series of

NaN:NaN

different calls

NaN:NaN

um in order to perform some action and

NaN:NaN

then at the end of it it retrieves sorry

NaN:NaN

it returns an answer.

NaN:NaN

Cool. And then last lecture

NaN:NaN

we saw how we could evaluate LLMs which

NaN:NaN

is a much tougher thing to do now that

NaN:NaN

LLMs can do a bunch of different things.

NaN:NaN

So we first saw that there were some

NaN:NaN

rulebased metrics that people were using

NaN:NaN

before LMS came into play. metrics that

NaN:NaN

you may have heard like blur, rouge,

NaN:NaN

meor and so on, but the main limitation

NaN:NaN

was that they were not considering how

NaN:NaN

language could differ but still be

NaN:NaN

correct.

NaN:NaN

And so uh this key idea that we saw was

NaN:NaN

why not leverage LLMs to evaluate

NaN:NaN

outputs. And so there is this uh key

NaN:NaN

idea of LLM as a judge where you receive

NaN:NaN

as input the prompts

NaN:NaN

the model response along with the

NaN:NaN

criteria that you want the response to

NaN:NaN

be evaluated on.

NaN:NaN

And then you want your LM messages to

NaN:NaN

output two things. The first one is a

NaN:NaN

rationale for why a given score is

NaN:NaN

output.

NaN:NaN

along with that score.

NaN:NaN

So nowadays, LM as a judges, they're

NaN:NaN

typically outputting a binary response

NaN:NaN

either

NaN:NaN

uh pass or fail, true or false, just

NaN:NaN

because it's easier. And we're also

NaN:NaN

having the rational be output before the

NaN:NaN

score

NaN:NaN

because in practice it's something that

NaN:NaN

also improves the performance of uh the

NaN:NaN

element as a judge a little bit if you

NaN:NaN

want like reasoning models do by

NaN:NaN

outputting the reasoning chain before

NaN:NaN

they output the answer.

NaN:NaN

But then we also saw that there were uh

NaN:NaN

some biases that came with this

NaN:NaN

approach. We saw position bias which is

NaN:NaN

the way you present the elements to

NaN:NaN

compare matters. So if you present

NaN:NaN

something first then maybe the LLM will

NaN:NaN

just prioritize that first. Uh so there

NaN:NaN

was position bias, there was verbosity

NaN:NaN

bias which is your LLM just preferring

NaN:NaN

longer outputs.

NaN:NaN

Uh and self-enhancement bias was another

NaN:NaN

one where it prefers its own outputs.

NaN:NaN

Um and then we also saw a number of

NaN:NaN

benchmarks

NaN:NaN

uh that people use nowadays in order to

NaN:NaN

say how great their LLM is. So if you

NaN:NaN

see the releases that come out, there

NaN:NaN

are typically a bunch of metrics across

NaN:NaN

a number of different benchmarks that

NaN:NaN

people know about. So that spans uh

NaN:NaN

knowledge, the ability to reason, coding

NaN:NaN

which is very important because a lot of

NaN:NaN

applications are coding related

NaN:NaN

and then safety and then this is not an

NaN:NaN

extensive list so there's actually many

NaN:NaN

more dimensions.

NaN:NaN

so yeah I think that's where we stopped

NaN:NaN

and it was last lecture

NaN:NaN

and this is all you are expected to know

NaN:NaN

for the final.

NaN:NaN

Everything after that is not going to be

NaN:NaN

part of the final.

NaN:NaN

Any questions

NaN:NaN

on this so far?

NaN:NaN

Cool. Okay. I'm expecting a hundreds for

NaN:NaN

everyone for the final. But yeah, um

NaN:NaN

I would say what I went through

NaN:NaN

is going to be foundational for the

NaN:NaN

final. So I guess if you understood

NaN:NaN

everything I said,

NaN:NaN

I think you're going to be ready for the

NaN:NaN

final. So um yeah, but if you have any

NaN:NaN

questions, you know, Shervin and I are

NaN:NaN

always here um to uh Oh, yeah. You have

NaN:NaN

a question?

NaN:NaN

Yes. So the question is, is the scope

NaN:NaN

for the final of lecture 5 to lecture 8?

NaN:NaN

Yes. So for midterm it was lectures 1 2

NaN:NaN

3 4 and this one is 5 6 7 8. So I guess

NaN:NaN

it's u equal equal size.

NaN:NaN

Cool.

NaN:NaN

Okay, great. So, with that said, we just

NaN:NaN

finished recapping this entire quarter

NaN:NaN

worth of lectures and now we're going to

NaN:NaN

go to the second item of today's menu,

NaN:NaN

which is looking at some trending

NaN:NaN

topics.

NaN:NaN

And so, I'm going to start with the

NaN:NaN

first one.

NaN:NaN

And I'm going to introduce it as

NaN:NaN

follows.

NaN:NaN

So if you remember we saw that the

NaN:NaN

transformer

NaN:NaN

was a concept and an architecture that

NaN:NaN

was first introduced in the context of

NaN:NaN

machine translation.

NaN:NaN

So it performed great. People said okay

NaN:NaN

it performs great on machine translation

NaN:NaN

why not try it on other text tasks. So

NaN:NaN

they tried it performs great

NaN:NaN

but now the question is can you not use

NaN:NaN

it for things other than text

NaN:NaN

it's a natural question right so in

NaN:NaN

order to answer that question I just

NaN:NaN

want us to remind ourselves that this

NaN:NaN

architecture is relying on this concept

NaN:NaN

of self attention

NaN:NaN

and this is what is making the

NaN:NaN

transformer work so Well,

NaN:NaN

so if we just recap what self attention

NaN:NaN

is, this uh illustration kind of does

NaN:NaN

the job quite well.

NaN:NaN

You have a query and then you have a

NaN:NaN

bunch of other elements which are

NaN:NaN

represented by your keys and your values

NaN:NaN

and you want to know which other

NaN:NaN

elements are actually relevant

NaN:NaN

in order to compute the embedding for

NaN:NaN

that query.

NaN:NaN

So right now we have only used tokens

NaN:NaN

you know text tokens

핵심 요약

전체 과정을 복습하고 2025년 최신 트렌드를 소개합니다. Lecture 1-8의 핵심 개념을 정리하고, Vision Transformer, Diffusion LLM 등 새로운 패러다임과 앞으로의 연구 방향을 다룹니다.

주요 개념

Part 1: 전체 과정 복습

Lecture 1 - Transformer 기초 01:24

Tokenization: 텍스트를 atomic unit으로 분할 (subword level이 가장 일반적)
Embedding: Word2Vec → RNN의 한계 (long-range dependency)
Self-Attention: 토큰 간 직접 연결, 위치에 상관없이 attend 가능
Transformer: Encoder-Decoder 구조, 번역 task에서 시작

Lecture 2 - Transformer 개선 05:33

RoPE (Rotary Position Embedding): 절대 위치 → 상대 위치, Q/K 회전
Grouped Query Attention: K/V 행렬 그룹화로 효율성 향상
Pre-norm vs Post-norm: 현대 LLM은 Pre-norm 선호
Encoder-only (BERT): 분류 task / Decoder-only (GPT): 생성 task

Lecture 3 - LLM 구조 09:07

MoE (Mixture of Experts): 전체 파라미터 중 일부만 활성화, FFN 레이어에 적용
Temperature: 낮으면 deterministic, 높으면 creative
Sampling: Greedy decoding 대신 확률적 샘플링으로 다양성 확보

Lecture 4 - LLM 학습 15:12

Scaling Laws: 더 큰 모델, 더 많은 데이터 = 더 좋은 성능
Chinchilla Rule: 파라미터 수 × 20 = 최소 학습 토큰 수
Flash Attention: HBM/SRAM 메모리 계층 활용, 정확한 결과 + 속도 향상
Parallelism: Data / Model / Pipeline parallelism

Lecture 5 - LLM Tuning 20:49

SFT (Supervised Fine-Tuning): instruction-response 쌍으로 fine-tuning
RLHF: Human preference로 모델 정렬
DPO: RLHF 단순화, reward model 없이 직접 최적화
LoRA: 저랭크 어댑터로 효율적 fine-tuning

Lecture 6 - LLM Reasoning 24:30

Chain-of-Thought (CoT): 단계별 추론으로 복잡한 문제 해결
Test-time Compute: 추론 시 더 많은 연산으로 성능 향상
GRPO, DAPO: Reasoning 모델 학습을 위한 RL 확장

Lecture 7 - Agentic LLMs 38:41

RAG (Retrieval-Augmented Generation): 외부 지식 검색 후 생성
- Candidate Retrieval (bi-encoder) → Re-ranking (cross-encoder)
Tool Calling: LLM이 API 선택 + 인자 결정 → 실행 → 결과 종합
ReAct: Observe → Plan → Act 반복 루프

Lecture 8 - LLM Evaluation 44:10

Rule-based Metrics: BLEU, ROUGE (언어 다양성 미반영)
LLM-as-a-Judge: Binary scale + Rationale before score
Biases: Position, Verbosity, Self-enhancement
Benchmarks: Knowledge, Reasoning, Coding, Safety

Part 2: 2025 트렌드 (시험 범위 외)

Vision Transformer (ViT) 49:32

이미지를 패치로 분할 → 벡터로 임베딩 → Transformer encoder
BERT와 유사: CLS 토큰으로 분류
충분한 데이터가 있으면 CNN보다 우수한 성능
핵심 통찰: Transformer는 낮은 inductive bias, 데이터로 학습

Multimodal LLMs 56:10

텍스트 + 이미지 입력 처리
Vision encoder로 이미지 → 토큰 변환 후 LLM에 입력
GPT-4V, Gemini 등에서 활용

Diffusion LLMs 77:46

Auto-regressive와 다른 생성 방식
노이즈 → 점진적 디노이징 → 텍스트
장점: Forward pass 수 = diffusion step 수 (토큰 수보다 적음) → 10배 빠름
Fill-in-the-middle: 양방향 컨텍스트 활용에 유리
아직 frontier 모델 수준은 아니지만 발전 중

Cross-Domain Pollination 83:24

이미지 → 텍스트: Diffusion 개념 차용 (속도 향상)
텍스트 → 이미지: Transformer 아키텍처 차용 (DiT)
RoPE의 2D 확장: 멀티모달 설정에서 위치 인코딩

Part 3: 미래 연구 방향

진행 중인 연구 영역 89:00

Optimizer: Adam → Muon/Muon-clip 등장
Normalization: LayerNorm → RMSNorm
Activation Functions: ReLU → GELU 등
Data Curation: LLM 생성 데이터의 model collapse 문제
Mid-training: Pre-training과 Fine-tuning 사이 고품질 데이터 학습

열린 문제들 107:21

지속적 학습: 현재는 학습 후 weight 고정
Hallucination: 본질적으로 next token prediction의 한계
Personalization, Interpretability, Safety
Hardware: GPU 외 새로운 아키텍처 탐색
Cost-effective LLM: SLM (Small Language Model) 등장

학습 리소스 109:42

arXiv, NeurIPS 등 학회
Hugging Face Trending Papers
YouTube: Yannic Kilcher, Andrej Karpathy
Twitter/X ML 커뮤니티
CME295 Study Guide (매년 업데이트 예정)

핵심 인사이트

시험 범위: Lecture 5-8 (Tuning, Reasoning, Agents, Evaluation)
Transformer의 범용성: 텍스트에서 시작해 이미지, 멀티모달로 확장
양방향 영감: 이미지의 Diffusion → 텍스트, 텍스트의 Transformer → 이미지
아직 정해진 것 없음: Optimizer, Normalization, Architecture 모두 연구 진행 중
Data가 핵심: LLM 생성 데이터 증가로 data curation의 중요성 부상
Cost-effectiveness가 다음 frontier: 성능만큼 비용 효율도 중요해질 것