Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 4 - LLM Training

Stanford Online • October 21, 2025 • AI 요약 생성: January 24, 2026

NaN:NaN

Cool.

NaN:NaN

Hello everyone and welcome to lecture 4

NaN:NaN

of CME 295. So today is Friday, October

NaN:NaN

the 17th, which means that the midterm

NaN:NaN

is one week away. So before we start,

NaN:NaN

I'm just going to go over some logistics

NaN:NaN

to make sure you know we're all aligned

NaN:NaN

on what to expect.

NaN:NaN

So the midterm will take place next

NaN:NaN

week, same time. Instead of an hour and

NaN:NaN

50 minutes, it will be an hour and 30

NaN:NaN

minutes. So it's 3:30 to 5 in this

NaN:NaN

classroom.

NaN:NaN

So it's like business as usual. Um in

NaN:NaN

terms of topics

NaN:NaN

the midterm will be about lectures 1,

NaN:NaN

two and three which we had and this one

NaN:NaN

which is lecture four.

NaN:NaN

So just to give you like an overview of

NaN:NaN

what you can expect in the midterm.

NaN:NaN

There's going to be uh some multi-choice

NaN:NaN

questions along with some free form

NaN:NaN

questions but they're mainly going to be

NaN:NaN

about the things that we've seen in

NaN:NaN

class. So if you watch the recordings or

NaN:NaN

attend the lectures and you know just go

NaN:NaN

through the slides and know the

NaN:NaN

important formulas I think yeah you'll

NaN:NaN

be you'll be fine.

NaN:NaN

So I know that you may have questions

NaN:NaN

until uh next week. So that's why after

NaN:NaN

this lecture with Shervin we will be

NaN:NaN

holding office hours. So feel free to

NaN:NaN

you know come to us and ask us any

NaN:NaN

questions. And uh of course we'll be

NaN:NaN

fully available between now and next

NaN:NaN

week. So in case you have any questions,

NaN:NaN

feel free to uh ping us on Ed. Um and

NaN:NaN

yeah, we'll make sure to respond.

NaN:NaN

Um cool. Uh I also know that a number of

NaN:NaN

you are auditing this class. So in case

NaN:NaN

you're still interested to take the

NaN:NaN

midterm for some reason uh maybe uh

NaN:NaN

because you have an upcoming interview

NaN:NaN

uh just uh tell us so that we can just

NaN:NaN

expect the number of uh like copies to

NaN:NaN

print. So we'll be printing this on

NaN:NaN

Monday. So just let us know over the

NaN:NaN

weekend in case you're interested.

NaN:NaN

Cool. So that's for the midterm. And

NaN:NaN

then the second piece of news is the

NaN:NaN

final. So, we said we were working on

NaN:NaN

the dates. So, we finally finalized the

NaN:NaN

dates, which did not change. So, it's

NaN:NaN

Wednesday, December the 10th. Okay. So,

NaN:NaN

a little bit late, 7:00 p.m. to 8:30

NaN:NaN

p.m. Uh, so it's a slot that we have.

NaN:NaN

Uh, the location is different from this

NaN:NaN

one. So, it's in this room.

NaN:NaN

And the final will only cover the second

NaN:NaN

part of the class which is basically

NaN:NaN

lectures five to 9.

NaN:NaN

Any questions on this? Yeah.

NaN:NaN

Oh yeah, good question. So is it closed

NaN:NaN

notes? Yes. Yeah. Yes.

NaN:NaN

So question is what is the format of the

NaN:NaN

multiple choice? So you'll have uh so we

NaN:NaN

did not finish writing the exam, but

NaN:NaN

it's going to be something like you have

NaN:NaN

a question and you let's say you have

NaN:NaN

like three four possible answers and

NaN:NaN

then you just choose the the one that's

NaN:NaN

that's correct. Something like this. And

NaN:NaN

you'll also have some free form uh like

NaN:NaN

you'll have to just like answer in your

NaN:NaN

own words. Yeah.

NaN:NaN

>> Yeah. Uh thanks. So question is are we

NaN:NaN

allowed to take anything? So it's closed

NaN:NaN

book. So like

NaN:NaN

like yeah nothing just a pen.

NaN:NaN

Yeah.

NaN:NaN

Uh question is no calculator you will

NaN:NaN

not need calculator

NaN:NaN

but speaking of the cheat sheet. So I'm

NaN:NaN

not sure if we mentioned I think we did.

NaN:NaN

So there is a cheat sheet for this one

NaN:NaN

which we cannot bring to the exam but

NaN:NaN

you can use for your uh just for uh you

NaN:NaN

know just your studying. Uh that's on

NaN:NaN

the website class website.

NaN:NaN

suggest recommends looking at it.

NaN:NaN

Cool. So, super clear for everyone.

NaN:NaN

Very cool. Well, okay.

NaN:NaN

As always, we'll be starting the class

NaN:NaN

just recapping what we saw in the

NaN:NaN

previous lecture. Um so if you remember

NaN:NaN

basically

NaN:NaN

studied a new kind of architecture which

NaN:NaN

was called the mixture of experts uh

NaN:NaN

which is such that if you have an input

NaN:NaN

what you want is to not necessarily

NaN:NaN

activate all the parameters and so you

NaN:NaN

are in a setting where you have multiple

NaN:NaN

experts uh and in the forward pass you

NaN:NaN

only activate some of them so that's a

NaN:NaN

sparse MOE. You also have the dense MOE

NaN:NaN

which basically weights the outputs as a

NaN:NaN

function of the output of the gate. So

NaN:NaN

we saw that this architecture was used

NaN:NaN

in LLM

NaN:NaN

and it was mainly used to be able to

NaN:NaN

scale these LLMs without incurring an

NaN:NaN

expensive cost at inference time because

NaN:NaN

you don't want to activate all the

NaN:NaN

parameters.

NaN:NaN

The second thing that we saw was uh just

NaN:NaN

defining what an LLM was and in

NaN:NaN

particular how you could

NaN:NaN

decide on what the next token prediction

NaN:NaN

is. So we saw three methods. First one

NaN:NaN

was we called uh greedy decoding which

NaN:NaN

was always taking the highest probable

NaN:NaN

token.

NaN:NaN

The second method we saw was beam search

NaN:NaN

where we kept track of the k most

NaN:NaN

probable sequences.

NaN:NaN

And then the third one was sampling.

NaN:NaN

So we're not doing a most probable we're

NaN:NaN

not keeping track of the highest

NaN:NaN

probable sequences. What we do is we

NaN:NaN

sample the next token respect to the

NaN:NaN

distribution that we get as output. And

NaN:NaN

then we saw there's this hyperparameter

NaN:NaN

that's called temperature that allows

NaN:NaN

you to tweak how spiky you want your

NaN:NaN

distribution to like versus uh not.

NaN:NaN

And we also saw some inference

NaN:NaN

optimization techniques which are used

NaN:NaN

in practice to avoid having uh like a

NaN:NaN

big cost at decoding time. So I'm not

NaN:NaN

going to just mention everything but I

NaN:NaN

would say KV cache for instance is a is

NaN:NaN

an important method. So yeah just

NaN:NaN

recommend just knowing what it is along

NaN:NaN

with the other ones.

NaN:NaN

And with that we're going to start

NaN:NaN

lecture four and actually I was really

NaN:NaN

looking forward to today because lecture

NaN:NaN

one we saw what self attention was what

NaN:NaN

a transformer was. Second lecture, we

NaN:NaN

saw

NaN:NaN

some of the tricks that people use today

NaN:NaN

and some of the variations from the

NaN:NaN

transformer. We introduced what an LLM

NaN:NaN

was last lecture and this lecture we're

NaN:NaN

finally going to see how these LLMs are

NaN:NaN

trained. So today we're going to focus

NaN:NaN

on LLM training.

NaN:NaN

And the first thing that I'm going to

NaN:NaN

say is if you've been in the ML field

NaN:NaN

for let's say more than a few years now

NaN:NaN

uh you may have noticed that

NaN:NaN

traditionally

NaN:NaN

if you had a task what you would do is

NaN:NaN

train a model specifically for that

NaN:NaN

task.

NaN:NaN

So let's suppose like 10 years ago,

NaN:NaN

let's suppose we had a task which was

NaN:NaN

around detecting spam. You would train a

NaN:NaN

model specifically to detect spams. So

NaN:NaN

you would train on the training set,

NaN:NaN

eval on the validation set and then test

NaN:NaN

on the test set. If you had another use

NaN:NaN

case that suppose sentiment extraction,

NaN:NaN

you would train a model specifically for

NaN:NaN

that

NaN:NaN

and so on and so forth.

NaN:NaN

But one could argue that these tasks,

NaN:NaN

they're not completely disjoint.

NaN:NaN

They're all involving just understanding

NaN:NaN

the text. So one could argue we could

NaN:NaN

find a way to somehow leverage the

NaN:NaN

knowledge that we acquired during

NaN:NaN

training for let's say one task

NaN:NaN

and reuse that for another task.

NaN:NaN

So this method has a name. It's been

NaN:NaN

around for some time. It's called

NaN:NaN

transfer learning.

NaN:NaN

So the goal of transfer learning is to

NaN:NaN

not always start from scratch. If you

NaN:NaN

have a new task, it's to start with some

NaN:NaN

pre-trained model. And we're going to

NaN:NaN

see what pre-train is and then tune it

NaN:NaN

for your task instead of starting from

NaN:NaN

scratch.

NaN:NaN

Well, it's basically the paradigm on

NaN:NaN

which LLMs are trained. So the idea here

NaN:NaN

is that all these tasks, they involve

NaN:NaN

understanding language. So, what we're

NaN:NaN

going to do is have what we call a

NaN:NaN

pre-training stage, which involves

NaN:NaN

training your LLM on vast amounts of

NaN:NaN

data to just understand what language,

NaN:NaN

what code is

NaN:NaN

and then have a second stage of quote

NaN:NaN

unquote tuning.

NaN:NaN

And we're going to see a little bit what

NaN:NaN

that tuning is. But in that second

NaN:NaN

stage, we're going to take our

NaN:NaN

pre-trained model and somehow find a way

NaN:NaN

to tune the weights to adapt to a

NaN:NaN

specific task.

NaN:NaN

So as an example here uh we would

NaN:NaN

pre-train a huge model and then suppose

NaN:NaN

for spam detection we would somehow tune

NaN:NaN

it for that uh sentiment instruction

NaN:NaN

same we would tune it for that and so

NaN:NaN

on.

NaN:NaN

So this is just to take the example from

NaN:NaN

before. And the idea here is in order to

NaN:NaN

obtain these models, we're not going to

NaN:NaN

start from scratch.

NaN:NaN

Cool. So okay. So now we're going to see

NaN:NaN

what pre-training is. So pre-training is

NaN:NaN

by far the most expensive both in terms

NaN:NaN

of compute cost, you know, everything

NaN:NaN

part of the training.

NaN:NaN

So what it does is taking a huge amount

NaN:NaN

of data and training your LLM to just

NaN:NaN

predict the next token.

NaN:NaN

And here by data what I mean is

NaN:NaN

basically everything you can find. So it

NaN:NaN

can be uh you know text in English, can

NaN:NaN

be text in other languages, it can be

NaN:NaN

even codes,

NaN:NaN

can be code in different languages, can

NaN:NaN

be basically the whole internet.

NaN:NaN

We're going to see some of the data sets

NaN:NaN

that people use for that. But you can

NaN:NaN

think of this as just training your

NaN:NaN

model to try to predict anything that's

NaN:NaN

written.

NaN:NaN

And as I mentioned, the objective here

NaN:NaN

is to predict the next token. So if you

NaN:NaN

remember, our LLM is a texttoext model

NaN:NaN

and most likely a decoder only model in

NaN:NaN

more than 90% of the cases. So what it

NaN:NaN

does is it takes some input text and it

NaN:NaN

tries to always predict the next token

NaN:NaN

in an iterative basis.

NaN:NaN

So in terms of the data sets that are

NaN:NaN

used, you will see the term common crawl

NaN:NaN

a lot on papers, it's basically a data

NaN:NaN

set composed of anything you can find on

NaN:NaN

the internet. So I think they have

NaN:NaN

something like three billion pages per

NaN:NaN

month. So if you go on their website,

NaN:NaN

they have a hu huge archive. So there's

NaN:NaN

a bunch of other websites as well that

NaN:NaN

you can find in there. So for instance

NaN:NaN

the Wikipedia articles any like social

NaN:NaN

media as well like Reddit I know there

NaN:NaN

are a lot of Reddit conversations in

NaN:NaN

those in those data sets you have a lot

NaN:NaN

of codes and of course you have a bunch

NaN:NaN

of places for that you have GitHub you

NaN:NaN

have stack overflow all these like

NaN:NaN

forums that talk about codes so all of

NaN:NaN

this is meant for your model to just

NaN:NaN

understand the structure of the language

NaN:NaN

and code

NaN:NaN

and in terms of size so it's measured in

NaN:NaN

terms of token number of tokens and one

NaN:NaN

order of magnitude that I want you to

NaN:NaN

remember is is on the order of hundreds

NaN:NaN

of billions or even trillions or even

NaN:NaN

tens of trillions of tokens.

NaN:NaN

So I'll give you an example. So GPT3 was

NaN:NaN

trained on 300 billion tokens and for

NaN:NaN

instance Llama 3 which was I believe

NaN:NaN

published last year was trained on 15

NaN:NaN

trillion tokens.

NaN:NaN

So these are huge data sets.

NaN:NaN

So before we go further, I want to

NaN:NaN

introduce two notations and I think one

NaN:NaN

of them I introduced I introduced it

NaN:NaN

last lecture.

NaN:NaN

The reason why I want to talk to you

NaN:NaN

about these notations is they are used

NaN:NaN

everywhere to talk about how much

NaN:NaN

compute

NaN:NaN

uh some model needs. So the first

NaN:NaN

notation is flops

NaN:NaN

which stands for floating operations

NaN:NaN

and what it is is it's a unit of

NaN:NaN

compute. So the higher the flops the

NaN:NaN

more operations are involved because by

NaN:NaN

definition flops is the number of

NaN:NaN

operations that involve floatingoint

NaN:NaN

numbers. So floatingoint numbers you can

NaN:NaN

think of them as just like numbers with

NaN:NaN

decimal points.

NaN:NaN

So in terms of order of magnitude

NaN:NaN

training an LLM is on the order of 10 ^

NaN:NaN

of 25 flops

NaN:NaN

and the way you obtain flops. So usually

NaN:NaN

it's like a complicated formula but in

NaN:NaN

your mind you can think of it as

NaN:NaN

something that is a function of the size

NaN:NaN

of your data. So the number of tokens

NaN:NaN

that you train it on and the number of

NaN:NaN

parameters of your model.

NaN:NaN

So there's not like a universal formula

NaN:NaN

because it also is a function of the

NaN:NaN

architecture. So you can think of for

NaN:NaN

instance based LLMs as requiring let's

NaN:NaN

say less compute because only some parts

NaN:NaN

are activated compared to let's say

NaN:NaN

dense LLMs.

NaN:NaN

But you can just think of it as it's a

NaN:NaN

function of the number of tokens and

NaN:NaN

parameters. It's like O of the product

NaN:NaN

between the two more or less.

NaN:NaN

And then there's a second notation that

NaN:NaN

I want to introduce which is also flops

NaN:NaN

but it's different. So here flops stands

NaN:NaN

for floating point operations per

NaN:NaN

second.

NaN:NaN

So it's a measure of compute speeds. So

NaN:NaN

it's basically how fast can your

NaN:NaN

hardware

NaN:NaN

execute these operations

NaN:NaN

and so you also have like some order of

NaN:NaN

magnitudes here. uh but if you're into

NaN:NaN

uh let's say GPUs, you will see that in

NaN:NaN

the description of GPUs, they always

NaN:NaN

indicate flops and we will see that in a

NaN:NaN

second.

NaN:NaN

But I just want to call out that flops

NaN:NaN

here usually all caps.

NaN:NaN

Although you may see some papers that

NaN:NaN

use one for the other,

NaN:NaN

which is confusing. So I just recommend

NaN:NaN

uh just contextual contextualizing

NaN:NaN

this notation with respect to the

NaN:NaN

sentence that it is in because sometimes

NaN:NaN

people actually switch the two but this

NaN:NaN

is the common notation.

NaN:NaN

So far so good.

NaN:NaN

Cool.

NaN:NaN

Okay. So now we know that we have a

NaN:NaN

pre-training step. We know it involves a

NaN:NaN

lot of compute. We know it involves a

NaN:NaN

lot of data. We know that our model is

NaN:NaN

large. So what people did was trying to

NaN:NaN

see how the performance evolves as a

NaN:NaN

function of model size and training

NaN:NaN

size.

NaN:NaN

And there's this one paper called

NaN:NaN

scaling loss for neural language models

NaN:NaN

that was published in 2020 that

NaN:NaN

performed a bunch of experiments by

NaN:NaN

varying these parameters. And what they

NaN:NaN

found was the more compute you have, the

NaN:NaN

better your model learns about

NaN:NaN

predicting the next token. Same for data

NaN:NaN

set uh size. So the more the bigger your

NaN:NaN

training set, the better it is. And the

NaN:NaN

bigger your model, the better it is.

NaN:NaN

So for some time, I think between 2019

NaN:NaN

and 2024,

NaN:NaN

you were seeing models that were larger

NaN:NaN

and larger, just people just building

NaN:NaN

things that were bigger and bigger

NaN:NaN

because according to these experiments,

NaN:NaN

um the performance was just getting

NaN:NaN

better.

NaN:NaN

So something else that they noticed was

NaN:NaN

bigger models tend to be more what they

NaN:NaN

call sample efficient.

NaN:NaN

So what that means is for an equal

NaN:NaN

amount of tokens that is processed

NaN:NaN

you will have a better performance with

NaN:NaN

a bigger model compared to a smaller

NaN:NaN

one.

NaN:NaN

But then you can wonder you know um we

NaN:NaN

don't have unlimited compute you know

NaN:NaN

compute is expensive it you know it has

NaN:NaN

a lot of drawbacks. So uh you have a

NaN:NaN

fixed compute and people also try to

NaN:NaN

answer the question given a certain

NaN:NaN

amount of compute.

NaN:NaN

How can you fix your training set size

NaN:NaN

and your model size in a way that's more

NaN:NaN

optimal?

NaN:NaN

Cuz um here uh you need to decide how

NaN:NaN

big is your model. So what they did is

NaN:NaN

they fixed a unit of compute which is

NaN:NaN

the color of these curves

NaN:NaN

and they tried training models of

NaN:NaN

different sizes with different training

NaN:NaN

set size. And what they saw was that

NaN:NaN

there was always a sweet spot here

NaN:NaN

which followed some kind of

NaN:NaN

relationship.

NaN:NaN

And in particular, this is a table that

NaN:NaN

summarizes quote unquote the optimal set

NaN:NaN

of number of parameters and training set

NaN:NaN

size, which is sometimes called the

NaN:NaN

chinchila.

NaN:NaN

And what they realized was if you have

NaN:NaN

an amount of training set size that's

NaN:NaN

about 20 times

NaN:NaN

the model size then you're spending your

NaN:NaN

compute in quote unquote like an optimal

NaN:NaN

way. And in particular,

NaN:NaN

GPT3 for instance,

NaN:NaN

I think it was like 175

NaN:NaN

billion parameters if I remember

NaN:NaN

correctly, but it's only trained on 300

NaN:NaN

billion tokens. So this one for instance

NaN:NaN

is according to this really undertrained

NaN:NaN

quote unquote.

NaN:NaN

I think there's a question. Yeah.

NaN:NaN

Um so yeah the question is do they fit

NaN:NaN

the neural architecture? So I think um

NaN:NaN

by now everyone agrees that LLMs are

NaN:NaN

transformerbased decoder only models. So

NaN:NaN

everyone uses the same model.

NaN:NaN

Yeah. So you can assume that when I say

NaN:NaN

LLM here it basically means decoder only

NaN:NaN

transformer based models.

NaN:NaN

Yeah. Question is architecture change

NaN:NaN

does not play a big role. So that's what

NaN:NaN

they say actually in their paper. They

NaN:NaN

say the thing that changes the most is

NaN:NaN

the amount of tokens on which you train

NaN:NaN

and the size of your model.

NaN:NaN

Cool. Any other questions?

NaN:NaN

Yeah.

NaN:NaN

Oh yeah, good question. So question is,

NaN:NaN

is there some kind of transfer learning

NaN:NaN

between different versions of models?

NaN:NaN

Um, so for a lot of these models,

NaN:NaN

they're actually closed source, so they

NaN:NaN

don't exactly reveal these things. But,

NaN:NaN

uh, I guess it's an interesting

NaN:NaN

question.

NaN:NaN

um one that I cannot answer in a general

NaN:NaN

way. So maybe I think it's the best

NaN:NaN

answer I can give you. Um but in any

NaN:NaN

case uh when you look at um some of

NaN:NaN

these papers they always state how much

NaN:NaN

it costs to train this and it's always

NaN:NaN

in the order of you know millions. It's

NaN:NaN

always an expensive step regardless.

NaN:NaN

Cool. Um just uh speaking of that um so

NaN:NaN

pre-training has a lot of challenges.

NaN:NaN

One of them is cost. So, uh when I say

NaN:NaN

millions of dollars, it's a minimum. I

NaN:NaN

think it can even cost tens of millions

NaN:NaN

of dollars or sometimes hundreds of

NaN:NaN

millions of dollars. It takes a lot of

NaN:NaN

time

NaN:NaN

and um people have been mindful of the

NaN:NaN

impact on in the environment. So,

NaN:NaN

they've also been including the

NaN:NaN

ecological cost.

NaN:NaN

So the other uh challenge is that the

NaN:NaN

pre-training

NaN:NaN

step is on data that is up to the time

NaN:NaN

at which you pre-train your model on. So

NaN:NaN

what that means is that the knowledge

NaN:NaN

that you acquire from training on this

NaN:NaN

data set can only go up until the date

NaN:NaN

at which you cut your data set.

NaN:NaN

So this date is called the knowledge

NaN:NaN

cutoff date. And so what that means is

NaN:NaN

your base model, your pre-trained base

NaN:NaN

model does not know has no way to know

NaN:NaN

by itself knowledge that occurred after

NaN:NaN

this state.

NaN:NaN

And speaking of that, a lot of papers

NaN:NaN

they've tried to edit knowledge, inject

NaN:NaN

knowledge. It's always tricky because uh

NaN:NaN

there's not a clean way to um you know

NaN:NaN

change the weights in a way that does

NaN:NaN

not penalize

NaN:NaN

some parts. So I guess what people want

NaN:NaN

to do is inject knowledge but not

NaN:NaN

regress in some other domains. And this

NaN:NaN

is a very hard problem. And of course

NaN:NaN

you know these models they try to

NaN:NaN

predict the next token and uh there's

NaN:NaN

this question of uh what if it just

NaN:NaN

generates something that it has seen at

NaN:NaN

training time. So what we call

NaN:NaN

plagiarism so there's always a risk. So

NaN:NaN

these are all the challenges I just want

NaN:NaN

to illustrate when I said the knowledge

NaN:NaN

cut off dates. So if you go on let's say

NaN:NaN

the open a website or Google websites to

NaN:NaN

look at the model cards you will always

NaN:NaN

see so I'm not sure if you can see from

NaN:NaN

here but um there is always a line on

NaN:NaN

knowledge cutoff dates which tells you

NaN:NaN

on when the pre-training of this model

NaN:NaN

was done. So here for instance GP5 was

NaN:NaN

released a few weeks ago and here it

NaN:NaN

says the knowledge cut of date is

NaN:NaN

September 30th. So you can guess that

NaN:NaN

they've done their pre-training around

NaN:NaN

that stage.

NaN:NaN

Cool.

NaN:NaN

Any questions on the first part?

NaN:NaN

Everyone good?

NaN:NaN

Perfect. So in this first part, we've

NaN:NaN

seen that pre-training was a crucial

NaN:NaN

step of the LLM training process and

NaN:NaN

we've seen all these big numbers

NaN:NaN

and one could wonder well how can you

NaN:NaN

train such a big model on such a big

NaN:NaN

amount of data like how do people do

NaN:NaN

that?

NaN:NaN

So this is what we're going to see here.

NaN:NaN

So just what I had mentioned so LLMs you

NaN:NaN

can think of them as decoder only

NaN:NaN

transformerbased models. So in order to

NaN:NaN

train your model you need that

NaN:NaN

you need a lot of data but then if you

NaN:NaN

look at your architecture

NaN:NaN

you see that a lot of the operations

NaN:NaN

involve matrix multiplications.

NaN:NaN

And I guess I have a question for you.

NaN:NaN

What is the kind of hardware that loves

NaN:NaN

matrix multiplications?

NaN:NaN

GPUs. Yes. So you also need GPUs.

NaN:NaN

Actually more than one. Yeah. You had a

NaN:NaN

question.

NaN:NaN

Oh. Uh question is GPUs for inference.

NaN:NaN

So this one we're going to focus on

NaN:NaN

training. But um requirements for GPUs

NaN:NaN

they differ a little bit between

NaN:NaN

training and inference. But in this

NaN:NaN

part, we're solely focused on training.

NaN:NaN

And speaking of GPUs, uh I guess uh it's

NaN:NaN

not GPUs everywhere because for

NaN:NaN

instance, Google, they've developed

NaN:NaN

their own hardware that's called TPUs.

NaN:NaN

Uh but any non Google

NaN:NaN

uh Google based models, they've most

NaN:NaN

likely been trained on GPUs.

NaN:NaN

Cool. So in order to train your model,

NaN:NaN

what do you do? So first of all you have

NaN:NaN

your LLM which is now so this is this

NaN:NaN

model but uh now we're representing with

NaN:NaN

a box just for simplicity you initialize

NaN:NaN

it uh it's like um you know lot of

NaN:NaN

parameters so you can think of uh the

NaN:NaN

scale as being somewhere around like

NaN:NaN

billions to hundreds of billions of

NaN:NaN

parameters. a huge model.

NaN:NaN

And what are the steps involved to train

NaN:NaN

a model? Well, what you're trying to do

NaN:NaN

is to tune the weights so that the model

NaN:NaN

can learn how to generate the next

NaN:NaN

token.

NaN:NaN

So, you have one step called the forward

NaN:NaN

pass where you have a bunch of data that

NaN:NaN

you're trying to pass through the

NaN:NaN

network. And um while we do that I just

NaN:NaN

want to call out things that are

NaN:NaN

important to note that we need to

NaN:NaN

somehow save in memory.

NaN:NaN

So when you do this forward pass you

NaN:NaN

have something that's called activations

NaN:NaN

which are basically the values at each

NaN:NaN

layer that are needed in order to

NaN:NaN

compute the loss.

NaN:NaN

So the loss tells you how off you are

NaN:NaN

compared to uh the label that you want

NaN:NaN

to train this on. And so the amount of

NaN:NaN

memory that you will use here is

NaN:NaN

dependent on a lot of things. It's

NaN:NaN

dependent on the mouth size which

NaN:NaN

impacts the number of activations. It's

NaN:NaN

dependent on how big your batch of data

NaN:NaN

is for training and it's dependent on

NaN:NaN

how large your context length is because

NaN:NaN

if you remember uh here we have of n

NaN:NaN

squared complexity because of this self

NaN:NaN

attention operation where n is the

NaN:NaN

sequence length. So you have all these

NaN:NaN

parameters that come into play.

NaN:NaN

So once you do the forward pass, let's

NaN:NaN

suppose you compute the loss, you know

NaN:NaN

how off you are compared to your label.

NaN:NaN

Now the next step is to somehow tweak

NaN:NaN

the weights in a way that minimizes the

NaN:NaN

loss. So how do you do that? There's

NaN:NaN

this another pass called backwards pass.

NaN:NaN

So what this pass does is quantify

NaN:NaN

the direction where the loss is going to

NaN:NaN

be minimized.

NaN:NaN

It's called a gradient. You take the

NaN:NaN

gradient of the loss with respect to

NaN:NaN

each parameter.

NaN:NaN

Well gradients they also need to be

NaN:NaN

saved somewhere in memory.

NaN:NaN

And then you have finally the weight

NaN:NaN

update

NaN:NaN

which is where you know where the

NaN:NaN

direction at which your loss is going to

NaN:NaN

be minimized. So you apply that update

NaN:NaN

to your weights and you typically use

NaN:NaN

optimizers out there like have you heard

NaN:NaN

of atom optimizer? Yeah. So atom

NaN:NaN

optimizer just a fancy version that has

NaN:NaN

some additional quantities

NaN:NaN

uh which keep track of uh which are

NaN:NaN

basically a function of the gradient. So

NaN:NaN

you have the first moment and the second

NaN:NaN

moment which is basically an average of

NaN:NaN

the moving average of the gradient and

NaN:NaN

the squared gradients

NaN:NaN

and all these quantities. So the first

NaN:NaN

moment the second moment you also need

NaN:NaN

to somehow save them somewhere in

NaN:NaN

memory.

NaN:NaN

So it's a lot of things to save.

NaN:NaN

Well,

NaN:NaN

okay, breaking news. Memory is not

NaN:NaN

unlimited. Memory is limited. And so

NaN:NaN

here what we have in front of us is the

NaN:NaN

description of a GPU.

NaN:NaN

Uh I think so. Yeah, H100, which is a

NaN:NaN

very good GPU. And you will see that in

NaN:NaN

that description there's a line on GPU

NaN:NaN

memory. So GPU memory is your amount of

NaN:NaN

memory per GPU. It's uh 80 gig for this

NaN:NaN

one. It's quite large. So it's in on the

NaN:NaN

order of tens of gigabytes.

NaN:NaN

So you need to store all these things in

NaN:NaN

80 GB

NaN:NaN

which is not a lot.

NaN:NaN

what are we doing? What will you be

NaN:NaN

doing?

NaN:NaN

So I guess the idea is to leverage not

NaN:NaN

one but several GPUs in order to somehow

NaN:NaN

distribute the load across CPUs. And in

NaN:NaN

order to do that you have several

NaN:NaN

methods

NaN:NaN

which we will see in a second.

NaN:NaN

So the first set of methods is called

NaN:NaN

data parallelism

NaN:NaN

also known as DP.

NaN:NaN

So what this set of methods does is it

NaN:NaN

distributes

NaN:NaN

data across GPUs

NaN:NaN

so that this forward pass and backward

NaN:NaN

pass they can all be done kind of

NaN:NaN

independently.

NaN:NaN

And so the idea here is to divide the

NaN:NaN

batch of data across devices.

NaN:NaN

And then um in order to do that of

NaN:NaN

course you need to have a copy of the

NaN:NaN

model per device

NaN:NaN

because of course you need to compute

NaN:NaN

the activations you you need to compute

NaN:NaN

all these things. Um but when you do

NaN:NaN

that you're able to reduce the memory

NaN:NaN

that is linked to the batch size.

NaN:NaN

So that's called data parallelism. Yes.

NaN:NaN

uh question is how about the gradient

NaN:NaN

updates? Well, it's a great question.

NaN:NaN

So, how what do you do when you have

NaN:NaN

independent computations here and there?

NaN:NaN

Well, the gradient is just the average

NaN:NaN

of the gradients uh for this for this

NaN:NaN

thing. So, you have some communication

NaN:NaN

in between the GPUs that basically

NaN:NaN

aggregate the gradient for for the

NaN:NaN

updates.

NaN:NaN

So, I have a question for you.

NaN:NaN

this the answer to everything? Like if

NaN:NaN

we just scale up like this for I don't

NaN:NaN

know lot of GPUs is it is it is is it

NaN:NaN

great always great or do we have like

NaN:NaN

cons?

NaN:NaN

Oh yeah uh great point. So yeah you have

NaN:NaN

to fit one model so yeah that's great

NaN:NaN

point. So the second point that I will

NaN:NaN

add is you have an additional cost which

NaN:NaN

is called communication cost because you

NaN:NaN

need to somehow communicate between your

NaN:NaN

GPUs in order to aggregate some

NaN:NaN

quantities.

NaN:NaN

So your training is going to be slower.

NaN:NaN

It's good you you can scale up the

NaN:NaN

memory. Of course you need to uh fit a

NaN:NaN

model on a on a device and we will see

NaN:NaN

what how we can do to do that but you

NaN:NaN

will be incurring those communic

NaN:NaN

communication costs so it's not all you

NaN:NaN

know great

NaN:NaN

so speaking of the memory and the fact

NaN:NaN

that we want to I guess be able to at

NaN:NaN

least store a model per um per uh

NaN:NaN

device. So people have realized that

NaN:NaN

there's actually a lot of duplication

NaN:NaN

and there's been a paper on wanting to

NaN:NaN

dduplicate this duplicate information

NaN:NaN

and this method is called zero

NaN:NaN

zero redundancy optimization

NaN:NaN

and the idea is that in each on each GPU

NaN:NaN

you know you store the same parameters,

NaN:NaN

you store the same gradients, you store

NaN:NaN

the st same optim optimizer step states.

NaN:NaN

So the idea here is how about we shard

NaN:NaN

we partition those quantities across

NaN:NaN

GPUs. So the first variation is around

NaN:NaN

sharing the optimization optim optimizer

NaN:NaN

states. So meaning we partition those

NaN:NaN

states across the GPUs. So this reduces

NaN:NaN

the memory by a lot. We can also

NaN:NaN

partition the gradients

NaN:NaN

and we can also partition the

NaN:NaN

parameters.

NaN:NaN

So here we have no redundant

NaN:NaN

information. Things are just

NaN:NaN

partitioned. Well, the problem is you're

NaN:NaN

going to have even more communication

NaN:NaN

costs, but at least it allows uh for us

NaN:NaN

to decrease the memory load on each GPU.

NaN:NaN

So this is zero. So there's 0 1, 02, 03.

NaN:NaN

And I guess the variation that you will

NaN:NaN

choose will be a function of how

NaN:NaN

sensitive you are to I guess training

NaN:NaN

time and how big is your model

NaN:NaN

and whether this will be an actual

NaN:NaN

problem or are you just fine with just

NaN:NaN

storing everything.

NaN:NaN

So that's one set of methods. So this

NaN:NaN

set of methods is again data

NaN:NaN

parallelism.

NaN:NaN

So it's basically you having independent

NaN:NaN

sets of data that are handled by

NaN:NaN

different GPUs.

NaN:NaN

Well, you have another set of methods

NaN:NaN

that's called model parallelism.

NaN:NaN

So model parallelism tries to

NaN:NaN

parallelize

NaN:NaN

the operation even within one batch.

NaN:NaN

So there's a bunch of methods. I don't

NaN:NaN

want to sound too like a catalog. So

NaN:NaN

we're not go through them all by one by

NaN:NaN

one, but I will just call out a few that

NaN:NaN

are worth noting.

NaN:NaN

So if you remember last lecture we

NaN:NaN

talked about MOE based LLM and how

NaN:NaN

sequences were being sent to different

NaN:NaN

experts.

NaN:NaN

Well there is a way to distribute that

NaN:NaN

across GPUs via this expert parallelism

NaN:NaN

techniques

NaN:NaN

which is uh having let's say one expert

NaN:NaN

on a device another one on another

NaN:NaN

device.

NaN:NaN

So that's one thing worth noting.

NaN:NaN

Another one I will say so tensor

NaN:NaN

parallelism is uh when you have big

NaN:NaN

matrix multiplications

NaN:NaN

to somehow cut that in a way that

NaN:NaN

decreases the uh memory required for

NaN:NaN

that.

NaN:NaN

Okay. And maybe the last one I will say

NaN:NaN

is pipeline parallelism.

NaN:NaN

It's when

NaN:NaN

you consider a forward pass as involving

NaN:NaN

several layers. So you're going to say

NaN:NaN

that one GPU is going to only be

NaN:NaN

responsible for let's say layers 1 2 3

NaN:NaN

and then another one layer three uh

NaN:NaN

sorry four to five four sorry four five

NaN:NaN

six and so on and so forth

NaN:NaN

um so you also have that kind of

NaN:NaN

parallelism but anyways there's a bunch

NaN:NaN

of techniques and the ones that I

NaN:NaN

mentioned they fall in the bucket of

NaN:NaN

model parallelism

NaN:NaN

make sense

NaN:NaN

No need to know the details on there,

NaN:NaN

but I think just like knowing that there

NaN:NaN

are several methods and just a rough

NaN:NaN

idea, I think is a is a good good thing

NaN:NaN

to have in mind.

NaN:NaN

Cool.

NaN:NaN

So, what did we do? So, we realized that

NaN:NaN

during the training process, we had to

NaN:NaN

save a lot of things in memory. So what

NaN:NaN

we saw was techniques that reduce

NaN:NaN

the burden of having memory per GPU. So

NaN:NaN

we are trying to distribute that across

NaN:NaN

GPUs. So we saw data parallelism and

NaN:NaN

then the zero method that has some extra

NaN:NaN

optimizations and we saw model

NaN:NaN

parallelism as well.

NaN:NaN

So now we're going to see another

NaN:NaN

technique that leverages the structure

NaN:NaN

of the GPU. And you may have heard of

NaN:NaN

this technique is called flash

NaN:NaN

attention.

NaN:NaN

It was actually developed here at

NaN:NaN

Stanford uh in 2022.

NaN:NaN

And in order for me to talk to you about

NaN:NaN

this technique, I want to tell you more

NaN:NaN

about what GPU is composed of.

NaN:NaN

So if you look under the hood, well GPU

NaN:NaN

is very complicated and I'm for sure not

NaN:NaN

uh I don't know everything either, but

NaN:NaN

what I know is that we have two kinds of

NaN:NaN

memories in a GPU. So you have one kind

NaN:NaN

of memory that's big but relatively slow

NaN:NaN

that's in the HPM

NaN:NaN

and then another kind of memory that is

NaN:NaN

fast but much much smaller which is on

NaN:NaN

chip next to the where the compute

NaN:NaN

happens that's called the SRAM

NaN:NaN

so you have HPM and SRAM HPM has

NaN:NaN

something around uh you know tens of

NaN:NaN

gigab by so it's like the GPU memory

NaN:NaN

that you saw in the description.

NaN:NaN

SRAMM is much smaller. It's like

NaN:NaN

something around like several like you

NaN:NaN

know tens of megabytes let's say. So

NaN:NaN

it's much smaller but then it is like 10

NaN:NaN

times faster. So this one is uh a few

NaN:NaN

terabytes per second let's say and the

NaN:NaN

SRAM is uh tens of terabytes per second.

NaN:NaN

So it's like a noticeable difference in

NaN:NaN

speed.

NaN:NaN

So what we want is to somehow leverage

NaN:NaN

the strength of these kinds of memories

NaN:NaN

in order to speed up the attention

NaN:NaN

computation in a in an exact way. So

NaN:NaN

what do I mean by exact way? So what I

NaN:NaN

mean is we're not making any

NaN:NaN

approximations to the computation.

NaN:NaN

What we're doing is we're just

NaN:NaN

leveraging the strength of these

NaN:NaN

components and sending the computation

NaN:NaN

in a in a clever way.

NaN:NaN

if you remember the self attention

NaN:NaN

computation is done with this very

NaN:NaN

important uh formula. So it's softmax of

NaN:NaN

queries and the keys over some scaling

NaN:NaN

factor times v.

NaN:NaN

So this allows queries to interact with

NaN:NaN

everyone else.

NaN:NaN

Um so in matrix form you can think of

NaN:NaN

queries as being uh as having the number

NaN:NaN

of rows equal to the sequence length and

NaN:NaN

then columns to being the the dimension

NaN:NaN

of the query and then you same for key

NaN:NaN

and value. So you have this big matrix

NaN:NaN

multiplications.

NaN:NaN

So if you do it, if you do this

NaN:NaN

computation the standards, the vanilla

NaN:NaN

way, what you would do is store them in

NaN:NaN

the big but slow memory component of the

NaN:NaN

GPU.

NaN:NaN

So you would store it in the HPM.

NaN:NaN

So here is what you would do if you were

NaN:NaN

to not do any optimization. So you would

NaN:NaN

take those matrices from the big but

NaN:NaN

slow

NaN:NaN

HPM,

NaN:NaN

perform the computation

NaN:NaN

and then write it back to the HPM

NaN:NaN

and then you would read that result

NaN:NaN

again from the HPM, compute the softmax

NaN:NaN

and then write it back to the HPM

NaN:NaN

and then you would again load this plus

NaN:NaN

the value matrix multiply them and then

NaN:NaN

write them to the HPM.

NaN:NaN

See there's like a lot of read and write

NaN:NaN

to the HPN. So it's a lot of uh data

NaN:NaN

transfer

NaN:NaN

which actually becomes the bottleneck.

NaN:NaN

So a GPU is very very fast but then you

NaN:NaN

spend a lot of time just loading your

NaN:NaN

matrices from the memory.

NaN:NaN

The reason why you do that is because of

NaN:NaN

the softmax softmax operation. So do you

NaN:NaN

remember what a softmax does? So it

NaN:NaN

normalizes the quantities so that they

NaN:NaN

sum to one but it's row dependent

NaN:NaN

meaning that each row needs to sum up to

NaN:NaN

one.

NaN:NaN

So in a sense you need that computation

NaN:NaN

to happen first before you do your

NaN:NaN

softmax. Like uh if you just like look

NaN:NaN

at it like that you you would think yeah

NaN:NaN

you you need to do the whole thing

NaN:NaN

first.

NaN:NaN

Well turns out that you don't need to do

NaN:NaN

everything

NaN:NaN

at once.

NaN:NaN

And this is the core idea behind flash

NaN:NaN

attention.

NaN:NaN

So what flash attention does is it tries

NaN:NaN

to minimize the amount of read and write

NaN:NaN

from and to the HPM

NaN:NaN

and instead takes small blocks and it's

NaN:NaN

called tiling. The method is called

NaN:NaN

tiling. It takes small blocks that it

NaN:NaN

sends to the SROM so that it gets

NaN:NaN

computed from end to end before being

NaN:NaN

sent back to the HPM.

NaN:NaN

Does that make sense? So the idea is

NaN:NaN

let's send small matrices into the SRAM

NaN:NaN

so that it does the whole you know full

NaN:NaN

end to end computation and just send it

NaN:NaN

back to the HPM because we want to

NaN:NaN

minimize the amount of read and write

NaN:NaN

from the HPM.

NaN:NaN

So here's how how you would do it. So

NaN:NaN

you remember the softmax

NaN:NaN

uh computation with the query and the

NaN:NaN

key and then the value. Well, what you

NaN:NaN

would do is to cut your matrices

NaN:NaN

and then proceed step by step.

NaN:NaN

But then there's a cool trick that I

NaN:NaN

want to talk to you about which is that

NaN:NaN

you don't need to compute the whole

NaN:NaN

matrix inside a softmax

NaN:NaN

in order to achieve the whole softmax

NaN:NaN

computation

NaN:NaN

cuz if you think about it let's suppose

NaN:NaN

you have a whole matrix and then you

NaN:NaN

have like different let's say columns or

NaN:NaN

like submatrices S1 to SN well submax of

NaN:NaN

this huge matrix

NaN:NaN

is equal to this matrix where the

NaN:NaN

softmax is taken respect to each of

NaN:NaN

these submatrices

NaN:NaN

up to some scaling factor.

NaN:NaN

So this is the core trick

NaN:NaN

and if you want to be convinced of it

NaN:NaN

just look at the softmax formula it's

NaN:NaN

like exponential of something over some

NaN:NaN

quantity which is shared across the row.

NaN:NaN

So this scaling factor will just

NaN:NaN

fix this with respect to that.

NaN:NaN

So with this in mind, what we will do is

NaN:NaN

take each respective slices of these

NaN:NaN

matrices,

NaN:NaN

do the whole computation and then

NaN:NaN

populate the corresponding

NaN:NaN

uh entry in the output matrix.

NaN:NaN

So we will do that between let's say the

NaN:NaN

first slice of the query and the first

NaN:NaN

slice of the keys and the values and

NaN:NaN

then we will repeat for the other slice

NaN:NaN

until the end

NaN:NaN

and then we will repeat for the other

NaN:NaN

queries as well until the end.

NaN:NaN

So what the paper explains is how this

NaN:NaN

scaling factor is being computed. So

NaN:NaN

this one is some formula that I did not

NaN:NaN

put on the slide. So it's not necessary

NaN:NaN

for you to memorize the formula. It's

NaN:NaN

just the idea

NaN:NaN

and the idea is exactly this trick.

NaN:NaN

So once you do that,

NaN:NaN

you basically end up with only one read

NaN:NaN

from the HPM

NaN:NaN

and these like tiled quantities are

NaN:NaN

stored in the SRAM

NaN:NaN

and then they're read from the SRAM

NaN:NaN

which is very fast and then computed and

NaN:NaN

then back to the SRAMM and then at the

NaN:NaN

end in order to accumulate the results

NaN:NaN

they're being sent back to the HPM.

NaN:NaN

So, just to make sure we're clear. So,

NaN:NaN

in green is basically when it's red from

NaN:NaN

the SRAM and then in blue is from the

NaN:NaN

HPM. You have a question? Yeah.

NaN:NaN

>> Yeah. The question is do you take the

NaN:NaN

whole row or a portion of it? So you can

NaN:NaN

take a portion of it but just for

NaN:NaN

illustrative purposes. Here we take the

NaN:NaN

like just this is just for illustrative

NaN:NaN

purposes. You can think of your your

NaN:NaN

matrix as being completely uh you know

NaN:NaN

like a grid and then uh you just like

핵심 요약

LLM의 학습 파이프라인을 다룹니다. Transfer Learning 패러다임, Pre-training 데이터와 목표, Scaling Laws, 분산 학습 기법(Data/Model Parallelism, ZeRO, Flash Attention), 그리고 Mixed Precision Training과 Post-training(SFT, RLHF)까지 설명합니다.

주요 개념

Transfer Learning 패러다임 4:30

전통적 방식: 각 태스크마다 모델을 처음부터 학습 (spam detection, sentiment analysis 각각)
Transfer Learning: 언어 이해라는 공통 기반을 Pre-training으로 학습 후, 태스크별로 Tuning
Pre-training: 방대한 데이터로 언어/코드 구조 학습 (Next Token Prediction)
Tuning: Pre-trained 모델을 특정 태스크에 맞게 조정

Pre-training 데이터 7:00

데이터 소스: Common Crawl(월 30억 페이지), Wikipedia, Reddit, GitHub, Stack Overflow
규모: 수백B ~ 수십T 토큰. GPT-3(300B), Llama 3(15T)
목표: Next Token Prediction을 통해 언어와 코드의 구조 학습

FLOPs vs FLOPS 11:00

FLOPs (소문자 s): Floating-point Operations - 연산량 단위. LLM 학습은 ~10^25 FLOPs
FLOPS (대문자 S): Floating-point Operations Per Second - 연산 속도. GPU 성능 지표
FLOPs 추정: O(토큰 수 × 파라미터 수)

Scaling Laws 13:30

Kaplan et al. (2020): 모델 크기, 데이터 크기, 컴퓨팅 늘리면 성능 향상
Sample Efficiency: 큰 모델이 같은 토큰 수로 더 좋은 성능
Chinchilla Optimal: 고정 컴퓨팅에서 최적 설정 = 데이터 토큰 수 ≈ 20 × 파라미터 수
GPT-3의 문제: 175B 파라미터인데 300B 토큰만 학습 (Chinchilla 기준 3.5T 토큰 필요)

메모리 요구사항 20:00

저장 대상: 모델 파라미터, Gradients, Optimizer States, Activations
Adam Optimizer: 파라미터당 12바이트 (FP32 weight + FP32 momentum + FP32 variance)
예시: 7B 모델 학습에 ~84GB 필요 (단일 GPU 초과)

Data Parallelism 28:00

원리: 모델을 모든 GPU에 복제, 데이터만 분할
과정: 각 GPU가 미니배치로 gradient 계산 → All-Reduce로 평균 → 동일하게 업데이트
장점: 구현 간단, 효과적인 배치 크기 증가
단점: 모델 전체가 각 GPU 메모리에 들어가야 함

ZeRO (Zero Redundancy Optimizer) 32:00

ZeRO-1: Optimizer States를 GPU들이 나눠서 저장
ZeRO-2: + Gradients도 분할
ZeRO-3: + Parameters도 분할 (필요할 때 All-Gather)
효과: 메모리 사용량 대폭 감소, 더 큰 모델 학습 가능

Model Parallelism 35:00

Tensor Parallelism: 하나의 연산(행렬곱)을 GPU들이 분할 처리
Pipeline Parallelism: 레이어를 GPU들에 분배 (GPU 1: Layer 1-3, GPU 2: Layer 4-6)

Flash Attention 37:00

GPU 메모리 구조: HBM(크지만 느림, ~수십GB) vs SRAM(작지만 빠름, ~수십MB)
병목: 기존 Attention은 HBM ↔ SRAM 간 읽기/쓰기가 많음
핵심 아이디어: Tiling - 작은 블록으로 나눠 SRAM에서 end-to-end 계산 후 HBM에 저장
Softmax 트릭: softmax([S1, S2, ...]) = [α1·softmax(S1), α2·softmax(S2), ...] (스케일링 팩터로 보정)
Recomputation: Activation 저장 대신 backward 시 재계산. 연산량은 늘지만 메모리 절약 + 실행 시간도 감소

Mixed Precision Training 50:00

원리: Weight는 FP32로 유지, Forward/Backward는 FP16으로 수행
이유: FP16에서 gradient가 너무 작으면 0으로 underflow
Loss Scaling: Loss에 큰 수를 곱해 gradient underflow 방지, 업데이트 시 다시 나눔
효과: 메모리 절약 + 연산 속도 향상

Post-training 1:20:00

SFT (Supervised Fine-Tuning): 고품질 (instruction, response) 쌍으로 학습
RLHF (Reinforcement Learning from Human Feedback): 인간 선호도 기반 Reward Model 학습 후 PPO로 최적화
DPO (Direct Preference Optimization): Reward Model 없이 직접 선호도 최적화
목적: Pre-trained 모델을 helpful, harmless, honest하게 조정

핵심 인사이트

Chinchilla Scaling Law는 '무조건 크게'에서 '효율적으로 크게'로 패러다임 전환
Flash Attention은 하드웨어 특성을 활용한 최적화의 좋은 예시
Pre-training → SFT → RLHF 파이프라인이 현대 LLM의 표준