Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 6 - LLM Reasoning

Stanford Online • November 14, 2025 • AI 요약 생성: January 24, 2026

NaN:NaN

Hello everyone and again welcome to

NaN:NaN

lecture six of CME295.

NaN:NaN

Uh so today is actually an exciting day

NaN:NaN

because we're going to uh cover a topic

NaN:NaN

that has been trending over the past

NaN:NaN

year or so which is LLM reasoning.

NaN:NaN

And it's actually a good segue compared

NaN:NaN

to what we talked last time which was uh

NaN:NaN

preference tuning because a lot of the

NaN:NaN

methods that we used in lecture five are

NaN:NaN

going to be the ones that we'll be using

NaN:NaN

as kind of the foundations for this

NaN:NaN

lecture.

NaN:NaN

So before we start as usual we're going

NaN:NaN

to uh just cover quickly what we saw in

NaN:NaN

the last lecture.

NaN:NaN

So if you remember lecture four and

NaN:NaN

lecture five were all about learning how

NaN:NaN

we can train a model. So in lecture four

NaN:NaN

we saw the first part which was the

NaN:NaN

pre-training part which is the most

NaN:NaN

compute intensive step where we

NaN:NaN

basically teach the model the structure

NaN:NaN

of text the structure of codes and we do

NaN:NaN

this very large large scale training.

NaN:NaN

So at the end of this first step which

NaN:NaN

is the pre-training we get a model that

NaN:NaN

knows about code that knows about

NaN:NaN

language but it only knows how to

NaN:NaN

autocomplete a sequence and so that's

NaN:NaN

why we also saw uh the second step which

NaN:NaN

was the finetuning step where we take

NaN:NaN

our pre-trained model and we try to make

NaN:NaN

it useful.

NaN:NaN

So one use case that we saw was for

NaN:NaN

instance uh you know an assistant. So we

NaN:NaN

tried to tune it in a way that it can

NaN:NaN

respond to questions and so here we have

NaN:NaN

uh this work of preparing the SFT data

NaN:NaN

which which is what we call the SFT data

NaN:NaN

uh which is a high quality created data

NaN:NaN

set that we can use to teach our model

NaN:NaN

on how to behave. So at the end of this

NaN:NaN

second step we have our model that is

NaN:NaN

tuned for a specific task which can be

NaN:NaN

um responding to queries. And then last

NaN:NaN

lecture we saw this third step

NaN:NaN

which was the preference tuning step.

NaN:NaN

And here the goal is to align our model

NaN:NaN

with human preferences.

NaN:NaN

So we saw in particular RHF which was a

NaN:NaN

common method to do that and we saw that

NaN:NaN

there was two steps to it. So there was

NaN:NaN

the first part learning to distinguish

NaN:NaN

good from bad using human preference

NaN:NaN

data

NaN:NaN

and then there was this second step

NaN:NaN

which was this RL stage which is going

NaN:NaN

to be useful today.

NaN:NaN

So in particular, if you remember, we

NaN:NaN

had drawn a comparison between the RL

NaN:NaN

setup that you may be familiar with uh

NaN:NaN

outside of this class where we have uh

NaN:NaN

an agent that is interacting with the

NaN:NaN

environment

NaN:NaN

and the way that it interact with it is

NaN:NaN

that given a given a state that it is

NaN:NaN

in, it can take an action following ing

NaN:NaN

a policy

NaN:NaN

which is nothing else than just a

NaN:NaN

probability distribution over actions.

NaN:NaN

So given

NaN:NaN

a state our agent can take an action

NaN:NaN

following this policy and at the result

NaN:NaN

of this it receives some reward.

NaN:NaN

And we saw last lecture that we have

NaN:NaN

this nice comparison between the

NaN:NaN

traditional RL setup and the LLM setup.

NaN:NaN

And so here our quote unquote agent is

NaN:NaN

just simply the LLM.

NaN:NaN

Um the environment that it interacts

NaN:NaN

with is just the set of tokens that it

NaN:NaN

can predict over.

NaN:NaN

So given an input that it has received

NaN:NaN

so far,

NaN:NaN

it can predict what the next token could

NaN:NaN

be which is the action and it does that

NaN:NaN

using the probability distribution

NaN:NaN

that is the result of uh I guess the LM

NaN:NaN

prediction. And we saw that we can

NaN:NaN

obtain human preferences

NaN:NaN

for each completion.

NaN:NaN

So you have a prompt, you have a

NaN:NaN

completion and you get those human

NaN:NaN

preferences.

NaN:NaN

And this is the one that you then use in

NaN:NaN

order to tune the LLM because here this

NaN:NaN

step is about aligning the model with

NaN:NaN

human preferences. And here the human

NaN:NaN

preferences is encapsulated in the

NaN:NaN

reward.

NaN:NaN

And so we saw that the loss function

NaN:NaN

during the RL stage is composed of two

NaN:NaN

parts.

NaN:NaN

So the first part is this advantage

NaN:NaN

maximization

NaN:NaN

and we saw that that advantage was based

NaN:NaN

on the rewards uh and it has some

NaN:NaN

baseline to just reduce the variance of

NaN:NaN

the gradient. Uh so we have that part

NaN:NaN

but we also saw

NaN:NaN

another part

NaN:NaN

which was that we don't want our model

NaN:NaN

to deviate too much from either the

NaN:NaN

previous iteration. So we don't want the

NaN:NaN

model to change too much from iteration

NaN:NaN

to iteration.

NaN:NaN

But we also don't want our model to

NaN:NaN

change too much compared to our initial

NaN:NaN

model like our base model. And here it's

NaN:NaN

the SFT model. And the reason why we

NaN:NaN

don't want to change too much is because

NaN:NaN

our model has already learned a lot.

NaN:NaN

It's already quite performant in what it

NaN:NaN

does. And what we want is to just align

NaN:NaN

it with human preferences which is not

NaN:NaN

something for which we want the model to

NaN:NaN

change completely.

NaN:NaN

And I think that was the part that um I

NaN:NaN

think was a little bit scary which was

NaN:NaN

the actual loss functions. So if you

NaN:NaN

remember the main algorithm that is

NaN:NaN

typically used in the RHF setting is PO

NaN:NaN

which stands for proximal policy

NaN:NaN

optimization

NaN:NaN

and there is this one variant that's

NaN:NaN

called PO clip

NaN:NaN

which is such that it clips the updates

NaN:NaN

from one iteration to another. So here

NaN:NaN

if you remember R is actually not the

NaN:NaN

reward it is the ratio and I know it's a

NaN:NaN

confusing confusing notation.

NaN:NaN

So it's the ratio between the current

NaN:NaN

policy and the old policy and the old

NaN:NaN

policy here is the policy that is at the

NaN:NaN

previous RL iteration.

NaN:NaN

So what we do is that we have a clipping

NaN:NaN

mechanism

NaN:NaN

such that the ratio cannot go beyond

NaN:NaN

some certain thresholds

NaN:NaN

such that we do not want to incentivize

NaN:NaN

the model to make too big of an update.

NaN:NaN

And we saw another variant of PO which

NaN:NaN

is called PPO kale penalty.

NaN:NaN

And this variant

NaN:NaN

uses the kale divergence

NaN:NaN

to penalize the model from changing too

NaN:NaN

much.

NaN:NaN

So if you remember in the original PO

NaN:NaN

paper uh you we used the old

NaN:NaN

quoteunquote old version of the model

NaN:NaN

which is the one at the previous RL

NaN:NaN

iteration.

NaN:NaN

But in modernday RHF training, this

NaN:NaN

scale divergence is typically applied to

NaN:NaN

the base model, which is the SFT model.

NaN:NaN

So I guess long story short is we have

NaN:NaN

these two PO variants that were variants

NaN:NaN

that were introduced in the original PO

NaN:NaN

paper.

NaN:NaN

But in modernday RLF training, we

NaN:NaN

typically have some combination, some

NaN:NaN

mix of these two loss functions.

NaN:NaN

Cool.

NaN:NaN

So, up until now, we've seen what I want

NaN:NaN

to call vanilla LLMs, which are LLMs

NaN:NaN

that take something as an input, let's

NaN:NaN

say a prompt, and just respond with some

NaN:NaN

answer.

NaN:NaN

And so those vanilla LLMs, they have a

NaN:NaN

lot of strength that we can enjoy. So,

NaN:NaN

first of all, we had seen that those

NaN:NaN

LLMs, they know a lot about structure of

NaN:NaN

the text. They know a lot about codes.

NaN:NaN

So in particular, if you want to debug

NaN:NaN

your code, I guess they're great to find

NaN:NaN

where the error is. They are great to

NaN:NaN

generate codes. They're also great to

NaN:NaN

generate, I don't know, essays or poems.

NaN:NaN

They're really, really good at that.

NaN:NaN

But I do want to call out some

NaN:NaN

weaknesses.

NaN:NaN

So the first, I guess, weakness that I

NaN:NaN

want to call out in these vanilla LLMs

NaN:NaN

is that they have quote unquote limited

NaN:NaN

resoning.

NaN:NaN

So typically if you kind of have it have

NaN:NaN

uh some sophisticated let's say math

NaN:NaN

problem it will not really um kind of

NaN:NaN

come up with I guess the solution

NaN:NaN

because maybe it will like uh get lost

NaN:NaN

in the in the way um because up until

NaN:NaN

now our model has been really trained to

NaN:NaN

I guess given a prompt respond to it

NaN:NaN

using the kind of next token prediction.

NaN:NaN

So I guess here there is not like really

NaN:NaN

a big reason why it would be able to

NaN:NaN

solve I guess complicated problems. So I

NaN:NaN

guess that is one.

NaN:NaN

So second weakness is that the LLM that

NaN:NaN

we have has been pre-trained on a huge

NaN:NaN

amount of data which is static

NaN:NaN

meaning that the knowledge that the LLM

NaN:NaN

has acquired is bound to the cutoff date

NaN:NaN

what we call the cutoff dates at which

NaN:NaN

we cut and formed our pre-training data.

NaN:NaN

So I know a few days ago we had an

NaN:NaN

election. So if let's say we trained our

NaN:NaN

LLM based on data before the election

NaN:NaN

and let's say today we ask it okay who

NaN:NaN

is the I don't know elected official of

NaN:NaN

let's say X it will not be able to

NaN:NaN

answer us because it does not have

NaN:NaN

access to knowledge after that date.

NaN:NaN

So a third weakness is so far it's uh

NaN:NaN

all talk no action. So you just uh

NaN:NaN

prompt your LLM but I guess if you want

NaN:NaN

to let's say I don't know um like place

NaN:NaN

an order or I don't know do some action

NaN:NaN

you just cannot do it.

NaN:NaN

And then I guess the last weakness which

NaN:NaN

by the way is not an exhaustive list is

NaN:NaN

contrary to traditional NLP models LLMs

NaN:NaN

they generate free form text

NaN:NaN

and it's hard to evaluate them in the

NaN:NaN

framework that we I mean by we is the ML

NaN:NaN

community has adopted up until a few

NaN:NaN

years ago. So if you're familiar with

NaN:NaN

it, let's suppose if you worked in the

NaN:NaN

translation world, you would use

NaN:NaN

rulebased metrics like blur or for

NaN:NaN

summarization rouge to evaluate your

NaN:NaN

outputs. But then here LLMs, they can

NaN:NaN

really do more than just that. So it's

NaN:NaN

very hard to evaluate them.

NaN:NaN

So what I want to say is that this is

NaN:NaN

let's say a subset of all the weaknesses

NaN:NaN

that LMS can have. And the last three

NaN:NaN

are going to be topics that we will

NaN:NaN

cover in the next lecture and in the

NaN:NaN

lecture eight.

NaN:NaN

And as I mentioned before, the focus of

NaN:NaN

today is reasoning. So we'll see how we

NaN:NaN

can improve the way our LLM reasons.

NaN:NaN

Cool. And again, um this is a topic

NaN:NaN

that's been very new and by new I mean

NaN:NaN

roughly a year. So almost everything

NaN:NaN

that we will see is uh either from 2024

NaN:NaN

or 2025.

NaN:NaN

And I guess we're lucky because now we

NaN:NaN

have kind of enough hindsight to just

NaN:NaN

know which piece is more important than

NaN:NaN

other things. And so the goal for today

NaN:NaN

is going to be to know what reasoning

NaN:NaN

models are.

NaN:NaN

And the second big goal is to know how

NaN:NaN

they are trained.

NaN:NaN

So hopefully at the end of this lecture

NaN:NaN

if you have a good answer to these two

NaN:NaN

questions that means that we have done a

NaN:NaN

good job.

NaN:NaN

Okay. So let's start with reasoning

NaN:NaN

models. What are they? Well to answer

NaN:NaN

this question we first need to I guess

NaN:NaN

define what reasoning is. And the bad

NaN:NaN

news here is there's not a commonly

NaN:NaN

agreed upon definition out there of what

NaN:NaN

reasoning is. So I will try to define it

NaN:NaN

um I guess to the best of my ability. So

NaN:NaN

here we define reasoning as the ability

NaN:NaN

to solve a problem.

NaN:NaN

And here by problem we typically think

NaN:NaN

핵심 요약

LLM Reasoning 강의는 추론 모델(Reasoning Model)의 개념과 학습 방법을 다룹니다. Chain of Thought를 대규모로 적용하여 모델이 복잡한 문제를 단계별로 분해하고 해결하도록 학습시키는 RL 기반 접근법, 특히 GRPO 알고리즘을 중심으로 설명합니다.

주요 개념

Vanilla LLM의 한계 08:50

Limited Reasoning: 복잡한 수학/코딩 문제 해결 능력 부족
Static Knowledge: 학습 데이터 cutoff date 이후 정보 없음
All Talk No Action: 실제 행동(주문, API 호출 등) 불가
Evaluation 어려움: Free-form text 출력으로 BLEU/ROUGE 같은 기존 메트릭 부적합

Reasoning의 정의 13:55

정의: Multi-step 추론을 통해 문제를 해결하는 능력
Knowledge vs Reasoning: "CME295가 뭐지?" (지식) vs "2020년생 곰이 2025년에 몇 살?" (추론)
주로 수학, 코딩 문제에 적용되지만 다른 분야로 확장 가능

Chain of Thought의 대규모 적용 15:55

LLM은 Next Token Prediction으로 학습되어 "plausible"하게 답변
복잡한 문제는 training data에 거의 없어서 직접 풀기 어려움
핵심 아이디어: 문제를 tractable한 하위 문제로 분해 → 학습된 패턴으로 해결
Compute Budget: 더 많은 토큰 생성 = 더 많은 forward pass = 더 많은 compute

Reasoning Model의 구조 21:58

Vanilla LLM: Question → Answer
Reasoning Model: Question → Reasoning Chain → Answer
출력이 단순 Answer가 아닌 "Reasoning + Answer"

Reasoning Model 타임라인 22:23

2024.09: OpenAI o1 preview 출시 (시작점)
2024.12: Google Gemini 2.0 Flash Thinking
2025.01: DeepSeek R1 - OpenAI 성능 match + 방법론 공개 (빅 모먼트)
이후: xAI, Anthropic Claude, Mistral 등 추론 기능 추가

Test-Time Scaling 24:30

Train-Time Scaling: 더 큰 모델, 더 많은 데이터, 더 많은 compute
Test-Time Scaling: 추론 시점에 더 많은 compute 투입 (새로운 패러다임)
같은 모델이라도 추론 시간을 더 주면 성능 향상 가능

Pass@K 메트릭 32:28

정의: K번 시도 중 최소 1번 성공할 확률
Best-of-N과 유사하지만, Reward Model 대신 Verifiable Reward 사용
코딩: 테스트 케이스 통과 여부 / 수학: 정답과 일치 여부
추정 공식: Pass@K = 1 - C(n-c, k) / C(n, k) (n개 샘플 중 c개 성공일 때)

Why RL for Reasoning? 50:40

수학/코딩은 Verifiable Reward 존재 (정답 여부 명확)
SFT만으로 처음부터 학습하기 어려움 (high-quality reasoning data 부족)
해결책: RL로 모델이 스스로 reasoning 패턴 학습

RL Reward 설계 51:30

Format Reward: 토큰 존재 여부 (reasoning chain 생성 유도)
Correctness Reward: 최종 답의 정확성 (verifiable)

Reward Model 불필요 - 둘 다 rule-based로 검증 가능

GRPO (Group Relative Policy Optimization) 1:10:20

DeepSeek Math 논문에서 제안
핵심 차이: Value Function 없이 Advantage 계산
방법: 같은 prompt에 대해 G개 completion 생성 → 각 completion의 reward를 group 내에서 비교
Advantage = (Ri - mean(R)) / std(R) (Group 내 상대적 비교)

GRPO vs PPO 비교 1:10:40

항목	PPO	GRPO
Frozen Models	Reference + Reward Model	Reference만
Trained Models	Policy + Value Function	Policy만
Advantage	Reward - Value	Group 내 상대 비교
용도	Preference Tuning	Reasoning Training

GRPO Loss 구성요소 1:11:10

공통점: Policy ratio (π/π_old), Clipping mechanism
차이점: GRPO는 KL divergence가 loss에 명시적으로 포함
PPO는 KL을 advantage 계산에 내재화

Thinking Budget Control 55:20

모든 문제에 같은 양의 thinking 불필요
Dynamic Budget: Classifier로 문제 난이도 판단
Budget Forcing: "wait" 토큰으로 더 생각하게, "time's up" 토큰으로 종료 유도
Continuous Thoughts: Token 대신 hidden representation으로 사고 (더 압축된 형태)

Length Optimization 1:16:20

RL training이 진행될수록 output 길이 증가 경향
더 긴 reasoning = 더 많은 비용 (사용자/제공자 모두)
효율성을 위한 length reward 추가 연구 진행 중

핵심 인사이트

Reasoning Model = Chain of Thought의 대규모 적용 + RL 기반 학습
Test-Time Scaling: 추론 시점 compute 증가로 성능 향상 (새로운 scaling 패러다임)
GRPO: Value Function 없이 Group 내 비교로 Advantage 계산 - PPO보다 간단
Verifiable Reward가 있는 도메인(수학, 코딩)에서 RL이 특히 효과적
DeepSeek R1이 방법론 공개로 reasoning 연구 민주화에 기여