Skip to content
Go back

Post-Training Is Not 'One Algorithm': Objective Functions and Implementation Essentials for PPO / DPO / GRPO

· 12 min read

Reading notes from Nathan Lambert’s “Reinforcement Learning from Human Feedback (RLHF).” The book helped me build a clearer mental model of post-training—not as a single algorithm, but as an engineering pipeline: data → reward proxy → optimization → evaluation → guardrails.

Notation: prompt is xx, completion is yy, policy model is πθ(yx)\pi_\theta(y\mid x), reference model is πref\pi_{\text{ref}}, reward/scoring function is r(x,y)r(x,y).


0) A “Production-Ready” Pipeline Diagram

SFT -> Preference data (pairs / rankings) -> RM/Judge/Verifier -> Optimize (PPO/GRPO/DPO/...) -> Eval & Iterate
                       |                                               |
                       +-------------------- guardrails (KL/NLL/len/...) +

Key insight: What you actually want is usable behavior in production; but what you can optimize is often just some proxy objective, so the system naturally evolves toward “needing guardrails.”


1) Canonical RLHF: Reward - KL (What the Objective Function Looks Like)

The book consistently uses this form of RLHF: maximize expected reward over the data distribution while using KL to pull the policy back toward the reference model (style/distribution constraint).

maxπ ExDEyπ(x)[r(x,y)]βDKL(π(x)πref(x))\max_{\pi} \ \mathbb{E}_{x\sim D}\mathbb{E}_{y\sim \pi(\cdot|x)}[r(x,y)] - \beta D_{\mathrm{KL}}(\pi(\cdot|x)\Vert \pi_{\mathrm{ref}}(\cdot|x))

Intuition: β\beta is the knob for “how far you dare to stray from SFT”; it’s not just a mathematical term—it’s more like the master switch for “capability vs style/stability” in your product.


2) PPO: The Core Isn’t “Being Stronger”—It’s That Clipping Makes Updates More Stable

PPO’s key is importance ratio + clipping: for the same batch of data sampled by the old policy πθold\pi_{\theta_\text{old}}, use the ratio to correct when updating the new policy, and use clip to limit single-update magnitude, preventing training collapse when the ratio deviates too far from 1.

Classic PPO clipped surrogate:

J(θ)=Et[min(rt(θ)At, clip(rt(θ),1ϵ,1+ϵ)At)]J(\theta)=\mathbb{E}_t\left[\min\left(r_t(\theta)A_t,\ \text{clip}(r_t(\theta),1-\epsilon,1+\epsilon)A_t\right)\right] rt(θ)=πθ(atst)πθold(atst)r_t(\theta)=\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)}

In language models, the common implementation is per-token (easier to compute with logprobs).

Minimal Working PPO Pseudocode (LM Version)

# given prompts x
y = sample(pi_old, x)                         # rollout / completion
r = reward_model_or_verifier(x, y)            # scalar per sequence (or per step)

logp_new = logprob(pi_theta, x, y)            # [B, L]
logp_old = logprob(pi_old,   x, y)            # [B, L]
ratio = exp(logp_new - logp_old)              # [B, L]

A = compute_advantage(r, baseline=V(x,y) or batch_norm)  # token- or seq-level
pg = -mean(min(ratio*A, clip(ratio, 1-eps, 1+eps)*A))

kl = mean(logp_new - logprob(pi_ref, x, y))   # MC estimate of reverse KL
loss = pg + beta*kl + vf_coef*value_loss + other_regularizers
loss.backward(); opt.step()

The easiest pitfalls in engineering aren’t the formulas—they’re: how advantage is defined (token vs sequence), whether KL goes in reward or loss, how many gradient steps per batch of data. These significantly change “what training produces.”


3) KL Regularization: The “Numerical Preference” of Reverse KL and Implementation

The book treats KL control as one of the core guardrails of post-training, explaining the common reverse KL to reference: when the new policy assigns high probability in regions where the reference model has low probability, it gets heavily penalized (numerically more “conservative”).

Implementation often uses sampling to approximate the expectation:

DKL(PQ)=ExP[logP(x)logQ(x)]D_{\mathrm{KL}}(P\Vert Q)=\mathbb{E}_{x\sim P}[\log P(x)-\log Q(x)]

This also explains why many systems can compute KL by just calculating logprobs twice (no explicit summation needed).

Forward KL vs Reverse KL Intuition

Forward KL DKL(PrefPθ)D_{KL}(P_{ref} \| P_\theta)Reverse KL DKL(PθPref)D_{KL}(P_\theta \| P_{ref})
PenalizesPθP_\theta assigning low prob where PrefP_{ref} is highPθP_\theta assigning high prob where PrefP_{ref} is low
Behavioral tendencyMode-covering (tries to cover all modes)Mode-seeking (tends to converge to single mode)
Practical effectMore “exploratory,” may produce strange outputsMore “conservative,” tends toward safe answers

RLHF commonly uses Reverse KL because we want to avoid the model “making things up” (assigning high probability where reference thinks it’s impossible).


4) DPO: Folding “RLHF + KL” into an Offline Contrastive Loss (Engineering-Friendly)

DPO’s position in the book is clear: it doesn’t do online rollouts, doesn’t separately train an RM, directly doing gradient descent on preference pairs; meanwhile it corresponds to a closed-form optimal solution of the “KL-constrained RLHF objective” (given data and β\beta).

DPO (Bradley–Terry form) core loss:

LDPO(πθ;πref)=E(x,yc,yr)D[logσ(βlogπθ(ycx)πref(ycx)βlogπθ(yrx)πref(yrx))]L_{\text{DPO}}(\pi_\theta;\pi_{\text{ref}})= -\mathbb{E}_{(x,y_c,y_r)\sim D} \left[ \log \sigma\left( \beta \log\frac{\pi_\theta(y_c|x)}{\pi_{\text{ref}}(y_c|x)} - \beta \log\frac{\pi_\theta(y_r|x)}{\pi_{\text{ref}}(y_r|x)} \right)\right]

It can also be interpreted as learning an “implicit reward” (log-ratio structure).

Minimal Working DPO Pseudocode

# batch of (x, y_chosen, y_rejected)
lc = sum_token_logp(pi_theta, x, y_chosen)
lr = sum_token_logp(pi_theta, x, y_rejected)
lcref = sum_token_logp(pi_ref, x, y_chosen)
lrref = sum_token_logp(pi_ref, x, y_rejected)

delta = beta * ((lc - lcref) - (lr - lrref))
loss = -log(sigmoid(delta)).mean()
loss.backward(); opt.step()

The book especially emphasizes a common misconception: DPO appears to be “directly training policy,” but essentially it’s still learning reward structure (hence the “Your LM is secretly a reward model” statement).

DPO’s Implicit Reward Interpretation

From DPO’s derivation, we can extract the implicit reward:

r(x,y)=βlogπθ(yx)πref(yx)+βlogZ(x)r(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)

where Z(x)Z(x) is the partition function. This means:


5) GRPO: Like PPO, But Using Group Comparison to Bypass Value Function

GRPO (used in DeepSeekMath and other work) can be viewed as a PPO-style surrogate loss, but it avoids training a value function: by sampling multiple completions per prompt and doing within-group normalization to estimate advantage.

GRPO objective (group-aggregated):

J(θ)=1Gi=1G[min(ρiAi, clip(ρi,1ϵ,1+ϵ)Ai)βDKL(πθπref)]J(\theta)=\frac{1}{G}\sum_{i=1}^{G} \Big[\min(\rho_i A_i,\ \text{clip}(\rho_i,1-\epsilon,1+\epsilon)A_i) -\beta D_{\mathrm{KL}}(\pi_\theta\Vert \pi_{\text{ref}})\Big] Ai=rimean(r1:G)std(r1:G)A_i=\frac{r_i-\text{mean}(r_{1:G})}{\text{std}(r_{1:G})}

Minimal Working GRPO Pseudocode

# For each prompt x, sample G completions
y_group = [sample(pi_old, x) for _ in range(G)]    # G completions per prompt
r_group = [reward(x, y) for y in y_group]          # G rewards

# Group-level advantage normalization
mu = mean(r_group)
std = std(r_group) + 1e-8
advantages = [(r - mu) / std for r in r_group]     # z-score within group

# PPO-style loss for each completion
for y, A in zip(y_group, advantages):
    logp_new = logprob(pi_theta, x, y)
    logp_old = logprob(pi_old, x, y)
    ratio = exp(logp_new - logp_old)
    
    pg1 = -A * ratio
    pg2 = -A * clip(ratio, 1-eps, 1+eps)
    pg_loss = max(pg1, pg2)                        # element-wise max
    
    kl = logp_new - logprob(pi_ref, x, y)
    loss = pg_loss + beta * kl

The book also mentions implementation details: GRPO commonly adds KL directly to the loss (rather than modifying reward first), which differs from traditional PPO.

GRPO vs PPO: Why Use Group Comparison?

AspectPPOGRPO
BaselineValue function V(s)V(s)Within-group mean rˉ\bar{r}
Extra modelNeeds Critic trainingNot needed
Memory overheadHigh (storing value head)Low
VarianceGAE can controlDepends on group size G
Use caseComplex multi-step decisionsBandit-style (single generation)

6) Preference Data: The Most Powerful Fuel, Also the Most Hidden Bias Amplifier

The book dedicates a section to “bias in data collection,” calling out prefix bias, sycophancy, verbosity, formatting, etc.—these often aren’t written in labeling guidelines but get learned very firmly by models.

Common Data Bias Types

Bias TypeManifestationConsequence
Length biasLonger answers more likely chosen as preferredModel becomes verbose, information density drops
Format biasMarkdown/lists more likely to winOver-formatting, bullet points even for simple questions
SycophancyAgreeing with user more likely chosenModel becomes “pleasing,” afraid to correct errors
Position biasFirst/last option more likely chosenEvaluation results unstable
Verbosity ≠ QualityDetailed ≠ correctReward hacking

My engineering conclusion: For preference pair data, the difference often isn’t in quantity but in whether these biases are systematically addressed (e.g., UI display, labeling workflow, length control, penalties for “flattery/fluff”).

Real case: In an internal judge, discovered “longer, more template-like answers win more easily,” causing DPO-trained model output information density to drop; had to add length-control / information density constraints to recover.


7) Evaluation: Why “All Scores Are Rising” But You Still Don’t Dare Ship

The book’s attitude toward evaluation is realistic: evaluation evolves with training objectives, and prompt/format can take the same model’s performance from “okay” to “near zero” (extremely sensitive).

The Triple Dilemma of Evaluation

  1. Prompt sensitivity: Same model, different prompt template, scores can differ 20%+
  2. Metric gaming: Optimizing benchmark scores ≠ real capability improvement
  3. Distribution shift: Training distribution vs evaluation distribution vs real user distribution—all three inconsistent

Internal vs External Evaluation

Internal EvaluationExternal Evaluation
PurposeHillclimbing, guide iterationComparison, release decisions
CharacteristicsControllable variables, reproducibleOpaque configuration, high error
RiskOverfitting internal benchmarkNot reproducible, high noise

LLM-as-a-Judge Engineering Tips

Although commonly used (including for generating preference data), note:

# Common tricks to reduce variance
judge_config = {
    "temperature": 0,           # Deterministic output
    "max_tokens": 1,            # Only want score, not explanation
    "logprobs": True,           # Use logprob rather than argmax
}

# Position debiasing
score_AB = judge(response_A, response_B)
score_BA = judge(response_B, response_A)
final_score = (score_AB - score_BA) / 2  # Cancel position bias

8) Over-Optimization: Not an Occasional Bug, But a Default Risk

The book gives a definition I really like:

When you optimize hard on a proxy as if it were the target, the “true objective” first improves then degrades (classic Goodhart’s Law).

Typical Over-Optimization Symptoms

SymptomCauseMitigation
Fixed phrasesCertain phrases overvalued by RMDiversity regularization, entropy bonus
Repetition/HedgingSafe outputs score highPenalize repeated n-grams
SycophancyAgreeing with user scores highDedicated sycophancy detector
Excessive refusalRefusing is safer than being wrongBalance helpfulness vs harmlessness
Length gamingLong answers score highLength penalty term

Guardrails Aren’t Decoration—They’re Survival Necessities

total_loss = (
    rl_loss                              # Main objective
    + beta * kl_penalty                  # Don't deviate too far from reference
    + sft_coef * sft_loss                # Maintain language capability
    + length_coef * length_penalty       # Control length
    + entropy_coef * entropy_bonus       # Maintain diversity
    + format_coef * format_penalty       # Format constraints
)

9) One-Page Engineering Checklist: What to Monitor When Running PPO/DPO/GRPO

Training Side (Observable)

MetricFocusAlert Threshold
mean_rewardIs it continuously risingSudden drop or saturation
reward_stdIs distribution healthyToo small (collapse) or too large (unstable)
kl_divergenceDeviation from reference> 10-15 usually problematic
clip_fractionPPO clip trigger rate> 30% may mean learning rate too high
entropyOutput diversityContinuous decline = collapse
grad_normGradient healthSudden spike = instability

Evaluation Side (Reproducible)

Data Side (Bias Governance)


10) Algorithm Selection Guide

                    Have online environment?
                              |
                   +----------+----------+
                   |                     |
                  Yes                    No
                   |                     |
              Have value head           DPO
              training budget?     (offline preference pairs)
                   |
           +-------+-------+
           |               |
          Yes              No
           |               |
          PPO            GRPO
       (full RL)    (group comparison)

When to Choose Which?

ScenarioRecommended AlgorithmReasoning
Lots of preference pairs, want fast iterationDPOSimple implementation, no rollout needed
Have verifier/env feedback, sufficient resourcesPPOMost flexible, can do multi-step optimization
Have verifier, but memory-constrainedGRPONo value head needed
Math/code with correct answersGRPOVerifier easy to define
Open-ended generation, subjective preferencesDPOLeverages human preference data

Final 3 Takeaways (Easy to Remember)

  1. PPO’s value is stable updates: ratio + clipping prevents “aggressive optimizer” from blowing up the model.

  2. DPO folds RLHF into offline contrastive learning: Simple implementation, fast iteration, but learning space is bounded by offline data.

  3. GRPO replaces value function with group comparison: Lower memory / fewer components, but you need to take reward design and group sampling strategy more seriously.


Appendix: Objective Function to Implementation Reference Table

PPOGRPODPO
ObjectiveE[min(rA,clip(r)A)]\mathbb{E}[\min(r A, \text{clip}(r) A)]E[min(rA,clip(r)A)]βKL\mathbb{E}[\min(r A, \text{clip}(r) A)] - \beta \text{KL}E[logσ(βΔ)]-\mathbb{E}[\log\sigma(\beta \Delta)]
AdvantageA=GV(s)A = G - V(s) (GAE)A=(rμ)/σA = (r - \mu) / \sigma (within-group)Implicit (log-ratio)
KL handlingFold into reward or lossAdd directly to lossImplicit in loss structure
Needs rollout
Needs value head
Needs reference
Data sourceOnline samplingOnline sampling (G per prompt)Offline preference pairs
Key hyperparamsϵ\epsilon, γ\gamma, vf_coefϵ\epsilon, GG, β\betaβ\beta

References

  1. Lambert, N. “Reinforcement Learning from Human Feedback.” (2024)
  2. Schulman, J., et al. “Proximal Policy Optimization Algorithms.” arXiv:1707.06347 (2017)
  3. Rafailov, R., et al. “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” NeurIPS 2023
  4. Shao, Z., et al. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.” arXiv:2402.03300 (2024)
  5. Ouyang, L., et al. “Training language models to follow instructions with human feedback.” NeurIPS 2022

Share this post on:

Previous Post
RLHF from an Engineering Perspective: PPO, GRPO, DPO, and Tool-Use Implementation
Next Post
What Worked (and What Didn't) When Training AEs and VAEs for Embedding Compression