Post-Training Is Not 'One Algorithm': Objective Functions and Implementation Essentials for PPO / DPO / GRPO

Reading notes from Nathan Lambert’s “Reinforcement Learning from Human Feedback (RLHF).” The book helped me build a clearer mental model of post-training—not as a single algorithm, but as an engineering pipeline: data → reward proxy → optimization → evaluation → guardrails.

Notation: prompt is $x$ , completion is $y$ , policy model is $\pi_\theta(y\mid x)$ , reference model is $\pi_{\text{ref}}$ , reward/scoring function is $r(x,y)$ .

0) A “Production-Ready” Pipeline Diagram

SFT -> Preference data (pairs / rankings) -> RM/Judge/Verifier -> Optimize (PPO/GRPO/DPO/...) -> Eval & Iterate
                       |                                               |
                       +-------------------- guardrails (KL/NLL/len/...) +

Key insight: What you actually want is usable behavior in production; but what you can optimize is often just some proxy objective, so the system naturally evolves toward “needing guardrails.”

1) Canonical RLHF: Reward - KL (What the Objective Function Looks Like)

The book consistently uses this form of RLHF: maximize expected reward over the data distribution while using KL to pull the policy back toward the reference model (style/distribution constraint).

\max_{\pi} \ \mathbb{E}_{x\sim D}\mathbb{E}_{y\sim \pi(\cdot|x)}[r(x,y)] - \beta D_{\mathrm{KL}}(\pi(\cdot|x)\Vert \pi_{\mathrm{ref}}(\cdot|x))

Intuition: $\beta$ is the knob for “how far you dare to stray from SFT”; it’s not just a mathematical term—it’s more like the master switch for “capability vs style/stability” in your product.

2) PPO: The Core Isn’t “Being Stronger”—It’s That Clipping Makes Updates More Stable

PPO’s key is importance ratio + clipping: for the same batch of data sampled by the old policy $\pi_{\theta_\text{old}}$ , use the ratio to correct when updating the new policy, and use clip to limit single-update magnitude, preventing training collapse when the ratio deviates too far from 1.

Classic PPO clipped surrogate:

J(\theta)=\mathbb{E}_t\left[\min\left(r_t(\theta)A_t,\ \text{clip}(r_t(\theta),1-\epsilon,1+\epsilon)A_t\right)\right]

r_t(\theta)=\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)}

In language models, the common implementation is per-token (easier to compute with logprobs).

Minimal Working PPO Pseudocode (LM Version)

# given prompts x
y = sample(pi_old, x)                         # rollout / completion
r = reward_model_or_verifier(x, y)            # scalar per sequence (or per step)

logp_new = logprob(pi_theta, x, y)            # [B, L]
logp_old = logprob(pi_old,   x, y)            # [B, L]
ratio = exp(logp_new - logp_old)              # [B, L]

A = compute_advantage(r, baseline=V(x,y) or batch_norm)  # token- or seq-level
pg = -mean(min(ratio*A, clip(ratio, 1-eps, 1+eps)*A))

kl = mean(logp_new - logprob(pi_ref, x, y))   # MC estimate of reverse KL
loss = pg + beta*kl + vf_coef*value_loss + other_regularizers
loss.backward(); opt.step()

The easiest pitfalls in engineering aren’t the formulas—they’re: how advantage is defined (token vs sequence), whether KL goes in reward or loss, how many gradient steps per batch of data. These significantly change “what training produces.”

3) KL Regularization: The “Numerical Preference” of Reverse KL and Implementation

The book treats KL control as one of the core guardrails of post-training, explaining the common reverse KL to reference: when the new policy assigns high probability in regions where the reference model has low probability, it gets heavily penalized (numerically more “conservative”).

Implementation often uses sampling to approximate the expectation:

D_{\mathrm{KL}}(P\Vert Q)=\mathbb{E}_{x\sim P}[\log P(x)-\log Q(x)]

This also explains why many systems can compute KL by just calculating logprobs twice (no explicit summation needed).

Forward KL vs Reverse KL Intuition

	Forward KL $D_{KL}(P_{ref} \\| P_\theta)$	Reverse KL $D_{KL}(P_\theta \\| P_{ref})$
Penalizes	$P_\theta$ assigning low prob where $P_{ref}$ is high	$P_\theta$ assigning high prob where $P_{ref}$ is low
Behavioral tendency	Mode-covering (tries to cover all modes)	Mode-seeking (tends to converge to single mode)
Practical effect	More “exploratory,” may produce strange outputs	More “conservative,” tends toward safe answers

RLHF commonly uses Reverse KL because we want to avoid the model “making things up” (assigning high probability where reference thinks it’s impossible).

4) DPO: Folding “RLHF + KL” into an Offline Contrastive Loss (Engineering-Friendly)

DPO’s position in the book is clear: it doesn’t do online rollouts, doesn’t separately train an RM, directly doing gradient descent on preference pairs; meanwhile it corresponds to a closed-form optimal solution of the “KL-constrained RLHF objective” (given data and $\beta$ ).

DPO (Bradley–Terry form) core loss:

L_{\text{DPO}}(\pi_\theta;\pi_{\text{ref}})= -\mathbb{E}_{(x,y_c,y_r)\sim D} \left[ \log \sigma\left( \beta \log\frac{\pi_\theta(y_c|x)}{\pi_{\text{ref}}(y_c|x)} - \beta \log\frac{\pi_\theta(y_r|x)}{\pi_{\text{ref}}(y_r|x)} \right)\right]

It can also be interpreted as learning an “implicit reward” (log-ratio structure).

Minimal Working DPO Pseudocode

# batch of (x, y_chosen, y_rejected)
lc = sum_token_logp(pi_theta, x, y_chosen)
lr = sum_token_logp(pi_theta, x, y_rejected)
lcref = sum_token_logp(pi_ref, x, y_chosen)
lrref = sum_token_logp(pi_ref, x, y_rejected)

delta = beta * ((lc - lcref) - (lr - lrref))
loss = -log(sigmoid(delta)).mean()
loss.backward(); opt.step()

The book especially emphasizes a common misconception: DPO appears to be “directly training policy,” but essentially it’s still learning reward structure (hence the “Your LM is secretly a reward model” statement).

DPO’s Implicit Reward Interpretation

From DPO’s derivation, we can extract the implicit reward:

r(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)

where $Z(x)$ is the partition function. This means:

A DPO-trained model is itself a reward model
You can use a trained DPO model to score new completions

5) GRPO: Like PPO, But Using Group Comparison to Bypass Value Function

GRPO (used in DeepSeekMath and other work) can be viewed as a PPO-style surrogate loss, but it avoids training a value function: by sampling multiple completions per prompt and doing within-group normalization to estimate advantage.

GRPO objective (group-aggregated):

J(\theta)=\frac{1}{G}\sum_{i=1}^{G} \Big[\min(\rho_i A_i,\ \text{clip}(\rho_i,1-\epsilon,1+\epsilon)A_i) -\beta D_{\mathrm{KL}}(\pi_\theta\Vert \pi_{\text{ref}})\Big]

A_i=\frac{r_i-\text{mean}(r_{1:G})}{\text{std}(r_{1:G})}

Minimal Working GRPO Pseudocode

# For each prompt x, sample G completions
y_group = [sample(pi_old, x) for _ in range(G)]    # G completions per prompt
r_group = [reward(x, y) for y in y_group]          # G rewards

# Group-level advantage normalization
mu = mean(r_group)
std = std(r_group) + 1e-8
advantages = [(r - mu) / std for r in r_group]     # z-score within group

# PPO-style loss for each completion
for y, A in zip(y_group, advantages):
    logp_new = logprob(pi_theta, x, y)
    logp_old = logprob(pi_old, x, y)
    ratio = exp(logp_new - logp_old)
    
    pg1 = -A * ratio
    pg2 = -A * clip(ratio, 1-eps, 1+eps)
    pg_loss = max(pg1, pg2)                        # element-wise max
    
    kl = logp_new - logprob(pi_ref, x, y)
    loss = pg_loss + beta * kl

The book also mentions implementation details: GRPO commonly adds KL directly to the loss (rather than modifying reward first), which differs from traditional PPO.

GRPO vs PPO: Why Use Group Comparison?

Aspect	PPO	GRPO
Baseline	Value function $V(s)$	Within-group mean $\bar{r}$
Extra model	Needs Critic training	Not needed
Memory overhead	High (storing value head)	Low
Variance	GAE can control	Depends on group size G
Use case	Complex multi-step decisions	Bandit-style (single generation)

6) Preference Data: The Most Powerful Fuel, Also the Most Hidden Bias Amplifier

The book dedicates a section to “bias in data collection,” calling out prefix bias, sycophancy, verbosity, formatting, etc.—these often aren’t written in labeling guidelines but get learned very firmly by models.

Common Data Bias Types

Bias Type	Manifestation	Consequence
Length bias	Longer answers more likely chosen as preferred	Model becomes verbose, information density drops
Format bias	Markdown/lists more likely to win	Over-formatting, bullet points even for simple questions
Sycophancy	Agreeing with user more likely chosen	Model becomes “pleasing,” afraid to correct errors
Position bias	First/last option more likely chosen	Evaluation results unstable
Verbosity ≠ Quality	Detailed ≠ correct	Reward hacking

My engineering conclusion: For preference pair data, the difference often isn’t in quantity but in whether these biases are systematically addressed (e.g., UI display, labeling workflow, length control, penalties for “flattery/fluff”).

Real case: In an internal judge, discovered “longer, more template-like answers win more easily,” causing DPO-trained model output information density to drop; had to add length-control / information density constraints to recover.

7) Evaluation: Why “All Scores Are Rising” But You Still Don’t Dare Ship

The book’s attitude toward evaluation is realistic: evaluation evolves with training objectives, and prompt/format can take the same model’s performance from “okay” to “near zero” (extremely sensitive).

The Triple Dilemma of Evaluation

Prompt sensitivity: Same model, different prompt template, scores can differ 20%+
Metric gaming: Optimizing benchmark scores ≠ real capability improvement
Distribution shift: Training distribution vs evaluation distribution vs real user distribution—all three inconsistent

Internal vs External Evaluation

	Internal Evaluation	External Evaluation
Purpose	Hillclimbing, guide iteration	Comparison, release decisions
Characteristics	Controllable variables, reproducible	Opaque configuration, high error
Risk	Overfitting internal benchmark	Not reproducible, high noise

LLM-as-a-Judge Engineering Tips

Although commonly used (including for generating preference data), note:

# Common tricks to reduce variance
judge_config = {
    "temperature": 0,           # Deterministic output
    "max_tokens": 1,            # Only want score, not explanation
    "logprobs": True,           # Use logprob rather than argmax
}

# Position debiasing
score_AB = judge(response_A, response_B)
score_BA = judge(response_B, response_A)
final_score = (score_AB - score_BA) / 2  # Cancel position bias

8) Over-Optimization: Not an Occasional Bug, But a Default Risk

The book gives a definition I really like:

When you optimize hard on a proxy as if it were the target, the “true objective” first improves then degrades (classic Goodhart’s Law).

Typical Over-Optimization Symptoms

Symptom	Cause	Mitigation
Fixed phrases	Certain phrases overvalued by RM	Diversity regularization, entropy bonus
Repetition/Hedging	Safe outputs score high	Penalize repeated n-grams
Sycophancy	Agreeing with user scores high	Dedicated sycophancy detector
Excessive refusal	Refusing is safer than being wrong	Balance helpfulness vs harmlessness
Length gaming	Long answers score high	Length penalty term

Guardrails Aren’t Decoration—They’re Survival Necessities

total_loss = (
    rl_loss                              # Main objective
    + beta * kl_penalty                  # Don't deviate too far from reference
    + sft_coef * sft_loss                # Maintain language capability
    + length_coef * length_penalty       # Control length
    + entropy_coef * entropy_bonus       # Maintain diversity
    + format_coef * format_penalty       # Format constraints
)

9) One-Page Engineering Checklist: What to Monitor When Running PPO/DPO/GRPO

Training Side (Observable)

Metric	Focus	Alert Threshold
`mean_reward`	Is it continuously rising	Sudden drop or saturation
`reward_std`	Is distribution healthy	Too small (collapse) or too large (unstable)
`kl_divergence`	Deviation from reference	> 10-15 usually problematic
`clip_fraction`	PPO clip trigger rate	> 30% may mean learning rate too high
`entropy`	Output diversity	Continuous decline = collapse
`grad_norm`	Gradient health	Sudden spike = instability

Evaluation Side (Reproducible)

Fixed prompting template & sampling parameters (temperature, top-p, token budget)
Private “regression set” (subset of real product traffic)
Human eval / A-B test (don’t trust a single score)
Variance estimation across multiple seeds

Data Side (Bias Governance)

Length/format bias countermeasures (e.g., length-controlled evaluation, format perturbation robustness)
Dedicated sycophancy data/rules/discriminator
Does labeling UI introduce position bias
Regular audit of labeling quality

10) Algorithm Selection Guide

                    Have online environment?
                              |
                   +----------+----------+
                   |                     |
                  Yes                    No
                   |                     |
              Have value head           DPO
              training budget?     (offline preference pairs)
                   |
           +-------+-------+
           |               |
          Yes              No
           |               |
          PPO            GRPO
       (full RL)    (group comparison)

When to Choose Which?

Scenario	Recommended Algorithm	Reasoning
Lots of preference pairs, want fast iteration	DPO	Simple implementation, no rollout needed
Have verifier/env feedback, sufficient resources	PPO	Most flexible, can do multi-step optimization
Have verifier, but memory-constrained	GRPO	No value head needed
Math/code with correct answers	GRPO	Verifier easy to define
Open-ended generation, subjective preferences	DPO	Leverages human preference data

Final 3 Takeaways (Easy to Remember)

PPO’s value is stable updates: ratio + clipping prevents “aggressive optimizer” from blowing up the model.
DPO folds RLHF into offline contrastive learning: Simple implementation, fast iteration, but learning space is bounded by offline data.
GRPO replaces value function with group comparison: Lower memory / fewer components, but you need to take reward design and group sampling strategy more seriously.

Appendix: Objective Function to Implementation Reference Table

	PPO	GRPO	DPO
Objective	$\mathbb{E}[\min(r A, \text{clip}(r) A)]$	$\mathbb{E}[\min(r A, \text{clip}(r) A)] - \beta \text{KL}$	$-\mathbb{E}[\log\sigma(\beta \Delta)]$
Advantage	$A = G - V(s)$ (GAE)	$A = (r - \mu) / \sigma$ (within-group)	Implicit (log-ratio)
KL handling	Fold into reward or loss	Add directly to loss	Implicit in loss structure
Needs rollout	✓	✓	✗
Needs value head	✓	✗	✗
Needs reference	✓	✓	✓
Data source	Online sampling	Online sampling (G per prompt)	Offline preference pairs
Key hyperparams	$\epsilon$ , $\gamma$ , `vf_coef`	$\epsilon$ , $G$ , $\beta$	$\beta$

References

Lambert, N. “Reinforcement Learning from Human Feedback.” (2024)
Schulman, J., et al. “Proximal Policy Optimization Algorithms.” arXiv:1707.06347 (2017)
Rafailov, R., et al. “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” NeurIPS 2023
Shao, Z., et al. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.” arXiv:2402.03300 (2024)
Ouyang, L., et al. “Training language models to follow instructions with human feedback.” NeurIPS 2022