RLHF and Preference Optimization

Engineering notes and research synthesis on PPO, DPO, GRPO, reward modeling, preference data, and model behavior optimization.

Focus Areas

PPO, DPO, and GRPO
preference data pipelines
reward modeling
creative and unverifiable rewards

Recommended Posts

Reproducing CompactRL: What Worked, What Failed, and Why We Did Not Scale

Updated: 24 Jul, 2026 · 16 min read

An auditable CompactRL reproduction spanning the public algorithm, a 96-step long-horizon simulation, integration with slime, real Qwen actor-critic training, value-function fixes, 17 experimental phases, and the evidence that stopped us from scaling.
From Long CoT to Agent Swarms: The Documented Evolution of Kimi's Reinforcement Learning

18 Jul, 2026 · 15 min read

A source-grounded history of Kimi's reinforcement-learning stack, from Kimi k1.5's long-context outcome RL and partial rollouts to K2's general RL and K2.5's multimodal GRMs and Parallel-Agent RL.
Training the Critic Without Crashing the Reward: A Practical Guide to Agentic RL

12 Jul, 2026 · 20 min read

A practical framework for critic training and credit assignment in long-horizon LLM agents: IQL, pairwise advantage, hindsight and counterfactual critics, privileged information, turn-level MDPs, chain-of-thought monitoring, and reward-crash diagnosis.
From GRPO Outcome Rewards to Token-Level Advantage

8 Jul, 2026 · 20 min read

A practical framework for turning GRPO-style sequence rewards into token-level advantages, including GAE-style estimators, credit assignment routes, and multi-reward training design.
Scaling RL for White-Collar Work: The Environment Foundry

3 Jul, 2026 · 20 min read

A practical framework for turning common white-collar workflows into RL environments: spreadsheets, CRM tasks, customer support, web research, dashboards, and other software-mediated work.
The Mercor Breach: What 4TB of Stolen Data Reveals About How Frontier AI Labs Actually Train Models

12 Apr, 2026 · 22 min read

A $10B AI data vendor was breached, exposing 84 Airtable workspaces of training data for OpenAI, Anthropic, Apple, Amazon, and Meta. This post analyzes what the public reporting reveals about each lab's evaluation methodology — rubric design, RLHF pipelines, and quality control — and what it means for the industry.
The Unverifiable Reward Problem: The Real Frontier of RL for LLMs

7 Mar, 2026 · 11 min read

Deep research on tasks with unverifiable rewards in RL — the key bottleneck for scaling RL beyond math and code. Covers JEPO, NRT, RLNVR, self-play methods, GenRM, Constitutional AI, reward hacking mitigation, and more.
Adding Ads in LLM/Chatbot: Character Training for Monetization

1 Jan, 2026 · 4 min read

Exploring how to integrate ads in LLMs through character training—making recommendations genuinely helpful rather than annoyingly promotional.
Post-Training Is Not 'One Algorithm': Objective Functions and Implementation Essentials for PPO / DPO / GRPO

30 Dec, 2025 · 12 min read

Reading notes on RLHF covering PPO, DPO, and GRPO—understanding post-training as an engineering pipeline rather than a single algorithm.
RLHF from an Engineering Perspective: PPO, GRPO, DPO, and Tool-Use Implementation

30 Dec, 2025 · 12 min read

A practical engineering guide to RLHF implementation—covering PPO, GRPO, DPO, and tool-use training with code snippets and debugging tips.

RLHF and Preference Optimization

Focus Areas

Recommended Posts

Reproducing CompactRL: What Worked, What Failed, and Why We Did Not Scale

From Long CoT to Agent Swarms: The Documented Evolution of Kimi's Reinforcement Learning

Training the Critic Without Crashing the Reward: A Practical Guide to Agentic RL

From GRPO Outcome Rewards to Token-Level Advantage

Scaling RL for White-Collar Work: The Environment Foundry

The Mercor Breach: What 4TB of Stolen Data Reveals About How Frontier AI Labs Actually Train Models

The Unverifiable Reward Problem: The Real Frontier of RL for LLMs

Adding Ads in LLM/Chatbot: Character Training for Monetization

Post-Training Is Not 'One Algorithm': Objective Functions and Implementation Essentials for PPO / DPO / GRPO

RLHF from an Engineering Perspective: PPO, GRPO, DPO, and Tool-Use Implementation