ML & AI in Action

Practical insights on machine learning, AI systems, and building products that ship. Topics include RLHF, LLM optimization, recommender systems, and generative UI.

Social Links:

Start Here Topic Hubs

Topic Hubs

LLM Agents
Tool use, agent runtime design, evaluation, context, and production patterns for systems that act across tools and environments.
Evaluation
Practical approaches to measuring model and agent capability with deterministic checks, rubrics, trajectories, and verifiable outcomes.
Post-Training
SFT, RLHF, preference optimization, instruction following, reasoning traces, and data pipelines for shaping model behavior after pretraining.
RLHF and Preference Optimization
Engineering notes and research synthesis on PPO, DPO, GRPO, reward modeling, preference data, and model behavior optimization.
Generative UI
How AI systems can produce, steer, and execute user interfaces with structured representations and practical product constraints.

Featured

From Long CoT to Agent Swarms: The Documented Evolution of Kimi's Reinforcement Learning

18 Jul, 2026 · 15 min read

A source-grounded history of Kimi's reinforcement-learning stack, from Kimi k1.5's long-context outcome RL and partial rollouts to K2's general RL and K2.5's multimodal GRMs and Parallel-Agent RL.
Reproducing CompactionRL: From the GLM-5.2 Algorithm to a Live slime E2E

17 Jul, 2026 · 13 min read

A source-grounded reproduction of CompactionRL: the algorithm, missing recipe details, a 96-step multi-seed causal experiment, and a live Qwen actor-critic update through THUDM/slime on four A10 GPUs.
Training the Critic Without Crashing the Reward: A Practical Guide to Agentic RL

12 Jul, 2026 · 20 min read

A practical framework for critic training and credit assignment in long-horizon LLM agents: IQL, pairwise advantage, hindsight and counterfactual critics, privileged information, turn-level MDPs, chain-of-thought monitoring, and reward-crash diagnosis.
From GRPO Outcome Rewards to Token-Level Advantage

8 Jul, 2026 · 20 min read

A practical framework for turning GRPO-style sequence rewards into token-level advantages, including GAE-style estimators, credit assignment routes, and multi-reward training design.
Scaling RL for White-Collar Work: The Environment Foundry

3 Jul, 2026 · 20 min read

A practical framework for turning common white-collar workflows into RL environments: spreadsheets, CRM tasks, customer support, web research, dashboards, and other software-mediated work.
Optimizing Inference for Router Looped Transformers

Updated: 29 Jun, 2026 · 20 min read

A research note on serving router looped transformers: why normal KV cache semantics break, what latency data says so far, and how vLLM or SGLang could be adapted with route-template batching and virtual-step KV cache.
How to Test Pretraining Ideas at Small Scale Before Betting on a Large Model

29 Jun, 2026 · 25 min read

A practical guide to validating pretraining improvements with small proxy models, scaling ladders, isoFLOP budgets, loss curves, downstream evals, and rank-correlation checks before committing to an expensive large-model run.
Why Embeddings Cannot Solve Eval-Set Contamination

29 Jun, 2026 · 11 min read

A technical deep dive on why semantic embedding search is useful but insufficient for eval-set decontamination: leakage is about evaluation advantage, not just text similarity.
Pretraining Contamination: Why Don't Train on the Test Set Became Hard

29 Jun, 2026 · 14 min read

A practical introduction to LLM pretraining contamination: why benchmark leakage is not ordinary deduplication, how public evals leak into web-scale corpora, and how layered decontamination pipelines reduce risk.
How to Arbitrarily Increase the Difficulty of Agent Evaluation Sets

28 May, 2026 · 18 min read

A practical framework for making agent benchmarks harder in a controlled way: treat difficulty as trajectory-graph complexity, not prompt wording. Covers deterministic scoring, capability facets, harness effects, and systematic data generation.
The Mercor Breach: What 4TB of Stolen Data Reveals About How Frontier AI Labs Actually Train Models

12 Apr, 2026 · 22 min read

A $10B AI data vendor was breached, exposing 84 Airtable workspaces of training data for OpenAI, Anthropic, Apple, Amazon, and Meta. This post analyzes what the public reporting reveals about each lab's evaluation methodology — rubric design, RLHF pipelines, and quality control — and what it means for the industry.
Improving LLM Internationalization: Bridging the Gap in Tool Use and Agency

10 Mar, 2026 · 17 min read

LLMs achieve 57% tool-calling accuracy in English but only 34% across 52 languages — and 6.8% for the worst. This post covers the full playbook for closing the multilingual gap: training-time techniques, agentic architecture patterns, failure mode analysis, and RL-based approaches for i18n.
The Unverifiable Reward Problem: The Real Frontier of RL for LLMs

7 Mar, 2026 · 11 min read

Deep research on tasks with unverifiable rewards in RL — the key bottleneck for scaling RL beyond math and code. Covers JEPO, NRT, RLNVR, self-play methods, GenRM, Constitutional AI, reward hacking mitigation, and more.
Instruction Following: What Models Get Wrong and How to Fix It with Better Post-Training Data

4 Mar, 2026 · 36 min read

LLMs can write poetry and solve math, but ask them to 'respond in exactly 3 bullet points using only lowercase' and they stumble. This post dissects the taxonomy of instruction-following failures and provides a practical playbook for building post-training data that actually fixes them.
Experience-Augmented In-Context Learning: A Training-Free Complement to RL Post-Training

28 Feb, 2026 · 23 min read

RL post-training makes models smarter, but it can't cover the infinite long tail of real-world cases. Experience-augmented ICL retrieves successful reasoning traces at inference time, letting agents learn continuously from real usage — no retraining required.
Tool Selection Optimization for LLM Agents at Scale

9 Jan, 2026 · 18 min read

A deep technical dive into tool selection—retrieval strategies, context optimization, learned selection, and the engineering trade-offs that matter when scaling to hundreds of tools.
Generative Engine Optimization (GEO): How to Get Your Product Cited by AI

2 Jan, 2026 · 13 min read

A comprehensive guide to Generative Engine Optimization—making your content retrievable, citable, and recommendable by large language models.
Generative UI Doesn't Move the Needle—Steering Does

31 Dec, 2025 · 10 min read

After shipping multiple generative UI features, I've concluded that the sophistication of AI-generated interfaces often doesn't translate to user benefit—but steering does.
Post-Training Is Not 'One Algorithm': Objective Functions and Implementation Essentials for PPO / DPO / GRPO

30 Dec, 2025 · 12 min read

Reading notes on RLHF covering PPO, DPO, and GRPO—understanding post-training as an engineering pipeline rather than a single algorithm.
RLHF from an Engineering Perspective: PPO, GRPO, DPO, and Tool-Use Implementation

30 Dec, 2025 · 12 min read

A practical engineering guide to RLHF implementation—covering PPO, GRPO, DPO, and tool-use training with code snippets and debugging tips.

ML & AI in Action

Topic Hubs

Featured

From Long CoT to Agent Swarms: The Documented Evolution of Kimi's Reinforcement Learning

Reproducing CompactionRL: From the GLM-5.2 Algorithm to a Live slime E2E

Training the Critic Without Crashing the Reward: A Practical Guide to Agentic RL

From GRPO Outcome Rewards to Token-Level Advantage

Scaling RL for White-Collar Work: The Environment Foundry

Optimizing Inference for Router Looped Transformers

How to Test Pretraining Ideas at Small Scale Before Betting on a Large Model

Why Embeddings Cannot Solve Eval-Set Contamination

Pretraining Contamination: Why Don't Train on the Test Set Became Hard

How to Arbitrarily Increase the Difficulty of Agent Evaluation Sets

The Mercor Breach: What 4TB of Stolen Data Reveals About How Frontier AI Labs Actually Train Models

Improving LLM Internationalization: Bridging the Gap in Tool Use and Agency

The Unverifiable Reward Problem: The Real Frontier of RL for LLMs

Instruction Following: What Models Get Wrong and How to Fix It with Better Post-Training Data

Experience-Augmented In-Context Learning: A Training-Free Complement to RL Post-Training

Tool Selection Optimization for LLM Agents at Scale

Generative Engine Optimization (GEO): How to Get Your Product Cited by AI

Generative UI Doesn't Move the Needle—Steering Does

Post-Training Is Not 'One Algorithm': Objective Functions and Implementation Essentials for PPO / DPO / GRPO

RLHF from an Engineering Perspective: PPO, GRPO, DPO, and Tool-Use Implementation

Recent Posts

A Looped Transformer Router Shows Its First Replicated Gain

When a Looped Transformer Router Almost Works

Ad Formats in LLM Products: What's Live vs. What's Research

Adding Ads in LLM/Chatbot: Character Training for Monetization

What Worked (and What Didn't) When Training AEs and VAEs for Embedding Compression

User Interest Modeling with Transformer Architectures