Tag: LLM

All the articles with the tag "LLM".

From Long CoT to Agent Swarms: The Documented Evolution of Kimi's Reinforcement Learning

18 Jul, 2026 · 15 min read

A source-grounded history of Kimi's reinforcement-learning stack, from Kimi k1.5's long-context outcome RL and partial rollouts to K2's general RL and K2.5's multimodal GRMs and Parallel-Agent RL.
Reproducing CompactionRL: From the GLM-5.2 Algorithm to a Live slime E2E

17 Jul, 2026 · 13 min read

A source-grounded reproduction of CompactionRL: the algorithm, missing recipe details, a 96-step multi-seed causal experiment, and a live Qwen actor-critic update through THUDM/slime on four A10 GPUs.
Training the Critic Without Crashing the Reward: A Practical Guide to Agentic RL

12 Jul, 2026 · 20 min read

A practical framework for critic training and credit assignment in long-horizon LLM agents: IQL, pairwise advantage, hindsight and counterfactual critics, privileged information, turn-level MDPs, chain-of-thought monitoring, and reward-crash diagnosis.
From GRPO Outcome Rewards to Token-Level Advantage

8 Jul, 2026 · 20 min read

A practical framework for turning GRPO-style sequence rewards into token-level advantages, including GAE-style estimators, credit assignment routes, and multi-reward training design.
A Looped Transformer Router Shows Its First Replicated Gain

6 Jul, 2026 · 10 min read

A small-budget BPE language-model experiment where a sparse late-final-loop token-feedback router becomes the first route-looped Transformer candidate to beat matched fixed-loop baselines across several controlled checks.
Scaling RL for White-Collar Work: The Environment Foundry

3 Jul, 2026 · 20 min read

A practical framework for turning common white-collar workflows into RL environments: spreadsheets, CRM tasks, customer support, web research, dashboards, and other software-mediated work.

Tag: LLM

From Long CoT to Agent Swarms: The Documented Evolution of Kimi's Reinforcement Learning

Reproducing CompactionRL: From the GLM-5.2 Algorithm to a Live slime E2E

Training the Critic Without Crashing the Reward: A Practical Guide to Agentic RL

From GRPO Outcome Rewards to Token-Level Advantage

A Looped Transformer Router Shows Its First Replicated Gain

Scaling RL for White-Collar Work: The Environment Foundry