Start Here

A short map of the main themes on this site.

This site is about the practical side of ML and AI systems: how models, agents, training pipelines, evaluations, and product interfaces behave when they leave the paper and enter a real engineering loop.

The fastest entry point is to pick a path below, then follow the linked topic hub when you want the full archive.

Reading Paths

LLM Agents

Tool use, agent runtime design, evaluation, context, and production patterns for systems that act across tools and environments.

How to Arbitrarily Increase the Difficulty of Agent Evaluation Sets
A practical framework for making agent benchmarks harder in a controlled way: treat difficulty as trajectory-graph complexity, not prompt wording. Covers deterministic scoring, capability facets, harness effects, and systematic data generation.
Improving LLM Internationalization: Bridging the Gap in Tool Use and Agency
LLMs achieve 57% tool-calling accuracy in English but only 34% across 52 languages — and 6.8% for the worst. This post covers the full playbook for closing the multilingual gap: training-time techniques, agentic architecture patterns, failure mode analysis, and RL-based approaches for i18n.
Experience-Augmented In-Context Learning: A Training-Free Complement to RL Post-Training
RL post-training makes models smarter, but it can't cover the infinite long tail of real-world cases. Experience-augmented ICL retrieves successful reasoning traces at inference time, letting agents learn continuously from real usage — no retraining required.

Evaluation

Practical approaches to measuring model and agent capability with deterministic checks, rubrics, trajectories, and verifiable outcomes.

How to Arbitrarily Increase the Difficulty of Agent Evaluation Sets
A practical framework for making agent benchmarks harder in a controlled way: treat difficulty as trajectory-graph complexity, not prompt wording. Covers deterministic scoring, capability facets, harness effects, and systematic data generation.
The Mercor Breach: What 4TB of Stolen Data Reveals About How Frontier AI Labs Actually Train Models
A $10B AI data vendor was breached, exposing 84 Airtable workspaces of training data for OpenAI, Anthropic, Apple, Amazon, and Meta. This post analyzes what the public reporting reveals about each lab's evaluation methodology — rubric design, RLHF pipelines, and quality control — and what it means for the industry.
The Unverifiable Reward Problem: The Real Frontier of RL for LLMs
Deep research on tasks with unverifiable rewards in RL — the key bottleneck for scaling RL beyond math and code. Covers JEPO, NRT, RLNVR, self-play methods, GenRM, Constitutional AI, reward hacking mitigation, and more.

Post-Training

SFT, RLHF, preference optimization, instruction following, reasoning traces, and data pipelines for shaping model behavior after pretraining.

The Mercor Breach: What 4TB of Stolen Data Reveals About How Frontier AI Labs Actually Train Models
A $10B AI data vendor was breached, exposing 84 Airtable workspaces of training data for OpenAI, Anthropic, Apple, Amazon, and Meta. This post analyzes what the public reporting reveals about each lab's evaluation methodology — rubric design, RLHF pipelines, and quality control — and what it means for the industry.
The Unverifiable Reward Problem: The Real Frontier of RL for LLMs
Deep research on tasks with unverifiable rewards in RL — the key bottleneck for scaling RL beyond math and code. Covers JEPO, NRT, RLNVR, self-play methods, GenRM, Constitutional AI, reward hacking mitigation, and more.
Instruction Following: What Models Get Wrong and How to Fix It with Better Post-Training Data
LLMs can write poetry and solve math, but ask them to 'respond in exactly 3 bullet points using only lowercase' and they stumble. This post dissects the taxonomy of instruction-following failures and provides a practical playbook for building post-training data that actually fixes them.

RLHF and Preference Optimization

Engineering notes and research synthesis on PPO, DPO, GRPO, reward modeling, preference data, and model behavior optimization.

The Mercor Breach: What 4TB of Stolen Data Reveals About How Frontier AI Labs Actually Train Models
A $10B AI data vendor was breached, exposing 84 Airtable workspaces of training data for OpenAI, Anthropic, Apple, Amazon, and Meta. This post analyzes what the public reporting reveals about each lab's evaluation methodology — rubric design, RLHF pipelines, and quality control — and what it means for the industry.
The Unverifiable Reward Problem: The Real Frontier of RL for LLMs
Deep research on tasks with unverifiable rewards in RL — the key bottleneck for scaling RL beyond math and code. Covers JEPO, NRT, RLNVR, self-play methods, GenRM, Constitutional AI, reward hacking mitigation, and more.
Adding Ads in LLM/Chatbot: Character Training for Monetization
Exploring how to integrate ads in LLMs through character training—making recommendations genuinely helpful rather than annoyingly promotional.

Generative UI

How AI systems can produce, steer, and execute user interfaces with structured representations and practical product constraints.

Ad Formats in LLM Products: What's Live vs. What's Research
A survey of advertising formats in LLM products—separating what's deployed in production from what remains in research.
Generative UI Doesn't Move the Needle—Steering Does
After shipping multiple generative UI features, I've concluded that the sophistication of AI-generated interfaces often doesn't translate to user benefit—but steering does.
UI Representation and Action Execution for Generative UI
Exploring structured UI representation using JSON Schema, and how to implement action handlers for generative UI systems.