Tag: LLM
All the articles with the tag "LLM".
-
The Mercor Breach: What 4TB of Stolen Data Reveals About How Frontier AI Labs Actually Train Models
· 22 min readA $10B AI data vendor was breached, exposing 84 Airtable workspaces of training data for OpenAI, Anthropic, Apple, Amazon, and Meta. This post analyzes what the public reporting reveals about each lab's evaluation methodology — rubric design, RLHF pipelines, and quality control — and what it means for the industry.
-
Improving LLM Internationalization: Bridging the Gap in Tool Use and Agency
· 17 min readLLMs achieve 57% tool-calling accuracy in English but only 34% across 52 languages — and 6.8% for the worst. This post covers the full playbook for closing the multilingual gap: training-time techniques, agentic architecture patterns, failure mode analysis, and RL-based approaches for i18n.
-
The Unverifiable Reward Problem: The Real Frontier of RL for LLMs
· 11 min readDeep research on tasks with unverifiable rewards in RL — the key bottleneck for scaling RL beyond math and code. Covers JEPO, NRT, RLNVR, self-play methods, GenRM, Constitutional AI, reward hacking mitigation, and more.
-
Instruction Following: What Models Get Wrong and How to Fix It with Better Post-Training Data
· 36 min readLLMs can write poetry and solve math, but ask them to 'respond in exactly 3 bullet points using only lowercase' and they stumble. This post dissects the taxonomy of instruction-following failures and provides a practical playbook for building post-training data that actually fixes them.
-
Experience-Augmented In-Context Learning: A Training-Free Complement to RL Post-Training
· 23 min readRL post-training makes models smarter, but it can't cover the infinite long tail of real-world cases. Experience-augmented ICL retrieves successful reasoning traces at inference time, letting agents learn continuously from real usage — no retraining required.
-
Tool Selection Optimization for LLM Agents at Scale
· 18 min readA deep technical dive into tool selection—retrieval strategies, context optimization, learned selection, and the engineering trade-offs that matter when scaling to hundreds of tools.