Skip to content
Go back

The Mercor Breach: What 4TB of Stolen Data Reveals About How Frontier AI Labs Actually Train Models

· 22 min read

In March 2026, Mercor — a 10billionAIrecruitinganddatalabelingstartupwasbreachedviaasupplychainattack.ThehackinggroupLapsus10 billion AI recruiting and data-labeling startup — was breached via a supply-chain attack. The hacking group Lapsus claimed 4TB of stolen data, including 84 Airtable workspaces containing the actual training data, evaluation rubrics, and preference annotations produced for OpenAI, Anthropic, Apple, Amazon, Meta, and Google DeepMind.

This post is not about the breach itself. It’s about what the publicly reported analysis of the stolen data reveals about how frontier AI labs actually build their post-training pipelines — the rubric design patterns, evaluation methodologies, and quality control architectures that are normally invisible.

All information in this post comes from publicly available security research reports and news articles, not from the stolen data itself.


Table of Contents

Open Table of Contents

Background: Why a Data Vendor Breach Matters More Than a Model Breach

Most AI security discussions focus on model weights — can someone steal your checkpoint? But the Mercor breach exposed something arguably more valuable: the methodology.

Mercor sits at the center of the AI data supply chain. It recruits domain experts (doctors, lawyers, engineers, Math Olympiad winners) at ~$95/hour to produce:

Six of the “Magnificent Seven” tech companies plus frontier labs OpenAI and Anthropic were clients. A single breach exposed all of them simultaneously — not because they shared infrastructure, but because they shared a vendor.

As Y Combinator president Garry Tan put it: “Incredible amount of SOTA training data now just available to China thanks to @mercor_ai leak. Every major lab. Billions and billions of value and a major national security issue.”


The Attack: Supply-Chain Compromise in 3 Phases

The breach followed a cascading supply-chain attack:

PhaseDateTargetMethod
1Mar 19Trivy (security scanner)Exploited pull_request_target in GitHub Actions, stole aqua-bot PAT, force-pushed malicious commits to 76 release tags
2Mar 24LiteLLM (AI proxy library, ~97M monthly downloads, present in 36% of cloud environments)Used Trivy-stolen credentials to hijack PyPI publishing token; pushed malicious versions 1.82.7 and 1.82.8 (live for ~40 minutes)
3Mar 24+MercorPoisoned LiteLLM dependency landed in dev environment; malware swept SSH keys, AWS tokens, K8s secrets; exfiltrated data via Tailscale VPN to models.litellm[.]cloud

The malicious LiteLLM 1.82.8 used a .pth file — a Python path configuration file that executes automatically when the interpreter starts. No explicit import needed. The moment a developer opened an IDE or ran pip, the payload was already running.


What Was Stolen: The Full Inventory

AssetSizeContents
Production Database211 GB250+ Aurora MySQL tables — contractor PII, interview transcripts, client project configs
Source Code939 GBComplete GitHub org including mercor-monorepo, hardcoded API keys, Terraform configs
Cloud Storage~3 TBVideo interviews, desktop screenshots, passport/ID scans, signed legal docs
Airtable ExportIncluded84 workspaces, 1,055 JSONL files — the actual annotation tasks, rubrics, model outputs, and human evaluations
Slack ExportIncludedFull enterprise Slack workspace + client-specific workspaces
Tailscale VPN DataIncludedInternal network topology, device certificates

The Airtable export is where the training methodology lives. Each workspace follows a standardized schema:

TASKS / TASK_VERSIONS    — the annotation tasks themselves
CRITERIA                 — evaluation criteria definitions
RUBRIC_VERSIONS          — scoring framework iterations
QA_SPECS                 — quality control specifications
LLM_CALL_CONFIGURATION   — model selection, temperature, sampling params
DOMAIN / SUBDOMAIN       — domain taxonomy
WORKFLOW                 — task routing logic
CONTROL_PANEL            — pipeline control parameters
TALENT                   — which experts are assigned to which tasks

This is not a dataset. It is the complete blueprint for an industrial-scale annotation pipeline.


Lab-by-Lab: What the Reporting Reveals About Each Company’s Evaluation Methodology

OpenAI: Self-Bootstrapping + Tournament Ranking

Annotation Platform: OpenAI uses a proprietary internal tool called Feather (feather.openai.com), organized by campaign UUIDs. Mercor contractors worked directly inside OpenAI’s tooling.

Key methodological insight — LLM-as-autograder:

The TaskDefinitions table configured openai/gpt-4.1 and openai/gpt-5 as autograders for human-produced annotation data. This reveals a bootstrapping strategy: use a stronger model to grade the output that will train the next generation of models. It’s an efficiency play — reduce the volume of expensive human review while maintaining quality signal — but it also means OpenAI’s training pipeline has a dependency on its own model quality for quality control.

Data purity rules:

Rubrics contained explicit constraints like “LLMs other than ChatGPT are prohibited.” This tells us OpenAI is concerned about cross-model contamination — they don’t want Claude’s or Gemini’s stylistic patterns leaking into their training data through contractors who might use competing tools to draft responses.

Ranking system — Bradley-Terry tournament:

The PairwiseComparisons table implements a classic Bradley-Terry model for ranking candidate outputs:

For each comparison:
  - winnerResumeId / loserResumeId
  - reasoning (LLM-generated explanation of why A > B)

Accumulated into:
  - numComparisons → mScoreRaw → mScoreNormalized

This is the same statistical framework used in chess Elo ratings, adapted for ranking model outputs. The LLM-generated reasoning for each comparison likely serves dual purpose: quality control (can a reviewer verify the judgment?) and potential training signal (the reasoning itself could be used for reward model training).

Quality control: Three-layer architecture — AI autograder → human review of edge cases → contractor dispute mechanism via TaskAudits.dispute.


Anthropic: Systematic Preference Evaluation + Constitutional AI Feedback

Core methodology: preference-centric RLHF/DPO

Anthropic’s pipeline is organized around structured preference comparisons, consistent with their published Constitutional AI and RLHF research. Multiple API_PREFERENCE workspaces (including V2 and personal copies for individual team members) contain:

TablePurpose
PROMPTSStandardized input prompt collection
RESPONSESMultiple model outputs for the same prompt
ROLESEvaluator personas/perspectives to adopt
DOMAINSDomain categorization (technical, creative, safety, etc.)
PROMPT_TEMPLATESReusable prompt templates
QAQuality assurance checklists

The ROLES table is interesting. It suggests Anthropic asks evaluators to judge outputs from different perspectives — possibly related to their Constitutional AI approach, where principles are evaluated from multiple ethical/practical viewpoints.

Head-to-head model comparison: GPT-4 vs Claude

A dedicated “GPT-4 vs Claude Evaluation” project compared Claude 3.5 Sonnet against GPT-4 across use cases. Each comparison included the prompt, both responses, and the human preference judgment with reasoning. This is essentially Anthropic’s competitive intelligence pipeline — systematically mapping where Claude wins and loses against GPT-4, then using that data to close gaps.

The existence of multiple workspace versions (API_PREFERENCE, API_PREFERENCE_V2, API_PREFERENCE__COPY__FOR_BRENDAN, API_PREF___KANIX) suggests this framework is under rapid iteration with individual researchers maintaining working copies.

Agent evaluation: AgentSandboxes records with agentType: claude show Anthropic was evaluating agentic capabilities through Mercor, with full conversation transcripts stored.


Apple: Multi-Model Orchestration + Evaluation Automation

Apple’s exposure was arguably the most surprising — pre-release model outputs from unreleased Apple Intelligence models.

Model versions and inference parameters:

The APPLE_ENDPOINT_SANDBOX workspace tested three model versions:

Model IDRole
afm-text-083Text generation (earlier version)
afm-model-085Text generation (intermediate)
afm-model-086Orchestrator model

Sampling parameters: temperature=0.7, top_p=0.9 — relatively standard nucleus sampling, suggesting Apple prioritizes response diversity over determinism in their evaluation setup.

Four-dimensional evaluation matrix:

Airtable TableEvaluation Dimension
TEXTGeneral text generation quality
DEEP_LTranslation capability (English → Spanish)
TEXT_ORCHESTRATORRouting/orchestration decisions
RUBRIC_AUTO_GENAutomated rubric generation

Two architectural insights emerge:

  1. TEXT_ORCHESTRATOR confirms Apple Intelligence uses a multi-model orchestration architecture — a routing model (afm-model-086) decides which sub-model handles each request. This is consistent with Apple’s on-device/cloud split architecture but reveals they’re also orchestrating between cloud-side models.

  2. RUBRIC_AUTO_GEN shows Apple is investing in evaluation automation — using AI to generate the evaluation criteria themselves. This is a meta-level capability: if you can automate rubric creation, you can scale evaluation across new domains without proportionally scaling human rubric designers.


Amazon: Chain-of-Thought Quality Analysis

Amazon’s methodology is distinctive in its focus on evaluating reasoning quality, not just final answers.

AMAZON_LLM_COT_EVALUATION workspace structure:

TablePurpose
DOMAINSEvaluation categories (math, stem, etc.)
PHASE_1_TASKSModel A vs Model B with complete CoT traces
PHASE_1_REVIEWSHuman reviews of CoT quality
MODEL_A_STRENGTHSStructured recording of each model’s reasoning advantages
TALENTEvaluator (domain expert) management

What makes this different: Most RLHF preference data captures “which response is better” as a holistic judgment. Amazon’s pipeline decomposes this — evaluators assess the reasoning chain itself, not just the conclusion. Each task includes complete Chain-of-Thought traces, final responses, and preference judgments. The MODEL_A_STRENGTHS table suggests they’re building a structured ontology of reasoning capabilities, not just accumulating preference labels.

The PHASE_1_TASKS / PHASE_1_REVIEWS naming implies a multi-stage evaluation funnel — likely Phase 1 broad screening → Phase 2 deep analysis → final determination.


Meta: Multimodal Annotation

Less was directly exposed about Meta’s methodology, but the AAIE___META_MULTIMEDIA_TEMPLATE_COMMAND_CENTER workspace (containing OVERALL_META, PROJECTS, FORMS, TEMPLATE tables) confirms Meta’s annotation work through Mercor involved multimodal data — not just text. Meta has since indefinitely paused its relationship with Mercor.


Google DeepMind: Benchmark Evaluation

GDM was confirmed as a Mercor client by the Wall Street Journal, but direct evidence in the leaked samples was limited. The most likely connection is through the Athena HLE (Humanity’s Last Exam) workspaces — four versions of ATHENA_HLE__STEM_ (including dated copies from July 2025) with MODEL_RESPONSES and AWAITING_REVIEW_METRICS tables, indicating an active human review pipeline for one of the most important frontier model benchmarks.


Current Data Collection Priorities: What OpenAI and Anthropic Are Betting On

The breach evidence doesn’t exist in a vacuum. When we cross-reference the exposed Mercor project data with each lab’s public product roadmap and recent model releases, a picture emerges of where the data investment is going right now.

OpenAI: Reasoning, Agentic Code, and Self-Improving Evaluation

OpenAI’s current trajectory — visible through the o3/o4-mini releases, public statements, and the Mercor project data — points to three converging data priorities:

1. Reasoning and Chain-of-Thought data

The o3 and o4-mini models are trained using large-scale reinforcement learning on chains of thought [13]. o4-mini achieves 99.5% pass@1 on AIME 2025 (with Python interpreter access), and o3 sets new state-of-the-art on Codeforces. These results require massive volumes of verified reasoning traces — problems with objectively correct answers where the reasoning path can be checked. The AIME_RUBRICS workspace in the Mercor data (math competition rubrics) and ACADEMIC_REASONING_SFT (with an explicit COT table for Chain-of-Thought supervision) align directly with this priority.

OpenAI Chief Scientist Jakub Pachocki has outlined the target: AI Research Interns by September 2026, fully autonomous researchers by March 2028 [17]. Getting there requires training data that captures multi-step problem decomposition — not just “what’s the right answer” but “what’s the right sequence of reasoning steps, and how do you recover from mistakes mid-chain.”

2. Agentic coding and tool use

The Mercor TaskDefinitions table references an “Agentic Code Final QC Audit” project focused on AI code generation quality control for GitHub issue solving [1]. This is SWE-bench-style data: given a real GitHub issue, can the model produce a correct patch? OpenAI recently discontinued SWE-bench Verified after finding models were reproducing gold patches verbatim due to training data contamination [15] — a strong signal they need fresh, high-quality agentic coding data from human experts.

For the first time with o3, OpenAI’s reasoning models can agentically combine every tool in ChatGPT — web search, Python, image reasoning, image generation — in a single reasoning chain [13]. Training this requires demonstration data showing how and when to invoke tools, plus evaluations of tool-use decisions.

3. Self-bootstrapping evaluation at scale

Perhaps the most strategically interesting signal: OpenAI uses openai/gpt-4.1 and openai/gpt-5 as autograders in TaskDefinitions. Combined with the constraint “LLMs other than ChatGPT are prohibited” in rubrics, this reveals a closed-loop data strategy:

Human experts produce SFT/RLHF data
  → GPT-5 autogrades the submissions (quality control)
  → High-quality data trains next generation
  → Next-gen model becomes the new autograder

This is data flywheel design: each generation of models improves the efficiency of producing training data for the next generation. The risk is obvious (errors compound across generations), which is why the human expert layer and dispute mechanism remain in the loop. But the direction is clear — OpenAI is investing heavily in reducing the marginal cost of high-quality data through model-assisted evaluation.

Where Mercor’s Math Olympiad pipeline fits: OpenAI’s relationship with Mercor began when Mercor’s CEO cold-emailed OpenAI’s head of human data operations and landed a contract to recruit Math Olympiad winners for model training [10]. This is the prototypical “verifiable reward” domain — math has objectively correct answers and checkable reasoning chains, making it ideal for RL training of reasoning models.


Anthropic: Adversarial Safety, Agentic Alignment, and Competitive Positioning

Anthropic’s data investment reflects a dual mandate: push capability (especially agentic) while maintaining their safety-first positioning.

1. Adversarial safety and alignment evaluation

Anthropic’s February 2026 Risk Report [18] reveals their current threat model hierarchy:

This translates to a specific data need: red-teaming and adversarial evaluation datasets. The Anthropic Fellows Program for 2026 [23] lists research areas including scalable oversight, adversarial robustness, AI control, and model organisms — all of which require carefully crafted evaluation scenarios where the “right” behavior is non-obvious.

The API_PREFERENCE workspaces in the Mercor data — with their ROLES (evaluator personas) and DOMAINS (evaluation dimensions) tables — likely served this purpose. Constitutional AI requires preference data from multiple ethical/practical perspectives, not just “which response sounds better.”

2. Long-horizon agentic capabilities

Claude Opus 4.6 (released February 2026) shows where Anthropic is pushing hardest [20] [21]:

CapabilityOpus 4.5 → 4.6Improvement
ARC-AGI-237.6% → 68.8%+83%
BrowseComp67.8% → 84.0%+24%
OSWorld66.3% → 72.7%+10%
Terminal-Bench 2.059.8% → 65.4%+9%

The biggest gains are in agentic and long-horizon tasks — browsing, operating systems, terminal interaction. New features include Agent Teams (multiple agents in parallel), context compaction (enabling longer multi-step runs), and programmatic tool calling (agents writing code that calls tools, reducing latency) [20].

Training these capabilities requires data that current public benchmarks can’t provide:

The AgentSandboxes table in the Mercor data — running agentType: claude with full transcript storage — was likely generating exactly this kind of agentic training data.

3. Systematic competitive benchmarking

The “GPT-4 vs Claude Evaluation” project in the Mercor data reveals a practice that’s likely more common than any lab would publicly admit: systematically comparing your model against competitors to identify and close gaps. The preference data structure — same prompt, both responses, human judgment with reasoning — is designed to produce targeted training signal: not “make Claude generally better” but “make Claude better specifically where GPT-4 currently wins.”

This is essentially an adversarial capability transfer strategy: use human evaluators to identify the delta between your model and the competition, then convert that delta into targeted training data. The multiple versions of the workspace (API_PREFERENCE_V2, copies for individual researchers) suggest this is an ongoing, iterative process rather than a one-time benchmark.


The Bigger Picture: Where the Industry Is Heading

Both labs’ data strategies converge on several themes:

TrendOpenAI SignalAnthropic Signal
Reasoning verificationRL on chains of thought; Math Olympiad data; AIME rubricsReasoning faithfulness evaluation; steganography detection
Agentic capabilities”Agentic Code Final QC Audit”; o3 multi-tool reasoningAgent Teams; computer use; Terminal-Bench gains
Self-improving evaluationGPT-5 as autograder; closed-loop data flywheelScalable oversight research; automated behavioral audits
Adversarial robustnessSWE-bench contamination detection → SWE-bench ProSabotage detection; evaluation awareness testing
Domain expert dataMath Olympiad winners; coding experts at $95/hrConstitutional AI annotators; red-team specialists

The common thread: the easy data is exhausted. Both labs are moving beyond generic internet text and crowdsourced preferences toward expert-produced data in domains where quality matters enormously and is hard to fake — mathematical reasoning, agentic code execution, safety-critical evaluation, and multi-step planning. Major AI labs each spend approximately **1billionannuallyonhumangeneratedtrainingdata[[24]](https://www.pin.com/blog/ailabshiringtrainmodels),withspecialistcompensationrangingfrom1 billion annually** on human-generated training data [[24]](https://www.pin.com/blog/ai-labs-hiring-train-models), with specialist compensation ranging from 15/hr for entry-level annotators to $500+/hr for domain experts. The Mercor relationship was valuable precisely because Mercor could supply specialists (doctors, lawyers, competitive programmers) who could produce data at the frontier of model capabilities.


Cross-Cutting Patterns: What All Labs Share

Across 84 Airtable workspaces, several universal patterns emerge in how frontier labs structure their evaluation pipelines:

1. Rubric Design: Three Core Principles

The leaked rubrics (particularly from Mercor’s own APEX benchmark suite, which they’ve partially open-sourced) reveal a shared rubric design philosophy:

Mean criteria per task: ~4, with a range of 1-10.

2. Quality Control: Four-Layer Architecture

Layer 1: LLM autograder         — bulk initial screening
Layer 2: Lead reviewer          — human review of edge cases and low-confidence auto-grades
Layer 3: Double-blind QA        — independent re-evaluation for calibration
Layer 4: Dispute resolution     — contractor appeals mechanism

Tables supporting this: QA_SPECS, LEAD_AUDIT_QA, DOUBLE_BLIND, REVIEWER_ASSESSMENT, TaskAudits.dispute.

3. Aggressive Version Control

The same rubric appears in 12+ dated copies spanning August 2025 through January 2026. Every workspace table has a version field. This means the labs are continuously iterating evaluation criteria — what “good” means is a moving target, refined through months of annotation experience.

This has a practical implication: any competitor who obtains a single snapshot of the rubrics gets the current state, but misses the iteration trajectory — the sequence of refinements that encode hard-won lessons about what criteria actually discriminate between good and bad model behavior.

4. Domain Expert Routing

TALENT tables appear in nearly every workspace. Contractors are routed by domain expertise:

WorkspaceDomainSpecialist Type
APEX_LEGALLegal reasoningLawyers
BEAR_MEDICINEMedical annotationPhysicians, radiologists
APEX_FINANCEFinancial analysisFinance professionals
AIME_RUBRICSMathematical reasoningMath competition participants
ACADEMIC_REASONING_SFTAcademic reasoningResearchers

The BEAR_MEDICINE workspace has its own DISCIPLINES, PODS (team structures), WRITER_DAILY_ACTIVITY, and REVIEWER_STATS tables — a self-contained annotation operation with per-person productivity tracking.

5. Benchmark Isolation (Now Compromised)

Evaluation-focused workspaces (ATHENA_HLE, AIME_RUBRICS, APEX_*) were organizationally separated from SFT/RLHF training workspaces. This separation is standard practice to prevent benchmark contamination — if evaluation data leaks into training data, benchmark scores become meaningless.

The breach destroyed this separation. All APEX benchmark tasks, criteria, gold-standard answers, and historical evaluation data are now available to any buyer. Any model trained on this data will score artificially high on APEX benchmarks. The EVALS workspace — containing APEX_RESULTS, BOREALIS_RESULTS, and LUCIUS_RESULTS — confirms these benchmarks were actively used for model comparison, making the contamination risk concrete.


The Screenshot Problem: Cascading Secondary Breach

One underappreciated dimension: Mercor required contractors to install the Insightful monitoring agent, which captured desktop screenshots every few minutes during work sessions. Each screenshot was stored on S3 with metadata including:

Because contractors worked directly inside client systems (OpenAI’s Feather platform, client Airtable workspaces, Slack channels), these screenshots are effectively visual records of client internal tools and data. An attacker can filter screenshots by projectId to systematically extract visual intelligence about any client’s internal systems.

This means the breach is not just of Mercor — it’s a proxy breach of every client whose internal tools were visible on a contractor’s screen.


Implications for the Industry

1. Methodology > Data

The most valuable thing leaked isn’t the training data — it’s the evaluation frameworks. The CRITERIA, QA_SPECS, and LLM_CALL_CONFIGURATION tables encode how each lab defines “what good AI output looks like.” This is the real competitive moat: anyone can collect prompt-response pairs, but knowing which criteria actually discriminate quality from mediocrity is years of expensive iteration.

2. Single Vendor = Single Point of Failure

Mercor simultaneously held training data, evaluation rubrics, and contractor work product for competing labs. When they were breached, everyone was exposed at once. This is the AI industry’s version of the SolarWinds problem — shared infrastructure creates correlated risk.

3. The Benchmark Contamination Cascade

Every APEX benchmark is now suspect. Any model evaluated against APEX after this breach could have been trained on leaked APEX data. Unless Mercor rebuilds the entire suite from scratch with new tasks and criteria, APEX results are meaningless. The same risk extends to any evaluation data in the 84 workspaces.

4. Biometric Data Is Irrevocable

Over 30,000 contractors had their video interviews, passport scans, and facial biometric data stolen. Unlike passwords, biometric data cannot be reset. These individuals face permanent risk of deepfake impersonation and identity verification fraud.

5. The Open-Source Dependency Surface

The attack succeeded because a widely-used Python package (LiteLLM, present in 36% of cloud environments) was compromised for ~40 minutes. The AI industry’s dependency on open-source tooling creates an attack surface that no single company controls. The .pth file technique — auto-executing on Python interpreter startup — bypasses any code review that focuses on explicit imports.


What Changes Now

Meta has indefinitely paused its work with Mercor. OpenAI and Anthropic are investigating. A class action lawsuit is underway.

But the structural question remains: will the industry respond by diversifying its data vendor relationships and hardening supply-chain security, or will it simply find the next hot vendor and repeat the pattern?

The Mercor breach is a reminder that in AI, the training pipeline is at least as valuable as the trained model — and considerably less protected.


References

Breach Analysis and Reporting

  1. Anatomy of Mercor’s Data Breach — Technical analysis of leaked database schema and Airtable workspaces (primary source for lab-specific methodology details)
  2. The CyberSec Guru: Inside the 4TB Lapsus$ Leak — Attack chain analysis (Trivy → LiteLLM → Mercor)
  3. Fortune: Mercor confirms major cybersecurity breach
  4. TNW: Meta freezes AI data work after breach — Training methodology exposure analysis, Meta pause, industry impact
  5. gentic.news: Expert Human Annotation Pipeline Exposed — Impact on Constitutional AI and RLHF
  6. LiveMint: OpenAI, Anthropic contractor targeted
  7. Business Insider: Meta Pauses Work With Mercor
  8. Tech Startups: Mercor confirms breach
  9. ClaimDepot: Mercor class action lawsuit
  10. SF Standard: San Francisco’s youngest billionaires — How the OpenAI–Mercor relationship began (Math Olympiad recruitment)

Mercor Official Resources

  1. Mercor APEX-Agents Benchmark — Official benchmark dataset with rubric design details (CC-BY 4.0)
  2. Mercor: Types of Data — Mercor’s documentation on SFT, RLHF, agentic, and evaluation data workflows

OpenAI — Models and Data Strategy

  1. Introducing OpenAI o3 and o4-mini — RL on chains of thought; agentic multi-tool reasoning; AIME/Codeforces results
  2. o3 and o4-mini System Card — Training approach: large-scale RL on chains of thought, data pipelines, filtering
  3. Why SWE-bench Verified no longer measures frontier coding — Benchmark contamination discovery; models reproducing gold patches verbatim
  4. Inside OpenAI’s in-house data agent — GPT-5.2/Codex-powered data management across 600PB, 70K datasets
  5. OpenAI’s Roadmap to Autonomous AI Researchers — Jakub Pachocki’s targets: AI Research Interns by Sep 2026, full autonomy by Mar 2028

Anthropic — Models and Safety Strategy

  1. Anthropic February 2026 Risk Report — Sabotage threats, reasoning faithfulness, steganography evaluation, ASL-3 deployment
  2. Claude Opus 4.6 System Card — Adaptive thinking, automated behavioral audits, evaluation awareness testing
  3. What’s new in Claude 4.6 — Agent Teams, context compaction, programmatic tool calling, 1M context
  4. Claude Opus 4.6 vs 4.5 Benchmarks — ARC-AGI-2 +83%, BrowseComp +24%, OSWorld +10%, Terminal-Bench +9%
  5. Developing a computer use model — Screenshot interpretation, pixel-level cursor positioning, OSWorld 14.9%
  6. Anthropic Fellows Program 2026 — Research areas: scalable oversight, adversarial robustness, AI control, model organisms, interpretability

Industry Context

  1. How AI Labs Are Hiring People to Train Models1B+annualspendperlabonhumandata;1B+ annual spend per lab on human data; 15/hr–$500+/hr annotator range
  2. The Changing Landscape of AI Data Labeling Hiring (2026) — Shift from crowdsourced to domain-expert annotation

Share this post on:

Next Post
Improving LLM Internationalization: Bridging the Gap in Tool Use and Agency