In March 2026, Mercor — a claimed 4TB of stolen data, including 84 Airtable workspaces containing the actual training data, evaluation rubrics, and preference annotations produced for OpenAI, Anthropic, Apple, Amazon, Meta, and Google DeepMind.
This post is not about the breach itself. It’s about what the publicly reported analysis of the stolen data reveals about how frontier AI labs actually build their post-training pipelines — the rubric design patterns, evaluation methodologies, and quality control architectures that are normally invisible.
All information in this post comes from publicly available security research reports and news articles, not from the stolen data itself.
Table of Contents
Open Table of Contents
- Background: Why a Data Vendor Breach Matters More Than a Model Breach
- The Attack: Supply-Chain Compromise in 3 Phases
- What Was Stolen: The Full Inventory
- Lab-by-Lab: What the Reporting Reveals About Each Company’s Evaluation Methodology
- Current Data Collection Priorities: What OpenAI and Anthropic Are Betting On
- Cross-Cutting Patterns: What All Labs Share
- The Screenshot Problem: Cascading Secondary Breach
- Implications for the Industry
- What Changes Now
- References
Background: Why a Data Vendor Breach Matters More Than a Model Breach
Most AI security discussions focus on model weights — can someone steal your checkpoint? But the Mercor breach exposed something arguably more valuable: the methodology.
Mercor sits at the center of the AI data supply chain. It recruits domain experts (doctors, lawyers, engineers, Math Olympiad winners) at ~$95/hour to produce:
- SFT data: expert-written prompt-response pairs
- RLHF preference data: human judgments comparing model outputs
- Evaluation rubrics: scoring frameworks defining “what good looks like”
- Chain-of-Thought traces: step-by-step reasoning annotations
- Benchmark evaluation data: graded model outputs against standardized tests
Six of the “Magnificent Seven” tech companies plus frontier labs OpenAI and Anthropic were clients. A single breach exposed all of them simultaneously — not because they shared infrastructure, but because they shared a vendor.
As Y Combinator president Garry Tan put it: “Incredible amount of SOTA training data now just available to China thanks to @mercor_ai leak. Every major lab. Billions and billions of value and a major national security issue.”
The Attack: Supply-Chain Compromise in 3 Phases
The breach followed a cascading supply-chain attack:
| Phase | Date | Target | Method |
|---|---|---|---|
| 1 | Mar 19 | Trivy (security scanner) | Exploited pull_request_target in GitHub Actions, stole aqua-bot PAT, force-pushed malicious commits to 76 release tags |
| 2 | Mar 24 | LiteLLM (AI proxy library, ~97M monthly downloads, present in 36% of cloud environments) | Used Trivy-stolen credentials to hijack PyPI publishing token; pushed malicious versions 1.82.7 and 1.82.8 (live for ~40 minutes) |
| 3 | Mar 24+ | Mercor | Poisoned LiteLLM dependency landed in dev environment; malware swept SSH keys, AWS tokens, K8s secrets; exfiltrated data via Tailscale VPN to models.litellm[.]cloud |
The malicious LiteLLM 1.82.8 used a .pth file — a Python path configuration file that executes automatically when the interpreter starts. No explicit import needed. The moment a developer opened an IDE or ran pip, the payload was already running.
What Was Stolen: The Full Inventory
| Asset | Size | Contents |
|---|---|---|
| Production Database | 211 GB | 250+ Aurora MySQL tables — contractor PII, interview transcripts, client project configs |
| Source Code | 939 GB | Complete GitHub org including mercor-monorepo, hardcoded API keys, Terraform configs |
| Cloud Storage | ~3 TB | Video interviews, desktop screenshots, passport/ID scans, signed legal docs |
| Airtable Export | Included | 84 workspaces, 1,055 JSONL files — the actual annotation tasks, rubrics, model outputs, and human evaluations |
| Slack Export | Included | Full enterprise Slack workspace + client-specific workspaces |
| Tailscale VPN Data | Included | Internal network topology, device certificates |
The Airtable export is where the training methodology lives. Each workspace follows a standardized schema:
TASKS / TASK_VERSIONS — the annotation tasks themselves
CRITERIA — evaluation criteria definitions
RUBRIC_VERSIONS — scoring framework iterations
QA_SPECS — quality control specifications
LLM_CALL_CONFIGURATION — model selection, temperature, sampling params
DOMAIN / SUBDOMAIN — domain taxonomy
WORKFLOW — task routing logic
CONTROL_PANEL — pipeline control parameters
TALENT — which experts are assigned to which tasks
This is not a dataset. It is the complete blueprint for an industrial-scale annotation pipeline.
Lab-by-Lab: What the Reporting Reveals About Each Company’s Evaluation Methodology
OpenAI: Self-Bootstrapping + Tournament Ranking
Annotation Platform: OpenAI uses a proprietary internal tool called Feather (feather.openai.com), organized by campaign UUIDs. Mercor contractors worked directly inside OpenAI’s tooling.
Key methodological insight — LLM-as-autograder:
The TaskDefinitions table configured openai/gpt-4.1 and openai/gpt-5 as autograders for human-produced annotation data. This reveals a bootstrapping strategy: use a stronger model to grade the output that will train the next generation of models. It’s an efficiency play — reduce the volume of expensive human review while maintaining quality signal — but it also means OpenAI’s training pipeline has a dependency on its own model quality for quality control.
Data purity rules:
Rubrics contained explicit constraints like “LLMs other than ChatGPT are prohibited.” This tells us OpenAI is concerned about cross-model contamination — they don’t want Claude’s or Gemini’s stylistic patterns leaking into their training data through contractors who might use competing tools to draft responses.
Ranking system — Bradley-Terry tournament:
The PairwiseComparisons table implements a classic Bradley-Terry model for ranking candidate outputs:
For each comparison:
- winnerResumeId / loserResumeId
- reasoning (LLM-generated explanation of why A > B)
Accumulated into:
- numComparisons → mScoreRaw → mScoreNormalized
This is the same statistical framework used in chess Elo ratings, adapted for ranking model outputs. The LLM-generated reasoning for each comparison likely serves dual purpose: quality control (can a reviewer verify the judgment?) and potential training signal (the reasoning itself could be used for reward model training).
Quality control: Three-layer architecture — AI autograder → human review of edge cases → contractor dispute mechanism via TaskAudits.dispute.
Anthropic: Systematic Preference Evaluation + Constitutional AI Feedback
Core methodology: preference-centric RLHF/DPO
Anthropic’s pipeline is organized around structured preference comparisons, consistent with their published Constitutional AI and RLHF research. Multiple API_PREFERENCE workspaces (including V2 and personal copies for individual team members) contain:
| Table | Purpose |
|---|---|
PROMPTS | Standardized input prompt collection |
RESPONSES | Multiple model outputs for the same prompt |
ROLES | Evaluator personas/perspectives to adopt |
DOMAINS | Domain categorization (technical, creative, safety, etc.) |
PROMPT_TEMPLATES | Reusable prompt templates |
QA | Quality assurance checklists |
The ROLES table is interesting. It suggests Anthropic asks evaluators to judge outputs from different perspectives — possibly related to their Constitutional AI approach, where principles are evaluated from multiple ethical/practical viewpoints.
Head-to-head model comparison: GPT-4 vs Claude
A dedicated “GPT-4 vs Claude Evaluation” project compared Claude 3.5 Sonnet against GPT-4 across use cases. Each comparison included the prompt, both responses, and the human preference judgment with reasoning. This is essentially Anthropic’s competitive intelligence pipeline — systematically mapping where Claude wins and loses against GPT-4, then using that data to close gaps.
The existence of multiple workspace versions (API_PREFERENCE, API_PREFERENCE_V2, API_PREFERENCE__COPY__FOR_BRENDAN, API_PREF___KANIX) suggests this framework is under rapid iteration with individual researchers maintaining working copies.
Agent evaluation: AgentSandboxes records with agentType: claude show Anthropic was evaluating agentic capabilities through Mercor, with full conversation transcripts stored.
Apple: Multi-Model Orchestration + Evaluation Automation
Apple’s exposure was arguably the most surprising — pre-release model outputs from unreleased Apple Intelligence models.
Model versions and inference parameters:
The APPLE_ENDPOINT_SANDBOX workspace tested three model versions:
| Model ID | Role |
|---|---|
afm-text-083 | Text generation (earlier version) |
afm-model-085 | Text generation (intermediate) |
afm-model-086 | Orchestrator model |
Sampling parameters: temperature=0.7, top_p=0.9 — relatively standard nucleus sampling, suggesting Apple prioritizes response diversity over determinism in their evaluation setup.
Four-dimensional evaluation matrix:
| Airtable Table | Evaluation Dimension |
|---|---|
TEXT | General text generation quality |
DEEP_L | Translation capability (English → Spanish) |
TEXT_ORCHESTRATOR | Routing/orchestration decisions |
RUBRIC_AUTO_GEN | Automated rubric generation |
Two architectural insights emerge:
-
TEXT_ORCHESTRATORconfirms Apple Intelligence uses a multi-model orchestration architecture — a routing model (afm-model-086) decides which sub-model handles each request. This is consistent with Apple’s on-device/cloud split architecture but reveals they’re also orchestrating between cloud-side models. -
RUBRIC_AUTO_GENshows Apple is investing in evaluation automation — using AI to generate the evaluation criteria themselves. This is a meta-level capability: if you can automate rubric creation, you can scale evaluation across new domains without proportionally scaling human rubric designers.
Amazon: Chain-of-Thought Quality Analysis
Amazon’s methodology is distinctive in its focus on evaluating reasoning quality, not just final answers.
AMAZON_LLM_COT_EVALUATION workspace structure:
| Table | Purpose |
|---|---|
DOMAINS | Evaluation categories (math, stem, etc.) |
PHASE_1_TASKS | Model A vs Model B with complete CoT traces |
PHASE_1_REVIEWS | Human reviews of CoT quality |
MODEL_A_STRENGTHS | Structured recording of each model’s reasoning advantages |
TALENT | Evaluator (domain expert) management |
What makes this different: Most RLHF preference data captures “which response is better” as a holistic judgment. Amazon’s pipeline decomposes this — evaluators assess the reasoning chain itself, not just the conclusion. Each task includes complete Chain-of-Thought traces, final responses, and preference judgments. The MODEL_A_STRENGTHS table suggests they’re building a structured ontology of reasoning capabilities, not just accumulating preference labels.
The PHASE_1_TASKS / PHASE_1_REVIEWS naming implies a multi-stage evaluation funnel — likely Phase 1 broad screening → Phase 2 deep analysis → final determination.
Meta: Multimodal Annotation
Less was directly exposed about Meta’s methodology, but the AAIE___META_MULTIMEDIA_TEMPLATE_COMMAND_CENTER workspace (containing OVERALL_META, PROJECTS, FORMS, TEMPLATE tables) confirms Meta’s annotation work through Mercor involved multimodal data — not just text. Meta has since indefinitely paused its relationship with Mercor.
Google DeepMind: Benchmark Evaluation
GDM was confirmed as a Mercor client by the Wall Street Journal, but direct evidence in the leaked samples was limited. The most likely connection is through the Athena HLE (Humanity’s Last Exam) workspaces — four versions of ATHENA_HLE__STEM_ (including dated copies from July 2025) with MODEL_RESPONSES and AWAITING_REVIEW_METRICS tables, indicating an active human review pipeline for one of the most important frontier model benchmarks.
Current Data Collection Priorities: What OpenAI and Anthropic Are Betting On
The breach evidence doesn’t exist in a vacuum. When we cross-reference the exposed Mercor project data with each lab’s public product roadmap and recent model releases, a picture emerges of where the data investment is going right now.
OpenAI: Reasoning, Agentic Code, and Self-Improving Evaluation
OpenAI’s current trajectory — visible through the o3/o4-mini releases, public statements, and the Mercor project data — points to three converging data priorities:
1. Reasoning and Chain-of-Thought data
The o3 and o4-mini models are trained using large-scale reinforcement learning on chains of thought [13]. o4-mini achieves 99.5% pass@1 on AIME 2025 (with Python interpreter access), and o3 sets new state-of-the-art on Codeforces. These results require massive volumes of verified reasoning traces — problems with objectively correct answers where the reasoning path can be checked. The AIME_RUBRICS workspace in the Mercor data (math competition rubrics) and ACADEMIC_REASONING_SFT (with an explicit COT table for Chain-of-Thought supervision) align directly with this priority.
OpenAI Chief Scientist Jakub Pachocki has outlined the target: AI Research Interns by September 2026, fully autonomous researchers by March 2028 [17]. Getting there requires training data that captures multi-step problem decomposition — not just “what’s the right answer” but “what’s the right sequence of reasoning steps, and how do you recover from mistakes mid-chain.”
2. Agentic coding and tool use
The Mercor TaskDefinitions table references an “Agentic Code Final QC Audit” project focused on AI code generation quality control for GitHub issue solving [1]. This is SWE-bench-style data: given a real GitHub issue, can the model produce a correct patch? OpenAI recently discontinued SWE-bench Verified after finding models were reproducing gold patches verbatim due to training data contamination [15] — a strong signal they need fresh, high-quality agentic coding data from human experts.
For the first time with o3, OpenAI’s reasoning models can agentically combine every tool in ChatGPT — web search, Python, image reasoning, image generation — in a single reasoning chain [13]. Training this requires demonstration data showing how and when to invoke tools, plus evaluations of tool-use decisions.
3. Self-bootstrapping evaluation at scale
Perhaps the most strategically interesting signal: OpenAI uses openai/gpt-4.1 and openai/gpt-5 as autograders in TaskDefinitions. Combined with the constraint “LLMs other than ChatGPT are prohibited” in rubrics, this reveals a closed-loop data strategy:
Human experts produce SFT/RLHF data
→ GPT-5 autogrades the submissions (quality control)
→ High-quality data trains next generation
→ Next-gen model becomes the new autograder
This is data flywheel design: each generation of models improves the efficiency of producing training data for the next generation. The risk is obvious (errors compound across generations), which is why the human expert layer and dispute mechanism remain in the loop. But the direction is clear — OpenAI is investing heavily in reducing the marginal cost of high-quality data through model-assisted evaluation.
Where Mercor’s Math Olympiad pipeline fits: OpenAI’s relationship with Mercor began when Mercor’s CEO cold-emailed OpenAI’s head of human data operations and landed a contract to recruit Math Olympiad winners for model training [10]. This is the prototypical “verifiable reward” domain — math has objectively correct answers and checkable reasoning chains, making it ideal for RL training of reasoning models.
Anthropic: Adversarial Safety, Agentic Alignment, and Competitive Positioning
Anthropic’s data investment reflects a dual mandate: push capability (especially agentic) while maintaining their safety-first positioning.
1. Adversarial safety and alignment evaluation
Anthropic’s February 2026 Risk Report [18] reveals their current threat model hierarchy:
- Sabotage: Can the model undermine safety assessments or poison training data for future models?
- Reasoning faithfulness: Does the model hide misaligned reasoning via steganography?
- Evaluation awareness: Can the model detect when it’s being tested and behave differently?
This translates to a specific data need: red-teaming and adversarial evaluation datasets. The Anthropic Fellows Program for 2026 [23] lists research areas including scalable oversight, adversarial robustness, AI control, and model organisms — all of which require carefully crafted evaluation scenarios where the “right” behavior is non-obvious.
The API_PREFERENCE workspaces in the Mercor data — with their ROLES (evaluator personas) and DOMAINS (evaluation dimensions) tables — likely served this purpose. Constitutional AI requires preference data from multiple ethical/practical perspectives, not just “which response sounds better.”
2. Long-horizon agentic capabilities
Claude Opus 4.6 (released February 2026) shows where Anthropic is pushing hardest [20] [21]:
| Capability | Opus 4.5 → 4.6 | Improvement |
|---|---|---|
| ARC-AGI-2 | 37.6% → 68.8% | +83% |
| BrowseComp | 67.8% → 84.0% | +24% |
| OSWorld | 66.3% → 72.7% | +10% |
| Terminal-Bench 2.0 | 59.8% → 65.4% | +9% |
The biggest gains are in agentic and long-horizon tasks — browsing, operating systems, terminal interaction. New features include Agent Teams (multiple agents in parallel), context compaction (enabling longer multi-step runs), and programmatic tool calling (agents writing code that calls tools, reducing latency) [20].
Training these capabilities requires data that current public benchmarks can’t provide:
- Multi-step task demonstrations with error recovery
- Tool orchestration traces showing when to use which tool
- Long-context interaction data (Opus 4.6 supports 1M token context)
- Computer use demonstrations — Claude achieves 14.9% on OSWorld (vs. 7.7% next-best), trained partly through screenshot interpretation and pixel-level cursor positioning [22]
The AgentSandboxes table in the Mercor data — running agentType: claude with full transcript storage — was likely generating exactly this kind of agentic training data.
3. Systematic competitive benchmarking
The “GPT-4 vs Claude Evaluation” project in the Mercor data reveals a practice that’s likely more common than any lab would publicly admit: systematically comparing your model against competitors to identify and close gaps. The preference data structure — same prompt, both responses, human judgment with reasoning — is designed to produce targeted training signal: not “make Claude generally better” but “make Claude better specifically where GPT-4 currently wins.”
This is essentially an adversarial capability transfer strategy: use human evaluators to identify the delta between your model and the competition, then convert that delta into targeted training data. The multiple versions of the workspace (API_PREFERENCE_V2, copies for individual researchers) suggest this is an ongoing, iterative process rather than a one-time benchmark.
The Bigger Picture: Where the Industry Is Heading
Both labs’ data strategies converge on several themes:
| Trend | OpenAI Signal | Anthropic Signal |
|---|---|---|
| Reasoning verification | RL on chains of thought; Math Olympiad data; AIME rubrics | Reasoning faithfulness evaluation; steganography detection |
| Agentic capabilities | ”Agentic Code Final QC Audit”; o3 multi-tool reasoning | Agent Teams; computer use; Terminal-Bench gains |
| Self-improving evaluation | GPT-5 as autograder; closed-loop data flywheel | Scalable oversight research; automated behavioral audits |
| Adversarial robustness | SWE-bench contamination detection → SWE-bench Pro | Sabotage detection; evaluation awareness testing |
| Domain expert data | Math Olympiad winners; coding experts at $95/hr | Constitutional AI annotators; red-team specialists |
The common thread: the easy data is exhausted. Both labs are moving beyond generic internet text and crowdsourced preferences toward expert-produced data in domains where quality matters enormously and is hard to fake — mathematical reasoning, agentic code execution, safety-critical evaluation, and multi-step planning. Major AI labs each spend approximately **15/hr for entry-level annotators to $500+/hr for domain experts. The Mercor relationship was valuable precisely because Mercor could supply specialists (doctors, lawyers, competitive programmers) who could produce data at the frontier of model capabilities.
Cross-Cutting Patterns: What All Labs Share
Across 84 Airtable workspaces, several universal patterns emerge in how frontier labs structure their evaluation pipelines:
1. Rubric Design: Three Core Principles
The leaked rubrics (particularly from Mercor’s own APEX benchmark suite, which they’ve partially open-sourced) reveal a shared rubric design philosophy:
- Hurdle criteria: Hard gates that must pass before any reward is granted — explicitly designed to prevent reward hacking. If the model fails a hurdle criterion, it scores zero regardless of other criteria quality.
- Grounding criteria: Dedicated scoring dimensions that penalize hallucination and unsupported claims — treated as a separate, weighted concern rather than folded into general quality.
- Binary grading: Each criterion is Met / Not Met rather than a Likert scale. This reduces inter-annotator disagreement and makes quality control more tractable (you can verify a binary judgment more reliably than a 1-5 score).
Mean criteria per task: ~4, with a range of 1-10.
2. Quality Control: Four-Layer Architecture
Layer 1: LLM autograder — bulk initial screening
Layer 2: Lead reviewer — human review of edge cases and low-confidence auto-grades
Layer 3: Double-blind QA — independent re-evaluation for calibration
Layer 4: Dispute resolution — contractor appeals mechanism
Tables supporting this: QA_SPECS, LEAD_AUDIT_QA, DOUBLE_BLIND, REVIEWER_ASSESSMENT, TaskAudits.dispute.
3. Aggressive Version Control
The same rubric appears in 12+ dated copies spanning August 2025 through January 2026. Every workspace table has a version field. This means the labs are continuously iterating evaluation criteria — what “good” means is a moving target, refined through months of annotation experience.
This has a practical implication: any competitor who obtains a single snapshot of the rubrics gets the current state, but misses the iteration trajectory — the sequence of refinements that encode hard-won lessons about what criteria actually discriminate between good and bad model behavior.
4. Domain Expert Routing
TALENT tables appear in nearly every workspace. Contractors are routed by domain expertise:
| Workspace | Domain | Specialist Type |
|---|---|---|
APEX_LEGAL | Legal reasoning | Lawyers |
BEAR_MEDICINE | Medical annotation | Physicians, radiologists |
APEX_FINANCE | Financial analysis | Finance professionals |
AIME_RUBRICS | Mathematical reasoning | Math competition participants |
ACADEMIC_REASONING_SFT | Academic reasoning | Researchers |
The BEAR_MEDICINE workspace has its own DISCIPLINES, PODS (team structures), WRITER_DAILY_ACTIVITY, and REVIEWER_STATS tables — a self-contained annotation operation with per-person productivity tracking.
5. Benchmark Isolation (Now Compromised)
Evaluation-focused workspaces (ATHENA_HLE, AIME_RUBRICS, APEX_*) were organizationally separated from SFT/RLHF training workspaces. This separation is standard practice to prevent benchmark contamination — if evaluation data leaks into training data, benchmark scores become meaningless.
The breach destroyed this separation. All APEX benchmark tasks, criteria, gold-standard answers, and historical evaluation data are now available to any buyer. Any model trained on this data will score artificially high on APEX benchmarks. The EVALS workspace — containing APEX_RESULTS, BOREALIS_RESULTS, and LUCIUS_RESULTS — confirms these benchmarks were actively used for model comparison, making the contamination risk concrete.
The Screenshot Problem: Cascading Secondary Breach
One underappreciated dimension: Mercor required contractors to install the Insightful monitoring agent, which captured desktop screenshots every few minutes during work sessions. Each screenshot was stored on S3 with metadata including:
- The active application, window title, and browser URL
- The contractor’s IP address, MAC address, and hardware fingerprint
- A direct link to the screenshot image
Because contractors worked directly inside client systems (OpenAI’s Feather platform, client Airtable workspaces, Slack channels), these screenshots are effectively visual records of client internal tools and data. An attacker can filter screenshots by projectId to systematically extract visual intelligence about any client’s internal systems.
This means the breach is not just of Mercor — it’s a proxy breach of every client whose internal tools were visible on a contractor’s screen.
Implications for the Industry
1. Methodology > Data
The most valuable thing leaked isn’t the training data — it’s the evaluation frameworks. The CRITERIA, QA_SPECS, and LLM_CALL_CONFIGURATION tables encode how each lab defines “what good AI output looks like.” This is the real competitive moat: anyone can collect prompt-response pairs, but knowing which criteria actually discriminate quality from mediocrity is years of expensive iteration.
2. Single Vendor = Single Point of Failure
Mercor simultaneously held training data, evaluation rubrics, and contractor work product for competing labs. When they were breached, everyone was exposed at once. This is the AI industry’s version of the SolarWinds problem — shared infrastructure creates correlated risk.
3. The Benchmark Contamination Cascade
Every APEX benchmark is now suspect. Any model evaluated against APEX after this breach could have been trained on leaked APEX data. Unless Mercor rebuilds the entire suite from scratch with new tasks and criteria, APEX results are meaningless. The same risk extends to any evaluation data in the 84 workspaces.
4. Biometric Data Is Irrevocable
Over 30,000 contractors had their video interviews, passport scans, and facial biometric data stolen. Unlike passwords, biometric data cannot be reset. These individuals face permanent risk of deepfake impersonation and identity verification fraud.
5. The Open-Source Dependency Surface
The attack succeeded because a widely-used Python package (LiteLLM, present in 36% of cloud environments) was compromised for ~40 minutes. The AI industry’s dependency on open-source tooling creates an attack surface that no single company controls. The .pth file technique — auto-executing on Python interpreter startup — bypasses any code review that focuses on explicit imports.
What Changes Now
Meta has indefinitely paused its work with Mercor. OpenAI and Anthropic are investigating. A class action lawsuit is underway.
But the structural question remains: will the industry respond by diversifying its data vendor relationships and hardening supply-chain security, or will it simply find the next hot vendor and repeat the pattern?
The Mercor breach is a reminder that in AI, the training pipeline is at least as valuable as the trained model — and considerably less protected.
References
Breach Analysis and Reporting
- Anatomy of Mercor’s Data Breach — Technical analysis of leaked database schema and Airtable workspaces (primary source for lab-specific methodology details)
- The CyberSec Guru: Inside the 4TB Lapsus$ Leak — Attack chain analysis (Trivy → LiteLLM → Mercor)
- Fortune: Mercor confirms major cybersecurity breach
- TNW: Meta freezes AI data work after breach — Training methodology exposure analysis, Meta pause, industry impact
- gentic.news: Expert Human Annotation Pipeline Exposed — Impact on Constitutional AI and RLHF
- LiveMint: OpenAI, Anthropic contractor targeted
- Business Insider: Meta Pauses Work With Mercor
- Tech Startups: Mercor confirms breach
- ClaimDepot: Mercor class action lawsuit
- SF Standard: San Francisco’s youngest billionaires — How the OpenAI–Mercor relationship began (Math Olympiad recruitment)
Mercor Official Resources
- Mercor APEX-Agents Benchmark — Official benchmark dataset with rubric design details (CC-BY 4.0)
- Mercor: Types of Data — Mercor’s documentation on SFT, RLHF, agentic, and evaluation data workflows
OpenAI — Models and Data Strategy
- Introducing OpenAI o3 and o4-mini — RL on chains of thought; agentic multi-tool reasoning; AIME/Codeforces results
- o3 and o4-mini System Card — Training approach: large-scale RL on chains of thought, data pipelines, filtering
- Why SWE-bench Verified no longer measures frontier coding — Benchmark contamination discovery; models reproducing gold patches verbatim
- Inside OpenAI’s in-house data agent — GPT-5.2/Codex-powered data management across 600PB, 70K datasets
- OpenAI’s Roadmap to Autonomous AI Researchers — Jakub Pachocki’s targets: AI Research Interns by Sep 2026, full autonomy by Mar 2028
Anthropic — Models and Safety Strategy
- Anthropic February 2026 Risk Report — Sabotage threats, reasoning faithfulness, steganography evaluation, ASL-3 deployment
- Claude Opus 4.6 System Card — Adaptive thinking, automated behavioral audits, evaluation awareness testing
- What’s new in Claude 4.6 — Agent Teams, context compaction, programmatic tool calling, 1M context
- Claude Opus 4.6 vs 4.5 Benchmarks — ARC-AGI-2 +83%, BrowseComp +24%, OSWorld +10%, Terminal-Bench +9%
- Developing a computer use model — Screenshot interpretation, pixel-level cursor positioning, OSWorld 14.9%
- Anthropic Fellows Program 2026 — Research areas: scalable oversight, adversarial robustness, AI control, model organisms, interpretability
Industry Context
- How AI Labs Are Hiring People to Train Models — 15/hr–$500+/hr annotator range
- The Changing Landscape of AI Data Labeling Hiring (2026) — Shift from crowdsourced to domain-expert annotation