Skip to content
Go back

Pretraining Contamination: Why Don't Train on the Test Set Became Hard

· 14 min read

Pretraining contamination is not a small data-cleaning mistake. It is a measurement problem. Once language models are trained on internet-scale corpora, public benchmarks, answer keys, code solutions, exam explanations, leaderboard discussions, and synthetic copies of all of the above can enter the training mixture. The result is that benchmark scores may measure a mixture of generalization, memorization, benchmark familiarity, and data-pipeline luck.

The old rule was simple: do not train on the test set. In modern LLM pretraining, that rule becomes much harder to enforce because there is no single neat training set and no single neat test set. There is a large, messy web-scale corpus on one side and a collection of public evaluation artifacts on the other. The hard question is no longer “did test.csv accidentally get included?” The hard question is “did the model see enough information, in any form, to gain an unfair advantage on this evaluation?“

1. What contamination means

In classical supervised learning, contamination usually means train-test overlap. A row from the test set appears in the training set. That is bad, but it is at least easy to define.

For LLMs, contamination is broader. A benchmark item can leak through:

This is why contamination is not the same thing as deduplication. Deduplication asks whether two training examples are redundant with each other. Decontamination asks whether a training example gives the model privileged access to an evaluation item.

A document can be unique and still contaminated. A single blog post explaining five benchmark questions is not a duplicate of anything else. But if it appears in pretraining, it can still undermine the benchmark.

2. Why pretraining makes the problem worse

Pretraining makes contamination harder for three reasons.

First, the data is huge. A modern pretraining corpus may include web pages, Common Crawl snapshots, books, papers, GitHub repositories, Q&A sites, documentation, scraped PDFs, subtitles, code comments, and synthetic data. At that scale, public benchmarks are likely to appear somewhere.

Second, the benchmark ecosystem is public and self-replicating. The more important a benchmark becomes, the more people write about it. They publish solutions, reproduce questions, make study guides, discuss edge cases, create translated copies, upload notebooks, and build “benchmark practice” datasets. Famous benchmarks leak because fame creates copies.

Third, synthetic data creates indirect contamination. Suppose Model A saw a benchmark. Model A then generates instruction data. Model B trains on that synthetic data. Model B may inherit benchmark-like examples even if the original benchmark file never appears in Model B’s raw corpus.

This means contamination is not only a historical artifact of sloppy dataset construction. It is a continuing ecosystem problem.

3. Different types of contamination

The most useful way to reason about contamination is to separate several levels.

Exact input contamination

The eval prompt appears verbatim in the training corpus. This is the easiest case to detect.

Example:

Eval: “Which planet is known as the Red Planet?”

Training data: “Which planet is known as the Red Planet?”

This can often be caught with normalized exact matching, n-gram matching, MinHash, or other near-duplicate methods.

Input-and-label contamination

The model sees both the question and the answer.

Example:

“Which planet is known as the Red Planet? Answer: Mars.”

This is more severe because the training document contains the mapping the benchmark is trying to test.

Partial contamination

Only part of the benchmark item appears. For example, a reading-comprehension passage appears without the exact question, or a math problem appears without the final answer.

Partial contamination is harder to interpret. It may help a lot, a little, or not at all. But it still matters because the evaluation is no longer fully independent.

Semantic contamination

The model sees a paraphrase, translation, equivalent code task, or near-equivalent reasoning problem.

Example:

Eval: “A train travels 60 miles in 1.5 hours. What is its average speed?”

Training data: “If a car covers 120 kilometers in 3 hours, calculate its average speed.”

This may not be the same literal question. But if the benchmark is supposed to test whether the model can infer rate = distance / time, then enough near-equivalent examples can change what the score means.

Benchmark-format contamination

The model may learn the quirks of a benchmark without memorizing individual items. It may learn common phrasing, answer styles, multiple-choice patterns, common distractors, or prompt templates.

This matters because many benchmarks are not pure capability tests. They are artifacts with recognizable styles.

Post-training contamination

Even if pretraining is clean, contamination can enter later through supervised fine-tuning, RLHF data, benchmark-focused instruction datasets, or model-generated explanations. A clean base model can become contaminated during alignment or instruction tuning.

4. Why this affects benchmark trust

Contamination does not automatically mean a benchmark score is fake. The effect depends on the benchmark, the model, the number of exposures, the training stage, and the type of overlap.

For factual QA, seeing the answer can directly improve performance.

For code generation, seeing a canonical solution or a near-identical GitHub task can help the model reproduce the right structure.

For math, seeing many templated variants can help the model learn a shallow pattern rather than solve a fresh problem.

For broad knowledge benchmarks, exposure to related educational material may be legitimate pretraining, while exposure to the exact benchmark item is not.

This is the core ambiguity: pretraining is supposed to teach the model from the world, and benchmarks are also drawn from the world. The goal is not to remove all knowledge related to an evaluation topic. The goal is to avoid giving the model access to the exam itself.

That boundary is not always clean.

5. Why simple cleaning is not enough

Many teams start with exact matching or n-gram overlap. These methods are valuable. They are scalable, explainable, and good at catching obvious leaks.

But they are incomplete.

They can miss:

On the other side, aggressive semantic filtering can remove too much. If every document semantically close to an MMLU question is deleted, the training set may lose normal educational material. If every document close to a coding benchmark is deleted, the model may lose legitimate programming examples.

This creates a real engineering tradeoff. Too little filtering leaves leakage. Too much filtering damages the training distribution and may remove the very knowledge the benchmark is meant to test.

Contamination policy scatter: remove, review, or keep

A decontamination system should not collapse every hit into one similarity threshold. The action depends on match confidence and leakage severity.

6. What a serious decontamination pipeline actually does

A mature decontamination system is not one classifier and not one embedding index. It is a layered pipeline with explicit policies. The important engineering idea is that each layer catches a different failure mode.

Step 1: Build a benchmark registry

The first artifact is a benchmark registry. For every benchmark you intend to report, store the item ID, split, release date, prompt, answer choices, correct label, explanation, source URL, license, and task family.

For multiple-choice and factual QA, store several searchable views:

This detail matters because answer-only or prompt-plus-answer retrieval can find leaks that prompt-only search misses. Deng et al. report that concatenating question and label improved retrieval efficiency for contamination detection on benchmarks such as MMLU and TruthfulQA (Deng et al., 2023).

Step 2: Canonicalize before matching

Both benchmark items and candidate training documents should be normalized before matching. At minimum:

This is boring infrastructure, but it is the difference between a real detector and a demo. Without canonicalization, tiny formatting differences create false negatives.

Step 3: Use exact and n-gram matching as the first wall

Exact and n-gram matching should still be the first wall because it is cheap, scalable, and interpretable. This is the family of methods used in early large-model decontamination work: later contamination literature summarizes GPT-3 as using a 13-gram-style strategy, while PaLM split examples into clean and contaminated subsets when at least 70% of the 8-grams in the question, prompt, or target appeared in training data (Deng et al., 2023; PaLM).

The practical version is:

The action should be aggressive for exact prompt-plus-answer overlap. Remove the chunk or quarantine the whole document, depending on how the corpus is assembled.

Step 4: Quarantine known bad sources

Some leakage is source-level, not item-level. If a GitHub repository is a benchmark mirror, a Kaggle notebook contains benchmark solutions, or a website exists to publish answer keys, matching item-by-item is too fragile. The safer policy is to quarantine the source.

Useful source-level signals include:

This is especially important for code benchmarks. A solution repository may not repeat the exact prompt, but it can still contain behaviorally equivalent solutions.

Step 5: Add semantic retrieval, but use it as triage

Embedding search is valuable, but it should be a candidate generator, not the final judge. It can surface paraphrases, translations, summaries, and explanation pages that n-gram matching misses. But it also brings false positives because many legitimate educational documents are semantically close to benchmark questions.

Yang et al. show the key failure mode: paraphrased or translated benchmark samples can bypass string-matching decontamination, and if those variants remain in training, a 13B model can overfit the benchmark and reach drastically inflated performance (Yang et al., 2023).

So the practical policy should be:

Step 6: Use task-specific detectors

Different benchmark families need different detectors.

For code:

For math:

For reading comprehension:

For multimodal benchmarks:

Step 7: Add evaluation-side defenses

Data cleaning alone is not enough, especially once public benchmarks become famous. Evaluation itself should become more robust.

One direction is time-sensitive evaluation. LatestEval, for example, creates reading-comprehension evaluations from recent texts so the benchmark is less likely to overlap with older pretraining corpora; its pipeline gathers recent texts, identifies key information, and constructs questions while removing existing answers from the context (LatestEval).

Other evaluation-side defenses include:

This matters because decontamination can reduce risk, but it cannot prove the model has never seen related material.

Step 8: Publish an auditable report

A serious release should include a decontamination report. DCLM is a useful model here: it releases decontamination tooling and asks submissions to disclose a decontamination report rather than treating contamination as a private implementation detail (DCLM).

The report should include:

The goal is not to claim “zero contamination.” The goal is to make the remaining uncertainty visible.

7. The deeper issue: evaluation governance

The phrase “data decontamination” makes this sound like cleaning. But the deeper issue is governance.

If benchmark scores are used to compare labs, rank models, advertise capabilities, allocate funding, or claim progress toward reasoning, then the benchmark must measure what people think it measures.

Contamination threatens that measurement.

This is why decontamination should become an auditable artifact of pretraining. A model release should not only say “we cleaned the data.” It should explain:

The goal is not to prove there is zero leakage. At web scale, that may be impossible. The goal is to make contamination risk measurable, comparable, and honest.

8. Conclusion

Pretraining contamination is hard because language models learn from the same public internet that benchmarks live on. The more public and important a benchmark becomes, the more likely it is to appear in training data directly or indirectly.

This does not mean benchmarks are useless. It means benchmark scores need provenance, context, and humility.

The old rule still matters: do not train on the test set. But for LLMs, that rule has become an engineering and scientific discipline. It requires benchmark registries, source tracking, multi-stage matching, semantic review, domain-specific checks, and transparent reporting.

Decontamination is not just about cleaner data. It is about protecting the meaning of evaluation.

References


Share this post on:

Previous Post
Why Embeddings Cannot Solve Eval-Set Contamination
Next Post
How to Arbitrarily Increase the Difficulty of Agent Evaluation Sets