Skip to content
Go back

Why Embeddings Cannot Solve Eval-Set Contamination

· 11 min read

Semantic embeddings are useful for finding suspicious overlap between pretraining data and evaluation sets, but they cannot be the whole decontamination system. Eval-set contamination is not just a semantic-similarity problem. It is a question of whether a training document gives the model privileged access to the eval item, answer, solution path, benchmark format, or task-specific shortcut.

Embedding search answers a narrower question: “Is this text close to that text in representation space?” That is related to contamination, but it is not equivalent to contamination.

This essay explains why.

1. The target is not “similar text”

Suppose an eval item asks:

What is the capital of Australia?

A training document says:

Canberra became Australia’s capital in the early twentieth century.

This is not a duplicate of the eval question. Depending on the embedding model, it may or may not be extremely close to the question. But it leaks the answer.

Now consider another training document:

Australia has a federal parliamentary system, a large mining sector, and major cities including Sydney, Melbourne, Brisbane, Perth, Adelaide, and Canberra.

This may be semantically close to the question because it discusses Australia and Canberra. But it does not necessarily leak the benchmark item in the same way.

The difference is subtle but important. Contamination is not merely about topical closeness. It is about evaluation advantage.

That advantage can come from:

Embeddings can help find some of these. They cannot reliably classify all of them.

Embedding similarity versus leakage risk

Embedding similarity and leakage risk are related, but they are not the same axis. Some high-similarity documents are harmless; some low-similarity documents leak the answer or solution.

2. Embedding similarity has no natural decontamination threshold

The most immediate engineering problem is threshold selection.

If the threshold is too strict, the system catches only near-duplicates. It misses paraphrases, translations, compressed explanations, code-equivalent solutions, and long documents with small leaking spans.

If the threshold is too loose, the system deletes too much legitimate data.

For example, an MMLU question about organic chemistry will be semantically close to many normal chemistry textbook passages. A HumanEval coding task will be semantically close to many normal StackOverflow answers and GitHub examples. A GSM8K arithmetic word problem will be semantically close to many elementary math worksheets.

Deleting all nearby material may make the model worse at the underlying domain. But keeping all nearby material may preserve benchmark leakage.

This is the central threshold dilemma:

A contamination detector needs to distinguish “this teaches the domain” from “this gives away the exam.”

Generic embeddings were not designed to make that distinction.

3. The unit of matching is unstable

Embedding search depends heavily on chunking.

At document level, leakage can disappear. A 20,000-token webpage may contain one paragraph that quotes an eval item. The full-document embedding mostly represents the broader page topic, not the leaking span.

At paragraph level, recall improves, but cost and false positives increase.

At sentence level, the system may catch answer sentences, but it can lose context. A sentence like “The answer is B” is useless without the surrounding question. A sentence like “Therefore, the function returns the length of the longest substring” may only be meaningful if linked to a coding prompt.

At sliding-window level, recall improves again, but the number of comparisons explodes.

This means “use embeddings” is not a complete algorithm. The real algorithm must specify:

Small choices here can change the contamination report.

4. Embeddings can miss answer-only leakage

Many benchmark items can be leaked without repeating the prompt.

For factual QA, a document containing the answer fact can help. For multiple-choice exams, a study guide may list the right concept without reproducing the exact question. For code, a repository may contain a functionally equivalent implementation without the benchmark prompt. For math, a solution page may show the derivation with different variable names.

Embedding the eval prompt and searching the corpus may miss these cases because the leaking document is not necessarily similar to the prompt. It may be similar to the answer, the explanation, or the latent solution concept.

A stronger system embeds multiple views of each benchmark item:

But once we do this, we are no longer “just using embeddings.” We are building a benchmark-aware retrieval and review system.

5. Embeddings struggle with code contamination

Code contamination is especially hard.

Consider a benchmark prompt:

Write a function that returns the longest common prefix among a list of strings.

A GitHub file might contain:

def prefix(xs):
    if not xs:
        return ""
    result = xs[0]
    for item in xs[1:]:
        while not item.startswith(result):
            result = result[:-1]
    return result

This file may not contain the benchmark wording. It may not mention “HumanEval.” It may not share many tokens with the prompt. But it solves the same task.

A generic embedding model may find it, or it may not. Even if it finds it, the score may be similar to many legitimate code examples. The hard question is not semantic closeness. The hard question is behavioral equivalence.

Code decontamination may need:

Embeddings are useful, but they are only one signal in a domain-specific system.

6. Embeddings struggle with math contamination

Math contamination has a different failure mode.

Two problems can be surface-different but structurally identical:

Alice buys 3 notebooks at 4 dollars each and 2 pens at 1 dollar each. How much does she spend?

and:

A store sells 3 packs of paper for 4 dollars each and 2 erasers for 1 dollar each. What is the total cost?

The names and objects changed, but the computation is the same:

3 * 4 + 2 * 1

Should this count as contamination? It depends on the evaluation claim.

If the benchmark claims to test arithmetic skill, then seeing many similar examples may be normal training. If the exact numeric structure and reasoning template were generated from the benchmark, then it may be leakage. If the model saw the same problem with renamed entities, then the benchmark is less independent.

Embeddings cannot resolve this policy question. A math contamination detector may need equation extraction, template matching, answer-path comparison, and generated counterfactual rewrites.

7. Embedding models can themselves be contaminated

There is also a measurement issue. The embedding model used for cleaning may have been trained on internet-scale data. It may already have seen the same public benchmarks.

That does not make embeddings useless. But it means the detector is not a perfectly independent instrument. A contaminated embedding model can encode benchmark-specific associations, benchmark wording, or answer relationships.

In practice, this suggests two safeguards:

8. Embedding search creates false confidence

The most dangerous failure mode is not that embeddings are bad. It is that they look sophisticated enough to create confidence.

A team might say:

We embedded all eval items, searched the pretraining corpus, removed matches above 0.85 cosine similarity, and therefore the data is clean.

This statement hides many unresolved questions:

Without those details, embedding decontamination is not an audit. It is a heuristic.

9. What embeddings are good for

The right conclusion is not “do not use embeddings.” The right conclusion is “do not use only embeddings.”

Embeddings are useful for:

They are especially helpful as a recall layer after exact and near-duplicate matching.

But embeddings should feed into a broader system with reason codes, thresholds, sampling, and domain-specific checks.

10. A better eval decontamination design

A more serious pipeline might look like this.

Step 1: Build a benchmark registry

For every eval item, store:

Step 2: Generate multiple searchable views

For each item, construct:

Step 3: Run layered retrieval

Use:

Step 4: Classify match severity

Do not reduce everything to one similarity score. Use severity labels:

These labels matter because removal policy should differ by severity.

Step 5: Choose policy actions

Possible actions:

The action should depend on the benchmark’s importance, the severity of the match, the source reliability, and the cost of false positives.

Step 6: Validate with model behavior

Data-side checks should be paired with behavior-side checks:

This still does not prove zero contamination, but it gives a more honest picture of risk.

11. The key distinction

The key distinction is this:

Semantic similarity asks whether two texts are close.

Eval contamination asks whether training exposure compromises the independence of measurement.

Those are not the same question.

Sometimes a semantically close document is harmless. Sometimes a semantically distant document leaks the answer. Sometimes the problem is not a document at all, but a source repository, a benchmark mirror, a synthetic data chain, or a repeated task template.

That is why eval-set decontamination cannot be solved by embeddings alone.

12. Conclusion

Embeddings should be part of modern pretraining decontamination. They are too useful to ignore, especially for paraphrases and weakly transformed benchmark copies.

But embedding-only cleaning is insufficient because contamination is target-conditioned, domain-specific, and policy-dependent. It is about unfair evaluation advantage, not merely semantic closeness.

The right standard is an auditable, layered decontamination system: exact matching, near-duplicate detection, source quarantine, embedding retrieval, task-specific similarity, severity classification, and residual-risk reporting.

In other words, embeddings are a search tool. Decontamination is an evaluation-governance problem.

References


Share this post on:

Previous Post
How to Test Pretraining Ideas at Small Scale Before Betting on a Large Model
Next Post
Pretraining Contamination: Why Don't Train on the Test Set Became Hard