Skip to content
Go back

How to Test Pretraining Ideas at Small Scale Before Betting on a Large Model

· 25 min read

Large-model pretraining is too expensive to run by intuition. If you have a new data filter, tokenizer, deduplication rule, domain mix, curriculum, optimizer setting, or architecture tweak, you cannot simply train a 70B model and hope the gain survives.

The normal practice is to build a scaling ladder.

You test the idea on small models. Then you test it on medium models. Then you check whether the gain behaves predictably as model size, token count, and compute increase. Only after the improvement survives several controlled scale points do you spend the money on the large run.

This post explains how that is usually done in pretraining practice, and why the method works.

The core idea:

A pretraining improvement is not real just because it improves a 100M model.

It becomes credible when the improvement is stable across model sizes, token budgets, seeds, validation slices, and downstream evaluations.

IsoFLOP scaling-law monitor for small and medium pretraining runs

This is a synthetic monitoring template, but the visual grammar is borrowed from real scaling-law practice: log-scale loss-vs-FLOPs fits, residual checks, and Chinchilla-style IsoFLOP sweeps. Move forward when the fitted curve and the IsoFLOP valleys stay smooth. Stop when residuals jump, the valley moves erratically, or a larger batch produces worse loss at the same compute.

Small-scale go/no-go scatter for pretraining ideas

A small proxy result should be treated as a decision surface. Move forward only when the gain is large enough and the transfer evidence is clean enough.

1. Why small models are useful at all

The reason small models are useful is not that they are miniature copies of large models in every way. They are not.

Small models are useful because many pretraining metrics scale smoothly. Kaplan et al. showed that language-model cross-entropy loss follows approximate power-law relationships with model size, dataset size, and compute over a wide range of scales (Kaplan et al., 2020). Hoffmann et al. later showed that compute-optimal training depends strongly on the balance between model parameters and training tokens, not just parameter count (Hoffmann et al., 2022).

This gives pretraining teams a way to ask:

The goal is not to prove the future perfectly. The goal is to reduce the chance that a large run surprises you.

2. The basic variables: N, D, C, and Q

A practical scaling experiment usually tracks four variables.

N: model size. This is usually the number of non-embedding or total trainable parameters: 70M, 160M, 410M, 1B, 3B, 7B, and so on.

D: training tokens. This is how many tokens the model sees. In Chinchilla-style compute-optimal training, model size and token count scale together. Hoffmann et al. trained more than 400 models from 70M to over 16B parameters and argued that, under a fixed compute budget, parameters and tokens should scale roughly together (Hoffmann et al., 2022).

C: compute. A rough rule of thumb for dense decoder-only Transformer training is that training FLOPs scale like:

C ~= 6 * N * D

This approximation is not exact, but it is useful for comparing experiment budgets.

Q: data quality or recipe quality. This is the part you are testing: a new filter, a new data source, a new mixing ratio, a new deduplication policy, a new tokenizer, a new schedule, or a new architecture detail.

Scaling-law work usually models loss as a smooth function of model size, data size, and compute. In practice, pretraining teams often care about whether changing Q shifts the curve downward.

If your intervention lowers the loss curve at every scale, it is promising.

If it lowers loss only for tiny models and disappears at 1B, it is probably a small-model artifact.

If it helps loss but hurts downstream tasks, the intervention may be optimizing the wrong proxy.

3. The key experimental question

The question is not:

Did my small model get better?

The question is:

Does my small-model result predict a better large-model run?

Those are different.

A 100M model may prefer cleaner, simpler, more textbook-like data because it has limited capacity. A 7B model may benefit from broader, messier, more diverse web data because it can model more variation. A small model may be bottlenecked by optimization instability. A large model may be bottlenecked by data diversity. A small model may underuse code data. A larger model may turn code data into better reasoning and tool-use transfer.

So the job of a scaling experiment is not to crown a winner at the smallest scale. It is to identify which interventions have stable scaling behavior.

4. The standard pretraining ladder

A practical ladder has several stages.

Stage 0: Data and pipeline smoke test

Before training meaningful models, train a tiny model to catch obvious breakage.

Example:

The goal is not scientific evidence. The goal is to find broken tokenization, duplicate explosions, bad loss spikes, source corruption, data-loader bugs, unstable gradients, and accidental evaluation leakage.

If a recipe cannot survive Stage 0, it is not ready for real comparison.

Stage 1: Cheap proxy models

Next, run a small but real comparison.

Example:

This stage tells you whether the intervention has any signal. It is where you test many ideas quickly.

Typical candidates:

Most ideas should die here. That is the point.

Stage 2: Medium proxy models

The surviving ideas move to a medium scale.

Example:

This is the first scale where you ask whether the gain is still there under a more realistic model and token budget.

DCLM is a good public example of this mindset. It defines multiple competition scales, including roughly 400M, 1B, and 7B settings, so teams can test data curation methods under smaller compute budgets before checking whether they transfer to larger runs (DCLM).

Stage 3: Large proxy or pre-final model

Before the final expensive run, train one larger proxy.

Example:

This is the last chance to catch a false small-scale win.

The large proxy does not need to be the final model size. Its purpose is to answer: does the intervention still help when the model has enough capacity to behave more like the final system?

Stage 4: Final run

Only now do you train the expensive model.

At this stage, you should not still be debating the basic data filter. You should already know:

The final run is still risky, but it should not be a blind bet.

A pretraining gain surviving, fading, or reversing across model scales

The green line is the pattern you want: the intervention keeps beating the baseline as model size increases. The yellow line needs more evidence. The red line is a classic small-model trap.

5. Two budget styles: isoFLOP and fixed-token

There are two common ways to compare pretraining interventions.

IsoFLOP comparison

In an isoFLOP comparison, every candidate gets the same compute budget.

Example:

Baseline: 300M model, 30B tokens
Variant:  300M model, 30B tokens

or:

Baseline: 1B model, 50B tokens
Variant:  1B model, 50B tokens

This is the cleanest test if you are asking:

Which recipe gives better performance for the same training cost?

IsoFLOP comparisons are especially important when the candidate changes data quality, data order, optimizer, deduplication, or filtering.

The practical plot is validation loss versus FLOPs. For each run, put a point on the chart:

run = model size + tokens + batch + optimizer + data recipe
x   = total training FLOPs so far
y   = held-out validation loss

Then compare the point to the fitted scaling-law band from your previous small runs. If the 300M, 1B, and 3B runs all sit on the same smooth curve, the recipe is behaving predictably. If a run lands far above the band, do not just keep training and hope. Something changed.

Common interpretations:

Batch condition is part of the experiment

Batch size is not just a throughput parameter. It changes the optimization condition of the run.

For a fixed FLOP budget, increasing batch size usually reduces the number of optimizer updates. That can improve hardware utilization, but if the batch is beyond the useful regime, you may spend the same FLOPs and get worse loss. McCandlish et al. frame this through the gradient noise scale: a measurable statistic that predicts the largest useful batch size and the tradeoff between compute-efficiency and time-efficiency (McCandlish et al., 2018).

So when comparing small-scale pretraining ideas, do not only log:

model size
tokens
FLOPs
validation loss

Also log:

global batch size in tokens
microbatch size
gradient accumulation
learning rate
warmup length
optimizer state
tokens per optimizer step
loss spike history
hardware throughput

If a point falls above the scaling-law band, batch is one of the first things to check. A bad batch/LR condition can make a good data recipe look bad.

Fixed-token comparison

In a fixed-token comparison, every candidate sees the same number of tokens.

This is useful if you are asking:

Does this data distribution produce better learning from the same amount of text?

But fixed-token comparisons can be misleading across model sizes because a larger model trained on the same number of tokens may be undertrained or overtrained relative to another scale.

In practice, good studies use both:

6. What to measure at every scale

You need more than one metric.

Training loss

Training loss catches optimization problems and gross data difficulty differences. But lower training loss alone can mean the data is easier, more duplicated, or less diverse.

A recipe that lowers training loss by making the corpus repetitive may not produce a better model.

Validation loss

Validation loss is the primary scaling-law metric. It should be measured on held-out data that is not part of the training corpus.

Use multiple validation slices:

The question is not only whether average validation loss improves. The question is where it improves and where it regresses.

Downstream evaluations

Downstream evals are noisier than validation loss, but they are closer to what users care about.

Evaluate on:

Gadre et al. explicitly address a gap in older scaling work: scaling laws often predict next-token loss in compute-optimal regimes, while real models are often overtrained and judged on downstream tasks. They train 104 models from 0.011B to 6.9B parameters and show how cheaper experiments can predict both validation loss and downstream error in more realistic regimes (Gadre et al., 2024).

Per-source and per-domain loss

If the intervention changes data, track per-source loss.

Example:

General web validation loss:    -0.03
Code validation loss:           +0.02
Math validation loss:           -0.06
Academic validation loss:       -0.01
Multilingual validation loss:   +0.04

This tells you whether the gain is broad or whether you are trading away one capability for another.

Validation gain versus downstream regression scatter

A recipe can improve average validation loss while breaking a target domain. That is not a scale-up candidate until the regression is understood.

Memorization and contamination checks

A small model may not memorize a leaked benchmark item, while a larger model can. Contamination risk can therefore increase with scale. So the scaling ladder should include decontamination checks, not just performance checks.

For each candidate recipe, keep track of:

7. The theory: why scaling extrapolation works

The theory is empirical but strong enough to guide engineering.

Kaplan et al. observed that language-model loss can be approximated by power laws in model size, data size, and compute (Kaplan et al., 2020).

A simplified version looks like:

L(N, D) = L_inf + A / N^alpha + B / D^beta

Where:

Hoffmann et al. refined the practical conclusion: many large models were undertrained on tokens, and compute-optimal training should scale model size and token count together (Hoffmann et al., 2022).

For pretraining practice, the key point is not the exact exponent. The key point is that curves are often smooth enough that small experiments can predict larger experiments.

But this only works if the experiment is controlled:

If you change five things at once, the scaling curve cannot tell you which thing mattered.

8. What it means for a gain to “survive scaling”

A gain survives scaling when it passes several checks.

Check 1: Same sign across scales

If the candidate improves validation loss at 100M, 300M, 1B, and 3B, that is much more credible than a one-off win.

The gain does not have to be identical. It can shrink or grow. But the sign should not randomly flip.

Check 2: Stable rank ordering

If you compare several candidate data recipes, the ranking should be similar across scales.

DCLM reports high rank correlation between smaller-scale results and larger 7B-scale results, which supports the idea that data curation can be iterated at small scales before larger confirmation (DCLM).

The practical version is:

At 300M:
Recipe C > Recipe A > Recipe B

At 1B:
Recipe C > Recipe A > Recipe B

At 7B:
Recipe C > Recipe A > Recipe B

This is stronger evidence than “Recipe C won one small run.”

Check 3: Loss gain maps to downstream gain

Validation loss is smoother than downstream evals, but downstream evals matter. A recipe that improves loss but not downstream performance may still be useful, but it should be treated carefully.

Gadre et al. model the relationship between perplexity and downstream task error, which is valuable because it connects the cheap, smooth metric to the expensive, noisy metric (Gadre et al., 2024).

Check 4: No hidden regressions

A gain is not a gain if it quietly breaks another capability.

For example:

The scaling ladder should expose these tradeoffs early.

Check 5: The effect size beats run noise

Small gains can be real, but they are easy to confuse with seed noise, data-order noise, checkpoint selection noise, or eval noise.

If a change improves validation loss by 0.1%, you need more replication than if it improves by 3%.

At minimum:

9. The pretraining experiment matrix

A useful experiment matrix looks like this:

ScaleParamsTokensPurposeCandidate count
Smoke10M-50M1B-5BCatch pipeline breakageMany
Small proxy100M-300M10B-30BFast recipe search10-50
Medium proxy400M-1B30B-150BConfirm scale transfer3-10
Large proxy3B-7B100B-1TPre-final confirmation1-3
Final7B+target budgetExpensive production run1

The exact sizes depend on budget. The important part is not the specific numbers. It is the shape:

Many cheap experiments, fewer medium experiments, one or two large confirmations.

This is how you avoid spending final-run compute on ideas that were never properly derisked.

10. How to test a data-filtering idea

Suppose you have a new quality filter for web data.

Bad experiment:

Train one 100M model on filtered data.
Compare it to an old baseline.
Declare victory.

Better experiment:

Baseline data: current web mixture
Variant A: mild filter
Variant B: medium filter
Variant C: aggressive filter

Run 100M models for 20B tokens.
Run 300M models for 30B tokens.
Run 1B models for 100B tokens for the top two variants.
Track validation loss by domain.
Track downstream evals.
Track document diversity and deduplication rate.
Track contamination risk.
Fit scaling curves.
Pick the variant whose gains survive and whose regressions are acceptable.

The most important question is not “which filter makes the data cleanest?” It is:

Which filter produces the best model at the final scale and budget?

Those can differ.

Sorscher et al. study data pruning from a scaling-law perspective and show that high-quality data selection can beat naive power-law scaling in some regimes, but the benefit depends on having a good pruning metric (Sorscher et al., 2022). This is exactly why data filters need scale validation.

11. How to test a data-mixture idea

Now suppose you want to change the mixture:

Current:
70% web
10% books
10% code
5% academic
5% math

Candidate:
55% web
10% books
20% code
5% academic
10% math

This is harder than testing a simple filter because the gain may be task-specific.

You need to ask:

The best practice is to evaluate the mixture both globally and by slice. If a mixture improves the headline average but hurts your target domain, it is not the right mixture.

Recent work on data ablations and mixture scaling tries to make this cheaper. For example, scalable data-ablation methods approximate how subsets of a corpus contribute without training every possible mixture from scratch (Scalable Data Ablation Approximations).

12. How to test a tokenizer idea

Tokenizer changes are dangerous because they alter almost everything:

A tokenizer that improves English web loss may hurt code or multilingual text. A tokenizer that reduces token count may make training look cheaper while changing the actual amount of information processed.

For tokenizer experiments, compare:

Do not scale up a tokenizer change from one tiny run. Tokenizers are too entangled with the whole training system.

13. How to test an architecture or optimizer idea

Architecture and optimizer changes are more fragile than data changes.

A data filter may transfer across scales because it changes the distribution the model learns from. An optimizer change may work at 300M and fail at 7B because stability, batch size, gradient noise, and hardware details change.

For architecture or optimizer tests:

The final metric should be quality per unit cost, not loss alone.

14. Why downstream gains can lag behind loss gains

Sometimes validation loss improves before downstream evals improve. This can happen because downstream tasks are noisy, sparse, or thresholded.

Example:

Validation loss improves smoothly:
100M: -0.020
300M: -0.025
1B:   -0.030

Downstream score:
100M: no change
300M: +0.3
1B:   +1.2

This is not necessarily a failure. It may mean the capability only becomes visible once the model has enough capacity.

The reverse can also happen:

Validation loss: tiny change
Code eval: large gain

That may happen when the intervention improves a narrow domain that is diluted in average validation loss.

This is why you need both smooth aggregate metrics and targeted evals.

15. Why gains sometimes disappear at larger scale

Small-scale gains disappear for several reasons.

The small model was capacity-limited

Cleaner or simpler data may help a small model because it cannot represent the full data distribution. A larger model may no longer need that simplification.

The metric was too narrow

A small eval suite may reward a narrow domain. A larger eval suite reveals regressions.

The token budget changed the answer

A filter may help at 20B tokens but hurt at 1T tokens because diversity becomes more important later.

The intervention changed optimization, not final capability

Some changes make early training faster but do not improve final loss.

The candidate overfit the proxy eval

If you tune repeatedly on the same small eval suite, the recipe can overfit the development benchmark.

The final model is overtrained

Many production models are trained on more tokens than compute-optimal Chinchilla prescriptions because inference cost matters. Gadre et al. study this overtraining regime directly and show why extrapolation should account for both model size and token count (Gadre et al., 2024).

16. The role of open scaling suites

Open model suites are useful because they show what controlled scaling looks like.

Pythia trained 16 models from 70M to 12B parameters on public data in the same order, with many checkpoints released throughout training (Pythia). This makes it possible to study how behavior changes with both scale and training time.

OLMo and Dolma are useful because they expose the training data, model checkpoints, and pretraining recipe more openly than most frontier systems (OLMo; Dolma).

DCLM is useful because it turns data curation into a controlled benchmark: fixed model recipes, multiple compute scales, and a focus on whether data decisions transfer from small to larger models (DCLM).

Together, these projects show the practical ideal:

Do not just release a final model. Release enough of the training ladder that people can understand why the final model was expected to work.

17. A decision rule before scaling up

Before moving from small proxy to medium proxy, require:

Before moving from medium proxy to large proxy, require:

Before final training, require:

This sounds bureaucratic, but it is cheaper than wasting a final run.

18. What the final scaling report should contain

A good pretraining scaling report should include:

Experiment table

Run ID
Model size
Token budget
Compute budget
Data mixture
Filter version
Tokenizer version
Optimizer settings
Seed
Hardware
Checkpoint schedule

Loss curves

Show training and validation loss against:

Scaling fit

Show the fitted curve and where each run lands.

At minimum:

Baseline predicted final loss:  X
Candidate predicted final loss: Y
Expected gain:                 X - Y
Uncertainty:                   +/- Z

Downstream eval table

Separate:

Regression table

List the things that got worse. This is where many weak scaling reports fail. They only show wins.

Go / no-go decision

End with a decision:

19. A concrete example

Suppose you want to add more math and code to pretraining because you believe it will improve reasoning.

The small-scale plan:

Baseline: current mixture
Variant A: +5% code, +5% math, -10% web
Variant B: +10% code, +10% math, -20% web
Variant C: +15% code, +15% math, -30% web

Run:

100M for 20B tokens
300M for 30B tokens
1B for 100B tokens on the top two variants

Measure:

General validation loss
Code validation loss
Math validation loss
Web validation loss
Multilingual validation loss
MMLU-like evals with contamination checks
GSM8K-style math evals
HumanEval-style code evals
Internal target tasks

Decision:

The important point is that you are not choosing based on a single score. You are choosing based on a scaling pattern.

20. The mental model

Pretraining is an expensive search problem under uncertainty.

Small models are probes. Medium models are confirmation. Large proxies are rehearsal. The final model is the expensive bet.

Scaling laws are the reason this is possible. They do not remove uncertainty, and they do not guarantee every small-scale improvement will transfer. But they let you replace guesswork with a disciplined sequence:

small experiment
-> controlled comparison
-> scaling curve
-> downstream validation
-> regression audit
-> large-run decision

The best pretraining teams are not the ones that simply train the biggest model. They are the ones that can predict, before the final run, which recipe deserves the compute.

References


Share this post on:

Next Post
Why Embeddings Cannot Solve Eval-Set Contamination