Tag: Pretraining
All the articles with the tag "Pretraining".
-
How to Test Pretraining Ideas at Small Scale Before Betting on a Large Model
· 25 min readA practical guide to validating pretraining improvements with small proxy models, scaling ladders, isoFLOP budgets, loss curves, downstream evals, and rank-correlation checks before committing to an expensive large-model run.
-
Why Embeddings Cannot Solve Eval-Set Contamination
· 11 min readA technical deep dive on why semantic embedding search is useful but insufficient for eval-set decontamination: leakage is about evaluation advantage, not just text similarity.
-
Pretraining Contamination: Why Don't Train on the Test Set Became Hard
· 14 min readA practical introduction to LLM pretraining contamination: why benchmark leakage is not ordinary deduplication, how public evals leak into web-scale corpora, and how layered decontamination pipelines reduce risk.