Description
Scaling Laws for Neural Language Models
https://arxiv.org/abs/2001.08361
Summary:
This research paper empirically investigates scaling laws for the performance of Transformer-based language models. The authors find that performance scales predictably as a power law with model size, dataset size, and compute used for training, while showing weak dependence on other architectural details. They establish equations that predict overfitting and training speed, enabling optimal compute budget allocation. The study reveals that larger models are significantly more sample-efficient, suggesting optimal training involves very large models trained on relatively less data and stopped well before convergence. These findings offer a predictive framework for future language model development.