A simple story about Scaling Laws of LLMs
We know that building LLMs (Large Language Models) is a lengthy and costly process. For example, to construct a model like GPT-3, which has roughly 175 billion parameters, several thousand GPUs must operate continuously for months. This fact always raises the question: How do the creators of these models know that after all this computational expense and months of waiting, the model’s error will decrease to the extent they want and they will have a model with acceptable intelligence? Essentially, to build a model with a specific accuracy, how many parameters and how much data are needed? How do they know how much computational budget is required for a model with N parameters and D tokens (data)? It’s not logical to run the model several times with each variable change like with Random Forest or SVM models to find the answer to these questions; so, they probably did something about this issue.
Open AI lit the first light towards solving this big problem. When Open AI produced GPT-2 and saw it performed better than GPT-1, they realized there might be an oil well here and, unlike the Encoder (Bert family) which didn’t improve as it scaled, they found hope here. So, they decided to scale up the model, assuming that the larger the model, the more accurate it becomes. But the uncertainties started here. Open-AI didn’t know exactly to what extent increasing the model size would continue to enhance its intelligence, and secondly, how many parameters and how much data should be fed to the model to precisely fit it, and God forbid, not end up under-fitted or over-fitted after all this expense? Mr. Kaplan, a theoretical physicist from Open-AI, took on solving this issue.
As I said in the first paragraph, it’s not logical to change each variable every time to reach the best configuration; but apparently, they did something similar, but with a difference. These people trained models of various sizes with a constant dataset over several months, aiming to perform this tedious process once and extract some rules from it, so they could produce future models without going through this process every time and make everything predictable. After much struggle with these LLMs, they drew several important conclusions, which led them to publish an article in 2020 and reach the GPT-3 model. The article briefly announced these results:
- Only three factors affect the model’s performance: model size, data volume, computational budget. These three factors have a Power-law relationship with the model’s performance.
- The type of model, architectural details, and the width and depth of the neural network have a minimal impact and can almost be considered ineffective (assuming we are talking about the architecture of Auto-regressive models).
- Simple rules apply to the relationship between model size or dataset and Over-fitting; thus, it’s easy to predict a computational budget for training a model.
- Since large models need fewer data to learn (sample-efficient), the best way to use the computational budget efficiently is to train as large models as possible.
- As the model size and data volume increase, the model’s performance improves. But if only one of them increases and the other remains constant, the model’s performance worsens. The relationship between increasing model size and data volume is approximately N^(0.74)/D. For example, if the model size increases 8x, the number of tokens (data volume) should increase 5x.
- And other minor points that need not be detailed here.
For about two years, this article was the reference guide for building LLMs until 2022, when the company Deep-Mind released a new paper that not only questioned all the results of the previous Open-AI article but even questioned the trained models of Open-AI, including GPT-3! In the abstract of this paper — also known as “Chinchilla Scaling Laws” — it’s claimed, who said that if the model size increases 8x, the data should increase 5x? Who said that the larger the model becomes, the less data it needs? No. The relationship between model size and data volume is linear. That is, if the model size increases 8x, the data should also increase 8x. Therefore, the models you developed (including GPT-3) are all under-fitted from the start.
However, the important part of the paper is not this. More exciting than this claim is that they finally present a solution to answer the questions raised at the beginning of this note. The authors of this paper first confirm that yes; model size, data volume, and computational budget are the three influential factors in LLM performance, and a Power-law relationship exists between them; but saying this sentence is not enough, and we are looking to see for a fixed computational cost (C), how should we choose the model size (N) and token amount (D) so that our fixed computational budget is used in the most efficient way? To answer this question, they first prove that the relationship between these three is as follows: C=6ND. That is, the coefficient 6 times the number of tokens and the number of parameters equals the computational cost in FLOP units. Finally, considering this relationship, they present a formula that allows predicting the model’s Loss without training a model and only by inserting the number of tokens and model size into the formula and considering the fixed computational cost C, both the model’s Loss can be predicted and the number of tokens and model size required to use the computational budget most efficiently can be determined. Despite the fact that spring 2022 was the spring of LLMs and many articles and products, including ChatGPT, emerged, I think this is the most important article of that year. It’s truly a masterpiece that you can detect without spending a dollar, with what computational cost and how much data you can reach an LLM with a certain accuracy.
This article also criticized other aspects of the previous article. Among them is that the previous article had studied small models while we want to examine Scaling Law! The models studied in the Deep-Mind article have up to 500B parameters, while many of the models studied in the previous article had 100M parameters. The authors of the article eventually produced and introduced a benchmark model called Chinchila. Despite having 70 billion parameters, it is as intelligent as a 175 billion model, thus proving their hypotheses.
Up to this point, we understand that the Loss of the model is predictable without the model being trained, given the information of the number of tokens, computational cost, and model size, and it’s not necessary for everyone who wants to produce an LLM to go to the end of the road. But still, the story does not end here. After publishing this article, Mr. Harm de Vries raises an important question in his blog. He says it’s true that the Deep-Mind article opened a great way for us, and we can predict the model’s Loss with this formula and by considering a fixed computational cost and data volume and model size, but maybe we don’t want the computational cost to be optimal. In fact, the author of this blog looks at the issue from this angle: if the issue of model accuracy has three sides (data volume, model size, computational cost), according to a rule of thumb, we must keep one of the three sides of this triangle fixed and change the other two. In the Deep-Mind article, they kept the computational cost fixed and changed the rest. While our problem may not be the computational cost, and we may want to keep the data or model size fixed and accept whatever the computational cost is. The advantage of this view is that in this way, the model volume can be relatively small, accepting the occurrence of computational overhead and a long training time in return for less delay during Inferencing and having a more Production-Friendly model. He reaches a formula with an analysis of the reviews of this article that can predict the computational overhead under conditions where your model size is fixed and smaller than the benchmark model (Chinchila). If you haven’t read either of the two previous articles, you haven’t missed anything, but definitely read this short blog.
Finally, I must say that only one question remains here, which still does not have a clear answer, and that is: How far does the Scaling Law apply? That is, how much can we increase data and model size and still see the model’s Loss decrease, and we reach a smarter model? This is a question that still does not have a clear answer, and it means that the assumption is that if you have unlimited computational cost resources and data, you can create a smarter model. Because of this, the AGI discussion has been raised in recent years, and Sam Altman also referred to this issue in an interview a while ago.