The size of a neural network is defined by how many parameters it has. Parameters in neural networks include things like its weights and its biases.
The simple 3,4,2 network pictured below would have a parameter count of 20. This is because there are 20 different weights (represented by arrows) in the network.
It therefore can be understood that by increasing the number of hidden layers and neurons (represented as peach circles) in the network, the parameter count will increase. So in this way, parameter counts are equivalent to the size of the neural network.
If we think of neural networks like artificial brains, then this is the equivalent of increasing the number of neuron connections in the brain. In brains, the number of connections is dynamic and changes in response to how often and how much it gets activated, like a muscle. In neural networks, the number of parameters is fixed before training starts.
Jared Kaplan and colleagues at OpenAI discovered a set of scaling laws, specifically for large language models, but with the results are applicable to most transformer models. These graphs illustrate their findings.
These graphs show that there is a relationship between increasing compute, dataset size, and the parameter count on the test loss. Test loss is how the performance of a neural network is measured. It is the difference between the model output and the desired output. Therefore, a smaller loss means a better model.
This tells us that by increasing all 3 of these features, we can expect the performance of neural networks to continue to increase. Note that this only applies when all 3 are scaled, and does not work if one is bottlenecking the network.
This discovery was quickly absorbed by the field, and networks, especially language models, were made bigger with each publication.
This was until DeepMind released its research paper and model called Chinchilla. Chinchilla experimented with the relationship between dataset size and parameter count. Hundreds of models were trained in order for a relationship between the two to be calculated.
Prior to Chinchilla, Gopher was DeepMinds most impressive language model and sat at 280B parameters. Following the new scaling relationship DeepMind discovered, Chinchilla was developed and outperforms Gopher despite being only 70B parameters and costing the exact same amount of compute.
This research proves that bigger is not always better for neural networks. Instead, finding the correct relationship between the number of parameters and the amount of training data is more important for efficiently training a neural network. The new trend is optimization rather than scale.
GPT-3 175B, OpenAI’s largest public language model, was trained before this discovery. When GPT-4 soon releases we should expect it to be around the same size as GPT-3 or perhaps even smaller. Recently, OpenAI discovered that through better hyperparameter tuning (these are the parameters of the training process, e.g. learning rate, learning rate schedule, batch size, optimiser, and
width-to-depth ratio) they could half the size of a 13B parameter version of GPT-3, whilst improving its performance.
In conclusion, large networks are not always better. There are a number of different features of neural networks that feed in to how well it performs, and finding the optimal balance between them is an ongoing area of research. The trend has changed from 2 years ago, where the consensus was previously that bigger was always better, but is now that parameter and hyperparameter optimisation is key.