Week after week, the AI research has something surprising to offer. The latest model/paper released by a team at Microsoft has defied previous expectations and created a model as tiny as 1.3B parameters trained for only 4 days on 8X A100 GPUs and wait for it ...... trained on textbook quality data (6B tokens) which is available all over the internet.
What is the matter with Phi-1 Model?
The researchers were able to feed the neural network with synthetically generated questions and exercises (earlier extracted from textbooks from the web) and attain a very high performance on two important benchmarks.
1) HumanEval which has 164 original programming languages assessing the quality of the model on various problems like comprehension, mathematics, algorithms similar to the ones asked in interview questions.
2) MBPP (Mostly Basic Python Programming) which is has over 1000 crowd-sourced Python programming problems, suitable for entry level programmers.
Why is this breakthrough important?
1) Reduced Model Size: This is arguably the smallest model which has the greatest performance on a number of benchmarks. Comparable models of similar sizes are opt-1.3B and GPT-Neo-1.3B both of which have no where near in terms of performance.
2) Deployment on Local Devices: Due to the extremely low size the model can potentially be deployed on any local device without the need for a GPU. This means being able to use the model on personal computers as well as mobile, without being connected to the internet.
3) Low Training Time: A training time of only 4 days is a great achievement in the realm of machine learning. The biggest models like GPT-3 and GPT-4 have been trained for over a month.
4) Emphasising Data Quality over Quantity: We are now observing this rhetoric again and again. As demonstrated in the LIMA (Less is More for Alignment) paper by meta. Data quality is a much important feature than data quantity.
5) Emergent Phenomena: The narrative in the field has been the following - as you increase the model size or the parameter count. Emergent abilities are observed or are more likely to be observed. However, as the authors claim they have seen emergent phenomena in this model as compared to the phi-base-1 model. This brings up a number of questions about the importance of parameter counts and its relationship with emergence.
Overall, we love to see models released every week or even better, every day (not really). The model will be available for testing and experimentation on Huggingface. In the meantime enjoy reading the paper.
Link to Paper - https://arxiv.org/pdf/2306.11644.pdf
Comments