OpenAI's GPT-3 Language Model

Overview of GPT-3 on various language tasks and its implication on the society


Source: Photo by Ivy Son from Pexels


Language models are now emerging as one of those areas in deep learning where the intensity of research and development is accelerating as the days go by. Understanding natural language tasks for a machine is tricky compare to the vision tasks since the former needs a conceptual understanding of the data in order to predict the correct output. Apart from that, there are other tasks such as question answering, textual entailment, semantic similarity assessment, document classification, language translation, et cetera.


There are a lot of language models that can achieve state-of-the-art performance at any given single task but not all the tasks at once. OpenAI took this challenge head-on and developed a model called GPT or generative pre-trained model.


The first model GPT was developed on June 11, 2018, following it they developed GPT-2 on February 14, 2019, and last year on May 28, 2020, they developed GPT-3.


GPT-3 is now trending as one of the most powerful language models in the world. Unlike GPT-2, OpenAI did not open-source the code for GPT-3 because they found that can be used unethically in various fields (something which we will discuss later in this article). But they build an API for the same for public use (available to the people and organizations who have registered).


The GPT-3 although being one of the most powerful language model algorithms in the world, very less is known about it. People or organizations that have registered for the beta version of the API are still experimenting and researching the ability of the model.


So what makes GPT-3 that powerful and dangerous at the same time?

In this article, we will try to unravel some of the important key points and features that make GPT-3 one of its kind.


Motivation

GPT-3 biggest motivation was to scale up the previous model — GPT-2 — and eliminate the process of fine-tuning altogether. This was a completely different approach because it turns out that while training GPT-2 the loss function was not reaching a plateau or it was not saturated at all instead it was still going down. This finding from GPT-2 led to the design of a new language model with the same architecture but this time they increase the parameter from 1.5 billion to 175 billion parameters. The biggest point of focus was to build a task-agnostic model that can be trained in a huge corpus of data and still able to perform various language tasks across all possible spectrum. To address this challenge the team decided not to use any task-specific data and any sort of fine-tuning methods.


Source: Language Models are Few-Shot Learners


By not using any task-specific data and any fine-tuning methods researcher believed that the model will be much more general and task-agnostic. They listed out three points to validate their approach:

  1. The use of large labeled data or task-specific data constrains the model's ability to generalize the problem very well. In the end, the process itself is repeated iteration after iterations, it makes no sense to mine labeled data in order to train an agnostic model. Plus, deep learning models need a lot of data to find patterns and representations to increase understanding and generalization. Hence, GPT-3 is an unsupervised multitask learner.

  2. Larger models like GPT-3 are very expressive and they can find good representations on the training set which is good. When these models are fine-tuned they tend to perform well in the training data and do not generalize well outside it because fine-tuning makes the model task-specific basically pruning any neural nets that do not belong to the specific task. This leads to the massive overfitting of the model.

  3. GPT-3 main objective is to be more like humans. Humans need less information to understand a subject and they can extrapolate that information and perform well at other tasks making them good at understanding various information.

GPT-3 has to be more like the humans. In order to achieve human-like performance one potential approach that was used in this model was meta-learning — a state of learning where a system develops a broad set of skills and pattern recognition abilities during the training, and effectively uses those learned methods at inference time for any given tasks.

Meta-learning in deep learning language models can be achieved by "in-context learning". This is where the model is trained using task-specific instruction and it is instructed to complete the task. Example: giving a string of text like "the sun rises from the east" with an instruction text of "{translate to french}". In another example: "Lyra is the ____________ constellation" with an instruction text of "{complete the sentence}".


In-context learning might show an increase in accuracy as the parameter of the model increases this is because the model extracts better patterns and representation as it grows bigger and bigger.


Architecture, Data, Training, and Inference


Architecture

The GPT-3 has exactly the same model and architecture as the GPT-2 with an exception of sparse transformers. It is the SOTA method that was developed by OpenAI to predict what comes next in the sequence.


OpenAI released eight sizes of GPT-3 with the smallest of 125M parameters to the largest with 175B parameters. The smallest model has 12 attention layers, each with 12x 64-dimension heads while the largest model uses 96 attention layers, each with 96x 128-dimension heads.


Also, it is worth mentioning that the larger models used larger batch-size and smaller learning rates while the opposite was true for the smaller models.


Source: Language Models are Few-Shot Learners


Data

The corpus of data used for training is from Common Crawl containing almost a trillion words.


Since the dataset was not very well curated three measures were taken to improve the dataset:

  1. The dataset was filtered thoroughly.

  2. Fuzzy duplication was performed at the documentation level ensuring that the integrity of the held-out validation dataset is not compromised.

  3. Mixing this dataset with high-quality reference corpora to increase diversity.

It is also important to understand that GPT-3 has to deal with is data contamination. The training dataset is sourced from the internet, and it is possible that the training dataset will coincide with some of the testing datasets.


To tackle the issue of data contamination, the OpenAI team produced a clean version of the testing dataset essentially for each downstream task, which removes all potentially leaked samples.


But it turns that GPT-3 had almost no effect on data contamination.


Training

Since the model is very massive time was no problem it was the memory. Training a larger model requires a lot of training memory. To solve of "training the model without running out of memory", a mixture of model parallelism within each matrix multiply and model parallelism across the layers of the network were performed. All models were trained on V100 GPU’s on the part of a high-bandwidth cluster provided by Microsoft [1].


Inference

Being a task-agnostic language model GPT-3 can perform all the downstream tasks with one model unlike other language models like BERT, RoBERT et cetera which requires to be fine-tuned for all downstream tasks.


Results

GPT-3 is performed incredibly well across all spectrum of language tasks.


In language modeling and task completion, GPT-3 accuracy was above 70%. For inference, datasets from LAMBADA were used and Wikipedia data was completely ignored because it was already there in the training dataset. As it turns out that on the LAMBADA dataset with few shot instances GPT-3's accuracy grew towards the human-level performance.


Source: Language Models are Few-Shot Learners


In another example of closed book trivia, GPT-3 surpassed the SOTA model "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks".


Source: Language Models are Few-Shot Learners


GPT-3 performance is extraordinary and it has proved that increasing the size of the model can increase the model's capacity to comprehend more patterns and develop a skill of understanding that humans have some sort of meta-learning.


Implication On Society

Human beings are the reflection of the information or knowlegde they absorb.

GPT-3 has indeed achieved an amazing feat but it has its disadvantage.

  1. Since GPT-3 has been trained from the real data from the internet it is very biased and sexist. For instance, it does not understand gender equality and is usually inclined towards the male-driven society.

  2. For the same reason, GPT-3 is racist towards religion and race.

  3. The article produced by GPT-3 is very convincing to human eyes and mind, and it is very difficult or almost impossible to differentiate that any algorithm has generated it. Which means it can be used for wrong purposes.


Source: Language Models are Few-Shot Learners

What's Next

GPT-3 has almost passed the Turing test in generating small articles. Although GPT-3 is a dangerous model if it gets into the wrong hands it is also a big leap in understanding what knowledge is and apparently how the brain works. The problem with GPT-3 is that it does not have long-term memory storage so in conclusion, it relies totally on associative memory. From the experiments, it has been observed that the accuracy of such models increases as the size increases.


So does that mean GPT-n can achieve human-level intelligence and come near enough to what is called Artificial General Intelligence?


Well, it depends on how we describe general-purpose intelligence. Although it is has been observed that GPT-3 is reaching human-level intelligence by passing the Turing test but it is also worth remembering that trade-offs essential in every part of evolution. So, if AGI means relying completely on associative memory then yes but if it also includes remembering long-term memory, being task-agnostic, having the understanding to extrapolate information from one field to another then no. It has not reached up to that level. And developing these models is very expensive and hard on nature as well which is another constrain that we are facing right now.


But rest assured that GPT-3 has open new doors to cognitive and AI research which means new approaches will come and revolutionize AI to reach general-purpose intelligence.


But meanwhile, we also find out a way to make AI safe.


References

  1. Language Models are Few-Shot Learners — GPT-3

  2. All the images are taken from GPT-3 official paper "Language Models are Few-Shot Learners"

  3. Improving LanguageUnderstanding with Unsupervised Learning — GPT

  4. Better Language Models and Their Implications — GPT-2

  5. OpenAI's GPT-3 Language Model: A Technical Overview

  6. OpenAI released a beta version of its language model, GPT-3. As artificial writing permeates our lives, the challenge is how to think clearly about what it is and what impact it could have on society.

  7. Generative Modeling with Sparse Transformers

  8. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

  9. The LAMBADA dataset: Word prediction requiring a broad discourse context