GPT

BERT: Word Masking, Transformer의 Encoder 구조
GPT: Next Word 예측, Transformer의 Decoder 구조
Improving language understanding by generative pre-training
GPT1

- <S>, <E>, $ 등의 special token을 사용하여 model을 fine-tuning
- 12-layer decoder-only transformer
- 12 head / 768 dimensional states
- GeLU activation unit
BERT

- Pre-training of Deep Bidirectional Transformers for Language Understanding
- Masked Language modeling tasks로 학습
- Use large-sclae data, large-scale model
cf) GPT2, 3는 special token 존재하지 않는다.
GPT2
- Really big transformer LM
- Language models are unsupervised multi-task learners
- Trained on 40GB of text: Webpages from reddit links with high karma
- Perform down-stream tasks in a zero-shot setting: Without any parameter or architecture modification
GPT3
「Language Models are Few-Shot Learners」
- 향상된 task-agnostic
- In-context learning: Not Fine-tuning으로 model이 context 파악
- Few-shot performance
- Autoregressive language model with 175 billion parameters in the few-shot setting
- 96 Attention layers
- Batch size of 3.2M
- 175B parameters

In-context learning
· Promt: The prefix given to the model
- Zero-shot: Predict the answer given only a natural language description of the task
- One-shot: See a single example of the task in addition to the task description
- Few-shot: See a few examples ot the task

- Zero-shot performance improves steadily with model size
- Few-shot performance increases more rapidly

References
「Language Models are Few-Shot Learners, Arxiv'20」
'NLP' 카테고리의 다른 글
| Text Generation (0) | 2022.07.07 |
|---|---|
| Sequence/Token Classification (0) | 2022.07.06 |
| BERT (0) | 2022.07.01 |
| Sinusoidal Positional Encoding 직접 계산해보기 (0) | 2022.07.01 |
| Transformer (0) | 2022.06.30 |