Language Model & AWD techniques

＊사용된 모든 영문 image의 출처는 cs231n 강의 자료입니다.＊

1. RNN, LSTM & GRU reviews

2. Language Model

3. AWD-LSTM/RNN techniques

RNN, LSTM & GRU reviews

RNN

Sequence data 예측에 효과적인 processing
Short-term memory (vanishing gradient problem)
LSTM과 GRU는 gate mechanism을 사용하여 short-term memory를 완화시킨다.
Gate: Sequence chain이 진행되는 동안 information flowing을 규제

LSTM

Cell state: Sequence chain을 따라 모든 information을 전달하는 transport highway
Four different gates: forgot, input, output, gate

GRU

Two gates: reset gate(r), update gate(z)
GRU는 LSTM보다 더 적은 tensor operation을 갖지만, 속도 차이는 거의 없다. (두 model 중 뭐가 더 좋다라는 건 없다.)

Language Model

Word와 sentence의 Probability distribution
일반적으로, next word를 예측하는데 previous word를 사용

e.g. [Today, is, Wednesday]

Input seqeunce x

P(Today is Wednesday) = 0.001
P(Today Wednesday is) = 0.0000000001

Conditional probaility

P(pizza | For dinner I`m making) > P(cement | For dinner I`m making)

Evaluate Language Model

Extrinsic evaluation

외부 task로 model 평가 ex) Neural Machine Translation
Expensive and slow

Intrinsic evaluation

내부 feature로 Language Model 그 자체를 evaluate
Quickly comparing models
E.g. perplexity

Perplexity (↓)

Language model을 evaluation할 때 사용
Real, syntactically correct한 sentence에서는 high probability
Fake, incorrect, infrequent한 sentence에서는 low probability

AWD-LSTM/RNN techniques

(Averaged stochastic gradient Weight-Dropped LSTM)

Problem of LSTM/RNN

RNN은 each time step에서 paramter를 share → small number of paramters

Large network에서 적은 수의 parameter를 사용
Model 복잡도가 단순해져 overfitting 발생

Perplexity가 test data에서 기대값보다 크게 나타나는 현상 발생

→ Reularization strategy 사용

Regularization strategies

NT-ASGD
DropConnect
Variational dropout
Embedding dropout

Averaged Stochastic Gradient Descent (ASGD)

Stochastic Gradient를 average하여 한번에 backpropagtion
Noisy한 gradient를 normalize하는 효과

e.g.

→ Train 초기에는 일반적인 SGD사용 + SGD의 loss가 감소하지 않을 때 ASGD 사용

Non-monotonically Triggered Averaged Stochasitc Gradient Descent (NT-ASGD)

ASGD를 적용할 단계를 model이 결정

Dropout

Dropout은 each time step에 적용되지만, each time step에서 다른 dropout mask 적용
Train 단계에서만 적용 (Test, inference 단계에서는 dropout off)
Pytorch에서는 RNN, LSTM, GRU에 bernoulli random variable을 따르는 drouput paramter를 제공

Standard Dropout

Node, weight를 모두 dropout → Node에 대한 정보 손실

DropConnect

Weight에만 dropout 적용 (일부 weight만 사용) → Node의 정보 손실 방지

Variational dropout

LSTM & GRU에서 각 gate에 dropout을 적용

Input level에서 dropout masking → Bernoulli(p)를 따르기 때문에 매 input에서 다르게 masking

Embedding dropout

Word level에서 embedding matrix에 dropout 적용
Dropout된 word는 0으로 변환
나머지 vector들은 compensation되어 scale

'NLP' 카테고리의 다른 글

NLP preprocessing (0)	2022.06.27
RNNs with Attention (0)	2022.06.24
Character-level Language Model (0)	2022.06.21
Word Embedding - Word2Vec, Glove, Doc2Vec (0)	2022.06.20
Topic Modeling (0)	2022.06.17

동영`s 인공지능 공부방

Language Model & AWD techniques

RNN, LSTM & GRU reviews

RNN

LSTM

GRU

Language Model