Training Neural Networks

1. Activation functions

2. Batch Normalization

3. Optimization methods

4. Ensemble and regularization

Activaction functions

Sigmoid activation

- range [0, 1]

- Saturated neuron에서 gradient 소실

- Sigmoid output들은 zero-centered가 아님

- exp() 연산 시, 연산량 ↑

Softmax function

- sigmoid의 multi label class 형태

Tanh activation

- Sigmoid * 2 - 1

- range [-1, 1]

- zero centered # RNN 같은 경우 zero-centered 문제가 중요함

- 여전히 saturated neuron에서 gradient 소실 문제 有

ReLU (Rectifier Linear Unit)

# 사실상 Deep Learning이 더욱 deep한 layer를 쌓을 수 있게 만든 장본인!

Input	Output
x ≥ 0	0
x ＜ 0	x

- saturate neuron 기울기 소실 문제 해결

- 효과적 연산 가능

- sigmoid/tanh보다 연산 속도가 빠름

- Sparse activation: 0 이하의 input에 대해 output을 0으로 출력함으로써 부분적으로 활성화

- Efficient gradient propagation:

gradient vanishing 및 gradient exploding이 없음 → 더 깊은 layer 쌓기 가능

- Efficient computation:

선형함수이므로 미분 계산이 간단하고 연산 속도 ↑

- Scale-invariant:

max(0, ax) = a*max(0, x)

- output이 zero-centered output이 아님

Batch Normalization

- Linear regression(-∞, ∞)의 range를 조정

- 보통 Fully Connected나 Convolution layer 후, nonlinearity 이전에 시행

- network의 gradient flow 향상

- 높은 learning rate 가능

- initialization의 강한 dependence 감소

1. 각 dimension의 평균과 분산 계산

2. Normalize

3. scale

- 다시 선형 변환 시행

- 각 node의 activation 빈도 조정

- scale이 필요 없을 때:

γ(k) = np.sqrt(var[x(k)]), β(k) = E[x(k)] 로 변환하여 원본으로 mapping

Test data에 BatchNorm 적용

- Train data의 BatchNorm을 적용시키는 게 아니라 이전 test data들의 BatchNorm적용

Optimization methods

SGD (vanilla optimization method)의 문제점:

gradient의 x축과 y축이 불균형하게 update시 연산 속도가 느려지고 발산 가능성 ↑

Momentum Update

- 이전 gradient를 합산하여 새로운 gradient에 적용 (이전 gradient 정보 반영)

vanilla gradient descent update

X -= learning_rate * dx

Momentum update

v = mu * v - learning_rate * dx
x += v

Adagrad Update

- 각 gradient의 값의 크기를 균등하게 조정

cache += dx ** 2
x -= learning_rate * dx / (np.sqrt(cache) + 1e - 7)

- cache가 끊임없이 누적되어 후반부에는 weight update가 거의 안 되는 상황 발생

RMSProp Update

- Adagrad의 문제점 해결 → decay_rate

- decay_rate: 최근에 해당하는 weight의 update가 더디다면 이전 cache들을 제거

cache += decay_rate * cache + (1 - decay_rate) * dx **2
x -= learning_rate * dx / (np.sqrt(cache) + 1e - 7)

Adam Update

- Momentum + RMSProp

m = beta1 * m + (1 - beta1) * dx # Momentum → 과거 gradient 반영
v = beta2 * v + (1 - beta2) * (dx **2) # RMSProp → dimension값들의 average 축적
x -= learning_rate *m / (np.sqrt(v) + 1e - 7)

Learning Rate

- SGD, SGD + Momentum, Adagrad, RMSProp, Adam은 모두 learning rate를 hyperparameter로 가짐

tips) Learning rate 조절할 때, 0.1 → 1 → 10 이런 식으로 과감하게 등비수열 형태로 try 하는 게 좋다!

Ensemble and regularization

Model Ensembles

1. 여러 independent한 model로 train

2. model들의 결과의 평균으로 test data에 적용

→ 보통 2%의 performance가 올라감

Add term to loss

- 최상위의 accuracy를 포기하더라도 overfitting 방지

- "이전 정보들을 모두 신뢰하여 예측하지 않겠다"

L2 regularization

L1 regularization

Elastic net (L1 + L2)

# 사실 현대 Deep Learning에서는 Adam 등의 등장으로 인해 잘 쓰이지는 않는다 ..

Dropout

- 총 node 중 random하게 pick하여 학습

- dropping probability는 보통 0.5를 사용

P = 0.5

def train_step(X):
	# X contains the data
    
    # forward pass (3 layer neural network)
    H1 = np.maximum(0, np.dot(w1, X) + b1)
    U1 = np.random.rand(*H1.shape) < p			# first dropout mask
    H1 *= U1						# drop!

	H2 = np.maximum(0, np.dot(w2, H1) + b2)
    U2 = np.random.rand(*H2.shape) < p			# Second dropout mask
    H2 *= U2						# drop!
    
    out = np.dot(w3, H2) + b3

cf) Dropout이 학습 data의 성능을 높이는 건 아님. Test data에서의 성능을 높이는 것이다!

dropout idea

ex) 사람 label 예측 class

tarin data의 feature에 ear가 있는데, Test data에는 ear가 없다면?

→ dropout을 적용하면 ear node가 없을 때도 train되어 ear가 없는 test data도 잘 평가 가능!

Test data에 적용할 때는 모든 node 사용 # Dropout off

def predict(X):
	# ensembled forward pass
    H1 = np.maximum(0, np.dot(w1, X) + b1) * p		# scale the activations
    H2 = np.maximum(0, np.dot(w2, H1) + b2) * p		# scale the activations
    
    out = np.dout(w3, H2) + b3

→ train시, output 출력을 낮춰서(p) dropout 했으므로 test에 적용할 때도 scale 적용

Data Augmentation

- Data 확충

- Horizontal Flips

- Random crops and scales

ResNet:

1. Pick random L in range [256, 480]

2. Resize training image, short side = L

3. Sample random 224 x 224 patch

- Random mix/combinations of:

translation
rotation
stretching
shearing
lens distortions
..

'Deep Learning' 카테고리의 다른 글

Recurrent Neural Networks (0)	2022.06.08
CNN Architectures (0)	2022.06.07
Convolutional Neural Network (0)	2022.06.03
Neural Network (0)	2022.05.30
모위딥 시즌 2 정리(수기) (0)	2022.04.07

동영`s 인공지능 공부방

Training Neural Networks

Activaction functions

Sigmoid activation

Softmax function

Tanh activation

ReLU (Rectifier Linear Unit)

Batch Normalization

1. 각 dimension의 평균과 분산 계산

2. Normalize

3. scale

Test data에 BatchNorm 적용

Optimization methods

Momentum Update

Adagrad Update

RMSProp Update

Adam Update

Learning Rate

Ensemble and regularization

Model Ensembles

Add term to loss

Dropout

Test data에 적용할 때는 모든 node 사용 # Dropout off

Data Augmentation

'Deep Learning' 카테고리의 다른 글

티스토리툴바

Training Neural Networks

Activaction functions

Sigmoid activation

Softmax function

Tanh activation

ReLU (Rectifier Linear Unit)

Batch Normalization

1. 각 dimension의 평균과 분산 계산

2. Normalize

3. scale

Test data에 BatchNorm 적용

Optimization methods

Momentum Update

Adagrad Update

RMSProp Update

Adam Update

Learning Rate

Ensemble and regularization

Model Ensembles

Add term to loss

Dropout

Test data에 적용할 때는 모든 node 사용 # Dropout off

Data Augmentation

'Deep Learning' 카테고리의 다른 글

'Deep Learning' Related Articles

티스토리툴바