Lecture 6 Training Neural Networks

11 minute read

Training Neural Networks

previous lecture recap

Where are we now …

Computational Graphs

Neural networks

Convolutional Neural Networks

If we had 6 5x5 filters, we will get 6 separate activation maps!

We can then stack these up to geta new image of size 28 x 28 x 6.

Slides from the previous lecture…

\(N + 2 = 9\) \(F = 3\) \((N - F) / 1 + 1 = 7\)

Once you pad it, incorporate padding into the formula.

The depth is going to be the number of filters we have.

\(N = 32\) augment by the padding we added on to this. We padded 2 into each dimension. \(2 * 2\) \(F = 5\)

our out put is going to be 32 x 32 for each filter and then we have 10 filters total. So we have 10 activation maps. our total output volume is \(32 * 32 * 10\)

So now we want to learn the values of all the weight or the parameters.

We learn what the perameters should be through optimization (최적화)

Mini batch SGD (stochastic gradient descent)

samlpe a batch of dat
forward prop it through the graph(network), get the loss
backprop to calculate the gradients
update the parameters using the gradients

Today’s lecture is about the details of training the neural network.

One time setup

  a. activation funtions (활성화 함수)
  b. preprocessing (전처리)
  c. weight initialization (가중치 초기화) 
  d. regularization (정규화, 일반화)
  f. gradient checking (정확성 검증)*

*우리가 직접 수동으로 검증하는 방법이 그라디언트 체킹(gradient checking)이라는 기법입니다

Training dynamics

 a. babysisting the learning process(학습 과정 다루기)
 b. parameter updates
 c. hyperparameter optimization (하이퍼파라미터를 최적화)

Evaluation
```
 a. model ensembles
```

Activation Functions

지난번에 봤던 convolutional Layer를 보면

데이터 입력이 들어오면 가중치와 곱하는데 그다음 활성함수 (비선형 연산)을 거치게 됩니다.

활성화 함수의 예

Sigmoid

squashes numbers to range [0,1]
historyically popular since they have nice interpretations as a saturating “firing rate” of a neuron

Firing rate: related to the number of spikes generated by a neuron per unit of time. 발사(fire)하지 않으면 0, 완전히 포화된(fully-saturated) 발사는 최대 주기 1로 가정합니다

saturated : activation value가 극단적 값만 가지게 되는 경우

However, sigmoid function is not used anymore for 3 reasons

Saturated neurons “kill” the gradients(Vanish Gradient)

When x = -10 or x = 10, the gradient is 0. This kills the gradient flow and 0 is passed down. x = 0 is fine.

Sigmoid outputs are not zero-centered

input은 항상 positive(x>0) output도 positive. 이 경우 w의 gradients는?

Q) If all of X is positive?

A) Always all positive or all negative

You either always increase W by the positive amount or decrease them all.

inefficient.

Two quadrants are the only directions where we are allowed to make the gradient update. We have to move in a zig zag pattern. We generally want zero mean data.

w에 대한 gradient를 좌표평면에 표현하면 gradient 벡터는 1,3사분면으로 나옵니다. 이상적인 움직임은 파란색이지만, 원하는 곳으로 가기 위해선 지그재그로(빨간색) 움직여야 합니다. 이 경우 수렴속도가 늦어지는 비효율을 낳게 됩니다

3) exp() is a bit compute expensive

Computationally expensive. But is a minor problem compared to other expensive dot product computations.

Tanh

tanh 비선형성은 위의 오른쪽 그림에 해당합니다. 이는 숫자를 [-1, 1] 범위로 뭉갭니다(squashes). sigmoid 뉴런처럼 활성화가 포화되지만, 그와는 달리 출력이 0을 중심(zero-centered)으로 합니다. 따라서 실전에서 tanh 비선형성은 항상 sigmoid 비선형성 보다 선호됩니다. tanh 뉴런은 단순히 sigmoid 뉴런에서 크기를 조정(scaled)했을 뿐입니다

ReLU

Rectified Linear Unit

Computes (f(x) = max(0, x))

Does not saturate (in +region)

Very computationally efficient(exp이 없어서)

Converges(수렴하다) much faster than sigmoid/tanh in practice(eg 6x)

Actually more biologically plausible than sigmoid(sigmoid보다 뉴런의 작용을 잘 반영) Closer approximation than sigmoids

2012, AlexNet에서 처음 사용

However, there was still an underlying problem…

Not zero-centered output
An annoyance(0보다 작은 부분의 gradient는 0이 됨. 10~20%가 dead ReLU) people like to initialize ReLU neurons with slightly positive biases(e.g. 0.01) 0일때 그라디언트 체크

When x = - 10 or 0, there is zero gradient.

Reason for this happening,

bad initilization

weights that happen to be unlucky and they happen to be off the data cloud, so they happen to specify this bad ReLU over here.
learning rate is too high

huge updats and the weights jump around and knocked off of the data manifold. This can happen in training.

In practice, some people like to use biases in initialization to get more ReLUs firing in the beginning. Generally people don’t always use this.

Leaky ReLU

PReLU

it’s just like the leaky ReLU. But the sloped region in the negative space is determined through this alpha parameter that we treat it as now a parameter that we can backprop into and learn instead of specifying or hard-coding it

ELU

There is argument to say that building back the deactivation regime creates can be more robust.

Max out Neuron

this is taking the max of these two functions and it generalizes the ReLU and the leaky ReLu. In otherwords, taking max of the two linear functions. Linear Regime! Does not saturate! Does not die!

problems…

doubles number of parameters per neurons. Now has W1 and W2. Computationally more expensive.

Summary

use ReLU.
Try other forms of ReLU
Try Tanh but don’t expect much
Don’t use sigmoid

Data Preprocessing

Generally we want to prepocess data.

zero mean
normalize

WHY??

If we remember our problem occuring when all the inputs are positive, then we get all of our gradients on the weights to be positive. we get a suboptimal opitmization due to biases. Even if its not all zeroes or all negative, it will still cause this type of problem.

You want to normalize data typically in the machine learning problems so that all features are in the same range so that they contribute equally. However, for most part of the course, we do perform zero centering since we are dealing with images. In practice , we don’t normalzie pixel values so much, because for images already have relative scale and distributions. (이미지는 이미 각 차원 간에 스케일이 어느정도 맞춰져 있기 때문입니다.) (scale이 달라지면 다른 feature)

In ML there are other complicated methods such as PCA, whitening.

However, working with images we typically stick with zero mean, and not do normalization or complicated pre-preocessing.

일반적으로는 이미지를 다룰 때는 굳이 입력을 더 낮은 차원으로 projection 시키지 않습니다.

CNN에서는 원본 이미지 자체의 spatial 정보를 이용해서 이미지의 spatial structure를 얻을 수 있도록 합니다.

Principal component analysis (PCA) 주성분 분석: a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance. 데이터 하나 하나에 대한 성분을 분석하는 것이 아니라, 여러 데이터들이 모여 하나의 분포를 이룰 때 이 분포의 주 성분을 분석해 주는 방법이다

Whitening: input의 feature들을 uncorrelated하게 만들고, 각각의 variance를 1로 만들어줌

Tranding phases

In general, determin the mean in training phase then apply it in test data.

기본적으로 zero-mean 으로 전처리.

compute the mean image then you subtract that from each image that you’re about to pass through the network and you’ll do the same thing at test time for this array that you determined at training time

In practice, we can also for some networks, we also do this by just of subtracting a per-channel mean, and so instead of having an entire mean image that were going to zero-center by, we just take the mean by channel, and this is just because it turns out that it was similar enough across the whole image, it didn’t make such a big difference to subtract the mean image versus just a per-channel value And this is easier to just pass around and deal with. So, you’ll see this as well for example, in a VGG Network, which is a network that came after AlexNet, and we’ll talk about that later

Channel : RGB, so our array, our images are typically for example, 32 by 32 by three. So, width, height, each are 32, and our depth, we have three channels RGB. so we’ll have one mean for the red channel, one mean for a green, one for blue

subtracting the mean image what is the mean taken over?: the mean is taking over all of your training images. So, you’ll take all of your training images and just compute the mean of all of those. we do this for the entire training set, once before we start training. We don’t do this per batch. And so if you take it per batch, if you’re sampling reasonable batches, it should be basically, you should be getting the same values anyways for the mean. So it’s more efficient and easier just do this once at the beginning, if we just want to have a good sample, an empirical mean that we have. You could also just sample enough training images to get a good estimate of your mean.

is does the data preprocessing solve the sigmoid problem: The data preprocessing is doing zero mean. in sigmoid we want to have zero mean so it does help solve the first layer - the zero mean. However, there are deep network non zero mean problems later on where this won’t be suffient.

Weight Initialization

What happens when W = 0 Init is used?

the key thing is that they will all do the same thing.

since your weights are zero, given an input, every neuron is going to be, have the same operation basically on top of your inputs.

And so, since they’re all going to output the same thing, they’re also all going to get the same gradient. because of that, they’re all going to update in the same way.

And now you’re just going to get all neurons that are exactly the same, which is not what you want.

so, that’s the problem when you initialize everything equally. there’s basically no symmetry breaking here.

작은 랜덤값 초기화

아래와 같이 랜덤값을 0.01로 scale 한 값을 사용하는 방법이 있습니다.

작은 network에서는 잘 작동하지만 깊은 network에서는 문제가 발생합니다.

아래는 각각 500개의 뉴런을 가진 10개의 레이어와 사이사이에 tanh Activation를 사용한 결과입니다.

보시면 알겠지만 레이어가 깊어질수록 weight 값이 전부 날라가게 됩니다.

tanh 그림을 보면 기울기가 0인 지점이 날라가게 됩니다.

기울기가 0이 안되는 지점인 가운데만 살아남게 됩니다.

그럼 0.01 scale를 하지 않고 사용하면 어떻게 될까요?

아래 그림과 같이 -1과 1 값을 포화가 되어버리고 맙니다.

Xavier initialization

Xavier initialization은 위에서 고정된 크기로 scaling을 해주었다면,

여기서는 노드의 개수(fan_in)로 normalized를 하자 입니다.

이렇게 하면 학습이 잘 되는 것을 볼 수 있습니다.

Batch Normalization

우리는 데이터가 gaussian range에서 activation이 꾸준히 잘 되기를 원하고 있습니다.

이러한 착안점에서 제안된 것이 Batch Normalization입니다.

이를 통해 training 하는 과정 자체를 전체적으로 안정화시켜 주는 것입니다.

이것은 internal covariance shift를 방지합니다.

network 각 층마다 input의 distribution이 달라지는 것을 방지합니다.

아래 그림은 일반적으로 activation 전에 잘 분포되도록 한 뒤에 activation을 진행할 수 있도록 해줍니다.

그래서 FC –> BN –> Activation으로 들어가게 되는 겁니다.

하지만 여기서 BN을 사용하면 input은 항상 unit Gaussian이 되게 되는데

이게 적합한 것인지 아닌지는 알 수 없습니다.

의문점
- Activation function을 relu를 사용한다면?
- 가중치의 크기를 증가시킬때 더 나은 성능을 가진다면?

이러한 문제를 해결하기 위해서 여기서 감마와 베타 값이 주어지게 됩니다.

감마 값으로 BN의 Variance 값을 조절하며, 베타 값으로 평균 값을 조절할 수 있게됩니다.

그리고 이 감마와 베타 값을 학습의 Hyperparameter로 사용하여 알맞은 값을 얻어가도록 합니다.

참고로 감마 값이 표준편차이고, 베타가 평균 값이면 BN를 하지 않는 것과 같습니다.

감마 : Scaling
베타 : Shifting

보통 BN을 하면 Dropout을 안써도 된다고 합니다.

그 이유는 Dropout은 랜덤하게 값을 꺼내주기 때문입니다.

BN도 마찬가지로 배치마다 값이 조금씩 다르게 들어가고 값이 계속 바뀌게 되어 노이즈가 적어지게 된다고 합니다.

또한 BN은 선형변환으로 기존의 공간적인 구조가 잘 유지됩니다.

Notice) CONV에서 Batch Normalization 할때 주의사항

기존에 Wx + b 형태로 weight를 적용해 주는데 BN의 Beta 값과 중복된다.
고로 Wx + b 의 bias 값을 사용하지 않아도 된다.
장점
- Network에 Gradient flow를 향상시킴
- 높은 learning rate를 사용해도 안정적인 학습 가능
- Weight 초기화의 의존성을 줄임
- Regularization기능도 하여 dropout의 필요성을 감소시킴
- Test 시에 overhead가 없다. (학습된 것을 사용만 함)
Test할땐 Minibatch의 평균과 표준편차를 구할 수 없으니 Training에서 구한 고정된 Mean과 Std를 사용함

Layer Normalization

Layer Normalization(LN)은 Batch Normalization(BN) 비슷하지만 다르다.

BN은 Batch들과 W, H 대해서 Normalization을 진행했다면,

LN은 한 Batch에서 Depth와 W,H 대해서 Normalization을 한 것이다.

고로 LN은 각 Batch들에 대해서는 신경쓰지 않고 BN과 다르게 각 Depth에 대한 정보를 모두 보고 Normalization을 진행한다.

BN과 LN의 식은 아래와 같다.

식을 보면 형태는 같고 i와 j만 바뀐 것을 볼 수 있다.

위 식에 대한 좀더 직관적인 이해는 아래와 같다.

아래는 Batch Normalization과 Layer Normalization의 차이를 보여준다.

실험적으로 RNN에서 좋은 성능을 가진다고 한다.

Instance Normalization

Instance Normalization 은 Layer Normalization 에서 한 걸음 더 나아간 것입니다.

Layer Normalization은 (Width, Height, Channel)에 대한 모든 성분을 보고 Normalization을 진행 진행했다면,

Instance Normalization은 각 Channel에서 (Width, Height)에 대해 Normalization을 진행하는 것입니다.

이는 이미지에 대해서만 가능한 정규화이고, RNN 에서는 사용할 수 없습니다. style transfer 에 있어서 배치 정규화를 대체해서 좋은 성능을 내는 것으로 보이며 GAN 에서도 사용되었다고 합니다.

Group Normalization

그룹 정규화(group normalization) 은 채널 그룹에 대한 평균 및 표준 편차를 계산합니다.

이는 layer normalization 과 instance normalization 의 조합인데,

모든 채널이 단일 그룹(G=C)이 된다면 layer normalization 이 되고,

각 채널을 다른 그룹에 넣게 될 경우(G=1) instance normalization 이 됩니다.

그룹 정규화는 ImageNet 에서 batch size 32 인 batch normalization 의 성능에 근접하며, 더 작은 크기에서는 성능이 더 좋게 나타난다.

또한, 높은 해상도의 이미지를 사용하여 물체를 감지(detection)하거나 분할(segmentation)하는 문제는 메모리 문제로 배치 크기를 늘리기 어려운데 이러한 문제에 대해 그룹 정규화는 매우 효과적인 정규화 방법이다.

그룹 정규화의 장점
- layer normalization보다 각 채널의 독립성을 보장해주며 모델의 유연성(flexibility)을 줄 수 있습니다.

아래 그림은 이미지의 resolution은 H,W이 하나의 차원으로 표현되었으며, C는 Channel axis(채널의 개수), N은 batch axis(배치의 개수) 이다.

Hyperparameter Optimization

하이퍼 파라미터를 찾아갈때 적절한 learning rate를 사용해야합니다.

learning rate의 크기에 따라서 다른 분포를 가지고 있고

우리가 원하는 learning rate는 빨간색으로 너무 크지도 않고 작지도 않은 learning rate 값을 설정해야합니다.

가끔은 이런 그래프를 볼 수 있는데 이런 경우에는 initialization이 좋지 않았던 것입니다.

아래 처럼 우리는 빨간색 선과 초록색 선의 Gap이 없어져야 합니다.

이 gap이 커지게 되버리면 overfitting(과적합)이 걸린 것입니다.

오버피팅이 되면 실제 데이터 셋에서 잘 동작이 되지 않습니다.

Share on

Twitter Facebook LinkedIn