Lecture 4 Introduction to Neural Networks

2 minute read

Introduction to Neural Networks

What do we want?

image

In the previous lecture, we covered how we can calculate the scores functions, SVM Loss and data loss + regularization.

Now we want to find the parameters W that corresponds to our lowest loss.

Why?

It is because we want to minimize the loss function, because we have a preference for simpler models for better generalization.

We can achieve this with optimiztation

We also can compute this gradient with

  • numerical gradient method

    • slow, approximate, easy to write
  • analytic gradient

    • fast, exact, error-prone

In practice, derive analytic gradient then check the implementation with numerical gradient

Computational graphs

image

What is a computation graph?

We can use this kind of graph in order to represent any function where the nodes of the graph are steps of computation that we go through.

노드의 연산 단계를 나타냅니다.

이 예제는 linear 선형 classifier 입니다.

inputs: X and W

multiplication node: represents the matrix multiplier (행렬 곱셈)

vector of scores: multiplication of parameters W and data X

hinge loss: data loss term.

Total Loss: Sum of regularization term and the data term

Advantage

we can call back propagation!

  • gradient를 얻기위해 computational graph 내부의 모든 변수에 대해 chain rule을 재귀적으로 사용합니다

  • really useful wehen working with complex functions

    • Convolutional network (AlexNet)

    • Neural Turing Machine

이걸 직접 계산하는 거는 정신나간 짓이다.

Back propogation

example1

image

backpropagation은 chain rule의 재귀적인 응용입니다. chiain rule에 의해 우리는 뒤에서부터 시작하기 때문에 뒤에서 부터 gradient 계산을 합니다.

y 와 f 는 직접 연결 되어 있지 않아서 chain rule을 사용한다.

y에 대한 f의 미분은 q에 대한 f의 미분과 y에 대한 q의 미분의 곱으로 나타낼 수 있습니다.

example2

image

If we take a loot at what we did in a different perspective as nodes, we see that we have the L(LOSS) value coming back as back propogation. We use the chain rule to multiply hte local gradient and upstream gradient coming down in order to get the gradient respect to the input.

example3

image

We define the computational nodes into any granularity we want to.

In practice, we can group some of the nodes together as long as we can write down the local gradient for the function.

For example, we can use the sigmoid function to shorten the nodes.

Trade off

how much math for simpler graphs vs how simple you want your gradients to be

patterns in backward flow

image

Gradients for vectorized code

The equation would stay the same with the only difference being that this is now a jocobian matrix. derivative of each element z with respect to derivative of each element x.

image

image

implementation

image

image

image

summary

  • neural nets will be very large: impractical to write down gradient formula by hand for all parameters

  • backpropagation = recursive application of the chain rule along a computational graph to compute the gradients of all inputs/parameters/intermediates

  • implementations maintain a graph structure, where the nodes implement the forward() / backward() API

  • forward: compute result of an operation and save any intermediates needed for gradient computation in memory

  • backward: apply the chain rule to compute the gradient of the loss function with respect to the inputs

neural networks

image

activate functions

image

summary

  • We arrange neurons into fully-connected layers

  • The abstraction of a layer has the nice property that it allows us to use efficient vectorized code (e.g. matrix multiplies)

  • Neural networks are not really neural

  • Next time: Convolutional Neural Networks

Leave a comment