Lecture 4 Introduction to Neural Networks

2 minute read

Introduction to Neural Networks

What do we want?

In the previous lecture, we covered how we can calculate the scores functions, SVM Loss and data loss + regularization.

Now we want to find the parameters W that corresponds to our lowest loss.

Why?

It is because we want to minimize the loss function, because we have a preference for simpler models for better generalization.

We can achieve this with optimiztation

We also can compute this gradient with

numerical gradient method
- slow, approximate, easy to write
analytic gradient
- fast, exact, error-prone

In practice, derive analytic gradient then check the implementation with numerical gradient

Computational graphs

What is a computation graph?

We can use this kind of graph in order to represent any function where the nodes of the graph are steps of computation that we go through.

노드의 연산 단계를 나타냅니다.

이 예제는 linear 선형 classifier 입니다.

inputs: X and W

multiplication node: represents the matrix multiplier (행렬 곱셈)

vector of scores: multiplication of parameters W and data X

hinge loss: data loss term.

Total Loss: Sum of regularization term and the data term

Advantage

we can call back propagation!

gradient를 얻기위해 computational graph 내부의 모든 변수에 대해 chain rule을 재귀적으로 사용합니다
really useful wehen working with complex functions
- Convolutional network (AlexNet)
- Neural Turing Machine

이걸 직접 계산하는 거는 정신나간 짓이다.

Back propogation

example1

backpropagation은 chain rule의 재귀적인 응용입니다. chiain rule에 의해 우리는 뒤에서부터 시작하기 때문에 뒤에서 부터 gradient 계산을 합니다.

y 와 f 는 직접 연결 되어 있지 않아서 chain rule을 사용한다.

y에 대한 f의 미분은 q에 대한 f의 미분과 y에 대한 q의 미분의 곱으로 나타낼 수 있습니다.

example2

If we take a loot at what we did in a different perspective as nodes, we see that we have the L(LOSS) value coming back as back propogation. We use the chain rule to multiply hte local gradient and upstream gradient coming down in order to get the gradient respect to the input.

example3

We define the computational nodes into any granularity we want to.

In practice, we can group some of the nodes together as long as we can write down the local gradient for the function.

For example, we can use the sigmoid function to shorten the nodes.

Trade off

how much math for simpler graphs vs how simple you want your gradients to be

patterns in backward flow

Gradients for vectorized code

The equation would stay the same with the only difference being that this is now a jocobian matrix. derivative of each element z with respect to derivative of each element x.

implementation

summary

neural nets will be very large: impractical to write down gradient formula by hand for all parameters
backpropagation = recursive application of the chain rule along a computational graph to compute the gradients of all inputs/parameters/intermediates
implementations maintain a graph structure, where the nodes implement the forward() / backward() API
forward: compute result of an operation and save any intermediates needed for gradient computation in memory
backward: apply the chain rule to compute the gradient of the loss function with respect to the inputs

neural networks

activate functions

summary

We arrange neurons into fully-connected layers
The abstraction of a layer has the nice property that it allows us to use efficient vectorized code (e.g. matrix multiplies)
Neural networks are not really neural
Next time: Convolutional Neural Networks

Share on

Twitter Facebook LinkedIn

Lecture 4 Introduction to Neural Networks

Introduction to Neural Networks

What do we want?

Computational graphs

Back propogation

example1

example2

example3

patterns in backward flow

Gradients for vectorized code

implementation

summary

neural networks

activate functions

summary

Share on

Leave a comment

You may also enjoy

L716 Max Stack

L21 Merge Two Sorted Lists

L543 Diameter of Binary Tree

L226 Invert Binary Tree