Lecture 4 Introduction to Neural Networks
Introduction to Neural Networks
What do we want?
In the previous lecture, we covered how we can calculate the scores functions, SVM Loss and data loss + regularization.
Now we want to find the parameters W that corresponds to our lowest loss.
Why?
It is because we want to minimize the loss function, because we have a preference for simpler models for better generalization.
We can achieve this with optimiztation
We also can compute this gradient with
-
numerical gradient method
- slow, approximate, easy to write
-
analytic gradient
- fast, exact, error-prone
In practice, derive analytic gradient then check the implementation with numerical gradient
Computational graphs
What is a computation graph?
We can use this kind of graph in order to represent any function where the nodes of the graph are steps of computation that we go through.
노드의 연산 단계를 나타냅니다.
이 예제는 linear 선형 classifier 입니다.
inputs: X and W
multiplication node: represents the matrix multiplier (행렬 곱셈)
vector of scores: multiplication of parameters W and data X
hinge loss: data loss term.
Total Loss: Sum of regularization term and the data term
Advantage
we can call back propagation!
-
gradient를 얻기위해 computational graph 내부의 모든 변수에 대해 chain rule을 재귀적으로 사용합니다
-
really useful wehen working with complex functions
-
Convolutional network (AlexNet)
-
Neural Turing Machine
-
이걸 직접 계산하는 거는 정신나간 짓이다.
Back propogation
example1
backpropagation은 chain rule의 재귀적인 응용입니다. chiain rule에 의해 우리는 뒤에서부터 시작하기 때문에 뒤에서 부터 gradient 계산을 합니다.
y 와 f 는 직접 연결 되어 있지 않아서 chain rule을 사용한다.
y에 대한 f의 미분은 q에 대한 f의 미분과 y에 대한 q의 미분의 곱으로 나타낼 수 있습니다.
example2
If we take a loot at what we did in a different perspective as nodes, we see that we have the L(LOSS) value coming back as back propogation. We use the chain rule to multiply hte local gradient and upstream gradient coming down in order to get the gradient respect to the input.
example3
We define the computational nodes into any granularity we want to.
In practice, we can group some of the nodes together as long as we can write down the local gradient for the function.
For example, we can use the sigmoid function to shorten the nodes.
Trade off
how much math for simpler graphs vs how simple you want your gradients to be
patterns in backward flow
Gradients for vectorized code
The equation would stay the same with the only difference being that this is now a jocobian matrix. derivative of each element z with respect to derivative of each element x.
implementation
summary
-
neural nets will be very large: impractical to write down gradient formula by hand for all parameters
-
backpropagation = recursive application of the chain rule along a computational graph to compute the gradients of all inputs/parameters/intermediates
-
implementations maintain a graph structure, where the nodes implement the forward() / backward() API
-
forward: compute result of an operation and save any intermediates needed for gradient computation in memory
-
backward: apply the chain rule to compute the gradient of the loss function with respect to the inputs
neural networks
activate functions
summary
-
We arrange neurons into fully-connected layers
-
The abstraction of a layer has the nice property that it allows us to use efficient vectorized code (e.g. matrix multiplies)
-
Neural networks are not really neural
-
Next time: Convolutional Neural Networks
Leave a comment