- Most of the popular applications have been developed using Supervised Learning, carefully selecting x and the output y and training a network accordingly.
- Supervised Learning
- Structured Data
- Unstructured Data = Audio/ Images
- Why is Deep Learning taking off?
Q. How to go through entire training set of m examples without an explicit for loop?
ReLU / tanh(centers data to 0) >> sigmoid(centers data to 0.5)
Sigmoid is useful only for output layer
Leaky ReLU= Avoid vanishing gradient for negative weights
The vanishing/exploding gradient problem: too much/little change for a weight= bad as it slows down learning.
Why non-linear activation function?
Removing activation functions makes the entire computation as linear! This eliminates the ability to learn complex boundary surfaces, and the entire network may not even need the deep layers! Basically makes the NN as equal to standard regressive model.
Linear activation function is useful only when using Neural Networks for a Linear Regression problem.
Why randomly initialize weights?
If all weights are 0, the hidden layer comput1ations are identical, effectively making all neurons equal and hence, redundant.