Applying ML to a given problem is a highly iterative process. It’s important to choose the following parameters correctly: Number of layers, number of hidden units, learning rates, activation functions.
If data-set it very large(in millions), test and dev set may be reduced to 1% or lesser.
Make sure dev and test sets come from the same distribution as the test set.
If data-set is small, work more in dev phase for better results, hence, 20%.
Bayes error = Human performance on the task, used as a baseline for estimating bias and variance problem
Training a bigger network never hurts, the main cost involved is simply the training time. Regularization is important always for big networks.
Regularization introduces linearization in the network, preventing it from complex boundaries.
We bump up the value at end by keep_prob to keep the activation value constant and not lose the activation, otherwise further neurons may not fire. Seems like the problem solution after implementing dropout.
No dropout is used while making predictions at test time.
Dropout introduces reliability in the network, it spreads out the weights. By shrinking the weights we are effectively reducing weight values, hence, L2 regularization.
Dropout probabilities can be different for different layers. Preferably do not dropout for the input layer.
Best way: First turn off dropout and check if J is decreasing with every iteration and then turn dropout on for the rest of the network.
More ways to avoid over-fitting/regularize
Modify and twist data= flip, rotate, distort, crop images
Orthogonalization = First train well then optimize well. First minimize J then minimize over fitting.
Eliminating vanishing gradients = Linearity in activation functions and small/large values for weight matrices forces implosion/explosion due to large multiplications over deep networks. Solution is to choose initialization of weights very smartly, according to function, ie, He et al.(ReLU) or Xavier initialization(tanh).
Faster training optimizations
Mini batch gradient descent:
Batch GD, Mini Batch GD, Stochastic GD(Online, lose vectorization speedup)
Make mini batch size as power of 2 and ensure that mini batch fits in GPU memory on chip
Epoch: Pass through the training set
Exponentially weighted average: Include bias correction term for initial phase (1 – beta^t)
Gradient descent with momentum: Ball rolling downhill, friction(beta) stops and gradient descent accelerates, watch again video