- Applying ML to a given problem is a highly iterative process. It’s important to choose the following parameters correctly: Number of layers, number of hidden units, learning rates, activation functions.
- If data-set it very large(in millions), test and dev set may be reduced to 1% or lesser.
- Make sure dev and test sets come from the same distribution as the test set.
- If data-set is small, work more in dev phase for better results, hence, 20%.
- Bayes error = Human performance on the task, used as a baseline for estimating bias and variance problem
- Training a bigger network never hurts, the main cost involved is simply the training time. Regularization is important always for big networks.
- Regularization introduces linearization in the network, preventing it from complex boundaries.
- Inverted Dropout
- We bump up the value at end by keep_prob to keep the activation value constant and not lose the activation, otherwise further neurons may not fire. Seems like the problem solution after implementing dropout.
- No dropout is used while making predictions at test time.
- Dropout introduces reliability in the network, it spreads out the weights. By shrinking the weights we are effectively reducing weight values, hence, L2 regularization.
- Dropout probabilities can be different for different layers. Preferably do not dropout for the input layer.
- Best way: First turn off dropout and check if J is decreasing with every iteration and then turn dropout on for the rest of the network.
- More ways to avoid over-fitting/regularize
- Modify and twist data= flip, rotate, distort, crop images
- Orthogonalization = First train well then optimize well. First minimize J then minimize over fitting.
- Normalized inputs = Better, faster, smoother gradient descent
- Eliminating vanishing gradients = Linearity in activation functions and small/large values for weight matrices forces implosion/explosion due to large multiplications over deep networks. Solution is to choose initialization of weights very smartly, according to function, ie, He et al.(ReLU) or Xavier initialization(tanh).
- Gradient checking
Faster training optimizations
- Mini batch gradient descent:
- Batch GD, Mini Batch GD, Stochastic GD(Online, lose vectorization speedup)
- Make mini batch size as power of 2 and ensure that mini batch fits in GPU memory on chip
- Epoch: Pass through the training set
- Exponentially weighted average: Include bias correction term for initial phase (1 – beta^t)
- Gradient descent with momentum: Ball rolling downhill, friction(beta) stops and gradient descent accelerates, watch again video
- Adam optimization algorithm
- Learning rate decay: alpha = alpha_not / (1 + decay-rate * epoch-num)
- Searching for the local optima: Saddle point problem, problem due to plateaus in surface. Overall, unlikely to get stuck in bad local optima.
- Search randomly, coarse to fine zoom in search space
- Appropriate scale selection, logarithmic scale for learning_rate and beta
- Pandas vs Caviar: Babysitting models in earlier stages of learning, or train many models in parallel
- Batch Normalization(unclear): Robustness to hyper-parameter search space!
Normalize inputs(z[l]) across layers.
- We normalize input values(A0) in order to center them near the linear region of the sigmoid function, which prevents problem of exploding gradients and speeds up training.
- Batch normalization centers all values in deeper layers and helps fasten training overall. The gamma and beta are used to shift the normalized value to desired location to train faster.
- Batch norm is applied before activation function.
- Batch norm eliminates bias term (because of averging) and adds 2 terms of it’s own(gamma and beta which decide distribution shift and tilt).
- Works well with Momentum, RMSProp and adam.
- It decouples the effect of weights from earlier layers on the later layers by normalizing z values.
- Batch normalization adds slight regularizing effect.
- Running average for mu and sigma at test time.
- Covariate shift = Change in distribution of x = Changing x from schwarz katz to buntes katz, model cannot find a new decision boundary on it’s own!
- Batch norm reduces this problem from the perspective for input for a deeper layer
- Soft max regression layer and class selection
- Inputs a vector and outputs an average vector
- Soft max vs Hard max(binary) functions