- Applying ML to a given problem is a highly iterative process. It’s important to choose the following parameters correctly: Number of layers, number of hidden units, learning rates, activation functions.
- If data-set it very large(in millions), test and dev set may be reduced to 1% or lesser.
- Make sure dev and test sets come from the same distribution as the test set.
- If data-set is small, work more in dev phase for better results, hence, 20%.

- Bayes error = Human performance on the task, used as a baseline for estimating bias and variance problem
- Training a bigger network never hurts, the main cost involved is simply the training time. Regularization is important always for big networks.

# Regularization

- Regularization introduces linearization in the network, preventing it from complex boundaries.
**Inverted Dropout**- We bump up the value at end by keep_prob to keep the activation value constant and not lose the activation, otherwise further neurons may not fire. Seems like the problem solution after implementing dropout.
- No dropout is used while making predictions at test time.
- Dropout introduces reliability in the network, it spreads out the weights. By shrinking the weights we are effectively reducing weight values, hence, L2 regularization.
- Dropout probabilities can be different for different layers. Preferably do not dropout for the input layer.
- Best way: First turn off dropout and check if J is decreasing with every iteration and then turn dropout on for the rest of the network.

**More ways to avoid over-fitting/regularize**- Modify and twist data= flip, rotate, distort, crop images
**Orthogonalization**= First train well then optimize well. First minimize J then minimize over fitting.

# Optimizations

**Normalized inputs**= Better, faster, smoother gradient descent

**Eliminating vanishing gradients**= Linearity in activation functions and small/large values for weight matrices forces implosion/explosion due to large multiplications over deep networks. Solution is to choose initialization of weights very smartly, according to function, ie, He et al.(ReLU) or Xavier initialization(tanh).**Gradient checking**

# Faster training optimizations

**Mini**batch gradient descent:- X{2}
- Batch GD, Mini Batch GD, Stochastic GD(Online, lose vectorization speedup)
- Make mini batch size as power of 2 and ensure that mini batch fits in GPU memory on chip

- Epoch: Pass through the training set
- Exponentially weighted average: Include bias correction term for initial phase (1 – beta^t)

- Gradient descent
**with momentum**: Ball rolling downhill, friction(beta) stops and gradient descent accelerates,**watch again video** **RMSprop**

**Adam**optimization algorithm

**Learning rate decay**: alpha = alpha_not / (1 + decay-rate * epoch-num)**Searching for the local optima**: Saddle point problem, problem due to plateaus in surface. Overall, unlikely to get stuck in bad local optima.

# Hyper-parameter tuning

- Search randomly, coarse to fine zoom in search space
- Appropriate scale selection, logarithmic scale for learning_rate and beta

**Pandas vs Caviar**: Babysitting models in earlier stages of learning, or train many models in parallel**Batch Normalization(unclear)**: Robustness to hyper-parameter search space!

Normalize inputs(z[l]) across layers.

- We normalize
**input**values(A0) in order to center them near the linear region of the sigmoid function, which prevents problem of exploding gradients and speeds up training. - Batch normalization centers all values in deeper layers and helps fasten training overall. The gamma and beta are used to shift the normalized value to desired location to train faster.
- Batch norm is applied before activation function.
- Batch norm eliminates bias term (because of averging) and adds 2 terms of it’s own(gamma and beta which decide distribution shift and tilt).
- Works well with Momentum, RMSProp and adam.
- It decouples the effect of weights from earlier layers on the later layers by normalizing z values.
- Batch normalization adds slight regularizing effect.
- Running average for mu and sigma at test time.
- Covariate shift = Change in distribution of x = Changing x from schwarz katz to buntes katz, model cannot find a new decision boundary on it’s own!
- Batch norm reduces this problem from the perspective for input for a deeper layer

- We normalize
**Soft max regression layer and class selection**- Inputs a vector and outputs an average vector
- Soft max vs Hard max(binary) functions