Improving Deep Neural Networks

  1. Applying ML to a given problem is a highly iterative process. It’s important to choose the following parameters correctly: Number of layers, number of hidden units, learning rates, activation functions.
  2. If data-set it very large(in millions), test and dev set may be reduced to 1% or lesser.
    1. Make sure dev and test sets come from the same distribution as the test set.
    2. If data-set is small, work more in dev phase for better results, hence, 20%.
  3. 1.jpg
  4. 1.jpg
  5. Bayes error = Human performance on the task, used as a baseline for estimating bias and variance problem
  6. Training a bigger network never hurts, the main cost involved is simply the training time. Regularization is important always for big networks.


  1. 1
  2. 1 Regularization introduces linearization in the network, preventing it from complex boundaries.
  3. Inverted Dropout
    1. We bump up the value at end by keep_prob to keep the activation value constant and not lose the activation, otherwise further neurons may not fire. Seems like the problem solution after implementing dropout.
    2. No dropout is used while making predictions at test time.
    3. Dropout introduces reliability in the network, it spreads out the weights. By shrinking the weights we are effectively reducing weight values, hence, L2 regularization.
    4. Dropout probabilities can be different for different layers. Preferably do not dropout for the input layer.
    5. Best way: First turn off dropout and check if J is decreasing with every iteration and then turn dropout on for the rest of the network.1
  4. More ways to avoid over-fitting/regularize
    1. Modify and twist data= flip, rotate, distort, crop images
    2. Orthogonalization = First train well then optimize well. First minimize J then minimize over fitting.
    3. 1


  1. Normalized inputs = Better, faster, smoother gradient descent
  2. Eliminating vanishing gradients = Linearity in activation functions and small/large values for weight matrices forces implosion/explosion due to large multiplications over deep networks. Solution is to choose initialization of weights very smartly, according to function, ie, He et al.(ReLU) or Xavier initialization(tanh).
  3. Gradient checking

Faster training optimizations

  1. Mini batch gradient descent:
    1. X{2}
    2. Batch GD, Mini Batch GD, Stochastic GD(Online, lose vectorization speedup)
    3. Make mini batch size as power of 2 and ensure that mini batch fits in GPU memory on chip
  2. Epoch: Pass through the training set
  3. Exponentially weighted average: Include bias correction term for initial phase (1 – beta^t)
  4. Gradient descent with momentum: Ball rolling downhill, friction(beta) stops and gradient descent accelerates, watch again video
  5. RMSprop
  6. Adam optimization algorithm
  7. Learning rate decay: alpha = alpha_not / (1 + decay-rate * epoch-num)
  8. Searching for the local optima: Saddle point problem, problem due to plateaus in surface. Overall, unlikely to get stuck in bad local optima.

Hyper-parameter tuning

  1. 1
  2. Search randomly, coarse to fine zoom in search space
  3. Appropriate scale selection, logarithmic scale for learning_rate and beta
  4. Pandas vs Caviar: Babysitting models in earlier stages of learning, or train many models in parallel
  5. Batch Normalization(unclear): Robustness to hyper-parameter search space!
    Normalize inputs(z[l]) across layers.

    1. We normalize input values(A0) in order to center them near the linear region of the sigmoid function, which prevents problem of exploding gradients and speeds up training.
    2. Batch normalization centers all values in deeper layers and helps fasten training overall. The gamma and beta are used to shift the normalized value to desired location to train faster.
    3. Batch norm is applied before activation function.
    4. Batch norm eliminates bias term (because of averging) and adds 2 terms of it’s own(gamma and beta which decide distribution shift and tilt).
    5. Works well with Momentum, RMSProp and adam.
    6. It decouples the effect of weights from earlier layers on the later layers by normalizing z values.
    7. Batch normalization adds slight regularizing effect.
    8. Running average for mu and sigma at test time.
    9. Covariate shift = Change in distribution of x = Changing x from schwarz katz to buntes katz, model cannot find a new decision boundary on it’s own!
      1. Batch norm reduces this problem from the perspective for input for a deeper layer
  6. Soft max regression layer and class selection
    1. Inputs a vector and outputs an average vector
    2. 4.png
    3. Soft max vs Hard max(binary) functions