Orthogonalization

Importance of having a single real number evaluation metric: F1 score for Precision and Recall, Satisficing and optimizing metrics
Comparing with human level performance
3. As we approach human level performance, it is harder to detect which one of bias/variance needs improvement
4. Human level performance as a proxy for Bayes error
Setting up training, dev and test sets properly
1. Always make sets come from same distribution
2. Size of test set should be big enough for confidence in final performance of system
3. Changing evaluation metrics and shifting targets in between project
4. Specifically giving more weight to extreme wrong examples
If doing well on current metric + dev/test set does not correspond to doing well on final application, we should change the metric or the dev/test set

Error Analysis

Manually determining why our model is failing by analyzing wrong inferences.
Determine category based error rate to focus improvements.
Random errors vs Systematic errors in training set
Important to consider examples which we got wrong and right too
Setup train/dev/test sets and build your first system quickly and then iterate according to Bias/Variance and Error Analysis

Data set must reflect the target to hit, don’t target PC when user is on mobile
Set training set as (M+m), dev/test set as (m) – dev set might be much harder
Training-dev set = Representative of training set to determine if network is flawed [high bias / variance] or (dev) data is too hard to process [data mismatch problem]
1. Fundamental take away is that performance must be compared across data sets coming from same distribution
Transfer learning as sober first, drunk later

When you train on perfect pictures and dev/test on blurry pictures and your model performs very poorly, rather than when your data labels are wrong
Artificial data synthesis
1. How distribution of data in training and dev set differs, make training data more similar to dev/test sets
2. Careful about over-fitting to noise(just 1 hour of noise)

When we have
1. Lot of data for original problem A and less data for target problem B
2. Same input data for both problems
3. Low level features from A and B are preferably common
Deleting final layer, expanding new network, deciding to train initial layers or not and how deep to keep level of training

Training one network to classify 4 things > Training 4 networks to classify 1 thing

Replacing multiple stages of learning with a single network. Requires a very large amount of data to compete with the traditional pipeline approach.
A network learns from the data, and from the hand designed components(which reflect the human knowledge)
Key question to ask is if we have enough data to learn a function of the complexity needed to map x to y?