Supervised Learning
The right answers are given, the program simply fits available data and produces a function.
m = No of data set points, n = No of features

Regression
Predicting a realvalued output from a continuous set of data. Extrapolation to generate a function(hypothesis).
 Univariate Linear Regression Predicting output using a straight line from one variable dependency
 Prediction = Data Matrix * Function parameters
 Cost Function= Used to determine the best possible straight line to fit data
 Gradient Descent algorithm
 Assertion(=) vs Assignment(:=)
 Simultaneous Update
 Gradient Descent algorithm
 Multivariate Linear Regression
 x0 = 1
 H(x) = Transpose(Theta) * x
 Tip and Tricks
 Feature Scaling: Ensuring features are on a similar scale
 Mean Normalization = Replace x with ((x – mean)/scale)
 Check if gradient descent is working correctly: J(theta) should ideally decrease after every iteration, prefer plots rather than limiting value
 If J(theta) is increasing/oscillating after iterations, use smaller step value(alpha)
 0.001 > 0.003 > 0.01 > 0.03 > 0.1 > 0.3 > 1 > 3
 Defining new features by combining old ones together
 Using quadratic or cubic models to fit lines which may not fit using linear regression model(Polynomial Regression)
 Algorithm to check which features to use, how many features to use and which model to fit data to
 Normal Equation
 No need for feature scaling
 Useful only if number of features < 10000
 If XtX is noninvertible, there may be too many features(as compared to training examples) or linearly dependent features.

Classification
Predicting a discretevalued output. We have a dataset with marked labels. Whenever a new data point appears, we have to classify it into one of the labels.
 One vs All training approach for multiple classes
 Binary classes
 Regularized Logistic Regression
 Problem of overfitting
 Caused by too much training
 Caused by too many features and too less training data
 Regularized Cost Function
 Regularized Parameter updating
 Regularized Normal equation solution
 Problem of overfitting
Neural Networks
Cost
Architecture Choices
 Input layer unit = Input feature vector
 Output layer unit = Output vector
 Hidden Layer = Preferably same size in every layer
 Number of neurons in input layer = Comparable or greater than input vector size
Machine Learning Diagnostics
A test that is run to gain insight into the machine learning algorithm and determine what is/is not working and how the performance can be improved. These can take time to implement but are very much worth the time invested.
Determining Bias vs Variance
Machine Learning System Design
Skew classes = Determine precision(positive/actual positive) and recall(positive/true positive)
Varying the threshold for classifier to determine tradeoff between recall and precision
F score(PR/P+R) to determine threshold
“It’s not who has the best algorithm wins, but who has the most data!”
Large data rationality test: Given the input x, can a human expert predict the output y?
Support Vector Machines
Large margin classifier decision boundary in presence of outliers depends on value of C!
In SVM, the training examples try to push the decision boundary away from themselves in order to maximize p, while allowing theta to be minimized. This is the reason for the high margin of separation.
Kernels(similarity function, Gaussian kernel) and landmarks to define complex decision boundary.
Recommender Systems
Large Scale Machine Learning
Getting More Data
Unsupervised Learning
Provides compact lowdimensional representation of the input. Is able to find sensible clusters of data in the input. Also, able to provide an economical high dimensional representation of the input in terms of learned features. Applications include market segmentation, organizing computer clusters in data center, astronomical data analysis

Clustering
Given a dataset with no labels, identify groups of data which may be clustered together. Example Google News.
 K Mean Clustering Algorithm
 If there is a cluster with no points assigned, eliminate the cluster rather than reinitializing it randomly.
 Optimization objective: Minimize cost function J
 Random initialization
 Data compression= PCA
 Compress ndimensional data to kdimensional data
 K Mean Clustering Algorithm
 Anomaly Detection

Non Clustering
Finding structure in a chaotic environment. Example Cocktail party audio.