# Machine Learning

## Supervised Learning

The right answers are given, the program simply fits available data and produces a function.

m = No of data set points,     n = No of features

• ### Regression

Predicting a real-valued output from a continuous set of data. Extrapolation to generate a function(hypothesis).

• Univariate Linear Regression- Predicting output using a straight line from one variable dependency
• Prediction = Data Matrix * Function parameters
• Cost Function= Used to determine the best possible straight line to fit data
• Assertion(=) vs Assignment(:=)
• Simultaneous Update
• Multivariate Linear Regression
• x0 = 1
• H(x) = Transpose(Theta) * x
• Tip and Tricks
• Feature Scaling: Ensuring features are on a similar scale
• Mean Normalization = Replace x with ((x – mean)/scale)
• Check if gradient descent is working correctly: J(theta) should ideally decrease after every iteration, prefer plots rather than limiting value
• If J(theta) is increasing/oscillating after iterations, use smaller step value(alpha)
• 0.001 -> 0.003 -> 0.01 -> 0.03 -> 0.1 -> 0.3 -> 1 -> 3
• Defining new features by combining old ones together
• Using quadratic or cubic models to fit lines which may not fit using linear regression model(Polynomial Regression)
• Algorithm to check which features to use, how many features to use and which model to fit data to
• Normal Equation
• No need for feature scaling
• Useful only if number of features < 10000
• If XtX is non-invertible, there may be too many features(as compared to training examples) or linearly dependent features.
• ### Classification

Predicting a discrete-valued output. We have a dataset with marked labels. Whenever a new data point appears, we have to classify it into one of the labels.

• One vs All training approach for multiple classes
• Binary classes
• Regularized Logistic Regression
• Problem of overfitting
• Caused by too much training
• Caused by too many features and too less training data
• Regularized Cost Function
• Regularized Parameter updating
• Regularized Normal equation solution

## Neural Networks

### Cost

#### Architecture Choices

1. Input layer unit = Input feature vector
2. Output layer unit = Output vector
3. Hidden Layer = Preferably same size in every layer
4. Number of neurons in input layer = Comparable or greater than input vector size

## Machine Learning Diagnostics

A test that is run to gain insight into the machine learning algorithm and determine what is/is not working and how the performance can be improved. These can take time to implement but are very much worth the time invested.

Determining Bias vs Variance

## Machine Learning System Design

Skew classes = Determine precision(positive/actual positive) and recall(positive/true positive)

Varying the threshold for classifier to determine tradeoff between recall and precision

F score(PR/P+R) to determine threshold

It’s not who has the best algorithm wins, but who has the most data!”

Large data rationality test: Given the input x, can a human expert predict the output y?

## Support Vector Machines

Large margin classifier decision boundary in presence of outliers depends on value of C!

In SVM, the training examples try to push the decision boundary away from themselves in order to maximize p, while allowing theta to be minimized. This is the reason for the high margin of separation.

Kernels(similarity function, Gaussian kernel) and landmarks to define complex decision boundary.

## Unsupervised Learning

Provides compact low-dimensional representation of the input. Is able to find sensible clusters of data in the input. Also, able to provide an economical high dimensional representation of the input in terms of learned features. Applications include market segmentation, organizing computer clusters in data center, astronomical data analysis

• ### Clustering

Given a dataset with no labels, identify groups of data which may be clustered together. Example- Google News.

• K Mean Clustering Algorithm
• If there is a cluster with no points assigned, eliminate the cluster rather than reinitializing it randomly.
• Optimization objective: Minimize cost function J
• Random initialization
• Data compression= PCA
• Compress n-dimensional data to k-dimensional data
• Anomaly Detection
• ### Non Clustering

Finding structure in a chaotic environment. Example- Cocktail party audio.