Machine Learning

Supervised Learning

The right answers are given, the program simply fits available data and produces a function.

m = No of data set points,     n = No of features

  • Regression

    Predicting a real-valued output from a continuous set of data. Extrapolation to generate a function(hypothesis).

    • Univariate Linear Regression- Predicting output using a straight line from one variable dependency
    • Prediction = Data Matrix * Function parameters
    • Cost Function= Used to determine the best possible straight line to fit data
      • Gradient Descent algorithm
        • Assertion(=) vs Assignment(:=)
        • Simultaneous Update
    • Multivariate Linear Regression
      • x0 = 1
      • H(x) = Transpose(Theta) * x
    • Tip and Tricks
      • Feature Scaling: Ensuring features are on a similar scale
      • Mean Normalization = Replace x with ((x – mean)/scale)
      • Check if gradient descent is working correctly: J(theta) should ideally decrease after every iteration, prefer plots rather than limiting value
      • If J(theta) is increasing/oscillating after iterations, use smaller step value(alpha)
      • 0.001 -> 0.003 -> 0.01 -> 0.03 -> 0.1 -> 0.3 -> 1 -> 3
      • Defining new features by combining old ones together
      • Using quadratic or cubic models to fit lines which may not fit using linear regression model(Polynomial Regression)
      • Algorithm to check which features to use, how many features to use and which model to fit data to
    • Normal Equation
      • No need for feature scaling
      • 1
      • Useful only if number of features < 10000
      • If XtX is non-invertible, there may be too many features(as compared to training examples) or linearly dependent features.
  • Classification

    Predicting a discrete-valued output. We have a dataset with marked labels. Whenever a new data point appears, we have to classify it into one of the labels.

    • One vs All training approach for multiple classes
    • Binary classes
      • 2
      • 3
    • Regularized Logistic Regression
      • Screenshot from 2017-11-19 02-24-20.png
      • Problem of overfitting
        • Caused by too much training
        • Caused by too many features and too less training data
      • Regularized Cost Function
      • Regularized Parameter updating
        Screenshot from 2017-11-19 03-49-55
      • Regularized Normal equation solution

Neural Networks




Screenshot-2017-11-25 Cost Function Coursera

Screenshot-2017-11-26 Backpropagation Algorithm CourseraScreenshot-2017-11-26 Backpropagation Algorithm Coursera(1)Screenshot-2017-11-26 Backpropagation Algorithm Coursera(2)Screenshot-2017-11-26 Gradient Checking Coursera

Screenshot-2017-11-26 Random Initialization Coursera


Architecture Choices

  1. Input layer unit = Input feature vector
  2. Output layer unit = Output vector
  3. Hidden Layer = Preferably same size in every layer
  4. Number of neurons in input layer = Comparable or greater than input vector size

Screenshot-2017-11-26 Putting It Together Coursera

Machine Learning Diagnostics

A test that is run to gain insight into the machine learning algorithm and determine what is/is not working and how the performance can be improved. These can take time to implement but are very much worth the time invested.

Screenshot from 2017-12-19 21-41-20Screenshot from 2017-12-19 21-41-51

Determining Bias vs Variance

Screenshot from 2017-12-21 19-29-12

Screenshot from 2017-12-22 01-53-43Screenshot from 2017-12-22 01-53-25

Machine Learning System Design

Skew classes = Determine precision(positive/actual positive) and recall(positive/true positive)

Varying the threshold for classifier to determine tradeoff between recall and precision

F score(PR/P+R) to determine threshold

It’s not who has the best algorithm wins, but who has the most data!”

Large data rationality test: Given the input x, can a human expert predict the output y?

Support Vector Machines

Large margin classifier decision boundary in presence of outliers depends on value of C!

In SVM, the training examples try to push the decision boundary away from themselves in order to maximize p, while allowing theta to be minimized. This is the reason for the high margin of separation.

Kernels(similarity function, Gaussian kernel) and landmarks to define complex decision boundary.

Recommender Systems

Large Scale Machine Learning

Getting More Data

Unsupervised Learning

Provides compact low-dimensional representation of the input. Is able to find sensible clusters of data in the input. Also, able to provide an economical high dimensional representation of the input in terms of learned features. Applications include market segmentation, organizing computer clusters in data center, astronomical data analysis

  • Clustering

    Given a dataset with no labels, identify groups of data which may be clustered together. Example- Google News.

    • K Mean Clustering Algorithm
      • If there is a cluster with no points assigned, eliminate the cluster rather than reinitializing it randomly.
      • Optimization objective: Minimize cost function J
      • Random initialization
    • Data compression= PCA
      • Compress n-dimensional data to k-dimensional data
  • Anomaly Detection
  • Non Clustering

    Finding structure in a chaotic environment. Example- Cocktail party audio.