This course is actually my first on Coursera, and I think it gives a very motivating introduction to Machine Learning. The topics in this course range from machine learning algorithms to their applications in a variety of problems, such as anti-spam, image recognition, clustering, and building recommender systems. Probably most importantly to an ML practitioner, in week 6 Prof Ng will give some advice on how to select the right algorithm for the right job, as well as become expert at ‘debugging’ and figuring out how to improve a learning algorithm’s performance!
Supervised vs Unsupervised Learning
The course takes off with an overview of Machine Learning. Distinction is made between supervised and unsupervised learning:
- In supervised learning (regression/classification), we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output.
- In unsupervised learning (e.g. clustering, cocktail party algorithm), inferences are drawn from datasets consisting of input data without labeled responses.
This is the first learning algorithm taught in this course! Prof Ng starts with the model representation of Linear Regression and shows an example of its application to housing price prediction. The discussion continues with an introduction of the squared error cost function (also called mean squared error) for univariate linear regression. Prof Ng does this by giving an illuminating intuition about the cost function, first with a 2-D line plot then with 3-D and contour plot.
To minimize the cost function, an algorithm called (batch) gradient descent is subsequently introduced. As always, Prof Ng gives the intuition behind the gradient descent formula in order to explain how the learning rate plays a role in finding the local minimum. Finally, we see how gradient descent is used to find the optimal parameters for linear regression and Prof Ng ends with a brief mention about the optimization problem being posed here for linear regression as having only one global minimum and no local optima (i.e. the cost function is convex).
In week 2, the course starts to extend linear regression to accommodate multiple input features (a.k.a multivariate linear regression). The hypothesis function is just a weighted sum of all input features. Once we get this right, the form of the cost function and gradient descent looks pretty much similar to the univariate case; just repeated for multiple input features.
In the later part of week 2, Prof Ng explains how feature scaling (and mean normalization) can help speed up gradient descent. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.
Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1. Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero.
Logistic regression is a method for classifying data into discrete outcomes. For example, one might use logistic regression to classify an email as spam or not spam. In this lesson, Prof Ng introduces the notion of classification, the cost function for logistic regression, and the application of logistic regression to multi-class classification using the one-vs-all method:
One of the most common Data Science interview question: how do you combat overfitting? In this module, Prof Ng will answer this dreaded question and show how regularization can address overfitting when we have a lot of slightly useful features:
This is the hottest topic right now in Machine Learning and there are other courses that dive deep specially into Neural Network alone. In this course, the topic of neural network is split into two weeks: first on representation, and second on how to train neural networks.
Part 2 (Learning): Our cost function for neural networks is going to be a generalization of the one we used for logistic regression, with addition of some nested summations to account for our multiple output nodes. In order to find the optimal parameters that minimize this cost function, we would use the famous backpropagation algorithm:
Note that the cost function is non-convex and we may get trapped in local minima; in practice the weights are randomly initialized (for symmetry breaking). To check that an implementation of backpropagation works as intended, a numerical approximation to estimate the derivative of the cost function is introduced as ‘gradient checking’.
To put it all together: First, we pick a network architecture; choose the layout of our neural network, including how many hidden units in each layer and how many layers in total we want to have.
- Number of input units = dimension of features x(i)
- Number of output units = number of classes
- Number of hidden units per layer = usually more the better (must balance with cost of computation as it increases with more hidden units)
- Defaults: 1 hidden layer. If you have more than 1 hidden layer, then it is recommended that you have the same number of units in every hidden layer.
Second, we train our Neural Network
- Randomly initialize the weights
- Implement forward propagation to get hΘ(x(i)) for any x(i)
- Implement the cost function
- Implement backpropagation to compute partial derivatives
- Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
- Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.
Support Vector Machine
Without a very minimal amount of mathematics, Prof Ng is able to illuminate the topic of SVM by giving lots of handy intuition. The discussion starts with adapting cost function from logistic regression to large margin classifier, then moves on to kernel and finally SVM in practice. Brief mention of Mercer’s theorem was made to give examples of valid kernels, such as
- Gaussian kernel (discussed at length)
- polynomial kernel
- string kernel
- chi-square kernel
- histogram intersection kernel
The discussion on SVM ends with a short note about multi-class classification (one-vs-all method) and some rules of thumb when to use logistic regression/SVM linear kernel or SVM with kernel.
Principal Component Analysis
In week 8, Prof Ng introduces Principal Components Analysis, and show how it can be used for data compression to speed up learning algorithms as well as for visualizations of complex datasets.
Prof Ng really tries to avoid the nitty-gritty of the linear algebra, but he gives a recipe of how PCA is done:
- Compute the covariance matrix from the feature vectors of all training examples
- Find the top K eigenvectors of the covariance matrix
- The K principal components are the projection of the original feature vector to the K eigenvectors.
In case the steps above sound esoteric, I recommend watching the video courses and you will be surprised how simple PCA is!
Next: K-means Clustering
Anomaly detection is widely used in fraud detection (e.g. ‘has this credit card been stolen?’). Given a large number of data points, we may sometimes want to figure out which ones vary significantly from the average. For example, in manufacturing, we may want to detect defects or anomalies. One way to perform anomaly detection is by modeling the dataset using a Gaussian distribution by first fitting the parameters μ and σ to every feature dimension.
Building and evaluating anomaly detection system:
- Assume we have labeled data (anomalous/non-anomalous) and split into training/CV/test set
- Fit model p(x) on training set and flag anomalous if p(x) < ε
- Evaluation metrics: Precision/recall, F1-score
The lesson starts by describing content-based recommendation: how to learn a distinct set of parameters for every user, given that every training example can be represented by a set of features that describe its content. As usual, the optimization objective and the gradient descent formula are also written out. However, content-based recommendation assumes that the “content” of each product is known/given.
Another approach to recommend products is called collaborative filtering, where the features are also learned during training time. This is only possible because each user has rated multiple products and each product is rated by multiple users. The term collaborative filtering refers to the fact that all users who have submitted their ratings collaborate in improving the system; so that we can guess the initial parameters for each user, use them to learn product features, use the results to improve the parameters, then learn better features, and so on.
Collaborative filtering is often called low rank matrix factorization. Concretely, given matrices X (each row containing features of a particular movie) and Θ (each row containing the weights for those features for a given user), then the full matrix Y of all predicted ratings of all movies by all users is given simply by: Y=XΘT. Note that predicting how similar two movies i and j are can be done using the distance between their respective feature vectors x.
Nuts and Bolts
Strategy on designing a machine learning system (e.g. building a spam classifier):
- Choose features: words that are indicative of spam/not spam (e.g. bag-of-words model), email routing information/email header
- Collect lots of data (e.g. honeypot project)
In order to help us as Machine Learning practitioner to choose which option to prioritize, Prof Ng suggests what I call a three-step approach:
- start with a quick-and-dirty algorithm and test it early on cross-validation set.
- plot learning curve to decide if you need more data/more features and do error analysis to find systematic trend of errors in the cross-validation set.
- use a single numerical evaluation metric to try out new ideas (e.g. stemming).
For data with skewed classes, classification accuracy is not a good evaluation metric and Precision/Recall is often used instead. Plotting the precision/recall curve, it’s quickly apparent that there is a tradeoff and one way to compute the golden mean is using the F1-score.
Finally, Prof Ng discusses on large data rationale and suggests when it’s appropriate to collect massive amount of data to get a high performance learning algorithm.