The course takes off with an overview of Machine Learning. Distinction is made between supervised and unsupervised learning:

- In supervised learning (regression/classification), we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output.
- In unsupervised learning (e.g. clustering, cocktail party algorithm), inferences are drawn from datasets consisting of input data without labeled responses.

This is the first learning algorithm taught in this course! Prof Ng starts with the model representation of Linear Regression and shows an example of its application to housing price prediction. The discussion continues with an introduction of the squared error cost function (also called *mean squared error*) for univariate linear regression. Prof Ng does this by giving an illuminating intuition about the cost function, first with a 2-D line plot then with 3-D and contour plot.

To minimize the cost function, an algorithm called (batch) gradient descent is subsequently introduced. As always, Prof Ng gives the intuition behind the gradient descent formula in order to explain how the learning rate plays a role in finding the local minimum. Finally, we see how gradient descent is used to find the optimal parameters for linear regression and Prof Ng ends with a brief mention about the optimization problem being posed here for linear regression as having only one global minimum and no local optima (i.e. the cost function is convex).

In week 2, the course starts to extend linear regression to accommodate multiple input features (a.k.a multivariate linear regression). The hypothesis function is just a weighted sum of all input features. Once we get this right, the form of the cost function and gradient descent looks pretty much similar to the univariate case; just repeated for multiple input features.

In the later part of week 2, Prof Ng explains how feature scaling (and mean normalization) can help speed up gradient descent. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.

Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1. Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero.

Logistic regression is a method for classifying data into discrete outcomes. For example, one might use logistic regression to classify an email as spam or not spam. In this lesson, Prof Ng introduces the notion of classification, the cost function for logistic regression, and the application of logistic regression to multi-class classification using the one-vs-all method:

One of the most common Data Science interview question: how do you combat overfitting? In this module, Prof Ng will answer this dreaded question and show how regularization can address overfitting when we have a lot of slightly useful features:

This is the hottest topic right now in Machine Learning and there are other courses that dive deep specially into Neural Network alone. In this course, the topic of neural network is split into two weeks: first on representation, and second on how to train neural networks.

Part 2 (Learning): Our cost function for neural networks is going to be a generalization of the one we used for logistic regression, with addition of some nested summations to account for our multiple output nodes. In order to find the optimal parameters that minimize this cost function, we would use the famous *backpropagation algorithm*:

Note that the cost function is non-convex and we may get trapped in local minima; in practice the weights are randomly initialized (for symmetry breaking). To check that an implementation of backpropagation works as intended, a numerical approximation to estimate the derivative of the cost function is introduced as ‘gradient checking’.

To put it all together: First, we pick a network architecture; choose the layout of our neural network, including how many hidden units in each layer and how many layers in total we want to have.

- Number of input units = dimension of features x(i)
- Number of output units = number of classes
- Number of hidden units per layer = usually more the better (must balance with cost of computation as it increases with more hidden units)
- Defaults: 1 hidden layer. If you have more than 1 hidden layer, then it is recommended that you have the same number of units in every hidden layer.

**Second, we train our Neural Network**

- Randomly initialize the weights
- Implement forward propagation to get hΘ(x(i)) for any x(i)
- Implement the cost function
- Implement backpropagation to compute partial derivatives
- Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
- Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.

Without a very minimal amount of mathematics, Prof Ng is able to illuminate the topic of SVM by giving lots of handy intuition. The discussion starts with adapting cost function from logistic regression to large margin classifier, then moves on to kernel and finally SVM in practice. Brief mention of Mercer’s theorem was made to give examples of valid kernels, such as

- Gaussian kernel (discussed at length)
- polynomial kernel
- string kernel
- chi-square kernel
- histogram intersection kernel

The discussion on SVM ends with a short note about multi-class classification (one-vs-all method) and some rules of thumb when to use logistic regression/SVM linear kernel or SVM with kernel.

In week 8, Prof Ng introduces Principal Components Analysis, and show how it can be used for data compression to speed up learning algorithms as well as for visualizations of complex datasets.

Prof Ng really tries to avoid the nitty-gritty of the linear algebra, but he gives a recipe of how PCA is done:

- Compute the covariance matrix from the feature vectors of all training examples
- Find the top K eigenvectors of the covariance matrix
- The K principal components are the projection of the original feature vector to the K eigenvectors.

*In case the steps above sound esoteric, I recommend watching the video courses and you will be surprised how simple PCA is!*

Next: K-means Clustering

Anomaly detection is widely used in fraud detection (e.g. ‘has this credit card been stolen?’). Given a large number of data points, we may sometimes want to figure out which ones vary significantly from the average. For example, in manufacturing, we may want to detect defects or anomalies. One way to perform anomaly detection is by modeling the dataset using a Gaussian distribution by first fitting the parameters **μ and σ **to every feature dimension.

Building and evaluating anomaly detection system:

- Assume we have labeled data (anomalous/non-anomalous) and split into training/CV/test set
- Fit model p(x) on training set and flag anomalous if p(x) <
**ε** - Evaluation metrics: Precision/recall, F1-score

The lesson starts by describing *content-based recommendation*: how to learn a distinct set of parameters for every user, given that every training example can be represented by a set of features that describe its content. As usual, the optimization objective and the gradient descent formula are also written out. However, content-based recommendation assumes that the “content” of each product is known/given.

Another approach to recommend products is called *collaborative filtering, *where the features are also learned during training time. This is only possible because each user has rated multiple products and each product is rated by multiple users. The term *collaborative filtering* refers to the fact that all users who have submitted their ratings collaborate in improving the system; so that we can guess the initial parameters for each user, use them to learn product features, use the results to improve the parameters, then learn better features, and so on.

Collaborative filtering is often called *low rank matrix factorization*. Concretely, given matrices X (each row containing features of a particular movie) and Θ (each row containing the weights for those features for a given user), then the full matrix Y of all predicted ratings of all movies by all users is given simply by: Y=XΘ^{T}. Note that predicting how similar two movies i and j are can be done using the distance between their respective feature vectors x.

Strategy on designing a machine learning system (e.g. building a spam classifier):

- Choose features: words that are indicative of spam/not spam (e.g. bag-of-words model), email routing information/email header
- Collect lots of data (e.g. honeypot project)

In order to help us as Machine Learning practitioner to choose which option to prioritize, Prof Ng suggests what I call a three-step approach:

- start with a quick-and-dirty algorithm and test it early on cross-validation set.
- plot learning curve to decide if you need more data/more features and do error analysis to find systematic trend of errors in the cross-validation set.
- use a single numerical evaluation metric to try out new ideas (e.g. stemming).

For data with skewed classes, classification accuracy is not a good evaluation metric and Precision/Recall is often used instead. Plotting the precision/recall curve, it’s quickly apparent that there is a tradeoff and one way to compute the golden mean is using the F1-score.

Finally, Prof Ng discusses on large data rationale and suggests when it’s appropriate to collect massive amount of data to get a high performance learning algorithm.

]]>Docker is an open-source software container platform. It creates containers on top of an operating system using Linux kernel features, thereby virtualizing the operating system instead of the physical hardware (making it more portable and efficient than Virtual Machines).

**Why Docker?**

**Easy to use:**it’s “build once, run anywhere”, meaning you can build an application on your laptop and it can run unmodified on any server or cloud**Fast**: containers share the OS kernel and take up fewer resources than VMs — container is lightweight and starts almost instantly!**Rich in ecosystem**: Docker Hub alone has hundreds of thousands of public images, community-created and readily available for use (see next section). There are other Docker registry hosting services too (e.g. Quay).**Scalable**: break down your application into multiple containers for modularity, then you can link them together; scale easily by adding in new containers or destroying unused ones independently.

Thanks to the rich ecosystem, there are already several readily available images for the common components in data science pipelines. Here are some Docker images to help you quickly spin up your own data science pipeline:

In order to build a Docker image for our data science application, we will need a Dockerfile (see below template). To build and tag the image as kevinsis/myapp version 1.0.0, navigate to the directory the Dockerfile is located and run:

`$ docker build –t kevinsis/myapp:1.0.0 .`

A sample Dockerfile can be found below. Here I include the installation of the GNU Scientific Library too (used by some NLP packages in R). You will need the following in the directory that the above docker build command is run:

- Dockerfile
- shiny-server.conf and shiny-server.sh
- a directory containing the ui.R and server.R (here named ‘myapp’)

```
FROM r-base:latest
MAINTAINER Kevin Siswandi "siswandi.kevin@gmail.com"
ENV http_proxy ""
ENV https_proxy ""
RUN apt-get update && apt-get install -y \
sudo \
gdebi-core \
pandoc \
pandoc-citeproc \
libcurl4-gnutls-dev \
libcairo2-dev/unstable \
libxt-dev \
libssl-dev \
gsl-bin \
libgsl0-dev
# Download and install shiny server
RUN wget --no-verbose https://s3.amazonaws.com/rstudio-shiny-server-os-build/ubuntu-12.04/x86_64/VERSION -O "version.txt" && \
VERSION=$(cat version.txt) && \
wget --no-verbose "https://s3.amazonaws.com/rstudio-shiny-server-os-build/ubuntu-12.04/x86_64/shiny-server-$VERSION-amd64.deb" -O ss-latest.deb && \
gdebi -n ss-latest.deb && \
rm -f version.txt ss-latest.deb
RUN R -e "install.packages(c('shiny', \
'shinydashboard', \
'dplyr', \
'ggplot2'), repos='http://cran.rstudio.com/')"
COPY shiny-server.conf /etc/shiny-server/shiny-server.conf
COPY myapp /srv/shiny-server/
EXPOSE 3838
COPY shiny-server.sh /usr/bin/shiny-server.sh
CMD ["/usr/bin/shiny-server.sh"]
```

Briefly, what the Dockerfile above does is installing the required dependencies and then shiny server before installing the required R packages and copying the shiny server configuration (shiny-server.conf) to the image. The shiny ui.R and server.R are located in the ‘myapp’ directory (you can change this) and are copied over to /srv/shiny-server in the Docker image. Notice that proxy settings need to be specified in the two ENV lines (you’ll need this to work if you are behind a corporate proxy).

After the image is built, you can run it as follows (e.g. an image named kevinsis/myapp tagged 1.0.0):

`$ docker run --rm –p 3838:3838 kevinsis/myapp:1.0.0`

If you have data to attach to the image (like me), you can do:

`$ docker run –p 3838:3838 –v /home/kevinsis/dockerizedShiny:/srv/shiny-server/data kevinsis/myapp:1.0.0`

Next, perhaps you may want to:

- Remove images (e.g. dangling images)
- See sample images: RStudio /Jupyter, nvidia-docker
- Find out about the containers behind Kaggle scripts

Note: The instructions below are taken from a repository by DataKindSG that I contributed to: https://github.com/DataKind-SG/contain-yourself

Follow the setup instructions here: https://docs.docker.com/docker-for-windows/install/

Note: If your machine doesn’t met the requirement for “Docker For Windows”, try setting up “Docker Toolbox”:https://docs.docker.com/toolbox/toolbox_install_windows/

Follow the setup instructions for your flavor of Linux here: https://docs.docker.com/engine/installation/linux/

Follow the setup instructions here: https://store.docker.com/editions/community/docker-ce-desktop-mac

Or if you use Homebrew Cask,

`$ brew cask install docker`

- Entity-Relationship Diagrams
- Relational schemas

Tools: ERDPlus

Basic commands:

- http://www.sqlcommands.net/
- https://www.codecademy.com/articles/sql-commands?r=master
- https://www.w3schools.com/sql

At the minimum, you would want to create a table and load your data. You can create a table following https://dev.mysql.com/doc/refman/5.5/en/creating-tables.html. Be mindful of variable types.

To load data (be it CSV/pipe-delimited), follow https://dev.mysql.com/doc/refman/5.7/en/load-data.html When altering the table is required, follow https://www.tutorialspoint.com/mysql/mysql-alter-command.htm.

Keywords: https://dev.mysql.com/doc/refman/5.5/en/keywords.html

Functions and operators:

- http://www.w3resource.com/mysql/mysql-functions-and-operators.php
- http://www.w3resource.com/mysql/string-functions/mysql-trim-function.php

Date and time: http://www.tutorialspoint.com/mysql/mysql-date-time-functions.htm

Syntax that varies across database platforms:

- LIMIT/TOP/ROWNUM: http://www.tutorialspoint.com/sql/sql-top-clause.htm

- OReilly: Plan and Design a Database
- w3schools: Quick Ref
- Practice: http://datamonkey.pro/blog/

* Answer*: It’s the popular dimensonality reduction algorithm:

To see how t-SNE cast the n-dimensional space into 3D or 2D space, check out: http://projector.tensorflow.org/

For a more practical take into t-SNE, check out http://distill.pub/2016/misread-tsne/, this Kaggle script of the week, and the video below:

Finding patterns in a t-SNE plot is both an art and science, with appropriate color able to make clear otherwise hidden patterns, while adding style to the aesthetics: https://github.com/kylemcdonald/Coloring-t-SNE/

Related: Illustrated introduction to the t-SNE algorithm by O’Reilly Media

]]>This piece on topic modeling is based on: topic modeling using R.

For a theoretical overview on topic modeling, see http://videolectures.net/mlss09uk_blei_tm/

I have been searching online for a tutorial on LDA using *topicmodels* and *LDAvis* package in R, but to no avail. So here I write my own (hope you like it!)

First of all, you would need a term document matrix (e.g. the one created by the tm package). See this piece on how to do that. With that, training an LDA model is as simple as

ldaOut <- LDA(dtm, k=4, method="Gibbs", control=list(nstart=5, seed=list(1, 2, 3, 4, 5), best=TRUE, burnin=4000, iter=2000, thin=500))

In the code block above, dtm is the document term matrix, k is the number of topics (you have to decide this prior to model training), and I used Gibbs sampling method with the following parameters:

- burnin: the number of initial walks to discard
- iter: how many iterations following the burn-in
- thin: how many iterations is for further use
- nstart: how many different starting points to use
- seed: a list corresponding to seeds used for every nstart
- best = TRUE will return the results with highest posterior probability.

Afterwards, you have to create a JSON object then pass it to the visualization template. You can either see the documentation or follow along these codes:

ldaOut.topics <- as.matrix(topics(ldaOut)) ldaOut.terms <- as.matrix(posterior(ldaOut)$terms) ldaOut.docLength <- rowSums(as.matrix(dtm)) ldaOut.termFreq <- colSums(as.matrix(dtm)) topicProbabilities <- as.data.frame(ldaOut@gamma) # Visualize jsonObj <- createJSON(phi=ldaOut.terms, theta=topicProbabilities, vocab=colnames(ldaOut.terms), doc.length = ldaOut.docLength, term.frequency = ldaOut.termFreq, R=10)

All that remains is to pass jsonObj to the function serVis (from LDAvis) or perhaps renderVis (if you’re embedding it to a shiny app).

]]>Producing clean graphs in R can be a challenging task, but when done right, graphs can be appealing, informative, and of considerable value. Traditionally, R was used for producing graphs in academic article, but it’s now so versatile that you can produce stunning data visualizations in just few lines of codes:

What we learnt from the video is that beautiful visualizations can be made easily with

- leaflet for maps
- quantmod for easy stock data visualization
- dygraph for time series data
- corrplot for correlation plots: example here
- ggvis for interactive graphics using ggplot-like syntax

Having said that, these are my favourite visualization frameworks because they are so versatile:

- plotly for interactive graphs: examples here.
- googleVis
- shiny for dashboards: example here

But how to select which type of chart to use? The following diagram would help (or this whitepaper from Tableau).

PS: Some people say pie charts are no good, but sometimes it can be useful.

There are, by the way, 7 different types of data stories:

- Narrate Change over time
- Start big and drill down
- Start small and zoom out
- Highlight contrasts
- Explore the intersection
- Dissect the factors
- Profile the outliers

Meanwhile, Tableau has some whitepapers related to producing visualizations with R:

- The Power of R and Visual Analytics
- Visual Analysis Best Practices
- Visualizing Time: Beyond the Line Chart

When it’s precision over storytelling, we may need to go back old-school: ggplot2! (here’s a cheatsheet)

Here are some ggplot2 examples:

2. Pie charts

There are also ggplot2 extensions: http://www.ggplot2-exts.org/ to create more interesting graphs, for example:

Finally, some use cases/gallery:

]]>I went to the

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant.

As part of my capstone project — with Swiftkey as the corporate partner — for the Data Science Specialization offered by John Hopkins University through Coursera, I have analyzed a large corpus of text documents to discover the structure in the data and how words are put together to build and sample from a predictive text model. This brings you a predictive text product hosted on shinyapps.io:

The app takes as input a phrase (multiple words), one clicks submit, and it predicts the next word.

Click here to check out the app

For more information on how the app works under the hood, please see:

- 5-slide intro: slide-deck on RPubs.
- Summary report of the data: comprises of tweets, news, and blogs in English.
- Source codes on github.

**Getting started**

- Start your journey in Kaggle
- Approaching (Almost) Any Machine Learning Problem
- How to get into top 10% in your first Kaggle competition

**The basics**

- Data Cleaning
- Feature Engineering
- Exploratory Data Analysis
- Model Selection
- http://mlwave.com/kaggle-ensembling-guide/

**Routines**

- xgboost (including one-hot encoding)
- create sparse matrix: sparse.model.matrix(TARGET ~. –1, data=test)

**Learn from others**

- Lessons from Kaggle’s Event Recommendation Engine Challenge
- Tutorials, talks, conferences: my Youtube playlist
- Predicting Facebook checkins: 1st place
- Predicting Facebook checkins: 2nd place

]]>

If you find a job posting for a “data scientist,” it’s likely that the employer is hiring their first data-centric positions. In this case, as a “jack of all trades,” you may be responsible for a wide variety of tasks. However, most postings will specify a clearer focus – usually in one of the four areas outlined below.

*On career paths in data science*, written in **Coursera Blog**.

The outlook of the profession is rather favorable:

- In Singapore: OK salary
- In US: OK for visa, stardom, good salary

There are lots of art in becoming a Data Scientist, from writing resume to using Machine Learning Libraries in R, Python, and Julia (or even Go). It seems that the term “Data Scientist” was first coined by Davenport and Patil, then Drew Conway popularised what it meant with a Venn Diagram.

This post aggregated some tips to get started in Data Science.

Data Camp has an infographic detailing 8 easy steps to learn data science:

- Get good in Mathematics, Statistics, and Machine Learning.
- Learn to code: end-to-end development, CS fundamentals, Python and R.
- Understand databases: MySQL, MongoDB, PostgreSQL, etc.
- Explore the data science workflow: from data collection to modeling to reporting.
- Level up with Big Data: Hadoop, Spark, etc.
- Grow, connect and learn: Kaggle, Driven Data, Meetup, pet project, etc.
- Immerse yourself completely: internship, bootcamp, and job.
- Engage with the community: R User group, Local Data Science meetup, etc.

Also make use of the free resources to learn Data Science and play to your strengths.

- David Robinson: 1 year at StackOverflow
- Sander Dieleman: From Kaggle to Google (Deepmind)
- Elena Grewal: How I made it (Airbnb)
- Vincent Granville: Salary history and progress (Data Science Central)

Data Science is masterable with some gravitas:

**Finally, do lots of readings**:

- The Field Guide to Data Science.
*Booz Allen Hamilton*. - Good Books for All Things Data, by Multithreaded @ Stitchfix
- Books and practices from Andrew Ng (Coursera, Baidu)

It’s primarily developed by Tianqi Chen at University of Washington, with the R Package authored by Tong He:

- Intro by Tong He: Introduction to XGBoost R Package.
- Intro by Tianqi Chen: Introduction to Boosted Trees

XGBoost uses the same model (tree ensembles) as random forest, but the difference is in how the model is trained. XGBoost learn the trees with an additive strategy: fix what it has learned, and add one new tree at a time.

FAQ:

- What does gblinear do?
- XGBoost tuning
- Installation: Windows
- Tutorial: Easy Steps to XGBOOST by Analytics Vidhya.
- paper: http://arxiv.org/abs/1603.02754
- docs: http://xgboost.readthedocs.org/en/latest/index.html