In this post, I would go over the basics of Docker containers, touch on using Docker for data science, and finally show how to deploy a simple Shiny app. I will assume no prior knowledge of Docker and show you how to install Docker on your laptop as well. In essence, the tutorial that follows is meant to be a self-contained and (hopefully) actionable getting started guide!
Docker is an open-source software container platform. It creates containers on top of an operating system using Linux kernel features, thereby virtualizing the operating system instead of the physical hardware (making it more portable and efficient than Virtual Machines).
- Easy to use: it’s “build once, run anywhere”, meaning you can build an application on your laptop and it can run unmodified on any server or cloud
- Fast: containers share the OS kernel and take up fewer resources than VMs — container is lightweight and starts almost instantly!
- Rich in ecosystem: Docker Hub alone has hundreds of thousands of public images, community-created and readily available for use (see next section). There are other Docker registry hosting services too (e.g. Quay).
- Scalable: break down your application into multiple containers for modularity, then you can link them together; scale easily by adding in new containers or destroying unused ones independently.
Containerized Data Science
Thanks to the rich ecosystem, there are already several readily available images for the common components in data science pipelines. Here are some Docker images to help you quickly spin up your own data science pipeline:
Example: Building a shiny app
In order to build a Docker image for our data science application, we will need a Dockerfile (see below template). To build and tag the image as kevinsis/myapp version 1.0.0, navigate to the directory the Dockerfile is located and run:
$ docker build –t kevinsis/myapp:1.0.0 .
A sample Dockerfile can be found below. Here I include the installation of the GNU Scientific Library too (used by some NLP packages in R). You will need the following in the directory that the above docker build command is run:
- shiny-server.conf and shiny-server.sh
- a directory containing the ui.R and server.R (here named ‘myapp’)
FROM r-base:latest MAINTAINER Kevin Siswandi "email@example.com" ENV http_proxy "" ENV https_proxy "" RUN apt-get update && apt-get install -y \ sudo \ gdebi-core \ pandoc \ pandoc-citeproc \ libcurl4-gnutls-dev \ libcairo2-dev/unstable \ libxt-dev \ libssl-dev \ gsl-bin \ libgsl0-dev # Download and install shiny server RUN wget --no-verbose https://s3.amazonaws.com/rstudio-shiny-server-os-build/ubuntu-12.04/x86_64/VERSION -O "version.txt" && \ VERSION=$(cat version.txt) && \ wget --no-verbose "https://s3.amazonaws.com/rstudio-shiny-server-os-build/ubuntu-12.04/x86_64/shiny-server-$VERSION-amd64.deb" -O ss-latest.deb && \ gdebi -n ss-latest.deb && \ rm -f version.txt ss-latest.deb RUN R -e "install.packages(c('shiny', \ 'shinydashboard', \ 'dplyr', \ 'ggplot2'), repos='http://cran.rstudio.com/')" COPY shiny-server.conf /etc/shiny-server/shiny-server.conf COPY myapp /srv/shiny-server/ EXPOSE 3838 COPY shiny-server.sh /usr/bin/shiny-server.sh CMD ["/usr/bin/shiny-server.sh"]
Briefly, what the Dockerfile above does is installing the required dependencies and then shiny server before installing the required R packages and copying the shiny server configuration (shiny-server.conf) to the image. The shiny ui.R and server.R are located in the ‘myapp’ directory (you can change this) and are copied over to /srv/shiny-server in the Docker image. Notice that proxy settings need to be specified in the two ENV lines (you’ll need this to work if you are behind a corporate proxy).
After the image is built, you can run it as follows (e.g. an image named kevinsis/myapp tagged 1.0.0):
$ docker run --rm –p 3838:3838 kevinsis/myapp:1.0.0
If you have data to attach to the image (like me), you can do:
$ docker run –p 3838:3838 –v /home/kevinsis/dockerizedShiny:/srv/shiny-server/data kevinsis/myapp:1.0.0
Next, perhaps you may want to:
- Remove images (e.g. dangling images)
- See sample images: RStudio /Jupyter, nvidia-docker
- Find out about the containers behind Kaggle scripts
Appendix: Installing Docker
Note: The instructions below are taken from a repository by DataKindSG that I contributed to: https://github.com/DataKind-SG/contain-yourself
.. for Windows
Follow the setup instructions here: https://docs.docker.com/docker-for-windows/install/
Note: If your machine doesn’t met the requirement for “Docker For Windows”, try setting up “Docker Toolbox”:https://docs.docker.com/toolbox/toolbox_install_windows/
… for Linux
Follow the setup instructions for your flavor of Linux here: https://docs.docker.com/engine/installation/linux/
… for MacOS
Follow the setup instructions here: https://store.docker.com/editions/community/docker-ce-desktop-mac
Or if you use Homebrew Cask,
$ brew cask install docker