More and more data scientists are using Docker to support reproducible research and to ease deployment of user-facing applications. This post is the first in a blog series intended to provide a few helpful tips and tricks for getting started with using R in Docker. The plan for the series is:
Docker is a set of tools for running processes in Linux in a way that isolates each process in its own environment, with its own filesystem and with only explicitly defined links to other processes and to the host on which it's running. Each isolated process is called a "container". A container's process and environment are defined in a Docker image, and images in turn are either "pulled" pre-made from online repositories (with DockerHub being the most common), or created in a local Docker host by building them from a sort of script called a Dockerfile.
On non-Linux platforms (notablty, MacOS and Windows), Docker actually executes in a lightweight Linux virtual machine, but the Docker tools interface seamlessly to this VM "under the hood" so it seems like containers are running natively in the MacOS or Windows environment. A tool called Docker Machine can even extend this approach to virtual machines running in a wide variety of contexts, including on cloud services like Amazon Web Services. The Docker website contains excellent instructions for installing Docker on all platforms.
A simple example illustrates how Docker is useful for data scientists and analysts who use R. A data scientist might pull a pre-configured R image from DockerHub and then run a container from that image locally to obtain an interactive R command prompt. If there is no pre-configured image that provides the features needed, the data scientist can define a Dockerfile based on a pre-configured image and just add the features (e.g., missing packages) needed. Alternatively, the data scientist can run a container from a pre-configured RStudio Server image (extending it with a Dockerfile as needed) if RStudio is preferred over a command prompt.
In the sections to follow, we'll explore both of these two options--interactive and server--for running R in Docker.
The Rocker project maintains the de-facto standard Docker images for R. There are four images that cover the vast majority of R users' scenarios (or at least mine, anyway):
rocker/r-ver: Base R installation
rocker/rstudio: Base R + RStudio Server installation
rocker/tidyverse: Base R + RStudio Server +
rocker/shiny: Base R with Shiny Server (open source edition)
The first three images in this list all support tags to indicate the version of R included in the image. For each version-specific image, the packages installed with R are from the Microsoft MRAN snapshot as of the last day that package was the latest available. In this way, R users can always have images that correspond to an R environment with a known base R version and a known version of all the included packages.
Sometimes, all we need is an R prompt so we can run some R code interactively from the command prompt, rather than with all the features of RStudio. Via Docker, this looks like:
$: docker run -ti --rm rocker/r-ver R R version 3.5.0 (2018-04-23) -- "Joy in Playing" Copyright (C) 2018 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. >
Note that the result of this command may differ depending on when you run it, since
rocker/r-ver will always run the latest
version of the Rocker project
r-ver image. To run a specific version of R, you can always use a tag to specify it, like so:
$: docker run -ti --rm rocker/r-ver:3.4.0 R R version 3.4.0 (2017-04-21) -- "You Stupid Darkness" ...
You can see all the tags available for an image by visiting the "Tags" page of the image's area on DockerHub, like this one for
A few details about these commands:
docker runis the Docker command used to run a container. The command takes a number of options, followed by the image (and, optionally, the tag) that defines the container environment, and finally the executable to run as the process.
-tioption tells Docker to run the container interactively, attaching the user's tty (command prompt) to the container's standard input and output.
--rmoption tells Docker to clean up the container when the interactive shell exits. This prevents "ghost" stopped containers from hanging around and cluttering up the Docker host.
R. In this case, this is not strictly necessary, since it is the default command for the
rocker/r-verimage. But my personal preference is to be explicit about the command.
One thing to note about interactive containers (actually, any container) is that any changes to the container's filesystem are lost when the container exits. So, if your interactive R session involves creating output and saving it to a file, remember that the file will no longer be there after you exit the container. It is possible (and quite easy) to write files outside the container so they stay around after the container exits. That requires Docker volumes, which we will cover in Part 3 of the series.
Running R interactively is great, but there are many use cases for running R as a server process, where users don't access R directly, but rather use a web browser or other client to interact with R over a protocol like https. For example:
Running R as a server process requires a slightly different flavor of the
docker run command. The interactive R containers we discussed above only live during
the interactive session, then they exit and go away. Server processes are different, in that we generally want to start them up, leave them running in the background as they
wait for connections, and return our local command prompt to us so we can do other things with it. With Docker we accomplish this with detached containers and the
-d option to
docker run. For example, to start up an instance of RStudio Server:
$: docker run -d --name rstudio -p 80:8787 rocker/rstudio
After this command, RStudio Server is running on the Docker host on port 80 (the
-p option tells Docker to expose the container port specified after the colon
to the host port specified before the colon). When running detached containers, I always specify a name rather than using the default name assigned by Docker; this way, I can
refer to the container by a name that makes sense to me (and has some relationship to what the container process is actually doing). RStudio Server will continue running, available
at http://localhost (from a browser running on the Docker host) or http://[hostname] from the network on which the Docker host lives. It will continue running there
until we stop it with either
docker stop rstudio (to stop the container only) or
docker rm -f rstudio (to stop it and remove it).
Note that the base
rocker/rstudio image only has the base R packages installed. If you'd like your RStudio Server instance to contain the
rocker/tidyverse image instead.
In many cases, we want packages beyond the
tidyverse to be available in our interactive R or RStudio Server session. This requires that we create our own image, derived
rocker/tidyverse, that contains what we need. We'll see how to do this in Part 2.