Docker for Data Science

Getting started with Docker for Data Scientists…

6 min readJun 29, 2021

You might have heard about docker, or containers, but you have just not had time to dig in and figure out what this is all about. This article is designed to help you gain enough knowledge about the Docker architecture, to learn about some docker syntax and to motivate you to use docker as a part of your data science toolkit if it isn’t already.

But wait!

Why Docker?

Reproducible data science workflows: In order to replicate calculations in data science, users need access to the same code, data, and libraries that were used to create the original notebook. This is where Docker comes in. Docker helps us solve the problem of “it works on machine”.

There are a lot of reasons to use docker, and some we will see through the course of this article.

Introduction to Docker

Docker is a platform that allow us to package and run application in loosely isolated environments called containers.

To understand how docker works, we will use a simple shipping container analogy. The analogy is not a new one —after all, you even see it in Docker’s logo :)

Shipping Container Analogy

So shipping containers are standardized for the logistics industry, it doesn’t really matter what’s in these containers, we can send them by boat, train, or by a truck. These shipping containers don’t care which ship they’re on, as long as the ship is sturdy enough to carry them. No need to worry about temperature, moisture, etc. The container takes care of that, not the ship.

Similarly, with docker, we can package our code, plus everything we need to run the code in an isolated container, and since these software containers are standardized, we can pass them into different environments, without having to worry if they’re going to run or not. Now this sounds very similar to virtual machines, but there’s a couple of key differences. Let’s look at that:

Docker Vs Virtual Machines

Docker Containers run natively on the host machine’s OS, meaning they share the same kernel as the host machine. For virtual machines, we have a hypervisor, and this provides each virtual machine with virtual access to the host resources. But, with docker containers, we don’t need full scale operating systems, so they are a lot more lightweight and have better performance characteristics than VMs.

The Docker Architecture

Docker Client: This is where you enter commands to interact with docker and these commands go to the docker host.
Docker Host: This could be running either on your local machine or a remote machine. The host needs to be running the docker daemon. The docker daemon listens to requests from the Docker Client, it is also going to manage docker objects like containers and Images and it is good to communicate with other docker daemons.
Docker Registry: This is where we store docker images. Dockerhub is like github, it’s the public registry. So there you can find a lot of public docker images for things like Pytorch, Tensorflow, Sklearn, Pandas, and many other popular data science libraries & tools.

Some Docker Terminologies

What is a docker Image?

This is a frozen snapshot of a container or like a blueprint for what you want to build. Each image consists of a set of read-only layers that are stacked on top of each other and each layer is the set of differences from the layer below it.

What are Docker Containers?

Containers are the runtime instance. When we create a container, what we do is we add a thin read/write layer called a container layer to the top of our image layer stack. Basically, instances of Docker images that can be run by using the Docker run command is what we call Containers. Here’s a visual description of the terms so far:

Oh wait, this is actually quite similar to Object Oriented Programming where:

Images: Classes
Layers: Inheritance
Containers: Objects

Dockerfile: This is a file containing all the command a user can enter to create an image. Let’s look at some dockerfile commands now.

Some Dockerfile commands

FROM: This allows us to set the base image i.e what image we are building off of and here we can use a repository name from docker hub.
LABEL: This is used to set metadata(like author, email, file size etc.). This used to be called MAINTAINER, but now it is deprecated so using label is recommended.
ENV: Set environment variables.
WORKDIR: Set working directory.
COPY: We can copy files and directories into an image
RUN: This command executes shell commands in a new layer and it puts that new layer at the top of the image stack. To view all docker images you can use the docker image command.

Each time we run a docker file command, we are creating a new layer. This brings us to some of dockers best practices.

Some Docker Best Practices

Each container should have only one purpose
Minimize the number of layers
Avoid installing unnecessary packages
MAINTAINER is deprecated use LABEL

Docker: Hello-world

So far we have looked at a lot of terms to get us started with using docker. Time for a little practice!

Step 1: Create a Docker account by following basic instructions at docker site. Then you need to visit Docker Playground. This Docker playground provides you with a ready-to-use Docker installation i.e provides full access to an instance with docker.

Step 2: At the Docker Playground, we can add a new instance.

Step 3: At the terminal, you can type touch Dockerfile to create a ‘Dockerfile’ in our root folder, create a simple ‘hello_world.py’ file i.etouch hello_world.py containing print("Hello World!") and click on “editor” like in the image above to view this.

Step 4: Now we can write some commands in our Dockerfile . Here’s a sample Dockerfile we will use to create our first image!

# Use latest Python runtime as base imageFROM python:3.9.5-alpine# Set the working directory to /app and copy current dirWORKDIR /appCOPY . /app# Run hello_world.py when the container launchesCMD ["python", "hello_world.py"]

Starting with a Python base image. I got the name from Docker Hub
Set the working directory for our hello world app, copy in the contents of the current directory
Use the CMD keyword to specify what to do when the container launches

Step 5: Docker build now automates the build. We can do this with docker build -t hello-world .

-t to specify what I want to call the image
to specify our Dockerfile is in the current directory we use the dot(.)

Awesome! Now we have our dockerfile built👏

Step 6: Time to run our image as a container using docker run hello-world

docker run <image name> . This went into the container and ran that script from earlier. Super!

Conclusion

Starting off for the first time with any new tool can be overwhelming, but once the learning curve smoothens out, things start to work out and new ideas open up with the usage. It is the same with Docker, hoping that this article makes you think about using it in your daily Data Science workflows :)