Using Docker to Enable More Usable Open Source ML Applications

Welcome back to my blog! I haven’t actually posted anything in several years, in part because I’ve been able to scratch this itch via my company blog whenever I’ve felt the desire to share things.

That said, I’ve felt inspired recently by the amount that I see industry leaders posting about their explorations on personal technical blogs, and so I figured I would get this going again. In any case, I end up spending a good chunk of my nights and weekends thinking about and exploring ideas in computer science, biology, and some intersection of them, and so I figured perhaps I could contribute something along the way. No promises though.

One of the explorations I’ve been doing recently has been in getting a variety of open source AI (groan) models to function on my personal computers, including both an M1 Macbook Pro and a decent Linux box with about 64gb of RAM and an NVIDIA TITAN V GPU. I’ve been curious and inspired about what can be done with relatively small amounts of compute.

In this post, rather than talking about the models or their applications, I actually wanted to walk through a little bit of what this experience of “getting a model to run locally” often looks like, what makes it frustrating, and try to outline how we can release code/models in a way that makes this better via Docker containers. I’m not going to claim that I’m the first person with this take, but given how rarely I see code distributed with containers, I figure it bears sharing as much as possible.

A motivating example: Running ESMFold

I’ve recently been thinking quite a bit about a variety of modalities in biology, and what aspects allow machine learning methods to work well on biophysical phenomena. There is no doubt that one big area of impact in this space has been in protein folding, especially with models like Alphafold2 and ESMFold. So, I thought it’d be fun to get protein folding running on my computers. Alphafold2 required a ton of disk space to get all of the reference data for calculating an MSA (multi-sequence alignment), so I figured I’d try out ESMFold, which only requires an amino acid sequence and model files.

As a disclaimer, while I’m going to use ESM as my motivating example since it’s the most recent time I did this, I have no negative feelings towards their code. Their code was actually relatively easy to work with as far as research code goes, and I’m using it as an example because it was easy to containerize.

Getting ESMFold to Run

I went over to the ESM repository and cloned it onto my Linux box. Things looked promising - the folks on the ESM team at FAIR provide an environment.yml file, which specifies a list of conda/pip packages for installation with the conda package manager. This should be as simple as:

git clone [email protected]:facebookresearch/esm.git
cd esm
conda env create -f environment.yml

After conda spent 20 minutes presumably solving P=NP (conda users, you know what I mean), it started installing, only to eventually fail with an esoteric error message involving my CUDA version. Turns out, my machine has CUDA 12.2 installed on it, and the version of Pytorch that ESM pins to asks for CUDA 11.x.).

This led me down an all-too-familiar path of updating a bunch of apt packages, hunting down the right NVIDIA CUDA 11.3 archived install page (Here in case you’re curious), manually downloading the CUDA 11.3 install runfile, installing the correct version of CUDA, and updating my /usr/local/cuda symlink to point to it. Took me about 30 minutes, and along the way I made a little error that led to me reinstalling both versions of CUDA on my box. Not the biggest deal - I’ve done it a million times at this point, but not great.

Good enough to run the examples from the README? Not quite. After that, their README had me install openfold, via pip. Easy enough, though in the pip installation, I ran into a number of errors. After parsing through a massive stacktrace, I found a line that said that a header file called cusparse.h was missing. With the help of GPT-4, I found out that the default apt package that NVIDIA had me install includes the libcusparse and libcublas packages, but not libcusparse-dev and libcublas-dev which include the relevant header files. Upon tracking that down, installing, and rerunning, I got the example code to work (yay!). All in, we’re talking about 1.5 hours of wall time spent going from git clone to getting the README to run (albeit, much of it spent waiting for conda and other install steps and not really actively coding, though arguably that’s even more frustrating), and that’s as someone who has done this basic flow quite a few times.

In the spirit of making scientific progress in machine learning and beyond as accessible as possible, let’s now dive into an alternative way this code could have been distributed and installed.

Enter Docker!

Using Docker substantially simplifies the experience for a new user trying to get the code to work. As a brief primer, Docker is a technology for building and running containers - which, in Docker’s own words, enable you to “separate your applications from your infrastructure”. If a developer provides a container, a user has no install process - all they need to do is pull or build the relevant container and run the application. I’ll show how I containerized ESM, but first, some basics.

Docker Basics

There are a few concepts that are worth groking in understanding how Docker can fit into an application like this:

Docker Host: Unlike an actual machine, a docker container needs a host. A host machine is one that is running the Docker driver.

Container: The crux of using Docker is the container. A container has essentially all of the files needed for an application to run. This includes a full filesystem based around a particular OS, any application files, all dependencies - it’s like a very lightweight virtual machine. Critically, a container image is portable - it can easily be moved between hosts or to cloud-based container registries. Multiple images for distinct applications can be downloaded onto a single host.

Dockerfile: A build file that specifies all of the commands needed to build a new container. A Docker host executes the Dockerfile (via docker build) to construct a new image. This image can then be sent to a container registry.

In other words, my Linux box is my docker host, and docker itself is the only dependency I need to install. To make the container, all I need to do is write a Dockerfile to build a container. Fortunately, NVIDIA releases a base image that makes it easy to build CUDA-based applications, and the kind folks at tensorflow have a well annotated Dockerfile that I’ve often referenced for building other images. So, I used that as a starting point for myself.

You can view the full Dockerfile here - I opened a pull request to the ESM team to contribute it to their codebase. To dive in, here is what it looks like:

Section 1: Base Image

As a base image, I picked an NVIDIA 11.8 image, since the Pytorch version was tied to CUDA 11.x. I picked Ubuntu 22 to have a relatively recent version available. Finally, I set some environment variables that help with the build later. Three cheers for not needing to manually install all of CUDA.

FROM nvidia/cuda:11.8.0-base-ubuntu22.04 as base 
ENV DEBIAN_FRONTEND=noninteractive

Section 2: apt-get installs

There are two sets of RUN blocks where I install a bunch of packages from apt-get. These are largely taken from what tensorflow asks for, since in practice that’s pretty similar to what pytorch needs too. Note that in this section I specifically installed the libcusparse-dev and libcublas-dev packages that I mentioned needing earlier. I also added a block from the tensorflow dockerfile to make sure the libcuda stub was pointing to the right place. While this seems like a lot of code, I was able to piggy-back off of other open source work to write almost all of it.

RUN export DEBIAN_FRONTEND=noninteractive \
    && apt-get -qq update \
    && apt-get -qq install -y gnupg ca-certificates wget git \
    && apt-get -qq clean \
    && rm -rf /var/lib/apt/lists/*

RUN export DEBIAN_FRONTEND=noninteractive \
    && wget \
    && dpkg -i cuda-keyring_1.1-1_all.deb \
    && apt-get -qq update \ 
    && apt-get install -y --no-install-recommends \
        cuda-command-line-tools-11-8 \
        cuda-cudart-dev-11-8 \ 
        cuda-nvcc-11-8 \
        cuda-cupti-11-8 \
        cuda-nvprune-11-8 \
        cuda-libraries-11-8 \
        cuda-nvrtc-11-8 \ 
        libcufft-11-8 \
        libcurand-11-8 \
        libcusolver-dev-11-8 \ 
        libcusparse-dev-11-8 \
        libcublas-dev-11-8 \
        libcudnn8= \
        libnvinfer-plugin8= \
        libnvinfer8= \
        build-essential \
        pkg-config \
        curl \
        software-properties-common \
        unzip \
    && apt-get -qq clean \
    && rm -rf /var/lib/apt/lists/* 

RUN find /usr/local/cuda-*/lib*/ -type f -name 'lib*_static.a' -not -name 'libcudart_static.a' -delete \
    && rm -f /usr/lib/x86_64-linux-gnu/libcudnn_static_v*.a \ 
    && ln -s /usr/local/cuda/lib64/stubs/ /usr/local/cuda/lib64/stubs/ \
    && echo "/usr/local/cuda/lib64/stubs" > /etc/ \
    && ldconfig

Section 3: Environment install

There are a few more commands here doing a set of simple operations: installing miniconda (a basic version of the conda package manager), setting the environment variables needed for that, adding the environment.yml file from the host machine to the container via the COPY step, and then using conda env create to build the environment in the image.

# Install miniconda
    && wget --quiet$MINICONDA \
    && bash $MINICONDA -b -p /miniconda \
    && rm -f $MINICONDA

ENV PATH /miniconda/bin:$PATH

COPY environment.yml .
RUN conda env create -f environment.yml

Building / Running the Container

Upon writing this file, I can now build the image specified by the Dockerfile

docker build -f Dockerfile -t esm .

This builds the image specified in the Dockerfile, and names the resultant image “esm”. I can now run this container via:

docker run -it  --rm --runtime=nvidia --gpus all esm

This command runs the esm container and uses the NVIDIA runtime (via –runtime=nvidia), which gives the running container access to the GPUs on my host machine. Once I run that command, I get a bash shell into my container. This container has all of the CUDA toolkit files, python dependencies, filesystem needs, etc to run the ESM script. Note that nothing, not even the CUDA toolkit, needs to be installed on my host machine.

Why does this matter?

At this point you could reasonably ask, ok what did we really buy here? I previously had to install all of this stuff on my host machine - now I had to go through this effort to install it in a container. What did I really get by doing this? Quite a bit it turns out.

Portability: One of the best aspects of including a Dockerfile with your application is it makes the application portable. Now, future open source contributors can either (a) directly pull my docker image from a container registry and start building on it, or (b) run docker build on my Dockerfile and reliably be able to build all of the dependencies of my app, essentially entirely independent of whatever they have on their host machine. Only have a CentOS host machine but want to build my Ubuntu app? No problem. Have a Mac and want to run a Linux container? You can do that now.

All it takes to push this to DockerHub, for example, is

docker tag esm ankitvgupta/esm:esm
docker push ankitvgupta/esm:esm # This now exists on a remote, public repo

Isolation: While a perfect engineer may not make the errors I made that led me to reinstall CUDA on my host machine when I tried to install ESM directly on my host, I’d argue that this kind of error is quite common in practice. It’s very easy to go through a complex install process that in some way has an unintentional side effect on the rest of your dev environment, even if you’re good about using a package manager, version control, etc. Docker circumvents all of this - the application install steps are entirely isolated inside a container - I don’t even need to install the CUDA toolkit on the host machine. My host machine now only has the CUDA 12 toolkit, even though the application wants 11.x.

Readability: No one likes reading and following complication installation guides - it’s easy to make mistakes, and even well-intentioned developers regularly omit details or may have forgotten them because they were inherent to the environment they did development in, like an important environment variable or apt package that had been previously installed. Writing down a Dockerfile enforces application environment as code - pushing the developer to annotate all of the decisions that were needed to get an application to install. And indeed, even if a user is inclined to try installing on their host machine directly for whatever reason, the Dockerfile serves as a great set of instructions to do so.

Reproducibility: In machine learning (and in the life sciences!), reproducibility is a critical challenge. Being able to progress the field requires building on the backs of existing work, and containerization reduces the surface of issues that make reproducibility tougher. This is otherwise all the more challenging when there may be bespoke computational needs that require expert attention outside of an applied scientist’s domain, like installing CUDA drivers and dealing with strange package installation corner cases.

Give it a go. You can download the ESM container I built above by installing docker on your machine and running

docker pull ankitvgupta/esm:esm
docker run -it  --rm --runtime=nvidia --gpus all ankitvgupta/esm:esm
conda activate esmfold 
# Enjoy your pre-made environment!

A closing note

Hopefully if you’ve read this far, I’ve now convinced you that including a Dockerfile with your open source tools (or contributing one to one you like!) is a worthwhile use of your time. I’ll just end by acknowledging that as ML/science people, we often don’t like to think about issues of code reusability, portability, and clarity. Certainly academic conferences or journals don’t reward contributions along these lines as much as they reward a 1% improvement on a flawed metric. I hope that nevertheless, we see the value in doing this - if the goal is to do open, reproducible science, and produce tools that others are willing to use, it’s one of the simplest ways we can ensure that reproducibility.

Thanks for reading until the end. My future posts will probably be in a mix of formats, spanning technical guides like this one, reflections on papers I enjoyed reading in ML and the life sciences, and maybe a few thoughts out of left field. Feel free to shoot me a message if you have any feedback.