envd is a frontend of BuildKit. Just like the Dockerfile. It has been more than a year since I started working on this project. Since the features are relatively stable, I'd like to write a blog about my journey with envd.

Why we need this tool

The machine learning development environment has been a pain point for a while. "Which Python are you using now?" is definitely a newbie slayer. It's even worse if you need to use CUDA. "It works on my machine!" happens a lot.

envd was created to solve the problem of the machine learning development environment. However, it goes far beyond that.

Infrastructure as code (IaC)

What a fancy name! Here it means by using the envd config file, you will be able to get the same environment on different machines, whether it's a local machine, a remote server, or a Kubernetes cluster.

Naming

It was named MIDI in the beginning. But that is not friendly for SEO.

The d in envd has no official meaning (as far as I know). It can be "docker", "deep learning", "dev", etc.

For more information, check this issue.

We have a cute logo designed by Lily. It's a cat face with the envd characters.

envd

Actually, the cat only blinked once when we created the GIF. The recording tool on macOS is tricky to use. That's why it ends up blinking twice. By the way, we replaced it with SVG to make the animation clear and smooth. Writing the SVG animation from scratch is not that hard.

You can find the drafts here.

Installation

Obviously, envd is a Golang project. However, our target audiences are mainly using Python. That's why we spend a lot of effort to support installation through pip.

As we know, Python has never done a good job of distributing pre-compiled binaries. I didn't find any good document about how to create a Python pre-compiled binary distribution. People just copy & paste the code from other projects. So does envd. The code is mainly copied from mosec.

I do learn something new from others' contributions:

Of course, you can use conda-forge. I have tried to create a recipe for mosec. It has a totally different packaging logic.

Rootless

As a developer, I don't like to run the command with sudo unless I have to. When I was trying to debug with the buildkit daemon, I found that we can run it in rootless mode.

Starlark

Starlark is a dialect of Python, which makes it easy to use for machine learning engineers and data scientists.

I know that lots of configuration files are written in YAML. I personally don't like it. You may also heard lots of complaints about the YAML format. I think the configuration file should be able to validate itself.

You can use if-condition, for-loop, etc. in Starlark. The following code works:

def build(libs, gpu=True):
    base = "ubuntu:20.04"
    if gpu:
        base = "nvidia/cuda:11.2.2-cudnn8-runtime-ubuntu20.04"
    for lib in sorted(libs):
        install.python_packages(name=[lib])

For more information, check the Starlark spec.

Although Starlark has an interpreting order, we don't rely on that. We will parse the file to an internal graph and construct the BuildKit Low-Level Build (LLB) graph on top of it. This tradeoff makes it easy to cache the layers.

Starlark is also easy to extend. We added lots of envd specific functions to make it more powerful. You can find them in the reference. It has a load function which is similar to import in Python to load another file. We create a new one called include (because import is reserved) to import functions from a git repository. People can create their own envd build functions and share them with others.

VSCode support

To make it more user-friendly, we have a VSCode extension for envd, which provides the following features:

  • LSP: this enables the Starlark auto-completion.
  • manage envd environment

BuildKit

This is the backend of envd. Integrating with it is troublesome. Mainly because it doesn't have any documentation. The only way to learn it is to read the examples. Since the source code is written in a functional style, it's a bit hard to understand. Once you get used to it, things will be easier.

There are some nice features in BuildKit:

  • Parallel build
  • Distributable workers
  • Better cache
  • Advanced operators

We will go through them one by one.

Parallel build

The main idea is to split the build graph into multiple sub-graphs and run them in parallel if possible. This is a great feature when some steps take a long time to finish while there is no overlap among them. For example, we can install the system packages and Conda environments in parallel.

The related operators are diff and merge. In the merge list, the later state will override the previous stats if they change the same directories. Sometimes, it may take longer than you expect to get the diff and merge them together. This should be used when you're sure that the parallelism will save time.

Distributable workers

Basically, the frontend will construct the build graph and serialize it in a Protocol Buffer format, then send it to the backend workers through TCP or Unix Domain Socket.

It's recommended to set up a long-running BuildKit daemon and use it as a remote worker since in this way it can benefit from the cache.

By default, we will create a buildkitd container for envd to build the image.

Better cache

BuildKit can import/export the cache from/to the local/inline/registry. You can choose to export the intermediate layer or not.

By default, the cache limit is 10% of your disk space. You can configure this through the buildkit config.

envd v0 will download a pre-build base image that contains the basic development tools and Python environment. This image can be used as the cache layer if none of the dependencies change. This is a great way to speed up the build process. You can check the nightly build benchmark.

Moby

For now, the best user experience is to use envd v1 with moby worker. This requires the docker engine version >= 22. To enable it, you can create a new envd context like this:

envd context create --name moby --builder moby-worker --use

Need to mention that the moby worker is still experimental. Due to the issue, we have to disable the merge operator used in envd when using the moby worker. Thus the build step might be slower but the export step will be much faster. Overall it's still faster, especially when you have a large image, which is the common case for machine learning.

Cache

Docker layer cache is a common optimization for image building. Besides, we also enable the cache for the APT packages, Python wheels, VSCode extensions, and oh-my-zsh plugins. This is done by mounting a cache directory during the build time. The machine learning related pip wheels can be huge, which makes the cache very useful.

Horust

I totally agree that for the online environment, one container should only do one thing, usually, that means running only one service. However, for the development environment, it's totally fine to run as many processes as you like, as long as they don't conflict with each other.

That's why we need a process management tool to control all of these processes. We have explored several options like systemd, s6 overlay, Supervisor. In the end, we decided to use Horust which is both simple and powerful. You can check the discussion.

Shell prompt

I personally use fish with starship, which gives a great out-of-box shell experience. starship can work well with the most common shells like bash, zsh, fish, etc. It's easy to configure and extend. You can check the starship documentation.

It works better when you have the Nerd font, but we cannot control the users' terminal configuration, we have to disable some fancy icons.

Coding in Jupyter Notebook and VSCode

These are the most common coding tools for machine learning engineers and data scientists.

Whether it's Jupyter Notebook or Jupyter Lab, it can be exposed as a normal web service.

VSCode really did a good job on the remote development. You can use the VSCode on your local machine to connect to the remote server or even the container running on a remote server.

Limited by the license, we have to use the Open VSX Registry. Sometimes the related CI test fails due to its stability.

Develop in the Kubernetes cluster

We were hoping to monetize envd with this feature. But not many people are interested in this one. The code is open sourced as envd-server. Maybe we can bring this feature to the new openmodelz project. Although you can run mdz exec {name} -ti bash to get into a container, but it doesn't support VScode-Remote for now.

Use pointer receivers

This is the most common bug during the development with envd. We have an internal build graph, which has many methods to build the LLB graph. Not all of these methods are using the pointer receivers, which results in the inconsistent state of the internal graph. I would prefer to use the pointer receivers for all of the methods.

You might be curious how come the lint doesn't catch this. That's because it can be used in a nested way, with the outer function using the value receiver while the inner function uses the pointer receiver.

This is also a good example to show the language design (personal option). You won't see this kind of bug in Rust. But Rust doesn't have a good container ecosystem. :(

Progress bar

The default docker progress bar is really complex. When I was implementing the moby push feature, I chose to reuse another progress bar lib to make life easier. Although it lacks multi-line log support.

SSH agent forwarding

Actually, we can forward the host SSH credentials to the container. So we can use the git command as we're in the host machine.

envd v1

This new version is created to address the inappropriate design of the envd v0. The main idea is that envd file should be a more general frontend of BuildKit. It should be able to build any image, not only for the machine learning development environment.

Here is a comparison:

Featuresv0v1
is default for envd<v1.0
support dev
support CUDA
support serving⚠️
support custom base image⚠️
support installing multiple languages⚠️
support moby builder

Make it faster

The compileBaseImage function should be able to run faster. You can try it if you're interested in the envd development.

Regrets

This feature will make it much more powerful, but also comes with complexity.

Users can use low-level operators to build the graph. We can execute the commands from envd file in the user-defined order.

Lots of development environments are not built in one shot. This proposal wants to track the changes in the running environment and update the envd file accordingly.

Summary

It is the first time that I can work on an open source project as my daily work.I have learned a lot from the community. I hope more people can benefit from envd.