Apache Mesos - Supporting Container Images in Mesos Containerizer

Supporting Container Images in Mesos Containerizer

Motivation

Mesos currently supports several containerizers, notably the Mesos containerizer and the Docker containerizer. Mesos containerizer uses native OS features directly to provide isolation between containers, while Docker containerizer delegates container management to the Docker engine.

Maintaining two containerizers is hard. For instance, when we add new features to Mesos (e.g., persistent volumes, disk isolation), it becomes a burden to update both containerizers. Even worse, sometimes the isolation on some resources (e.g., network handles on an agent) requires coordination between two containerizers, which is very hard to implement in practice. In addition, we found that extending and customizing isolation for containers launched by Docker engine is difficult, mainly because we do not have a way to inject logics during the life cycle of a container.

Therefore, we made an effort to unify containerizers in Mesos (MESOS-2840, a.k.a. the Universal Containerizer). We improved Mesos containerizer so that it now supports launching containers that specify container images (e.g., Docker/Appc images).

Getting Started

To support container images, we introduced a new component in Mesos containerizer, called image provisioner. Image provisioner is responsible for pulling, caching and preparing container root filesystems. It also extracts runtime configurations from container images which will then be passed to the corresponding isolators for proper isolation.

There are a few container image specifications, notably Docker, Appc, and OCI (future). Currently, we support Docker and Appc images. More details about what features are supported or not can be found in the following sections.

NOTE: container image is only supported on Linux currently.

Configure the agent

To enable container image support in Mesos containerizer, the operator will need to specify the --image_providers agent flag which tells Mesos containerizer what types of container images are allowed. For example, setting --image_providers=docker allow containers to use Docker images. The operators can also specify multiple container image types. For instance, --image_providers=docker,appc allows both Docker and Appc container images.

A few isolators need to be turned on in order to provide proper isolation according to the runtime configurations specified in the container image. The operator needs to add the following isolators to the --isolation flag.

In summary, to enable container image support in Mesos containerizer, please specify the following agent flags:

$ sudo mesos-agent \
  --containerizers=mesos \
  --image_providers=appc,docker \
  --isolation=filesystem/linux,docker/runtime

Framework API

We introduced a new protobuf message Image which allow frameworks to specify container images for their containers. It has two types right now: APPC and DOCKER, representing Appc and Docker images respectively.

For Appc images, the name and labels are what described in the spec.

For Docker images, the name is the Docker image reference in the following form (the same format expected by docker pull): [REGISTRY_HOST[:REGISTRY_PORT]/]REPOSITORY[:TAG|@DIGEST]

message Image {
  enum Type {
    APPC = 1;
    DOCKER = 2;
  }

  message Appc {
    required string name = 1;
    optional Labels labels = 3;
  }

  message Docker {
    required string name = 1;
  }

  required Type type = 1;

  // Only one of the following image messages should be set to match
  // the type.
  optional Appc appc = 2;
  optional Docker docker = 3;
}

The framework needs to specify MesosInfo in ContainerInfo in order to launch containers with container images. In other words, the framework needs to set the type to ContainerInfo.MESOS, indicating that it wants to use the Mesos containerizer. If MesosInfo.image is not specified, the container will use the host filesystem. If MesosInfo.image is specified, it will be used as the container image when launching the container.

message ContainerInfo {
  enum Type {
    DOCKER = 1;
    MESOS = 2;
  }

  message MesosInfo {
    optional Image image = 1;
  }

  required Type type = 1;
  optional MesosInfo mesos = 5;
}

Test it out!

First, start the Mesos master:

$ sudo sbin/mesos-master --work_dir=/tmp/mesos/master

Then, start the Mesos agent:

$ sudo GLOG_v=1 sbin/mesos-agent \
  --master=<MASTER_IP>:5050 \
  --isolation=docker/runtime,filesystem/linux \
  --work_dir=/tmp/mesos/agent \
  --image_providers=docker \
  --executor_environment_variables="{}"

Now, use Mesos CLI (i.e., mesos-execute) to launch a Docker container (e.g., redis). Note that --shell=false tells Mesos to use the default entrypoint and cmd specified in the Docker image.

$ sudo bin/mesos-execute \
  --master=<MASTER_IP>:5050 \
  --name=test \
  --docker_image=library/redis \
  --shell=false

Verify if your container is running by launching a redis client:

$ sudo docker run -ti --net=host redis redis-cli
127.0.0.1:6379> ping
PONG
127.0.0.1:6379>

Docker Support and Current Limitations

Image provisioner uses Docker v2 registry API to fetch Docker images/layers. The fetching is based on curl, therefore SSL is automatically handled. For private registries, the operator needs to configure curl accordingly so that it knows where to find the additional certificate files.

Fetching requiring authentication is supported through the --docker_config agent flag. Starting from 1.0, operators can use this agent flag to specify a shared docker config file, which is used for pulling private repositories with authentication. Per container credential is not supported yet (coming soon).

Operators can either specify the flag as an absolute path pointing to the docker config file (need to manually configure .docker/config.json or .dockercfg on each agent), or specify the flag as a JSON-formatted string. See configuration documentation for detail. For example:

--docker_config=file:///home/vagrant/.docker/config.json

or as a JSON object,

--docker_config="{ \
  \"auths\": { \
    \"https://index.docker.io/v1/\": { \
      \"auth\": \"xXxXxXxXxXx=\", \
      \"email\": \"username@example.com\" \
    } \
  } \
}"

Private registry is supported either through the --docker_registry agent flag, or specifying private registry for each container using image name <REGISTRY>/<REPOSITORY> (e.g., localhost:80/gilbert/inky:latest). If <REGISTRY> is included as a prefix in the image name, the registry specified through the agent flag --docker_registry will be ignored.

If the --docker_registry agent flag points to a local directory (e.g., /tmp/mesos/images/docker), the provisioner will pull Docker images from local filesystem, assuming Docker archives (result of docker save) are stored there based on the image name and tag. For example, the operator can put a busybox:latest.tar (the result of docker save -o busybox:latest.tar busybox) under /tmp/mesos/images/docker and launch the agent by specifying --docker_registry=/tmp/mesos/images/docker. Then the framework can launch a Docker container by specifying busybox:latest as the name of the Docker image. This flag can also point to an HDFS URI (experimental in Mesos 1.7) (e.g., hdfs://localhost:8020/archives/) to fetch images from HDFS if the hadoop command is available on the agent.

If the --switch_user flag is set on the agent and the framework specifies a user (either CommandInfo.user or FrameworkInfo.user), we expect that user exists in the container image and its uid and gids matches that on the host. User namespace is not supported yet. If the user is not specified, root will be used by default. The operator or the framework can limit the capabilities of the container by using the linux/capabilities isolator.

Currently, we support host, bridge and user defined networks (reference). none is not supported yet. We support the above networking modes in Mesos Containerizer using the CNI (Container Network Interface) standard. Please refer to the network/cni isolator document for more details about how to configure the network for the container.

More agent flags

--docker_registry: The default URL for pulling Docker images. It could either be a Docker registry server URL (i.e: https://registry.docker.io), or a local path (i.e: /tmp/docker/images) in which Docker image archives (result of docker save) are stored. The default value is https://registry-1.docker.io.

--docker_store_dir: Directory the Docker provisioner will store images in. All the Docker images are cached under this directory. The default value is /tmp/mesos/store/docker.

--docker_config: The default docker config file for agent. Can be provided either as an absolute path pointing to the agent local docker config file, or as a JSON-formatted string. The format of the docker config file should be identical to docker’s default one (e.g., either $HOME/.docker/config.json or $HOME/.dockercfg).

Appc Support and Current Limitations

Currently, only the root filesystem specified in the Appc image is supported. Other runtime configurations like environment variables, exec, working directory are not supported yet (coming soon).

For image discovery, we current support a simple discovery mechanism. We allow operators to specify a URI prefix which will be prepend to the URI template {name}-{version}-{os}-{arch}.{ext}. For example, if the URI prefix is file:///tmp/appc/ and the Appc image name is example.com/reduce-worker with version:1.0.0, we will fetch the image at file:///tmp/appc/example.com/reduce-worker-1.0.0.aci.

More agent flags

appc_simple_discovery_uri_prefix: URI prefix to be used for simple discovery of appc images, e.g., http://, https://, hdfs://<hostname>:9000/user/abc/cde. The default value is http://.

appc_store_dir: Directory the appc provisioner will store images in. All the Appc images are cached under this directory. The default value is /tmp/mesos/store/appc.

Provisioner Backends

A provisioner backend takes a set of filesystem layers and stacks them into a root filesystem. Currently, we support the following backends: copy, bind, overlay and aufs. Mesos will validate if the selected backend works with the underlying filesystem (the filesystem used by the image store --docker_store_dir or --appc_store_dir) using the following logic table:

+---------+--------------+------------------------------------------+
| Backend | Suggested on | Disabled on                              |
+---------+--------------+------------------------------------------+
| aufs    | ext4 xfs     | btrfs aufs eCryptfs                      |
| overlay | ext4 xfs*    | btrfs aufs overlay overlay2 zfs eCryptfs |
| bind    |              | N/A(`--sandbox_directory' must exist)    |
| copy    |              | N/A                                      |
+---------+--------------+------------------------------------------+

NOTE: xfs support on overlay is enabled only when d_type=true. Use xfs_info to verify that the xfs ftype option is set to 1. To format an xfs filesystem for overlay, use the flag -n ftype=1 with mkfs.xfs.

The provisioner backend can be specified through the agent flag --image_provisioner_backend. If not set, Mesos will select the best backend automatically for the users/operators. The selection logic is as following:

1. Use `overlay` backend if the overlayfs is available.
2. Use `aufs` backend if the aufs is available and overlayfs is not supported.
3. Use `copy` backend if none of above is selected.

Copy

The Copy backend simply copies all the layers into a target root directory to create a root filesystem.

Bind

This is a specialized backend that may be useful for deployments using large (multi-GB) single-layer images and where more recent kernel features such as overlayfs are not available. For small images (10’s to 100’s of MB) the copy backend may be sufficient. Bind backend is faster than Copy as it requires nearly zero IO.

The bind backend currently has these two limitations:

  1. The bind backend supports only a single layer. Multi-layer images will fail to provision and the container will fail to launch!

  2. The filesystem is read-only because all containers using this image share the source. Select writable areas can be achieved by mounting read-write volumes to places like /tmp, /var/tmp, /home, etc. using the ContainerInfo. These can be relative to the executor work directory. Since the filesystem is read-only, --sandbox_directory and /tmp must already exist within the filesystem because the filesystem isolator is unable to create it (e.g., either the image writer needs to create the mount point in the image, or the operator needs to set agent flag --sandbox_directory properly).

Overlay

The reason overlay backend was introduced is because the copy backend will waste IO and space while the bind backend can only deal with one layer. The overlay backend allows containizer to utilize the filesystem to merge multiple filesystems into one efficiently.

The overlay backend depends on support for multiple lower layers, which requires Linux kernel version 4.0 or later. For more information of overlayfs, please refer to here.

AUFS

The reason AUFS is introduced is because overlayfs support hasn’t been merged until kernel 3.18 and Docker’s default storage backend for ubuntu 14.04 is AUFS.

Like overlayfs, AUFS is also a unioned file system, which is very stable, has a lot of real-world deployments, and has strong community support.

Some Linux distributions do not support AUFS. This is usually because AUFS is not included in the mainline (upstream) Linux kernel.

For more information of AUFS, please refer to here.

Executor Dependencies in a Container Image

Mesos has this concept of executors. All tasks are launched by an executor. For a general purpose executor (e.g., thermos) of a framework (e.g., Aurora), requiring it and all its dependencies to be present in all possible container images that a user might use is not trivial.

In order to solve this issue, we propose a solution where we allow the executor to run on the host filesystem (without a container image). Instead, it can specify a volume whose source is an Image. Mesos containerizer will provision the image specified in the volume, and mount it under the sandbox directory. The executor can perform pivot_root or chroot itself to enter the container root filesystem.

Garbage Collect Unused Container Images

Experimental support of garbage-collecting unused container images was added at Mesos 1.5. This can be either configured automatically via a new agent flag --image_gc_config, or manually invoked through agent’s v1 Operator HTTP API. This can be used to avoid unbounded disk space usage of image stores.

This is implemented with a simple mark-and-sweep logic. When image GC happens, we check all layers and images referenced by active running containers and avoid removing them from the image store. As a pre-requisite, if there are active containers launched before Mesos 1.5.0, we cannot determine what images can be safely garbage collected, so agent will refuse to invoke image GC. To garbage collect container images, users are expected to drain all containers launched before Mesos 1.5.0.

NOTE: currently, the image GC is only supported for docker store in Mesos Containerizer.

Automatic Image GC through Agent Flag

To enable automatic image GC, use the new agent flag --image_gc_config:

--image_gc_config=file:///home/vagrant/image-gc-config.json

or as a JSON object,

--image_gc_config="{ \
  \"image_disk_headroom\": 0.1, \
  \"image_disk_watch_interval\": { \
    \"nano_seconds\": 3600 \
    }, \
  \"excluded_images\": \[ \] \
}"

Manual Image GC through HTTP API

See PRUNE_IMAGES section in v1 Operator HTTP API for manual image GC through the agent HTTP API.

References

For more information on the Mesos containerizer filesystem, namespace, and isolator features, visit Mesos Containerizer. For more information on launching Docker containers through the Docker containerizer, visit Docker Containerizer.