Welcome to day 4 of our series on containerd internals!
Container images are the mechanism that we use to capture a container’s filesystem, distribute it to nodes that will eventually run containers, and ensure that containers start from a known-identical configuration. In many ways, images are the defining characteristic of a containerized system; they are the interaction point for users who want to create a workload and make it repeatable and predictable. Without images, you could still have similar isolation characteristics that are available in containerized systems today, but it would be more difficult to achieve reliable, production-ready, and understandable workloads.
Most users are familiar with Docker-style images. These images feature inheritance (the ability to start from a known base, abstracting common functionality) and a (mostly) straightforward imperative build language to express how a container should be constructed. Tools like Docker, Buildkit, nerdctl, and Buildah can all be used to construct images like these. Once built, the images are then distributable via an interchange format and common protocol. A set of popular open-source operating systems and tools are packaged as part of the Docker library project and are commonly used as base images, though companies also often use their own internal collection of base images for their images.
There are two popular formats for container images today: OCI images and Docker schema 2 images, though they are very similar to each other. Both of these formats have a concept of a layer representing a portion of the filesystem. An individual layer is an archive file that represents files and directories along with their metadata (structure, names, size, permissions); in both OCI and Docker schema 2 these are tar-formatted. An image is made up of layers, which are stacked on top of each other such that the resulting container’s filesystem is a union of all the layers.
Layer 3 |
Layer 2 |
Layer 1 |
Once an image is built, users will typically push it to a registry, which serves as a storage and distribution point for images. Most registries today either implement the OCI distribution spec or the Docker Registry HTTP API V2.
A container runtime such as containerd is generally responsible for obtaining a container image, transforming the interchange format into a suitable on-disk format, and presenting a unified filesystem so a container can run. There are many ways to achieve this; in containerd, this is abstracted out into pluggable components such as a resolver, differ, snapshotter, and so forth. We’ll cover some of this here, but future posts may go into more detail on each of these components.
Layers and how they are identified
Layers provide the mechanism for inheritance and sharing that have made Docker-style images popular. When layers are shared between images, they can also be deduplicated to save storage space. This means it is important to have consistent mechanisms to identify layers.
Layers in an image are typically tar streams and are also typically stored and transferred in a compressed state1. This means that both the compressed and uncompressed representation of the layer matter. There are three different content-addresses that are relevant for understanding how layers are identified during transfer and subsequently inside containerd:
- Digest - a sha256 hash of the compressed2 tar of a layer. The layer digest is stored in the image’s manifest, and is used during image pull to locate a layer from the registry and to verify it during download.
- Diff ID - a sha256 hash of the uncompressed tar of a layer. The diff ID is stored in the image config, and is mostly used as part of the next concept…
- Chain ID - a recurrence relation over Diff IDs that provides ordering for the layers. A given layer’s Chain ID is computed based on the Chain ID of its predecessor layers and its own Diff ID; the only time a Diff ID and Chain ID are equal is for the first layer of an image.
In other words:
Content address | Compression | Ordering |
---|---|---|
Digest | Compressed | Unordered |
Diff ID | Uncompressed | Unordered |
Chain ID | Uncompressed | Ordered |
Ordering is generally irrelevant in storing and transferring images. A given compressed layer can be shared between images even if the order is different in the images. However, ordering very much does matter when unioning the layers together to form a filesystem for a container; collisions in upper layers occlude things present in lower layers. This is true both for new files/directories and for deletions; deletion markers in upper layers allow files and directories present in lower layers to be hidden from the container.
|
|
Because the Chain ID changes based on ordering and because ordering is irrelevant for storage, registries (and registry-like systems) generally ignore the Chain ID and rely on an unordered identifier like the Digest. Because ordering is relevant for constructing the runtime environment of a container, container runtimes rely on an ordered identifier like the Chain ID.
Layers can be shared between images. As shown in figure 2, a layer that is shared can have a different Chain ID even while it has the same Diff ID. A common pattern is for images to also share lineage (ancestry, base images) where a layer has the same Chain ID and Diff ID, but layers on top diverge. Layers can also be duplicated within the same image, having the same Diff ID but different Chain IDs.
|
|
|
||||||||||||
(Figure 3a: Layer 2 is duplicated in a single image. Each copy of layer 2 has a different Chain ID but the same Diff ID) | (Figure 3b: Layers 1 and 2 are shared lineage with the image in figure 3a. Both the Chain ID and Diff ID are shared with the bottom two layers in figure 3a.) | (Figure 3c: layer 2 is shared with the images in figure 3a and 3b, but has a different Chain ID due to its different lower layers and ordering in this image.) |
Image pull
Downloading an image to a container runtime is usually called “pulling” the image. containerd implements image pull by first storing the compressed image layers into the content store and then extracting them into runnable snapshots inside a snapshotter. Image metadata is stored in the image service. This process is coordinated either by the smart Go client or by the transfer service (introduced as experimental in containerd 1.7 and stable in 2.0). Once the image layers have been extracted into snapshots, the original compressed layers may be optionally discarded (though that can complicate subsequent operations if these layers are intended to be used as base layers for building new images).
containerd snapshotters
In containerd, snapshotters use the Chain ID to represent individual snapshots. This is because a number of backend snapshotting mechanisms can only hold ordered content. For example, disk-based snapshotting mechanisms like devicemapper thin provisioned devices have a concrete parent-child relationship. Similarly, ZFS snapshots are a set of copy-on-write diffs requiring a parent-child relationship. At the time a snapshot is created using either of these two snapshotters, the parent-child relationship must be known and has a direct effect on the actual disk layout.
containerd’s default snapshotter on Linux is based on the overlay filesystem. This is a mechanism in Linux that allows two (or more) directories to be unioned together. Unlike devicemapper thin provisioned devices or ZFS snapshots, the directory structure used to back an overlay does not require a strict/consistent parent-child relationship (though the resulting overlay is itself ordered). This means that the same two directories can be overlaid in different orders; it is only at overlay creation (mount) time that the order matters. However, duplication does matter; an overlay mount is not valid if the same directory appears more than once in the set of directories that are unioned together.
containerd has several snapshotters built-in (including overlay, btrfs, and blockfile among others), but snapshotters can also be implemented as external plugins. This allows for experimentation and new implementations to be built easily. Some snapshotters may eventually become built-in while others may not; the devmapper snapshotter was originally an experimental external implementation but was contributed directly later after it had been used in production.
Inspired by the Slacker paper, a pattern has emerged for remote snapshotters in containerd, which are typically implemented as external plugins. When containerd pulls an image, it (eventually) stores the extracted layers of an image as snapshots in a snapshotter. To avoid duplicating downloading snapshots that are already stored locally, containerd queries the snapshotter to find whether a given snapshot exists prior to initiating a download. Remote snapshotters leverage this logic in containerd around snapshots that already exist; instead of checking whether a snapshot already exists locally, the remote snapshotter can check with its backend to find whether the snapshot can be made to exist. In effect this is an accessibility check more than anything else: the snapshotter verifies that the requested layer is both available in the backend and that the user is authorized to access it. If a layer is accessible, the snapshotter can return a code indicating that the snapshot already exists, and then lazily load (or stream) the contents. The open-source stargz, nydus, and overlaybd snapshotters are implementations of these. Cloud providers also offer features built on remote snapshotters, like GKE Image Streaming (which I work on), Azure’s Artifact Streaming, and SOCI on AWS Fargate.
Conclusion
We hope this series on containerd internals has been helpful to you in understanding more about containerd and containers in general. While we’ve reached the end of December, we’ll likely continue to blog more about the internals of containerd in the future.