When a Stranger Syscalls: Quirks of jail(2)

runj is an OCI runtime for FreeBSD jails. Until recently, it relied on the jail(8) command to actually set up and manage the jail in the FreeBSD kernel. It had been a to-do item on my list for a long time to migrate to directly-invoking the jail(2) family of syscalls. This is now done, but I learned some new things along the way.

It’s simpler than I had thought

The core logic in runj I needed to implement consisted of jail_set(2), jail_get(2), jail_attach(2), and jail_remove(2). These four handle all of jail configuration (including creating the jail), visibility into jails, attaching processes to jails, and removing jails. These four are then broken down into 2 different styles: jail_attach(2) and jail_remove(2) only need the jail’s ID number to operate while jail_set(2) and jail_get(2) take and interesting data structure called an iovec.

An iovec is a byte buffer with a defined length. The structure consists of a pointer to a memory region and the size (in bytes) of the region. The kernel and userland process can then both access this memory. It’s an input-output vector. The Go definition of the struct looks like this:

type Iovec struct {
	Base *byte
	Len  uint64
}

This simple definition of a shared memory region allows the iovec to be used for both input to the syscall (parameters) and output from the syscall; the userland process must just provide that output space to the kernel.

For these jail-related syscalls, an array of iovec structs is taken as input, and processed in pairs. Each pair consists of a first element which is a name (char*) and a second which is the input parameter or output buffer (void*). Effectively the userland program and kernel communicate by way of named parameters rather than a defined struct.

There are some clear benefits to this: the kernel can add support for new parameters without requiring any changes in userland code; the existing code just won’t pass those new parameters. And the amount of memory required for each syscall is limited to just the parameters that the userland program wants to pass. Similarly, the userland program can optionally pass buffers for output and the kernel can decide to populate them.

There are also some downsides: userland programs need to be careful in how to serialize the parameters to the iovecs and ensure they match what the kernel expects; this is easy in C but requires more care in a higher-level language like Go where the types do not necessarily match C types. And the names of the iovec parameters should be well-known, but I had trouble finding documentation of them, which leads me to…

Parameter names are obvious, until they aren’t

To a large extent, the parameter names used as input to jail_set(2) match the names used in the jail.conf(5) file and jail(8) command. Some exceptions are also easy to intuit; for example jail.conf(5) supports includes (with globs) and wildcards; those would not be expected for the syscall. But some are less obvious: jail(8) has pseudo-parameters which are not passed to the kernel but are instead interpreted internal to jail(8). These include mount parameters for mounting filesystems, exec parameters for lifecycle hooks, and parameters that deal with network interfaces. These are documented in the manual page, but it does require one to read the manual rather than just looking at example jail.conf(5) files and guessing the parameter names to pass to jail_set(2).

There is one parameter name I have failed to locate any documentation for: errmsg. This is used as buffer space for the kernel to pass detailed error messages in addition to the standard errno error number. I was made aware of this, but had to locate example use in the source code for jail(8) rather than finding documentation.

Booleans are a little weird

Perhaps it was naïve of me to expect that a boolean iovec would have a value portion of a single byte indicating true or false. For jail_set(2), boolean values are instead represented by prefixing “no” onto a portion of the parameter name. For example, the parameter “persist” is set false by using the name “nopersist” instead. Some parameters are dotted, like “allow.mount” and the “no” prefix is placed after the dot, like “allow.nomount”. jail.c (in libjail) has noname and nononame functions to deal with this conversion.

I don’t currently have a use-case for a negated boolean in runj, so “persist” is passed as a “nil” iovec:

func nilIovec(name string) ([]syscall.Iovec, error) {
	n, err := syscall.ByteSliceFromString(name)
	if err != nil {
		return nil, err
	}
	return makeIovec(n, nil, 0), nil
}

Type-based switching: not always recommended

Since an iovec is just a pointer to a region of memory, there is no type information in the iovec itself. It’s up to the userland program and kernel to ensure they are generating and interpreting the bytes the same way.

A pattern that I see in libjail (the library used by jail(8)) and in others is type-based switching. This might be a table mapping known parameter names to type, or in a language like Go might be reflection over the types in a struct. These are both used in similar ways: central or generic functions to handle (de)serialization, separate from the code that is ultimately needing to interact with the values.

I think there are situations where this can make a lot of sense. Parsing unknown files (like JSON, XML, etc) is complicated and a generic parsing library that reflects over an application-provided struct to populate it is really usable. But I do think those situations are limited, and it’s a good idea to prioritize clarity rather than DRY or abstractions that may not add a ton of value. Reflection in particular can be challenging to read and debug.

In Go, there are no enums. There are only type aliases and validating that a value matches its expected set is up to the programmer. Some of the jail parameters are enums, with values like “disable”, “new”, and “inherit”; these ultimately need to be passed to the kernel as 32-bit integers. Reflection-based code would need to handle faux-enum type aliases (perhaps with some enforced validation interface) directly, or special-case the string values.

I elected not to use reflection or type-switching in runj. Instead, I have per-type functions to generate an iovec, such as:

func netIPIovec(name string, value []netip.Addr) ([]syscall.Iovec, error) {
	n, err := syscall.ByteSliceFromString(name)
	if err != nil {
		return nil, err
	}
	bytes := make([]byte, 0)
	for i, addr := range value {
		bytes = append(bytes, addr.AsSlice()...)
	}
	return makeIovec(n, &bytes[0], len(bytes)), nil
}

And each struct I want to serialize into an iovec has an explicit function to do so:

func (c *CreateParams) iovec() ([]syscall.Iovec, error) {
	iovec := make([]syscall.Iovec, 0)

	name, err := stringIovec("name", c.Name)
	if err != nil {
		return nil, err
	}
	iovec = append(iovec, name...)

    // ...

	if c.VNet != "" {
		var vnet int32
		switch c.VNet {
		case "new":
			vnet = 1
		case "inherit":
			vnet = 2
		default:
			return nil, fmt.Errorf("jail: unknown VNet type %q", c.VNet)
		}
		vnetio, err := int32Iovec("vnet", vnet)
		if err != nil {
			return nil, err
		}
		iovec = append(iovec, vnetio...)
	}
    
    // ...

I think this is more clear as it is explicit exactly how each element is serialized. And the number of structs I need to do this for is low (two), and the number of parameters is similarly low (the biggest one currently has six) so the overhead is low.

Conclusion

I’m really happy with the conversion to invoking jail(2) syscalls directly. This reduces the number of moving parts in setting up a jail and helps to more-clearly delineate userland-vs-kernel responsibilities.

I think this is pretty much all the interesting things I’ve found about jail(2) so far. If you liked this, you might also be interested in working on runj with me.

Samuel Karp