runj is an OCI runtime for FreeBSD jails.
Until recently, it relied on the
jail(8) command
to actually set up and manage the jail in the FreeBSD kernel. It had been a
to-do item on my list for a long time to migrate to directly-invoking the
jail(2) family of syscalls.
This is now done, but I learned some new things along the way.
It’s simpler than I had thought
The core logic in runj I needed to implement consisted of jail_set(2),
jail_get(2), jail_attach(2), and jail_remove(2). These four handle
all of jail configuration (including creating the jail), visibility into jails,
attaching processes to jails, and removing jails. These four are then broken
down into 2 different styles: jail_attach(2) and jail_remove(2) only need
the jail’s ID number to operate while jail_set(2) and jail_get(2) take and
interesting data structure called an iovec.
An iovec is a byte buffer with a defined length. The structure consists of a
pointer to a memory region and the size (in bytes) of the region. The kernel
and userland process can then both access this memory. It’s an input-output
vector. The
Go definition
of the struct looks like this:
type Iovec struct {
Base *byte
Len uint64
}This simple definition of a shared memory region allows the iovec to be used for both input to the syscall (parameters) and output from the syscall; the userland process must just provide that output space to the kernel.
For these jail-related syscalls, an array of iovec structs is taken as input,
and processed in pairs. Each pair consists of a first element which is a name
(char*) and a second which is the input parameter or output buffer (void*).
Effectively the userland program and kernel communicate by way of named
parameters rather than a defined struct.
There are some clear benefits to this: the kernel can add support for new parameters without requiring any changes in userland code; the existing code just won’t pass those new parameters. And the amount of memory required for each syscall is limited to just the parameters that the userland program wants to pass. Similarly, the userland program can optionally pass buffers for output and the kernel can decide to populate them.
There are also some downsides: userland programs need to be careful in how to serialize the parameters to the iovecs and ensure they match what the kernel expects; this is easy in C but requires more care in a higher-level language like Go where the types do not necessarily match C types. And the names of the iovec parameters should be well-known, but I had trouble finding documentation of them, which leads me to…
Parameter names are obvious, until they aren’t
To a large extent, the parameter names used as input to jail_set(2) match the
names used in the
jail.conf(5) file
and jail(8) command. Some exceptions are also easy to intuit; for example
jail.conf(5) supports includes (with globs) and wildcards; those would not be
expected for the syscall. But some are less obvious: jail(8) has
pseudo-parameters which are not passed to the kernel but are instead
interpreted internal to jail(8). These include mount parameters for
mounting filesystems, exec parameters for lifecycle hooks, and parameters
that deal with network interfaces. These are documented in the manual page,
but it does require one to read the manual rather than just looking at example
jail.conf(5) files and guessing the parameter names to pass to jail_set(2).
There is one parameter name I have failed to locate any documentation for:
errmsg. This is used as buffer space for the kernel to pass detailed error
messages in addition to the standard errno error number. I was
made aware of this,
but had to locate
example use in the source code for jail(8)
rather than finding documentation.
Booleans are a little weird
Perhaps it was naïve of me to expect that a boolean iovec would have a value
portion of a single byte indicating true or false. For jail_set(2), boolean
values are instead represented by prefixing “no” onto a portion of the
parameter name. For example, the parameter “persist” is set false by using the
name “nopersist” instead. Some parameters are dotted, like “allow.mount” and
the “no” prefix is placed after the dot, like “allow.nomount”. jail.c
(in libjail) has
noname and nononame functions
to deal with this conversion.
I don’t currently have a use-case for a negated boolean in runj, so “persist” is passed as a “nil” iovec:
func nilIovec(name string) ([]syscall.Iovec, error) {
n, err := syscall.ByteSliceFromString(name)
if err != nil {
return nil, err
}
return makeIovec(n, nil, 0), nil
}Type-based switching: not always recommended
Since an iovec is just a pointer to a region of memory, there is no type information in the iovec itself. It’s up to the userland program and kernel to ensure they are generating and interpreting the bytes the same way.
A pattern that I see in libjail (the library used by jail(8)) and in others
is
type-based switching.
This might be a table mapping known parameter names to type, or in a language
like Go might be reflection over the types in a struct. These are both used in
similar ways: central or generic functions to handle (de)serialization,
separate from the code that is ultimately needing to interact with the values.
I think there are situations where this can make a lot of sense. Parsing unknown files (like JSON, XML, etc) is complicated and a generic parsing library that reflects over an application-provided struct to populate it is really usable. But I do think those situations are limited, and it’s a good idea to prioritize clarity rather than DRY or abstractions that may not add a ton of value. Reflection in particular can be challenging to read and debug.
In Go, there are no enums. There are only type aliases and validating that a value matches its expected set is up to the programmer. Some of the jail parameters are enums, with values like “disable”, “new”, and “inherit”; these ultimately need to be passed to the kernel as 32-bit integers. Reflection-based code would need to handle faux-enum type aliases (perhaps with some enforced validation interface) directly, or special-case the string values.
I elected not to use reflection or type-switching in runj. Instead, I have per-type functions to generate an iovec, such as:
func netIPIovec(name string, value []netip.Addr) ([]syscall.Iovec, error) {
n, err := syscall.ByteSliceFromString(name)
if err != nil {
return nil, err
}
bytes := make([]byte, 0)
for i, addr := range value {
bytes = append(bytes, addr.AsSlice()...)
}
return makeIovec(n, &bytes[0], len(bytes)), nil
}And each struct I want to serialize into an iovec has an explicit function to do so:
func (c *CreateParams) iovec() ([]syscall.Iovec, error) {
iovec := make([]syscall.Iovec, 0)
name, err := stringIovec("name", c.Name)
if err != nil {
return nil, err
}
iovec = append(iovec, name...)
// ...
if c.VNet != "" {
var vnet int32
switch c.VNet {
case "new":
vnet = 1
case "inherit":
vnet = 2
default:
return nil, fmt.Errorf("jail: unknown VNet type %q", c.VNet)
}
vnetio, err := int32Iovec("vnet", vnet)
if err != nil {
return nil, err
}
iovec = append(iovec, vnetio...)
}
// ...
I think this is more clear as it is explicit exactly how each element is serialized. And the number of structs I need to do this for is low (two), and the number of parameters is similarly low (the biggest one currently has six) so the overhead is low.
Conclusion
I’m really happy with the conversion to invoking jail(2) syscalls directly.
This reduces the number of moving parts in setting up a jail and helps to
more-clearly delineate userland-vs-kernel responsibilities.
I think this is pretty much all the interesting things I’ve found about
jail(2) so far. If you liked this, you might also be interested in
working on runj with me.
Comments via 🦋
Join the conversation