One of the really nice things about Docker containers is that the defaults
mostly just work. One of those defaults is networking; docker run
gives you
a perfectly serviceable network experience with containers able to access the
Internet, access each other, and expose services.
runj is a much lower-level tool than Docker, so that sort of out-of-the-box network experience wouldn’t be something runj would directly provide. However, I recently added support to runj for some of the pieces that make a networking experience like that possible. Higher-level tools that use runj, like nerdctl, might use these pieces in the future.
Docker’s default networking model on Linux consists of a separate network
namespace for each container, a network interface inside the network namespace
(a veth
device), a
bridge, and
iptables rules to provide
address translation (NAT).
The veth
provides communication across the boundary of a network namespace
and is assigned an IP address (Docker uses a subset of the 172.16.0.0/12
RFC 1918 private range).
The bridge joins devices together similar to a network switch and serves as a
gateway for the container subnet. Address translation (NAT) allows for
packets with private source addresses to be rewritten to come from the host.
This setup ends up being functionally similar to a typical consumer home
network, with the difference being that it works to connect the containers on
a single computer with the outside network rather than connecting multiple
computers in a home to communicate with the Internet.
FreeBSD has equivalent networking capabilities backed by somewhat different
implementations. The FreeBSD
vnet
is similar to a Linux
network namespace and can provide network isolation for a jail. The FreeBSD
epair
is similar
to a Linux veth
and can pass traffic across a vnet boundary. FreeBSD also
has a bridge
device, similar
to a Linux bridge. And FreeBSD’s
PF firewall
can be used in place of Linux’s iptables to provide NAT.
So let’s put that together. Here are the (somewhat) low-level and manual steps to setting this up on FreeBSD with runj and containerd.
-
Decide on a subnet to use. I’m using 172.17.0.1/16 for this example since that’s what Docker uses.
-
Enable PF by writing
pf_enable="YES"
into/etc/rc.conf
, make a config file for PF at/etc/pf.conf
and then start it immediately withservice pf start
.The config file should define a NAT rule allowing traffic for addresses on a specific table. You can pick whatever name you want for the table, but I’ll use
jail-nat
here. The rule also needs to reference the interface where traffic should be forwarded; I’ll use the primary/default interface on my box which isem0
.sam@freebsd:~$ sudo sysrc pf_enable=YES sam@freebsd:~$ echo 'nat on em0 inet from <jail-nat> to any -> (em0)' | sudo tee -a /etc/pf.conf sam@freebsd:~$ sudo service pf start
-
Enable packet forwarding
sam@freebsd:~$ sudo sysctl net.inet.ip.forwarding=1
-
Set up an epair to use for your jail
sam@freebsd:~$ sudo ifconfig epair create epair0a
ifconfig
outputs the name of one end of the epair. The other end has the same name, exceptb
at the end instead ofa
. In this example it would beepair0b
. -
Set up the bridge
sam@freebsd:~$ sudo ifconfig bridge create bridge0 sam@freebsd:~$ sudo ifconfig bridge0 inet 172.17.0.1/16 sam@freebsd:~$ sudo ifconfig bridge0 addm epair0a sam@freebsd:~$ sudo ifconfig epair0a up
ifconfig
outputs the name of the bridge:bridge0
. We can then assign the subnet we picked (172.17.0.1/16) to the bridge, and add thea
end of our epair to the bridge.After doing this, the output looks something like this:
sam@freebsd:~$ ifconfig em0: flags=8863<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=481009b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,VLAN_HWFILTER,NOMAP> ether 08:00:27:f3:cd:05 inet6 fe80::a00:27ff:fef3:cd05%em0 prefixlen 64 scopeid 0x1 inet 10.0.2.15 netmask 0xffffff00 broadcast 10.0.2.255 media: Ethernet autoselect (1000baseT <full-duplex>) status: active nd6 options=23<PERFORMNUD,ACCEPT_RTADV,AUTO_LINKLOCAL> lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384 options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6> inet6 ::1 prefixlen 128 inet6 fe80::1%lo0 prefixlen 64 scopeid 0x2 inet 127.0.0.1 netmask 0xff000000 groups: lo nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> epair0a: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=8<VLAN_MTU> ether 02:7b:e0:b2:dc:0a groups: epair media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>) status: active nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> epair0b: flags=8862<BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=8<VLAN_MTU> ether 02:7b:e0:b2:dc:0b groups: epair media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>) status: active nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> bridge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 ether 58:9c:fc:00:12:0a inet 172.17.0.1 netmask 0xffff0000 broadcast 172.17.255.255 id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15 maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200 root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0 member: epair0a flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP> ifmaxaddr 0 port 3 priority 128 path cost 2000 groups: bridge nd6 options=9<PERFORMNUD,IFDISABLED>
-
Pick an IP address for your jail and add that to the PF table referenced in the NAT rule. I picked 172.17.0.2 here.
sam@freebsd:~$ sudo pfctl -t jail-nat -T add 172.17.0.2 1 table created. 1/1 addresses added.
-
Make a
runj.ext.json
file instructing runj to move theb
side of the epair into the jailThesam@freebsd:~$ cat <<EOF >runj.ext.json > {"network":{"vnet":{"mode":"new","interfaces":["epair0b"]}}} > EOF {"network":{"vnet":{"mode":"new","interfaces":["epair0b"]}}}
runj.ext.json
file acts as a pass-through for experimental FreeBSD-specific fields that I’m hoping to eventually upstream into the OCI runtime spec. We can use it here to specify both that we want to create a new vnet ("mode":"new"
) and to pass the specific interface into the jail’s vnet ("interfaces":["epair0b"]
). -
Run a jail with containerd, passing the
runj.ext.json
This runs the container interactively, so you should see a shell prompt (typicallysam@freebsd:~$ sudo ctr run \ --runtime wtf.sbk.runj.v1 \ --snapshotter zfs \ --rm \ --tty \ --runtime-config-path $(pwd)/runj.ext.json \ public.ecr.aws/samuelkarp/freebsd:13.1-RELEASE \ my-container \ sh
#
since you’re root by default inside the jail). -
Now inside the jail, look at the interfaces and configure
epair0b
We can see the unconfigured# ifconfig lo0: flags=8008<LOOPBACK,MULTICAST> metric 0 mtu 16384 options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6> groups: lo nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> epair0b: flags=8862<BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=8<VLAN_MTU> ether 02:15:7d:ce:b0:0b groups: epair media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>) status: active nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
epair0b
interface has been passed in! If we look outside the jail (in our normal shell), we can see the interface has disappeared.Now, back inside the jail, we can configure the interface to use the 172.17.0.2 address we picked earlier. Make sure to pass the same mask (/16) used for the bridge.sam@freebsd:~$ ifconfig em0: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=481009b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,VLAN_HWFILTER,NOMAP> ether 08:00:27:f3:cd:05 inet6 fe80::a00:27ff:fef3:cd05%em0 prefixlen 64 scopeid 0x1 inet 10.0.2.15 netmask 0xffffff00 broadcast 10.0.2.255 media: Ethernet autoselect (1000baseT <full-duplex>) status: active nd6 options=23<PERFORMNUD,ACCEPT_RTADV,AUTO_LINKLOCAL> lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384 options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6> inet6 ::1 prefixlen 128 inet6 fe80::1%lo0 prefixlen 64 scopeid 0x2 inet 127.0.0.1 netmask 0xff000000 groups: lo nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> epair0a: flags=8863<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=8<VLAN_MTU> ether 02:15:7d:ce:b0:0a groups: epair media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>) status: active nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> bridge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 ether 58:9c:fc:00:12:0a inet 172.17.0.1 netmask 0xffff0000 broadcast 172.17.255.255 id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15 maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200 root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0 member: em0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP> ifmaxaddr 0 port 1 priority 128 path cost 20000 groups: bridge nd6 options=9<PERFORMNUD,IFDISABLED>
# ifconfig epair0b inet 172.17.0.2/16
-
Inside the jail again, set up routing and a nameserver
With a route and nameserver set, we can reach the Internet!# route -4 add default 172.17.0.1 add net default: gateway 172.17.0.1 # echo 'nameserver 8.8.8.8' > /etc/resolv.conf
# ping -4 -c4 google.com PING google.com (142.251.33.110): 56 data bytes 64 bytes from 142.251.33.110: icmp_seq=0 ttl=63 time=8.411 ms 64 bytes from 142.251.33.110: icmp_seq=1 ttl=63 time=9.766 ms 64 bytes from 142.251.33.110: icmp_seq=2 ttl=63 time=8.903 ms 64 bytes from 142.251.33.110: icmp_seq=3 ttl=63 time=6.969 ms --- google.com ping statistics --- 4 packets transmitted, 4 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 6.969/8.513/9.766/1.014 ms
🎉
I want to say a special thanks to Doug Rabson, who helped me troubleshoot what I was doing wrong while testing this. Doug has also put together a port of the reference CNI plugins to do these steps automatically.