One of the really nice things about Docker containers is that the defaults mostly just work. One of those defaults is networking; docker run gives you a perfectly serviceable network experience with containers able to access the Internet, access each other, and expose services.

runj is a much lower-level tool than Docker, so that sort of out-of-the-box network experience wouldn’t be something runj would directly provide. However, I recently added support to runj for some of the pieces that make a networking experience like that possible. Higher-level tools that use runj, like nerdctl, might use these pieces in the future.

Docker’s default networking model on Linux consists of a separate network namespace for each container, a network interface inside the network namespace (a veth device), a bridge, and iptables rules to provide address translation (NAT). The veth provides communication across the boundary of a network namespace and is assigned an IP address (Docker uses a subset of the 172.16.0.0/12 RFC 1918 private range). The bridge joins devices together similar to a network switch and serves as a gateway for the container subnet. Address translation (NAT) allows for packets with private source addresses to be rewritten to come from the host. This setup ends up being functionally similar to a typical consumer home network, with the difference being that it works to connect the containers on a single computer with the outside network rather than connecting multiple computers in a home to communicate with the Internet.

FreeBSD has equivalent networking capabilities backed by somewhat different implementations. The FreeBSD vnet is similar to a Linux network namespace and can provide network isolation for a jail. The FreeBSD epair is similar to a Linux veth and can pass traffic across a vnet boundary. FreeBSD also has a bridge device, similar to a Linux bridge. And FreeBSD’s PF firewall can be used in place of Linux’s iptables to provide NAT.

So let’s put that together. Here are the (somewhat) low-level and manual steps to setting this up on FreeBSD with runj and containerd.

  1. Decide on a subnet to use. I’m using 172.17.0.1/16 for this example since that’s what Docker uses.

  2. Enable PF by writing pf_enable="YES" into /etc/rc.conf, make a config file for PF at /etc/pf.conf and then start it immediately with service pf start.

    The config file should define a NAT rule allowing traffic for addresses on a specific table. You can pick whatever name you want for the table, but I’ll use jail-nat here. The rule also needs to reference the interface where traffic should be forwarded; I’ll use the primary/default interface on my box which is em0.

    sam@freebsd:~$ sudo sysrc pf_enable=YES
    sam@freebsd:~$ echo 'nat on em0 inet from <jail-nat> to any -> (em0)' | sudo tee -a /etc/pf.conf
    sam@freebsd:~$ sudo service pf start
    
  3. Enable packet forwarding

    sam@freebsd:~$ sudo sysctl net.inet.ip.forwarding=1

  4. Set up an epair to use for your jail

    sam@freebsd:~$ sudo ifconfig epair create
    epair0a
    ifconfig outputs the name of one end of the epair. The other end has the same name, except b at the end instead of a. In this example it would be epair0b.

  5. Set up the bridge

    sam@freebsd:~$ sudo ifconfig bridge create
    bridge0
    sam@freebsd:~$ sudo ifconfig bridge0 inet 172.17.0.1/16
    sam@freebsd:~$ sudo ifconfig bridge0 addm epair0a
    sam@freebsd:~$ sudo ifconfig epair0a up
    ifconfig outputs the name of the bridge: bridge0. We can then assign the subnet we picked (172.17.0.1/16) to the bridge, and add the a end of our epair to the bridge.

    After doing this, the output looks something like this:

    sam@freebsd:~$ ifconfig
    em0: flags=8863<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
          options=481009b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,VLAN_HWFILTER,NOMAP>
          ether 08:00:27:f3:cd:05
          inet6 fe80::a00:27ff:fef3:cd05%em0 prefixlen 64 scopeid 0x1
          inet 10.0.2.15 netmask 0xffffff00 broadcast 10.0.2.255
          media: Ethernet autoselect (1000baseT <full-duplex>)
          status: active
          nd6 options=23<PERFORMNUD,ACCEPT_RTADV,AUTO_LINKLOCAL>
    lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
          options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
          inet6 ::1 prefixlen 128
          inet6 fe80::1%lo0 prefixlen 64 scopeid 0x2
          inet 127.0.0.1 netmask 0xff000000
          groups: lo
          nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
    epair0a: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
          options=8<VLAN_MTU>
          ether 02:7b:e0:b2:dc:0a
          groups: epair
          media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
          status: active
          nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
    epair0b: flags=8862<BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
          options=8<VLAN_MTU>
          ether 02:7b:e0:b2:dc:0b
          groups: epair
          media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
          status: active
          nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
    bridge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
          ether 58:9c:fc:00:12:0a
          inet 172.17.0.1 netmask 0xffff0000 broadcast 172.17.255.255
          id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
          maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
          root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
          member: epair0a flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
                   ifmaxaddr 0 port 3 priority 128 path cost 2000
          groups: bridge
          nd6 options=9<PERFORMNUD,IFDISABLED>
    

  6. Pick an IP address for your jail and add that to the PF table referenced in the NAT rule. I picked 172.17.0.2 here.

    sam@freebsd:~$ sudo pfctl -t jail-nat -T add 172.17.0.2
    1 table created.
    1/1 addresses added.
    

  7. Make a runj.ext.json file instructing runj to move the b side of the epair into the jail

    sam@freebsd:~$ cat <<EOF >runj.ext.json
    > {"network":{"vnet":{"mode":"new","interfaces":["epair0b"]}}}
    > EOF
    {"network":{"vnet":{"mode":"new","interfaces":["epair0b"]}}}
    
    The runj.ext.json file acts as a pass-through for experimental FreeBSD-specific fields that I’m hoping to eventually upstream into the OCI runtime spec. We can use it here to specify both that we want to create a new vnet ("mode":"new") and to pass the specific interface into the jail’s vnet ("interfaces":["epair0b"]).

  8. Run a jail with containerd, passing the runj.ext.json

    sam@freebsd:~$ sudo ctr run \
      --runtime wtf.sbk.runj.v1 \
      --snapshotter zfs \
      --rm \
      --tty \
      --runtime-config-path $(pwd)/runj.ext.json \
      public.ecr.aws/samuelkarp/freebsd:13.1-RELEASE \
      my-container \
      sh
    
    This runs the container interactively, so you should see a shell prompt (typically # since you’re root by default inside the jail).

  9. Now inside the jail, look at the interfaces and configure epair0b

    # ifconfig
    lo0: flags=8008<LOOPBACK,MULTICAST> metric 0 mtu 16384
         options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
         groups: lo
         nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
    epair0b: flags=8862<BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
         options=8<VLAN_MTU>
         ether 02:15:7d:ce:b0:0b
         groups: epair
         media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
         status: active
         nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
    
    We can see the unconfigured epair0b interface has been passed in! If we look outside the jail (in our normal shell), we can see the interface has disappeared.
    sam@freebsd:~$ ifconfig
    em0: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
            options=481009b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,VLAN_HWFILTER,NOMAP>
            ether 08:00:27:f3:cd:05
            inet6 fe80::a00:27ff:fef3:cd05%em0 prefixlen 64 scopeid 0x1
            inet 10.0.2.15 netmask 0xffffff00 broadcast 10.0.2.255
            media: Ethernet autoselect (1000baseT <full-duplex>)
            status: active
            nd6 options=23<PERFORMNUD,ACCEPT_RTADV,AUTO_LINKLOCAL>
    lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
            options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
            inet6 ::1 prefixlen 128
            inet6 fe80::1%lo0 prefixlen 64 scopeid 0x2
            inet 127.0.0.1 netmask 0xff000000
            groups: lo
            nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
    epair0a: flags=8863<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
            options=8<VLAN_MTU>
            ether 02:15:7d:ce:b0:0a
            groups: epair
            media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
            status: active
            nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
    bridge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
            ether 58:9c:fc:00:12:0a
            inet 172.17.0.1 netmask 0xffff0000 broadcast 172.17.255.255
            id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
            maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
            root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
            member: em0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
                    ifmaxaddr 0 port 1 priority 128 path cost 20000
            groups: bridge
            nd6 options=9<PERFORMNUD,IFDISABLED>
    
    Now, back inside the jail, we can configure the interface to use the 172.17.0.2 address we picked earlier. Make sure to pass the same mask (/16) used for the bridge.
    # ifconfig epair0b inet 172.17.0.2/16
    

  10. Inside the jail again, set up routing and a nameserver

    # route -4 add default 172.17.0.1
    add net default: gateway 172.17.0.1
    # echo 'nameserver 8.8.8.8' > /etc/resolv.conf
    
    With a route and nameserver set, we can reach the Internet!
    # ping -4 -c4 google.com
    PING google.com (142.251.33.110): 56 data bytes
    64 bytes from 142.251.33.110: icmp_seq=0 ttl=63 time=8.411 ms
    64 bytes from 142.251.33.110: icmp_seq=1 ttl=63 time=9.766 ms
    64 bytes from 142.251.33.110: icmp_seq=2 ttl=63 time=8.903 ms
    64 bytes from 142.251.33.110: icmp_seq=3 ttl=63 time=6.969 ms
    
    --- google.com ping statistics ---
    4 packets transmitted, 4 packets received, 0.0% packet loss
    round-trip min/avg/max/stddev = 6.969/8.513/9.766/1.014 ms
    

    🎉

I want to say a special thanks to Doug Rabson, who helped me troubleshoot what I was doing wrong while testing this. Doug has also put together a port of the reference CNI plugins to do these steps automatically.