One of my colleagues recently asked me if there were any best practice guides for designing and testing software daemons (background processes). I hadn’t known of any before writing this blog post (and most of what I’ve found while researching were related to the mechanics of writing daemons), but we came up with a few ideas together and maybe this can serve as a starting point for others.
Note: all of this is from a Linux perspective. I haven’t written any daemons for Mac OS and some of your concerns might be different with Windows Services.
Restarts and supervision
A daemon process runs in the background and is generally expected to stay running until it is explicitly stopped. Sometimes software might stop unexpectedly (e.g., crash) and you might want to think about what happens after:
- Does your daemon recover from a process restart?
- Can your daemon restore its own state?
- Is there a difference between a crash and an intentional restart? What about a reboot?
Process supervisors can help with some of the mechanics of running daemons. Supervisors are separate programs (usually either daemons themselves or integrated into an init system) responsible for monitoring the state of daemons and potentially defining restart policies around them. Some common supervisors include supervisord and systemd, though there are many others. In a container-based system, you’ll frequently see the role of a supervisor performed by a container orchestrator like Amazon ECS or Kubernetes.
Many process supervisors allow you to define policy around restarts. Some good things to think about in defining that policy include:
- What happens when your daemon fails to start?
- Does your daemon have a mechanism to signal to a supervisor that there is a terminal failure?
- How long does your daemon take to start up?
- Do you have a mechanism to indicate that the daemon is “healthy” or ready to receive traffic, process requests, or otherwise do work?
- How frequently should the supervisor attempt to restart your daemon? How many attempts before giving up?
- Can your supervisor distinguish between a normal exit and an abnormal exit?
- Does a
SIGTERMto the daemon process cause a supervisor to restart the daemon unexpectedly?
Monitoring and logging
Since daemons are typically background processes that run without interactivity, it can be a challenge to know what’s going on with the daemon. Is it running? Is it receiving requests? Many daemons will emit logs, where they record information about their activity. Some daemons emit logs to files directly, while others leverage log facilities like syslog or simply write to stdout and expect another process to be reading from that descriptor. Here are some questions to get started:
- What does your daemon log? Does it log request information, activities it performs, or only errors when things go wrong?
- Who is the audience of your log? Are you looking to provide information for a desktop user, an operator, a developer integrating with your daemon, or a developer working on your daemon?
- If you log directly to files, do you perform rotation? How much data do you keep and for how long?
- If you use a network-based log facility, what happens when the network becomes unavailable? Do you buffer logs? If so, for how long?
Monitoring your daemon is also generally useful. Some daemons might have information that they know about their particular workload that is useful to monitor while there are also process-level, (sometimes) language-level, and system-level data that’s useful to know. Daemons might be running under a supervisor, and that supervisor might also have useful information. A variety of mechanisms exist for both exposing and exporting this data, including through tooling like Prometheus. Let’s do the question thing again:
- Does your daemon expose metrics about its own workload (e.g., how many requests/messages/work items it processes, how many errors it encounters, etc)?
- Does your daemon use a garbage-collected language or runtime? Do you know what the heap size looks like, the impact of GC on latency, etc?
- Do you know the system resource consumption for your daemon (e.g., memory, CPU, disk I/O, network bandwidth)? What about steady-state versus under load?
- What happens to your daemon when system resources are exhausted?
Upgrades, downgrades, and dependency changes
Upgrading daemons and their dependencies can be challenging as daemons are
typically designed to stay running indefinitely. Daemons can be designed to
interact differently with software changes. Some daemons will stay running
during upgrades and downgrades. Others might integrate with package managers to
trigger restarts as a result of an upgrade or a downgrade. There isn’t
necessarily a single right answer here; what one particular daemon needs might
not be needed by others. Daemons that operate as servers may want to stay
running to continue to process requests. Daemons that have more asynchronous
behavior may choose to restart as part of an upgrade or downgrade so that the
running software reflects what’s installed on disk. Keeping a daemon running
during an upgrade can have some challenges: while the executable code of the
main process will continue to be in memory (on Linux), if the daemon has a
dependency on a dynamically-linked library (a
.so file) unexpected behavior
may occur if the library is upgraded and a different version is loaded.
Restarting a daemon means integrating with the upgrade process and some amount
of unavailability during the upgrade. There might also be data compatibility
issues; the daemon’s state might need to undergo a schema migration. In
- Does your daemon recover from a package manager upgrade/downgrade?
- Does your daemon recover from an upgrade/downgrade of a dependency?
- What happens if the daemon is actively doing work during upgrade/downgrade?
- Does the schema or format of data need to change during an upgrade? Is it possible to reverse that for a downgrade scenario?
Dependencies can also be challenging. Daemons might have multiple kinds of dependencies with different semantics: dynamically-linked libraries, kernel interfaces, remote APIs over a network, a message-passing system like D-Bus, persistent data stored in a particular format, and so on. Some of these might be affected by an upgrade/downgrade (dynamically-linked libraries, kernel interfaces) and some might not (remote APIs over a network), but all are worth thinking through.
- How does your daemon model its dependencies?
- Can your daemon compensate if a particular dependency is unavailable at runtime?
- If your daemon runs under a supervisor, can you coordinate the supervisor so that it starts the dependencies ahead of your daemon?
- Does your daemon load all its dynamically-linked libraries at start-up? If it loads later, can it handle a different version of a library being loaded?
- Does your daemon depend on particular kernel interfaces or features? Will it work without those?
- Does your daemon depend on a particular file or data format? Will it be able to handle data in an older format? What about a newer one?
Init systems and daemonizing
Init systems can be a kind of process supervisor covered above, but not all are. There are a few special considerations for init systems that are worth thinking through when designing a daemon.
More traditional init systems (in the SysV style) expect programs to “daemonize” themselves: handle the mechanics of placing themselves into the background and running asynchronously. These init systems don’t typically perform much supervision, and may have fairly simple conventions around reporting status and handling dependency startup ordering. For daemons that run under these systems, they are generally expected to do the following for themselves:
- Close all open file descriptors (especially standard input, standard output and standard error)
- Change its working directory to the root filesystem, to ensure that it doesn’t tie up another filesystem and prevent it from being unmounted
- Reset its
- Run in the background (i.e., fork)
- Disassociate from its process group (usually a shell), to insulate itself from signals (such as
HUP) sent to the process group
- Ignore all terminal I/O signals
- Disassociate from the control terminal (and take steps not to reacquire one)
- Handle any
(from the daemonize tool website)
This can be a lot to test and get right (there are fairly detailed guides too), so there is tooling like daemonize to help you do that. But there are also modern init systems like systemd and upstart (though upstart has generally been abandoned in favor of systemd) that prefer that your daemon not “daemonize” itself and run in the foreground instead; they take responsibility for the work of isolating the daemon process. These init systems may have additional recommendations, but the general process is simpler. Init systems like this may also provide features around dependency management, monitoring/log handling, and activation that can be useful to you. The amount of work that an init system like this abstracts away can make it enticing to tie your daemon to that init system; that can be appropriate for some use-cases but can hinder broad adoption.
- Can your daemon run under different init systems/supervisors or is it tied to a specific one?
- Can your daemon run in the foreground?
- Can your daemon double-fork and daemonize itself?
- Can your daemon use int system features like integrated logging facilities when available (e.g., journald)?
How and why the daemon starts can be another useful thing to think about as well. Some daemons are expected to always run and run indefinitely when the system boots. Others might only be necessary when a particular piece of work comes in (a network request, a message, a new device) and can be started on-demand. Modern init systems like systemd can help model these different activation behaviors.
- Does your daemon need to run all the time?
- Does your daemon need to run at boot?
- Can your daemon be started reactively when work is available? If so, how quickly can it start?
This blog post is only a start, and only represents my way of thinking about daemons. I would be remiss if I failed to link to the resources I found as I was writing this article. Here they are, in no particular order:
I hope this blog post has been useful. If you have anything to add or any corrections you’d like to suggest, I’d appreciate if you left a comment here!