Envoy hot restart

Matt Klein
Envoy Proxy
Published in
9 min readAug 13, 2017

--

This is the 2nd in my series of posts on Envoy architecture. (If you didn’t see the first check out my post on the threading model). In this post I will cover why and how Envoy is able to “hot restart” itself.

What is hot restart and why is it important?

Figure 1: Rolling and hot restart deploy methods

At a fundamental level, deploys of services (for this discussion services include the application, Envoy, log agents, stat relays, etc.) that do not drop traffic take place one of two ways:

  1. Via rolling deploy. New service nodes are brought up and traffic is drained and shifted from old nodes to new nodes. This is shown in the top half of figure 1.
  2. Via “hot” or live restart. No traffic shifting is done. The entity being restarted is somehow able to do so without dropping any traffic. This is shown in the bottom half of figure 1.

For a variety of reasons, (1) is a better approach. It allows more complex and safe traffic shifting techniques such as percentage based canary, percentage based rolling, zone aware rolling, and full blue/green. More importantly it allows the infrastructure to run in an immutable fashion such that service nodes are never altered in place. They are simply torn down after a successful deploy with traffic having been shifted onto new immutable nodes. Immutable infrastructure is substantially easier to reason about. However, (1) requires a large investment in infrastructure and deploy tooling. Although the industry is converging towards this direction and the tooling is becoming more widely available via projects like Kubernetes, Nomad, and Mesos, there are still a huge number of large-scale service deployments that use much simpler mechanisms with substantially less flexibility (like Lyft!).

Because many deployments do not have the sophisticated tooling required for (1), (2) is desirable because it is conceptually much simpler. The operator gets new code and configurations onto the service node, and via some mechanism specific to the service causes it to restart without dropping any connections. Almost all network proxies support at least some variant of this process, and many service frameworks in a variety of programming languages do as well.

How does the previous discussion relate to Envoy? When run as a full service mesh, Envoy forms the substrate by which the rest of the infrastructure communicates. This means that it is typically intimately involved in how services are deployed and operated. For example, in a Kubernetes based immutable infrastructure Envoy might be used to perform traffic shifting during deploys (Istio is an example of exactly this).

Yet, what if Envoy itself has to be updated? How does that work without impacting customer traffic? As it turns out this is a complex question that depends on how the infrastructure has been designed in terms of options (1) and (2) above.

  1. If Envoy is running as part of a sophisticated scheduling system (e.g. as an Istio sidecar pod member in Kubernetes), deploying Envoy is as “simple” as forcing an application deploy such that the Envoy injected into the pod gets updated to the latest version. Whatever rolling deployment is done will redeploy Envoy also.
  2. For the vast majority of organizations that do not yet have robust rolling deploy capabilities via projects like Kubernetes or a homegrown system, being able to hot restart Envoy in place vastly simplifies (and usually greatly speeds up) how it is deployed.

In summary, we should all be striving for (1) but the reality on the ground still often leads to (2) being easier to implement and therefore what is done in practice by many organizations.

Envoy hot restart design goals

Envoy recognizes that (2) is still the approach of many deployments and thus supporting hot restart with the following attributes is a fundamental design goal:

  1. The entire process (not just configuration) should be reloadable without dropping any connections. Envoy does not support a mechanism to reload only the configuration. This was a conscious design decision because adding code to modify a running base configuration is not trivial (internally, Envoy holds true to the principle of immutable state with atomic switching being simpler to reason about — for the same reason that immutability is preferable for deploys). Fundamentally, doing a full binary reload is a superset of a configuration reload. Additionally, Envoy supports a large number of dynamic configuration APIs which make reloading the local configuration almost never necessary in practice assuming the availability of a sophisticated enough management server. Thus, deploying Envoy itself is by far the most common reason that a restart is needed and why a full binary reload is the only supported mechanism.
  2. Stats should be consistent during reload. Even though hot restarting Envoy will yield two Envoy processes running side by side for some period of time, from an observability standpoint it is desirable that both processes be considered a single unit. This means that gauges should be consistent across both processes (e.g., there should be a single total connections gauge that is accurate across both processes) and that only a single source of stats will be provided (whether push or pull based) to the stats infrastructure. The fact that a hot restarted multi-process Envoy is still logically a single Envoy makes operations easier to reason about for the rest of the infrastructure.
  3. Hot restart should still be possible with container based immutable deployment. Even though many container based deployment systems are fully immutable and use rolling deployment, Envoy will be used in a variety of different environments. In some cases it’s beneficial that Envoy can be deployed on a host via an immutable container but hot restarted into a new Envoy container started on the same host. This has important implications in terms of how the restart is actually done. This will be discussed more below.
  4. The drain rate and destruction of the old Envoy process should be configurable. Envoy is used in some scenarios in which it is very important for performance reasons that connections be kept alive as long as possible. Because of this, how aggressively connections are drained from the old process and how long the old process continues to live should be configurable.

Envoy hot restart architecture

Figure 2: Envoy hot restart architecture

Figure 2 shows a high level overview of how Envoy performs hot restart. The following components are involved:

  1. A shared memory region that contains version information, raw stat storage, and shared locks.
  2. The “primary” Envoy process. This is the first Envoy that starts up or a secondary process that gets promoted to primary once the primary shuts down after draining.
  3. The “secondary” Envoy process. This is a new Envoy process that is in the process of being started, initialized, and having traffic shifted to it. After some period of time it becomes the primary process.
  4. A simple RPC protocol is used to communicate between the two processes. This is done over unix domain sockets (UDS).

Shared memory

The shared memory region is primarily used for raw stat storage as well as a few shared locks that are required for data that both processes may simultaneously modify. How stats work will be the focus of the next blog post so I won’t talk more about it here. However, I will mention that raw stat memory being stored in shared memory satisfies the design goal of having consistent multi-process stats (this is because after several layers of indirection the actual counter or gauge value is shared by both processes).

The multi-process locks are used for synchronizing output to standard out, writing to files simultaneously, and for allocating and deallocating raw stat memory. The first Envoy process that starts up (known as “epoch” zero) will initialize the shared memory region. Subsequent processes (epochs greater than zero) will attach to the shared memory region and fail if they cannot. Version information on both the layout of the shared memory region as well as the RPC protocol is stored in the shared memory region. This allows the hot restart process to fail with a useful error message if the protocol has changed and hot restart is not possible (in practice this rarely happens. The protocol has been stable for quite some time).

Startup procedure and RPC protocol

As stated previously, one of the design goals of the Envoy hot restart mechanism is that it should work with containers. This requirement makes the implementation different from how many other proxies operate. Most existing network proxies utilize some type of within-process “trampoline” that knows how to restart the process and/or restart new worker processes. This is convenient because all of the functionality is contained within the initial running process. However, this does not work with immutable containers. The new process and the old process cannot interact in any way other than via shared memory or the network if the hot restart process is to work with immutable container deployment.

Figure 3: Envoy hot restart RPC flow

Figure 3 shows the hot restart RPC flow when a new Envoy process starts. Note again that the only way the two processes communicate is via shared memory and UDS. This means that the two processes can be in separate containers if desired. The following steps are involved:

  1. The secondary process asks the primary process to shutdown its admin port. The secondary process now takes over all admin duties, including stat flushing. This is so that from an operations standpoint, there is only a single logical Envoy process.
  2. The secondary process loads its configuration and starts to bind to listen sockets. During this phase it fetches applicable listen sockets from the primary via UDS. (In the current implementation Envoy does not make use of the SO_REUSEPORT socket option. This is primarily historical since that socket option is only available on relatively new kernels. At some point support for old kernels can be dropped and we will switch the code to using this socket option).
  3. Once the secondary process is fully initialized (initialization will be the subject of a full blog post), it tells the primary process to stop listening for new connections and to start draining. The amount of drain time is configurable and defaults to 15 minutes. During these 15 minutes (or whatever time is configured), the primary process will start gracefully closing connections. The close rate becomes more aggressive the longer the process has been draining. This yields a smooth closure of old connections which then reestablish on the secondary process which is already listening for new connections.
  4. During the drain phase, the secondary process is flushing stats. Most stats are stored in shared memory and don’t need to be fetched via RPC. However, some special stats are only known to the draining primary process. For example, how many connections the primary process still has open and how much memory it has allocated. These stats are written out by the secondary so that drain rate can be more easily observed.
  5. Finally, after the drain time has passed, the secondary tells the primary to shut down. Any remaining connections held open by the primary are closed. At this point, the secondary becomes the primary and there is a single Envoy process running.
  6. Repeat.

Hot restart wrapper

Because of the way Envoy implements hot restart in order to work in an immutable container environment, some type of coordination mechanism is still needed in order to actually run Envoy with the required command line arguments. At Lyft, we currently use a very simple Python process which knows how to run the Envoy processes, forward pertinent signals, and watch for erroneous behavior (e.g., if one of the processes aborts, the wrapper kills everything and exits so that a process manager can restart the tree cleanly). Lyft currently use runit as our on-node process manager. The runit daemon is only aware of the hot restart wrapper. The wrapper does everything else which serves to hide the multiple Envoy processes from the rest of the infrastructure. Although we don’t currently use containers in production at Lyft, the hot restart wrapper script could be fairly trivially modified to work in a container environment given the rest of the design.

Conclusion

Although the industry is moving towards fully immutable infrastructure, the ability to perform a hot restart of code and configurations is still a very useful feature to many. Envoy’s approach to hot restart is to hide as much as possible from the rest of the infrastructure and appear as a single logical process. This results in substantially easier operability. Container based hot restart was also an initial design goal and allows for more flexibility in deployment techniques.

Links to code

Some links to a few of the interfaces and implementation headers discussed in this post:

--

--