Take it to the limit—considerations for building reliable systems

March 1, 2017

Building reliable systems is a key part of the engineering process at Workiva. This means recognizing that complex systems typically consist of many discrete pieces, each of which can fail in isolation (or in concert), and accounting for it. In a microservice architecture where a given function potentially comprises several independent service calls, high availability hinges on the ability to be partially available.

This is a core tenet behind resilience engineering. If a function depends on 3 services, each with a reliability of 90%, 95%, and 99%, partial availability could be the difference between 99.995% reliability and 84% reliability (assuming failures are independent). Resilience engineering means designing with failure as the normal.

Anticipating failure is the first step to resilience zen, but the second is embracing it. Telling the client "no" and failing on purpose is better than failing in unpredictable or unexpected ways.

Backpressure is another critical resilience engineering pattern. Fundamentally, it's about enforcing limits. This comes in the form of queue lengths, bandwidth throttling, traffic shaping, message rate limits, max payload sizes, etc. Prescribing these restrictions makes the limits explicit when they would otherwise be implicit (eventually your server will exhaust its memory, but since the limit is implicit, it's unclear exactly when or what the consequences might be). Relying on unbounded queues and other implicit limits is like saying someone knows when to stop drinking because he or she eventually passes out.

Rate limiting is important, not just to prevent bad actors from DoSing your system, but also yourself. Queue limits and message size limits are especially interesting because they seem to confuse and frustrate developers who haven't fully internalized the motivation behind them. But really, these are just another form of rate limiting, or more generally, backpressure.

Let's look at max message size as a case study.

Imagine we have a system of distributed actors. An actor can send messages to other actors who, in turn, process the messages and may choose to send messages themselves.

Now, as any good software engineer knows, the eighth fallacy of distributed computing is "the network is homogenous." This means not all actors are using the same hardware, software, or network configuration. We have servers with 128GB RAM running Ubuntu, laptops with 16GB RAM running macOS, mobile clients with 2GB RAM running Android, IoT edge devices with 512MB RAM, and everything in between—all running a hodgepodge of software and network interfaces.

When we choose not to put an upper bound on message sizes, we are making an implicit assumption (recall the discussion on implicit/explicit limits from earlier). Put another way, you and everyone you interact with, likely unknowingly, enters an unspoken contract of which neither party can opt out. This is because any actor may send a message of arbitrary size. This means any downstream consumer of this message, either directly or indirectly, must also support arbitrarily large messages.

How can we test something that is arbitrary? We can't. We have two options: Either we make the limit explicit, or we keep this implicit, arbitrarily binding contract.

The former allows us to define our operating boundaries and gives us something to test. The latter requires us to test at some undefined production-level scale. The second option is literally gambling reliability for convenience. The limit is still there, it's just hidden.

When we don't make it explicit, we make it easy to DoS ourselves in production. Limits become even more important when dealing with cloud infrastructure due to their multitenant nature. They prevent a bad actor (or you) from bringing down services or dominating infrastructure and system resources.

In our heterogeneous actor system, we have messages bound for mobile devices and web browsers, which are often single-threaded or memory-constrained consumers. Without an explicit limit on message size, a client could easily doom itself by requesting too much data or simply receiving data outside of its control—this is why the contract is unspoken but binding.

Let's look at this from a different kind of engineering perspective. Consider another type of system: the U.S. National Highway System.

The U.S. Department of Transportation (DOT) uses the Federal Bridge Gross Weight Formula as a means to prevent heavy vehicles from damaging roads and bridges. It's really the same engineering problem, just a different discipline and a different type of infrastructure.

The August 2007 collapse of the Interstate 35W Mississippi River bridge in Minneapolis brought renewed attention to the issue of truck weights and their relation to bridge stress. In November 2008, the National Transportation Safety Board determined there had been several reasons for the bridge's collapse, including (but not limited to): faulty gusset plates, inadequate inspections, and the extra weight of heavy construction equipment combined with the weight of rush hour traffic.

The DOT relies on weigh stations to ensure trucks comply with federal weight regulations, fining those that exceed restrictions without an overweight permit.

The federal maximum weight is set at 80,000 pounds. Trucks exceeding the federal weight limit can still operate on the country's highways with an overweight permit, but such permits are only issued before the scheduled trip and expire at the end of the trip. Overweight permits are only issued for loads that cannot be broken down to smaller shipments that fall below the federal weight limit, and if there is no other alternative to moving the cargo by truck.

Weight limits need to be enforced, so civil engineers have a defined operating range for the roads, bridges, and other infrastructure they build. Computers are no different. This is the reason many systems enforce these types of limits.

For example, Amazon clearly publishes the limits for its Simple Queue Service—the max queue depth for standard queues is 120,000 messages and 20,000 messages for FIFO queues. Messages are limited to 256KB in size. Amazon Kinesis, Apache Kafka, NATS, and Google App Engine pull queues all limit messages to 1MB in size.

These limits allow the system designers to optimize their infrastructure and ameliorate some of the risks of multitenancy—not to mention making capacity planning much easier.

Unbounded anything—queues, message sizes, queries, or traffic—is a resilience engineering antipattern. Without explicit limits, things fail in unexpected and unpredictable ways. Remember, the limits exist, they're just hidden. By making them explicit, we restrict the failure domain giving us more predictability, longer mean time between failures, and shorter mean time to recovery at the cost of more upfront work or slightly more complexity.

It's better to be explicit and handle these limits upfront than to punt on the problem and allow systems to fail in unexpected ways. The latter might seem like less work at first but will lead to more problems in the long term. By requiring developers to deal with these limitations directly, they will think through their APIs and business logic more thoroughly and design better interactions with respect to stability, scalability, and performance.


Add new comment

This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.