Don’t build in the dark: monitoring your microservices

March 16, 2016

Recently, our team worked on some code which was riddled with complexity—a problem analogous to deadlock detection in distributed systems. In order to make sense of the examples shown in this post, it helps to know that our problem focused on building dependency graphs, detecting circular dependencies, and knowing when an update started and finished propagating through a graph. The subject of this blog post is not the problem and its possible solutions, but rather the approach we took while prototyping.

We've all heard of test-driven development: creating tests as a way of defining requirements and then writing code to satisfy those tests, which in turn, satisfies the requirements. We use those tests to drive development by adding assertions which constrain our expected end-state. Then we add code, run tests, add code, run tests, etc., until all of our assertions are satisfied. Blah, blah, blah, instant feedback is awesome, blah, blah, blah, everyone's been doing it for years…

As a back-end developer, this has served me quite well: a command line interface, a single command, and a list of test names indicating a pass or fail.

   test_basic_linking ...                                 [OK]
   test_causal_chain ...                                  [OK]
   test_the_u_problem ...                                 [OK]
   test_interpolated_with_user_changes ...                [FAIL]
   test_intersecting_concurrent_propagation ...           [OK]
   test_twice_consistent ...                              [OK]
   test_simple_circular_propagation ...                   [OK]


I rarely feel that I need a graphical interface, and I don't often write front-end code. I respect front-end development and the aesthetics of user interface design, I'm just not very good at it. I'm not a visual learner, if there is such a thing. Typically, it doesn't affect my ability to code.

So, as with most coding projects we've worked on recently, we started with test-driven development. As we coded, our process evolved toward Monitoring Driven Development (MDD). If you're not interested in the evolutionary process and just want the takeaways, skip to “Turning the lights on.”

Whiteboarding is a powerful tool

As we discussed the ins and outs of the problem, time and time again, we found ourselves compelled to whiteboard our scenarios.

"No,'s what I mean..."

"Oh crap... yeah that definitely breaks it."

And so we wrote a unit test that exercised the control flow of the offending graph, and we gave it some descriptive name.

"I call it the Bow Tie scenario!" someone would shout.

As a result, test_bow_tie was added to the ever-growing test suite.

Oh god, I'm so tired of whiteboarding. How do I connect my keyboard to this thing?

Adding visual documentation to our tests

We decided to document every test with dot notation and the intention of generating concrete documentation as part of our build process that visually described the tests.

Here's an example of a graph with a cycle in it:

Circular Propagation

   digraph {
       "LUX/A1" -> "DT/B1" -> "DT/B2" -> "SOX/C1" -> "SOX/C2";
       "SOX/C2" -> "DT/B3" -> "DT/B2";

If you're unfamiliar with dot notation, we're describing a directed graph. Each node in the graph is wrapped in quotation marks, and -> denotes a directed edge.

Here's some example output from one of our unit tests, generated automatically from the dot notation in our docstring:

One of the nice things about these graphs is that they’re interactive. You can pull the nodes around and contort the graph, to more easily detect homeomorphism.

After drawing our graph, we then set up our test fixture to match and run our simulation:

def test_circular_propagation(self):
   Circular Propagation

       digraph {
           "LUX/A1" -> "DT/B1" -> "DT/B2" -> "SOX/C1" -> "SOX/C2";
           "SOX/C2" -> "DT/B3" -> "DT/B2";
   """'LUX/A1', ['DT/B1'])
   self.dt.add_contributor('DT/B1', ['DT/B2'])
   self.dt.add_contributor('DT/B3', ['DT/B2'])'DT/B2', ['SOX/C1'])
   self.sox.add_contributor('SOX/C1', ['SOX/C2'])  'SOX/C2', ['LUX/B3'])

   self.lux.update('LUX/A1', UpdateValue(7))
   ... # Assertions below ...

Unfortunately, as we added scenarios and tests, we found keeping our documentation and code in sync was extra work. We already wrote the test setup in dot notation. Why not generate our test setups from the dot notation itself?

def test_circular_propagation(self):
   Circular Propagation

       digraph {
           "LUX/A1" -> "DT/B1" -> "DT/B2" -> "SOX/C1" -> "SOX/C2";
           "SOX/C2" -> "DT/B3" -> "DT/B2";
   self.lux.update('LUX/A1', UpdateValue(7))
   ... # Assertions below ...

Adding new tests became:

  1. Drawing graphs in dot notation
  2. Picking a node to pump a value through

This was great for ease of writing unit tests. Our test setups were derived from our documentation. Our documentation effectively could not be out of date and was re-rendered in HTML/JavaScriptTM every time a test was run!

Watch yourself: verifying expectations with race conditions and out of order messages

If all there was to solving our problem was drawing directed graphs, our job would be complete. However, as with most distributed problems, we expected to have race conditions and out of order messages.

We wanted to ensure that we were properly detecting state changes. Test assertions gave us some confidence, but complex scenarios were still hard to reason about in our UI-less microservice prototype.

We had seen timelines used in the past to offer insight. For example, Google App EngineTM Appstats to detect inefficiency in RPCs, or the Mayhem Timeline in the Wdesk Flex Client. So it seemed like a no-brainer to add a timeline to our prototype:

This was a huge boon for productivity and helped us detect errors in tests which passed. The problem with unit tests is they test end-state. Through our visual output, we found that even though our end-state was correct, there were states in between. Our assertions were passing by coincidence.

We now had a new and more powerful feedback loop:

  1. Run unit tests
  2. If unit test fails, fix bug and go to step 1
  3. Visually inspect the test-generated timeline for any anomalies
  4. Add assertion to test to catch class of anomaly
  5. Go to step 1

Our visualization now informed us of missing test assertions that we hadn't thought of. It's like we're QA'ing our own work. We'll never have to rework a ticket again—developers rejoice!

Check yo'self before you wreck yo'self: visually debugging data structures

We felt confident in our event handling, even for out of order or duplicate messages. However, we came up with a class of graphs that exposed a fundamental flaw in one of our custom data structures: the TreeBag.

The TreeBag is essentially a graph builder—a data structure that maintains a set of graph fragments. These graph fragments will be joined together when missing graph fragments are added to it. Once all graph fragments have checked in, the graph is processed and marked closed.

After stepping through the debugger a few times to view the in-memory state of the TreeBag, we decided something needed to change:

Since the TreeBag is only ever updated when graph fragments are checked in and check-ins are already noted on our timeline visualization, we captured the state of the TreeBag at every check-in and correlated it on the same visualization:

Hovering on a point in time showed us the state of our underlying data structure. We effectively played back time on the data structure to see its state in a human-friendly way.

Turning the lights on

When issues arise in production systems, the questions seem obvious:

  1. What were the events leading up to this issue?
  2. What was the state of the system and contents of its data structures at that time?
  3. What sets this scenario apart?

These questions aren't unique to our deadlock problem. They're questions that are universal to debugging. Unfortunately, we rarely consider how to answer these questions until things go catastrophically wrong. At that point, we spend critical time poring over gigabytes of logs and trying to determine which hoops to jump through to get a look at the internal state of our systems. It’s only then that we realize how deficient our monitoring is.

Most projects I've worked on begin with brainstorming, whiteboarding, building system internals, defining API contracts and achieving MVP. For microservices, monitoring and metrics are more often than not bolted on as an afterthought. By laying down a framework up front, for visually debugging internal state and building against it, we achieve several key results:

  1. Established production support tooling
  2. Better code separation: our microservices monitoring implementation actually encouraged us to separate functions and data structures in a more logical and monitorable way—similar to how TDD does
  3. A better understanding and ability to communicate how our systems operate internally
  4. Faster development process
  5. Rich and up-to-date documentation

When you work on front-end code, you get some of this for free because bugs may manifest themselves visually in your UI that were not caught by your unit tests. When you work on back-end code, you're completely deprived of that extra verification layer, unless you build it yourself.

Not all problems are the same. Maybe what you're solving is not a graph problem or doesn't require a timeline. Maybe all you need is a histogram or a few charts (MDD). In our case, we found auto-generated visualizations incredibly powerful for driving our development process. Being able to walk through data structures state in a comprehensible way using visualizations is a power we often overlook as back-end engineers.

Visualization libraries like vis.js make it easy to add this kind of functionality, so there's really not much reason not to do it. It's also relevant to mention that many languages and platforms have some of this built in to varying degrees. Erlang's Observer on the BEAM VM is a great example.

If you take the plunge and decide to use monitoring to drive your development process, I suggest building those microservices monitoring tools with one question in mind: "If I were debugging this in production, what would I want to know?"

Google App Engine is a trademark of Google Inc.

JavaScript is a trademark or registered trademark of Oracle in the U.S. and other countries.