There is no production version

July 25, 2019

Workiva has been talking about and building microservices for a while by breaking apart our applications into isolated components so that we can develop and release independently. We chose this architecture because it allows teams more freedom to manage the lifecycle of their service, so they can develop and release on their schedule instead of a companywide one. Like all architecture decisions, it comes at a cost, and I think there is one cost we haven't fully acknowledged yet: there is no production version.

Now don't misunderstand: I'm not implying that customers are not using our products. No, this is far more subtle. Consider how our world worked before microservices. We had a central project that was released to a single location. Therefore, the question "What version is currently deployed to production?" was simple and easy to answer.

That world doesn't exist anymore.

Now we have dozens of services that make up our current application stack, all deployed to eight Kubernetes clusters, with the potential of private stacks adding to that number. Now, our first reaction to this problem might be, “Just ensure they are all running the same version.”

If it were only that easy.

First, we have the problem of time zones. In most cases, teams are avoiding deployments during business hours for our customers, but we have two major customer bases now: North America and Europe. So releases often are performed at 10:00 p.m. CT to our North American data centers, but our European ones don't occur until after noon CT the following day. So for 14 hours per day, there are in fact two versions in production, not one.

Next, we have the slow rollout strategy in Kubernetes clusters. This feature is wonderful as it allows you to minimize the impact of a deployment on your users, but it has the interesting side effect of two versions running in production instead of one.

Third, there is a temptation to treat locations like Sandbox as less critical because "customers" are not using them, but this is far from the truth. The customers are a different group, but very much impacted by our deployments. Everyone in the company tries to help test our application by using it instead of alternatives like Google Docs and Microsoft Office®. If you deploy something that causes a team's review to have no slides at the last minute or information sent to our executives to be unavailable, have you slowed our company's progress?

Finally, things can be reverted and rolled back, so we can't expect only monotonically increasing versions of a service to exist.

Put all of those facts together and we reach the uncomfortable truth that production is an ever-changing list of service versions.

This has a lot of consequences to our development process and pipelines and we need to keep this idea in the front of our minds. We're tempted to try to predict what will be available in production when our feature is deployed, and if you have success with this, I encourage you to share the next lottery numbers with me. The future is fluid and can even go backwards due to reverts, so that is often a road fraught with peril. 

The next option we reach for is coupling our releases to say X and Y need to release together. We're greeted with immediate failure here too, since slow rollouts and rollbacks mean the other service isn't guaranteed to be available to us at the version we want.

With all of this perceived chaos in the system, how do we reliably release new value to customers? The answer is not an easy one and will require us to change our mindset from our monolithic application days. This post doesn't pretend to offer a magic bullet, but I'd like to offer some solutions that teams around the world in similar situations have used to start our conversation.

Versioning

This is a complex but critical topic for us. The first thing to understand is that your software version is a communication tool. Other teams use it to make judgment calls about how difficult it will be to upgrade their usages and if it is safe to do so. 

You can help those teams using your service by providing a clearly defined API using Frugal or even a client library. A client library gives you even more control over how interactions with your service work and can simplify the process of integrating into other projects. Additionally, with a client library you could track the versions all other projects are using via Release Management's Dependency Search page.

Read the semver page and let it sink in, as it's the basis of version numbers at our company.

The App Frameworks team added one additional beautiful rule to their versioning guidance: major versions should only remove deprecated features and not add their replacements. This gives your consumers time in the minor versions to migrate, because the old and new way of performing that action coexist. I strongly encourage this as the method of deprecating code for all teams.

Let's talk about 1.0.0. It is tempting to stay pre-1.0.0 because you don't think you have nailed down your API. Early in the project that's likely the case, and for a short time that's appropriate. Don't linger here though, as you start causing pain for other teams that start to use your service. When you break the API they depend on without notice or if you're unsure if you have the API correct, reach out to your consumers and talk it through. Once you've done that, look at the work you will need to accomplish, and pick a specific point that will be 1.0.0, and stick to it!

Consumers, be wary of services that won't commit to a 1.0.0 release. By using them, you are signing up for an undetermined amount of tech debt while they finish their service. Each time they change their API you will also need to refactor to match. How many times do you want to do that?

Consumer-driven contracts

Traditional testing methods put the burden of defining how a service should behave on the team building it. At first glance that seems logical, they built it and should know how it should work. But, when you are a service consumed by another team, that's incorrect. The team consuming it has the knowledge of how they want it to work. This idea is key to a service's existence. If you are building a service, and you have no consumers, why are you building it?

Consumer-driven contracts turn testing on its head to create a way for the teams consuming your service to submit tests defined by how they expect it to behave while allowing both of your projects to be tested in isolation. 

In the HTTP world, Pact is a common way to define those contracts. The story at Workiva for contract testing Frugal-based interactions is early, but it's showing great promise in allowing us to simplify our testing environments while maintaining the same level of confidence.

The huge advantage of contract testing is found with our many versions of production. Imagine trying to take a test suite that depends on three other services that each have three versions in a deployed environment currently. Will you need to stand up all nine permutations of service versions to be confident that your change will work when it's deployed? With contracts you only need to set up one at a time, they are often fast enough that running against multiple versions of your dependencies is possible.

Graceful degradation

When you are adding new product features that depend on new services, the process of rolling them to production becomes more complicated. You definitely need the new service, or additions to an existing service for your feature to work, but you can't be sure they will be available when you deploy.

In these situations, think about how you might build your application in a way that handles the new service not being available. Perhaps you return a message indicating that operation is not available, maybe you disable something in the UI. This gives you an opportunity to improve the experience for your users and to log metrics, so you can track these failures.

The extra handling you add can serve another purpose too by helping to identify issues in production faster. If you are logging messages about the exact service that is unavailable, you will be able to react to issues very quickly.

Conclusion

Effectively building software using a microservices architecture is very different than a monolith and has different challenges that include having many versions of each service considered "production." We should challenge ourselves to consider the processes and tools we have today to see if they are effective with this architecture. Is there something in your team's process or tools that could be altered to make delivering changes faster and easier?
Microsoft Office is a registered trademark of Microsoft Corporation in the United States and/or other countries. © 2018 Google LLC All rights reserved. Google Docs is a trademark of Google LLC.