In the time that I’ve been in the quality organization at Workiva, the impact of changes to the methods and means by which we deliver product quality has been substantial, challenging, and positive. It came as no real surprise last year when we started to see the emerging methodologies and trends in the way that new product development teams would be seeking to develop, test, and deliver some of our new projects.
It’s a difficult thing admitting that sometimes what got you to where you stand will never take you close to where you’re going. I believe that this is a key distinction that sets our R&D organization apart from the rest of the pack. We haven’t been afraid to adapt and pivot, doing so rapidly and frequently where needed. However, doing so also means that there are some additional challenges incurred in maintaining an exceptional product platform through engineering quality.
When the many facets of a project change, it can be challenging to have the whole organization move together, but with several years and countless hours of collaboration as an organization, we have achieved many incredible milestones.
A few significant figures to note:
- We deliver Wdesk an average of 4 times per week.
- Those releases contain an average of 20,000 lines changed across 26.5 tickets.
- We have delivered exceptional levels of product functionality and quality demonstrated by customer satisfaction and net promoter scores.
- An average release currently undergoes over 9 days of automated testing within the release cycle running approximately 7,000 automated functional tests.
These are staggering metrics that demonstrate the requirements and detail required to achieve the goal that has been set by our company leadership, that we are a world-class software organization. But alongside of recognizing these successes, it’s time that we take a step back and look at the means by which we have achieved these successes to determine the path forward.
Without the efforts of our Test Development team the past few years to engineer quality into our products and processes, we wouldn't be able to deliver our products with the same high level of quality that we have achieved.
Looking back, when we started investigation into automated functional testing, the group consisted of four team members. This team worked on a project that the quality organization believed would be necessary to achieve the goals that were seen in the future. Team leadership saw the countless clones of manual regression test suites, hours and hours of manual execution, and the nauseating repetition of executing test cases. The tedious process to document and maintain those test suites would not be maintainable at the pace we knew was required to achieve the organization’s goals.
Fast forward a few years, and several mature testing solutions which have ultimately resulted from those early projects are still the means by which we verify release candidates. When we began executing automated tests using an internal project named Kitty Hawk, 14 outdated laptops stacked on a cart managed our automated test suites, the kind of machines that IT had taken back because someone had been upgraded.
The test machine cart in all its glory:
This rabble of QA members and pile of old laptops have grown into a mature testing team and infrastructure. Through this, Workiva delivers the means we use to determine the viability of a release candidate on a near daily basis. Certainly there are limitations, but today we are more limited by the performance of the application under test than the actual testing frameworks themselves. By all accounts, this is a measure of tremendous success and value.
With our success, though, we have come to realize that after years of delivering testing that met the needs of our development and release model, there is a need to significantly adjust the patterns (or really anti-patterns) of how we test our applications. More accurately, where and how we test applications, when we design and implement new test tooling, and even how we decide when to not deliver testing. Yes, sad days. TL;DR your headline can read “Admissions from QA manager that sometimes it’s better not to test something.” Here’s the caveat—how not to test something within our previous understanding of how we implement automated testing.
The earlier Workiva days were such that we developed and shipped products and features rapidly. We didn’t always consider the full future impacts of how something should be tested. We saw a need for test coverage and implemented it in the best way we knew how at the time. In hindsight, we became victims of our own test success. We wrote thousands of tests that absolutely ensure quality but do so in a method that wasn’t engineered into our products and has some downsides and inefficiencies.
Sure this model works, but at what cost?
- Money? (You'd shudder at our cloud service bills!)
- Personal well being of our team members? ¯_(ツ)_/¯ (You’d have to ask their families.)
Time is something we can measure, let's look at that.
For example, as a service (note, service—this is important), server pdf translations once executed 325 individual translation/comparison test cases for a full regression suite of the product.
Each test looked something like this:
- Log in and import a document
- Send the document revision to the translation
- API receives the translation and processes the document
- Callback with translated document for download
- Save document
- Call comparison to a goldfile to ensure the outcome matches expectations
Yeah, it was. The full set of tests took on average 100 minutes to execute (this is parallelized across available machines). Executed serially, this breaks down to an equivalent of approximately 21.6 hours of testing time for the service.
After recognizing the many pitfalls and the inefficiencies of testing the service by using the application UI, we pivoted. The team's QA, working with test development and a number of other team members, implemented and delivered tooling to call the service directly for the translation. As an additional win, the auditing of existing tests also discovered inefficiencies in the way we had previously decided to separate test cases yielding further gain.
The translations service test framework was rolled into our test management system several months ago. At current executes 239 tests with a total average execution time of 50 minutes effectively cutting the parallelized test time of the service in half and also reducing occurrences of test failures not related to the intended service under test.
So, what happened? Did we test it wrong originally? Yes and no. Yes, the testing could been done more efficiently by some other means. We could have used a service test harnesses or a similar process of testing the service. No, because at the time that was how we were able to test the translations APIs with the existing test frameworks. Having not done so at the time would have had significant impacts on testing of this service.
In our need to move quickly, we utilized what was available. Teams didn’t spin up their own test frameworks, and test development wasn’t resourced for projects beyond what was currently in progress. This occurred mostly due to bandwidth limitations and requirements of maintaining and updating the existing system. Honestly, we didn’t recognize it until we were far down the path.
What is the outcome? Consider the following:
A generally accepting testing pattern: The Testing Pyramid
- A small percentage of end-to-end tests, which test the full stack of an application in a replicated production environment.
- A larger bed of integration tests ensuring that service endpoints and various APIs integrate as expected. These are crucial in a service oriented system.
- A foundation of unit tests ensuring that the code does what it’s supposed to through the permutations of how it can be executed.
A fairly common testing (anti) pattern: Test Ice Cream Cone
- A large volume of manual testing.
- A high number of automated end-to-end tests.
- A smaller number of service integration tests.
- And a narrow foundation of unit tests.
On one hand, we have a system where the large majority of your tests execute in ns or ms. The other is such that the majority of your tests execute in minutes or, at best, tens of seconds.
One looks delicious, one looks…robust?
Who doesn’t love ice cream? But much like that large cookie dough blizzard, you’re going to eventually regret this decision. The Wdesk client application at its current state follows something in the middle of these two—very much an hourglass.
What does this indicate? That we've ventured into the land of functional test bloat. It’s simple if you look at how Wdesk is designed in the most basic sense. We have a client, a handful of services, and a python server. The server runs on an infrastructure/architecture that is in most ways, mature and without need of additional testing (i.e., doesn’t require an extreme amount of intentional testing, Google AppEngineTM).
We run approximately 34,000 unit tests per build, approximately 6,000 automated functional tests per release, and around 1,000 service or integration tests per release candidate regression test run.
Additionally, there is a layer of integration tests within our python unit test framework that can be run ad hoc. Consider the testing pyramid and the testing ice cream cone (anti-pattern). Which is it? Which should it be? How is this considered in the product life cycle?
So what now?
With all of this in mind, it’s time to make some changes to how we understand and deliver product quality and, more directly, how we test. We need to approach testing with a mindset of how we design, engineer, deliver, and iterate.
For the future ecosystems of complex microservice applications to be tested effectively, we must adjust. Without an intentional shift, the potential for failure is high if our future model of a microservice integration ecosystem is built on a foundation of client end-to-end testing. This is a recipe for slow, arduous delivery and test value lapses/failures.
Ok, so we’re doomed…great.
Not at all. Our belief as an organization is that we want to deliver the highest quality software as quickly as possible. We are a world-class software organization and acting as such should be a non-negotiable in how we proceed.
The need to understand the road forward and act accordingly necessitates that in order to succeed, our method of testing and ensuring product quality occurs early and at the most efficient level. The future of our product quality (and ultimately success—who wants to use a buggy product?) depends on delivering effectively on this method.
Because of this, we believe the following things:
- We believe in delivering only the highest quality software to our customers.
- We believe in building testable applications.
- We believe in the value of unit testing.
- We believe in headless service testing.
- We believe in service integration testing where possible.
- We believe that testing should happen early to ensure quality throughout the development cycle and in a measurable, ongoing way.
- We believe that efficiency is to be valued over speed, never sacrificing quality.
- We believe end-to-end tests should be used at minimum for only use cases requiring the full stack.
- We believe that testing needs to be designed and engineered as a part of any product development life cycle.
- We believe that doing so will enable teams to deliver quality software most efficiently.
- We believe in breaking the testing anti-pattern.
- We believe in designing and delivering testing by the most efficient means across our organization, using the right tools for the tasks.
- We believe that world class software requires that we engineer quality into the foundation of our products and processes.
- We believe this will not happen by accident.
The next generation of Workiva quality
As we move into the next generation of Wdesk, empowered by these things, we are continuing to achieve the goals set before us. We will ultimately deliver an exceptional next generation data platform. We are designing and implementing world-class testing, solutions, and tooling. We are ensuring that teams can deliver high-quality functionality and testing in parity, and this is resulting in the ability to ship software faster than ever.
Through testing that is not only thorough and effective but also efficient, we are enabled to release as often as we like. We can do this without flakey tests, last-minute reworks, unknown dependencies, or lack of meaningful test coverage. Issues can be identified earlier, lowering the impact to the overall development cycle. We know that what has taken us this far won't take us through the next steps. And, we look forward to delivering the next generation of Wdesk more efficiently than ever.
Google App Engine is a trademark of Google Inc.