Across the business landscape, we’re seeing a wholesale movement of services to the Cloud as companies wake up to the benefits of capacity and flexibility that accompany such a shift away from hosting systems on servers.
While there are clear advantages to migrating to the cloud, there are challenges too – such as layers of added complexity.
Testing these cloud-based systems to make sure they stand up under duress is important, but the heightened complexity of these environments means that the role of Quality Assurance (QA) has had to evolve accordingly. The traditional method of testing a service was the same as making sure each light on a set of traffic signals were working correctly. In a Cloud environment, where the effective working of a platform or application relies on a glut of variables, a binary approach to QA is simply not fit-for-purpose.
In order to address a complex environment such as the Cloud, the testing mechanism needs to be accordingly robust and aggressive. That’s why Thales has developed the Chaos Engine, in order to help businesses make sure their Cloud-based services are resilient enough to address a multi-point failure.
Why is the Cloud more complex?
Cloud resources – such as data storage and computing power – are typically more expensive than an equivalent physical server. However, ‘traditional’ servers need to have enough resources to handle a businesses’ busiest times. On the other side of the coin, servers also have idle resources at other times when there isn’t such a demand for their requirements.
Cloud hosted applications are capable of managing the number of virtual servers they need from a cloud provider at any given time. They can call on more resources when there is extra traffic, and then return them to the cloud provider when the demand drops. This creates a dynamic hosting environment that responds in step with usage requirements.
Accomplishing this, however, depends on configuring complex rules for scaling up and down based on measurable metrics. If you do this right, you can keep your costs down, and offer your users a seamless experience. Any small errors in these rules can result in runaway costs, poor user experience, and sometimes both.
What is the Chaos Engine?
At its core, the idea of the Chaos Engine is self-explanatory. It is designed to create chaos in the testing environment and bend the very limitations of what the service can do, thereby pushing the very limitations of these virtual servers sitting on The Cloud.
Let’s think about that traffic analogy again; instead of testing each light on a traffic signal in turn, the Chaos Engine tries to close out a busy city intersection during rush hour and observe how the city traffic reacts to such disruptive event
This is the Chaos Principle. The aim is to create as many random faults that could reasonably occur in a real application deployment. This means switching off or randomly deprecating some part of the system and seeing how it can stay alive. Think of this as deliberately initiating the virtual equivalent of our very own adrenaline-induced fight or flight mechanism.
QA for the intricacies of the Cloud
As the migration to the Cloud becomes more and more the default for systems, it’s increasingly challenging to triage any outage that might be currently happening. Add to that the fact that any additional downtime is becoming extremely costly – with global averages suggesting that a company loses $300,000 per hour if their public facing systems go down – and it’s clear why there’s a financial imperative to catch these issues before they happen.
The Chaos Engine is designed to ask challenging questions in a live environment; ‘What happens if this falls down?’ or ‘What could possibly go wrong?’
Businesses such as Netflix already work with chaos principles, but more often that not, services hosted on the Cloud are using more traditional QA methods – effectively leaving the good running of their operations to chance.
Creating ultra-resilient environments
We built the Chaos Engine as an open source project to help businesses show they have the best resilience as a service and that they can prove any KPIs they are measuring against. We’re also welcoming contributions to make sure that the project itself is as versatile, dynamic and effective for as many different use cases as possible – after all, businesses in every sector are moving towards The Cloud.
Cloud environments offer a range of benefits to businesses, but all this will be superfluous if platforms, containers and APIs fail as soon as they’re put under at least a modicum of pressure. To ultimately create order, we need chaos.