Robust, resilient IT systems are crucial to data-driven operations. Whether these systems drive internal processes or deliver customer-facing services, the need for reliability and availability remains the same.So, why would you deliberately try to break your services?

What is ‘Chaos Monkey’?

Chaos engineering does just that – deliberately terminating instances in your production environment. Online video streaming service Netflix was one of the first organizations to popularize the concept with their Chaos Monkey engine.

When Netflix began migrating to the cloud in 2010, they found a potential problem with hosted infrastructure – hosts could be terminated and replaced at any moment, potentially affecting quality of service. To ensure a smooth streaming experience, their systems needed to be able to manage these terminations seamlessly.

To assist with testing, Netflix developers created ‘Chaos Monkey’. This application runs in the background of Netflix operations, terminating services randomly.

The Netflix Chaos Monkey is perhaps the best-known example of chaos engineering. And as cloud services mature, this chaos engineering methodology will gain in popularity.

Why would you deliberately break your IT systems?

At the heart of the chaos engineering model is the concept of deliberately breaking things in your production environment. But why would you do that? Why not restrict testing to the dev environment?

“No amount of testing can prove software right; a single test can prove software wrong.”

– Amir Ghahrai

Using chaos engineering principles, you introduce an important element of randomness into testing and accelerate the process of identifying single points of failure. System failures are rarely predictable, and the chaos monkey can surface issues that have not been previously considered. If you only ever test for what you think may break, other important issues may be overlooked.

In this way, random outages help to keep testing honest. The testing scripts cannot be skewed, shortened, or cheated, and every fault identified is real – you can literally see the problem and its effects.

Benefits of Chaos Engineering

Conducting tests on the production system is quite a high risk. You will probably need a relatively robust, mature platform before you unleash the chaos monkey. However, there are also some benefits.

First, you probably do not have to replicate your entire production environment for testing, which helps to reduce costs. It is also almost impossible to properly the simulate effects at scale in a development environment.

“The impact of an extended outage would depend on the scale of the cloud provider: an incident that takes a top-three cloud provider offline in the US for three to six days would result in losses of between $6.9bn and $14.7bn, and between $1.5bn and $2.8bn in industry insured losses. A cyber-incident that takes a 10th to 15th placed cloud provider offline in the US for three to six days would result in losses of between $1.1bn to $2.1bn and between $220m and $450 million in industry insured losses.”

– ZDNet

Second, there is an added incentive to address issues quickly. Any breakages caused by the chaos monkey need to be fixed as fast as possible to maintain an adequate level of service for customers. It’s also worth remembering that building fixes in the production environment will dramatically reduce time to deployment.

Breaking things the correct way

Developing meaningful fixes after a chaos monkey breakage is often a two step-process: a quick ‘patch’ to restore operations followed by a more in-depth code update.

Chaos tests are best performed in four cases:

  1. When deploying new code
  2. When adding dependencies
  3. As usage patterns change
  4. When mitigating problems

Although random, chaos tests should not be completely uncontrolled. In many cases, the monkey should only be unleashed on a sub-section of the system to test a specific hypothesis. Only if the test is passed, you should widen the scope of the test to assess other parts of the system.

Along with randomly crashing services, chaos engineering also requires an effective monitoring system. This will help you assess the impact and severity of an outage and its effect on the user experience, for instance. Application tracing is absolutely critical for identifying the source of any failure and the modules that require work.

Wherever the chaos monkey exposes blind spots in your system design, application monitoring can help you understand them better. This allows you to formulate robust fixes and updates. Monitoring will also allow you to assess the efficacy of each fix, verify that future outages can be prevented, and that the system continues to meet your performance requirements.

From a customer/user-facing perspective, monitoring also allows you to assess the impact on digital user experience:

  • How has the outage affected performance?
  • Does the degradation in service fall below standard?
  • Did the outage breach any SLAs?
  • Are you dealing with a single point of failure, or are there multiple factors at fault?

With ongoing application and performance monitoring, you can continue to assess user experience. Importantly, the chaos engineering and development teams can also provide empirical proof of any improvements or failings. By taking guesswork out of patches and fixes, you can allocate better resources to appropriate tasks that will yield the greatest benefit to your users.

Maximizing your chaos engineering potential

eG Enterprise offers extensive applications and platform monitoring functions, allowing you to assess current system health – and the effects of every chaos monkey-inspired failure.

eG Enterprise is an Observability solution for Modern IT. Monitor digital workspaces,
web applications, SaaS services, cloud and containers from a single pane of glass.