Wednesday, October 3, 2018

Chaos Monkey

This is hilarious, had to share: https://en.wikipedia.org/wiki/Chaos_Monkey

Every company pays lip service to the idea of "resiliency" - that is, the ability to carry on business-as-usual in spite of technical issues - and vast amounts of time and money are spent setting up "solutions" which are only ever tested when something bad does happen, and then rarely work as expected. Netflix decided to turn this on its head and in 2011 created "Chaos Monkey," a tool that would, at a random time, disable a random server, just to see what would happen. As someone described it, "Imagine a monkey entering a data center and randomly destroying devices. The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will do."

While intentionally shutting down a production server in the middle of the day seems crazy, it's exactly what a resilient system is designed to handle.  And of course it's not just a server crashing but network problems, security issues, etc. Netflix eventually built the "Simian army," a collection of chaos monkeys designed to find faults or problems.  (My favourite is the "Chaos Gorilla" that shuts down an entire zone!)

That said, most of us don't manage a system like Netflix, don't have a need for constant uptime and don't have the time or resources to "design for failure."  But if we don't regularly change our tyres on a sunny day, how can we be expected to change a flat in the middle of the night in the pouring rain?  We throw a spare tyre in the boot but then never check it until we need it, and hope it's okay.  Having two servers running in two locations doesn't give you "resiliency" as much as it gives you "complexity," "synchronisation issues" and "maintenance headache."

So what are the other options?  Today a co-worker proposed a purpose-built emergency system, something I'd never considered before.  Rather than complicating and overloading your production system, most of which isn't even required in an emergency, build a parallel system that is just the bare essentials.  Since it's a separate system, you can test it easily and control it better.  You know exactly what capabilities it supports and it's available at a moment's notice.  Think of it as a the nuclear bunker of backup systems.

The only problem is business mindset, because by defining an emergency system the business has to decide, "What is important?" which is one they never like to answer.  In addition, the business has to spend money to create a system that they never plan to use, as opposed to hiding the cost of "resiliency" in a new system.  And of course not every failure constitutes an emergency, so if one server fails and a system goes down you wouldn't invoke your emergency procedures, but you will lament that particular system was not "resilient."

Of course, there's no one-size-fits-all solution, which is what keeps me employed.

No comments: