A couple of years ago Netflix introduced a concept called the “Simian Army”. The idea was to implement a bunch of open-source, automated processes that tested the Netflix cloud’s resilience to various failure scenarios.
One key soldier in this army is a tool still being used called “Chaos Monkey”, which randomly shuts down servers in the Netflix infrastructure to test an application’s ability to withstand server failures. When you know that a Chaos Monkey is running free in your infrastructure and your service nonetheless stays up, you know you can handle server failure effectively. We think that a similar approach should apply to the security around cloud infrastructure.
What is the Infected Chaos Monkey?
To that end, I’d like to propose a theoretical new cyber-security tool called the Infected Monkey. This tool would spin up an infected virtual machine inside random parts of your cloud infrastructure, much as the Chaos Monkey spins up havoc inside your datacenter, to test for potential security failures. By “inside”, I mean behind the firewall and behind whatever other perimeter defense you may maintain for your computing infrastructure.
The Monkey machine itself would be infested with all sorts of violent malware that would actively try to spread and infect everything around it. We’d make sure to infest the Monkey VM with the latest and greatest viruses out there–without their actual destructive payloads, of course. Just like the original Netflix Chaos Monkey, the Infected Monkey would run within a predefined time frame.
Why release the Infected Chaos Monkey into your cloud?
Security breaches happen all the time. They never happen exactly the way you expected or planned for. Yet your infrastructure should be able to withstand a breach of its exterior security layer, and also handle the infection of internal servers. Cloud security needs to be designed for a perimeter breach just as cloud apps need to be designed for server failure. The way to know that you are indeed ready and safe would be to periodically release a tool like the Infected Monkey inside your cloud.
When should I let the Infected Monkey run loose?
While a real breach will happen at the worst possible moment, the Infected Monkey would wake up in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problem.
If we gave you the Monkey, would you run it inside your cloud infrastructure? Would you feel safe that it would not inflict a tremendous amount of damage? Probably not. Unfortunately, in most organizations today, if a VM inside the datacenter gets infected there is very little to stop it from spreading the infection to other servers around it. My point is, most companies don’t have robust-enough security to take this chance.
We believe you should have a system in place that can detect an infection inside your datacenter, understand the semantics of the attack, mitigate its spread and remediate infected hosts. All in real time. Recent changes in the datacenter infrastructure make the creation of such a system completely realistic. So perhaps soon, your infrastructure will boast a built-in “immune system” that can handle, and reject, breaches as they happen.