Cloud and failure

Despite all the cloud talk and where I live, it is like the cloud mecca, for enterprises it is still quite new and many are just starting to think about it. A hard lesson that many of us learn (and partly how we amass our scars) is to design for failures. For those, who run things in their enterprises data center, are quite spoilt I think. Failures are rare, and if machines or state goes down, moving to another one isn’t really a big deal (of course it is a little more complex, and not to say, there isn’t any down time, or business loss, etc.).

When thinking about a cloud migration (hybrid or otherwise) – a key rule is that you are guaranteed to have failures – at many aspects, and those cannot be exceptional conditions, but rather the normal design and expected behavior. As a result, you app/services/API/whatever needs to be designed for failure. And not only how your loosely couple your architecture to be able to handle these situations, but also, how the response isn’t a binary (yay, or a fancy 404 ); but rather a degraded experience, where your app/service/API/whatever still performs albeit in a deprecated mode.

Things that can throw one off, and is food for thought (not exhaustive, or on any particular order):

Managing state (when failures is guaranteed)
Latency – cloud is fast, but slower than your internal data center; you know – physics. :) How are your REST API’s handling latency, and are they degrading performance?
“Chatiness” – how talkative, are your things on the wire? And how big is the payload?
Rollback, or fall forward?
Lossy transfers (if data structure sizes are large)
DevOps – mashing up of Developers, and Operations (what some call SRE) – you own the stuff you build, and, responsible for it.
AutoScale – most think this is to scale up, but it also means to scale down when resources are not needed.
Physical deployments – Regional deployment vs. global ones – there isn’t a right or wrong answer, it frankly depends on the service and what you are trying to do. Personally, I would lean towards regional first.
Production deployment strategies – various ways to skin a cat and no one is right or wrong per se (except, please don’t do a basic deployment) – that is suicide. I am use to A/B testing, but also what is now called Blue/Green deployment. Read up more here . And of course use some kind of a deployment window (that works for your business) – this allows you and your team to watch what is going on, and take corrective actions if required.
Automate everything you can; yes its not free, but you recoup that investment pretty quick; and will still have hair on the scalp!
Instrument – if you can’t measure it, you can’t fix it.

Again, not an exhaustive list, but rather meant to get one thinking. There are also some inherent assumptions – e.g. automation and production deployment suggests, there is some automated testing in place; and a CI/CD strategy and supporting tools.

Bottom line – when it comes to cloud (or any other distributed architecture), the best way to avoid failure is to fail constantly!