Andy Crouch - Code, Technology & Obfuscation ...

Distribute Your Cloud Services

Smashed Computer Screen

Photo: Unsplash

Today has been an interesting day.

Today is proof that putting all your eggs in one basket is a terrible idea.

Today proved that even if you are Amazon, your cloud services can still let your down. S3 (Amazon’s Cloud Storage) went down in the East Coast region and a large number of sites went down with it. Some sites didn’t go down but lost some functionality. Slack, Quora, Medium and Imgur were all affected. I am certain we will find out the cause in the next few days although people don’t seem to care once a service is back. While it is down all hell breaks loose and Twitter makes for interesting reading.

What this highlights is the trade off you deal with by embracing cloud services. I am old enough to remember pre-cloud. This was a time when running a couple of servers would take 40% of a developers time. It could also cost you hundreds a month to host, not including licence fees if you needed them. Devops back then was far more involved that it is now. You installed and configured everything on the box in most cases from the OS up. These days the OS is irrelevant and deployment can be as easy as a git push You want to set up a virtual machine? Piece of cake. You want to spin up a website or server for an API? 5-minute job. Pre-cloud people had jobs doing nothing but operating servers for small companies. Cloud providers automated these jobs away at many companies’s.

The cloud is the perfect solution to most developers desire. Developers by nature do not want to deal with or be System Administrators. They do not want to deal with servers, licencing and patching security vulnerabilities. They want to code and create and solve problems and then deploy their solution. Click and forget once that initial setup wizard is complete.

So the Cloud sounds great for developers. In many instances, it is the perfect solution. But, once you gain users your hosting stability and availability is far more important than your developer’s convenience.

Each of the vendors has multi-region solutions. They market the fact that their platform or services are resilient. They sell the dream that continuity will prevail and your application will be running even if an outage occurs. This is not the case without a fair amount of effort up front and usually during any moderate incident. It is something that as developers we need to give a lot more thought to. It should not be an afterthought otherwise, it comes back to haunt you.

At Open Energy Market we have suffered two major outages on Microsoft’s Azure platform. Both of these happened during trading hours. Both caused our customers to lose access to our platform. Even from the start, our infrastructure has been set up so we had redundancy. Replicated services across different regions and regular fail-over testing. We were doing everything by the book and then Azures DNS routing suffered an issue and we were down. Having many regions gives you a sense of security until the infrastructure that connects them fails. Then all you can do is sit and wait until Microsoft fixes the issue.

Let’s not hide the fact that their communication at these times is not the best. Most developers would rather know what the issue is and how and whenit’s fixed than there being an issue. Open, transparent, communication is key.

Something interesting happens as well. You explain to your team the situation and say things like “this is bad but anothercompany.com are also down”. Somehow this conveys a sense that we have picked a great platform and are not the only ones suffering. We try and justify our shortcomings by grouping ourselves with stalwarts of the internet that should know better. This brings little comfort to the end users or your team who are having to liaise with them.

There is another interesting point, platform lock-in. These platforms each offer their own take on services and server types that we take for granted. It is very easy to design your codebase around one of the providers SDK. Azure Functions and AWS Lambda offer serverless functionality. They are incompatible and you can not switch from one to the other without modification to your code.

Once bitten and all that. The second time it happened on the 15th September last year we were already designing a Disaster Recovery solution. Irony is explaining that we are building a DR solution during an outage. We had taken the decision that we would stick with Azure as we’d made a fair investment in it but we needed a backup. So we built a replica environment on AWS. We had to make amendments to our code to refactor out Azure SDK specific code. We set up a bi-directional replication database agents. We also amended our document handling architecture. A new caching mechanism that persists to both Azure and AWS storage was developed. We now replicate any service we use in Azure on AWS. The solution is not perfect. It does give us a fall back though and will see us through our current redevelopment phase.

That brings me to my final point about cloud hosting (for now). It is not ideal to only consider it’s impact at the end of your development effort. Designing your infrastructure as you build out your application (or service) is key. As we develop our new codebase at Open Energy Market we are considering not only how we are going to host our services but also if they are portable across multiple cloud providers.