Bernard Golden on Dealing with Cloud Outages the Netflix Way

Posted by on April 9th, 2013


Bernard Golden, Enstratius

Bernard Golden is Vice President, Enterprise Solutions for Enstratius, a leading cloud management software company. Named “one of the 10 most influential people in cloud computing” by Wired, Golden is the cloud computing advisor for CIO Magazine, where his blog is read by tens of thousands of people each month. He is also the author of four books on virtualization and cloud computing.

Boundary:  Recently, you wrote:  “More application outages are caused by what’s going on in the application than are ever caused by infrastructure failure—and this is becoming even more true because of the increasingly complex nature of applications.” Explain this in more detail.

Golden: If you look at the dominant IT model of 10 years ago, a typical application was hosted on a single instance on a single server, and you maybe had three different layers of software in total. Over time, we have grown the number of servers and the instances running on them. Now we have multiple servers, multiple tiers, and even on the code level you’ve got additional components all supporting that one application. This complex infrastructure has developed over the past decade to respond to greater application loads and the need for an extra layer of resiliency and availability.

B: What is the overall effect of systems complexity on IT operations?

G: It’s a lot more complex to track 30 or 40 instances and all these components. Let’s say the app is showing poor performance from the end-user perspective. The IT person needs to figure out if it is a problem in the database, is code running slowly, or is it how I am managing the objects in that code or is it a network issue or web server issue. There are a myriad of places that are potentially causing problems, versus years ago when it was just on that one server hanging off one network.

B: So, what’s a CIO to do about it when it comes to creating a more fail-proof, manageable environment?

G:  There are a number of things to do but first, moving to horizontal scaling and redundancy is an obvious step. Beyond that, you have to instrument and log the app, using monitoring tools. That helps track things down so you can say, okay it’s not my database but maybe it’s on the caching tier. The third area is being more proactive through load testing. Application issues tend to show up when there is a heavy load, not under normal conditions, so you need to simulate that spike in activity somehow. The fourth area is what I call the Netflix approach. This company proactively disables parts of their application to see if the app can still recover and perform. You purposely break something, fix it, and then modify the environment to handle that condition.

B: Are very many companies doing these four things?

G: Not really. Many are still not even doing horizontal scaling to protect against resource failure–and that is a well-proven approach that has been around for decades. So the proactive load testing approach is not being done much and breaking and fixing is not really being done at all although it should be. Netflix is one of the few organizations doing that.  They have a toolset called The Symian Army of different testing activities to shut down or bake things and even simulate a full-blown outage. Netflix has released these open-source tools for public use, as well.

B: You don’t seem to be sure that APM is effective in this new world of application complexity. But software like Boundary continually pulls in real-time information so that IT managers can spot troubling trends before they become issues. That way they can actually adapt the environment on-the-fly to prevent outages and slowdowns.

G: Well first, you have to do APM, even though everyone’s not doing it. The previous generation of tools are capable but are challenging to use because they are expensive and are just adding to your application burden. You have to install and manage them too. The new generation of tools are cloud-based, they reduce your overhead, and they are cloud oriented by architecture so they are capable of addressing issues in multi-tier environments. They can do a good job of monitoring apps in the cloud as well as in traditional environments. The newer tools get you part of the way for sure. One of the big challenges is you don’t know what’s going to break or what is going to happen in certain situations. Netflix is one of the most advanced companies on the planet in these practices and they have very sophisticated monitoring and metrics. But they still go and do the other piece, to prepare for the situation in which maybe an entire data center hosted on Amazon goes down. Monitoring is necessary but insufficient in preparing for unusual yet potentially realistic scenarios. So you simulate that scenario and then use the APM tools to determine what will happen to the app once those conditions are in place.

B: Not every company has the resources internally to do that type of testing though.

G: It’s an investment, no question. You need to do a risk-reward assessment to see how necessary it is. Netflix can’t afford to be offline because someone somewhere wants to watch videos. That is their whole business.  If they are down it is a big problem. Yet many companies are becoming in effect Netflix-like in that they have significant digital systems, but don’t see how much they are dependent on them…until something breaks. Then if they don’t have these APM tools and operational practices they are in a world of hurt. We were working with a company that had a $150 million line of business running on Amazon. They found out that the service had been down for something like a day and they didn’t even know it. I don’t know how that happened but this demonstrates why being down, for a revenue-generating or customer-facing operation, is really a big problem and you’ve got to have these tools and practices in place.

 B: On to cloud and other infrastructure service providers. How can companies work with them to achieve the best possible service and manage risk??

G: Most people say you have to negotiate a SLA with your provider but in my view that’s really ineffective. These companies depend on uptime and they do their very best to keep the business up but there’s no 100% guarantee. All of these providers have outages. It’s not just about the contract you have, but having sufficient resources against your application, doing load testing, and all the other things that we talked about earlier. If your business warrants it, you need to consider additional measures like multi-region redundancy.  Technically or economically, that’s not possible for every business.

Connect with Bernard on Twitter: @bernardgolden.


No comments yet

Chat with Us!