| By Lori MacVittie | Article Rating: |
|
| January 27, 2010 05:45 PM EST | Reads: |
2,819 |

I haven’t heard the term “graceful degradation” in a long time, but as we continue to push the limits of data centers and our budgets to provide capacity it’s a concept we need to revisit.
You might have heard that Twitter was down (again) last week. What you might not have heard (or read) is some interesting crunchy bits about how Twitter attempts to maintain availability by degrading capabilities gracefully when services are over capacity.
“Twitter Down, Overwhelmed by Whales” from Data Center Knowledge offered up the juicy details:
The “whales” comment refers to the “Fail Whale” – the downtime mascot that appears whenever Twitter is unavailable. The appearance of the Fail Whale indicates a server error known as a 503, which then triggers a “Whale Watcher” script that prompts a review of the last 100,000 lines of server logs to sort out what has happened.
When at all possible, Twitter tries to adapt by slowing the site performance as an alternative to a 503. In some cases, this means disabling features like custom searches. In recent weeks Twitter.com users have periodically encountered messages that the service was over capacity, but the condition was usually temporary. At times of heavy load for more on how Twitter manages its capacity challenges, see Using Metrics to Vanquish the Fail Whale.
I found this interesting and refreshing at a time when the answer to capacity problems is to just “go cloud”, primarily because even if (and that’s a big if) “the cloud” was truly capable of “infinite scale” (it is not) it is almost certainly a fact that most organization’s budgets are not capable of “infinite payments” and cloud computing isn’t free.
It’s been many years, in fact, since the phrase “graceful degradation” has been uttered within my hearing, but that’s really what the article is describing and it’s something we don’t talk enough about. Perhaps that’s because it’s difficult to admit that there are limitations – whether technical or financial – on the ability to scale and meet demand. But there are, and if organizations are wise they’ll include in their application delivery strategy the means by which applications and services can “degrade gracefully.”
Twitter’s solution, the disabling of specific features, is a particularly easy way to implement such a strategy for Web 2.0 applications; at least it’s particularly easy if you have a network-side scripting capable solution mediating for the applications.
G
RACEFUL DEGRADATION
The reason it’s particularly easy to gracefully degrade Web 2.0 applications is that there is generally a 1:1 mapping between “functions” and “URIs.” This is often true for the web-facing interface, almost always true for RESTful APIs, and always true for SOAPy endpoints.

What you need to do is identify those “premium” URIs, i.e. those that can be disabled without negatively impacting core services, so that they can be “degraded” in the face of an overwhelming volume of requests.
You also need an intermediary. This can be a Load balancer, assuming it’s capable of providing the flexibility in configuration necessary to enable and disable service to specific URIs, i.e. it must be layer 7 aware. It has to be an intermediary through which all requests are routed because individual servers do not have the visibility required to be able to “see” the total requests and all responses. The fact that a server is throwing back 503 (Internal Error) errors indicates it doesn’t have the resources available to respond to a request, which means it won’t be able to respond to any requests, including those to disable services. Only an architecture that includes an intermediary of some kind (a reverse proxy) can achieve this solution.
The network-side script, which is deployed on the application delivery platform (load balancer), should implement logic that triggers degradation based on receiving 503 errors. It should probably not trigger on a single 503 or multiple 503s from the same application instance as such behavior could be indicative of a problem with that one instance as opposed to being produced due to a lack of capacity. That means the scripting solution needs to be able to take action based on a pattern of behavior coming from all application instances in conjunction with the total number of requests being received from users.
Yes, it has to be context-aware.
Once it’s determined that the errors are being generated due to a lack of capacity, the scripting solution needs to disable one or more of the specific URIs determined to be “premium” or ancillary. The intermediary can then respond to subsequent requests for the disabled URIs with custom content based on the expected response type. For example, if it’s an API call it might be appropriate to return a pre-formatted response in the appropriate data format indicating service is currently unavailable. Many network-side scripting solutions are capable of returning pre-formatted responses or they can be customized to provide more detail – it’s really up to the implementer to decide what information is included and how.
The premise is that as premium or ancillary services are degraded (disabled) that application instances will be able to focus on servicing core requests and return service to normal for those pieces of the application. When the volume of requests returns to within normal operating parameters for the capacity available, the intermediary can restore service to the previously degraded services.
S
CALABILITY is NEVER REALLY INFINITE
From a technological point of view “infinite scale” is not possible. At some point the volume of requests will reach boundaries that simply cannot be overcome, be they limitations on the load balancer (there is a limit to how many servers can ultimately be load balanced, and bandwidth is not unlimited) or on the application infrastructure itself. After all, you can’t launch a new instance of an application if there are no physical resources left on which to launch it.
It is almost certainly the case, however, that before reaching the technical limits of an “infinitely scalable” environment that you will hit a financial limitation. Or it may be the case that you haven’t jumped on the “cloud” bandwagon and what you see is what you get: a limited number of physical resources running a finite number of application instances, and that’s it. In either case, there are limitations on capacity and at some point you may reach them. How you respond to those limitations is an organizational decision, but graceful degradation in a controlled manner is probably more desirable than random, uncontrolled service outages.
Graceful degradation is an acceptable strategy for responding to availability issues and is especially easy to implement for a Web 2.0 application or API. It’s certainly more appealing than the alternative, which leaves every user essentially playing a game of Russian Roulette with availability of your web application.
Read the original blog entry...
Published January 27, 2010 Reads 2,819
Copyright © 2010 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Lori MacVittie
Lori MacVittie is responsible for education and evangelism of application services available across F5’s entire product suite. Her role includes authorship of technical materials and participation in a number of community-based forums and industry standards organizations, among other efforts. MacVittie has extensive programming experience as an application architect, as well as network and systems development and administration expertise. Prior to joining F5, MacVittie was an award-winning Senior Technology Editor at Network Computing Magazine, where she conducted product research and evaluation focused on integration with application and network architectures, and authored articles on a variety of topics aimed at IT professionals. Her most recent area of focus included SOA-related products and architectures. She holds a B.S. in Information and Computing Science from the University of Wisconsin at Green Bay, and an M.S. in Computer Science from Nova Southeastern University.
- Microsoft’s Second UI Innovation
- What Motivates Open Standards in the Cloud?
- StorSimple Supports OpenStack
- What to Expect in 2012: Cloud Computing and Open Source Software
- Ten Hot Trends in Cloud Data for 2012
- HP Expands Its HANA Alliance with SAP
- End-User Participation to Provide Unique Forum for Peer Collaboration at 2012 Technology Convergence Conference
- Write Once Run Anywhere or Cross Platform Mobile Development Tools
- Three Buzzwords That Every CIO Hears but One They Should Listen To
- Microsoft’s New Cloudware Could Cast a Shadow over VMware
- Cloud Expo New York: Cloud Architectures Require Scale-out Storage
- AT&T Joins OpenStack, Floats Cloud Architect
- The Future of Cloud Computing: Industry Predictions for 2012
- HP Puts Activist Shareholder on Board
- Gartner Hype Cycle for Emerging Technologies 2011
- Microsoft’s Second UI Innovation
- Cloud Computing: A Comparison of Computing Models
- What Motivates Open Standards in the Cloud?
- Big Data Bug Bites GE
- StorSimple Supports OpenStack
- What to Expect in 2012: Cloud Computing and Open Source Software
- Apprenda Upgrades Its .NET Private PaaS
- Ten Hot Trends in Cloud Data for 2012
- Cloud Expo Takeaways: Cloud Confusion Still Exists
- The Top 150 Players in Cloud Computing
- Where Are RIA Technologies Headed in 2008?
- FullArmor GPAnywhere Secures Microsoft Application Virtualization Applications Through Group Policy
- SYS-CON's Virtualization Conference & Expo: Themes & Topics
- SYS-CON's Virtualization Journal Opens Its "Readers' Choice Awards" Nominations
- Application Virtualization: Instant Migration to Vista, Fast Delivery, Secure Access, Side-by-Side Deployments
- "Virtualization Is Now a Key Strategic Theme," Says Citrix CTO
- Application Virtualization
- Integration with Windows Vista, Microsoft Excel, and Microsoft Application Virtualization
- Will Microsoft Buy Citrix?
- mValent Extends Automated Application Configuration Management to Virtualization Environments
- Has the Technology Bounceback Begun?
















