Last In - First Out

Icon The Intersection of Availability, System Management and Security

Degraded Operations - Gracefully

From James Hamilton’s Degraded Operations Mode:

“In Designing and Deploying Internet Scale Services I’ve argued that all services should expect to be overloaded and all services should expect mass failures.  Very few do and I see related down-time in the news every month or so.....We want all system to be able to drop back to a degraded operation mode that will allow it to continue to provide at least a subset of service even when under extreme load or suffering from cascading sub-system failures.”

I've had high visibility applications fail into 'degraded operations mode'. Unfortunately it has not always been a designed, planned or tested failure mode, but rather a quick reaction to an ugly mess. A graceful degrade plan is better than random degradation, even if the plan something as simple as a manual intervention to disable features in a controlled manner rather than letting then fail in an uncontrolled manner.

On some applications we've been able to plan and execute graceful service degradation by disabling non-critical features. In one case, we disabled a scheduling widget in order to maintain sufficient headroom for more important functions like quizzing and exams, in other cases, we have the ability to limit the size of shopping carts or restrict financial aid and grade re-calcs during peak load.

Degraded operations isn't just an application layer concept. Network engineers routinely build forms of degraded operations into their designs. Networks have been congested since the day they were invented, and as you'd expect, the technology available for handling degraded operations is very mature. On a typical network, QOS (Quality of Service) policy and configuration is used to maintain critical network traffic and shed non-critical traffic.

As and example, on our shared state wide backbone, we assume that we'll periodically end up in some sort of degraded mode, either because a primary circuit has failed and the backup paths don't have adequate bandwidth, because we experience inbound DOS attacks, or perhaps because we simply don't have adequate bandwidth.  In our case, the backbone is shared by all state agencies, public colleges and universities, including state and local law enforcement, so inter-agency collaboration is necessary when determining what needs to get routed during a degraded state.

A simplified version of the traffic priority on the backbone is:

Highest Priority Router Traffic (BGP, OSPF, etc.)
  Law Enforcement
  Voice
  Interactive Video
  Intra-State Data
Lowest Priority Internet Data

When the network is degraded, we presume that law enforcement traffic should be near the head of the queue. We consider interactive video conferencing to be business critical (i.e. we have to cancel classes when interactive classroom video conferencing is broke), so we keep it higher in the priority order than ordinary data. We have also decided that commodity Internet should be the first traffic to discarded when the network is degraded.

Unfortunately on the part of the application stack that's hardest to scale, the database, there is no equivalent to network QOS or traffic engineering.  I as far as I know, I don't have the ability to tag a query or stored procedure with a few extra bits that tell the database engine to place the query at the head of the work queue, discarding other less important work if necessary. It's not hard to imagine a 'discard eligible' bit that could be set on certain types of database processes or on work submitted by certain clients. The database, if necessary, would discard that work, or place the work in a 'best effort' scheduling class and run if if & when it has free CPU cycles.

If the engineers at the major database vendors would Google 'Weighted Fair Queuing' or 'Weighted Random Early Detect' we might someday see interesting new ways of managing degraded databases.

 
 

Creative Server Installs - WAN Boot on Solaris (SPARC)

Sun's SPARC servers have the ability to boot a kernel and run an installer across a routed network using only HTTP or HTTPS. On SPARC platforms, the (BIOS|Firmware|Boot PROM) can download a bootable kernel and mini root file system via HTTP/HTTPS, boot from the mini root, and then download and install Solaris. This allows booting a server across a local or wide area network without having any bootable media attached to the chassis. All you need is a serial console, a network connection, an IP address, a default gateway and a web server that's accessible from the bare SPARC server. You set a few variables, then tell it to boot. Yep, it's cool.

From the Boot PROM prompt (the SPARC equivalent of the BIOS)

OK> setenv network-boot-arguments host-ip=client-IP,
router-ip=router-ip,subnet-mask=mask-value,
hostname=client-name,http-proxy=proxy-ip:port,
file=wanbootCGI-URL

OK> boot net -v install

Our base Solaris install is fairly small - on the order of a few hundred megabytes - so booting across a WAN through a proxy or an SSH tunnel works pretty well. We usually build a temporary SSH tunnel from our management  infrastructure out to another server in the same security container and point the new server at the tunnel end point.

PXE is an attempt to provide similar functionality. It's got a dependency on having DHCP available on the deployed subnet, something which I'm absolutely do not want to enable on non-desktop networks, and it's based on UDP, which makes it slightly less suitable for booting across WAN's where packet loss might be an issue. In any case, we've had enough issues with network boots on x86/x64 platforms that we've pretty much defaulted to using bootable USB's or CD/DVD's for remote installs. That makes an x86/x64 deploy significantly more work effort, as we have to arrange for a bootable USB or CD/DVD's to be delivered on site, or we need to leave bootable media installed in production servers.

Linux has 'BKO', but as far as I can tell, it's still dependent on having either bootable media or PXE.

SPARC's Wan boot is pretty slick, but not as slick as Cisco's AutoInstall. AutoInstall allows you to drop ship an unconfigured router to a remote site. The router will learn it's IP address from it's upstream router via either SLARP or BootP,  automatically download a configuration file, and re-boot with a valid configuration.

A couple of closing thoughts:
  • If the SPARC platform ever goes away, I'll miss it.
  • If router engineers ever decide to build application servers, they'd probably come up with radically new ways of solving old problems. 

 
 

Pandemic Planning – The Dilbert Way

I normally don’t embed things in this blog, but this one is too good to pass up:

Dilbert.com

Deciding who is important is interesting.

Senior management wants to see a plan. Middle manager needs to decide who is important. If Middle Manager says only 8 of 20 are critical, what does that say about the other 12?  The only answer that most managers offer is ‘all my employees are critical to the enterprise’.

I’m assuming that many or most readers have been a part of some sort of pandemic planning. In our EDU system, the plan isn’t interesting because of the criticality of anything that we do. In a major pandemic, deadlines can be extended, semester start and end dates can be changed, faculty can adapt. It’s interesting because of what our facilities can do. In the rural towns served by many of our colleges, the campus is the best connected building in town. In many cases, our college serves as the local or regional backbone connection point for T1’s from other state agencies, some of which have critical public health, safety or law enforcement roles. I suspect some of those agencies are more important than an exam, lecture or quiz. It’s possible that for us, the critical resources in a pandemic might not have anything to do with education. HVAC, power, and routers might be the top priority.

Then there’s payroll. You’ve got to keep that going no matter what. Sick employees don’t have the energy to mess with bounced checks and overdrawn accounts.

 
 

T-Mobile Down, Where Do I Go?

Where else? The real time, authoritative source for all things everywhere:

T-Mo-Twitter

I’m not sure if this is a reflection on Twitter or Society.

Probably the latter.