One of the things we system managers dread the most is having the power yanked out from under our servers, something that happens far too frequently (and hits the news pretty regularly). Why? Because we don't trust file systems and databases to gracefully handle abnormal termination. We've all had or heard of file system and database corruption just from a simple power outage. Servers have been getting the power yanked out from under them for five decades, and we still don't trust them to crash cleanly? That's ridiculous. Five decades and thousands of programmer-years of work effort ought to have solved that problem by now. It’s not like it’s going to go away anytime in the next five decades.
In A Crash Course in Failure, Craig Stuntz discuses the concept of building crash only software – or software for which a crash and a normal shutdown are functionally equivalent.
Highlights:
“Hardware will fail. Software will crash. Those are facts of life.”
"…if you believe you have designed for redundancy and availability, but are afraid to hard-fault a rack due to the presence of non-crash-only hardware or software, then you're fooling yourself."
"…maintain savable user data in a recoverable state for the entire lifecycle of your application, and simply do nothing when the system restarts."
“…it is sort of absurd that users have to tell software that they would like to save their work. In truth, users nearly always want to save their work. Extra action should only be required in the unusual case where the user would like to throw their work away.”
Why shouldn't continuous and automatic state saving be the default for any/all applications? A CAD system I bought in 1984 did exactly that. If the system crashed or terminated abnormally, the post-crash reboot would do a complete 'replay' of every edit since the last normal save. In fact you'd have to sit and watch every one of your drawing edits in sequence like a VCR on fast forward, a process that was usually pretty amusing in a Keystone Cops sort of way. It can't be that hard to write serialized changes to the end of the document & only re-write the whole doc when the user explicitly saves the doc or journal every change to another file. That CAD system did it twenty-five years ago on on 4mhz CPU and 8" floppies. Some applications are at least attempting to gracefully recover after a crash, a step in the right direction. It certainly is not any harder than what Etherpad does- and they are doing it multi-user, real time, on the Internet.
“Accept that, no matter what, your system will have a variety of failure modes. Deny that inevitability, and you lose your power to control and contain them. Once you accept that failures will happen, you have the ability to design your system's reaction to specific failures. … If you do not design your failure modes, then you will get whatever unpredictable---and usually dangerous---ones happen to emerge.” -- Michael Nygard
References:
A Crash Course in Failure, Craig Stuntz
Design your Failure Modes, Michael Janke
'Everything will ultimately fail', Michael Nygard
A long time ago, shortly after the University I was attending migrated students off of punch cards, I had an assignment to write a batch based hotel room reservation program. We were on top of the world - we had dumb terminals instead of punch cards. The 9600 baud terminals were reserved for professors, but if you got lucky, [WooHoo!] you could get one of the 4800 baud terminals instead of a 2400 or 1200 baud DECwriters.
The instructors mantra - I'll never forget - is that students need to learn how to write programs that gracefully handle errors. 'You don't want an operator calling at 2am telling you your program failed. That sucks.' He was a part time instructor and full time programmer who got tired of getting woke up, and he figured that we needed our sleep, so he made robustness part of his grading criteria.
Here's how he made that stick in my mind for 30 years: When the assignment was handed to us, the instructor gave us the location of sample input data files to use to test our programs. The files were usually laced with data errors. Things like short records, missing fields and random ASCII characters in integer fields were routine, and we got graded on our error handling, so students quickly learned to program with a healthy bit of paranoia and lots of error checking.
That was a great idea and we learned fast. But here's how he caught us all: A few hours before the assignment was due, the instructor gave us a new input file that we had to process with our programs, the results of which would determine our grade.
What was in the final data file?
……[insert drum roll here]……
Nothing. It was a zero byte file.
Try to picture this - the data wasn’t available until a couple hours before the deadline, it was a frantic dash to get a terminal (long lines of students on most days, especially at the end of the semester), edit the source file to gracefully handle the error and exit (think ‘edlin’ or ‘ed’ ), submit it into the batch queue for the compiler (sometimes that queue was backed up for an hour or more) and re-run it against the broken data file, all by the deadline.
How many students caught that error the first time? Not many, certainly not me. My program crashed and I did the frantic thing. The rest of the semester? We all had so dammed many paranoid if-thens in our code you'd probably laugh if you saw it.
He was teaching us to think about building robust programs - to code for what goes wrong, not just what goes right. For him this was an availability problem, not a security problem. But what he taught is relevant today, except the bad guys are feeding your programs the data, not your instructor. That makes it a security problem.
I can't remember the operating system or platform (PDP-something?), I can't remember the language (Pascal, I think, but we learned SNOBOL and FORTH in that class too, so it could have been one of those), but I'll never forget that !@$%^# zero byte file!
In Hardware is Expensive, Programmers are Cheap II I promised that I’d give an example of a case where hardware is cheap compared to designing and building a more efficient application. That post pointed out a case where a relatively small investment in program optimization would have paid itself back by dramatic hardware savings across a small number of the software vendors customers.
Here’s an example of the opposite.
Circa 2000/2001 we started hosting an ASP application running on x86 app servers with a SQL server backend. The hardware was roughly 1Ghz/1GB per app server. Web page response time was a consistent 2000ms. Each app server could handle no more than a handful of page views per second.
By 2004 or so, application utilization grew enough that the page response time and the scalability (page views per server per second) were both considered unacceptable. We did a significant amount of investigation into the application, focusing first on the database, and then on the app servers. After a week or so of data gathering we determined that the only significant bottleneck was a call to an XSLT/XML transformation function. The details escape me – and aren’t really relevant anyway, but what I remember is that most of the page response time was buried in that library call, and that call used most of the app server CPU. Figuring out how to make the app go faster was pretty straightforward.
- The app servers were CPU bound on a single library call.
- The library wasn’t going to get re-written or optimized with any reasonable work effort. (If I remember correctly, it was a Microsoft provided library, the software developers only option would and been a major re-write).
- The servers were somewhere around 4 years old and due for a routine replacement.
- The new servers would clock 3x as fast, have better memory bandwidth and larger caches. The CPU bound library call would likely scale with processor clock speed, and if it fit in the processor cache might scale better than clock.
Conclusion: Buy hardware. In this case, two new app servers replaced four old app servers, the page response time improved dramatically, and the pages views per server per second went up enough to handle normal application growth. It was a clear that throwing hardware at the problem was the simplest, cheapest way to make it go away.
In The Quarter Million Dollar Query I outlined how we attached an approximate dollar cost to a specific poorly performing query. “The developers - who are faced with having to balance impossible user requirements, short deadlines, long bug lists, and whiny hosting teams complaining about performance - likely will favor the former over the latter.”
Unless of course they have data comparing hardware, software licenses and hosting costs to their development costs. My preference is to express the operational cost of solving a performance problem in ‘programmer-salaries’ or ‘programmer-months’. Using units like that helps bridge the communication gap.
My conclusion in that post: “To properly prioritize the development work effort, some rational measurement must be made of the cost of re-working existing functionality to reduce [server or database] load verses the value of using that same work effort to add user requested features.”
Related:
The Quarter Million Dollar Query Hardware is Cheap, Programmers are Expensive Hardware is Expensive, Programmers are Cheap Hardware is Expensive, Programmers are Cheap II
I enjoy reading Ivan Pepelnjak's Cisco IOS hints and tricks blog. Having been a partner in a state wide ATM wide area network that implemented end to end RSVP, his thoughts on What went wrong: end-to-end ATM are interesting.
I can' figure out how to leave a comment on his blog though, so I'll comment here:
I'd add a couple more reasons for ATM's failure.
(1) Cost. Host adapters, switches and router interfaces were more expensive. ATM adapters used more CPU, so larger routers were needed for a given bandwidth.
(2) Complexity, especially on the LAN side. (On a WAN, ATM isn't necessarily more complex than MPLS for a given functionality. It might even be simpler).
(3) 'Good enough' QOS on ethernet and IP routing. Inferior to ATM? Yes. Good enough? Considering the cost and complexity of ATM, yes.
Ironically, core IP routers maintain a form of session state anyway (CEF).
On an ATM wide are a network, H.323 video endpoints would connect to a gatekeeper and request a bandwidth allocation for a video call to another endpoint (384kbps for example). The ATM network would provision a virtual circuit and guarantee the bandwidth and latency end to end. There was no 'best effort'. If bandwidth wasn't available, rather than allowing new calls to overrun the circuit and degrade existing calls, the new call attempt would fail. If a link failed, the circuit would get re-routed at layer 2, not layer 3. Rather than band-aid-add-on QoS like DSCP and priority queuing, ATM provided reservations and guarantees.
It was a different way of thinking about the network.