Joel on Software: Five Whys

Five Whys

Keeping internet services online suffers from the problem of black swans. Nassim Taleb, who invented the term, defines it thus: “A black swan is an outlier, an event that lies beyond the realm of normal expectations.” Almost all internet outages are unexpected unexpecteds: extremely low-probability outlying surprises. They’re the kind of things that happen so rarely it doesn’t even make sense to use normal statistical methods like “mean time between failure.” What’s the “mean time between catastrophic floods in New Orleans?”

Somewhere between the “extremely unreliable” level of service, where it feels like stupid outages occur again and again and again, and the “extremely reliable” level of service, where you spend millions and millions of dollars getting an extra minute of uptime a year, there’s a sweet spot, where all the expected unexpecteds have been taken care of. A single hard drive failure, which is expected, doesn’t take you down. A single DNS server failure, which is expected, doesn’t take you down. But the unexpected unexpecteds might. That’s really the best we can hope for.

To reach this sweet spot, we borrowed an idea from Sakichi Toyoda, the founder of Toyota. He calls it Five Whys. When something goes wrong, you ask why, again and again, until you ferret out the root cause. Then you fix the root cause, not the symptoms.

— via Daring Fireball

Every additional “nine” of uptime costs exponentially more in two ways. Improving architecture design often requires extra engineering, testing, and hardware investments; most startups defer at least one of these until later. Employee work hours frequently make up for the lack of investment in architecture, often with great impact on those first responders.

Advertisement

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s