Keeping internet services online suffers from the problem of black swans. Nassim Taleb, who invented the term, defines it thus: “A black swan is an outlier, an event that lies beyond the realm of normal expectations.” Almost all internet outages are unexpected unexpecteds: extremely low-probability outlying surprises. They’re the kind of things that happen so rarely it doesn’t even make sense to use normal statistical methods like “mean time between failure.” What’s the “mean time between catastrophic floods in New Orleans?”
…
Somewhere between the “extremely unreliable” level of service, where it feels like stupid outages occur again and again and again, and the “extremely reliable” level of service, where you spend millions and millions of dollars getting an extra minute of uptime a year, there’s a sweet spot, where all the expected unexpecteds have been taken care of. A single hard drive failure, which is expected, doesn’t take you down. A single DNS server failure, which is expected, doesn’t take you down. But the unexpected unexpecteds might. That’s really the best we can hope for.
To reach this sweet spot, we borrowed an idea from Sakichi Toyoda, the founder of Toyota. He calls it Five Whys. When something goes wrong, you ask why, again and again, until you ferret out the root cause. Then you fix the root cause, not the symptoms.
— via Daring Fireball
Every additional “nine” of uptime costs exponentially more in two ways. Improving architecture design often requires extra engineering, testing, and hardware investments; most startups defer at least one of these until later. Employee work hours frequently make up for the lack of investment in architecture, often with great impact on those first responders.
0 responses so far ↓
There are no comments yet...Kick things off by filling out the form below.
You must be logged in to post a comment.