Avoiding Business Disruptions: Lessons from the Recent IT Outage

Right now the world is still whirling over a massage outage due to a fault somewhere being pinned at the feet of Microsoft and CrowdStrike. And while there is some blame to be placed there, I will say a large amount of that blame does need to be placed squarely on the IT professionals at the impacted businesses.

Should Microsoft release buggy software, of course not, but let’s be honest here for a moment, we know that it happens – it isn’t often this dramatic, but it does happen. Which is why for decades Microsoft and others have provided tools to prevent widespread problems just like this.

The reality is that this seemingly zero day critical bug was only disastrous because we have become lulled into the standard of always automatic update your software. Your phone is updating applications without you knowing as you read this, there is probably a couple of apps updating every single day. There is a name for this, continuous delivery….

Way back in Windows Server 2003 — over two decades ago, there was a product known as Server Update Service, later known as Windows Server Update Service (WSUS), which had from the very inception the idea of having a tier rollout of updates. Whereby what I typically did back then was roll out an update to my daily driver desktop, and then two days later updates would be released to a few “power users” — those who if they noticed anything would let me know, but also knew enough to not just get stopped dead in their tracks. That would follow by deploying to one computer one each department setup, and then company wide… In 2005 I worked for a company with a few dozen employees.

Later in 2006 when I was at a consulting firm, we took a similar rollout approach, of course it was more complex then that, but the great thing is that the software managed all of that…

Fast forward to July 2024, if we still were implementing these fundamental best practices from decades ago — it would have likely BSOD’d my desktop computer — and that is about it. I would have pulled the plug on any updates deploying elsewhere. At worse maybe one or two other people, but it would not take down an entire company. Yet we see posts roll in about places being completely crippled.

As I type this I am still on hold, both on the phone and on a chat-text service with United Airlines (@United) who after over 7 hours still has failed to answer the phone, text nor DM. Have there call centers been so obliterated by this? I read reports of 911 centers and hospitals being brought to their knees. This is not acceptable.

If you’re a business contracting our your IT services to an outside management company, often referred to as an MSP (managed service provider), and your company was brought to a halt, then you should rethink who you are using. Shop around, get references, who was not impacted by this? That is the company you should be working with. We often use the wrong metrics to determine the MSP we use for our businesses, largely because these sorts of things — following best practices when it comes to software updates, backups, business continuity, etc., are things that take time and energy, and most managed service providers don’t want to spent the money on it — and often business managers don’t want to spend money on things like this.

Regardless of how they let you down regarding this… here is a drill you should run them through — tell them you’ll pay the hourly labor for it (because you really need to know this in advance). Call them up, after hours, let them know that your servers are all gone — fire, theft, car came through the wall, be creative — you need your business back up and operational by start of business tomorrow — and…. go…. Don’t talk theoretical, you want to see a functioning business as fast as possible. If they cannot, then what are you really paying for with your backups? What have they done regarding business continuity planning. So many have this on their website and such, but so few can actually deliver on this. And what a better time to figure this out than before the next big thing.

If you’re not outsourcing, then your CTO and/or Director of Technology has a LOT of explaining to do — this should not have happened on their watch.

It is not just Boeing who has become complacent, but so many businesses and departments.