By now, most of the people affected by the Crowdstrike outage last weekend should be able to boot up their laptops and read this blog – so it feels like a good time to share our key takeaway from the chaos that unfurled. After all, software testing is our game, and as we’ve commented before, so many major software outages are down to software that wasn’t adequately tested.
So, here we go.
Software testing done by customers likely wouldn’t have prevented the outage.
Sorry.
The truth is that even with the most robust testing procedures – and we are strong advocates of robust testing procedures – you can only test an update if a vendor allows you to. With Crowdstrike, as they say in their own blog on the topic:
The configuration files mentioned above are referred to as “Channel Files” and are part of the behavioral protection mechanisms used by the Falcon sensor. Updates to Channel Files are a normal part of the sensor’s operation and occur several times a day in response to novel tactics, techniques, and procedures discovered by CrowdStrike. This is not a new process; the architecture has been in place since Falcon’s inception.
So with these updates happening automatically in the background, there was no opportunity for Crowdstrike’s customers to test the updated software – and even if there were, the burden of testing several times a day would be considerable, since you’d likely want to be more thorough than “does a computer with these new channel files installed actually start?”
Analysts seem to think that the update was pushed to customers with faulty code due to a process error at Crowdstrike – and there’s lots of chat online from news outlets to Reddit about how Crowdstrike should update its processes to prevent these issues happening in the future.
What can you do to keep yourself safe from this happening again?
Right now, it’s very hard to say. It’s unlikely that organizations like Crowdstrike will give control to customers over how and when they deploy new channel files or definitions – that could compromise the effectiveness of their protection. It’s possible that organizations could use a combination of vendors to protect their systems – 50% of machines with one vendor, 50% with another, for instance. But that would likely cost more, certainly be more problematic to manage, and open up organizations to new risks as different security technologies combine.
We’ll all have to wait and see what happens in the coming days, weeks and months as the fallout from the outage unfurls fully. In the meantime, we’ll leave you with the words of Dr. Stephanie Hare, a tech expert interviewed by the BBC as the crisis unfolded:
“It’s really a lesson for us all in cyber resilience – you need to plan for failure, and you want to have something called business continuity.”
– Dr Stephanie Hare
Listen to Dr Hare’s warning that this could happen again:
Our deepest commiserations go out to all the IT teams putting in the weekends and late nights to get their organizations back up and running – and a reminder to test every update you can, no matter how small.
ESPECIALLY if you’re a global cybersecurity vendor 👀