What the Crowdstrike outage means for the security industry?
Nothing drastic. It does remind people software doesn't solve everything.
Disclaimer: Opinions expressed are solely my own and do not express the views or opinions of my employer or any other entities with which I am affiliated.
By now, everyone has heard that Crowdstrike caused major outages for a variety of industries, including aviation, healthcare, etc. It’s funny because many of my friends have invested in Crowdstrike but didn’t quite know what they did until now. Of course, this is bad publicity for Crowdstrike, and it’s caused a lot of debate about what went wrong and how to handle it in the security community.
What is Crowdstrike?
Before we get too much into the details of what happened, what is Crowdstrike, and what they do? They provide a product to protect endpoints — think laptops and workstations, and now expanding into servers and cloud servers. They also provide a variety of security services, such as incident response, pentesting, etc. During 2016, they helped the DNC respond to a major intrusion. In essence, their product protects against malware and intrusions on endpoints, which is typically the start of an attack chain. I discuss Crowdstrike more, unfortunately, in an article about how it might fail. To be clear, I don’t believe Crowdstrike will fail because of this incident.
What happened?
Crowdstrike released an update that crashed Windows machines, leading the “blue screen of death.” More technically, the reason that this happened is that their agent has access to the kernel code, the heart of an operating system, and triggered an error. Kernels are very sensitive parts of an operating system and are easily crashable versus application code, which operates at the “user level” and has more resiliency because the operating system can find a way to recover. However, it is the kernel code that can assist with the resiliency, so when the code that operates key functions crashes, there’s very little that you can do. Crowdstrike released technical details on what happened for those interested, and The Pragmatic Engineer also wrote up a good Substack post.
What made this particularly hard to resolve was that the fix had to be applied manually. Typically, Crowdstrike can push an update to the endpoint, but since the endpoint crashed and was offline, there’s no way to do this. What made things worse is that this affected most machines and customers at once, so this caused a resourcing problem at Crowdstirke to resolve.
Overall, despite all the criticism, I believe Crowdstrike has had great communication on what’s happening, and I appreciate all the effort they are putting in to make things right for their customers. Although this might seem like an obvious way to respond, it isn’t true for every SaaS company!
What does this mean for Crowdstrike and the security community?
Honestly, I don’t think anything drastic will change, but it is a wake-up call about how we operate both our IT and security systems. Crowdstrike may have to pay some fees and refunds, which might affect their margins in the short term. I do think this will spur more conversations about how systems should work and be designed as well as the types of risk conversations security should be having with other functions.
The biggest topic will be resiliency because systems shouldn’t fail on such a large scale that it disrupts major critical functions. That’s dangerous on multiple levels. Luckily, this wasn’t a malicious attack, but it shows that this is possible. Historically, in these instances, we learn and adapt accordingly.
I believe the following things will change to improve resiliency:
Less compliance and more security problem-solving
More discussion about availability versus security risk
More discussions focused on non-digital resiliency
More collaboration with engineering
Less compliance and more security problem-solving
Keep reading with a 7-day free trial
Subscribe to Frankly Speaking to keep reading this post and get 7 days of free access to the full post archives.