What the Crowdstrike outage means for the security industry?

Nothing drastic. It does remind people software doesn't solve everything.

Jul 24, 2024

Disclaimer: Opinions expressed are solely my own and do not express the views or opinions of my employer or any other entities with which I am affiliated.

By now, everyone has heard that Crowdstrike caused major outages for a variety of industries, including aviation, healthcare, etc. It’s funny because many of my friends have invested in Crowdstrike but didn’t quite know what they did until now. Of course, this is bad publicity for Crowdstrike, and it’s caused a lot of debate about what went wrong and how to handle it in the security community.

What is Crowdstrike?

Before we get too much into the details of what happened, what is Crowdstrike, and what they do? They provide a product to protect endpoints — think laptops and workstations, and now expanding into servers and cloud servers. They also provide a variety of security services, such as incident response, pentesting, etc. During 2016, they helped the DNC respond to a major intrusion. In essence, their product protects against malware and intrusions on endpoints, which is typically the start of an attack chain. I discuss Crowdstrike more, unfortunately, in an article about how it might fail. To be clear, I don’t believe Crowdstrike will fail because of this incident.

How Crowdstrike fails

Frank Wang

February 15, 2024

Read full story

What happened?

Crowdstrike released an update that crashed Windows machines, leading the “blue screen of death.” More technically, the reason that this happened is that their agent has access to the kernel code, the heart of an operating system, and triggered an error. Kernels are very sensitive parts of an operating system and are easily crashable versus application code, which operates at the “user level” and has more resiliency because the operating system can find a way to recover. However, it is the kernel code that can assist with the resiliency, so when the code that operates key functions crashes, there’s very little that you can do. Crowdstrike released technical details on what happened for those interested, and The Pragmatic Engineer also wrote up a good Substack post.

What made this particularly hard to resolve was that the fix had to be applied manually. Typically, Crowdstrike can push an update to the endpoint, but since the endpoint crashed and was offline, there’s no way to do this. What made things worse is that this affected most machines and customers at once, so this caused a resourcing problem at Crowdstirke to resolve.

Overall, despite all the criticism, I believe Crowdstrike has had great communication on what’s happening, and I appreciate all the effort they are putting in to make things right for their customers. Although this might seem like an obvious way to respond, it isn’t true for every SaaS company!

What does this mean for Crowdstrike and the security community?

Honestly, I don’t think anything drastic will change, but it is a wake-up call about how we operate both our IT and security systems. Crowdstrike may have to pay some fees and refunds, which might affect their margins in the short term. I do think this will spur more conversations about how systems should work and be designed as well as the types of risk conversations security should be having with other functions.

The biggest topic will be resiliency because systems shouldn’t fail on such a large scale that it disrupts major critical functions. That’s dangerous on multiple levels. Luckily, this wasn’t a malicious attack, but it shows that this is possible. Historically, in these instances, we learn and adapt accordingly.

I believe the following things will change to improve resiliency:

Less compliance and more security problem-solving
More discussion about availability versus security risk
More discussions focused on non-digital resiliency
More collaboration with engineering

Less compliance and more security problem-solving

This is primarily for Crowdstrike customers. Many of them are using Crowdstrike as part of the compliance needed for endpoints. Also, part of compliance is to ensure that vendors and systems can recover from failures. For example, business continuity and disaster recovery (BC/DR) are key parts of SOC2 and HITRUST. However, it seems that businesses have focused on basic BC/DR to fulfill compliance, but those scenarios are either not well tested, not realistic, or both. In other words, customers would have recovered faster if they had more mature BC/DR. However, in their defense, most BC/DR focuses on infrastructure rather than IT failures. This used to be more relevant when IT maintained most infrastructure and software. However, it seems that BC/DR needs to evolve to focus on our new reality — more SaaS and cloud software, which could potentially fail.

This is ultimately my frustration. We shouldn’t need new compliance standards to force us to do this. Security teams and leaders should have seen this as a problem and taken steps to mitigate and solve it.

More discussion about availability versus security risks

This balance is always tricky. Ultimately, customers remember the outage more than security incidents. In other words, I believe that security incidents have a shelf life, and customers tend to move. However, when outages happen, customers remember that for longer. Security knows this well, and that’s the reason they tend to take security risks to ensure availability. The reason is that outages tend to have a larger business impact than security risks. They also tend to have larger reputational damage because it’s seen as a product quality issue.

Anyway, balancing availability and security risk should be an important discussion between engineering and security. Sometimes, security issues might indeed lead to availability ones, such as with the United Healthcare hack, but in the Crowdstrike case, the very protection to prevent security issues was the cause of the outage. This is a scenario we need to consider going forward — to have security, we might have to take on availability risk. Another example is Okta. If Okta were to go down, it would affect many systems’ SSO capabilities, but it does make application management more secure. Okta put in great effort to be highly available when AWS went down. However, rather than relying on the vendor, security and IT teams need to think about potential recovery strategies. This is why I believe that all SaaS products should have a way to log in outside of SSO. I know many products do, but many do not.

More discussions focused on non-digital resiliency

The above is a good segue into this topic. Companies have relied too much on technology to solve their problems without fully understanding the operational risk. Technology has made companies more efficient and improved productivity. As a result, it has benefitted the economy and consumers substantially. Noah Smith writes about this a lot in his Substack. Although he believes that US manufacturing productivity has stagnated, overall productivity has gone up, and it might allow us to have a soft landing with higher interest rates. Technology has played a major role there, but I digress.

We need to focus more on non-digital resiliency. Most compliance standards require some sort of resiliency, but those focus on other software. We need to have a plan when technology itself fails. We did operate in a world before the internet and computers, albeit it was inefficient. However, if technology were to fail us, we need to do something while we wait for the technology to come back online.

Brian Klass wrote about this in his recent article:

Even though the modern quest for optimization has too often made resilience an afterthought, it is not inevitable that we continue down the risky path we’re on. And making our systems more resilient doesn’t require going back to a disconnected, primitive world, either. Instead, our complex, interconnected societies simply demand that we sacrifice a bit of efficiency in order to allow a little extra slack. In doing so, we can engineer our social systems to survive even when mistakes are made or one node breaks down.

I only partially agree with him. I do think we definitely need to better understand and construct digital resiliency. Digital systems are too efficient now, and there’s not enough room for error compared to systems in other fields like manufacturing. We do need to have some sort of non-digital resiliency. For example, many flights and airlines came to a stop because they didn’t know how to operate without their computers. It’s ok to rely on your computers, and it’s also ok not to immediately know what to do if they fail. But, this is the point of disaster planning. If this were to happen, there should be instructions to help employees continue operations even if it requires the use of telephones and paper/pen. In this particular case, it would have been helpful to find a way to issue non-Windows machines to critical workers!

With that said, these efforts do cost money, and the incentives aren’t aligned for businesses to be so resilient as investors are always trying to find more profit. With that said, this failure has impacted the market cap of Crowdstrike (it’s down about 30% since the hack) and operation costs for airlines like Delta. We just have to see how fast they recover.

More collaboration with engineering

I do sound like a broken record in this newsletter, but I do think there needs to be more collaboration with engineering to prevent issues like this in the future. For those who know, I continuously advocate for more engineering in security.

Only engineers can secure the cloud

Frank Wang

September 6, 2023

Read full story

Security engineering is more efficient

Frank Wang

June 5, 2024

Read full story

How to be an engineer that security people don't hate

Frank Wang

August 1, 2023

Read full story

In this particular case, I believe that Crowdstrike did a good job responding quickly to the issue, but I do think one failure is that this bug had such a large blast radius. Software bugs are common, but they are usually caught before they are released more broadly. Similarly, for customers, their security teams should have coordinated with IT and engineering to figure out how to manage operations if certain security software were to cause an outage. Security needs to better understand what operations are critical and how they might create additional risk in deploying software. That’s what security is always worried about — in the pursuit of security, it causes an outage. This is common with vulnerability updates.

Either way, most companies’ operations depend on engineering nowadays, so it makes sense that security should ensure they have ways to recover if security tools interfere with engineering.

This is a warning

Like Brian Klass said:

The CrowdStrike debacle is a clear warning that the modern world is fragile by design. So far, we have decided to make ourselves vulnerable. That means we can decide differently too.

This incident was a warning that our new digital world is too fragile. We rely too heavily on singular vendors. It does make procurement and operations more efficient, but it can lead to substantial issues. I don’t know if this will change much going forward unless new regulation is introduced. We have relied on Windows and Crowdstrike too much, but they do bring substantial benefits to both operations and security respectively.

Don’t get me wrong. I am a Crowdstrike customer, and I think they have a good product. This is mostly a warning sign that we rely on fragile systems for some of our most important operations. As the digital world matures, every team, especially security, needs to think more about resiliency.

Frankly Speaking

How Crowdstrike fails

Only engineers can secure the cloud

Security engineering is more efficient

How to be an engineer that security people don't hate

Discussion about this post