CrowdStrike IT Outage

Six Lessons Learned from the CrowdStrike Outage Disaster

Art Clomera

Vice President, Operations

Did your world stand still on “Blue” Friday? I couldn’t stop tracking the news as 9,600 flights were canceled worldwide, broadcasters were forced off the air, hospitals had to postpone operations, and about 8.5 million devices using Microsoft Windows were bricked, displaying the Windows critical error blue screen of death (BSOD). 

This is a teachable moment for the cybersecurity community, private and federal. Thousands of experts are weighing in on the urgency of robust disaster recovery plans and failover mechanisms, reminding us of the fragility of our interconnected systems. Our CrowdStrike outage takeaway? It’s a teachable moment.

  

What if a foreign adversary causes the next global outage?

Private and government organizations heavily reliant on uninterrupted digital services, including critical medical centers, were disproportionately impacted. We dismiss this incident at our peril 

Had this been a targeted cyberattack by a foreign adversary, the outage could have lasted much longer. Recovery efforts might have been sabotaged, and the extended window of exposure could have posed severe threats to lives and critical infrastructure. 

 

Software isn’t the problem, the problem is us

How many of us anticipated our systems were so fragile that a “bug” in an update could cost Fortune 500 companies $5.4 billion– (since CrowdStrike is a vendor to over 50% of them.) Professor of aeronautics and astronautics at MIT, Nancy Leveson did.  

She emphasizes that the software did exactly what it was told to do but was given the wrong instructions. Software failures are failures of understanding and imagination.  

How can we work around that? 

 

The problem is efficiency over resilience

Nassim Nicholas Taleb’s Black Swan theory, which describes unpredictable, high-impact events with severe consequences, pretty much sums up the CrowdStrike Outage. So, too, does his concept of antifragility: “Systems should not just survive shocks but bounce back stronger.”  

While efficiency often drives system design, these streamlined processes lack redundancy, making systems vulnerable to unexpected disruptions.  

The CrowdStrike incident exposed the risks of prioritizing efficiency over security. While mass software updates can streamline operations, they should be implemented in phases to detect problems before they become widespread.

  

Beyond Efficiency: Six lessons learned from CrowdStrike 

It’s heartening to see the cybersecurity community share their insights and lessons learned from CrowdStrike. Here are our CrowdStrike outage takeaways to ensure your security posture remains efficient yet antifragile. 

 

1. Critical systems must have robust backup and failover mechanisms

Failover mechanisms such as redundant systems, automatic switchover protocols, and geographically dispersed data centers are essential to ensure uninterrupted operations during outages (whether triggered by human error or foreign adversary). 

Also, distributing security infrastructure across multiple Cloud Service Providers reduces the risk of a single point of failure. Even if on-premises systems are affected, critical data and operations can be quickly restored without disrupting mission continuity.  

For example, IPKeys assisted DLA in migrating from on-premises data centers to the cloud. This included transitions to milCloud 1.0 and commercial cloud platforms, establishing authorizations for Impact Level 4 and 5 cloud enclaves. Learn more. 

 

2. Enhanced monitoring and incident response 

The Cybersecurity and Infrastructure Security Agency (CISA) recommends enhanced monitoring and rapid incident response to maintain a resilient security posture. CISA’s incident response playbooks provide federal agencies with standardized procedures to identify, coordinate, remediate, and recover from cybersecurity incidents.

This approach helps agencies quickly address vulnerabilities and reduce the impact of cyber threats. According to a recent GAO report, federal agencies have made progress but need to implement incident response requirements fully. 

 

3. Consideration of Tail Risks 

Organizations must evaluate not only immediate risks but also the potential for extreme outcomes that are not typically accounted for in standard risk assessments with third-party dependencies. Organizations must have a phased approach to rolling out updates and a thorough risk mitigation test strategy within a simulated operational environment, prior to or as part of any automation. 

For example, multiple federal agencies were compromised during the SolarWinds cyberattack due to their reliance on a single software provider, demonstrating the critical need for thorough risk assessments considering potential catastrophic failure or Black Swan events. 

Embracing chaos engineering practices can help organizations understand how systems behave under stress, allowing them to build more resilient infrastructures. Netflix pioneered this approach with their Chaos Monkey tool, randomly turning off production instances to test system robustness. 

 

4. Vulnerability of Monocultures

The CrowdStrike outage exposed the risks of overreliance on single platforms and vendors. Windows’s dominance in the corporate world created a large, homogenous attack surface for cybercriminals. Similarly, the concentration of cybersecurity services in a few major companies amplifies the potential impact of a breach, as demonstrated by the SolarWinds attack.

This consolidation of services creates an attractive attach surface for adversaries, as the SolarWinds attack affects critical US government departments, including Homeland Security, State, Commerce, and Treasury, as demonstrated. To mitigate these risks, organizations should consider diversifying their technology stack and security providers and implementing a multi-layered defense strategy that reduces dependence on any single vendor or system.

 

5. Building a crisis management framework 

Involving IT, cybersecurity, PR, legal, and risk management teams ensures a comprehensive incident response. CISA advocates for such integrated approaches in its incident response playbooks 

Open communication with stakeholders is essential for rebuilding trust after a crisis. Organizations can mitigate reputational damage by promptly acknowledging issues, providing regular updates, and demonstrating a commitment to accountability. The Federal Communications Commission (FCC) emphasizes clear communication strategies in disaster response protocols. 

 

6. Importance of Phased Rollouts 

Hindsight is 20/20, but regular automated updates to endpoint detection response (EDR) tools, like CrowdStrike’s Falcon, should be implemented in phases to detect problems before they become widespread. (Admittedly, this is not as efficient as blasting a new security update on 8.5 million devices without these controls.) 

A phased rollout approach involves: 

  1. Initial deployment to a small, diverse subset of systems 
  2. Monitoring for unexpected behaviors or conflicts 
  3. Gradual expansion to larger groups 
  4. Maintaining the ability to rollback if problems arise quickly 

By combining these strategies, organizations can create a more resilient and antifragile IT infrastructure. 

Read more: What should you do if you suspect you’ve fallen victim to a cyber attack?

 

CEO vs. CTO/CIO: Should the CTO/CIO play a higher role in running the show? 

Naturally, blame has begun to target the person at the center of it all: CrowdStrike CEO George Kurtz. Tech industry analyst Anshel Sag pointed out that this isn’t the first time Kurtz has played a major role in a historic IT blowout. When asked about Kurtz’s history at McAfee, a CrowdStrike spokesperson said, “George was there as a sales-facing CTO, not in charge of engineering, technology, or operations.” 

Many high-level tech roles are often held by those whose strengths are more in management or marketing than technology. On Reddit, users shared personal experiences where non-technical executives made poor decisions affecting operations and product integrity.  

The US Cyber Command (USCYBERCOM) was established in response to the growing importance and vulnerability of computers and networks. The information age is upon us therefore, we need unified directions for cybersecurity operations, endorsement of cyber capabilities, increased ability to operational information and communication networks, counter cyber threats, assure access to cyber resources, and interoperate with a global information environment. 

In the context of the CrowdStrike outage and broader industry practices, the debate over whether the CEO or CTO/CIO should run the show becomes more nuanced. The ideal solution likely lies in a collaborative leadership model where CEOs and CTOs/CIOs work in tandem, leveraging their complementary skills, but ensuring CTOs/CIOs have cyber operational responsibilities.  

 

How to ensure an IT outage doesn’t happen on your watch 

We understand the complex challenges federal agencies face in maintaining resilient, secure, and adaptable IT infrastructures.  

IPKeys CLaaS®, designed in collaboration with the DoD, provides AI/ML analytics and continuous monitoring of potential outages, tail risks and cyber threats. Contact us to learn more about improving mission-critical continuity. 

More from IPKeys

Want IPKeys insights and news delivered directly to your email?

We'll notify you when new content is published at the email below (and you can opt-out any time)

Thank you! Your submission has been received!

We will never share your information with any third-parties without your permission, nor will we ever spam you. We take privacy very seriously and you can read our full privacy policy here.