Grounded planes, disrupted government operations, stalled healthcare systems; it all sounds suspiciously like a malware attack or an extensive hacking job. Unfortunately, it was something far less sinister. It was the result of a flawed software update by CrowdStrike that resulted in over 8.5 million Windows systems showing the panic-inducing blue screen of death. It is estimated to be one of the largest IT outages, if not the most costly, in history, and it could have been prevented. In the aftermath of the incident, CrowdStrike’s CEO has been called to testify in front of the U.S. Subcommittee on Cybersecurity and Infrastructure Protection, the company’s shareholders have levied a lawsuit for damages, companies across the globe have lost billions, and it’s estimated that it might be the largest IT outage in history.
So, what happened, what lessons can be learned from the failed CrowdStrike update, and how can you ensure something similar doesn’t happen to you or your client’s operating systems? We’ll walk you through some important takeaways and set you up for success when it comes to dealing with updates, application security, testing strategies, and outages.
CrowdStrike update crashes 8.5 million machines in less than 2 hours
On July 19th, cybersecurity company CrowdStrike pushed a flawed content configuration update to Windows systems users and effectively crashed over 8.5 million machines in less than 2 hours, and that estimate only includes Windows machines that have reported crash data. According to CrowdStrike’s incident report published on their website, “The issue on Friday involved a Rapid Response Content update with an undetected error.” The Rapid Response Content is pushed to sensors in order to detect rapidly changing threats by gathering telemetry. The difference on July 19, was a bug in the Content Validator that allowed problematic content data to pass validation. The crash happened when the sensor received the flawed channel file, resulting in, “an out-of-bounds memory read triggering an exception.” Unable to deal with the exception, the file crashed Windows operating systems, resulting in the blue screen of death.
The channel file was fixed and a new one uploaded an hour and a half later, but extensive damage had already impacted critical services and systems, causing widespread outages and delays. Even worse, it was not until July 29 when CrowdStrike announced that 99% of machines were back online; a full 10 days later. As far as consequences go, the Crowdstrke outage led to:
- $500 million USD loss for Delta Airlines
- Billions in losses for the Fortune 500, currently estimated at $5.4 billion
- 3,000 flights into, out of, and within the US were canceled and thousands more delayed on July 19th alone
- Healthcare systems and emergency response operators experienced outages, with some healthcare groups canceling elective and non-essential procedures until systems were restored, and
- Millions of other critical services were disrupted from banking and financial institutions to mass transit and emergency services
 
          The challenges of cloud-dependent architectures and performing a robust analysis
The CrowdStrike outage serves as a timely reminder for organizations to periodically evaluate their cloud strategy and network security. Although this should be a foundational process for any medium-sized to enterprise-level company, it often lacks the regularity required to prevent both minor and serious issues. In the real world, incidents of the magnitude of CrowdStrike can severely impact critical systems, triggering unforeseen consequences and harmful backlash.
One significant challenge enterprises face is vendor lock-in with cloud providers. Relying on a single vendor can create single points of failure, potentially bringing down entire systems. To mitigate this risk, diversifying vendors and adopting a hybrid or multi-cloud approach can ensure workloads are distributed across multiple platforms. This strategy not only enhances protection against cybersecurity threats but also facilitates easier closure of access points. As cybersecurity technology becomes increasingly integrated into day-to-day systems, the importance of reliability and diversification grows.
Performing an analysis of your architecture
Conducting a thorough analysis of your architecture is crucial to maintaining a robust and resilient cloud strategy. Start by examining what processes are automated within your environment. Automation can significantly enhance efficiency and reduce the likelihood of human error, but it’s essential to ensure these automated processes are secure and functioning as intended. Next, consider whether test servers are utilized effectively. Test servers and staging environments can simulate different scenarios and help identify potential issues before they impact the live environment. They also provide a safe space to implement and refine automation scripts. Finally, evaluate how diversified your cloud environment is. A diversified cloud strategy, involving multiple cloud providers and platforms, reduces the risk of system-wide failures and enhances overall security. This diversification should be regularly assessed and adjusted to align with evolving threats and organizational needs. By addressing these areas, organizations can build a more resilient architecture capable of withstanding and swiftly recovering from potential disruptions.
Go for gold: implement proper testing procedures
To say testing is an important part of the software development lifecycle is a gross understatement. CrowdStrike admitted that the Rapid Response Content did not undergo rigorous testing like its Content Sensor, indicating that a simple test could have caught the bug that wreaked havoc and cost companies billions of dollars and disrupted millions of lives. In software development, it is easy to become complacent when previous versions of software have passed tests with flying colors or you just want to get updates or patches deployed quickly. However, it is incredibly important to build out robust testing practices as the ramifications of flawed files can have downstream impacts that cost time, money, and at the most extreme, lives. CrowdStrike has indicated that they will now be applying different testing types, including local developer tests, stress testing, fuzzing and fault injection, content update and rollback testing, stability testing, and content interface testing to prevent future issues. However, such a gesture seems like too little too late.
One of the morals of this story is to build out a robust testing strategy and make sure it is implemented and discussed ad nauseam across the organization. Testing delivers generous system performance insights into the solutions and products you are building, helping you catch errors early and deploy quality solutions to your end users. Continuous testing throughout the software development lifecycle is required to avoid system crashes, failures, and security breaches. Here are some of the key takeaways from the CrowdStrike episode which may help to strengthen or enhance parts of your existing testing strategy:
 
          Incorporate pre-deployment testing
Imagine deploying an update to users without testing or peer reviewing. A comprehensive testing process includes edge cases and real-world scenarios, and you should be aware of what is going to happen when the update hits users because you tested it.
Avoid an over-reliance on automation
Automation testing is an incredible tool and has improved the testing experience for developers, testers, and users alike. However, manual oversight has to be incorporated from the start and should be part of a robust testing strategy to avoid problems with automations.
Hire expert QA professionals
Underappreciated? Not today! QA professionals can help you build out a testing strategy that works for your organization, tech stack, and solutions. This seems like an obvious approach until companies start trying to cut costs and assume fewer QA professionals can just use AI tools and automations to make everything more efficient. Unfortunately, that does not always work, and it leaves the organization vulnerable to issues.
Phased rollouts
Pushing updates to a subsection of users can help pinpoint issues before they bring down whole systems. Updates and configuration files, no matter how mundane, should be tested locally first on a representative sample before being pushed globally. Think of the chaos CrowdStrike could have avoided with a phased rollout.
Review AI and machine learning models!
Yes, AI testing is a huge benefit for those that have incorporated it into their systems, like Crowdstike. However, it still requires massive human oversight and should not be solely relied upon to predict problems. Remember, it is still only as capable as the data it is trained on. So, if it received faulty training data, those issues might be enhanced during a rollout. Additionally, if you do not test something, you will never know if it contains bugs.
Putting together a response plan
Even with robust testing practices, mistakes and breaches will happen. For this reason, it is not enough to just implement continuous testing practices or security software, rather, organizations must also have a response plan ready for when things go awry. A strong response plan should address security breaches, unexpected downtime, outline a communications management plan for when comms fail (as they did for those using Microsoft Teams), and prepare for the unexpected, or a black swan. Additionally, considering some form of compensation for affected parties is essential, as most losses may go uninsured, and for your teams that help to remedy the situation. However, it is crucial to offer meaningful compensation.
Addressing security breaches is a fundamental aspect of any response plan. Organizations should have predefined procedures for identifying, containing, and mitigating breaches. This includes regular security audits, employee training on recognizing phishing attempts, and having cybersecurity experts ready to respond quickly. Unexpected downtime is another critical area. A plan for downtime should include backup systems, regular data backups, and a clear process for restoring services. Throughout any breach or outage, communication is the key to success. Unfortunately, sometimes internal communications systems can fail, leaving organizations in the dark and floundering. Organizations should have alternative communication channels in place, such as satellite phones or external messaging services, and ensure all employees are aware of these alternatives. Lastly, preparing for the unexpected requires a flexible and adaptable approach. This might involve scenario planning exercises, maintaining a reserve of critical resources, and fostering a culture of resilience within the organization. All of these activities can help you build a robust response plan that may mitigate the impacts of an event.
One of the most important lessons here has nothing to do with software or lost money. Rather, it’s about complacency, and understanding that most systems are interconnected now. As large software companies that touch most aspects of people’s lives, not just operating systems, we have a responsibility to address downstream consequences and think about the ramifications of our actions. The CrowdStrike outage, and subsequent Windows system crash, made it even more obvious that the “simple” systems we rely on everyday control important parts of our lives. And, just as there are government entities responsible for ensuring companies uphold high standards and avoid harming the populace, technology companies must also avoid complacency and ensure the same high standards for all of their solutions.
If you’re looking to strengthen your system’s resilience, review your architecture, or develop a robust disaster recovery plan, we’re here to help. Get in touch with us today to explore how we can enhance your organization’s technological resilience and safeguard your operations against potential crises. Let’s work together to build a more secure and reliable future for your business.

 
	
 
		 
		 
 
	
 
			 
		