Warning: this post contains graphic descriptions of enterprise system failures.
On July 19th, 2024, you’d be forgiven for thinking you’d awoken to an unannounced war game in which the world’s governments tested their mettle against the initial cyber salvos of WW3.
Such was the scale of the CrowdStrike outage:
- Over 8.5 million devices affected
- Upward of 5 billion in lost revenue for the US top 500 (more on this later) only a fraction of which was covered by insurance
- And, most distressingly, serious disruptions to vital services like 911 and healthcare provision
A brief remedial for the uninformed and/or smug Linux/macOS-leaning: the CrowdStrike outage was a bug that crashed a huge swath of Microsoft devices on July 19th, 2024.
So, what happened?
Anatomy of a fall – how Falcon fell to earth
We’re not the first company to pick the bones of this particular carcass. And we won’t be the last.
Many in the DevOps and DevSecOps space have lined up to tear a scrap of I-told-you-so or wouldn’t-happen-on-my-watch – as they are entitled to.
However, as a specialist in providing 24/7 support and DevOps to enterprise-grade clients, our hearts go out to the Falcon team.
So, rather than crowing (more bird puns to follow), we’ll try to understand exactly what went wrong.
Anatomy of a fall:
Mid-flight: pre-July 19th
- The Falcon team happily develop Channel File 291 to update Falcon’s sensor configuration file. At this point, everything’s gravy
- The file goes into content validation via CrowdStrike’s imaginatively named Content Validator tool; it’s validated
- Next, File 291 is tested. The testing, as far as the Falcon team is concerned, goes great
Some turbulence: the small hours of July 19th
- Around 4:09 UTC Channel File 291 (unsatisfactorily adding up to just one short of 13 for the numerophiles) is deployed
- Almost immediately, Windows systems running the Falcon sensor begin crashing and rebooting in a never-ending loop
- CrowdStrike is inundated with user reports
Earthward bound: the big and unpleasant hours of July 19th
- Within hours, the CrowdStrike team diagnose the issue and revert to prevent further damage
- However, by this point, the extant damage is considerable. See our opening bullets. Millions affected. Billions lost. Life-saving services on the fritz
- Falcon hit the ground with this mea culpa – tl;dr it’s our fault
What went wrong?
Looking at the above, it’s pretty obvious the validation and testing mechanisms were the points of failure.
Once they were aware of the problem, the Falcon team behaved admirably and speedily – though not quite as speedily as we.
The validation failure
Essentially there was a bug in the Content Validation tool itself that let File 291 through the net.
Previous updates using the same IPC Template Type had passed checks with no issues, so extra dynamic tests (which could have caught the issue) weren’t performed.
The testing failure
Falcon relied on admittedly robust automated testing. However, it only needs to be not quite robust enough once.
The tests failed, and there was no local developer element to add oversight.
WWJD?
What would Just After Midnight do?
At JAM, we look after mission-critical products and websites for a range of enterprise partners.
From innovative tech-led advertising running over the Super Bowl to giants in the software field, our raison d’être is to prevent, detect and resolve failure and downtime to a tighter-than-tight SLA.
We do this through a mixture of:
- Comprehensive and far-reaching automated testing – using tools such as the TeamCity Matrix build feature
- Follow-the-sun developer distribution – full-stack engineers always-on in every time zone
- A buck-stopping culture – it’s always our problem no matter where the point of failure occurs
So, if you don’t fancy being the subject of a blog like this one, give us a shout.