Anatomy of a fall: how CrowdStrike’s Falcon hit the ground

Warning: this post contains graphic descriptions of enterprise system failures.

On July 19th, 2024, you’d be forgiven for thinking you’d awoken to an unannounced war game in which the world’s governments tested their mettle against the initial cyber salvos of WW3.

Such was the scale of the CrowdStrike outage:

Over 8.5 million devices affected
Upward of 5 billion in lost revenue for the US top 500 (more on this later) only a fraction of which was covered by insurance
And, most distressingly, serious disruptions to vital services like 911 and healthcare provision

A brief remedial for the uninformed and/or smug Linux/macOS-leaning: the CrowdStrike outage was a bug that crashed a huge swath of Microsoft devices on July 19th, 2024.

So, what happened?

Anatomy of a fall – how Falcon fell to earth

We’re not the first company to pick the bones of this particular carcass. And we won’t be the last.

Many in the DevOps and DevSecOps space have lined up to tear a scrap of I-told-you-so or wouldn’t-happen-on-my-watch – as they are entitled to.

However, as a specialist in providing 24/7 support and DevOps to enterprise-grade clients, our hearts go out to the Falcon team.

So, rather than crowing (more bird puns to follow), we’ll try to understand exactly what went wrong.

Anatomy of a fall:

Mid-flight: pre-July 19th

The Falcon team happily develop Channel File 291 to update Falcon’s sensor configuration file. At this point, everything’s gravy
The file goes into content validation via CrowdStrike’s imaginatively named Content Validator tool; it’s validated
Next, File 291 is tested. The testing, as far as the Falcon team is concerned, goes great

Some turbulence: the small hours of July 19th

Around 4:09 UTC Channel File 291 (unsatisfactorily adding up to just one short of 13 for the numerophiles) is deployed
Almost immediately, Windows systems running the Falcon sensor begin crashing and rebooting in a never-ending loop
CrowdStrike is inundated with user reports

Earthward bound: the big and unpleasant hours of July 19th

Within hours, the CrowdStrike team diagnose the issue and revert to prevent further damage
However, by this point, the extant damage is considerable. See our opening bullets. Millions affected. Billions lost. Life-saving services on the fritz
Falcon hit the ground with this mea culpa – tl;dr it’s our fault

What went wrong?

Looking at the above, it’s pretty obvious the validation and testing mechanisms were the points of failure.

Once they were aware of the problem, the Falcon team behaved admirably and speedily – though not quite as speedily as we.

The validation failure

Essentially there was a bug in the Content Validation tool itself that let File 291 through the net.

Previous updates using the same IPC Template Type had passed checks with no issues, so extra dynamic tests (which could have caught the issue) weren’t performed.

The testing failure

Falcon relied on admittedly robust automated testing. However, it only needs to be not quite robust enough once.

The tests failed, and there was no local developer element to add oversight.

WWJD?

What would Just After Midnight do?

At JAM, we look after mission-critical products and websites for a range of enterprise partners.

From innovative tech-led advertising running over the Super Bowl to giants in the software field, our raison d’être is to prevent, detect and resolve failure and downtime to a tighter-than-tight SLA.

We do this through a mixture of:

Comprehensive and far-reaching automated testing – using tools such as the TeamCity Matrix build feature
Follow-the-sun developer distribution – full-stack engineers always-on in every time zone
A buck-stopping culture – it’s always our problem no matter where the point of failure occurs

So, if you don’t fancy being the subject of a blog like this one, give us a shout.

Anatomy of a fall: how CrowdStrike’s Falcon hit the ground

by Ned Hallett

Anatomy of a fall – how Falcon fell to earth

Anatomy of a fall:

Mid-flight: pre-July 19th

Some turbulence: the small hours of July 19th

Earthward bound: the big and unpleasant hours of July 19th

What went wrong?

The validation failure

The testing failure

WWJD?

How to market and sell a SaaS greenfield

Meet IA-Cloud, our new cloud automation platform

Content as a service (CaaS) and why it’s great for eCommerce

Anatomy of a fall: how CrowdStrike’s Falcon hit the ground

by Ned Hallett

Anatomy of a fall – how Falcon fell to earth

Anatomy of a fall:

Mid-flight: pre-July 19th

Some turbulence: the small hours of July 19th

Earthward bound: the big and unpleasant hours of July 19th

What went wrong?

The validation failure

The testing failure

WWJD?

Know someone who'd find this useful? Pass it on.

Other articles you may also like...

How to market and sell a SaaS greenfield

Meet IA-Cloud, our new cloud automation platform

Content as a service (CaaS) and why it’s great for eCommerce

Have a question? We’re here to help.