Fastly went down yesterday, and it literally broke the internet.
A CDN (content delivery network) used by the likes of AWS, Reddit and gov.uk, Fastly is responsible for showing users content geographically closer to them, cutting down on latency and saving on performance.
But, yesterday – it broke instead.
When commonly used third-party services go down, a lot of customers’ tech stacks run into problems, and it can feel like the internet is broken.
But what can you do in these situations to mitigate the damage? Well, as 24/7 full-stack support specialists, we like to think we know a thing or two about damage control.
What to do
Diagnose the issue – what part of the CDN is actually down, is it a region, is it a few, or all of them?
If everything has gone down, one option is to check the DNS records and disable the performance side, sending traffic to the endpoint location without any of the CDN offering features. However, this means opening access to some firewall security groups, based on how the site is set up, as some will only allow traffic coming from the CDN. Whilst there is an element of security risk associated with this, it is only a temporary workaround – and so shouldn’t pose too much of a threat.
Alternatively, if you have a disaster recovery site or holding page set up for situations like this, you could send users there, temporarily. This will redirect traffic to a simplified website held elsewhere, with less IP restrictions in place.
Meanwhile, chase your provider for updates on when things will go back to normal – make lots of noise and try to get frequent updates.
If the CDN you’re using has failed multiple times – you may even want to consider changing providers, however, we have seen the same from most of the top CDN providers over the last year or two; this is not limited to Fastly.
How to do it
In terms of incident management, it’s all about communication.
As soon as the problem is identified, you need to send out comms to your team and clients.
They’ll be frustrated they can’t reach you – so let them know what the problem is and that you’re working on it.
Send regular updates (so they know you’re actually doing something), rather than sending one when it all kicks off – our team sends updates at least every 30 minutes during major incidents.
Keep talking with every relevant stakeholders – your technical team, the CDN provider and pass this info along to clients/customers.
If you’re a B2B business, send out a summary of exactly what happened to your clients once everything’s under control. This ensures they know what actions were taken throughout.
To make sure everything above runs smoothly, have a set of instructions ready and waiting for your team to access should something like this happen. Knowing who to contact and when is key – and means that you’ll likely be reaching out to your clients before they come to you.
Post event, update these instructions with learnings each time, so the incident management process is continually refined.
How we can help
As full-stack support specialists, we take incident management and resolution entirely off your plate.
With a tech-enabled support offering and a global team who work just like an extension of your business, we’re perfectly set up to help you through the most difficult times a business can face, and, dare we say it, fix the internet.
On top, we can help you plan for disaster recovery before anything goes wrong, identifying weak spots, assessing third parties, and building for resilience.
So to find out more about our 24/7 full-stack support service, how we can help you avoid outages in the first place (or even avoid pulling a Fastly) just get in touch.