DevOps incident management – how to (and how to outsource)

by Ned Hallett
As Digital Marketing Manager and JAM’s primary pair of lungs, I provide the JAM-y take on the ever-evolving worlds of DevOps, SaaS, MACH - and acronyms yet to be coined.
Published on April 2024

If you’re reading this, you’re either well prepared or well…not. If the latter, don’t waste any time. Get in touch with our support team straightaway and Godspeed. If the former, let’s get to it.

Why “DevOps” incident management

Incident management or IM is a little different in the context of DevOps. For one, you stop using an initialism because you’d get DIM, which carries the connotation of general stupidity.

The other and more relevant differences are:

  • A DevOps incident covers (though is not limited to) points of failure in your DevOps tooling and pipeline
  • DevOps incident management (more commonly) refers to how you enrich IM using a DevOps methodology/tooling  

In short, DevOps incident management is IM for people using DevOps for enhanced automation and collaboration. 

How incident response and management can get a DevOps-y shine

To illustrate this. Let’s do a step-by-step best-practice IM flow and highlight the DevOps-y parts.

Step one, incident identification

In one sense, a DevOps approach to this step doesn’t differ too much from general best practice. But in another sense, it does.

In both, you will have a focus on proactive monitoring tools that may encompass:

  • Application performance monitoring (APM), for real-time application metrics
  • Infrastructure monitoring, to track server, storage and network health
  • Network performance monitoring for insights into traffic distribution
  • Log management tools, to aggregate and analyze logs
  • Security monitoring to detect vulnerabilities
  • Cloud monitoring for cloud resources and services

However, the main difference in a DevOps response would be the extent to which each monitoring tool is integrated into an automated incident response.

An example

You may have a log monitoring tool configured to trigger a sequence of actions if it detects a spike in internal 500 errors.

Using ELK Stack for example, you might decide that if you have 0.5%> detected in a 5-minute window you’d generate an alert to your team and/or begin a series of automated fixes, which we’ll explore in the next section.

The key difference

A traditional approach to IM would still focus on setting up monitoring for incident alerts but may rely on a plucky team to fix things the ol’ fashion way.

Step two, incident response

Incident response is shaped by whether you’ve outsourced your IM or not. But for today’s exercise, we’ll assume you haven’t.

In the best-case scenario, your internal incident management workflow will involve some sort of immediate response by the on-call engineer who then escalates if they’re unable to resolve things for themselves.

Using tools like Jira Service Management or PagerDuty, you then coordinate your response team and fixes, hopefully relying on a defined structure, exhaustive runbook and willingness to take these learnings forward to future incidents.

The differences, when DevOps is properly integrated into the incident management process, are:

  • The incident management team will likely straddle dev and ops (internal structure may vary)
  • Many of the first-line fixes will be built into the workflow, triggered by the kinds of alerts and thresholds we looked at in part one
  • Even if the fixes are not entirely automatic, there will be a greater leveraging of automation in whatever fix or resolution is eventually carried out

An example

You’re sitting pretty, then all of a sudden Prometheus fills your monitoring Slack channel with tales of woe.

An ops and dev supergroup determines a rollback is the way to go. Then, using a GitLab webhook, the command can be sent straight to GitLab from Slack itself. 

A low-touch highly automated rollback.

The key difference

The internal structure of your team and the nature of your response reflect DevOps’ focus on collaboration and automation.

Step three, incident resolution and future planning

How you resolve incidents carries the question of what you’ll do if the same incident crops up in the future.

Some incidents truly are acts of God. And we can’t always carry out future-proofing and root cause analysis for each trip and wobble.

However, in lots of cases, we can.

A major incident should prompt a thorough incident report that will look at improving the response itself and the systems that went down in the first place. In the case of the former we might have:

  • Altering monitoring and alert systems for quick detection
  • Updating documentation and runbooks for faster incident response
  • Revising on-call schedules to ensure better coverage
  • Conducting training sessions to address knowledge gaps

And in the stack itself:

  • Adopting more robust and scalable cloud services 
  • Switching to databases with better fault tolerance 
  • Incorporating extra security measures
  • Working with a CDN or headless approach to take work off the server

Again, DevOps-enriched incident management, DevOps-enabled incident management, whatever phrase you want to use, means a greater focus on automation and automated fixes being added to the response process itself. 

An example

Let’s say we have a new deployment that causes a major memory leak. It’s a big disruption to overall service quality with much downtime and unhappiness.

The monitoring tools didn’t pick it up. The incident management tools didn’t get people’s ducks in a row. The incident response process was, in short, not good. 

DevOps team. Incident commander. What can we do?

We can in fact:

  • Add Valgrind (a memory leak detection tool) into our pipeline that will now scan for memory leaks whenever new code is deployed
  • If the code causes a leak, it can no longer leave the CI stage
  • This automatically creates a ticket and sends it to the relevant team(s)
  • The dev and ops team can now expand their knowledge of memory leaks at their leisure rather than in the flaming panic of a live incident 

 The key difference

There is a greater focus on iterative, automated improvements across the pipeline.

How and why to outsource incident management and DevOps

As you can see, incident management is one thing, and DevOps-enriched incident management is another, slightly better thing.

However, for many teams, even getting a solid, 24/7 incident management capability is a tricky proposition. Let alone one enriched by DevOps.

This can be down to:

  • It being too expensive to upskill your existing team + buy all the incident management tools and DevOps doodahs
  • Even when you do get all that stuff, it’s hard to scale, requiring a pretty linear cost spike if you need to, say, double your capacity
  • 24/7 coverage is hard to achieve outside a follow-the-sun model which itself is hard to achieve outside already being a globetrotter with offices in multiple timezones
  • You can’t exactly sue/legally punish your own team for falling outside of an SLA whereas although you can in the case of a provider (what actually happens is this much more real possibility makes them very good at sticking within their SLAs)
  • Same point as above for security
  • In general, an outsourced team will have all the tech and all the tips for doing IM perfectly which is their bread and butter; your bread and butter is whatever you do; so it’s best you do that and they do this

How we can help

As the team that literally monitors and supports a major DevOps tool used for detection and rollbacks (identity to be disclosed shortly), we can actually answer the question ‘who watches the watchmen?’

The answer is that we are so good at watching, we actually watch the watchman. It’s us. We were chosen to do it.

And if that meta watching-and-fixing-the-thing-that-watches-and-fixes doesn’t convince you, then you need watching and fixing, in our opinion.

See below for examples of how we’ve:

  • Provided DevOps, support and IM for legal SaaS product StructureFlow
  • Provided 24/7 support and IM for the product that journey plans for the likes of Five Guys and Taco Bell, LineTen
  • Provided support and IM for the customer-facing website of global leading law firm DLA Piper

DevOps support can be added to our custom package. If you want to talk about outsourcing your monitoring and IM, or anything else, just get in touch.

SHARE

CONTACT US

With partners across the USA, Europe and APAC, we provide a truly global service. So wherever you or your clients are based, contact us today to find out what we can do.