If you’re reading this, you’re either well prepared or well…not. If the latter, don’t waste any time. Get in touch with our support team straightaway and Godspeed. If the former, let’s get to it.
Why “DevOps” incident management
Incident management or IM is a little different in the context of DevOps. For one, you stop using an initialism because you’d get DIM, which carries the connotation of general stupidity.
The other and more relevant differences are:
- A DevOps incident covers (though is not limited to) points of failure in your DevOps tooling and pipeline
- DevOps incident management (more commonly) refers to how you enrich IM using a DevOps methodology/tooling
In short, DevOps incident management is IM for people using DevOps for enhanced automation and collaboration.
How incident response and management can get a DevOps-y shine
To illustrate this. Let’s do a step-by-step best-practice IM flow and highlight the DevOps-y parts.
Step one, incident identification
In one sense, a DevOps approach to this step doesn’t differ too much from general best practice. But in another sense, it does.
In both, you will have a focus on proactive monitoring tools that may encompass:
- Application performance monitoring (APM), for real-time application metrics
- Infrastructure monitoring, to track server, storage and network health
- Network performance monitoring for insights into traffic distribution
- Log management tools, to aggregate and analyze logs
- Security monitoring to detect vulnerabilities
- Cloud monitoring for cloud resources and services
However, the main difference in a DevOps response would be the extent to which each monitoring tool is integrated into an automated incident response.
An example
You may have a log monitoring tool configured to trigger a sequence of actions if it detects a spike in internal 500 errors.
Using ELK Stack for example, you might decide that if you have 0.5%> detected in a 5-minute window you’d generate an alert to your team and/or begin a series of automated fixes, which we’ll explore in the next section.
The key difference
A traditional approach to IM would still focus on setting up monitoring for incident alerts but may rely on a plucky team to fix things the ol’ fashion way.
Step two, incident response
Incident response is shaped by whether you’ve outsourced your IM or not. But for today’s exercise, we’ll assume you haven’t.
In the best-case scenario, your internal incident management workflow will involve some sort of immediate response by the on-call engineer who then escalates if they’re unable to resolve things for themselves.
Using tools like Jira Service Management or PagerDuty, you then coordinate your response team and fixes, hopefully relying on a defined structure, exhaustive runbook and willingness to take these learnings forward to future incidents.
The differences, when DevOps is properly integrated into the incident management process, are:
- The incident management team will likely straddle dev and ops (internal structure may vary)
- Many of the first-line fixes will be built into the workflow, triggered by the kinds of alerts and thresholds we looked at in part one
- Even if the fixes are not entirely automatic, there will be a greater leveraging of automation in whatever fix or resolution is eventually carried out
An example
You’re sitting pretty, then all of a sudden Prometheus fills your monitoring Slack channel with tales of woe.
An ops and dev supergroup determines a rollback is the way to go. Then, using a GitLab webhook, the command can be sent straight to GitLab from Slack itself.
A low-touch highly automated rollback.
The key difference
The internal structure of your team and the nature of your response reflect DevOps’ focus on collaboration and automation.
Step three, incident resolution and future planning
How you resolve incidents carries the question of what you’ll do if the same incident crops up in the future.
Some incidents truly are acts of God. And we can’t always carry out future-proofing and root cause analysis for each trip and wobble.
However, in lots of cases, we can.
A major incident should prompt a thorough incident report that will look at improving the response itself and the systems that went down in the first place. In the case of the former we might have:
- Altering monitoring and alert systems for quick detection
- Updating documentation and runbooks for faster incident response
- Revising on-call schedules to ensure better coverage
- Conducting training sessions to address knowledge gaps
And in the stack itself:
- Adopting more robust and scalable cloud services
- Switching to databases with better fault tolerance
- Incorporating extra security measures
- Working with a CDN or headless approach to take work off the server
Again, DevOps-enriched incident management, DevOps-enabled incident management, whatever phrase you want to use, means a greater focus on automation and automated fixes being added to the response process itself.
An example
Let’s say we have a new deployment that causes a major memory leak. It’s a big disruption to overall service quality with much downtime and unhappiness.
The monitoring tools didn’t pick it up. The incident management tools didn’t get people’s ducks in a row. The incident response process was, in short, not good.
DevOps team. Incident commander. What can we do?
We can in fact:
- Add Valgrind (a memory leak detection tool) into our pipeline that will now scan for memory leaks whenever new code is deployed
- If the code causes a leak, it can no longer leave the CI stage
- This automatically creates a ticket and sends it to the relevant team(s)
- The dev and ops team can now expand their knowledge of memory leaks at their leisure rather than in the flaming panic of a live incident
The key difference
There is a greater focus on iterative, automated improvements across the pipeline.
How and why to outsource incident management and DevOps
As you can see, incident management is one thing, and DevOps-enriched incident management is another, slightly better thing.
However, for many teams, even getting a solid, 24/7 incident management capability is a tricky proposition. Let alone one enriched by DevOps.
This can be down to:
- It being too expensive to upskill your existing team + buy all the incident management tools and DevOps doodahs
- Even when you do get all that stuff, it’s hard to scale, requiring a pretty linear cost spike if you need to, say, double your capacity
- 24/7 coverage is hard to achieve outside a follow-the-sun model which itself is hard to achieve outside already being a globetrotter with offices in multiple timezones
- You can’t exactly sue/legally punish your own team for falling outside of an SLA whereas although you can in the case of a provider (what actually happens is this much more real possibility makes them very good at sticking within their SLAs)
- Same point as above for security
- In general, an outsourced team will have all the tech and all the tips for doing IM perfectly which is their bread and butter; your bread and butter is whatever you do; so it’s best you do that and they do this
How we can help
As the team that literally monitors and supports a major DevOps tool used for detection and rollbacks (identity to be disclosed shortly), we can actually answer the question ‘who watches the watchmen?’
The answer is that we are so good at watching, we actually watch the watchman. It’s us. We were chosen to do it.
And if that meta watching-and-fixing-the-thing-that-watches-and-fixes doesn’t convince you, then you need watching and fixing, in our opinion.
See below for examples of how we’ve:
- Provided DevOps, support and IM for legal SaaS product StructureFlow
- Provided 24/7 support and IM for the product that journey plans for the likes of Five Guys and Taco Bell, LineTen
- Provided support and IM for the customer-facing website of global leading law firm DLA Piper
DevOps support can be added to our custom package. If you want to talk about outsourcing your monitoring and IM, or anything else, just get in touch.