Keep calm and stay online: how to manage application and infrastructure incidents

by Lauren Davis
Published on August 2019

Providing 24/7 application support services means our team is always ready to react when an incident happens with our client’s websites. Whilst our technical team works to ensure this rarely happens, it is our team of incident managers who make sure that when it does, it is dealt with quickly and efficiently.

So, what do our team of incident managers do, and what tips do they have for putting out the fires? Read on to find out.

Going back to basics

Incident management is the process of resolving an issue. This may sound simple, but it involves many different people from within a company coming together to achieve this. Different incident management processes are required for different business scenarios, but they usually involve these basic steps:

  1. Incident identification: this is when the problem is reported and understood as an issue which needs to be resolved.
  2. Incident investigation/triage: when the incident is investigated and an attempt is made to understand why it is happening. This is also when the incident will be prioritised.
  3. Assignment or escalation: all incidents are assigned to a person in the team to be resolved.
  4. Resolution: this is when the team has been able to understand why the issue is occurring and correct it.
  5. Reporting and closure: a report is produced after each incident to log the process for the team and client to review. Following that an issue is closed.

At JAM, we have a team of dedicated incident managers who lead and oversee this process, ensuring that every step is covered and clients are kept well-informed throughout the process.

Before going live with a new client, our incident managers are trained on the ins and outs of each client and website. We produce a detailed runbook, decision tree and training video, which are made compulsory viewing and reading for our team of incident managers before they start to support the client. Consider the ways you want to collect and communicate the essential information to your support provider before they start monitoring so that they’re aware of what actions to take in all emergency situations.

The nature of JAM’s work means that each support solution is highly individualized to the needs of each business, which can also sometimes change over time. Our incident managers are trained to update the runbook as soon as new information comes in from the client or internal team so that others are instantly aware when they come online.

If you want to make sure your site is supported 24/7, make sure that whoever is monitoring your website is aware of the following basics about the application:

  • The URL’s that need to be monitored
  • How quickly an issue needs to be responded to (SLAs) – do checks need to be performed every 15, 30, or 45 minutes for example?
  • The key functionalities on each page that provide the customer with a digital experience, and how to test that each one is working
  • Any known issues that might crop up, as well as any deployment or scheduled work that would impact the performance of the site (our incident managers are trained to ignore alerts during this time)

Go here for even more info on the application support best practices.

Who you gonna call? (probably not Ghostbusters)

Whether the site is down, or a key area of functionality appears to be faulty, the individual responsible for monitoring your website will need to be aware of exactly who to call in each situation.

  • Phone numbers of key technical staff should be provided, as well as any times when they might be unavailable
  • The roles of each individual responsible for looking after the different elements of the site should be clear, so the right person can be contacted straight away
  • Third-party contacts are also important so that the incident manager can get to the root of the problem quickly, rather than unnecessarily going through other contacts

In an emergency situation, it’s vital that all stakeholders are kept up to date with the resolution process, including all the steps being taken to return the application to full functionality. At JAM, we use many different communication channels to make sure all our internal and external teams are informed. Different ways we do this include:

  • Instantly messaging them via Slack on a dedicated client channel once we are alerted of an issue. We keep this channel updated regularly with our resolution progress.
  • Emailing them with the alerts we receive from our monitoring system so that they understand exactly what we are reacting to
  • Calling key nominated contacts from the agency or client so that they are personally informed of the situation
  • Starting a bridge call with relevant members of the client, agency and third party teams to triage the problem and begin resolving it as efficiently as possible
  • Producing detailed incident reports within 24 hours of the incident’s resolution, including a step by step account of our actions as well as recommendations JAM plans to implement to avoid a repeat of the same issue

Getting technical

Incident managers are trained to react and deal with the issues that come in, although they do not work alone. Our technical developers are essential to the incident management process as they are the ones who will resolve the issue. However, we found that by giving our incident managers a good introduction to the technical issues that are being reported, the incident management process flowed more smoothly. At Just After Midnight, we have a team of incident managers with strong technical knowledge which means that they can clearly communicate an incident to the relevant people, clearing up the problem quickly and efficiently.

Some of the key website management  that an incident manager needs to feel confident and have a good understanding of include:

  • CPU
  • Memory
  • Uptime
  • Database
  • Disk space

While this is not essential, it will make your incident resolution process much more efficient, as the incident manager will be able to diagnose and prioritise the issue much faster, and feed the relevant information to the developer.

Keep calm and carry on!

By having someone who can own the incident management process for your website, you can keep calm with the knowledge that your website is in safe hands even when a critical issue occurs. This means you can prioritise what’s important: running your business and improving the customer experience.

Incident management of application issues might seem time-consuming, but it is an essential aspect of maintaining a healthy and successful website. Supporting both infrastructure and applications 24/7 is the bread and butter of what JAM does, and so if you’d like to speak to some experts about how we can help with ensuring your site has a record-high uptime, get in touch now.