Some of you may have spotted this screen through the hours of 21st March 2019. As 1000s of people rushed to the parliament website to sign the crucial petition to remain in the EU, the website crashed and suffered intermittent instability throughout the whole day. The petition committee claimed that the signing rate was the highest they have ever seen with nearly 2000 signatures completed every minute and a consistent number of between 80,000 and 100,000 people concurrently viewing the petition page at any time.
Whilst it is incredible to see the number of people that have been able to voice their opinion using digital platforms such as this, it is also eye opening to see how quickly that ability can be shut down in these unprecedented circumstances. The petitions committee dealt with this well and kept people updated on the status of the website through ongoing communication.
We asked our performance engineering team to give us some top tips on what can be done to be better prepared for a similar situation:
- Benchmarking and war gaming for the worse case scenarios – Historical data is not always enough, like the banking systems, you have to stress test both realistic historical and extreme scenarios.
- Architecting solutions that can scale – Once you have your theoretical max throughput you can design a cloud hosting solution. There are a huge number of design decisions, tools and tactics that can be used including.
Auto scaling across the stack to allow the system to react to demand in real-time.
Use of CDN edge networks to take load from the page.
- Queuing systems / waiting rooms (like queue it or Netacea) which sit in front of the site to monitor traffic and divert to intelligent queues when certain thresholds are met (you see this type of thing frequently used on ticket sites for gigs and sporting events).
- Check the budget – You can pretty much design the most robust solution in the world but you have to look at the cost. Autoscaling, CDNs and queue systems do not always come cheap. You always have to weigh the risk, potential reputational / financial loss against the anticipated costs.
- Load testing, stress testing – Once you have completed your design, agreed the costs and stood the website or application up you have to test back against your worse case scenarios. You can do this by using scripted tests and virtual users from a variety of tools. You need to build up slowly to the theoretical max and see what creaks first, you can then tweak your design and plan for incident management scenarios.
- Continuous monitoring and 24/7 Incident Management – Finally once you are live you need to setup continuous monitoring across the board (from the servers to the traffic) and set thresholds to alert you to potential issues. And of course a full-time 24/7 incident management team is the ideal solution.
In summary, preparation and planning for these situations is key whilst being realistic that in certain unprecedented circumstances, when people want (or in this case don’t want) something bad enough, you are always going to be up against it.