In this piece, we’ll be covering why SaaS products are switching from a traditional monitoring approach to one led by observability, especially as they gain the efficiencies that come with distributed computing.
Traditional monitoring relies on trip wires poised to catch known intruders. These are the failure scenarios you’re well aware of. But they aren’t the only ones you’ll encounter, unfortunately.
As the technology underpinning SaaS products has changed, these failure scenarios have become more obscure, leading to a new need for a more predictive, exhaustive approach to detection.
Enter, observability. The catch-all term for tech and process aimed at going beyond a set-and-watch approach to SaaS telemetry.
In this piece we’ll be getting to grips with observability and:
- A working definition of telemetry data as it applies here
- What a traditional SaaS monitoring approach looks like
- What an observability approach looks like
Let’s get started.
What is telemetry? Skip if this feels remedial
Telemetry refers to the automatic collection and centralisation of data. So, as we’ll see, both monitoring and observability are underpinned by telemetry.
Types of telemetry
Metrics
These are system-level numbers that span all kinds of application states. Ranging from API response times to user behaviour.
Logs
Logs, on the other hand, provide more context-rich, qualitative data on individual events. For example, a failed API call at a specific time.
Traces
Traces “trace” the journey of requests throughout a system. These can be used to identify bottlenecks in complex or distributed systems. This one’s a little more complex, so let’s take a look at an example.
Example of a trace
Our trace might begin when a user clicks the play button of a SaaS media player. This triggers an API call to /media/start, which interacts with the media streaming service to initialize the video stream and queries the database to retrieve metadata, such as the user’s access permissions, playback settings, and media file location.
What a traditional monitoring approach looks like in SaaS
As we said at the outset, traditional monitoring relies on setting alerts to trigger on known failure or attack scenarios.
This is still complex work. And instrumenting an effective, traditional monitoring strategy is no easy feat.
Example of a traditional monitoring setup
To keep running with our media player example (a product we’ll call Exemplary Media Streaming) let’s look at monitoring implementation at the metric, log and trace level.
Exemplary Metrics
These will be used to monitor overall system health. Things like playback start times, CPU usage and number of streamers.
The monitoring team may have set up an automated workflow based on the system metric us-east-1-avg-playback (the average amount of time between the user clicking play and the video beginning in that AWS region).
In the case that this metric gets above, say, 3 seconds, the cavalry rock up:
- Step one: an API call scales up CDN nodes in the us-east-1 region
- Step two: CDN servers are cleared of cached content
- Step three: traffic is routed to us-east-1
- Step four: whoever is the on-call engineer receives a ping that steps one to three have been taken due to the needle hitting the red zone
- Step five: the needle either drops back down and the engineer logs it, or it doesn’t, and they begin to work their magic
Either way, the problem’s being dealt with.
Exemplary Logs
Logs contain defined events. The more obvious ones you’d want to keep an eye on might log definitive issues like corrupted files or videos not starting. But there’s more to it than that.
For example, one might set a monitoring alert/workflow to detect patterns of behaviour that might indicate a cyber attack; if we’re seeing too many video-start requests for a single user, we might have a bot.
This time the response might look like:
- Step one: a firewall is called for the user IP in question
- Step two: using dynamic monitoring, the playback logs begin to sample more frequently, looking for patterns similar to the initial spike
- Step three: this all gets passed to the 24/7 support engineer who can decide where to go from there
Exemplary Traces
In monitoring, traces are essentially metrics. By this we mean that although data is gathered by tracing requests through systems, it really just ends up being aggregated and put in a metrics dashboard.
We’ll see how traces are used differently in the next section.
Why this approach has the unfashionable “traditional” as its descriptor
As you can see from the above, this “traditional” monitoring set-up is an intelligent, worthwhile use of telemetry.
Combining data on system performance, CPU and memory usage and providing intelligent, pattern-based analysis, Exemplary were able to implement real-time data analysis and response to keep their SaaS product chugging along.
So why is this not enough?
A metaphorical explanation of why observability > monitoring
Imagine you have a pet. It’s a dog. A good reliable pet. It’s companionable and protective and it can be used to strike up conversations with people in the park.
If there’s a problem with your dog, you consult 101 Things That Could Be Wrong With Your Dog And What To Do About Them.
You monitor your dog for these signs. A little redness beneath the eyes.
Now let’s say you have a shape-shifter. This is undoubtedly better than your dog. For one, it can be a dog. In addition, it can be whatever animal you need it to be at whatever time.
There is no book for this creature.
There is no predefined guide for what happens when a stomach virus picked up in rat form ends up in your parrot.
There are too many unique, dynamic interactions.
What you do instead is map out the undulating, Lovecraftian biology of your pet as it moves from form to form.
From this map, you dynamically generate the scenarios and issues which could arise.
This is essentially the situation with SaaS products.
As technologies like microservices, serverless and containers drive ephemerality, abstraction and distribution, the interplay of components becomes too complex to map out from first principles. The only way to understand them is to see them in action: to observe them.
A practical explanation of why observability > monitoring
This next example will feature the leveraging of cloud telemetry, performance data and real-time monitoring to generate valuable insights.
However, we’ll depart from the above in that Exemplary will now be using traces to discover issues not previously foreseen.
Exemplary traces
Users have been consistently reporting blurred videos. However, none of Exemplary’s self-devised monitoring thresholds are showing an issue.
All software systems, cloud environments and performance metrics are in normal ranges. Everything indicates optimal performance. If only.
But Exemplary aren’t the sort to take it lying down. Using an observability platform like AWS X-Ray or Zipkin, they go about implementing a trace:
- Step one: Exemplary uses Zipkin to trace requests made by the browser-based player throughout their entire system. The trace follows each request from the video player through the CDN and backend services to the origin server, capturing detailed timing and flow data.
- Step two: The trace discovers video segments are loading more slowly from the us-east-region CDN. This region shows significantly higher latency compared to others, pointing to a performance issue specific to the us-east-1 CDN nodes.
- Step three: Further analysis reveals a caching problem with the CDN; video segments are frequently not found in the cache. When a cache miss occurs, the browser bypasses the CDN and fetches the segment directly from the origin server, introducing additional delays.
- Step four: The browser compensates for this latency by lowering the video’s bitrate. As the player’s adaptive bitrate logic prioritises continuous playback over resolution, users experience blurry video as a result of the slower segment loads.
- Step five: After the caching policy is reviewed, the BITRATE_DOWNGRADE log event is incorporated into Exemplary’s monitoring strategy. By adding this event as a monitored metric, the team ensures that future video quality drops can be detected and addressed proactively.
What does this tell us about observability?
The use of a CDN to cache regional video files is a great example of the benefits of a distributed, cloud-native approach. Tactics like this can be used to optimise performance while reducing costs for the SaaS company in question.
However, as we can see in this example, efficiencies may come with their own drawbacks. In essence, observability is a way of compensating for the increased surface area of mistakes opened up by evermore distributed systems.
Observability isn’t magic – but it’s close
If you’ve made it this far, you can probably see the writing on the wall: traditional monitoring, while useful, can’t keep up with the complexity of today’s distributed systems.
Implementing observability doesn’t have to be a headache. That’s where we come in. We help SaaS teams move beyond “set-it-and-forget-it” monitoring to systems that actually show you what’s happening under the hood. The kind that solves problems you didn’t even know were there.
Curious? Ready to dive in? Get in touch with a member of our team. We’d be happy to help.