In the real-time software world, 24×7 uptime is critical for core software where millions of transactions occur every second. In 2018, Amazon’s Prime Day event experienced a 13-minute outage that, according to some estimates, may have cost the company up to $99 million in lost sales. Reliability is paramount when the business depends on it for revenue, customer experience, and competitive advantage. Data-driven teams rely on tracking metrics and all system performance data they can get to ensure that systems are performing and scaled as expected.
To improve reliability and ensure constant uptime, engineers and managers are commonly on call for the services they own. An “on-call” involves being ready to acknowledge alerts, mitigate incidents, ensure alert response and the right escalations, and do post-incident follow-ups. It’s an incredibly important role, as the on-call engineer is often the first line of defense in ensuring the reliability and availability of a company’s services.
Here’s what different levels of availability could mean for your team:
Availability | Downtime Per Year |
99% | 3.65 days |
99.9% | 8.76 hours |
99.99% | 52.6 minutes |
99.999% | 5.26 minutes |
But here’s the problem: A bad on-call rotation with a low signal-to-noise ratio can lead to developer burnout, engineering churn, and lost focus on real engineering work. It also increases the mean time to incident detection, since developers must spend time sifting through the noise to identify the right set of issues to act on.
So, how do you ensure a healthy on-call experience?
In this post, you’ll learn:
- Tips for teams and engineering leaders to improve on-call hygiene
- Examples of companies with effective on-call approaches
- Ideas worth considering for your own team
Identify Issues Weekly
The first step to a healthy on-call is to identify issues and ensure a strong signal-to-noise ratio regularly. On-call hygiene is not a one-time fix, but an ongoing process. Set up a weekly review to analyze alerts and determine which ones are providing valuable signals vs. just noise. Ruthlessly eliminate noisy alerts that don’t require immediate attention. A common example of this could be noisy alerts when the overall system is healthy but has a small blip in metrics that recovers automatically. In such cases, it’s important to identify the root cause and address it immediately rather than letting it alert and divert developer attention frequently.
Prioritize Repeat Offenders
Alerts that fire repeatedly demand special attention. If not addressed, these problems snowball and lead to even more alerts in the future. Prioritize fixing these repeat offenders to get ahead of the alert fatigue curve.
De-Duplicate and Group Related Alerts
During a major incident, the last thing you want is developers being paged hundreds of times for the same underlying issue. Work to de-duplicate related alerts to a single notification. This will help your team stay focused on the actual problem rather than getting buried in redundant pages. As an example, instead of having error rate alerts on every host or server, see if an aggregate higher-level alert can provide the same level of reliability and detection capabilities; then, aggregation will help improve overall sanity. This single alert provides a clear signal that there’s an application-wide problem, without overwhelming the on-call engineer with noise.
Automate Manual Toil
On-call often involves executing the same manual steps repeatedly. Look for opportunities to automate these repeated tasks. This could be as simple as a runbook script or a more sophisticated auto-remediation system. The more you can automate, the easier on-call becomes.
Foster an On-Call-Friendly Culture
Improving on-call is not just a technical challenge but also a cultural one. Work to develop a culture emphasizing the importance of a healthy on-call experience. This means giving engineers time to work on alert hygiene, sharing best practices across teams, and celebrating alert reduction wins.
Importance of Secondary On-Call
It’s also very important that teams maintain an on-call set-up with primary and secondary on-call engineers. The specific roles and responsibilities of the primary and secondary on-call engineers can vary depending on the team’s needs. Some teams use the secondary on-call as a backup for any pages that the primary might miss, while others assign the primary to handle only urgent pages and assign low-priority tickets to the secondary.
Regardless, having a secondary is especially crucial during incident mitigation. During an incident, the secondary on-call can take on important tasks like investigating dashboards of dependency services, communicating with stakeholders and downstream customers, or documenting the incident, enabling the primary on-call to focus on mitigating the incident at hand.
Additionally, in case of a prolonged incident, the secondary on-call can take over the primary role, ensuring that the service remains supported and monitored throughout the incident.
Wrapping Up
Identifying and fixing on-call processes can lead to enormous benefits: happier teammates, reduced engineering churn, and more focus on the work that matters most.
The key takeaways:
- Regularly review alerts to maintain a high signal-to-noise ratio
- Prioritize fixing repeat offenders
- De-duplicate related alerts
- Automate manual toil
- Foster a culture that values a healthy on-call experience