On-call: reasons it's not getting better and ways to improve
Examples of cases from running products that ruined on-call experience and techniques that helped to mitigate them
Examples of cases from running products that ruined on-call experience and techniques that helped to mitigate them
Each product once it's moved from the internal testing stage to acquiring real users, at some point begins to receive complaints about non-working things. The more critical the problem, the less time users are willing to wait. That is why almost all companies are coming to the idea of having an on-call process. Unfortunately, if badly organized it can result in a drop in product metrics, slow down development, and ruin the engineers' work-life balance, which in the worst-case scenario leads to burnout and the decision to leave the project. If you start noticing that on-call is seen as punishment it’s time to take action.
Overall, most of the issues are caused by either technical debt or process issues, moreover, ones often become the reason for the others. It’s good to remember a simple rule: Every issue that appears during the on-call must be followed by an action preventing its reoccurrence. It can be code change, process improvement, or documentation amendment. Now, let’s dive into the details.
Poor service knowledge can be seen in teams where the composition changes frequently without proper knowledge transfer. Sometimes the reorganization of the company structure is followed by service transfer to new owners which makes the first on-call shifts the most difficult. Also, if the product consists of many services and a common 2-pizza team can’t physically manage all of them, therefore it’s unlikely for one engineer to have deep knowledge of all dependencies.
In addition, Customer Support is rarely aware of the internal architecture and creates a ticket for the first service with which users interact, leaving further triaging for engineers. If the team is not confident in service stability or is not aware of all the dependencies, an on-call engineer is likely to spend some time looking into services that operated fine from start.
An on-call engineer received a task to triage a payout discrepancy as the difference in numbers between what the users could see in their web accounts and in their bank statements was quite big. It took a day to validate UI, API, and data pipelines, in parallel sending emails to get customer confirmation to assess the profile. When it became obvious that the services were healthy (which was supported by monitoring data), it was decided to re-assign the issue to a team that is responsible for collecting source data. They resolved the task within 5 minutes as the issue was well-known.
A good way to minimize the risk is to make sure everybody knows where to start if the incident occurs. You can begin by creating a monitoring dashboard with key metrics and an on-call runbook with the most frequent cases and include them in the onboarding plan. Once an incident happens, every engineer should be able to evaluate the system's state based on the information from these 2 points. If the case appears for the first time, an engineer might either extend the monitoring or add a new case in the runbook as a follow-up action.
If you are aware of what services live behind yours and could be a source of the issue, it’s fine to involve the responsible team in triaging to speed up the process. However, remember that your team is the owner of the on-call issue until you provide a valid reason why the issue must be re-assigned, so continue the further investigation even if somebody’s agreed to help.
Once you can clearly see you’re getting similar issues by mistake (we’ll talk a bit later about how to track it) it’s time to start making changes in the task assignment process. If you utilize an automatic system think about adding rules that allow distinguishing your services from others. Also, make sure Customer Support’s documentation contains steps they could use to triage the issue in more detail and assign it to the right team.
During on-call shifts, I often received false requests that could be resolved without involving an engineer. All I could do was provide a comment like “this is by design” and a reference to the documentation.
It’s quite easy to determine this type of task: you don’t take any action that somehow affects the service like creating a PR or re-running a job. Most likely the reason is poor UX, overwhelming public documentation, or both. If customers aren’t sure how to navigate among sophisticated features they could generate a curious but invalid outcome that won’t work as they expect. Or, when the next step isn’t clear, it’s easier for customers to ask for support rather than deal with the issue by themselves.
Sometimes even an improvement can be the reason if you somehow change the service without notifying customers. A new flow might be confusing or even break the existing integrations which will raise the amount of on-call tickets as well.
To deal with it the engineer needs to gain deep domain and service knowledge which is possible after working for a company for a while. Otherwise, be ready to spend some time before you can figure out that “blocking account after 3 wrong password attempts” is done on purpose.
The service was designed to integrate with 3rd parties by generating a token with an expiration period. Once the period is passed the user had to repeat the process. Because of the poor product design, the token generation step was included in the onboarding flow. There was no need to get back to this page once the setup is completed as it didn’t contain any useful information. As you probably could get, every time a token had expired the on-call engineer got a new issue “Client N doesn’t get data more than M hours, API responds with an error”, and after triaging had to redirect the customer to the onboarding page to regenerate the token.
You can often see the relationship between the occurrence of such tasks and insufficient information at the time of the error. Luckily, the solution is pretty elegant: you need to change the service in such a way that the user can understand what exactly is wrong. For example, the message “Something went wrong” is much less suitable as a description of an error than “Changes for the price can’t be applied as your campaign is active. If you want to change the price without affecting other settings, you can finish this campaign and run a new one by cloning settings”. Work on such tasks together with UX specialists to find the optimal solution.
Do not expect that adding all the details to the public documentation will dramatically change the number of incoming tasks: in the case of large systems with complex configurations, users prefer to contact support directly - it's faster and more efficient. Review the documentation Customer Support or CSM use for triaging and make sure the information there is up to date as sometimes they contain detailed instructions with snapshots and links that might be outdated.
This large category is typical for companies that support expertise in certain technologies, as well as encourage growing individual product experts in the team. Add to this outdated documentation and you'll get a very painful on-call with many constraints, which for some turns into the lifelong role of a support engineer. In addition, it extends the risk of slowing down product development, not to mention the limiting of domain knowledge among a very narrow circle of specialists. The new employee onboarding to such products is slower and less qualitative, as information is transmitted selectively, and some services can be skipped due to tech stack unmatch. Often all happens in the form of a live meeting without recording and with a newly created whiteboard of dependencies that will be lost right after the call. I've ever worked with a product where after an expert decided to quit, the team wasn't able to upgrade the service safely as nobody knew how a couple of features functioned.
In terms of the on-call process, the knowledge might be more specific to triaging and taking quick actions in case of emergency. If more experienced engineers don’t share knowledge, the other engineers will always ask for help, distracting teammates from general product work.
After another reorganization, the teams that used to be divided by the technology stack were united into product teams supporting UI-API-Data Management. Since most of the issues were related to data inconsistency and unexpected shutdown of instances, front-end engineers were not initially involved in the on-call. At some point, rotations among back-end engineers became too frequent, which slowed down product development, and it was decided to involve everybody. The first shifts revealed that front-end engineers could not cope on their own: there were no easy ways to get why the alert triggered, what instance was affected, and how to recover it. It turned out that due to the different use of different tools for the front-end and back-end, the engineers had neither links nor access to each other's monitoring systems. All knowledge was either passed on verbally (hey, tribal knowledge) or acquired while touching the actual service code base.
There are several effective methods here. First of all, make sure that the domain and service data are stored in a place accessible to everyone. Follow the same template for all services, and automate where possible. For example, if you're storing public URLs for different environments on Github, create a README template that will initially contain this section, and add linting rules to remind engineers to fill it. For sensitive information like credentials, specify where to find them for each environment. It's acceptable to use multiple documentation services simultaneously, as sometimes one tool is helpful for sales when another is for engineers. Just make sure it's up to-date, and consistent, and everyone on the team knows how to find them. To make on-call easier you can create a runbook - as a rule, it is a separate document that covers the most useful information to analyze tasks easier, especially for beginners and under the pressure of urgency and severity.
It's vital not just having documentation but to make sure each engineer knows about its existence. Include links to the onboarding plan, appoint a more experienced teammate as a buddy to help with not covered questions, and conduct workshops. Hint: try to make those who have recently completed onboarding become buddies for the next newcomers - a deeper service understanding is guaranteed. Allow newcomers to update the onboarding guide with missed information.
If someone decides to leave the team, make sure that they prepare a handover plan with information about the active projects. Notify other engineers in advance and give them the opportunity to prepare questions as this will help to cover edge cases. Record the meeting: for some time this knowledge will be relevant and by this, you also resolve an attendance problem.
The positive result for on-call is showing the practice of the shadow on-call. Commonly, it requires 2 engineers for a shift: the main on-call is responsible for incoming tickets, and the shadow is ready to cover the main in case of an emergency like a 1-day absence or support with a severe incident. Newcomers are put as shadows for initial shifts to understand the process. You can arrange paired sessions with the main on-call engineer for a task from the moment it arrives till it's solved. That will help to nail tiny details such as which commands to call or what internal tools to use to look for resources. In some cases, shadows can take the lowest priority tasks and try to solve them on their own.
The last thing I'd like to mention is the ending of an on-call shift. In many companies it's pretty straightforward, one person's shift has ended and the next person's shift has begun. However, you risk losing context for non-finished tasks. Try to introduce the practice of a small meeting for the team where the on-call engineer sums up their shift: what tasks were completed and what caused them, where they stopped at the tasks in progress, and which tasks are blocked. Thus, all team members will be equally aware and better oriented if given a similar task during their shift.
For large systems, it’s common to receive several on-call tasks at the same time. It would seem that the prioritization does not really make a difference, because, in the end, they all have to be resolved. But if you look at on-call deeper and evaluate how many people were affected and how much profit was lost, then the importance of prioritization becomes clearer. Moreover, late reaction to more critical issues often causes cascading crashes, leading to more affected services and more drastic measures to restore them. This is quite typical for services with multiple servers and load balancing: in the event of a shutdown of one instance, there is a high probability of subsequent shutdown of the others (if autoscale is not provided). For data-processing systems, the late reaction is fraught with longer and more infra-consuming re-processing.
An on-call engineer on a first come first served basis selected a task created by another engineer about problems with a test account. A little lower in the same list there were several similar tasks of the from the notification system for a daily pause in the data processing. When emails from customers began to arrive in bulk, the processing hadn't been working for several days. The issue was tiny: a new data type wasn't handled properly, and instead of ignoring the unknown input, all calculations for further rows were paused. The error was fixed and the processing was restarted, but for some time the system used all available computing resources, not to mention that the company had to offer some compensation to customers.
The solution can be divided into 2 parts: choose a prioritization system and implement it in the process of creating tasks. For the first part, it is necessary to determine which metrics are more significant for the product. For example, if you're developing mobile games, your metric might be the number of users (existing or anticipated) affected by an issue. For B2B, you can evaluate the criticality of the problem based on the daily loss of profit or the expected downtime compensation. The best way to develop the prioritization scale is to collaborate with PMs, Engs, and Customer Support. It's good to introduce a deadline for each task as it helps to order tasks for triaging in a proper way.
Hotfixes are those little elves who come to rescue situations where the amount of potential damage is large
and there is no time for an elegant solution. Who of us hasn't added sleep(N)
because otherwise "the previous
method hasn't been completed yet"? But a hotfix is not only a hard-coded solution but any inefficient action
taken by an engineer. Manually updated configuration files right on the server or manually reinitiated
instances aren't acceptable day-to-day with an existing CI/CD but are quite handy in a critical situation.
Since this is almost a guaranteed working approach, tasks for improvements usually live somewhere on the
bottom of the backlog. Having a task or a TODO
in the code is the best-case scenario, at worst, engineers
will use the “unplug and plug back in” technique for any issues. Sticking to hotfix practice raises the risks
of system stability and maintainability, potential code conflicts between hotfixes and standardized codebase,
as a result, more on-call issues.
The team developed a new version of the API and under the pressure of the deadline decided to start using it without properly configuring DNS. Instead, it was decided to simply configure dependent services to call it by IPs. Tasks for replacing hard-coded configurations with standard ones were planned but as often happens due to other priorities they were postponed. One day, the infra department did some maintenance and as a result, IPs changed and services outside the DNS became unavailable. It took time and effort to find dependencies (yeah, it appeared that a few dependencies had changed owners and the new team wasn't aware of the hotfix), reconfigure, and restart the services, and during this time the product was unavailable to customers.
Remember the rule: each task must have some action to prevent recurring. Therefore, it's important to fix the
hotfix usage as a comment to the on-call task or as a paragraph in the postmortem document. Later it'll be
converted into a tech debt task whose status can be controlled. Always link tasks to keep context and visualize
the size of the issue if the hotfix is applied regularly. If you mark code with TODO
provide a link to the
task next to it, as often not tied TODOs live forever in the codebase and don't lead to any solution.
If the task was created a long time ago but remains untouched you need to review the prioritization policy. Sometimes it's difficult to negotiate such tasks with the leadership since there's confidence that 10-25% of the total development time allocated for tech debt should cover such work. This is where on-call tracking becomes handy. Try to start tracking tasks for which all postmortem dependencies aren't resolved, as well as a number of on-call tasks linked with the hotfix removal task. This will be a much worse but more truthful metric and will allow the leadership to make better estimations relying on the technical condition of the product.
As far as prioritization within a team is concerned, post-incident tasks should be given extremely high priority, such as security risk tasks. If it's not possible to quickly eliminate the hotfix, you need to evaluate the changes from the perspective of business - affected revenue, infrastructure cost, cost of maintenance, and so on. As a result, you either convert the work into a project for the whole team, or realize that the cost of changes is too high, and describe the hotfix in the runbook as an acceptable solution.
Using personal accounts, even corporate ones, in the development pipeline rarely ends well. The very fact that the use of personal accounts outside the corporate domain is acceptable creates additional security and data privacy risks. Also, it adds extra work to IT teams by increasing complexity in monitoring activity and costs for 3P services, and disabling employee accounts when the contract is terminated. Even If the company doesn't allow using personal accounts outside the system, you can still notice linking products and services to an engineer account. You can find it in the usage of personal email for server access, DB user, user for autotests, or storing project documentation in a personal Google drive. In case of an emergency, an on-call engineer has to improvise, contact the IT team to grand assess, or be ready to blindly "change and redeploy" until the issue is resolved. If the team is distributed across continents, be ready to wait for a while: once I received an issue about data processing and didn't have access to the jobs running tool, I had to put on hold the task for 8 hours until the access was granted.
The code analysis system has found a "dead" endpoint and created a task about automatic removal within two weeks and notified an engineer to take action in case the endpoint was needed. The notification rule was created to send all notifications to a particular engineer, who at some point decided to leave the team. The team had missed the notification, and the endpoint turned out to be not dead, was deleted. The on-call engineer had to deal with re-creating configs and reverting all changes.
As obvious as “don’t allow personal accounts” may sound, the reality is much more complicated, especially for early-stage startups.
The first weak point is onboarding. On the first working day, a newcomer gets an empty laptop and the opportunity to install everything to their liking, and this is the risk of getting software without a license and services registered on personal accounts. Presetup and using SSO will help to reduce the effect. Quite often engineers are forced to create accounts to take a glance at the product. To add a bit of control, the first task for a new project (or one that you’ve received from another team) is to create all possible accounts and placeholders even if they remain unused for a while. This could be email addresses for the team and test users, a documentation page in a shared space, and so on. In the future, the engineer will have much less need to re-create accounts.
If you still allow the use of personal accounts, make sure that all team members have extended permissions. For example, if your Github account has only one admin, the whole team will be blocked with small improvements like adding a webhook.
The introduction of regular automatic checks will help deal with personal accounts that have already entered the system. Static code analyzers can help find account usage in the codebase, scheduled jobs can collect data from databases and flag accounts with unwanted settings (e.g. non-corporate email + test user + admin role). 3P services usually offer good analytics, so you can easily get user data and take actions to restrict access if personal accounts are used.
It is not always possible to get rid of unwanted accounts quickly: in extreme cases, essential flows and accesses can be tied to a single email. Organizing a separate event like a hackathon is helpful here as you can bring in engineers from different teams and be sure that the replacement won’t break CI/CD in the middle of the release.
Most of the time flakiness can be seen in alerts or tests, and the most annoying are categories “too many false positives” and “too frequent notification”. Both of them lead to the same outcome: engineers just stop paying attention to them and react to the issue later than they could.
Alerts are usually not documented and sometimes contain a description that doesn’t explain its purpose. Too frequent alerts are an example of over-engineering: with good intentions to make a system as safe as possible engineers trigger alerts on every exception and set too low thresholds. Another way to make on-call worse is to raise the same alert in all available channels: send an email, create a task, use all Slack options, and so on.
As for the tests, the amount of QAs in the team is much smaller than engineers, so tests are released later than features and become outdated quite fast (relevant for other team compositions such as “no QAs”, “QA Platform”, etc.). Tests for UI components are at the highest risk to become flaky, especially if meant to be pixel-perfect. E2E testing for complex systems with authentication, long-lasting data processing, and stateful dependencies can become flaky due to timeouts. Tests themselves aren’t the reason for on-call worsening however the capacity to create them usually comes from the same budget as tech debt tasks. If they’re designed badly they will distract on-call engineers more than they serve for service resilience.
That was one of the funniest cases in my practice. I worked on a feature and introduced a UI bug in a list container - the scrolling functionality was blocked and only a visible set of items was available. During the development I used a minimal setup so, obviously, I missed it. At that time, a few integration tests were flaky as they relied on the find-and-click approach. The production environment had better resources than the preproduction so the latest faced flakiness much more often. When I ran the preproduction tests they failed but as it wasn’t a unique case I just notified the QA engineer and moved forward to the production. When the production tests failed, the QA engineer started checking the tests themselves (hi, low service confidence) and we started receiving on-call tickets. Luckily, the revert was done pretty quickly but my colleagues still remind me about this case.
If the percentage of flakiness is really high consider switching off alerts/tests. It might sound crazy but if the outcome is false positive 9 of 10 times it distracts you more than helps. Make a proper review and remove the reason for the flakiness before returning a check back to the pipeline. For example, check that alerts have a threshold that is consistent with a service throughput, especially when the service has been upgraded and scaled. For tests, it’s good to start by getting rid of hardcoded timeout values or instances that could easily change (object position on the screen, a value of the input label, data loaded in the collapsed component, etc).
To reduce noise try minimizing the number of notification channels, my choice is to have a task assigned to the on-call engineer for low-mid priority cases and a task + DM for high-priority. You might keep a team channel for the most severe cases but only if you expect somebody except the on-call engineer to be involved.
There are a lot of examples using tests as a service health check by running them continuously on production. If you rely on this approach, try focusing on parts that are most likely to crash unexpectedly. 3P service can become unavailable or change the format of the response which can cause a system failure. Just remember, tests might be long-running and cover only core functionality, if you want a better result consider building a scheduled job instead.
After a short time of using the labels, the first on-call analysis helped to identify that most of the load fell on small tasks (the triage required less than 1 hour) that were either closed without a fix or transferred to another team. After the tasks were grouped and given a list of high-level issues:
The team has received enough information to come up with solutions that will not only solve current problems but prevent similar ones in the future. For example, the code review system began to automatically notify the specialists working with content every time it detected a new or changed text message. It allowed them to prevent moving forward the changes that aren't aligned with the content strategy and therefore, avoid message inconsistency.
There is no magic pill to completely eliminate on-call, and I wouldn't say it's needed as on-call activity shows that your product is live, evolving, and ready to address customer requests. However, it's essential to keep the balance between time to create new features and to maintain the existing ecosystem. I hope you'll find techniques that will help you to make the on-call process less frightening and the product more resilient.