Ask HN: Does your on-call rotation suck? Can I join it?

9 points by asciifree a day ago

Hi HN

I'm doing some field research on unique ways on-call rotations can be unhealthy. It would be great to hear some anecdata from the community about /why/ you feel your on-call rotation sucks - and I figure it would be even better to experience it firsthand :)

I do understand asking to join/shadow your rotation is probably not practical however I am 100% serious and happy to sign whatever.

Cheers & may your pager stay silent

dakiol 14 hours ago

On call sucks because: You need to stay at home or very close 24h. You cannot go for a run, or to the cinema, or to have a dinner if you are on call (please don’t tell me: “just take your laptop with you”. How on earth would I run with a laptop?)

No matter if you get paged or not, you need to be available, and that sucks.

tra3 a day ago

You know it's broken, everyone knows it's broken but it keeps alerting.

  • nik736 15 hours ago

    Yes, this is probably the #1 reason. Alerts go off, but you don't even know if they are for real issues this time.

  • andrewmcwatters a day ago

    And you're allocated the time for on-call, salaried for it, but never assigned the task to fix the damn thing. So instead, you and your team burn maybe-I-have-to-get-up-at-night-time instead of definitely-attempting-to-fix-it-time.

    • asciifree 19 hours ago

      I strongly believe whoever is on-call should have free reign to modify anything about their ops environment. The pitch for management is that while in the short term it might take time away from project work, eventually the reduction in interruptions will result in higher productivity.

      p.s. Knew I recognised the name - loved following development of Planimeter/Grid a while ago!

      • andrewmcwatters 6 hours ago

        Thank you so much! My company is shipping a finance platform at the moment, and I’d love to get back to Planimeter’s work when I am able.

    • tra3 a day ago

      It's like hitting snooze on the alarm clock. It's an illusion. You gonna have to wake up. You're not getting any more sleep.

      I stopped hitting snooze as I got older. I either wake up or give up and sleep in..

ferguess_k 8 hours ago

On call sucks because it's usually 7/24 for the on-call person. Any work that is 7/24 standby sucks anyway.

brudgers a day ago

On-call issues are staffing issues not tools issues.

Inadequate staff is the only reason on-call exists. Sure, people might be mostly sitting around all night being paid and not being terribly busy.

But if a company needs someone at night, they need someone at night. Companies getting away with not paying for that is why oncall sucks.

In other words oncall sucks because companies don’t pay for solving the problems that require it. There’s no self correcting feedback.

A tool can’t fix that and oncall is not inevitable. Good luck.

  • muzani 19 hours ago

    I assume it's part of the pay. You can't be a firefighter or a cop and then complain that there's night shifts. I've had nearly 4 years of it at a payment gateway and IIRC only one time was there something that had to be solved that night. When it happened, it was sort of my fault anyway; a good deal of the problems are (should be?) within the control of the people being on-call. And I think companies like payment gateways and cloud services which need people active at all times are also far more tolerant of things like spending a week reviewing a PR and such, so the frequency of downtime is lower even if the impact is much higher.

    Though I'd agree it's a staffing issue. 5 people in a cycle is fine. If you had a concert or something that week, just swap places with a colleague. When we reduced it to 2 people, it was not cool to spend half your time on-call.

    There's also policies like don't release on Fridays, don't release on a vacation week. If there's a tool for it, it would be flagging these behaviors. Unfortunately, we can't really control when partners go down.

  • al_borland 21 hours ago

    I used to work the night shift handling most off hours issues for a a couple dozen teams. We would occasionally have to call someone, but not that often compared to the alternative. Most of the time it was just to get sign off on what we already planned to do.

    When I started people were paid for any hours they worked on-call. By the end, the company changed the policy so on-call was part of base pay. For those who were on-call during the change over, their last year of on-call pay was averaged and added to their salary. For everyone who came after that, they got screwed (that includes me).

    Once I changed to the day shift I got called a few times for on-call. Every single time, I documented what I did to fix it, as I did it, and handed it off to the ops team. Or in some cases I automated the fix. I have 0 tolerance for being called in my free time. I don’t care what the boss says my priorities are, if I’m being called at night, stopping that in its tracks is my #1 priority. If I ever get called two times for the same issue, that’s my fault. So far, it’s never happened.

    • asciifree 19 hours ago

      > When I started people were paid for any hours they worked on-call

      I've yet to hear of any alternative compensation model that actually works. Just pay people in their choice of money or time off in lieu. Sorry to hear you got screwed.

      > Every single time, I documented what I did to fix it, as I did it, and handed it off to the ops team. Or in some cases I automated the fix. I have 0 tolerance for being called in my free time. I don’t care what the boss says my priorities are, if I’m being called at night, stopping that in its tracks is my #1 priority.

      100% agree, I think people are far too tolerant of being paged. Especially management - the productivity impact of constant interrupts is huge. In a previous job one of my favourite things to do was go out to teams and just disable alerts they said were noisy or unactionable. If there was any pushback/consequence I was happy to accept responsibility (but never had to).

      • muzani 19 hours ago

        Disabling non-actionable alerts actually lowered the error rate in my experience, because people would start paying attention to the alerts. Even if they were being lazy, they'd be able to see a pattern after getting rid of the noise.

        • asciifree 19 hours ago

          Exactly! Cut the noise, boost the signal. Every alert outside business hours should mean "drop everything and investigate this". Otherwise it can wait until the morning.

  • asciifree 19 hours ago

    I think we somewhat agree.. Uncompensated on-call is not acceptable. Even if you're not busy, there is an ever-present burden to knowing you could be interrupted at any moment that takes a toll on your personal time.

    But as long as the expected cost of downtime outweighs the financial cost of keeping someone available to fix it, on-call in some form will be inevitable. (There are a lot of instances where the cost doesn't make sense, and we should just accept the system being broken until 9am)

    I don't think on-call needs to suck though. IMO "staffing issues" (whether it's headcount, time, competing priorities, etc) are resourcing issues and I believe better tooling can absolutely help with that - either by reducing the resources required to fix it or by making the cost of the issues quantifiable. Thanks for the good luck :)

joshstrange 11 hours ago

“Hi, welcome to being on-call, watch these channels and respond to any alerts that fire, except those alerts, you can ignore this subset of alerts, also you can ignore this alert unless it fires multiple times in quick succession, and this alert only matters if….”

Quitschquat a day ago

Ask the poor bastards at Discovery

johncole a day ago

Doing some customer discovery?

dgunay 17 hours ago

I'm certainly not at liberty to invite a random to cover on-call shifts for us, but here's some anecdata about things I've witness that made on-call suck.

We began with free food delivery over the weekend, and the expectation that you'd take a day off the next week ("unlimited" PTO policy). Eventually they stopped letting us do that and now the "unlimited" in our PTO policy has an invisible limit, so you can't actually do that without it counting towards the invisible limit on your unlimited PTO for the year.

Our monitoring and alerting is unusably noisy. Deviance is fully normalized. All our postmortems typically have a section stating that alerts were issued, but ignored until customers began complaining. Attempts to cut the noise down to a sane level have all been defeated by the ever present pressure to feature factory. TBF this is mostly an engineering self-own and I feel partially responsible for this outcome.

The on-call engineer does a shocking amount of manual labor to paper over bugs in the product and un-stick users who fall through the (many) cracks. It is effectively a T3 tech support rotation. We've taken steps to tone it down to mere triage and channel this into pressure against offending teams' timelines, but there's a huge amount of silent cultural resistance and no one is being held accountable when a feature increases support load. I suspect this issue alone would make most bigtech engineers quit.

For the (many) issues that require manual intervention, the on-call engineer cannot actually do anything unless 2 other engineers sign off on a PR (either to run a SQL query or to deploy some tool or bugfix to resolve the problem).

This is more specific to the product I work on, but the sheer amount of 3rd party services we rely on means that something is constantly acting up and there's not a lot we can do about it. Our API client code for each service we use typically contains _at least_ one service-specific hacky workaround to keep things running in the face of bad behavior.

The frontend team has no on-call rotation despite causing plenty of bugs on their own. Backend engineers are expected to triage what are clearly frontend problems. We stood up a lot of observability tooling for the frontend but it took years for them to even start to use it.

More than anything, it feels like the moment I stop championing the issue, everyone stops paying attention and the on-call experience reverts to the mean. Other on-call engineers just sort of stop boyscouting and let the chaos wash over them while focusing on sprint obligations (can't blame them), and leadership takes their eye off the ball to chase growth (also can't blame them). Hugely fucked lack of accountability and the buck eventually stops at whoever is the poor guy holding the pager that week.

  • arcfour 16 hours ago

    I wonder what would happen if you sent this nearly verbatim to executive leadership. It is quite a thorough, candid description of a serious problem.

    At least I certainly wouldn't be happy to learn that my product was bursting at the seams and nobody was being held accountable. But I'm not an executive leader. (Maybe that's why?)

    • dgunay 3 hours ago

      All of these issues were at some point raised to leadership. I've spent a lot of political capital on the issue and decided that it's not a hill I'm prepared to die on. Either a crop of new hires will come along and improve the situation with their fresh-eyed optimism, or it'll just keep happening and I'll try to remain zen.

      And there's certainly a calculus to it that changes when you're an executive. To me, craftsmanship, diligence, and engineering excellence are important, not just because I love programming but also because I'm an IC and it affects me directly. To an executive, I am just some weird nerd they have to pay a lot of money to make computers do things. Beautiful code and a serene on-call experience are nice but they don't usually get a company acquired.

    • ferguess_k 8 hours ago

      It probably doesn't worth it, considering it might impact the replier's career negatively. I'd never do that. I'd speak to my manager and if he just gets by then I just get by.