points by yashap 4 years ago

That’s what stood out to me too. Although they’d been slowly rolling it out for awhile, their last major rollout was quite close to the outage start:

> Several months ago, we enabled a new Consul streaming feature on a subset of our services. This feature, designed to lower the CPU usage and network bandwidth of the Consul cluster, worked as expected, so over the next few months we incrementally enabled the feature on more of our backend services. On October 27th at 14:00, one day before the outage, we enabled this feature on a backend service that is responsible for traffic routing. As part of this rollout, in order to prepare for the increased traffic we typically see at the end of the year, we also increased the number of nodes supporting traffic routing by 50%

Consul was clearly the culprit early on, and you just made a significant Consul-related infrastructure change, you’d think rolling that back would be one of the first things you’d try. One of the absolute first steps in any outage is “is there any recent change we could possibly see causing this? If so, try rolling it back.”

They’ve obviously got a lot of strong engineers there, and it’s easy to critique from the outside, but this certainly struck me as odd. Sounds like they never even tried “let’s try rolling back Consul-related changes”, it was more that, 50+ hours into a full outage, they’d done some deep profiling, and discovered the steaming issue. But IMO root cause analysis is for later, “resolve ASAP” is the first response, and that often involves rollbacks.

I wonder if this actually hindered their response:

> Roblox Engineering and technical staff from HashiCorp combined efforts to return Roblox to service. We want to acknowledge the HashiCorp team, who brought on board incredible resources and worked with us tirelessly until the issues were resolved.

i.e. earlier on, were there HashiCorp peeps saying “naw, we tested streaming very thoroughly, can’t be that”?

otterley 4 years ago

When you're at Roblox's scale, it is often difficult to know in advance whether you will have a lower MTTR by rolling back or fixing forward. If it takes you longer to resolve a problem by rolling back a significant change than by tweaking a configuration file, then rolling back is not the best action to take.

Also, multiple changes may have confounded the analysis. Adjusting the Consul configuration may have been one of many changes that happened in the recent past, and certainly changes in client load could have been a possible culprit.

  • yashap 4 years ago

    Some changes are extremely hard to rollback, but this doesn’t sound like one of them. From their report, sounds like the rollback process involved simply making a config change to disable the streaming feature, it took a bit to rollout to all nodes, and then Consul performance almost immediately returned to normal.

    Blind rollbacks are one thing, but they identified Consul as the issue early on, and clearly made a significant Consul config change shortly before the outage started, that was also clearly quite reversible. Not even trying to roll that back is quite strange to me - that’s gotta be something you try within the first hour of the outage, nevermind the first 50 hours.

  • mypalmike 4 years ago

    In most cases, if you've planned your deployment well (meaning in part that you've specified the rollback steps for your deployment) it's almost impossible to imagine rollback being slower than any other approach.

    When I worked at Amazon, oncalls within our large team initially had leeway over whether to roll backwards or try to fix problems in situ ("roll forward"). Eventually, the amount of time wasted trying to fix things, and new problems introduced by this ad hoc approach, led to a general policy of always rolling back if there were problems (I think VP approval became required for post-deploy fixes that weren't just rolling back).

    In this case, though, the deployment happened ages (a whole day!) before the problems erupted. The rollback steps wouldn't necessarily be valid (to your "multiple confounding changes" point). So there was no avoiding at least some time spent analyzing and strategizing before deciding to roll back.

  • uvdn7 4 years ago

    > When you're at Roblox's scale

    Yet a regional Consul deployment is the single point of failure. I apologize if that sounds sarcastic. There’re obviously a lot of lessons to be learned and blames have no places in this type of situations - excuses as well.

notacoward 4 years ago

In a not-too-distant alternate universe, they made the rookie assumption that every change to every system is trivially reversible, only to find that it's not always true (especially for storage or storage-adjacent systems), and ended up making things worse. Naturally, people in alternate-universe HN bashed them for that too.

  • yashap 4 years ago

    Obviously I'm on the outside looking in here - can't say anything with confidence. But I've been on call consistently for the past 9 years, for some decent sized products (not Roblox scale, but on the order of 1 million active users), mitigating more outages than I can count. For any major outage, the playbook has always been something like this:

    1. Which system is broken?

    2. Are there any recent changes to this system? If so, can we try reverting them?

    They did "1", quickly identified Consul as the issue. They made a significant Consul change the day before, one they were clearly cautious/worried about (i.e. they'd been slowly adopting the new Consul streaming feature, service by service, for over a month, and did a big rollout of it the previous day). And once they did identify streaming as the issue, it was indeed quick to roll back. It just seems like they never tried "2" above, which is strange to me, very contrary to my experience being on call at multiple companies.

    • notacoward 4 years ago

      What do you do when you're working on a storage system and rolling back a change leaves some data in a state that the old code can't grok properly? I've seen that cause other parts of the system (e.g. repair, re-encoding, rebalancing) mangle it even further, overwrite it, or even delete it as useless. Granted, these mostly apply to code changes rather than config, but it can also happen if code continue to evolve on both sides of a feature flag, and both versions are still in active use in some of the dozens of clusters you run. Yes, speaking from experience here.

      While it's true that rolling back recent changes is always one of the first things to consider, we should acknowledge that sometimes it can be worse than finding a way to roll forward. Maybe the Roblox engineers had good reason to be wary of pulling that trigger too quickly when Consul or BoltDB were involved. Maybe it even turned out, in perfect 20/20 hindsight, that foregoing that option was the wrong decision and prolonged the outage. But one of the cardinal rules of incident management is that learning depends on encouraging people to be open and honest, which we do by giving involved parties liberal benefit of the doubt for trying to do the right thing based on information they had at the time. Yes, even if that means allowing them to make mistakes.

    • Karrot_Kream 4 years ago

      If you're doing a slow rollout, it's not always easy to tell whether the thing you're rolling out is the culprit. I've been on the other side of this outage where we had an outage and suspected a slow change we had been rolling out, especially because we opted something new into it minutes before an incident, only to realize later when the dust settled that it was completely unrelated. When you're running at high scale like Roblox and have lots of monitoring in place and multiple pieces of infrastructure at multiple levels of slow-rollout, outages like this one don't quickly point to a smoking gun.

  • erosenbe0 4 years ago

    Spot on. And some things are easily reversible to the extent that they alleviate the downtime, yet still leave a large data sync or etl job to complete in their wake. The effect of which, until resolved, is continued loss of function or customer data at some lesser level of severity.

hughrr 4 years ago

As a fairly regular consul cluster admin for the last 6 years or so, but not on that scale i can safely say that you generally have no idea if rolling back will work. I’ve experienced everything up to complete cluster collapses before. I spent an entire night blasting and reseeding a 200 node cluster once after a well tested forward migration went into a leadership battle it never resolved. Even if you test it before that’s no guarantee it’ll be alright on the night.

Quite frankly relying on consul scares the shit out of me. There are so few guarantees and so many pitfalls and traps that I don’t sleep well. At this point I consider it a mortal risk.

That applies to vault as well.

  • darkwater 4 years ago

    > Quite frankly relying on consul scares the shit out of me. There are so few guarantees and so many pitfalls and traps that I don’t sleep well. At this point I consider it a mortal risk.

    Consul (and Vault) for sure a complex softwares that 99% of the time "just work", but when they fail they can fail big time, I concur. But calling it a mortal risk seems a bit far fetched in my opinion.

    • hughrr 4 years ago

      When you run 5 nines that’s a risk.

  • miyuru 4 years ago

    I also run 3 small clusters of consul and I went ahead and read the raft paper[1] so I can debug consul election problems if it occurs.

    Consul is awesome when it works, but when it breaks it can be hell to get it working again. thankfully it usually works fine. I only had 1 outage and it fixed itself after restarting the service.

    [1] https://raft.github.io/raft.pdf

    • rfoo 4 years ago

      > so I can debug consul election problems if it occurs

      Interestingly, reading this remind me of a HashiCorp Nomad marketing piece [1]:

      > "We have people who are first-time system administrators deploying applications, building containers, maintaining Nomad. There is a guy on our team who worked in the IT help desk for eight years — just today he upgraded an entire cluster himself."

      I was always thinking "but what if something goes wrong? just call HashiCorp engs?" :p

      [1] https://www.hashicorp.com/case-studies/roblox

    • mrweasel 4 years ago

      That seems to be a general problem with these types of solutions. You have the exact same issue with something like ZooKeeper. It awesome when it works, but good luck trying to figure out why it's broken.

      Just the author of the previous post relying on these types of services is something that can keep me up at night.

throwdbaaway 4 years ago

At first I thought it is a well-written post-mortem with proper root cause analysis. After reading it for the second time though, it doesn't sound like the root cause has been identified? At one point, they disabled streaming across the board, and the consul cluster started to become sort of stable. Is streaming to be blamed here? Why would streaming, an enhancement over the existing blocking query, which is read-only, end up causing "elevated write latency"? Why did some voter nodes encounter the boltdb freelist issue, while some other voter nodes didn't?

And there is still no satisfying explanation for this:

> The system had worked well with streaming at this level for a day before the incident started, so it wasn’t initially clear why it’s performance had changed.

But I totally agree with you that the first thing they should look into is to rollback the 2 changes made to the traffic routing service the day before, as soon as they discovered that the consul cluster had become unhealthy.

londons_explore 4 years ago

"just roll back" gets risky when you roll back more than a few hours in many cases.

Frequently the feature you want to roll back now has other services depending on it, has already written data into the datastore that the old version of the code won't be able to parse, has already been released to customers in a way that will be a big PR disaster if it vanishes, etc.

Many teams only require developers to maintain rollback ability for a single release. Everything beyond that is just luck, and there's a good chance you're going to be manually cherry picking patches and having to understand the effects and side effects of tens of conflicting commits to get something that works.