This outage has it all, distributed systems, non-uniform memory access contention (aka "you wanted scale up? how about instead we make your CPU a distributed system that you have to reason about?"), a defect in a log-structured merge tree based data store, malfunctioning heartbeats affecting scheduling, wow wow wow.
Kind of curious about this. I know this is probably company specific but how do outages get handled at large orgs? Would the on-calls have been called in first then called in the rest of the relevant team?
Is their a leadership structure that takes command of the incident to make big coordinated decisions to manage the risk of different approaches?
Would this have represented crunch time to all the relevant people or would this be a core team with other people helping as needed?
Yes. This was a multi-day outage and eventually the oncall does need sleep, so you need more of the team to help with it. Typically, at any reasonable team, everyone that chipped in nights get to take off equivalent days and sprint tasks are all punted.
Yes. Not just to manage risks, but also to get quick prioritization from all teams at the company. "You need legal? Ok, meet ..." "You need string translations? Ok escalated to ..." "You need financial approval? Ok, looped in ..."
Kinda. Definitely would have represented crunch time, but a very very demoralizing crunch time. Managers also try to insulate most of their teams from it, but everyone pays attention anyways. Keep in mind these typically only last an hour or 3, at most they last a few days, so there is no "core team" other than the leadership structure from your question 2. Otherwise, it is very much "people/teams helping as needed".
After a certain length of outage, you have to start prioritizing differently though. I only have our own anecdotes there. But if someone was at a problem for 8 - 12 consecutive hours under pressure, the quality of their work is going to drop sharply. At such a point, it becomes more and more likely for them to make the situation worse instead of fixing it.
And at or beyond that point, you pretty much have to take inspiration from fire fighters and emergency services: You need to organize the experts on subsystems to rest and sleep in shifts, ideally during simpler but time consuming tasks. Otherwise these persons will crash and you lose their skills and knowledge during that outage for good. And that might render an outage almost impossible to handle.
I think I didn't explain myself very well: clearly on-duty must sleep if it's a multi-day incident, but they also need extra help when they are awake! If the business is completely down, there isn't normal work to do for other engineers so, even if they are out of their typical domain, they might give good insights, novel ideas or fix some side issues that will help the ones with more domain knowledge.
The problem is that you don’t know how long the outage will be when it starts. I once saw a large outage start, everyone jumped on to troubleshoot, thinking it would be an hour. 8 hours later it’s still an outage, and everyone is still on and burned out. Management should have told half the people who jumped on at the start to go away and be prepared for a phone call in 8 hours to provide relief.
Oncalls get paged first and then escalate. As they assess impact to other teams and orgs, they usually post their tickets to a shared space. Once multiple team/org impact is determined, leadership and relevant ops groups (networking, eg) get pulled in to a call. A single ticket gets designated the Master Ticket for the Event, and oncalls dump diagnostic info there. Root cause is found (hopefully), affected teams work to mitigate while RC team rushes to fix.
The largest of these calls I've seen was well into the hundreds of sw engineers, managers, network engineers, etc.
Wow, that makes complete sense for something that is impacting this many people and by extension lots of money.
Thanks for the answer, I have only ever worked with such a small team that we are all on a call every day.
I can imagine it can probably get a little hectic in large group calls? On the engineering side is there a command structure? Like say the root cause was found and RC team is rushing to fix it. But another team wants to mitigate in the mean time in a slightly risky way. Would their manager make a case with leadership? Would the proposed plan just be put out for general comment as a response to that main ticket?
It depends. I’ve managed major incidents with hundreds of participants.
Our major incident process generally had a “suit” call with non-technical executives and people who would be coordinating customer triage, outreach, etc. Then we would have a tech bridge where the key stakeholders did their thing.
We used the Federal incident command system as a model. It’s a great reference point to use as an inspiration.
In addition, you can look into ITIL/ITSM Incident Management plans, they have well developed process structure to work from as a guideline.
I have also seen organizations recommend Kepner Tregoe method training for real time high pressure problem solving based off Nasa Mission Control systems.
each company is different,
from my experience it would depend on the severity of the fix, and the severity of the issue. the problem would get resolved by any means ie temporary sticky plaster if necessary.
Another team would then assess and analyse the root cause from a company wide perspective and then assess the risks, costs and impact and then make any modifications (possibly redoing the temporary fix, and fixing it properly)
Real issue, a call center main telephony system and one of the management servers kept crashing causing over 1400 call center people to stop working. Temporary fix was to re boot the servers every 4 hours causing minor pain, but the call staff was up and running.
After a whole stupid week of the engineers not being able to find the route cause it was escalated extremely high and our team was brought in and we found the root cause in seconds (literally)The servers was VMs and the engineers hadn't checked the physical ESX server they were hosted on. another VM on the box caused the server to go unstable (ESX not configured correctly).
BAU project set up to audit/ report and fix all the ESX servers in the company for other stupid config issues
The person you're responding to is not exactly wrong. But since the users dropped to 0 pretty quickly it's likely that every team with any monitoring at all got paged. At least that's what would happen at the moderately large company I work for.
I'm giving a much broader example of what a large company might do for high impact events. I have no idea what the insides of Roblox look like specifically.
Not to mention a VP or three. A well-led company is going to have management in the line of fire, so to speak, so an outage of this scale would wake them as well.
So 1-3 people actually figure it out while everyone else gets in the way? There's no way hundreds of engineers, managers, network engineers etc. can get anything actually done as a group, right?
Former Google SRE here, I can share my experience although I've never been involved in a large serious outage (thankfully). I've had my fair share of smaller multi-team outages though.
Usually the way it works is so that we have multiple clearly-identified and properly-handed-off roles. There's an Incident Commander (IC) role, whose job is to basically oversee the whole situation, there's various responders (including a primary one) whose job is to mitigate/fix the problems usually relating their own teams/platform/infra (networking, security, virtualization clusters, capacity planning, logging, etc. depends on the outage). There's also sometimes a communication person (I forget the role name specifically) whose job is to keep people updated, both internal to the outage (responders, etc) and outsiders (dealing with public-facing comms, either to other internal teams affected by the outage or even external customers).
Depending on the size of the outage, the IC may establish a specific "war room" channel (used to be an IRC chatroom, not sure what they use these days though) where most communication from various interested parties will take place. The advantage of a chatroom is that it lets you maintain communication logs and timestams (useful for postmortem and timeline purposes), and it helps when handing off to the next oncaller during a shift change (they can read the history of what happened).
> There's no way hundreds of engineers, managers, network engineers etc. can get anything actually done as a group, right?
Most people will not really be doing much but when you need to diagnose a problem, having a lot of brains with various expertise in different domains helps, especially if those people are the ones that have implemented a certain service that might be obscure to the other oncallers. Generally speaking, it wouldn't be unheard of to have 30-40 people in the same irc channel brainstorming and coordinating a cross-team effort to mitigate a problem, but into the hundreds? Not quite sure about that much.
Just my two cents. You can probably get more info by reading the Google SRE book https://sre.google/books/
Yeah, I've read the Google SRE book and the product I work on follows Google's SRE model. Sometimes I wonder though if it's all one big anti-pattern. Maybe more precisely it's a pattern designed to work even if nobody knows what's going on. Things are so vastly (over?) complicated. The original designers are long gone. But you still somehow have to keep things going and address any issues that pop up. In our org that SRE model leads that some very weird things because the SREs know the infrastructure (to some degree) but don't really understand the stuff running over it. But I guess we're delivering the service so that's something.
I think the "real world" doesn't work like that. The way the real world works is that things are decoupled in a way that one system's failure doesn't bring the entire world down. So things can be solved in isolation by people that actually understand the system and/or systems are designed in a way that they are serviceable etc.
When the power fails in my neighbourhood, you don't get 100 engineers on a hotline, one van comes down, troubleshoots the problem, and fixes it. Like 3 technicians.
I know there are some exceptions like some power failures that cascaded or the global supply shortages. But those are design failures IMO. A computer system that goes down for this length of time and nobody can figure out why or recover, that seems like a total failure to me on multiple levels. We're just doing this wrong.
Speaking from personal experience, most outages are contained and mitigated within a specific service before they end up impacting other services too. Cascade effects are rare, you just notice them more often because they affect multiple people and usually external-facing customers too. In reality, most things will (or, rather, *should*) page you well before it becomes a cascade-effect incident that multiple teams will have to take care of.
If your problem is that nobody knows what's going on and that stuff constantly brings down a bunch of different systems, you either need to finetune your alerting so the affected system tells you something is wrong *before* it reaches other people (monitor your partial rollouts, canary releases, capacity bursts, etc), or you have a problem with playbooks.
The person that implemented the system doesn't need to be the person that fixes it in case there's a problem. We have playbooks that tell us exactly what to do, where to go, which flags to flip, which machine to bring down/bring up, etc in case of various problems. These should be written by the person that implemented the system and any following SRE who's been in charge of fixing bugs or finding issues as a way for the next SRE oncall to not be lost when navigating that space. Remember that the person oncall is not the one responsible for fixing the issue, they are the person responsible for mitigating the problem until the most appropriate person can fix it (preferably not outside working hours).
Again, there can be exceptions that require multiple engineers to work together on multiple services, but in reality that should not be the norm. Most of the pages I handled as an SRE were "silly" things that were self-contained to our team and our service and our customers never even noticed anything was wrong in the first place.
In a really large company, you're talking maybe ~100-200 people per org. EC2 alone has a massive footprint, for instance. Hundreds of engineers, of whom a dozen are maybe oncall for their respective components. If something goes wrong in, let's say' cloudwatch, but EC2 is impacted, that's dozens of people working to weight their services out of the impacted AZ, change cache settings, bounce fleets, etc.
A lot of the time root cause is solved by a smaller number of people. But identifying root cause and mitigating impact during an event -- and then communicating specifics of that impact -- can fall to a much larger group.
If 1-3 people are actively solving the issue, they do so alone, and give periodic updates to the broader group through a manager or other communication liason.
3 people to fix the Vital Component That Must Work At All Times.
97 people to check/restart/monitor their team's system, because the Vital Component has never failed before so their graceful recovery code is untested or nonexistent.
For the on call system that I ran until recently, there are about a dozen on call teams responsible for parts of the service. Each team has a primary and backup engineer, generally on a 7x24 shift that lasts a week. Most weeks it's not very busy.
Working with them during an incident is an on call comms lead, who handles outside-of-team comms (protecting the engineers), and an engineering lead (who is a consultant, advisor, and can approve certain actions).
For big incidents, an exec incident manager is involved. They primarily help with getting resources from other teams.
Where I work there is an incident team that handles things like creating a master ticket, starting a call bridge, getting the on-calls into the bridge, keeping track of what teams (and who from those teams) have been brought in, manages the call (keeping chatter down and focused when there are 100 people in a call is important), periodically comments on the master ticket with status and a list of impacted teams, marks down milestone times like when the impact started, when it was detected, mitigated, root cause found, etc. This person is also responsible for stuff like when they hear you want to engage team X, they'll go track down an on-call for you, or summarizing known impact for the outward-facing status pages, etc. They also create the postmortem template and follow up with all involved teams to get them to contribute their detailed impact statement there.
Edit: sometimes when it's a really gnarly problem and there are huge numbers of people on the call, the set of people who are actively trying to come up with mitigations and need to just be able to talk freely at each other will break off into a less noisy call and leave a representative to relay status to the main call.
At Google an oncaller typically gets paged, triages the incident and, if it's bad, they page other oncallers and or team members for help. For more serious incidents, people take on different roles like communications lead, incident commander etc.
During the worst outage I was a involved in basically the entire org including all of the most senior engineers worked around the clock for two weeks to fix everything
As someone with 8 years of experience in SRE in Google: I wouldn't be so sure about that. Most outages require only rudimentary understanding of the particular service. Pretty much "have you tried turning it off and on?", with the extra step of figuring out which piece of the stack needs the kick. Hence, there are many SRE teams that onboard lots of services with this kind of half-support. The on call only performs generic investigation and repair attempts. If that doesn't help, they escalate to the relevant dev team, who likely will only respond in office hours.
Only the important services get dedicated oncalls. Most important ones will have both 24/7 SRE and dev oncalls.
What processes are there (and how effective are they?) to determine if a non-expert SRE should fix something there-and-then (and potentially making things worse) vs. assigning it to a dev team for a correctly engineered fix, at the cost of delays?
"We enjoyed seeing some of our most dedicated players figure out our DNS steering scheme and start exchanging this information on Twitter so that they could get “early” access as we brought the service back up."
Why do I have a feeling "enjoyed" wasn't really enjoyed so much as "WTF", followed by "oh shit..." at the thought that their main way to balance load may have gone out the window.
At their scale, it was probably an insignificant minority. I read that as nothing more than a wink and nod of "we see what you did ;)" ; which I appreciate. Some companies would have a fit and go nuclear on people for that, for no particular reason. As long as it is an insignificant minority, it doesn't matter, and ideally it's teenagers learning how something works on the side, and that helped grow some future hacker (in the HN sense) somewhere.
It's difficult to know how quickly word could have spread, but I enjoy knowing a few 11 year olds learned something about the Internet in order to play a game an hour early.
With social media etc, I can see it spreading really fast. That would be my bigger fear trying to get a service back up from a very long outage like that.
The intentionally slow bringup is to handle the thundering herd of having the system come back online to 100% at once. If a couple hundred users (small percentage of userbase) here or there are able to jump to queue, it's no real big deal.
As far as players figuring out the DNS steering scheme; the company has no responsibility to keep a non-advertised backend up. If it was a problem, disallow new connection to it and remove it from the main pool.
Love the "Note on Public Cloud", and their stance on owning and operating their own hardware in general. I know there has to be people thinking this could all be avoided/the blame could be passed if they used a public cloud solution. Directly addressing that and doubling down on your philosophies is a badass move, especially after a situation like this.
It's interesting, I don't see that being on cloud would have avoided or helped this situation much. They were able to ramp up their hardware very quickly - who knows where they got it that fast - and it actually made the problem worse, so being on cloud and having the ability to do that with keystrokes would not have helped. You could say they might be using a different set of components if they were on cloud which may not have suffered the same issues, but you can play the what if game all day it's not related to pros/cons of public cloud.
It's weird it took them so long to disable streaming. One of the first things you do in this case is roll back the last software and config updates, even innocent looking ones.
That’s what stood out to me too. Although they’d been slowly rolling it out for awhile, their last major rollout was quite close to the outage start:
> Several months ago, we enabled a new Consul streaming feature on a subset of our services. This feature, designed to lower the CPU usage and network bandwidth of the Consul cluster, worked as expected, so over the next few months we incrementally enabled the feature on more of our backend services. On October 27th at 14:00, one day before the outage, we enabled this feature on a backend service that is responsible for traffic routing. As part of this rollout, in order to prepare for the increased traffic we typically see at the end of the year, we also increased the number of nodes supporting traffic routing by 50%
Consul was clearly the culprit early on, and you just made a significant Consul-related infrastructure change, you’d think rolling that back would be one of the first things you’d try. One of the absolute first steps in any outage is “is there any recent change we could possibly see causing this? If so, try rolling it back.”
They’ve obviously got a lot of strong engineers there, and it’s easy to critique from the outside, but this certainly struck me as odd. Sounds like they never even tried “let’s try rolling back Consul-related changes”, it was more that, 50+ hours into a full outage, they’d done some deep profiling, and discovered the steaming issue. But IMO root cause analysis is for later, “resolve ASAP” is the first response, and that often involves rollbacks.
I wonder if this actually hindered their response:
> Roblox Engineering and technical staff from HashiCorp combined efforts to return Roblox to service. We want to acknowledge the HashiCorp team, who brought on board incredible resources and worked with us tirelessly until the issues were resolved.
i.e. earlier on, were there HashiCorp peeps saying “naw, we tested streaming very thoroughly, can’t be that”?
When you're at Roblox's scale, it is often difficult to know in advance whether you will have a lower MTTR by rolling back or fixing forward. If it takes you longer to resolve a problem by rolling back a significant change than by tweaking a configuration file, then rolling back is not the best action to take.
Also, multiple changes may have confounded the analysis. Adjusting the Consul configuration may have been one of many changes that happened in the recent past, and certainly changes in client load could have been a possible culprit.
In most cases, if you've planned your deployment well (meaning in part that you've specified the rollback steps for your deployment) it's almost impossible to imagine rollback being slower than any other approach.
When I worked at Amazon, oncalls within our large team initially had leeway over whether to roll backwards or try to fix problems in situ ("roll forward"). Eventually, the amount of time wasted trying to fix things, and new problems introduced by this ad hoc approach, led to a general policy of always rolling back if there were problems (I think VP approval became required for post-deploy fixes that weren't just rolling back).
In this case, though, the deployment happened ages (a whole day!) before the problems erupted. The rollback steps wouldn't necessarily be valid (to your "multiple confounding changes" point). So there was no avoiding at least some time spent analyzing and strategizing before deciding to roll back.
Some changes are extremely hard to rollback, but this doesn’t sound like one of them. From their report, sounds like the rollback process involved simply making a config change to disable the streaming feature, it took a bit to rollout to all nodes, and then Consul performance almost immediately returned to normal.
Blind rollbacks are one thing, but they identified Consul as the issue early on, and clearly made a significant Consul config change shortly before the outage started, that was also clearly quite reversible. Not even trying to roll that back is quite strange to me - that’s gotta be something you try within the first hour of the outage, nevermind the first 50 hours.
Yet a regional Consul deployment is the single point of failure. I apologize if that sounds sarcastic. There’re obviously a lot of lessons to be learned and blames have no places in this type of situations - excuses as well.
In a not-too-distant alternate universe, they made the rookie assumption that every change to every system is trivially reversible, only to find that it's not always true (especially for storage or storage-adjacent systems), and ended up making things worse. Naturally, people in alternate-universe HN bashed them for that too.
Obviously I'm on the outside looking in here - can't say anything with confidence. But I've been on call consistently for the past 9 years, for some decent sized products (not Roblox scale, but on the order of 1 million active users), mitigating more outages than I can count. For any major outage, the playbook has always been something like this:
1. Which system is broken?
2. Are there any recent changes to this system? If so, can we try reverting them?
They did "1", quickly identified Consul as the issue. They made a significant Consul change the day before, one they were clearly cautious/worried about (i.e. they'd been slowly adopting the new Consul streaming feature, service by service, for over a month, and did a big rollout of it the previous day). And once they did identify streaming as the issue, it was indeed quick to roll back. It just seems like they never tried "2" above, which is strange to me, very contrary to my experience being on call at multiple companies.
If you're doing a slow rollout, it's not always easy to tell whether the thing you're rolling out is the culprit. I've been on the other side of this outage where we had an outage and suspected a slow change we had been rolling out, especially because we opted something new into it minutes before an incident, only to realize later when the dust settled that it was completely unrelated. When you're running at high scale like Roblox and have lots of monitoring in place and multiple pieces of infrastructure at multiple levels of slow-rollout, outages like this one don't quickly point to a smoking gun.
What do you do when you're working on a storage system and rolling back a change leaves some data in a state that the old code can't grok properly? I've seen that cause other parts of the system (e.g. repair, re-encoding, rebalancing) mangle it even further, overwrite it, or even delete it as useless. Granted, these mostly apply to code changes rather than config, but it can also happen if code continue to evolve on both sides of a feature flag, and both versions are still in active use in some of the dozens of clusters you run. Yes, speaking from experience here.
While it's true that rolling back recent changes is always one of the first things to consider, we should acknowledge that sometimes it can be worse than finding a way to roll forward. Maybe the Roblox engineers had good reason to be wary of pulling that trigger too quickly when Consul or BoltDB were involved. Maybe it even turned out, in perfect 20/20 hindsight, that foregoing that option was the wrong decision and prolonged the outage. But one of the cardinal rules of incident management is that learning depends on encouraging people to be open and honest, which we do by giving involved parties liberal benefit of the doubt for trying to do the right thing based on information they had at the time. Yes, even if that means allowing them to make mistakes.
Spot on. And some things are easily reversible to the extent that they alleviate the downtime, yet still leave a large data sync or etl job to complete in their wake. The effect of which, until resolved, is continued loss of function or customer data at some lesser level of severity.
As a fairly regular consul cluster admin for the last 6 years or so, but not on that scale i can safely say that you generally have no idea if rolling back will work. I’ve experienced everything up to complete cluster collapses before. I spent an entire night blasting and reseeding a 200 node cluster once after a well tested forward migration went into a leadership battle it never resolved. Even if you test it before that’s no guarantee it’ll be alright on the night.
Quite frankly relying on consul scares the shit out of me. There are so few guarantees and so many pitfalls and traps that I don’t sleep well. At this point I consider it a mortal risk.
I also run 3 small clusters of consul and I went ahead and read the raft paper[1] so I can debug consul election problems if it occurs.
Consul is awesome when it works, but when it breaks it can be hell to get it working again.
thankfully it usually works fine. I only had 1 outage and it fixed itself after restarting the service.
> so I can debug consul election problems if it occurs
Interestingly, reading this remind me of a HashiCorp Nomad marketing piece [1]:
> "We have people who are first-time system administrators deploying applications, building containers, maintaining Nomad. There is a guy on our team who worked in the IT help desk for eight years — just today he upgraded an entire cluster himself."
I was always thinking "but what if something goes wrong? just call HashiCorp engs?" :p
That seems to be a general problem with these types of solutions. You have the exact same issue with something like ZooKeeper. It awesome when it works, but good luck trying to figure out why it's broken.
Just the author of the previous post relying on these types of services is something that can keep me up at night.
> Quite frankly relying on consul scares the shit out of me. There are so few guarantees and so many pitfalls and traps that I don’t sleep well. At this point I consider it a mortal risk.
Consul (and Vault) for sure a complex softwares that 99% of the time "just work", but when they fail they can fail big time, I concur. But calling it a mortal risk seems a bit far fetched in my opinion.
At first I thought it is a well-written post-mortem with proper root cause analysis. After reading it for the second time though, it doesn't sound like the root cause has been identified? At one point, they disabled streaming across the board, and the consul cluster started to become sort of stable. Is streaming to be blamed here? Why would streaming, an enhancement over the existing blocking query, which is read-only, end up causing "elevated write latency"? Why did some voter nodes encounter the boltdb freelist issue, while some other voter nodes didn't?
And there is still no satisfying explanation for this:
> The system had worked well with streaming at this level for a day before the incident started, so it wasn’t initially clear why it’s performance had changed.
But I totally agree with you that the first thing they should look into is to rollback the 2 changes made to the traffic routing service the day before, as soon as they discovered that the consul cluster had become unhealthy.
"just roll back" gets risky when you roll back more than a few hours in many cases.
Frequently the feature you want to roll back now has other services depending on it, has already written data into the datastore that the old version of the code won't be able to parse, has already been released to customers in a way that will be a big PR disaster if it vanishes, etc.
Many teams only require developers to maintain rollback ability for a single release. Everything beyond that is just luck, and there's a good chance you're going to be manually cherry picking patches and having to understand the effects and side effects of tens of conflicting commits to get something that works.
The post indicates they'd been rolling it out for months, and indicate the feature went live "several months ago".
With the behaviour matching other types of degradation (hardware), it's entirely reasonable that it could have taken quite a while to recognise that software and configurations that have proven stable for several months, that is still there working, wasn't quite so stable as it seemed.
Right, but it only went live on the DB that failed the day before. Obviously, hindsight is 20/20, but it's strange that the oversight didn't rate a mention in the postmortem.
- The write up is amazing. There is a great level of detail.
- When they had the first indication of a problem, instead of looking if the problem was the hardware (disk I/O, etc.) the team went full cattle/cloud: bring down the node, launch a new one. Apparently that cost them a few hours. We would probably have done the same but I wonder if there's a lesson there.
- The obvious thing to do was revert configs. It is very strange that it took so long to revert. After being down for hours and having no idea what gives, it's the reasonable thing to try.
- The problem was consul. But consul is a key component and Roblox seem to be running a fairly large infrastructure. The company's valuation is sky-high, I assume the infra team is quite large. Consul is an open source project. Wouldn't make sense instead of relying on hashicorp so heavily to bring-in or train ppl around consul internals at this point? (maybe not possible/feasible/optimal, just wondering)
Would be a nice touch to check if bbolt has the bug and possibly push a fix. That said, the post-mortem is state-of-art. Way better than anything we've seen from much much bigger companies.
Honestly I would guess part of it is that streaming is supposed to be a performance increase. So during a performance related outage, it might be easy to overlook. Am I really going to turn off a feature that I think is actually helping the problem?
If that feature was the one most recently deployed or updated? Yes, if possible. That could be a big if, though right? Maybe rolling back such a change isn't trivial, or imposes other costs to returning to service that are more expensive than simply working through the problem.
The htop screenshot was an immediate, appropriately-colored red flag for me: that much red (kernel time) on the CPU utilization bars for a system running etcd/consul is not right in my experience.
It's a spicy read. Really could have happened to anyone. All very reasonable assumptions and steps taken. You could argue they could have more thoroughly load tested Consul, but doubtful any of us would have done more due diligence than they did with the slow rollout of streaming support.
(Ignoring the points around observability dependencies on the system that went down causing the failure to be extended)
The main mistake IMO is that, the day before the outage, they made a significant Consul-related infra change. Then they have this massive outage, where Consul is clearly the root cause, but nobody ever tries rolling that recent change back? That’s weird.
The outage occurring could certainly happen to anyone, but it taking 72 hours to resolve seems like a pretty fundamental SRE mistake. It’s also strange that “try rollbacks of changes related to the affected system” isn’t even acknowledged as a learning in their learnings/action items section.
It's possible they deal with so much load that they considered a day's worth of traffic to be sufficient load testing:
> The system had worked well with streaming at this level for a day before the incident started, so it wasn’t initially clear why it’s performance had changed.
And a short note later on how much load their caching system sees:
> These databases were unaffected by the outage, but the caching system, which regularly handles 1B requests-per-second across its multiple layers during regular system operation, was unhealthy.
That doesn't sound accurate. Wasn't the major change they ended up rolling back Consul streaming, which they'd enabled months before, and had been slowly rolling out?
Right, but the day before the outage, they enabled streaming for a service that didn't have it turned on. That's a discrete config change, the day before the outage.
> Several months ago, we enabled a new Consul streaming feature on a subset of our services. This feature, designed to lower the CPU usage and network bandwidth of the Consul cluster, worked as expected, so over the next few months we incrementally enabled the feature on more of our backend services. On October 27th at 14:00, one day before the outage, we enabled this feature on a backend service that is responsible for traffic routing. As part of this rollout, in order to prepare for the increased traffic we typically see at the end of the year, we also increased the number of nodes supporting traffic routing by 50%
So they rolled out a pretty significant Consul related change the day before their massive Consul outage began. They’d been doing a slow rollout, but ramping it up a bunch is a significant change.
Admittedly this is armchair architecture talk, but it seems like either consul or Roblox's use of Consul is falling into a CAP-trap: they are using a CP system when what they need is an eventually-consistent AP system. Granted, the use of consul seems heterogenous, but it seems like the main root cause was service discovery. And service discovery loves stale data.
Service discovery largely doesn't change that often. Especially in an outage where a lot of things that churn service discovery are disabled (e.g. deploys), returning stale responses should work fine. There's a reason DNS works this way - it prioritizes having any response, even if stale, since most DNS entries don't change that frequently. That said, DNS is not a great service discovery mechanism for other reasons. Not sure if there's an off-the-shelf solution that relies more on fast invalidation rather than distributed consistent stores.
Good catch. If Roblox only uses consul for service discovery, things should continue to work, just slowly degrade over the hours/days. There should at least be one consul agent running on each physical hosts, and this consul agent has cache and can continue to provide service discovery functionality with stale data.
Dissecting this paragraph from the post-mortem...
> When a Roblox service wants to talk to another service, it relies on Consul to have up-to-date knowledge of the location of the service it wants to talk to.
OK.
> However, if Consul is unhealthy, servers struggle to connect.
Why? The local "client-side" consul agents running on each hosts should be the authoritative source for service discovery, not the "server-side" consul agents running on the 5 voter nodes.
> Furthermore, Nomad and Vault rely on Consul, so when Consul is unhealthy, the system cannot schedule new containers or retrieve production secrets used for authentication.
Now that's one very bad setup, similar to deploying all services in a single k8s cluster.
Didn’t realize consul had that. Seems like the right approach - though I wonder why Roblox wasn’t using it.
Fwiw I believe kubernetes did this right - if you shoot the entire set of leaders, nothing really happens. Yes if containers die they aren’t restarted and things that create new pods (eg cron jobs) won’t run, but you don’t immediately lose cluster connectivity or the (built-in) service discovery. Not to say you can survive az failures or the like - or that kubernetes upgrades are easy/fun.
And don’t run dev stuff in your prod kube cluster. Just…don’t.
Their comment implies "are totally fine with stale data".
Their argument is that the membership set for a service (especially on-prem) doesn't change all that frequently, and even if it's out of date, it's likely that most of the endpoints are still actually servicing the thing you were looking for. That plus client retries and you're often pretty good.
Maybe I'm just working on an idiosyncratic version of the service discovery problem, but "stale data" is basically my bête noire. Part of it is that I don't control all my clients, and I can't guarantee they have sane retry logic; what service discovery tells them is the best place to go had better be responsive, or we're effectively having an outage. For us, service discovery is exquisitely sensitive to stale data.
I'm not saying I totally agree with the original comment there, just confirming that they meant it.
If you own your clients, sometimes you can say "it's on you to retry" (deadlines and retries are effectively mandatory, and often automatic, at Google). Having user facing services hand out bad addresses / endpoints would be really bad.
However, even for things like databases, you really want to know who the leader / primary is (and it's not really okay to get a stale answer).
So I dunno, some things are just fine with it, and some aren't. It's better if it just works :). Besides, if the data isn't changing, the write rate isn't high!
Yes - exactly what boulos said - I’m coming from the Google “you control your clients” perspective. That said, in some sense you always control your clients - you can always set up localhost proxies that speak your protocol or just tcp proxies.
The thing is service discovery is _always_ inconsistent. By the time you get your set of endpoints from your discovery, it can be out of date by the time you open your socket. Certainly for something like databases you need followers or standby leaders to reject writes from clients - service discovery can’t 100% save you here.
Super interesting. A place where ipvs or ebpf rules per-host for the discovery of services seems much more resilient than this heavy reliance on a functional consul service. The team shared a great postmortem here. I know the feeling well of testing something like a full redeploy and seeing no improvement…easy to lose hope at that point. 70+ hours of a full outage, multiple failed attempts to restore, has got to result in a few grey hairs worth of stress. Well done to all the sre, frontline, support engineers, devs, and whoever else rolled up their sleeves and got after it. The lessons learned here could only have been learned in an infra this big.
The BoltDB issue seems like straight up bad design. Needing a freelist is fine, needing to sync the entire freelist to disk after every append is pants on head.
BoltDB author here. Yes, it is a bad design. The project was never intended to go to production but rather it was a port of LMDB so I could understand the internals. I simplified the freelist handling since it was a toy project. At Shopify, we had some serious issues at the time (~2014) with either LMDB or the Go driver that we couldn't resolve after several months so we swapped out for Bolt. And alas, my poor design stuck around.
LMDB uses a regular bucket for the freelist whereas Bolt simply saved the list as an array. It simplified the logic quite a bit and generally didn't cause a problem for most use cases. It only became an issue when someone wrote a ton of data and then deleted it and never used it again. Roblox reported having 4GB of free pages which translated into a giant array of 4-byte page numbers.
I, for one, appreciate you owning this. It takes humility and strength of character to admit one's errors. And Heaven knows we all make them, large and small.
I also appreciate the honesty, but I don't see the error in the author, quite the opposite.
Afaiu, Bolt is a personal OSS project, github repo is archived with last commit 4 years ago, and the first thing you see in the readme is the "author no longer has time nor energy to continue".
Commercial cash cows like Roblox (a) shouldn't expect free labor and (b) should be wise enough to recognize tech debt or immaturity in their dependencies. Heck, even as a solo dev I review every direct dependency I take on, at least to a minimal level.
I can't speak to the incident response as I'm not an sre, but as a dev this screams of fragile "ship fast" culture, despite all the back patting in the post. I'm all for blameless postmortems, but a culture of rigor is a collective property worthy of attention and criticism.
I think the design choice is mine to own but, as with most OSS software, liability rests on the end user. It always sucks to see a bug cause so much grief to other folks.
As for HashiCorp, they're an awesome group of folks. There are few developers I esteem higher than their CTO, Armond Dadger. Wicked smart guy. That all being said, there's a lot of moving parts and sometimes bugs get through. ¯\_(ツ)_/¯
Consul is much older than 4 years old (public availability in 2014; 1.0 release in 2017, with a lot of sites using 0.x in production long before). And the fact that they didn't encounter this pathological case until Q4 2021 tells us that they got a lot of useful life out of BoltDB. They also were planning to switch over to bbolt back in 2020[1].
The developers at Hashicorp are top-tier, and this doesn't substantially change their reputation in my eyes. Hindsight is always 20/20.
Let's end this thread; blaming doesn't help anyone.
I share the sentiment, but not for Roblox. Hashicorp, with a recent IPO, 200 mil operating revenue, and supposedly a good engineering reputation has one of its flagship products critically depend on a "toy project".
How does this happen so often? It's awesome to get the authors take on things. Also thank you for explaining and owning it. Where you part of this incident response?
For now this is (relatively) easy since bindings for GoLang, Rust NodeJS/Deno, etc are available and the API is mostly the same in general.
---
The ideas that MDBX uses to solve these issues are simple: zero-cost micro-defragmentation, coalescing short GC/freelist records, chunking too long GC/freelist records, LIFO for GC/freelist reclaiming, etc.
Many of the ideas mentioned seems simple to implement in BoldDB. However the complete solution is not documented and too complicated (in accordance with the traditions inherited from LMDB ;)
Having written a commercial memory allocator a quarter century ago, I remember dealing with freelists, and decided they were too much of a pain to manage if fragmentation got out of control. I chose a different architecture that was less fragile under load. Interesting that this can still be an issue even on today's hardware.
It's also interesting how much a tiny detail can derail a huge organization. My former employer lost all services worldwide because of a single incorrect routing in a DNS server.
I can’t remember off the top of my head. We had an issue where every couple of months the database would quickly grow to consume the entire disk. We checked that there were no long running read transactions and did some other debugging but couldn’t figure it out. Swapping out for Bolt seemed to fix that issue.
I haven’t heard of anyone else with the same issue since then so I assume it’s probably fixed.
I had a few folks offer to sponsor at individual levels but no corporate sponsorship except Shopify paying my salary in the early days.
That being said, I gave away the software for free so I don’t have any expectation of payment. I agree the ecosystem is broken but I don’t know how to fix it or even if it can be fixed.
This is a great post-mortem - thank you to the Roblox engineering team for being this transparent about the issue and the process you took to fix it. It couldn't have been easy and it sounds like it was a beast to track down (under pressure no less). gg
> On October 27th at 14:00, one day before the outage, we enabled this feature on a backend service that is responsible for traffic routing. As part of this rollout, in order to prepare for the increased traffic we typically see at the end of the year, we also increased the number of nodes supporting traffic routing by 50%.
Seems like the smoking gun, this should have been identified and rolled back much earlier.
If reading a postmortem makes the smoking gun obvious, then the postmortem is doing its job. Don't mistake the amount of investigation that goes into a postmortem for the available information and mental headspace during an outage.
I've been in my fair share of incidents so I'm aware of how they work. But they knew it was an issue related to Consul within hours. It shouldn't take more than two days before they check for recent deployments made to Consul.
In my experience, there's typically more than one "smoking gun". The problem isn't finding one, it's eliminating all of the "smoking guns" that aren't actually related to the outage.
If I worked at an organization with many teams deploying updates multiple times per day and several same day events seemed related, I would probably also put less weight on a gradual, months-long deployment that had completed a day prior.
It's obvious when it's pointed out in an article like this. It's less clear when it's one of many changes that could have been happening in a day, and it was an operation that was considered "safe" given that it had been done multiple times for other services in the preceding months.
">circular dependencies in our observability stack"
This appears to be why the outage was extended, and was referenced elsewhere too. It's hard to diagnose something when part of the diagnostic tool kit is also malfunctioning.
aaaalllllllll the way down at the bottom is this gem:
>Some core Roblox services are using Consul’s KV store directly as a convenient place to store data, even though we have other storage systems that are likely more appropriate.
Yeah, don't use consul as redis, they are not the same.
But you can... which is what some engineers were thinking. In my experience they do this because:
A) they're afraid to ask for permission and would rather ask for forgiveness
B) management refused to provision extra infra to support the engineers need, but they needed to do this "one thing" anyways
C) security was lax and permissions were wide open so people just decided to take advantage of it to test a thing that then became a feature and so they kept it but "put it on the backlog" to refactor to something better later
It seems that Consul does not have the ability to use the newer hashmap implementation of freelist that Alibaba implemented for etcd. I cannot find any reference to setting this option in Consul's configuration.
Unfortunate, given it has been around for a while.
Just to be clear, we are talking about this item from the post-mortem right?
> We are working closely with HashiCorp to deploy a new version of Consul that replaces BoltDB with a successor called bbolt that does not have the same issue with unbounded freelist growth.
EDIT: I see what you mean. The freelist improvement has to be enabled by setting the `FreelistType` config to "hashmap" (default is "array"). Indeed it doesn't look like consul has done that...
I think we're still talking about different things, but that is a good move on their part regardless. :)
I mean the optional called `FreelistType` has a new option called `FreelistMapType` and the default is `FreelistArrayType`. There is no option in Consul from what I can tell to configure that option. They did have to upgrade from the old boltdb code to the etcd boltdb code to do this though.
I have this little idea I think about called the "status update chain". When I worked in small organizations and we had issues the status update chain looked like this: ceo-->me, as the organizations got larger the chain got longer first it was ceo-->manager-->me then ceo-->director-->manager-->me and so on. I wonder how long the status update chains are at companies like this? How long does at status update take to make it end to end?
If the situation is serious enough, you'll have several layers sitting together at the status update meetings to hear it straight from the dog's mouth.
Both directions, he is asking "What is going on" and I am telling him. As the org gets larger the request to know what is going on passes down the chain and the reply passes back up.
Usually there’s a central place where status is being updated and shared by everyone (a Slack channel for example) and everyone in the chain can just read/ping/respond there. Less of a chain.
For me as a roblox user/programmer, the most annoying part of this was,
that their desktop development tools refused to run during this outage,
because they insist on "phoning home" when you launch them.
It is annoying, because the tools actually run perfectly fine on a local desktop,
once you are past the "mothership handshake".
I spent that week reading roblox dev documentation instead.
Wow, that's a very bad trend we see emerging in those past years. Did you have any chance to investigate the request at play and whether you could impersonate it via DNS on your local network (or if on the opposite, TLS certificates were stapled into the app)?
Also, i'm curious about your experiences with Roblox. I've only heard about it from these HN threads (no i don't know a single person using it) so if you have feedback to share regarding how to program it and how it compares to a modern game engine/editor like Godot, i'm all ears. Also, if you know of a free-software "alternative" to Roblox ; i'm amazed we run proprietary software in the first place, worried when it doesn't run because it can't phone home, but i'm actually ashamed we end up *producing* (eg. developing) content with proprietary tools that these companies can take away from us any minute.
Slightly offtopic; “the team decided to replace all the nodes in the Consul cluster with new, more powerful machines”. How do teams usually do this quickly? Is it a combination of Terraform to provision servers and something like Ansible to install and configure software on it?
Totally depends on how “disciplined” the team’s DevOps practices are. In theory it should be as easy as updating a config parameter as you say, but my experience tells me that it’s sometimes not the case.
Especially with these kind of fundamental, core services such as Consul provides, it’s not unheard of to have templates with static machine allocations (as opposed to everything in a single auto-scaling group). It’s a bit of a shortcut, but it’s often a bit hairy to implement these services using true auto-scaling.
Having said all this, doing these types of migrations when things are already completely broken / on fire makes things a lot easier: you don’t care about downtime. So then it can be as simple as restarting all instances using a new instance type, downtime be damned.
I still don't understand how the elevated Consul latency ended up bringing all of the fleet to halt, fail health checks and drop user traffic. I guess use cases calling consul directly (e.g. service discovery) or indirectly (e.g. vault) could not tolerate having stale reads or sticking with what they've read? If anyone can shed a light on this, I appreciate.
IMO looking at the root causes here isn't that helpful. Software is complicated and there will always be some unknown bottleneck or bug lurking to knock you over on a bad day. The important lessons here are about:
* How their system architecture made them particularly vulnerable to this kind of issue
* Their actions to diagnose and attempt to mitigate the issue
* The whole later part about effectively cold-starting their entire infrastructure, all while millions of users were banging on their metaphorical door to start using the service again.
I think this outage was made worse by them not being properly in a big cloud provider.
In a cloud provider, having a few people working simultaneously on spinning up instances with different potential fixes, running different tests, and then directing all traffic to the first one that works properly is a viable path to a solution.
When you have your own hardware, you can really only try one thing at a time.
> When you have your own hardware, you can really only try one thing at a time.
How so? What would prevent you from hiring 5-10 people for Ops heavy stuff and getting a bit more hardware resources and doing those things in staging environments with load tests and whatnot? I mean, isn't that how you should do things, regardless of where your infra and software is?
If you own your own hardware, for a given service you probably have enough hardware for the production workload, plus maybe 50% more for dev, test, staging, experiments, etc. All those other environments will probably be scaled down versions. Sure, they can be used in an emergency situation, but they can't withstand the full production load, and anyway they're likely on a separate physical hardware and network (usually you want good isolation between production and test environments).
If you use AWS, then you probably on average use the same day to day, but in an emergency you can spin up 5 versions of full production scale to test 5 things at once, and just edit a configfile to direct production traffic to any.
Tldr: We made a single point of failure, then we made it super reliable, then the stuff it was doing to maintain itself made it slow itself down, then our single point of failure took down our service.
Would be interesting to compare this result to the classic paper on Tandem failures:
A. Thakur, R. K. Iyer, L. Young and I. Lee, "Analysis of failures in the Tandem NonStop-UX Operating System," Proceedings of Sixth International Symposium on Software Reliability Engineering. ISSRE'95, 1995, pp. 40-50, doi: 10.1109/ISSRE.1995.497642.
Sounds like they didn't check what had changed first, before starting to fix things with best guesses ... not saying I wouldn't do the same, but arguably lost them a lot of time.
Was on a call with a bank VP that had moved to AWS. Asked how it was going. Said it was going great after six months but just learning about availability zones so they were going to have to rework a bunch of things.
Astonishing how our important infrastructure is moved to AWS with zero knowledge of how AWS works.
Most startups I've worked at literally have a script to deploy their whole setup to a new region when desired. Then you just need latency-based routing running on top of it to ensure people are processed in the closest region to them. Really not expensive. You can do this with under $200/month in terms of complexity and the bandwidth + database costs are going to be roughly the same as they normally are because you're splitting your load between regions. Now if you stupidly just duplicate your current infrastructure entirely, yes it would be expensive because you'd be massively overpaying on DB.
In theory the only additional cost should be the latency-based routing itself, which is $50/month. Other than that, you'll probably save money if you choose the right regions.
Are the same instance sizes available in all regions?
Are there enough instances of the sizes you need?
Do you have reserved instances in the other region?
Are your increased quotas applied to all regions?
What region are your S3 assets in? Are you going to migrate those as well?
Is it acceptable for all user sessions to be terminated?
Have you load tested the other region?
How often are you going to test the region fail over? Yearly? Quarterly? With every code change?
What is the acceptable RTO and RPO with executives and board-members?
And all of that is without thinking about cache warming, database migration/mirror/replication, solr indexing (are you going to migrate the index or rebuild? Do you know how long it takes to rebuild your solr index?).
The startups you worked at probably had different needs the Roblox. I was the tech leach on a Rails app that was embedded in TurboTax and QuickBooks and was rendered on each TT screen transition and reading your comment in that context shows a lot of inexperience in large, production systems.
A lot of this can also be mitigated by going all in on API gateway + Lambda, like we have at Arist. We only need to worry about DB scaling and a few considerations with S3 (that are themselves mitigated by using CloudFront).
Are you implying that Roblox should move their entire system to the API Gateway + Lambda to solve their availability problems?
Seriously though, what is your RTO and RPO? We are talking systems that when they are down you are on the news. Systems where minutes of downtime are millions of dollars. I encourage you to setup some time with your CTO at Arist and talk through these questions.
1. When a company of Robolox's size is still in single-region mode by the time they've gone public, that is quite a red flag. As you and others have mentioned, game servers have some unique requirements not shared by traditional web apps (everyone knows this), however Roblox's constraints seem to be self-imposed and ridiculous considering their size. It is quite obvious they have very fragile and highly manual infrastructure, which is dangerous after series A, nevermind after going public! At this point their entire infrastructure should be completely templated and scripted to the point where if all their cloud accounts were deleted they could be up and running within an hour or two. Having 18,000 servers or 5 servers doesn't make much of a difference -- you're either confident you can replicate your infrastructure because you've put in the work to make it completely reproducible and automated, or you haven't. Orgs that have taken these steps have no problem deploying additional regions because they have tackled all of those problems (db read clones, latency-based routing, consistency, etc) and the solutions are baked into their infrastructure scripts and templates. The fact that there exists a publicly traded company in the tech space that hasn't done this shocks me a bit, and rightly so.
2. I mentioned API Gateway and Lambda because OP asked if in general it is difficult to go multi-region (not specifically asking about Roblox), and most startups, and most companies in general, do not have the same technical requirements in terms of managing game state that Roblox has (and are web app based), and thus in general doing a series of load balancers + latency based routing or API Gateway + Lambda + latency based routing is good approach for most companies especially now with ala carte solutions like Ruby on Jets, serverless framework, etc. that will do all the work for you.
3. That said, I do think that we are on the verge of seeing a really strong viable serverless-style option for game servers in the next few years, and when that happens costs are going to go way way down because the execution context will live for the life of the game, and that's it. No need to over-provision. The only real technical limitation is the hard 15 minute execution time limit and mapping users to the correct running instance of the lambda. I have a side project where I'm working on resolving the first issue but I've resolved the second issue already by having the lambda initiate the connection to the clients directly to ensure they are all communicating with the same instance of the lambda. The first problem I plan to solve by pre-emptively spinning up a new lambda when time is about to run out and pre-negotiate all clients with the new lambda in advance before shifting control over to the new lambda. It's not done yet but I believe I can also solve the first issue with zero noticable lag or stuttering during the switch-over, so from a technical perspective, yes, I think serverless can be a panacea if you put in the effort to fully utilize it. If you're at the point where you're spinning up tens of thousands of servers that are doing something ephemeral that only needs to exist for 5-30 minutes, I think you're at the point where it's time to put in that effort.
4. I am in fact the CTO at Arist. You shouldn't assume people don't know what they're talking about just because they find the status quo of devops at [insert large gaming company here] a little bit antiquated. In particular, I think you're fighting a losing battle if you have to even think about what instance type is cheapest for X workload in Y year. That sounds like work that I'd rather engineer around with a solution that can handle any scale and do so as cheaply as possible even if I stop watching it for 6 months. You may say it's crazy, but an approach like this will completely eat your lunch if someone ever gets it working properly and suddenly can manage a Roblox-sized workload of game states without a devops team. Why settle for anything less?
5. Regarding the systems I work with -- we send ~50 million messages a day (at specific times per day, mostly all at once) and handle ~20 million user responses a day on behalf of more than 15% of the current roster of fortune 500 companies. In that case, going 100% lambda works great and scales well, for obvious reasons. This is nowhere near the scale Roblox deals with, but they also have a completely different problem (managing game state) than we do (ensuring arbitrarily large or small numbers of messages go out at exactly the right time based on tens of thousands of complex messaging schedules and course cadences)
Anyway, I'm quite aware devops at scale is hard -- I just find it puzzling when small orgs have it perfectly figured out (plenty of gaming startups with multi-region support) but a company on the NYSE is still treating us-east-1 or us-east-2 like the only region in existence. Bad look.
Also, still sounding like you don’t understand how large systems like Roblox/Twitter/Apple/Facebook/etc are designed, deployed, and maintained-which is fine; most people don’t–but saying they should just move to llamda shows inexperience in these systems. If it is "puzzling" to you, maybe there is something you are missing in your understanding of how these systems work.
Correctly handling failure edge cases in a active-active multi-region distributed database requires work. SaaS DBs do a lot of the heavy lifting but they are still highly configurable and you need to understand the impact of the config you use. Not to mention your scale-up runbooks need to be established so a stampede from a failure in one region doesn't cause the other region to go down. You also need to avoid cross-region traffic even though you might have stateful services that aren't replicated across regions. That might mean changes in config or business logic across all your services.
It is absolutely not as simple as spinning up a cluster on AWS at Roblox's scale.
Roblox is not a startup, and has a significant sized footprint (18,000 servers isn't something that's just available, even within clouds. They're not magically scalable places, capacity tends to land just ahead of demand). It's not even remotely a simple case of just "run a script and wee we have redundancy" There are lots of things to consider.
18k servers is also not cheap, at all. They suggest at least some of their clusters are running on 64 cores, some on 128. I'm guessing they probably have a fair spread of cores.
Just to give a sense of cost, AWS's calculator estimates 18,0000 32 core instances would set you back $9m per month. That's just the EC2 cost, and assuming a lower core count is used by other components in the platform. 64 core would bump that to $18m. Per month. Doing nothing but sitting waiting ready. That's not considering network bandwidth costs, load balancers etc. etc.
When you're talking infrastructure on that scale, you have to contact cloud companies in advance, and work with them around capacity requirements, or you'll find you're barely started on provisioning and you won't find capacity available (you'll want to on that scale anyway because you'll get discounts but it's still going to be very expensive)
This was in reply to OP who said deploying to a new region is insanely complicated. In general it is not. For Roblox, if they are manually doing stuff in EC2, it could be quite complicated.
So Roblox need a button to press to (re)deploy 18,000 servers and 170,000 containers? They already have multiple core data centres, as well as many edge locations.
You will note the problem was with the software provided and supported by Hashicorp.
> It's also a lot more expensive. Probably order of magnitude more expensive than the cost of a 1 day outage
Not sure I agree. Yes, network costs are higher, but your overall costs may not be depending on how you architect. Independent services across AZs? Sure. You'll have multiples of your current costs. Deploying your clusters spanning AZs? Not that much - you'll pay for AZ traffic though.
The usual way this works (and I assume this is the case for Roblox) is not by constructing buildings, but by renting space in someone else's datacentre.
Pretty much every city worldwide has at least one place providing power, cooling, racks and (optionally) network. You rent space for one or more servers, or you rent racks, or parts of a floor, or whole floors. You buy your own servers, and either install them yourself, or pay the datacentre staff to install them.
Yes. If you are running in two zones in the hopes that you will be up if one goes down, you need to be handling less than 50% load in each zone. If you can scale up fast enough for your use case, great. But when a zone goes down and everyone is trying to launch in the zone still up, there may not be instances for you available at that time. Our site had a billion in revenue or something based on a single day, so for us it was worth the cost, but it not easy (or at least it wasn't at the time).
How expensive? Remember that the Roblox Corporation does about a billion dollars in revenue per year and takes about 50% of all revenue developers generate on their platform.
Right, outages get more expensive the larger you grow. What else needs to be thought of is not just the loss of revenue for the time your service is down but also it's affect on user trust and usability. Customers will gladly leave you for a more reliable competitor once they get fed up.
There are definitely cost and other considerations you have to think about when going multi-AZ.
Cross-AZ network traffic has charges associated with it. Inter-AZ network latency is higher than intra-AZ latency. And there are other limitations as well, such as EBS volumes being attachable only to an instance in the same AZ as the volume.
That said, AWS does recommend using multiple Availability Zones to improve overall availability and reduce Mean Time to Recovery (MTTR).
(I work for AWS. Opinions are my own and not necessarily those of my employer.)
This is very true, the costs and performance impacts can be significant if your architecture isn't designed to account for it. And sometimes even if it is.
In addition, unless you can cleanly survive an AZ going down, which can take a bunch more work in some cases, then being multi-AZ can actually reduce your availability by giving more things to fail.
AZs are a powerful tool but are not a no-brainer for applications at scale that are not designed for them, it is literally spreading your workload across multiple nearby data centers with a bit (or a lot) more tooling and services to help than if you were doing it in your own data centers.
Data Transfer within the same AWS Region
Data transferred "in" to and "out" from Amazon EC2, Amazon RDS, Amazon Redshift, Amazon DynamoDB Accelerator (DAX), and Amazon ElastiCache instances, Elastic Network Interfaces or VPC Peering connections across Availability Zones in the same AWS Region is charged at $0.01/GB in each direction.
Wrong. Depends on the use case AWS can be very cheap.
> splitting amongst AZ's is of no additional cost.
Wrong.
"
across Availability Zones in the same AWS Region is charged at $0.01/GB in each direction. Effectively, cross-AZ data transfer in AWS costs 2¢ per gigabyte and each gigabyte transferred counts as 2GB on the bill: once for sending and once for receiving."
Availability Zones aren't the same thing as regions. AWS regions have multiple Availability Zones. Independent availability zones publishes lower reliability SLAs so you need to load balance across multiple independent availability zones in a region to reach higher reliability. Per AZ SLAs are discussed in more detail here [1]
(N.B. I find HN commentary on AWS outages pretty depressing because it becomes pretty obvious that folks don't understand cloud networking concepts at all.)
> (N.B. I find HN commentary on AWS outages pretty depressing because it becomes pretty obvious that folks don't understand cloud networking concepts at all.)
What he said was perfectly cogent.
Outages in us-east-1 AZ us-east-1a have caused outages in us-west-1a, which is a different region and a different AZ.
Or, to put it in the terms of reliability engineering: even though these are abstracted as independent systems, in reality there are common-mode failures that can cause outages to propagate.
So, if you span multiple availability zones, you are not spared from events that will impact all of them.
> Or, to put it in the terms of reliability engineering: even though these are abstracted as independent systems, in reality there are common-mode failures that can cause outages to propagate.
It's up to the _user_ of AWS to design around this level of reliability. This isn't any different than not using AWS. I can run my web business on the super cheap by running it out of my house. Of course, then my site's availability is based around the uptime of my residential internet connection, my residential power, my own ability to keep my server plugged into power, and general reliability of my server's components. I can try to make things more reliable by putting it into a DC, but if a backhoe takes out the fiber to that DC, then the DC will become unavailable.
It's up to the _user_ to architect their services to be reliable. AWS isn't magic reliability sauce you sprinkle on your web apps to make them stay up for longer. AWS clearly states in their SLA pages what their EC2 instance SLAs are in a given AZ; it's 99.5% availability for a given EC2 instance in a given region and AZ. This is roughly ~1.82 days, or ~ 43.8 hours, of downtime in a year. If you add a SPOF around a single EC2 instance in a given AZ then your system has a 99.5% availability SLA. Remember the cloud is all about leveraging large amounts commodity hardware instead of leveraging large, high-reliability mainframe style design. This isn't a secret. It's openly called out, like in Nishtala et al's "Scaling Memcache at Facebook" [1] from 2013!
The background of all of this is that it costs money, in terms of knowledgable engineers (not like the kinds in this comment thread who are conflating availability zones and regions) who understand these issues. Most companies don't care; they're okay with being down for a couple days a year. But if you want to design high reliability architectures, there are plenty of senior engineers willing to help, _if_ you're willing to pay their salaries.
If you want to come up with a lower cognitive overhead cloud solution for high reliability services that's economical for companies, be my guest. I think we'd all welcome innovation in this space.
During a recent AWS outage, the STS service running in us-east-1 was unavailable. Unfortunately, all of the other _regions_ - not AZs, but _regions_, rely on the STS service in us-east-1, which meant that customers which had built around Amazon’s published reliability model had services in every region impacted by an outage in one specific availability zone.
This is what kreeben was referring to - not some abstract misconception about the difference between AZs and Regions, but an actual real world incident in which a failure in one AZ had an impact in other Regions.
For high availability, STS offers regional endpoints -- and AWS recommends using them[1] -- but the SDKs don't use them by default. The author of the client code, or the person configuring
the software, has to enable them.
The client code which defaults to STS in us-east-1 includes the AWS console website, as far as I can tell.
Real question, though - are those genuinely separate endpoints that remained up and operational during the outage? I don’t think I saw or knew a single person unaffected by this outage, so either there’s still some bleed over on the backend or knowledge of the regional STS endpoints is basically zero (which I Can believe, y’all run a big shop)
My team didn't use STS but I know other teams at the company did. Those that did rely on non-us-east-1 endpoints did stay up IIRC. Our company barely use the AWS console at all and base most of our stuff around their APIs to hook into our deployment/CI processes. But I don't work at AWS so I don't know if it's true or if there was some other backend replication lag or anything else going on that was impacted by us-east-1 being down. We had some failures for some of our older services that were not properly sharded out, but most of our stuff failed over and continued to work as expected.
> Unfortunately, all of the other _regions_ - not AZs, but _regions_, rely on the STS service in us-east-1, which meant that customers which had built around Amazon’s published reliability model had services in every region impacted by an outage in one specific availability zone.
That's not true. STS offers regional endpoints, for example if you're in Australia and don't want to pay the latency cost to transit to us-east-1 [1]. It's up to the user to opt into them though. And that goes back to what I was saying earlier, you need engineers willing to read their docs closely and architect systems properly.
> knowledgable engineers (not like the kinds in this comment thread who are conflating availability zones and regions)
I think this breaks the site guidelines. Worse, I don't think the other people are wrong: being in a different region implies being in a different availability zone.
That is, I've read the comments to say "they're not only in different AZ's, they're in different regions". It seems you seem determined to pick a reading that lets you feel smugly superior about your level of knowledge, and then cast out digs at other people based on that presumed superiority.
> Worse, I don't think the other people are wrong: being in a different region implies being in a different availability zone.
Availability zones do not map across regions. AZs are specific to a region. Different regions have differing numbers of AZs. us-east-1 has 3. IIRC ap-southeast-1 has 2.
> That is, I've read the comments to say "they're not only in different AZ's, they're in different regions"
So I've read. The earlier example about STS that someone brought up was incorrect; both I and another commenter linked to the doc with the correct information.
> It seems you seem determined to pick a reading that lets you feel smugly superior about your level of knowledge, and then cast out digs at other people based on that presumed superiority.
You obviously feel very strongly about this. You've replied to my parent twice now. You're right that the parenthetical was harsh but I wouldn't say it's uncalled for.
Every one of these outage threads descends into a slew of easily defensible complaints about cloud providers. The quality of these discussions is terrible. I spend a lot of time at my dayjob (and as a hobby) working on networking related things. Understanding the subtle guarantees offered by AWS is a large part of my day-to-day. When I see people here make easily falsifiable comments full of hearsay ("I had a friend of a friend who works at Amazon and they did X, Y, Z bad things") and use that to drum up a frenzy, it flies in the face of what I do everyday. There's lots of issues with cloud providers as a whole and AWS in particular but to get to that level of conversation you need to understand what the system is actually doing, not just get angry and guess why it's failing.
> > being in a different region implies being in a different availability zone.
> Availability zones do not map across regions. AZs are specific to a region. Different regions have differing numbers of AZs. us-east-1 has 3. IIRC ap-southeast-1 has 2.
Right.. So if you are in a different region, you are by definition in a different availability zone.
> You obviously feel very strongly about this. You've replied to my parent twice now. You're right that the parenthetical was harsh but I wouldn't say it's uncalled for.
Yah, I really thought about it and you're just reeking unkindness. And the people above that you're replying to and mocking are not wrong.
> Every one of these outage threads descends into a slew of easily defensible complaints about cloud providers. The quality of these discussions is terrible. I spend a lot of time at my dayjob (and as a hobby) working on networking related things. Understanding the subtle guarantees offered by AWS is a large part of my day-to-day.
If you're unable to be civil about this, maybe you should avoid the threads. Amazon seeks to avoid common-mode failures between AZs (and thus regions). This doesn't mean that Amazon attains this goal. And the larger point: as I'm sure you're aware, building a distributed system that attains higher uptimes by crossing multiple AZs is hard and costly and can only be justified in some cases.
I've got >20 years of experience in building geographically distributed, sharded, and consensus-based systems. I think you are being unfair to the people you're discussing with. Be nice.
> Amazon seeks to avoid common-mode failures between AZs (and thus regions).
there is a distinction between azs within a region vs azs in different regions. the overwhelming majority of services are offered regionally and provide slas at that level. services are expected to have entirely independent infrastructure for each region, and cross-regional/global services are built to scope down online cross regional dependencies as much as possible.
the specific example brought up (cross regional sts) is wrong in the sense that sts is fully regionalized as evidenced by the overwhelming number of aws services that leverage sts not having a global meltdown. but as others mentioned in a lot of ways it’s even worse because customers are opted into the centralized endpoint implicitly.
> If you're unable to be civil about this, maybe you should avoid the threads.
I didn't read my tone as uncivil, just harsh. I guess it came across harsher than intended. I'll try to cool it a bit more next time, but I have to say it's not like the rest of HN is taking this advice to heed when they're criticizing AWS. I realize that this isn't a defense (whataboutism), but I guess it's fine to "speak truth to power" or something? Anyway point noted and I'll try to keep my snark down.
> Amazon seeks to avoid common-mode failures between AZs (and thus regions). This doesn't mean that Amazon attains this goal. And the larger point: as I'm sure you're aware, building a distributed system that attains higher uptimes by crossing multiple AZs is hard and costly and can only be justified in some cases.
Right, so which common mode failures are occurring here? What I'm seeing in this thread and previous AWS threads is a lot of hearsay. Stuff like "the AWS console isn't loading" or "I don't have that problem on Linode!" or "the McDonalds app isn't working so everything is broken thanks to AWS!" I'd love to see a postmortem document, like this, actually uncover one of these common mode failures. Not because I doubt they exist (any real system has bugs and I have no doubt a real distributed system has real limitations); I just haven't seen it borne in real world experience at my current company and other companies I've worked at which used AWS pretty heavily.
While I don't work at AWS, my company also publishes an SLA and we refund our customers when we dip below that SLA. When an outage, SLA-impacting or not, occurs, we spend a _lot_ of time getting to the bottom of what happened and documenting what went wrong. Frequently it's multiple things that go wrong which cause a sort of cascading failure that we didn't catch or couldn't reproduce in chaos testing. Part of the process of architecting solutions for high scale (~ billions/trillions of weekly requests) is to work through the AWS docs and make sure we select the right architecture to get the guarantees we seek. I'd like to see evidence of common-mode failures and the defensive guarantees that failed in order show proof of them, or proof positive through a dashboard or something, before I'm willing to malign AWS so easily.
> And the larger point: as I'm sure you're aware, building a distributed system that attains higher uptimes by crossing multiple AZs is hard and costly and can only be justified in some cases.
Sure if you're not operating high reliability services at high scale, it's true, you don't need cross-AZ or cross-region failover. But if you chose, through balance sheet or ignorance, not to take advantage of AWS's reliability features then you shouldn't get to complain that AWS is unreliable. Their guarantees are written on their SLA pages.
> I realize that this isn't a defense (whataboutism), but I guess it's fine to "speak truth to power" or something?
... I still don't think your overall starting assertions about the other people not understanding regions vs. AZs is correct, and it triggered you to repeatedly assert that the people you were talking to are unskilled.
I could very easily use the same words as them, and I have decade-old spreadsheets where I was playing with different combinations of latencies for commits and correlation coefficients for failures to try and estimate availability.
> Right, so which common mode failures are occurring here? What I'm seeing in this thread and previous AWS threads is a lot of hearsay. Stuff like "the AWS console isn't loading" or "I don't have that problem on Linode!" or "the McDonalds app isn't working so everything is broken thanks to AWS!" I'd love to see a postmortem document, like this, actually uncover one of these common mode failures. Not because I doubt they exist (any real system has bugs and I have no doubt a real distributed system has real limitations); I just haven't seen it borne in real world experience at my current company and other companies I've worked at which used AWS pretty heavily.
I remember 2011, where EBS broke across all US-EAST AZs and lots of control plane services were impacted and you couldn't launch instances across all AZs in all regions for 12 hours.
Now maybe you'll be like "pfft, a decade ago!". I do think Amazon has significantly improved architecture. At the same time, AZs and regions being engineered to be independent doesn't mean they really are. We don't attain independent, uncorrelated failures on passenger aircraft, let alone these more complicated, larger, and less-engineered systems.
Further, even if AWS gets it right, going multi-AZ introduces new failure modes. Depending on the complexity of data model and operations on it, this stuff can be really hard to get right. Building a geographically distributed system with current tools is very expensive and there's no guarantee that your actual operational experience will be better than in a single site for quite some time of climbing the maturity curve.
> Their guarantees are written on their SLA pages.
Yup, and it's interesting to note that their thresholds don't really assume independence of failures. E.g. .995/.990/.95 are the thresholds for instances and .999/.990/.950 are the uptime thresholds for regions.
If Amazon's internal costing/reliability engineering model assumed failures would be independent, they could offer much better SLAs for regions safely. (e.g. back of the envelope, 1- (.005 * .005) * 3C2 =~ .999925 ) Instead, they imply that they expect multi-AZ has a failure distribution that's about 5x better for short outages and about the same for long outages.
And note there's really no SLA asserting independence of regions... You just have the instance level and region level guarantees.
Further, note that the SLA very clearly excludes some causes of multi-AZ failures within a region. Force majeure, and regional internet access issues beyond the "demarcation point" of the service.
Yes, but the underlying point you're willfully missing is:
You can't engineer around AWS AZ common-mode failures using AWS.
The moment that you have failures that are not independent and common mode, you can't just multiply together failure probabilities to know your outage times.
Yup, so true. People think redundant == 100% uptime, or that when they advertise 99.9% uptime, it's the same thing as 100% minus a tiny bit for "glitches".
It's not. .1% of 36524 = 87.6 hours of downtime - that's over 3 days of complete downtime every year!
They only refund 100% when they fall below 95% of availability! 95-99= 30%. I believe the real target is above 99.9% though, as that results in 0 refund to the customer. What that means is, 3 days of downtime is acceptable!
Alternatively, you can return to your own datacenter and find out first hand that it's not particularly as easy to deliver that as you may think. You too will have power outages, network provider disruptions, and the occasional "oh shit, did someone just kick that power cord out?" or complete disk array meltdowns.
Anywho, they have a lot more room in their published SLA's than you think.
Edit: as someone correctly pointed out i did a typo in my math. it is only ~9 hours of aloted downtime. Keeping in mind that this is per service though - meaning each service can have a different 9 hours of downtime before they need to pay out 10% of that one service. I still stand by my statement thier SLA's have a lot of wiggle room that people should take more seriously.
Your computation is incorrect, 3 days out of 365 is 1% of downtime, not 0.1%. I believe your error stems from reporting .1% as 0.1. Indeed:
0.001 (.1%) * 8760 (365d*24h) = 8.76h
Alternatively, the common industry standard in infrastructure (the place I work at at least,) is 4 nines, so 99.99% availability, which is around 52 mins a year or 4 mins a month iirc. There's not as much room as you'd think! :)
> Yup, so true. People think redundant == 100% uptime, or that when they advertise 99.9% uptime, it's the same thing as 100% minus a tiny bit for "glitches".
Maybe this is the problem. 99.9% isn't being used by AWS the way people use it in conversation; it has a definite meaning, and they'll refund you based on that definite meaning.
>> you need to load balance across multiple independent availability zones
The only problem with that is, there are no independent availability zones.
What we do have, though, is an architecture where errors propagate cross-zone until they can't propagate any further, because services can't take any more requests, because they froze, because they weren't designed for a split brain scenario, and then, half the internet goes down.
> The only problem with that is, there are no independent availability zones.
There are - they can be as independent as you need them to be.
Errors won't necessarily propagate cross-zone. If they do, someone either screwed up, or they made a trade-off. Screwing up is easy, so you need to do chaos testing to make sure your system will survive as intended.
I'm not talking about my global app. I'm talking about the system I deploy to, the actual plumbing, and how a huge turd in a western toilet causes east's sewerage system to over-flow.
That's not how they work.
They exist, and work extremely well within their defined engineering / design goals. It's much more nuanced than 'everything works independently'.
Wouldn't it be possible to create fully independent zones with multiple cloud providers, like AWS, GCP, Azure? This is assuming that your workloads don't rely on proprietary services from a given provider.
Yes, and would also protect you from administrative outages like, "AWS shut off our account because we missed the email about our credit card expiring."
(But wouldn't protect you from software/configuration issues if you're running the same stack in every zone.)
There have been multiple discussions on HN about cloud vs not cloud and there are endless amount of opinions of "cloud is a waste blah blah".
This is exactly one of the reasons people go cloud. Introducing an additional AZ is a click of a button and some relatively trivial infrastructure as code scripting, even at this scale.
Running your own data center and AZ on the other hand requires a very tight relationship with your data center provider at global scale.
For a platform like Roblox where downtime equals money loss (i.e. every hour of the day people make purchases), then there is a real tangible benefit to using something like AWS. 72 hours downtime is A LOT, and we're talking potentially millions of dollars of real value lost and millions of potential in brand value lost. I'm not saying definitively they would save money (in this case profit impact) by going to AWS, but there is definitely a story to be had here.
> Running all Roblox backend services on one Consul cluster left us exposed to an outage of this nature. We have already built out the servers and networking for an additional, geographically distinct data center that will host our backend services. We have efforts underway to move to multiple availability zones within these data centers; we have made major modifications to our engineering roadmap and our staffing plans in order to accelerate these efforts.
If they were in AWS they could have used Consul across multi-AZs and done changes in a roll out fashion.
So that next time they can spend 96 hours on recovery, this time adding a split brain issue to the list of problems to deal with. Jokes aside, the write-up is quite good after after thinking about all the problems they had to deal with, I was quite humbled.
It doesn't really explain how they reached the conclusion that that would help. Like, yes, it's a problem that they had a giant Consul cluster that was a SPOF, but you can run multiple smaller Consul clusters within a single AZ if you want.
Honestly it reads to me like an internal movement for a multi-AZ deployment successfully used this crisis as an opportunity.
For example parts of AWS itself. us-east-1 having issues? Looks like aws console all over the world have issues.
You constantly hear about multi zone, region, cloud. But in practice when things break you hear all these stories of them running in a single region+zone
I would normally not call this out, but it is repeated so often in the text that it is jarring. Just call it "median" as it is everywhere else, please.
On the other hand, I must commend the author(s) for not using "based off of" :-)
Excellent write up. Reading a thorough, detailed and open postmortem like this makes me respect the company. They may have issues but it sounds like the type of company that (hopefully) does not blame, has open processes, and looks to improve - the type of company I'd want to work for!
The first video reveals a more general issue that is not specific to Roblox: child labor in the marketplace of monetized user generated content. There are plenty of under-18 YouTubers. It's not even just online content: these questions came up in the entertainment industry a long time ago, but in that industry at least some safeguards were put in place.
But do those other places pay the creators such small percentages, and also do everything in their power to avoid paying real $? As far as I know Youtube doesn't have their own currency.
The % cut of $revenue is outside the main scope of my comment, which concerned child labor, a somewhat independent issue. It doesn't matter too much to me the % monthly revenue a 12 year old kid gets, I'm more concerned with the promise of unlikely riches encouraging kids to work long hours outside the oversight of traditional child labor laws. If that 12 year old is putting in a 30 hour work weeks then I think it's problematic regardless of the revenue with some minimal enforceable guardrails. I don't think parental signoff is sufficient either: the entertainment industry has plenty of examples of how that can go wrong, and also how some of those minimum guardrails might work.
Do those other places host creators' applications for free regardless of scale, and without injecting ads into your entertainment?
Youtube doesn't have their own currency simply because they don't have to, not because they are kind. Their major source of income is ads and subscription. Neither of these needs their own currency. Highly unsure what your point is.
host creators' applications for free regardless of scale
Yes? Your Youtube video / livestream can have tens of thousands of simultaneous viewers across the globe.
> without injecting ads into your entertainment?
That's an odd complaint that hyper-specific to how Youtube monetizes content. Look at Steam, Play Store, AppStore, all those host your app/game for free, regardless of the scale, and only take a 30% cut.
> Highly unsure what your point is.
Fine, here's another way to explain it. The minimum amount to take out money in most places, including Youtube, is 100$. Why is it 1000$ in Roblox?
1. You are comparing interactive game servers to streaming service? Sure there are free CDN you can use, but are there free servers alternative you can use on the market regardless of scale and traffic? The pricing of these two services has fundamental difference on the market. If you know one of such service, name it. I would be glad to use it.
2. You are comparing Steam, Play Store, App store to Roblox? Are they providing free servers for multi-player experience? Does Steam let you host free servers? Sure they provide a way to download the app, but none of them provides free servers along with free network traffic. What makes you think they should be priced the same way when the services they provide are fundamentally different?
3. What makes you think $100 is reasonable, while $1000 is not reasonable? Every business has their own way for monetization. If you want to claim $100 is reasonable and should be the industry standard, what is your argument to support it? If Youtube is so kind, why don't they just not have any minimum take out amount in the first place, like most e-commerce companies do? Your questions are baseless in the first place by assuming there is certain non-existent rules that need to be followed. You can dislike it, but claiming $1000 limit it set to abuse child labor is completely un-based.
I'm not sure either. Particularly how, even in the presence of any policies, they could police the system: Does YouTube send someone around to check on the Ryan's World kid and the work environment?
Even with child labor policies it doesn't seem like platforms would be much better at managing them than content moderation.
A partial solution might be that games by developers without legal age verification can't utilize real money transactions via the Robux currency. That way Roblox isn't rewarded directly for it either.
Your solution to children who want to sell things on Roblox getting exploited by unscrupulous middlemen agencies is to make it impossible to bypass those agencies?
To add, there is a nice documentary here[1] which also has a followup[2] that show even more of the issue at hand. Kids making games and only getting 24.5% of the profit is one thing, but everything else that Roblox does is much worse.
The 24.5% cut is fine, you have to consider the 30% app store fees for a majority mobile playerbase, all hosting is free, moderation is a major expense, and engine and platform development.
Successful games subsidize everyone else, which is not comparable to Steam or anything else.
Collectible items are fine and can't be exchanged for USD, Roblox can't arbitrate developer disputes, "black markets" are an extremely tiny niche. A lot of misinformation.
It's annoying to see these videos brought up every single time Roblox is mentioned anywhere for these reasons. Part of the blame lies with Roblox for one of the worst PR responses I have seen in tech, I suppose.
> The 24.5% cut is fine, you have to consider the 30% app store fees for a majority mobile playerbase, all hosting is free, moderation is a major expense, and engine and platform development.
You have successfully made the case for a 45% fee and being considered approximately normal, or a 60% fee and being considered pretty high still. 75+% is crazy.
I can't think of any other platform with comparable expenses. Traditional game engines have the R&D component, but not moderation, developer services, or subsidizing games that don't succeed.
It helps that seriously marketing a Roblox game costs < $1k USD always, usually <
$200 USD. It's not easy to generate a loss, even when including production costs. That's the tradeoff.
I have less a problem with the cut, and more a problem with how they achieve it. It harkens back to company towns paying workers in company credit that is expensive to convert to USD.
This % includes cost of all game server hosting, databases, memory stores, etc. even with millions of concurrents, app store fees, etc. All included in that number. Developer gets effectively pure profit for the sole cost of programming/designing a great game. Taught me how to program, & changed my entire future. Disclosure: My game is one of most popular on the platform.
And that's a reasonable decision for an adult to make, and if they were targeting an adult developer community.
I don't think anyone objects to adults making that choice over say, using Unity or Unreal, and targeting other platforms.
In practice, explaining to my son who is growing into an avid developer why I won't a) help him build on Roblox, or b) fund his objectives of advertising and promoting his work in Roblox (by spending Roblox company scrip) on the platform has necessitated helping him to learn and understand what exploitation means and how to recognize it.
It's a learning experience for him, and a challenging issue for me as a technically proficient and financially literate parent who actually owns and run businesses related to intellectual property. It's got to be much more painful for parents who lack in any of those three areas.
Are you really suggesting that Roblox's cut should be lower purely because the target market is children? Why? If anything, the fact that a kid can code a game in a high-level environment and immediately start making money—without any of the normal challenges of setting up infrastructure, let alone marketing and discovery—is amazing, and a feat for which Roblox should definitely be rewarded.
In any case, what's the alternative? To teach your son how to build the game from scratch in Unity, spin up a server infrastructure that won't crumble with more than a few concurrent players (not to mention the cash required for this), figure out distribution, and then actually get people to find and play the game? That seems quite unreasonable for most children/parents.
If this were easy, a competitor would have come in and offered the same service with significantly lower fees.
Yes, I agree that the deception is a problem, although I admit I'm not well versed in the issue. (I'm watching the documentary linked elsewhere now.) But the original claim was that they were exploiting young developers by taking a big cut of revenues, which I disagree with.
> And that's a reasonable decision for an adult to make, and if they were targeting an adult developer community.
If it's a reasonable decision for an adult to make because the trade-offs might be worth it, doesn't that mean that it would also be reasonable for a child to make the same decision for the same reason?
It's either exploitative or it isn't, the age of the developer doesn't alter the trade-offs involved.
Western society says that some decisions are only able to be made by people who are old enough. If you think about other decisions like gambling at a casino, joining the army or purchasing alcohol, then it might help you understand where they're coming from.
Very cool, the Jailbreak creator! Do such popular games earn enough to be able to retire? (although you wouldn't actually retire, since working is more fun)
Again, as across-thread: this is a tangent unrelated to the actual story, which is interesting for reasons having nothing at all to do with Roblox (I'll never use it, but operating HashiStack at this scale is intensely relevant to me). We need to be careful with tangents like this, because they're easier for people to comment on than the vagaries of Raft and Go synchronization primitives, and --- as you can see here --- they quickly drown out the on-topic comments.
No matter what the cut is I think there are some legitimate social questions to ask about whether want young people to be potentially exposed to economic pressure to earn or whether we'd rather push back aggressively against youth monetization to preserve a childhood where, ideally, children get to play.
I know there are lots of child actors and plenty of household situations that make enjoying childhood difficult for many youths - but just because we're already bad at a thing doesn't mean we should let it get worse. Child labour laws were some of the first steps of regulation in the industrial revolution because inflation works in such a way where opening the door up to child labour can put significant financial pressure on families that choose not to participate when demand adjusts to that participation being normal.
More egregiously, they're (per your article) manipulating kids into buying real ads for their creations, with the false promise that "you could get rich -- if you pay us".
>"As there are no discoverability tools, users are only able to see a tiny selection of the millions of experiences available. One of the ways boost to discoverability is to pay to advertise on the platform using its virtual currency, Robux."
(Note that this "virtual" currency is real money, bidirectionally exchangeable with USD).
The sales pitch is "get rich fast":
>"Under the platform’s ‘Create’ tab, it sells the idea that users can “make anything”, “reach millions of players”, and “earn serious cash”, while its official tutorials and support website both “assume” they are looking for help with monetisation."
I agree that this doesn't really look like a labor issue. That's distracting and contentious tangent; it's easier to just label this as a kind of consumer exploitation. (Most of the people involved aren't earning money -- but they are all paying money). It's a scam either way.
I am naive about the reality on the ground when it comes to this issue, but doesn't this hinge on transparency? If they can show they are covering costs + the going market rate, which seems to be 30% (at best), then wouldn't it be reasonable? So is a 45% cut for infra ok or not seems to be the question.
This is an interesting debate to have somewhere, but it has nothing to do with this thread. We need to be careful about tangents like this, because it's a lot easier to have an opinion about the cut a platform should take from UGC than it is to have opinion about Raft, channel-based concurrency, and on-disk freelists. If we're not careful, we basically can't have the technical threads, because they're crowded out by the "easier" debate.
True, it is off topic to the postmortem. However, the top comment talks about wanting to work there. I get is is very relevant to see a bigger picture. Personally, I could never work for them. I have a kid and the services and culture they created around their product is sickening and should be made illegal.
Even so, if you're responding to the "wanting to work there" you could do so in a less snide manner that's more in line with HN's guidelines on constructive commenting.
While I personally think digitalengineer's comment was low-effort and knee-jerk, I think this general thread of discussion is on topic for the comment replied to, which was specifically about how the postmortem increased the commenter's respect for Roblox as a company and made them want to work there. I think an acceptable compromise between "ethical considerations drown out any technical discussion" and "any non-technical discussion gets downvoted/flagged to oblivion" would be to quarantine comments about the ethics of Roblox's business model to a single thread of discussion, and this one seems as good as any.
The guidelines and zillions of moderation comments are pretty explicit that doesn't count as 'on topic'. You can always hang some rage-subthread off the unrelated perfidy, real or perceived, of some entity or another. This one is extra tenuous and forced given that 'the type of company I'd want to work for' is a generic expression of approval/admiration.
As one of the top developers on the platform (& 22 y/o, taught myself how to program through Roblox, ~13 years ago), I can say that it seems a majority of us in the developer community are quite unhappy with the image this video portrays. We love Roblox.
Yeah as long as Roblox is exploiting children they're just flat-out not respectable. This video is a good look at a phenomenon most people are unaware of.
Players of your game creating content for it is not exploitation. It's just how it works in the gaming world. When I was a kid I spent time creating a minecraft mod that hundreds of people used. Did Mojang or anyone else ever pay me? No. I did it because I wanted to.
The way they're paying kids and what they're telling them is a big part of the problem... they're pushing a lot of the problematic game development industry onto kids that are sometimes as young as 10.
If this was free content creation when kids want to do it, then it would be an entirely different story.
I kid of course. One of the best post-mortems I've seen. I'm sure there are K8s horror stories out there of etcd giving up the ghost in a similar fashion.
The one thing you can say about Nomad is that's generally incredibly scalable compared to Kubernetes. At 1000+ nodes over multiple datacenters, things in Kube seem to break down.
>Critical monitoring systems that would have provided better visibility into the cause of the outage relied on affected systems, such as Consul. This combination severely hampered the triage process.
which gives me goosebumps whenever I hear people proselytizing everything run on Kubernetes. At some point, it makes good sense to keep capabilities isolated from each other, especially when those functions are key to keeping the lights on. Mapping out system dependencies (either systems, software components, etc) is really the soft underbelly of most tech stacks.
Could you please stop posting unsubstantive comments to HN? You've done it quite a bit, unfortunately. We're trying for a different quality of discussion here.
> Note all dates and time in this blog post are in Pacific Standard Time (PST).
But the incident was during PDT. Just use UTC or colloquial "Pacific time" or equiv and never be wrong!
My heart goes out to these people. I can imagine how much sustained terror they were feeling, stare hard and harder at your terminals and still nothing makes sense.
>for our most performance and latency critical workloads, we have made the choice to build and manage our own infrastructure on-prem
I don't understand this logic. are they basically saying that their servers are on average closer to the user than mainstream cloud infra? are they e.g. choosing to have N satellite servers around a city instead of N instances at one cloud provider location in the centre of the city? is it the sparseness of the servers that decreases the latency?
or is it more to do with avoiding the herd, i.e. less trafficky routes / beating the queues?
it's also unclear whether they use their own hardware on rented rackspace as that could potentially lower costs too
Cloud providers are rarely in cities. Google's biggest region is in the middle of Iowa, Amazon's is in Virginia.
If you have a latency sensitive application (like multiplayer games) it makes sense to put a few servers in each of 100 locations rather concentrate them in a half dozen cloud regions.
As they point out elsewhere, the cost of infrastructure directly impacts their ability to pay creators on the platform. Doing it yourself will always be cheaper, and they hired the smart people to make it happen.
> it makes sense to put a few servers in each of 100 locations rather concentrate them in a half dozen cloud regions.
Large cloud providers have a backbone network with interconnects to many ISPs reducing the amount of Hops a client has to take across the internet.
> Doing it yourself will always be cheaper
Treating the Cloud as a traditional IAAS Datacenter extension will be more expensive.
By utilising PAAS, only using resource that's needed and when it's needed, etc.. is much cheaper.
Read the article "
It has been 2.5 months since the outage. What have we been up to? We used this time to learn as much as we could from the outage, to adjust engineering priorities based on what we learned, and to aggressively harden our systems. One of our Roblox values is Respect The Community, and while we could have issued a post sooner to explain what happened, we felt we owed it to you, our community, to make significant progress on improving the reliability of our systems before publishing."
They wanted to make sure everything was fixed before publishing
They just got out of their busiest time of year, and taking the time to write an accurate post mortem with data gleamed afterwards seems sensible to me.
I would not have guessed Roblox was on-prem with such little redundancy. Later in the post, they address the obvious “why not public cloud question”? They argue that running their own hardware gives them advantages to cost and performance. But those seem irrelevant if usage and revenue go to zero when you can’t keep a service up. It will be interesting to see how well this architecural decision ages if they keep scaling to their ambitions. I wonder about their ability to recruit the level of talent required to run a service at this scale.
Since the issue's root cause was a pathological database software issue, Roblox would have suffered the same issue in the public cloud. (I am assuming for this analysis that their software stack would be identical.) Perhaps they would have been better off with other distributed databases than Consul (e.g., DynamoDB), but at their scale, that's not guaranteed, either. Different choices present different potential difficulties.
Playing "what-if" thought experiments is fun, but when the rubber hits the road, you often find that things that are stable for 99.99%+ of load patterns encounter previously unforeseen problems once you get into that far-right-hand side of the scale. And it's not like we've completely mastered squeezing performance out of huge CPU core counts on NUMA architectures while avoiding bottlenecking on critical sections in software. This shit is hard, man.
This is not true, if they handled the rollout properly. Companies like Uber have two entirely different data centers and during outages they failover you either datacenter.
Everything is duplicated which is potentially wasteful but ensures complete redundancy and it’s an insurance policy. If you rollout, you rollout to each datacenter separately. So in this case rolling out in one complete datacenter and waiting a day for their Consul streaming changes probably would have caught it.
The parent poster said that it would have happened even if they had cloud, ie. another datacenter. That's my assumption for the comment.
As far as I can tell from reading, Roblox doesn't have multiple datacenters. I find that really hard to believe, so if that's not true, then my point would be incorrect. If it is true, then if they completely duplicated their datacenters, they would be able to make the switch in one datacenter to streaming while keeping the other datacenter the old setting until they validated that everything was fine. That would have caught the problem, having slow rollout across datacenters.
Uber is also a service that has a much lower tolerance for downtime: If people can't play a game, they're sad. If they're trying to get a ride and it doesn't work, or drivers apps stop working suddenly, the stranded people get very upset in a hurry, and the company loses a lot of customers.
It can be totally reasonable for Uber to pay for 2x the amount of infra they need for serving their products while not being worth it for a company like Roblox.
You didn't read it properly. The changes were rolled out months before, but the switch to streaming based on that rollout was made 1 day before the incident. That was the root cause.
I think the public cloud is a good choice for startups, teams, and projects which don't have infrastructure experience. Plenty of companies still have their own infrastructure expertise and roll their own CDNs, as an example.
Not only can one save a significant amount of money, it can also be simpler to troubleshoot and resolve issues when you have a simpler backend tech stack. Perhaps that doesn’t apply in this case, but there are plenty of use cases which don’t need a hundred micro services on AWS, none of which anyone fully understands.
> But those seem irrelevant if usage and revenue go to zero when you can’t keep a service up
You're assuming the average profits lost are more than the average cost of doing things differently, which, according to their statement, is not the case.
This outage has it all, distributed systems, non-uniform memory access contention (aka "you wanted scale up? how about instead we make your CPU a distributed system that you have to reason about?"), a defect in a log-structured merge tree based data store, malfunctioning heartbeats affecting scheduling, wow wow wow.
Big props to the on-calls during this.
> Big props to the on-calls during this.
Kind of curious about this. I know this is probably company specific but how do outages get handled at large orgs? Would the on-calls have been called in first then called in the rest of the relevant team?
Is their a leadership structure that takes command of the incident to make big coordinated decisions to manage the risk of different approaches?
Would this have represented crunch time to all the relevant people or would this be a core team with other people helping as needed?
Typically:
Yes. This was a multi-day outage and eventually the oncall does need sleep, so you need more of the team to help with it. Typically, at any reasonable team, everyone that chipped in nights get to take off equivalent days and sprint tasks are all punted.
Yes. Not just to manage risks, but also to get quick prioritization from all teams at the company. "You need legal? Ok, meet ..." "You need string translations? Ok escalated to ..." "You need financial approval? Ok, looped in ..."
Kinda. Definitely would have represented crunch time, but a very very demoralizing crunch time. Managers also try to insulate most of their teams from it, but everyone pays attention anyways. Keep in mind these typically only last an hour or 3, at most they last a few days, so there is no "core team" other than the leadership structure from your question 2. Otherwise, it is very much "people/teams helping as needed".
> Yes. This was a multi-day outage and eventually the oncall does need sleep, so you need more of the team to help with it.
Well, also your business is 100% down, all the capable engineering eyes should be looking at the issue.
After a certain length of outage, you have to start prioritizing differently though. I only have our own anecdotes there. But if someone was at a problem for 8 - 12 consecutive hours under pressure, the quality of their work is going to drop sharply. At such a point, it becomes more and more likely for them to make the situation worse instead of fixing it.
And at or beyond that point, you pretty much have to take inspiration from fire fighters and emergency services: You need to organize the experts on subsystems to rest and sleep in shifts, ideally during simpler but time consuming tasks. Otherwise these persons will crash and you lose their skills and knowledge during that outage for good. And that might render an outage almost impossible to handle.
I think I didn't explain myself very well: clearly on-duty must sleep if it's a multi-day incident, but they also need extra help when they are awake! If the business is completely down, there isn't normal work to do for other engineers so, even if they are out of their typical domain, they might give good insights, novel ideas or fix some side issues that will help the ones with more domain knowledge.
The problem is that you don’t know how long the outage will be when it starts. I once saw a large outage start, everyone jumped on to troubleshoot, thinking it would be an hour. 8 hours later it’s still an outage, and everyone is still on and burned out. Management should have told half the people who jumped on at the start to go away and be prepared for a phone call in 8 hours to provide relief.
Google has his Site Reliability Engineering book, which might answer some of your questions
https://sre.google/sre-book/table-of-contents/
It is an interesting read. Here's the pdf:
https://github.com/captn3m0/google-sre-ebook/releases/downlo...
Is this the same as the O'Reilly dead tree book of the same name?
Yes.
Oncalls get paged first and then escalate. As they assess impact to other teams and orgs, they usually post their tickets to a shared space. Once multiple team/org impact is determined, leadership and relevant ops groups (networking, eg) get pulled in to a call. A single ticket gets designated the Master Ticket for the Event, and oncalls dump diagnostic info there. Root cause is found (hopefully), affected teams work to mitigate while RC team rushes to fix.
The largest of these calls I've seen was well into the hundreds of sw engineers, managers, network engineers, etc.
Wow, that makes complete sense for something that is impacting this many people and by extension lots of money.
Thanks for the answer, I have only ever worked with such a small team that we are all on a call every day.
I can imagine it can probably get a little hectic in large group calls? On the engineering side is there a command structure? Like say the root cause was found and RC team is rushing to fix it. But another team wants to mitigate in the mean time in a slightly risky way. Would their manager make a case with leadership? Would the proposed plan just be put out for general comment as a response to that main ticket?
It depends. I’ve managed major incidents with hundreds of participants.
Our major incident process generally had a “suit” call with non-technical executives and people who would be coordinating customer triage, outreach, etc. Then we would have a tech bridge where the key stakeholders did their thing.
We used the Federal incident command system as a model. It’s a great reference point to use as an inspiration.
Any guides on the "Federal incident command system" to read from (e.g. without blindly googling for it). Thanks?
In addition, you can look into ITIL/ITSM Incident Management plans, they have well developed process structure to work from as a guideline.
I have also seen organizations recommend Kepner Tregoe method training for real time high pressure problem solving based off Nasa Mission Control systems.
https://training.fema.gov/emiweb/is/icsresource/jobaids/
Here’s a good place to start
https://training.fema.gov/nims/ is a great entry point.
each company is different, from my experience it would depend on the severity of the fix, and the severity of the issue. the problem would get resolved by any means ie temporary sticky plaster if necessary.
Another team would then assess and analyse the root cause from a company wide perspective and then assess the risks, costs and impact and then make any modifications (possibly redoing the temporary fix, and fixing it properly)
Real issue, a call center main telephony system and one of the management servers kept crashing causing over 1400 call center people to stop working. Temporary fix was to re boot the servers every 4 hours causing minor pain, but the call staff was up and running.
After a whole stupid week of the engineers not being able to find the route cause it was escalated extremely high and our team was brought in and we found the root cause in seconds (literally)The servers was VMs and the engineers hadn't checked the physical ESX server they were hosted on. another VM on the box caused the server to go unstable (ESX not configured correctly).
BAU project set up to audit/ report and fix all the ESX servers in the company for other stupid config issues
The person you're responding to is not exactly wrong. But since the users dropped to 0 pretty quickly it's likely that every team with any monitoring at all got paged. At least that's what would happen at the moderately large company I work for.
I'm giving a much broader example of what a large company might do for high impact events. I have no idea what the insides of Roblox look like specifically.
Not to mention a VP or three. A well-led company is going to have management in the line of fire, so to speak, so an outage of this scale would wake them as well.
So 1-3 people actually figure it out while everyone else gets in the way? There's no way hundreds of engineers, managers, network engineers etc. can get anything actually done as a group, right?
Former Google SRE here, I can share my experience although I've never been involved in a large serious outage (thankfully). I've had my fair share of smaller multi-team outages though.
Usually the way it works is so that we have multiple clearly-identified and properly-handed-off roles. There's an Incident Commander (IC) role, whose job is to basically oversee the whole situation, there's various responders (including a primary one) whose job is to mitigate/fix the problems usually relating their own teams/platform/infra (networking, security, virtualization clusters, capacity planning, logging, etc. depends on the outage). There's also sometimes a communication person (I forget the role name specifically) whose job is to keep people updated, both internal to the outage (responders, etc) and outsiders (dealing with public-facing comms, either to other internal teams affected by the outage or even external customers).
Depending on the size of the outage, the IC may establish a specific "war room" channel (used to be an IRC chatroom, not sure what they use these days though) where most communication from various interested parties will take place. The advantage of a chatroom is that it lets you maintain communication logs and timestams (useful for postmortem and timeline purposes), and it helps when handing off to the next oncaller during a shift change (they can read the history of what happened).
> There's no way hundreds of engineers, managers, network engineers etc. can get anything actually done as a group, right?
Most people will not really be doing much but when you need to diagnose a problem, having a lot of brains with various expertise in different domains helps, especially if those people are the ones that have implemented a certain service that might be obscure to the other oncallers. Generally speaking, it wouldn't be unheard of to have 30-40 people in the same irc channel brainstorming and coordinating a cross-team effort to mitigate a problem, but into the hundreds? Not quite sure about that much.
Just my two cents. You can probably get more info by reading the Google SRE book https://sre.google/books/
Yeah, I've read the Google SRE book and the product I work on follows Google's SRE model. Sometimes I wonder though if it's all one big anti-pattern. Maybe more precisely it's a pattern designed to work even if nobody knows what's going on. Things are so vastly (over?) complicated. The original designers are long gone. But you still somehow have to keep things going and address any issues that pop up. In our org that SRE model leads that some very weird things because the SREs know the infrastructure (to some degree) but don't really understand the stuff running over it. But I guess we're delivering the service so that's something.
I think the "real world" doesn't work like that. The way the real world works is that things are decoupled in a way that one system's failure doesn't bring the entire world down. So things can be solved in isolation by people that actually understand the system and/or systems are designed in a way that they are serviceable etc.
When the power fails in my neighbourhood, you don't get 100 engineers on a hotline, one van comes down, troubleshoots the problem, and fixes it. Like 3 technicians.
I know there are some exceptions like some power failures that cascaded or the global supply shortages. But those are design failures IMO. A computer system that goes down for this length of time and nobody can figure out why or recover, that seems like a total failure to me on multiple levels. We're just doing this wrong.
Speaking from personal experience, most outages are contained and mitigated within a specific service before they end up impacting other services too. Cascade effects are rare, you just notice them more often because they affect multiple people and usually external-facing customers too. In reality, most things will (or, rather, *should*) page you well before it becomes a cascade-effect incident that multiple teams will have to take care of.
If your problem is that nobody knows what's going on and that stuff constantly brings down a bunch of different systems, you either need to finetune your alerting so the affected system tells you something is wrong *before* it reaches other people (monitor your partial rollouts, canary releases, capacity bursts, etc), or you have a problem with playbooks.
The person that implemented the system doesn't need to be the person that fixes it in case there's a problem. We have playbooks that tell us exactly what to do, where to go, which flags to flip, which machine to bring down/bring up, etc in case of various problems. These should be written by the person that implemented the system and any following SRE who's been in charge of fixing bugs or finding issues as a way for the next SRE oncall to not be lost when navigating that space. Remember that the person oncall is not the one responsible for fixing the issue, they are the person responsible for mitigating the problem until the most appropriate person can fix it (preferably not outside working hours).
Again, there can be exceptions that require multiple engineers to work together on multiple services, but in reality that should not be the norm. Most of the pages I handled as an SRE were "silly" things that were self-contained to our team and our service and our customers never even noticed anything was wrong in the first place.
In a really large company, you're talking maybe ~100-200 people per org. EC2 alone has a massive footprint, for instance. Hundreds of engineers, of whom a dozen are maybe oncall for their respective components. If something goes wrong in, let's say' cloudwatch, but EC2 is impacted, that's dozens of people working to weight their services out of the impacted AZ, change cache settings, bounce fleets, etc.
A lot of the time root cause is solved by a smaller number of people. But identifying root cause and mitigating impact during an event -- and then communicating specifics of that impact -- can fall to a much larger group.
If 1-3 people are actively solving the issue, they do so alone, and give periodic updates to the broader group through a manager or other communication liason.
3 people to fix the Vital Component That Must Work At All Times.
97 people to check/restart/monitor their team's system, because the Vital Component has never failed before so their graceful recovery code is untested or nonexistent.
For the on call system that I ran until recently, there are about a dozen on call teams responsible for parts of the service. Each team has a primary and backup engineer, generally on a 7x24 shift that lasts a week. Most weeks it's not very busy.
Working with them during an incident is an on call comms lead, who handles outside-of-team comms (protecting the engineers), and an engineering lead (who is a consultant, advisor, and can approve certain actions).
For big incidents, an exec incident manager is involved. They primarily help with getting resources from other teams.
Where I work there is an incident team that handles things like creating a master ticket, starting a call bridge, getting the on-calls into the bridge, keeping track of what teams (and who from those teams) have been brought in, manages the call (keeping chatter down and focused when there are 100 people in a call is important), periodically comments on the master ticket with status and a list of impacted teams, marks down milestone times like when the impact started, when it was detected, mitigated, root cause found, etc. This person is also responsible for stuff like when they hear you want to engage team X, they'll go track down an on-call for you, or summarizing known impact for the outward-facing status pages, etc. They also create the postmortem template and follow up with all involved teams to get them to contribute their detailed impact statement there.
Edit: sometimes when it's a really gnarly problem and there are huge numbers of people on the call, the set of people who are actively trying to come up with mitigations and need to just be able to talk freely at each other will break off into a less noisy call and leave a representative to relay status to the main call.
Approaches vary company-to-company, but https://response.pagerduty.com/ is a good resource for understanding how it often looks.
At Google an oncaller typically gets paged, triages the incident and, if it's bad, they page other oncallers and or team members for help. For more serious incidents, people take on different roles like communications lead, incident commander etc.
During the worst outage I was a involved in basically the entire org including all of the most senior engineers worked around the clock for two weeks to fix everything
The on calls ARE the relevant team lol. You're doing it wrong otherwise
As someone with 8 years of experience in SRE in Google: I wouldn't be so sure about that. Most outages require only rudimentary understanding of the particular service. Pretty much "have you tried turning it off and on?", with the extra step of figuring out which piece of the stack needs the kick. Hence, there are many SRE teams that onboard lots of services with this kind of half-support. The on call only performs generic investigation and repair attempts. If that doesn't help, they escalate to the relevant dev team, who likely will only respond in office hours.
Only the important services get dedicated oncalls. Most important ones will have both 24/7 SRE and dev oncalls.
What processes are there (and how effective are they?) to determine if a non-expert SRE should fix something there-and-then (and potentially making things worse) vs. assigning it to a dev team for a correctly engineered fix, at the cost of delays?
"We enjoyed seeing some of our most dedicated players figure out our DNS steering scheme and start exchanging this information on Twitter so that they could get “early” access as we brought the service back up."
Why do I have a feeling "enjoyed" wasn't really enjoyed so much as "WTF", followed by "oh shit..." at the thought that their main way to balance load may have gone out the window.
At their scale, it was probably an insignificant minority. I read that as nothing more than a wink and nod of "we see what you did ;)" ; which I appreciate. Some companies would have a fit and go nuclear on people for that, for no particular reason. As long as it is an insignificant minority, it doesn't matter, and ideally it's teenagers learning how something works on the side, and that helped grow some future hacker (in the HN sense) somewhere.
> Some companies would have a fit and go nuclear on people for that, for no particular reason
Sometimes it's even the Missouri state governor doing that too.
It's difficult to know how quickly word could have spread, but I enjoy knowing a few 11 year olds learned something about the Internet in order to play a game an hour early.
With social media etc, I can see it spreading really fast. That would be my bigger fear trying to get a service back up from a very long outage like that.
The intentionally slow bringup is to handle the thundering herd of having the system come back online to 100% at once. If a couple hundred users (small percentage of userbase) here or there are able to jump to queue, it's no real big deal.
As far as players figuring out the DNS steering scheme; the company has no responsibility to keep a non-advertised backend up. If it was a problem, disallow new connection to it and remove it from the main pool.
I think it mostly consisted of "KEEP PRESSING REFRESH AND YOU'LL GET LET IN AT SOME POINT" so there wasn't any additional unplanned load for Roblox.
enjoyed as in having dedicated fans that would go through hops to have access
Love the "Note on Public Cloud", and their stance on owning and operating their own hardware in general. I know there has to be people thinking this could all be avoided/the blame could be passed if they used a public cloud solution. Directly addressing that and doubling down on your philosophies is a badass move, especially after a situation like this.
It's interesting, I don't see that being on cloud would have avoided or helped this situation much. They were able to ramp up their hardware very quickly - who knows where they got it that fast - and it actually made the problem worse, so being on cloud and having the ability to do that with keystrokes would not have helped. You could say they might be using a different set of components if they were on cloud which may not have suffered the same issues, but you can play the what if game all day it's not related to pros/cons of public cloud.
It's weird it took them so long to disable streaming. One of the first things you do in this case is roll back the last software and config updates, even innocent looking ones.
That’s what stood out to me too. Although they’d been slowly rolling it out for awhile, their last major rollout was quite close to the outage start:
> Several months ago, we enabled a new Consul streaming feature on a subset of our services. This feature, designed to lower the CPU usage and network bandwidth of the Consul cluster, worked as expected, so over the next few months we incrementally enabled the feature on more of our backend services. On October 27th at 14:00, one day before the outage, we enabled this feature on a backend service that is responsible for traffic routing. As part of this rollout, in order to prepare for the increased traffic we typically see at the end of the year, we also increased the number of nodes supporting traffic routing by 50%
Consul was clearly the culprit early on, and you just made a significant Consul-related infrastructure change, you’d think rolling that back would be one of the first things you’d try. One of the absolute first steps in any outage is “is there any recent change we could possibly see causing this? If so, try rolling it back.”
They’ve obviously got a lot of strong engineers there, and it’s easy to critique from the outside, but this certainly struck me as odd. Sounds like they never even tried “let’s try rolling back Consul-related changes”, it was more that, 50+ hours into a full outage, they’d done some deep profiling, and discovered the steaming issue. But IMO root cause analysis is for later, “resolve ASAP” is the first response, and that often involves rollbacks.
I wonder if this actually hindered their response:
> Roblox Engineering and technical staff from HashiCorp combined efforts to return Roblox to service. We want to acknowledge the HashiCorp team, who brought on board incredible resources and worked with us tirelessly until the issues were resolved.
i.e. earlier on, were there HashiCorp peeps saying “naw, we tested streaming very thoroughly, can’t be that”?
When you're at Roblox's scale, it is often difficult to know in advance whether you will have a lower MTTR by rolling back or fixing forward. If it takes you longer to resolve a problem by rolling back a significant change than by tweaking a configuration file, then rolling back is not the best action to take.
Also, multiple changes may have confounded the analysis. Adjusting the Consul configuration may have been one of many changes that happened in the recent past, and certainly changes in client load could have been a possible culprit.
In most cases, if you've planned your deployment well (meaning in part that you've specified the rollback steps for your deployment) it's almost impossible to imagine rollback being slower than any other approach.
When I worked at Amazon, oncalls within our large team initially had leeway over whether to roll backwards or try to fix problems in situ ("roll forward"). Eventually, the amount of time wasted trying to fix things, and new problems introduced by this ad hoc approach, led to a general policy of always rolling back if there were problems (I think VP approval became required for post-deploy fixes that weren't just rolling back).
In this case, though, the deployment happened ages (a whole day!) before the problems erupted. The rollback steps wouldn't necessarily be valid (to your "multiple confounding changes" point). So there was no avoiding at least some time spent analyzing and strategizing before deciding to roll back.
Some changes are extremely hard to rollback, but this doesn’t sound like one of them. From their report, sounds like the rollback process involved simply making a config change to disable the streaming feature, it took a bit to rollout to all nodes, and then Consul performance almost immediately returned to normal.
Blind rollbacks are one thing, but they identified Consul as the issue early on, and clearly made a significant Consul config change shortly before the outage started, that was also clearly quite reversible. Not even trying to roll that back is quite strange to me - that’s gotta be something you try within the first hour of the outage, nevermind the first 50 hours.
> When you're at Roblox's scale
Yet a regional Consul deployment is the single point of failure. I apologize if that sounds sarcastic. There’re obviously a lot of lessons to be learned and blames have no places in this type of situations - excuses as well.
In a not-too-distant alternate universe, they made the rookie assumption that every change to every system is trivially reversible, only to find that it's not always true (especially for storage or storage-adjacent systems), and ended up making things worse. Naturally, people in alternate-universe HN bashed them for that too.
Obviously I'm on the outside looking in here - can't say anything with confidence. But I've been on call consistently for the past 9 years, for some decent sized products (not Roblox scale, but on the order of 1 million active users), mitigating more outages than I can count. For any major outage, the playbook has always been something like this:
1. Which system is broken?
2. Are there any recent changes to this system? If so, can we try reverting them?
They did "1", quickly identified Consul as the issue. They made a significant Consul change the day before, one they were clearly cautious/worried about (i.e. they'd been slowly adopting the new Consul streaming feature, service by service, for over a month, and did a big rollout of it the previous day). And once they did identify streaming as the issue, it was indeed quick to roll back. It just seems like they never tried "2" above, which is strange to me, very contrary to my experience being on call at multiple companies.
If you're doing a slow rollout, it's not always easy to tell whether the thing you're rolling out is the culprit. I've been on the other side of this outage where we had an outage and suspected a slow change we had been rolling out, especially because we opted something new into it minutes before an incident, only to realize later when the dust settled that it was completely unrelated. When you're running at high scale like Roblox and have lots of monitoring in place and multiple pieces of infrastructure at multiple levels of slow-rollout, outages like this one don't quickly point to a smoking gun.
What do you do when you're working on a storage system and rolling back a change leaves some data in a state that the old code can't grok properly? I've seen that cause other parts of the system (e.g. repair, re-encoding, rebalancing) mangle it even further, overwrite it, or even delete it as useless. Granted, these mostly apply to code changes rather than config, but it can also happen if code continue to evolve on both sides of a feature flag, and both versions are still in active use in some of the dozens of clusters you run. Yes, speaking from experience here.
While it's true that rolling back recent changes is always one of the first things to consider, we should acknowledge that sometimes it can be worse than finding a way to roll forward. Maybe the Roblox engineers had good reason to be wary of pulling that trigger too quickly when Consul or BoltDB were involved. Maybe it even turned out, in perfect 20/20 hindsight, that foregoing that option was the wrong decision and prolonged the outage. But one of the cardinal rules of incident management is that learning depends on encouraging people to be open and honest, which we do by giving involved parties liberal benefit of the doubt for trying to do the right thing based on information they had at the time. Yes, even if that means allowing them to make mistakes.
Spot on. And some things are easily reversible to the extent that they alleviate the downtime, yet still leave a large data sync or etl job to complete in their wake. The effect of which, until resolved, is continued loss of function or customer data at some lesser level of severity.
As a fairly regular consul cluster admin for the last 6 years or so, but not on that scale i can safely say that you generally have no idea if rolling back will work. I’ve experienced everything up to complete cluster collapses before. I spent an entire night blasting and reseeding a 200 node cluster once after a well tested forward migration went into a leadership battle it never resolved. Even if you test it before that’s no guarantee it’ll be alright on the night.
Quite frankly relying on consul scares the shit out of me. There are so few guarantees and so many pitfalls and traps that I don’t sleep well. At this point I consider it a mortal risk.
That applies to vault as well.
I also run 3 small clusters of consul and I went ahead and read the raft paper[1] so I can debug consul election problems if it occurs.
Consul is awesome when it works, but when it breaks it can be hell to get it working again. thankfully it usually works fine. I only had 1 outage and it fixed itself after restarting the service.
[1] https://raft.github.io/raft.pdf
> so I can debug consul election problems if it occurs
Interestingly, reading this remind me of a HashiCorp Nomad marketing piece [1]:
> "We have people who are first-time system administrators deploying applications, building containers, maintaining Nomad. There is a guy on our team who worked in the IT help desk for eight years — just today he upgraded an entire cluster himself."
I was always thinking "but what if something goes wrong? just call HashiCorp engs?" :p
[1] https://www.hashicorp.com/case-studies/roblox
That seems to be a general problem with these types of solutions. You have the exact same issue with something like ZooKeeper. It awesome when it works, but good luck trying to figure out why it's broken.
Just the author of the previous post relying on these types of services is something that can keep me up at night.
> Quite frankly relying on consul scares the shit out of me. There are so few guarantees and so many pitfalls and traps that I don’t sleep well. At this point I consider it a mortal risk.
Consul (and Vault) for sure a complex softwares that 99% of the time "just work", but when they fail they can fail big time, I concur. But calling it a mortal risk seems a bit far fetched in my opinion.
When you run 5 nines that’s a risk.
At first I thought it is a well-written post-mortem with proper root cause analysis. After reading it for the second time though, it doesn't sound like the root cause has been identified? At one point, they disabled streaming across the board, and the consul cluster started to become sort of stable. Is streaming to be blamed here? Why would streaming, an enhancement over the existing blocking query, which is read-only, end up causing "elevated write latency"? Why did some voter nodes encounter the boltdb freelist issue, while some other voter nodes didn't?
And there is still no satisfying explanation for this:
> The system had worked well with streaming at this level for a day before the incident started, so it wasn’t initially clear why it’s performance had changed.
But I totally agree with you that the first thing they should look into is to rollback the 2 changes made to the traffic routing service the day before, as soon as they discovered that the consul cluster had become unhealthy.
"just roll back" gets risky when you roll back more than a few hours in many cases.
Frequently the feature you want to roll back now has other services depending on it, has already written data into the datastore that the old version of the code won't be able to parse, has already been released to customers in a way that will be a big PR disaster if it vanishes, etc.
Many teams only require developers to maintain rollback ability for a single release. Everything beyond that is just luck, and there's a good chance you're going to be manually cherry picking patches and having to understand the effects and side effects of tens of conflicting commits to get something that works.
The post indicates they'd been rolling it out for months, and indicate the feature went live "several months ago".
With the behaviour matching other types of degradation (hardware), it's entirely reasonable that it could have taken quite a while to recognise that software and configurations that have proven stable for several months, that is still there working, wasn't quite so stable as it seemed.
Right, but it only went live on the DB that failed the day before. Obviously, hindsight is 20/20, but it's strange that the oversight didn't rate a mention in the postmortem.
Totally agree with nightpool. That's a very strange oversight.
Some comments:
- The write up is amazing. There is a great level of detail. - When they had the first indication of a problem, instead of looking if the problem was the hardware (disk I/O, etc.) the team went full cattle/cloud: bring down the node, launch a new one. Apparently that cost them a few hours. We would probably have done the same but I wonder if there's a lesson there. - The obvious thing to do was revert configs. It is very strange that it took so long to revert. After being down for hours and having no idea what gives, it's the reasonable thing to try. - The problem was consul. But consul is a key component and Roblox seem to be running a fairly large infrastructure. The company's valuation is sky-high, I assume the infra team is quite large. Consul is an open source project. Wouldn't make sense instead of relying on hashicorp so heavily to bring-in or train ppl around consul internals at this point? (maybe not possible/feasible/optimal, just wondering)
Would be a nice touch to check if bbolt has the bug and possibly push a fix. That said, the post-mortem is state-of-art. Way better than anything we've seen from much much bigger companies.
Honestly I would guess part of it is that streaming is supposed to be a performance increase. So during a performance related outage, it might be easy to overlook. Am I really going to turn off a feature that I think is actually helping the problem?
If that feature was the one most recently deployed or updated? Yes, if possible. That could be a big if, though right? Maybe rolling back such a change isn't trivial, or imposes other costs to returning to service that are more expensive than simply working through the problem.
Well its a feature that affects the specific area where your having issues (performance), so yes would be the right thing to start with.
The htop screenshot was an immediate, appropriately-colored red flag for me: that much red (kernel time) on the CPU utilization bars for a system running etcd/consul is not right in my experience.
The post mortem is really well written but I had the same thoughts. They upgraded the machines hardware before rolling back the latest config updates.
Hindsight is 20-20
I shouldn't have drank that many
Hindsight is 20-20
Stop.
–
Little elevators are far too small for me
So I ride the big ones
It's not so fun unless you're OCD
And you like buttons
It's a spicy read. Really could have happened to anyone. All very reasonable assumptions and steps taken. You could argue they could have more thoroughly load tested Consul, but doubtful any of us would have done more due diligence than they did with the slow rollout of streaming support.
(Ignoring the points around observability dependencies on the system that went down causing the failure to be extended)
The main mistake IMO is that, the day before the outage, they made a significant Consul-related infra change. Then they have this massive outage, where Consul is clearly the root cause, but nobody ever tries rolling that recent change back? That’s weird.
I went into more detail here: https://news.ycombinator.com/item?id=30015826
The outage occurring could certainly happen to anyone, but it taking 72 hours to resolve seems like a pretty fundamental SRE mistake. It’s also strange that “try rollbacks of changes related to the affected system” isn’t even acknowledged as a learning in their learnings/action items section.
It's possible they deal with so much load that they considered a day's worth of traffic to be sufficient load testing:
> The system had worked well with streaming at this level for a day before the incident started, so it wasn’t initially clear why it’s performance had changed.
And a short note later on how much load their caching system sees:
> These databases were unaffected by the outage, but the caching system, which regularly handles 1B requests-per-second across its multiple layers during regular system operation, was unhealthy.
That doesn't sound accurate. Wasn't the major change they ended up rolling back Consul streaming, which they'd enabled months before, and had been slowly rolling out?
Right, but the day before the outage, they enabled streaming for a service that didn't have it turned on. That's a discrete config change, the day before the outage.
> Several months ago, we enabled a new Consul streaming feature on a subset of our services. This feature, designed to lower the CPU usage and network bandwidth of the Consul cluster, worked as expected, so over the next few months we incrementally enabled the feature on more of our backend services. On October 27th at 14:00, one day before the outage, we enabled this feature on a backend service that is responsible for traffic routing. As part of this rollout, in order to prepare for the increased traffic we typically see at the end of the year, we also increased the number of nodes supporting traffic routing by 50%
So they rolled out a pretty significant Consul related change the day before their massive Consul outage began. They’d been doing a slow rollout, but ramping it up a bunch is a significant change.
Admittedly this is armchair architecture talk, but it seems like either consul or Roblox's use of Consul is falling into a CAP-trap: they are using a CP system when what they need is an eventually-consistent AP system. Granted, the use of consul seems heterogenous, but it seems like the main root cause was service discovery. And service discovery loves stale data.
Service discovery largely doesn't change that often. Especially in an outage where a lot of things that churn service discovery are disabled (e.g. deploys), returning stale responses should work fine. There's a reason DNS works this way - it prioritizes having any response, even if stale, since most DNS entries don't change that frequently. That said, DNS is not a great service discovery mechanism for other reasons. Not sure if there's an off-the-shelf solution that relies more on fast invalidation rather than distributed consistent stores.
Good catch. If Roblox only uses consul for service discovery, things should continue to work, just slowly degrade over the hours/days. There should at least be one consul agent running on each physical hosts, and this consul agent has cache and can continue to provide service discovery functionality with stale data.
Dissecting this paragraph from the post-mortem...
> When a Roblox service wants to talk to another service, it relies on Consul to have up-to-date knowledge of the location of the service it wants to talk to.
OK.
> However, if Consul is unhealthy, servers struggle to connect.
Why? The local "client-side" consul agents running on each hosts should be the authoritative source for service discovery, not the "server-side" consul agents running on the 5 voter nodes.
> Furthermore, Nomad and Vault rely on Consul, so when Consul is unhealthy, the system cannot schedule new containers or retrieve production secrets used for authentication.
Now that's one very bad setup, similar to deploying all services in a single k8s cluster.
Didn’t realize consul had that. Seems like the right approach - though I wonder why Roblox wasn’t using it.
Fwiw I believe kubernetes did this right - if you shoot the entire set of leaders, nothing really happens. Yes if containers die they aren’t restarted and things that create new pods (eg cron jobs) won’t run, but you don’t immediately lose cluster connectivity or the (built-in) service discovery. Not to say you can survive az failures or the like - or that kubernetes upgrades are easy/fun.
And don’t run dev stuff in your prod kube cluster. Just…don’t.
Can you say more about service discovery "loving stale data"? Loves in the sense of "generates a lot of it; is constantly plagued by it"?
Their comment implies "are totally fine with stale data".
Their argument is that the membership set for a service (especially on-prem) doesn't change all that frequently, and even if it's out of date, it's likely that most of the endpoints are still actually servicing the thing you were looking for. That plus client retries and you're often pretty good.
Maybe I'm just working on an idiosyncratic version of the service discovery problem, but "stale data" is basically my bête noire. Part of it is that I don't control all my clients, and I can't guarantee they have sane retry logic; what service discovery tells them is the best place to go had better be responsive, or we're effectively having an outage. For us, service discovery is exquisitely sensitive to stale data.
Yep!
I'm not saying I totally agree with the original comment there, just confirming that they meant it.
If you own your clients, sometimes you can say "it's on you to retry" (deadlines and retries are effectively mandatory, and often automatic, at Google). Having user facing services hand out bad addresses / endpoints would be really bad.
However, even for things like databases, you really want to know who the leader / primary is (and it's not really okay to get a stale answer).
So I dunno, some things are just fine with it, and some aren't. It's better if it just works :). Besides, if the data isn't changing, the write rate isn't high!
Yes - exactly what boulos said - I’m coming from the Google “you control your clients” perspective. That said, in some sense you always control your clients - you can always set up localhost proxies that speak your protocol or just tcp proxies.
The thing is service discovery is _always_ inconsistent. By the time you get your set of endpoints from your discovery, it can be out of date by the time you open your socket. Certainly for something like databases you need followers or standby leaders to reject writes from clients - service discovery can’t 100% save you here.
Super interesting. A place where ipvs or ebpf rules per-host for the discovery of services seems much more resilient than this heavy reliance on a functional consul service. The team shared a great postmortem here. I know the feeling well of testing something like a full redeploy and seeing no improvement…easy to lose hope at that point. 70+ hours of a full outage, multiple failed attempts to restore, has got to result in a few grey hairs worth of stress. Well done to all the sre, frontline, support engineers, devs, and whoever else rolled up their sleeves and got after it. The lessons learned here could only have been learned in an infra this big.
The BoltDB issue seems like straight up bad design. Needing a freelist is fine, needing to sync the entire freelist to disk after every append is pants on head.
BoltDB author here. Yes, it is a bad design. The project was never intended to go to production but rather it was a port of LMDB so I could understand the internals. I simplified the freelist handling since it was a toy project. At Shopify, we had some serious issues at the time (~2014) with either LMDB or the Go driver that we couldn't resolve after several months so we swapped out for Bolt. And alas, my poor design stuck around.
LMDB uses a regular bucket for the freelist whereas Bolt simply saved the list as an array. It simplified the logic quite a bit and generally didn't cause a problem for most use cases. It only became an issue when someone wrote a ton of data and then deleted it and never used it again. Roblox reported having 4GB of free pages which translated into a giant array of 4-byte page numbers.
I, for one, appreciate you owning this. It takes humility and strength of character to admit one's errors. And Heaven knows we all make them, large and small.
I also appreciate the honesty, but I don't see the error in the author, quite the opposite.
Afaiu, Bolt is a personal OSS project, github repo is archived with last commit 4 years ago, and the first thing you see in the readme is the "author no longer has time nor energy to continue".
Commercial cash cows like Roblox (a) shouldn't expect free labor and (b) should be wise enough to recognize tech debt or immaturity in their dependencies. Heck, even as a solo dev I review every direct dependency I take on, at least to a minimal level.
I can't speak to the incident response as I'm not an sre, but as a dev this screams of fragile "ship fast" culture, despite all the back patting in the post. I'm all for blameless postmortems, but a culture of rigor is a collective property worthy of attention and criticism.
I think the design choice is mine to own but, as with most OSS software, liability rests on the end user. It always sucks to see a bug cause so much grief to other folks.
As for HashiCorp, they're an awesome group of folks. There are few developers I esteem higher than their CTO, Armond Dadger. Wicked smart guy. That all being said, there's a lot of moving parts and sometimes bugs get through. ¯\_(ツ)_/¯
Consul is much older than 4 years old (public availability in 2014; 1.0 release in 2017, with a lot of sites using 0.x in production long before). And the fact that they didn't encounter this pathological case until Q4 2021 tells us that they got a lot of useful life out of BoltDB. They also were planning to switch over to bbolt back in 2020[1].
The developers at Hashicorp are top-tier, and this doesn't substantially change their reputation in my eyes. Hindsight is always 20/20.
Let's end this thread; blaming doesn't help anyone.
[1] https://github.com/hashicorp/consul/issues/8442
The onus is more on HashiCorp here by this logic. Consul itself is open source but HashiCorp sells an enterprise version.
I share the sentiment, but not for Roblox. Hashicorp, with a recent IPO, 200 mil operating revenue, and supposedly a good engineering reputation has one of its flagship products critically depend on a "toy project".
> BoltDB author here.
How does this happen so often? It's awesome to get the authors take on things. Also thank you for explaining and owning it. Where you part of this incident response?
It's on the front page of HN so it's pretty visible. However, I also use f5bot to notify on terms like "boltdb" and my other project "litestream".
You also made litestream?! Awesome, I love that project.
Yeah, that's me too. Hopefully I don't crash another multi-billion dollar public company in 8 years with it though... :)
Sounds like pretty good success criteria to me!
These issues partially solved in the libmdbx (a deeply revised and extended descendant of LMDB).
So BoltDB and LMDB affected users may switch to libmdbx as the Erigon (Ethereum implementation) does year ago https://github.com/ledgerwatch/erigon/wiki/Criteria-for-tran...
For now this is (relatively) easy since bindings for GoLang, Rust NodeJS/Deno, etc are available and the API is mostly the same in general.
---
The ideas that MDBX uses to solve these issues are simple: zero-cost micro-defragmentation, coalescing short GC/freelist records, chunking too long GC/freelist records, LIFO for GC/freelist reclaiming, etc.
Many of the ideas mentioned seems simple to implement in BoldDB. However the complete solution is not documented and too complicated (in accordance with the traditions inherited from LMDB ;)
> The project was never intended to go to production
:)
Having written a commercial memory allocator a quarter century ago, I remember dealing with freelists, and decided they were too much of a pain to manage if fragmentation got out of control. I chose a different architecture that was less fragile under load. Interesting that this can still be an issue even on today's hardware.
It's also interesting how much a tiny detail can derail a huge organization. My former employer lost all services worldwide because of a single incorrect routing in a DNS server.
> At Shopify, we had some serious issues at the time (~2014) with either LMDB or the Go driver that we couldn't resolve after several months
Is there an issue/bug for this somewhere?
I can’t remember off the top of my head. We had an issue where every couple of months the database would quickly grow to consume the entire disk. We checked that there were no long running read transactions and did some other debugging but couldn’t figure it out. Swapping out for Bolt seemed to fix that issue.
I haven’t heard of anyone else with the same issue since then so I assume it’s probably fixed.
By any chance could you (or bbolt folks?) update README to include this information?
Your answer should be voted to the top! :)
OSS contributors are rarely noticed or appreciated. Did HashiCorp ever sponsor you or share any revenue with you? The OSS ecosystem is broken.
I had a few folks offer to sponsor at individual levels but no corporate sponsorship except Shopify paying my salary in the early days.
That being said, I gave away the software for free so I don’t have any expectation of payment. I agree the ecosystem is broken but I don’t know how to fix it or even if it can be fixed.
This is a great post-mortem - thank you to the Roblox engineering team for being this transparent about the issue and the process you took to fix it. It couldn't have been easy and it sounds like it was a beast to track down (under pressure no less). gg
> On October 27th at 14:00, one day before the outage, we enabled this feature on a backend service that is responsible for traffic routing. As part of this rollout, in order to prepare for the increased traffic we typically see at the end of the year, we also increased the number of nodes supporting traffic routing by 50%.
Seems like the smoking gun, this should have been identified and rolled back much earlier.
If reading a postmortem makes the smoking gun obvious, then the postmortem is doing its job. Don't mistake the amount of investigation that goes into a postmortem for the available information and mental headspace during an outage.
I've been in my fair share of incidents so I'm aware of how they work. But they knew it was an issue related to Consul within hours. It shouldn't take more than two days before they check for recent deployments made to Consul.
In my experience, there's typically more than one "smoking gun". The problem isn't finding one, it's eliminating all of the "smoking guns" that aren't actually related to the outage.
If I worked at an organization with many teams deploying updates multiple times per day and several same day events seemed related, I would probably also put less weight on a gradual, months-long deployment that had completed a day prior.
It's obvious when it's pointed out in an article like this. It's less clear when it's one of many changes that could have been happening in a day, and it was an operation that was considered "safe" given that it had been done multiple times for other services in the preceding months.
">circular dependencies in our observability stack"
This appears to be why the outage was extended, and was referenced elsewhere too. It's hard to diagnose something when part of the diagnostic tool kit is also malfunctioning.
Like the Facebook outage a few months ago, when their DNS being down prevented them from communicating interally.
aaaalllllllll the way down at the bottom is this gem: >Some core Roblox services are using Consul’s KV store directly as a convenient place to store data, even though we have other storage systems that are likely more appropriate.
Yeah, don't use consul as redis, they are not the same.
But you can... which is what some engineers were thinking. In my experience they do this because:
A) they're afraid to ask for permission and would rather ask for forgiveness
B) management refused to provision extra infra to support the engineers need, but they needed to do this "one thing" anyways
C) security was lax and permissions were wide open so people just decided to take advantage of it to test a thing that then became a feature and so they kept it but "put it on the backlog" to refactor to something better later
Yes, this and having such a big consul cluster where the recommendation is to have more smaller clusters.
That said, could've happened to anyone and it was a great write up.
It seems that Consul does not have the ability to use the newer hashmap implementation of freelist that Alibaba implemented for etcd. I cannot find any reference to setting this option in Consul's configuration.
Unfortunate, given it has been around for a while.
https://www.alibabacloud.com/blog/594750
I think they just made the switch to the fork that does contain the freelist improvement in https://github.com/hashicorp/consul/pull/11720
Took a major incident to swallow your pride? (consul, powered by go.etcd.io/bbolt)
Is this option enabled by default? I don't this it is and I don't think they actually set it manually anywhere.
EDIT: I think we're talking about two different options. I meant the ability to leave sync turned on but change the data structure.
This is the PR for the freelist improvement from that alibaba article: https://github.com/etcd-io/bbolt/pull/141
Just to be clear, we are talking about this item from the post-mortem right?
> We are working closely with HashiCorp to deploy a new version of Consul that replaces BoltDB with a successor called bbolt that does not have the same issue with unbounded freelist growth.
EDIT: I see what you mean. The freelist improvement has to be enabled by setting the `FreelistType` config to "hashmap" (default is "array"). Indeed it doesn't look like consul has done that...
I think we're still talking about different things, but that is a good move on their part regardless. :)
I mean the optional called `FreelistType` has a new option called `FreelistMapType` and the default is `FreelistArrayType`. There is no option in Consul from what I can tell to configure that option. They did have to upgrade from the old boltdb code to the etcd boltdb code to do this though.
I have this little idea I think about called the "status update chain". When I worked in small organizations and we had issues the status update chain looked like this: ceo-->me, as the organizations got larger the chain got longer first it was ceo-->manager-->me then ceo-->director-->manager-->me and so on. I wonder how long the status update chains are at companies like this? How long does at status update take to make it end to end?
If the situation is serious enough, you'll have several layers sitting together at the status update meetings to hear it straight from the dog's mouth.
I am sorry, I didn't have enough context to understand what your saying.
When you say: status update chain: ceo --> me. What information is flowing from the CEO to you? or is it the other way around?
Both directions, he is asking "What is going on" and I am telling him. As the org gets larger the request to know what is going on passes down the chain and the reply passes back up.
Usually there’s a central place where status is being updated and shared by everyone (a Slack channel for example) and everyone in the chain can just read/ping/respond there. Less of a chain.
In well designed incident comms systems, the upward comms occurs automatically, not on request.
My goal has always been that my execs know what is going on, so that they are never caught short by status queries.
Thanks, that makes sense. I haven't experienced that myself yet so I wasn't sure.
For me as a roblox user/programmer, the most annoying part of this was, that their desktop development tools refused to run during this outage, because they insist on "phoning home" when you launch them.
It is annoying, because the tools actually run perfectly fine on a local desktop, once you are past the "mothership handshake". I spent that week reading roblox dev documentation instead.
Wow, that's a very bad trend we see emerging in those past years. Did you have any chance to investigate the request at play and whether you could impersonate it via DNS on your local network (or if on the opposite, TLS certificates were stapled into the app)?
Also, i'm curious about your experiences with Roblox. I've only heard about it from these HN threads (no i don't know a single person using it) so if you have feedback to share regarding how to program it and how it compares to a modern game engine/editor like Godot, i'm all ears. Also, if you know of a free-software "alternative" to Roblox ; i'm amazed we run proprietary software in the first place, worried when it doesn't run because it can't phone home, but i'm actually ashamed we end up *producing* (eg. developing) content with proprietary tools that these companies can take away from us any minute.
Is there any tutorial on how go get a pref report like the one show in this screenshot? https://blog.roblox.com/wp-content/uploads/2021/11/3-perf-re...
Yes. It's the default output for "perf report". I recommend reading this: https://www.brendangregg.com/perf.html
However, the short 2-line way to get that output is the following:
You'll get similar output to what they show if you have consul running with a similar load :)Slightly offtopic; “the team decided to replace all the nodes in the Consul cluster with new, more powerful machines”. How do teams usually do this quickly? Is it a combination of Terraform to provision servers and something like Ansible to install and configure software on it?
Totally depends on how “disciplined” the team’s DevOps practices are. In theory it should be as easy as updating a config parameter as you say, but my experience tells me that it’s sometimes not the case.
Especially with these kind of fundamental, core services such as Consul provides, it’s not unheard of to have templates with static machine allocations (as opposed to everything in a single auto-scaling group). It’s a bit of a shortcut, but it’s often a bit hairy to implement these services using true auto-scaling.
Having said all this, doing these types of migrations when things are already completely broken / on fire makes things a lot easier: you don’t care about downtime. So then it can be as simple as restarting all instances using a new instance type, downtime be damned.
If I wasn't using AWS I would have no idea how to do this.
I still don't understand how the elevated Consul latency ended up bringing all of the fleet to halt, fail health checks and drop user traffic. I guess use cases calling consul directly (e.g. service discovery) or indirectly (e.g. vault) could not tolerate having stale reads or sticking with what they've read? If anyone can shed a light on this, I appreciate.
One thing I don't see mentioned -- why is the write load so high? Can anyone from Roblox say? (I have a specific reason for asking.)
What I learn from this is issue is partly because of not proper use "Go Channels" and open source product "BoltDB"
IMO looking at the root causes here isn't that helpful. Software is complicated and there will always be some unknown bottleneck or bug lurking to knock you over on a bad day. The important lessons here are about:
* How their system architecture made them particularly vulnerable to this kind of issue
* Their actions to diagnose and attempt to mitigate the issue
* The whole later part about effectively cold-starting their entire infrastructure, all while millions of users were banging on their metaphorical door to start using the service again.
That and going all-in on Hashicorp.
I think this outage was made worse by them not being properly in a big cloud provider.
In a cloud provider, having a few people working simultaneously on spinning up instances with different potential fixes, running different tests, and then directing all traffic to the first one that works properly is a viable path to a solution.
When you have your own hardware, you can really only try one thing at a time.
> When you have your own hardware, you can really only try one thing at a time.
How so? What would prevent you from hiring 5-10 people for Ops heavy stuff and getting a bit more hardware resources and doing those things in staging environments with load tests and whatnot? I mean, isn't that how you should do things, regardless of where your infra and software is?
If you own your own hardware, for a given service you probably have enough hardware for the production workload, plus maybe 50% more for dev, test, staging, experiments, etc. All those other environments will probably be scaled down versions. Sure, they can be used in an emergency situation, but they can't withstand the full production load, and anyway they're likely on a separate physical hardware and network (usually you want good isolation between production and test environments).
If you use AWS, then you probably on average use the same day to day, but in an emergency you can spin up 5 versions of full production scale to test 5 things at once, and just edit a configfile to direct production traffic to any.
>>> The scale of our deployment is significant, with over 18,000 servers and 170,000 containers.
That's impressive.
Tldr: We made a single point of failure, then we made it super reliable, then the stuff it was doing to maintain itself made it slow itself down, then our single point of failure took down our service.
Would be interesting to compare this result to the classic paper on Tandem failures:
A. Thakur, R. K. Iyer, L. Young and I. Lee, "Analysis of failures in the Tandem NonStop-UX Operating System," Proceedings of Sixth International Symposium on Software Reliability Engineering. ISSRE'95, 1995, pp. 40-50, doi: 10.1109/ISSRE.1995.497642.
Does anyone know what tool this one is?
https://blog.roblox.com/wp-content/uploads/2021/11/3-perf-re...
Is it really perf?
It's perf-report[0] which reads the output of a perf data file and displays the profile.
[0]: https://man7.org/linux/man-pages/man1/perf-report.1.html
> /wp-content/uploads/2021/11/3-perf-report.png
It's perf.
Sounds like they didn't check what had changed first, before starting to fix things with best guesses ... not saying I wouldn't do the same, but arguably lost them a lot of time.
They were aware of the changes, but as they stated: it seemed to be working fine and, therefore, was ruled out early on as a potential problem.
Just curious, does Roblox push engineers to learn internals of critical software they operate or they lean on vendors.
If vendors, it is reckless.
>> We are working to move to multiple availability zones and data centers.
Surprised it was a single availability zone, without redundancy. Having multiple fully independent zones seems more reliable and failsafe.
Was on a call with a bank VP that had moved to AWS. Asked how it was going. Said it was going great after six months but just learning about availability zones so they were going to have to rework a bunch of things.
Astonishing how our important infrastructure is moved to AWS with zero knowledge of how AWS works.
> Surprised it was a single availability zone, without redundancy. Having multiple fully independent zones seems more reliable and failsafe.
It's also a lot more expensive. Probably order of magnitude more expensive than the cost of a 1 day outage
Most startups I've worked at literally have a script to deploy their whole setup to a new region when desired. Then you just need latency-based routing running on top of it to ensure people are processed in the closest region to them. Really not expensive. You can do this with under $200/month in terms of complexity and the bandwidth + database costs are going to be roughly the same as they normally are because you're splitting your load between regions. Now if you stupidly just duplicate your current infrastructure entirely, yes it would be expensive because you'd be massively overpaying on DB.
In theory the only additional cost should be the latency-based routing itself, which is $50/month. Other than that, you'll probably save money if you choose the right regions.
Are the same services available in all regions?
Are the same instance sizes available in all regions?
Are there enough instances of the sizes you need?
Do you have reserved instances in the other region?
Are your increased quotas applied to all regions?
What region are your S3 assets in? Are you going to migrate those as well?
Is it acceptable for all user sessions to be terminated?
Have you load tested the other region?
How often are you going to test the region fail over? Yearly? Quarterly? With every code change?
What is the acceptable RTO and RPO with executives and board-members?
And all of that is without thinking about cache warming, database migration/mirror/replication, solr indexing (are you going to migrate the index or rebuild? Do you know how long it takes to rebuild your solr index?).
The startups you worked at probably had different needs the Roblox. I was the tech leach on a Rails app that was embedded in TurboTax and QuickBooks and was rendered on each TT screen transition and reading your comment in that context shows a lot of inexperience in large, production systems.
A lot of this can also be mitigated by going all in on API gateway + Lambda, like we have at Arist. We only need to worry about DB scaling and a few considerations with S3 (that are themselves mitigated by using CloudFront).
Are you implying that Roblox should move their entire system to the API Gateway + Lambda to solve their availability problems?
Seriously though, what is your RTO and RPO? We are talking systems that when they are down you are on the news. Systems where minutes of downtime are millions of dollars. I encourage you to setup some time with your CTO at Arist and talk through these questions.
1. When a company of Robolox's size is still in single-region mode by the time they've gone public, that is quite a red flag. As you and others have mentioned, game servers have some unique requirements not shared by traditional web apps (everyone knows this), however Roblox's constraints seem to be self-imposed and ridiculous considering their size. It is quite obvious they have very fragile and highly manual infrastructure, which is dangerous after series A, nevermind after going public! At this point their entire infrastructure should be completely templated and scripted to the point where if all their cloud accounts were deleted they could be up and running within an hour or two. Having 18,000 servers or 5 servers doesn't make much of a difference -- you're either confident you can replicate your infrastructure because you've put in the work to make it completely reproducible and automated, or you haven't. Orgs that have taken these steps have no problem deploying additional regions because they have tackled all of those problems (db read clones, latency-based routing, consistency, etc) and the solutions are baked into their infrastructure scripts and templates. The fact that there exists a publicly traded company in the tech space that hasn't done this shocks me a bit, and rightly so.
2. I mentioned API Gateway and Lambda because OP asked if in general it is difficult to go multi-region (not specifically asking about Roblox), and most startups, and most companies in general, do not have the same technical requirements in terms of managing game state that Roblox has (and are web app based), and thus in general doing a series of load balancers + latency based routing or API Gateway + Lambda + latency based routing is good approach for most companies especially now with ala carte solutions like Ruby on Jets, serverless framework, etc. that will do all the work for you.
3. That said, I do think that we are on the verge of seeing a really strong viable serverless-style option for game servers in the next few years, and when that happens costs are going to go way way down because the execution context will live for the life of the game, and that's it. No need to over-provision. The only real technical limitation is the hard 15 minute execution time limit and mapping users to the correct running instance of the lambda. I have a side project where I'm working on resolving the first issue but I've resolved the second issue already by having the lambda initiate the connection to the clients directly to ensure they are all communicating with the same instance of the lambda. The first problem I plan to solve by pre-emptively spinning up a new lambda when time is about to run out and pre-negotiate all clients with the new lambda in advance before shifting control over to the new lambda. It's not done yet but I believe I can also solve the first issue with zero noticable lag or stuttering during the switch-over, so from a technical perspective, yes, I think serverless can be a panacea if you put in the effort to fully utilize it. If you're at the point where you're spinning up tens of thousands of servers that are doing something ephemeral that only needs to exist for 5-30 minutes, I think you're at the point where it's time to put in that effort.
4. I am in fact the CTO at Arist. You shouldn't assume people don't know what they're talking about just because they find the status quo of devops at [insert large gaming company here] a little bit antiquated. In particular, I think you're fighting a losing battle if you have to even think about what instance type is cheapest for X workload in Y year. That sounds like work that I'd rather engineer around with a solution that can handle any scale and do so as cheaply as possible even if I stop watching it for 6 months. You may say it's crazy, but an approach like this will completely eat your lunch if someone ever gets it working properly and suddenly can manage a Roblox-sized workload of game states without a devops team. Why settle for anything less?
5. Regarding the systems I work with -- we send ~50 million messages a day (at specific times per day, mostly all at once) and handle ~20 million user responses a day on behalf of more than 15% of the current roster of fortune 500 companies. In that case, going 100% lambda works great and scales well, for obvious reasons. This is nowhere near the scale Roblox deals with, but they also have a completely different problem (managing game state) than we do (ensuring arbitrarily large or small numbers of messages go out at exactly the right time based on tens of thousands of complex messaging schedules and course cadences)
Anyway, I'm quite aware devops at scale is hard -- I just find it puzzling when small orgs have it perfectly figured out (plenty of gaming startups with multi-region support) but a company on the NYSE is still treating us-east-1 or us-east-2 like the only region in existence. Bad look.
You didn’t answer my only question.
Also, still sounding like you don’t understand how large systems like Roblox/Twitter/Apple/Facebook/etc are designed, deployed, and maintained-which is fine; most people don’t–but saying they should just move to llamda shows inexperience in these systems. If it is "puzzling" to you, maybe there is something you are missing in your understanding of how these systems work.
Correctly handling failure edge cases in a active-active multi-region distributed database requires work. SaaS DBs do a lot of the heavy lifting but they are still highly configurable and you need to understand the impact of the config you use. Not to mention your scale-up runbooks need to be established so a stampede from a failure in one region doesn't cause the other region to go down. You also need to avoid cross-region traffic even though you might have stateful services that aren't replicated across regions. That might mean changes in config or business logic across all your services.
It is absolutely not as simple as spinning up a cluster on AWS at Roblox's scale.
Roblox is not a startup, and has a significant sized footprint (18,000 servers isn't something that's just available, even within clouds. They're not magically scalable places, capacity tends to land just ahead of demand). It's not even remotely a simple case of just "run a script and wee we have redundancy" There are lots of things to consider.
18k servers is also not cheap, at all. They suggest at least some of their clusters are running on 64 cores, some on 128. I'm guessing they probably have a fair spread of cores.
Just to give a sense of cost, AWS's calculator estimates 18,0000 32 core instances would set you back $9m per month. That's just the EC2 cost, and assuming a lower core count is used by other components in the platform. 64 core would bump that to $18m. Per month. Doing nothing but sitting waiting ready. That's not considering network bandwidth costs, load balancers etc. etc.
When you're talking infrastructure on that scale, you have to contact cloud companies in advance, and work with them around capacity requirements, or you'll find you're barely started on provisioning and you won't find capacity available (you'll want to on that scale anyway because you'll get discounts but it's still going to be very expensive)
You have no idea what you're talking about when comparing their setup to "most startups you've worked with"
This was in reply to OP who said deploying to a new region is insanely complicated. In general it is not. For Roblox, if they are manually doing stuff in EC2, it could be quite complicated.
So Roblox need a button to press to (re)deploy 18,000 servers and 170,000 containers? They already have multiple core data centres, as well as many edge locations.
You will note the problem was with the software provided and supported by Hashicorp.
> It's also a lot more expensive. Probably order of magnitude more expensive than the cost of a 1 day outage
Not sure I agree. Yes, network costs are higher, but your overall costs may not be depending on how you architect. Independent services across AZs? Sure. You'll have multiples of your current costs. Deploying your clusters spanning AZs? Not that much - you'll pay for AZ traffic though.
It is when you run your own date centers and have to shell out a large capital outlays to spin up a new datacenter.
The usual way this works (and I assume this is the case for Roblox) is not by constructing buildings, but by renting space in someone else's datacentre.
Pretty much every city worldwide has at least one place providing power, cooling, racks and (optionally) network. You rent space for one or more servers, or you rent racks, or parts of a floor, or whole floors. You buy your own servers, and either install them yourself, or pay the datacentre staff to install them.
Yes. If you are running in two zones in the hopes that you will be up if one goes down, you need to be handling less than 50% load in each zone. If you can scale up fast enough for your use case, great. But when a zone goes down and everyone is trying to launch in the zone still up, there may not be instances for you available at that time. Our site had a billion in revenue or something based on a single day, so for us it was worth the cost, but it not easy (or at least it wasn't at the time).
How expensive? Remember that the Roblox Corporation does about a billion dollars in revenue per year and takes about 50% of all revenue developers generate on their platform.
Right, outages get more expensive the larger you grow. What else needs to be thought of is not just the loss of revenue for the time your service is down but also it's affect on user trust and usability. Customers will gladly leave you for a more reliable competitor once they get fed up.
Multi-AZ is free at Amazon. Having things split amongst 3 AZ's cost no more than having in a single AZ.
Multi-Region is a different story.
There are definitely cost and other considerations you have to think about when going multi-AZ.
Cross-AZ network traffic has charges associated with it. Inter-AZ network latency is higher than intra-AZ latency. And there are other limitations as well, such as EBS volumes being attachable only to an instance in the same AZ as the volume.
That said, AWS does recommend using multiple Availability Zones to improve overall availability and reduce Mean Time to Recovery (MTTR).
(I work for AWS. Opinions are my own and not necessarily those of my employer.)
This is very true, the costs and performance impacts can be significant if your architecture isn't designed to account for it. And sometimes even if it is.
In addition, unless you can cleanly survive an AZ going down, which can take a bunch more work in some cases, then being multi-AZ can actually reduce your availability by giving more things to fail.
AZs are a powerful tool but are not a no-brainer for applications at scale that are not designed for them, it is literally spreading your workload across multiple nearby data centers with a bit (or a lot) more tooling and services to help than if you were doing it in your own data centers.
Having bare metal may not be less stress but AWS is by no means cheap.
AWS is not cheap, but splitting amongst AZ's is of no additional cost.
False
Data Transfer within the same AWS Region Data transferred "in" to and "out" from Amazon EC2, Amazon RDS, Amazon Redshift, Amazon DynamoDB Accelerator (DAX), and Amazon ElastiCache instances, Elastic Network Interfaces or VPC Peering connections across Availability Zones in the same AWS Region is charged at $0.01/GB in each direction.
https://aws.amazon.com/ec2/pricing/on-demand/#Data_Transfer_...
> AWS is not cheap
Wrong. Depends on the use case AWS can be very cheap.
> splitting amongst AZ's is of no additional cost.
Wrong.
" across Availability Zones in the same AWS Region is charged at $0.01/GB in each direction. Effectively, cross-AZ data transfer in AWS costs 2¢ per gigabyte and each gigabyte transferred counts as 2GB on the bill: once for sending and once for receiving."
>> Having multiple fully independent zones seems more reliable
I don't think these independent zones exist. See AWS's recent outages, where east cripples west and vice versa.
Availability Zones aren't the same thing as regions. AWS regions have multiple Availability Zones. Independent availability zones publishes lower reliability SLAs so you need to load balance across multiple independent availability zones in a region to reach higher reliability. Per AZ SLAs are discussed in more detail here [1]
(N.B. I find HN commentary on AWS outages pretty depressing because it becomes pretty obvious that folks don't understand cloud networking concepts at all.)
[1]: https://aws.amazon.com/compute/sla/
> (N.B. I find HN commentary on AWS outages pretty depressing because it becomes pretty obvious that folks don't understand cloud networking concepts at all.)
What he said was perfectly cogent.
Outages in us-east-1 AZ us-east-1a have caused outages in us-west-1a, which is a different region and a different AZ.
Or, to put it in the terms of reliability engineering: even though these are abstracted as independent systems, in reality there are common-mode failures that can cause outages to propagate.
So, if you span multiple availability zones, you are not spared from events that will impact all of them.
> Or, to put it in the terms of reliability engineering: even though these are abstracted as independent systems, in reality there are common-mode failures that can cause outages to propagate.
It's up to the _user_ of AWS to design around this level of reliability. This isn't any different than not using AWS. I can run my web business on the super cheap by running it out of my house. Of course, then my site's availability is based around the uptime of my residential internet connection, my residential power, my own ability to keep my server plugged into power, and general reliability of my server's components. I can try to make things more reliable by putting it into a DC, but if a backhoe takes out the fiber to that DC, then the DC will become unavailable.
It's up to the _user_ to architect their services to be reliable. AWS isn't magic reliability sauce you sprinkle on your web apps to make them stay up for longer. AWS clearly states in their SLA pages what their EC2 instance SLAs are in a given AZ; it's 99.5% availability for a given EC2 instance in a given region and AZ. This is roughly ~1.82 days, or ~ 43.8 hours, of downtime in a year. If you add a SPOF around a single EC2 instance in a given AZ then your system has a 99.5% availability SLA. Remember the cloud is all about leveraging large amounts commodity hardware instead of leveraging large, high-reliability mainframe style design. This isn't a secret. It's openly called out, like in Nishtala et al's "Scaling Memcache at Facebook" [1] from 2013!
The background of all of this is that it costs money, in terms of knowledgable engineers (not like the kinds in this comment thread who are conflating availability zones and regions) who understand these issues. Most companies don't care; they're okay with being down for a couple days a year. But if you want to design high reliability architectures, there are plenty of senior engineers willing to help, _if_ you're willing to pay their salaries.
If you want to come up with a lower cognitive overhead cloud solution for high reliability services that's economical for companies, be my guest. I think we'd all welcome innovation in this space.
[1]: https://www.usenix.org/system/files/conference/nsdi13/nsdi13...
During a recent AWS outage, the STS service running in us-east-1 was unavailable. Unfortunately, all of the other _regions_ - not AZs, but _regions_, rely on the STS service in us-east-1, which meant that customers which had built around Amazon’s published reliability model had services in every region impacted by an outage in one specific availability zone.
This is what kreeben was referring to - not some abstract misconception about the difference between AZs and Regions, but an actual real world incident in which a failure in one AZ had an impact in other Regions.
It's more subtle than that.
For high availability, STS offers regional endpoints -- and AWS recommends using them[1] -- but the SDKs don't use them by default. The author of the client code, or the person configuring the software, has to enable them.
[1] https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credenti...
(I work for AWS. Opinions are my own and not necessarily those of my employer.)
The client code which defaults to STS in us-east-1 includes the AWS console website, as far as I can tell.
Real question, though - are those genuinely separate endpoints that remained up and operational during the outage? I don’t think I saw or knew a single person unaffected by this outage, so either there’s still some bleed over on the backend or knowledge of the regional STS endpoints is basically zero (which I Can believe, y’all run a big shop)
My team didn't use STS but I know other teams at the company did. Those that did rely on non-us-east-1 endpoints did stay up IIRC. Our company barely use the AWS console at all and base most of our stuff around their APIs to hook into our deployment/CI processes. But I don't work at AWS so I don't know if it's true or if there was some other backend replication lag or anything else going on that was impacted by us-east-1 being down. We had some failures for some of our older services that were not properly sharded out, but most of our stuff failed over and continued to work as expected.
> Unfortunately, all of the other _regions_ - not AZs, but _regions_, rely on the STS service in us-east-1, which meant that customers which had built around Amazon’s published reliability model had services in every region impacted by an outage in one specific availability zone.
That's not true. STS offers regional endpoints, for example if you're in Australia and don't want to pay the latency cost to transit to us-east-1 [1]. It's up to the user to opt into them though. And that goes back to what I was saying earlier, you need engineers willing to read their docs closely and architect systems properly.
[1]: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credenti...
> knowledgable engineers (not like the kinds in this comment thread who are conflating availability zones and regions)
I think this breaks the site guidelines. Worse, I don't think the other people are wrong: being in a different region implies being in a different availability zone.
That is, I've read the comments to say "they're not only in different AZ's, they're in different regions". It seems you seem determined to pick a reading that lets you feel smugly superior about your level of knowledge, and then cast out digs at other people based on that presumed superiority.
> Worse, I don't think the other people are wrong: being in a different region implies being in a different availability zone.
Availability zones do not map across regions. AZs are specific to a region. Different regions have differing numbers of AZs. us-east-1 has 3. IIRC ap-southeast-1 has 2.
> That is, I've read the comments to say "they're not only in different AZ's, they're in different regions"
So I've read. The earlier example about STS that someone brought up was incorrect; both I and another commenter linked to the doc with the correct information.
> It seems you seem determined to pick a reading that lets you feel smugly superior about your level of knowledge, and then cast out digs at other people based on that presumed superiority.
You obviously feel very strongly about this. You've replied to my parent twice now. You're right that the parenthetical was harsh but I wouldn't say it's uncalled for.
Every one of these outage threads descends into a slew of easily defensible complaints about cloud providers. The quality of these discussions is terrible. I spend a lot of time at my dayjob (and as a hobby) working on networking related things. Understanding the subtle guarantees offered by AWS is a large part of my day-to-day. When I see people here make easily falsifiable comments full of hearsay ("I had a friend of a friend who works at Amazon and they did X, Y, Z bad things") and use that to drum up a frenzy, it flies in the face of what I do everyday. There's lots of issues with cloud providers as a whole and AWS in particular but to get to that level of conversation you need to understand what the system is actually doing, not just get angry and guess why it's failing.
> > being in a different region implies being in a different availability zone.
> Availability zones do not map across regions. AZs are specific to a region. Different regions have differing numbers of AZs. us-east-1 has 3. IIRC ap-southeast-1 has 2.
Right.. So if you are in a different region, you are by definition in a different availability zone.
> You obviously feel very strongly about this. You've replied to my parent twice now. You're right that the parenthetical was harsh but I wouldn't say it's uncalled for.
Yah, I really thought about it and you're just reeking unkindness. And the people above that you're replying to and mocking are not wrong.
> Every one of these outage threads descends into a slew of easily defensible complaints about cloud providers. The quality of these discussions is terrible. I spend a lot of time at my dayjob (and as a hobby) working on networking related things. Understanding the subtle guarantees offered by AWS is a large part of my day-to-day.
If you're unable to be civil about this, maybe you should avoid the threads. Amazon seeks to avoid common-mode failures between AZs (and thus regions). This doesn't mean that Amazon attains this goal. And the larger point: as I'm sure you're aware, building a distributed system that attains higher uptimes by crossing multiple AZs is hard and costly and can only be justified in some cases.
I've got >20 years of experience in building geographically distributed, sharded, and consensus-based systems. I think you are being unfair to the people you're discussing with. Be nice.
> Amazon seeks to avoid common-mode failures between AZs (and thus regions).
there is a distinction between azs within a region vs azs in different regions. the overwhelming majority of services are offered regionally and provide slas at that level. services are expected to have entirely independent infrastructure for each region, and cross-regional/global services are built to scope down online cross regional dependencies as much as possible.
the specific example brought up (cross regional sts) is wrong in the sense that sts is fully regionalized as evidenced by the overwhelming number of aws services that leverage sts not having a global meltdown. but as others mentioned in a lot of ways it’s even worse because customers are opted into the centralized endpoint implicitly.
> If you're unable to be civil about this, maybe you should avoid the threads.
I didn't read my tone as uncivil, just harsh. I guess it came across harsher than intended. I'll try to cool it a bit more next time, but I have to say it's not like the rest of HN is taking this advice to heed when they're criticizing AWS. I realize that this isn't a defense (whataboutism), but I guess it's fine to "speak truth to power" or something? Anyway point noted and I'll try to keep my snark down.
> Amazon seeks to avoid common-mode failures between AZs (and thus regions). This doesn't mean that Amazon attains this goal. And the larger point: as I'm sure you're aware, building a distributed system that attains higher uptimes by crossing multiple AZs is hard and costly and can only be justified in some cases.
Right, so which common mode failures are occurring here? What I'm seeing in this thread and previous AWS threads is a lot of hearsay. Stuff like "the AWS console isn't loading" or "I don't have that problem on Linode!" or "the McDonalds app isn't working so everything is broken thanks to AWS!" I'd love to see a postmortem document, like this, actually uncover one of these common mode failures. Not because I doubt they exist (any real system has bugs and I have no doubt a real distributed system has real limitations); I just haven't seen it borne in real world experience at my current company and other companies I've worked at which used AWS pretty heavily.
While I don't work at AWS, my company also publishes an SLA and we refund our customers when we dip below that SLA. When an outage, SLA-impacting or not, occurs, we spend a _lot_ of time getting to the bottom of what happened and documenting what went wrong. Frequently it's multiple things that go wrong which cause a sort of cascading failure that we didn't catch or couldn't reproduce in chaos testing. Part of the process of architecting solutions for high scale (~ billions/trillions of weekly requests) is to work through the AWS docs and make sure we select the right architecture to get the guarantees we seek. I'd like to see evidence of common-mode failures and the defensive guarantees that failed in order show proof of them, or proof positive through a dashboard or something, before I'm willing to malign AWS so easily.
> And the larger point: as I'm sure you're aware, building a distributed system that attains higher uptimes by crossing multiple AZs is hard and costly and can only be justified in some cases.
Sure if you're not operating high reliability services at high scale, it's true, you don't need cross-AZ or cross-region failover. But if you chose, through balance sheet or ignorance, not to take advantage of AWS's reliability features then you shouldn't get to complain that AWS is unreliable. Their guarantees are written on their SLA pages.
> I realize that this isn't a defense (whataboutism), but I guess it's fine to "speak truth to power" or something?
... I still don't think your overall starting assertions about the other people not understanding regions vs. AZs is correct, and it triggered you to repeatedly assert that the people you were talking to are unskilled.
I could very easily use the same words as them, and I have decade-old spreadsheets where I was playing with different combinations of latencies for commits and correlation coefficients for failures to try and estimate availability.
> Right, so which common mode failures are occurring here? What I'm seeing in this thread and previous AWS threads is a lot of hearsay. Stuff like "the AWS console isn't loading" or "I don't have that problem on Linode!" or "the McDonalds app isn't working so everything is broken thanks to AWS!" I'd love to see a postmortem document, like this, actually uncover one of these common mode failures. Not because I doubt they exist (any real system has bugs and I have no doubt a real distributed system has real limitations); I just haven't seen it borne in real world experience at my current company and other companies I've worked at which used AWS pretty heavily.
I remember 2011, where EBS broke across all US-EAST AZs and lots of control plane services were impacted and you couldn't launch instances across all AZs in all regions for 12 hours.
Now maybe you'll be like "pfft, a decade ago!". I do think Amazon has significantly improved architecture. At the same time, AZs and regions being engineered to be independent doesn't mean they really are. We don't attain independent, uncorrelated failures on passenger aircraft, let alone these more complicated, larger, and less-engineered systems.
Further, even if AWS gets it right, going multi-AZ introduces new failure modes. Depending on the complexity of data model and operations on it, this stuff can be really hard to get right. Building a geographically distributed system with current tools is very expensive and there's no guarantee that your actual operational experience will be better than in a single site for quite some time of climbing the maturity curve.
> Their guarantees are written on their SLA pages.
Yup, and it's interesting to note that their thresholds don't really assume independence of failures. E.g. .995/.990/.95 are the thresholds for instances and .999/.990/.950 are the uptime thresholds for regions.
If Amazon's internal costing/reliability engineering model assumed failures would be independent, they could offer much better SLAs for regions safely. (e.g. back of the envelope, 1- (.005 * .005) * 3C2 =~ .999925 ) Instead, they imply that they expect multi-AZ has a failure distribution that's about 5x better for short outages and about the same for long outages.
And note there's really no SLA asserting independence of regions... You just have the instance level and region level guarantees.
Further, note that the SLA very clearly excludes some causes of multi-AZ failures within a region. Force majeure, and regional internet access issues beyond the "demarcation point" of the service.
Yes, but the underlying point you're willfully missing is:
You can't engineer around AWS AZ common-mode failures using AWS.
The moment that you have failures that are not independent and common mode, you can't just multiply together failure probabilities to know your outage times.
Yup, so true. People think redundant == 100% uptime, or that when they advertise 99.9% uptime, it's the same thing as 100% minus a tiny bit for "glitches".
It's not. .1% of 36524 = 87.6 hours of downtime - that's over 3 days of complete downtime every year!
For a more complete list of their SLA's for every service: https://aws.amazon.com/legal/service-level-agreements/?aws-s...
They only refund 100% when they fall below 95% of availability! 95-99= 30%. I believe the real target is above 99.9% though, as that results in 0 refund to the customer. What that means is, 3 days of downtime is acceptable!
Alternatively, you can return to your own datacenter and find out first hand that it's not particularly as easy to deliver that as you may think. You too will have power outages, network provider disruptions, and the occasional "oh shit, did someone just kick that power cord out?" or complete disk array meltdowns.
Anywho, they have a lot more room in their published SLA's than you think.
Edit: as someone correctly pointed out i did a typo in my math. it is only ~9 hours of aloted downtime. Keeping in mind that this is per service though - meaning each service can have a different 9 hours of downtime before they need to pay out 10% of that one service. I still stand by my statement thier SLA's have a lot of wiggle room that people should take more seriously.
As someone else said, your math is off. Your point is still reasonable, though.
The uptime.is website is a handy resource for these calculations. For example, http://uptime.is/99.9 says
"SLA level of 99.9 % uptime/availability results in the following periods of allowed downtime/unavailability:
Your computation is incorrect, 3 days out of 365 is 1% of downtime, not 0.1%. I believe your error stems from reporting .1% as 0.1. Indeed:
0.001 (.1%) * 8760 (365d*24h) = 8.76h
Alternatively, the common industry standard in infrastructure (the place I work at at least,) is 4 nines, so 99.99% availability, which is around 52 mins a year or 4 mins a month iirc. There's not as much room as you'd think! :)
> Yup, so true. People think redundant == 100% uptime, or that when they advertise 99.9% uptime, it's the same thing as 100% minus a tiny bit for "glitches".
Maybe this is the problem. 99.9% isn't being used by AWS the way people use it in conversation; it has a definite meaning, and they'll refund you based on that definite meaning.
>> you need to load balance across multiple independent availability zones
The only problem with that is, there are no independent availability zones.
What we do have, though, is an architecture where errors propagate cross-zone until they can't propagate any further, because services can't take any more requests, because they froze, because they weren't designed for a split brain scenario, and then, half the internet goes down.
> The only problem with that is, there are no independent availability zones.
There are - they can be as independent as you need them to be.
Errors won't necessarily propagate cross-zone. If they do, someone either screwed up, or they made a trade-off. Screwing up is easy, so you need to do chaos testing to make sure your system will survive as intended.
I'm not talking about my global app. I'm talking about the system I deploy to, the actual plumbing, and how a huge turd in a western toilet causes east's sewerage system to over-flow.
That's not how they work. They exist, and work extremely well within their defined engineering / design goals. It's much more nuanced than 'everything works independently'.
If the design goal of these zones is that they should be independent of each other then, no, they do not work extremely well.
> I don't think these independent zones exist.
Wouldn't it be possible to create fully independent zones with multiple cloud providers, like AWS, GCP, Azure? This is assuming that your workloads don't rely on proprietary services from a given provider.
Yes, and would also protect you from administrative outages like, "AWS shut off our account because we missed the email about our credit card expiring."
(But wouldn't protect you from software/configuration issues if you're running the same stack in every zone.)
There have been multiple discussions on HN about cloud vs not cloud and there are endless amount of opinions of "cloud is a waste blah blah".
This is exactly one of the reasons people go cloud. Introducing an additional AZ is a click of a button and some relatively trivial infrastructure as code scripting, even at this scale.
Running your own data center and AZ on the other hand requires a very tight relationship with your data center provider at global scale.
For a platform like Roblox where downtime equals money loss (i.e. every hour of the day people make purchases), then there is a real tangible benefit to using something like AWS. 72 hours downtime is A LOT, and we're talking potentially millions of dollars of real value lost and millions of potential in brand value lost. I'm not saying definitively they would save money (in this case profit impact) by going to AWS, but there is definitely a story to be had here.
But it wasn't a hardware issue. It was a software one and that would have crossed AZ boundaries.
So then why does the post mortem suggest setting up multi-az to address the problems they encountered?
I took that to mean sharding Roblox instead of spanning it across data center AZs.
FTA:
> Running all Roblox backend services on one Consul cluster left us exposed to an outage of this nature. We have already built out the servers and networking for an additional, geographically distinct data center that will host our backend services. We have efforts underway to move to multiple availability zones within these data centers; we have made major modifications to our engineering roadmap and our staffing plans in order to accelerate these efforts.
If they were in AWS they could have used Consul across multi-AZs and done changes in a roll out fashion.
So that next time they can spend 96 hours on recovery, this time adding a split brain issue to the list of problems to deal with. Jokes aside, the write-up is quite good after after thinking about all the problems they had to deal with, I was quite humbled.
It doesn't really explain how they reached the conclusion that that would help. Like, yes, it's a problem that they had a giant Consul cluster that was a SPOF, but you can run multiple smaller Consul clusters within a single AZ if you want.
Honestly it reads to me like an internal movement for a multi-AZ deployment successfully used this crisis as an opportunity.
I'm more impressed that it hasn't been an issue until now.
> Having multiple fully independent zones seems more reliable and failsafe.
This also introduces new modes of failure which did not exist before. There are no silver bullets for this problem.
There are no silver bullets to any problem, but there are other ways of implementing services and architecture that can sidestep these things.
No surprised at all. Multi AZ is a PITA. You'd be surprised how many 7fig+/month infra is single region/az
For example parts of AWS itself. us-east-1 having issues? Looks like aws console all over the world have issues.
You constantly hear about multi zone, region, cloud. But in practice when things break you hear all these stories of them running in a single region+zone
A guess would be that game servers are distributed across the globe but backend services l are in one place. A common pattern in game companies.
> 50th percentile
I would normally not call this out, but it is repeated so often in the text that it is jarring. Just call it "median" as it is everywhere else, please.
On the other hand, I must commend the author(s) for not using "based off of" :-)
Great write-up, otherwise.
Love NATS for not having to deal with service discovery at all.
Excellent write up. Reading a thorough, detailed and open postmortem like this makes me respect the company. They may have issues but it sounds like the type of company that (hopefully) does not blame, has open processes, and looks to improve - the type of company I'd want to work for!
> the type of company I'd want to work for!
I recommend watching the following:
https://www.youtube.com/watch?v=_gXlauRB1EQ
https://www.youtube.com/watch?v=vTMF6xEiAaY
The first video reveals a more general issue that is not specific to Roblox: child labor in the marketplace of monetized user generated content. There are plenty of under-18 YouTubers. It's not even just online content: these questions came up in the entertainment industry a long time ago, but in that industry at least some safeguards were put in place.
But do those other places pay the creators such small percentages, and also do everything in their power to avoid paying real $? As far as I know Youtube doesn't have their own currency.
The % cut of $revenue is outside the main scope of my comment, which concerned child labor, a somewhat independent issue. It doesn't matter too much to me the % monthly revenue a 12 year old kid gets, I'm more concerned with the promise of unlikely riches encouraging kids to work long hours outside the oversight of traditional child labor laws. If that 12 year old is putting in a 30 hour work weeks then I think it's problematic regardless of the revenue with some minimal enforceable guardrails. I don't think parental signoff is sufficient either: the entertainment industry has plenty of examples of how that can go wrong, and also how some of those minimum guardrails might work.
Do those other places host creators' applications for free regardless of scale, and without injecting ads into your entertainment?
Youtube doesn't have their own currency simply because they don't have to, not because they are kind. Their major source of income is ads and subscription. Neither of these needs their own currency. Highly unsure what your point is.
host creators' applications for free regardless of scale
Yes? Your Youtube video / livestream can have tens of thousands of simultaneous viewers across the globe.
> without injecting ads into your entertainment?
That's an odd complaint that hyper-specific to how Youtube monetizes content. Look at Steam, Play Store, AppStore, all those host your app/game for free, regardless of the scale, and only take a 30% cut.
> Highly unsure what your point is.
Fine, here's another way to explain it. The minimum amount to take out money in most places, including Youtube, is 100$. Why is it 1000$ in Roblox?
1. You are comparing interactive game servers to streaming service? Sure there are free CDN you can use, but are there free servers alternative you can use on the market regardless of scale and traffic? The pricing of these two services has fundamental difference on the market. If you know one of such service, name it. I would be glad to use it.
2. You are comparing Steam, Play Store, App store to Roblox? Are they providing free servers for multi-player experience? Does Steam let you host free servers? Sure they provide a way to download the app, but none of them provides free servers along with free network traffic. What makes you think they should be priced the same way when the services they provide are fundamentally different?
3. What makes you think $100 is reasonable, while $1000 is not reasonable? Every business has their own way for monetization. If you want to claim $100 is reasonable and should be the industry standard, what is your argument to support it? If Youtube is so kind, why don't they just not have any minimum take out amount in the first place, like most e-commerce companies do? Your questions are baseless in the first place by assuming there is certain non-existent rules that need to be followed. You can dislike it, but claiming $1000 limit it set to abuse child labor is completely un-based.
> There are plenty of under-18 YouTubers.
...I guess I always assumed that the ones who were monetized had parental involvement. Does Youtube allow minors to do that themselves?
I don't understand how the stuff Robox does (and maybe Youtube?) doesn't run afoul of child labor laws...
I'm not sure either. Particularly how, even in the presence of any policies, they could police the system: Does YouTube send someone around to check on the Ryan's World kid and the work environment?
Even with child labor policies it doesn't seem like platforms would be much better at managing them than content moderation.
Well, which is why I don't think Roblox should be allowed to pay children at all.
That seems like it would be far worse for the children in question than the current situation.
A partial solution might be that games by developers without legal age verification can't utilize real money transactions via the Robux currency. That way Roblox isn't rewarded directly for it either.
Your solution to children who want to sell things on Roblox getting exploited by unscrupulous middlemen agencies is to make it impossible to bypass those agencies?
This youtuber needs to make a living.
Roblox is great that it built up an ecosystem where people can contribute and get rewarded. It is a positive feedback loop.
Not like open source software, where the financial loop is broken. I am pretty sure the Bolt creator did not get anything from HashCorp for his work.
Too bad they exploit young game developers by taking a 75.5% cut of their earnings. Big yikes of a red flag for me. https://www.nme.com/news/gaming-news/roblox-is-allegedly-exp...
To add, there is a nice documentary here[1] which also has a followup[2] that show even more of the issue at hand. Kids making games and only getting 24.5% of the profit is one thing, but everything else that Roblox does is much worse.
[1] https://youtu.be/_gXlauRB1EQ
[2] https://youtu.be/vTMF6xEiAaY
The 24.5% cut is fine, you have to consider the 30% app store fees for a majority mobile playerbase, all hosting is free, moderation is a major expense, and engine and platform development.
Successful games subsidize everyone else, which is not comparable to Steam or anything else.
Collectible items are fine and can't be exchanged for USD, Roblox can't arbitrate developer disputes, "black markets" are an extremely tiny niche. A lot of misinformation.
It's annoying to see these videos brought up every single time Roblox is mentioned anywhere for these reasons. Part of the blame lies with Roblox for one of the worst PR responses I have seen in tech, I suppose.
> The 24.5% cut is fine, you have to consider the 30% app store fees for a majority mobile playerbase, all hosting is free, moderation is a major expense, and engine and platform development.
You have successfully made the case for a 45% fee and being considered approximately normal, or a 60% fee and being considered pretty high still. 75+% is crazy.
I can't think of any other platform with comparable expenses. Traditional game engines have the R&D component, but not moderation, developer services, or subsidizing games that don't succeed.
It helps that seriously marketing a Roblox game costs < $1k USD always, usually < $200 USD. It's not easy to generate a loss, even when including production costs. That's the tradeoff.
Crazy compared to what? Nobody else is offering what Roblox offers, and building it yourself is a non-starter for almost everyone.
I have less a problem with the cut, and more a problem with how they achieve it. It harkens back to company towns paying workers in company credit that is expensive to convert to USD.
This % includes cost of all game server hosting, databases, memory stores, etc. even with millions of concurrents, app store fees, etc. All included in that number. Developer gets effectively pure profit for the sole cost of programming/designing a great game. Taught me how to program, & changed my entire future. Disclosure: My game is one of most popular on the platform.
And that's a reasonable decision for an adult to make, and if they were targeting an adult developer community.
I don't think anyone objects to adults making that choice over say, using Unity or Unreal, and targeting other platforms.
In practice, explaining to my son who is growing into an avid developer why I won't a) help him build on Roblox, or b) fund his objectives of advertising and promoting his work in Roblox (by spending Roblox company scrip) on the platform has necessitated helping him to learn and understand what exploitation means and how to recognize it.
It's a learning experience for him, and a challenging issue for me as a technically proficient and financially literate parent who actually owns and run businesses related to intellectual property. It's got to be much more painful for parents who lack in any of those three areas.
Are you really suggesting that Roblox's cut should be lower purely because the target market is children? Why? If anything, the fact that a kid can code a game in a high-level environment and immediately start making money—without any of the normal challenges of setting up infrastructure, let alone marketing and discovery—is amazing, and a feat for which Roblox should definitely be rewarded.
In any case, what's the alternative? To teach your son how to build the game from scratch in Unity, spin up a server infrastructure that won't crumble with more than a few concurrent players (not to mention the cash required for this), figure out distribution, and then actually get people to find and play the game? That seems quite unreasonable for most children/parents.
If this were easy, a competitor would have come in and offered the same service with significantly lower fees.
The problem is that robolox essentially lies to kids (by omission) in an attempt to get free labor out of them.
Yes, I agree that the deception is a problem, although I admit I'm not well versed in the issue. (I'm watching the documentary linked elsewhere now.) But the original claim was that they were exploiting young developers by taking a big cut of revenues, which I disagree with.
> And that's a reasonable decision for an adult to make, and if they were targeting an adult developer community.
If it's a reasonable decision for an adult to make because the trade-offs might be worth it, doesn't that mean that it would also be reasonable for a child to make the same decision for the same reason?
It's either exploitative or it isn't, the age of the developer doesn't alter the trade-offs involved.
No, because a child is not deemed to have the necessary faculties to make these decisions.
The question should not be posed to a child, that is the law for child labour, and why we do not have children gambling on roulette wheels.
Western society says that some decisions are only able to be made by people who are old enough. If you think about other decisions like gambling at a casino, joining the army or purchasing alcohol, then it might help you understand where they're coming from.
Does your son have other alternatives to learn programming and make money other than Roblox?
If there are, then it's a great lesson about looking outside of one's immediate circumstance and striving towards something better.
Very cool, the Jailbreak creator! Do such popular games earn enough to be able to retire? (although you wouldn't actually retire, since working is more fun)
Pretty darn cool hearing from you! But yes, you're correct on both points :)
Congrats, I haven't made it that far yet, buy enjoy working anyway.
Again, as across-thread: this is a tangent unrelated to the actual story, which is interesting for reasons having nothing at all to do with Roblox (I'll never use it, but operating HashiStack at this scale is intensely relevant to me). We need to be careful with tangents like this, because they're easier for people to comment on than the vagaries of Raft and Go synchronization primitives, and --- as you can see here --- they quickly drown out the on-topic comments.
The idea that these children would otherwise be making their own games is knowingly, generally wrong.
No matter what the cut is I think there are some legitimate social questions to ask about whether want young people to be potentially exposed to economic pressure to earn or whether we'd rather push back aggressively against youth monetization to preserve a childhood where, ideally, children get to play.
I know there are lots of child actors and plenty of household situations that make enjoying childhood difficult for many youths - but just because we're already bad at a thing doesn't mean we should let it get worse. Child labour laws were some of the first steps of regulation in the industrial revolution because inflation works in such a way where opening the door up to child labour can put significant financial pressure on families that choose not to participate when demand adjusts to that participation being normal.
More egregiously, they're (per your article) manipulating kids into buying real ads for their creations, with the false promise that "you could get rich -- if you pay us".
>"As there are no discoverability tools, users are only able to see a tiny selection of the millions of experiences available. One of the ways boost to discoverability is to pay to advertise on the platform using its virtual currency, Robux."
(Note that this "virtual" currency is real money, bidirectionally exchangeable with USD).
The sales pitch is "get rich fast":
>"Under the platform’s ‘Create’ tab, it sells the idea that users can “make anything”, “reach millions of players”, and “earn serious cash”, while its official tutorials and support website both “assume” they are looking for help with monetisation."
I agree that this doesn't really look like a labor issue. That's distracting and contentious tangent; it's easier to just label this as a kind of consumer exploitation. (Most of the people involved aren't earning money -- but they are all paying money). It's a scam either way.
I am naive about the reality on the ground when it comes to this issue, but doesn't this hinge on transparency? If they can show they are covering costs + the going market rate, which seems to be 30% (at best), then wouldn't it be reasonable? So is a 45% cut for infra ok or not seems to be the question.
By that logic, Dreams is "exploiting" developers by taking a 100% cut of their earnings. Making money isn't the point of either of these platforms.
Or how about giving a free platform to get into games development for young people that otherwise wouldn't have become interested.
The solution is creating a competing platform and offering a better cut. You up for the task?
Edit to add: lazy people downvote.
This is an interesting debate to have somewhere, but it has nothing to do with this thread. We need to be careful about tangents like this, because it's a lot easier to have an opinion about the cut a platform should take from UGC than it is to have opinion about Raft, channel-based concurrency, and on-disk freelists. If we're not careful, we basically can't have the technical threads, because they're crowded out by the "easier" debate.
True, it is off topic to the postmortem. However, the top comment talks about wanting to work there. I get is is very relevant to see a bigger picture. Personally, I could never work for them. I have a kid and the services and culture they created around their product is sickening and should be made illegal.
Even so, if you're responding to the "wanting to work there" you could do so in a less snide manner that's more in line with HN's guidelines on constructive commenting.
You’ve pretty much articulated for me why I’ve been commenting on Reddit less and less frequently.
I loathe the constant riffing on <related and yet nothing indicates it is actually related/> topics.
Sadly it is happening here on HN too, < insert the next blurb about corporatism/>
Guess we need to find the next space lol
Took long enough to find this place.
While I personally think digitalengineer's comment was low-effort and knee-jerk, I think this general thread of discussion is on topic for the comment replied to, which was specifically about how the postmortem increased the commenter's respect for Roblox as a company and made them want to work there. I think an acceptable compromise between "ethical considerations drown out any technical discussion" and "any non-technical discussion gets downvoted/flagged to oblivion" would be to quarantine comments about the ethics of Roblox's business model to a single thread of discussion, and this one seems as good as any.
The guidelines and zillions of moderation comments are pretty explicit that doesn't count as 'on topic'. You can always hang some rage-subthread off the unrelated perfidy, real or perceived, of some entity or another. This one is extra tenuous and forced given that 'the type of company I'd want to work for' is a generic expression of approval/admiration.
You can get to a sane place just by mentally substituting "engineering team" for the word "company" in the original comment.
As one of the top developers on the platform (& 22 y/o, taught myself how to program through Roblox, ~13 years ago), I can say that it seems a majority of us in the developer community are quite unhappy with the image this video portrays. We love Roblox.
That's kind of on Roblox then for not answering their questions transparently.
I think what bothers me the most is the effective 'pay to play' aspect
My son loves it, I think it is a great way to learn.
Yeah as long as Roblox is exploiting children they're just flat-out not respectable. This video is a good look at a phenomenon most people are unaware of.
Players of your game creating content for it is not exploitation. It's just how it works in the gaming world. When I was a kid I spent time creating a minecraft mod that hundreds of people used. Did Mojang or anyone else ever pay me? No. I did it because I wanted to.
The way they're paying kids and what they're telling them is a big part of the problem... they're pushing a lot of the problematic game development industry onto kids that are sometimes as young as 10.
If this was free content creation when kids want to do it, then it would be an entirely different story.
Mojang was likely not selling you on making a mod with promises of making money though. Roblox did that, maybe they still do it.
Please review the video. The problem is not ‘players creating content’.
Sounds like they need to switch to Kubernetes?
I kid of course. One of the best post-mortems I've seen. I'm sure there are K8s horror stories out there of etcd giving up the ghost in a similar fashion.
The one thing you can say about Nomad is that's generally incredibly scalable compared to Kubernetes. At 1000+ nodes over multiple datacenters, things in Kube seem to break down.
Do they still? GKE supports 15,000 nodes per cluster.
you joke, but it's precisely this:
>Critical monitoring systems that would have provided better visibility into the cause of the outage relied on affected systems, such as Consul. This combination severely hampered the triage process.
which gives me goosebumps whenever I hear people proselytizing everything run on Kubernetes. At some point, it makes good sense to keep capabilities isolated from each other, especially when those functions are key to keeping the lights on. Mapping out system dependencies (either systems, software components, etc) is really the soft underbelly of most tech stacks.
>Sounds like they need to switch to Kubernetes?
Hah! Good one!
Could you please stop posting unsubstantive comments to HN? You've done it quite a bit, unfortunately. We're trying for a different quality of discussion here.
https://news.ycombinator.com/newsguidelines.html
warning, completely pedantic pet peeve.
> Note all dates and time in this blog post are in Pacific Standard Time (PST).
But the incident was during PDT. Just use UTC or colloquial "Pacific time" or equiv and never be wrong!
My heart goes out to these people. I can imagine how much sustained terror they were feeling, stare hard and harder at your terminals and still nothing makes sense.
>for our most performance and latency critical workloads, we have made the choice to build and manage our own infrastructure on-prem
I don't understand this logic. are they basically saying that their servers are on average closer to the user than mainstream cloud infra? are they e.g. choosing to have N satellite servers around a city instead of N instances at one cloud provider location in the centre of the city? is it the sparseness of the servers that decreases the latency?
or is it more to do with avoiding the herd, i.e. less trafficky routes / beating the queues?
it's also unclear whether they use their own hardware on rented rackspace as that could potentially lower costs too
Cloud providers are rarely in cities. Google's biggest region is in the middle of Iowa, Amazon's is in Virginia.
If you have a latency sensitive application (like multiplayer games) it makes sense to put a few servers in each of 100 locations rather concentrate them in a half dozen cloud regions.
As they point out elsewhere, the cost of infrastructure directly impacts their ability to pay creators on the platform. Doing it yourself will always be cheaper, and they hired the smart people to make it happen.
> it makes sense to put a few servers in each of 100 locations rather concentrate them in a half dozen cloud regions.
Large cloud providers have a backbone network with interconnects to many ISPs reducing the amount of Hops a client has to take across the internet.
> Doing it yourself will always be cheaper
Treating the Cloud as a traditional IAAS Datacenter extension will be more expensive. By utilising PAAS, only using resource that's needed and when it's needed, etc.. is much cheaper.
So the outage lasted 3 days and the postmortem took 3 months!
Read the article " It has been 2.5 months since the outage. What have we been up to? We used this time to learn as much as we could from the outage, to adjust engineering priorities based on what we learned, and to aggressively harden our systems. One of our Roblox values is Respect The Community, and while we could have issued a post sooner to explain what happened, we felt we owed it to you, our community, to make significant progress on improving the reliability of our systems before publishing."
They wanted to make sure everything was fixed before publishing
They just got out of their busiest time of year, and taking the time to write an accurate post mortem with data gleamed afterwards seems sensible to me.
I would not have guessed Roblox was on-prem with such little redundancy. Later in the post, they address the obvious “why not public cloud question”? They argue that running their own hardware gives them advantages to cost and performance. But those seem irrelevant if usage and revenue go to zero when you can’t keep a service up. It will be interesting to see how well this architecural decision ages if they keep scaling to their ambitions. I wonder about their ability to recruit the level of talent required to run a service at this scale.
Since the issue's root cause was a pathological database software issue, Roblox would have suffered the same issue in the public cloud. (I am assuming for this analysis that their software stack would be identical.) Perhaps they would have been better off with other distributed databases than Consul (e.g., DynamoDB), but at their scale, that's not guaranteed, either. Different choices present different potential difficulties.
Playing "what-if" thought experiments is fun, but when the rubber hits the road, you often find that things that are stable for 99.99%+ of load patterns encounter previously unforeseen problems once you get into that far-right-hand side of the scale. And it's not like we've completely mastered squeezing performance out of huge CPU core counts on NUMA architectures while avoiding bottlenecking on critical sections in software. This shit is hard, man.
This is not true, if they handled the rollout properly. Companies like Uber have two entirely different data centers and during outages they failover you either datacenter.
Everything is duplicated which is potentially wasteful but ensures complete redundancy and it’s an insurance policy. If you rollout, you rollout to each datacenter separately. So in this case rolling out in one complete datacenter and waiting a day for their Consul streaming changes probably would have caught it.
> So in this case rolling out in one complete datacenter and waiting a day for their Consul streaming changes probably would have caught it.
But this has nothing to do with cloud vs. colo.
The parent poster said that it would have happened even if they had cloud, ie. another datacenter. That's my assumption for the comment.
As far as I can tell from reading, Roblox doesn't have multiple datacenters. I find that really hard to believe, so if that's not true, then my point would be incorrect. If it is true, then if they completely duplicated their datacenters, they would be able to make the switch in one datacenter to streaming while keeping the other datacenter the old setting until they validated that everything was fine. That would have caught the problem, having slow rollout across datacenters.
Uber is also a service that has a much lower tolerance for downtime: If people can't play a game, they're sad. If they're trying to get a ride and it doesn't work, or drivers apps stop working suddenly, the stranded people get very upset in a hurry, and the company loses a lot of customers.
It can be totally reasonable for Uber to pay for 2x the amount of infra they need for serving their products while not being worth it for a company like Roblox.
The Consul streaming changes were rolled out months before the incident occurred.
You didn't read it properly. The changes were rolled out months before, but the switch to streaming based on that rollout was made 1 day before the incident. That was the root cause.
I think the public cloud is a good choice for startups, teams, and projects which don't have infrastructure experience. Plenty of companies still have their own infrastructure expertise and roll their own CDNs, as an example.
Not only can one save a significant amount of money, it can also be simpler to troubleshoot and resolve issues when you have a simpler backend tech stack. Perhaps that doesn’t apply in this case, but there are plenty of use cases which don’t need a hundred micro services on AWS, none of which anyone fully understands.
> But those seem irrelevant if usage and revenue go to zero when you can’t keep a service up
You're assuming the average profits lost are more than the average cost of doing things differently, which, according to their statement, is not the case.
>I wonder about their ability to recruit the level of talent required to run a service at this scale.
According to this user's comments, it doesn't look like it'll be that tough for them:
https://news.ycombinator.com/item?id=30014748