Tell HN: AWS connectivity issues, but health dashboard says everything fine

114 points by ccleve a year ago

About 15 minutes ago I got a call from a customer that our site was down. "Works for me," I said, because I could bring it up on my laptop. The customer said he couldn't get it on his phone, and then I confirmed I couldn't get it on my phone either.

The AWS Health Dashboard (https://health.aws.amazon.com/health/status) reports no issues at all. But DownDetector (https://downdetector.com/status/aws-amazon-web-services/) shows a spike in reports.

I can't even reach the AWS console through my phone.

So, AWS has connectivity issues to certain networks and their own health dashboard is lying to us about it. What gives?

(All of this accurate as of 2:10 pm CST).

Update: as of 2:26 pm CST, the health dashboard reports that they are "investigating an issue". So, 45 minutes after Down Detector sees it, they do.

hedgehog_irl a year ago

Everyone seems to overlook the point here. That yet again Amazon were slow as hell to be honest with their customers. I get it up down reports help but why do you keep using a service which lies to you about availability. I've read on HN in the past how the dashboard can only be updated to reflect an issue with approval. (Comments section on a similar posting, believe it if you wish). So why not move to a hosting company that is transparent and open about their status. I'll not make suggestions as I don't want to be accused of trying to shill for a specific provider but there are plenty out there. 45 min to update their public dash is too slow. They either don't care, don't monitor or they are trying to hide their stats for fear Jeff will beat the staff for SLA violations. If any other provider lied to customers the way AWS does they wouldn't be tolerated why do you tolerate this behaviour from AWS?

Edited to fix auto correct issues

  • salil999 a year ago

    Good luck convincing your company "Hey because they were 45 minutes late in informing us we need to move all our cloud to a different provider."

    Updating a dashboard can easily be an automated process but for business reasons it is not. AWS did not "lie" about the incident - they are extremely transparent for all outages and disruptions (btw this was a disruption - not an outage). They stated on the issue the exact time frame for when the issue started and when it ended.

    Is it bad they were late? Definitely. AWS has a history of being late due to the sheer scale it works at. I've caused an outage myself when I used to work there and updating the dashboard requires several higher ups to understand what exactly the issue is and what is considered to be worthy of "informing of an incident." These processes take time. Is it perfect? Absolutely not. But there are legitimate reasons for it.

    I'm not sure why you think Jeff is involved here. This kind of disruption isn't enough to warrant someone as high as Jeff to be involved.

    As for SLA violations, AWS public SLAs for every service and they credit your account if it ever dips below those defined thresholds. And as for caring I don't know a single cloud provider with the level of great customer support AWS has. This is extremely opinionated but this is what I've observed in the industry.

    I would recommend people to use AWS monitoring. But having some of your own basic internal dashboards / metrics is also worth having.

    • merek a year ago

      > I would recommend people to use AWS monitoring

      Use a monitoring service to monitor the provider of the monitoring service? Wouldn't it be better to use a monitoring service hosted on a totally different provider?

      I'm not even sure running your own monitoring is sufficient in this case. Sure it's useful to have, but when something goes wrong, the first thing I want to know is if it's us or them. If it's us, I/the team scramble to fix it in a panicked frenzy. If it's them (the cloud provider), and they acknowledge it early, even a simple "we're investigating an issue with X", we can at least take some comfort from the fact that it's out of our hands.

      If we just don't know the cause, we assume it's us and jump into panicked frenzy mode. Panicked frenzy days are the worst days of my life, especially if it's discovered that it was all in vain.

  • remus a year ago

    I understand the frustration, but Im not convinced monitoring at large scale is that straightforward.

    The core question is: what constitutes degraded service? Would you say a service is experiencing downtime every time a 500 response is served? If you're serving millions to billions of requests/sec it seems a bit disproportionate to marka service down after a single 500 error, so then you need to work out some kind of acceptable threshold.

    What about latency? Again you're just going to draw a line in the sand somewhere.

    You end up with this big mix of metrics that define service quality, so you then have a kind of meta problem of deciding which metrics you should alert users on. Get too trigger happy and it's going to cost you money and customer trust, and your customers are going to get alert fatigue when it turns out the issue you alerted them about was more of a false alarm. Set the bar too high and you'll have angry customers wondering wtf is going on.

    All that to say I don't think there's a right answer.

    • leesalminen a year ago

      We were pretty liberal with posting to our status. page for years and thought it was The Right Thing to do. I still do, to a point.

      But, what ended up happening was a competitor who didn't have a status page at all would use our status page against us in the sales process. They just never mentioned their lack of a status page to compare to.

      This was the same competitor who went 100% down for ~4 days during the busiest month of the year and only posted updates to a private Facebook group. There was data loss that was never publicly admitted to.

      So, yeah, we implemented reasonable boundaries on what constitutes a post to the status page. We also adopted a new status page provider that let us get more granular with categorizing posts, and allowing users to subscribe to only "urgent" channels that pertain to them.

      • lamontcg a year ago

        Before 2003-ish Amazon used to have a static "gonefishing" page on www.amazon.com that was manually triggered during outages. Due to newspaper reporters writing scripts that would detect the GF pages they were removed and the site was allowed to just spew 500s for whatever segment of critical pages was busted.

    • hedgehog_irl a year ago

      Very fair but 45 min of an outage/disruption before manually updating public status is poor service and why is that acceptable for Aws to deliver to users

  • ioman a year ago

    AWS is the 800 pound gorilla in the cloud space. Are any of the other cloud providers better with customer honesty?

    • autotune a year ago

      Also good luck trying to convince your company to migrate to another cloud provider over, say, implementing multi-region strategy, which you should have been doing in the first place.

      • hedgehog_irl a year ago

        Highlight the lack of transparency on reporting outages and that's a start. If your MPLS or ISP provider operated in the save way. The company wouldn't accept it

        • autotune a year ago

          My company is not going to spend hundreds of thousands of dollars or more, and months or even years of effort, and add additional constraints to the given pool of candidates we are hiring for, to migrate to GCP or Azure or DigitalOcean or Hetzner or wherever is considered more trendy than AWS right now due to "a lack of transparency" lmao. I would look completely incompetent to even suggest the idea to anyone internally.

          • hedgehog_irl a year ago

            But your company is willing to accept poor service and as a result spend more money with the same provider to ensure continuity. So essentially you reward Aws hiding their stats. As they can claim high uptime figures and when an outage happens it's the users fault for not spending enough money with them to have many many instances around the availability zones to ensure your covered the Aws mess up. I get it redundancy is needed in systems but lack of proper reporting message users are forced to over spend our of fear. It's a great business model. Hook the clients in with lies and then get them to reward you for hiding facts. Clearly your company has money to burn wasting it like this. Every one knows they lie and are blatant about it why is it tolerated. As I said I don't see other enterprise providers getting away with this kinda behaviour towards clients

            • autotune a year ago

              If you are willing to host your critical infra on some dodgy startup alternative that might go away in 3 months because you refuse to bend on your personal values and separate them from what the typical organization actually cares about, best of luck. I know HN tends to loves the underdog, but there is a time and place for that, and a time and place to accept what you need to do to keep your services online.

              • hedgehog_irl a year ago

                So your logic is to accept poor quality service to keep your service online rather than trying to do better and improve service. So you are saying that rather than rewarding a company trying to do better just accept poor service from Aws.How is this better than "hosting on some dodgy start-up" This is nothing to do with my personal beliefs or opinion I'm trying to understand why it's accepted from Aws but not others Edited for to add point

                • autotune a year ago

                  My logic is to build highly resilient infrastructure given the constraints available. Your definition of "poor service" is not what I have experienced in my 10 year career as a SRE, because I build around your definition of what makes it poor and make it work as it should. It's called chaos engineering, and companies like Netflix have been doing it for years with their Chaos Monkey tool and SRE practices. Doesn't matter what cloud provider you go to, there is ALWAYS unexpected and unannounced downtime unless you build around, plan for, and expect it. But sure, go ahead and tell us all how industry leaders like them are wrong for sticking with what you call "poor service."

                  • hedgehog_irl a year ago

                    Ok simply question. Would you accept any other infra service provider having such poor customer service and not provided updated status of an outage/disruption for 45 min.

                    • autotune a year ago

                      You are deliberately avoiding the counter-points I already specifically addressed in response to that question to the point we are stuck in a loop, so I am going to leave this thread now. If you feel you can do a better job at SRE with your current mindset and believe you are better at choosing which cloud providers are worth using for an org, I welcome you to try.

          • simfree a year ago

            Your company is hiring and retaining people who can't work with tooling outside Amazon Web Services?

            • karamanolev a year ago

              Many companies are hiring and retaining specialists in AWS-lock-in-technology, who lack experience with another-cloud-provider-technology, so I don't know what's surprising.

            • autotune a year ago

              Training and getting up to speed takes time and money, neither of which are unlimited for any organization. It's not that they/we can't work with other cloud services, it's that it would likely add up to months of additional on-boarding time to get someone who wasn't familiar with another cloud provider productive with infra at scale on said provider.

    • hedgehog_irl a year ago

      Didn't say you have to go "cloud" rent hardware in a DC and run that yourself. Or use a VPS I mean the cloud is just "Other people's hardware" and I'd thank you to not insult gorillas like that by comparing them to Amazon.

      • koksik202 a year ago

        this just bring whole load of new expenses on staffing physical locations creates more problem than it solves

        • simfree a year ago

          Colocation or especially server rental generally requires no persistent staffing. The datacenter has their own staff for tasks requiring physical intervention, and you have IPMI/iLO access to your servers for doing reboots and similar.

        • hedgehog_irl a year ago

          Not really renting from a DC provider means you just run the host yourself they deal with power space cooling etc

    • uuddlrlrbaba a year ago

      I'd ask if there any cloud providers worse with customer honesty instead.

      • hedgehog_irl a year ago

        I'm sure there are 2bit vps providers that claim to be cloud and are terrible. But for the price and claims of service like Aws I donno they are at the scale where they don't have to care about customers

dixie_land a year ago

Keep in mind AWS status dashboard solely reflects the product owning managers discretion.

And the number of yellow ("green I" if you're old enough) is definitely a material input to PIP :)

  • joecot a year ago

    Correct. No matter how down AWS is, their status page will only show a disruption if a manager approves showing a disruption. There is nothing automated to display the status, so the status page is mostly worthless except for whatever AWS admits is down.

    • whoknew1122 a year ago

      All this compounded by the fact AWS builds on AWS. So there can be a disruption of a service, but it's not really the service's fault -- it's a upstream failure.

nnf a year ago

I'm hearing from customers and other employees that our stuff at AWS (us-east-2) is unreachable, but I'm able to get to it all without any issue (via http & ssh). Perhaps there's a problem upstream of AWS that's only affecting some ISPs?

  • thedougd a year ago

    It's only impacting some ISPs. Outage varies by my office locations. At my location, it is out, however I was able to get access via Cloudflare warp VPN.

    Edit: Sounds like AT&T

cathintexas a year ago

On our team, we are seeing that if you are on AT&T cell service or have AT&T as your ISP, you can't reach AWS or our site in US-east-2.

everfrustrated a year ago

A reminder that the public and personal health dashboards are not the the only port of call.

If you pay for the top tier of AWS support, if you have a suspected outage you'd be paging in AWS who will pick up the phone and start debugging your problem.

If your business depends on AWS you don't sit around clicking refresh on a status page hoping it might be updated.

  • coredog64 a year ago

    At some level of spend, your account team will know what services you use and know when those services are having higher than normal error rates.

daneel_w a year ago

About two weeks ago all three of our Aurora DB instances in eu-central-1 suddenly crashed and were offline, to no avail, for almost 55 minutes. Simultaneously we had random network problems going on within our eu-central-1 VPC which we were unable to diagnose. We still don't know what happened because we're not getting any answers to our support request. The AWS health dashboard was all green the entire time. No notifications were sent out.

jamroom a year ago

We're still up on us-east-2 but lots of customers calling in that they can't connect - makes me think there's some network down somewhere.

_justinfunk a year ago

It does seem to be a networking issue. I have a ec2 instance in us-east-2 that is accessible through a "Global Accelerator" but not externally through my ISP.

That ec2 instance can talk to other ec2 instances that are on us-east-2 - but none of those other instances are accessible externally.

  • leesalminen a year ago

    Can confirm that Global Accelerator helped us avoid this issue today.

xup a year ago

I'm still seeing it on my end. Our currently-running EC2 instances are working fine, but the EC2 us-east-2 console webpage doesn't load, and an EC2 instance in us-east-2 I rebooted has yet to come back online.

jeremib a year ago

They've finally updated their status.

12:26 PM PST We are investigating an issue, which may be impacting Internet connectivity between some customer networks and the US-EAST-2 Region.

BWStearns a year ago

Seeing issues in Florida for us-east-2. Coworkers in NY can still get to us-east-2.

mplanchard a year ago

We're seeing a similar thing for our us-east-2 properties. Some of our team is able to reach them, but others aren't. Folks in the midwest (Oklahoma and Michigan) can't even load the AWS console, while people in Texas, California, Arizona, and Pennsylvania can.

Analemma_ a year ago

There are three kinds of lies: lies, damned lies, and cloud status dashboards.

ocdtrekkie a year ago

A vendor's cloud product is having significant issues. Figured HN would tell me which major public cloud infrastructure fell over to cause it. Never fails.

baq a year ago

It only turns yellow if the datacenter gets flooded by lava. Red is probably a tactical nuke.

  • agilob a year ago

    I read here on HN before that yellow requires a manual signature on a paper from a higher manager. Because such fault affects their compensation and decreases stock value. Red requires signature from C-level. It's not automated at all, almost worthless dashboard.

muttantt a year ago

We got all our main stack all at us-east-2. Seems to all be running currently

  • joshuanapoli a year ago

    We couldn't detect any problem accessing resources via CloudFront/AppSync. Maybe the issue was specific to ELB.

  • ccleve a year ago

    Try it on your phone.

    • jamroom a year ago

      Yeah I can't access our services on us-east-2 on AT&T 5G but CAN on my CenturyLink fiber.

    • mrobins a year ago

      Works from my computer but not my phone (AT&T in NY).

agilob a year ago

Anyone with affected RDS instances? We were getting random connectivity issues today occasionally... New pods with 1-2 minutes after startup were suddenly getting timeouts connecting to MySQL DBs

bloaf a year ago

There was definitely something going on last night too, I noticed a number of sites having intermittent issues confirmed by down detector.

jstimps a year ago

We're seeing that external requests to ALBs on us-east-2 are affected

rshm a year ago

Could be resolved, east-2 console working fine on my end.