Understanding AWS End of Service Life Is a Key FinOps Responsibility

53 points by noctarius a year ago

I'm fine with forcing upgrades this way - however, from an operations standpoint, it is an absolute nightmare.

For one, depending on your situation/CRD's/automation, doing these upgrades in-place can be next to impossible. Updating an EKS minor version can only be done one version at a time - e.g., if you want to go from 1.24 -> 1.28, you need to do 1.25, then 1.26, then 1.27, then 1.28. So teams without a lot of resources are probably in a tough spot depending on how far they are behind here. Often, it's far more efficient to build an entirely new cluster from scratch and then cut over - which seems ridiculous.

Why are upgrading EKS versions such a pain? Well, if you're using any cluster add-ons, for one, all those need to be upgraded to the correct versions, and the compatibility matrix there can be rough. Stuff often breaks at this stage. Care needs to be taken around PV's, the CNI, and god help you if you have some helm charts or CRD's that rely on some deprecated EKS API - even if the upstream repository has a fix for it, you will often find this yak-shaving nightmare of fixing all the stuff that breaks on upgrading that, and then whatever downstream services THAT service breaks - etc.

What is the solution? I don't know. I'm not a kubernetes architect, but I work with it a lot. I understand there are security patches and improvements constantly, but the release cycle, at least from an infrastructure/operations perspective, IME places considerable strain on teams, to the point where I have literally seen a role in a company whose primary responsibility was upgrading EKS cluster versions.

I have a sneaking suspicion this is to try to encourage people to migrate to more expensive managed container orchestration services.

watermelon0 a year ago

EKS release cycle is related to Kubernetes release cycle. I'm not sure it's fair to expect AWS to freely support outdated K8s versions, that don't have upstream support.
If K8s would be backwards compatible, upgrading would be a lot easier, and if they would support LTS releases, like other projects, manual upgrades would be needed only every X years.
For example, the reason that you can use PostgreSQL with the same major version for 5 years on RDS is due to PostgreSQL actively supporting it, and minor versions are non-breaking and can be seamlessly applied (restart or failover to standby replica is still needed during upgrade).
- JohnMakin a year ago
  
  Completely understand why it is this way, and like I said I don't know the solution - unless AWS was able to or would want to fork Kubernetes in the same way that they did ElasticSearch, but that is understandable why they may not want to do that. Was mostly just griping that this process is a complete pain in the ass for tons of people (IME).
rho138 a year ago

I recently did the upgrade from 1.24->1.28 on a neglected cluster after testing the upgrade in a dev environment and it was honestly not that terrible. It really comes down to having the capability and man hours to manage the procedure. In reality the longest part was waiting for cluster nodes to upgrade to X version of k8s, but the complete upgrade only took 3 weeks of testing and a single 4 hour outage with no loss in processing over the period.
Realistically those workloads being run would have been better suited in an horizontal-scaling EC2 deployment but that was a future goal that never came to fruition.
- JohnMakin a year ago
  
  Like I said, it depends on your situation. Sometimes a /v1/beta api gets deprecated and causes complete chaos for a deployment. Sometimes your IAC is resistant to these kinds of frequent changes. There's really a billion scenarios.
  For reference, I have done upgrades from 1.12 -> 1.28 and most of the time if get into a messy project and I can get away with it, I will just rebuild a cluster from scratch.
easton a year ago

> try to encourage people to migrate to more expensive managed container orchestration services.
The question is: what service? ECS is the competing Amazon built service and it’s entirely free for management, you just pay for compute. We don’t use k8s because ECS is free and we don’t plan on leaving AWS.
Sure, you’re more locked in with ECS, but if you aren’t doing funky stuff with the APIs you can probably off ramp to k8s pretty easily. I know I could move us in a week or less, we’d have far bigger problems with the other AWS services we use.
- JohnMakin a year ago
  
  Lightsail? I’m sure there’s more examples other than ecs.
- pas a year ago
  
  how are you managing/configuring/monitoring/understanding ECS? (terraform?) to me it's a complete spaghetti with thick WTF-sauce. at least with k8s there's YAML and only YAML. (and there's CDK8S, and it supports tests, and spinning up a new cluster to test things is straightforward.)
  sure, I guess there's a whole industry that offers services to help manage AWS, but at that point the whole thing could be just spun off to the lowest bidder. (although I understand that people can make up a neverending list of reasons (excoughses) to keep paying the AWS tax.)
  ... okay, I'm probably too salty. if it works it works, if it's profitable and business is happy, it's hard to argue with the stack.
  
  easton a year ago
  
  CloudFormation for us. CF is wack, but the ECS concepts seemed as easy as k8s.
noctarius a year ago

Didn't think of this suspicion beforehand, but doesn't sound like a total miss.
pid-1 a year ago

As K8s matures it's likely we will get some kind of LTS versioning scheme.
Having new realeases so often for such a core infrastructure component is kinda insane unless it was explicitly architected to allow seeamless upgrades.
- mdaniel a year ago
  
  There's a tiny bit of nuance there about "allow seamless upgrades" in that they do what I think is a fantastic job of version skew toleration between all the parts that interact (kubectl, kubelet, apiserver, etc). So that part, I think, is not the long pole in any such tent, especially because if the control-plane gets wiped out, kubelet will continue to manage the last state of affairs it knew about, and traffic will continue to flow to those pods
  The hairy bit is the rando junk that gets shoved into clutsers, without any sane packaging scheme to roll it up or back. I even recently had to learn the deep guts of the sh.helm.v1.foo secret because we accidentally left an old release in a cluster which no longer supported its apiVersion. No problem, says I, $(helm uninstall && helm install --version new-thing) but har-de-har-har helm uses that Secret to fully rehydrate the whole manifest of the release before deleting it so when helm tries (effectively) kubectl delete thing/v1beta1/oldthing and pukes, well, no uninstall for you, even if those objects are already gone
- noctarius a year ago
  
  I hope you're right. Apart from that, yes I think it's necessary.
cjk2 a year ago

Yeah this. My average day when I go near EKS upgrades: Waltz in, fuck up the ALB ingress controller in some new and interesting way, spend all day bouncing AWS support tickets around, find out it was AWS's fault, find half the manifest YAML schema in the universe is now deprecated, sob into my now soaking wet trousers and wonder why the fuck I ended up doing this for living.
Yesterday I spent 3 hours trying to fix something and find it's an indent error somewhere.

htrp a year ago

This is also the right way to deprecate. Charge people an arm and a leg to keep things running (and eventually force them to migrate).

solatic a year ago

100%. People are responsible for an ever-increasing amount of things; people will focus on business priorities and stuff that is working will be left the hell alone. As long as the bills are manageable and the business pays - the lights will be kept on forever. Passing increasing support costs to customers realigns interests between customer and provider without danger of user impact.
And for Kubernetes, honestly, charging 6x for extended support is probably a bargain, considering the pace of change and difficulty of hiring engineers for unsexy maintenance work.
- mdaniel a year ago
  
  I do appreciate that the devil is always in the details, but I'll be straight that their new(?) "Upgrade insights" tab/api <https://docs.aws.amazon.com/eks/latest/userguide/cluster-ins...> goes a long way toward driving down the upgrade risk from a "well, what are we using that's going to get cut in the new version?"
  We just rolled off of their extended version and it was about 19 minutes to upgrade the control plane, no downtime, and then varying between 10 minutes and over an hour to upgrade the vpc-cni add-on. It seemed just completely random, and without any cancel button. We also had to manually patch kube-proxy container version, which OT1H, they did document, but OTOH, well, I didn't put those DaemonSets on the Nodes so why do I suddenly have to manage its version? Weird
  Touching CNI is always a potential downtime inducing event, but for the most part it was manageable
TheP1000 a year ago

Agreed. I would imagine the previous approach of forced upgrades ended up burning lots of customers in worse ways than just their pocketbook.
noctarius a year ago

True, but I guess it'll be a surprise to many. And, unfortunately, upgrading isn't always the easiest thing with deprecations and stuff

noctarius a year ago

Article by Mary Henry. I was shocked to see how much more the extended support (per hour) cost is for Kubernetes on AWS.

Haven't had that situation myself on AWS yet, but ran into it a few times on Azure

I can't remember to have paid extra on Azure though, but maybe we did. Certainly not 6x the price though.

PS: not sure why it got flagged the first time, but I think because I used a different title. Sorry.

res0nat0r a year ago

We just got emails yesterday about the EKS price increase. It's another reason we're trying to move the main app to the vendors SaaS because I don't have enough time and resources to be a fulltime k8s admin. The ecosystem moves way too fast and upgrades/deprecation happens way too quickly to keep up and to have time test / plan / rollout proper upgrades without breaking our critical production workloads.
qqtt a year ago

AWS also recently ended support for Mysql 5, so if you had an RDS instance with that version running past the cutoff, your support costs ballooned exorbitantly.
- VectorLock a year ago
  
  Yup this one hit me hard. USE2-ExtendedSupport:Yr1-Yr2:MySQL5.7 sent my bill up 70%.
  
  hughesjj a year ago
  
  How long was it before the notice and you getting charged extra?
- noctarius a year ago
  
  Seems like I'm a lucky one. Neither using RDS nor MySQL. But seriously, ouch. I mean, I get why they want people to migrate to supported versions but ...
  
  SteveNuts a year ago
  
  I wish we could implement this internally via chargebacks. The teams that refuse to upgrade their stuff should be forced to pay for the externalities they cause.

chrisjj a year ago

> running unsupported versions makes it harder to get help from a community that’s currently focused on the latest version

Great example of misuse of that simple word 'that'.

Should be 'which'.

TecoAndJix a year ago

Always learning something new[1]:
"The difference between which and that depends on whether the clause is restrictive or nonrestrictive.
In a restrictive clause, use that.
In a nonrestrictive clause, use which.
Remember, which is as disposable as a sandwich wrapper. If you can remove the clause without destroying the meaning of the sentence, the clause is nonessential (another word for nonrestrictive), and you can use which.
[1] https://www.grammarly.com/blog/which-vs-that/#:~:text=Which%....
pas a year ago

can you please explain the difference in semantics? what does this mean with 'that' and why that is inconsistent/incorrect/illogical compared to the meaning with 'which'? thanks!
- chrisjj a year ago
  
  https://writeanything.wordpress.com/2008/09/20/grammar-girl-...

neilv a year ago

Sounds like an entropy problem.

https://www.youtube.com/watch?v=y8OnoxKotPQ

> Which we get from EKS -- our entropy chaos service.

thebeardisred a year ago

This is something most people don't realize is an aspect of Red Hat's value. Extended Lifecycle Support (ELS) + Extended Update Support (EUS) are available _just in case_ you really can't figure out how to migrate off of those Red Hat Enterprise Linux 6 systems running on x86 (32 bit). https://access.redhat.com/support/policy/updates/errata

VectorLock a year ago

Had this bite me for my small-scale personal AWS setup. Have an AWS account I run some personal sites on, a Mastodon instance, etc. Got some Billing Alarms I setup that my bill went from normally $100 to $180. Got a $75 charge for USE2-ExtendedSupport:Yr1-Yr2:MySQL5.7 I mean I'm very used to Amazon's ridiculous fee structure but even this one caught me for a loop.

steelaz a year ago

To be fair to AWS, they announced the deprecation of MySQL 5.7 in January 2021, and many emails warned of this change throughout 2024.
- VectorLock a year ago
  
  Deprecating a service is one thing.
  Charging an arm and a leg for a deprecated service is another.
noctarius a year ago

Ouch. Glad you had the alarm (and that it reacted "early enough"). Anyhow, I think you may not be along with that surprise.

bushbaba a year ago

Why the AWS hate when this is an issue of the k8s cncf team from constant churn. There needs to be a cncf blessed LTS release of k8s. AWS is just filling in a gap here with headaches involved of back porting security patches.

abrookewood a year ago

Doesn't just apply to EKS - we are currently going through the same thing with MySQL on RDS. It's a big jump in support cost, but at the same time, I understand why they are doing it.