Show HN: K8s Cleaner – Roomba for Kubernetes
sveltos.projectsveltos.ioHello HN community!
I'm excited to share K8s Cleaner, a tool designed to help you clean up your Kubernetes clusters.
As Kubernetes environments grow, they often accumulate unused resources, leading to confusion, waste, and clutter. K8s-cleaner simplifies the process of identifying and removing unnecessary components.
The tool scans your Kubernetes clusters for unused or orphaned resources—including pods, services, ingresses, and secrets—and removes them safely. You can fully customize which resources to scan and delete, maintaining complete control over what stays and what goes.
Getting Started:
Visit https://sveltos.projectsveltos.io/k8sCleaner.html and click the "Getting Started" button to try K8s-cleaner.
Key Features:
- Easy to Use: No complex setup or configuration required—perfect for developers and operators alike - Open Source: Modify the code to better fit your specific needs - Community Driven: We welcome your feedback, feature ideas, and bug reports to help improve K8s-cleaner for everyone
I'm here to answer questions, address feedback, and discuss ideas for future improvements.
Looking forward to your thoughts! And make sure your all you kubernetes clusters are sparkling clean for the holidays. :-)
Simone
For resources that are supposed to be cleaned up automatically, fixing your operator/finalizer is a better approach. Using this tool is just kicking the can down the road, which may cause even bigger problem.
If you have resources that need to be regularly created and deleted, I feel a cronjob running `kubectl delete -l <your-label-selector>` should be more than enough, and less risker than installing a 3rd party software with cluster wide list/delete permission.
How should I discover the things that need deletion?
Presumably running some sort of analysis tool in a dry-run mode would help, no?
https://github.com/yonahd/kor
It appears to cover all that kor covers and more. I like the flexibility of cleaner. Both useful tools IMHO.
I think it can be useful as a discovery tool. How do you know that your operator is leaking resources in the first place? What if one of your cluster operators manually modified or created resources?
Most of the times you don't own the operator/finalizer
And most of the times, when it is a shared cluster, you don't even know what else is being deployed.
If you find yourself using something like this, you seriously fucked up as DevOps / cloud admin / whatever.
I understand where you’re coming from, and ideally, we strive for well-managed Kubernetes environments. However, as DevOps practitioners, we often face complexities that lead to stale or orphaned resources due to fast deployment cycles, changing application needs or teams. Even the public clouds make lots of money from services that are left running and not used for which some companies make a living helping clean things up.
K8s-cleaner serves as a helpful safety net to automate the identification and cleanup of these resources, reducing manual overhead and minimizing human error. It allows teams to focus on strategic tasks instead of day-to-day resource management.
> However, as DevOps practitioners, we often face complexities that lead to stale or orphaned resources due to fast deployment cycles
So, as a DevOps practitioner myself, I had enough say within the organizations I worked at, who are now clients, and also my other clients, that anything not in a dev environment goes through our GitOps pipeline. Other than the GitOps pipeline, there is zero write access to anything not dev.
If we stop using a resource, we remove a line or two (usually just one) in a manifest file, the GitOps pipeline takes care of the rest.
Not a single thing is unaccounted for, even if indirectly.
That said, the DevOps-in-name-only clowns far outnumber actual DevOps people, and there is no doubt a large market for your product.
edited: added clarity
> I had enough say within the organizations I worked at, who are now clients
This sounds like experience that’s mainly at small/medium sized orgs. At large orgs the devops/cloud people are constantly under pressure to install random stuff from random vendors. That pressure comes from every direction because every department head (infosec/engineering/data science) is trying to spend huge budgets to justify their own salary/headcount and maintain job security, because it’s harder to fire someone if you’re in the middle of a migrate-to-vendor process they championed, and you’re locked into the vendor contract, etc etc. People also will seek to undermine every reasonable standard about isolation and break down the walls you design between environments so that even QA or QC type vendors want their claws in prod. Best practice or not, You can’t really say no to all of it all the time or it’s perceived as obstructionist.
Thus there’s constant churn of junk you don’t want and don’t need that’s “supposed to be available” everywhere and the list is always changing. Of course in the limit there is crusty unused junk and we barely know what’s running anywhere in clouds or clusters. Regardless of the state of the art with Devops, most orgs are going to have clutter because those orgs are operating in a changing world and without a decisive or even consistent vision of what they want/need.
> At large orgs the devops/cloud people are constantly
Two of our clients are large (15,000+ employees, and 22,000+ employees) orgs. Their tech execs are happy with our work, specifically our software delivery pipeline with guard rails and where we emphasize a "Heroku-like experience".
One of their projects needed HiTRUST, and we made it happen for them in under four weeks (no we're not geniuses, we stole the smarts of the AWS DoD-compliant architecture & deployment patterns) and the tone of the execs seemed to change pretty significantly after that.
One of these clients laid off more than half their IT staff suddenly this year.
When I was in individual contributor role in a mid-size (just under 3,000 employees), I wrote my thoughts, "internal whitepaper" or whatever being fully candid about the absurd struggles we were having (why does instantiating a VM take over three weeks?), and sent it to the CTO (and also the CEO, but the CTO didn't know about that) and some things changed pretty quickly.
But yeah, things suck in large orgs, that's why large orgs are outsourcing which is in the most-downstream customers' (the American peoples') best interests too -- a win-win-win all around.
A tool that would be useful for organizations that don't have superstar, ultra-competent devops people on the full-time payroll sounds pretty useful in general. There are a lot of companies that just aren't at the scale to justify hiring someone for that kind of role, and even then, it's hard to find really good people who know what they are doing.
> for organizations that don't have superstar, ultra-competent
Just outsource.
Outsource to those who do have DevOps people who know what they're doing -- most companies do this already in one form or another.
How do you realize that you have stopped using a resource? Can there be cases when you're hesitant to remove a resource just yet, because you want a possible rollback to be fast, and then it lingers, forgotten?
With our GitOps patterns, anything you could call an "environment" has a Git branch that reflects "intended state".
So a K8s namespace named test02, another named test-featurebranch, and another named staging-featurebranch, all have their own Git branch with whatever Helm, OpenTOFU, etc.
With this pattern, and other patterns, we have a principle "if it's in the Git branch, we intend for it to there, if it's not in the Git branch it can't be there".
We use Helm to apply and remove things -- and we loved version 3 when it came out -- so there's not really any way for anything to linger.
No one likes leaking resources, but it happens to the very best teams.
It seems like a tool that can propose a good guess about where you landed is strictly useful and good?
> but it happens to the very best teams.
I am very convinced it does not. I think where the apex between our viewpoints lies is what we recognize (or not) as "best" teams.
Sometimes you show up to your job and find that you have 10000 square pegs that need to go through a circular hole. You can over time fix this problem but sometimes you need a stopgap. You may or may not ascribe to the google SRE creed but the goal is to get stuff working today with what you have however ugly it may be. Some hacky tool to restart programs to fix memory a memory leak is sometimes necessary due to time constraints or being a stopgap. Migrations at large companies can take multiple years whereas I can install this tool with helm in approximately half a day.
> Some hacky tool to restart programs to fix memory a memory leak is sometimes necessary due to time constraints or being a stopgap.
This makes sense for K8s resources that ARE still serving production traffic. But this overall thread is about a tool to remove applications ARE NOT serving production traffic.
> Migrations at large companies can take multiple years
Depends who is in charge and who management considers worth listening to (some of us don't struggle so hard in this area).
> I can install this tool with helm in approximately half a day.
A script I wrote to find unused resources took less than 10 minutes to write.
If you know how to do serious software without spilling any memory then you should write it up and collect the most significant Turing Award ever.
Hackers would be immediately bifurcated into those who followed that practice and those who are helpless.
> If you know how to do serious software without spilling any memory
This thread is about K8s Pods (and other K8s resources) that have been sitting idle, not memory leaks in software.
As far as "spilling" memory, the problem has already been solved by Rust which does not do garbage collection because it has static memory mapping. Does this mean egregious amounts of memory won't be used by some Rust programs? No. But unlike languages with garbage collection, where Rust is using that memory it is actually doing something with that memory.
I can assure you that memory leaks specifically and resource leaks generally are possible in any language more expressive than a stack machine with an arena.
Rust makes an interesting and demonstrably pragmatic set of tradeoffs here as opposed to e.g. C/C++: treating std::move as a default (aka linear typing) prevents a lot of leaks. But it still has pointers (Arc, etc.) and it still has tables: it’s still easy to leak in a long-lived process doing interesting things. For a lot of use cases it’s the better default and it’s popular as a result.
But neither Rust nor k8s have solved computer science.
Or the person working that role before you did and you are trying to manage the situation as best as possible.
As a cloud/DevOps consultant, I don't believe in letting management drag on the life support of failed deployments.
We (the team I built out since a couple years ago) carve out a new AWS subaccount / GCP Project or bare-metal K8s or whatever the environment is, instantiate our GitOps pipeline, and services get cut-over in order to get supported.
When I was working in individual contributor roles, I managed to "manage upward" enough that I could establish boundaries on what could be supported or not. Yes this did involve a lot of "soft skills" (something I'm capable of doing even though my posts on this board are rather curt).
"Every DevOps job is a political job" rings true.
Do not support incumbent K8s clusters, as a famous -- more like infamous -- radio host used to say on 97.1 FM in the 90s: dump that bitch.
It's really hard to parse what you are trying to say in the middle of this colorful imagery and quotations, but if you are spinning up a new Kubernetes cluster (and AWS/GCP account) per deployment, it's obvious you don't need this tool.
Specifically, leaving the corporate political dynamics aside, we move the workloads to the deployments that are up to standard and then archive+drop the old ones. Very simple.
A more cynical man than I would say that if you need to recreate all your workloads on a new cluster to bring them up to standard, "you seriously fucked up as DevOps / cloud admin / whatever"
Heh, they aren't ""our"" workloads, before we get there, and not until they get deployed correctly for the first time ever at which point then and only then are they ""our"" workloads.
I don't agree, you can see it as just another linting tool. Just like your IDE will warn of unused variables and functions, and offer a quick-fix to remove them. You wouldn't call someone a fucked up programmer for using an IDE.
[flagged]
I am assuming that the tool provides a list and asks for confirmation before deleting? Does it just go and automatically delete stuff in the background at any time?
edit: yes I see a "dry run" option, which is how I would use it. I also see a "scheduled" option which is probably what you're criticizing. Hard to tell, you're quicker with the insulting than the arguing.
Not all third party Kubernetes controllers are created equal, unfortunately.
And it is also the reality that not every infra team gets final say on what operators we have to deploy to our clusters (yet we're stuck cleaning up the messes they leave behind).
Agreed, this feels like the kubernetes descheduler as a service or something. Wild.
Possible, but if someone needs it and it works well, why not?
Or you're budget constrained, and haven't yet been able to allocate resources to things like GitOps.
[flagged]
How does it work in an IaC/CD scenario, with things like Terraform or ArgoCD creating and syncing resources lifecycle inside the cluster? A stale resource, as identified and cleaned by K8s Cleaner, would be recreated in the next sync cycle, right?
In that case I would use it in DryRun mode and have it generate a report. Then look at the report and if it makes sense, fix the terraform or ArgoCD configuration.
More on report: https://gianlucam76.github.io/k8s-cleaner/reports/k8s-cleane...
Nice, it's something already. But probably as someone else was saying, it would be cool to iterate and improve this.
Because, if you want to have a product to sell out of this (and I guess you do?), your n.1 client will be (IMO) big enterprises, which usually have a lot of cruft in their clusters and go after cost optimizations. But they also usually don't shove manifests in directly, they have at least one solution on top of it that takes care of resources lifecycle. So, IMO, if you want your product to be useful for those kind of customers, this is something that need to be improved.
Improve it! Teach it to figure out if the resource is managed by ArgoCD or FluxCD and then suspend reconciliation.
It is open source. The least we could do if we find something else useful would be to file an enhancement request. Be nice.
I've been using it for a bit now and very happy with it. The stale-persistent-volume-claim detection has been almost a 100% hit in my case; it's a real game-changer for cleaning up disk space.
Kubernetes clutter can quickly become a headache, and having a tool like this to identify and remove unused resources has made my workflow so much smoother.
When I saw the headline I was pretty excited, but looking at your examples, I'm really curious about why you decided to make everything work via CRDs? Also having to write code inside those CRD for the cleanup logic seems like a pretty steep learning curve and honestly I'd be pretty scared to end up writing something that would delete my entire cluster.
Any reason why you chose this approach over something like a CLI tool you can run on your cluster?
It has a DryRun that generates a report on which resources would be affected. So you can see what resources identifies as unused before asking it to remove those. I would be scared as well otherwise. Agree.
A CRD controller is expected, especially it will allow your clean up logic long running in cluster or periodically. But writing code inside YAML is very weird, at least there are CEL as option here.
one more thing. It comes with a huge library of use cases already covered. So you don't necessarily need to write any new instance.
You would need to do that only if you have your own proprietary CRDs or some use case that is not already covered.
Don’t let these naysayers here discourage you. I’ve used CCleaner on Windows 20 years ago, so why not finally now on my kube cluster.
why isn't this just a cli tool? I don't see any reason it needs to be installed on a cluster. There should at least be an option to scan a cluster from the cli.
Ironically, I'd bet environments most desperately in need of tools like this are some of the ones where there has been lots of "just run this Helm chart/manifest!" admining over the years...
Agree ability to run also as a binary would be helpful.
So is this the first instance of a Cloud C-Cleaner then? You could call it CCCleaner!
Looking at other comments and drawing Windows parallel, I propose kkCleaner
Useful project nevertheless!
Seems like a simple and effective tool!
I've been saying for a while that most of the time we didn't replace pets with cattle bit pet servers with pet clusters. The need for a tool like this proves my point
Sounds useful.
I feel like the fact that this even needs to exist is a damning indictment of k8s.
Oh, good to know that your environment never loses track of any cloud resource. Maybe apply to Netflix, since it seems they're still trying to solve that problem https://www.theregister.com/2024/12/18/netflix_aws_managemen... <https://news.ycombinator.com/item?id=42448541>
Sure, I’ll grant you that at huge companies it’s probably easy to lose track of who is responsible for what and when resources ought to be cleaned up. But for small/medium-sized companies, using Terraform is sufficient. And while you can also use Terraform to manage resources on k8s, there’s a lot more friction to doing so than using YAML + `kubectl apply`. It’s far too easy to `kubectl apply` and create a bunch of garbage that you forget to clean up.