The real lesson they should learn is to not rely on running images and then using "docker commit" to turn it into an image, but instead to use proper image building tools.
If you absolutely have to do it that way, be very deliberate about what you actually need. Don't run an SSH daemon, don't run cron, don't an SMTP daemon, don't run the suite of daemons that run on a typical Linux server. Only run precisely what you need to create the files that you need for a "docker commit".
Each service that you run can potentially generate log files, lock files, temp files, named pipes, unix sockets and other things you don't want in your image.
Taking a snapshot from a working, regular VM and using that as a docker image is one of the worst ways to built one.
My first reaction: 800GB who committed that?!? This size alone screams something is wrong.
To be fair even with basic dockerfiles it’s easy to build up a lot of junk. But there should be a general size limit in any workflow that just alerts when something grows out of proportion. We had this in our shop just a few weeks ago. A docker image for some ai training etc grew too big and nobody got alerted about the image final size. It got committed and pushed to jfrog. From there the image synced to a lot of machines. Jfrog informed us that something is off with our amount of data we shuffle around. So on one end this should not happen but it seems to easily end up in production without warning.
Given that Jfrog bills on egress for these container images I’m sure you guys saw an eye watering bill for the privilege of distributing your bloated container
What if I need cron in my docker container? And ssh? And a text editor? And a monitoring agent? :P
Thankfully LXD is here to serve this need: very lightweight containers for systems, where your app runs in a complete ecosystem, but very light on the ram usage.
>What if I need cron in my docker container? And ssh? And a text editor? And a monitoring agent? :P
How are you going to orchestrate all those daemons without systemd? :P
As you mentioned, a container running systemd and a suite of background services is the typical use case of LXD, not docker. But the difference seems to be cultural -- there's nothing preventing one from using systemd as the entry point of a docker container.
fwiw I recently bootstrapped a small Debian image for myself, originally intended to sandbox coding agents I was evaluating. Shortly after I got annoyed by baseline vim and added my tmux & nvim dotfiles, now I find myself working inside the container regularly. It definitely works and is actually not the worst experience if your workflow is cli-focused.
My experience is if the tooling is set up right it’s not painful, it’s the fiddling around with volume mounts folder permissions and debug points and “what’s inside the container and what isn’t” etc that is always the big pain point
Very accurate - that was one of the steps that caused me to fiddle quite a bit. Had to add an entrypoint to chown the mounts and also some Buildkit cache volumes for all the package managers.
You can skip the uid/chown stuff if you work with userns mappings, but this was my work machine so I didn't want to globally touch the docker daemon.
Ideally, you have a separate docker container for each process (i.e. a separate container for the ssh service, one for cron etc). The text editor can be installed if it's needed - that's not an issue apart from slightly increasing the container size. Most of the time, the monitoring agent would be running on the host machine and setup to monitor aspects of the container - containers should be thought of as running a single process and not as running a VM along with all its services.
Initially I didn't understand how they were getting the log files into the image. I had no idea that people abuse "docker commit" - do they know nothing about containers? If you want persistent logs, then have a separate volume for them so they can't pollute any image (plus they are readable when the image restarts etc).
When I saw the HN title, I thought this was going to be something subtle like deleting package files (e.g. apt) in a separate layer, so you end up with a layer containing the files and then a subsequent layer that hides them.
I’m shocked that a company would share how amazingly bad their layer management had become. This may be a great internal blog, but I wouldn’t share it publicly.
On the other hand, I'm impressed that a company is owning up to the problem. Is it a dumb problem to have? Definitely. Are they the only ones to have it? Almost certainly not.
People are going to use the tools at their disposal, and they aren't all going to learn their tools at a high level. Think of every insane misuse of Excel you've ever heard of, for instance.
IT has the choice in this case to mitigate, or limit the access to the tools. Choosing mitigation prevents the growth of shadow IT and helps ensure that IT remains a trusted partner and not an obstacle to be worked around. This reflects well on the company, especially if they then go and provide better training to their users as well.
Yeah: I can usually tell from public information when a company has problems like this, and that makes me disinclined to want to work for them. Seeing how they deal with those problems, though? … Well, in this case, it shows that the company doesn't know how to deal with these problems properly, and thinks ChatGPT is appropriate for write-ups, so I still might not want to work there – but I might bother interviewing there, just to check how deep these problems go. (If they're just a case of "they didn't know better, but they're happy to learn", then I might actually take the job offer: an environment where others are willing to learn without fear of losing face is an environment where I can learn without worrying about that either.)
Oh just wait till it’s time for your company to stop the ‘growth mode’ shenanigans and get serious about acceptable levels of tech debt and feature bloat. It’s where we are.
You can’t just flip a switch. There is no “Hey, that was fun, but it’s time to start designing these things with a purpose and vision”. Beyond the totally unreasonable expectations that have been set by Product and C-level- you still have the mountain of tech debt that is coming due and changes slow to a crawl or outages skyrocket or both. Plus, hiring has been based on ‘getting things done’, so you have this group of people who are actually really skilled in hacking things together and getting it out the door. It’s tough and calls for an entire culture shift. How do you stop being a reactionary startup and become vision-based and purposeful organization?
This is the job of tech leadership IMO. People respond to incentive changes. If these items are properly prioritized on the roadmap, and credit and recognition follows tech debt remediation efforts to a similar degree as feature delivery, the work can be done.
But this requires strong tech leadership who can interface well with the C Suite and get buy-in for delaying in feature delivery. In the absence of this buy-in, you pretty much need to control the narrative and create a rogue skunkworks initiative to wrap these improvements _into_ the feature delivery.
Many companies don't have strong tech leadership though, and will perpetually churn VPs and Directors, forever chasing A Change without addressing the culture and incentive system that created that culture.
From what I understood they provide a kind of shared platform where anyone can run things, and it was one of their clients/users performing the commits.
So they don't set reasonable expectations with the customers and accept any and all garbage. As Ops person, this is a path to Ops hell as customers throw more and more garbage at you and toil dealing with customer problems becomes unbearable.
This is a case of Product Team not working with customers, finding out what is reasonable and allowing system to set reasonable limits.
I would give them some leeway, sometimes you have to learn the hard way. But I was also kind of surprised didn't mention contacting the client anywhere.
It's like a car repair company sharing how they dramatically improved ride comfort, speed and fuel usage by using air to fill tyres rather than concrete.
Sure, it frightens away the short-sighted or particularly excitable people, but anyone who understands how unrealistic perfection is will be comforted by such transparency. Exposing the warts not only sets expectations, but it also assures people that things will (likely) not be just swept under the rug in a company culture of denialism and obfuscation.
Interesting, although something about the language makes me think it was written by a LLM; I like the ending though:
"The key insight is to treat container images not as opaque black boxes, but as structured, manipulable archives. Deeply understanding the underlying technology, like the OCI image specification, allows for advanced optimization and troubleshooting that goes far beyond standard tooling. This knowledge is essential for preventing issues like Kubernetes disk space exhaustion before they start."
Not a guaranteed tell but I noticed the word "surgically" in the opening paragraphs, and from personal experience I find that appears in ChatGPT a lot for me.
One of the common phrase tropes I find is something like "Here's a set of small, surgical steps you can take to..."
This whole article could have been much better written as: learn to build images with a Dockerfile/ Containerfile or similar tooling, and store logs in a volume rather than the image filesystem. Everyone that builds a process around `docker commit` is simply in a race against time before they learn this lesson.
In the comments: People who didn't read the article assuming they were literally building 800GB images (the example in the article is an 11GB image that was amplified by copying behaviors)
The TLDR:
> We tackled critical container image bloat on our Sealos platform, fixing a severe disk space exhaustion issue by shrinking an 800GB, 272-layer image to just 2.05GB.
They say they made a 800GB container image, so your issue is about singular vs plural?
Regardless, I don't really get why anyone would self report like this. Is next article going to be about how they don't encrypt passwords and when they accidentally dropped prod DB they could restore account from logs because it had the passwords in clear text?
It seems like their whole platform depends on it though… to my read they’re providing their users with cloud devcontainers to connect to from their local VS Code, then deploying to production by snapshotting the container with docker commit. Those containers have SSH enabled to the internet, which is where all of the auth logs came from that wound up baked into the images.
272 layers in a single image seems really unusual, is that just due to my lack of experience with containers? I've never seen an image with more than maybe a few dozen in my career...
I can't think of anything that would justify that many layers. If I have that much complexity, I would split up the container or start writing bash scripts.
The automation of containers looks simple but developers with systems experience know the actual complexity of operating systems and running applications.
People who know javascript but don't know how a file system works can build and deploy containers. They just copy and paste stuff until it runs. The automation of containers makes brute force iteration a viable option. It was a lot more difficult trying to run a Linux server, which would force you to learn something or use a platform as a service instead.
> 1. A user's container is under a brute-force attack, and /var/log/btmp grows to 11GB.
> 2. The user performs a commit, creating a new image layer.
> 3. A single new failed login is appended to /var/log/btmp.
> 4. Because of CoW, OverlayFS doesn't just write the new line. It copies the entire 11GB file into the new, upper layer.
> 5. This process repeated 271 times.
So the user is creating hundreds of layers for unclear reasons. The article refers to this as "exponential growth", but for that to be the case those commits would need to be triggered in proportion to the number of existing layers, which seems unlikely. Assuming the commits are caused by the user for reasons unrelated to the size of the existing image, this is growth that is quadratic† (in the number of layers; it's hard to characterize as a function of time or whatever), and it'd be nice to know why there were so many layers.
† Note that while the growth is technically quadratic, I don't think that impacted them. They say that the problem occurred when one 11GB file got copied into each of 272 image layers. That would require 2,992 GB, but they also say that the image exhibiting this problem was only 800GB.
I suspect that the answer here is that only some of the layers modified (and therefore copied) the log file. Probably about 72 of the layers. This is more like growth that's linear (still technically slightly superlinear, but probably not quadratic) in the number of failed SSH login attempts. ~75% of layers aren't contributing to the problem at all.
> image-manip squash: This is the key to reclaiming disk space and the core of our strategy to squash the image layers. The tool creates a temporary container, applies all 272 layers in sequence to an empty root filesystem, and then exports the final, merged filesystem as a single new layer. This flattens the image's bloated history into a lean, optimized final state.
Wouldn't a multistage Dockerfile have accomplished the same thing? smth like
Given that I was using multi-layers in 2020, when I finally got involved in projects with Docker, five years is already some time to learn about this stuff, and I bet is is much older, not bothering to look it up.
Fascinating deep dive into OverlayFS CoW behavior. The 11GB btmp file getting copied 271 times is a perfect storm scenario. Did they consider mounting /var/log outside the image layers? Seems like that would prevent any log file from causing this amplification. Also interested in image-manip... Does it handle metadata differently than
docker export/import?
This is less of a deep dive and more an illustration of the worst way to use containers.
Having /var/log set as as a persistent volume would have worked, but ultimately they were using "docker commit" to amend/update their images which is definitely the wrong way to do it.
Seriously. Honestly this whole thing feels kinda like…using an LLM to write a blog post about debugging weird problems that only exist because the whole platform was built by an LLM in the first place. The multiple top level comments that are clearly written by an LLM are icing on the (layer) cake.
Our platform is designed to solve a very specific workflow, and the DevBox is only the first step in that process.
Our users need to connect their local VS Code, Cursor, or JetBrains IDEs to the cloud environment. The industry-standard extensions for this only speak the SSH protocol. So, to give our users the tools they love, the container must run an SSHD to act as the host.
We aren't just a CDE like Coder or Codespaces. We're trying to provide a fully integrated, end-to-end application lifecycle in one place.
The idea is that a developer on Sealos can:
1. Spin up their DevBox instantly.
2. Code and test their feature in that environment (using their local IDE).
3. Then, from that same platform, package their application into a production-ready, versioned image.
4. And finally, deploy that image directly to a production Kubernetes environment with one click.
That "release" feature was how we let a developer "snapshot" their entire working environment into a deployable image without ever having to write a Dockerfile.
Agree with other commenters that this seems like a bad idea. Why on earth should the release image contain all of the cruft of development?? Why on earth should it contain historical versions of all that cruft??
The irony is that Kubernetes already provides a "ssh into any container" ability, and it's provided directly by k8s, no sshd needed (it's not the ssh protocol but it's good enough to get a shell). Not sure it's advisable to do with any user but an admin, but the standard workflow with k8s is not to shell into running containers anyway, it's to rebuild the container and redeploy the pod.
Who are you, exactly? There is practically no publicly available information about your company, other than that it appears to be held by a Chinese entity called Labring.
Reliable systems require hard limits imposed by designers. When systems hit the hard limits, it's a sign somebody's assumptions are wrong: either the designer built too small, or there's some bug pushing up against the hard limit. Either you catch the bug or make an intentional decision on how to scale further. This is basic engineering and is a requisite part of any undergraduate engineering degree worth its salt.
Allowing eight hundred gigabyte containers is gross incompetence. Trying to fix it by scaling the node disk from 2 TB to 2.5 TB is further evidence of incompetence. Understanding that you need to build a hard cap, but not concluding with action items to actually build one - instead just building monitoring for image size - is a clear sign to stay away.
It boggles my mind that the author could understand copy-on-write filesystem semantics but can't imagine how to engineer actual size limits on said filesystem. How is that possible?
.... oh right, the blogpost is LLM slop. So nobody knows what the author actually learned.
I'm not a sysadmin but doesn't the root cause sound like a missing fail2ban or something? (Sounds like a whole bunch of problems stacked on top of each other honestly.)
This seems very much like a ‘we mis configured our containers; then we realised, then we fixed it, then we blogged about it’ post of very little value.
Either way, hope the user was communicated with or alerted to what's going on.
At the same time, someone said that 800 GB container images are a problem in of themselves no matter the circumstances and they got downvoted for saying so - yet I mostly agree.
Most of mine are about 50-250 MB at most and even if you need big ones with software that's GB in size, you will still be happier if you treat them as something largely immutable. I've never had odd issues with them thanks to this. If you really care about data persistence, then you can use volumes/bind mounts or if you don’t then just throw things into tmpfs.
I'm not sure whether treating containers as something long lived with additional commmits/layers is a great idea, but if it works for other people, then good for them. Must be a pain to run something so foundational for your clients, though, cause you'll be exposed to most of the edge cases imaginable sooner or later.
Is it spooky that they said they looked inside a customer's image to fix this? A bunch of engineers just had access to their customer's intellectual property, security keys, git repos, ...
If you are adding security keys and git repos to your final shipped image you are doing things very wrong - a container image is literally a tarball and some metadata about how to run the executables inside. Even if you need that data to build your application you should use a multi-stage build to include only the final artifacts in the image you ship.
For stuff like security keys you should typically add them as build --args-- secrets, not as content in the image.
Yep. The only valid usecase I think of is using the secret for something else, eg connecting to an internal package registry, in which case the secret mounts may help.
Yeah, typically, but in this case they're commiting and commiting in the container image, and saving changes from running software. Not only that, they're commiting log files into the image, which is crazy.
The thing here is they're using Docker container images like if they were VM disks and they end up with images with almost 300 layers, like in this case. I think LXC or VMs should be a better case for this (but I don't know if they've tested it or why are they using Docker)
You have approval in the terms of service. This is absolutely known and expected across the entire industry. It's why your employees have clauses in their contracts about respecting third party confidentiality.
What about this case where the container was working but was consuming overhead due to an infrastructure issue? Customer hasn't done anything wrong. If you stop their containers they'll likely leave for a competitor.
I did a little research on this company. It’s related to (or wholly owned by) a Chinese entity called Labring. LinkedIn shows practically nobody related to the company other than its marketing team. Something smells incredibly fishy.
Defence № 2 and № 3 are ones I would do everywhere as a knee-jerk reaction, regardless of any justification to not bother with them. It’s just an ingrained habit at this point.
It’s № 1 which I could not have guessed at or gone for. Good write-up, love the transparency.
I did and the image had problems to begin with. If it's a bad image or a bad configuration of your visor or in the image doesn't matter. If your images can bloat to over 800GB you are doing the basics wrong. Hint: Using commit to create your images...
The real lesson they should learn is to not rely on running images and then using "docker commit" to turn it into an image, but instead to use proper image building tools.
If you absolutely have to do it that way, be very deliberate about what you actually need. Don't run an SSH daemon, don't run cron, don't an SMTP daemon, don't run the suite of daemons that run on a typical Linux server. Only run precisely what you need to create the files that you need for a "docker commit".
Each service that you run can potentially generate log files, lock files, temp files, named pipes, unix sockets and other things you don't want in your image.
Taking a snapshot from a working, regular VM and using that as a docker image is one of the worst ways to built one.
My first reaction: 800GB who committed that?!? This size alone screams something is wrong. To be fair even with basic dockerfiles it’s easy to build up a lot of junk. But there should be a general size limit in any workflow that just alerts when something grows out of proportion. We had this in our shop just a few weeks ago. A docker image for some ai training etc grew too big and nobody got alerted about the image final size. It got committed and pushed to jfrog. From there the image synced to a lot of machines. Jfrog informed us that something is off with our amount of data we shuffle around. So on one end this should not happen but it seems to easily end up in production without warning.
Given that Jfrog bills on egress for these container images I’m sure you guys saw an eye watering bill for the privilege of distributing your bloated container
Yes. But fair enough that we got a warning the very next day.
What if I need cron in my docker container? And ssh? And a text editor? And a monitoring agent? :P
Thankfully LXD is here to serve this need: very lightweight containers for systems, where your app runs in a complete ecosystem, but very light on the ram usage.
>What if I need cron in my docker container? And ssh? And a text editor? And a monitoring agent? :P
How are you going to orchestrate all those daemons without systemd? :P
As you mentioned, a container running systemd and a suite of background services is the typical use case of LXD, not docker. But the difference seems to be cultural -- there's nothing preventing one from using systemd as the entry point of a docker container.
fwiw I recently bootstrapped a small Debian image for myself, originally intended to sandbox coding agents I was evaluating. Shortly after I got annoyed by baseline vim and added my tmux & nvim dotfiles, now I find myself working inside the container regularly. It definitely works and is actually not the worst experience if your workflow is cli-focused.
Even putting GUI apps in a container isnt too bad once one develops the right incantation for x11/wayland forwarding.
My experience is if the tooling is set up right it’s not painful, it’s the fiddling around with volume mounts folder permissions and debug points and “what’s inside the container and what isn’t” etc that is always the big pain point
Very accurate - that was one of the steps that caused me to fiddle quite a bit. Had to add an entrypoint to chown the mounts and also some Buildkit cache volumes for all the package managers.
You can skip the uid/chown stuff if you work with userns mappings, but this was my work machine so I didn't want to globally touch the docker daemon.
The answer is naturally kubernetes, alongside rootless and noshell containers.
Ideally, you have a separate docker container for each process (i.e. a separate container for the ssh service, one for cron etc). The text editor can be installed if it's needed - that's not an issue apart from slightly increasing the container size. Most of the time, the monitoring agent would be running on the host machine and setup to monitor aspects of the container - containers should be thought of as running a single process and not as running a VM along with all its services.
Initially I didn't understand how they were getting the log files into the image. I had no idea that people abuse "docker commit" - do they know nothing about containers? If you want persistent logs, then have a separate volume for them so they can't pollute any image (plus they are readable when the image restarts etc).
When I saw the HN title, I thought this was going to be something subtle like deleting package files (e.g. apt) in a separate layer, so you end up with a layer containing the files and then a subsequent layer that hides them.
[flagged]
[flagged]
I’m shocked that a company would share how amazingly bad their layer management had become. This may be a great internal blog, but I wouldn’t share it publicly.
On the other hand, I'm impressed that a company is owning up to the problem. Is it a dumb problem to have? Definitely. Are they the only ones to have it? Almost certainly not.
People are going to use the tools at their disposal, and they aren't all going to learn their tools at a high level. Think of every insane misuse of Excel you've ever heard of, for instance.
IT has the choice in this case to mitigate, or limit the access to the tools. Choosing mitigation prevents the growth of shadow IT and helps ensure that IT remains a trusted partner and not an obstacle to be worked around. This reflects well on the company, especially if they then go and provide better training to their users as well.
Yeah: I can usually tell from public information when a company has problems like this, and that makes me disinclined to want to work for them. Seeing how they deal with those problems, though? … Well, in this case, it shows that the company doesn't know how to deal with these problems properly, and thinks ChatGPT is appropriate for write-ups, so I still might not want to work there – but I might bother interviewing there, just to check how deep these problems go. (If they're just a case of "they didn't know better, but they're happy to learn", then I might actually take the job offer: an environment where others are willing to learn without fear of losing face is an environment where I can learn without worrying about that either.)
I'm confused. I had the same initial reaction as you and then read further and it sounds like the image was actually provided by a client?
This sounds like a case of "We are in growth mode and will accept any garbage the customer will throw at us" without calculating the tech debt costs.
As someone who is currently there, it's very frustrating place.
Oh just wait till it’s time for your company to stop the ‘growth mode’ shenanigans and get serious about acceptable levels of tech debt and feature bloat. It’s where we are.
You can’t just flip a switch. There is no “Hey, that was fun, but it’s time to start designing these things with a purpose and vision”. Beyond the totally unreasonable expectations that have been set by Product and C-level- you still have the mountain of tech debt that is coming due and changes slow to a crawl or outages skyrocket or both. Plus, hiring has been based on ‘getting things done’, so you have this group of people who are actually really skilled in hacking things together and getting it out the door. It’s tough and calls for an entire culture shift. How do you stop being a reactionary startup and become vision-based and purposeful organization?
This is the job of tech leadership IMO. People respond to incentive changes. If these items are properly prioritized on the roadmap, and credit and recognition follows tech debt remediation efforts to a similar degree as feature delivery, the work can be done.
But this requires strong tech leadership who can interface well with the C Suite and get buy-in for delaying in feature delivery. In the absence of this buy-in, you pretty much need to control the narrative and create a rogue skunkworks initiative to wrap these improvements _into_ the feature delivery.
Many companies don't have strong tech leadership though, and will perpetually churn VPs and Directors, forever chasing A Change without addressing the culture and incentive system that created that culture.
From what I understood they provide a kind of shared platform where anyone can run things, and it was one of their clients/users performing the commits.
So they don't set reasonable expectations with the customers and accept any and all garbage. As Ops person, this is a path to Ops hell as customers throw more and more garbage at you and toil dealing with customer problems becomes unbearable.
This is a case of Product Team not working with customers, finding out what is reasonable and allowing system to set reasonable limits.
I would give them some leeway, sometimes you have to learn the hard way. But I was also kind of surprised didn't mention contacting the client anywhere.
It's like a car repair company sharing how they dramatically improved ride comfort, speed and fuel usage by using air to fill tyres rather than concrete.
After asking chatgpt for suggestions and trying them all.
Transparency breeds trust.
Sure, it frightens away the short-sighted or particularly excitable people, but anyone who understands how unrealistic perfection is will be comforted by such transparency. Exposing the warts not only sets expectations, but it also assures people that things will (likely) not be just swept under the rug in a company culture of denialism and obfuscation.
Interesting, although something about the language makes me think it was written by a LLM; I like the ending though:
"The key insight is to treat container images not as opaque black boxes, but as structured, manipulable archives. Deeply understanding the underlying technology, like the OCI image specification, allows for advanced optimization and troubleshooting that goes far beyond standard tooling. This knowledge is essential for preventing issues like Kubernetes disk space exhaustion before they start."
Not a guaranteed tell but I noticed the word "surgically" in the opening paragraphs, and from personal experience I find that appears in ChatGPT a lot for me.
One of the common phrase tropes I find is something like "Here's a set of small, surgical steps you can take to..."
Yeah, but just also the overall feel of the article is kind of LLM-y and not human in some hard to articulate way
This whole article could have been much better written as: learn to build images with a Dockerfile/ Containerfile or similar tooling, and store logs in a volume rather than the image filesystem. Everyone that builds a process around `docker commit` is simply in a race against time before they learn this lesson.
In the comments: People who didn't read the article assuming they were literally building 800GB images (the example in the article is an 11GB image that was amplified by copying behaviors)
In fairness the article is LLM vomit and could be two paragraphs, can't blame people for not reading it.
The TLDR: > We tackled critical container image bloat on our Sealos platform, fixing a severe disk space exhaustion issue by shrinking an 800GB, 272-layer image to just 2.05GB.
They say they made a 800GB container image, so your issue is about singular vs plural?
Regardless, I don't really get why anyone would self report like this. Is next article going to be about how they don't encrypt passwords and when they accidentally dropped prod DB they could restore account from logs because it had the passwords in clear text?
>the example in the article is an 11GB image that was amplified by copying behaviors
It's not, it's an 800GB image caused by multiple full writes of an 11GB file into the image's layers. I read the article.
TIL of `docker commit`. What is the use case for this command? Quick debugging or something, to share with a coworker?
Same as snapshotting a vm, or as an interactive version of "docker build". But rarely useful, since most workflows don't really need statefulness.
It seems like their whole platform depends on it though… to my read they’re providing their users with cloud devcontainers to connect to from their local VS Code, then deploying to production by snapshotting the container with docker commit. Those containers have SSH enabled to the internet, which is where all of the auth logs came from that wound up baked into the images.
272 layers in a single image seems really unusual, is that just due to my lack of experience with containers? I've never seen an image with more than maybe a few dozen in my career...
I can't think of anything that would justify that many layers. If I have that much complexity, I would split up the container or start writing bash scripts.
The automation of containers looks simple but developers with systems experience know the actual complexity of operating systems and running applications.
People who know javascript but don't know how a file system works can build and deploy containers. They just copy and paste stuff until it runs. The automation of containers makes brute force iteration a viable option. It was a lot more difficult trying to run a Linux server, which would force you to learn something or use a platform as a service instead.
You can build docker images with nix, in which case you can have every dependency be its own layer.
That is clearly not what these people are doing, though.
Well, as described...
> Here's how the disaster unfolded:
> 1. A user's container is under a brute-force attack, and /var/log/btmp grows to 11GB.
> 2. The user performs a commit, creating a new image layer.
> 3. A single new failed login is appended to /var/log/btmp.
> 4. Because of CoW, OverlayFS doesn't just write the new line. It copies the entire 11GB file into the new, upper layer.
> 5. This process repeated 271 times.
So the user is creating hundreds of layers for unclear reasons. The article refers to this as "exponential growth", but for that to be the case those commits would need to be triggered in proportion to the number of existing layers, which seems unlikely. Assuming the commits are caused by the user for reasons unrelated to the size of the existing image, this is growth that is quadratic† (in the number of layers; it's hard to characterize as a function of time or whatever), and it'd be nice to know why there were so many layers.
† Note that while the growth is technically quadratic, I don't think that impacted them. They say that the problem occurred when one 11GB file got copied into each of 272 image layers. That would require 2,992 GB, but they also say that the image exhibiting this problem was only 800GB.
I suspect that the answer here is that only some of the layers modified (and therefore copied) the log file. Probably about 72 of the layers. This is more like growth that's linear (still technically slightly superlinear, but probably not quadratic) in the number of failed SSH login attempts. ~75% of layers aren't contributing to the problem at all.
> image-manip squash: This is the key to reclaiming disk space and the core of our strategy to squash the image layers. The tool creates a temporary container, applies all 272 layers in sequence to an empty root filesystem, and then exports the final, merged filesystem as a single new layer. This flattens the image's bloated history into a lean, optimized final state.
Wouldn't a multistage Dockerfile have accomplished the same thing? smth like
FROM bigimage
RUN rm bigfile
FROM scratch
COPY --from=0 / /
I think yep, pretty much. Maybe they didn't know this existed?
Given that I was using multi-layers in 2020, when I finally got involved in projects with Docker, five years is already some time to learn about this stuff, and I bet is is much older, not bothering to look it up.
Fascinating deep dive into OverlayFS CoW behavior. The 11GB btmp file getting copied 271 times is a perfect storm scenario. Did they consider mounting /var/log outside the image layers? Seems like that would prevent any log file from causing this amplification. Also interested in image-manip... Does it handle metadata differently than docker export/import?
This is less of a deep dive and more an illustration of the worst way to use containers.
Having /var/log set as as a persistent volume would have worked, but ultimately they were using "docker commit" to amend/update their images which is definitely the wrong way to do it.
Is it fascinating?
Do people not know that each layer comes with its own downsides?
Do people just do 272 layers and think that it’s normal?
This seems like people discovering that water is wet and fire is hot.
I feel like I'm having a LLM fever dream
Seriously. Honestly this whole thing feels kinda like…using an LLM to write a blog post about debugging weird problems that only exist because the whole platform was built by an LLM in the first place. The multiple top level comments that are clearly written by an LLM are icing on the (layer) cake.
Our platform is designed to solve a very specific workflow, and the DevBox is only the first step in that process.
Our users need to connect their local VS Code, Cursor, or JetBrains IDEs to the cloud environment. The industry-standard extensions for this only speak the SSH protocol. So, to give our users the tools they love, the container must run an SSHD to act as the host.
We aren't just a CDE like Coder or Codespaces. We're trying to provide a fully integrated, end-to-end application lifecycle in one place.
The idea is that a developer on Sealos can:
1. Spin up their DevBox instantly. 2. Code and test their feature in that environment (using their local IDE). 3. Then, from that same platform, package their application into a production-ready, versioned image. 4. And finally, deploy that image directly to a production Kubernetes environment with one click.
That "release" feature was how we let a developer "snapshot" their entire working environment into a deployable image without ever having to write a Dockerfile.
Agree with other commenters that this seems like a bad idea. Why on earth should the release image contain all of the cruft of development?? Why on earth should it contain historical versions of all that cruft??
This does seem bonkers to me. All but guaranteed to have worse issues than bloated container images in the future.
The irony is that Kubernetes already provides a "ssh into any container" ability, and it's provided directly by k8s, no sshd needed (it's not the ssh protocol but it's good enough to get a shell). Not sure it's advisable to do with any user but an admin, but the standard workflow with k8s is not to shell into running containers anyway, it's to rebuild the container and redeploy the pod.
Who are you, exactly? There is practically no publicly available information about your company, other than that it appears to be held by a Chinese entity called Labring.
Reliable systems require hard limits imposed by designers. When systems hit the hard limits, it's a sign somebody's assumptions are wrong: either the designer built too small, or there's some bug pushing up against the hard limit. Either you catch the bug or make an intentional decision on how to scale further. This is basic engineering and is a requisite part of any undergraduate engineering degree worth its salt.
Allowing eight hundred gigabyte containers is gross incompetence. Trying to fix it by scaling the node disk from 2 TB to 2.5 TB is further evidence of incompetence. Understanding that you need to build a hard cap, but not concluding with action items to actually build one - instead just building monitoring for image size - is a clear sign to stay away.
It boggles my mind that the author could understand copy-on-write filesystem semantics but can't imagine how to engineer actual size limits on said filesystem. How is that possible?
.... oh right, the blogpost is LLM slop. So nobody knows what the author actually learned.
I'm not a sysadmin but doesn't the root cause sound like a missing fail2ban or something? (Sounds like a whole bunch of problems stacked on top of each other honestly.)
Yes, the article does list multiple root causes, including that one.
This seems very much like a ‘we mis configured our containers; then we realised, then we fixed it, then we blogged about it’ post of very little value.
Right? Blog posts like these makes me question competence instead of attributing it.
Competence comes from experience built on lots of screw-ups. I like when people blog their mistakes.
This is crazy. And they created an entire business around containers not even understand the basics of how building them work? Yikes.
Images don't seem to be working:
https://sealos.io/_next/image?url=.%2Fimages%2Fcontainerd-hi...
https://sealos.io/_next/image?url=.%2Fimages%2Fbloated-conta...
Either way, hope the user was communicated with or alerted to what's going on.
At the same time, someone said that 800 GB container images are a problem in of themselves no matter the circumstances and they got downvoted for saying so - yet I mostly agree.
Most of mine are about 50-250 MB at most and even if you need big ones with software that's GB in size, you will still be happier if you treat them as something largely immutable. I've never had odd issues with them thanks to this. If you really care about data persistence, then you can use volumes/bind mounts or if you don’t then just throw things into tmpfs.
I'm not sure whether treating containers as something long lived with additional commmits/layers is a great idea, but if it works for other people, then good for them. Must be a pain to run something so foundational for your clients, though, cause you'll be exposed to most of the edge cases imaginable sooner or later.
Is it spooky that they said they looked inside a customer's image to fix this? A bunch of engineers just had access to their customer's intellectual property, security keys, git repos, ...
If you are adding security keys and git repos to your final shipped image you are doing things very wrong - a container image is literally a tarball and some metadata about how to run the executables inside. Even if you need that data to build your application you should use a multi-stage build to include only the final artifacts in the image you ship.
For stuff like security keys you should typically add them as build --args-- secrets, not as content in the image.
> For stuff like security keys you should typically add them as build args, not as content in the image.
Build args are content in the image: https://docs.docker.com/reference/build-checks/secrets-used-...
> For stuff like security keys you should typically add them as build args, not as content in the image.
Do not use build arguments for anything secret. The values are committed into the image layers.
Yep. The only valid usecase I think of is using the secret for something else, eg connecting to an internal package registry, in which case the secret mounts may help.
Yeah, typically, but in this case they're commiting and commiting in the container image, and saving changes from running software. Not only that, they're commiting log files into the image, which is crazy.
The thing here is they're using Docker container images like if they were VM disks and they end up with images with almost 300 layers, like in this case. I think LXC or VMs should be a better case for this (but I don't know if they've tested it or why are they using Docker)
That’s nice, but you still shouldn’t be looking into your customer’s containers.
How else do they diagnose issues? Sorry to break it to you, this is absolutely standard across the entire industry.
Evict the containers, let the customer know and get customer approval to work with their images.
You have approval in the terms of service. This is absolutely known and expected across the entire industry. It's why your employees have clauses in their contracts about respecting third party confidentiality.
What about this case where the container was working but was consuming overhead due to an infrastructure issue? Customer hasn't done anything wrong. If you stop their containers they'll likely leave for a competitor.
I did a little research on this company. It’s related to (or wholly owned by) a Chinese entity called Labring. LinkedIn shows practically nobody related to the company other than its marketing team. Something smells incredibly fishy.
I did something on a smaller scale by ripping out large parts of Boost, which was nearly 50% of the image size
Title makes it seem like 800GB images are a normal occurance: it is not.
2GB is the expected and default size for a docker image. It's a bit bloated even.
this reads like it was written by a clanker
Defence № 2 and № 3 are ones I would do everywhere as a knee-jerk reaction, regardless of any justification to not bother with them. It’s just an ingrained habit at this point.
It’s № 1 which I could not have guessed at or gone for. Good write-up, love the transparency.
What's up with the images that are supposed to appear in the article? They appear to be coded to load from "./images/containerd-high-disk-io-iotop.png", but https://sealos.io/blog/images/containerd-high-disk-io-iotop.... and https://sealos.io/images/containerd-high-disk-io-iotop.png both fail.
(And indeed, the images are broken in Firefox and Edge. Is there another browser where they're not broken?)
If your image is 800GB you are doing something wrong in the first place.
You didn't read the article.
I did and the image had problems to begin with. If it's a bad image or a bad configuration of your visor or in the image doesn't matter. If your images can bloat to over 800GB you are doing the basics wrong. Hint: Using commit to create your images...