Composefs: Content-Addressable Overlay Filesystem for Linux

80 points by ignoramous 2 years ago

rektide 2 years ago

There's a ton of good technics to dive into, but in my mind it's almost all to address one leading sore point of containers:

With composefs, if two containers have the same file- wherever on the filesystem it is- that file will only be stored once on the host, and (equally if not more importantly) it will be shared in the page-cache (the cache for file contents). Currently, in most systems, different images end up having replicas of the same file that aren't shared on disk or in the page-cache.

If, for example, your org has a handful of base images it works from, it could drastically reduce the footprint of containers, both on disk and in memory. By effectively sharing the things that can be shared.

okso 2 years ago

Combining this with IPFS could be pretty interesting

lifty 2 years ago

could you elaborate

mathfailure 2 years ago

I've read the project's description and still failed to understand what it does and what it is useful for.

ecnahc515 2 years ago

The main advantage is the content addressable part. Existing overlay filesystems only handle the overlay aspect, and then tools like docker/containerd attempt to reuse layers efficiently, but it's not perfect. The same files from two different layers may have the same content, but it's still stored twice because the layers are the "unit" of storage roughly speaking. By making a single filesystem which handles both content addressability and the overlay aspect, you can avoid duplicating files that are the same, but in different layers.
PlutoIsAPlanet 2 years ago

Another way of describing it, rather than a Docker/Container image being a group of layered archives, each with changes, instead a list of file hashes is distributed, detailing where those files need to be in a mounted filesystem, with x permissions.
Since everything is named based on hashes, content is naturally deduplicated if two images share the same files and all files are stored in the same place.
If you boot on top of a composed filesystem, you also get easy file verification as long as the booted list is signed and unmodified. If you modify the local files, the hashes won't match.
__MatrixMan__ 2 years ago

Imagine a workflow:
- clone a repo
- run a command
- run another command
It runs several times daily. Maybe it's CI or something.
Now suppose you want to cache the filesystem state for each command so that they can be rerun in a debug scenario where you'd expect them to behave the same as they did the first time because they have the same filesystem. (Having then recreated the bug, you could then start making changes towards a fix).
You either end up with many many copies of that repo, or you use something like this to only store the unique files and instead have many many indices into that store.
- sluongng 2 years ago
  
  Git has 2 object stores: the loose object store and the packed object store.
  What you said is applicable to the loose object store where a full copy of the file is stored as plain text. Those could be deduplicated quite nicely, and git does just that, loose object store is a CAS.
  However the packed object store is trickier. It stores duplicated object in a delta compressed format plus gzipped. So deduplicating packfiles on a file system level is almost never worth it.
  Git is moving toward using packed object store more and more. With some of the latest patches, you can effectively use git with very little loose object storage ultilization (zero if you are on a server hosting git repositories).
  
  __MatrixMan__ 2 years ago
  
  Hmm that's good to know, thanks.
  It still just applies to files that are handled by git though. If your workflow applies a patch and then invokes a compiler which generates intermediate files, both the post-patch file and the intermediate files will not end up in the packed object store. So if you're taking filesystem snapshots of those states you'll still want to deduplicate them some other way.

nmstoker 2 years ago

By chance had been reading this blog on the same topic just the other day:

https://blogs.gnome.org/alexl/2022/06/02/using-composefs-in-...

classified 2 years ago

I was shortly chasing that mythical composef whose plural the composefs might be, to no avail.

kdmytro 2 years ago

Don't forget about the singular zf and btrf.
- classified 2 years ago
  
  Those couldn't be pronounced like single words. I'd pronounce composef like compose-eff.

tarasglek 2 years ago

Very promising, but given that they are still fixing binary search bounds checks, probably needs time to stabilize https://github.com/containers/composefs/commit/64640fa0fe256...

ryukafalz 2 years ago

Oh, interesting! I wonder if this could be used for an alternative implementation of Nix/Guix's store/profiles. It seems conceptually very similar, but implemented as a filesystem rather than a big bundle of symlinks.

aidenn0 2 years ago

Nix at least is not content-addressable; it's derivation-contents addressable. Under ideal circumstances the same derivation will result in the same contents, but it's not a guarantee.
- hamandcheese 2 years ago
  
  Content-addressed Nix currently in testing, see: https://discourse.nixos.org/t/content-addressed-nix-call-for...
- __MatrixMan__ 2 years ago
  
  I've often imagined a system that tries to build consensus around which (content-addressed) code snippets can be treated as pure functions with (content-addressed) memorized outputs and which ones need to be rerun.
  If you're on a well-worn-path you could operate mostly by lookup and only run the code if something doesn't smell right.
  
  wereallterrrist 2 years ago
  
  Unison (lang) ?
  
  __MatrixMan__ 2 years ago
  
  Indeed. I'm watching that one closely, although I haven't made time to do much coding in it.
  Even if it turns out to be perfect in every way though, it takes a long time for the masses to adopt new languages. I think it might be worth finding ways to build consensus around claims like:
  > this particular bit of python can be treated like a pure function
  ...even though there aren't any guarantees built into the language.

hawski 2 years ago

I guess it is a solution to Yocto build directories being huge while having lots and lots of duplicates. I thought about a filesystem layer that would do dedup like this and this seems to be a solution for the problem.

pdimitar 2 years ago

Or dwarfs.

dathinab 2 years ago

fun consideration:

any file system can act as a key-value store

any key-value store can act as a content addressable data store

through with limitations like fundamental constraints not being enforced, performance problems and potential unexpected problems (like the number of files per folder)

for above reason and given that content addressable storage is a poster child for fast(1) distributed reliable solutions (1: At least in "single entity owned backend database case") I would never recommend using the file system for it outside of prototyping (and system internal use cases like docker storage).