On-disk format robustness requirements for new filesystems

192 points by chmaynard 6 years ago

Somehow reminds me of this conversation [0]:

  > Al Viro asked if there is a plan to allow mounting hand-crafted XFS or ext4
  > filesystem images. That is an easy way for an attacker to run their own code
  > in ring 0, he said. The filesystems are not written to expect that kind of
  > (ab)use. When asked if it really was that easy to crash the kernel with a
  > hand-crafted filesystem image, Viro said: "is water wet?"

[0] https://lwn.net/Articles/718639/

rsync 6 years ago

"The filesystems are not written to expect that kind of (ab)use. When asked if it really was that easy to crash the kernel with a hand-crafted filesystem image, Viro said: "is water wet?""
This is why an rsync.net account that is enabled to allow zfs send/recv is actually inside a VM and the customer is given their own zpool and their own root login.
It's really resource intensive to do it this way and there are other, much simpler and scalable ways to provide the ability to zfs send into cloud storage ...
However, there is universal agreement among the ZFS coding community[1] that allowing someone to 'zfs send' an arbitrary datastream (in this case, a snapshot) is tremendously dangerous.
In the best case, the malicious actor can crash the kernel and deny service. In the worst case, the malicious actor could destroy the underlying zpool.
[1] Please consider attending the OpenZFS developer Summit in November if you have any interest in this ...

iforgotpassword 6 years ago

I agree fully with the desire for resilience throughout the kernel, especially in today's world. It was very different 20 years ago when Linux was still young.

Otoh it's one of the things that I think made Linux succeed in the beginning. Everyone could upstream a half arsed driver for something and it would get fixed while people use it and encounter bugs. Now that Linux is used professionally everywhere that just isn't feasible anymore.

On another note, I remember that some time ago there was a talk about Linux file system fuzzing given at some conference and ext4 fared the best by far, which is why I'm still using that exclusively, although some of the features of btrfs would come in handy at times.

est31 6 years ago

I think the main reason why btrfs has not fully replaced the ext family by now is its bugs. It's understandable that they don't want to add more buggy filesystems.
- pletnes 6 years ago
  
  Personally, I’ve never noticed bugs, but the performance characteristics are truly weird. If you fill up a large partition, expect using a few days of running various incantations to free space, even after removing plenty of files.
  I’m completely amazed at how fast snapshots can be done, though.
  
  est31 6 years ago
  
  Yeah I'd classify that bad performance with nearly full drives as a bug. Compare that to one of my ext4 partitions which has been > 90% since years of active use without major issues. Occasionally I get errors that it can't write stuff any more due to capacity, I just delete some unneeded stuff and go on with life.
  
  seized 6 years ago
  
  It's not really a bug though taken in context of how the file systems work with copy on write. You need somewhere to write, so if you low free space and have the drive heads seeking...
  
  kzrdude 6 years ago
  
  It really makes it seem like handling free space was an afterthought in btrfs.
  Infinite memory and infinite storage are great abstractions, easy to use models and a lot of software runs on those models... but in this case it seems like ENOSPC should have been designed in from the start.
  
  sitkack 6 years ago
  
  I have noticed lots of things break when the resource horizon shrinks. Weirdness happens as on crosses into 1/2 (for things that like to double, like amortized data structures) and then again at <5% of free space/cpu/network/etc. Often the best mitigation is to have excess reserve and remove the false blocker (file on disk, optional process eating ram, etc).
  
  dnr 6 years ago
  
  Just more anecdotes: I accidentally filled up a 2tb btrfs filesystem on ssd with a buggy backup script. I deleted the stuff that shouldn't be there, ran a few recommended incantations, and was back to normal in about ten minutes. It was much less of a big deal than I expected based on what I'd read.
- cmurf 6 years ago
  
  Btrfs has had its fair share of bugs, but rarely do they result in the inability to at least mount the file system read-only and get your data out. What has further been exposed by Btrfs are device bugs, in particular firmware. When write ordering doesn't happen the way the file system expects, it's a big problem, and Btrfs does not tolerate them, it should abruptly go read only in order to prevent amplifying file system confusion. This is separate from silent data corruption which Btrfs also detects and complains about, but usually just results in EIO as Btrfs will not propagate data it think is corrupt due to csum mismatches. But even in that case you can get your data out.
- nicoburns 6 years ago
  
  Yep. I personally won't use btrfs for this reason. The last thing I want to deal with is filesystem corruption. If I needed the advanced features, I'd just use zfs.
  
  vetinari 6 years ago
  
  "just use zfs" is it's own can of worms.
  Maybe Canonical will solve that, we will see how distributing zfs together with kernel will go through.
  
  sigstoat 6 years ago
  
  > "just use zfs" is it's own can of worms.
  it's been fine for some of us for, what, nearly 10 years now?
  
  vetinari 6 years ago
  
  If you consider normal to self-compile kernel modules on your production machines (or have compiler there, or use mutable system at all), well, then yes.
  Otherwise, it is a massive pita. I do have one machine with zfs, and I made the mistake of placing the root into a subvolume. Won't do that again.
  
  laumars 6 years ago
  
  A few distros include ZoL drivers in their repos.
  There's also FreeBSD as well as the various platforms derived from the, now defunct, OpenSolaris.
  I've never had compile ZFS myself (though I have, on occasions, chosen to because I've wanted to try features before they hit the repos).
  
  nicoburns 6 years ago
  
  I'd rather do that than use a buggy filesystem!
  
  vetinari 6 years ago
  
  ext4 and xfs are perfectly fine, you don't have to choose only between zfs or btrfs. In their case, being the boring ones, is an important feature.
  And they are supported by any linux distro out of the box.
  
  nicoburns 6 years ago
  
  Oh sure. I generally stick to ext4. But IF I needed the advanced features of btrfs/zfs, then I'd choose zfs.
  
  laumars 6 years ago
  
  More than 10 years in my case. I've run ZFS on production systems on Solaris, OpenSolaris, Nexenta, FreeBSD (vanilla, not FreeNAS) and, more recently, Ubuntu. Never had an issue.
  That said, I did once run into an issue running ZFS on ArchLinux which caused data loss. That was a highly experimental set up though and was before ZoL really took off (incidentally I've also run Btrfs on ArchLinux and that also caused me data loss).
  Hopefully I'm not jynxing things saying this, but ZFS has saved me from excessive downtime on a number of occasions. It has even recovered from corrupted superblock failures (when a RAID controller was faulty and randomly dropping devices during heavy load).
- iikoolpp 6 years ago
  
  After an experience where btrfs decided to break my combined filesystem I vow to never use it again.
  
  pas 6 years ago
  
  Could you explain/describe what happened and why and how to prevent it?
  
  iikoolpp 6 years ago
  
  Sure, I was trying to remove a disk from my array (1TB, with 50GB used, the other disk was 4TB and had ~600GB used). I tried doing `btrfs remove /dev/disk1 /mnt` and it refused, claiming there was no free space. No amount of arcane commands worked. Eventually I just copied all the files to somewhere else and nuked the filesystem entirely.
  
  cmurf 6 years ago
  
  Due to the way btrfs allocates in "collections" of extents called block groups, it's possible all the space was allocated by mostly empty block groups, which could make enospc possible. But that's a rather old set of bugs that I haven't seen in a very long time and predates all the modern space handling code. It must have been pre-4.0 kernels. And I did run into it myself on purpose many times while trying to help improve the behavior.
  Non-obvious, but very straight forward way around such a wedged in file system, is to add a 3rd device. It could even be a USB stick, back then I was using small 4GiB sticks and it would work. That was enough to allocate a couple metadata only block groups to the stick, to write out the file system changes necessary to back out the second device. And once that completes, a brief filtered balance (e.g. btrfs balance start -dusage=10 is usually sufficien) allows enough free space on the 1st device, to back out the 3rd (the USB stick).
  The non-obvious thing about any COW file system is that deletion always requires free space. There is no such thing as deleting a file with COW unless the fs can write that deletion change to all the affected trees into free space. Once the entire set of metadata changes is committed, then the data and metadata extents for those deleted files can be freed.
  Anyway, a lot has changed even in one year in Btrfs, let alone the past five years. It's thousands of line changes per kernel release.

finchisko 6 years ago

Wondering, if there is flash memory friendly filesystem, that is actually not overwriting existing blocks until card is full, but write any changes to new memory cells, rather than overwriting existing ones. This is not a problem for devices like cameras, since they only create new files, so wear is distributed among all cells evenly (assuming you remove all files when card is full, before taking new photos), but it's real problem for devices like raspberry pi and rapsbian, doing lot of updates (log files, ...). And, yes I understand root partition, could be set to read-only and log stored on tmpfs, but I'm still curious.

PS: As I still struggle to understand, why I'm getting downvotes. So please be so kind and write why, so I can eventually delete this comment.

mrob 6 years ago

It's called a log-structured file system. LWN has a good overview:
https://lwn.net/Articles/353411/
SSDs implement them internally, and some Android devices use the F2FS filesystem:
https://en.wikipedia.org/wiki/F2FS
- toolslive 6 years ago
  
  Log-structured is conceptually robust, but SSDs/NVMes have failure modes that can do things like "in this extent of 64MB, all bytes have their 6th bit erased". So it's an illusion to think that in case of a crash, the things you did not touch remained unharmed.
  
  toolslive 6 years ago
  
  here's a paper that will shatter some illusions: https://www.usenix.org/system/files/conference/fast13/fast13...
  
  londons_explore 6 years ago
  
  Considering how the hardware doesn't provide the serializability guarantees that it claims, why do we pay the high software performance and complexity hit to try to get the same?
  Lockfiles, O_SYNC, flush(), etc. all become unnecessary if we just assume that all data is at risk in case of improper poweroff. libeatmydata does this, and dramatically increases performance for some workloads.
  
  blattimwind 6 years ago
  
  I/O has always been a happy-path-only adventure in mainstream software and hardware.
  Attempting consistent I/O (kernel and hardware will try their best to thwart any attempt) etc. may help in some cases, but ultimately there are hardly any guarantees when it comes to power loss, and your data may be gone or corrupted no matter what you did.
rocqua 6 years ago

I thought most flash devices did their own block mapping to implement this. Or is that just SSDs?
- loeg 6 years ago
  
  Only tiny tiny embedded devices don't use a wear-leveling controller of some kind.
- krylon 6 years ago
  
  AFAIU, you are correct, but apparently, "most flash devices" does not include MicroSD cards. At least that is how I remember it.
  
  pjc50 6 years ago
  
  No, they've definitely got a Flash translation layer in there, and a microcontroller to run it. It's pretty much unavoidable for making a working flash device with reasonable performance on traditional filesystems.
  
  zzzcpan 6 years ago
  
  But SD cards, CF cards, eMMCs, USB sticks - unless enterprise graded and claim otherwise, all have either very primitive wear leveling in their FTL or none at all, it's only SSDs that have full blown log structured algorithms.
kalleboo 6 years ago

There are a few https://en.wikipedia.org/wiki/Flash_file_system

JulianMorrison 6 years ago

It feels surprising to me that a Linux system can be bricked or rooted by a maliciously constructed filesystem and this is not considered a major bug. Surely this is an obvious attack vector? (Dropped USB sticks, etc.)

londons_explore 6 years ago

It's a pretty good argument for FUSE-stye userspace filesystems. Then the code reading the filesystem need not have any more permissions than the user mounting it.
- imglorp 6 years ago
  
  HURD has it also.
  https://www.gnu.org/software/hurd/hurd/documentation/transla...
- JulianMorrison 6 years ago
  
  That might prevent privilege escalation but I'm not sure it prevents a kernel wedge.
  
  catern 6 years ago
  
  If you can wedge the kernel with a FUSE filesystem, that is certainly a severe bug, because it means an unprivileged user is able to DOS the system for everyone.
rwmj 6 years ago

It's the reason libguestfs exists.
fh973 6 years ago

... or block storage systems in container environments, where the user somehow has write access to the block device.
- JulianMorrison 6 years ago
  
  Aren't those the usual way to provision some databases?
int0x80 6 years ago

One argument is that allowing non-root to mount(2) is already a big security problem.

the8472 6 years ago

Fuzzing was suggested in the past to shake out some corrupt filesystem image bugs: https://lwn.net/Articles/685182/

akx 6 years ago

User Mode Linux is fresh in my mind due to a recent HN post on it – wouldn't it be easier to fuzz the kernel as an user-mode program instead of futzing around with a `/dev/afl` device?
- sitkack 6 years ago
  
  Being so pluggable, a file system should even be fuzzable inside of a highly sandboxed environment.
  1) compile fs code into wasm
  2) generate in-memory disk images
  3) run fs code over in-memory disk image
  4) use a neural net to search the fuzz space
  At that point one could couple something like profile guide optimization but branch predicted adversarial input differentiation. This would automatically find patterns between on-disk data structures and the code that is executed from changes in those structures.
  Zero syscall, zero vm exit file system fuzzing all in user space. One could easily get thousands of cores working on this problem in short order.
- yjftsjthsd-h 6 years ago
  
  NetBSD has been sort of doing this; they have rump kernels to run kernel pieces in userspace, and testing/fuzzing is a significant use.

nindalf 6 years ago

Noob question - what is the advantage for Huawei in getting this moved out of staging? All of their devices are already using it despite it being in staging. Is it to reduce maintenance burden on themselves because it's now a shared responsibility?

londons_explore 6 years ago

Filesystems are fairly pluggable, so them keeping it private isn't that big a burden.
They would be expected to maintain it either way.
The biggest benefit would probably be if they hope to get Google to make it a standard part of Android, or hope for other manufacturers to start using it (and therefore sharing the work of feature development and maintanance).
robert_foss 6 years ago

Exactly.
But also to make sure that it isn't kicked out of the kernel. Staging is not meant to be a permanent location for any kernel code.

nneonneo 6 years ago

Desktop Linux systems often feature some kind of automount capability so you can just plug in an external drive and have it work. Windows and macOS provide similar facilities. Per the article, is this actually insecure? Will plugging in a corrupt/maliciously modified ext4-formatted drive on an automount Linux system enable kernel compromise? If so, why is it that this is generally OK on Windows/macOS (barring silly things like autostart viruses)?

amaccuish 6 years ago

As an aside, if I were to plug a ext4 formatted memory stick in, and the system automounts it, and I've placed a setuid binary on there, will it work? Or does the automounter predict that and mount with nosuid?

rumanator 6 years ago

On a side-note, how does one go about adding data to a read-only file system Suh as EROFS?

mschuster91 6 years ago

This is common in the embedded world with squashfs - unpack it somewhere, add/modify the filesystem there, re-pack it.
- BlackLotus89 6 years ago
  
  If it's a working environment you could add an overlayfs and repack it after working on it. Used to do that when I ran Gentoo in ram. Emerge and squash.
  At least I think that's what I did... Man time flies
- plq 6 years ago
  
  SquashFS also supports append-only modifications in-place.