Scripts for Btrfs Maintenance

89 points by pantalaimon 2 years ago

Was I supposed to be maintaining my btrfs partition all this time? I just formatted my disk as btrfs when I bought my laptop and haven't thought about it since.

xtracto 2 years ago

Just recently I had a worst-case nightmare scenario with a single BTRFS partition 5TB disk (no RAID or anything crazy). It was a single BTRFS disk, single partition external USB disk running under Linux Mint.
At some point after I restarted the computer and it wouldn't mount the partition saying that the path for mounting was already being used (even after restarting??), but the disk was not mounted. Searching around apparently there is some kind of bug where Linux will "cache" the BTRFS UUID [1].
At the time, the "solution" I read was to change the UUID of the BTRFS partition running btrfstune. Which I did... and it supposedly changed successfully. Except that after trying to mount the partition again, it will tell me that there was an error in the partition: the tree had mismatching UUIDs :-/. I tried updating the UUID several times, and the command ended in success, but the error remained. My BTRFS disk was officially broken...
After spending several hours trying to fix the issue following the BTRFS documentation (pretty shitty TBH), in the end, I ended up having to do `btrfs restore -iv ...` to extract the data from the disk into some other external disk (formatted NTFS this time!!). The command is still going, about 2 weeks later and I am slowly recovering my 5TB of data.
But the one thing clear to me is that I don't trust BTRFS or it's admin commands AT ALL after this experience. Once I finish recovering my data, I'll nuke the partition, format the disk in NTFS and forget about this sour experience.
[1] https://unix.stackexchange.com/questions/603528/why-am-i-get...
- Arnavion 2 years ago
  
  >Searching around apparently there is some kind of bug where Linux will "cache" the BTRFS UUID [1].
  It's not a bug. It's just a fact that every BTRFS filesystem visible to the kernel (mounted or not) should have a unique ID. Otherwise operations on one filesystem (including mounting the partition that contains it) can end up applying to the other filesystem.
  >At the time, the "solution" I read was to change the UUID of the BTRFS partition running btrfstune.
  The solution would've been to find out why you have two partitions with the same filesystem ID and remove one of them. One example is that if you have the BTRFS filesystem on an LVM partition and then you clone the LVM partition, you end up with two BTRFS filesystems with the same ID. I accidentally encountered it while converting an unencrypted BTRFS parition into an encrypted one by dd'ing it into a new LUKS device - when I tried to mount the encrypted partition it ended mounting the unencrypted one.
  Since you say it's an external USB drive, perhaps it disconnected and reconnected uncleanly and the kernel thought there were two of that drive connected at the same time.
  
  tbrownaw 2 years ago
  
  > It's not a bug. It's just a fact that every BTRFS filesystem visible to the kernel (mounted or not) should have a unique ID.
  Maybe it's not a code bug, but imposing global uniqueness requirements where you can't just `cp /dev/blkdev1 myfs.img` and be able to work with the new image seems rather like a design bug.
  
  yrro 2 years ago
  
  It seems like a bit of a tall order to expect this to work with a multi-device filesystem. To the kernel it looks like a second multipath connection to the same disk just appeared; how should it handle this? Refusing to mount the cloned block device looks sensible.
  While not a multi-device filesystem, I believe XFS has the same behaviour and you have to change the UUID (or use xfs_copy in the first place) before you can mount a cloned block device.
  (I wonder how ZFS handles this?)
  
  bandrami 2 years ago
  
  Indeed, before we had FS UUIDs the kernel would simply mount partitions at random, there being no other possible way to distinguish one filesystem from another. It was a wild time.
  
  londons_explore 2 years ago
  
  Sounds like a kernel bug ... If it encounters another filesystem with the same ID, it should warn in dmesg and assign a null ID to one of them.
  
  _flux 2 years ago
  
  If it was this simple, they'd have done it, right? The idea is that filesystems have have the same UUID are actually part of the same filesystem. It's a long-lasting hm feature of Btrfs.
  I think the actual dynamic filesystem discovery is in user space and kernel just gets a list of devices, so it should be even easier to change that.
  
  thrtythreeforty 2 years ago
  
  That's fine but we can have a "device UUID" and "filesystem UUID" as separate constructs and then the complaint remains valid. It's okay for it to fail if you dd and try to mount (suboptimal, but okay), but it should never silently start breaking things while appearing to work.
  
  _flux 2 years ago
  
  Btrfs already has separate fsid and device id. Anyway, if you clone the device then those also stay the same, so I'm not sure what this would solve. How would the complete solution look like?
  AFAIK the filesystem UUID, the device UUID, or both are sprinkled all around the filesystem, I suppose for reasons of identifying data belonging to the fs in case of fsck. According to my quick research device id cannot be changed. I suppose if it could be changed by some means, then btrfs rescue clear-uuid-tree could be used to fix the other uuids.
- generalizations 2 years ago
  
  That doesn't even sound like a BTRFS problem; I recently (couple weekends ago) had the same UUID cache problem with an XFS partition.
- bravetraveler 2 years ago
  
  I can echo/support this claim - I wandered down the same rabbit hole.
  It doesn't even get much better with many drives and avoiding the docs/tooling
  BTRFS arrays will consistently corrupt with my reset button -- RAID10 on gen4 NVMe drives... while LVM/dm-raid + traditional file systems are absolutely fine
  
  xtracto 2 years ago
  
  >BTRFS arrays will consistently corrupt with my reset button
  YES!!! Sorry to beat a dead horse, but it has been so frustrating for me. I thought I did something wrong, like, how could the partition break for doing nothing! Where did I fucked up? I cannot imagine the experience with a RAIDed BTRFS array shudders
  
  phire 2 years ago
  
  I had a BTRFS raid corrupt because one of my HDD had a firmware bug and was silently discarding writes.
  Sure... HDDs shouldn't do that, but it's literally the job of a BTRFS raid to protect me from such issues. The data was all still intact on the other drive. I could see it with dd, but BTRFS refused to read it.
  
  curt15 2 years ago
  
  Isn't copy-on-write supposed to be inherently resilient to crashes[1]? Why would btrfs be more susceptible to resets than traditional filesystems?
  [1]: https://unix.stackexchange.com/questions/634050/in-what-ways...
  
  londons_explore 2 years ago
  
  I think most of the bugs are in the drives themselves - doing things like claiming data is flushed when it is not.
  However, btrfs is a very fragile filesystem and any inconsistencies caused by the drive bugs tend to lead to it breaking and refusing to mount, whereas on other filesystems you'll just get a warning in dmesg.
  It doesn't help that btrfs developers refuse to work around drive bugs.
  
  phire 2 years ago
  
  > It doesn't help that btrfs developers refuse to work around drive bugs.
  The entire selling point of the filesystem is that it's resilient to data corruption. Drive bugs are just another class of data corruption.
  
  bravetraveler 2 years ago
  
  I doubt that because of the experience I just outlined
  In theory perhaps, but absolutely not in practice. Making things atomic is tricky.
  
  yrro 2 years ago
  
  > BTRFS arrays will consistently corrupt with my reset button
  I know you don't want to get asked this, and I wouldn't if you'd just said "I tried this once a while ago and it happened", but if it's consistent... have you filed a bug?
  
  bravetraveler 2 years ago
  
  It's a fair ask! I don't mind.
  It's occurred to me, but I expect a fair bit of push back; 'you abuse the array and expect it to... what?'
  I'm not really interested in debating that, and it's remarkably easy to get hung up on. I've floated this issue a few times unofficially and it's a constant stickler
  I know expecting coherency in this situation is a little silly, but BTRFS is notably less reliable/robust than the alternatives
  All in all... I'm avoiding a situation that may or may not occur, by not participating.
  Not saying it's a good thing, but it's easier for me to just use What Works
  
  yrro 2 years ago
  
  With properly functioning hardware (i.e., that doesn't lie to you when it says "that has been committed to durable storage") then metadata or data corruption following power loss is definitely a bug! It didn't even occur to me that you might get a different response.
  I'd be kind of annoyed if I went to the trouble of reporting a bug and got that response. But I'd also not assume it would end that way...
  
  bravetraveler 2 years ago
  
  With where we're at with SSDs chasing the performance dragon... I suspect some deception is at play!
  The DRAM caches are pretty hefty compared to metadata, so I wouldn't be surprised - though that's about my extent of understanding.
  Even then, though - other filesystems are certainly more durable, dealing with whatever these drives are doing.
  FWIW, this was with four Sabrent Rocket 4.0 Plus drives in RAID10 with 4K sectors. I'm not sure if these are particularly well known one way or another for deception/trickery
  They hold up just fine with either dm-raid or native LVM raid10 under more traditional filesystems (I tested EXT4/XFS, using the latter consistently)
- _flux 2 years ago
  
  Surely this incident highlights the importance of backups, right? 5 TB is even a manageable amount of data.
  I also used to run btrfs in btrfs-RAID10 configuration until apparently a flapping SATA link and fsck attempts were able to break the fs completely. Full system backups were great that day. I run https://kopia.io/ nowadays every three hours during day time and I've been quite happy with it.
  Nowadays I run bcachefs.. Backups are still handy :).
  I suppose the reason why you chose NTFS was to be able to access the data from Windows, at least in case of emergency? Because there are a lot of filesystems that are presumably more mature than NTFS is for Linux.
  
  xtracto 2 years ago
  
  The finny thing is that , THAT is my backup drive, mostly. It has lots of crap, including GB synced from Google Drive (photos being the mos important). Other than that, it's stuff I could re-download from elsewhere, but 3TB is a lot to re-download. Aah yeah and it also has the TimeHift backups.
  I thought of going EXT4 , but as you said, NTFS is more widely supported.
- MaKey 2 years ago
  
  I'm not sure you can fault BTRFS for the issue you've experienced. It sounds like your system had a problem which lead to you breaking your (likely perfectly fine) BTRFS file system.
  
  omniglottal 2 years ago
  
  A filesystem breaks by the user pushing a button on their computer and you blame the user...?!
  
  MaKey 2 years ago
  
  He had trouble mounting it on his system because of a duplicate UUID - this doesn't mean that the filesystem was broken. As Arnavion wrote:
  > Since you say it's an external USB drive, perhaps it disconnected and reconnected uncleanly and the kernel thought there were two of that drive connected at the same time.
bravetraveler 2 years ago

You're not benefiting from it as much as you could be; others have mentioned scrubbing
That will read all of the data from the storage device(s) checking for coherency. In the case of redundancy (ie: RAID1) and corruption/rot, would use the other copy to make things whole
Beyond that and trim, most of what's in here is fairly specific. ie: avoiding ENOSPC or dealing with array changes
edit: I think running this more than ~monthly is overzealous... and only really meaningful if you have redundancy
If memory serves, the checksums are validated when you read things anyway - so I question doing passes too aggressively. I'll accept a bit for bitrot
- londons_explore 2 years ago
  
  Thing is, the drive itself typically is in a much better position to protect against bitrot.
  Both SSD's and spinning metal drives have ECC bits stored with the data, and the drive can detect when the data was read easily or required lots of error correction applied. And then based on that they can make the optimal call of how often to read data to see it it is 'nearly rotted' and needs a rewrite.
  The filesystem has no knowledge of any of that, so has to do dumb periodic scans.
sp332 2 years ago

Since you have checkums available, you might as well do a periodic "scrub" and see if any of your data has been corrupted. Trim is useful for SSDs because it can make your drive's firmware more effective at wear leveling. Balance is pretty workload-specific and I don't have a heuristic for when it might make a difference. But being super unbalanced might cause performance issues.
- EscapeFromNY 2 years ago
  
  Scrubbing sounds useful. I'll start doing that every so often.
  I thought trim was enabled by default (https://wiki.archlinux.org/title/Btrfs#SSD_TRIM). Does fstrim do something more than that or is it the same thing?
  
  sweettea 2 years ago
  
  If there is no discard work to do, a fstrim on btrfs will do nothing. Discard=async, the new default, should be enough to trim deleted data without using fstrim. But nothing bad happens by using both.
- kevincox 2 years ago
  
  IIUC the filesystem should balance itself. So unless you have added a new device to a mostly full filesystem you should be fine.
  
  sp332 2 years ago
  
  If you delete a bunch of files it might get unbalanced? That wouldn't degrade performance though I think. Or it is one of those FS's that makes sure everything is balanced on disk anyway?
  
  kevincox 2 years ago
  
  In theory yes. But it is unlikely. The files are very likely balanced before you delete them so it should be mostly balanced after. It will also self-correct the difference after a bit more use.
  Unbalanced drives can affect performance as one drive will have to do more of the work often bottlenecking. But except for extreme cases it is probably a rounding error.
  
  foobarqux 2 years ago
  
  I don't understand the details but I know one case it can fix problems is after deleting files to free up space on a nearly full disk.
  
  tremon 2 years ago
  
  Btrfs has separate allocation pools for data and metadata. If you delete files, the freed-up space is returned to the data pool but that does not make it generally-available for use in the metadata pool. So in the case where the entire disk space is already allocated to one of the allocation pools, none of the pools can grow beyond their current size.
  One of the effects of balancing the tree is to release unused space from the allocation pools to general availability, solving that problem.
kevincox 2 years ago

Running an occasional scrub is probably a good idea. I don't think any of the other commands mentioned here should be required. Naturally the disk should be balanced as it is written so you shouldn't need explicit balances unless you added a new drive when they others were quite full.
- Gigachad 2 years ago
  
  If it was a good idea, I assume it should already be happening for me. What would be the point of mindlessly running some command periodically?
  
  lmm 2 years ago
  
  > If it was a good idea, I assume it should already be happening for me.
  It would, if you were running a well-designed/maintained operating system. Unfortunately many Linux distributions make poor decisions.
  
  bravetraveler 2 years ago
  
  Does your car wash itself? /s
  It very well may be! Depending on your distribution... they may already have similar services/timers for periodic scrubbing
  On Fedora they aren't provided; but are with Arch
  Unless you run an array it's fairly meaningless, beyond being an administrative-reporting tool. It can't fix failed checksums without mirrors or parity
  I think the checksums are validated on read anyway, so it's mostly to mitigate bitrot - very specific to 'glacial' storage
  
  kevincox 2 years ago
  
  Because different use cases have different requirements. So having the maintenance command be separate from the filesystem gives ultimate flexibility.
  
  Gigachad 2 years ago
  
  The OP comment didn't give any context for what you should be thinking to make a judgement call on. Just blindly "run this command periodically". If I'm not making any actual decision, it should just be run for me by default.
heavyset_go 2 years ago

Do a scrub every once in a while. Mount with discard=async or turn on systemd's fstrim service if you're using SSDs.
If it would actually affect performance, turn on autodefrag. Be aware that running a manual defrag, instead of using the mount option, will break reflinks.
- quantumfissure 2 years ago
  
  As of kernel 6.2, discard=async isn't needed anymore. It got a fix in 6.3[1]
  [1]:https://www.phoronix.com/news/Btrfs-Discard-Tuning-Linux-6.3
  
  heavyset_go 2 years ago
  
  Nice, thanks for the heads up.

Locrin 2 years ago

I have happily used BTRFS in Raid1 on my Plex server for a good while now. I originally had a chucked WD 8TB drive. I started dreading to rebuild my library if the drive failed. So I got a 8TB Seagate disk and created a Raid-1 setup with one drive missing. I then copied over the data and added the old disk to the new raid. It took a good while to balance itself but it's been problemfree since. I might use ZFS if I switch to Ubuntu for home server, but I am using Fedora now, and want to stick to what is natively supported.

DiabloD3 2 years ago

There is no reason to ever use btrfs, IMO.

btrfs is the result of Oracle trying to clone, badly, ZFS. When they bought Sun, they discontinued development on btrfs completely, as they (thought they) owned ZFS. Nobody sponsors btrfs development anymore, and its development has completely stagnated; while ZFS, under the OpenZFS project, continues to accelerate and absolutely dominates the enterprise mission critical file system sort of market.

Just use the real deal: use ZFS.

Due to the massive data loss issues with btrfs, for example, Redhat completely removed btrfs support in 8 entirely, after the preview feature in 7 bombed. They default to, and highly recommend, XFS (as do I, its a good file system).

AnonC 2 years ago

> There is no reason to ever use btrfs, IMO.
> ...
> Due to the massive data loss issues with btrfs, for example, Redhat completely removed btrfs support in 8 entirely, after the preview feature in 7 bombed. They default to, and highly recommend, XFS (as do I, its a good file system).
Synology strongly recommends btrfs in its NAS systems and promotes that over ext4, which is also supported by Synology. [1] On the consumer side, I guess this would probably be larger than ZFS/XFS usage.
[1]: https://www.synology.com/en-us/dsm/Btrfs
- DiabloD3 2 years ago
  
  I wouldn't wish a Synology box on anyone.
  If you need a well supported NAS that anyone can operate, just slap TrueNAS onto a budget NAS box and be done with it. TrueNAS uses ZFS, and actively discourages btrfs due to a history of data loss issues.
akvadrako 2 years ago

A good reason to use Btrfs over ZFS is if, like me, you use Fedora Silverblue. It's supported by default while ZFS is essentially impossible to use – I use VMs for the occasional need.
Another good reason is simplified tooling – ZFS has a lot of special ways of doing things different from all other filesystems.
RedHat seemed to remove Btrfs because that want to move towards Stratis, not because ZFS or XFS is good enough.
None of them are perfect yet and my best hope for the future is bcachefs.
callahad 2 years ago

> Nobody sponsors btrfs development anymore
Doesn't SUSE still use btrfs as their default rootfs?
Looking at `git shortlog -sne --since="1 Jan, 2023" fs/btrfs/` it seems like folks at least at Suse, Oracle, and Western Digital have all made more than a dozen commits this year.
- yrro 2 years ago
  
  As does Fedora Workstation (as of a couple of releases ago)
  Being able to sit back and have a big bunch o'disks that I can just add and remove devices to is convenient when combined with BTFS raid1: data will be present on at least two disks, I don't care which. I don't think ZFS is able to do this. It's obviously more suited for serious enterprise use where you have plans and budgets... ;)
  (Once I too have my data eaten by it I'll no doubt be very grumpy, then again that's what backups are for so...)
  More seriously, I'd give ZFS a go but it's still too much of a pain in the arse to use with Linux. If even Oracle Unbreakable Linux can't integrate it properly then what hope does anyone else have?
  (Talking about making me build my own kernel modules here)
  
  DiabloD3 2 years ago
  
  Building kernel modules doesn't have to be scary.
  Debian and SUSE both do this perfectly and automatically.
  
  yrro 2 years ago
  
  It's not scary. It's just annoying and I don't want to do it any more.
candiddevmike 2 years ago

Anecdotally, I've had so much trouble running Linux on a ZFS root. Failing DKMS builds or installing a kernel that is too new or old for the ZFS components has rendered my system unbootable enough that I switched to BTRFS.
- DiabloD3 2 years ago
  
  The only time I've heard this happening is on Ubuntu. Don't use Ubuntu, and switch to Debian or SUSE, and your problems go away.
  The only times I've ever heard of this not being Ubuntu pathology is people running builds of the kernel from git, and not keeping track of the extremely rare breaking changes that effect ZFS (OpenZFS's git repo for the kernel modules are updated quickly, and breaking changes are often handled ahead of time; all of this is fixed before stable kernels get released nowadays).

seized 2 years ago

I just scrub my ZFS pools once in a while and that's it... OpenIndiana autosnap is the rest of it which is built in. Then Monit to check pool health and alert me automatically.

eternityforest 2 years ago

This does not exactly inspired confidence in BTRFS, if I need a script to maintain it.

I only use it for one thing, read only compressed lower layers. It's great for that because it can be written to at setup time with standard tools, as opposed to some extra image compile step like true compressed read only FSes, but I wouldn't use it for anything that gets written in normal use.

chasil 2 years ago

Why is nobody mentioning?:

  btrfs fi defrag -r /

This is a recovery that ZFS cannot make.

https://www.usenix.org/system/files/login/articles/login_sum...

foobarqux 2 years ago

Defrag breaks COW so if you use COW a lot (e.g. snapshots) you don't want to do it.
- fsckboy 2 years ago
  
  what do you mean it breaks? does it purge* the older versions, or does something break?
  * (yes, DEC reference)
  
  jraph 2 years ago
  
  It reduplicates everything. Don't defrag Btrfs lightly, unless you have a specific issue to solve and you ran out off other solutions. Even then, only do it if you know what you are doing.
  edit: I initially wrote deduplicates, thanks the8472, and agree with the rest of your message as well
  
  the8472 2 years ago
  
  *reduplicates
  Anyway, deduplication can be recovered after a defrag by running duperemove or similar tools. But it's a lot of added work so it's only really worth it if fragmentation is an issue in the first place.
  
  Dylan16807 2 years ago
  
  And you have to cross your fingers and hope it picks the defragmented copy, as far as I know.
  It's also very slow to go through that on multiple snapshots.
  And I think some of the metadata doesn't like to recombine?
nubinetwork 2 years ago

ZFS doesn't need it, if the file is accessed enough, it stays in cache. If the file isn't important enough, who cares that it takes a little bit longer to load into memory?
- chasil 2 years ago
  
  If a ZFS dataset goes over 80% utilization, I understand it will switch from "first fit" to "best fit" which can permanently and severely impact performance.
  https://serverfault.com/questions/733817/does-the-max-80-use...
  
  LeoPanthera 2 years ago
  
  I actually tested this, and could not measure any performance hit at all until I hit 95% full.
  It sped right back up again when I deleted some stuff.
  I wouldn't worry about it.
  
  chasil 2 years ago
  
  Interesting test. Thanks for finding this out.
  
  seized 2 years ago
  
  You can ZFS send back and forth between pools or to a new dataset to defrag/fix that, so it's not "permanent" as in can't ever be fixed. It also depends on workload, write once/read many is different from VM disk files.
  
  chasil 2 years ago
  
  Btrfs can "rebalance" in place. This has had bugs in the past, but seems safe now.
  On the side of ZFS, you can take one volume of a mirrorset, move it to another system and mount it in degraded mode, and copy a bunch of new files to it. When you return the updated drive to the full set and mount it, it will resilver the older drive with the newer content automatically.
  That's a killer feature. With btrfs, you have to rebalance to mirror the new files, which takes a long time as every block must be rewritten between the mirrors. ZFS knows how to move just the new files in a resilver.

taacc3 2 years ago

[dead]