I made a command line utility called `dedup` a while back to do the same thing. It has a dry-run mode, will “intelligently” choose the best clone source, understands hard links and other clones, preserves metadata, deals with HFS compressed files properly. It hasn’t destroyed any of my own data, but like any file system tool, use at your own risk.
Replying to myself now that I've had a chance to try the scan, but not the deduplication. I work with disc images, program binaries, intermediate representations in a workspace that's 7.6G.
A few notes:
* By default it doesn't scan everything. It ignores all files but those in an allow list. The way the allow list is structured, it seems like Hyperspace needs to understand the content of a file. As an end user, I have no idea what the difference between a Text file and a Source Code file would be or how Hyperspace would know. Hyperscan only found 360MB to dedup. Allowing all files increased that to 842MB.
* It doesn't scan files smaller than 100 KB by default. Disabling the size limit along with allowing all files increased that to 1.1GB
* With all files and no size limit it scanned 67,309 of 68,874 files. `dedup` scans 67,426.
* It says 29,522 files are eligible. Eligible means they can be deduped. `dedup` only fines 29,447. There are 76 already deduped files, which is an off-by-one, so I'm not sure what the difference is.
* Scanning files in Hyperspace took around 50s vs `dedup` at 14s
* It seems to scan the file system, then do a duplication calculation, then do the deduplication. I'm not sure why the first shouldn't be done together. I chose to queue any filesystem metadata as it was scanned and in parallel start calculating duplicates. The vast majority of the time files can be mismatched by size, which is available from `fts_read` "for free" while traversing the directory.
* Hyperspace found 1.1GB to save, `dedup` finds 1.04GB and 882MB already saved (from previous deduping)
* I'm not going to buy Hyperspace at this time, so I don't know how long it takes to dedup or if it preserves metadata or deals with strange files. `dedup` took 31s to scan and deduplicate.
* After deduping with `dedup`, Hyperscan thinks there are still 2 files that can be deduped.
* Hyperspace seems to understand it can't dedup files with multiple hard links, empty files, and some of the other things `dedup` also checks for.
* I can't test ACLs or any other attribute preservation like that without paying. `strings` suggests those are handled. HFS Compression is a tricky edge case, but I haven't tested how Hyperspace's scan deals with those.
I'm a little surprised that folks here are investing so much time into this app. It's closed source, only available for a non-obious amount, time-limited or subscription-based and lots of details of how it works are missing.
With a FOSS project this would have been expected, but with a ShareWare-style model? Idk..
John has reiterated multiple times on his podcast that he doesn't want to deal with thousands of support requests when making his apps open source and free. All his apps are personal itches he scratched and he sells them not to make a profit but to make the barrier of entry high enough to make user feedback manageable.
Absolutely no judgement for however people want to licence and distribute their software, but I've seen the support burden used as justification for closed source/selling software quite a bit recently, and wonder how often people might be conflating open source with open development. There's no reason an open source project has to accept bug reports or pull requests from anyone. See SQLite or many of the tools from Fabrice Bellard for example.
Again, I've got no problem with people selling software or closed source models, but I've never understood using this justification. Maybe in this instance he's a well known public figure with published contact info that people will abuse?
Didn’t SQLite developer(s) famously receive a flood of phone calls because McAfee antivirus used it in a way that was visible (and “suspicious”) to its users?
> he doesn't want to deal with thousands of support requests when making his apps open source and free.
Who says you have to deal with support requests if you open source something?
> All his apps are personal itches he scratched and he sells them not to make a profit but to make the barrier of entry high enough to make user feedback manageable.
> Who says you have to deal with support requests if you open source something?
Almost anyone who has ever maintained popular open-source software, even if dealing with them means putting up a notice that says "Don't ask support questions" and having to delete angrily posted issues.
My understanding from listening to his explanation is he wants to be able to support users and have an income stream to incentivize that.
As an open-source maintainer of a popular piece of software, I'm very empathetic.
Sorry, that's a BS reason. If you don't want that, just ignore all opened issues. That's it. If you are nice then you put a README that explains this in a sentence or two. If a community forms that wants to fix issues, for example critical ones that could lead to data loss then the community will deal with it, e.g. by forking.
Just keeping everything closed is really missing the point of how trust in infra that handles critical data is built nowadays.
As I understand it, from listening to the podcast, a better summary is that if it becomes popular, he wants it to be worthwhile for him to keep working on.
Apps like this can easily bit rot, and more users does often mean more work e.g. answering or filtering emails, finding more edge cases, etc.
From his perspective that means having a income to dedicate time to this. I don't think he's interested in being an "infra" app as you would think of it.
As someone who maintains critical open-source software, I can strongly empathize, even if it’s not an approach I would take.
Just because the creator, John Siracusa is famous. If a no-name developer did the app, it wouldn't get this many upvotes and this much attention. He used to write very detailed OS reviews, and I learned a lot from him, including Apple's logical volume manager functions (`diskutil cs`).
Just tried it, and it works well! I didn't realize the potential of this technique until I saw just how many dupes there were of certain types of files, especially in node_modules. It wasn't uncommon to see it replace 50 copies of some js file with one, and that was just in a specific subdirectory.
I see it is "pre-release" and sort of low GH stars (== usage?), so I'm curious about the stability since this type of tool is relatively scary if buggy.
I use it on my family photos, work documents, etc. and have not run into an issue that I haven’t added a test for. I didn’t commercialize it because I didn’t know what I didn’t know, but the utility does try to fail quickly if any of the file system operations fail (cloning, metadata duplication, atomic swaps, etc.).
Whenever using it on something sensitive that I can’t back up first for whatever reason, I make checksum files and compare them afterwards. I’ve done this many times on hundreds of GB and haven’t seen corruption. Caveat emptor.
There is one huge caveat I should add to the README - block corruption happens. Having a second copy of a file is a crude form of backup. Cloning causes all instances to use the same block, so if that one instance is corrupted, all clones are. That’s fine for software projects with generated files that can be rebuilt or checked out again, but introduces some risk for files that may not otherwise be replaceable. I keep multiple backups of all that stuff in hardware other than where I’m deduping, so I dedup with abandon.
I’m a nobody with no audience. Maybe some attention here will get some users.
Somewhat unrelated but I believe the dupe issue with node_modules is the main reason to use pnpn instead of npm - pnpm just uses a single global package repo on your machine and creates links inside node_modules as needed.
> There is no way for Hyperspace to cooperate with all other applications and macOS itself to coordinate a “safe” time for those files to be replaced, nor is there a way for Hyperspace for forcibly take exclusive control of those files.
This got me wondering why the filesystem itself doesn't run a similar kind of deduplication process in the background. Presumably, it is at a level of abstraction where it could safely manage these concerns. What could be the downsides of having this happen automatically within APFS?
This is the standard API for deduplication on Linux (used for btrfs and XFS); you ask the OS nicely to deduplicate a given set of ranges, and it responds by locking the ranges, verifying that they are indeed identical and only then deduplicates for you (you get a field back saying how many bytes were deduplicated from each range). So there's no way a userspace program can mess up your files.
Probably because APFS runs on everything from the Apple Watch to the Mac Pro and everything in between.
You probably don’t want your phone or watch de-duping stuff.
There are tons of knobs you can tweak in macOS but Apple has always been pretty conservative when it comes to what should be default behavior for the vast majority of their users.
Certainly when you duplicate a file using the Finder or use cp -c at the command line, the Copy-on-Write functionality is being used; most users don’t need to know that.
On ZFS it consumes a lot of RAM. In part I think this is because ZFS does it on the block level, and has to keep track of a lot of blocks to compare against when a new one is written out. It might be easier on resources if implemented on the file level. Not sure if the implementation would be simpler or more complex.
It might also be a little unintuitive that modifying one byte of a large file would result in a lot disk activity, as the file system would need to duplicate the file again.
In regards to the second point, this isn't correct for ZFS: "If several files contain the same pieces (blocks) of data or any other pool data occurs more than once in the pool, ZFS stores just one copy of it. Instead of storing many copies of a book it stores one copy and an arbitrary number of pointers to that one copy." [0]. So changing one byte of a large file will not suddenly result in writing the whole file to disk again.
It reads like that is what they meant: "modifying one byte of a large file would result in a lot disk activity, as the file system would need to duplicate the file again"
Yeah, I did not write it very clearly. On ZFS, you're right. On a file system that applied deduplication to files and not individual blocks, the file would need to be duplicated again, no matter where and what kind of change was made.
That's a ZFS online dedup limitation. I think xfs and btrfs are better prior art here since they use extent-based deduplication and can do offline dedup which means they don't have have to keep it in memory and the on-disk metadata is smaller too.
Files are always represented as lists of blocks or block spans within a file system. Individual blocks could in theory be partially shared between files at the complexity cost of a reference counter for each block. So changing a single byte in a copy on write file could take the same time regardless of file size because only the affected bock would have to be duplicated. I don't know at all how MacOS implements this copynon write scheme, though.
APFS is a copy on write filesystem if you use the right APIs, so it does what you describe but only for entire files.
I believe as soon as you change a single bite you get a complete copy that’s your own.
And that’s how this program works. It finds perfect duplicates and then effectively deletes and replaces them with a copy of the existing file so in the background there’s only one copy of the bits on the disk.
I suppose this means that you could find yourself unexpectedly out of disk space in unintuitive ways, if you're only trying to change one byte in a cloned file but there isn't enough space to copy its entire contents?
It doesn't work like you think.
If you change one byte of duplicated file - the only "byte" will be changed on disk (a "byte", because, technically is not a byte, but a block).
As far as I understand, it works like a reflink feature in the modern linux FSs.
If so, thats really cool, and thats also a bit better than the zfs's snapshots.
Iam newbie on macos, but it looks amazing
That’s true as long as the writing application only writes to blocks that have changed. Is a VM tool going to write blocks, or write multi-MB segments of a sparse image that can be swapped atomically? Unfortunately, once a file changes there are no APIs to check which blocks are still shared (at least there weren’t as of macOS 13).
I’m not sure if it works on a file or block level for CoW, but yes.
However APFS gives you a number of space related foot-guns if you want. You can overcommit partitions, for example.
It also means if you have 30 GB of files on disk that could take up anywhere from a few hundred K to 30 GB of actual data depending on how many dupes you have.
It’s a crazy world, but it provides some nice features.
If we only have two files, A and its duplicate B with some changes as a diff, this works pretty well. Even if the user deletes A, the OS could just apply the diff to the file on disk, unlink A, and assign B to that file.
But if we have A and two different diffs B1 and B2, then try to delete A, it gets a little murkier. Either you do the above process and recalculate the diff for B2 to make it a diff of B1; or you keep the original A floating around on disk, not linked to any file.
Similarly, if you try to modify A, you'd need to recalculate the diffs for all the duplicates. Alternatively, you could do version tracking and have the duplicate's diffs be on a specific version of A. Then every file would have a chain of diffs stretching back to the original content of the file. Complex but could be useful.
It's certainly an interesting concept but might be more trouble than it's worth.
ZFS does this by de-duplicating at the block level, not the file level. It means you can do what you want without needing to keep track of a chain of differences between files. Note that de-duplication on ZFS has had issues in the past, so there is definitely a trade-off. A newer version of de-duplication sounds interesting, but I don't have any experience with it: https://www.truenas.com/docs/references/zfsdeduplication/
ZFS: "The main benefit of deduplication is that, where appropriate, it can greatly reduce the size of a pool and the disk count and cost. For example, if a server stores files with identical blocks, it could store thousands or even millions of copies for almost no extra disk space." (emphasis added)
APFS shares blocks so only blocks that changed are no longer shared. Since a block is the smallest atomic unit (except maybe an inode) in a FS, that’s the best level of granularity to expect.
VAST storage does something like this. Unlike how most storage arrays identify the same block by hash and only store it once VAST uses a content aware hash so hashes of similar blocks are also similar. They store a reference block for each unique hash and then when new data comes in and is hashed the most similar block is used to create byte level deltas against. In practice this works extremely well.
Records (which are of variable size) are already checksummed, and there were checksum-hashes which made it vanishingly unlikely that one could choose two different records with the same (optionally cryptographically strong) checksum-hash. When a newly-created record's checksum is generated, one could look into a table of existing checksum-hashes and avoid the record write if it already exists, substituting an incremented refcount for that table entry.
ZFS is essentially an object store database at one layer; the checksum-hash deduplication table is an object like any other (file, metadata, bookmarks, ...). There is one deduplication table per pool, shared among all its datasets/volumes.
On reads, one does not have to consult the dedup table.
The mechanism was fairly easy to add. And for highly-deduplicatable data that is streaming-write-once-into-quiescent-pool-and-never-modify-or-delete-what's-written-into-a-deduplicated-dataset-or-volume, it was a reasonable mechanism.
In other applications, the deduplication table would tend to grow and spread out, requiring extra seeks for practically every new write into a deduplicated dataset or volume, even if it's just to increment or decrement the refcount for a record.
Destroying a deduplicated dataset has to decrement all its refcounts (and remove entries from the table where it's the only reference), and if your table cannot all fit in ram, the additional IOPS onto spinning media hurt, often very badly. People experimenting with deduplication and who wanted to back out after running into performance issues for typical workloads sometimes determined it was much MUCH faster to destroy the entire pool and restore from backups, rather than wait for a "zfs destroy" on a set of deduplicated snapshots/datasets/volumes to complete.
I have no specialized knowledge (just a ZFS user for over a decade). I suspect the reason is that in addition to files, ZFS will also allow you to create volumes. These volumes act like block devices, so if you want to dedup them, you need to do it at the block level.
This is commonly done with compression on block storage devices. That fails, of course, if the file system is encrypting the blocks it sends down to the device.
Doing deduplication at this level is nice because you can dedupe across file systems. If you have, say, a thousand systems that all have the same OS files you can save vats of storage. Many times, the only differences will be system specific configurations like host keys and hostnames. No single filesystem could recognize this commonality.
This fails when the deduplication causes you to have fewer replicas of files with intense usage. To take the previous example, if you boot all thousand machines at the same time, you will have a prodigious I/O load on the kernel images.
Windows Server does this for NTFS and ReFS volumes. I used it quite a bit on ReFS w/ Hyper-V VMs and it worked wonders. Cut my storage usage down by ~45% with a majority of Windows Server VMs running a mix of 2016/2019 at the time.
Yep. At a previous job we had a file server that we published Windows build output to.
There were about 1000 copies of the same pre-requisite .NET and VC++ runtimes (each build had one) and we only paid for the cost of storing it once. It was great.
It is worth pointing out though, that on Windows Server this deduplication is a background process; When new duplicate files are created, they genuinely are duplicates and take up extra space, but once in a while the background process comes along and "reclaims" them, much like the Hyperspace app here does.
Because of this (the background sweep process is expensive), it doesn't run all the time and you have to tell it which directories to scan.
If you want "real" de-duplication, where a duplicate file will never get written in the first place, then you need something like ZFS
Both ZFS and WinSvr offer "real" dedupe. One is on-write, which requires a significant amount of available memory, the other is on a defined schedule, which uses considerably less memory (300MB + 10MB/TB).
ZFS is great if you believe you'll exceed some threshold of space while writing. I don't personally plan my volumes with that in mind but rather make sure I have some amount of excess free space.
WinSvr allows you to disable dedupe if you want (don't know why you would) where as ZFS is a one-way street without exporting the data.
Both have pros and cons. I can live with the WinSvr cons while ZFS cons (memory) would be outside of my budget, or would have been at the particular time with the particular system.
Dedupe seemed more interesting when storage was expensive, but nowadays it feels like the overhead you get from running dedupe, in most cases, is priced-in. At least with software like CommVault for backups, dedupe requires beefy hardware and low-latency SSDs for the database, If there is even a few extra milliseconds of latency or the server can’t handle requests fast enough, your backup throughput absolutely tanks. Depending on your data though you could see some ridiculous savings here that make it worth the trouble.
I’ve heard many horror stories of dedupe related corruption or restoration woes though, especially after a ransomware attack.
Even using sha-256 or greater type of hashing, I'd still have concerns about letting a system make deletion decisions without my involvement. I've even been part of de-dupe efforts, so maybe my hesitation is just because I wrote some of the code and I know I'm not perfect in my coding or even my algo decision trees. I know that any mistake I made would not be of malice but just ignorance or other stupid mistake.
I've done the entire compare every file via hashing and then log each of the matches for humans to compare, but never has any of that ever been allowed to mv/rm/link -s anything. I feel my imposter syndrome in this regard is not a bad thing.
Now you understand why this app costs more than 2x the price of alternatives such as diskDedupe.
Any halfway-competent developer can write some code that does a SHA256 hash of all your files and uses the Apple filesystem API's to replace duplicates with shared-clones. I know swift, I could probably do it in an hour or two. Should you trust my bodgy quick script? Heck no.
The author - John Siracusa - has been a professional programmer for decades and is an exceedingly meticulous kind of person. I've been listening to the ATP podcast where they've talked about it, and the app has undergone an absolute ton of testing. Look at the guardrails on the FAQ page https://hypercritical.co/hyperspace/ for an example of some of the extra steps the app takes to keep things safe. Plus you can review all the proposed file changes before you touch anything.
You're not paying for the functionality, but rather the care and safety that goes around it. Personally, I would trust this app over just about any other on the mac.
Best course of action is to not trust John, and just wait for a year of the app out the wild, until everyone else trusts John . I have enough hard drive space in the meantime to not rush into trusting John.
Yeah, the lack of involvement was more in response to ZFS doing this not this app. I could have crossed the streams with other threads about ZFS if it's not directly in this thread
Most EULA’s would disclaim liability for data loss and suggest users keep good backups. I haven’t read a EULA for a long time, but I think most of them do so.
I can't find a specific EULA or disclaimer for the Hyperspace app, but given that the EULA's for major things like Microsoft Office basically say "we offer you no warranty or recourse no matter what this software does" I would hardly expect an indie app to offer anything like that
> There is no way for Hyperspace to cooperate with all other applications and macOS itself to coordinate a “safe” time for those files to be replaced, nor is there a way for Hyperspace for forcibly take exclusive control of those files.
If Apple is anything like where I work, there's probably a three-year-old bug ticket in their system about it and no real mandate from upper management to allocate resources for it.
I love the model of it being free to scan and see if you'd get any benefit, then paying for the actual results. I, too, am a packrat, ran it, and got 7GB to reclaim. Not quite worth the squeeze for me, but I appreciate it existing!
He’s talked about it on the podcast he was on. So many users would buy this, run it once, then save a few gigs and be done. So a subscription didn’t make a ton of sense.
After all how many perfect duplicate files do you probably create a month accidentally?
There’s a subscription or buy forever option for people who think that would actually be quite useful to them. But for a ton of people a one time IAP that gives them a limited amount of time to use the program really does make a lot of sense.
And you can always rerun it for free to see if you have enough stuff worth paying for again.
am I really that old that I remember this being the default for most of the software about 10 years ago? Are people already that used to the subscription trap that they think this is a new model ?
I grew up with shareware in the 90s that often adopted a similar model (though having to send $10 in the mail and wait a couple weeks for a code or a disk to come back was a bit of a grind!) but yes, it's refreshing in the current era where developers will even attempt to charge $10 a week for a basic coloring in app on the iPad..
it’s very refreshing compared to those “free trials” you have to remember to cancel (pro tip: use virtual credit cards which you can lock for those so if you forget to cancel the charges are blocked)
however has anyone been able to find out from the website how much the license actually costs?
Just want to mention: Apple ships a modified version of the copy command (good old cp) that supports the ability to use the cloning feature of APFS by using the -c flag.
And in case your cp doesn't support it, you could also do it by invoking Python. Something like `import Foundation; Foundation.NSFileManager.defaultManager().copyItemAtPath_toPath_error_(...)`.
Correct. Foundation's NSFileManager / FileManager will automatically use clone for same-volume copies if the underlying filesystem supports it. This makes all file copies in all apps that use Foundation support cloning even if the app does nothing.
libcopyfile also supports cloning via two flags: COPYFILE_CLONE and COPYFILE_CLONE_FORCE. The former clones if supported (same volume and filesystem supports it) and falls back to actual copy if not. The force variant fails if cloning isn't supported.
They might not have ported it. They could have a Python __getattr__ implementation that returns a callable that simply uses objc_msgSend under the hood.
What algorithm does the application use to figure out if two files are identical? There's a lot of interesting algorithms out there. Hashes, bit by bit comparison etc. But these techniques have their own disadvantages. What is the best way to do this for a large amount of files?
I don't know exactly what Siracusa is doing here, but I can take an educated guess:
For each candidate file, you need some "key" that you can use to check if another candidate file is the same. There can be millions of files so the key needs to be small and quick to generate, but at the same time we don't want any false positives.
The obvious answer today is a SHA256 hash of the file's contents; It's very fast, not too large (32 bytes) and the odds of a false positive/collision are low enough that the world will end before you ever encounter one. SHA256 is the de-facto standard for this kind of thing and I'd be very surprised if he'd done anything else.
You can start with the size, which is probably really unique. That would likely cut down the search space fast.
At that point maybe it’s better to just compare byte by byte? You’ll have to read the whole file to generate the hash and if you just compare the bytes there is no chance of hash collision no matter how small.
Plus if you find a difference in bytes 1290 you can just stop there instead of reading the whole thing to finish the hash.
I don’t think John has said exactly how on ATP (his podcast with Marco and Casey), but knowing him as a longtime listener/reader he’s being very careful. And I think he’s said that on the podcast too.
To make dedup[0] fast, I use a tree with device id, size, first byte, last byte, and finally SHA-256. Each of those is only used if there is a collision to avoid as many reads as possible. dedup doesn’t do a full file compare, because if you’ve found a file with the same size, first and last bytes, and SHA-256 you’ve also probably won the lottery several times over and can afford data recovery.
This is the default for ZFS deduplication and git does something similar with size and far weaker SHA-1. I would add a test for SHA-256 collisions, but no one has seemed to find a working example yet.
How much time is saved by not comparing full file contents? Given that this is a tool some people will only run occasionally, having it take 30 seconds instead of 15 is a small price to pay for ensuring it doesn't treat two differing files as equal.
FWIW, when I wrote a tool like this I used same size + some hash function, not MD5 but maybe SHA1, don't remember. First and last bytes is a good idea, didn't think of that.
Wonder what the distribution is here, on average? I know certain file types tend to cluster in specific ranges.
>maybe it’s better to just compare byte by byte? You’ll have to read the whole file to generate the hash
Definitely, for comparing any two files. But, if you're searching for duplicates across the entire disk, then you're theoretically checking each file multiple times, and each file is checked against multiple times. So, hashing them on first pass could conceivably be more efficient.
>if you just compare the bytes there is no chance of hash collision
You could then compare hashes and, only in the exceedingly rare case of a collision, do a byte-by-byte comparison to rule out false positives.
But, if your first optimization (the file size comparison) really does dramatically reduce the search space, then you'd also dramatically cut down on the number of re-comparisons, meaning you may be better off not hashing after all.
You could probably run the file size check, then based on how many comparisons you'll have to do for each matched set, decide whether hashing or byte-by-byte is optimal.
To have a mere one in a billion chance of getting a SHA-256 collision, you'd need to spend 160 million times more energy than the total annual energy production on our planet (and that's assuming our best bitcoin mining efficiency, actual file hashing needs way more energy).
The probability of a collision is so astronomically small, that if your computer ever observed a SHA-256 collision, it would certainly be due to a CPU or RAM failure (bit flips are within range of probabilities that actually happen).
You can group all files into buckets, and as soon as a bucket is empty, discard it. If in the end there are still files in the same bucket, they are duplicates.
Initially all files are in the same bucket.
You now iterate over differentiators which given two files tell you whether they are maybe equal or definitely not equal. They become more and more costly but also more and more exact. You run the differentiator on all files in a bucket to split the bucket into finer equivalence classes.
For example:
* Differentiator 1 is the file size. It's really cheap, you only look at metadata, not the file contents.
* Differentiator 2 can be a hash over the first file block. Slower since you need to open every file, but still blazingly fast and O(1) in file size.
* Differentiator 3 can be a hash over the whole file. O(N) in file size but so precise that if you use a cryptographic hash then you're very unlikely to have false positives still.
* Differentiator 4 can compare files bit for bit. Whether that is really needed depends on how much you trust collision resistance of your chosen hash function. Don't discard this though. Git got bitten by this.
Not surprisingly, differentiator 2 can just be the first byte (or machine word). Differentiator 3 can be the last byte (or word). At that point, 99.99% (in practice more 9s) of files are different and you’re read at most 2 blocks per file. I haven’t figured out a good differentiator 3 prior to hashing, but it’s already so rare, that it’s not worth it, in my experience.
I experimented with a similar, "hardlink farm"-style approach for deduplicated, browseable snapshots. It resulted in a small bash script which did the following:
- compute SHA256 hashes for each file on the source side
- copy files which are not already known to a "canonical copies" folder on the destination (this step uses the hash itself as the file name, which makes it easy to check if I had a copy from the same file earlier)
- mirror the source directory structure to the destination
- create hardlinks in the destination directory structure for each source file; these should use the original file name but point to the canonical copy.
Hard links are not a suitable alternative here. When you deduplicate files, you typically want copy-on-write: if an app writes to one file, it should not change the other. Because of this, I would be extremely scared to use anything based on hard links.
In any case, a good design is to ask the kernel to do the dedupe step after user space has found duplicates. The kernel can double-check for you that they are really identical before doing the dedupe. This is available on Linux as the ioctl BTRFS_IOC_FILE_EXTENT_SAME.
xxHash (or xxh3 which I believe is even faster) is massively faster than SHA256 at the cost of security, which is unnecessary here.
Of course, engineering being what it is, it's possible that only one of these has hardware support and thus might end up actually being faster in realtime.
Blake3 is my favorite for this kind of thing. It's a cryptographic hash (maybe not the world's strongest, but considered secure), and also fast enough that in real world scenarios it performs just as well as non-crypto hashes like xx.
I think the prob. is not so low. I remember reading here about a person getting a foto of another chat in a chat application, which was using sha in the background. I do not recall all the details, it is improbable, but possible.
This reminds me of https://en.wikipedia.org/wiki/Venti_(software) which was a content-addressible filesystem which used hashes for de-duplication. Since the hashes were computed at write time, the performance penalty is amortized.
I'd hash the first 1024 bytes of all files, and starts from there is any collision. That way you don't need to hash the whole (large) files, but only those with same hashes.
I suspect that bytes near the end are more likely to be different (even if there may be some padding). For example, imagine you have several versions of the same document.
Also, use the length of the file for a fast check.
In order to check if a file is a duplicate of another, you need to check it against _every other possible file_. You need some kind of "lookup key".
If we took the first 1024 bytes of each file as the lookup key, then our key size would be 1024 bytes. If you have 1 million files on your disk, then that's 128MB of ram just to store all the keys. That's not a big deal these days, but it's also annoying if you have a bunch of files that all start with the same 1024 bytes -- e.g. perhaps all the photoshop documents start with the same header. You'd need a 2-stage comparison, where you first match the key (1024 bytes) and then do a full comparison to see if it really matches.
Far more efficient - and less work - If you just use a SHA256 of the file's contents. That gets you a much smaller 32 byte key, and you don't need to bother with 2-stage comparisons.
I understand the concept. My main point is that it's probably not a huge advantage to store hashes of the first 1KB, which requires CPU to calculate, over just the raw bytes, which requires storage. There's a tradeoff either way.
I don't think it would be far more efficient to do hash the entire contents though. If you have a million files storing a terabyte of data, the 2 stage comparison would read at most 1GB (1 million * 1KB) of data, and less for smaller files. If you do a comparison of the whole hashed contents, you have to read the entire 1TB. There are a hundred confounding variables, for sure. I don't think you could confidently estimate which would be more efficient without a lot of experimenting.
If you're going to keep partial hashes in memory, may as well align it on whatever boundary is the minimal block/sector size that your drives give back to you. Hashing (say) 8kB takes less time than it takes to fetch it from SSD (much less disk), so if you only used the first 1kB, you'd (eventually) need to re-fetch the same block to calculate the hash for the rest of the bytes in that block.
... okay, so as long as you always feed chunks of data into your hash in the same deterministic order, it doesn't matter for the sake of correctness what that order is or even if you process some bytes multiple times. You could hash the first 1kB, then the second-through-last disk blocks, then the entire first disk block again (double-hashing the first 1kB) and it would still tell you whether two files are identical.
If you're reading from an SSD and seek times don't matter, it's in fact probable that on average a lot of files are going to differ near the start and end (file formats with a header and/or footer) more than in the middle, so maybe a good strategy is to use the first 32k and the last 32k, and then if they're still identical, continue with the middle blocks.
etc, and only calculate the latter partial hashes when there is a collision between earlier ones. If you have 10M files and none of them have the same length, you don't need to hash anything. If you have 10M files and 9M of them are copies of each other except for a metadata tweak that resides in the last handful of bytes, you don't need to read the entirety of all 10M files, just a few blocks from each.
A further refinement would be to have per-file-format hashing strategies... but then hashes wouldn't be comparable between different formats, so if you had 1M pngs, 1M zips, and 1M png-but-also-zip quine files, it gets weird. Probably not worth it to go down this road.
> Which means if you actually edited those files, you might fill up your HD much more quickly than you expected.
I'm not sure if this is what you intended, but just to be sure: writing changes to a cloned file doesn't immediately duplicate the entire file again in order to write those changes — they're actually written out-of-line, and the identical blocks are only stored once. From [the docs](^1) posted in a sibling comment:
> Modifications to the data are written elsewhere, and both files continue to share the unmodified blocks. You can use this behavior, for example, to reduce storage space required for document revisions and copies. The figure below shows a file named “My file” and its copy “My file copy” that have two blocks in common and one block that varies between them. On file systems like HFS Plus, they’d each need three on-disk blocks, but on an Apple File System volume, the two common blocks are shared.
The key is “unmodified” and how APFS knows or doesn’t know whether they are modified. How many apps write on block boundaries or even mutate just in disk data that has changed vs overwriting or replacing atomically? For most applications there is no benefit and a significant risk of corruption.
So APFS supports it, but there is no way to control what an app is going to do, and after it’s done it, no way to know what APFS has done.
For apps which write a new file and replace atomically, the CoW mechanism doesn't come into play at all. The new file is a new file.
I don't understand what makes you think there's a significant risk of corruption. Are you talking about the risk of something modifying a file while the dedupe is happening? Or do you think there's risk associated with just having deduplicated files on disk?
What will happen when the original file will be deleted? Often this handled by block reference counters, which just would be decreased. How APFS handles this? Is there any master/copy concepts or just block references?
Oh wow, what a funny coincidence. I hadn't visited the site in a couple of years but someone linked me "Front and Center" yesterday, so I saw the icon for this app and had no clue it had only appeared there maybe hours earlier.
The idea is not new, of course, and I've written one of these (for Linux, with hardlinks) years ago but in the end just deleted all the duplicate files in my mp3 collection and didn't touch the rest of the files on the disk, because not a lot of size was reclaimed.
I wonder for whom this really saves a lot of space. (I saw someone mentioning node_modules, had to chuckle there).
But today I learned about this APFS feature, nice.
I gave it a try on my massive folder of NodeJS projects but it only found 1GB of savings on a 8.1GB folder.
I then tried again including my user home folder (731K files, 127K folders, 2755 eligible files) to hopefully catch more savings and I only ended up at 1.3GB of savings (300MB more than just what was in the NodeJS folders.)
I tried to scan System and Library but it refused to do so because of permission issues.
I think the fact that I use pnpm for my package manager has made my disk space usage already pretty near optimal.
Oh well. Neat idea. But the current price is too high to justify this. Also I would want it as a background process that runs once a month or something.
More importantly, pnpm installs packages as symlinks, so the deduping is rather more effective. I believe it also tries to mirror the NPM folder structure and style of deduping as well, but if you have two of the same package installed anywhere on your system, pnpm will only need to download and save one copy of that package.
But the way you put them after each other, makes it sound like npm does de-duplication, and since pnpm tries to be a drop-in replacement for npm, so does pnpm.
So for clarification: npm doesn't do de-duplication across all your projects, and that in particular was of the more useful features that pnpm brought to the ecosystem when it first arrived.
For some reason "disk-inventory-x" will scan those folders. I used that amazing tool to prune left over Unreal Engine files and docker caches when they put them not in my home folder. The tool asks for a ton of permissions when you run it in order to do the scan though, which is a bit annoying.
It’s not obvious but the system folder is on a separate, secure volume; the Finder does some trickery to make the system volume and the data volume appear as one.
I have a 512GB drive in my MacBook Air M3 with 225GB free. Saving 1GB is 0.5% of my total free space, and it is definitely "below my line." It is a neat tool still in concept.
When I ran it on my home folder with 165GB of data it only found 1.3GB of savings. This isn't that significant to me and it isn't really worth paying for.
BTW I highly recommend the free "disk-inventory-x" utility for MacOS space management.
It should be proportional to the total used space, not the space available. The previous commenter said it was a 1 GB savings from ~8 GB of used space; that's equally significant whether it happens on a 10 GB drive or a 10 TB one.
He "only" saved 30%? That's amazing. I really doubt most people are going to get anywhere near that.
When I run it on my home folder (Roughly 500GB of data) I find 124 MB of duplicated files.
At this stage I'd like it to tell me what those files are - The dupes are probably dumb ones that I can simply go delete by hand, but I can understand why he'd want people to pay up first, as by simply telling me what the dupes are he's proved the app's value :-)
> He "only" saved 30%? That's amazing. I really doubt most people are going to get anywhere near that.
You misunderstood my comment. I ran it on my home folder which contains 165GB of data and it found 1.3GB is savings. That isn't significant for me to care about because I currently have 225GB free of my 512GB drive.
BTW I highly recommend the free "disk-inventory-x" utility for MacOS space management.
His comment is pretty understandable if you've done frontend work in javascript.
Node_modules is so ripe for duplicate content that some tools explicitly call out that they're disk efficient (It's literally in the tagline for PNPM "Fast, disk space efficient package manager": https://github.com/pnpm/pnpm)
So he got ok results (~13% savings) on possibly the best target content available in a user's home directory.
Then he got results so bad it's utterly not worth doing on the rest (0.10% - not 10%, literally 1/10 of a single percent).
---
Deduplication isn't super simple, isn't always obviously better, and can require other system resources in unexpected ways (ex - lots of CPU and RAM). It's a cool tech to fiddle with on a NAS, and I'm generally a fan of modern CoW filesystems (incl APFS).
But I want to be really clear - this is people picking spare change out of the couch style savings. Penny wise, pound foolish. The only people who are likely to actually save anything buying this app probably already know it, and have a large set of real options available. Everyone else is falling into the "download more ram" trap.
Another 30% more than the 1GB saved in node modules, for 1.3GB total. Not 30% of total disk space.
For reference, from the comment they’re talking about:
> I then tried again including my user home folder (731K files, 127K folders, 2755 eligible files) to hopefully catch more savings and I only ended up at 1.3GB of savings (300MB more than just what was in the NodeJS folders.)
This is basically only a win on macOS, and only because Apple charges through the nose for disk space.
Ex - On my non-apple machines, 8GB is trivial. I load them up with the astoundingly cheap NVMe drives in the multiple terabyte range (2TB for ~$100, 4TB for ~$250) and I have a cheap NAS.
So that "big win" is roughly 40 cents of hardware costs on the direct laptop hardware. Hardly worth the time and effort involved, even if the risk is zero (and I don't trust it to be zero).
If it's just "storage" and I don't need it fast (the perfect case for this type of optimization) I throw it on my NAS where it's cheaper still... Ex - it's not 40 cents saved, it's ~10.
---
At least for me, 8GB is no longer much of a win. It's a rounding error on the last LLM model I downloaded.
And I'd suggest that basically anyone who has the ability to not buy extortionately priced drives soldered onto a mainboard is not really winning much here either.
I picked up a quarter off the ground on my walk last night. That's a bigger win.
> This is basically only a win on macOS, and only because Apple charges through the nose for disk space
You do realize that this software is only available on macOS, and only works because of Apple's APFS filesystem? You're essentially complaining that medicine is only a win for people who are sick.
This is NOT a novel or new feature in filesystems... Basically any CoW file system will do it, and lots of other filesystems have hacks built on top to support this kinds of feature.
---
My point is that "people are only sick" because the company is pricing storage outrageously. Not that Apple is the only offender in this space - but man are they the most egregious.
Absolutely, 100% backwards. The tool cannot save space from disk space that is not scanned. Your "not a big win" comment assumes that there is no space left to be reclaimed on the rest of the disk. Or that the disk is not empty, or that the rest of the disk can't be reclaimed at an even higher rate.
Didn't have time to try it myself, but there is an option for minimal files size to consider clearly seen on the AppStore screenshot. I suppose it was introduced to minimize comparison buffers. It is possible that node modules are sliding under this size and wasn't considered.
Even if I have both a Mac and iPhone, but happen to use my Linux computer right now, it seems like the store page (https://apps.apple.com/us/app/hyperspace-reclaim-disk-space/...) is not showing the price, probably because I'm not actively on a Apple device? Seems like a poor UX even for us Mac users.
It’s a free app because you don’t have to buy it to run it. It will tell you how much space it can save you for free. So you don’t have to waste $20 to find out it only would’ve been 2kb.
But that means the parts you actually have to buy are in app purchases, which are always hidden on the store pages.
> Finally, at WWDC 2017, Apple announced Apple File System (APFS) for macOS (after secretly test-converting everyone’s iPhones to APFS and then reverting them back to HFS+ as part of an earlier iOS 10.x update in one of the most audacious technological gambits in history).
How can you revert a FS change like that if it goes south? You'd certainly exercise the code well but also it seems like you wouldn't be able to back out of it if something was wrong.
IIRC migrating from HFS+ to APFS can be done without touching any of the data blocks and a parallel set of APFS metadata blocks and superblocks are written to disk. In the test migrations Apple did the entire migration including generating APFS superblocks but held short of committing the change that would permanently replace the HFS+ superblocks with APFS ones. To roll back they “just” needed to clean up all the generated APFS superblocks and metadata blocks.
Let’s say for simplification we have three metadata regions that report all the entirety of what the file system might be tracking, things like file names, time stamps, where the blocks actually live on disk, and that we also have two regions labeled file data, and if you recall during the conversion process the goal is to only replace the metadata and not touch the file data.
We want that to stay exactly where it is as if nothing had happened to it.
So the first thing that we’re going to do is identify exactly where the metadata is, and as we’re walking through it we’ll start writing it into the free space of the HFS+ volume.
And what this gives us is crash protection and the ability to recover in the event that conversion doesn’t actually succeed.
Now the metadata is identified.
We’ll then start to write it out to disk, and at this point, if we were doing a dry-run conversion, we’d end here.
If we’re completing the process, we will write the new superblock on top of the old one, and now we have an APFS volume.
I think that’s what they did too. And it was a genius way of testing. They did it more than once too I think.
Run the real thing, throw away the results, report all problems back to the mothership so you have a high chance of catching them all even on their multi-hundred million device fleet.
You lack imagination. This is not some crown jewel only achievable by Apple. In the open source world we have tools to convert ext file systems to btrfs and (1) you could revert back; (2) you could mount the original ext file system while using the btrfs file system.
I watched the section from the talk [0] and there's no details given really, other than that it was done as a test of consistency. I've blown so many things up in production that I'm not sure if I could every pull the trigger on such a large migration
I wrote a similar (but simpler) script which would replace a file by a hardlink if it has the same content.
My main motivation was for the packages of Python virtual envs, where I often have similar packages installed, and even if versions are different, many files would still match. Some of the packages are quite huge, e.g. Numpy, PyTorch, TensorFlow, etc. I got quite some disk space savings from this.
If you'd like. In the blog post he says he wrote the prototype in an afternoon. Hyperspace does try hard to preserve unique metadata as well as other protections.
CoW is a function of ReFS, shipped with Server 2016. "DevDrive" is just a marketing term for a ReFS volume which has file system filters placed in async mode or optionally disabled altogether.
Would be nice if git could make use of this on macOS.
Each worktree I usually work on is several gigs of (mostly) identical files.
Unfortunately the source files are often deep in a compressed git pack file, so you can't de-duplicate that.
(Of course, the bigger problem is the build artefacts on each branch, which are like 12G per debug/release per product, but they often diverge for boring reasons.)
"git worktree" shares a .git folder between multiple checkouts. You'll still have multiple files in working copy, but at least the .pack files would be shared. It is great feature, very robust, I use it all the time.
There is also ".git/objects/info/alternates", accessed via "--shared"/"--reference" option of "git clone", that allows only sharing of object storage and not branches etc... but it is has caveats, and I've only used it in some special circumstances.
Git is a really poor fit for a project like that since it's snapshot based instead of diff based... Luckily, `git lfs` exists for working around that, I'm assuming you've already investigated that for the large artifacts?
Git de-duplicates everything in its store (in the .git directory) already. That's how it can store thousands of commits which are snapshots of the entire repository without eating up tons of disk space. Why do you have duplicated files in the working directory, though?
> Like all my apps, Hyperspace is a bit difficult to explain. I’ve attempted to do so, at length, in the Hyperspace documentation. I hope it makes enough sense to enough people that it will be a useful addition to the Mac ecosystem.
Am I missing something, or isn't it a "file de-duplicator" with a nice UI/UX? Sounds pretty simple to describe, and tells you why it's useful with just two words.
The author of the software is a file system enthusiast (so much that in the podcast he's a part of they have a dedicated sound effect every time "filesystem" comes up), a long time blogger and macOS reviewer. So you'll have to see it in that context while documenting every bit and the technical details behind it is important to him...even if it's longer than a tag line on a landing page.
In times where documentation is often an afterthought, and technical details get hidden away from users all the time ("Ooops some error occurred") this should be celebrated.
No because it isn't getting rid of the duplicate, it's using a feature of APFS that allows for duplicates to exist separately but share the same internal data.
My understanding is that it is a copy-on-write clone, not a hard link. [1]
> Q: Are clone files the same thing as symbolic links or hard links?
> A: No. Symbolic links ("symlinks") and hard links are ways to make two entries in the file system that share the same data. This might sound like the same thing as the space-saving clones used by Hyperspace, but there’s one important difference. With symlinks and hard links, a change to one of the files affects all the files.
> The space-saving clones made by Hyperspace are different. Changes to one clone file do not affect other files. Cloned files should look and behave exactly the same as they did before they were converted into clones.
What kind of changes could you make to one clone that would still qualify it as a clone? If there are changes, it's no longer the same file. Even after reading the How It Works[0] link, I'm not groking how it works. Is it making some sort of delta/diff that is applied to the original file? That's not possible for every file format like large media files. I could see that being interesting for text based files, but that gets complicated for complex files.
If I understand correctly, a COW clone references the same contents (just like a hardlink) as long as all the filesystem references are pointing to identical file contents.
Once you open one of the reference handles and modify the contents, the copy-on-write process is invoked by the filesystem, and the underlying data is copied into a new, separate file with your new changes, breaking the link.
Comparing with a hardlink, there is no copy-on-write, so any changes made to the contents when editing the file opened from one reference would also show up if you open the other hardlinks to the same file contents.
Almost, but the difference is that if you change one of hardlinked files, you change "all of them". (It's really the same file but with different paths.)
With a hard link, the content of each of the two 'files' are identical in perpetuity.
With APFS Clones, the contents start off identical, but can be changed independently. If you change a small part of a file, those block(s) will need to be created, but the existing blocks will continue to be shared with the clone.
It’s not the same because clones can have separate meta data; in addition, if a cloned file changes, it stores a diff of the changes from the original.
Replacing duplicates with hard links would be extremely dangerous. Software which expects to be able to modify file A without modifying previously-identical file B would break.
Right, but the concept is the same, "remove duplicates" in order to save storage space. If it's using reflinks, softlinks, APFS clones or whatever is more or less an implementation detail.
I know that internally it isn't actually "removing" anything, and that it uses fancy new technology from Apple. But in order to explain the project to strangers, I think my tagline gets the point across pretty well.
> Right, but the concept is the same, "remove duplicates" in order to save storage space.
The duplicates aren't removed, though. Nothing changes from the POV of users or software that use those files, and you can continue to make changes to them independently.
De-duplication does not mean the duplicates completely disappear. If I download a deduplication utility I expect it to create some sort of soft/hard link. I definitely don’t want it to completely remove random files on the filesystem, that’s just going to wreak havoc.
But it can still wreak havoc if you use hardlinks or softlinks, because maybe there was a good reason for having a duplicate file! Imagine you have a photo “foo.jpg.” You make a copy of it “foo2.jpg” You’re planning on editing that file, but right now, it’s a duplicate. At this point you run your “deduper” that turns the second file into a hardlink. Then a few days later you go and edit the file, but wait, the original “backup” file is now modified too! You lost your original.
That’s why Copy-on-write clones are completely different than hardlinks.
Judging by this sub-thread, the process really is harder to explain that it appears on the surface. The basic idea is simple but the implementation requires deeper knowledge.
But why would you discuss the implementation to end-users who probably wouldn't even understand what "implementation" means? The discussions you see in the subthread is not a discussion that would appear on less-technical forums, and I wouldn't draw any broader conclusions based on HN conversations in general.
Because the implementation leaks to the user experience. The user at least needs to know whether after running the utility, the duplicate files will be gone, or whether changing one of the files will change the other.
Symbolic links, hard links, ref links are all part of the file system interface, not the implementation.
Also models that various AI libraries and plugins love to autodownload into custom locations. Python folks definitely need to learn caching, symlinks, asking a user where to store data, or at least logging where they actually do it.
Interesting idea, and I like the idea of people getting paid for making useful things.
Also, I get a data security itch having a random piece of software from the internet scan every file on an HD, particularly on a work machine where some lawyers might care about what's reading your hard drive. It would be nice if it was open source, so you could see what it's doing.
> I like the idea of people getting paid for making useful things
> It would be nice if it was open source
> I get a data security itch having a random piece of software from the internet scan every file on an HD
With the source it would be easy for others to create freebie versions, with or without respecting license restrictions or security.
I am not arguing anything, except pondering how software economics and security issues are full of unresolved holes, and the world isn't getting default fairer or safer.
--
The app was a great idea, indeed. I am now surprised Apple doesn't automatically reclaim storage like this. Kudos to the author.
You could download the app, disconnect Wifi and Ethernet, run the app and the reclamation process, remove the app (remember, you have the guarantees of the macOS App Store so no kernel extensions etc), and then reconnect.
Edit: this might not work with the payment option actually. I don't think you can IAP without the internet.
> Q: Does Hyperspace preserve file metadata during reclamation?
> A: When Hyperspace replaces a file with a space-saving clone, it attempts to preserve all metadata associated with that file. This includes the creation date, modification date, permissions, ownership, Finder labels, Finder comments, whether or not the file name extension is visible, and even resource forks. If the attempt to preserve any of these piece of metadata fails, then the file is not replaced.
Yes. Hyperspace is finding the identical files and then replacing all but one copy with a reflink copy using the filesystem's reflink functionality.
When you asked about the filesystem, I assumed you were asking about which filesystem feature was being used, since hyperspace itself is not provided by the filesystem.
Someone else mentioned[0] fclones, which can do this task of finding and replacing duplicates with reflinks on more than just macOS, if you were looking for a userspace tool.
Hyperspace uses built in APFS features, it just applies them to existing files.
You only get CoW on APFS if you copy a file with certain APIs or tools.
If you have a program that does it manually, you copied a duplicate to somewhere on your desk from some other source, or your files already existed on the file system when you converted to APFS because you’ve been carrying them for a long time then you’d have duplicates.
APFS doesn’t look for duplicates at any point. It just keeps track of those that it knows are duplicates because of copy operations.
Yes, Linux has a systemcall to do this for any filesystem with reflink support (and it is safe and atomic). You need a "driver" program to identify duplicates but there are a handful out there. I've used https://github.com/markfasheh/duperemove and was very pleased with how it worked.
He spoke to this on No Longer ery Good, episode 626 of The Accidental Tech Podcast. Time stamp ~1:32:30
It tries, but there are some things it can't perfectly preserve like the last access time. Instances where it can't duplicate certain types of extended attributes or ownership permissions it will not perform the operation.
Well, the FAQ also states that people should notify if you're missing attributes, so it really sounds like it's a predefined list instead of just enumeration through everything.
No word about alternate data streams. I'll pass for now.. Although it's nice to see how much duplicates you have
Q: Does Hyperspace preserve file metadata during reclamation?
A: When Hyperspace replaces a file with a space-saving clone, it attempts to preserve all metadata associated with that file. This includes the creation date, modification date, permissions, ownership, Finder labels, Finder comments, whether or not the file name extension is visible, and even resource forks. If the attempt to preserve any of these piece of metadata fails, then the file is not replaced.
If you find some piece of file metadata that is not preserved, please let us know.
Q: How does Hyperspace handle resource forks?
A: Hyperspace considers the contents of a file’s resource fork to be part of the file’s data. Two files are considered identical only if their data and resource forks are identical to each other.
When a file is replaced by a space-saving clone during reclamation, its resource fork is preserved.
TL;DR: He wrote an OS X dedup app which finds files with the same contents and tells the filesystem that their contents are identical, so it can save space (using copy-on-write features).
He points out its dangerous but could be worth it cause space savings.
I wonder if the implementation is using a hash only or does an additional step to actually compare the contents to avoid hash collision issues.
It's not open source, so we'll never know. He chose a pay model instead.
Also, some files might not be identical but have identical blocks. Something that could be explored too. Other filesystems have that either in their tooling or do it online or both.
In my experience, Macs use up a ridiculous amount of "System" storage for no reason that users can't delete. I've grown tired of family members asking me to help them free up storage that I can't even find. That's the major issue from what I've seen; unless this app prevents apple deliberately eating up 50%+ of the storage space of a machine, this doesn't do much for the people I know.
There's no magic around it, macOS just doesn't do a good job explaining it using the built in tools. Just use Daisy Disk or something. It's all there and can be examined.
In earlier episodes of ATP when they were musing on possible names, one listener suggested the frankly amazing "Dupe Nukem". I get that this is a potential IP problem, which is why John didn't use it, but surely Duke Nukem is not a zealously-defended brand in 2025. I think interest in that particular name has been stone dead for a while now.
It's a genius name, but Gearbox owns Duke Nukem. They're not exactly dormant. Duke Nukem as a franchise made over a billion in revenue. In 2023, Zen released a licensed Duke Nukem pinball table, so there is at least some ongoing interest in the franchise.
Reminds me of Avira's Luke Filewalker - I wonder if they needed any special agreement with Lucasfilm/Disney. I couldn't find any info on it, and their website doesn't mention Star Wars at all.
Downloaded. Ran it. Tells me "900" files can be cleaned. No summary, no list. But I was at least asked to buy the app. Why would I buy the app if I have no idea if it'll help?
If you don’t mind CLI tools, You can try dedup - https://github.com/ttkb-oss/dedup . Use the —-dry-run option to get a list of files that would be merged without modifying anything and how much space would be saved.
On good file systems (see <https://news.ycombinator.com/item?id=43174685>) also identical chunks of files can be merged, resulting in more savings than with just whole files. As of now, dedup cannot help with this, but duperemove or jdupes do.
Back in the MS-DOS days, when the RAM was sparse, there was a class of so-called "memory optimization" programs. They all inevitably found at least few KB to be reclaimed through their magic even if the same optimizer was run back to back with itself and allowed to "optimize" things. That is, on each run they always find extra memory to be freed. They ultimately did nothing but claim they did the work. Must've sold pretty well nonetheless.
> Back in the MS-DOS days, when the RAM was sparse, there was a class of so-called "memory optimization" programs. They ultimately did nothing but claim they did the work. Must've sold pretty well nonetheless.
QEMM worked by remapping stuff into extended memory - in a time that most software wasn't interested in using it. It worked as advertised.
Quarterdeck made good stuff all around. Desq and DesqView/X were amazing multitaskers. Way snappier than Windows and ran on little to nothing.
I remember using MemTurbo in the Windows 2000 era, though now I know it was mostly smoke and mirrors. My biggest gripe these days is too many "hardware accelerated" apps eating away VRAM, which is less of a problem with Windows (better over-commit) but which causes me a few crashes a month on KDE.
> If some eligible files were found, the amount of disk space that can be reclaimed is shown next to the “Potential Savings” label. To proceed any further, you will have to make a purchase. Once the app’s full functionality is unlocked, a “Review Files” button will become available after a successful scan. This will open the Review Window.
I half remember this being discussed on ATP; the logic being that if you have the list of files, you will just go and de-dupe them yourself.
Many comments here offering similar solutions based on hardlinks or symlinks.
This uses a specific feature of APFS that allows the creation of copy-on-write clones. [1] If a clone is written to, then it is copied on demand and the original file is unmodified. This is distinct from the behavior of hardlinks or symlinks.
brew install fclones
cd ~
fclones group . | fclones dedupe
I've used fclones before in the default mode (create hard links) but this is the first time I've run it at the top level of my home folder, in dedupe mode (i.e. using APFS clones). Fingers crossed it didn't wreck anything.
< You can also create [APFS (copy on write) clones] in Terminal using the command `cp -c oldfilename newfilename` where the c option requires cloning rather than a regular copy.
`fclones dedupe` uses the same command[1]:
if cfg!(target_os = "macos") {
result.push(format!("cp -c {target} {link}"));
Nice, also compression at file system level can save a lot of space and with current CPU speeds is completely transparent. It is feature from HFS+ that is still works in APFS, but is not officially supported anymore, what is wrong with you Apple ?
This tool to enable compression is free and open source
Also note about APFS vs HFS+, if you use HDD e.g. as backup media for Time Machine, HFS+ is must have over APFS as it is optimised only for SSD (random access).
Not so smart Time Machine setup utility forcefully re-creates APFS on a HDD media, so you have to manually create HFS+ volume (e.g. with Disk Utily) and then use terminal command to add this volume as TM destination
Nice, but I'm not getting a subscription for a filesystem utility. Had it been a one-time $5 license, I would have bought it. At the current price, it's literally cheaper to put files in a S3 bucket or outright buy an SSD.
They had long discussions about the pricing on the podcast the author is a part of (atp.fm). It went through a few iterations of one time purchase, fee for each time you free up space and a subscription. There will always be people unhappy about either choice.
I worry about procreate too. It’s way too cheap for what it is and it’s in the small set of apps that can justify a subscription.
This app though? No chance. Parent comment says “if you want to support the app’s development” but not all apps need to be “developed” continuously, least of all system utilities.
Claude 3.7 just rewrote the whole thing (just based on reading the webpage description) as a commandline app for me, so there's that.
And because it has no Internet access yet (and because I prompted it to use a workaround like this in that circumstance), the first thing it asked me to do (after hallucinating the functionality first, and then catching itself) was run `curl https://hypercritical.co/hyperspace/ | sed 's/<[^>]*>//g' | grep -v "^$" | clip`
("clip" is a bash function I wrote to pipe things onto the clipboard or spit them back out in a cross-platform linux/mac way)
clip() {
if command -v pbcopy > /dev/null; then
[ -t 0 ] && pbpaste || pbcopy;
else
if command -v xclip > /dev/null; then
[ -t 0 ] && xclip -o -selection clipboard || xclip -selection clipboard;
else
echo "clip function error: Neither pbcopy/pbpaste nor xclip are available." >&2;
return 1;
fi;
fi
}
Trust, but verify. We should always read and understand what gets spit out anyways.
When doing something with any risk potential I first ask the model for potential risks with the output, and then I manually read the code.
I also "recreated" this tool with Sonnet 3.7. The initial bash script worked (but was slow), and after a few iterations we landed on an fclones one-liner. I hadn't heard of fclones before, but works great! Saved a bunch of disk space today.
Ooh, could you share the source code? That seems like a perfect example for my "relying on AI code generation will subtly destroy your data" presentation.
The price does seem very high. It’s probably a niche product and I’d imagine developers are the ones who would see the biggest savings. Hopefully it works out for them
Well I do value software, I'm paid $86/h to write some! I just find that for $20/year or $50 one time, you can get way more than 12G of hard drive space. I also don't think that this piece of software requires so much maintenance that it wouldn't be worth making at a lower price. I'm not saying that it's bad software, it's really great, just too expensive... Personally, my gut feeling is that the dev would have had more sales with a one time $5, and made more money overall.
The first option presented is a one month non-renewing subscription for $10. I think the intention is periodically (once a year, once every few years?) you run it to reclaim space. If it was reclaiming more than a few gigs I would do it.
The author talked about being very conservative on launch; skipping directories like the Photo library or others apps that actively manage data or looking across user directories. He stumbled into writing this app because he noticed the duplicated data of shared Photo libraries between different users on the same machine. That use case isn't even supported in this version. He said he plans future development to safely dedup more data--making a one time purchase less sustainable for them.
I'm pretty sure some of them also work on MacOS. rmlint[1], for example can output a script that reflinks duplicates (or run any script for both files):
rmlint -c sh:handler=reflink .
I'm not sure if reflink works out of the box, but you can write your own alternative script that just links both files
I feel like that's true for most of the relatively low-level disk and partition management tooling. As unpopular an opinion as it may lately be around here, I'm enough of a pedagogical traditionalist to remain convinced that introductory logical volume management is best left at least till kindergarten.
Despite knowing this is the correct interpretation, I still consistently make the same incorrect interpretation as the parent comment. It would be nice if they made this more intuitive. Glad I’m not the only one that’s made that mistake.
Swift 6 is not the problem. It's backward compatible.
The problem is SwiftUI. It's very new, still barely usable on the Mac, but they are adding lots of new features every macOS release.
If you want to support older versions of macOS you can't use the nice stuff they just released. Eg. pointerStyle() is a brand new macOS 15 API that is very useful.
It's not bad, just limited. I think it's getting usable, but just barely so.
They are working on it, and making it better every year. I've started using it for small projects and it's pretty neat how fast you can work with it -- but not everything can be done yet.
Since they are still adding pretty basic stuff every year, it really hurts if you target older versions. AppKit is so mature that for most people it doesn't matter if you can't use new features introduced in the last 3 years. For SwiftUI it still makes a big difference.
Came here to post the same thing. Would love to try the application, but I guess not if the developer is deliberately excluding my device (which cannot run the bleeding edge OS).
In fairness, I don't think you can describe it as bleeding edge when we're 5 months into the annual 12 month upgrade cycle. It's recent, but not exactly an early adapter version at this point.
Expensive. Keeping us on the expensive hardware treadmill. My guess is that it cannot be listed in the Apple store unless its only for Macs released in the last 11 months.
This isn’t true you can set the target multiple versions back. The main problem right now is a huge amount of churn in the language, APIs and multiple UI frameworks means everything is a moving target. SwiftUI has only really become useable in the last coupe of versions.
Every time Xcode updates, it seems a few more older macOS and iOS versions are removed from the list of "Minimum Deployment Versions". My current Xcode lets me target macOS back to 10.13 (High Sierra, 7 years old) and iOS 12.0 (6 years old). This seems... rather limiting. Like, I'd be leaving a lot of users out in the cold if I were actually releasing apps anymore. And this is Xcode 15.2, on a dev host Mac forever stuck on macOS 13.7. I'm sure newer Mac/Xcode combinations are even more limiting.
I used to be a hardcore Apple/Mac guy, but I'm kind of giving up on the ecosystem. Even the dev tools are keeping everyone on the treadmill.
You can keep using an older version of Xcode if you like. I mean, every other tool chain that I can think of does more or less the same thing. There are plenty of reasons to criticise Apple's developer tooling and relations, but I don't see this as being especially different to other platforms
I don't understand why a simple, closed source de-dup app is at the top of the front page with 160+ comments? What is so interesting about it? I read the blog and the comments here and I still don't get it.
I assume it’s because it’s from John Siracusa, a long-time Mac enthusiast, blogger, and podcaster. If you listen to him on ATP, it’s hard not to like him, and anything he does is bound to get more than the usual upvotes on HN.
As a web dev, it’s been fun listening to Accidental Tech Podcast where Siracusa has been talking (or ranting) about the ins and outs of developing modern mac apps in Swift and SwiftUI.
The part where he said making a large table in HTML and rendering it with a web view was orders of magnitude faster than using the SwiftUI native platform controls made me bash my head against my desk a couple times. What are we doing here, Apple.
SwiftUI is a joke when it comes to performance. Even Marco's Overcast stutters when displaying a table of a dozen rows (of equal height).
That being said, it's not quite an apples to apples comparison, because SwiftUI or UIKit can work with basically an infinite number of rows, whereas HTML will eventually get to a point where it won't load.
Shoutout to iced, my favorite GUI toolkit, which isn't even in 1.0 yet but can do that with ease and faster than anything I've ever seen: https://github.com/iced-rs/iced
It's easy to write a quick and clean UI toolkit, but it's when you add all the stuff for localization (like support for RTL languages - which also means swapping over where icons are) and accessibility (all the screen reader support) is where you really get bogged down and start wanting to add all these abstractions that slow things down.
I wish there were modern benchmarks against browser engines. A long time ago native apps were much faster at rendering UI than the browser, but that may performance rewrites ago, so I wonder how browsers perform now.
Hacker News loves to hate Electron apps. In my experience ChatGPT on Mac (which I assume is fully native) is nearly impossible to use because I have a lot of large chats in my history but the website works much better and faster. ChatGPT website packed in Electron would've been much better. In fact, I am using a Chrome "PWA App" for ChatGPT now instead of the native app.
> In my experience ChatGPT on Mac (which I assume is fully native)
If we are to believe ChatGPT itself: "The ChatGPT macOS desktop app is built using Electron, which means it is primarily written in JavaScript, HTML, and CSS"
Someone more experienced that me could probably comment on this more, but theoretically is it possible for Electron production builds to become more efficient by having a much longer build process and stripping out all the unnecessary parts of Chromium?
For those mentioning that there's no price listed, it's not that easy as in the App Store the price varies by country. You can open the App Store link and then look at "In App Purchases" though.
For me on the German store it looks like this:
Unlock for One Year 22,99 €
Unlock for One Month 9,99 €
Lifetime Unlock 59,99 €
It would be interesting if payments bought a certain amount of saved space, and the rate was based on current storage prices, to keep it competitive with the cost of just expanding storage.
Its interesting how Linux tools are all free when even trivial mac tools are being sold. Nothing against someone trying to monetize but the linux culture sure is nice!
I don't think they meant it in a disparaging way, except maybe against Apple. Moreso that typically filesystems that can support deduplication include a deduplication tool in it's standard suite of FS tools. I too find it odd that Apple does not do this.
I have yet to see a GUI variant of deduplication software for Linux. There are plenty of command line tools, which probably can be ported to macOS, but there's no user friendly tool to just click through as far as I know.
There's value in convenience. I wouldn't pay for a yearly license (that price seems more than fair for a "version lifetime" price to me?) but seeing as this tool will probably need constant maintenance as Apple tweaks and changes APFS over time, combined with the mandatory Apple taxes for publishing software like this, it's not too awful.
The Mac App Store (and all of Apple's App Stores) doesn't enable this sort of licensing. It's exactly the sort of thing that drives a lot of developers to independent distribution.
That's why we see so many more subscription-based apps these days, application development is an ongoing process with ongoing costs, so it needs to have ongoing income. But the traditional buy-it-once app pricing doesn't enable that long-term development and support. The app store supports subscriptions though, so now we get way more subscription-based apps.
I really think Siracusa came up with a clever pricing scheme here, given his want to use the app store for distribution.
The cost is because of the fact people won't use it regularly. The developer is offering life time unlocks, lower cost levels for shorter timeframes etc.
A ~20 y.o. account with perhaps hundreds of devices in history across different continents and countries along with family sharing.
Every time I need to purchase something via Apple, it becomes a quest. Enter password, validate card, welcome to endless login loop. Reboot. Click purchase, enter password, confirm OTP on another device, then nothing happens, purchase button is active, clicks ignored. Reboot. Click "Get", program begins downloading, wait 30s, app cannot be download, go to Settings to verify account. Sure. Account is perfectly fine in Settings. Reboot. Click "Get". Finally program installed. Click in-app purchase. Enter password again. Choose Apple Pay. Proceed with purchase. You need to verify your account. Account is fine in Settings. Reboot. Click purchase. Cannot be completed at this time. Wait couple of hours, try again. Purchase successful.
All. The. Time. For years. On almost all of the devices which I upgrade annually.
Oh, I was agreeing with you. My 2015 iMac died two weeks ago. Pretty sure it's the SSD I installed when I first got it. And while most of my files are in cloud storage, I also had a series of chained external drives running Time Machine. Guess what? I can't use any apple tools to grab any files because different file system type between that machine and my M1 MB Pro (permissions issues).
I'm going to have to clone the drive, then use terminal to chmod the /usr dirs to extract the files I want (mostly personal music production).
I immediately ordered a Mac mini, but since I didn't want a 256GB drive and 16GB RAM, I'm still waiting for it to arrive from China.
Also, the M1 MB Pro was the most expensive* and worst computer I have ever owned. I wish I had just bought an air. No tactile volume controls. As a musician that is the worst.
(*company I worked for a while in school paid $12k for a Mac 2 FX. Lol.)
I would never dismiss such a complaint with a glib "works for me". And yet, your experience is so utterly, completely different from mine that I have to think something's busted in your account somewhere. I've had an account for about as long, with family sharing and all the rest. I never, ever, have anywhere near that level of difficulty. For me it works as documented: I click "Get", it asks for Face ID to confirm it's really me, then a few seconds later I have the app installed and ready to use.
Again, I don't think you're doing anything wrong, and I don't doubt your experience. But I really think something's fundamentally wrong somewhere, because what you're dealing with is not normal. It's not the common experience others are tolerating.
Oh yes, it's pretty clear to me that something is wrong on Apple's side specifically with my account. Obviously people are having close to zero friction with Apple's stuff.
I can't complain though because I have this account for like couple decades and loosing that account would be painful. Apple did ban my account twice on the grounds that it's a US account while I'm not physically located there. I was able to revert the ban by explaining that I've got a US legal entity (account, banking card, etc) and thus I beg to continue using it. Not taking chances for the third time, so I silently endure.
> Obviously people are having close to zero friction with Apple's stuff.
I don't believe that to be true, I have been having issues with Apple bugs for the last 7-8 years. Totally unnecessary friction due to features I do not want and do not use.
Edit: Let’s be real here, Tim Cook is keeping the lights on. He isn’t a product guy. They lack leadership and vision at present and are committed to a foolish release cycle based on calendar year not quality of product. These wouldn’t have come to pass had Steve lived till today. Yes its opinion but I doubt its an unpopular one.
So I’ll say it again succinctly, to answer “what happened?”
It is not a hard link. A clone is an independent file which is backed by the same storage. So far mostly the same as a hard link you’ll say. However if you modify a clone, it will be “uncloned” and will be modified independently of its clones.
If you replace `sh:link` with `sh:clone` instead, it will.
> clone: reflink-capable filesystems only. Try to clone both files with the FIDEDUPERANGE ioctl(3p) (or BTRFS_IOC_FILE_EXTENT_SAME on older kernels). This will free up duplicate extents while preserving the metadata of both. Needs at least kernel 4.2.
I don't think czkawa supports deduplication via reflink so it's not exactly the same thing. fclones as linked by another user is more similar: https://news.ycombinator.com/item?id=43173713
The fact that copying doesn't copy seems dangerous. Like what if I wanted to copy for the purpose of modifying the file while retaining the original. A trivial example of this might be I have a meme template and I want to write text in it while still keeping a blank copy of the template.
There's a place for alias file pointers, but lying to the user and pretending like an alias is a copy is bound to lead to unintended and confusing results
Copy-on-write means that it performs copy only when you make the first change (and only copies part that changes, rest is used from the original file), until then copying is free.
Requires macOS 15.0 or later. – Oh god, this is so stupid and most irritating thing about macOS "Application development".
It is really unfair to call it "software" it is more like "glued to recent version of OS ware", meanwhile I can still run .exe compiled in 2006, and with wine even on mac or linux.
I would also have appreciated a version that's compatible with a non-latest macOS release.
Then again, this app was written with SwiftUI, which hasn't received some handy features before macOS 12 and is still way behind AppKit.
When I see an app that's not compatible with the second most recent macOS, I assume the dev either didn't know better or they were too lazy to write workarounds / shims for the latest-and-greatest shiny stuff.
I have to confess: it miffs me that a utility that would normally fly completely under the radar is likely to make the creator thousands of dollars just because he runs a popular podcast. (Am I jealous? Oh yes. But only because I tried to sell similar apps in the past and could barely get any downloads no matter how much I marketed them. Selling software without an existing network seems nigh-on impossible these days.)
Anyway, congrats to Siracusa on the release, great idea, etc. etc.
I can understand your criticism as it's easy to arrive at that conclusion (Also a common occurrence when levelsio launches a new product, as his Twitter following is large) but it's also not fair to discount it as "just because he runs a popular podcast".
The author is a "household" name in the macOS / Apple scene for a long time even before the podcast. If someone is spending all their life blogging about all things Apple on outlets like ArsTechnica and is consistently putting out new content on podcasts for decades they will naturally have a better distribution.
How many years did you spend on building up your marketing and distribution reach?
I know! I actually like him and wish him the best. I just get a bit annoyed when one of the ATP folks releases some small utility with an unclear niche and then later talks about how they've "merely" earned thousands of dollars from it. When I was an app developer, I would have counted myself lucky to have made just a hundred bucks from a similar release. The gang's popularity gives them a distorted view of the market sometimes, IMHO.
I did this with two scripts - one that produces and cached sha1 sums of files, and another that consumes the output of the first (or any of the *sum progs) and produces stats about duplicate files, with options to delete or hard-link them.
if file is not going to be modified (in the low-level sense - open("w") on the filename; as opposed to rename-and-create-new), then reflinks (what this app does) and hardlinks act somewhat identically.
For example if you have multiple node_modules, or app installs, or source photos/videos (ones you don't edit), or music archives, then hardlinks work just fine.
I’ve experimented with reflinks and other APFS operations.
Here’s a question though: how does this work with transparently compressed files on APFS?
In my past experience, using reflinks is fine and using transparent compression is fine, but combining them leads to hard-to-debug file corruption.
I made a command line utility called `dedup` a while back to do the same thing. It has a dry-run mode, will “intelligently” choose the best clone source, understands hard links and other clones, preserves metadata, deals with HFS compressed files properly. It hasn’t destroyed any of my own data, but like any file system tool, use at your own risk.
0 - https://github.com/ttkb-oss/dedup
Replying to myself now that I've had a chance to try the scan, but not the deduplication. I work with disc images, program binaries, intermediate representations in a workspace that's 7.6G.
A few notes:
* By default it doesn't scan everything. It ignores all files but those in an allow list. The way the allow list is structured, it seems like Hyperspace needs to understand the content of a file. As an end user, I have no idea what the difference between a Text file and a Source Code file would be or how Hyperspace would know. Hyperscan only found 360MB to dedup. Allowing all files increased that to 842MB.
* It doesn't scan files smaller than 100 KB by default. Disabling the size limit along with allowing all files increased that to 1.1GB
* With all files and no size limit it scanned 67,309 of 68,874 files. `dedup` scans 67,426.
* It says 29,522 files are eligible. Eligible means they can be deduped. `dedup` only fines 29,447. There are 76 already deduped files, which is an off-by-one, so I'm not sure what the difference is.
* Scanning files in Hyperspace took around 50s vs `dedup` at 14s
* It seems to scan the file system, then do a duplication calculation, then do the deduplication. I'm not sure why the first shouldn't be done together. I chose to queue any filesystem metadata as it was scanned and in parallel start calculating duplicates. The vast majority of the time files can be mismatched by size, which is available from `fts_read` "for free" while traversing the directory.
* Hyperspace found 1.1GB to save, `dedup` finds 1.04GB and 882MB already saved (from previous deduping)
* I'm not going to buy Hyperspace at this time, so I don't know how long it takes to dedup or if it preserves metadata or deals with strange files. `dedup` took 31s to scan and deduplicate.
* After deduping with `dedup`, Hyperscan thinks there are still 2 files that can be deduped.
* Hyperspace seems to understand it can't dedup files with multiple hard links, empty files, and some of the other things `dedup` also checks for.
* I can't test ACLs or any other attribute preservation like that without paying. `strings` suggests those are handled. HFS Compression is a tricky edge case, but I haven't tested how Hyperspace's scan deals with those.
I'm a little surprised that folks here are investing so much time into this app. It's closed source, only available for a non-obious amount, time-limited or subscription-based and lots of details of how it works are missing.
With a FOSS project this would have been expected, but with a ShareWare-style model? Idk..
John has reiterated multiple times on his podcast that he doesn't want to deal with thousands of support requests when making his apps open source and free. All his apps are personal itches he scratched and he sells them not to make a profit but to make the barrier of entry high enough to make user feedback manageable.
Absolutely no judgement for however people want to licence and distribute their software, but I've seen the support burden used as justification for closed source/selling software quite a bit recently, and wonder how often people might be conflating open source with open development. There's no reason an open source project has to accept bug reports or pull requests from anyone. See SQLite or many of the tools from Fabrice Bellard for example.
Again, I've got no problem with people selling software or closed source models, but I've never understood using this justification. Maybe in this instance he's a well known public figure with published contact info that people will abuse?
Didn’t SQLite developer(s) famously receive a flood of phone calls because McAfee antivirus used it in a way that was visible (and “suspicious”) to its users?
One does not simply “not accept bug reports”.
https://github.com/sqlite/sqlite/blob/e8346d0a889c89ec8a78e6...
> he doesn't want to deal with thousands of support requests when making his apps open source and free.
Who says you have to deal with support requests if you open source something?
> All his apps are personal itches he scratched and he sells them not to make a profit but to make the barrier of entry high enough to make user feedback manageable.
That makes no sense
> Who says you have to deal with support requests if you open source something?
Almost anyone who has ever maintained popular open-source software, even if dealing with them means putting up a notice that says "Don't ask support questions" and having to delete angrily posted issues.
My understanding from listening to his explanation is he wants to be able to support users and have an income stream to incentivize that.
As an open-source maintainer of a popular piece of software, I'm very empathetic.
Sorry, that's a BS reason. If you don't want that, just ignore all opened issues. That's it. If you are nice then you put a README that explains this in a sentence or two. If a community forms that wants to fix issues, for example critical ones that could lead to data loss then the community will deal with it, e.g. by forking.
Just keeping everything closed is really missing the point of how trust in infra that handles critical data is built nowadays.
As I understand it, from listening to the podcast, a better summary is that if it becomes popular, he wants it to be worthwhile for him to keep working on.
Apps like this can easily bit rot, and more users does often mean more work e.g. answering or filtering emails, finding more edge cases, etc.
From his perspective that means having a income to dedicate time to this. I don't think he's interested in being an "infra" app as you would think of it.
As someone who maintains critical open-source software, I can strongly empathize, even if it’s not an approach I would take.
Just because the creator, John Siracusa is famous. If a no-name developer did the app, it wouldn't get this many upvotes and this much attention. He used to write very detailed OS reviews, and I learned a lot from him, including Apple's logical volume manager functions (`diskutil cs`).
On metadata, the excellent faq addresses that specifically (it does preserve). (I had the same question)
Thank you for creating and sharing this utility.
I ran it over my Postgres development directories that have almost identical files. It saved me about 1.7GB.
The project doesn't have any license associated with it. If you don't mind, can you please license this project with a license of your choice.
As a gesture of thanks, I have attempted to improve the installation step slightly and have created this pull request: https://github.com/ttkb-oss/dedup/pull/6
Just tried it, and it works well! I didn't realize the potential of this technique until I saw just how many dupes there were of certain types of files, especially in node_modules. It wasn't uncommon to see it replace 50 copies of some js file with one, and that was just in a specific subdirectory.
I see it is "pre-release" and sort of low GH stars (== usage?), so I'm curious about the stability since this type of tool is relatively scary if buggy.
I use it on my family photos, work documents, etc. and have not run into an issue that I haven’t added a test for. I didn’t commercialize it because I didn’t know what I didn’t know, but the utility does try to fail quickly if any of the file system operations fail (cloning, metadata duplication, atomic swaps, etc.).
Whenever using it on something sensitive that I can’t back up first for whatever reason, I make checksum files and compare them afterwards. I’ve done this many times on hundreds of GB and haven’t seen corruption. Caveat emptor.
There is one huge caveat I should add to the README - block corruption happens. Having a second copy of a file is a crude form of backup. Cloning causes all instances to use the same block, so if that one instance is corrupted, all clones are. That’s fine for software projects with generated files that can be rebuilt or checked out again, but introduces some risk for files that may not otherwise be replaceable. I keep multiple backups of all that stuff in hardware other than where I’m deduping, so I dedup with abandon.
I’m a nobody with no audience. Maybe some attention here will get some users.
Somewhat unrelated but I believe the dupe issue with node_modules is the main reason to use pnpn instead of npm - pnpm just uses a single global package repo on your machine and creates links inside node_modules as needed.
See the comments on https://news.ycombinator.com/item?id=38113396 for a list of alternatives. I used https://github.com/sahib/rmlint in the past and can't complain.
Wow, that's some excellent documentation.
I was also really impressed that `make` ran basically instantly.
Thanks!
I love the documentation from FreeBSD and OpenBSD. Only having target one platform and only system libraries makes building simple.
> There is no way for Hyperspace to cooperate with all other applications and macOS itself to coordinate a “safe” time for those files to be replaced, nor is there a way for Hyperspace for forcibly take exclusive control of those files.
This got me wondering why the filesystem itself doesn't run a similar kind of deduplication process in the background. Presumably, it is at a level of abstraction where it could safely manage these concerns. What could be the downsides of having this happen automatically within APFS?
This is the standard API for deduplication on Linux (used for btrfs and XFS); you ask the OS nicely to deduplicate a given set of ranges, and it responds by locking the ranges, verifying that they are indeed identical and only then deduplicates for you (you get a field back saying how many bytes were deduplicated from each range). So there's no way a userspace program can mess up your files.
Yup. This is the ioctl BTRFS_IOC_FILE_EXTENT_SAME.
Probably because APFS runs on everything from the Apple Watch to the Mac Pro and everything in between.
You probably don’t want your phone or watch de-duping stuff.
There are tons of knobs you can tweak in macOS but Apple has always been pretty conservative when it comes to what should be default behavior for the vast majority of their users.
Certainly when you duplicate a file using the Finder or use cp -c at the command line, the Copy-on-Write functionality is being used; most users don’t need to know that.
On ZFS it consumes a lot of RAM. In part I think this is because ZFS does it on the block level, and has to keep track of a lot of blocks to compare against when a new one is written out. It might be easier on resources if implemented on the file level. Not sure if the implementation would be simpler or more complex.
It might also be a little unintuitive that modifying one byte of a large file would result in a lot disk activity, as the file system would need to duplicate the file again.
In regards to the second point, this isn't correct for ZFS: "If several files contain the same pieces (blocks) of data or any other pool data occurs more than once in the pool, ZFS stores just one copy of it. Instead of storing many copies of a book it stores one copy and an arbitrary number of pointers to that one copy." [0]. So changing one byte of a large file will not suddenly result in writing the whole file to disk again.
[0] https://www.truenas.com/docs/references/zfsdeduplication/
This applies to modifying a byte. But inserting a byte will change every block from then on, and will force a rewrite.
Of course, that is true of most filesystems.
Not the whole file but it would duplicate the block. GP didn't claim that the whole file is copied.
It reads like that is what they meant: "modifying one byte of a large file would result in a lot disk activity, as the file system would need to duplicate the file again"
Yeah, I did not write it very clearly. On ZFS, you're right. On a file system that applied deduplication to files and not individual blocks, the file would need to be duplicated again, no matter where and what kind of change was made.
That's a ZFS online dedup limitation. I think xfs and btrfs are better prior art here since they use extent-based deduplication and can do offline dedup which means they don't have have to keep it in memory and the on-disk metadata is smaller too.
Files are always represented as lists of blocks or block spans within a file system. Individual blocks could in theory be partially shared between files at the complexity cost of a reference counter for each block. So changing a single byte in a copy on write file could take the same time regardless of file size because only the affected bock would have to be duplicated. I don't know at all how MacOS implements this copynon write scheme, though.
APFS is a copy on write filesystem if you use the right APIs, so it does what you describe but only for entire files.
I believe as soon as you change a single bite you get a complete copy that’s your own.
And that’s how this program works. It finds perfect duplicates and then effectively deletes and replaces them with a copy of the existing file so in the background there’s only one copy of the bits on the disk.
I suppose this means that you could find yourself unexpectedly out of disk space in unintuitive ways, if you're only trying to change one byte in a cloned file but there isn't enough space to copy its entire contents?
It doesn't work like you think. If you change one byte of duplicated file - the only "byte" will be changed on disk (a "byte", because, technically is not a byte, but a block).
As far as I understand, it works like a reflink feature in the modern linux FSs. If so, thats really cool, and thats also a bit better than the zfs's snapshots. Iam newbie on macos, but it looks amazing
That’s true as long as the writing application only writes to blocks that have changed. Is a VM tool going to write blocks, or write multi-MB segments of a sparse image that can be swapped atomically? Unfortunately, once a file changes there are no APIs to check which blocks are still shared (at least there weren’t as of macOS 13).
I’m not sure if it works on a file or block level for CoW, but yes.
However APFS gives you a number of space related foot-guns if you want. You can overcommit partitions, for example.
It also means if you have 30 GB of files on disk that could take up anywhere from a few hundred K to 30 GB of actual data depending on how many dupes you have.
It’s a crazy world, but it provides some nice features.
> I believe as soon as you change a single bite you get a complete copy that’s your own.
I think it stores a delta:
https://en.m.wikipedia.org/wiki/Apple_File_System#Clones
That’s not how this works. Nothing is deleted. It creates zero-space clones of existing files.
https://en.wikipedia.org/wiki/Apple_File_System?wprov=sfti1#...
Is there a FS that keeps only diffs in clone files? It would be neat
I wondered that too.
If we only have two files, A and its duplicate B with some changes as a diff, this works pretty well. Even if the user deletes A, the OS could just apply the diff to the file on disk, unlink A, and assign B to that file.
But if we have A and two different diffs B1 and B2, then try to delete A, it gets a little murkier. Either you do the above process and recalculate the diff for B2 to make it a diff of B1; or you keep the original A floating around on disk, not linked to any file.
Similarly, if you try to modify A, you'd need to recalculate the diffs for all the duplicates. Alternatively, you could do version tracking and have the duplicate's diffs be on a specific version of A. Then every file would have a chain of diffs stretching back to the original content of the file. Complex but could be useful.
It's certainly an interesting concept but might be more trouble than it's worth.
ZFS does this by de-duplicating at the block level, not the file level. It means you can do what you want without needing to keep track of a chain of differences between files. Note that de-duplication on ZFS has had issues in the past, so there is definitely a trade-off. A newer version of de-duplication sounds interesting, but I don't have any experience with it: https://www.truenas.com/docs/references/zfsdeduplication/
Not an FS but attempting to mimic NTFS, SharePoint does this within it's content database(s).
https://www.microsoft.com/en-us/download/details.aspx?id=397...
ZFS: "The main benefit of deduplication is that, where appropriate, it can greatly reduce the size of a pool and the disk count and cost. For example, if a server stores files with identical blocks, it could store thousands or even millions of copies for almost no extra disk space." (emphasis added)
https://www.truenas.com/docs/references/zfsdeduplication/
With extent-based filesystems you can clone extents and then overwrite one extent and only that becomes unshared.
That’s how APFS works; it uses delta extents for tracking differences in clones: https://en.wikipedia.org/wiki/Delta_encoding?wprov=sfti1#Var...
APFS shares blocks so only blocks that changed are no longer shared. Since a block is the smallest atomic unit (except maybe an inode) in a FS, that’s the best level of granularity to expect.
VAST storage does something like this. Unlike how most storage arrays identify the same block by hash and only store it once VAST uses a content aware hash so hashes of similar blocks are also similar. They store a reference block for each unique hash and then when new data comes in and is hashed the most similar block is used to create byte level deltas against. In practice this works extremely well.
https://www.vastdata.com/blog/breaking-data-reduction-trade-...
From a file system design perspective, does anyone know why ZFS chose to use block clones, instead of file clones?
Records (which are of variable size) are already checksummed, and there were checksum-hashes which made it vanishingly unlikely that one could choose two different records with the same (optionally cryptographically strong) checksum-hash. When a newly-created record's checksum is generated, one could look into a table of existing checksum-hashes and avoid the record write if it already exists, substituting an incremented refcount for that table entry.
ZFS is essentially an object store database at one layer; the checksum-hash deduplication table is an object like any other (file, metadata, bookmarks, ...). There is one deduplication table per pool, shared among all its datasets/volumes.
On reads, one does not have to consult the dedup table.
The mechanism was fairly easy to add. And for highly-deduplicatable data that is streaming-write-once-into-quiescent-pool-and-never-modify-or-delete-what's-written-into-a-deduplicated-dataset-or-volume, it was a reasonable mechanism.
In other applications, the deduplication table would tend to grow and spread out, requiring extra seeks for practically every new write into a deduplicated dataset or volume, even if it's just to increment or decrement the refcount for a record.
Destroying a deduplicated dataset has to decrement all its refcounts (and remove entries from the table where it's the only reference), and if your table cannot all fit in ram, the additional IOPS onto spinning media hurt, often very badly. People experimenting with deduplication and who wanted to back out after running into performance issues for typical workloads sometimes determined it was much MUCH faster to destroy the entire pool and restore from backups, rather than wait for a "zfs destroy" on a set of deduplicated snapshots/datasets/volumes to complete.
I have no specialized knowledge (just a ZFS user for over a decade). I suspect the reason is that in addition to files, ZFS will also allow you to create volumes. These volumes act like block devices, so if you want to dedup them, you need to do it at the block level.
This is commonly done with compression on block storage devices. That fails, of course, if the file system is encrypting the blocks it sends down to the device.
Doing deduplication at this level is nice because you can dedupe across file systems. If you have, say, a thousand systems that all have the same OS files you can save vats of storage. Many times, the only differences will be system specific configurations like host keys and hostnames. No single filesystem could recognize this commonality.
This fails when the deduplication causes you to have fewer replicas of files with intense usage. To take the previous example, if you boot all thousand machines at the same time, you will have a prodigious I/O load on the kernel images.
> This got me wondering why the filesystem itself doesn't run a similar kind of deduplication process in the background.
I think that ZFS actually does this. https://www.truenas.com/docs/references/zfsdeduplication/
It's considered an "expensive" configuration that is only good for certain use-cases, though, due to its memory requirements.
Yes true, but that page also covers some recent improvements to de-duplication that might assist.
Really? I haven't looked at this ZFS feature in a few years so I will take a look
EDIT: Is this referring to the "fast" dedup feature?
Yes. It doesn't solve everything, but it looks promising
Windows Server does this for NTFS and ReFS volumes. I used it quite a bit on ReFS w/ Hyper-V VMs and it worked wonders. Cut my storage usage down by ~45% with a majority of Windows Server VMs running a mix of 2016/2019 at the time.
Yep. At a previous job we had a file server that we published Windows build output to.
There were about 1000 copies of the same pre-requisite .NET and VC++ runtimes (each build had one) and we only paid for the cost of storing it once. It was great.
It is worth pointing out though, that on Windows Server this deduplication is a background process; When new duplicate files are created, they genuinely are duplicates and take up extra space, but once in a while the background process comes along and "reclaims" them, much like the Hyperspace app here does.
Because of this (the background sweep process is expensive), it doesn't run all the time and you have to tell it which directories to scan.
If you want "real" de-duplication, where a duplicate file will never get written in the first place, then you need something like ZFS
Both ZFS and WinSvr offer "real" dedupe. One is on-write, which requires a significant amount of available memory, the other is on a defined schedule, which uses considerably less memory (300MB + 10MB/TB).
ZFS is great if you believe you'll exceed some threshold of space while writing. I don't personally plan my volumes with that in mind but rather make sure I have some amount of excess free space.
WinSvr allows you to disable dedupe if you want (don't know why you would) where as ZFS is a one-way street without exporting the data.
Both have pros and cons. I can live with the WinSvr cons while ZFS cons (memory) would be outside of my budget, or would have been at the particular time with the particular system.
hey, it's defrag all over again!
(not really, since it's not fragmentation, but conceptually similar)
data loss is the largest concern
I still do not trust de-duplication software.
Agreed, "I made a deduplication software in my garage! Do you want to try it?" is a terrifying pitch.
I've been writing a similar thing to dedupe my photo collection and I'm so paranoid of pulling the trigger I just keep writing more tests.
Dedupe seemed more interesting when storage was expensive, but nowadays it feels like the overhead you get from running dedupe, in most cases, is priced-in. At least with software like CommVault for backups, dedupe requires beefy hardware and low-latency SSDs for the database, If there is even a few extra milliseconds of latency or the server can’t handle requests fast enough, your backup throughput absolutely tanks. Depending on your data though you could see some ridiculous savings here that make it worth the trouble.
I’ve heard many horror stories of dedupe related corruption or restoration woes though, especially after a ransomware attack.
Even using sha-256 or greater type of hashing, I'd still have concerns about letting a system make deletion decisions without my involvement. I've even been part of de-dupe efforts, so maybe my hesitation is just because I wrote some of the code and I know I'm not perfect in my coding or even my algo decision trees. I know that any mistake I made would not be of malice but just ignorance or other stupid mistake.
I've done the entire compare every file via hashing and then log each of the matches for humans to compare, but never has any of that ever been allowed to mv/rm/link -s anything. I feel my imposter syndrome in this regard is not a bad thing.
Now you understand why this app costs more than 2x the price of alternatives such as diskDedupe.
Any halfway-competent developer can write some code that does a SHA256 hash of all your files and uses the Apple filesystem API's to replace duplicates with shared-clones. I know swift, I could probably do it in an hour or two. Should you trust my bodgy quick script? Heck no.
The author - John Siracusa - has been a professional programmer for decades and is an exceedingly meticulous kind of person. I've been listening to the ATP podcast where they've talked about it, and the app has undergone an absolute ton of testing. Look at the guardrails on the FAQ page https://hypercritical.co/hyperspace/ for an example of some of the extra steps the app takes to keep things safe. Plus you can review all the proposed file changes before you touch anything.
You're not paying for the functionality, but rather the care and safety that goes around it. Personally, I would trust this app over just about any other on the mac.
Best course of action is to not trust John, and just wait for a year of the app out the wild, until everyone else trusts John . I have enough hard drive space in the meantime to not rush into trusting John.
Having listened to John for 10 years, he’d be the first to encourage you to wait around to trust his app.
More than TeX or SQLite?
> I'd still have concerns about letting a system make deletion decisions without my involvement
You are involved. You see the list of duplicates and can review them as carefully as you'd like before hitting the button to write the changes.
Yeah, the lack of involvement was more in response to ZFS doing this not this app. I could have crossed the streams with other threads about ZFS if it's not directly in this thread
Question for the developer: what's your liability if user files are corrupted?
Most EULA’s would disclaim liability for data loss and suggest users keep good backups. I haven’t read a EULA for a long time, but I think most of them do so.
I can't find a specific EULA or disclaimer for the Hyperspace app, but given that the EULA's for major things like Microsoft Office basically say "we offer you no warranty or recourse no matter what this software does" I would hardly expect an indie app to offer anything like that
Disk Utility.app manages to keep the OS running while make the disk exclusive-access.. I wonder how it does that.
It doesn't? Freeze when commiting operations.
I was musing about this comment:
> There is no way for Hyperspace to cooperate with all other applications and macOS itself to coordinate a “safe” time for those files to be replaced, nor is there a way for Hyperspace for forcibly take exclusive control of those files.
Maybe not solving the same problem.
NTFS supports deduplication but it is only available on Server versions which is very annoying.
a content addressed block store with pointers and skiplists for file continuity would be kinda neat.
what's the source of that quote, does it mean it's not safe to use hyperspace ?
If Apple is anything like where I work, there's probably a three-year-old bug ticket in their system about it and no real mandate from upper management to allocate resources for it.
I love the model of it being free to scan and see if you'd get any benefit, then paying for the actual results. I, too, am a packrat, ran it, and got 7GB to reclaim. Not quite worth the squeeze for me, but I appreciate it existing!
He’s talked about it on the podcast he was on. So many users would buy this, run it once, then save a few gigs and be done. So a subscription didn’t make a ton of sense.
After all how many perfect duplicate files do you probably create a month accidentally?
There’s a subscription or buy forever option for people who think that would actually be quite useful to them. But for a ton of people a one time IAP that gives them a limited amount of time to use the program really does make a lot of sense.
And you can always rerun it for free to see if you have enough stuff worth paying for again.
am I really that old that I remember this being the default for most of the software about 10 years ago? Are people already that used to the subscription trap that they think this is a new model ?
I grew up with shareware in the 90s that often adopted a similar model (though having to send $10 in the mail and wait a couple weeks for a code or a disk to come back was a bit of a grind!) but yes, it's refreshing in the current era where developers will even attempt to charge $10 a week for a basic coloring in app on the iPad..
I also really like this pricing model.
I wish it were more obvious how to do it with other software. Often there's a learning curve in the way before you can see the value.
it’s very refreshing compared to those “free trials” you have to remember to cancel (pro tip: use virtual credit cards which you can lock for those so if you forget to cancel the charges are blocked)
however has anyone been able to find out from the website how much the license actually costs?
Doesn’t the Mac App Store listing list the IAP SKUs like it does on iOS?
It does. It's reasonably clear for this app but I wish they made it clearer for other apps where the IAP SKUs often have meaningless descriptions.
Start with a story, narrow it down to a problem and show how your solution magically solves that problem. Such a fine example of GREAT marketing.
Just want to mention: Apple ships a modified version of the copy command (good old cp) that supports the ability to use the cloning feature of APFS by using the -c flag.
And in case your cp doesn't support it, you could also do it by invoking Python. Something like `import Foundation; Foundation.NSFileManager.defaultManager().copyItemAtPath_toPath_error_(...)`.
Apparently the cp command in CoreUtilities also supports copy-on-write on macOS: https://unix.stackexchange.com/questions/311536/cp-reflink-a...
Correct. Foundation's NSFileManager / FileManager will automatically use clone for same-volume copies if the underlying filesystem supports it. This makes all file copies in all apps that use Foundation support cloning even if the app does nothing.
libcopyfile also supports cloning via two flags: COPYFILE_CLONE and COPYFILE_CLONE_FORCE. The former clones if supported (same volume and filesystem supports it) and falls back to actual copy if not. The force variant fails if cloning isn't supported.
that function name `copyItemAtPath_toPath_error_` sounds like it was ported directly from objective-c!
They might not have ported it. They could have a Python __getattr__ implementation that returns a callable that simply uses objc_msgSend under the hood.
What algorithm does the application use to figure out if two files are identical? There's a lot of interesting algorithms out there. Hashes, bit by bit comparison etc. But these techniques have their own disadvantages. What is the best way to do this for a large amount of files?
I don't know exactly what Siracusa is doing here, but I can take an educated guess:
For each candidate file, you need some "key" that you can use to check if another candidate file is the same. There can be millions of files so the key needs to be small and quick to generate, but at the same time we don't want any false positives.
The obvious answer today is a SHA256 hash of the file's contents; It's very fast, not too large (32 bytes) and the odds of a false positive/collision are low enough that the world will end before you ever encounter one. SHA256 is the de-facto standard for this kind of thing and I'd be very surprised if he'd done anything else.
You can start with the size, which is probably really unique. That would likely cut down the search space fast.
At that point maybe it’s better to just compare byte by byte? You’ll have to read the whole file to generate the hash and if you just compare the bytes there is no chance of hash collision no matter how small.
Plus if you find a difference in bytes 1290 you can just stop there instead of reading the whole thing to finish the hash.
I don’t think John has said exactly how on ATP (his podcast with Marco and Casey), but knowing him as a longtime listener/reader he’s being very careful. And I think he’s said that on the podcast too.
To make dedup[0] fast, I use a tree with device id, size, first byte, last byte, and finally SHA-256. Each of those is only used if there is a collision to avoid as many reads as possible. dedup doesn’t do a full file compare, because if you’ve found a file with the same size, first and last bytes, and SHA-256 you’ve also probably won the lottery several times over and can afford data recovery.
This is the default for ZFS deduplication and git does something similar with size and far weaker SHA-1. I would add a test for SHA-256 collisions, but no one has seemed to find a working example yet.
0 - https://github.com/ttkb-oss/dedup
Reading just the first byte is probably wasting a read of the whole block.
Hashing the whole file after that is wasteful. You need to read (and hash) only as much as needed to demonstrate uniqueness of the file in the set.
The tree concept can be extended to every byte in the file:
https://github.com/kornelski/dupe-krill?tab=readme-ov-file#n...
How much time is saved by not comparing full file contents? Given that this is a tool some people will only run occasionally, having it take 30 seconds instead of 15 is a small price to pay for ensuring it doesn't treat two differing files as equal.
Same size, same first and last bytes, and same SHA-256.
…and you’re not worried about shark attacks, are you?
FWIW, when I wrote a tool like this I used same size + some hash function, not MD5 but maybe SHA1, don't remember. First and last bytes is a good idea, didn't think of that.
>which is probably really unique
Wonder what the distribution is here, on average? I know certain file types tend to cluster in specific ranges.
>maybe it’s better to just compare byte by byte? You’ll have to read the whole file to generate the hash
Definitely, for comparing any two files. But, if you're searching for duplicates across the entire disk, then you're theoretically checking each file multiple times, and each file is checked against multiple times. So, hashing them on first pass could conceivably be more efficient.
>if you just compare the bytes there is no chance of hash collision
You could then compare hashes and, only in the exceedingly rare case of a collision, do a byte-by-byte comparison to rule out false positives.
But, if your first optimization (the file size comparison) really does dramatically reduce the search space, then you'd also dramatically cut down on the number of re-comparisons, meaning you may be better off not hashing after all.
You could probably run the file size check, then based on how many comparisons you'll have to do for each matched set, decide whether hashing or byte-by-byte is optimal.
> exceedingly rare
To have a mere one in a billion chance of getting a SHA-256 collision, you'd need to spend 160 million times more energy than the total annual energy production on our planet (and that's assuming our best bitcoin mining efficiency, actual file hashing needs way more energy).
The probability of a collision is so astronomically small, that if your computer ever observed a SHA-256 collision, it would certainly be due to a CPU or RAM failure (bit flips are within range of probabilities that actually happen).
This can be done much faster and safer.
You can group all files into buckets, and as soon as a bucket is empty, discard it. If in the end there are still files in the same bucket, they are duplicates.
Initially all files are in the same bucket.
You now iterate over differentiators which given two files tell you whether they are maybe equal or definitely not equal. They become more and more costly but also more and more exact. You run the differentiator on all files in a bucket to split the bucket into finer equivalence classes.
For example:
* Differentiator 1 is the file size. It's really cheap, you only look at metadata, not the file contents.
* Differentiator 2 can be a hash over the first file block. Slower since you need to open every file, but still blazingly fast and O(1) in file size.
* Differentiator 3 can be a hash over the whole file. O(N) in file size but so precise that if you use a cryptographic hash then you're very unlikely to have false positives still.
* Differentiator 4 can compare files bit for bit. Whether that is really needed depends on how much you trust collision resistance of your chosen hash function. Don't discard this though. Git got bitten by this.
Not surprisingly, differentiator 2 can just be the first byte (or machine word). Differentiator 3 can be the last byte (or word). At that point, 99.99% (in practice more 9s) of files are different and you’re read at most 2 blocks per file. I haven’t figured out a good differentiator 3 prior to hashing, but it’s already so rare, that it’s not worth it, in my experience.
I experimented with a similar, "hardlink farm"-style approach for deduplicated, browseable snapshots. It resulted in a small bash script which did the following:
- compute SHA256 hashes for each file on the source side
- copy files which are not already known to a "canonical copies" folder on the destination (this step uses the hash itself as the file name, which makes it easy to check if I had a copy from the same file earlier)
- mirror the source directory structure to the destination
- create hardlinks in the destination directory structure for each source file; these should use the original file name but point to the canonical copy.
Then I got too scared to actually use it :)
Hard links are not a suitable alternative here. When you deduplicate files, you typically want copy-on-write: if an app writes to one file, it should not change the other. Because of this, I would be extremely scared to use anything based on hard links.
In any case, a good design is to ask the kernel to do the dedupe step after user space has found duplicates. The kernel can double-check for you that they are really identical before doing the dedupe. This is available on Linux as the ioctl BTRFS_IOC_FILE_EXTENT_SAME.
xxHash (or xxh3 which I believe is even faster) is massively faster than SHA256 at the cost of security, which is unnecessary here.
Of course, engineering being what it is, it's possible that only one of these has hardware support and thus might end up actually being faster in realtime.
Blake3 is my favorite for this kind of thing. It's a cryptographic hash (maybe not the world's strongest, but considered secure), and also fast enough that in real world scenarios it performs just as well as non-crypto hashes like xx.
I think the prob. is not so low. I remember reading here about a person getting a foto of another chat in a chat application, which was using sha in the background. I do not recall all the details, it is improbable, but possible.
The probability is truly, obscenely, low. If you read about a collision then you surely weren't reading about SHA256.
https://crypto.stackexchange.com/questions/47809/why-havent-...
LOL nope, I seriously doubt that was the result of a SHA256 collision.
Or just use whatever algorithm rsync uses.
This reminds me of https://en.wikipedia.org/wiki/Venti_(software) which was a content-addressible filesystem which used hashes for de-duplication. Since the hashes were computed at write time, the performance penalty is amortized.
I'd hash the first 1024 bytes of all files, and starts from there is any collision. That way you don't need to hash the whole (large) files, but only those with same hashes.
I suspect that bytes near the end are more likely to be different (even if there may be some padding). For example, imagine you have several versions of the same document.
Also, use the length of the file for a fast check.
At that point, why hash them instead of just using the first 1024 bytes as-is?
In order to check if a file is a duplicate of another, you need to check it against _every other possible file_. You need some kind of "lookup key".
If we took the first 1024 bytes of each file as the lookup key, then our key size would be 1024 bytes. If you have 1 million files on your disk, then that's 128MB of ram just to store all the keys. That's not a big deal these days, but it's also annoying if you have a bunch of files that all start with the same 1024 bytes -- e.g. perhaps all the photoshop documents start with the same header. You'd need a 2-stage comparison, where you first match the key (1024 bytes) and then do a full comparison to see if it really matches.
Far more efficient - and less work - If you just use a SHA256 of the file's contents. That gets you a much smaller 32 byte key, and you don't need to bother with 2-stage comparisons.
I understand the concept. My main point is that it's probably not a huge advantage to store hashes of the first 1KB, which requires CPU to calculate, over just the raw bytes, which requires storage. There's a tradeoff either way.
I don't think it would be far more efficient to do hash the entire contents though. If you have a million files storing a terabyte of data, the 2 stage comparison would read at most 1GB (1 million * 1KB) of data, and less for smaller files. If you do a comparison of the whole hashed contents, you have to read the entire 1TB. There are a hundred confounding variables, for sure. I don't think you could confidently estimate which would be more efficient without a lot of experimenting.
If you're going to keep partial hashes in memory, may as well align it on whatever boundary is the minimal block/sector size that your drives give back to you. Hashing (say) 8kB takes less time than it takes to fetch it from SSD (much less disk), so if you only used the first 1kB, you'd (eventually) need to re-fetch the same block to calculate the hash for the rest of the bytes in that block.
... okay, so as long as you always feed chunks of data into your hash in the same deterministic order, it doesn't matter for the sake of correctness what that order is or even if you process some bytes multiple times. You could hash the first 1kB, then the second-through-last disk blocks, then the entire first disk block again (double-hashing the first 1kB) and it would still tell you whether two files are identical.
If you're reading from an SSD and seek times don't matter, it's in fact probable that on average a lot of files are going to differ near the start and end (file formats with a header and/or footer) more than in the middle, so maybe a good strategy is to use the first 32k and the last 32k, and then if they're still identical, continue with the middle blocks.
In memory, per-file, you can keep something like
etc, and only calculate the latter partial hashes when there is a collision between earlier ones. If you have 10M files and none of them have the same length, you don't need to hash anything. If you have 10M files and 9M of them are copies of each other except for a metadata tweak that resides in the last handful of bytes, you don't need to read the entirety of all 10M files, just a few blocks from each.A further refinement would be to have per-file-format hashing strategies... but then hashes wouldn't be comparable between different formats, so if you had 1M pngs, 1M zips, and 1M png-but-also-zip quine files, it gets weird. Probably not worth it to go down this road.
Probably because you need to keep a lot of those in memory.
I suspect that a computer with so many files that this would be useful probably has a lot of RAM in it, at least in the common case.
But you need to constantly process them too, not just store them.
And why first 1024, can pick from predefined points.
Depending on the medium, the penalty of reading single bytes in sparse locations could be comparable with reading the whole file. Maybe not a big win.
Deleted comment based on a misunderstanding.
> This tool simply identifies files that point at literally the same data on disk because they were duplicated in a copy-on-write setting.
You misunderstood the article, as it's basically doing the opposite of what you said.
This tool finds duplicate data that is specifically not duplicated via copy-on-write, and then turns it into a copy-on-write copy.
Fair. Deleted.
I have file A that's in two places and I run this.
I modify A_0. Does this modify A_1 as well or just kind of reify the new state of A_0 while leaving A_1 untouched?
It's called copy-on-write because when you modify A_0, it'll make a copy of the file if you write to it but not A_1.
https://en.wikipedia.org/wiki/Copy-on-write#In_computer_stor...
Which means if you actually edited those files, you might fill up your HD much more quickly than you expected.
But if you have the same 500MB of node_modules in each of your dozen projects, this might actually durably save some space.
> Which means if you actually edited those files, you might fill up your HD much more quickly than you expected.
I'm not sure if this is what you intended, but just to be sure: writing changes to a cloned file doesn't immediately duplicate the entire file again in order to write those changes — they're actually written out-of-line, and the identical blocks are only stored once. From [the docs](^1) posted in a sibling comment:
> Modifications to the data are written elsewhere, and both files continue to share the unmodified blocks. You can use this behavior, for example, to reduce storage space required for document revisions and copies. The figure below shows a file named “My file” and its copy “My file copy” that have two blocks in common and one block that varies between them. On file systems like HFS Plus, they’d each need three on-disk blocks, but on an Apple File System volume, the two common blocks are shared.
[^1]: https://developer.apple.com/documentation/foundation/file_sy...
The key is “unmodified” and how APFS knows or doesn’t know whether they are modified. How many apps write on block boundaries or even mutate just in disk data that has changed vs overwriting or replacing atomically? For most applications there is no benefit and a significant risk of corruption.
So APFS supports it, but there is no way to control what an app is going to do, and after it’s done it, no way to know what APFS has done.
For apps which write a new file and replace atomically, the CoW mechanism doesn't come into play at all. The new file is a new file.
I don't understand what makes you think there's a significant risk of corruption. Are you talking about the risk of something modifying a file while the dedupe is happening? Or do you think there's risk associated with just having deduplicated files on disk?
Thanks for the clarification!
Thanks for the clarification. I expected it worked like that but couldn't find it spelled out after a brief perusal of the docs.
What will happen when the original file will be deleted? Often this handled by block reference counters, which just would be decreased. How APFS handles this? Is there any master/copy concepts or just block references?
He's using the "copy on write" feature of the file system. So it should leave A_1 untouched, creating a new copy for A_0's modifications. More info: https://developer.apple.com/documentation/foundation/file_sy...
Oh wow, what a funny coincidence. I hadn't visited the site in a couple of years but someone linked me "Front and Center" yesterday, so I saw the icon for this app and had no clue it had only appeared there maybe hours earlier.
The idea is not new, of course, and I've written one of these (for Linux, with hardlinks) years ago but in the end just deleted all the duplicate files in my mp3 collection and didn't touch the rest of the files on the disk, because not a lot of size was reclaimed.
I wonder for whom this really saves a lot of space. (I saw someone mentioning node_modules, had to chuckle there).
But today I learned about this APFS feature, nice.
I gave it a try on my massive folder of NodeJS projects but it only found 1GB of savings on a 8.1GB folder.
I then tried again including my user home folder (731K files, 127K folders, 2755 eligible files) to hopefully catch more savings and I only ended up at 1.3GB of savings (300MB more than just what was in the NodeJS folders.)
I tried to scan System and Library but it refused to do so because of permission issues.
I think the fact that I use pnpm for my package manager has made my disk space usage already pretty near optimal.
Oh well. Neat idea. But the current price is too high to justify this. Also I would want it as a background process that runs once a month or something.
pnpm tries to be a drop-in replacement for npm, and dedupes automatically.
More importantly, pnpm installs packages as symlinks, so the deduping is rather more effective. I believe it also tries to mirror the NPM folder structure and style of deduping as well, but if you have two of the same package installed anywhere on your system, pnpm will only need to download and save one copy of that package.
npm's --install-strategy=linked flag is supposed to do this too, but it has been broken in several ways for years.
> pnpm tries to be a drop-in replacement for npm
True
> and dedupes automatically
Also true.
But the way you put them after each other, makes it sound like npm does de-duplication, and since pnpm tries to be a drop-in replacement for npm, so does pnpm.
So for clarification: npm doesn't do de-duplication across all your projects, and that in particular was of the more useful features that pnpm brought to the ecosystem when it first arrived.
> I tried to scan System and Library but it refused to do so because of permission issues.
macOS has a sealed volume which is why you're seeing permission errors.
https://support.apple.com/guide/security/signed-system-volum...
For some reason "disk-inventory-x" will scan those folders. I used that amazing tool to prune left over Unreal Engine files and docker caches when they put them not in my home folder. The tool asks for a ton of permissions when you run it in order to do the scan though, which is a bit annoying.
It’s not obvious but the system folder is on a separate, secure volume; the Finder does some trickery to make the system volume and the data volume appear as one.
In general, you don’t want to mess with that.
> it only found 1GB of savings on a 8.1GB folder.
You "only" found that 12% of the space you are using is wasted? Am I reading this right?
I have a 512GB drive in my MacBook Air M3 with 225GB free. Saving 1GB is 0.5% of my total free space, and it is definitely "below my line." It is a neat tool still in concept.
When I ran it on my home folder with 165GB of data it only found 1.3GB of savings. This isn't that significant to me and it isn't really worth paying for.
BTW I highly recommend the free "disk-inventory-x" utility for MacOS space management.
Your original comment did not mention that your home folder was 165 GB, which is extremely relevant here
The relevant number (missing from above) is the total amount of space on that storage device. If it saves 1GB on a 8TB drive, it's not a big win.
It should be proportional to the total used space, not the space available. The previous commenter said it was a 1 GB savings from ~8 GB of used space; that's equally significant whether it happens on a 10 GB drive or a 10 TB one.
He picked node_modules because it's highly likely to encounter redundant files there.
If you read the rest of the comment he only saved another 30% running his entire user home directory through it.
So this is not a linear trend based on space used.
He "only" saved 30%? That's amazing. I really doubt most people are going to get anywhere near that.
When I run it on my home folder (Roughly 500GB of data) I find 124 MB of duplicated files.
At this stage I'd like it to tell me what those files are - The dupes are probably dumb ones that I can simply go delete by hand, but I can understand why he'd want people to pay up first, as by simply telling me what the dupes are he's proved the app's value :-)
> He "only" saved 30%? That's amazing. I really doubt most people are going to get anywhere near that.
You misunderstood my comment. I ran it on my home folder which contains 165GB of data and it found 1.3GB is savings. That isn't significant for me to care about because I currently have 225GB free of my 512GB drive.
BTW I highly recommend the free "disk-inventory-x" utility for MacOS space management.
Everyone misunderstood your comment for a reason.
You wrote: but it only found 1GB of savings on a 8.1GB folder.
It’s quite a saving and that’s what everyone understood from your comment.
I think this is somewhat funny.
His comment is pretty understandable if you've done frontend work in javascript.
Node_modules is so ripe for duplicate content that some tools explicitly call out that they're disk efficient (It's literally in the tagline for PNPM "Fast, disk space efficient package manager": https://github.com/pnpm/pnpm)
So he got ok results (~13% savings) on possibly the best target content available in a user's home directory.
Then he got results so bad it's utterly not worth doing on the rest (0.10% - not 10%, literally 1/10 of a single percent).
---
Deduplication isn't super simple, isn't always obviously better, and can require other system resources in unexpected ways (ex - lots of CPU and RAM). It's a cool tech to fiddle with on a NAS, and I'm generally a fan of modern CoW filesystems (incl APFS).
But I want to be really clear - this is people picking spare change out of the couch style savings. Penny wise, pound foolish. The only people who are likely to actually save anything buying this app probably already know it, and have a large set of real options available. Everyone else is falling into the "download more ram" trap.
Another 30% more than the 1GB saved in node modules, for 1.3GB total. Not 30% of total disk space.
For reference, from the comment they’re talking about:
> I then tried again including my user home folder (731K files, 127K folders, 2755 eligible files) to hopefully catch more savings and I only ended up at 1.3GB of savings (300MB more than just what was in the NodeJS folders.)
If it saved 8.1GB, by your measure it'd also not be a big win?
This is basically only a win on macOS, and only because Apple charges through the nose for disk space.
Ex - On my non-apple machines, 8GB is trivial. I load them up with the astoundingly cheap NVMe drives in the multiple terabyte range (2TB for ~$100, 4TB for ~$250) and I have a cheap NAS.
So that "big win" is roughly 40 cents of hardware costs on the direct laptop hardware. Hardly worth the time and effort involved, even if the risk is zero (and I don't trust it to be zero).
If it's just "storage" and I don't need it fast (the perfect case for this type of optimization) I throw it on my NAS where it's cheaper still... Ex - it's not 40 cents saved, it's ~10.
---
At least for me, 8GB is no longer much of a win. It's a rounding error on the last LLM model I downloaded.
And I'd suggest that basically anyone who has the ability to not buy extortionately priced drives soldered onto a mainboard is not really winning much here either.
I picked up a quarter off the ground on my walk last night. That's a bigger win.
If Apple charges extortionately through the nose for storage hardware, that just makes this tool more valuable.
> This is basically only a win on macOS, and only because Apple charges through the nose for disk space
You do realize that this software is only available on macOS, and only works because of Apple's APFS filesystem? You're essentially complaining that medicine is only a win for people who are sick.
> and only works because of Apple's APFS filesystem
There are lots of other file systems that support this kind of deduplication...
Like ZFS that the author of the software explicitly mentions in his write up https://www.truenas.com/docs/references/zfsdeduplication/
Or Btrfs ex: https://kb.synology.com/en-id/DSM/help/DSM/StorageManager/vo...
Or hell, even NTFS: https://learn.microsoft.com/en-us/windows-server/storage/dat...
This is NOT a novel or new feature in filesystems... Basically any CoW file system will do it, and lots of other filesystems have hacks built on top to support this kinds of feature.
---
My point is that "people are only sick" because the company is pricing storage outrageously. Not that Apple is the only offender in this space - but man are they the most egregious.
Absolutely, 100% backwards. The tool cannot save space from disk space that is not scanned. Your "not a big win" comment assumes that there is no space left to be reclaimed on the rest of the disk. Or that the disk is not empty, or that the rest of the disk can't be reclaimed at an even higher rate.
Didn't have time to try it myself, but there is an option for minimal files size to consider clearly seen on the AppStore screenshot. I suppose it was introduced to minimize comparison buffers. It is possible that node modules are sliding under this size and wasn't considered.
whats the price? doesnt seem to be published anywhere
It's on the Mac App Store so you'll find the pricing there. Looks like $10 for one month (one time use maybe?), $20 for a year, $50 lifetime.
Even if I have both a Mac and iPhone, but happen to use my Linux computer right now, it seems like the store page (https://apps.apple.com/us/app/hyperspace-reclaim-disk-space/...) is not showing the price, probably because I'm not actively on a Apple device? Seems like a poor UX even for us Mac users.
It's buried under a drop-down in the "Information" section, under "In-App Purchases". I agree, it's not the greatest.
It’s a side effect of the terrible store design.
It’s a free app because you don’t have to buy it to run it. It will tell you how much space it can save you for free. So you don’t have to waste $20 to find out it only would’ve been 2kb.
But that means the parts you actually have to buy are in app purchases, which are always hidden on the store pages.
Åh, you're absolutely right, missed that completely. Buried at the bottom of the page :) Thanks for pointing it out.
I see it on my android phone. It's a free app but the subs are an in-app purchase so you need to hunt that section down.
£9.99 a month, £19.99 for one year, £49.99 for life (app store purchase prices visible once you've scanned a directory).
What jumped out to me:
> Finally, at WWDC 2017, Apple announced Apple File System (APFS) for macOS (after secretly test-converting everyone’s iPhones to APFS and then reverting them back to HFS+ as part of an earlier iOS 10.x update in one of the most audacious technological gambits in history).
How can you revert a FS change like that if it goes south? You'd certainly exercise the code well but also it seems like you wouldn't be able to back out of it if something was wrong.
IIRC migrating from HFS+ to APFS can be done without touching any of the data blocks and a parallel set of APFS metadata blocks and superblocks are written to disk. In the test migrations Apple did the entire migration including generating APFS superblocks but held short of committing the change that would permanently replace the HFS+ superblocks with APFS ones. To roll back they “just” needed to clean up all the generated APFS superblocks and metadata blocks.
Yes, that's how it's described in this talk transcript:
https://asciiwwdc.com/2017/sessions/715
Let’s say for simplification we have three metadata regions that report all the entirety of what the file system might be tracking, things like file names, time stamps, where the blocks actually live on disk, and that we also have two regions labeled file data, and if you recall during the conversion process the goal is to only replace the metadata and not touch the file data.
We want that to stay exactly where it is as if nothing had happened to it.
So the first thing that we’re going to do is identify exactly where the metadata is, and as we’re walking through it we’ll start writing it into the free space of the HFS+ volume.
And what this gives us is crash protection and the ability to recover in the event that conversion doesn’t actually succeed.
Now the metadata is identified.
We’ll then start to write it out to disk, and at this point, if we were doing a dry-run conversion, we’d end here.
If we’re completing the process, we will write the new superblock on top of the old one, and now we have an APFS volume.
I think that’s what they did too. And it was a genius way of testing. They did it more than once too I think.
Run the real thing, throw away the results, report all problems back to the mothership so you have a high chance of catching them all even on their multi-hundred million device fleet.
You lack imagination. This is not some crown jewel only achievable by Apple. In the open source world we have tools to convert ext file systems to btrfs and (1) you could revert back; (2) you could mount the original ext file system while using the btrfs file system.
I watched the section from the talk [0] and there's no details given really, other than that it was done as a test of consistency. I've blown so many things up in production that I'm not sure if I could every pull the trigger on such a large migration
[0] https://www.youtube.com/watch?v=IcyaadNy9Jk&t=1670s
I wrote a similar (but simpler) script which would replace a file by a hardlink if it has the same content.
My main motivation was for the packages of Python virtual envs, where I often have similar packages installed, and even if versions are different, many files would still match. Some of the packages are quite huge, e.g. Numpy, PyTorch, TensorFlow, etc. I got quite some disk space savings from this.
https://github.com/albertz/system-tools/blob/master/bin/merg...
This does not use hard links or symlinks; this uses a feature of the filesystem that allows the creation of copy-on-write clones. [1]
[1] https://en.wikipedia.org/wiki/Apple_File_System#Clones
So albertzeyer's script can be adapted to use `cp -c` command, to achieve the same effect as Hyperspace.
If you'd like. In the blog post he says he wrote the prototype in an afternoon. Hyperspace does try hard to preserve unique metadata as well as other protections.
uv does this out of the box, I think other tools (poetry, hatch, pdm, etc.) do as well but I have less experience with the details.
On Windows there is "Dev Drive" which I believe does a similar "copy-on-write" -thing.
If it works it's a no-brainer so why isn't it the default?
https://learn.microsoft.com/en-us/windows/dev-drive/#dev-dri...
CoW is a function of ReFS, shipped with Server 2016. "DevDrive" is just a marketing term for a ReFS volume which has file system filters placed in async mode or optionally disabled altogether.
requires refs, which still isnt supported on the system drive on windows, iirc
Would be nice if git could make use of this on macOS.
Each worktree I usually work on is several gigs of (mostly) identical files.
Unfortunately the source files are often deep in a compressed git pack file, so you can't de-duplicate that.
(Of course, the bigger problem is the build artefacts on each branch, which are like 12G per debug/release per product, but they often diverge for boring reasons.)
"git worktree" shares a .git folder between multiple checkouts. You'll still have multiple files in working copy, but at least the .pack files would be shared. It is great feature, very robust, I use it all the time.
There is also ".git/objects/info/alternates", accessed via "--shared"/"--reference" option of "git clone", that allows only sharing of object storage and not branches etc... but it is has caveats, and I've only used it in some special circumstances.
Git is a really poor fit for a project like that since it's snapshot based instead of diff based... Luckily, `git lfs` exists for working around that, I'm assuming you've already investigated that for the large artifacts?
Git de-duplicates everything in its store (in the .git directory) already. That's how it can store thousands of commits which are snapshots of the entire repository without eating up tons of disk space. Why do you have duplicated files in the working directory, though?
> Like all my apps, Hyperspace is a bit difficult to explain. I’ve attempted to do so, at length, in the Hyperspace documentation. I hope it makes enough sense to enough people that it will be a useful addition to the Mac ecosystem.
Am I missing something, or isn't it a "file de-duplicator" with a nice UI/UX? Sounds pretty simple to describe, and tells you why it's useful with just two words.
The author of the software is a file system enthusiast (so much that in the podcast he's a part of they have a dedicated sound effect every time "filesystem" comes up), a long time blogger and macOS reviewer. So you'll have to see it in that context while documenting every bit and the technical details behind it is important to him...even if it's longer than a tag line on a landing page.
In times where documentation is often an afterthought, and technical details get hidden away from users all the time ("Ooops some error occurred") this should be celebrated.
Before that he was known for his exhaustive reviews of OS X on Ars Technica
https://arstechnica.com/author/john-siracusa/
No because it isn't getting rid of the duplicate, it's using a feature of APFS that allows for duplicates to exist separately but share the same internal data.
Is it not the same as a hard link (which I believe are supported on Mac too)?
My understanding is that it is a copy-on-write clone, not a hard link. [1]
> Q: Are clone files the same thing as symbolic links or hard links?
> A: No. Symbolic links ("symlinks") and hard links are ways to make two entries in the file system that share the same data. This might sound like the same thing as the space-saving clones used by Hyperspace, but there’s one important difference. With symlinks and hard links, a change to one of the files affects all the files.
> The space-saving clones made by Hyperspace are different. Changes to one clone file do not affect other files. Cloned files should look and behave exactly the same as they did before they were converted into clones.
[1] https://hypercritical.co/hyperspace/
What kind of changes could you make to one clone that would still qualify it as a clone? If there are changes, it's no longer the same file. Even after reading the How It Works[0] link, I'm not groking how it works. Is it making some sort of delta/diff that is applied to the original file? That's not possible for every file format like large media files. I could see that being interesting for text based files, but that gets complicated for complex files.
[0] https://hypercritical.co/hyperspace/#how-it-works
If I understand correctly, a COW clone references the same contents (just like a hardlink) as long as all the filesystem references are pointing to identical file contents.
Once you open one of the reference handles and modify the contents, the copy-on-write process is invoked by the filesystem, and the underlying data is copied into a new, separate file with your new changes, breaking the link.
Comparing with a hardlink, there is no copy-on-write, so any changes made to the contents when editing the file opened from one reference would also show up if you open the other hardlinks to the same file contents.
ah, that's where the copy-on-write takes place. sometimes, just reading it written by someone else is the knock upside the head I need.
That’s correct.
Almost, but the difference is that if you change one of hardlinked files, you change "all of them". (It's really the same file but with different paths.)
https://hypercritical.co/hyperspace/#how-it-works
APFS apparently allows for creating "link files" which when changed, start to diverge.
A copy-on-write clone is not the same thing as a hard link.
With a hard link, the content of each of the two 'files' are identical in perpetuity.
With APFS Clones, the contents start off identical, but can be changed independently. If you change a small part of a file, those block(s) will need to be created, but the existing blocks will continue to be shared with the clone.
It’s not the same because clones can have separate meta data; in addition, if a cloned file changes, it stores a diff of the changes from the original.
It does get rid of the duplicate. The duplicate data is deleted and a hard link is created in its place.
It does not make hard links. It makes copy-on-write clones.
Replacing duplicates with hard links would be extremely dangerous. Software which expects to be able to modify file A without modifying previously-identical file B would break.
No, because it's not actually a hard link -- if you modify one of the files they'll diverge.
Sounds like jdupes with -B
Cursory googling suggests that it's using the same filesystem feature, yeah.
Right, but the concept is the same, "remove duplicates" in order to save storage space. If it's using reflinks, softlinks, APFS clones or whatever is more or less an implementation detail.
I know that internally it isn't actually "removing" anything, and that it uses fancy new technology from Apple. But in order to explain the project to strangers, I think my tagline gets the point across pretty well.
> Right, but the concept is the same, "remove duplicates" in order to save storage space.
The duplicates aren't removed, though. Nothing changes from the POV of users or software that use those files, and you can continue to make changes to them independently.
De-duplication does not mean the duplicates completely disappear. If I download a deduplication utility I expect it to create some sort of soft/hard link. I definitely don’t want it to completely remove random files on the filesystem, that’s just going to wreak havoc.
But it can still wreak havoc if you use hardlinks or softlinks, because maybe there was a good reason for having a duplicate file! Imagine you have a photo “foo.jpg.” You make a copy of it “foo2.jpg” You’re planning on editing that file, but right now, it’s a duplicate. At this point you run your “deduper” that turns the second file into a hardlink. Then a few days later you go and edit the file, but wait, the original “backup” file is now modified too! You lost your original.
That’s why Copy-on-write clones are completely different than hardlinks.
I've been using `fclones` [1] to do this, with `dedupe`, which uses reflink/clonefile.
https://github.com/pkolaczk/fclones
Judging by this sub-thread, the process really is harder to explain that it appears on the surface. The basic idea is simple but the implementation requires deeper knowledge.
But why would you discuss the implementation to end-users who probably wouldn't even understand what "implementation" means? The discussions you see in the subthread is not a discussion that would appear on less-technical forums, and I wouldn't draw any broader conclusions based on HN conversations in general.
Because the implementation leaks to the user experience. The user at least needs to know whether after running the utility, the duplicate files will be gone, or whether changing one of the files will change the other.
Symbolic links, hard links, ref links are all part of the file system interface, not the implementation.
What are examples of files that make up the "dozens of gigabytes" of duplicated data?
There are some CUDA files that every local AI app install that take multiple GB.
Also models that various AI libraries and plugins love to autodownload into custom locations. Python folks definitely need to learn caching, symlinks, asking a user where to store data, or at least logging where they actually do it.
.terraform, rust target directory, node_modules.
iMovie used to copy video files etc. into its "library".
audio files; renders, etc.
My personal head canon is that Steve Jobs personally cancelled ZFS in OSX because Jonathan Schwartz prematurely announced it.
I assumed it was patent-encumbered that could not be reconciled by either party.
This is cool!
Wait a minute, what happens to copies on different physical drives. Are they cloned too?
This operates within one drive. Possibly within one magical APFS “partition” (or whatever APFS calls partitions), I can’t remember
You must mean "APFS volume".
Wasn't able to use it on a few directories I tried as they were inside iCloud Drive.
Interesting idea, and I like the idea of people getting paid for making useful things.
Also, I get a data security itch having a random piece of software from the internet scan every file on an HD, particularly on a work machine where some lawyers might care about what's reading your hard drive. It would be nice if it was open source, so you could see what it's doing.
> I like the idea of people getting paid for making useful things
> It would be nice if it was open source
> I get a data security itch having a random piece of software from the internet scan every file on an HD
With the source it would be easy for others to create freebie versions, with or without respecting license restrictions or security.
I am not arguing anything, except pondering how software economics and security issues are full of unresolved holes, and the world isn't getting default fairer or safer.
--
The app was a great idea, indeed. I am now surprised Apple doesn't automatically reclaim storage like this. Kudos to the author.
You could download the app, disconnect Wifi and Ethernet, run the app and the reclamation process, remove the app (remember, you have the guarantees of the macOS App Store so no kernel extensions etc), and then reconnect.
Edit: this might not work with the payment option actually. I don't think you can IAP without the internet.
Hopefully doesn’t have similar bug like jdupes did
https://web.archive.org/web/20210506130542/https://github.co...
From the FAQ
> Q: Does Hyperspace preserve file metadata during reclamation?
> A: When Hyperspace replaces a file with a space-saving clone, it attempts to preserve all metadata associated with that file. This includes the creation date, modification date, permissions, ownership, Finder labels, Finder comments, whether or not the file name extension is visible, and even resource forks. If the attempt to preserve any of these piece of metadata fails, then the file is not replaced.
On a related note: are there any utilities that can measure disk usage of a folder taking (APFS) cloned files into account?
Is this the dedup function provided by other FS?
I think the term to search for is reflink. Btrfs is one example: https://btrfs.readthedocs.io/en/latest/Reflink.html
Like with Hyperspace, you would need to use a tool that can identify which files are duplicates, and then convert them into reflinks.
I thought reflink is provided by the underlying FS, and Hyperspace is a dedup tool that finds the duplicates.
Yes. Hyperspace is finding the identical files and then replacing all but one copy with a reflink copy using the filesystem's reflink functionality.
When you asked about the filesystem, I assumed you were asking about which filesystem feature was being used, since hyperspace itself is not provided by the filesystem.
Someone else mentioned[0] fclones, which can do this task of finding and replacing duplicates with reflinks on more than just macOS, if you were looking for a userspace tool.
[0]: https://news.ycombinator.com/item?id=43173713
Hyperspace uses built in APFS features, it just applies them to existing files.
You only get CoW on APFS if you copy a file with certain APIs or tools.
If you have a program that does it manually, you copied a duplicate to somewhere on your desk from some other source, or your files already existed on the file system when you converted to APFS because you’ve been carrying them for a long time then you’d have duplicates.
APFS doesn’t look for duplicates at any point. It just keeps track of those that it knows are duplicates because of copy operations.
You can do the same with `cp -c` on macOS, or `cp --reflink=always` on Linux, if your filesystem supports it.
Yes, Linux has a systemcall to do this for any filesystem with reflink support (and it is safe and atomic). You need a "driver" program to identify duplicates but there are a handful out there. I've used https://github.com/markfasheh/duperemove and was very pleased with how it worked.
btrfs, xfs, FIDEDUPERANGE for the sake of people coming from search engines.
Was pleasantly surprised to see that this is John Siracusa - the original GOAT when it comes to macOS release articles on Ars.
What would an equivalent tool be on linux? I guess it depends on the filesystem?
I've used bees successfully. https://github.com/Zygo/bees
OmniDiskSweeper (I know, not exactly the same thing, but still...)
Does it preserve all metadata, extended attributes, and alternate streams/named forks?
He spoke to this on No Longer ery Good, episode 626 of The Accidental Tech Podcast. Time stamp ~1:32:30
It tries, but there are some things it can't perfectly preserve like the last access time. Instances where it can't duplicate certain types of extended attributes or ownership permissions it will not perform the operation.
https://podcasts.apple.com/podcast/id617416468?i=10006919599...
Well, the FAQ also states that people should notify if you're missing attributes, so it really sounds like it's a predefined list instead of just enumeration through everything.
No word about alternate data streams. I'll pass for now.. Although it's nice to see how much duplicates you have
The FAQ talks about this a little:
Q: Does Hyperspace preserve file metadata during reclamation?
A: When Hyperspace replaces a file with a space-saving clone, it attempts to preserve all metadata associated with that file. This includes the creation date, modification date, permissions, ownership, Finder labels, Finder comments, whether or not the file name extension is visible, and even resource forks. If the attempt to preserve any of these piece of metadata fails, then the file is not replaced.
If you find some piece of file metadata that is not preserved, please let us know.
Q: How does Hyperspace handle resource forks?
A: Hyperspace considers the contents of a file’s resource fork to be part of the file’s data. Two files are considered identical only if their data and resource forks are identical to each other.
When a file is replaced by a space-saving clone during reclamation, its resource fork is preserved.
What are the potential risks or problems of such conversion of duplicates into APFS clones?
The linked docs cover this in detail.
Wish John went with the name superdeduper :/ :P
I was hoping for Storacusa
Any way it can be built for 14? It requires macOS 15.
TL;DR: He wrote an OS X dedup app which finds files with the same contents and tells the filesystem that their contents are identical, so it can save space (using copy-on-write features).
He points out its dangerous but could be worth it cause space savings.
I wonder if the implementation is using a hash only or does an additional step to actually compare the contents to avoid hash collision issues.
It's not open source, so we'll never know. He chose a pay model instead.
Also, some files might not be identical but have identical blocks. Something that could be explored too. Other filesystems have that either in their tooling or do it online or both.
What's the difference with jdupes?
Nothing much, jdupes already does the best thing.
https://manpages.debian.org/jdupes/jdupes.1#B
;( requires macos 15
Am also still on macOS 14. Keep postponing the upgrade. I've got stuff to do.
In my experience, Macs use up a ridiculous amount of "System" storage for no reason that users can't delete. I've grown tired of family members asking me to help them free up storage that I can't even find. That's the major issue from what I've seen; unless this app prevents apple deliberately eating up 50%+ of the storage space of a machine, this doesn't do much for the people I know.
These are often Time Machine snapshots. Nuking those can free up quite a bit of space.
You don't need Terminal even, you can view and delete them in Disk Utility.
Even without time machine there are loads of storage spent on “system”. Especially now with the apple intelligence (even when turned off).
Apple "Intelligence" gets its own category in 15.3.1.
There's no magic around it, macOS just doesn't do a good job explaining it using the built in tools. Just use Daisy Disk or something. It's all there and can be examined.
John is a the legend.
In earlier episodes of ATP when they were musing on possible names, one listener suggested the frankly amazing "Dupe Nukem". I get that this is a potential IP problem, which is why John didn't use it, but surely Duke Nukem is not a zealously-defended brand in 2025. I think interest in that particular name has been stone dead for a while now.
It's a genius name, but Gearbox owns Duke Nukem. They're not exactly dormant. Duke Nukem as a franchise made over a billion in revenue. In 2023, Zen released a licensed Duke Nukem pinball table, so there is at least some ongoing interest in the franchise.
I probably wouldn't have risked it, either.
Reminds me of Avira's Luke Filewalker - I wonder if they needed any special agreement with Lucasfilm/Disney. I couldn't find any info on it, and their website doesn't mention Star Wars at all.
Downloaded. Ran it. Tells me "900" files can be cleaned. No summary, no list. But I was at least asked to buy the app. Why would I buy the app if I have no idea if it'll help?
If you don’t mind CLI tools, You can try dedup - https://github.com/ttkb-oss/dedup . Use the —-dry-run option to get a list of files that would be merged without modifying anything and how much space would be saved.
On good file systems (see <https://news.ycombinator.com/item?id=43174685>) also identical chunks of files can be merged, resulting in more savings than with just whole files. As of now, dedup cannot help with this, but duperemove or jdupes do.
I'll check it out! Thanks!
This reminds me -
Back in the MS-DOS days, when the RAM was sparse, there was a class of so-called "memory optimization" programs. They all inevitably found at least few KB to be reclaimed through their magic even if the same optimizer was run back to back with itself and allowed to "optimize" things. That is, on each run they always find extra memory to be freed. They ultimately did nothing but claim they did the work. Must've sold pretty well nonetheless.
>Back in the MS-DOS days, when the RAM was sparse, there was a class of so-called "memory optimization" programs
You may find this interesting: Investigations into SoftRAM 95 by Raymond Chen [1] and Mark Russinovich [2] respectively.
"They implemented only one compression algorithm.
It was memcpy."
[1] https://devblogs.microsoft.com/oldnewthing/20211111-00/?p=10...
[2] https://www.drdobbs.com/parallel/inside-softram-95/184409937
> Back in the MS-DOS days, when the RAM was sparse, there was a class of so-called "memory optimization" programs. They ultimately did nothing but claim they did the work. Must've sold pretty well nonetheless.
QEMM worked by remapping stuff into extended memory - in a time that most software wasn't interested in using it. It worked as advertised.
Quarterdeck made good stuff all around. Desq and DesqView/X were amazing multitaskers. Way snappier than Windows and ran on little to nothing.
I remember using MemTurbo in the Windows 2000 era, though now I know it was mostly smoke and mirrors. My biggest gripe these days is too many "hardware accelerated" apps eating away VRAM, which is less of a problem with Windows (better over-commit) but which causes me a few crashes a month on KDE.
From the FAQ:
> If some eligible files were found, the amount of disk space that can be reclaimed is shown next to the “Potential Savings” label. To proceed any further, you will have to make a purchase. Once the app’s full functionality is unlocked, a “Review Files” button will become available after a successful scan. This will open the Review Window.
I half remember this being discussed on ATP; the logic being that if you have the list of files, you will just go and de-dupe them yourself.
> the logic being that if you have the list of files, you will just go and de-dupe them yourself.
If you can do that, you can check for duplicates yourself anyway. It's not like there aren't already dozens of great apps that dedupe.
> there aren't already dozens of great apps that dedupe.
Most of those delete rather than use the features of APFS.
It didn't tell you how much disk space? It's supposed to. It only told you the number of files?
Only told me the number of files and it didn't even provide a list of files (their paths, etc.)
This seems like a bug. It is supposed to tell you the amount of disk space.
(Obviously it won't show the list until payment. That part is expected behavior.)
List is available after paying. Which makes a lot of sense.
Many comments here offering similar solutions based on hardlinks or symlinks.
This uses a specific feature of APFS that allows the creation of copy-on-write clones. [1] If a clone is written to, then it is copied on demand and the original file is unmodified. This is distinct from the behavior of hardlinks or symlinks.
[1] https://en.wikipedia.org/wiki/Apple_File_System#Clones
Also called reflink on Linux. Which are supported by bcachefs, Btrfs, CIFS, NFS 4.2, OCFS2, overlayfs, XFS, and OpenZFS.
Sources: https://unix.stackexchange.com/questions/631237/in-linux-whi... https://forums.veeam.com/veeam-backup-replication-f2/openzfs...
https://github.com/pkolaczk/fclones can do the same thing, and it's perfectly free and open source. terminal based though
Hyperspace said I can save 10GB.
But then I ran this command and saved over 20GB:
I've used fclones before in the default mode (create hard links) but this is the first time I've run it at the top level of my home folder, in dedupe mode (i.e. using APFS clones). Fingers crossed it didn't wreck anything.[I was wrong, see below.—cw] It doesn't do the same thing. An APFS clone/copy-on-write clone is not the same as a hard or soft link. https://eclecticlight.co/2019/01/05/aliases-hard-links-symli...
Your source points out that:
< You can also create [APFS (copy on write) clones] in Terminal using the command `cp -c oldfilename newfilename` where the c option requires cloning rather than a regular copy.
`fclones dedupe` uses the same command[1]:
[1] https://github.com/pkolaczk/fclones/blob/555cde08fde4e700b25...I stand corrected, thank you!
Nice, also compression at file system level can save a lot of space and with current CPU speeds is completely transparent. It is feature from HFS+ that is still works in APFS, but is not officially supported anymore, what is wrong with you Apple ?
This tool to enable compression is free and open source
https://github.com/RJVB/afsctool
Also note about APFS vs HFS+, if you use HDD e.g. as backup media for Time Machine, HFS+ is must have over APFS as it is optimised only for SSD (random access).
https://bombich.com/blog/2019/09/12/analysis-apfs-enumeratio...
https://larryjordan.com/blog/apfs-is-not-yet-ready-for-tradi...
Not so smart Time Machine setup utility forcefully re-creates APFS on a HDD media, so you have to manually create HFS+ volume (e.g. with Disk Utily) and then use terminal command to add this volume as TM destination
`sudo tmutil setdestination /Volumes/TM07T`
Nice, but I'm not getting a subscription for a filesystem utility. Had it been a one-time $5 license, I would have bought it. At the current price, it's literally cheaper to put files in a S3 bucket or outright buy an SSD.
They had long discussions about the pricing on the podcast the author is a part of (atp.fm). It went through a few iterations of one time purchase, fee for each time you free up space and a subscription. There will always be people unhappy about either choice.
Edit: Apparently both is possible in the end: https://hypercritical.co/hyperspace/#purchase
Who would be unhappy with $5 owned forever? Other than the author of course for making less money.
People who want the app to stick around and continue to be developed.
I worry about that with Procreate. It feels like it's priced too low to be sustainable.
I worry about procreate too. It’s way too cheap for what it is and it’s in the small set of apps that can justify a subscription.
This app though? No chance. Parent comment says “if you want to support the app’s development” but not all apps need to be “developed” continuously, least of all system utilities.
Is the author's desire not important here?
> Two kinds of purchases are possible: one-time purchases and subscriptions.
https://hypercritical.co/hyperspace/#purchase
Claude 3.7 just rewrote the whole thing (just based on reading the webpage description) as a commandline app for me, so there's that.
And because it has no Internet access yet (and because I prompted it to use a workaround like this in that circumstance), the first thing it asked me to do (after hallucinating the functionality first, and then catching itself) was run `curl https://hypercritical.co/hyperspace/ | sed 's/<[^>]*>//g' | grep -v "^$" | clip`
("clip" is a bash function I wrote to pipe things onto the clipboard or spit them back out in a cross-platform linux/mac way)
Would you trust Claude with your hard drive?
Trust, but verify. We should always read and understand what gets spit out anyways.
When doing something with any risk potential I first ask the model for potential risks with the output, and then I manually read the code.
I also "recreated" this tool with Sonnet 3.7. The initial bash script worked (but was slow), and after a few iterations we landed on an fclones one-liner. I hadn't heard of fclones before, but works great! Saved a bunch of disk space today.
Ooh, could you share the source code? That seems like a perfect example for my "relying on AI code generation will subtly destroy your data" presentation.
Hah, I have not actually had a chance to run it as a test yet
Best upload the source code somewhere, before you do. Otherwise, we'll never know why you lost all your files.
I think it's priced reasonably. A one-time $5 license wouldn't be sustainable.
Since it's the kind of thing you will likely only need every couple of years, $10 each time feels fair.
If putting all your data online or into an SSD makes more sense, then this app isn't for you and that's okay too.
I can't even find the price anywhere. Do you have to install the software to see it?
The Mac App Store page has the pricing at the bottom in the In-App Purchases section..
TL;DR - $49 for a lifetime subscription, or $19/year or $9/month.
It could definitely be easier to find.
The price does seem very high. It’s probably a niche product and I’d imagine developers are the ones who would see the biggest savings. Hopefully it works out for them
"I don't value software but that's not a respectable opinion so I'll launder that opinion via subscriptions"
Well I do value software, I'm paid $86/h to write some! I just find that for $20/year or $50 one time, you can get way more than 12G of hard drive space. I also don't think that this piece of software requires so much maintenance that it wouldn't be worth making at a lower price. I'm not saying that it's bad software, it's really great, just too expensive... Personally, my gut feeling is that the dev would have had more sales with a one time $5, and made more money overall.
The first option presented is a one month non-renewing subscription for $10. I think the intention is periodically (once a year, once every few years?) you run it to reclaim space. If it was reclaiming more than a few gigs I would do it.
The author talked about being very conservative on launch; skipping directories like the Photo library or others apps that actively manage data or looking across user directories. He stumbled into writing this app because he noticed the duplicated data of shared Photo libraries between different users on the same machine. That use case isn't even supported in this version. He said he plans future development to safely dedup more data--making a one time purchase less sustainable for them.
There are several such tools for Linux, and they are free, so maybe just change operating systems.
I'm pretty sure some of them also work on MacOS. rmlint[1], for example can output a script that reflinks duplicates (or run any script for both files):
I'm not sure if reflink works out of the box, but you can write your own alternative script that just links both files[1]: https://github.com/sahib/rmlint
It does not support APFS: https://github.com/sahib/rmlint/issues/421
I don't think either of them supports APFS deduplication though?
> Hyperspace can’t be installed on “Macintosh HD” because macOS version 15 or later is required.
macOS 15 was released in September 2024, this feels far too soon to deprecate older versions.
Can it really be seen as deprecating an old version when it’s a brand new app?
+1. He's not taking anything away because you never had it.
I'm a bit confused as the Mac App Store says it's over 4 years old.
The 4+ Age rating is like, who can use the app. Not for 3 year olds, apparently.
I feel like that's true for most of the relatively low-level disk and partition management tooling. As unpopular an opinion as it may lately be around here, I'm enough of a pedagogical traditionalist to remain convinced that introductory logical volume management is best left at least till kindergarten.
Despite knowing this is the correct interpretation, I still consistently make the same incorrect interpretation as the parent comment. It would be nice if they made this more intuitive. Glad I’m not the only one that’s made that mistake.
The way they specify this has always confused me, because I actually care more about how old the app is than what age range it's aimed for
He wanted to write it in Swift 6. Does it support older OS versions?
Swift 6 is not the problem. It's backward compatible.
The problem is SwiftUI. It's very new, still barely usable on the Mac, but they are adding lots of new features every macOS release.
If you want to support older versions of macOS you can't use the nice stuff they just released. Eg. pointerStyle() is a brand new macOS 15 API that is very useful.
I can’t remember for sure but there may also have been a recent file system API he said he needed. Or a bug that he had to wait for a fix on.
It's been a while since I last looked at SwiftUI on mac, Is it really still that bad ?
It's not bad, just limited. I think it's getting usable, but just barely so.
They are working on it, and making it better every year. I've started using it for small projects and it's pretty neat how fast you can work with it -- but not everything can be done yet.
Since they are still adding pretty basic stuff every year, it really hurts if you target older versions. AppKit is so mature that for most people it doesn't matter if you can't use new features introduced in the last 3 years. For SwiftUI it still makes a big difference.
I wonder why they haven't tried to back port SwiftUI improvements/versions to the older OSs. Seems like this should have been possible.
Came here to post the same thing. Would love to try the application, but I guess not if the developer is deliberately excluding my device (which cannot run the bleeding edge OS).
In fairness, I don't think you can describe it as bleeding edge when we're 5 months into the annual 12 month upgrade cycle. It's recent, but not exactly an early adapter version at this point.
The developer deliberately chose to write it in Swift 6. Apple is the one who deliberately excluded Swift 6 from your device.
Yea, too bad :( Everyone involved with macOS and iOS development seems to be (intentionally or unintentionally) keeping us on the hardware treadmill.
Expensive. Keeping us on the expensive hardware treadmill. My guess is that it cannot be listed in the Apple store unless its only for Macs released in the last 11 months.
This isn’t true you can set the target multiple versions back. The main problem right now is a huge amount of churn in the language, APIs and multiple UI frameworks means everything is a moving target. SwiftUI has only really become useable in the last coupe of versions.
Every time Xcode updates, it seems a few more older macOS and iOS versions are removed from the list of "Minimum Deployment Versions". My current Xcode lets me target macOS back to 10.13 (High Sierra, 7 years old) and iOS 12.0 (6 years old). This seems... rather limiting. Like, I'd be leaving a lot of users out in the cold if I were actually releasing apps anymore. And this is Xcode 15.2, on a dev host Mac forever stuck on macOS 13.7. I'm sure newer Mac/Xcode combinations are even more limiting.
I used to be a hardcore Apple/Mac guy, but I'm kind of giving up on the ecosystem. Even the dev tools are keeping everyone on the treadmill.
You can keep using an older version of Xcode if you like. I mean, every other tool chain that I can think of does more or less the same thing. There are plenty of reasons to criticise Apple's developer tooling and relations, but I don't see this as being especially different to other platforms
I don't understand why a simple, closed source de-dup app is at the top of the front page with 160+ comments? What is so interesting about it? I read the blog and the comments here and I still don't get it.
I assume it’s because it’s from John Siracusa, a long-time Mac enthusiast, blogger, and podcaster. If you listen to him on ATP, it’s hard not to like him, and anything he does is bound to get more than the usual upvotes on HN.
The developer is popular and APFS cloning is genuinely technically interesting.
(no, it's not a symlink)
COW filesystems are older than MacOS, no surprises for me. Maybe people aren't that aware of it?
CoW - Copy on Write. Most probably on older mainframes. ( Actually newer mainframes ).
"CoW is used as the underlying mechanism in file systems like ZFS, Btrfs, ReFS, and Bcachefs"
Obligatory: https://en.wikipedia.org/wiki/Copy-on-write
As a web dev, it’s been fun listening to Accidental Tech Podcast where Siracusa has been talking (or ranting) about the ins and outs of developing modern mac apps in Swift and SwiftUI.
The part where he said making a large table in HTML and rendering it with a web view was orders of magnitude faster than using the SwiftUI native platform controls made me bash my head against my desk a couple times. What are we doing here, Apple.
SwiftUI is a joke when it comes to performance. Even Marco's Overcast stutters when displaying a table of a dozen rows (of equal height).
That being said, it's not quite an apples to apples comparison, because SwiftUI or UIKit can work with basically an infinite number of rows, whereas HTML will eventually get to a point where it won't load.
I love the new Overcast's habit of mistaking my scroll gestures for taps when browsing the sections of a podcast.
Shoutout to iced, my favorite GUI toolkit, which isn't even in 1.0 yet but can do that with ease and faster than anything I've ever seen: https://github.com/iced-rs/iced
https://github.com/tarkah/iced_table is a third-party widget for tables, but you can roll out your own or use other alternatives too
It's in Rust, not Swift, but I think switching from the latter to the former is easier than when moving away from many other popular languages.
It's easy to write a quick and clean UI toolkit, but it's when you add all the stuff for localization (like support for RTL languages - which also means swapping over where icons are) and accessibility (all the screen reader support) is where you really get bogged down and start wanting to add all these abstractions that slow things down.
RTL and accessibility are on the roadmap, the latter for this next version IIRC
I'd argue there's a lot more to iced than just being a quick toolkit. the Elm Architecture really shines for GUI apps
I wish there were modern benchmarks against browser engines. A long time ago native apps were much faster at rendering UI than the browser, but that may performance rewrites ago, so I wonder how browsers perform now.
Hacker News loves to hate Electron apps. In my experience ChatGPT on Mac (which I assume is fully native) is nearly impossible to use because I have a lot of large chats in my history but the website works much better and faster. ChatGPT website packed in Electron would've been much better. In fact, I am using a Chrome "PWA App" for ChatGPT now instead of the native app.
It's possible to make bad apps with anything. The difference is that, as far as I can tell, it's not possible to make good apps with Electron.
> In my experience ChatGPT on Mac (which I assume is fully native)
If we are to believe ChatGPT itself: "The ChatGPT macOS desktop app is built using Electron, which means it is primarily written in JavaScript, HTML, and CSS"
Someone more experienced that me could probably comment on this more, but theoretically is it possible for Electron production builds to become more efficient by having a much longer build process and stripping out all the unnecessary parts of Chromium?
As a web dev I must say that this segment made me happy and thankful for the browser team that really knows how to optimize.
For those mentioning that there's no price listed, it's not that easy as in the App Store the price varies by country. You can open the App Store link and then look at "In App Purchases" though.
For me on the German store it looks like this:
So it supports both one time purchases and subscriptions. Depending on what you prefer. More about that here: https://hypercritical.co/hyperspace/#purchaseIt would be interesting if payments bought a certain amount of saved space, and the rate was based on current storage prices, to keep it competitive with the cost of just expanding storage.
Its interesting how Linux tools are all free when even trivial mac tools are being sold. Nothing against someone trying to monetize but the linux culture sure is nice!
It's not that nice to call someone's work they spent months on "trivial" without knowing anything about the internals and what they ran into.
I don't think they meant it in a disparaging way, except maybe against Apple. Moreso that typically filesystems that can support deduplication include a deduplication tool in it's standard suite of FS tools. I too find it odd that Apple does not do this.
I didnt mean that the work was trivial, but the tool seems like something that could be a simple CLI
Skimming comments here, there are at least 3 open source versions of this, eg https://news.ycombinator.com/item?id=43179010
So it’s neither novel or new.
I think author being “famous” around tech circles and Apple fans’ engineered hatred of open source contributed to the rise of this article.
There might be a difference in robustness. There's a monetary consequence to this developer for getting it wrong.
The linux tools can't get it wrong, the kernel checks that the files submitted for deduplication are actually identical.
CLI tool to find duplicate files unbelievably quickly:
https://github.com/twpayne/find-duplicates
It lacks the deduplication part that make its competitors useful.
A 20$ 1 year licence for something that probably has a FOSS equivalent on Linux...
However, considering Apple will never ever ever allow user replaceable storage on a laptop, this might be worth it.
The developer does need to make up for the $100 yearly privilege of publishing the app to the App Store.
I have yet to see a GUI variant of deduplication software for Linux. There are plenty of command line tools, which probably can be ported to macOS, but there's no user friendly tool to just click through as far as I know.
There's value in convenience. I wouldn't pay for a yearly license (that price seems more than fair for a "version lifetime" price to me?) but seeing as this tool will probably need constant maintenance as Apple tweaks and changes APFS over time, combined with the mandatory Apple taxes for publishing software like this, it's not too awful.
50$ for a lifetime license.
Which really means up until the dev gets bored, which can be as short as 18 months.
I wouldn't mind something like this versioned to OS. 20$ for the current OS, and ten dollars for every significant update.
The Mac App Store (and all of Apple's App Stores) doesn't enable this sort of licensing. It's exactly the sort of thing that drives a lot of developers to independent distribution.
That's why we see so many more subscription-based apps these days, application development is an ongoing process with ongoing costs, so it needs to have ongoing income. But the traditional buy-it-once app pricing doesn't enable that long-term development and support. The app store supports subscriptions though, so now we get way more subscription-based apps.
I really think Siracusa came up with a clever pricing scheme here, given his want to use the app store for distribution.
Okay I stand corrected.
The cost is because of the fact people won't use it regularly. The developer is offering life time unlocks, lower cost levels for shorter timeframes etc.
[dead]
[dead]
[flagged]
[flagged]
A ~20 y.o. account with perhaps hundreds of devices in history across different continents and countries along with family sharing.
Every time I need to purchase something via Apple, it becomes a quest. Enter password, validate card, welcome to endless login loop. Reboot. Click purchase, enter password, confirm OTP on another device, then nothing happens, purchase button is active, clicks ignored. Reboot. Click "Get", program begins downloading, wait 30s, app cannot be download, go to Settings to verify account. Sure. Account is perfectly fine in Settings. Reboot. Click "Get". Finally program installed. Click in-app purchase. Enter password again. Choose Apple Pay. Proceed with purchase. You need to verify your account. Account is fine in Settings. Reboot. Click purchase. Cannot be completed at this time. Wait couple of hours, try again. Purchase successful.
All. The. Time. For years. On almost all of the devices which I upgrade annually.
Oh, I was agreeing with you. My 2015 iMac died two weeks ago. Pretty sure it's the SSD I installed when I first got it. And while most of my files are in cloud storage, I also had a series of chained external drives running Time Machine. Guess what? I can't use any apple tools to grab any files because different file system type between that machine and my M1 MB Pro (permissions issues).
I'm going to have to clone the drive, then use terminal to chmod the /usr dirs to extract the files I want (mostly personal music production).
I immediately ordered a Mac mini, but since I didn't want a 256GB drive and 16GB RAM, I'm still waiting for it to arrive from China.
Also, the M1 MB Pro was the most expensive* and worst computer I have ever owned. I wish I had just bought an air. No tactile volume controls. As a musician that is the worst.
(*company I worked for a while in school paid $12k for a Mac 2 FX. Lol.)
I would never dismiss such a complaint with a glib "works for me". And yet, your experience is so utterly, completely different from mine that I have to think something's busted in your account somewhere. I've had an account for about as long, with family sharing and all the rest. I never, ever, have anywhere near that level of difficulty. For me it works as documented: I click "Get", it asks for Face ID to confirm it's really me, then a few seconds later I have the app installed and ready to use.
Again, I don't think you're doing anything wrong, and I don't doubt your experience. But I really think something's fundamentally wrong somewhere, because what you're dealing with is not normal. It's not the common experience others are tolerating.
Oh yes, it's pretty clear to me that something is wrong on Apple's side specifically with my account. Obviously people are having close to zero friction with Apple's stuff.
I can't complain though because I have this account for like couple decades and loosing that account would be painful. Apple did ban my account twice on the grounds that it's a US account while I'm not physically located there. I was able to revert the ban by explaining that I've got a US legal entity (account, banking card, etc) and thus I beg to continue using it. Not taking chances for the third time, so I silently endure.
> Obviously people are having close to zero friction with Apple's stuff.
I don't believe that to be true, I have been having issues with Apple bugs for the last 7-8 years. Totally unnecessary friction due to features I do not want and do not use.
Steve died.
Edit: Let’s be real here, Tim Cook is keeping the lights on. He isn’t a product guy. They lack leadership and vision at present and are committed to a foolish release cycle based on calendar year not quality of product. These wouldn’t have come to pass had Steve lived till today. Yes its opinion but I doubt its an unpopular one.
So I’ll say it again succinctly, to answer “what happened?”
Steve died.
Steve was the one who delivered MobileMe, arguably the most broken service by Apple. So not that he's a saint.
[flagged]
[flagged]
[flagged]
It is not a hard link. A clone is an independent file which is backed by the same storage. So far mostly the same as a hard link you’ll say. However if you modify a clone, it will be “uncloned” and will be modified independently of its clones.
Does APFS have the extent-level deduplication Linux filesystems have or is it only file-level deduplication?
While it's not very user-friendly software (command line, after all), duperemove has proven quite useful for my filesystems.
And I believe that only the modified blocks are “uncloned”, the rest of the data is still shared between the two files
Aside from the other point, macOS does, of course, have hard link functionality.
$ rmlint -c sh:link -L -y s -p -T duplicates
will produce a script which, if run, will hardlink duplicates
That's not what this app is doing though. APFS clones are copy-on-write pointers to the same data, not hardlinks.
If you replace `sh:link` with `sh:clone` instead, it will.
> clone: reflink-capable filesystems only. Try to clone both files with the FIDEDUPERANGE ioctl(3p) (or BTRFS_IOC_FILE_EXTENT_SAME on older kernels). This will free up duplicate extents while preserving the metadata of both. Needs at least kernel 4.2.
On Linux
There's a cross-platform open-source version of this program: https://github.com/qarmin/czkawka
I don't think czkawa supports deduplication via reflink so it's not exactly the same thing. fclones as linked by another user is more similar: https://news.ycombinator.com/item?id=43173713
That’s not remotely comparable.
The fact that copying doesn't copy seems dangerous. Like what if I wanted to copy for the purpose of modifying the file while retaining the original. A trivial example of this might be I have a meme template and I want to write text in it while still keeping a blank copy of the template.
There's a place for alias file pointers, but lying to the user and pretending like an alias is a copy is bound to lead to unintended and confusing results
Copy-on-write means that it performs copy only when you make the first change (and only copies part that changes, rest is used from the original file), until then copying is free.
Is it file level or block level copy? The latter, I hope.
Update: whoops, missed it in your comment. Block (changed bytes) level.
It’s not a symbolic link - it copies on modification. No need to worry!
CoW is not aliasing. It will perform the actual copying when you modify the file content.
It‘s Copy On Write. When you modify either one it does get turned into an actual copy
It's copy on write.
[flagged]
Requires macOS 15.0 or later. – Oh god, this is so stupid and most irritating thing about macOS "Application development".
It is really unfair to call it "software" it is more like "glued to recent version of OS ware", meanwhile I can still run .exe compiled in 2006, and with wine even on mac or linux.
I would also have appreciated a version that's compatible with a non-latest macOS release.
Then again, this app was written with SwiftUI, which hasn't received some handy features before macOS 12 and is still way behind AppKit.
When I see an app that's not compatible with the second most recent macOS, I assume the dev either didn't know better or they were too lazy to write workarounds / shims for the latest-and-greatest shiny stuff.
However, you can't run an app targeted for Windows 11 on Windows XP. How unfair is that? Curse you, Microsoft.
I have to confess: it miffs me that a utility that would normally fly completely under the radar is likely to make the creator thousands of dollars just because he runs a popular podcast. (Am I jealous? Oh yes. But only because I tried to sell similar apps in the past and could barely get any downloads no matter how much I marketed them. Selling software without an existing network seems nigh-on impossible these days.)
Anyway, congrats to Siracusa on the release, great idea, etc. etc.
I can understand your criticism as it's easy to arrive at that conclusion (Also a common occurrence when levelsio launches a new product, as his Twitter following is large) but it's also not fair to discount it as "just because he runs a popular podcast".
The author is a "household" name in the macOS / Apple scene for a long time even before the podcast. If someone is spending all their life blogging about all things Apple on outlets like ArsTechnica and is consistently putting out new content on podcasts for decades they will naturally have a better distribution.
How many years did you spend on building up your marketing and distribution reach?
I know! I actually like him and wish him the best. I just get a bit annoyed when one of the ATP folks releases some small utility with an unclear niche and then later talks about how they've "merely" earned thousands of dollars from it. When I was an app developer, I would have counted myself lucky to have made just a hundred bucks from a similar release. The gang's popularity gives them a distorted view of the market sometimes, IMHO.
Lovely idea, but way too expensive for me.
I don’t need this,storage is cheap, but I’m glad it exists.
Storage isnt cheap on macs though. One has to pay 2k USD to get 8 TB SSD
Storage comes in many forms. It doesn't need to be soldered to the mainboard to satisfy most use cases.
But cleaning / making space on your main soldered drive where the OS is is quite important
I did this with two scripts - one that produces and cached sha1 sums of files, and another that consumes the output of the first (or any of the *sum progs) and produces stats about duplicate files, with options to delete or hard-link them.
I wonder how any comments about hard links will be in these comments by people misunderstanding what this app does.
if file is not going to be modified (in the low-level sense - open("w") on the filename; as opposed to rename-and-create-new), then reflinks (what this app does) and hardlinks act somewhat identically.
For example if you have multiple node_modules, or app installs, or source photos/videos (ones you don't edit), or music archives, then hardlinks work just fine.