Hyperspace
268 comments
·February 25, 2025bob1029
taneliv
On ZFS it consumes a lot of RAM. In part I think this is because ZFS does it on the block level, and has to keep track of a lot of blocks to compare against when a new one is written out. It might be easier on resources if implemented on the file level. Not sure if the implementation would be simpler or more complex.
It might also be a little unintuitive that modifying one byte of a large file would result in a lot disk activity, as the file system would need to duplicate the file again.
gmueckl
Files are always represented as lists of blocks or block spans within a file system. Individual blocks could in theory be partially shared between files at the complexity cost of a reference counter for each block. So changing a single byte in a copy on write file could take the same time regardless of file size because only the affected bock would have to be duplicated. I don't know at all how MacOS implements this copynon write scheme, though.
MBCook
APFS is a copy on write filesystem if you use the right APIs, so it does what you describe but only for entire files.
I believe as soon as you change a single bite you get a complete copy that’s your own.
And that’s how this program works. It finds perfect duplicates and then effectively deletes and replaces them with a copy of the existing file so in the background there’s only one copy of the bits on the disk.
amzin
Is there a FS that keeps only diffs in clone files? It would be neat
rappatic
I wondered that too.
If we only have two files, A and its duplicate B with some changes as a diff, this works pretty well. Even if the user deletes A, the OS could just apply the diff to the file on disk, unlink A, and assign B to that file.
But if we have A and two different diffs B1 and B2, then try to delete A, it gets a little murkier. Either you do the above process and recalculate the diff for B2 to make it a diff of B1; or you keep the original A floating around on disk, not linked to any file.
Similarly, if you try to modify A, you'd need to recalculate the diffs for all the duplicates. Alternatively, you could do version tracking and have the duplicate's diffs be on a specific version of A. Then every file would have a chain of diffs stretching back to the original content of the file. Complex but could be useful.
It's certainly an interesting concept but might be more trouble than it's worth.
UltraSane
VAST storage does something like this. Unlike how most storage arrays identify the same block by hash and only store it once VAST uses a content aware hash so hashes of similar blocks are also similar. They store a reference block for each unique hash and then when new data comes in and is hashed the most similar block is used to create byte level deltas against. In practice this works extremely well.
https://www.vastdata.com/blog/breaking-data-reduction-trade-...
ted_dunning
This is commonly done with compression on block storage devices. That fails, of course, if the file system is encrypting the blocks it sends down to the device.
Doing deduplication at this level is nice because you can dedupe across file systems. If you have, say, a thousand systems that all have the same OS files you can save vats of storage. Many times, the only differences will be system specific configurations like host keys and hostnames. No single filesystem could recognize this commonality.
This fails when the deduplication causes you to have fewer replicas of files with intense usage. To take the previous example, if you boot all thousand machines at the same time, you will have a prodigious I/O load on the kernel images.
p_ing
Windows Server does this for NTFS and ReFS volumes. I used it quite a bit on ReFS w/ Hyper-V VMs and it worked wonders. Cut my storage usage down by ~45% with a majority of Windows Server VMs running a mix of 2016/2019 at the time.
borland
Yep. At a previous job we had a file server that we published Windows build output to.
There were about 1000 copies of the same pre-requisite .NET and VC++ runtimes (each build had one) and we only paid for the cost of storing it once. It was great.
albertzeyer
> This got me wondering why the filesystem itself doesn't run a similar kind of deduplication process in the background.
I think that ZFS actually does this. https://www.truenas.com/docs/references/zfsdeduplication/
pmarreck
It's considered an "expensive" configuration that is only good for certain use-cases, though, due to its memory requirements.
UltraSane
NTFS supports deduplication but it is only available on Server versions which is very annoying.
pizzafeelsright
data loss is the largest concern
I still do not trust de-duplication software.
dylan604
Even using sha-256 or greater type of hashing, I'd still have concerns about letting a system make deletion decisions without my involvement. I've even been part of de-dupe efforts, so maybe my hesitation is just because I wrote some of the code and I know I'm not perfect in my coding or even my algo decision trees. I know that any mistake I made would not be of malice but just ignorance or other stupid mistake.
I've done the entire compare every file via hashing and then log each of the matches for humans to compare, but never has any of that ever been allowed to mv/rm/link -s anything. I feel my imposter syndrome in this regard is not a bad thing.
asdfman123
If Apple is anything like where I work, there's probably a three-year-old bug ticket in their system about it and no real mandate from upper management to allocate resources for it.
null
petercooper
I love the model of it being free to scan and see if you'd get any benefit, then paying for the actual results. I, too, am a packrat, ran it, and got 7GB to reclaim. Not quite worth the squeeze for me, but I appreciate it existing!
MBCook
He’s talked about it on the podcast he was on. So many users would buy this, run it once, then save a few gigs and be done. So a subscription didn’t make a ton of sense.
After all how many perfect duplicate files do you probably create a month accidentally?
There’s a subscription or buy forever option for people who think that would actually be quite useful to them. But for a ton of people a one time IAP that gives them a limited amount of time to use the program really does make a lot of sense.
And you can always rerun it for free to see if you have enough stuff worth paying for again.
sejje
I also really like this pricing model.
I wish it were more obvious how to do it with other software. Often there's a learning curve in the way before you can see the value.
astennumero
What algorithm does the application use to figure out if two files are identical? There's a lot of interesting algorithms out there. Hashes, bit by bit comparison etc. But these techniques have their own disadvantages. What is the best way to do this for a large amount of files?
borland
I don't know exactly what Siracusa is doing here, but I can take an educated guess:
For each candidate file, you need some "key" that you can use to check if another candidate file is the same. There can be millions of files so the key needs to be small and quick to generate, but at the same time we don't want any false positives.
The obvious answer today is a SHA256 hash of the file's contents; It's very fast, not too large (32 bytes) and the odds of a false positive/collision are low enough that the world will end before you ever encounter one. SHA256 is the de-facto standard for this kind of thing and I'd be very surprised if he'd done anything else.
MBCook
You can start with the size, which is probably really unique. That would likely cut down the search space fast.
At that point maybe it’s better to just compare byte by byte? You’ll have to read the whole file to generate the hash and if you just compare the bytes there is no chance of hash collision no matter how small.
Plus if you find a difference in bytes 1290 you can just stop there instead of reading the whole thing to finish the hash.
I don’t think John has said exactly how on ATP (his podcast with Marco and Casey), but knowing him as a longtime listener/reader he’s being very careful. And I think he’s said that on the podcast too.
rzzzt
I experimented with a similar, "hardlink farm"-style approach for deduplicated, browseable snapshots. It resulted in a small bash script which did the following:
- compute SHA256 hashes for each file on the source side
- copy files which are not already known to a "canonical copies" folder on the destination (this step uses the hash itself as the file name, which makes it easy to check if I had a copy from the same file earlier)
- mirror the source directory structure to the destination
- create hardlinks in the destination directory structure for each source file; these should use the original file name but point to the canonical copy.
Then I got too scared to actually use it :)
pmarreck
xxHash (or xxh3 which I believe is even faster) is massively faster than SHA256 at the cost of security, which is unnecessary here.
Of course, engineering being what it is, it's possible that only one of these has hardware support and thus might end up actually being faster in realtime.
null
f1shy
I think the prob. is not so low. I remember reading here about a person getting a foto of another chat in a chat application, which was using sha in the background. I do not recall all the details, it is improbable, but possible.
kittoes
The probability is truly, obscenely, low. If you read about a collision then you surely weren't reading about SHA256.
https://crypto.stackexchange.com/questions/47809/why-havent-...
sgerenser
LOL nope, I seriously doubt that was the result of a SHA256 collision.
null
amelius
Or just use whatever algorithm rsync uses.
w4yai
I'd hash the first 1024 bytes of all files, and starts from there is any collision. That way you don't need to hash the whole (large) files, but only those with same hashes.
amelius
I suspect that bytes near the end are more likely to be different (even if there may be some padding). For example, imagine you have several versions of the same document.
Also, use the length of the file for a fast check.
kstrauser
At that point, why hash them instead of just using the first 1024 bytes as-is?
borland
In order to check if a file is a duplicate of another, you need to check it against _every other possible file_. You need some kind of "lookup key".
If we took the first 1024 bytes of each file as the lookup key, then our key size would be 1024 bytes. If you have 1 million files on your disk, then that's 128MB of ram just to store all the keys. That's not a big deal these days, but it's also annoying if you have a bunch of files that all start with the same 1024 bytes -- e.g. perhaps all the photoshop documents start with the same header. You'd need a 2-stage comparison, where you first match the key (1024 bytes) and then do a full comparison to see if it really matches.
Far more efficient - and less work - If you just use a SHA256 of the file's contents. That gets you a much smaller 32 byte key, and you don't need to bother with 2-stage comparisons.
sedatk
Probably because you need to keep a lot of those in memory.
smusamashah
And why first 1024, can pick from predefined points.
diegs
This reminds me of https://en.wikipedia.org/wiki/Venti_(software) which was a content-addressible filesystem which used hashes for de-duplication. Since the hashes were computed at write time, the performance penalty is amortized.
williamsmj
Deleted comment based on a misunderstanding.
Sohcahtoa82
> This tool simply identifies files that point at literally the same data on disk because they were duplicated in a copy-on-write setting.
You misunderstood the article, as it's basically doing the opposite of what you said.
This tool finds duplicate data that is specifically not duplicated via copy-on-write, and then turns it into a copy-on-write copy.
williamsmj
Fair. Deleted.
BWStearns
I have file A that's in two places and I run this.
I modify A_0. Does this modify A_1 as well or just kind of reify the new state of A_0 while leaving A_1 untouched?
madeofpalk
It's called copy-on-write because when you modify A_0, it'll make a copy of the file if you write to it but not A_1.
https://en.wikipedia.org/wiki/Copy-on-write#In_computer_stor...
bsimpson
Which means if you actually edited those files, you might fill up your HD much more quickly than you expected.
But if you have the same 500MB of node_modules in each of your dozen projects, this might actually durably save some space.
lgdskhglsa
He's using the "copy on write" feature of the file system. So it should leave A_1 untouched, creating a new copy for A_0's modifications. More info: https://developer.apple.com/documentation/foundation/file_sy...
null
bhouston
I gave it a try on my massive folder of NodeJS projects but it only found 1GB of savings on a 8.1GB folder.
I then tried again including my user home folder (731K files, 127K folders, 2755 eligible files) to hopefully catch more savings and I only ended up at 1.3GB of savings (300MB more than just what was in the NodeJS folders.)
I tried to scan System and Library but it refused to do so because of permission issues.
I think the fact that I use pnpm for my package manager has made my disk space usage already pretty near optimal.
Oh well. Neat idea. But the current price is too high to justify this. Also I would want it as a background process that runs once a month or something.
p_ing
> I tried to scan System and Library but it refused to do so because of permission issues.
macOS has a sealed volume which is why you're seeing permission errors.
https://support.apple.com/guide/security/signed-system-volum...
bhouston
For some reason "disk-inventory-x" will scan those folders. I used that amazing tool to prune left over Unreal Engine files and docker caches when they put them not in my home folder. The tool asks for a ton of permissions when you run it in order to do the scan though, which is a bit annoying.
zamalek
pnpm tries to be a drop-in replacement for npm, and dedupes automatically.
MrJohz
More importantly, pnpm installs packages as symlinks, so the deduping is rather more effective. I believe it also tries to mirror the NPM folder structure and style of deduping as well, but if you have two of the same package installed anywhere on your system, pnpm will only need to download and save one copy of that package.
spankalee
npm's --install-strategy=linked flag is supposed to do this too, but it has been broken in several ways for years.
diggan
> pnpm tries to be a drop-in replacement for npm
True
> and dedupes automatically
Also true.
But the way you put them after each other, makes it sound like npm does de-duplication, and since pnpm tries to be a drop-in replacement for npm, so does pnpm.
So for clarification: npm doesn't do de-duplication across all your projects, and that in particular was of the more useful features that pnpm brought to the ecosystem when it first arrived.
lou1306
> it only found 1GB of savings on a 8.1GB folder.
You "only" found that 12% of the space you are using is wasted? Am I reading this right?
bhouston
I have a 512GB drive in my MacBook Air M3 with 225GB free. Saving 1GB is 0.5% of my total free space, and it is definitely "below my line." It is a neat tool still in concept.
When I ran it on my home folder with 165GB of data it only found 1.3GB of savings. This isn't that significant to me and it isn't really worth paying for.
BTW I highly recommend the free "disk-inventory-x" utility for MacOS space management.
timerol
Your original comment did not mention that your home folder was 165 GB, which is extremely relevant here
warkdarrior
The relevant number (missing from above) is the total amount of space on that storage device. If it saves 1GB on a 8TB drive, it's not a big win.
oneeyedpigeon
It should be proportional to the total used space, not the space available. The previous commenter said it was a 1 GB savings from ~8 GB of used space; that's equally significant whether it happens on a 10 GB drive or a 10 TB one.
jy14898
If it saved 8.1GB, by your measure it'd also not be a big win?
rconti
Absolutely, 100% backwards. The tool cannot save space from disk space that is not scanned. Your "not a big win" comment assumes that there is no space left to be reclaimed on the rest of the disk. Or that the disk is not empty, or that the rest of the disk can't be reclaimed at an even higher rate.
null
modzu
whats the price? doesnt seem to be published anywhere
scblock
It's on the Mac App Store so you'll find the pricing there. Looks like $10 for one month (one time use maybe?), $20 for a year, $50 lifetime.
diggan
Even if I have both a Mac and iPhone, but happen to use my Linux computer right now, it seems like the store page (https://apps.apple.com/us/app/hyperspace-reclaim-disk-space/...) is not showing the price, probably because I'm not actively on a Apple device? Seems like a poor UX even for us Mac users.
piqufoh
£9.99 a month, £19.99 for one year, £49.99 for life (app store purchase prices visible once you've scanned a directory).
null
jamesfmilne
Would be nice if git could make use of this on macOS.
Each worktree I usually work on is several gigs of (mostly) identical files.
Unfortunately the source files are often deep in a compressed git pack file, so you can't de-duplicate that.
(Of course, the bigger problem is the build artefacts on each branch, which are like 12G per debug/release per product, but they often diverge for boring reasons.)
theamk
"git worktree" shares a .git folder between multiple checkouts. You'll still have multiple files in working copy, but at least the .pack files would be shared. It is great feature, very robust, I use it all the time.
There is also ".git/objects/info/alternates", accessed via "--shared"/"--reference" option of "git clone", that allows only sharing of object storage and not branches etc... but it is has caveats, and I've only used it in some special circumstances.
diggan
Git is a really poor fit for a project like that since it's snapshot based instead of diff based... Luckily, `git lfs` exists for working around that, I'm assuming you've already investigated that for the large artifacts?
diggan
> Like all my apps, Hyperspace is a bit difficult to explain. I’ve attempted to do so, at length, in the Hyperspace documentation. I hope it makes enough sense to enough people that it will be a useful addition to the Mac ecosystem.
Am I missing something, or isn't it a "file de-duplicator" with a nice UI/UX? Sounds pretty simple to describe, and tells you why it's useful with just two words.
dewey
The author of the software is a file system enthusiast (so much that in the podcast he's a part of they have a dedicated sound effect every time "filesystem" comes up), a long time blogger and macOS reviewer. So you'll have to see it in that context while documenting every bit and the technical details behind it is important to him...even if it's longer than a tag line on a landing page.
In times where documentation is often an afterthought, and technical details get hidden away from users all the time ("Ooops some error occurred") this should be celebrated.
protonbob
No because it isn't getting rid of the duplicate, it's using a feature of APFS that allows for duplicates to exist separately but share the same internal data.
yayoohooyahoo
Is it not the same as a hard link (which I believe are supported on Mac too)?
andrewla
My understanding is that it is a copy-on-write clone, not a hard link. [1]
> Q: Are clone files the same thing as symbolic links or hard links?
> A: No. Symbolic links ("symlinks") and hard links are ways to make two entries in the file system that share the same data. This might sound like the same thing as the space-saving clones used by Hyperspace, but there’s one important difference. With symlinks and hard links, a change to one of the files affects all the files.
> The space-saving clones made by Hyperspace are different. Changes to one clone file do not affect other files. Cloned files should look and behave exactly the same as they did before they were converted into clones.
actionfromafar
Almost, but the difference is that if you change one of hardlinked files, you change "all of them". (It's really the same file but with different paths.)
https://hypercritical.co/hyperspace/#how-it-works
APFS apparently allows for creating "link files" which when changed, start to diverge.
zippergz
A copy-on-write clone is not the same thing as a hard link.
rahimnathwani
With a hard link, the content of each of the two 'files' are identical in perpetuity.
With APFS Clones, the contents start off identical, but can be changed independently. If you change a small part of a file, those block(s) will need to be created, but the existing blocks will continue to be shared with the clone.
diggan
Right, but the concept is the same, "remove duplicates" in order to save storage space. If it's using reflinks, softlinks, APFS clones or whatever is more or less an implementation detail.
I know that internally it isn't actually "removing" anything, and that it uses fancy new technology from Apple. But in order to explain the project to strangers, I think my tagline gets the point across pretty well.
CharlesW
> Right, but the concept is the same, "remove duplicates" in order to save storage space.
The duplicates aren't removed, though. Nothing changes from the POV of users or software that use those files, and you can continue to make changes to them independently.
zerd
I've been using `fclones` [1] to do this, with `dedupe`, which uses reflink/clonefile.
null
null
bsimpson
Interesting idea, and I like the idea of people getting paid for making useful things.
Also, I get a data security itch having a random piece of software from the internet scan every file on an HD, particularly on a work machine where some lawyers might care about what's reading your hard drive. It would be nice if it was open source, so you could see what it's doing.
Nevermark
> I like the idea of people getting paid for making useful things
> It would be nice if it was open source
> I get a data security itch having a random piece of software from the internet scan every file on an HD
With the source it would be easy for others to create freebie versions, with or without respecting license restrictions or security.
I am not arguing anything, except pondering how software economics and security issues are full of unresolved holes, and the world isn't getting default fairer or safer.
--
The app was a great idea, indeed. I am now surprised Apple doesn't automatically reclaim storage like this. Kudos to the author.
benced
You could download the app, disconnect Wifi and Ethernet, run the app and the reclamation process, remove the app (remember, you have the guarantees of the macOS App Store so no kernel extensions etc), and then reconnect.
Edit: this might not work with the payment option actually. I don't think you can IAP without the internet.
albertzeyer
I wrote a similar (but simpler) script which would replace a file by a hardlink if it has the same content.
My main motivation was for the packages of Python virtual envs, where I often have similar packages installed, and even if versions are different, many files would still match. Some of the packages are quite huge, e.g. Numpy, PyTorch, TensorFlow, etc. I got quite some disk space savings from this.
https://github.com/albertz/system-tools/blob/master/bin/merg...
andrewla
This does not use hard links or symlinks; this uses a feature of the filesystem that allows the creation of copy-on-write clones. [1]
radicality
Hopefully doesn’t have similar bug like jdupes did
https://web.archive.org/web/20210506130542/https://github.co...
null
exitb
What are examples of files that make up the "dozens of gigabytes" of duplicated data?
xnx
There are some CUDA files that every local AI app install that take multiple GB.
wruza
Also models that various AI libraries and plugins love to autodownload into custom locations. Python folks definitely need to learn caching, symlinks, asking a user where to store data, or at least logging where they actually do it.
zerd
.terraform, rust target directory, node_modules.
password4321
iMovie used to copy video files etc. into its "library".
butlike
audio files; renders, etc.
> There is no way for Hyperspace to cooperate with all other applications and macOS itself to coordinate a “safe” time for those files to be replaced, nor is there a way for Hyperspace for forcibly take exclusive control of those files.
This got me wondering why the filesystem itself doesn't run a similar kind of deduplication process in the background. Presumably, it is at a level of abstraction where it could safely manage these concerns. What could be the downsides of having this happen automatically within APFS?