Skip to content(if available)orjump to list(if available)

Hyperspace

Hyperspace

268 comments

·February 25, 2025

bob1029

> There is no way for Hyperspace to cooperate with all other applications and macOS itself to coordinate a “safe” time for those files to be replaced, nor is there a way for Hyperspace for forcibly take exclusive control of those files.

This got me wondering why the filesystem itself doesn't run a similar kind of deduplication process in the background. Presumably, it is at a level of abstraction where it could safely manage these concerns. What could be the downsides of having this happen automatically within APFS?

taneliv

On ZFS it consumes a lot of RAM. In part I think this is because ZFS does it on the block level, and has to keep track of a lot of blocks to compare against when a new one is written out. It might be easier on resources if implemented on the file level. Not sure if the implementation would be simpler or more complex.

It might also be a little unintuitive that modifying one byte of a large file would result in a lot disk activity, as the file system would need to duplicate the file again.

gmueckl

Files are always represented as lists of blocks or block spans within a file system. Individual blocks could in theory be partially shared between files at the complexity cost of a reference counter for each block. So changing a single byte in a copy on write file could take the same time regardless of file size because only the affected bock would have to be duplicated. I don't know at all how MacOS implements this copynon write scheme, though.

MBCook

APFS is a copy on write filesystem if you use the right APIs, so it does what you describe but only for entire files.

I believe as soon as you change a single bite you get a complete copy that’s your own.

And that’s how this program works. It finds perfect duplicates and then effectively deletes and replaces them with a copy of the existing file so in the background there’s only one copy of the bits on the disk.

amzin

Is there a FS that keeps only diffs in clone files? It would be neat

rappatic

I wondered that too.

If we only have two files, A and its duplicate B with some changes as a diff, this works pretty well. Even if the user deletes A, the OS could just apply the diff to the file on disk, unlink A, and assign B to that file.

But if we have A and two different diffs B1 and B2, then try to delete A, it gets a little murkier. Either you do the above process and recalculate the diff for B2 to make it a diff of B1; or you keep the original A floating around on disk, not linked to any file.

Similarly, if you try to modify A, you'd need to recalculate the diffs for all the duplicates. Alternatively, you could do version tracking and have the duplicate's diffs be on a specific version of A. Then every file would have a chain of diffs stretching back to the original content of the file. Complex but could be useful.

It's certainly an interesting concept but might be more trouble than it's worth.

UltraSane

VAST storage does something like this. Unlike how most storage arrays identify the same block by hash and only store it once VAST uses a content aware hash so hashes of similar blocks are also similar. They store a reference block for each unique hash and then when new data comes in and is hashed the most similar block is used to create byte level deltas against. In practice this works extremely well.

https://www.vastdata.com/blog/breaking-data-reduction-trade-...

ted_dunning

This is commonly done with compression on block storage devices. That fails, of course, if the file system is encrypting the blocks it sends down to the device.

Doing deduplication at this level is nice because you can dedupe across file systems. If you have, say, a thousand systems that all have the same OS files you can save vats of storage. Many times, the only differences will be system specific configurations like host keys and hostnames. No single filesystem could recognize this commonality.

This fails when the deduplication causes you to have fewer replicas of files with intense usage. To take the previous example, if you boot all thousand machines at the same time, you will have a prodigious I/O load on the kernel images.

p_ing

Windows Server does this for NTFS and ReFS volumes. I used it quite a bit on ReFS w/ Hyper-V VMs and it worked wonders. Cut my storage usage down by ~45% with a majority of Windows Server VMs running a mix of 2016/2019 at the time.

borland

Yep. At a previous job we had a file server that we published Windows build output to.

There were about 1000 copies of the same pre-requisite .NET and VC++ runtimes (each build had one) and we only paid for the cost of storing it once. It was great.

albertzeyer

> This got me wondering why the filesystem itself doesn't run a similar kind of deduplication process in the background.

I think that ZFS actually does this. https://www.truenas.com/docs/references/zfsdeduplication/

pmarreck

It's considered an "expensive" configuration that is only good for certain use-cases, though, due to its memory requirements.

UltraSane

NTFS supports deduplication but it is only available on Server versions which is very annoying.

pizzafeelsright

data loss is the largest concern

I still do not trust de-duplication software.

dylan604

Even using sha-256 or greater type of hashing, I'd still have concerns about letting a system make deletion decisions without my involvement. I've even been part of de-dupe efforts, so maybe my hesitation is just because I wrote some of the code and I know I'm not perfect in my coding or even my algo decision trees. I know that any mistake I made would not be of malice but just ignorance or other stupid mistake.

I've done the entire compare every file via hashing and then log each of the matches for humans to compare, but never has any of that ever been allowed to mv/rm/link -s anything. I feel my imposter syndrome in this regard is not a bad thing.

axus

Question for the developer: what's your liability if user files are corrupted?

codazoda

Most EULA’s would disclaim liability for data loss and suggest users keep good backups. I haven’t read a EULA for a long time, but I think most of them do so.

asdfman123

If Apple is anything like where I work, there's probably a three-year-old bug ticket in their system about it and no real mandate from upper management to allocate resources for it.

null

[deleted]

petercooper

I love the model of it being free to scan and see if you'd get any benefit, then paying for the actual results. I, too, am a packrat, ran it, and got 7GB to reclaim. Not quite worth the squeeze for me, but I appreciate it existing!

MBCook

He’s talked about it on the podcast he was on. So many users would buy this, run it once, then save a few gigs and be done. So a subscription didn’t make a ton of sense.

After all how many perfect duplicate files do you probably create a month accidentally?

There’s a subscription or buy forever option for people who think that would actually be quite useful to them. But for a ton of people a one time IAP that gives them a limited amount of time to use the program really does make a lot of sense.

And you can always rerun it for free to see if you have enough stuff worth paying for again.

sejje

I also really like this pricing model.

I wish it were more obvious how to do it with other software. Often there's a learning curve in the way before you can see the value.

astennumero

What algorithm does the application use to figure out if two files are identical? There's a lot of interesting algorithms out there. Hashes, bit by bit comparison etc. But these techniques have their own disadvantages. What is the best way to do this for a large amount of files?

borland

I don't know exactly what Siracusa is doing here, but I can take an educated guess:

For each candidate file, you need some "key" that you can use to check if another candidate file is the same. There can be millions of files so the key needs to be small and quick to generate, but at the same time we don't want any false positives.

The obvious answer today is a SHA256 hash of the file's contents; It's very fast, not too large (32 bytes) and the odds of a false positive/collision are low enough that the world will end before you ever encounter one. SHA256 is the de-facto standard for this kind of thing and I'd be very surprised if he'd done anything else.

MBCook

You can start with the size, which is probably really unique. That would likely cut down the search space fast.

At that point maybe it’s better to just compare byte by byte? You’ll have to read the whole file to generate the hash and if you just compare the bytes there is no chance of hash collision no matter how small.

Plus if you find a difference in bytes 1290 you can just stop there instead of reading the whole thing to finish the hash.

I don’t think John has said exactly how on ATP (his podcast with Marco and Casey), but knowing him as a longtime listener/reader he’s being very careful. And I think he’s said that on the podcast too.

rzzzt

I experimented with a similar, "hardlink farm"-style approach for deduplicated, browseable snapshots. It resulted in a small bash script which did the following:

- compute SHA256 hashes for each file on the source side

- copy files which are not already known to a "canonical copies" folder on the destination (this step uses the hash itself as the file name, which makes it easy to check if I had a copy from the same file earlier)

- mirror the source directory structure to the destination

- create hardlinks in the destination directory structure for each source file; these should use the original file name but point to the canonical copy.

Then I got too scared to actually use it :)

pmarreck

xxHash (or xxh3 which I believe is even faster) is massively faster than SHA256 at the cost of security, which is unnecessary here.

Of course, engineering being what it is, it's possible that only one of these has hardware support and thus might end up actually being faster in realtime.

null

[deleted]

f1shy

I think the prob. is not so low. I remember reading here about a person getting a foto of another chat in a chat application, which was using sha in the background. I do not recall all the details, it is improbable, but possible.

kittoes

The probability is truly, obscenely, low. If you read about a collision then you surely weren't reading about SHA256.

https://crypto.stackexchange.com/questions/47809/why-havent-...

sgerenser

LOL nope, I seriously doubt that was the result of a SHA256 collision.

null

[deleted]

amelius

Or just use whatever algorithm rsync uses.

w4yai

I'd hash the first 1024 bytes of all files, and starts from there is any collision. That way you don't need to hash the whole (large) files, but only those with same hashes.

amelius

I suspect that bytes near the end are more likely to be different (even if there may be some padding). For example, imagine you have several versions of the same document.

Also, use the length of the file for a fast check.

kstrauser

At that point, why hash them instead of just using the first 1024 bytes as-is?

borland

In order to check if a file is a duplicate of another, you need to check it against _every other possible file_. You need some kind of "lookup key".

If we took the first 1024 bytes of each file as the lookup key, then our key size would be 1024 bytes. If you have 1 million files on your disk, then that's 128MB of ram just to store all the keys. That's not a big deal these days, but it's also annoying if you have a bunch of files that all start with the same 1024 bytes -- e.g. perhaps all the photoshop documents start with the same header. You'd need a 2-stage comparison, where you first match the key (1024 bytes) and then do a full comparison to see if it really matches.

Far more efficient - and less work - If you just use a SHA256 of the file's contents. That gets you a much smaller 32 byte key, and you don't need to bother with 2-stage comparisons.

sedatk

Probably because you need to keep a lot of those in memory.

smusamashah

And why first 1024, can pick from predefined points.

diegs

This reminds me of https://en.wikipedia.org/wiki/Venti_(software) which was a content-addressible filesystem which used hashes for de-duplication. Since the hashes were computed at write time, the performance penalty is amortized.

williamsmj

Deleted comment based on a misunderstanding.

Sohcahtoa82

> This tool simply identifies files that point at literally the same data on disk because they were duplicated in a copy-on-write setting.

You misunderstood the article, as it's basically doing the opposite of what you said.

This tool finds duplicate data that is specifically not duplicated via copy-on-write, and then turns it into a copy-on-write copy.

williamsmj

Fair. Deleted.

BWStearns

I have file A that's in two places and I run this.

I modify A_0. Does this modify A_1 as well or just kind of reify the new state of A_0 while leaving A_1 untouched?

madeofpalk

It's called copy-on-write because when you modify A_0, it'll make a copy of the file if you write to it but not A_1.

https://en.wikipedia.org/wiki/Copy-on-write#In_computer_stor...

bsimpson

Which means if you actually edited those files, you might fill up your HD much more quickly than you expected.

But if you have the same 500MB of node_modules in each of your dozen projects, this might actually durably save some space.

lgdskhglsa

He's using the "copy on write" feature of the file system. So it should leave A_1 untouched, creating a new copy for A_0's modifications. More info: https://developer.apple.com/documentation/foundation/file_sy...

null

[deleted]

bhouston

I gave it a try on my massive folder of NodeJS projects but it only found 1GB of savings on a 8.1GB folder.

I then tried again including my user home folder (731K files, 127K folders, 2755 eligible files) to hopefully catch more savings and I only ended up at 1.3GB of savings (300MB more than just what was in the NodeJS folders.)

I tried to scan System and Library but it refused to do so because of permission issues.

I think the fact that I use pnpm for my package manager has made my disk space usage already pretty near optimal.

Oh well. Neat idea. But the current price is too high to justify this. Also I would want it as a background process that runs once a month or something.

p_ing

> I tried to scan System and Library but it refused to do so because of permission issues.

macOS has a sealed volume which is why you're seeing permission errors.

https://support.apple.com/guide/security/signed-system-volum...

bhouston

For some reason "disk-inventory-x" will scan those folders. I used that amazing tool to prune left over Unreal Engine files and docker caches when they put them not in my home folder. The tool asks for a ton of permissions when you run it in order to do the scan though, which is a bit annoying.

zamalek

pnpm tries to be a drop-in replacement for npm, and dedupes automatically.

MrJohz

More importantly, pnpm installs packages as symlinks, so the deduping is rather more effective. I believe it also tries to mirror the NPM folder structure and style of deduping as well, but if you have two of the same package installed anywhere on your system, pnpm will only need to download and save one copy of that package.

spankalee

npm's --install-strategy=linked flag is supposed to do this too, but it has been broken in several ways for years.

diggan

> pnpm tries to be a drop-in replacement for npm

True

> and dedupes automatically

Also true.

But the way you put them after each other, makes it sound like npm does de-duplication, and since pnpm tries to be a drop-in replacement for npm, so does pnpm.

So for clarification: npm doesn't do de-duplication across all your projects, and that in particular was of the more useful features that pnpm brought to the ecosystem when it first arrived.

lou1306

> it only found 1GB of savings on a 8.1GB folder.

You "only" found that 12% of the space you are using is wasted? Am I reading this right?

bhouston

I have a 512GB drive in my MacBook Air M3 with 225GB free. Saving 1GB is 0.5% of my total free space, and it is definitely "below my line." It is a neat tool still in concept.

When I ran it on my home folder with 165GB of data it only found 1.3GB of savings. This isn't that significant to me and it isn't really worth paying for.

BTW I highly recommend the free "disk-inventory-x" utility for MacOS space management.

timerol

Your original comment did not mention that your home folder was 165 GB, which is extremely relevant here

warkdarrior

The relevant number (missing from above) is the total amount of space on that storage device. If it saves 1GB on a 8TB drive, it's not a big win.

oneeyedpigeon

It should be proportional to the total used space, not the space available. The previous commenter said it was a 1 GB savings from ~8 GB of used space; that's equally significant whether it happens on a 10 GB drive or a 10 TB one.

jy14898

If it saved 8.1GB, by your measure it'd also not be a big win?

rconti

Absolutely, 100% backwards. The tool cannot save space from disk space that is not scanned. Your "not a big win" comment assumes that there is no space left to be reclaimed on the rest of the disk. Or that the disk is not empty, or that the rest of the disk can't be reclaimed at an even higher rate.

null

[deleted]

modzu

whats the price? doesnt seem to be published anywhere

scblock

It's on the Mac App Store so you'll find the pricing there. Looks like $10 for one month (one time use maybe?), $20 for a year, $50 lifetime.

diggan

Even if I have both a Mac and iPhone, but happen to use my Linux computer right now, it seems like the store page (https://apps.apple.com/us/app/hyperspace-reclaim-disk-space/...) is not showing the price, probably because I'm not actively on a Apple device? Seems like a poor UX even for us Mac users.

piqufoh

£9.99 a month, £19.99 for one year, £49.99 for life (app store purchase prices visible once you've scanned a directory).

null

[deleted]

jamesfmilne

Would be nice if git could make use of this on macOS.

Each worktree I usually work on is several gigs of (mostly) identical files.

Unfortunately the source files are often deep in a compressed git pack file, so you can't de-duplicate that.

(Of course, the bigger problem is the build artefacts on each branch, which are like 12G per debug/release per product, but they often diverge for boring reasons.)

theamk

"git worktree" shares a .git folder between multiple checkouts. You'll still have multiple files in working copy, but at least the .pack files would be shared. It is great feature, very robust, I use it all the time.

There is also ".git/objects/info/alternates", accessed via "--shared"/"--reference" option of "git clone", that allows only sharing of object storage and not branches etc... but it is has caveats, and I've only used it in some special circumstances.

diggan

Git is a really poor fit for a project like that since it's snapshot based instead of diff based... Luckily, `git lfs` exists for working around that, I'm assuming you've already investigated that for the large artifacts?

diggan

> Like all my apps, Hyperspace is a bit difficult to explain. I’ve attempted to do so, at length, in the Hyperspace documentation. I hope it makes enough sense to enough people that it will be a useful addition to the Mac ecosystem.

Am I missing something, or isn't it a "file de-duplicator" with a nice UI/UX? Sounds pretty simple to describe, and tells you why it's useful with just two words.

dewey

The author of the software is a file system enthusiast (so much that in the podcast he's a part of they have a dedicated sound effect every time "filesystem" comes up), a long time blogger and macOS reviewer. So you'll have to see it in that context while documenting every bit and the technical details behind it is important to him...even if it's longer than a tag line on a landing page.

In times where documentation is often an afterthought, and technical details get hidden away from users all the time ("Ooops some error occurred") this should be celebrated.

protonbob

No because it isn't getting rid of the duplicate, it's using a feature of APFS that allows for duplicates to exist separately but share the same internal data.

yayoohooyahoo

Is it not the same as a hard link (which I believe are supported on Mac too)?

andrewla

My understanding is that it is a copy-on-write clone, not a hard link. [1]

> Q: Are clone files the same thing as symbolic links or hard links?

> A: No. Symbolic links ("symlinks") and hard links are ways to make two entries in the file system that share the same data. This might sound like the same thing as the space-saving clones used by Hyperspace, but there’s one important difference. With symlinks and hard links, a change to one of the files affects all the files.

> The space-saving clones made by Hyperspace are different. Changes to one clone file do not affect other files. Cloned files should look and behave exactly the same as they did before they were converted into clones.

[1] https://hypercritical.co/hyperspace/

actionfromafar

Almost, but the difference is that if you change one of hardlinked files, you change "all of them". (It's really the same file but with different paths.)

https://hypercritical.co/hyperspace/#how-it-works

APFS apparently allows for creating "link files" which when changed, start to diverge.

zippergz

A copy-on-write clone is not the same thing as a hard link.

rahimnathwani

With a hard link, the content of each of the two 'files' are identical in perpetuity.

With APFS Clones, the contents start off identical, but can be changed independently. If you change a small part of a file, those block(s) will need to be created, but the existing blocks will continue to be shared with the clone.

diggan

Right, but the concept is the same, "remove duplicates" in order to save storage space. If it's using reflinks, softlinks, APFS clones or whatever is more or less an implementation detail.

I know that internally it isn't actually "removing" anything, and that it uses fancy new technology from Apple. But in order to explain the project to strangers, I think my tagline gets the point across pretty well.

CharlesW

> Right, but the concept is the same, "remove duplicates" in order to save storage space.

The duplicates aren't removed, though. Nothing changes from the POV of users or software that use those files, and you can continue to make changes to them independently.

dingnuts

It does get rid of the duplicate. The duplicate data is deleted and a hard link is created in its place.

zippergz

It does not make hard links. It makes copy-on-write clones.

kemayo

No, because it's not actually a hard link -- if you modify one of the files they'll diverge.

zerd

I've been using `fclones` [1] to do this, with `dedupe`, which uses reflink/clonefile.

https://github.com/pkolaczk/fclones

null

[deleted]

null

[deleted]

bsimpson

Interesting idea, and I like the idea of people getting paid for making useful things.

Also, I get a data security itch having a random piece of software from the internet scan every file on an HD, particularly on a work machine where some lawyers might care about what's reading your hard drive. It would be nice if it was open source, so you could see what it's doing.

Nevermark

> I like the idea of people getting paid for making useful things

> It would be nice if it was open source

> I get a data security itch having a random piece of software from the internet scan every file on an HD

With the source it would be easy for others to create freebie versions, with or without respecting license restrictions or security.

I am not arguing anything, except pondering how software economics and security issues are full of unresolved holes, and the world isn't getting default fairer or safer.

--

The app was a great idea, indeed. I am now surprised Apple doesn't automatically reclaim storage like this. Kudos to the author.

benced

You could download the app, disconnect Wifi and Ethernet, run the app and the reclamation process, remove the app (remember, you have the guarantees of the macOS App Store so no kernel extensions etc), and then reconnect.

Edit: this might not work with the payment option actually. I don't think you can IAP without the internet.

albertzeyer

I wrote a similar (but simpler) script which would replace a file by a hardlink if it has the same content.

My main motivation was for the packages of Python virtual envs, where I often have similar packages installed, and even if versions are different, many files would still match. Some of the packages are quite huge, e.g. Numpy, PyTorch, TensorFlow, etc. I got quite some disk space savings from this.

https://github.com/albertz/system-tools/blob/master/bin/merg...

andrewla

This does not use hard links or symlinks; this uses a feature of the filesystem that allows the creation of copy-on-write clones. [1]

[1] https://en.wikipedia.org/wiki/Apple_File_System#Clones

radicality

Hopefully doesn’t have similar bug like jdupes did

https://web.archive.org/web/20210506130542/https://github.co...

null

[deleted]

exitb

What are examples of files that make up the "dozens of gigabytes" of duplicated data?

xnx

There are some CUDA files that every local AI app install that take multiple GB.

wruza

Also models that various AI libraries and plugins love to autodownload into custom locations. Python folks definitely need to learn caching, symlinks, asking a user where to store data, or at least logging where they actually do it.

zerd

.terraform, rust target directory, node_modules.

password4321

iMovie used to copy video files etc. into its "library".

butlike

audio files; renders, etc.