Hyperspace

502 comments

·February 25, 2025

jonhohle

I made a command line utility called `dedup` a while back to do the same thing. It has a dry-run mode, will “intelligently” choose the best clone source, understands hard links and other clones, preserves metadata, deals with HFS compressed files properly. It hasn’t destroyed any of my own data, but like any file system tool, use at your own risk.

0 - https://github.com/ttkb-oss/dedup

jonhohle

Replying to myself now that I've had a chance to try the scan, but not the deduplication. I work with disc images, program binaries, intermediate representations in a workspace that's 7.6G.

A few notes:

* By default it doesn't scan everything. It ignores all files but those in an allow list. The way the allow list is structured, it seems like Hyperspace needs to understand the content of a file. As an end user, I have no idea what the difference between a Text file and a Source Code file would be or how Hyperspace would know. Hyperscan only found 360MB to dedup. Allowing all files increased that to 842MB.

* It doesn't scan files smaller than 100 KB by default. Disabling the size limit along with allowing all files increased that to 1.1GB

* With all files and no size limit it scanned 67,309 of 68,874 files. `dedup` scans 67,426.

* It says 29,522 files are eligible. Eligible means they can be deduped. `dedup` only fines 29,447. There are 76 already deduped files, which is an off-by-one, so I'm not sure what the difference is.

* Scanning files in Hyperspace took around 50s vs `dedup` at 14s

* It seems to scan the file system, then do a duplication calculation, then do the deduplication. I'm not sure why the first shouldn't be done together. I chose to queue any filesystem metadata as it was scanned and in parallel start calculating duplicates. The vast majority of the time files can be mismatched by size, which is available from `fts_read` "for free" while traversing the directory.

* Hyperspace found 1.1GB to save, `dedup` finds 1.04GB and 882MB already saved (from previous deduping)

* I'm not going to buy Hyperspace at this time, so I don't know how long it takes to dedup or if it preserves metadata or deals with strange files. `dedup` took 31s to scan and deduplicate.

* After deduping with `dedup`, Hyperscan thinks there are still 2 files that can be deduped.

* Hyperspace seems to understand it can't dedup files with multiple hard links, empty files, and some of the other things `dedup` also checks for.

* I can't test ACLs or any other attribute preservation like that without paying. `strings` suggests those are handled. HFS Compression is a tricky edge case, but I haven't tested how Hyperspace's scan deals with those.

karparov

I'm a little surprised that folks here are investing so much time into this app. It's closed source, only available for a non-obious amount, time-limited or subscription-based and lots of details of how it works are missing.

With a FOSS project this would have been expected, but with a ShareWare-style model? Idk..

codemusings

John has reiterated multiple times on his podcast that he doesn't want to deal with thousands of support requests when making his apps open source and free. All his apps are personal itches he scratched and he sells them not to make a profit but to make the barrier of entry high enough to make user feedback manageable.

crowselect

Non obvious amount? Scroll to the bottom of the app store page, all the in app purchase prices are listed there.

App store prices are localized. If the blog post said it costs “$10” or whatever, that doesn’t mean anything to millions of potential customers who live where they don’t use $, and is confusing for millions more that do use $ but don’t know if the price is in their local $ or USD

kccqzy

Just because the creator, John Siracusa is famous. If a no-name developer did the app, it wouldn't get this many upvotes and this much attention. He used to write very detailed OS reviews, and I learned a lot from him, including Apple's logical volume manager functions (`diskutil cs`).

eduction

On metadata, the excellent faq addresses that specifically (it does preserve). (I had the same question)

gurjeet

Thank you for creating and sharing this utility.

I ran it over my Postgres development directories that have almost identical files. It saved me about 1.7GB.

The project doesn't have any license associated with it. If you don't mind, can you please license this project with a license of your choice.

As a gesture of thanks, I have attempted to improve the installation step slightly and have created this pull request: https://github.com/ttkb-oss/dedup/pull/6

jonhohle

Thanks for the PR. Individual files were licensed, but I’ve also added a LICENSE file as well.

LVB

Just tried it, and it works well! I didn't realize the potential of this technique until I saw just how many dupes there were of certain types of files, especially in node_modules. It wasn't uncommon to see it replace 50 copies of some js file with one, and that was just in a specific subdirectory.

I see it is "pre-release" and sort of low GH stars (== usage?), so I'm curious about the stability since this type of tool is relatively scary if buggy.

jonhohle

I use it on my family photos, work documents, etc. and have not run into an issue that I haven’t added a test for. I didn’t commercialize it because I didn’t know what I didn’t know, but the utility does try to fail quickly if any of the file system operations fail (cloning, metadata duplication, atomic swaps, etc.).

Whenever using it on something sensitive that I can’t back up first for whatever reason, I make checksum files and compare them afterwards. I’ve done this many times on hundreds of GB and haven’t seen corruption. Caveat emptor.

There is one huge caveat I should add to the README - block corruption happens. Having a second copy of a file is a crude form of backup. Cloning causes all instances to use the same block, so if that one instance is corrupted, all clones are. That’s fine for software projects with generated files that can be rebuilt or checked out again, but introduces some risk for files that may not otherwise be replaceable. I keep multiple backups of all that stuff in hardware other than where I’m deduping, so I dedup with abandon.

I’m a nobody with no audience. Maybe some attention here will get some users.

ncann

Somewhat unrelated but I believe the dupe issue with node_modules is the main reason to use pnpn instead of npm - pnpm just uses a single global package repo on your machine and creates links inside node_modules as needed.

actinium226

Wow, that's some excellent documentation.

I was also really impressed that `make` ran basically instantly.

jonhohle

Thanks!

I love the documentation from FreeBSD and OpenBSD. Only having target one platform and only system libraries makes building simple.

Recursing

See the comments on https://news.ycombinator.com/item?id=38113396 for a list of alternatives. I used https://github.com/sahib/rmlint in the past and can't complain.

bob1029

> There is no way for Hyperspace to cooperate with all other applications and macOS itself to coordinate a “safe” time for those files to be replaced, nor is there a way for Hyperspace for forcibly take exclusive control of those files.

This got me wondering why the filesystem itself doesn't run a similar kind of deduplication process in the background. Presumably, it is at a level of abstraction where it could safely manage these concerns. What could be the downsides of having this happen automatically within APFS?

taneliv

On ZFS it consumes a lot of RAM. In part I think this is because ZFS does it on the block level, and has to keep track of a lot of blocks to compare against when a new one is written out. It might be easier on resources if implemented on the file level. Not sure if the implementation would be simpler or more complex.

It might also be a little unintuitive that modifying one byte of a large file would result in a lot disk activity, as the file system would need to duplicate the file again.

abrookewood

In regards to the second point, this isn't correct for ZFS: "If several files contain the same pieces (blocks) of data or any other pool data occurs more than once in the pool, ZFS stores just one copy of it. Instead of storing many copies of a book it stores one copy and an arbitrary number of pointers to that one copy." [0]. So changing one byte of a large file will not suddenly result in writing the whole file to disk again.

[0] https://www.truenas.com/docs/references/zfsdeduplication/

btilly

This applies to modifying a byte. But inserting a byte will change every block from then on, and will force a rewrite.

Of course, that is true of most filesystems.

karparov

Not the whole file but it would duplicate the block. GP didn't claim that the whole file is copied.

taneliv

Yeah, I did not write it very clearly. On ZFS, you're right. On a file system that applied deduplication to files and not individual blocks, the file would need to be duplicated again, no matter where and what kind of change was made.

gmueckl

Files are always represented as lists of blocks or block spans within a file system. Individual blocks could in theory be partially shared between files at the complexity cost of a reference counter for each block. So changing a single byte in a copy on write file could take the same time regardless of file size because only the affected bock would have to be duplicated. I don't know at all how MacOS implements this copynon write scheme, though.

MBCook

APFS is a copy on write filesystem if you use the right APIs, so it does what you describe but only for entire files.

I believe as soon as you change a single bite you get a complete copy that’s your own.

And that’s how this program works. It finds perfect duplicates and then effectively deletes and replaces them with a copy of the existing file so in the background there’s only one copy of the bits on the disk.

the8472

That's a ZFS online dedup limitation. I think xfs and btrfs are better prior art here since they use extent-based deduplication and can do offline dedup which means they don't have have to keep it in memory and the on-disk metadata is smaller too.

amzin

Is there a FS that keeps only diffs in clone files? It would be neat

rappatic

I wondered that too.

If we only have two files, A and its duplicate B with some changes as a diff, this works pretty well. Even if the user deletes A, the OS could just apply the diff to the file on disk, unlink A, and assign B to that file.

But if we have A and two different diffs B1 and B2, then try to delete A, it gets a little murkier. Either you do the above process and recalculate the diff for B2 to make it a diff of B1; or you keep the original A floating around on disk, not linked to any file.

Similarly, if you try to modify A, you'd need to recalculate the diffs for all the duplicates. Alternatively, you could do version tracking and have the duplicate's diffs be on a specific version of A. Then every file would have a chain of diffs stretching back to the original content of the file. Complex but could be useful.

It's certainly an interesting concept but might be more trouble than it's worth.

UltraSane

VAST storage does something like this. Unlike how most storage arrays identify the same block by hash and only store it once VAST uses a content aware hash so hashes of similar blocks are also similar. They store a reference block for each unique hash and then when new data comes in and is hashed the most similar block is used to create byte level deltas against. In practice this works extremely well.

https://www.vastdata.com/blog/breaking-data-reduction-trade-...

abrookewood

ZFS: "The main benefit of deduplication is that, where appropriate, it can greatly reduce the size of a pool and the disk count and cost. For example, if a server stores files with identical blocks, it could store thousands or even millions of copies for almost no extra disk space." (emphasis added)

https://www.truenas.com/docs/references/zfsdeduplication/

p_ing

Not an FS but attempting to mimic NTFS, SharePoint does this within it's content database(s).

https://www.microsoft.com/en-us/download/details.aspx?id=397...

alwillis

That’s how APFS works; it uses delta extents for tracking differences in clones: https://en.wikipedia.org/wiki/Delta_encoding?wprov=sfti1#Var...

jonhohle

APFS shares blocks so only blocks that changed are no longer shared. Since a block is the smallest atomic unit (except maybe an inode) in a FS, that’s the best level of granularity to expect.

the8472

With extent-based filesystems you can clone extents and then overwrite one extent and only that becomes unshared.

throwaway2037

From a file system design perspective, does anyone know why ZFS chose to use block clones, instead of file clones?

raattgift

Records (which are of variable size) are already checksummed, and there were checksum-hashes which made it vanishingly unlikely that one could choose two different records with the same (optionally cryptographically strong) checksum-hash. When a newly-created record's checksum is generated, one could look into a table of existing checksum-hashes and avoid the record write if it already exists, substituting an incremented refcount for that table entry.

ZFS is essentially an object store database at one layer; the checksum-hash deduplication table is an object like any other (file, metadata, bookmarks, ...). There is one deduplication table per pool, shared among all its datasets/volumes.

On reads, one does not have to consult the dedup table.

The mechanism was fairly easy to add. And for highly-deduplicatable data that is streaming-write-once-into-quiescent-pool-and-never-modify-or-delete-what's-written-into-a-deduplicated-dataset-or-volume, it was a reasonable mechanism.

In other applications, the deduplication table would tend to grow and spread out, requiring extra seeks for practically every new write into a deduplicated dataset or volume, even if it's just to increment or decrement the refcount for a record.

Destroying a deduplicated dataset has to decrement all its refcounts (and remove entries from the table where it's the only reference), and if your table cannot all fit in ram, the additional IOPS onto spinning media hurt, often very badly. People experimenting with deduplication and who wanted to back out after running into performance issues for typical workloads sometimes determined it was much MUCH faster to destroy the entire pool and restore from backups, rather than wait for a "zfs destroy" on a set of deduplicated snapshots/datasets/volumes to complete.

mbreese

I have no specialized knowledge (just a ZFS user for over a decade). I suspect the reason is that in addition to files, ZFS will also allow you to create volumes. These volumes act like block devices, so if you want to dedup them, you need to do it at the block level.

null

[deleted]

ted_dunning

This is commonly done with compression on block storage devices. That fails, of course, if the file system is encrypting the blocks it sends down to the device.

Doing deduplication at this level is nice because you can dedupe across file systems. If you have, say, a thousand systems that all have the same OS files you can save vats of storage. Many times, the only differences will be system specific configurations like host keys and hostnames. No single filesystem could recognize this commonality.

This fails when the deduplication causes you to have fewer replicas of files with intense usage. To take the previous example, if you boot all thousand machines at the same time, you will have a prodigious I/O load on the kernel images.

Sesse__

This is the standard API for deduplication on Linux (used for btrfs and XFS); you ask the OS nicely to deduplicate a given set of ranges, and it responds by locking the ranges, verifying that they are indeed identical and only then deduplicates for you (you get a field back saying how many bytes were deduplicated from each range). So there's no way a userspace program can mess up your files.

kccqzy

Yup. This is the ioctl BTRFS_IOC_FILE_EXTENT_SAME.

albertzeyer

> This got me wondering why the filesystem itself doesn't run a similar kind of deduplication process in the background.

I think that ZFS actually does this. https://www.truenas.com/docs/references/zfsdeduplication/

pmarreck

It's considered an "expensive" configuration that is only good for certain use-cases, though, due to its memory requirements.

abrookewood

Yes true, but that page also covers some recent improvements to de-duplication that might assist.

p_ing

Windows Server does this for NTFS and ReFS volumes. I used it quite a bit on ReFS w/ Hyper-V VMs and it worked wonders. Cut my storage usage down by ~45% with a majority of Windows Server VMs running a mix of 2016/2019 at the time.

borland

Yep. At a previous job we had a file server that we published Windows build output to.

There were about 1000 copies of the same pre-requisite .NET and VC++ runtimes (each build had one) and we only paid for the cost of storing it once. It was great.

It is worth pointing out though, that on Windows Server this deduplication is a background process; When new duplicate files are created, they genuinely are duplicates and take up extra space, but once in a while the background process comes along and "reclaims" them, much like the Hyperspace app here does.

Because of this (the background sweep process is expensive), it doesn't run all the time and you have to tell it which directories to scan.

If you want "real" de-duplication, where a duplicate file will never get written in the first place, then you need something like ZFS

p_ing

Both ZFS and WinSvr offer "real" dedupe. One is on-write, which requires a significant amount of available memory, the other is on a defined schedule, which uses considerably less memory (300MB + 10MB/TB).

ZFS is great if you believe you'll exceed some threshold of space while writing. I don't personally plan my volumes with that in mind but rather make sure I have some amount of excess free space.

WinSvr allows you to disable dedupe if you want (don't know why you would) where as ZFS is a one-way street without exporting the data.

Both have pros and cons. I can live with the WinSvr cons while ZFS cons (memory) would be outside of my budget, or would have been at the particular time with the particular system.

sterlind

hey, it's defrag all over again!

(not really, since it's not fragmentation, but conceptually similar)

UltraSane

NTFS supports deduplication but it is only available on Server versions which is very annoying.

alwillis

Probably because APFS runs on everything from the Apple Watch to the Mac Pro and everything in between.

You probably don’t want your phone or watch de-duping stuff.

There are tons of knobs you can tweak in macOS but Apple has always been pretty conservative when it comes to what should be default behavior for the vast majority of their users.

Certainly when you duplicate a file using the Finder or use cp -c at the command line, the Copy-on-Write functionality is being used; most users don’t need to know that.

pizzafeelsright

data loss is the largest concern

I still do not trust de-duplication software.

throwaway439080

Agreed, "I made a deduplication software in my garage! Do you want to try it?" is a terrifying pitch.

I've been writing a similar thing to dedupe my photo collection and I'm so paranoid of pulling the trigger I just keep writing more tests.

czk

Dedupe seemed more interesting when storage was expensive, but nowadays it feels like the overhead you get from running dedupe, in most cases, is priced-in. At least with software like CommVault for backups, dedupe requires beefy hardware and low-latency SSDs for the database, If there is even a few extra milliseconds of latency or the server can’t handle requests fast enough, your backup throughput absolutely tanks. Depending on your data though you could see some ridiculous savings here that make it worth the trouble.

I’ve heard many horror stories of dedupe related corruption or restoration woes though, especially after a ransomware attack.

dylan604

Even using sha-256 or greater type of hashing, I'd still have concerns about letting a system make deletion decisions without my involvement. I've even been part of de-dupe efforts, so maybe my hesitation is just because I wrote some of the code and I know I'm not perfect in my coding or even my algo decision trees. I know that any mistake I made would not be of malice but just ignorance or other stupid mistake.

I've done the entire compare every file via hashing and then log each of the matches for humans to compare, but never has any of that ever been allowed to mv/rm/link -s anything. I feel my imposter syndrome in this regard is not a bad thing.

borland

Now you understand why this app costs more than 2x the price of alternatives such as diskDedupe.

Any halfway-competent developer can write some code that does a SHA256 hash of all your files and uses the Apple filesystem API's to replace duplicates with shared-clones. I know swift, I could probably do it in an hour or two. Should you trust my bodgy quick script? Heck no.

The author - John Siracusa - has been a professional programmer for decades and is an exceedingly meticulous kind of person. I've been listening to the ATP podcast where they've talked about it, and the app has undergone an absolute ton of testing. Look at the guardrails on the FAQ page https://hypercritical.co/hyperspace/ for an example of some of the extra steps the app takes to keep things safe. Plus you can review all the proposed file changes before you touch anything.

You're not paying for the functionality, but rather the care and safety that goes around it. Personally, I would trust this app over just about any other on the mac.

criddell

> I'd still have concerns about letting a system make deletion decisions without my involvement

You are involved. You see the list of duplicates and can review them as carefully as you'd like before hitting the button to write the changes.

axus

Question for the developer: what's your liability if user files are corrupted?

codazoda

Most EULA’s would disclaim liability for data loss and suggest users keep good backups. I haven’t read a EULA for a long time, but I think most of them do so.

petercooper

I love the model of it being free to scan and see if you'd get any benefit, then paying for the actual results. I, too, am a packrat, ran it, and got 7GB to reclaim. Not quite worth the squeeze for me, but I appreciate it existing!

MBCook

He’s talked about it on the podcast he was on. So many users would buy this, run it once, then save a few gigs and be done. So a subscription didn’t make a ton of sense.

After all how many perfect duplicate files do you probably create a month accidentally?

There’s a subscription or buy forever option for people who think that would actually be quite useful to them. But for a ton of people a one time IAP that gives them a limited amount of time to use the program really does make a lot of sense.

And you can always rerun it for free to see if you have enough stuff worth paying for again.

null

[deleted]

brailsafe

For me the value in a dedup app like this isn't as much the space savings, since I just don't generate huge amounts, but it's the lack of duplicated files, some of which or all in aggregate may be large. There are some weird scenarios where this occurs, usually due to having to reconcile a hard drive recovery with another location for the files, or a messy download directory with an organized destination.

For example, I discovered my time machine backup kicked out the oldest versions of files I didn't know it had a record of and thought I'd long since lost, but it destroyed the names of the directories and obfuscated the contents somewhat. Thousands of numerically named directories, some of which have files I may want to hang onto, but don't know whether I already have them or not, or where they are since it's completely unstructured. Likewise, many of them may just have one of the same build system text file I can obvs toss away.

mentalgear

am I really that old that I remember this being the default for most of the software about 10 years ago? Are people already that used to the subscription trap that they think this is a new model ?

petercooper

I grew up with shareware in the 90s that often adopted a similar model (though having to send $10 in the mail and wait a couple weeks for a code or a disk to come back was a bit of a grind!) but yes, it's refreshing in the current era where developers will even attempt to charge $10 a week for a basic coloring in app on the iPad..

sejje

I also really like this pricing model.

I wish it were more obvious how to do it with other software. Often there's a learning curve in the way before you can see the value.

jedbrooke

it’s very refreshing compared to those “free trials” you have to remember to cancel (pro tip: use virtual credit cards which you can lock for those so if you forget to cancel the charges are blocked)

however has anyone been able to find out from the website how much the license actually costs?

xp84

Doesn’t the Mac App Store listing list the IAP SKUs like it does on iOS?

petercooper

It does. It's reasonably clear for this app but I wish they made it clearer for other apps where the IAP SKUs often have meaningless descriptions.

astennumero

What algorithm does the application use to figure out if two files are identical? There's a lot of interesting algorithms out there. Hashes, bit by bit comparison etc. But these techniques have their own disadvantages. What is the best way to do this for a large amount of files?

borland

I don't know exactly what Siracusa is doing here, but I can take an educated guess:

For each candidate file, you need some "key" that you can use to check if another candidate file is the same. There can be millions of files so the key needs to be small and quick to generate, but at the same time we don't want any false positives.

The obvious answer today is a SHA256 hash of the file's contents; It's very fast, not too large (32 bytes) and the odds of a false positive/collision are low enough that the world will end before you ever encounter one. SHA256 is the de-facto standard for this kind of thing and I'd be very surprised if he'd done anything else.

MBCook

You can start with the size, which is probably really unique. That would likely cut down the search space fast.

At that point maybe it’s better to just compare byte by byte? You’ll have to read the whole file to generate the hash and if you just compare the bytes there is no chance of hash collision no matter how small.

Plus if you find a difference in bytes 1290 you can just stop there instead of reading the whole thing to finish the hash.

I don’t think John has said exactly how on ATP (his podcast with Marco and Casey), but knowing him as a longtime listener/reader he’s being very careful. And I think he’s said that on the podcast too.

jonhohle

To make dedup[0] fast, I use a tree with device id, size, first byte, last byte, and finally SHA-256. Each of those is only used if there is a collision to avoid as many reads as possible. dedup doesn’t do a full file compare, because if you’ve found a file with the same size, first and last bytes, and SHA-256 you’ve also probably won the lottery several times over and can afford data recovery.

This is the default for ZFS deduplication and git does something similar with size and far weaker SHA-1. I would add a test for SHA-256 collisions, but no one has seemed to find a working example yet.

0 - https://github.com/ttkb-oss/dedup

unclebucknasty

>which is probably really unique

Wonder what the distribution is here, on average? I know certain file types tend to cluster in specific ranges.

>maybe it’s better to just compare byte by byte? You’ll have to read the whole file to generate the hash

Definitely, for comparing any two files. But, if you're searching for duplicates across the entire disk, then you're theoretically checking each file multiple times, and each file is checked against multiple times. So, hashing them on first pass could conceivably be more efficient.

>if you just compare the bytes there is no chance of hash collision

You could then compare hashes and, only in the exceedingly rare case of a collision, do a byte-by-byte comparison to rule out false positives.

But, if your first optimization (the file size comparison) really does dramatically reduce the search space, then you'd also dramatically cut down on the number of re-comparisons, meaning you may be better off not hashing after all.

You could probably run the file size check, then based on how many comparisons you'll have to do for each matched set, decide whether hashing or byte-by-byte is optimal.

karparov

This can be done much faster and safer.

You can group all files into buckets, and as soon as a bucket is empty, discard it. If in the end there are still files in the same bucket, they are duplicates.

Initially all files are in the same bucket.

You now iterate over differentiators which given two files tell you whether they are maybe equal or definitely not equal. They become more and more costly but also more and more exact. You run the differentiator on all files in a bucket to split the bucket into finer equivalence classes.

For example:

* Differentiator 1 is the file size. It's really cheap, you only look at metadata, not the file contents.

* Differentiator 2 can be a hash over the first file block. Slower since you need to open every file, but still blazingly fast and O(1) in file size.

* Differentiator 3 can be a hash over the whole file. O(N) in file size but so precise that if you use a cryptographic hash then you're very unlikely to have false positives still.

* Differentiator 4 can compare files bit for bit. Whether that is really needed depends on how much you trust collision resistance of your chosen hash function. Don't discard this though. Git got bitten by this.

jonhohle

Not surprisingly, differentiator 2 can just be the first byte (or machine word). Differentiator 3 can be the last byte (or word). At that point, 99.99% (in practice more 9s) of files are different and you’re read at most 2 blocks per file. I haven’t figured out a good differentiator 3 prior to hashing, but it’s already so rare, that it’s not worth it, in my experience.

rzzzt

I experimented with a similar, "hardlink farm"-style approach for deduplicated, browseable snapshots. It resulted in a small bash script which did the following:

- compute SHA256 hashes for each file on the source side

- copy files which are not already known to a "canonical copies" folder on the destination (this step uses the hash itself as the file name, which makes it easy to check if I had a copy from the same file earlier)

- mirror the source directory structure to the destination

- create hardlinks in the destination directory structure for each source file; these should use the original file name but point to the canonical copy.

Then I got too scared to actually use it :)

kccqzy

Hard links are not a suitable alternative here. When you deduplicate files, you typically want copy-on-write: if an app writes to one file, it should not change the other. Because of this, I would be extremely scared to use anything based on hard links.

In any case, a good design is to ask the kernel to do the dedupe step after user space has found duplicates. The kernel can double-check for you that they are really identical before doing the dedupe. This is available on Linux as the ioctl BTRFS_IOC_FILE_EXTENT_SAME.

null

[deleted]

620gelato

Oh the sha-256 hashes are precisely what I used for a quick script I put together to parse through various backups of my work laptop in different places (tool changes and laziness). I had 10 different backups going back 4 years, and I wanted to make sure I - 1) preserved all unique files, 2) preserve the latest folder structure they showed up in.

Using sha256 was a no-brainer, at least for me.

pmarreck

xxHash (or xxh3 which I believe is even faster) is massively faster than SHA256 at the cost of security, which is unnecessary here.

Of course, engineering being what it is, it's possible that only one of these has hardware support and thus might end up actually being faster in realtime.

PhilipRoman

Blake3 is my favorite for this kind of thing. It's a cryptographic hash (maybe not the world's strongest, but considered secure), and also fast enough that in real world scenarios it performs just as well as non-crypto hashes like xx.

f1shy

I think the prob. is not so low. I remember reading here about a person getting a foto of another chat in a chat application, which was using sha in the background. I do not recall all the details, it is improbable, but possible.

sgerenser

LOL nope, I seriously doubt that was the result of a SHA256 collision.

kittoes

The probability is truly, obscenely, low. If you read about a collision then you surely weren't reading about SHA256.

https://crypto.stackexchange.com/questions/47809/why-havent-...

null

[deleted]

diegs

This reminds me of https://en.wikipedia.org/wiki/Venti_(software) which was a content-addressible filesystem which used hashes for de-duplication. Since the hashes were computed at write time, the performance penalty is amortized.

w4yai

I'd hash the first 1024 bytes of all files, and starts from there is any collision. That way you don't need to hash the whole (large) files, but only those with same hashes.

amelius

I suspect that bytes near the end are more likely to be different (even if there may be some padding). For example, imagine you have several versions of the same document.

Also, use the length of the file for a fast check.

kstrauser

At that point, why hash them instead of just using the first 1024 bytes as-is?

borland

In order to check if a file is a duplicate of another, you need to check it against _every other possible file_. You need some kind of "lookup key".

If we took the first 1024 bytes of each file as the lookup key, then our key size would be 1024 bytes. If you have 1 million files on your disk, then that's 128MB of ram just to store all the keys. That's not a big deal these days, but it's also annoying if you have a bunch of files that all start with the same 1024 bytes -- e.g. perhaps all the photoshop documents start with the same header. You'd need a 2-stage comparison, where you first match the key (1024 bytes) and then do a full comparison to see if it really matches.

Far more efficient - and less work - If you just use a SHA256 of the file's contents. That gets you a much smaller 32 byte key, and you don't need to bother with 2-stage comparisons.

sedatk

Probably because you need to keep a lot of those in memory.

smusamashah

And why first 1024, can pick from predefined points.

williamsmj

Deleted comment based on a misunderstanding.

Sohcahtoa82

> This tool simply identifies files that point at literally the same data on disk because they were duplicated in a copy-on-write setting.

You misunderstood the article, as it's basically doing the opposite of what you said.

This tool finds duplicate data that is specifically not duplicated via copy-on-write, and then turns it into a copy-on-write copy.

williamsmj

Fair. Deleted.

alwillis

Just want to mention: Apple ships a modified version of the copy command (good old cp) that supports the ability to use the cloning feature of APFS by using the -c flag.

kccqzy

And in case your cp doesn't support it, you could also do it by invoking Python. Something like `import Foundation; Foundation.NSFileManager.defaultManager().copyItemAtPath_toPath_error_(...)`.

xenadu02

Correct. Foundation's NSFileManager / FileManager will automatically use clone for same-volume copies if the underlying filesystem supports it. This makes all file copies in all apps that use Foundation support cloning even if the app does nothing.

libcopyfile also supports cloning via two flags: COPYFILE_CLONE and COPYFILE_CLONE_FORCE. The former clones if supported (same volume and filesystem supports it) and falls back to actual copy if not. The force variant fails if cloning isn't supported.

alwillis

Apparently the cp command in CoreUtilities also supports copy-on-write on macOS: https://unix.stackexchange.com/questions/311536/cp-reflink-a...

jlhawn

that function name `copyItemAtPath_toPath_error_` sounds like it was ported directly from objective-c!

kccqzy

They might not have ported it. They could have a Python __getattr__ implementation that returns a callable that simply uses objc_msgSend under the hood.

BWStearns

I have file A that's in two places and I run this.

I modify A_0. Does this modify A_1 as well or just kind of reify the new state of A_0 while leaving A_1 untouched?

madeofpalk

It's called copy-on-write because when you modify A_0, it'll make a copy of the file if you write to it but not A_1.

https://en.wikipedia.org/wiki/Copy-on-write#In_computer_stor...

bsimpson

Which means if you actually edited those files, you might fill up your HD much more quickly than you expected.

But if you have the same 500MB of node_modules in each of your dozen projects, this might actually durably save some space.

_rend

> Which means if you actually edited those files, you might fill up your HD much more quickly than you expected.

I'm not sure if this is what you intended, but just to be sure: writing changes to a cloned file doesn't immediately duplicate the entire file again in order to write those changes — they're actually written out-of-line, and the identical blocks are only stored once. From [the docs](^1) posted in a sibling comment:

> Modifications to the data are written elsewhere, and both files continue to share the unmodified blocks. You can use this behavior, for example, to reduce storage space required for document revisions and copies. The figure below shows a file named “My file” and its copy “My file copy” that have two blocks in common and one block that varies between them. On file systems like HFS Plus, they’d each need three on-disk blocks, but on an Apple File System volume, the two common blocks are shared.

[^1]: https://developer.apple.com/documentation/foundation/file_sy...

BWStearns

Thanks for the clarification. I expected it worked like that but couldn't find it spelled out after a brief perusal of the docs.

kdmtctl

What will happen when the original file will be deleted? Often this handled by block reference counters, which just would be decreased. How APFS handles this? Is there any master/copy concepts or just block references?

lgdskhglsa

He's using the "copy on write" feature of the file system. So it should leave A_1 untouched, creating a new copy for A_0's modifications. More info: https://developer.apple.com/documentation/foundation/file_sy...

mattgreenrocks

What jumped out to me:

> Finally, at WWDC 2017, Apple announced Apple File System (APFS) for macOS (after secretly test-converting everyone’s iPhones to APFS and then reverting them back to HFS+ as part of an earlier iOS 10.x update in one of the most audacious technological gambits in history).

How can you revert a FS change like that if it goes south? You'd certainly exercise the code well but also it seems like you wouldn't be able to back out of it if something was wrong.

quux

IIRC migrating from HFS+ to APFS can be done without touching any of the data blocks and a parallel set of APFS metadata blocks and superblocks are written to disk. In the test migrations Apple did the entire migration including generating APFS superblocks but held short of committing the change that would permanently replace the HFS+ superblocks with APFS ones. To roll back they “just” needed to clean up all the generated APFS superblocks and metadata blocks.

k1t

Yes, that's how it's described in this talk transcript:

https://asciiwwdc.com/2017/sessions/715

Let’s say for simplification we have three metadata regions that report all the entirety of what the file system might be tracking, things like file names, time stamps, where the blocks actually live on disk, and that we also have two regions labeled file data, and if you recall during the conversion process the goal is to only replace the metadata and not touch the file data.

We want that to stay exactly where it is as if nothing had happened to it.

So the first thing that we’re going to do is identify exactly where the metadata is, and as we’re walking through it we’ll start writing it into the free space of the HFS+ volume.

And what this gives us is crash protection and the ability to recover in the event that conversion doesn’t actually succeed.

Now the metadata is identified.

We’ll then start to write it out to disk, and at this point, if we were doing a dry-run conversion, we’d end here.

If we’re completing the process, we will write the new superblock on top of the old one, and now we have an APFS volume.

MBCook

I think that’s what they did too. And it was a genius way of testing. They did it more than once too I think.

Run the real thing, throw away the results, report all problems back to the mothership so you have a high chance of catching them all even on their multi-hundred million device fleet.

anitil

I watched the section from the talk [0] and there's no details given really, other than that it was done as a test of consistency. I've blown so many things up in production that I'm not sure if I could every pull the trigger on such a large migration

[0] https://www.youtube.com/watch?v=IcyaadNy9Jk&t=1670s

kccqzy

You lack imagination. This is not some crown jewel only achievable by Apple. In the open source world we have tools to convert ext file systems to btrfs and (1) you could revert back; (2) you could mount the original ext file system while using the btrfs file system.

mattgreenrocks

Cool!

TrapLord_Rhodo

Start with a story, narrow it down to a problem and show how your solution magically solves that problem. Such a fine example of GREAT marketing.

bhouston

I gave it a try on my massive folder of NodeJS projects but it only found 1GB of savings on a 8.1GB folder.

I then tried again including my user home folder (731K files, 127K folders, 2755 eligible files) to hopefully catch more savings and I only ended up at 1.3GB of savings (300MB more than just what was in the NodeJS folders.)

I tried to scan System and Library but it refused to do so because of permission issues.

I think the fact that I use pnpm for my package manager has made my disk space usage already pretty near optimal.

Oh well. Neat idea. But the current price is too high to justify this. Also I would want it as a background process that runs once a month or something.

lou1306

> it only found 1GB of savings on a 8.1GB folder.

You "only" found that 12% of the space you are using is wasted? Am I reading this right?

bhouston

I have a 512GB drive in my MacBook Air M3 with 225GB free. Saving 1GB is 0.5% of my total free space, and it is definitely "below my line." It is a neat tool still in concept.

When I ran it on my home folder with 165GB of data it only found 1.3GB of savings. This isn't that significant to me and it isn't really worth paying for.

BTW I highly recommend the free "disk-inventory-x" utility for MacOS space management.

timerol

Your original comment did not mention that your home folder was 165 GB, which is extremely relevant here

warkdarrior

The relevant number (missing from above) is the total amount of space on that storage device. If it saves 1GB on a 8TB drive, it's not a big win.

oneeyedpigeon

It should be proportional to the total used space, not the space available. The previous commenter said it was a 1 GB savings from ~8 GB of used space; that's equally significant whether it happens on a 10 GB drive or a 10 TB one.

jy14898

If it saved 8.1GB, by your measure it'd also not be a big win?

rconti

Absolutely, 100% backwards. The tool cannot save space from disk space that is not scanned. Your "not a big win" comment assumes that there is no space left to be reclaimed on the rest of the disk. Or that the disk is not empty, or that the rest of the disk can't be reclaimed at an even higher rate.

zamalek

pnpm tries to be a drop-in replacement for npm, and dedupes automatically.

MrJohz

More importantly, pnpm installs packages as symlinks, so the deduping is rather more effective. I believe it also tries to mirror the NPM folder structure and style of deduping as well, but if you have two of the same package installed anywhere on your system, pnpm will only need to download and save one copy of that package.

spankalee

npm's --install-strategy=linked flag is supposed to do this too, but it has been broken in several ways for years.

diggan

> pnpm tries to be a drop-in replacement for npm

True

> and dedupes automatically

Also true.

But the way you put them after each other, makes it sound like npm does de-duplication, and since pnpm tries to be a drop-in replacement for npm, so does pnpm.

So for clarification: npm doesn't do de-duplication across all your projects, and that in particular was of the more useful features that pnpm brought to the ecosystem when it first arrived.

p_ing

> I tried to scan System and Library but it refused to do so because of permission issues.

macOS has a sealed volume which is why you're seeing permission errors.

https://support.apple.com/guide/security/signed-system-volum...

bhouston

For some reason "disk-inventory-x" will scan those folders. I used that amazing tool to prune left over Unreal Engine files and docker caches when they put them not in my home folder. The tool asks for a ton of permissions when you run it in order to do the scan though, which is a bit annoying.

alwillis

It’s not obvious but the system folder is on a separate, secure volume; the Finder does some trickery to make the system volume and the data volume appear as one.

In general, you don’t want to mess with that.

kdmtctl

Didn't have time to try it myself, but there is an option for minimal files size to consider clearly seen on the AppStore screenshot. I suppose it was introduced to minimize comparison buffers. It is possible that node modules are sliding under this size and wasn't considered.

modzu

whats the price? doesnt seem to be published anywhere

scblock

It's on the Mac App Store so you'll find the pricing there. Looks like $10 for one month (one time use maybe?), $20 for a year, $50 lifetime.

diggan

Even if I have both a Mac and iPhone, but happen to use my Linux computer right now, it seems like the store page (https://apps.apple.com/us/app/hyperspace-reclaim-disk-space/...) is not showing the price, probably because I'm not actively on a Apple device? Seems like a poor UX even for us Mac users.

piqufoh

£9.99 a month, £19.99 for one year, £49.99 for life (app store purchase prices visible once you've scanned a directory).

null

[deleted]

null

[deleted]

albertzeyer

I wrote a similar (but simpler) script which would replace a file by a hardlink if it has the same content.

My main motivation was for the packages of Python virtual envs, where I often have similar packages installed, and even if versions are different, many files would still match. Some of the packages are quite huge, e.g. Numpy, PyTorch, TensorFlow, etc. I got quite some disk space savings from this.

https://github.com/albertz/system-tools/blob/master/bin/merg...

andrewla

This does not use hard links or symlinks; this uses a feature of the filesystem that allows the creation of copy-on-write clones. [1]

[1] https://en.wikipedia.org/wiki/Apple_File_System#Clones

gurjeet

So albertzeyer's script can be adapted to use `cp -c` command, to achieve the same effect as Hyperspace.

pfranz

If you'd like. In the blog post he says he wrote the prototype in an afternoon. Hyperspace does try hard to preserve unique metadata as well as other protections.

notatallshaw

uv does this out of the box, I think other tools (poetry, hatch, pdm, etc.) do as well but I have less experience with the details.

galaxyLogic

On Windows there is "Dev Drive" which I believe does a similar "copy-on-write" -thing.

If it works it's a no-brainer so why isn't it the default?

https://learn.microsoft.com/en-us/windows/dev-drive/#dev-dri...

p_ing

CoW is a function of ReFS, shipped with Server 2016. "DevDrive" is just a marketing term for a ReFS volume which has file system filters placed in async mode or optionally disabled altogether.

siranachronist

requires refs, which still isnt supported on the system drive on windows, iirc

galaxyLogic

Good to know. It is interesting to think how much electricity is wasted when most people don't have Copy-on-Write, all over the world.

jamesfmilne

Would be nice if git could make use of this on macOS.

Each worktree I usually work on is several gigs of (mostly) identical files.

Unfortunately the source files are often deep in a compressed git pack file, so you can't de-duplicate that.

(Of course, the bigger problem is the build artefacts on each branch, which are like 12G per debug/release per product, but they often diverge for boring reasons.)

theamk

"git worktree" shares a .git folder between multiple checkouts. You'll still have multiple files in working copy, but at least the .pack files would be shared. It is great feature, very robust, I use it all the time.

There is also ".git/objects/info/alternates", accessed via "--shared"/"--reference" option of "git clone", that allows only sharing of object storage and not branches etc... but it is has caveats, and I've only used it in some special circumstances.

globular-toast

Git de-duplicates everything in its store (in the .git directory) already. That's how it can store thousands of commits which are snapshots of the entire repository without eating up tons of disk space. Why do you have duplicated files in the working directory, though?

diggan

Git is a really poor fit for a project like that since it's snapshot based instead of diff based... Luckily, `git lfs` exists for working around that, I'm assuming you've already investigated that for the large artifacts?

HN

Hyperspace

Hyperspace