DiffX – Next-Generation Extensible Diff Format

154 comments

·June 4, 2025

laserbeam

I really don’t like the highly hierarchical format, that there’s a “..meta” and a “…meta” somewhere else. I can imagine we want to annotate the whole diff, each file and each chunk. That’s a total of 3 levels of depth. Let’s just give them distinct names and not go full yaml with a format for once?

This helps with readability (if one of the “meta” blocks is missing, for example, I could still tell at a glance what it refers to without counting dots), and is less error prone (it make little sense to me why the metadata associated with a whole diff should have the same fields as the metadata of a file).

Furthermore, why do we have two formats? Json and key=value pairs? Is there any reason to not just use one format because it sounds like the number of things we’d want to annotate is quite small. Having a single structure makes it much easier to write parsers or integrate with existing tooling (grep, sed or jq - but not both at once)

Other notes:

- please allow trailing commas in lists

- diffs are inherently splittable. I can grab half of a diff and apply it. How does your format influence that? I guess it breaks because I would need to copy the preamble, then skip 20 lines, then copy the block I need?

- revisions are a file property? Not a commit checksum? (I might just be dumb here)

chipx86

In the early drafts, we played with a number of approaches for the structure. Things like "commit-meta", etc. In the end, we broke it down into `#<section_level><section_type>`, just to simplify the parsing requirements. Every meta block is a meta block, and knowing what section level you're supposed to be in and comparing to what section level you get become a matter of "count the dots".

The header formats are meant to be very simple key/value pairs that are known by the parser, and not free-form bits of metadata. That's what the "meta" blocks are for. The parsing rules for the header are intentionally very simple.

JSON was chosen after a lot of discussion between us and outside parties and after experimentation with other grammars. The header for a meta block can specify a format used to serialize the data, in case down the road something supplants JSON in a meaningful way. We didn't want to box ourselves in, but we also don't want to just let any format sit in there (as that brings us back to the format compatibility headaches we face today).

For the other notes:

1. Compatibility is a key factor here, so we'd want to go with base-level JSON. I'd prefer being able to have trailing commas in lists, but not enough to make life hard for someone implementing this without access to a JSON5 parser.

2. If your goal is to simply feed to GNU patch (or similar), you can still split it. This extra data is in the Unified Diff "garbage" areas, so they'll be ignored anyway (so long as they don't conflict, and we take care to ensure that in our recommendations on encoding).

If your goal is to split into two DiffX files, it does become more complicated in that you'd need to re-add the leading headers.

That said, not all diff formats used in the wild can be split and still retain all metadata. Mercurial diffs, for example, have a header that must be present at the top to indicate parent commit information. You can remove that and still feed to GNU patch, but Mercurial (or tools supporting the format) will no longer have the information on the parent commit.

3. Revisions depend heavily on the SCM. Some SCMs use a commit identifier. Some use per-file identifiers. Some use a combination of the two. Some use those plus additional information that either gets injected into the diff or needs to be known out-of-bounds. There's a wide variety of requirements here across the SCM landscape.

laserbeam

> The header formats are meant to be very simple key/value pairs that are known by the parser, and not free-form bits of metadata. That's what the "meta" blocks are for.

One more thing you should prepare for whenever you have "free-form bits of metadata". They somehow turn into: "some user was storing 100MB blobs in there, and that broke our other thing".

laserbeam

> 1. Compatibility is a key factor here, so we'd want to go with base-level JSON. I'd prefer being able to have trailing commas in lists, but not enough to make life hard for someone implementing this without access to a JSON5 parser.

This is what I was referring to. This is not json:

> #..meta: format=json, length=270

> The header formats are meant to be very simple key/value pairs that are known by the parser, and not free-form bits of metadata. That's what the "meta" blocks are for. The parsing rules for the header are intentionally very simple.

Exactly my point. That level of flexibility for a .patch format to support another language embedded in it is overwhelming. Keep in mind that you are proposing a textual format, not a binary format. So people will use 3rd party text parsing tools to play with it. And having 2 distinct languages in there makes that annoying and a pain.

hdjrudni

How do they reasonable work around that though? If they want the ability to move away from JSON, you have to know that it is JSON before trying to parse it. And then you need to know how much data to read. So I can see why they put those 2 tidbits of info above data block.

Maybe they could have said too bad, JSON for life, we'll never change it. OK. But then you still need the length or a delimiter for the "end of json".

WhyNotHugo

What was your reasoning for discarding the existing header format used by git?

quotemstr

> Compatibility is a key factor here, so we'd want to go with base-level JSON. I'd prefer being able to have trailing commas in lists, but not enough to make life hard for someone implementing this without access to a JSON5 parser.

Everyone has access to a JSON5 parser. Everyone has to suffer for the sake of a few people who don't to pay the trifling tax of pip installing something --- when they're using an external library for a novel file format _anyway_?

genocidicbunny

> Everyone has access to a JSON5 parser.

That's just a lack of imagination. When you're making a product for teams that span everything from a brand new startup using the latest tooling to teams that are working on software that runs on embedded systems from the 90's, you need to consider things like this.

HelloNurse

A staggering amount of unnecessary and counterproductive scope creep in just 4 items:

    A single diff can’t represent a list of commits

    There’s no standard way to represent binary patches

    Diffs don’t know about text encodings (which is more of a problem than you might think)

    Diffs don’t have any standard format for arbitrary metadata, so everyone implements it their own way.

Of these, only a notation for binary patches would be a reasonable generalization of diff files. Everything else is the internal data structure or protocol of some specific revision control system, only exchanged between its clients and servers and backups.

chipx86

We build a code review product that interfaces with over a dozen SCMs. In about 20 years of writing diff parsers, we've encountered all kinds of problems and limitations in SCM-generated diff files (which we have to process) that we wouldn't ever have expected to even consider thinking about before. This all comes from the pain points and lessons learned in that work, and has been a huge help in solving these for us.

These aren't problems end users should hopefully ever need to worry about, but they're problems that tools need to worry about and work around. Especially for SCMs that don't have a diff format of their own, have one that is missing data (in some, not all changes can be represented, e.g. deleted files), or don't include enough information for another tool to identify the file in a repository.

HelloNurse

Better file formats cannot, by themselves, improve an inferior SCM tool that, for instance, processes files with the wrong text encoding or forgets deleted and renamed files: they would only have helped you for the purpose of developing your code review tool.

Standards are meant for interchange, like (as mentioned in other comments) producing a patch file by any means and having someone else apply it regardless of what they use for version control.

tankenmate

Not so, obviously it is less common these days, but I still use patch(1) and friends enough to run into problems from time to time. This is especially true when you have devs on different platforms (don't even get me started on filename mangling / case-folding issues).

Borg3

Oh, then this is management issue, not tooling. You need to sit down and analize where your stuff will be developled. Some very basic rules to start with: file names need to be all lower case (they are case-insensitive), use 7bit ASCII encoding for source code files. And vioala :)

NavinF

Poe's law at work. Replies are taking you literally, but I'm almost certain that you're joking. Very few large projects exclusively have lowercase filenames

bawolff

What exactly is the lowest common denominator platform we are trying to target here where we need 7bit ascii? MS-dos?

blacklion

So, self-delimitered format (JSON) is embedded in format with lengths? I change one space in JSON, JSOM is valid, whole DiffX file is invalid.

Nice, nice.

Format looks very clunky and messy, to be honest, mixture of self-invented headers and JSON payloads, strange structure (without comments here I will not notice different number of dots in `.meta`), need essentialy two parsers.

Idea to have extended diff with standard way to put metadata is good.

This implementation looks bad, sorry.

The patch format addresses all of these issues, no?

https://git-scm.com/docs/git-format-patch

laserbeam

TIL these are a thing. Thanks! (Just a regular joe on the internet, not the author)

genocidicbunny

It might solve it for git, but this looks like something the Review Board team came up with, and they have to integrate with many other version control systems like SVN, CVS, Perforce..etc. Seems like this is meant to address supporting many different version control systems with a single format.

I've worked at a place that used Review Board, and SVN as their primary vcs, but many devs used a local git-svn mirror for their work. Sometimes this caused problems with uploading diffs, especially if svn and git-svn were being mixed in one review. Having the Review Board cli generate a common diff format for both would have helped with that.

motorest

> Seems like this is meant to address supporting many different version control systems with a single format.

I'm sorry, this is simply wrong at so many levels. You're lauding this as a solution in search for a problem. As OP pointed out, this is already a solved problem as proven by Git. Git is not using a proprietary format. The problem of "integrate with many other version control systems" depends on whether those version control systems want to work on adding support for this feature. I guarantee you there isn't a single SVN or Mercurial maintainer complaining that they would love to share patches with Git but they are blocked because they cannot implement, let alone design, a format to exchange patches. That is not the hard part. That doesn't even register as a concern.

chipx86

Git is using a proprietary variant on top of Unified Diffs. Unified Diffs themselves convey very little information about the file being modified, focusing solely on the line-based contents of text files and allowing vendors to provide their own "garbage" lines containing anything else. Every SCM that tracks information beyond line changes in a diff fills out the garbage data differently.

The intent here isn't to let you copy changes from one type of repository to another, but to have a format that can be generated from many SCMs that a tool could parse in a consistent way.

Right now, tools working with diffs from multiple types of SCMs need at least one diff parser per SCM (some provide multiple formats, or have significantly changed compatibility between releases.

For SCMs that lack a diff format (there are several) or lack one that contains enough information to identify a file or its changes (there are several), tools also need to choose a method to represent that information. That often means yet another custom diff format that is specific to the tool consuming the diff.

We've spent over 20 years dealing with the headaches and pain points here, giving it a lot of thought. DiffX (which is now a few years old itself) has worked out very well as a solution for us. This wasn't done in a vacuum, but rather has gone through many rounds of discussion with developers at a few different SCM vendors who have given thought to these issues and supplied much valuable feedback and improvements for the spec.

genocidicbunny

>I guarantee you there isn't a single SVN or Mercurial maintainer complaining that they would love to share patches with Git

I was one of those maintainers. So you're already wrong there. As I described in my parent comment, I've worked somewhere this was an actual problem I encountered. I was responsible for both maintaining our SVN repository, and our Review Board instance, so I have had to actually deal with this.

chipx86

Exactly that. They all do things so differently that you end up creating and maintaining a separate parser for every SCM's diff format, and sometimes doing a lot of normalization of content or modification to include information the format lacks that's needed to apply the patches. And those are just for the ones that actually have a diff format -- many don't.

We needed something for ourselves at the very least. Much of DiffX came from thinking about these pain points and from talking to other SCM vendors whose engineers have also given some thought to these problems.

genocidicbunny

DiffX would have been nice to have available way back when I was trying to add support for our custom in-house vcs to Review Board. We had to either contort the diffs from our vcs to some format already understood by Review Board, which was sometimes difficult due to how the vcs structured the data it stored, or add a whole new parser to our Review Board instance, which would have been a major maintenance pain.

As an aside, I applaud you for creating Review Board. I've introduced its usage with several teams that I've worked with, and it really helped change how those teams operated, from a fly-by-night sort of development to actually having a process; The reduction in bugs and improvement in code quality were quite useful too.

motorest

> They all do things so differently that you end up creating and maintaining a separate parser for every SCM's diff format (...)

...and you seriously believe that pushing yet another ad-hoc format, and one which no one at all uses, is a way to address your concern?

null

[deleted]

bawolff

Are these really problems? I feel like i've never really encountered any of these issues and have trouble imagining when they would crop up (except binary files).

- encoding - even if your file is not utf-8, why would that matter? You would still run the patch algorithm the same way. It doesn't really matter if the characters are valid utf-8

-why would i want a single diff to represent multiple commits? Having multiple diffs seems much more natural.

-metadata... i guess, but also the metadata seems like it would mostly only be useful inside a single system.

account42

> - encoding - even if your file is not utf-8, why would that matter? You would still run the patch algorithm the same way. It doesn't really matter if the characters are valid utf-8

Yeah I don't see a use-case for a patch encoding either - just treat the patch data as ascii-delimited binary mistery goo. Patch files need to be able to deal with mixed encoding text (e.g. to fix it) so you can't really just have one encoding anyway.

rwmj

They're not problems at all. They probably should have asked people who regularly use diffs what actual problems they have, rather than trying to reinvent some overengineered yaml in a vacuum.

chipx86

Generally-speaking, you probably shouldn't have to deal with these problems unless you're writing a tool that has to interface with certain SCMs or SCMs used in certain environments. I'll give you some examples for each of these points:

1. There are two important areas where encoding can matter: The filename and the diff content.

Git pays attention to filename encoding, but most SCMs don't, so when a diff is generated, it's based on the local encoding. If there are any non-ASCII characters in that filename, a diff generated in one environment with one encoding set can end up not applying to another (or, in our case, not being able to be looked up from a repository). This isn't common but it can happen (we've seen this on Perforce and Subversion).

Then there's the content. Many SCMs will actually give you a representation of a text file and not the raw contents itself. That text file will be re-encoded for your local/preferred encoding, and newlines may be adjusted as well (`\r\n`, `\n`). The text file is then re-encoded back when pushing the change. This allows people in different environments to operate on the same file regardless of what encoding they're working with.

This doesn't necessarily make its way into the diff, though. So when you send a diff from a less-common encoding to a tool to process it, and that tries to apply it to the file checked out with its encoding, it can fail to patch.

The solution is to either know the encoding of the file you're processing, or try to guess it (some tools, like ours, let you specify a list of preferred encodings to try).

It's best if you can know it up-front.

Bonus Fun Fact: On some SCMs (Perforce comes to mind), checking out a file on Windows and then diffing it Linux via a shared mount can get you a diff with `\r\r\n` newlines. It's a bad time and breaks patching. And it used to come up a lot, until we worked around it.

Also, Perforce for a while would sometimes normalize encodings incorrectly and you'd end up with BOMs in the diff, breaking GNU patch.

2. It does when you're working with them directly for applying and patching. If you're handing them off to a tool for processing, if there's any risk of one file in a sequence not being included, you can end up with breakages that maybe you don't see until later processing.

It's also just really nice having all the state and metadata up-front so we can process it in one go in a consistent way without having to sanity-check all the diffs against each other.

When working locally, it also depends on your tooling. `git format-patch` and `git am` are great, but are for Git. If I'm working with (let's just say) Subversion, I need to do my own thing or find another tool.

3. It's critical for the kind of information needed to locate files in a repository. Some systems need a commit-wide identifier. Some need per-file identifiers. Some need a combination of the two. Some need those plus additional data not otherwise represented in the path or revision (generally more enterprise SCMs targeting certain use cases).

It's also critical for representing information that isn't in the Unified Diff format (namely, anything but the filename). So, symlink information, file modes, SCM-specific properties on a file or directory, to name a few. This information needs to live somewhere if a SCM provides it, and it's up to every SCM to choose how and where to store that data (and then how it's encoded, etc.).

account42

> Then there's the content. Many SCMs will actually give you a representation of a text file and not the raw contents itself. That text file will be re-encoded for your local/preferred encoding, and newlines may be adjusted as well (`\r\n`, `\n`). The text file is then re-encoded back when pushing the change.

Yeah, don't do that.

> This allows people in different environments to operate on the same file regardless of what encoding they're working with.

No it causes hard to understand bugs because now what people see on their device and what is tracked in source control differs, defeating the entire purpose of having source control in the first place. This isn't theoretical at all btw.

> The solution is to either know the encoding of the file you're processing

In general, there is no such encoding - source control tools need to be able to deal with files not valid in any single encoding.

ris

Binary data - definitely a problem.

xyzzy_plugh

I find this whole document hard to read. A "diff" colloquially refers to the difference between two things -- files, directory trees, whatever. What TFA refers to as a diff has been always known as a patch, at least to me.

This is nothing about diffs, but entirely about patch metadata management. I mean, sure, noble goal, but this is just shuffling bits around. If they proposed that metadata was required to be JSON that would be one thing, but instead it's some weird self-describing length-delimited nonsense that just disguises the same problems that exist today. It's already extensible! Just type words!

I've spent a lot of time parsing things out of git commits and patch files and while some standardization would be neat, this isn't it.

That said I find the argument that git diff style is more or less canonical more compelling than I have in the past. So there's that.

> A single diff can't represent a list of commits

A patch set can! Why on earth would you want that represented by a single diff is beyond me.

eddd-ddde

The last bit is the first thing I thought. Just use multiple diffs!

redleader55

What actual problem is this trying to solve? They mention patch/diff format not being good enough, but they don't explain for whom. Are GNU Patch people complaining? What are these people building that needs a better patch format?

chipx86

I have a much-too-long-for-one-comment write-up about this, but it's basically for those who build SCMs or tools that need to work with SCMs. End users shouldn't have to care about this.

There's not one patch/diff format. There's often at least one per SCM. A couple are pretty good (Git's), many are okay (Subversion's), and many are really bad or non-existent.

I founded one of the older code review products, Review Board (turning 20 next year), and we deal with the problems this is trying to solve all the time, across over a dozen SCMs. So we're the ones complaining :) And much of this is based on extensive feedback from SCM vendors we've spoken to about this at length.

Most people shouldn't have to care. But it benefits tools like ours that have to deal with the nightmare that is the world of diff formats.

genocidicbunny

Looks like this is being used by Review Board, which heavily relies on diffs for source code reviews, and supports a whole bunch of version control systems.

itake

One of my issues that remains unsolved with diff tools is they are dependent on new line attributes.

Reviewing changes on a long line (like compressed json or long array) is too difficult.

chipx86

Absolutely agree. I think there's a lot of avenues to explore for better diff representations for structured data (which would also be great for ASTs, something we've been thinking about).

This format is meant to be an extension of Unified Diffs (much like the diff formats of most SCMs), and not something entirely new and focusing on other areas. But if more specific diff formats become widespread, we could directly support encoding them within DiffX as well, as we do for binary diffs formats.

saghm

I think part of the problem is that the common format used is somewhat of a compromise between being human readable and parseable by tools. I can sort of see where the author of this tool is coming from with trying to address some of that with metadata, but I feel like the better way might be to come to with a format that isn't reliant on plaintext but instead can be rendered to something more readable. Coming up with a way to calculate reasonable diffs on files like you mention that doesn't generate worse diffs than we have now for existing stuff would be challenging, but it doesn't feel like it would be impossible to solve.

devman0

git does have word diffing if you need something more granular than line diffing, the default delimiter being whitespace.

itake

oo I didnt' realize that. that seems pretty close to what I want, but the git tools (cli or Github Desktop) still print the entire line.

For line diffing, it clips to only show the ~3 line before and after the change. But the word diff still prints the entire line and you have to scroll to find the change in the line wrapping.

eddd-ddde

You can use `delta` as a diff pager and it includes word-diff-syntax-highlight.

tlb

The most general and unambiguous way to represent a diff is to just include the contents of the two files. It's more data, but that's rarely an issue these days.

So instead of `diff a b | patch c`, where the data through the pipe needs to be in some interchange format, you'd run `apply a b c` and the apply command can use whatever internal representation it likes.

Diffs also aren't great for human reading. A color-coded side-by-side view is better. For which you also want to start with the two files.

There's really no need to ever transmit a diff and deal with all the format vagaries when you can just send the two files.

layer8

A diff between two files isn’t unique, meaning there can be better or worse diffs between the same two versions of a file, depending on the file format and possibly the purpose of the diff. Similarly, there can be different strategies for applying a diff as a patch.

Having a diff format allows decoupling the implementation of diff creation from the implementation of diff application, turning a potential n*m problem into an n+m problem.

vintermann

> It's more data, but that's rarely an issue these days.

I do think it's a bit annoying that a program gets updated, and you have to download the whole 130 gb again.

But especially how quickly and easily you can compress two almost-identical files, I think your approach has a lot going for it. It may even be possible to get clever and send over just a hash of the original file, and a version of the new file which has been compressed with the original file as prefix (but without the actual compressed data for that).

blacklion

It is hard to manipulate pairs of files without special container. For example, you want to attach chages to e-mail, changes cover 10 files + 1 removed file + 2 added files. Will you pack it to tar/zip with two folders `old` and `new` inside or what? Looks like pre-VCS era solution, when we did manaul "version control" by copying `project` to `project-19950112-final-for-sure` :)

thecupisblue

>There's really no need to ever transmit a diff and deal with all the format vagaries when you can just send the two files.

Well, depends what are you doing, and in 2025, they are more relevant than ever.

Asking an LLM to output a diff for an edit can save you a staggering amount of tokens and cut the latency of it's response by 5-10x. I've done it your way, a custom diff way and then added a standard diff one, and even back then with GPT 3.5 there was a huge difference, let alone now with way larger models.

There is a lot of diff's in the dataset so telling it to create a standard diff is usually no different than asking it to create a whole file in terms of accuracy (depending on the task), but saves you all the output tokens and reduces the amount of compute/time required to infer all those tokens.

Updating a code running in a sandbox on a 3rd machine over the wire and speed is relevant? You want a diff. I did it your way first for ease, but knowing how much data and compute I was wasting on that, it was a low hanging optimisation to use a diff, and it worked wonders. Yeah, for most usecases it would be an overkill, but for this usecase miliseconds were important.

If you have file A and A2 and a diff AxA2, constructing file A2 is easy and saves you all the A-diff data.

Merging them is the only potential issue due to conflicts, and that is where a human or LLM has to come in and that is where just having a file A2 to overwrite the original one would be easy, but conflict occurs only in cases where you might not want that to happen.

TLDR; diff good.

koiueo

> format (string – recommended): > > This would indicate the metadata format. Currently, only json is officially supported, and is the default if not provided.

JSON doesn't seem a good choice for representing metadata in a format that aims to be universal. It is unnecessarily complicated for this purpose IMO

motorest

> JSON doesn't seem a good choice for representing metadata in a format that aims to be universal. It is unnecessarily complicated for this purpose IMO

That's an odd statement. Can you please explain why you believe that JSON is "unnecessarily complicated" to represent metadata.

quotemstr

What's wrong with JSON?

* JSON is just barely powerful enough to need a library to parse it, but not powerful enough to have comments or trailing commas, so editing is needlessly annoying.

* It's human-readable, but deciphering nested data structures is annoying, especially when things are formatted as long lines. If you have to pipe something to jq to be able to read it, it's broken as a text-based document format.

* JSON is needlessly strict. If I write {foo: 5}, my intent is crystal clear. I shouldn't have to write {"foo": 5}. Come on. Who's really helped by this kind of syntactic hairshirt?

* despite being a strict schoolmarm of text formats, JSON is still vague. Yes, it has numbers. How big can they be? Who knows?

I mean, JSON is fine-ish, I guess, as an interchange format, in which I'm looking at it only for the occasional debugging session. But as a format for documents meant to be read by humans? Ugh. Anything, anything at all but JSON.

dolmen

How often do you edit patch files?

When editing a patch file, how often do you edit metadata beyond the file content differences?

It seems that the proposed DiffX is meant to be produced/edited only by machines, so JSON doesn't seem too much of a problem for this use case.

watusername

> editing is needlessly annoying.

Luckily, this does not matter in DiffX because the whole thing goes up in flames when you change the length anyways :)

signa11

difftastic: https://difftastic.wilfred.me.uk/ uses tree-sitter for better diff-info, and is, imho, superior to this.

chipx86

difftastic is great!

This isn't a tool for viewing changes to files or to ASTs. This is a way of being able to generate a single diff file for processing or patching that addresses the kinds of problems we've encountered in over 20 years of building diff parsing tooling and working with over a dozen SCMs with varying levels of completeness or brokenness of bespoke custom diff formats.

It's not an end user tool, but a useful format for tools like code review products to use.

JanisErdmanis

This looks great. The diff is quite inefficient for patching with the C preprocessor branches.

Since it patches the code, looking at its tree structure, is the diff human readable, and can it be edited directly? This is a major contributor to why I opt for sed for patching.

touristtam

Seen it before, but I might try this time.

FWIW: mise install is breaking due to the submodule. I had to resort to brew install

greatgib

Extending/reworking the format is probably good but I don't think that using multiline (indentation dependant)json or yaml would be good for such a thing.

One of the interesting point of diff files is that all commands are on single lines. You can easily parse or manipulate with simple shell tools just stripping lines out.

eddd-ddde

So much this. At the very least metadata should be inside BEGIN and END markers that allow for easy extraction with something like awk. Not sprinkled around _multiple_ JSON objects you have to merge in manually.

HN

DiffX – Next-Generation Extensible Diff Format

DiffX – Next-Generation Extensible Diff Format