Is OOXML Artifically Complex?
77 comments
·September 5, 2025s20n
gregopet
My wife worked in one of the national standardization organizations. She was urgently called into her boss' office: "Please be on this meeting with me, I think they will try to bribe me if I'm alone". It only happened once while my wife worked there and it was right before the vote where Microsoft tried to fast track their office format.
monocasa
Specifically what I heard on the grapevine was that Microsoft sponsored a collection of small island nations into the ISO process, in exchange for their vote on OOXML.
CorrectHorseBat
Both can be true at once.
They didn't want a standard other people could adapt easily nor do the work to make Word adhere to one and it had to happen fast. By doing it the way they did they got everything they wanted and only needed to buy ISO.
fsflover
Sounds exactly like deliberate sabotage to me.
quotemstr
Some myths just won't die.
OOXML is complex because it has to be. It has to losslessly round trip through an open format every single feature of Office. That's a lot of features.
Yes, it's complex. Should Microsoft have cut features of Office just to make OOXML simpler? That's ridiculous. What about users who relied on those cut features?
It was fair to ask Microsoft to open the file format. It wasn't fair to expect them to cut features and compatibility. The complaints about complexity from RMS and others represent outsiders seeing the sausage factory and realizing that the sausage making is complicated and needs a lot of moving parts. Maybe life wasn't as simple as the Slashdot "Micro$oft" narrative would suggest. Maybe the complexity of the product was downstream of the shit ton of complexity and sweat and thought that had gone into it.
But admitting that would have been hard. Easier to come up with conspiracy theories.
clort
You are wrong. Microsoft was not asked to open the file format. There was an open file format already accepted as an ISO standard, so now they needed to make their product compliant with an ISO standard because companies around the world were going to prioritise that in their purchases. They did everything they could to ensure that their format was both an ISO standard, and impossible for somebody else to implement.
hdjrudni
From the article,
> First, OOXML was, in material part, a defensive posture under intensifying antitrust and “open standards” pressure. Microsoft announced OOXML in late 2005 while appealing an adverse European Commission judgment centered on interoperability disclosures. Thus, it was only a matter of time before Office file compatibility came under the regulatory microscope. (The Commission indeed opened a probe in 2008.)
> Meanwhile, the rival ODF matured and became an ISO standard in May 2006. Governments, especially in Europe, began to mandate open standards in public procurement. If Microsoft did nothing, Office risked exclusion from government deals.
So... maybe they weren't directly asked to open their file format, but what then? Adopt ODF which is surely incompatible with their feature set, and... just corrupt every .doc file when converting into the new format? And also have to reimplement all their apps?
user3939382
So you put extensions in the spec you don’t make it impossible for anyone else to implement. They knew open source suites were competing with them they did it on purpose.
quotemstr
> So you put extensions in the spec
... which are either public, in which case people complain that the spec+extensions is too long instead of that the spec is too long, or
... which aren't public, in which case people complain that there's no interoperability.
You can't win.
> impossible for anyone else to implement
Except for all the people who did implement it?
dullcrisp
The…sausage has a lot of moving parts?
troupo
> OOXML is complex because it has to be.
What it didn't have to be is sections upon sections of "this behaviour is as seen in Word 95", "this behaviour is as seen in Word 97" without any further specification or context.
The main struggle for independent implementors was reverse engineering all the implicit and explicit assumptions and inner workings of MS Office software.
> But admitting that would have been hard. Easier to come up with conspiracy theories.
I actually read through a lot of that spec at the time. A lot of it was just lip service to open standards at a time when MS was under a lot of regulatory pressure.
qcnguy
That stuff happens because Microsoft don't know what the behavior is. It's just a bit which forks Word down some ancient code path that nobody understands and isn't properly documented. Given the huge effort that would have gone into producing this thousand plus page specification, is understandable why the spec writers would have given up at times.
I expect most people posting on Hacker News would not be able to write a satisfactory specification for their own software if they are working a large legacy code base.
nneonneo
Microsoft seems to have known that they could ram basically anything through a standards body, so they presumably didn't bother to actually try and simplify the standard. Instead, it's basically an XML serialization of their older binary formats, complete with all of the quirks and bugs that have to be emulated for 100% compatibility.
To be fair, we're talking about a product line with over 35 years of history here. Cruft in the format builds up but can never be removed, so long as you commit to strong backwards compatibility - which Microsoft has always done.
Fun trivia: many of the old binary formats use a meta-format called OLE2 (Object Linking and Embedding). The file format is a FAT12 filesystem packed into a single file, with a FAT filesystem chain, file blocks aligned to a specific power-of-two size, etc. This made saving files very fast, but raised the possibility of internal fragmentation (where individual sub-files are scattered over many non-contiguous blocks); hence, users were recommended to "Save As..." periodically for large/complex files to optimize the internal storage.
rtpg
"You have to standardize the format"
"OK we will standardize our serialization format"
It's... I guess malicious compliance, though also if you don't care about interop you're not going to try to abstract away your internal application structures, are you!
I appreciate the standard existing rather than it not existing. Trying to have the standard exist in this way has always felt like an uphill battle, and at least now there's _something_.
Just you will have a better time if you emulate how Office does things. But you have a bit more documentation to go along with it.
flomo
Officially now MS-CFB (i think). OLE2 generally refers to a predecessor to COM, and not just the file format.
https://learn.microsoft.com/en-us/openspecs/windows_protocol...
Lammy
I love this screen that shows you exactly why they named it “Office Open” XML: https://i.imgur.com/hnj3sdv.png
It was a pretty big deal when OpenOffice.org's 2.0 release came with OpenDocument as the default file format. Very easy for someone to misread this MSOffice screen and click on OOXML expecting it to mean OO.o.
zamadatix
Oh wow. I must have clicked through that page dozens of times, selecting "Keep Current" after a quick scan and thinking the 2nd option was talking about Open Office.
CobrastanJorji
The OOXML fight is near and dear to my heart because, when it happened, I was a baby developer, and I cared about the issue for some reason I can barely recall, and I found an expert on the issue on Twitter. That guy would regularly tweet about everything that was going on and the problems with the spec and the shenanigans, and I was one of the, like, 20 people who was hanging on his every word. And sometimes he'd talk about bee keeping instead. It was my first introduction to Twitter at its best. You got these unfiltered whole views of the lives and concerns of real people who were, in part, experts at what you cared about. So sometimes you had to listen to them talk about other random stuff they thought was neat. And that's great!
lorenzohess
> In my view, OOXML is indeed complex, convoluted, and obscure. But that’s likely less about a plot to block third-party compatibility and more about a self-interested negligence: Microsoft prioritized the convenience of its own implementation and neglected the qualities of clarity, simplicity, and universality that a general-purpose standard should have.
The author only provides arguments for "self-interested negligence". He provides no counterarguments to the claim that OOXML complexity was "a plot to block third-party compatibility". Therefore, he cannot compare "negligence" and "a plot". Therefore, his claim that "negligence" is a better explanation for OOXML complexity than "a plot" cannot follow.
To restate:
> If we dig into the context of OOXML’s creation, it can be argued that harming competitors was not Microsoft’s primary aim.
The author provides no evidence to support this claim. At most, the evidence provided in this section at most supports the claim that "negligence" played a role in OOXML complexity. From this evidence alone, no conclusions can be drawn about the "primariness" of "negligence" vs "harming competitors".
unscaled
Unless we ever get the full archive of Microsoft emails, meeting minutes and recordings from all the secret microphones they didn't have in their meeting rooms, I don't think you can ever disprove this claim. It's generally impossible to conclusively disprove conspiracy theories, because you could always claim you're only showing there are no documents proving the conspiracy, but there are no documents disproving it.
The author is just implicitly appealing to Occam's razor here, as people often in face of accusations of a plot. They can show that Microsoft has backed the ANSI accreditation of ODF[1] and eventually implemented support for ODF import and export in Office, but that's not enough to prove there was no conspiracy.
Instead, the article just provides a very plausible explanation for the complexity in OOXML. Does this explanation thoroughly disprove the accusations of a plot? Clear not. Is it more plausible than a great plot to crush a bunch of competitors that had no market share and kill a better standard document format that Microsoft did end up implementing in Office? Yes. This is probably as far as we can get.
[1] https://news.microsoft.com/source/2007/05/16/microsoft-votes...
airstrike
Both things can be true. It had a genuine purpose, but the fact that Microsoft will go out of its way to not implement anything better and less temperamental is an indication it's not really open. There's plenty of evidence of Microsoft dragging their feet at playing nice with the rest of the office ecosystem.
I'm not saying they shouldn't do that as a company maximizing shareholder value. But we should all collectively groan every time the topic comes up, not applaud them.
to11mtm
I mean sometimes you gotta ship a product (and remember back then, that meant masters for CDs,) and it's perfectly possible that whatever team was in charge of handling 'conversion' stuff for old format (remember that old excel formats have OLE type cruft going on, the sorts of things that led to VBA viruses, imagine what other functionality needs to be implemented) just plain had to take shortcuts in uglifying the spec to support all the jank.
tannhaeuser
Worth keeping in mind that the native MSO formats were using "structured storage", a horrible binary chunked serialization and metadata format from an era where binary embedding of document streams in other application documents via "Object linking and embedding" (OLE, see also Apple's OpenDoc format) was deemed desirable, with zero consideration given to third-party apps and segment formats tied to C++ data structures. Compared to that, OOXML is still a huge progress, and while it's complex I wouldn't say it's maliciously so.
The Shakespeare example is a good one where the sentence is split into multiple spans to apply style rules yet the bare text content could be extracted by just removing all XML tags. Whereas the ODF variant is actually less recommendable as it relies on an unneccesarily complex formatting and text addressing language on top of XML.
The article says
> Even at a glance [ODF's markup] is more intelligible. Strip the text: namespaces and it’s nearly valid HTML. The only thing that needs explaining is that ODF doesn’t wrap To be with a dedicated “bold” tag. Instead, it applies an auto-style named T1 to a <text:span>, an act of separating content and presentation that mirrors established web practices.
but this definitely makes things more complex for data exchange compared to OOXML.
quotemstr
Can you explain what's wrong with the concept of a container format that allows embedding subdocuments of different types?
> zero consideration given to third-party apps and segment formats
The reality is the opposite. COM serialization was specifically built to allow for composing components (and serializations thereof) that didn't know about each other into a single document. That's why it leans so heavily on GUIDs for names: they avoid collisions without needing coordination. That's a laudable goal, not pointless bloat. And the COM people implemented it pretty efficiently too!
> C++ data structures
What gives you that idea? Yes, the OLE stream thing was a binary format, but so is DER for ASN.1. Every webpage you load goes over a binary tagged object format not too different from OLE/COM's.
But due to a persistence of myths from the 90s, people still think of the Office binary format as "horrible" when it's actually quite elegant, especially considering the problems the authors had to solve and their constraints in doing so.
In many ways, we've regressed.
> Markup
The author of the article nails it when he says ODF is meant to be a markup language and OOXML is the serialization of an object graph. So what? Do people write ODF by hand? There are countless JSON formats just as inscrutable as MSO's legacy streams.
Anyway, the idea that the MSO binary format was crap because it was binary, lazy, and represented a "memory dump" is an old myth that just won't die. It wasn't a memory dump, it wasn't lazy, and it wasn't crap. Yes, there are real problems with some of the things people put inside the OLE container, but it's facile and wrong to blame the container or the OLE stream composition model for the problem.
Mikhail_Edoshin
I remember Spreadsheet ML, an older format compatible with Excel. It had a subset of features, I think, but it was a rather powerful subset: formatting, formulae, multiple sheets. And it was rather simple. (Had a silly design mistake: for some reason MS gave namespace to attributes, which is not necessary, only for rather specific purposes).
Another XML standard from MS that also seems relatively simple is XPS, a PDF alternative. But it uses Open Packaging and that is somewhat hard to read.
themerone
It's as complex as it needs to be to losslessly convert old binary office files.
A better format would have made us geeks a lot happier, but the average user just wants things to work the way they always have.
Gigachad
My possibly incomplete understanding was that the original office file format was basically just raw dumps of the internal C data structures. Not designed or specified in any way.
The XML version likely carries a lot of baggage having to be compatible with that.
lmkg
They weren't "just" raw dumps of internal C structures. It takes careful design work to dump raw memory in a usable fashion. Consider: You can't just write a pointer to disk and then read it back next week.
Binary MS Office format is a phenomenal piece of engineering to achieve a goal that's no longer relevant: fast save/load on late-80's hard drives. Other programs took minutes to save a spreadsheet, Excel took seconds. It did this by making sure it's in-memory data structures for a document could be dumped straight to disk without transformation.
But yes, this approach carries a shitton of baggage. And that achievement is no longer relevant in a world where consumer hardware can parse XML documents on the fly.
I have heard it argued, though, that the "baggage" isn't the file format. It's actually the full historical featureset of Excel. Being backwards-compatible means being able to faithfully represent the features of old Excel, and the essential complexity of that far outweighs the incidental complexity of how those features were encoded.
charlieyu1
I once digged through the 5000 page specification. There was a lot of useless stuff that only old Microsoft Word supported like WordArt items.
bawolff
Does office no longer support word art?
When i was a kid,making cool wordart headers for school projects was like 50% of what we used office for.
lblume
Office does still support word art. [0]
[0]: https://support.microsoft.com/en-us/office/insert-wordart-c5...
bjoli
How else would terminally uncool church youth groups advertise in their local church?
It might be a Swedish thing, but I always laugh when I see them. Not nearly as common today as ten years ago, but I see them a couple of times a year.
PaulHoule
People who were developing "office" programs in the early 1990s were thinking about the problem of serializing arbitrary object graphs into documents to support technologies like
https://en.wikipedia.org/wiki/Object_Linking_and_Embedding
where you could embed an Excel spreadsheet inside a Word document or actually embedded any of a large range of COM objects into a Word document which on one hand is a really appealing vision but on the other hand means you have to have and be able to run all the binaries for all the objects that live in a document which ties the whole thing to Windows.
PDF is a different sort of document format which privileges viewing over editing but it is also really about serializing an object graph when it comes down to it and then having various sorts of filters and transformations and a range of objects defined in the spec as opposed to open ended access to an object library.
This kind of system has a lot of overlap with the serdes problem you get with RPC frameworks that used to be under the files "Sun RPC sucks", "DCOM Sucks", "CORBA Sucks" and "WS-* Sucks" Those things are mostly forgotten these days because well... they sucked, and now the usual complaint is "protobuf sucks" but you rarely hear "JSON sucks" because it gave up on graphs for trees, if you don't have a type system people can't say the type system sucks, and the only thing that really sucks about it is that people won't just use ISO 8601 dates but you can always rise above that by just using ISO 8601 dates without asking for permission. But we all agree YAML sucks.
That points to any flexible document format sucking but also sucks because it has lots of poorly specified and obscure features that amount to "format this the same way Word 95 formatted it if you used a certain obscure option".
From a glass is half empty perspective it sucks because it's close to impossible to make a Microsoft Office replacement that renders 100% of documents 100% correctly.
From a glass is half empty perspective it rules because if you want to make a Python script that writes an Excel script with formulas it is easy. If you want to extract the images out of a Word document it is easy because a Word document is just a ZIP file. If you want to do anything with an OOXML document short of writing an Office replacement it's actually a pretty good situation.
com2kid
> but you rarely hear "JSON sucks" because it gave up on graphs for trees
Except it also spawned a thousand custom formats that include $ref support of some type, so we are right back to having graphs. :-D
eirikbakke
Microsoft Office has many features. Each feature must be reflected in the file format somehow.
(I wonder what the specification-pages-to-man-years ratio is...)
> Why Microsoft’s Motive Wasn’t Deliberate Sabotage
I absolutely do not agree.
Not only is the standard overly complex, Microsoft also indulged in all sorts of unscrupulous activities to corrupt various National Standards Organisations to get it approved through the ISO <https://en.wikipedia.org/wiki/Standardization_of_Office_Open...>, which is clear evidence of malicious intent.
This is a quote from Richard Stallman:
> The specifications document was so long that it would be difficult for anyone else to implement it properly. When the proposed standard was submitted through the usual track, experienced evaluators rejected it for many good reasons. Microsoft responded using a special override procedure in which its money buy the support of many of the voting countries, thus bypassing proper evaluation and demonstrating that ISO can be bought.