Skip to content(if available)orjump to list(if available)

A love letter to the CSV format

A love letter to the CSV format

701 comments

·March 26, 2025

jwr

I so hate CSV.

I am on the receiving end: I have to parse CSV generated by various (very expensive, very complicated) eCAD software packages. And it's often garbage. Those expensive software packages trip on things like escaping quotes. There is no way to recover a CSV line that has an unescaped double quote.

I can't point to a strict spec and say "you are doing this wrong", because there is no strict spec.

Then there are the TSV and semicolon-Separated V variants.

Did I mention that field quoting was optional?

And then there are banks, which take this to another level. My bank (mBank), which is known for levels of programmer incompetence never seen before (just try the mobile app) generates CSVs that are supposed to "look" like paper documents. So, the first 10 or so rows will be a "letterhead", with addresses and stuff in various random columns. Then there will be your data, but they will format currency values as prettified strings, for example "34 593,12 USD", instead of producing one column with a number and another with currency.

mjw_byrne

I used to be a data analyst at a Big 4 management consultancy, so I've seen an awful lot of this kind of thing. One thing I never understood is the inverse correlation between "cost of product" and "ability to do serialisation properly".

Free database like Postgres? Perfect every time.

Big complex 6-figure e-discovery system? Apparently written by someone who has never heard of quoting, escaping or the difference between \n and \r and who thinks it's clever to use 0xFF as a delimiter, because in the Windows-1252 code page it looks like a weird rune and therefore "it won't be in the data".

recursive

"Enterprise software" has been defined as software that is purchased based on the decisions of people that will not use it. I think that explains a lot.

mjw_byrne

Yep, we had a constant tug of war between techies who wanted to use open-source tools that actually work (Linux, Postgres, Python, Go etc.) and bigwigs who wanted impressive-sounding things in Powerpoint decks and were trying to force "enterprise" platforms like Palantir and IBM BigInsights on us.

Any time we were allowed to actually test one of the "enterprise" platforms, we'd break it in a few minutes. And I don't mean by being pathologically abusive, I mean stuff like "let's see if it can correctly handle a UTF-8 BOM...oh no, it can't".

ethbr1

> Big complex 6-figure e-discovery system? Apparently written by someone who has never heard of quoting...

It's because about a certain size, system projects are captured by the large consultancy shops, who eat the majority of the price in profit and management overhead...

... and then send the coding work to a lowest-cost someone who has never heard of quoting, etc.

And it's a vicious cycle, because the developers in those shops that do learn and mature quickly leave for better pay and management.

(Yes, there's usually a shit hot tiger team somewhere in these orgs, but they spend all their time bailing out dumpster fires or landing T10 customers. The average customer isn't getting them.)

deepsun

Just a nitpick about consultancy shops -- I've had a chance of working in one in eastern europe and noticed that it's approach to quality was way better than client's. It also helped that client paid by hours, so consultancy company was incentivized to spend more time on refactorings, improvals and testing (with constant pushback from client).

So I don't buy the consultancy company sentiment, it always boils down to engineers and incentives.

null

[deleted]

raxxorraxor

Try to live in a country where "," is the decimal point. Of course this causes numerous interoperability issues or hidden mistakes in various data sets.

There would have been many better separators... but good idea to bring formatting into it as well...

Moru

There was a long period of my life that I thought .csv meant cemicolon separated because all I saw was cemicolon separated files and I had no idea of the pain.

null

[deleted]

JadeNB

Although it is spelled "semicolon," so that doesn't quite fit.

mrweasel

Not sure if they still do this, but Klarna would send us ", " separated files. If there wasn't a space after the comma then it was to be read as a decimal point. Most of the CSV parser don't/didn't allow you to specify multi-character separators. In the end I just accepted that we had one field for krona and for öre and most fields would need to have a leading space removed.

raxxorraxor

Microsoft did this very extensively. Many Non-English versions of Excel do save CSV-files with a semicolon as a separator and it probably was handled differently too in normal Excel files. But it goes even further, it affected their scripting languages even to this day with newer languages like their BI script (forgot the name of the language). For example, parameters of function calls aren't separated by ',' anymore and ';' is used instead. But only in the localized versions.

That of course means that you have to translate these scripts depending on the locale set in your office suite, otherwise they are full of syntax errors...

zaxomi

There are better separators included in ASCII, but not used as often: 28 File Separator, 29 Group Separator, 30 Record Separator and 31 Unit Separator.

Sami_Lehtinen

TSV should do it for you. Been there done that.

danso

> Then there will be your data, but they will format currency values as prettified strings, for example "34 593,12 USD", instead of producing one column with a number and another with currency.

To be fair, that's not a problem with CSV but with the provider's lack of data literacy.

remram

Yeah, you can also use Parquet/JSON/protobuf/XLSX and store numbers as strings in this format. CSV is just a container.

ahoka

But somehow CSV is the PHP of serialization formats, attracts the wrong kind of developers and projects.

TRiG_Ireland

I worked in a web shop which had to produce spreadsheets which people wanted to look at in Excel. I gave them so many options, and told each client to experiment and choose the option which worked for them. In the end, we had (a) UTF-8 CSV, (b) UTF-8 CSV with BOM, (c) UTF-16 TSV, (d) UTF-8 HTML table with a .xlsx file extension and a lying Content-Type header which claimed it was an Excel spreadsheet.

Option a worked fine so long as none of the names in the spreadsheet had any non-ASCII characters.

Option d was by some measures the worst (and was definitely the largest file size), but it did seem to consistently work in Excel and Libre Office. In fact, they all worked without any issue in Libre Office.

recursive

> I can't point to a strict spec and say "you are doing this wrong", because there is no strict spec.

Have you tried RFC 4180?

https://www.ietf.org/rfc/rfc4180.txt

hermitcrab

I've written a commercial, point and click, data wrangling tool (Easy Data Transform) that can deal with a lot of these issues:

-different delimiters (comma, semi-colon, tab, pipe etc)

-different encodings (UTF8, UTF16 etc)

-different line ending (CR, LF, CR+LF)

-ragged rows

-splitting and merging columns

And much more besides.

However, if you have either:

-line feeds and/or carriage returns in data values, but no quoting

or

-quoting, but quotes in data values aren't properly handled

Then you are totally screwed and you have my sympathies!

imtringued

I agree and as a result I have completely abandoned CSV.

I use the industry standard that everyone understands: ECMA-376, ISO/IEC 29500 aka .xlsx.

Nobody has any problems producing or ingesting .xlsx files. The only real problem is the confusion between numbers and numeric text that happens when people use excel manually. For machine to machine communication .xlsx has never failed me.

Hackbraten

Off the top of my head:

https://learn.microsoft.com/en-us/office/troubleshoot/excel/...

Now you might argue that ECMA-376 accounts for this, because it has a `date1904` flag, which has to be 0 for 1900-based dates and 1 for 1904-based dates. But what does that really accomplish if you can’t be sure that vendors understand subtleties like that if they produce or consume it? Last time I checked (maybe 8 years ago), spreadsheets created on Windows and opened on Mac still shifted dates by four years, and the bug was already over twenty years old at that time.

And the year-1904 issue is just the one example that I happen to know.

I have absolutely zero confidence in anything that has touched, or might have touched, MS Excel with anything short of a ten-foot pole.

Gormo

Parsing Excel files in simple data interchange use cases that don't involve anyone manually using spreadsheets is an instance of unnecessary complexity. There are plenty of alternatives to CSV that remain plaintext, have much broader support, and are more rigorous than Excel in ensuring data consistency. You can use JSON, XML, ProtoBuf, among many other options.

eternauta3k

But everyone already has a GUI installed for editing xlsx files...

byyll

I was recently writing a parser for a weird CSV. It had multiple header column rows in it as well as other header rows indicating a folder.

jll29

The post should at least mention in passing the major problem with CSV: it is a "no spec" family of de-facto formats, not a single thing (it is an example of "historically grown"). And omission of that meams I'm going to have to call this our for its bias (but then it is a love letter, and love makes blind...).

Unlike XML or JSON, there isn't a document defining the grammar of well-formed or valid CSV files, and there are many flavours that are incompatible with each other in the sense that a reader for one flavour would not be suitable for reading the other and vice versa. Quoting, escaping, UTF-8 support are particular problem areas, but also that you cannot tell programmatically whether line 1 contains column header names or already data (you will have to make an educated guess but there ambiguities in it that cannot be resolved by machine).

Having worked extensively with SGML for linguistic corpora, with XML for Web development and recently with JSON I would say programmatically, JSON is the most convenient to use regarding client code, but also its lack of types makes it useful less broadly than SGML, which is rightly used by e.g. airlines for technical documntation and digital humanities researchers to encode/annotate historic documents, for which it is very suitable, but programmatically puts more burden on developers. You can't have it all...

XML is simpler than SGML, has perhaps the broadest scope and good software support stack (mostly FOSS), but it has been abused a lot (nod to Java coders: Eclipse, Apache UIMA), but I guess a format is not responsible for how people use or abuse it. As usual, the best developers know the pros and cons and make good-taste judgments what to use each time, but some people go ideological.

(Waiting for someone to write a love letter to the infamous Windows INI file format...)

sramsay64

In fairness there are also several ambiguities with JSON. How do you handle multiple copies of the same key? Does the order of keys have semantic meaning?

jq supports several pseudo-JSON formats that are quite useful like record separator separated JSON, newline separated JSON. These are obviously out of spec, but useful enough that I've used them and sometimes piped them into a .json file for storage.

Also, encoding things like IEEE NaN/Infinity, and raw byte arrays has to be in proprietary ways.

d0mine

JSON lines is not JSON It is built on top of it. .jsonl extension can be used to make it clear https://jsonlines.org/

joquarky

Back in my day it was called NDJSON.

The industry is so chaotic now we keep giving the same patterns different names, adding to the chaos.

thiht

> How do you handle multiple copies of the same key

That’s unambiguously allowed by the JSON spec, because it’s just a grammar. The semantics are up to the implementation.

sbergot

interestingly other people are answering the opposite in this thread.

yrro

Internet JSON (RRC 7493) forbids objects to have members with duplicate names.

_flux

As it says:

I-JSON (short for "Internet JSON") is a restricted profile of JSON designed to maximize interoperability and increase confidence that software can process it successfully with predictable results.

So it's not JSON, but a restricted version of it.

I wonder if use of these restrictions is popular. I had never heard of I-JSON.

zzo38computer

> How do you handle multiple copies of the same key? Does the order of keys have semantic meaning?

This is also an issue, due to the way that order of keys are working in JavaScript, too.

> record separator separated JSON, newline separated JSON.

There is also JSON with no separators, although that will not work very well if any of the top-level values are numbers.

> Also, encoding things like IEEE NaN/Infinity, and raw byte arrays has to be in proprietary ways.

Yes, as well as non-Unicode text (including (but not limited to) file names on some systems), and (depending on the implementation) 64-bit integers and big integers. Possibly also date/time.

I think DER avoids these problems. You can specify whether or not the order matters, you can store Unicode and non-Unicode text, NaN and Infinity, raw byte arrays, big integers, and date/time. (It avoids some other problems as well, including canonization (DER is already in canonical form) and other issues. Although, I have a variant of DER that avoids some of the excessive date/time types and adds a few additional types, but this does not affect the framing, which can still be parsed in the same way.)

A variant called "Multi-DER" could be made up, which is simply concatenating any number of DER files together. Converting Multi-DER to BER is easy just by adding a constant prefix and suffix. Converting Multi-DER to DER is almost as easy; you will need the length (in bytes) of the Multi-DER file and then add a prefix to specify the length. (In none of these cases does it require parsing or inspecting or modifying the data at all. However, converting the JSON variants into ordinary JSON does require inspecting the data in order to figure out where to add the commas.)

diekhans

Plus the 64-bit integer problem, really 52-bit integers, due to JS not having integers.

d0mine

JSON itself is not limited to neither 52 nor 64-bit integers.

    integer = -? (digit | onenine digit+)
    
https://json.org/

0cf8612b2e1e

That’s a JavaScript problem, not JSON.

dtech

Most good parsers have an option to parse to integers or arbitrary precision decimals.

tobyhinloopen

bigint exists

realitysballs

They do specifically mention this:

“No one owns CSV. It has no real specification (yes, I know about the controversial ex-post RFC 4180), just a set of rules everyone kinda agrees to respect implicitly. It is, and will forever remain, an open and free collective idea.”

thayne

They even seem to think it is a good thing. But I don't see how not having a bunch of implementations that can't agree on the specifics of a file/interchange format is a good thing. And being free and open is completely orthogonal. There are many proprietary formats that don't have a spec, and many open formats that do have a spec (like, say, json).

Gormo

That's true of the vast majority of protocols that people use in real life to exchange date. We're using one of them right now, in fact.

rkagerer

Waiting for someone to write a love letter to the infamous Windows INI file format

I actually miss that. It was nice when settings were stored right alongside your software, instead of being left behind all over a bloated registry. And the format was elegant, if crude.

I wrote my own library for encoding/writing/reading various datatypes and structure into ini's, in a couple different languages, and it served me well for years.

isoprophlex

TOML is nice like that... elegant like INI, only with lists.

eddythompson80

> instead of being left behind all over a bloated registry

Really? I think the idea of a central, generic, key-value pair database for all the setting on a system is probably the most elegant reasonable implementation there could be.

The initial implementation of Windows Registry wasn't good. It was overly simplistic and pretty slow. Though the "bloat" (what ever that means) of registry hasn't been an actual issue in over 20 years. The only people invested in convincing you "it's an issue" are CCleaner type software that promise to "speed up your computer" if you just pay $6.99.

How many rows do you need in a sqlite database for it to be "bloated"?

xp84

I feel like YAML is a spiritual successor to the .ini, since it shares a notable ideal of simple human readability/writability.

estebank

Whenever I ask myself "should I use YAML?" I answer myself "Norway".

lelanthran

> I feel like YAML is a spiritual successor to the .ini, since it shares a notable ideal of simple human readability/writability.

It doesn't feel that way to me: it's neither simple to read nor to write. I suppose that that's a builtin problem due to tree representation, which is something that INI files were never expected to represent.

TBH, I actually prefer the various tree representation workarounds used by INI files: using whitespace to indicate child nodes stops being readable once you have more than a screenful of children in a node.

ElectricalUnion

Given how YAML does magic and sometimes accidental type conversions of potentially nested objects, I think TOML is the well-defined sucessor to .ini

consp

YAML is readable? No way as there are too many ways to do the same thing and nested structures are unclear to the non trained eye (what is a list? What is nested?), let alone indentation in large files is an issue especially with the default 2 space unreadable standard so many people adhere to.

YAML simple? It's sepc is larger than XML... Parsing of numbers and strings is ambiguous, leading zeros are not strings but octal (implicit conversion...). List as keys? Oh ffs, and you said readable. And do not get me started about "Yes" being a boolean, reminds me of the MS Access localizations which had other decimal values for true and [local variant of true] (1 vs -1).

Writable? Even worse. I think I have never been able to write a YAML file without errors. But that might just be me, XML is fine though while unreadable.

afiori

I think GRON[1] would fit the bill better

[1] https://github.com/tomnomnom/gron

otabdeveloper4

People who say that CSV is "simpler" are talking about whatever format Excel exports.

Also these people have only ever had to deal with the American Excel localization.

So yeah, with the caveat of "only ever use Excel and only ever the American edition" CSV is pretty nice.

mbnielsen

As someone living in a country where , is used as the decimal separator, I cannot begin to describe the number of times CSV data has caused me grief. This becomes especially common in an office environment where Excel is the de facto only data handling tool that most people can and will use. Here the behavior of loading data becomes specific to the individual machine and changes over time (e.g. when IT suddenly forces a reset of MS Office application languages to the local one).

That said, I don't really know of any alternative that won't be handled even worse by my colleagues...

cgio

Also keeping in mind all the locales where comma is the decimal point…tsv for the world.

matwood

And all the 'simple' formats start failing when dealing with blocks of text.

lelanthran

To be honest, I'm wondering why you are rating JSON higher than CSV.

> Unlike XML or JSON, there isn't a document defining the grammar of well-formed or valid CSV files,

There is, actually, RFC 4180 IIRC.

> there are many flavours that are incompatible with each other in the sense that a reader for one flavour would not be suitable for reading the other and vice versa.

"There are many flavours that deviate from the spec" is a JSON problem too.

> you cannot tell programmatically whether line 1 contains column header names or already data (you will have to make an educated guess but there ambiguities in it that cannot be resolved by machine).

Also a problem in JSON

> Quoting, escaping, UTF-8 support are particular problem areas,

Sure, but they are no more nor no less a problem in JSON as well.

IanCal

Have you had to work with csv files from the wild much? I'm not being snarky but what you're talking about is night and day to what I've experienced over the years.

There aren't vast numbers of different JSON formats. There's practically one and realistically maybe two.

Headers are in each line, utf8 has never been an issue for me and quoting and escaping are well defined and obeyed.

This is because for datasets, almost exclusively, the file is machine written and rarely messed with.

Csv files have all kinds of separators, quote characters, some parsers don't accept multi lines and some do, people sort files which mostly works until there's a multi line. All kinds of line endings, encodings and mixed encodings where people have combined files.

I tried using ASCII record separators after dealing with so many issues with commas, semicolons, pipes, tabs etc and still data in the wild had these jammed into random fields.

Lots of these things don't break when you hit the issue either, the parsers happily churn on with garbage data, leading to further broken datasets.

Also they're broken for clients if the first character is a capital I.

dspillett

WRT JSON:

> Headers are in each line

This might be my old “space and network cost savings” reflex, which is a lot less necessary these days, kicking in, but the feels inefficient. It also gives rise to not knowing the whole schema until you read the whole dataset (which might be multiple files), unless some form of external schema definition is provided.

Having said that, I accept that JSON has advantages over CSV, even if all that is done is translating a data-table into an array of objects representing one row each.

> utf8 has never been an issue for me

The main problem with UTF8 isn't with CSV generally, it is usually, much like the “first column is called ID” issue, due to Excel. Unfortunately a lot of people interact with CSVs primarily with Excel, so it gets tarred with that brush by association. Unless Excel sees the BOM sequence at the start of a CSV file, which the Unicode standards recommend against for UTF8, it assumes its characters are using the Win1252 encoding (almost, but not quite, ISO-8859-1).

> Csv files have all kinds of separators

I've taken to calling them Character Separated Value files, rather than Comma, for this reason.

recursive

> There aren't vast numbers of different JSON formats.

Independent variations I have seen:

* Trailing commas allowed or not * Comments allowed or not * Multiple kinds of date serialization conventions * Divergent conventions about distinguishing floating point types from integers * Duplicated key names tolerated or not * Different string escaping policies, such as, but not limited to "\n" vs "\x0a"

There are bazillions of JSON variations.

lelanthran

Sure, I get your arguments and we're probably mostly in agreement, but in practice I see very few problems arising with using CSV.

I mean, right now, the data interchange format between multiple working systems is CSV; think payment systems, inter-bank data interchange, ERP systems, CRM systems, billing systems ... the list goes on.

I just recently had a coffee with a buddy who's a salesman for some enterprise system: of the most common enterprise systems we recently worked with (SAP type things, but on smaller scales), every single one of them had CSV as the standard way to get data between themselves and other systems.

And yet, they work.

The number of people uploading excel files to be processed or downloading excel files for local visualistation and processing would floor you. It's done multiple times a day, on multiple systems, in multiple companies.

And yet, they work.

I get your argument though - a JSON array of arrays can represent everything that CSV can, and is preferable to CSV, and is what I would choose when given the choice, but the issues with using that are not going to be fewer than issues with CSV using RFC1480.

gpvos

What's the problem with capital I?

SkyBelow

I wonder if CSV is the trivial format, so you have many people picking it because they want the easiest, and still getting it wrong. JSON is harder, so very few people are going to roll their own serializer/deserializer, and those who do are more likely to focus on getting it right (or at least catching the really obvious bugs).

I've dealt with incorrect CSVs numerous times, never with incorrect JSON, but, of the times I know what was happening on the other system, each time the CSV was from some in house (or similar) implementation of dumping a SQL output (or similar) into a text file as an MVP. JSON was always using some library.

If so, that's all the more reason to love CSV as it stands guard for JSON. If CSV didn't exist, we would instead have broken JSON implementations. (JSON and XML would likely then share a similar relationship.)

Someone

> There is, actually, RFC 4180 IIRC.

Does any software fully follow that spec (https://www.rfc-editor.org/rfc/rfc4180)? Some requirements that I doubt are commonly followed:

- “Each record is located on a separate line, delimited by a line break (CRLF)” ⇒ editing .csv files using your the typical Unix text editor is complicated.

- “Spaces are considered part of a field and should not be ignored”

- “Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes” ⇒ fields containing lone carriage returns or new lines need not be enclosed in double quotes.

no_wizard

INI was for a long time a seemingly preferable format in the Python community for configuration for a long time, as I recall it.

Haven’t been a full time Python dev in sometime though, it seems TOML has supplanted that, but I remember thinking how interesting it was that Python had a built in INI parser and serializer

pcwalton

> Unlike XML or JSON, there isn't a document defining the grammar of well-formed or valid CSV files

There is such a document: RFC 4180. It may not be a good document, but it does exist.

mjw_byrne

CSV is ever so elegant but it has one fatal flaw - quoting has "non-local" effects, i.e. an extra or missing quote at byte 1 can change the meaning of a comma at byte 1000000. This has (at least) two annoying consequences:

1. It's tricky to parallelise processing of CSV. 2. A small amount of data corruption can have a big impact on the readability of a file (one missing or extra quote can bugger the whole thing up).

So these days for serialisation of simple tabular data I prefer plain escaping, e.g. comma, newline and \ are all \-escaped. It's as easy to serialise and deserialise as CSV but without the above drawbacks.

koolba

JSON serialized without extra white space with one line per record is superior to CSV.

If you want CSV-ish, enforce an array of strings for each record. Or go further with actual objects and non-string types.

You can even jump to an arbitrary point and then seek till you see an actual new line as it’s always a record boundary.

It’s not that CSV is an invalid format. It’s that libraries and tools to parse CSV tend to suck. Whereas JSON is the lingua franca of data.

derriz

> It’s that libraries and tools to parse CSV tend to suck. Whereas JSON is the lingua franca of data.

This isn't the case. An incredible amount of effort and ingenuity has gone into CSV parsing because of its ubiquity. Despite the lack of any sort of specification, it's easily the most widely supported data format in existence in terms of tools and language support.

nukem222

> An incredible amount of effort and ingenuity has gone into CSV parsing because of its ubiquity.

Yea and it's still a partially-parseable shit show with guessed values. But we can and could have and should have done better by simply defining a format to use.

otikik

Meanwhile, Excel exports to CSV as “semicolon separated values” depending on your OS locale

hajile

Can you point me to a language with any significant number of users that does NOT have a JSON library?

I went looking at some of the more niche languages like Prolog, COBOL, RPG, APL, Eiffel, Maple, MATLAB, tcl, and a few others. All of these and more had JSON libraries (most had one baked into the standard library).

The exceptions I found (though I didn't look too far) were: Bash (use jq with it), J (an APL variant), Scratch (not exposed to users, but scratch code itself is encoded in JSON), and Forth (I could find implementations, but it's very hard to pin down forth dialects).

kentm

I’ve found that the number of parsers that don’t handle multiline records is pretty high though.

recursive

> Despite the lack of any sort of specification

People keep saying this but RFC 4180 exists.

consteval

It's widely, but inconsistently, supported. The behavior of importers varies a lot, which is generally not the case for JSON.

autoexec

> it's easily the most widely supported data format in existence in terms of tools and language support.

Even better, the majority of the time I write/read CSV these days I don't need to use a library or tools at all. It'd be overkill. CSV libraries are best saved for when you're dealing with random CSV files (especially from multiple sources) since the library will handle the minor differences/issues that can pop up in the wild.

benwilber0

JSON is a textual encoding no different than CSV.

It's just that people tend to use specialized tools for encoding and decoding it instead of like ",".join(row) and row.split(",")

I have seen people try to build up JSON strings like that too, and then you have all the same problems.

So there is no problem with CSV except that maybe it's too deceptively simple. We also see people trying to build things like URLs and query strings without using a proper library.

int_19h

The problem with CSV is that there's no clear standard, so even if you do reach for a library to parse it, that doesn't ensure compatibility.

gthompson512

> It's just that people tend to use specialized tools for encoding and decoding it instead of like ",".join(row) and row.split(",")

You really super can't just split on commas for csv. You need to handle the string encodings since records can have commas occur in a string, and you need to handle quoting since you need to know when a string ends and that string may have internal quote characters. For either format unless you know your data super well you need to use a library.

kec

Until you have a large amount of data & need either random access or to work on multiple full columns at once. Duplicated keys names mean it's very easy for data in jsonlines format to be orders of magnitude larger than the same data as CSV, which is incredibly annoying if your processing for it isn't amenable to streaming.

magicalhippo

> JSON [...] with one line per record

Couple of standards that I know of that does this, primarily intended for logging:

https://jsonlines.org/

https://clef-json.org/

Really easy to work with in my experience.

Sure some space is usually wasted on keys but compression takes care of that.

nmz

This is simply not true, parsing json v csv is a difference of thousands of lines.

klysm

You serialize the keys on every row which is a bit inefficient but it’s a text format anyway

zeroimpl

Space-wise, as long as you compress it, it's not going to make any difference. I suspect a JSON parser is a bit slower than a CSV parser, but the slight extra CPU usage is probably worth the benefits that come with JSON.

packetlost

Eh, it really isn't. The format does not lend itself to tabular data, instead the most natural way of representing data involves duplicating the keys N times for each record.

koolba

You can easily represent it as an array:

    [“foo”,”bar”,123]
That’s as tabular as CSV but you now have optional types. You can even have lists of lists. Lists of objects. Lists of lists of objects…

juliansimioni

What happens when you need to encode the newline character in your data? That makes splitting _either_ CSV or LDJSON files difficult.

koolba

The new line character in a JSON string would always be \n. The new line in the record itself as whitespace would not be acceptable as that breaks the one line record contract.

Remember that this does not allow arbitrary representation of serialized JSON data. But it allows for any and all JSON data as you can always roundtrip valid JSON to a compact one line representation without extra whitespace.

afiori

Actually even whitespace-separated json would be a valid format and if you forbid json documents to be a single integer or float then even just concatenating json gives a valid format as JSON is a prefix free language.

That is[0] if a string s is a valid JSON then there is no substring s[0..i] for i < n that is a valid json.

So you could just consume as many bytes you need to produce a json and then start a new one when that one is complete. To handle malformed data you just need to throw out the partial data on syntax error and start from the following byte (and likely throw away data a few more times if the error was in the middle of a document)

That is [][]""[][]""[] is unambiguos to parse[1]

[0] again assuming that we restrict ourselves to string, null, boolean, array and objects at the root

[1] still this is not a good format as a single missing " can destroy the entire document.

mananaysiempre

When you need to encode the newline character in your data, you say \n in the JSON. Unlike (the RFC dialect of) CSV, JSON has an escape sequence denoting a newline and in fact requires its use. The only reason to introduce newlines into JSON data is prettyprinting.

nmz

It's tricky, but simple enough, RFC states that " must be used, inserting a " is done with "". This makes knowing what a record is difficult, since you must keep a variable that keeps the entire string.

How do you do this simply? you read each line, and if there's an uneven number of ", then you have an incomplete record and you will keep all lines until there is an odd number of ". after having the string, parsing the fields correctly is harder but you can do it in regex or PEGs or a disgusting state machine.

andrepd

That would be solved by using the ASCII control chars Record Separator / Unit Separator! I don't get how this is not widely used as standard.

sundarurfriend

I remembered seeing a comment like this before, and...

comment: https://news.ycombinator.com/item?id=26305052

comment: https://news.ycombinator.com/item?id=39679662

"ASCII Delimited Text – Not CSV or Tab Delimited Text" post [2014]: https://news.ycombinator.com/item?id=7474600

same post [2024]: https://news.ycombinator.com/item?id=42100499

comment: https://news.ycombinator.com/item?id=15440801

(...and many more.) "This comes up every single time someone mentions CSV. Without fail." - top reply from burntsushi in that last link, and it remains as true today as in 2017 :D

You're not wrong though, we just need some major text editor to get the ball rolling and start making some attempts to understand these characters, and the rest will follow suit. We're kinda stuck at a local optimum which is clearly not ideal but also not troublesome enough to easily drum up wide support for ADSV (ASCII Delimiter Separated Values).

scythe

>we just need some major text editor to get the ball rolling and start making some attempts to understand these characters

Many text editors offer extensions APIs, including Vim, Emacs, Notepad++. But the ideal behavior would be to auto-align record separators and treat unit separators as a special kind of newline. That would allow the file to actually look like a table within the text editor. Input record separator as shift+space and unit separator as shift+enter.

nukem222

Excel needs to default its export to this. Unfortunately excel is proprietary software and therefore fucked.

ryandrake

Hahah, I came here to make the comment about ASCII's control characters, so I'm glad someone else beat me to it, and also that someone further pointed out that this topic comes up every time someone mentions CSV!

mjevans

The _entire_ point of a CSV file is that it's fully human readable and write-able.

The characters you mention could be used in a custom delimiter variant of the format, but at that point it's back to a binary machine format.

mbreese

And that’s why I tend to use tab delimited files more… when viewed with invisible characters shown, it’s pretty clear to read/write separate fields and have an easier to parse format.

This, of course, assumes that your input doesn’t include tabs or newlines… because then you’re still stuck with the same problem, just with a different delimiter.

mikepurvis

As soon as you give those characters magic meanings then suddenly people will have reason to want to use them— it'll be a CSV containing localization strings for tooltips that contain that character and bam, we'll be back to escaping.

Except the usages of that character will be rare and so potentially way more scary. At least with quotes and commas, the breakages are everywhere so you confront them sooner rather than later.

kevmo314

What do you mean? I just push the Record Separator key on my keyboard.

/s in case :)

EGreg

Can't there be some magic sequence (like two unescaped newlines) to start a new record?

jandrese

If there were visible well known characters that could be printed for those and keys on a keyboard for inputting them we would probably have RSV files. Because they are buried down in the nonprintable section of the ASCII chart they are a pain for people to deal with. All it would have taken is one more key on the keyboard, maybe splitting the tab key in half.

thesuitonym

> If there were visible well known characters that could be printed...

...There would be datasets that include those characters, and so they wouldn't be as useful for record separators. Look into your heart and know it to be true.

andrewflnr

Can't type them on a keyboard I guess, or generally work with them in the usual text-oriented tools? Part of the appeal of CSV is you can just open it up in Notepad or something if you need to. Maybe that's more a critique of text tools than it is of ASCII record separator characters.

marxisttemp

I was really excited when I learned of these characters, but ultimately if it’s in ASCII then it’s in-band and will eventually require escaping leading to the same problem.

orthoxerox

But what if one of your columns contains arbitrary binary data?

ndsipa_pomu

You'd likely need to uuencode it or similar as CSV isn't designed for binary data.

zoover2020

Perhaps the popularity,or lack thereof? More often than not, the bad standard wins the long term market

masklinn

> I don't get how this is not widely used as standard.

It requires bespoke tools for edition, and while CSV is absolute garbage it can be ingested and produced by most spreadsheet software, as well as databases.

taeric

Reminds me of a fatal flaw of yaml. Turns out truncating a yaml file doesn't make it invalid. Which can lead to some rather non-obvious failures.

nextts

What is the failure mode where a yaml file gets truncated? They are normally config files in Git. Or uploaded to S3 or Kubernetes etc.

CSV has the same failure mode. As does HTML. (But not XML)

taeric

I couldn't find the story on it, but there was an instance of a config for some major service getting truncated, but since it was yaml it was more difficult to figure out that that was what happened. I think in AWS, but I can't find the story, so can't really remember.

And fully fair that you can have similar issues in other formats. I think the complaint here was that it was a bit harder, specifically because it did not trip up any of the loading code. With a big lesson learned that configs should probably either go pascal string style, where they have an expected number of items as the first part of the data, or xml style, where they have a closing tag.

Really, it is always amusing to find how many of the annoying parts of XML turned out to be somewhat more well thought out than people want to admit.

bobmcnamara

Bad merges.

pasc1878

Same is true of CSV/TSV.

taeric

I think you are a bit more likely to notice in a CSV/TSV, as it is unlikely to truncate at a newline?

Still, fair point. And is part of why I said it is a flaw, not the flaw. Plenty of other reasons to not like YAML, to me. :D

dietr1ch

I don't understand why CSV became a thing when TSV, or a format using the nowadays weird ASCII control characters like start/end of text, start of heading, horizontal/vertical tab, file/group/record/unit separator.

It seems many possible designs would've avoided the quoting chaos and made parsing sort of trivial.

niccl

Any time you have a character with a special meaning you have to handle that character turning up in the data you're encoding. It's inevitable. No matter what obscure character you choose, you'll have to deal with it

oever

It's evitable by stating the number of bytes in a field and then the field. No escaping needed and faster parsing.

wvenable

Except we have all these low ASCII characters specifically for this purpose that don't turn up in the data at all. But there is, of course, also an escape character specifically for escaping them if necessary.

noosphr

The difference is that the coma and newline characters are much more common in text than 0x1F and 0x1E, which if you restrict your data to alphanumeric characters (which you really should) will never appear anywhere else.

solidsnack9000

In TSV as commonly implemented (for example, the default output format of Postgres and MySQL), tab and newline are escaped, not quoted. This makes processing the data much easier. For example, you can skip to a certain record or field just by skipping literal newlines or tabs.

mjw_byrne

Exactly. "Use a delimiter that's not in the data" is not real serialisation, it's fingers-crossed-hope-for-the-best stuff.

I have in the past does data extractions from systems which really can't serialise properly, where the only option is to concat all the fields with some "unlikely" string like @#~!$ as a separator, then pick it apart later. Ugh.

dietr1ch

The characters would likely be unique, maybe even by the spec.

Even if you wanted them, we use backslashes to escape strings in most common programming languages just fine, the problem CSV is that commas aren't easy to recognize because they might be within a single or double quote string, or might just be a separator.

Can strings in CSV have newlines? I bet parsers disagree since there's no spec really.

wodenokoto

It's a lot easier to type comma than control characters and it's a lot easier to view comma than a tab (which might look like a space).

For automated serialization, plain text formats won out, because they are easy to implement a minimal working solution (both import and export) and more importantly, almost all systems agree on what plain text is.

We don't really have Apple formatted text, that will show up as binary on windows. Especially if you are just transferring id's and numbers, those will fall within ascii and that will work even if you are expecting unicode.

pasc1878

The ASCII control characters do not appear well or are editable in a plain text editor.

I did always use TSV and I think the original use of CSV could have used that.

But TSV would still have many issues.

solidsnack9000

What issues would TSV have? As commonly implemented (for example, the default output format of Postgres and MySQL), in TSV, tab and newline are escaped, not quoted.

hajile

I don't understand why CSV became a thing in the 70s when S-expressions existed since at least the 50s and are better in practically every way.

fragmede

I want to push Sqlite as a data interchange format! it has the benefit of being well defined, and can store binary data, like images for product pictures inside the database. not a good idea if you're trying to serve users behind a web app, but as interchange, better than a zip file with filenames that have to be "relinked".

Someone1234

For context: I have a LOT of experience of interchange formats, like "full time job, every day, all day, hundreds of formats, for 20-years" experience.

Based on that experience I have come to one key, but maybe, counter-intuitive truth about interchange formats:

- Too much freedom is bad.

Why? Generating interchange data is cheaper than consuming it, because the creator only needs to consider the stuff they want to include, whereas the consumer needs to consider every single possible edge case and or scenario the format itself can support.

This is why XML is WAY more costly to ingest than CSV, because in XML someone is going to use: attributes, CDATA, namespaces, comments, different declaration, includes, et al. In CVS they're going to use rows, a format separator, and quotes (with or without escaping). That's it. That's all it supports.

Sqlite as an interchange format is a HORRIFYING suggestion, because every single feature Sqlite supports may need to be supported by consumers. Even if you curtailed Sqlite's vast feature set, you've still created something vastly more expensive to consume than XML, which itself is obnoxious.

My favorite interchange formats are, in order:

- CVS, JSON (inc. NDJSON), YAML, XML, BSON (due to type system), MessagePack, Protobuf, [Giant Gap] Sqlite, Excel (xlsx, et al)

More features mean more cost, more edge cases, more failures, more complex consumers. Keep in mind, this is ONLY about interchange formats between two parties, I have wildly different opinions about what I would use for my own application where I am only ever the creator/consumer, I actually love Sqlite for THAT.

ttyprintk

I’m not sure you need to support every SQLite feature. I’m unconvinced of binary formats, but the .dump output is text and simple SQL.

nomel

Oh, this is interesting. Are you tying different systems together? If so, do you use some preferred intermediate format? Do you have a giant library of * -> intermediate -> * converters that you sprinkle between everything? Or maybe the intermediate format is in memory?

What about Parquet and the like?

sadcodemonkey

I love the wisdom in this comment!

fragmede

Interesting! I've dealt with file interchange between closed source (and a couple open source) programs, but that was a while ago. I've also had to deal with csvs and xslts between SaaS vendors for import export of customer's data. I've done a bunch of reverse engineering of proprietary formats so we could import the vendor's files, which had more information than they were willing to export in an interchange format. Sometimes they're encrypted and you have to break it.

What you say is fair. Csv is underspecified though, there's no company called csv that's gonna sue for trademark enforcement, there's no official csv standard library that everyone uses. (They exist are some but there are so many naive importations because from first principles, because how hard could it be? output records and use a comma and newline (of which there are three possible options)).

How often do you deal with multiple Csv files to represent multiple tables that are actually what's used by vendors internally, vs one giant flattened Csv with hundreds of columns and lots of empty cells? I don't have your level of experience with csvs, but I've dealt with a them being a mess, where the other side implement whatever they think is reasonable given the name "comma separated values".

With sqlite, we're in the Internet age and so I presume this hypothetical developer would use the sqlite library and not implement their own library from scratch for funsies. This then leads to types, database normalization, multiple tables. I hear you that too many choices can bad, and xml is a great example of this, but sqlite isn't xml and isn't Csv.

It's hard to have this discussion in the abstract so I'll be forthcoming about where I'm coming from, which is Csv import export between vendors for stores, think like Doordash to UberEATS. the biggest problem we have is images of the items, and how to deal with that. It's an ongoing issue how to get them, but the failure mode, which does happen, is that when moving vendor, they just have to redo a lot of work that they shouldn't have to.

Ultimately the North Star I want to push towards is moving beyond csvs, because it'll let a people who currently have to hand edit the Csv so every row imports properly, not have to do that. They'd still exist, but instead have to deal with, well, what you see with XML files. which has its shortcomings, as you mention, but at least once how a vendor is using it is understood, individual records are generally understandable.

I was moved so I don't deal with import export currently, but it's because sqlite is so nice to work with on personal projects where it's appropriate that I want to push the notion of moving to sqlite over csvs.

ThatPlayer

> not a good idea if you're trying to serve users behind a web app

I use Sqlite for a static site! Generating those static pages out to individual pages would involve millions of individual files. So instead I serve up a sqlite database over http, and use a sqlite wasm driver [0] to load (database) pages as needed. Good indexing cuts down on the number of pages it grabs, and I can even get full text search!

Only feature I'm missing is compression, which is complicated because for popular extensions like sqlite-zstd written in Rust.

[0] https://github.com/mmomtchev/sqlite-wasm-http

0cf8612b2e1e

One very minor problem is that you max out storing blobs of 2GB(? I think, maybe 4GB). Granted few will hit this, but this limit did kill one previous data transfer idea of mine.

msla

CSV's actual problem is that there's no single CSV, and you don't know what type you have (or even if you have single consistent type through the whole file) without trying to parse the whole file and seeing what breaks. Is there quoting? Is that quoting used consistently? Do you have five-digit ZIP codes, or have the East Coast ones been truncated to four digits because they began with zero? Spin the wheel!

lelanthran

> So these days for serialisation of simple tabular data I prefer plain escaping, e.g. comma, newline and \ are all \-escaped. It's as easy to serialise and deserialise as CSV but without the above drawbacks.

For my own parser, I made everything `\` escaped: outside of a quote or double-quote delimited string, any character prefixed with a `\` is read verbatim. There are no special exceptions resulting in `\,` producing a comma while `\a` produces `\a`. This makes it a good rule, because it is only one rule with no exceptions.

mjw_byrne

I considered this but then went the other way - a \ before anything other than a \, newline or comma is treated as an error. This leaves room for adding features, e.g. \N to signify a SQL NULL.

Regarding quoting and escaping, there are two options that make sense to me - either use quoting, in which case quotes are self-escaped and that's that; or use escaping, in which case quotes aren't necessary at all.

evnp

Anyone with a love of CSV hasn't been asked to deal with CSV-injection prevention in an enterprise setting, without breaking various customer data formats.

There's a dearth of good resources about this around the web, this is the best I've come across: https://georgemauer.net/2017/10/07/csv-injection.html

Suppafly

That mostly breaks down to "excel is intentionally stupid with csv files if you don't use the import function to open them" along with the normal "don't trust customer input without stripping or escaping it" concerns you'd have with any input.

evnp

That was my initial reaction as well – it's a vulnerability in MS software, not ours, not our problem. Unfortunately, reality quickly came to bear: our customers and employees ubiquitously use excel and other similar spreadsheet software, which exposes us and them to risk regardless where the issue lies. We're inherently vulnerable because of the environment we're operating in, by using CSV.

"don't trust customer input without stripping or escaping it" feels obvious, but I don't think it stands up to scrutiny. What exactly do you strip or escape when you're trying to prevent an unknown multitude of legacy spreadsheet clients that you don't control from mishandling data in an unknown variety of ways? How do you know you're not disrupting downstream customer data flows with your escaping? The core issue, as I understand it, stems from possible unintended formula execution – which can be prevented by prefixing certain cells with a space or some invisible character (mentioned in the linked post above). This _does_ modify customer data, but hopefully in a way that unobtrusive enough to be acceptable. All in all, it seems to be a problem without a perfect solution.

togakangaroo

Hey, I'm the author of the linked article, cool to see this is still getting passed around.

Definitely agree there's no perfect solution. There's some escaping that seems to work ok, but that's going to break CSV-imports.

An imperfect solutions is that applications should be designed with task-driven UIs so that they know the intended purpose of a CSV export and can make the decision to escape/not escape then. Libraries can help drive this by designing their interfaces in a similar manner. Something like `export_csv_for_eventual_import()`, `export_csv_for_spreadsheet_viewing()`.

Another imperfect solution would be to ... ugh...generate exports in Excel format rather than CSV. I know, I know, but it does solve the problem.

Or we could just get everyone in the world to switch to emacs csv-mode as a csv viewer. I'm down with that as well.

conductr

I’ve almost always found the simple way around Excel users not knowing how to safely use CSV files is to just give the file another extension: I prefer .txt or .dat

Then, the user doesn’t have Excel has the default program for opening the file and has to jump through a couple safety hoops

imtringued

If your customers and employees are using Excel then stop going against the grain with your niche software developer focused formats that need a lot of explanations.

I need to interface with a lot of non-technical people who exclusively use Excel. I give them .xlsx files. It's just as easy to export .xlsx as it is to export .CSV and my customers are happy.

aorth

Someone filed a bug report on a project I work on, saying that it was a security vulnerability that we don't prefix cell values with a single quote (') when the cell content contains certain values like an equal sign (=). They said this can cause Excel to evaluate the content and potentially run unsafe code.

I responded that this was Excel's problem, not ours, and that nobody would assign a CVE to our product for such a "vulnerability". How naive I was! They forwarded me several such CVEs assigned to products that create CSVs that are "unsafe" for Excel.

Terrible precedent. Ridiculous security theater.

evnp

There are a lot of these sorts of bug reports running around, to the point that Google's bug bounty program has classified them as invalid: https://bughunters.google.com/learn/invalid-reports/google-p...

I agree with the characterization ("security theater") of these bug reports. The problem is that the intentions of these reports don't make the potential risk less real, depending on the setting, and I worry that the "You're just looking for attention" reaction (a very fair one!) leads to a concerning downplaying of this issue across the web.

As a library author, I agree this very well may not be something that needs to be addressed. But as someone working in a company responsible for customers, employees, and their sensitive information, disregarding this issue disregards the reality of the tools these people will invariably use, downstream of software we _are_ responsible for. Aiming to make this downstream activity as safe as possible seems like a worthy goal.

yrro

The next version of CVSS needs to add a metric for these kind of bullshit non-vulnerabilities so that we can ignore them at source.

hdjrudni

I didn't know about the formula injection, I just knew that Excel and Sheets mangle my dates every time and it drives me bonkers. Why is that the default? It makes no sense.

beautron

I also love CSV for its simplicity. A key part of that love is that it comes from the perspective of me as a programmer.

Many of the criticisms of CSV I'm reading here boil down to something like: CSV has no authoritative standard, and everyone implements it differently, which makes it bad as a data interchange format.

I agree with those criticisms when I imagine them from the perspective of a user who is not also a programmer. If this user exports a CSV from one program, and then tries to load the CSV into a different program, but it fails, then what good is CSV to them?

But from the perspective of a programmer, CSV is great. If a client gives me data to load into some app I'm building for them, then I am very happy when it is in a CSV format, because I know I can quickly write a parser, not by reading some spec, but by looking at the actual CSV file.

Parsing CSV is quick and fun if you only care about parsing one specific file. And that's the key: It's so quick and fun, that it enables you to just parse anew each time you have to deal with some CSV file. It just doesn't take very long to look at the file, write a row-processing loop, and debug it against the file.

The beauty of CSV isn't that it's easy to write a General CSV Parser that parses every CSV file in the wild, but rather that its easy to write specific CSV parsers on the spot.

Going back to our non-programmer user's problem, and revisiting it as a programmer, the situation is now different. If I, a programmer, export a CSV file from one program, and it fails to import into some other program, then as long as I have an example of the CSV format the importing program wants, I can quickly write a translator program to convert between the formats.

There's something so appealing about to me about simple-to-parse-by-hand data formats. They are very empowering to a programmer.

MarceliusK

Totally agree that its biggest strength is how approachable it is for quick, ad hoc tooling. Need to convert formats? Join two datasets? Normalize a weird export? CSV gives you just enough structure to work with and not so much that it gets in your way.

dkarl

> I know I can quickly write a parser, not by reading some spec, but by looking at the actual CSV file

This is fine if you can hand-check all the data, or if you are okay if two offsetting errors happen to corrupt a portion of the data without affecting all of it.

Also I find it odd that you call it "easy" to write custom code to parse CSV files and translate between CSV formats. If somebody give you a JSON file that isn't valid JSON, you tell them it isn't valid, and they say "oh, sorry" and give you a new one. That's the standard for "easy." When there are many and diverse data formats that meet that standard, it seems perverse to use the word "easy" to talk about empirically discovering the quirks in various undocumented dialects and writing custom logic to accommodate them.

Like, I get that a farmer a couple hundred years ago would describe plowing a field with a horse as "easy," but given the emergence of alternatives, you wouldn't use the word in that context anymore.

beautron

> If somebody give you a JSON file that isn't valid JSON, you tell them it isn't valid, and they say "oh, sorry" and give you a new one. That's the standard for "easy."

But it isn't that reliably easy with JSON. Sometimes I have clients give me data that I just have to work with, as-is. Maybe it was invalid JSON spat out by some programmer or tool long ago. Maybe it's just from a different department than my contact, which might delay things for days before the bureaucracy gets me a (hopefully) valid JSON.

I consider CSV's level of "easy" more reliable.

And even valid JSON can be less easy. I've had experiences where writing the high-level parsing for some JSON file, in terms of a JSON library, was less easy and more time-consuming than writing a custom CSV parser.

Subjectively, I think programming a CSV parser from basic programming primitives is just more fun and appealing than programming in terms of a JSON library or XML library. And I find the CSV code is often simpler and quicker to write.

beautron

> When there are many and diverse data formats that meet that standard, it seems perverse to use the word "easy" to talk about empirically discovering the quirks in various undocumented dialects and writing custom logic to accommodate them.

But the premise of CSV is so simple, that there are only four quirks to empirically discover: cell delimiter, row delimiter, quote, escaped-quote.

I think it's "easy" to peek at the file and say, "Oh, they use semicolon cell delimiters."

And it's likewise "easy" to write the "custom logic", which is about as simple as parsing something directly from a text stream gets. I typically have to stop and think a minute about the quoting, but it's not that bad.

If a programmer is practiced at parsing from a text stream (a powerful, general skill that is worth exercising), than I think it is reasonable to think they might find parsing CSV by hand to be easier and quicker than parsing JSON (etc.) with a library.

999900000999

The best part about csv, anyone can write a parser in 30 minutes meaning that I can take data from the early '90s and import it into a modern web service.

The worst part about CSV, anyone can ride a parser in about 30 minutes, meaning that it's very easy to get incorrect implementations, incorrect data, and other strange undefined behaviors. But to be clear json, and yaml also have issues with everyone trying to reinvent the wheel constantly. XML is rather ugly, but it seems to be the most resilient.

Xelbair

until you find someone abusing XSD schemas, or someone designing a "dynamically typed" XML... or sneaks in extra data in comments - happened to me way often than it should.

MathMonkeyMan

You know what grinds my gears about using XSD for message definitions? Namespaces. Namespaces are a good idea and were done well in XML, as far as I can see, but with XSD you run into this [problem][1]:

Namespaces are used to qualify tags and attributes in XML elements. But they're also used by XSD to qualify the names of types defined in the schema. A sequence element's type is indicated by the value of its "type" attribute. The attribute value is a string that is the namespace-qualified name of the type.

So, if you want to change the alias of an XML namespace in an XSD schema, you can't just use your XML library's facilities for namespace management. You also have to go find the "type" attributes (but not all of the "type" attributes), parse their values, and do the corresponding alias change in the type name.

Don't use a string for a thing that is not a string! I guess in XML attributes you have no choice. XAML improved on the situation a bit.

[1]: https://github.com/dgoffredo/stag/tree/master/src/stag/xsd-u...

999900000999

My condolences. Any open standard runs the risk of this happening. It's not a problem I think we'll ever solve.

marcosdumay

In principle, if you make your standard extensible enough, people should stop sneaking data into comments or strings.

... What makes the GP's problem so much more amusing. XML was the last place I'd expect to see it.

a3w

For some side projects, I would like use XML, but cannot read the spec, as that costs money.

So I cannot trust XML in depth, and depend on using a library that bought the spec and hopefully adheres to it.

account-5

I like CSV for the same reasons I like INI files. It's simple, text based, and there's no typing encoded in the format, it's just strings. You don't need a library.

They're not without their drawbacks, like no official standards etc, but they do their job well.

I will be bookmarking this like I have the ini critique of toml: https://github.com/madmurphy/libconfini/wiki/An-INI-critique...

I think the first line of the toml critique applies to CSV: it's a federation of dialects.

deepsun

Similarly I had once loved the schemaless datastorages. They are so much simpler!

Until I worked quite a bit with them and realized that there's always schema in the data, otherwise it's just random noise. The question is who maintains the schema, you or a dbms.

Re. formats -- the usefulness comes from features (like format enforcing). E.g. you may skip .ini at all and just go with lines on text files, but somewhere you still need to convert those lines to your data, there's no way around it, the question is who's going to do that (and report sane error messages).

nukem222

Schemaless can be accomplished with well-formed formats like json, xml, yaml, toml, etc. from the producer side these are roughly equivalent interfaces. There's zero upside to using CSVs except to comfort your customer. Or maybe you have centered importing of CSVs into your actual business, in which case you should probably not exist.

nukem222

> It's simple

My experience has indicated the exact opposite. CSVs are the only "structured" format nobody can claim to parse 100% (ok probably not true thinking about html etc, just take this as hyperbole.) Just use a well-specified format and save your brain-cells.

Occasionally, we must work with people who can only export to csv. This does not imply csv is a reasonable way to represent data compared to other options.

PaulHoule

The HTML 5 spec says exactly how you're supposed to deal with broken HTML files.

nukem222

Yes, that is a single spec with correspondingly-small importance. Generally parsing html remains extremely difficult.

kokada

> CSVs are the only "structured" format nobody can claim to parse 100%

You don't need to though since in most cases you just need to support whatever CSV format the tool you're handling, unless of course you're trying to write the next Excel/Google Sheets competitor.

samdung

CSV works because CSV is understood by non technical people who have to deal with some amount of technicality. CSV is the friendship bridge that prevents technical and non technical people from going to war.

I can tell an MBA guy to upload a CSV file and i'll take care of it. Imagine i tell him i need everything in a PARQUET file!!! I'm no longer a team player.

imtringued

This is incorrect. Everyone uses Excel, not CSV. There are billions of people on this planet who know what to do with an .xlsx file.

Do the same with a .CSV file and you'll have to teach those people how to use the .CSV importer in Excel and also how to set up the data types for each column etc. It's a non trivial problem that forces you down to a few million people.

.CSV is a niche format for inexperienced software developers.

tim333

Indeed the my main use is most financial services will output your records in csv, although I mostly open that in excel which sometimes gets a bit confused.

MarceliusK

"Friendship bridge" is the perfect phrase

tucnak

This is so relatable to all data eng people from SWE background!

Thanks

Foobar8568

Among the shit I have seen in CSV, no " for strings, including those with a return char, innovative SEP, date, numbers, no escape for " within strings, rows related to the reporting tools used to export to CSV etc

bell-cot

True. But most of those problems are pretty easy for the non-technical person to see, understand, and (often) fix. Which strengthens the "friendship bridge".

(I'm assuming the technical person can easily write a basic parsing script for the CSV data - which can flag, if not fix, most of the format problems.)

For a dataset of any size, my experience is that most of the time & effort goes into handling records which do not comply with the non-technical person's beliefs about their data. Which data came from (say) an old customer database - and between bugs in the db software, and abuse by frustrated, lazy, or just ill-trained CSR's, there are all sorts of "interesting" things, which need cleaning up.

PaulHoule

I wish this was a joke. I'm always trying to convince data scientists with a foot in the open source world that their life will be so much better if they use parquet or Stata or Excel or any other kind of file but CSV.

On top of all the problems people mention here involving the precise definition of the format and quoting, it's outright shocking how long it takes to parse ASCII numbers into floating point. One thing that stuck with me from grad school was that you could do a huge number of FLOPS on a matrix in the time it would take to serialize and deserialize it to aSCII.

jcattle

What advantages does excel give you over CSV?

PaulHoule

Accurate data typing (never confuse a string with a number)

Maybe be circular but: always loads correctly into Excel, if you want to load into a spreadsheet you can add text formatting and even formulas, checkboxes and stuff which can be a lot of fun.

kec

That is very much not true, Excel does type coercion, especially around things that happen to look like dates: https://www.theverge.com/2020/8/6/21355674/human-genes-renam...

k_bx

I've recently been developing a raspberry pi based solution which works with telemetry logs. First implementation used an SQLite database (with WAL log) – only to find it corrupted after just couple of days of extensive power on/off cycles.

I've since started looking at parquet files – which turned out to not be friendly to append-only operations. I've ended up implementing writing events into ipc files which then periodically get "flushed" into the parquet files. It works and it's efficient – but man is it non-trivial to implement properly!

My point here is: for a regular developer – CSV (or jsonl) is still the king.

theoryofx

> First implementation used an SQLite database (with WAL log) – only to find it corrupted after just couple of days of extensive power on/off cycles.

Did you try setting `PRAGMA synchronous=FULL` on your connection? This forces fsync() after writes.

That should be all that's required if you're using an NVMe SSD.

But I believe most microSD cards do not even respect fsync() calls properly and so there's technically no way to handle power offs safely, regardless of what software you use.

I use SanDisk High Endurance SD cards because I believe (but have not fully tested) that they handle fsync() properly. But I think you have to buy "industrial" SD cards to get real power fail protection.

k_bx

Raspberry Pi uses microSD card. Just using fsync after every write would be a bit devastating, but batching might've worked ok in this case. Anyways, too late to check now.

bobmcnamara

There's definitely a place for it. I ran into the same problem with a battery powered event logger. Basically alternate between sleep-until-event and sample-until-event-over.

SQLite was fine until the realities of that environment hit.

0) I need to save the most data over time and my power budget is unpredictable due to environmentals. 1) When should I commit? SQLite commit per insert slows down, impacts battery life, impacts sample rate. Practically you could get away with batching all data for a small period. 2) SQLite is slow to repair databases. Partially written file would often take longer to repair than we had battery to run.

CSV based format filled that niche. First column was line-column count to support firmware upgrades. Last column is line-checksum. Another column indicating if this line was the last for an event. Parser skips corrupted lines/entries.

bobbylarrybobby

If sqlite ends up corrupted, why wouldn't a CSV? What happens if the system dies partway through a write?

k_bx

It would, but it'd be very easy to skip corrupted lines. With SQLite, I ended up losing data since the corrupted entry.

theLiminator

> I've since started looking at parquet files – which turned out to not be friendly to append-only operations. I've ended up implementing writing events into ipc files which then periodically get "flushed" into the parquet files. It works and it's efficient – but man is it non-trivial to implement properly!

I think the industry standard for supporting this is something like iceberg or delta, it's not very lightweight, but if you're doing anything non-trivial, it's the next logical move.

notatallshaw

What isn't fun about CSV is quickly written parsers and serializers repeatedly making the common mistake of not handling, or badly handling, quoting.

For a long time I was very wary of CSV until I learnt Python and started using it's excellent csv standard library module.

codeulike

Thats true, in recent years its been less of a disaster with lots of good csv libraries for various languages. In the 90s csv was a constant footgun, perhaps thats why they went crazy and came up with XML

Macha

Even widely used libraries that you might expect get it right, don't. (Like Spark, which uses Java style backslash escaping)

goatlover

Why not Pandas, since you're working with tabular data anyway?

notatallshaw

When I first started, installing packages which required compiling native code on either my work Windows machine and the old Unix servers was not easy.

So I largely stuck to the Python standard library where I could, and most of the operations I had at the time did not require data analysis on a server, that was mostly done in a database. Often the job was validating and transforming the data to then insert it into a database.

As the Python packaging ecosystem matured and I found I could easily use Pandas everywhere it just wasn't my first thing I'd reach to. And occasionally it'd be very helpful to iterate through them with the csv module, only taking a few MBs of memory, vs. loading the entire dataset into memory with Pandas.

FridgeSeal

Because maybe they’re not doing something column oriented? Because it has a notoriously finicky API? A dozen other reasons?

hajile

The argument against JSON isn't very compelling. Adding a name to every field as they do in their strawman example isn't necessary.

Compare this CSV

    field1,field2,fieldN
    "value (0,0)","value (0,1)","value (0,n)"
    "value (1,0)","value (1,1)","value (1,n)"
    "value (2,0)","value (2,1)","value (2,n)"
To the directly-equivalent JSON

    [["field1","field2","fieldN"],
     ["value (0,0)","value (0,1)","value (0,n)"],
     ["value (1,0)","value (1,1)","value (1,n)"],
     ["value (2,0)","value (2,1)","value (2,n)"]]
The JSON version is only marginally bigger (just a few brackets), but those brackets represent the ability to be either simple or complex. This matters because you wind up with terrible ad-hoc nesting in CSV ranging from entries using query string syntax to some entirely custom arrangement.

    person,val2,val3,valN
    fname=john&lname=doe&age=55&children=[jill|jim|joey],v2,v3,vN
And in these cases, JSON's objects are WAY better.

Because CSV is so simple, it's common for them to avoid using a parsing/encoding library. Over the years, I've run into this particular kind of issue a bunch.

    //outputs `val1,val2,unexpected,comma,valN` which has one too many items
    ["val1", "val2", "unexpected,comma", "valN"].join(',')
JSON parsers will not only output the expected values every time, but your language likely uses one of the super-efficient SIMD-based parsers under the surface (probably faster than what you are doing with your custom CSV parser).

Another point is standardization. Does that .csv file use commas, spaces, semicolons, pipes, etc? Does it use CR,LF, or CRLF? Does it allow escaping quotations? Does it allow quotations to escape commas? Is it utf-8, UCS-2, or something different? JSON doesn't have these issues because these are all laid out in the spec.

JSON is typed. Sure, it's not a LOT of types, but 6 types is better than none.

While JSON isn't perfect (I'd love to see an official updated spec with some additional features), it's generally better than CSV in my experience.

munchler

> the directly-equivalent JSON

    [["field1","field2","fieldN"],
     ["value (0,0)","value (0,1)","value (0,n)"],
     ["value (1,0)","value (1,1)","value (1,n)"],
     ["value (2,0)","value (2,1)","value (2,n)"]]
I think it's a big stretch to use that JSON for comparison. In practice, one is much more likely to see this:

    [
      {
        "field1": "value (0,0)",
        "field2": "value (0,1)",
        "fieldN": "value (0,n)"
      },
      {
        "field1": "value (1,0)",
        "field2": "value (1,1)",
        "fieldN": "value (1,n)"
      },
      {
        "field1": "value (2,0)",
        "field2": "value (2,1)",
        "fieldN": "value (2,n)"
      }
    ]

hajile

While most people would prefer the second version, the first version is also valid JSON and will definitely see use when you want/need JSON but want to reduce data over the wire though you'd probably still see a wrapper object like:

    {
      headers: ["field1","field2","fieldN"],
      values: [["value (0,0)","value (0,1)","value (0,n)"],
       ["value (1,0)","value (1,1)","value (1,n)"],
       ["value (2,0)","value (2,1)","value (2,n)"]]
    }

croes

> Because CSV is so simple, it's common for them to avoid using a parsing/encoding library.

A but unfair to compare CSV without parser library to JSON with library.

hajile

Essentially nobody uses JSON without a library, but tons of people (maybe even most people) use CSV without a library.

Part of the problem here is standards. There's a TON of encoding variations all using the same .csv extension. Making a library that can accurately detect exactly which one is correct is a big problem once you leave the handful of most common variants. If you are doing subfield encoding, you are almost certainly on your own with decoding at least part of your system.

JSON has just one standard and everyone adheres to that standard which makes fast libraries possible.

pphysch

The flexibility of JSON is a downside when you just want to stream large volumes of row-oriented tabular data

hajile

If you want to stream large volumes of row-oriented data, you aren't reading yourself and you should be using a binary format which is going to be significantly smaller (especially for numeric data).

pphysch

Yeah that would be the next step in optimization. In the meanwhile, raw text CSV streaming (for not purely numeric data) is still extremely fast and easy to set up

bobmcnamara

> JSON parsers will not only output the expected values every time

Unless you need appendability, but then you should probably just use NDJSON/JSONL for a lot of cases.