Tabular-JSON: Combining the best of JSON and CSV

Last couple of years I regularly think about what could come after JSON. As a developer, I’m using JSON data on a daily basis, and I develop JSON Editor Online, the most popular online JSON Editor. JSON is a great and hugely popular data format. Reason for that is that it is so simple, unambiguous, and ubiquitous. However, it is not perfect: it is relatively verbose for tabular data. For tabular data, CSV is really a great data format because it is so simple and compact, and I use CSV a lot too. Both JSON and CSV have their limitations. Can we come up with a new data format that combines the best of both JSON and CSV?

JSON is typically used for two very different purposes: as a data format and as a configuration format. In this experiment we’ll focus on a solution for data, not for configuration files. Of course, we can come up with a solution that works well for both data and config, however, such an all-in-one solution is most likely more complex than needed for “just” data purposes. We want to explore here what the needs are for a minimalistic data format.

The limitations of JSON and CSV

So, what’s the problem concretely? What are limitations of JSON and CSV that people encounter?

The upsides of JSON are that the format is simple, ubiquitous, unambiguous, and that it supports nested data structures (thanks to supporting object and array). The upsides of CSV are that, like JSON, it is simple and ubiquitous. But unlike JSON, CSV is very compact and has streaming support.

Downsides of JSON are that it can be verbose because of two reasons. The first is that JSON is verbose for tabular data due to repeating the object properties for every record in an array, for example [{"id":1,"name":"joe"},{"id":2,"name":"sarah"}] which can be represented much more compact in CSV as: id,name\n1,joe\n2,sarah. The second reason is that it is verbose due to the required quotes around keys and string values. Furthermore, JSON supports only a limited data types. It has for example no type for dates. And it doesn’t support streaming data like with logs, since an array must have a start and an end (NDJSON solves that). Lastly, JSON doesn’t support comments, though this is not an issue when using it for data, only when using it for configuration.

The downsides of CSV are that it only supports tabular data, unlike JSON which supports any nested data structure. A practical issue with CSV is that it is ambiguous in multiple ways. Based on just a CSV file itself, there is no way to know whether the first line is a header, and what separator is used (comma, semicolon, tab, …). Also, CSV has no data types at all (string/number/boolean/etc.). Lastly, CSV parsers and applications have different ways on how they interpret whitespacing around values, so there is ambiguity there.

Summarizing, there are two downsides that jump out:

The first is that JSON can be quite verbose, mostly for tabular data. That is exactly where CSV shines.
The second is that CSV is ambiguous and requires additional information or a popup where you specify whether the document has a header and what separator is used before you can open the document. That is where JSON shines: it is unambiguous.

Can we get rid of these downsides by combining the best of JSON and CSV?

Characteristics of the ideal data format

Depending on your use case, the requirements of a suitable data format can differ. Let’s not go in details about that here. Looking at the huge popularity of JSON and CSV though, let’s see what properties they both share, making them so successful. Based on that, the ideal data format will have the following characteristics:

Human-readable (text based)
Simple (like JSON and CSV)
Unambiguous (no configuration like with CSV needed)
Compact data format (like CSV, unlike JSON)
Supports streaming (like CSV and NDJSON)
Supports any nested data structure (like JSON)
Easy to parse/stringify
Easy to edit manually
White-spacing has no meaning
Extra: being a superset of JSON or CSV would be very powerful

The idea: JSON with tables

Let’s start cooking and take JSON and CSV as the main ingredients.

What we can do is create a superset of JSON which adds a new structure named “table”. We will call this data format Tabular-JSON. This allows us to embed CSV like data structures, both nested and at the root level. Also, we can make the quotes around keys and values optional to make the data more compact and easier to read. This can look as follows:

{
  name: rob,
  hobbies: [
    swimming,
    biking
  ],
  friends: ---
    id, name,  address.city, address.street
    2,  joe,   New York,     "1st Ave"
    3,  sarah, Washington,   "18th Street NW"
  ---,
  address: {
    city: New York,
    street: "1st Ave"
  }
}

Try it out

Important to note in the example above is that:

There is a nested table wrapped inside --- blocks.
The table contains nested object properties like address.city and address.street, which means that address is an object with nested properties city and street.
The quotes around keys and values is optional. It is required when it contains a delimiter, starts or ends with a whitespace, or starts with a digit.
The contents of the table can be aligned in columns.

At root level, the enclosing table block delimiter --- can be omitted. This results in streamable table rows, similar to CSV:

id, name,  address.city, address.street
2,  joe,   New York,     "1st Ave"
3,  sarah, Washington,   "18th Street NW"

Note that it is not possible to make a standard that is a superset of both JSON and CSV, since they use a different and incompatible way of escaping control characters.

Design decisions

The specification of this new data format, Tabular-JSON, is currently a work in progress. There are a number of topics that need to be decided upon, weighing the simplicity of the format against useful features. Please feel welcome to join the discussion.

Support for optional quotes (or not)
Support for comments (or not)
Support for dates (or not)
Support for fractions (or not)
Support for Infinity and NaN (or not)
Support for trailing commas (or not)

Next steps

So, what is next? The first next step is collecting feedback, to see whether the idea of Tabular-JSON resonates with people. We need to make choices for each of the design decisions and work out the specification accordingly. Then, we need a couple of reference implementations of a parser, for example in JavaScript and Python, so you can actually start using the data format. We also need a language plugin for IDE’s like VS Code.

By now, you’re probably thinking: this sounds interesting, but can this data format every become a success and dethrone JSON and CSV? Realistically speaking, the chance of that is about zero. The data formats JSON and CSV are so ubiquitous that it is nearly impossible to replace them. This is of course a classic chicken-egg problem: people will only use a data format when is available in all of their development environments, but application builders will only add support for a data format when it is used a lot. It can probably only happen when a major application like Excel, DataBricks, or RStudio implements native support for it. Time will tell.

Tabular-JSON: Combining the best of JSON and CSV

The limitations of JSON and CSV

Characteristics of the ideal data format

The idea: JSON with tables

Design decisions

Next steps

Contents

Recent posts

Categories

About