I wasn’t going to write a CSV parser, really…

Mar 24, 2023 | Parse

…but I ended up writing one of the fastest JavaScript CSV parsers out there.

Some time ago, I implemented CSV export functionality for JSON Editor Online. I used the excellent json2csv library to convert JSON to CSV. I do not remember why I chose that specific library, but it just worked like a charm. Recently, I wanted to implement CSV import for JSON Editor Online. I expected this to be a no-brainer: just to the same as the CSV export but then the other way around. However, the CSV library I was using did not support CSV to JSON. No problem I thought, I’ll just pick another CSV library, there are plenty of them. So I picked another popular library, only to find out that it was more than twice as slow as the library I used before, that was a bit too much for me. So I selected another library, only to find out that this library was huge in size, too large for my taste to use in my web application in the browser. Other libraries are focussed on node.js and not the browser. All in all it was quite a struggle to find a library that suited my needs. Which is odd, since there are so many CSV libraries out there.

One limitation that most of the CSV libraries for JavaScript out there have in common is no built-in support for nested JSON objects. To me this is an essential feature since nested data structures are just so common in JavaScript. You can solve it with a pre- and post-processing step to flatten your data, but that feels like something a CSV library can do for you.

In the end, I was a bit disappointed and decided to just write my own CSV parser to see if that could address my needs: small, fast, and with support for nested objects. It felt a bit wrong to write yet another CSV parser since there are so many libraries out there already. But well, it’s also just fun to do even if it doesn’t work out.

To my big surprise, it turned out that my new CSV parser, csv42, was actually faster than the popular CSV libraries out there. Since CSV is such an old data format, I had expected that the existing libraries would be highly optimized and unbeatable. So, I hadn’t expected to see my library beating them with ease.

In this article I’ll explain what the library can do, show a benchmark comparing the library to the most popular CSV libraries, and explain what makes csv42 so fast.

csv42: a new CSV library

The new CSV library is called csv42 and is open sourced on Github: https://github.com/josdejong/csv42

Why this name? Well, you can imagine that it is hard to come up with a name for a CSV library that is not yet taken on npm 😅. All combinations of “csv” and “json” already exist. So, why the name csv42? Just because 42 is a beautiful number and to remind us that there is a whole universe of beautiful CSV libraries out there.

The API of csv42 is limited yet flexible and powerful. It has two main functions: csv2json and json2csv.

To convert JSON to CSV (all options are optional):

const json = [...]
const options = {
  header,     // whether or not to include a header row
  delimiter,  // a delimiter like , ; tab or space
  eol,        // end of line character like \n
  flatten,    // whether or not to flatten nested objects
  fields,     // a custom list with fields
  formatValue // a custom formatter to stringify values
}
const csv = json2csv(json, options)
To convert CSV to JSON (all options are optional):
const csv= '...'
const options = {
  header,     // whether the first line contains a header row
  delimiter,  // a delimiter like , ; tab or space
  nested,     // whether or not to parse fields into nested objects
  fields,     // a custom list with fields
  parseValue  // a custom parser to parse values (like into numbers)
}
const json = csv2json(csv, options)

The library also comes with a set of utility functions, you can consult the documentation to read more about that.

Benchmark

To see how well csv42 performs, I’ve worked out a benchmark. First: doing benchmarks right is hard. Please let me know if you see a flaw in the benchmarks. 

The source code of the benchmark is publicly available at https://github.com/josdejong/csv42/tree/main/benchmark, so you can run the benchmarks yourself.

Benchmark setup

The benchmark tries to test realistic, real-world scenarios. Considerations where:

  • I opted to not try to tweak configuration for individual libraries but go with the “standard” way of using them, since that is how they will normally be used
  • Since nested JSON data is so common in JSON data structures, the benchmark tests with two different data sets: one with flat JSON data, and one with nested JSON data. Not all CSV libraries do support flattening nested JSON objects though. To make the libraries work with nested data, the data is flattened using the library flat. The use of flat is denoted with a (+flat) suffix in the benchmark results. Note that flattening is only applied in the nested JSON benchmarks and NOT in the tests with flat JSON data.
  • The performance depends of course on what kind of data a single row contains and how much. This benchmark generates test data that contains a bit of everything: string, numbers, and strings that need escaping. And in the case of nested data: a nested object and nested array.
  • The CSV libraries have different defaults when parsing values. Some just leave all values a string, which is fast (nothing needs to be parsed). Most parse for example numeric values into a number, which is most useful in real-world scenarios. In the benchmarks, numeric values are being parsed into numbers.
  • In the benchmark, we want to see the performance for both small and large amounts of data. Therefore, the benchmark runs tests for various numbers of rows.

The benchmarks are run for the following data sets (see next section for details):

  • Flat JSON to CSV
  • Flat CSV to JSON
  • Nested JSON to CSV
  • NestedJSON to CSV

The benchmarks are run for a different number of rows:

  • 100 rows (16 KB)
  • 1000 rows (159 KB)
  • 10k rows (1.6 MB)
  • 100k rows (16 MB)

The benchmarks are run for the following CSV libraries:

  • 1 csv42
  • 2 json2csv
  • 3 csv
  • 4 papaparse
  • 5 fast-csv
  • 6 json-2-csv

Benchmark datasets

The actual generated datasets look as follows. We have two data sets (flat and nested), and we have two data formats (JSON and CSV).

Flat JSON data example:

[
  {"_type":"item","name":"Item 0","description":"Item 0 description in text","city":"Rotterdam","street":"Main street","latitude":51.9280712,"longitude":4.4207888,"speed":5.4,"heading":128.3,"field with , delimiter":"value with , delimiter","field with \" double quote":"value with \" double quote"},
  {"_type":"item","name":"Item 1","description":"Item 1 description in text","city":"Rotterdam","street":"Main street","latitude":51.9280712,"longitude":4.4207888,"speed":5.4,"heading":128.3,"field with , delimiter":"value with , delimiter","field with \" double quote":"value with \" double quote"},
  {"_type":"item","name":"Item 2","description":"Item 2 description in text","city":"Rotterdam","street":"Main street","latitude":51.9280712,"longitude":4.4207888,"speed":5.4,"heading":128.3,"field with , delimiter":"value with , delimiter","field with \" double quote":"value with \" double quote"},
  ...
]

Flat CSV data example:

_type,name,description,city,street,latitude,longitude,speed,heading,"field with , delimiter","field with "" double quote"
item,Item 0,Item 0 description in text,Rotterdam,Main street,51.9280712,4.4207888,5.4,128.3,"value with , delimiter","value with "" double quote"
item,Item 1,Item 1 description in text,Rotterdam,Main street,51.9280712,4.4207888,5.4,128.3,"value with , delimiter","value with "" double quote"
item,Item 2,Item 2 description in text,Rotterdam,Main street,51.9280712,4.4207888,5.4,128.3,"value with , delimiter","value with "" double quote"
...

Nested JSON data example:

[
  {"_type":"item","name":"Item 0","description":"Item 0 description in text","location":{"city":"Rotterdam","street":"Main street","geo":{"latitude":51.9280712,"longitude":4.4207888}},"speed":5.4,"heading":128.3,"sizes":{"small":0.9,"medium":3.4,"large":5.1},"field with , delimiter":"value with , delimiter","field with \" double quote":"value with \" double quote"},
  {"_type":"item","name":"Item 1","description":"Item 1 description in text","location":{"city":"Rotterdam","street":"Main street","geo":{"latitude":51.9280712,"longitude":4.4207888}},"speed":5.4,"heading":128.3,"sizes":{"small":0.9,"medium":3.4,"large":5.1},"field with , delimiter":"value with , delimiter","field with \" double quote":"value with \" double quote"},
  {"_type":"item","name":"Item 2","description":"Item 2 description in text","location":{"city":"Rotterdam","street":"Main street","geo":{"latitude":51.9280712,"longitude":4.4207888}},"speed":5.4,"heading":128.3,"sizes":{"small":0.9,"medium":3.4,"large":5.1},"field with , delimiter":"value with , delimiter","field with \" double quote":"value with \" double quote"},
  ...
]

Nested CSV data example:

_type,name,description,location.city,location.street,location.geo.latitude,location.geo.longitude,speed,heading,sizes.small,sizes.medium,sizes.large,"field with , delimiter","field with "" double quote"
item,Item 0,Item 0 description in text,Rotterdam,Main street,51.9280712,4.4207888,5.4,128.3,0.9,3.4,5.1,"value with , delimiter","value with "" double quote"
item,Item 1,Item 1 description in text,Rotterdam,Main street,51.9280712,4.4207888,5.4,128.3,0.9,3.4,5.1,"value with , delimiter","value with "" double quote"
item,Item 2,Item 2 description in text,Rotterdam,Main street,51.9280712,4.4207888,5.4,128.3,0.9,3.4,5.1,"value with , delimiter","value with "" double quote"
...

Benchmark results

Here are the four charts showing the results for the two different datasets (flat and nested) in two directions (JSON to CSV and CSV to JSON). The charts are normalized to show the number of rows that are processed per second, so we can easily compare performance of small vs large documents.

flat json to csv benchmark
flat csv to json benchmark
nested json to csv benchmark
nested csv to json benchmark

Benchmark conclusions

Looking at the benchmark results, we can observe the following:

  • The csv42 library does really well. It is faster than the popular CSV libraries, sometimes by a big margin. It is only outpaced in one case when converting a small, flat JSON document. The libraries json2csv and papaparse do very well too in many of the tests, though their performance drops quite a bit for larger documents when converting JSON to CSV.
  • Most CSV libraries seem optimized mostly to convert small JSON documents to CSV. There is much room for improvement converting CSV to JSON, and processing large amounts of data.
  • Except csv42, none of the libraries does perform well with nested data. Most libraries do not support flattening nested JSON data, and thus are not designed with this in mind. The separate conversion step to flatten nested data is an expensive one.
  • It is important to keep in mind that in this benchmark we only look at the performance. Besides performance, there are more reasons to consider a library: size, features, support, and more.

What makes csv42 so fast?

The library csv42 applies the following general techniques to get a good performance.

Measure

First important thing is to not optimize based on your assumptions, but based on measurements. In general, it is best to write your code optimized to be simple to read, understand, and maintain. Only when there turns out to be a performance issue, you can effectively pinpoint the bottleneck and try to improve the performance. When starting optimization, make sure you have a good and trustworthy setup to benchmark the performance. Otherwise it may turn out that you’re spending time optimizing something that’s not actually causing performance issues.

Before you start optimizing your code, agree with yourself (and your team) on how much time you’re willing to spend on it. You can spend an endless amount of time improving little things, and it is very addictive. You can keep yourself busy, but at some point you’re micro-optimizing without adding much business value.

Minimize creation of temporary arrays and objects

Creating values, objects, and arrays takes time and memory. Having the garbage collector clean them up afterwards takes time too. We regularly create temporary objects to pass from one function to another, and this often makes sense. To optimize performance though, we should minimize the amount of temporary objects and arrays.

One clear example is what we see in the benchmark from the CSV libraries that do not support nested JSON objects. In that case, we first convert the nested JSON objects into flattened objects, and then the CSV library loops over the flattened objects and generates the CSV output. The flattened JSON objects are temporary, and this intermediate step worsens the performance. The csv42 has a different approach: it works with getter and setter functions which can read nested fields and directly output them to CSV without need for creating a temporary object.

Minimize the amount of indirection and abstractions

Abstractions are useful to keep code maintainable in a growing code base. Sometimes however, we’re adding more abstractions than needed. Having too much indirection and abstractions can make it harder to understand what a piece of code actually does, and it can negatively impact the performance. A small example is creating a class, and setting internal state for the class at construction, but subsequently only using a single method. Such code can be simplified to just a static function.

Be careful not to add abstractions because you think it will come in handy in the future (YAGNI), but only when you really need it. Try to keep it simple (KISS). That makes code easier to understand and often comes with better performance as a free bonus.

Understand the characteristics of basic data structures

You should have an understanding of basic data structures, like lists and maps. For example, finding an item in an Array is much slower than finding an item in a Set, Map, or Object. In case of an Array, the find operation must read and compare every item, whereas in case of a Set, Map, or Object, you have a fast lookup-table (or a “hash table”), which has indexed all items and returns a result instantly.

Another common performance bottleneck is caused by nested loops. In JavaScript that can be nested for loops, or nested operations like .map(), .forEach(), .filter(), or .find(). A single loop becomes twice as slow when looping over twice as much data. However, when having nested loops, the number of operations grows exponentially. It is not always trivial to refactor a nested loop. Sometimes you can replace an inner loop with a lookup in a Map or Set.

In short: keep an eye out for these known bottlenecks and use data structures that play nice with the kind of operations you need to perform on your data.

Understand your process

Now, this is the hardest one, but it can also be the most important one.

To understand how you can improve the performance of your code, you have to know where your bottlenecks are. You can get insight in the bottlenecks of your code by profiling your application. That will reveal the “hot” code paths. Now, you can often improve the performance with “micro” optimizations. But sometimes you need to zoom out to understand the whole of your process and rethink that in order to get real performance gains.

For example, on a blog, you may fetch a list with posts of a certain period. Then, you need to enrich each post with information like the details of the author. That can introduce a lot of small additional requests which may be slow. Since this request for author details is “hot” code, you may try to optimize it, with caching and optimizing the database requests. It may not be enough though to optimize functions and endpoints as they are. Maybe you need to combine all these small requests into a single bulk request to get serious performance gains.

Another example is the function getNestedPaths(items) in the csv42 library. This function detects the CSV columns from the JSON data by looping over all items and collecting all nested fields. In the first version of the library, this was done in the following two steps. First, for every item, we collect the nested paths of the item, and next, we merge all collected paths together. For every item, a temporary array with serialized paths was created, which is slow. To make this faster, the code is refactored into a different model: in the first step, a merged object is built up by looping over all items and appending their nested fields to this object when missing. In the next step, the paths are extracted from this merged object. So in this case we only create a single temporary object, which is much faster. Refactoring individual functions as they are wasn’t enough: the operation as a whole needed to be turned upside down. The original function took 7 seconds to scan a large 405MB JSON file. The optimized function took only 0.35 seconds (20 times faster), and it only involved a couple of hours of work. That is definitely worth the effort.

Conclusion

It was a great lot of fun to write the csv42 library. Sometimes fun is enough of a reason to spend time on something :). I’m happily surprised that the library performs so well. Ensuring the best performance when working with large datasets is one of the goals of JSON Editor Online, and csv42 is doing very well in that regard. Want to know more? Read about how to use CSV data in JSON Editor Online.

[2023-09-07] Update

It looks like we’re in a great time for innovation around CSV parsers. There is a new CSV parser μDSV which is much faster than csv42 and any other CSV library out there, like 2-4 times as fast. The library currently only supports converting CSV into JSON, but maybe in the future it will also support converting JSON to CSV. The reason that this library is so fast is that it reads the first few lines of the CSV to detect the columns and data types, and then dynamically creates optimized JavaScript using new Function() to parse it. This requires the data to be homogeneous, which in practice often is the case. The use of new Function() (the sibling of eval) in JavaScript is generally considered dangerous since it can make it possible for attackers to inject malicious code. In some JavaScript environments the use of it is forbidden via Content Security Policies (CSP). The author of μDSV is aware of these security risks and has taken measures to protect against code injection.