How to write a streaming parser

Dec 19, 2023 | Parse

A parser can turn stringified data (or code) into a structured object that can be operated on. For example a JSON parser can parse text into a JSON document, and a CSV parser parses text into a structured CSV document.

In some cases, the input data is so large that it cannot fit in memory. In that case, you cannot use a regular parser since it would run out of memory. Instead, you need a streaming parser. This article explains what a streaming parser is, when you need one, and how it works under the hood.

How does a parser work in the first place?

When writing a parser, the easiest way is to first load the data in memory, and then loop over it character by character to interpret the data. The code will follow the structure of the data. For example, a CSV file consists of rows separated by a newline, with on each row a list of comma separated values. The structure of a CSV parser can look as follows in pseudo code, aligning with the data structure of CSV:

FUNCTION parseCSV(text)
   FUNCTION parseRow()
       WHILE end of row is not reached
           CALL parseValue()
       END WHILE
       RETURN values
   END FUNCTION


   FUNCTION parseValue()
       WHILE end of value is not reached
           get the next character and add it to value
       END WHILE
       RETURN value
   END FUNCTION
  
   WHILE end of text is not reached
       CALL parseRow()
   END WHILE

   RETURN rows
END FUNCTION

Here, the function parseCsv has text as input, and returns the parsed rows with values as output. The parser contains a loop that repeats parsing a row one by one. The function to parse a row in turn contains a loop to parse each value in the row one by one. And the function to parse a value will loop over each character until the end of the value is reached.

The following interactive CodePen shows a minimal, non-streaming CSV parser. In the next sections we will implement the same parser in different ways, so we can compare pros and cons:

The CSV parser has only two levels: rows and values. This keeps things clear.

Let’s now shortly look at the JSON data format, which has nested structures. JSON contains object, array, string, number, boolean, null. It contains recursion, since every object and array can contain nested values including nested objects and arrays. In pseudo code, a JSON parser can look like:

FUNCTION parseJSON(text)
   FUNCTION parseValue()
       RETURN parseObject() OR ELSE
           parseArray() OR ELSE
           parseString() OR ELSE
           parseNumber() OR ELSE
           parseBoolean() OR ELSE
           parseNull()
   END FUNCTION


   FUNCTION parseObject()
       WHILE end of object is not reached
           CALL parseKey() followed by parseValue() and 
           add the pair to the object
       END WHILE
       RETURN the object
   END FUNCTION


   FUNCTION parseArray()
       WHILE end of array is not reached
           CALL parseValue() and add the item to the array
       END WHILE
       RETURN array
   END FUNCTION


   FUNCTION parseKey()
       RETURN parseString()
   END FUNCTION


   FUNCTION parseString()
       parse and return a string if any, otherwise return nothing
   END FUNCTION


   FUNCTION parseBoolean()
       parse and return a boolean value if any, otherwise return nothing
   END FUNCTION


   FUNCTION parseNull()
       parse and return a null value if any, otherwise return nothing
   END FUNCTION


   RETURN parseValue()
END FUNCTION

What is important to realize here is that this parser recurses into nested arrays and objects. This is what can make a streaming parser challenging to write. We will get back to that later.

When do I need a streaming parser?

Suppose that you have a large JSON file of 200 MB and a regular JSON parser. This means that you first need to load 200 MB of bytes into memory, and after that parse it, and at last, throw away the 200 MB of bytes from memory, and then you are left with the parsed data. It costs time to load a large amount of bytes in memory, and it is quite a waste that this is needed just temporarily. And if the data is larger than the total amount of available memory, it is simply not possible to parse the data: you will run out of memory. There may be even more waste, for example when you need to filter the data on a specific condition: then you will throw away a large part of the parsed data directly after filtering too.

When parsing the data in a streaming way instead, there is no need to first load the full document into memory and parse the full document before being able to operate on it. A streaming parser processes chunks of data as soon as they come in, and allows to directly apply a filtering step or other processing step without needing all of the data to be parsed beforehand. Therefore, there is not much memory needed to process a possibly endlessly large amount of data.

A use case can be reading a large log file and filtering the logs that match a search request, or receiving a large query result from a database in a backend, transforming the data and then sending it to the client. As an illustration, the library ijson is a the streaming JSON parser for Python and it can parse some nested array “earth.europe” and let you directly process the array items one by one whilst they are being parsed (see docs):

import ijson

f = urlopen('http://.../')
objects = ijson.items(f, 'earth.europe.item')
cities = (o for o in objects if o['type'] == 'city')
for city in cities:
    do_something_with(city)

How to write a streaming parser?

There are three different approaches that to implement a streaming parser:

  1. Parse a flat collection row by row
  2. Parse a nested data structure (A): generator functions
  3. Parse a nested data structure (B): pause and resume using state

We’ll discuss how to work with that in the following sections.

Parse a flat collection row by row

Parsing a collection in a streaming way is quite straightforward. A collection consists of a set of items that are separated by a delimiter like a newline character. Each item or row can be processed one by one. Examples of this are NDJSON and CSV, which are popular formats for logging for example. 

To parse NDJSON or CSV data, you can read the data until you encounter a newline character, then parse the row with a regular parser, and repeat that for the next rows until you reach the end of the file. Appending a new row can be done without having to parse the data that is already there. 

When writing the streaming parser, we can’t simply write a function that processes some input and returns the output:

FUNCTION parseData(input)
   parse input
   RETURN output
END FUNCTION

Instead, we need an API which can pause processing, wait for new data to come in, process the data that is received so far, and wait again for more data:

WAIT until a new chunk of data is received
   append the chunk of data to the data that was received before


   IF the received data contains the end of the line
       cut the data in two at the line end
       parse and process the line
       keep the remaining data
   END IF
END WAIT

We can change the non-streaming CSV parser example such that it has a streaming API where you can pass the data chunk by chunk, and the data is processed line by line:

The most important difference is the API with methods push and flush, and a callback onRow. The parser section itself is reduced to only parsing a single row, parsing multiple rows is handle on top.

Parse a nested data structure (A): generator functions

When dealing with an arbitrary nested data structure it is not possible to process items one by one in an easy way like when dealing with a collection. You have to keep the full data structure in memory whilst parsing. There are two ways to achieve this: either use generator functions when available, or write a parser that can pause and resume and manually keeps track of the state.

The first way to write a streaming parser is to use generator functions which can “pause” execution of a function using yield. Using a generator function, we can pause the parsing process when we need more input data, and then continue as soon as a new chunk of data comes in. Not every programming language has support for generator functions, but you can use them for example in JavaScript and Python.

The nice thing about this approach is that the code and the flow can be the same as when writing a regular (non-streaming) parser. The only thinking that is needed is to change all functions into generator functions, and to replace the function that reads the next character from the input data into a generator that will pause (yield) when there is no new data, and will continue as soon as new data is received. The CSV parser from before will look like:

FUNCTION parseCSV (text)
   GENERATOR_FUNCTION parseRow ()
       WHILE end of row is not reached
           CALL parseValue()
       END WHILE
       RETURN values
   END GENERATOR_FUNCTION


   GENERATOR_FUNCTION parseValue ()
       WHILE end of value is not reached
           CALL nextCharacter() and add it to value
       END WHILE
       RETURN value
   END GENERATOR_FUNCTION
  
   GENERATOR_FUNCTION nextCharacter()
       IF no new data
           WAIT until more data is received
       END IF
       RETURN the next character
   END GENERATOR_FUNCTION
  
   WHILE end of text is not reached
       CALL parseRow()
   END WHILE


   RETURN rows
END FUNCTION

We can adjust the CSV parser shared before to use generator functions. The nice thing is that the logic of the parser itself is the same as the original, non-steaming parser, and only the public API changed, and the inner nextCharacter function that can now pause to wait for more data:

Parse a nested data structure (B): pause and resume using state

The second approach to make a parser streaming is to write the parser in such a way that there are a lot of points at which you can pause the parser, such as after parsing a single value. When the parser pauses, awaiting new data, it needs to remember where it left off. When new data comes in, it must be able to resume processing where it left off purely based on the stored state. 

In pseudo code, this structure looks as follows: there is a function push that you can use to append new data. This function will process next steps, one by one, as long as there is data. At the end, a function flush is called to process any leftover data.

VARIABLE state
   keeps the parsing state, including a stack keeping track on nested structures


FUNCION push(chunk)
   append the chunk to a buffer with data that was received before


   DO
       CALL processNext()
   WHILE processNext was able to parse another step
  
   remove the data that has been parsed from the buffer
END FUNCTION


FUNCTION flush()
   WHILE end of data is not reached
       CALL processNext()
   END WHILE
END FUNCTION


FUNCTION processNext()
   SWITCH (currentState)
       Go through all possible states, try to process the next
       step based on the current state. After parsing a piece of data,
       return the new state. For example: if the currentState
       expects a new value, and we encounter an open array bracket "[",
       we add a new array to the stack, and as new state we return
       that we're now inside an array, expecting a value.
   END SWITCH
END FUNCTION

The upside of this approach is that there is no need for generator functions (which are not available in all languages). The downside is that you need to split the code of the parser into separate pieces that each process a single “token”, the smallest unit of data that we process: a delimiter, number, boolean, etc. The parser must be able to determine what token is expected next based on some state, and it must return both a new state and optional parsed output. This indirection results in more complex code that is harder to reason about.

The challenges of writing a parser this we becomes clear when looking at the rewritten version of the CSV parser shared before:

You see that the original flow that was still present in the three earlier versions of the CSV parser is put upside down and split apart. The structure of this parser is prone to issues like infinite loops when the different states do not correctly follow each other up. It is harder to see the overall flow of the parser. In this case the parser is quite minimal and it is still doable, but you can imagine that the complexity and room for bugs grows when parsing a more advanced data format.

More complex cases

So far, we used examples of a CSV parser and a JSON parser. These parsers are relatively straightforward. In other cases, you may come across more complex needs, such as the need to not just read the current character, but ahead or behind to determine what to do. That is for example the case in the jsonrepair library, which recently got streaming support. The library for example has to look behind to fix trailing commas after white space, or has to revert parsing of a string when it discovers that the string misses an end quote. In that case, it needs to parse the string again with a different strategy, stopping at the first next delimiter instead of an end quote. In the jsonrepair library, this is implemented using an input buffer and an output buffer, which keep a limited “moving window” of input and output available to read from and write to.

It is important to think through possible implications for memory. When using a streaming parser, the assumption is that memory usage will be limited. If parts of the data like a large string require more memory, the parser either has to throw an error (prompting the user to configure a large buffer), or it has to use more memory without informing the user, possibly blowing up memory usage. In general, this last option is not preferable.

Conclusion about writing a streaming parser

A streaming parser can be needed when processing large amounts of data. It is quite common that the data to be processed is a collection, like rows in a log file or items from a database. In that case, processing this data in a streaming way is quite straightforward. Processing an arbitrary nested data structure in a streaming is more challenging, but luckily also quite a niche. There are two main approaches to go about that, and in essence, it must be possible to pause the parser to await receiving more data, and then continue.

To summarize, here are the links to the four CSV parsers discussed throughout the article. You can compare them side by side and play around with the different concepts yourself:

 

  1. CSV Parser (non-streaming)
  2. CSV Parser (streaming, line by line)
  3. CSV Parser (streaming, generator functions)
  4. CSV Parser (streaming, pause and resume keeping state)