Parsing JSON takes time – time is money

python

json

A couple of tips to parse JSON Lines faster in Python.

Author

Fabrizio Damicelli

Published

July 26, 2024

JSON Lines is a common format encountered in modern data applications, as stated in this documentation:

The JSON Lines text format, also called newline-delimited JSON, is a convenient format for storing structured data that may be processed one record at a time. It’s a great format for log files. It’s also a flexible format for passing messages between cooperating processes.

For example Google Platform’s BigQuery exports tables in this format per default.

I will cover a few simple tips that can speed up the parsing significantly. I’ve got a file on my computer which contains a ~9600 of such JSON lines.

This is a toy example, as in the real-world workloads I deal with I typically have something like 300 Million JSON Lines to process. So the job might spend literally hours parsing JSON – you guessed right: and each minute running costs 💸.

!wc -l sample1.jsonl

9611 sample1.jsonl

Regardless of the very specific content, our data look like this:

{
    "key1": True
    "key2": ["hello", "world"], ...
    "key3": [1231123, 1234192], ...
    "key4": ["super", "coool"], ... 
    "key5": ["very", "niiice"], ...
      . 
      .
      .
}

where the lists can be anywhere between 8 and ~4400 length:

Code

import json
from pathlib import Path

lines = (json.loads(line) for line in Path("sample1.jsonl").read_text().splitlines())
minlen, maxlen = 1e6, 0
for line in lines:
    for v in line.values():
        if isinstance(v, (list, str)):
            if (l:= len(v)) > maxlen:
                maxlen = l
            if (l:= len(v)) < minlen:
                minlen = l

print("Min length:", minlen)
print("Max length:", maxlen)

Min length: 8
Max length: 4403

Warning

Code benchmarks are always tricky. The tips I will show, I believe, apply in general. But still the relative differences might vary depending on several factors, so you should profile the parsers with your own data to see which variant suits your case best.
Also, I will assume we don’t want to use schema information about the data, i.e., we want it back as as dictionary.

For the sake of the comparison, we will have for each method a function that receives a file path and returns a generator of dictionaries (being each dictionary a JSON line in a file). We use a generator to avoid taking into account the time to create a container (e.g., a list).
By the way, that is already our Tip Number 1: Consume the lines lazily (if you only need them one by one), to save memory footprint and time of allocating large container objects.

Code

from typing import Generator, Callable

file = Path("sample1.jsonl")

Here’s the canonical “pure-python” way of doing it, using the json module from the standard library:

def get_lines_text(p: Path) -> Generator[str, None, None]:
    "Notice we read as text"
    for line in p.read_text().splitlines():
        yield line
        
def read_python(p: Path, line_reader: Callable) -> Generator[dict, None, None]:
    """Since we do not use third-party libraries, I call this _python"""
    for line in line_reader(p):
        yield json.loads(line)

def traverse_lines(lines: Generator[dict, None, None]):
    """
    Go through the file, parsing the lines, but doing nothing with them.
    We only want to parse them.
    """
    for _ in lines: # Consume the generator
        pass

%%timeit
traverse_lines(
    read_python(file, line_reader=get_lines_text)
)

2.64 s ± 286 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

That’s our baseline.

Tip Number 2: Consume bytes directly

Notice our function get_lines_text uses the method read_text to grab the text. That will under the hood first parse the bytes into a string which will then be converted into a dictionary. But we don’t need that! The function json.loads (same as for third-party libraries) accepts bytes as well, so let’s change our get_lines function to read bytes:

def get_lines_bytes(p: Path) -> Generator[str, None, None]:
    for line in p.read_bytes().splitlines():
        yield line

%%timeit
traverse_lines(
    read_python(
        file,
        line_reader=get_lines_bytes  # <-- here!
    )
)

1.82 s ± 79.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

That’s ~40% improvement for free – don’t know about you, but I’d take it ;)

But we can push further.

Tip Number 3: Create the stream manually (skip pathlib)
We all love pathlib as it is super handy, but there’s a tiny overhead in this case that can add up (if you have, say, thousands of files to read) – let’s see:

def get_lines_bytes_stream(p: Path) -> Generator[str, None, None]:
    with open(p, "rb") as f:  # remember we read bytes (the default is "rt": read TEXT!)
        for line in f:
            yield line

%%timeit
traverse_lines(
    read_python(
        file,
        line_reader=get_lines_bytes_stream  # <-- here!
    )
)

1.65 s ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

That’s not that much of a big deal, but still, ~10% improvement, for free – I’ll take it!

Starting from here, we’ll see a few libraries beyond the python standard library. Here’s also where you might want to take results with a pinch of salt, as different libraries have different implementations under the hood, which might take advantage of different aspects or structure of the data for optimization. Thus you definitely want to try out with your own data to check the extent to which these following results apply.

Tip Number 4: Use Pydantic’s from_json
This little function is a bit of a hidden gem. Almost a drop-in replacement for json.loads:

from pydantic_core import from_json

def read_pydantic(p: Path, line_reader: Callable) -> Generator[dict, None, None]:
    for line in line_reader(p):
        yield from_json(line, cache_strings="keys")

%%timeit
traverse_lines(
    read_pydantic(
        file,
        line_reader=get_lines_bytes_stream
    )
)

1.22 s ± 74.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

That’s another 25% improvement (on top of what we already had so far). Not bad, I’d say.

Tip Number 5: Use msgspec
The performance of this library has blown my mind already a few times in the past and unfortunately it lives a bit in the shadows of more visible frameworks, but I think it deserves much more attention!

import msgspec

def read_msgspec(p: Path, line_reader: Callable) -> Generator[dict, None, None]:
    for line in line_reader(p):
        yield msgspec.json.decode(line)
        # you could also instantiate the decoder outside 
        # of the function once (msgspec.json.Decoder())
        # and call .decode() method here.
        # For this use case, I didn't find that to have 
        # a better performance.

%%timeit
traverse_lines(
    read_msgspec(
        file,
        line_reader=get_lines_bytes_stream
    )
)

989 ms ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Almost 20% faster than the previous solution – and we are still talking about almost drop-in replacements!

Tip Number 6: Use orjson
This is a popular library, claiming to be the fastest JSON parser, let’s see:

import orjson

def read_orjson(p: Path, line_reader: Callable) -> Generator[dict, None, None]:
    for line in line_reader(p):
        yield orjson.loads(line)

%%timeit
traverse_lines(
    read_orjson(
        file,
        line_reader=get_lines_bytes_stream
    )
)

863 ms ± 5.01 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

We keep stacking improvements: ~13% faster than the previous solution.

To sum up, we managed to push down the parsing time from ~2.64 seconds (or 1.82 seconds if we exclude the “naive” case reading text) to ~0.85 seconds. All in all that means more than 3 times faster! (or more than 2 if we just read bytes).
A quick back-of-the-envelope calculation for my concrete, full use case results in reducing the running time by almost 10 hours, which can definitely mean some money depending on the hardware being used (for example GPUs).

Bonus Tip: Use polars

We only considered the case reading and parsing the lines into a dictionary.
But if you don’t mind having the data in a polars dataframe you can try it out:

import polars as pl

%%time
_ = pl.scan_ndjson(file).collect(streaming=True)

CPU times: user 4.56 s, sys: 729 ms, total: 5.29 s
Wall time: 900 ms

Here are the links to the third-party libraries.
Go show them some gratitude for their contributions to the community! ❤️

/Fin

Any bugs, questions, comments, suggestions? Ping me on twitter or drop me an e-mail (fabridamicelli at gmail).

Share this article on your favourite platform: