!wc -l sample1.jsonl
9611 sample1.jsonl
Fabrizio Damicelli
July 26, 2024
JSON Lines is a common format encountered in modern data applications, as stated in this documentation:
The JSON Lines text format, also called newline-delimited JSON, is a convenient format for storing structured data that may be processed one record at a time. It’s a great format for log files. It’s also a flexible format for passing messages between cooperating processes.
For example Google Platform’s BigQuery exports tables in this format per default.
I will cover a few simple tips that can speed up the parsing significantly. I’ve got a file on my computer which contains a ~9600 of such JSON lines.
This is a toy example, as in the real-world workloads I deal with I typically have something like 300 Million JSON Lines to process. So the job might spend literally hours parsing JSON – you guessed right: and each minute running costs 💸.
Regardless of the very specific content, our data look like this:
{
"key1": True
"key2": ["hello", "world"], ...
"key3": [1231123, 1234192], ...
"key4": ["super", "coool"], ...
"key5": ["very", "niiice"], ...
.
.
.
}
where the lists can be anywhere between 8 and ~4400 length:
import json
from pathlib import Path
lines = (json.loads(line) for line in Path("sample1.jsonl").read_text().splitlines())
minlen, maxlen = 1e6, 0
for line in lines:
for v in line.values():
if isinstance(v, (list, str)):
if (l:= len(v)) > maxlen:
maxlen = l
if (l:= len(v)) < minlen:
minlen = l
print("Min length:", minlen)
print("Max length:", maxlen)
Min length: 8
Max length: 4403
Code benchmarks are always tricky. The tips I will show, I believe, apply in general. But still the relative differences might vary depending on several factors, so you should profile the parsers with your own data to see which variant suits your case best.
Also, I will assume we don’t want to use schema information about the data, i.e., we want it back as as dictionary.
For the sake of the comparison, we will have for each method a function that receives a file path and returns a generator of dictionaries (being each dictionary a JSON line in a file). We use a generator to avoid taking into account the time to create a container (e.g., a list).
By the way, that is already our Tip Number 1: Consume the lines lazily (if you only need them one by one), to save memory footprint and time of allocating large container objects.
Here’s the canonical “pure-python” way of doing it, using the json module from the standard library:
def get_lines_text(p: Path) -> Generator[str, None, None]:
"Notice we read as text"
for line in p.read_text().splitlines():
yield line
def read_python(p: Path, line_reader: Callable) -> Generator[dict, None, None]:
"""Since we do not use third-party libraries, I call this _python"""
for line in line_reader(p):
yield json.loads(line)
2.64 s ± 286 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That’s our baseline.
Tip Number 2: Consume bytes directly
Notice our function get_lines_text
uses the method read_text
to grab the text. That will under the hood first parse the bytes into a string which will then be converted into a dictionary. But we don’t need that! The function json.loads
(same as for third-party libraries) accepts bytes as well, so let’s change our get_lines
function to read bytes:
1.82 s ± 79.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That’s ~40% improvement for free – don’t know about you, but I’d take it ;)
But we can push further.
Tip Number 3: Create the stream manually (skip pathlib
)
We all love pathlib
as it is super handy, but there’s a tiny overhead in this case that can add up (if you have, say, thousands of files to read) – let’s see:
1.65 s ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That’s not that much of a big deal, but still, ~10% improvement, for free – I’ll take it!
Starting from here, we’ll see a few libraries beyond the python standard library. Here’s also where you might want to take results with a pinch of salt, as different libraries have different implementations under the hood, which might take advantage of different aspects or structure of the data for optimization. Thus you definitely want to try out with your own data to check the extent to which these following results apply.
Tip Number 4: Use Pydantic’s from_json
This little function is a bit of a hidden gem. Almost a drop-in replacement for json.loads
:
1.22 s ± 74.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That’s another 25% improvement (on top of what we already had so far). Not bad, I’d say.
Tip Number 5: Use msgspec
The performance of this library has blown my mind already a few times in the past and unfortunately it lives a bit in the shadows of more visible frameworks, but I think it deserves much more attention!
import msgspec
def read_msgspec(p: Path, line_reader: Callable) -> Generator[dict, None, None]:
for line in line_reader(p):
yield msgspec.json.decode(line)
# you could also instantiate the decoder outside
# of the function once (msgspec.json.Decoder())
# and call .decode() method here.
# For this use case, I didn't find that to have
# a better performance.
989 ms ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Almost 20% faster than the previous solution – and we are still talking about almost drop-in replacements!
Tip Number 6: Use orjson
This is a popular library, claiming to be the fastest JSON parser, let’s see:
863 ms ± 5.01 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
We keep stacking improvements: ~13% faster than the previous solution.
To sum up, we managed to push down the parsing time from ~2.64 seconds (or 1.82 seconds if we exclude the “naive” case reading text) to ~0.85 seconds. All in all that means more than 3 times faster! (or more than 2 if we just read bytes).
A quick back-of-the-envelope calculation for my concrete, full use case results in reducing the running time by almost 10 hours, which can definitely mean some money depending on the hardware being used (for example GPUs).
Bonus Tip: Use polars
We only considered the case reading and parsing the lines into a dictionary.
But if you don’t mind having the data in a polars dataframe you can try it out:
CPU times: user 4.56 s, sys: 729 ms, total: 5.29 s
Wall time: 900 ms
Here are the links to the third-party libraries.
Go show them some gratitude for their contributions to the community! ❤️
/Fin
Any bugs, questions, comments, suggestions? Ping me on twitter or drop me an e-mail (fabridamicelli at gmail).
Share this article on your favourite platform: