Efficient Deserialization of Numpy Arrays

python

numpy

sqlite

Speed up the retrieval of stored vectors.

Author

Fabrizio Damicelli

Published

June 23, 2024

TL;DR

Numpy’s bytes format can be considerably faster than other formats to deserialize. When storing/retrieving vectors arrays just use the methods array.tobytes() and numpy.frombuffer() (instead of, for example, pickle.dumps/loads).

The Situation

Let’s say you have a bunch of entities, e.g. product-ids of on online shop, for which you have a vector representation (think for example of a word or an image embedding) stored somewhere like a database.

Now it’s time to put your model in production and you need to retrieve the vectors from the database in order to deliver your predictions (for example, you might need to get an embedding representation of product images in order to calculate similarities and show product recommendations). The most typical (and probably sane thing to do) is to have an sqlite database for storage.

It turns out, we can store the vectors in different formats and which format we use can heavily affect the retrieval (loading and deserialization) speed in our application. That can be critical if your predictions behind your use-case need to be fast (e.g. under 100ms).

Let’s create some fake data consisting of vectors of length 256 filled with random numbers. We will use that data to profile the retrieval (from the database) performance of different formats.

Code

from pathlib import Path
import json
import sqlite3
import pickle

import numpy as np

vectors = np.random.random(size=(20_000, 256))
data = {f"id_{i}": vec for i, vec in enumerate(vectors)}
data["id_0"][:5]

array([0.93501425, 0.70911416, 0.54255025, 0.84465434, 0.92316   ])

Store Vectors as Text

Our first option is to just dump the numbers into the database as text.

def dump_vectors_as_json(data, db_path):
    conn = sqlite3.connect(db_path)
    cur = conn.cursor()
    cur.execute(
    """
    CREATE TABLE vectors (
    id TEXT PRIMARY KEY,
    vector TEXT
    )
    """
    )
    for pid, vec in data.items():
        db_vec = json.dumps(list(vec))
        cur.execute("INSERT INTO vectors VALUES (?,?)", (pid, db_vec))
        conn.commit()
    conn.close()

path_json = "/tmp/embeddings-json.db"
Path(path_json).unlink()  # start fresh
dump_vectors_as_json(data, path_json)

Now let’s take a look at the data by querying specific 5000 ids:

ids = [f"id_{i}" for i in range(5000)]
ids[:5]

['id_0', 'id_1', 'id_2', 'id_3', 'id_4']

def load_vectors_from_db(db_path, ids):
    conn = sqlite3.connect(f"file:{db_path}?mode=ro", uri=True)
    cur = conn.cursor()
    placeholder = f"({','.join('?'*len(ids))})"
    out = cur.execute(
        f"""
        SELECT * FROM vectors
        WHERE id in {placeholder}
        """,
        ids
    ).fetchall()
    conn.close() 
    return dict(out)

for id_, vec in load_vectors_from_db(path_json,ids).items():
    print(id_, json.loads(vec)[:3])  # show only first 3 values
    break

id_0 [0.9350142536057673, 0.7091141633646703, 0.5425502463856475]

Looks fine, but we cannot do much with the values as text, let’s convert them back into numpy arrays:

vecs = {id_: np.array(json.loads(vec)) for id_, vec in load_vectors_from_db(path_json,ids).items()}
vecs["id_0"][:10]

array([0.93501425, 0.70911416, 0.54255025, 0.84465434, 0.92316   ,
       0.04012891, 0.38365326, 0.69617891, 0.93489605, 0.32225334])

We should wrap that up into a function for later comparison:

def deserialize_json(vec):
    return np.array(json.loads(vec))

def load_and_deserialize(db_path, ids, deserialize_func):
    loaded = load_vectors_from_db(db_path, ids)
    return {id_: deserialize_func(vec) for id_, vec in loaded.items()}

for k,v in load_and_deserialize(path_json, ids[:4], deserialize_json).items():
    print(k, v[:5])

id_0 [0.93501425 0.70911416 0.54255025 0.84465434 0.92316   ]
id_1 [0.73054276 0.69741267 0.71045242 0.41007697 0.27032626]
id_2 [0.02500795 0.20845864 0.90890601 0.71998367 0.33240879]
id_3 [0.48994333 0.24024084 0.62285499 0.18298199 0.34209958]

That’s alright, let’s quickly check how long it takes for all our 5000 ids:

%%timeit
_ = load_and_deserialize(path_json, ids, deserialize_json)

362 ms ± 3.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

That might be good enough. But we can definitely do much better (without paying much of a complexity price).

Store Vectors in Binary Format: Pickle

We can store the arrays directly as binary data in the database.

def dump_vectors_as_pickle_blob(data, db_path):
    conn = sqlite3.connect(db_path)
    cur = conn.cursor()
    cur.execute(
    """
    CREATE TABLE vectors (
    id TEXT PRIMARY KEY,
    vector BLOB
    )
    """
    )
    for pid, vec in data.items():
        db_vec = pickle.dumps(vec)
        cur.execute("INSERT INTO vectors VALUES (?,?)", (pid, db_vec))
        conn.commit()
    conn.close()

path_pickle = "/tmp/embeddings-pickle.db"
Path(path_pickle).unlink() # start fresh

dump_vectors_as_pickle_blob(data, path_pickle)

We need another function to deserialize them:

def deserialize_pickle(vec):
    return pickle.loads(vec)

for k,v in load_and_deserialize(path_pickle, ids[:2], deserialize_pickle).items():
    print(k,v[:5])

id_0 [0.93501425 0.70911416 0.54255025 0.84465434 0.92316   ]
id_1 [0.73054276 0.69741267 0.71045242 0.41007697 0.27032626]

What does our clock say?

%%timeit

_ = load_and_deserialize(path_pickle, ids, deserialize_pickle)

40.8 ms ± 2.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

That’s roughly more than 10 times faster. We’re cruising :)

But we can still do better. After all we are working with numpy arrays, so we can avoid the pickling path altogether.

Store Vectors in Binary Format: Numpy Bytes

Numpy arrays have a method called .tobytes that does the job:

arr = np.arange(5.)
arr

array([0., 1., 2., 3., 4.])

buf = arr.tobytes()
buf

b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x08@\x00\x00\x00\x00\x00\x00\x10@'

np.frombuffer(buf)

array([0., 1., 2., 3., 4.])

Warning

When calling .tobytes() numpy does not store information about the data type. The default of np.frombuffer is np.float64. But if that does not match your data, the loaded data will be wrong! Make sure to explicitely pass the dtype argument np.frombuffer(..., dtype=YOUR-TYPE) for it to work properly.

def dump_vectors_as_npbytes_blob(data, db_path):
    conn = sqlite3.connect(db_path)
    cur = conn.cursor()
    cur.execute(
    """
    CREATE TABLE vectors (
    id TEXT PRIMARY KEY,
    vector BLOB
    )
    """
    )
    for pid, vec in data.items():
        db_vec = vec.tobytes()   # <-- HERE
        cur.execute("INSERT INTO vectors VALUES (?,?)", (pid, db_vec))
        conn.commit()
    conn.close()

def deserialize_npbytes(buf,dtype=np.float64):
    return np.frombuffer(buf, dtype=dtype)

path_npbytes = "/tmp/embeddings-npbytes"
Path(path_npbytes).unlink() # start fresh

dump_vectors_as_npbytes_blob(data, path_npbytes)

for k,v in load_and_deserialize(path_npbytes, ids[:2], deserialize_npbytes).items():
    print(k,v[:5])

id_0 [0.93501425 0.70911416 0.54255025 0.84465434 0.92316   ]
id_1 [0.73054276 0.69741267 0.71045242 0.41007697 0.27032626]

%%timeit

_ = load_and_deserialize(path_npbytes, ids, deserialize_npbytes)

19.1 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

That’s roughly 1.7 times faster than the pickle version – for free!

Let’s recap with a side-by-side comparison with a bit larger list of ids to query:

ids = [f"id_{i}" for i in range(10_000)]

%%timeit

_ = load_and_deserialize(path_json, ids, deserialize_json)

729 ms ± 7.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit

_ = load_and_deserialize(path_pickle, ids, deserialize_pickle)

67.7 ms ± 2.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit

_ = load_and_deserialize(path_npbytes, ids, deserialize_npbytes)

37.1 ms ± 319 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Sometimes 40, 70 or even 500ms are not important. But sometimes they are (and can cost significant money). For example, in the context of high-traffic, real-time recommender systems, pushing down the response time by 20/30 ms (for free!) is definitely not a bad thing to have.

/Fin

Any bugs, questions, comments, suggestions? Ping me on twitter or drop me an e-mail (fabridamicelli at gmail).

Share this article on your favourite platform: