Code
from pathlib import Path
import json
import sqlite3
import pickle
import numpy as np
Fabrizio Damicelli
June 23, 2024
Numpy’s bytes format can be considerably faster than other formats to deserialize. When storing/retrieving vectors arrays just use the methods array.tobytes()
and numpy.frombuffer()
(instead of, for example, pickle.dumps/loads
).
Let’s say you have a bunch of entities, e.g. product-ids of on online shop, for which you have a vector representation (think for example of a word or an image embedding) stored somewhere like a database.
Now it’s time to put your model in production and you need to retrieve the vectors from the database in order to deliver your predictions (for example, you might need to get an embedding representation of product images in order to calculate similarities and show product recommendations). The most typical (and probably sane thing to do) is to have an sqlite database for storage.
It turns out, we can store the vectors in different formats and which format we use can heavily affect the retrieval (loading and deserialization) speed in our application. That can be critical if your predictions behind your use-case need to be fast (e.g. under 100ms).
Let’s create some fake data consisting of vectors of length 256 filled with random numbers. We will use that data to profile the retrieval (from the database) performance of different formats.
vectors = np.random.random(size=(20_000, 256))
data = {f"id_{i}": vec for i, vec in enumerate(vectors)}
data["id_0"][:5]
array([0.93501425, 0.70911416, 0.54255025, 0.84465434, 0.92316 ])
Our first option is to just dump the numbers into the database as text.
def dump_vectors_as_json(data, db_path):
conn = sqlite3.connect(db_path)
cur = conn.cursor()
cur.execute(
"""
CREATE TABLE vectors (
id TEXT PRIMARY KEY,
vector TEXT
)
"""
)
for pid, vec in data.items():
db_vec = json.dumps(list(vec))
cur.execute("INSERT INTO vectors VALUES (?,?)", (pid, db_vec))
conn.commit()
conn.close()
Now let’s take a look at the data by querying specific 5000 ids:
for id_, vec in load_vectors_from_db(path_json,ids).items():
print(id_, json.loads(vec)[:3]) # show only first 3 values
break
id_0 [0.9350142536057673, 0.7091141633646703, 0.5425502463856475]
Looks fine, but we cannot do much with the values as text, let’s convert them back into numpy arrays:
vecs = {id_: np.array(json.loads(vec)) for id_, vec in load_vectors_from_db(path_json,ids).items()}
vecs["id_0"][:10]
array([0.93501425, 0.70911416, 0.54255025, 0.84465434, 0.92316 ,
0.04012891, 0.38365326, 0.69617891, 0.93489605, 0.32225334])
We should wrap that up into a function for later comparison:
id_0 [0.93501425 0.70911416 0.54255025 0.84465434 0.92316 ]
id_1 [0.73054276 0.69741267 0.71045242 0.41007697 0.27032626]
id_2 [0.02500795 0.20845864 0.90890601 0.71998367 0.33240879]
id_3 [0.48994333 0.24024084 0.62285499 0.18298199 0.34209958]
That’s alright, let’s quickly check how long it takes for all our 5000 ids:
362 ms ± 3.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That might be good enough. But we can definitely do much better (without paying much of a complexity price).
We can store the arrays directly as binary data in the database.
def dump_vectors_as_pickle_blob(data, db_path):
conn = sqlite3.connect(db_path)
cur = conn.cursor()
cur.execute(
"""
CREATE TABLE vectors (
id TEXT PRIMARY KEY,
vector BLOB
)
"""
)
for pid, vec in data.items():
db_vec = pickle.dumps(vec)
cur.execute("INSERT INTO vectors VALUES (?,?)", (pid, db_vec))
conn.commit()
conn.close()
We need another function to deserialize them:
id_0 [0.93501425 0.70911416 0.54255025 0.84465434 0.92316 ]
id_1 [0.73054276 0.69741267 0.71045242 0.41007697 0.27032626]
What does our clock say?
40.8 ms ± 2.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
That’s roughly more than 10 times faster. We’re cruising :)
But we can still do better. After all we are working with numpy arrays, so we can avoid the pickling path altogether.
Numpy arrays have a method called .tobytes
that does the job:
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x08@\x00\x00\x00\x00\x00\x00\x10@'
When calling .tobytes()
numpy does not store information about the data type. The default of np.frombuffer
is np.float64
. But if that does not match your data, the loaded data will be wrong! Make sure to explicitely pass the dtype
argument np.frombuffer(..., dtype=YOUR-TYPE)
for it to work properly.
def dump_vectors_as_npbytes_blob(data, db_path):
conn = sqlite3.connect(db_path)
cur = conn.cursor()
cur.execute(
"""
CREATE TABLE vectors (
id TEXT PRIMARY KEY,
vector BLOB
)
"""
)
for pid, vec in data.items():
db_vec = vec.tobytes() # <-- HERE
cur.execute("INSERT INTO vectors VALUES (?,?)", (pid, db_vec))
conn.commit()
conn.close()
def deserialize_npbytes(buf,dtype=np.float64):
return np.frombuffer(buf, dtype=dtype)
id_0 [0.93501425 0.70911416 0.54255025 0.84465434 0.92316 ]
id_1 [0.73054276 0.69741267 0.71045242 0.41007697 0.27032626]
19.1 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
That’s roughly 1.7 times faster than the pickle version – for free!
Let’s recap with a side-by-side comparison with a bit larger list of ids to query:
729 ms ± 7.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
67.7 ms ± 2.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
37.1 ms ± 319 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Sometimes 40, 70 or even 500ms are not important. But sometimes they are (and can cost significant money). For example, in the context of high-traffic, real-time recommender systems, pushing down the response time by 20/30 ms (for free!) is definitely not a bad thing to have.
/Fin
Any bugs, questions, comments, suggestions? Ping me on twitter or drop me an e-mail (fabridamicelli at gmail).
Share this article on your favourite platform: