13  Common Data Formats

13.1 CSV

Comma Separated Values

Code
data = """\
date,id,age
2020-01-01,x12,19
2020-01-02,x11,23
2020-01-02,x3,22
2020-01-03,x19,28
"""
print(data)
date,id,age
2020-01-01,x12,19
2020-01-02,x11,23
2020-01-02,x3,22
2020-01-03,x19,28
Code
from pathlib import Path
filepath = "/tmp/data.csv"
Path(filepath).write_text(data)

Loading data into a dataframe is not the only but one the most common ways to to load this data. We will use here pandas, a very popular library for data wrangling in python.

Install pandas:

uv add pandas
import pandas as pd

df = pd.read_csv(filepath)
df
date id age
0 2020-01-01 x12 19
1 2020-01-02 x11 23
2 2020-01-02 x3 22
3 2020-01-03 x19 28

We can also save a dataframe as csv:

df.to_csv("/tmp/data.csv", index=None)
!cat /tmp/data.csv
date,id,age
2020-01-01,x12,19
2020-01-02,x11,23
2020-01-02,x3,22
2020-01-03,x19,28

13.2 Excel

We can also read in data coming from an excel spreadsheet.

df.to_excel("/tmp/data.xlsx", index=None)
pd.read_excel("/tmp/data.xlsx")
date id age
0 2020-01-01 x12 19
1 2020-01-02 x11 23
2 2020-01-02 x3 22
3 2020-01-03 x19 28

13.3 JSON

JSON (JavaScript Object Notation) is by far one of the most used data formats, nowadays the default format to transfer data over the internet. It is also very commonly used for configuration files and logging.

flat = {
  "name": "John Doe",
  "age": 25
}

nested = {
  "name": {
    "firstName": "John",
    "lastName": "Doe",
    "middleName": "Smith"
  },
  "age": 25,
  "hobbies": ["reading", "writing"]
}

13.3.1 Encoding

Also called “serialization”.

import json

flat_str = json.dumps(flat)
flat_str
'{"name": "John Doe", "age": 25}'
nested_str = json.dumps(nested)
nested_str
'{"name": {"firstName": "John", "lastName": "Doe", "middleName": "Smith"}, "age": 25, "hobbies": ["reading", "writing"]}'
arr_str = json.dumps(arr)
arr_str
'{"name": "John Doe", "age": 25, "hobbies": ["reading", "writing"]}'
Path("/tmp/flat.json").write_text(flat_str)
Path("/tmp/nested.json").write_text(nested_str)

13.3.2 Decoding

Also called “deserialization”.

json.load(
    Path("/tmp/flat.json").open()
)
{'name': 'John Doe', 'age': 25}
json.loads(
    Path("/tmp/flat.json").read_text()
)
{'name': 'John Doe', 'age': 25}
json.load(
    Path("/tmp/nested.json").open()
)
{'name': {'firstName': 'John', 'lastName': 'Doe', 'middleName': 'Smith'},
 'age': 25,
 'hobbies': ['reading', 'writing']}
json.loads(
    Path("/tmp/nested.json").read_text()
)
{'name': {'firstName': 'John', 'lastName': 'Doe', 'middleName': 'Smith'},
 'age': 25,
 'hobbies': ['reading', 'writing']}

Notice we load the data into a python dictionary:

out = json.loads(
    Path("/tmp/nested.json").read_text()
)
type(out)
dict

We can also store a list as JSON array:

Path("/tmp/arr.json").write_text(
    json.dumps([
        {"a": 1},
        {"b": 2},
    ])
)
out = json.load(Path("/tmp/arr.json").open())
print(type(out))
out
<class 'list'>
[{'a': 1}, {'b': 2}]

13.4 Parquet

Parquet is a column oriented format. For a number of reasons, this format is much more efficient than csv and other formats.

df
date id age
0 2020-01-01 x12 19
1 2020-01-02 x11 23
2 2020-01-02 x3 22
3 2020-01-03 x19 28

With pandas we can save data to a parquet file:

df.to_parquet("/tmp/data.parquet")

And read in:

pd.read_parquet("/tmp/data.parquet")
date id age
0 2020-01-01 x12 19
1 2020-01-02 x11 23
2 2020-01-02 x3 22
3 2020-01-03 x19 28
Tip

Prefer parquet format when possible. It is faster to read and it stores metadata that can be used by libraries for optimization, for example, applying some filters.