13 Common Data Formats

13.1 CSV

Code

data = """\
date,id,age
2020-01-01,x12,19
2020-01-02,x11,23
2020-01-02,x3,22
2020-01-03,x19,28
"""
print(data)

date,id,age
2020-01-01,x12,19
2020-01-02,x11,23
2020-01-02,x3,22
2020-01-03,x19,28

Code

from pathlib import Path
filepath = "/tmp/data.csv"
Path(filepath).write_text(data)

Loading data into a dataframe is not the only but one the most common ways to to load this data. We will use here pandas, a very popular library for data wrangling in python.

Install pandas:

uv add pandas

import pandas as pd

df = pd.read_csv(filepath)
df

	date	id	age
0	2020-01-01	x12	19
1	2020-01-02	x11	23
2	2020-01-02	x3	22
3	2020-01-03	x19	28

We can also save a dataframe as csv:

df.to_csv("/tmp/data.csv", index=None)

!cat /tmp/data.csv

date,id,age
2020-01-01,x12,19
2020-01-02,x11,23
2020-01-02,x3,22
2020-01-03,x19,28

13.2 Excel

We can also read in data coming from an excel spreadsheet.

df.to_excel("/tmp/data.xlsx", index=None)

pd.read_excel("/tmp/data.xlsx")

	date	id	age
0	2020-01-01	x12	19
1	2020-01-02	x11	23
2	2020-01-02	x3	22
3	2020-01-03	x19	28

13.3 JSON

JSON (JavaScript Object Notation) is by far one of the most used data formats, nowadays the default format to transfer data over the internet. It is also very commonly used for configuration files and logging.

flat = {
  "name": "John Doe",
  "age": 25
}

nested = {
  "name": {
    "firstName": "John",
    "lastName": "Doe",
    "middleName": "Smith"
  },
  "age": 25,
  "hobbies": ["reading", "writing"]
}

13.3.1 Encoding

Also called “serialization”.

import json

flat_str = json.dumps(flat)
flat_str

'{"name": "John Doe", "age": 25}'

nested_str = json.dumps(nested)
nested_str

'{"name": {"firstName": "John", "lastName": "Doe", "middleName": "Smith"}, "age": 25, "hobbies": ["reading", "writing"]}'

arr_str = json.dumps(arr)
arr_str

'{"name": "John Doe", "age": 25, "hobbies": ["reading", "writing"]}'

Path("/tmp/flat.json").write_text(flat_str)
Path("/tmp/nested.json").write_text(nested_str)

13.3.2 Decoding

Also called “deserialization”.

json.load(
    Path("/tmp/flat.json").open()
)

{'name': 'John Doe', 'age': 25}

json.loads(
    Path("/tmp/flat.json").read_text()
)

{'name': 'John Doe', 'age': 25}

json.load(
    Path("/tmp/nested.json").open()
)

{'name': {'firstName': 'John', 'lastName': 'Doe', 'middleName': 'Smith'},
 'age': 25,
 'hobbies': ['reading', 'writing']}

json.loads(
    Path("/tmp/nested.json").read_text()
)

{'name': {'firstName': 'John', 'lastName': 'Doe', 'middleName': 'Smith'},
 'age': 25,
 'hobbies': ['reading', 'writing']}

Notice we load the data into a python dictionary:

out = json.loads(
    Path("/tmp/nested.json").read_text()
)
type(out)

dict

We can also store a list as JSON array:

Path("/tmp/arr.json").write_text(
    json.dumps([
        {"a": 1},
        {"b": 2},
    ])
)

out = json.load(Path("/tmp/arr.json").open())
print(type(out))
out

<class 'list'>

[{'a': 1}, {'b': 2}]

13.4 Parquet

Parquet is a column oriented format. For a number of reasons, this format is much more efficient than csv and other formats.

df

	date	id	age
0	2020-01-01	x12	19
1	2020-01-02	x11	23
2	2020-01-02	x3	22
3	2020-01-03	x19	28

With pandas we can save data to a parquet file:

df.to_parquet("/tmp/data.parquet")

And read in:

pd.read_parquet("/tmp/data.parquet")

	date	id	age
0	2020-01-01	x12	19
1	2020-01-02	x11	23
2	2020-01-02	x3	22
3	2020-01-03	x19	28

Tip

Prefer parquet format when possible. It is faster to read and it stores metadata that can be used by libraries for optimization, for example, applying some filters.