Code
= """\
data date,id,age
2020-01-01,x12,19
2020-01-02,x11,23
2020-01-02,x3,22
2020-01-03,x19,28
"""
print(data)
date,id,age
2020-01-01,x12,19
2020-01-02,x11,23
2020-01-02,x3,22
2020-01-03,x19,28
date,id,age
2020-01-01,x12,19
2020-01-02,x11,23
2020-01-02,x3,22
2020-01-03,x19,28
Loading data into a dataframe is not the only but one the most common ways to to load this data. We will use here pandas
, a very popular library for data wrangling in python.
Install pandas:
date | id | age | |
---|---|---|---|
0 | 2020-01-01 | x12 | 19 |
1 | 2020-01-02 | x11 | 23 |
2 | 2020-01-02 | x3 | 22 |
3 | 2020-01-03 | x19 | 28 |
We can also save a dataframe as csv:
We can also read in data coming from an excel spreadsheet.
JSON (JavaScript Object Notation) is by far one of the most used data formats, nowadays the default format to transfer data over the internet. It is also very commonly used for configuration files and logging.
Also called “serialization”.
'{"name": {"firstName": "John", "lastName": "Doe", "middleName": "Smith"}, "age": 25, "hobbies": ["reading", "writing"]}'
'{"name": "John Doe", "age": 25, "hobbies": ["reading", "writing"]}'
Also called “deserialization”.
{'name': {'firstName': 'John', 'lastName': 'Doe', 'middleName': 'Smith'},
'age': 25,
'hobbies': ['reading', 'writing']}
{'name': {'firstName': 'John', 'lastName': 'Doe', 'middleName': 'Smith'},
'age': 25,
'hobbies': ['reading', 'writing']}
Notice we load the data into a python dictionary:
We can also store a list as JSON array:
Parquet is a column oriented format. For a number of reasons, this format is much more efficient than csv
and other formats.
With pandas
we can save data to a parquet file:
And read in:
date | id | age | |
---|---|---|---|
0 | 2020-01-01 | x12 | 19 |
1 | 2020-01-02 | x11 | 23 |
2 | 2020-01-02 | x3 | 22 |
3 | 2020-01-03 | x19 | 28 |
Prefer parquet
format when possible. It is faster to read and it stores metadata that can be used by libraries for optimization, for example, applying some filters.