Achtung: Watch out, German csv readers!

python
pandas
text
Little pandas Gotcha dealing with German text.
Author

Fabrizio Damicelli

Published

May 20, 2020

TL;DR: pandas.read_csv considers the word “null” as a NaN, which also means “zero” in German. The arguments na_values and keep_default_na offer a solution.

It’s Friday an you set out to build a very sophisticated numbers translator in several languages:

import pandas as pd

numbers = pd.DataFrame({
   "Spanish": ["cero", "uno", "dos", "tres"],
   "English": ["zero", "one", "two", "three"],
   "German": ["null", "eins", "zwei", "drei"],
})
numbers.index = numbers.index.rename("Number")
numbers
Spanish English German
Number
0 cero zero null
1 uno one eins
2 dos two zwei
3 tres three drei

If you want to know how to say 3 in Spanish, you do:

numbers.loc[3, "Spanish"]
'tres'

Nice. You save the super advanced translator for later use and go off for a relaxed weekend.

numbers.to_csv("numbers")

Back to work on Monday, your German friend drops by your office and you want to proudly show what you’ve created.
So you load your “translator” and go like:
– Ask me how to say any number!
– OK. Let’s start easy: how do you say zero in German?

“That I can do”, you think and type:

# Load the awesome translator
numbers_loaded = pd.read_csv("numbers", index_col="Number")
# And get the translation
numbers_loaded.loc[0, "German"]
nan

Oh no, that’s no good!
You get out of the embarrassing situation saying it is actually a beta version, and, and, and. The harm is done and your friend leaves the office skeptical – to say the least.

What’s was the problem?
The answer is in the docstrings of the function pd.read_csv. If we look carefully at which values pandas considers as NaN per default we find the following:

"""
na_values : scalar, str, list-like, or dict, optional
    Additional strings to recognize as NA/NaN. If dict passed, specific
    per-column NA values.  By default the following values are interpreted as
    NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan',
    '1.#IND', '1.#QNAN', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan',
    'null'.
"""

There it is: “null” is in the list!

Solution:
We need to do two things: - Pass other values without “null” (and “NULL” if you’re not sure everything is lowercase). - Tell pandas not to keep the defaults (otherwise it will use both the defaults and the passed values).

na_values = [
    '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', 
    '-nan', '1.#IND', '1.#QNAN', 'N/A', 'NA', 'NaN', 'n/a', 'nan'
]

numbers_loaded = pd.read_csv(
    "numbers", 
     index_col="Number",
     na_values=na_values, 
     keep_default_na=False
)
                            
numbers_loaded
Spanish English German
Number
0 cero zero null
1 uno one eins
2 dos two zwei
3 tres three drei

Now we can ask:

numbers_loaded.loc[0, "German"]
'null'

That will keep your German friends happy :)

/Fin

Any bugs, questions, comments, suggestions? Ping me on twitter or drop me an e-mail (fabridamicelli at gmail).
Share this article on your favourite platform: