Pandas Column and Row Types

Columns in a Pandas DataFrame Contain the Same Type

Like the columns in a database or in a spreadsheet, we generally expect that all the values in a single column will be of the same type -- all ints or floats or strings or datetimes or what have you. Consider the following table:

In [1]:
import pandas as pd
import numpy as np

# Create DataFrame with different types:
df = pd.DataFrame({"ints": [1,2,3], "strings": ["Moe", "Larry", "Curly"], 'floats': [2.1, 9.9, 8.5]}, index=list("XYZ"))
floats ints strings
X 2.1 1 Moe
Y 9.9 2 Larry
Z 8.5 3 Curly

Each of the columns, when selected, is a single Pandas series. The dtype of the column is the same as the dtype of the individual elements.

In [2]:
# Column dtypes represent the homogeneous values they contain:
print("Column types: ", df["floats"].dtype, df["ints"].dtype, df["strings"].dtype,  "\n")

# Prove that these are actually dtypes:
print("Column dtypes: ", df["floats"].dtype.__repr__(), df["ints"].dtype.__repr__(), df["strings"].dtype.__repr__(),  "\n")

# And they're all Pandas series objects.
print("Type of a column: ", type(df["ints"]))
Column types:  float64 int64 object 

Column dtypes:  dtype('float64') dtype('int64') dtype('O') 

Type of a column:  <class 'pandas.core.series.Series'>

Rows Contain Different Types, so What Type is the Does the Whole Row Have?

If we show the value of a row, we get an idea that the type of the row is a kind of "least common denominator" of the different types. It represents the most specific type that can be applied to all rows.

In [3]:
# Accessing a row gives the type most generic common type as type of the series:
# Get the last row
last_row = df.loc["Z",:]

# Show it
floats       8.5
ints           3
strings    Curly
Name: Z, dtype: object
In [4]:
# Type of a row is also a series, but "series type" is most generic 
print("Type of a row: ", type(last_row))

# dtype for the whole row
print("Row dtype: ", last_row.dtype)
Type of a row:  <class 'pandas.core.series.Series'>
Row dtype:  object

What About the Values in the Row?

If the type of the row represents the type that can be applied to all the types, what type do the elements in the row have? This is where things get a little spooky, because the answer is that sometimes they have the original dtype, and sometimes they have the scalar type of the object.

In [5]:
# Individual row sometiems still have their original dtype!

# Get value in each row:
an_int  = last_row['ints']
afloat = last_row['floats']
astring = last_row['strings']

# Here we get a dtype, then get the value of the python scalar.
print("A dtype of: ", type(an_int), " with a value of: ", an_int.item())
print("A dtype of: ", type(afloat), " with a value of: ", afloat.item())

# But this doesn't work for string, becuase strings contain the scalar directly
# Error!
# print("A dtype of: ", type(astring), " with a value of: ", astring.item())

# instead work with the scalar directly.
print("A scalar python type of: ", type(astring), " with a value of: ", astring)
A dtype of:  <class 'numpy.int64'>  with a value of:  3
A dtype of:  <class 'numpy.float64'>  with a value of:  8.5
A scalar python type of:  <class 'str'>  with a value of:  Curly

Promotion of rows to series

Originally I thought that what was behind this weirdness in the type of a single row is the same sort of weirdness that happens when using pandas.DataFrame.iterrows -- the row is being promoted to a series. From the Pandas documentation for DataFrame.iterrows: "Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). ... To preserve dtypes while iterating over the rows, it is better to use itertuples which returns namedtuples of the values and which is generally faster than iterrows."

However, it turns out that in terms of the string being a plain old python string, itertuples demonstrates the same behavior. This is not a feature of Pandas, but of numpy. The answer to the mystery is hidden in plain sight in the numpy docs:

  • Note The data actually stored in object arrays (i.e., arrays having dtype object) are references to Python objects, not the objects themselves. Hence, object arrays behave more like usual Python lists, in the sense that their contents need not be of the same Python type. The object type is also special because an array containing object items does not return an object_ object on item access, but instead returns the actual object that the array item refers to.

Observing the same behavior for itertuples

Generally, the fastest Pandas operations are the Cython-optimized, vector-based operations on a column, but if you do need to get iterate rows efficiently and preserve the underlying types, as you would for an ETL tool, for example, you would use iterrows.

itertuples returns a collection of namedtuple objects, which are tuples which can also be indexed by name.

In [6]:
# Let's hang onto the last row so we can look at it.
for row in df.itertuples():    
    last_tuple = row


print("last_tuple.floats value:", last_tuple.floats)
print("last_tuple.floats dtype:", last_tuple.floats.dtype)
print("last_tuple.floats type:", type(last_tuple.floats))

print("last_tuple.strings value: ", last_tuple.strings)

# Note type is plain class str, not numpy object!
print("last_tuple.strings type: ", type(last_tuple.strings))

# Won't work, as detailed above:
# >> AttributeError: 'str' object has no attribute 'dtype'
# print(last_tuple.strings.dtype)
Pandas(Index='Z', floats=8.5, ints=3, strings='Curly')
last_tuple.floats value: 8.5
last_tuple.floats dtype: float64
last_tuple.floats type: <class 'numpy.float64'>
last_tuple.strings value:  Curly
last_tuple.strings type:  <class 'str'>