Indexing Data in Pandas

We set up a test DataFrame and use it to demonstrate many ways to index data, directly using [ ], and using loc and iloc.

TODO -- We exlplore things like return values along the way, but this means the order is all wrong. The return value discussion should come out as a separate notebook.

Better Outline:

  1. Indexing using []
    1. How strings are treated
    2. How ranges are treated
    3. Problems
    4. Relation of [] to getitem
  2. loc and iloc
  3. pandasql
In [1]:
# Do imports and set up a test DataFrame
import pandas as pd
import numpy as np

cols = ['x' + n for n in list('01234')]
indexes = ['y' + n for n in list('0123456789')]
np.random.seed(42)
df = pd.DataFrame(np.random.rand(10,5), columns=cols, index=indexes)
df
Out[1]:
x0 x1 x2 x3 x4
y0 0.374540 0.950714 0.731994 0.598658 0.156019
y1 0.155995 0.058084 0.866176 0.601115 0.708073
y2 0.020584 0.969910 0.832443 0.212339 0.181825
y3 0.183405 0.304242 0.524756 0.431945 0.291229
y4 0.611853 0.139494 0.292145 0.366362 0.456070
y5 0.785176 0.199674 0.514234 0.592415 0.046450
y6 0.607545 0.170524 0.065052 0.948886 0.965632
y7 0.808397 0.304614 0.097672 0.684233 0.440152
y8 0.122038 0.495177 0.034389 0.909320 0.258780
y9 0.662522 0.311711 0.520068 0.546710 0.184854

Strings or lists of strings passed to [] are treated as columns

In [2]:
# A string
df['x0']
Out[2]:
y0    0.374540
y1    0.155995
y2    0.020584
y3    0.183405
y4    0.611853
y5    0.785176
y6    0.607545
y7    0.808397
y8    0.122038
y9    0.662522
Name: x0, dtype: float64
In [3]:
# A list of strings returns more than one column as a DataFrame
df[['x0', 'x2']]
Out[3]:
x0 x2
y0 0.374540 0.731994
y1 0.155995 0.866176
y2 0.020584 0.832443
y3 0.183405 0.524756
y4 0.611853 0.292145
y5 0.785176 0.514234
y6 0.607545 0.065052
y7 0.808397 0.097672
y8 0.122038 0.034389
y9 0.662522 0.520068

Ranges are treated as rows

In [4]:
# Note this is equivalent to df.tail(4)
df[-4:]
Out[4]:
x0 x1 x2 x3 x4
y6 0.607545 0.170524 0.065052 0.948886 0.965632
y7 0.808397 0.304614 0.097672 0.684233 0.440152
y8 0.122038 0.495177 0.034389 0.909320 0.258780
y9 0.662522 0.311711 0.520068 0.546710 0.184854

DataFrame.loc

Use "loc" with brackets to specify the rows and columns you want by name. Syntax is frame_name.loc[rows,columns]

In [5]:
# Passing a single row and column name
df.loc["y4","x2"]
Out[5]:
0.29214464853521815
In [6]:
# Passing a "slice" of rows and columns
# Todo, link to and write a slicing review.
df.loc["y":,"x1":"x2"]
Out[6]:
x1 x2
y0 0.950714 0.731994
y1 0.058084 0.866176
y2 0.969910 0.832443
y3 0.304242 0.524756
y4 0.139494 0.292145
y5 0.199674 0.514234
y6 0.170524 0.065052
y7 0.304614 0.097672
y8 0.495177 0.034389
y9 0.311711 0.520068

What gets returned

In [7]:
# Type returned depends on whether you pass a label or a list.

print("Passing a label gives: " + str(type(df["x0"])))
print("Passing a list even of one label gives: " + str(type(df[["x0"]])))
Passing a label gives: <class 'pandas.core.series.Series'>
Passing a list even of one label gives: <class 'pandas.core.frame.DataFrame'>
In [8]:
# More on data types -- move out

# Columns
print(type(df.loc[:,["x0"]]))
print(type(df.loc[:,"x0"]))

# Rows
print(type(df.loc[["y1"],:]))
print(type(df.loc["y1",:]))

# Single element
print(type(df.loc["y1","x1"]))
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
<class 'numpy.float64'>
In [9]:
# Indexing directly can lead to chained badness?  No, ok


df['x0'][0:3] = 2

df
Out[9]:
x0 x1 x2 x3 x4
y0 2.000000 0.950714 0.731994 0.598658 0.156019
y1 2.000000 0.058084 0.866176 0.601115 0.708073
y2 2.000000 0.969910 0.832443 0.212339 0.181825
y3 0.183405 0.304242 0.524756 0.431945 0.291229
y4 0.611853 0.139494 0.292145 0.366362 0.456070
y5 0.785176 0.199674 0.514234 0.592415 0.046450
y6 0.607545 0.170524 0.065052 0.948886 0.965632
y7 0.808397 0.304614 0.097672 0.684233 0.440152
y8 0.122038 0.495177 0.034389 0.909320 0.258780
y9 0.662522 0.311711 0.520068 0.546710 0.184854

Selecting rows by values in columns

Set up a new test dataframe to try locating specific rows by column values

In [10]:
cols = ['x' + n for n in list('01234')]
indexes = ['y' + n for n in list('0123456789')]
df = pd.DataFrame(np.arange(1, 51).reshape(10,5), columns=cols, index=indexes)
df
Out[10]:
x0 x1 x2 x3 x4
y0 1 2 3 4 5
y1 6 7 8 9 10
y2 11 12 13 14 15
y3 16 17 18 19 20
y4 21 22 23 24 25
y5 26 27 28 29 30
y6 31 32 33 34 35
y7 36 37 38 39 40
y8 41 42 43 44 45
y9 46 47 48 49 50

Show rows using loc

In [25]:
# Get a single matching row
display(df.loc[df.x0 == 11])

# Another example matching row
display(df.loc[df['x2'] == 38])
x0 x1 x2 x3 x4
y2 11 12 13 14 15
x0 x1 x2 x3 x4
y7 36 37 38 39 40
In [26]:
# Return rows where x0 > 31

display(df.loc[df['x0'] > 31])
x0 x1 x2 x3 x4
y7 36 37 38 39 40
y8 41 42 43 44 45
y9 46 47 48 49 50
In [27]:
# Using where