Intro to Pandas in Python

Published 9/20/2016

Intro to Pandas

Quick Note

This tutorial is part of the Python Basic Resources 4-part series produced by Galvanize.

This tutorial moves very fast. If you would like a slower, more in-depth intro to Python, we suggest you take our Intro to Python Evening Course. It's the perfect way to become familiar with Python + gain experience using Python to solve challenging problems.

Prerequisites

This web tutorial follows the Jupyter notebook found on GitHub. If you are viewing it using GitHub, then you cannot execute the cells that contain Python code. To view and run this notebook you'll need to install Jupyter on your computer before you continue. See these installation instructions for help!

Overview

In this tutorial, we will use a very awesome Python library named "Pandas". We will look at how to get data into a Pandas DataFrameand how to view and manipulate the data.

Pandas Import

import pandas as pd

Loading External Data

The Pandas documentation will show you all of the ways that you could load external data into a DataFrame (there are also ways to load data already in your Python program, but for now we'll focus on loading external data). Basically, there is a way to load in data in any format that you might want to load it in from (CSV, JSON, SQL, Excel, HTML). All of these take some form of a read_ method. So, if we wanted to load data in from a CSV, we would simply use:

df = pd.read_csv('my_data.csv')

Note that you need to have the column names as the first row in the .csv.

In [1]:
import pandas as pd # I haven't actually run this in code yet. 
df = pd.read_csv('data/forestfires.csv')
In [2]:
# Gives us a very high level overview of our data. 
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 517 entries, 0 to 516
Data columns (total 13 columns):
X        517 non-null int64
Y        517 non-null int64
month    517 non-null object
day      517 non-null object
FFMC     517 non-null float64
DMC      517 non-null float64
DC       517 non-null float64
ISI      517 non-null float64
temp     517 non-null float64
RH       517 non-null int64
wind     517 non-null float64
rain     517 non-null float64
area     517 non-null float64
dtypes: float64(8), int64(3), object(2)
memory usage: 52.6+ KB
In [3]:
# Gives us a more detailed look at each of the columns in our dataset. Note that it 
# doesn't include non-numeric columns in this summary. 
df.describe()
Out[3]:
X Y FFMC DMC DC ISI temp RH wind rain area
count 517.000000 517.000000 517.000000 517.000000 517.000000 517.000000 517.000000 517.000000 517.000000 517.000000 517.000000
mean 4.669246 4.299807 90.644681 110.872340 547.940039 9.021663 18.889168 44.288201 4.017602 0.021663 12.847292
std 2.313778 1.229900 5.520111 64.046482 248.066192 4.559477 5.806625 16.317469 1.791653 0.295959 63.655818
min 1.000000 2.000000 18.700000 1.100000 7.900000 0.000000 2.200000 15.000000 0.400000 0.000000 0.000000
25% 3.000000 4.000000 90.200000 68.600000 437.700000 6.500000 15.500000 33.000000 2.700000 0.000000 0.000000
50% 4.000000 4.000000 91.600000 108.300000 664.200000 8.400000 19.300000 42.000000 4.000000 0.000000 0.520000
75% 7.000000 5.000000 92.900000 142.400000 713.900000 10.800000 22.800000 53.000000 4.900000 0.000000 6.570000
max 9.000000 9.000000 96.200000 291.300000 860.600000 56.100000 33.300000 100.000000 9.400000 6.400000 1090.840000

In [4]:
# Shows us the first 5 rows of our data set. 
df.head()
Out[4]:
X Y month day FFMC DMC DC ISI temp RH wind rain area
0 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 0.0
1 7 4 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 0.0
2 7 4 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0 0.0
3 8 6 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2 0.0
4 8 6 mar sun 89.3 51.3 102.2 9.6 11.4 99 1.8 0.0 0.0
In [5]:
# Shows us the last 5 rows of our data set. 
df.tail()
Out[5]:
X Y month day FFMC DMC DC ISI temp RH wind rain area
512 4 3 aug sun 81.6 56.7 665.6 1.9 27.8 32 2.7 0.0 6.44
513 2 4 aug sun 81.6 56.7 665.6 1.9 21.9 71 5.8 0.0 54.29
514 7 4 aug sun 81.6 56.7 665.6 1.9 21.2 70 6.7 0.0 11.16
515 1 4 aug sat 94.4 146.0 614.7 11.3 25.6 42 4.0 0.0 0.00
516 6 3 nov tue 79.5 3.0 106.7 1.1 11.8 31 4.5 0.0 0.00

A little bit more of Pandas...

There are TONS of things you can do with a Pandas DataFrame. They were inspired by R DataFrames, and so most (if not all) of the stuff you could do in R, you can probably do here. In this tutorial we will focus on some of the more common things that you can do with DataFrames (it would take too long to cover everything). If you want to view all of the available attributes and methods of DataFrames, you can check out the Pandas Docs. If you want practical examples of how you might use DataFrames, I might suggest getting a copy of Python for Data Analysis (it's written by Wes McKinney, the creator of Pandas).

Alright, let's dive in...

Indexing to Grab your Data

Let's start off with some basic indexing. This will allow you to grab the columns, rows, etc. that you want. When we index into DataFrames, we will always be using some form of bracket notation [], with one or two sets of numbers/strings in there. If it's two sets, then they will be separated by a comma (something like [number/letter, number/letter]). The first number will always be a reference to rows, while the second will always be a reference to columns.

If we want to grab entire rows, then we can simply index into our DataFrame almost like we would a list - df[rows_to_grab]. The only caveat is that you have to use two indices separated by a :. If we want to grab entire columns, then we can grab those by using bracket notation (df[column_name]) or dot notation (df.column_name). The one caveat here is that if you are going to use dot notation, you cannot have any spaces in your column name.

If we want to grab only certain rows and columns, there are three common methods that we can use to index into a Pandas DataFrame: .loc[], .iloc[], and .ix[].

.loc[] is a purely label-location based indexer, .iloc[] is a purely integer-location based indexer, and .ix[] is a primarily label-location based indexer that falls back to integer indexing.

In [6]:
# Let's look at this mini-DataFrame. 
import pandas as pd
data_lst = [{'a': 1, 'b': 2, 'c':3}, {'a': 4, 'b':5, 'c':6, 'd':7}]
df = pd.DataFrame(data_lst) # Create the DataFrame.
df
Out[6]:
a b c d
0 1 2 3 NaN
1 4 5 6 7.0
In [7]:
df[:1]
Out[7]:
a b c d
0 1 2 3 NaN
What do you think would happen if we tried to access the first or second row using df[0] and df[1]?
In [8]:
# Empty cell to test the answer to the question above. 
In [9]:
df['c']  # Can access columns via brackets or by dot notation. 
df.c
Out[9]:
0    3
1    6
Name: c, dtype: int64
In [10]:
df.loc[:, 'a'] # Note that df.loc[:, 1] would fail here. Why?
Out[10]:
0    1
1    4
Name: a, dtype: int64
In [11]:
df.iloc[:, 0]
Out[11]:
0    1
1    4
Name: a, dtype: int64
What happens if we try df.iloc[:, 'a'] here?
In [12]:
# Empty cell to test answer. 
In [13]:
print df.ix[:, 'a'] # With .ix[], either .ix[:, 'a'] or [:, 0] works! For this reason, I typically 
print df.ix[:, 0]   # always go with .ix[].
0    1
1    4
Name: a, dtype: int64
0    1
1    4
Name: a, dtype: int64

Querying to grab your data

What if you don't know exactly what rows/columns you want to grab before hand? Is there a way that you can grab the data you want by simply specifying some query parameters? Yes! There are a couple of ways that you can do this. The first way we'll look at is just through masking, whereas the second actually uses a .query() method availiable on the Pandas DataFrame.

In [14]:
df # Remind ourselves of what the data looks like. 
Out[14]:
a b c d
0 1 2 3 NaN
1 4 5 6 7.0
In [15]:
df['c'] == 3 # Note that this just gives us a True/False for each observation, for whether
             # or not that equal 3 (this is called a mask). To grab the entire row we have 
             # to do the following...
Out[15]:
0     True
1    False
Name: c, dtype: bool
In [16]:
df[df['c'] == 3]  # Here we use the mask to grab the entire row.
Out[16]:
a b c d
0 1 2 3 NaN

While using a mask like above works, this can get really messy if you have multiple conditions that you want to specify... df[(condition1) & (condition2) & (condition3)]. Using the .query() method on DataFrames is typically perferred, as it makes your code much cleaner and easier to read. All you do is put your query into a string, and then place it into the .query()method.

In [17]:
df.query('c == 3')
Out[17]:
a b c d
0 1 2 3 NaN
In [18]:
df.query('c == 3 and b == 2 and a == 1') # Now we can do arbitrarily long queries more easily.
Out[18]:
a b c d
0 1 2 3 NaN
In [19]:
c = 3
b = 2
a = 1
df.query('c == @c and b == @b and a == @a')
Out[19]:
a b c d
0 1 2 3 NaN

What else can I do with my data?

Anything! The world is your oyster! Seriously, though, chances are high that you can do whatever you're imagining (within reason) with your data in a Pandas DataFrame. I've already mentioned that Pandas DataFrame's are based off R's dataframes, and so anything you can do in R's dataframes, I imagine you can do in Pandas DataFrames. For those of you from SQL, I imagine you can also do pretty much anything you would want to do in SQL in Pandas DataFrames (with some slightly different syntax). Let's go back to our forest-fires data.

Starting with groupby's...

In [20]:
import pandas as pd
df = pd.read_csv('data/forestfires.csv')
df.head()
Out[20]:
X Y month day FFMC DMC DC ISI temp RH wind rain area
0 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 0.0
1 7 4 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 0.0
2 7 4 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0 0.0
3 8 6 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2 0.0
4 8 6 mar sun 89.3 51.3 102.2 9.6 11.4 99 1.8 0.0 0.0
In [21]:
# Let's group the data by month. Note that it gives me back a groupby object until 
# I do something with it. 
df.groupby('month')
Out[21]:
<pandas.core.groupby.DataFrameGroupBy object at 0x11286c0d0>
In [22]:
# Note that we can store the results of a groupby to then perform all kinds of operations on
# it (this is actually the preferred method if we're going to perform more than one calculation
# on it). We have tons of operations we can perform on it. 
groupby_obj = df.groupby('month')
groupby_obj.mean()
groupby_obj.max()
groupby_obj.count() 
Out[22]:
X Y day FFMC DMC DC ISI temp RH wind rain area
month
apr 9 9 9 9 9 9 9 9 9 9 9 9
aug 184 184 184 184 184 184 184 184 184 184 184 184
dec 9 9 9 9 9 9 9 9 9 9 9 9
feb 20 20 20 20 20 20 20 20 20 20 20 20
jan 2 2 2 2 2 2 2 2 2 2 2 2
jul 32 32 32 32 32 32 32 32 32 32 32 32
jun 17 17 17 17 17 17 17 17 17 17 17 17
mar 54 54 54 54 54 54 54 54 54 54 54 54
may 2 2 2 2 2 2 2 2 2 2 2 2
nov 1 1 1 1 1 1 1 1 1 1 1 1
oct 15 15 15 15 15 15 15 15 15 15 15 15
sep 172 172 172 172 172 172 172 172 172 172 172 172

Check out the Group By documentation to look at what all you can do with the Pandas .groupby().

You can sort as well...

In [24]:
df.sort_values('temp') # Note this is ascending by default. 
df.sort_values('temp', ascending=False)
Out[24]:
X Y month day FFMC DMC DC ISI temp RH wind rain area
498 6 5 aug tue 96.1 181.1 671.2 14.3 33.3 26 2.7 0.0 40.54
484 2 5 aug sun 94.9 130.3 587.1 14.1 33.1 25 4.0 0.0 26.43
496 4 5 aug mon 96.2 175.5 661.8 16.8 32.6 26 3.1 0.0 2.77
492 1 3 aug fri 95.9 158.0 633.6 11.3 32.4 27 2.2 0.0 0.00
491 4 4 aug thu 95.8 152.0 624.1 13.8 32.4 21 4.5 0.0 0.00
497 3 4 aug tue 96.1 181.1 671.2 14.3 32.3 27 2.2 0.0 14.68
483 8 6 aug sun 94.9 130.3 587.1 14.1 31.0 27 5.4 0.0 0.00
421 2 4 aug wed 95.2 217.7 690.0 18.0 30.8 19 4.5 0.0 0.00
494 6 6 aug sat 96.0 164.0 643.0 14.0 30.8 30 4.9 0.0 8.59
485 2 4 aug mon 95.0 135.5 596.3 21.3 30.6 28 3.6 0.0 2.07
480 9 9 jul thu 93.2 114.4 560.0 9.5 30.2 25 4.5 0.0 2.75
481 4 3 jul thu 93.2 114.4 560.0 9.5 30.2 22 4.9 0.0 0.00
28 6 3 sep sat 93.4 145.4 721.4 8.1 30.2 24 2.7 0.0 0.00
154 1 5 sep sat 93.4 145.4 721.4 8.1 29.6 27 2.7 0.0 1.46
422 8 6 jul sun 88.9 263.1 795.9 5.2 29.3 27 3.6 0.0 6.30
503 2 4 aug wed 94.5 139.4 689.1 20.0 29.2 30 4.9 0.0 1.95
504 4 3 aug wed 94.5 139.4 689.1 20.0 28.9 29 4.9 0.0 49.59
490 4 4 aug wed 95.1 141.3 605.8 17.7 28.7 33 4.0 0.0 0.00
248 8 6 aug wed 93.1 157.3 666.7 13.5 28.7 28 2.7 0.0 0.00
156 2 4 sep sat 93.4 145.4 721.4 8.1 28.6 27 2.2 0.0 1.61
228 4 6 sep sun 93.5 149.3 728.6 8.1 28.3 26 3.1 0.0 64.10
260 2 4 aug mon 93.6 97.9 542.0 14.4 28.3 32 4.0 0.0 8.85
382 8 6 aug wed 95.2 217.7 690.0 18.0 28.2 29 1.8 0.0 5.86
399 9 5 jun wed 93.3 49.5 297.7 14.0 28.0 34 4.5 0.0 0.00
400 9 5 jun wed 93.3 49.5 297.7 14.0 28.0 34 4.5 0.0 8.16
427 2 4 aug mon 92.1 207.0 672.6 8.2 27.9 33 2.2 0.0 2.35
405 2 2 aug tue 94.6 212.1 680.9 9.5 27.9 27 2.2 0.0 0.00
512 4 3 aug sun 81.6 56.7 665.6 1.9 27.8 32 2.7 0.0 6.44
231 1 5 sep sun 93.5 149.3 728.6 8.1 27.8 27 3.1 0.0 95.18
511 8 6 aug sun 81.6 56.7 665.6 1.9 27.8 35 2.7 0.0 0.00
... ... ... ... ... ... ... ... ... ... ... ... ... ...
378 6 5 mar thu 90.9 18.9 30.6 8.0 8.7 51 5.8 0.0 0.00
77 1 3 mar mon 87.6 52.2 103.8 5.0 8.3 72 3.1 0.0 0.00
3 8 6 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2 0.00
0 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 0.00
411 7 4 feb fri 84.6 3.2 43.6 3.3 8.2 53 9.4 0.0 4.62
7 8 6 aug mon 91.5 145.4 608.2 10.7 8.0 86 2.2 0.0 0.00
390 7 4 feb mon 84.7 9.5 58.3 4.1 7.5 71 6.3 0.0 9.96
284 5 4 feb fri 85.2 4.9 15.8 6.3 7.5 46 8.0 0.0 24.24
75 9 9 feb thu 84.2 6.8 26.6 7.7 6.7 79 3.1 0.0 0.00
196 6 5 apr thu 81.5 9.1 55.2 2.7 5.8 54 5.8 0.0 10.93
176 6 5 apr thu 81.5 9.1 55.2 2.7 5.8 54 5.8 0.0 4.61
61 2 2 mar sun 89.3 51.3 102.2 9.6 5.5 59 6.3 0.0 0.00
394 6 5 feb mon 84.1 4.6 46.7 2.2 5.3 68 1.8 0.0 0.00
165 6 5 mar thu 84.9 18.2 55.0 3.0 5.3 70 4.5 0.0 2.14
104 2 4 jan sat 82.1 3.7 9.3 2.9 5.3 78 3.1 0.0 0.00
379 4 5 jan sun 18.7 1.1 171.4 0.0 5.2 100 0.9 0.0 0.00
464 6 4 feb tue 75.1 4.4 16.2 1.9 5.1 77 5.4 0.0 2.14
274 8 6 dec wed 84.0 27.8 354.6 5.3 5.1 61 8.0 0.0 11.19
211 7 4 aug sat 93.5 139.4 594.2 20.3 5.1 96 5.8 0.0 26.00
281 6 5 dec tue 85.4 25.4 349.7 2.6 5.1 24 8.5 0.0 24.77
275 4 6 dec thu 84.6 26.4 352.0 2.0 5.1 61 4.9 0.0 5.38
273 4 6 dec sun 84.4 27.2 353.5 6.8 4.8 57 8.5 0.0 8.98
465 2 2 feb sat 79.5 3.6 15.3 1.8 4.6 59 0.9 0.0 6.84
279 4 4 dec mon 85.4 25.4 349.7 2.6 4.6 21 8.5 0.0 9.77
463 6 5 feb tue 75.1 4.4 16.2 1.9 4.6 82 6.3 0.0 5.39
278 4 4 dec mon 85.4 25.4 349.7 2.6 4.6 21 8.5 0.0 22.03
277 3 4 dec mon 85.4 25.4 349.7 2.6 4.6 21 8.5 0.0 10.73
276 4 4 dec mon 85.4 25.4 349.7 2.6 4.6 21 8.5 0.0 17.85
282 6 3 feb sun 84.9 27.5 353.5 3.4 4.2 51 4.0 0.0 0.00
280 4 6 dec fri 84.7 26.7 352.6 4.1 2.2 59 4.9 0.0 9.27

517 rows × 13 columns

We can also sort by multiple columns...

In [26]:
df.sort_values(['temp', 'wind'], ascending=False)
Out[26]:
X Y month day FFMC DMC DC ISI temp RH wind rain area
498 6 5 aug tue 96.1 181.1 671.2 14.3 33.3 26 2.7 0.0 40.54
484 2 5 aug sun 94.9 130.3 587.1 14.1 33.1 25 4.0 0.0 26.43
496 4 5 aug mon 96.2 175.5 661.8 16.8 32.6 26 3.1 0.0 2.77
491 4 4 aug thu 95.8 152.0 624.1 13.8 32.4 21 4.5 0.0 0.00
492 1 3 aug fri 95.9 158.0 633.6 11.3 32.4 27 2.2 0.0 0.00
497 3 4 aug tue 96.1 181.1 671.2 14.3 32.3 27 2.2 0.0 14.68
483 8 6 aug sun 94.9 130.3 587.1 14.1 31.0 27 5.4 0.0 0.00
494 6 6 aug sat 96.0 164.0 643.0 14.0 30.8 30 4.9 0.0 8.59
421 2 4 aug wed 95.2 217.7 690.0 18.0 30.8 19 4.5 0.0 0.00
485 2 4 aug mon 95.0 135.5 596.3 21.3 30.6 28 3.6 0.0 2.07
481 4 3 jul thu 93.2 114.4 560.0 9.5 30.2 22 4.9 0.0 0.00
480 9 9 jul thu 93.2 114.4 560.0 9.5 30.2 25 4.5 0.0 2.75
28 6 3 sep sat 93.4 145.4 721.4 8.1 30.2 24 2.7 0.0 0.00
154 1 5 sep sat 93.4 145.4 721.4 8.1 29.6 27 2.7 0.0 1.46
422 8 6 jul sun 88.9 263.1 795.9 5.2 29.3 27 3.6 0.0 6.30
503 2 4 aug wed 94.5 139.4 689.1 20.0 29.2 30 4.9 0.0 1.95
504 4 3 aug wed 94.5 139.4 689.1 20.0 28.9 29 4.9 0.0 49.59
490 4 4 aug wed 95.1 141.3 605.8 17.7 28.7 33 4.0 0.0 0.00
248 8 6 aug wed 93.1 157.3 666.7 13.5 28.7 28 2.7 0.0 0.00
156 2 4 sep sat 93.4 145.4 721.4 8.1 28.6 27 2.2 0.0 1.61
260 2 4 aug mon 93.6 97.9 542.0 14.4 28.3 32 4.0 0.0 8.85
228 4 6 sep sun 93.5 149.3 728.6 8.1 28.3 26 3.1 0.0 64.10
382 8 6 aug wed 95.2 217.7 690.0 18.0 28.2 29 1.8 0.0 5.86
399 9 5 jun wed 93.3 49.5 297.7 14.0 28.0 34 4.5 0.0 0.00
400 9 5 jun wed 93.3 49.5 297.7 14.0 28.0 34 4.5 0.0 8.16
405 2 2 aug tue 94.6 212.1 680.9 9.5 27.9 27 2.2 0.0 0.00
427 2 4 aug mon 92.1 207.0 672.6 8.2 27.9 33 2.2 0.0 2.35
231 1 5 sep sun 93.5 149.3 728.6 8.1 27.8 27 3.1 0.0 95.18
511 8 6 aug sun 81.6 56.7 665.6 1.9 27.8 35 2.7 0.0 0.00
512 4 3 aug sun 81.6 56.7 665.6 1.9 27.8 32 2.7 0.0 6.44
... ... ... ... ... ... ... ... ... ... ... ... ... ...
378 6 5 mar thu 90.9 18.9 30.6 8.0 8.7 51 5.8 0.0 0.00
3 8 6 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2 0.00
77 1 3 mar mon 87.6 52.2 103.8 5.0 8.3 72 3.1 0.0 0.00
411 7 4 feb fri 84.6 3.2 43.6 3.3 8.2 53 9.4 0.0 4.62
0 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 0.00
7 8 6 aug mon 91.5 145.4 608.2 10.7 8.0 86 2.2 0.0 0.00
284 5 4 feb fri 85.2 4.9 15.8 6.3 7.5 46 8.0 0.0 24.24
390 7 4 feb mon 84.7 9.5 58.3 4.1 7.5 71 6.3 0.0 9.96
75 9 9 feb thu 84.2 6.8 26.6 7.7 6.7 79 3.1 0.0 0.00
176 6 5 apr thu 81.5 9.1 55.2 2.7 5.8 54 5.8 0.0 4.61
196 6 5 apr thu 81.5 9.1 55.2 2.7 5.8 54 5.8 0.0 10.93
61 2 2 mar sun 89.3 51.3 102.2 9.6 5.5 59 6.3 0.0 0.00
165 6 5 mar thu 84.9 18.2 55.0 3.0 5.3 70 4.5 0.0 2.14
104 2 4 jan sat 82.1 3.7 9.3 2.9 5.3 78 3.1 0.0 0.00
394 6 5 feb mon 84.1 4.6 46.7 2.2 5.3 68 1.8 0.0 0.00
379 4 5 jan sun 18.7 1.1 171.4 0.0 5.2 100 0.9 0.0 0.00
281 6 5 dec tue 85.4 25.4 349.7 2.6 5.1 24 8.5 0.0 24.77
274 8 6 dec wed 84.0 27.8 354.6 5.3 5.1 61 8.0 0.0 11.19
211 7 4 aug sat 93.5 139.4 594.2 20.3 5.1 96 5.8 0.0 26.00
464 6 4 feb tue 75.1 4.4 16.2 1.9 5.1 77 5.4 0.0 2.14
275 4 6 dec thu 84.6 26.4 352.0 2.0 5.1 61 4.9 0.0 5.38
273 4 6 dec sun 84.4 27.2 353.5 6.8 4.8 57 8.5 0.0 8.98
276 4 4 dec mon 85.4 25.4 349.7 2.6 4.6 21 8.5 0.0 17.85
277 3 4 dec mon 85.4 25.4 349.7 2.6 4.6 21 8.5 0.0 10.73
278 4 4 dec mon 85.4 25.4 349.7 2.6 4.6 21 8.5 0.0 22.03
279 4 4 dec mon 85.4 25.4 349.7 2.6 4.6 21 8.5 0.0 9.77
463 6 5 feb tue 75.1 4.4 16.2 1.9 4.6 82 6.3 0.0 5.39
465 2 2 feb sat 79.5 3.6 15.3 1.8 4.6 59 0.9 0.0 6.84
282 6 3 feb sun 84.9 27.5 353.5 3.4 4.2 51 4.0 0.0 0.00
280 4 6 dec fri 84.7 26.7 352.6 4.1 2.2 59 4.9 0.0 9.27

517 rows × 13 columns

Apply

Using the .apply() functions on our DataFrames, we can apply any kind of function to a groupby object.

In [27]:
# Here I'm finding the average squared area of burnage per month. 
df.groupby('month').mean()['area'].apply(lambda area: area**2)
Out[27]:
month
apr     79.051857
aug    155.977022
dec    177.688900
feb     39.375625
jan      0.000000
jul    206.487919
jun     34.119343
mar     18.980544
may    370.177600
nov      0.000000
oct     44.063044
sep    321.937479
Name: area, dtype: float64

Dropping and filling nulls

Pandas has functions for both filling nulls (or N/As) with whatever value you want, or dropping nulls. To fill nulls, we use the .fillna() method on the DataFrame, and to drop nulls, we call the .dropna() method on the DataFrame. In terms of the .fillna() function, we can give it a default value to fill in, or a number of other methods to fill it in (padding, back filling, foward filling). You can read about dealing with missing data in the docs here.

In [28]:
df.fillna('0', inplace=True)
df.dropna(inplace=True) # Notice the addition of the inplace argument here. 
What do you think would have happened if I didn't add the inplace argument above?

Creating and Dropping Columns

Creating columns is done in one of three ways: (1) Using bracket notation, (2) Using the .eval() method on the Pandas DataFrame, or (3) Using the assign method on the Pandas DataFrame (we'll look at the first two). Dropping columns is done using the .drop()method on the Pandas DataFrame. When dropping columns, you have to be careful to make sure to tell the DataFrame to drop them in place, or assign the DataFrame with dropped columns to a new variable. You also need to make sure to tell the .drop() method what axis the thing you're trying to drop is on (rows are axis=0, and columns are axis=1).

In [29]:
df.drop('rain', axis=1) # Note the lack of the inplace argument.
Out[29]:
X Y month day FFMC DMC DC ISI temp RH wind area
0 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.00
1 7 4 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.00
2 7 4 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.00
3 8 6 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.00
4 8 6 mar sun 89.3 51.3 102.2 9.6 11.4 99 1.8 0.00
5 8 6 aug sun 92.3 85.3 488.0 14.7 22.2 29 5.4 0.00
6 8 6 aug mon 92.3 88.9 495.6 8.5 24.1 27 3.1 0.00
7 8 6 aug mon 91.5 145.4 608.2 10.7 8.0 86 2.2 0.00
8 8 6 sep tue 91.0 129.5 692.6 7.0 13.1 63 5.4 0.00
9 7 5 sep sat 92.5 88.0 698.6 7.1 22.8 40 4.0 0.00
10 7 5 sep sat 92.5 88.0 698.6 7.1 17.8 51 7.2 0.00
11 7 5 sep sat 92.8 73.2 713.0 22.6 19.3 38 4.0 0.00
12 6 5 aug fri 63.5 70.8 665.3 0.8 17.0 72 6.7 0.00
13 6 5 sep mon 90.9 126.5 686.5 7.0 21.3 42 2.2 0.00
14 6 5 sep wed 92.9 133.3 699.6 9.2 26.4 21 4.5 0.00
15 6 5 sep fri 93.3 141.2 713.9 13.9 22.9 44 5.4 0.00
16 5 5 mar sat 91.7 35.8 80.8 7.8 15.1 27 5.4 0.00
17 8 5 oct mon 84.9 32.8 664.2 3.0 16.7 47 4.9 0.00
18 6 4 mar wed 89.2 27.9 70.8 6.3 15.9 35 4.0 0.00
19 6 4 apr sat 86.3 27.4 97.1 5.1 9.3 44 4.5 0.00
20 6 4 sep tue 91.0 129.5 692.6 7.0 18.3 40 2.7 0.00
21 5 4 sep mon 91.8 78.5 724.3 9.2 19.1 38 2.7 0.00
22 7 4 jun sun 94.3 96.3 200.0 56.1 21.0 44 4.5 0.00
23 7 4 aug sat 90.2 110.9 537.4 6.2 19.5 43 5.8 0.00
24 7 4 aug sat 93.5 139.4 594.2 20.3 23.7 32 5.8 0.00
25 7 4 aug sun 91.4 142.4 601.4 10.6 16.3 60 5.4 0.00
26 7 4 sep fri 92.4 117.9 668.0 12.2 19.0 34 5.8 0.00
27 7 4 sep mon 90.9 126.5 686.5 7.0 19.4 48 1.3 0.00
28 6 3 sep sat 93.4 145.4 721.4 8.1 30.2 24 2.7 0.00
29 6 3 sep sun 93.5 149.3 728.6 8.1 22.8 39 3.6 0.00
... ... ... ... ... ... ... ... ... ... ... ... ...
487 5 4 aug tue 95.1 141.3 605.8 17.7 26.4 34 3.6 16.40
488 4 4 aug tue 95.1 141.3 605.8 17.7 19.4 71 7.6 46.70
489 4 4 aug wed 95.1 141.3 605.8 17.7 20.6 58 1.3 0.00
490 4 4 aug wed 95.1 141.3 605.8 17.7 28.7 33 4.0 0.00
491 4 4 aug thu 95.8 152.0 624.1 13.8 32.4 21 4.5 0.00
492 1 3 aug fri 95.9 158.0 633.6 11.3 32.4 27 2.2 0.00
493 1 3 aug fri 95.9 158.0 633.6 11.3 27.5 29 4.5 43.32
494 6 6 aug sat 96.0 164.0 643.0 14.0 30.8 30 4.9 8.59
495 6 6 aug mon 96.2 175.5 661.8 16.8 23.9 42 2.2 0.00
496 4 5 aug mon 96.2 175.5 661.8 16.8 32.6 26 3.1 2.77
497 3 4 aug tue 96.1 181.1 671.2 14.3 32.3 27 2.2 14.68
498 6 5 aug tue 96.1 181.1 671.2 14.3 33.3 26 2.7 40.54
499 7 5 aug tue 96.1 181.1 671.2 14.3 27.3 63 4.9 10.82
500 8 6 aug tue 96.1 181.1 671.2 14.3 21.6 65 4.9 0.00
501 7 5 aug tue 96.1 181.1 671.2 14.3 21.6 65 4.9 0.00
502 4 4 aug tue 96.1 181.1 671.2 14.3 20.7 69 4.9 0.00
503 2 4 aug wed 94.5 139.4 689.1 20.0 29.2 30 4.9 1.95
504 4 3 aug wed 94.5 139.4 689.1 20.0 28.9 29 4.9 49.59
505 1 2 aug thu 91.0 163.2 744.4 10.1 26.7 35 1.8 5.80
506 1 2 aug fri 91.0 166.9 752.6 7.1 18.5 73 8.5 0.00
507 2 4 aug fri 91.0 166.9 752.6 7.1 25.9 41 3.6 0.00
508 1 2 aug fri 91.0 166.9 752.6 7.1 25.9 41 3.6 0.00
509 5 4 aug fri 91.0 166.9 752.6 7.1 21.1 71 7.6 2.17
510 6 5 aug fri 91.0 166.9 752.6 7.1 18.2 62 5.4 0.43
511 8 6 aug sun 81.6 56.7 665.6 1.9 27.8 35 2.7 0.00
512 4 3 aug sun 81.6 56.7 665.6 1.9 27.8 32 2.7 6.44
513 2 4 aug sun 81.6 56.7 665.6 1.9 21.9 71 5.8 54.29
514 7 4 aug sun 81.6 56.7 665.6 1.9 21.2 70 6.7 11.16
515 1 4 aug sat 94.4 146.0 614.7 11.3 25.6 42 4.0 0.00
516 6 3 nov tue 79.5 3.0 106.7 1.1 11.8 31 4.5 0.00

517 rows × 12 columns

In [30]:
df.head() # Our dataframe still contains the rain column! But why!?
Out[30]:
X Y month day FFMC DMC DC ISI temp RH wind rain area
0 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 0.0
1 7 4 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 0.0
2 7 4 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0 0.0
3 8 6 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2 0.0
4 8 6 mar sun 89.3 51.3 102.2 9.6 11.4 99 1.8 0.0 0.0
In [31]:
df.drop('rain', axis=1, inplace=True) 
df.head()
Out[31]:
X Y month day FFMC DMC DC ISI temp RH wind area
0 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0
1 7 4 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0
2 7 4 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0
3 8 6 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.0
4 8 6 mar sun 89.3 51.3 102.2 9.6 11.4 99 1.8 0.0
How do you think I would create a new column that is the sum of the wind and temp columns?
In [32]:
# Cell for answer
In [34]:
df.eval('wind_plus_temp2 = wind + temp', inplace=True) # This is generally the preferred way, as it's a little cleaner. 
df.head()
Out[34]:
X Y month day FFMC DMC DC ISI temp RH wind area wind_plus_temp2
0 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 14.9
1 7 4 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 18.9
2 7 4 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0 15.9
3 8 6 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.0 12.3
4 8 6 mar sun 89.3 51.3 102.2 9.6 11.4 99 1.8 0.0 13.2

We also have a special function for getting dummy variables for a given column...

In [35]:
pd.get_dummies(df.month, prefix='month')
Out[35]:
month_apr month_aug month_dec month_feb month_jan month_jul month_jun month_mar month_may month_nov month_oct month_sep
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
5 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
6 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
11 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
12 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
13 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
14 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
15 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
16 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
17 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
18 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
19 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
21 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
22 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
23 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
24 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
25 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
26 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
27 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
28 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
29 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ...
487 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
488 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
489 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
490 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
491 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
492 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
493 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
494 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
495 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
496 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
497 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
498 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
499 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
500 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
501 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
502 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
503 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
504 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
505 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
506 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
507 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
508 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
509 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
510 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
511 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
512 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
513 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
514 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
515 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
516 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0

517 rows × 12 columns

Combining data sets

Pandas ways of combining two sets of data include the use of pd.merge(), df.join(), pd.concat(), and df.merge(). For the most part, these three do largely the same things (although you'll notice the slight syntax difference with .concat() called via the Pandas module, .join() called on the DataFrame itself, and .merge() called either way. There are some cases where one of them might be better than the other in terms of writing less code or performing some kind of data combining in an easier way. The major differences between these, though, largely depend on what they do by default when you try to combine different data sets. By default, .merge() looks to join on common columns, .join() on common indices, and .concat() by just appending on a given axis.

You can read about the differences between all three of these in the docs. Below I'll show one example of merging.

In [36]:
month_dummies = pd.get_dummies(df.month, prefix='month') 
df.join(month_dummies) # Here we combine on the indices. 
df.head() 
Out[36]:
X Y month day FFMC DMC DC ISI temp RH wind area wind_plus_temp2
0 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 14.9
1 7 4 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 18.9
2 7 4 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0 15.9
3 8 6 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.0 12.3
4 8 6 mar sun 89.3 51.3 102.2 9.6 11.4 99 1.8 0.0 13.2
Any ideas on why our dataframe doesn't contain the dummy variables, even though we did the join?
In [37]:
# Cell for ideas. 

Next Steps

If you want to see Python in action exploring a real dataset, have a look at Exploring Data with Python using Jupyter Notebooks.

 Check out these related articles

 

5 More Tools Data Scientists Need to Know

 

 

 

What’s the Difference Between Data Engineering and Data Science?

 

 

 

Common Data Science Interview Questions

 

 

Back to Full List

Want to learn more?

Galvanize offers an 6-Week part time workshop, as well as a 12-week full-time program in Data Science that teaches you how to make an impact as a contributing member of a data analytics team.

Learn About our Immersive Programs Register for a Workshop

Sign up to get updates direct to your inbox