Exploring Data with Python using Jupyter Notebooks

Published 10/04/2016

Quick Note

If you want a more in-depth dive into Python + practice using Python, we suggest you take our Intro to Python Evening Course.


By the end of this tutorial, you'll be able to:

  • Install Anaconda
  • Launch, run, and modify a Jupyter notebook
  • Use Python and Pandas to analyze a dataset of Hotel Receipts

Possible Business Goals

This document will get part-way (more-or-less) to each of the following business goals:

  • Inform an investor where he should build his next hotel (a first-pass market research effort).
  • Hilton has hired you to do a SWOT analysis for their hotels in Texas.
  • Identify best months for hotels in San Antonio to offer sales and promotions to drive profit.


The following instructions assume that you have little to no experience using GitHub or Jupyter. In that case, you are probably viewing this document on GitHub.com. That's fine for viewing purposes, but if you want to be able to run the code in this document, you'll have to do some extra work. See the instructions below:

Prep your computer:

Load our starter code using Jupyter:

  • Launch Jupyter. There are two ways to do this:
  • If you're confortable in the terminal, run the command jupyter notebook
  • Otherwise, use Anaconda Navigator to launch Jupyter.
  • Using the Jupyter window that opens in your browser, navigate to where you saved & unzipped the starter code.
  • Click on THIS notebook to open it. You should now see this notebook on your own computer screen!

Begin Programming!

Congrats! You are now ready to run the code in this notebook! Continue below by reading every English sentence and running every cell of code. When you reach the end, write some extra code to answer one of the "Possible Business Goals" in the section above.

Import the libraries we'll need.

In python, you use the import statement to load libraries into your script.

In [ ]:
# matplotlib is used for...you guessed it: plotting!
import matplotlib.pyplot as plt

# Pandas is the "Python Data Analysis Library". We use it for loading data from disk and
# for manipulating and printing data.
import pandas as pd
from pandas.tools.plotting import scatter_matrix

# This next line is a Jupyter directive. It tells Jupyter that we want our plots to show
# up right below the code that creates them.
%matplotlib inline

Pandas by default will not print every column in our data if there are A LOT of columns. For this notebook, we would prefer that Pandas print all the columns no matter what, so let's tell Pandas our wishes:

In [ ]:
pd.set_option('display.max_columns', None)

Let's read some data!

This data was obtained from https://data.texas.gov/

In [ ]:
raw_dataframe = pd.read_csv("data/texas_open_data/Hotel_Occupancy_Tax_Receipts__Monthly.csv")

Woo! We have data. Now let's look at it!

Pandas makes it easy to peek into your data and see what you have. The first step in any data project is to get to know your data.

Let's start by looking at the first 3 rows of data. That's easy to do using the head() method on our DataFrame:

In [ ]:

Looks good at first glance. We'll dig in later, but first, let's ask Pandas to give us info about the datatype of each column:

In [ ]:

Most of the stuff above looks correct:

  • Month is encoded as an integer, so pandas is correct in identifying the Month column as an integer.
  • Name, Address, City, and State are correctly identified as strings (that's what pandas means when it says something is an object).
  • County is also encoded as an integer, so pandas is correct again.
  • ... etc...

There are a few things above that don't look right:

  • Why is Taxpayer Zip a floating point number?
  • Why is Location County a floating point number?
  • Location Tot Room Receipts has a dollar sign in it which is why pandas thinks it's a string instead of a number.
  • Ditto for Location Taxable Receipts.

Let's clean our data.

We identified four issues above, so let's fix each one.

In [ ]:
# Let's work in a new DataFrame so that we don't loose our original data.
# (It's often useful to have the original data on-hand.)
dataframe = raw_dataframe.copy()

# Convert _Taxpayer Zip_ to an integer.
dataframe['Taxpayer Zip'] = dataframe['Taxpayer Zip'].fillna(0.0).astype(int)

# Convert _Location County_ to an integer.
dataframe['Location County'] = dataframe['Location County'].fillna(0.0).astype(int)

# Convert _Location Tot Room Receipts_ to a float.
dataframe['Location Tot Room Receipts'] = \
    dataframe['Location Tot Room Receipts'] \
        .replace('[\$,]', '', regex=True).astype(float, inplace=True)

# Convert _Location Tot Room Receipts_ to a float.
dataframe['Location Taxable Receipts'] = \
    dataframe['Location Taxable Receipts'] \
        .replace('[\$,]', '', regex=True).astype(float, inplace=True)

Again, let's have pandas give us info about what it thinks the datatype of each column is. It should all look good now:

In [ ]:

Woot! This time, the datatypes look correct. So, datatypes are good... next, we'll look at the content of our data.

Exploratory Data Analysis (EDA)

Sanity Check

Let's look at the head again. It should be the same as before, but we'll look again to remind ourselves what it looks like.

In [ ]:

I'm noticing something that might be wrong. We'll have to keep this in mind as we move forward: Why is 'Location Room Capacity' equal 1 for these rows? Are these hotels with just one room each? Maybe... we'll keep this in mind.

For now let's get an idea about the distribution of the 'Location Room Capacity' column. We'll create a histogram over this column.

In [ ]:
dataframe['Location Room Capacity'].hist(bins=100)

... well... that wasn't very informative! That's okay, let's try again. This time we'll only look at rows where the 'Location Room Capacity' is less than 50; this will make us sort-of "zoom in" on the part that's most interesting:

In [ ]:
rows_of_interest = (dataframe['Location Room Capacity'] < 50)
dataframe[rows_of_interest]['Location Room Capacity'].hist(bins=100)

Humm... is that spike at 'Location Room Capacity' == 1? Let's find out.

In [ ]:
(dataframe['Location Room Capacity'] == 1).sum()

... yep, that spike is at 'Location Room Capacity' == 1.

There seems to be way too many hotels with 'Location Room Capacity' == 1. It's possible that whoever created this dataset is filling all "unknown values" with the value 1. It's too soon to tell... we'll come back to this later.

In [ ]:
# Don't uncomment the lines below on the FIRST pass through this notebook.
# We'll have you come back here later. The lines below drop the 'bad' rows
# in the dataset.

#print("Orig size:", dataframe.shape)
#dataframe = dataframe[dataframe['Location Room Capacity'] > 1]
#print("New size:", dataframe.shape)

Visualization (histograms and scatterplots)

Number of States Represented

Let's see what states are represented in this data. (It should only be Texas, but let's check.)

In [ ]:
agg = dataframe.groupby('Location State').apply(len)
agg.plot(kind='bar').set_ylabel('# records')

... We have some non-Texas stuff in here! Let's get a textual printout for more detail.

In [ ]:
dataframe['Location State'].value_counts()

... Indiana... what's up with that? Let's keep only the Texas rows.

In [ ]:
texas_df = dataframe[dataframe['Location State'] == 'TX']

Number of Cities Represented

How many unique cities are represented?

In [ ]:
# How many unique cities are in our dataset?
texas_df['Location City'].describe()

Histogram: Number of unique hotels in each city

In [ ]:
# Count the number of unique hotels in each city:
city_counts = texas_df.groupby('Location City')['Taxpayer Number'].nunique()
city_counts.sort_values(axis=0, ascending=False, inplace=True)
top_cities = city_counts > 50
city_counts[top_cities].plot(kind='bar').set_ylabel('# unique hotels')

Exploring the numeric columns

In [ ]:
# Now let's focus on the numeric columns.
numeric_columns = ['Month', 'Year', 'Outlet Number', 'Location Room Capacity',
                   'Location Tot Room Receipts', 'Location Taxable Receipts']
In [ ]:
# Let's look at some stats for our numeric columns:

Pairplot over a few numeric columns-of-interest

For this dataset, this turns out to not be very interesting, but it's good to always look at this anyway!

A pairplot (also known as a scatter matrix) will plot every column vs every other column. It's a quick way to see which pairs of columns are correlated (or not).

In [ ]:
cols_of_interest = ['Location Room Capacity', 'Location Tot Room Receipts',
                    'Location Taxable Receipts']
scatter_matrix(texas_df[cols_of_interest], figsize=(10, 8))

Compare two cities: Scatterplot 'Location Room Capacity' vs 'Location Tot Room Receipts' for each city

Does it seem that Austin and El Paso are different, or are they basically the same? (We'll explore this more further later on, but for now let's just look at the pretty picture and make our best guess.)

In [ ]:
cities_of_interest = ['AUSTIN', 'EL PASO']

# Scatter plot by city.
for city in cities_of_interest:
    city_mask = (texas_df['Location City'] == city)
    size_mask = (texas_df['Location Room Capacity'] < 220)
    plt.plot(texas_df[city_mask][size_mask]['Location Room Capacity'],
             texas_df[city_mask][size_mask]['Location Tot Room Receipts'],
plt.xlabel('Location Room Capacity')
plt.ylabel('Location Tot Room Receipts')

Humm... I don't know about you, but I didn't expect this! There are two things here that surprised me:

  1. The trend-line is not very strong. There is a general upward trend, but it is riddled with noise.
  2. These cities do not differ much! Average income per room is roughly the same in each city.

The takeaway (this is usually the takeaway), is two-fold:

  1. Your data is probably not as clean asd pretty as you'd like, and
  2. Often your assumptions about what you will find are just plain wrong.

... Seeing this now changes the direction I was planning to head. That's why we do EDA!

Also: Imagine this senario: Your investor friend was told that Austin is in desparate need of hotels--that there is more demand than supply and that it would be a great time to build a hotel in Austin because you could get amazing ROI. Your investor friend comes to you to varify this. We're not finished with our analysis yet, but it's not looking good for that theory so far.

Histogram of every numeric column

This is another graph that you should make for every dataset you get. It is often useful. For this dataset, it's less useful though. :(

In [ ]:
# Here's another one-liner:
texas_df[numeric_columns].hist(bins=10, figsize=(10, 6))
# ... although in this case this isn't helpful... :(

Years and Months, what do we have of each?

We only have one year (2015), see:

In [ ]:

We only have two months (July and August), see:

In [ ]:
# Looks like these are all from 2015 and include only July and August. Let's see how many
# of each month we have.

Histogram: Avg 'room capacity' per city (top cities only)

In [ ]:
city_counts = texas_df.groupby('Location City')['Location Room Capacity'].mean()
city_counts.sort_values(axis=0, ascending=False, inplace=True)
city_counts[city_counts>110].plot(kind='bar', figsize=(10, 4)).set_ylabel('avg room capacity')

Histogram: Avg income per room per night per city (top cities only)

In [ ]:
num_days = 31 + 31  # <-- both July and August each have 31 days in them.

cities = texas_df.groupby('Location City')
city_receipts = cities['Location Tot Room Receipts'].sum()
city_rooms = cities['Location Room Capacity'].sum()
city_income_per_room_per_night = city_receipts / city_rooms / num_days
city_income_per_room_per_night.sort_values(axis=0, ascending=False, inplace=True)
most_profitable_cities = city_income_per_room_per_night > 100
city_income_per_room_per_night[most_profitable_cities].plot(kind='bar', figsize=(10, 4)) \
    .set_ylabel('avg income per room per night')

RED ALERT: What's up with SOUTHLAKE?

In [ ]:
# What's up with SOUTHLAKE?
texas_df[texas_df['Location City'] == 'SOUTHLAKE']

Our data says that this hotel in Southlake has only one room. Let's look into whether or not that's true. Here's the website for that hotel in Southlake: http://www3.hilton.com/en/hotels/texas/hilton-dallas-southlake-town-square-DFWSLHF/index.html

Okay... that hotel above definitely has more than one room! Remember how we considered dropping rows where 'Location Room Capacity' == 1? Well, I think it's time to pull the trigger on that decision.


Is there much of a difference between the month of July and the month of August?

In [ ]:
# How does the specific month affect the avg income per room per night?

cities = texas_df.groupby(['Location City', 'Month'])
city_receipts = cities['Location Tot Room Receipts'].sum()
city_rooms = cities['Location Room Capacity'].sum()
city_income_per_room_per_night = city_receipts / city_rooms / num_days
city_income_per_room_per_night = city_income_per_room_per_night.unstack()
city_income_per_room_per_night['avg_7_8'] = (city_income_per_room_per_night[7] + \
                                             city_income_per_room_per_night[8]) / 2
city_income_per_room_per_night.sort_values(by='avg_7_8', ascending=False, inplace=True)
most_profitable_cities = city_income_per_room_per_night['avg_7_8'] > 100
city_income_per_room_per_night[most_profitable_cities].plot(kind='bar', figsize=(10, 4)) \
    .set_ylabel('avg income per room per night')

Haha. Buda had a great month of July! (Or it was a data entry error.) It does seem that in general, the month of July was better for hotels than the month of August. The graph above only shows the top cities, let's look at ALL cities and compare the July vs Auguest.

In [ ]:
# Seems July (7) was a better month for the top cities than August (8). Let's see if this
# was overall true for the entire dataset.
print('July Avg:', city_income_per_room_per_night[7].mean())
print('Aug Avg: ', city_income_per_room_per_night[8].mean())

Yep. Seems like on average hotels' month of July was better than August.

... but... remember how we saw that Buda had a REALLY GOOD month of July? We need to look into that more. We'll leave that as an exercise for you! Figure out if:

  1. Buda really did have a REALLY GOOD month in July, or
  2. We are seeing some data-entry error... in which case we need to fix it and re-run our analysis.

Want More?

If you enjoyed this tutorial, learn more about our Data Science Immersive program. It is 12 weeks of intense training in:

  • statistics,
  • machine learning, and
  • software engineering.

 Check out these related articles


5 More Tools Data Scientists Need to Know




What’s the Difference Between Data Engineering and Data Science?




Common Data Science Interview Questions



Back to Full List

Want to learn more?

Galvanize offers an 6-Week part time workshop, as well as a 12-week full-time program in Data Science that teaches you how to make an impact as a contributing member of a data analytics team.

Learn About our Immersive Programs Register for a Workshop

Sign up to get updates direct to your inbox