Intro to Scikit-Learn

Published 9/26/2016

Scikit-Learn

Quick Note

This tutorial moves very fast. If you would like a slower, more in-depth intro to Python, we suggest you take our Intro to Python Evening Course. It's the perfect way to become familiar with Python + gain experience using Python to solve challenging problems.

Prerequisites

This document is a Jupyter notebook. If you are viewing it using GitHub, then you cannot execute the cells that contain Python code. To view and run this notebook you'll need to install Jupyter on your computer before you continue. See these installation instructions for help!

Overview

Here we're going to walk through running a model, and looking at the results. Before we get started let's go over some terminology...

Terminology

1.) features - Another word for the X variables, independent variables, predictors, regressors...
2.) target - Another word for the Y variable, dependent variable, outcome variable, response...
3.) model - What we use to relate one set of variables to another.
4.) feature engineering - Refers to the process of manipulating your features, creating new ones, etc. before feeding the data into your model.
5.) training data set - Refers to the observations from your data set that are used to train/learn the statistical model.
6.) testing data set - Refers to the observations from your data that are not used to train/learn the statistical model. They are held out, and not seen by the model during training.
7.) hyperparameters - Stay tuned... it's a little hard to put into words, but remind me if I don't discuss it later.

Scikit-learn import

import sklearn

Typically we're actually going to be importing something from one of the modules/libraries in sklearn. The sklearn main page can help you determine where you might find something that you are looking for, and the API reference is also pretty helpful. A large majority of all of the machine learning algorithms you might run can be found somewhere within sklearn. Today we're going to talk through using a Random Forest Regressor.

General workflow

Here are the steps by which we train a model...

1.) Import whatever model you'll be fitting.
2.) Instantiate the model (i.e. create a variable that holds your model object). Set any hyperparameters as you see fit (we'll discuss what these are shortly).
3.) Feed in the X and Y variables (features and target) to the .fit() method.
4.) Call the .score() or .predict() method to see how well the model does on the training data (or new data).

What would be another word/term we might use to describe the new data from step (4) above?

We'll be working with a RandomForestRegressor tonight, which you can see the documentation for here.

In [1]:
import pandas as pd
df = pd.read_csv('data/forestfires.csv') # Get the data. 
df.head()
Out[1]:
X Y month day FFMC DMC DC ISI temp RH wind rain area
0 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 0.0
1 7 4 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 0.0
2 7 4 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0 0.0
3 8 6 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2 0.0
4 8 6 mar sun 89.3 51.3 102.2 9.6 11.4 99 1.8 0.0 0.0
In [2]:
from sklearn.ensemble import RandomForestRegressor # Import our model. 
random_forest = RandomForestRegressor(n_estimators=100) # Instantiate it. 

Let's create our features (X variables) and target (Y variable). I'm using the forest-fire data, and for now am only going to use the Xand Y columns (which are the spatial coordinates of the fires) for the features, and the area column for the target (this is defined as the dependent variable on the UCI website where I got this data). A link to the data and it's description can be found here.

How do I pull the X and Y columns from our df to use as the features? How about the area?
In [3]:
features = df[['X','Y']]
target = df['area']
In [4]:
# Fit/train the model (i.e. build the model based off the training data)
random_forest.fit(features, target) 
Out[4]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
In [5]:
random_forest.score(features, target) # This .score() gives you the R^2. 
Out[5]:
0.032062770222170189
In [6]:
predictions = random_forest.predict(features) # This gives us back a vector of predictions
                                              # (one for each observation). 

In terms of metrics, the sklearn.metrics documentation will give you an idea of any of the metrics you can use to judge a model. The majority of these take the format of a fuction call where you input (y_predictions, y_observations), and they output the calculated metric. We'll look at mean squared error below.

In [8]:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(predictions, target)
print mse
3914.55650291

This looks terrible! We're doing awful. Let's see if we can add in something else and get better.

In [9]:
features = df[['X','Y', 'wind']]
target = df['area']
random_forest.fit(features, target)
print random_forest.score(features, target)
0.203887096898
In [10]:
predictions = random_forest.predict(features)
print mean_squared_error(predictions, target)
3219.66016598

Much better! But let's try one more variable. I imagine the month could be pretty important.

In [12]:
features = df[['X','Y', 'wind', 'month']]
target = df['area']
random_forest.fit(features, target)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-5b5ffb984971> in <module>()
      1 features = df[['X','Y', 'wind', 'month']]
      2 target = df['area']
----> 3 random_forest.fit(features, target)

/Users/sallamander/anaconda/lib/python2.7/site-packages/sklearn/ensemble/forest.pyc in fit(self, X, y, sample_weight)
    193         """
    194         # Validate or convert input data
--> 195         X = check_array(X, dtype=DTYPE, accept_sparse="csc")
    196         if issparse(X):
    197             # Pre-sort indices to avoid that each individual tree of the

/Users/sallamander/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features)
    342             else:
    343                 dtype = None
--> 344         array = np.array(array, dtype=dtype, order=order, copy=copy)
    345         # make sure we actually converted to numeric:
    346         if dtype_numeric and array.dtype.kind == "O":

ValueError: could not convert string to float: nov
What do you think went wrong here? Don't look ahead!

What went wrong here? It turns out that most of the algorithms we use don't accept strings as inputs, but rather expect numeric values. The way to fix this would be to create dummy variables for the months (I'll leave that as an exercise for you).

So, we have run two models now, and saw that the second one performed better. We could keep adding variables into our model, checking the R^2 and our MSE after adding in any variable. But we're not actually running our model on any data that we aren't training it on, so how do we know that what we are putting into our model would actually help on data that we've never seen. In other words, how do we tell if our model will generalize well? The answer is cross validation.

The way that cross validation works is that we break our data into k number of folds (typically 5 or 10). We train our model on k-1 of those folds, and then predict on the kth fold. We take those predictions, and then get our scoring metric (mean squared error, in our case) using those predictions. We then do this again, and again, and again, until each of the k folds has been used for predictions (so with 5 folds, we do this 5 times, and with 10 folds, 10 times, etc.)

intro-to-python-sklearn

Using cross validation, we can get an idea of how our model would perform on data it hasn't seen before, and then when we add in variables into our model (or change model hyperparameters), we can be more sure that they were actually worth putting into our model.

Best of all, it turns out that sklearn has a library we can use for this! Check out the cross validation library for all the details. Today we'll be looking at the cross_val_score function, which allows you to pass in a model, a target (Y), a feature set (X), a number of folds (5 or 10, for example), and a scoring function (we'll use our mean_squared_error).

In [14]:
from sklearn.cross_validation import cross_val_score
features = df[['X','Y']]
target = df['area']
results = cross_val_score(random_forest, features, target, cv=20, scoring='mean_squared_error')
results
Out[14]:
array([  -375.47767068,   -162.20055005,   -603.25384842,   -536.51811759,
          -62.73033068,   -280.80067275,   -341.6422815 ,   -173.85800086,
        -1912.14482145, -50478.40726929,   -118.81598123,   -528.28201791,
         -419.59203963,   -139.91659542,  -1225.51294279, -21430.62382299,
        -1533.13269162,   -509.65316076,  -3340.51172585,   -345.3483658 ])
Anybody want to take a guess at why we're getting negative mean squared error values?
In [15]:
-results.mean()
Out[15]:
4225.9211453648695
In [16]:
features = df[['X', 'Y', 'wind']]
target = df['area']
results = cross_val_score(random_forest, features, target, cv=20, scoring='mean_squared_error')
-results.mean()
Out[16]:
5270.6323296025976

So it looks like wind might not have been as helpful as we thought. Good think we used cross-validation!

Cross-validation is a crucial part of a data-scientists workflow. We have to make sure that our model will generalize well, and cross-validation is a way to make sure that we are putting the right variables into our model. It can also be used to check our model hyperparameters (for a random forest, this might be the number of trees, the depth of each tree, etc.). Sklearn also has a built in to perform cross-validation over hyperparameters. It is located in the sklearn.grid_search module, and it is called GridSearchCV. As arguments, it takes an estimator/model (such as our Random Forest) and a parameter grid (dictionary). We instantiate it with these, and then we call the .fit() method on it, passing it our features and target. It returns back to us the best parameters to use for our model.

In [19]:
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
random_forest = RandomForestRegressor()
param_grid = {'n_estimators': [10, 100, 500], 'max_depth': [1, 3, 5]}
grid_search_cv = GridSearchCV(random_forest, param_grid, scoring='mean_squared_error')
In [20]:
features = df[['X', 'Y', 'wind']]
target = df['area']
grid_search_cv.fit(features, target)
Out[20]:
GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'n_estimators': [10, 100, 500], 'max_depth': [1, 3, 5]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None,
       scoring='mean_squared_error', verbose=0)
In [21]:
best_model = grid_search_cv.best_estimator_ # Get a copy of the best model. 
best_params = grid_search_cv.best_params_ # Get a dictionary of the best parameters. 
best_score = grid_search_cv.best_score_ # Get the best score of scoring function we passed in.
In [22]:
best_params
Out[22]:
{'max_depth': 1, 'n_estimators': 100}
In [23]:
best_score
Out[23]:
-4173.1288852755588
In [24]:
features = df[['X', 'Y']]
target = df['area']
grid_search_cv.fit(features, target)
Out[24]:
GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'n_estimators': [10, 100, 500], 'max_depth': [1, 3, 5]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None,
       scoring='mean_squared_error', verbose=0)
In [25]:
best_model = grid_search_cv.best_estimator_ # Get a copy of the best model. 
best_params = grid_search_cv.best_params_ # Get a dictionary of the best parameters. 
best_score = grid_search_cv.best_score_ # Get the best score of scoring function we passed in.
In [26]:
best_params
Out[26]:
{'max_depth': 1, 'n_estimators': 500}
In [27]:
best_score
Out[27]:
-4171.6458630324541

With all this being said, we can kind of re-define/re-work the steps in our general workflow...

1.) Import whatever model you'll be fitting.
2.) Instantiate the model (i.e. create a variable that holds your model object). Set any hyperparameters as you see fit.
3.) Feed in the X and Y variables (features and target) to the .fit() method.
4.) Call the .score() or .predict() method to see how well the model does on the training data (or new data).
5.) Repeat steps (2) - (4) to find the best model given your chosen scoring metric.

Note: This assumes that all of your feature engineering/variable manipulation is done.

Want some practice?

Have a look at the intro_sklearn_practice.ipynb notebook!

Next Steps

If you want to see Python in action exploring a real dataset, have a look at Exploring Data with Python using Jupyter Notebooks.

 Check out these related articles

 

5 More Tools Data Scientists Need to Know

 

 

 

What’s the Difference Between Data Engineering and Data Science?

 

 

 

Common Data Science Interview Questions

 

 

Back to Full List

Want to learn more?

Galvanize offers an 6-week part time workshop, as well as a 12-week full-time program in Data Science that teaches you how to make an impact as a contributing member of a data analytics team.

Learn About our Immersive Programs Register for a Workshop

Sign up to get updates direct to your inbox