Decision Tree Classifiers

Example

Important Considerations

PROS CONS
Easy to visualize and Interpret Prone to overfitting
No normalization of Data Necessary Ensemble needed for better performance
Handles mixed feature types  

Iris Example

Use measurements to predict species

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn import tree
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
In [3]:
import seaborn as sns
iris = sns.load_dataset('iris')
iris.head()
Out[3]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
In [4]:
#split the data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target)
In [5]:
len(X_test)
Out[5]:
38
In [6]:
#load classifier
clf = tree.DecisionTreeClassifier()
In [7]:
#fit train data
clf = clf.fit(X_train, y_train)
In [8]:
#examine score
clf.score(X_train, y_train)
Out[8]:
1.0
In [9]:
#against test set
clf.score(X_test, y_test)
Out[9]:
0.92105263157894735

How would specific flower be classified?

If we have a flower that has:

  • Sepal.Length = 1.0
  • Sepal.Width = 0.3
  • Petal.Length = 1.4
  • Petal.Width = 2.1
In [10]:
clf.predict_proba([[1.0, 0.3, 1.4, 2.1]])
Out[10]:
array([[ 0.,  1.,  0.]])
In [11]:
#cross validation
from sklearn.model_selection import cross_val_score
cross_val_score(clf, X_train, y_train, cv=10)
Out[11]:
array([ 0.83333333,  1.        ,  1.        ,  0.91666667,  0.91666667,
        1.        ,  0.90909091,  1.        ,  1.        ,  0.9       ])

How important are different features?

In [12]:
#list of feature importance
clf.feature_importances_
Out[12]:
array([ 0.06184963,  0.        ,  0.03845214,  0.89969823])
In [13]:
imp = clf.feature_importances_
In [14]:
plt.bar(['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width'], imp)
Out[14]:
<Container object of 4 artists>
_images/010-Decision-Trees_18_1.png

Visualizing Decision Tree

pip install pydotplus

In [15]:
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,  filled=True, rounded=True, special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
Out[15]:
_images/010-Decision-Trees_20_0.png

What’s Happening with Decision Tree

In [16]:
import seaborn as sns
iris = sns.load_dataset('iris')
sns.pairplot(data = iris, hue = 'species');
_images/010-Decision-Trees_22_0.png

Pre-pruning: Avoiding Over-fitting

  • max_depth: limits depth of tree
  • max_leaf_nodes: limits how many leafs
  • min_samples_leaf: limits splits to happen when only certain number of samples exist
In [17]:
clf = DecisionTreeClassifier(max_depth = 1).fit(X_train, y_train)
In [18]:
clf.score(X_train, y_train)
Out[18]:
0.6875
In [19]:
clf.score(X_test, y_test)
Out[19]:
0.60526315789473684
In [20]:
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,  filled=True, rounded=True, special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
Out[20]:
_images/010-Decision-Trees_27_0.png
In [21]:
clf = DecisionTreeClassifier(max_depth = 2).fit(X_train, y_train)
In [22]:
clf.score(X_train, y_train)
Out[22]:
0.9642857142857143
In [23]:
clf.score(X_test, y_test)
Out[23]:
0.94736842105263153
In [24]:
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,  filled=True, rounded=True, special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
Out[24]:
_images/010-Decision-Trees_31_0.png
In [25]:
clf = DecisionTreeClassifier(max_depth = 3).fit(X_train, y_train)
clf.score(X_train, y_train)
Out[25]:
0.9732142857142857
In [26]:
clf.score(X_test, y_test)
Out[26]:
0.97368421052631582

Confusion Matrix

In [29]:
from sklearn.metrics import classification_report
import sklearn.metrics
from sklearn.metrics import confusion_matrix

classifier=clf.fit(X_train,y_train)

predictions=clf.predict(X_test)

mat = confusion_matrix(y_test, predictions)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label');
_images/010-Decision-Trees_35_0.png
In [30]:
sklearn.metrics.confusion_matrix(y_test, predictions)
Out[30]:
array([[10,  0,  0],
       [ 0, 13,  0],
       [ 0,  1, 14]])
In [27]:
sklearn.metrics.accuracy_score(y_test, predictions)
Out[27]:
0.94736842105263153
In [28]:
dot_data2 = StringIO()
export_graphviz(clf, out_file=dot_data2,
                filled=True, rounded=True,
                special_characters=True)
graph2 = pydotplus.graph_from_dot_data(dot_data2.getvalue())
Image(graph2.create_png())
Out[28]:
_images/010-Decision-Trees_38_0.png
In [29]:
sklearn.metrics.accuracy_score(y_test, predictions)
Out[29]:
0.94736842105263153

Example with Adolescent Health Data

In [33]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import matplotlib.pylab as plt

from sklearn.metrics import classification_report
import sklearn.metrics
In [34]:
AH_data = pd.read_csv("data/tree_addhealth.csv")
data_clean = AH_data.dropna()
data_clean.dtypes
Out[34]:
BIO_SEX      float64
HISPANIC     float64
WHITE        float64
BLACK        float64
NAMERICAN    float64
ASIAN        float64
age          float64
TREG1        float64
ALCEVR1      float64
ALCPROBS1      int64
marever1       int64
cocever1       int64
inhever1       int64
cigavail     float64
DEP1         float64
ESTEEM1      float64
VIOL1        float64
PASSIST        int64
DEVIANT1     float64
SCHCONN1     float64
GPA1         float64
EXPEL1       float64
FAMCONCT     float64
PARACTV      float64
PARPRES      float64
dtype: object
In [35]:
data_clean.describe()
Out[35]:
BIO_SEX HISPANIC WHITE BLACK NAMERICAN ASIAN age TREG1 ALCEVR1 ALCPROBS1 ... ESTEEM1 VIOL1 PASSIST DEVIANT1 SCHCONN1 GPA1 EXPEL1 FAMCONCT PARACTV PARPRES
count 4575.000000 4575.000000 4575.000000 4575.000000 4575.000000 4575.000000 4575.000000 4575.000000 4575.000000 4575.000000 ... 4575.000000 4575.000000 4575.000000 4575.000000 4575.000000 4575.000000 4575.000000 4575.000000 4575.000000 4575.000000
mean 1.521093 0.111038 0.683279 0.236066 0.036284 0.040437 16.493052 0.176393 0.527432 0.369180 ... 40.952131 1.618579 0.102514 2.645027 28.360656 2.815647 0.040219 22.570557 6.290710 13.398033
std 0.499609 0.314214 0.465249 0.424709 0.187017 0.197004 1.552174 0.381196 0.499302 0.894947 ... 5.381439 2.593230 0.303356 3.520554 5.156385 0.770167 0.196493 2.614754 3.360219 2.085837
min 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 12.676712 0.000000 0.000000 0.000000 ... 18.000000 0.000000 0.000000 0.000000 6.000000 1.000000 0.000000 6.300000 0.000000 3.000000
25% 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 15.254795 0.000000 0.000000 0.000000 ... 38.000000 0.000000 0.000000 0.000000 25.000000 2.250000 0.000000 21.700000 4.000000 12.000000
50% 2.000000 0.000000 1.000000 0.000000 0.000000 0.000000 16.509589 0.000000 1.000000 0.000000 ... 40.000000 0.000000 0.000000 1.000000 29.000000 2.750000 0.000000 23.700000 6.000000 14.000000
75% 2.000000 0.000000 1.000000 0.000000 0.000000 0.000000 17.679452 0.000000 1.000000 0.000000 ... 45.000000 2.000000 0.000000 4.000000 32.000000 3.500000 0.000000 24.300000 9.000000 15.000000
max 2.000000 1.000000 1.000000 1.000000 1.000000 1.000000 21.512329 1.000000 1.000000 6.000000 ... 50.000000 19.000000 1.000000 27.000000 38.000000 4.000000 1.000000 25.000000 18.000000 15.000000

8 rows × 25 columns

In [36]:
predictors = data_clean[['BIO_SEX','HISPANIC','WHITE','BLACK','NAMERICAN','ASIAN',
'age','ALCEVR1','ALCPROBS1','marever1','cocever1','inhever1','cigavail','DEP1',
'ESTEEM1','VIOL1','PASSIST','DEVIANT1','SCHCONN1','GPA1','EXPEL1','FAMCONCT','PARACTV',
'PARPRES']]

targets = data_clean.TREG1

pred_train, pred_test, tar_train, tar_test  =   train_test_split(predictors, targets, test_size=.4)

print(pred_train.shape, pred_test.shape, tar_train.shape, tar_test.shape)
(2745, 24) (1830, 24) (2745,) (1830,)
In [37]:
#Build model on training data
classifier=DecisionTreeClassifier(max_depth = 4)
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)

sklearn.metrics.confusion_matrix(tar_test,predictions)
Out[37]:
array([[1415,   99],
       [ 193,  123]])
In [38]:
sklearn.metrics.accuracy_score(tar_test, predictions)
Out[38]:
0.84043715846994538
In [39]:
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
dot_data2 = StringIO()
export_graphviz(classifier, out_file=dot_data2,
                filled=True, rounded=True,
                special_characters=True)
graph2 = pydotplus.graph_from_dot_data(dot_data2.getvalue())
Image(graph2.create_png())
Out[39]:
_images/010-Decision-Trees_47_0.png
In [40]:
sklearn.metrics.accuracy_score(tar_test, predictions)
Out[40]:
0.84043715846994538