In [1]:

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Intro to Machine Learning¶

One of the main ideas of machine learning, is to split data into testing and training sets. These sets are used to develop the model, and subsequently test its accuracy. Later, we will repeat this process a number of times to get an even better model. Machine learning can be thought of as representing a philosophy to model building, where we improve our models by iteratively building the model and testing it’s performance on held out data.

In [2]:

x = np.random.randn(400)
y = np.random.randn(400)

In [3]:

x.shape

Out[3]:

(400,)

In [4]:

plt.scatter(x[:350], y[:350], color = 'red', alpha = 0.4, label = 'training set')
plt.scatter(x[350:], y[350:], color = 'blue', alpha = 0.4, label = 'test set')
plt.legend(loc = 'best', frameon = False)
plt.title("Idea of Test and Train Split \nin Machine Learning", loc = 'left')

Out[4]:

Text(0,1,'Idea of Test and Train Split \nin Machine Learning')

In [5]:

X_train, x_test, y_train, y_test = x[:350].reshape(-1,1), x[350:].reshape(-1,1), y[:350].reshape(-1,1), y[350:].reshape(-1,1)

In [6]:

X_train.shape

Out[6]:

(350, 1)

In [7]:

from sklearn import linear_model

In [8]:

reg = linear_model.LinearRegression()
reg.fit(X_train, y_train)

Out[8]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [9]:

reg.coef_

Out[9]:

array([[-0.0010095]])

In [10]:

y_predict = reg.predict(x_test.reshape(-1,1))

In [11]:

plt.scatter(X_train, y_train, alpha = 0.3)
plt.scatter(x_test, y_test, alpha = 0.3)
plt.plot(x_test, y_predict, color = 'black')

Out[11]:

[<matplotlib.lines.Line2D at 0x1a159ebcf8>]

Regression Example: Loading and Structuring Data¶

Predicting level of diabetes based on body mass index measures.

In [16]:

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

In [17]:

diabetes = datasets.load_diabetes()

In [18]:

diabetes

Out[18]:

{'DESCR': 'Diabetes dataset\n================\n\nNotes\n-----\n\nTen baseline variables, age, sex, body mass index, average blood\npressure, and six blood serum measurements were obtained for each of n =\n442 diabetes patients, as well as the response of interest, a\nquantitative measure of disease progression one year after baseline.\n\nData Set Characteristics:\n\n  :Number of Instances: 442\n\n  :Number of Attributes: First 10 columns are numeric predictive values\n\n  :Target: Column 11 is a quantitative measure of disease progression one year after baseline\n\n  :Attributes:\n    :Age:\n    :Sex:\n    :Body mass index:\n    :Average blood pressure:\n    :S1:\n    :S2:\n    :S3:\n    :S4:\n    :S5:\n    :S6:\n\nNote: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).\n\nSource URL:\nhttp://www4.stat.ncsu.edu/~boos/var.select/diabetes.html\n\nFor more information see:\nBradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.\n(http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)\n',
 'data': array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
          0.01990842, -0.01764613],
        [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
         -0.06832974, -0.09220405],
        [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
          0.00286377, -0.02593034],
        ...,
        [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
         -0.04687948,  0.01549073],
        [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
          0.04452837, -0.02593034],
        [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
         -0.00421986,  0.00306441]]),
 'feature_names': ['age',
  'sex',
  'bmi',
  'bp',
  's1',
  's2',
  's3',
  's4',
  's5',
  's6'],
 'target': array([ 151.,   75.,  141.,  206.,  135.,   97.,  138.,   63.,  110.,
         310.,  101.,   69.,  179.,  185.,  118.,  171.,  166.,  144.,
          97.,  168.,   68.,   49.,   68.,  245.,  184.,  202.,  137.,
          85.,  131.,  283.,  129.,   59.,  341.,   87.,   65.,  102.,
         265.,  276.,  252.,   90.,  100.,   55.,   61.,   92.,  259.,
          53.,  190.,  142.,   75.,  142.,  155.,  225.,   59.,  104.,
         182.,  128.,   52.,   37.,  170.,  170.,   61.,  144.,   52.,
         128.,   71.,  163.,  150.,   97.,  160.,  178.,   48.,  270.,
         202.,  111.,   85.,   42.,  170.,  200.,  252.,  113.,  143.,
          51.,   52.,  210.,   65.,  141.,   55.,  134.,   42.,  111.,
          98.,  164.,   48.,   96.,   90.,  162.,  150.,  279.,   92.,
          83.,  128.,  102.,  302.,  198.,   95.,   53.,  134.,  144.,
         232.,   81.,  104.,   59.,  246.,  297.,  258.,  229.,  275.,
         281.,  179.,  200.,  200.,  173.,  180.,   84.,  121.,  161.,
          99.,  109.,  115.,  268.,  274.,  158.,  107.,   83.,  103.,
         272.,   85.,  280.,  336.,  281.,  118.,  317.,  235.,   60.,
         174.,  259.,  178.,  128.,   96.,  126.,  288.,   88.,  292.,
          71.,  197.,  186.,   25.,   84.,   96.,  195.,   53.,  217.,
         172.,  131.,  214.,   59.,   70.,  220.,  268.,  152.,   47.,
          74.,  295.,  101.,  151.,  127.,  237.,  225.,   81.,  151.,
         107.,   64.,  138.,  185.,  265.,  101.,  137.,  143.,  141.,
          79.,  292.,  178.,   91.,  116.,   86.,  122.,   72.,  129.,
         142.,   90.,  158.,   39.,  196.,  222.,  277.,   99.,  196.,
         202.,  155.,   77.,  191.,   70.,   73.,   49.,   65.,  263.,
         248.,  296.,  214.,  185.,   78.,   93.,  252.,  150.,   77.,
         208.,   77.,  108.,  160.,   53.,  220.,  154.,  259.,   90.,
         246.,  124.,   67.,   72.,  257.,  262.,  275.,  177.,   71.,
          47.,  187.,  125.,   78.,   51.,  258.,  215.,  303.,  243.,
          91.,  150.,  310.,  153.,  346.,   63.,   89.,   50.,   39.,
         103.,  308.,  116.,  145.,   74.,   45.,  115.,  264.,   87.,
         202.,  127.,  182.,  241.,   66.,   94.,  283.,   64.,  102.,
         200.,  265.,   94.,  230.,  181.,  156.,  233.,   60.,  219.,
          80.,   68.,  332.,  248.,   84.,  200.,   55.,   85.,   89.,
          31.,  129.,   83.,  275.,   65.,  198.,  236.,  253.,  124.,
          44.,  172.,  114.,  142.,  109.,  180.,  144.,  163.,  147.,
          97.,  220.,  190.,  109.,  191.,  122.,  230.,  242.,  248.,
         249.,  192.,  131.,  237.,   78.,  135.,  244.,  199.,  270.,
         164.,   72.,   96.,  306.,   91.,  214.,   95.,  216.,  263.,
         178.,  113.,  200.,  139.,  139.,   88.,  148.,   88.,  243.,
          71.,   77.,  109.,  272.,   60.,   54.,  221.,   90.,  311.,
         281.,  182.,  321.,   58.,  262.,  206.,  233.,  242.,  123.,
         167.,   63.,  197.,   71.,  168.,  140.,  217.,  121.,  235.,
         245.,   40.,   52.,  104.,  132.,   88.,   69.,  219.,   72.,
         201.,  110.,   51.,  277.,   63.,  118.,   69.,  273.,  258.,
          43.,  198.,  242.,  232.,  175.,   93.,  168.,  275.,  293.,
         281.,   72.,  140.,  189.,  181.,  209.,  136.,  261.,  113.,
         131.,  174.,  257.,   55.,   84.,   42.,  146.,  212.,  233.,
          91.,  111.,  152.,  120.,   67.,  310.,   94.,  183.,   66.,
         173.,   72.,   49.,   64.,   48.,  178.,  104.,  132.,  220.,   57.])}

In [19]:

diabetes.DESCR

Out[19]:

'Diabetes dataset\n================\n\nNotes\n-----\n\nTen baseline variables, age, sex, body mass index, average blood\npressure, and six blood serum measurements were obtained for each of n =\n442 diabetes patients, as well as the response of interest, a\nquantitative measure of disease progression one year after baseline.\n\nData Set Characteristics:\n\n  :Number of Instances: 442\n\n  :Number of Attributes: First 10 columns are numeric predictive values\n\n  :Target: Column 11 is a quantitative measure of disease progression one year after baseline\n\n  :Attributes:\n    :Age:\n    :Sex:\n    :Body mass index:\n    :Average blood pressure:\n    :S1:\n    :S2:\n    :S3:\n    :S4:\n    :S5:\n    :S6:\n\nNote: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).\n\nSource URL:\nhttp://www4.stat.ncsu.edu/~boos/var.select/diabetes.html\n\nFor more information see:\nBradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.\n(http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)\n'

In [20]:

diabetes.data

Out[20]:

array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
         0.01990842, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
        -0.06832974, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
         0.00286377, -0.02593034],
       ...,
       [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
        -0.04687948,  0.01549073],
       [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
         0.04452837, -0.02593034],
       [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
        -0.00421986,  0.00306441]])

In [34]:

diabetes.feature_names[2]

Out[34]:

'bmi'

In [21]:

diabetes.data[:, np.newaxis, 2]

Out[21]:

array([[ 0.06169621],
       [-0.05147406],
       [ 0.04445121],
       [-0.01159501],
       [-0.03638469],
       [-0.04069594],
       [-0.04716281],
       [-0.00189471],
       [ 0.06169621],
       [ 0.03906215],
       [-0.08380842],
       [ 0.01750591],
       [-0.02884001],
       [-0.00189471],
       [-0.02560657],
       [-0.01806189],
       [ 0.04229559],
       [ 0.01211685],
       [-0.0105172 ],
       [-0.01806189],
       [-0.05686312],
       [-0.02237314],
       [-0.00405033],
       [ 0.06061839],
       [ 0.03582872],
       [-0.01267283],
       [-0.07734155],
       [ 0.05954058],
       [-0.02129532],
       [-0.00620595],
       [ 0.04445121],
       [-0.06548562],
       [ 0.12528712],
       [-0.05039625],
       [-0.06332999],
       [-0.03099563],
       [ 0.02289497],
       [ 0.01103904],
       [ 0.07139652],
       [ 0.01427248],
       [-0.00836158],
       [-0.06764124],
       [-0.0105172 ],
       [-0.02345095],
       [ 0.06816308],
       [-0.03530688],
       [-0.01159501],
       [-0.0730303 ],
       [-0.04177375],
       [ 0.01427248],
       [-0.00728377],
       [ 0.0164281 ],
       [-0.00943939],
       [-0.01590626],
       [ 0.0250506 ],
       [-0.04931844],
       [ 0.04121778],
       [-0.06332999],
       [-0.06440781],
       [-0.02560657],
       [-0.00405033],
       [ 0.00457217],
       [-0.00728377],
       [-0.0374625 ],
       [-0.02560657],
       [-0.02452876],
       [-0.01806189],
       [-0.01482845],
       [-0.02991782],
       [-0.046085  ],
       [-0.06979687],
       [ 0.03367309],
       [-0.00405033],
       [-0.02021751],
       [ 0.00241654],
       [-0.03099563],
       [ 0.02828403],
       [-0.03638469],
       [-0.05794093],
       [-0.0374625 ],
       [ 0.01211685],
       [-0.02237314],
       [-0.03530688],
       [ 0.00996123],
       [-0.03961813],
       [ 0.07139652],
       [-0.07518593],
       [-0.00620595],
       [-0.04069594],
       [-0.04824063],
       [-0.02560657],
       [ 0.0519959 ],
       [ 0.00457217],
       [-0.06440781],
       [-0.01698407],
       [-0.05794093],
       [ 0.00996123],
       [ 0.08864151],
       [-0.00512814],
       [-0.06440781],
       [ 0.01750591],
       [-0.04500719],
       [ 0.02828403],
       [ 0.04121778],
       [ 0.06492964],
       [-0.03207344],
       [-0.07626374],
       [ 0.04984027],
       [ 0.04552903],
       [-0.00943939],
       [-0.03207344],
       [ 0.00457217],
       [ 0.02073935],
       [ 0.01427248],
       [ 0.11019775],
       [ 0.00133873],
       [ 0.05846277],
       [-0.02129532],
       [-0.0105172 ],
       [-0.04716281],
       [ 0.00457217],
       [ 0.01750591],
       [ 0.08109682],
       [ 0.0347509 ],
       [ 0.02397278],
       [-0.00836158],
       [-0.06117437],
       [-0.00189471],
       [-0.06225218],
       [ 0.0164281 ],
       [ 0.09618619],
       [-0.06979687],
       [-0.02129532],
       [-0.05362969],
       [ 0.0433734 ],
       [ 0.05630715],
       [-0.0816528 ],
       [ 0.04984027],
       [ 0.11127556],
       [ 0.06169621],
       [ 0.01427248],
       [ 0.04768465],
       [ 0.01211685],
       [ 0.00564998],
       [ 0.04660684],
       [ 0.12852056],
       [ 0.05954058],
       [ 0.09295276],
       [ 0.01535029],
       [-0.00512814],
       [ 0.0703187 ],
       [-0.00405033],
       [-0.00081689],
       [-0.04392938],
       [ 0.02073935],
       [ 0.06061839],
       [-0.0105172 ],
       [-0.03315126],
       [-0.06548562],
       [ 0.0433734 ],
       [-0.06225218],
       [ 0.06385183],
       [ 0.03043966],
       [ 0.07247433],
       [-0.0191397 ],
       [-0.06656343],
       [-0.06009656],
       [ 0.06924089],
       [ 0.05954058],
       [-0.02668438],
       [-0.02021751],
       [-0.046085  ],
       [ 0.07139652],
       [-0.07949718],
       [ 0.00996123],
       [-0.03854032],
       [ 0.01966154],
       [ 0.02720622],
       [-0.00836158],
       [-0.01590626],
       [ 0.00457217],
       [-0.04285156],
       [ 0.00564998],
       [-0.03530688],
       [ 0.02397278],
       [-0.01806189],
       [ 0.04229559],
       [-0.0547075 ],
       [-0.00297252],
       [-0.06656343],
       [-0.01267283],
       [-0.04177375],
       [-0.03099563],
       [-0.00512814],
       [-0.05901875],
       [ 0.0250506 ],
       [-0.046085  ],
       [ 0.00349435],
       [ 0.05415152],
       [-0.04500719],
       [-0.05794093],
       [-0.05578531],
       [ 0.00133873],
       [ 0.03043966],
       [ 0.00672779],
       [ 0.04660684],
       [ 0.02612841],
       [ 0.04552903],
       [ 0.04013997],
       [-0.01806189],
       [ 0.01427248],
       [ 0.03690653],
       [ 0.00349435],
       [-0.07087468],
       [-0.03315126],
       [ 0.09403057],
       [ 0.03582872],
       [ 0.03151747],
       [-0.06548562],
       [-0.04177375],
       [-0.03961813],
       [-0.03854032],
       [-0.02560657],
       [-0.02345095],
       [-0.06656343],
       [ 0.03259528],
       [-0.046085  ],
       [-0.02991782],
       [-0.01267283],
       [-0.01590626],
       [ 0.07139652],
       [-0.03099563],
       [ 0.00026092],
       [ 0.03690653],
       [ 0.03906215],
       [-0.01482845],
       [ 0.00672779],
       [-0.06871905],
       [-0.00943939],
       [ 0.01966154],
       [ 0.07462995],
       [-0.00836158],
       [-0.02345095],
       [-0.046085  ],
       [ 0.05415152],
       [-0.03530688],
       [-0.03207344],
       [-0.0816528 ],
       [ 0.04768465],
       [ 0.06061839],
       [ 0.05630715],
       [ 0.09834182],
       [ 0.05954058],
       [ 0.03367309],
       [ 0.05630715],
       [-0.06548562],
       [ 0.16085492],
       [-0.05578531],
       [-0.02452876],
       [-0.03638469],
       [-0.00836158],
       [-0.04177375],
       [ 0.12744274],
       [-0.07734155],
       [ 0.02828403],
       [-0.02560657],
       [-0.06225218],
       [-0.00081689],
       [ 0.08864151],
       [-0.03207344],
       [ 0.03043966],
       [ 0.00888341],
       [ 0.00672779],
       [-0.02021751],
       [-0.02452876],
       [-0.01159501],
       [ 0.02612841],
       [-0.05901875],
       [-0.03638469],
       [-0.02452876],
       [ 0.01858372],
       [-0.0902753 ],
       [-0.00512814],
       [-0.05255187],
       [-0.02237314],
       [-0.02021751],
       [-0.0547075 ],
       [-0.00620595],
       [-0.01698407],
       [ 0.05522933],
       [ 0.07678558],
       [ 0.01858372],
       [-0.02237314],
       [ 0.09295276],
       [-0.03099563],
       [ 0.03906215],
       [-0.06117437],
       [-0.00836158],
       [-0.0374625 ],
       [-0.01375064],
       [ 0.07355214],
       [-0.02452876],
       [ 0.03367309],
       [ 0.0347509 ],
       [-0.03854032],
       [-0.03961813],
       [-0.00189471],
       [-0.03099563],
       [-0.046085  ],
       [ 0.00133873],
       [ 0.06492964],
       [ 0.04013997],
       [-0.02345095],
       [ 0.05307371],
       [ 0.04013997],
       [-0.02021751],
       [ 0.01427248],
       [-0.03422907],
       [ 0.00672779],
       [ 0.00457217],
       [ 0.03043966],
       [ 0.0519959 ],
       [ 0.06169621],
       [-0.00728377],
       [ 0.00564998],
       [ 0.05415152],
       [-0.00836158],
       [ 0.114509  ],
       [ 0.06708527],
       [-0.05578531],
       [ 0.03043966],
       [-0.02560657],
       [ 0.10480869],
       [-0.00620595],
       [-0.04716281],
       [-0.04824063],
       [ 0.08540807],
       [-0.01267283],
       [-0.03315126],
       [-0.00728377],
       [-0.01375064],
       [ 0.05954058],
       [ 0.02181716],
       [ 0.01858372],
       [-0.01159501],
       [-0.00297252],
       [ 0.01750591],
       [-0.02991782],
       [-0.02021751],
       [-0.05794093],
       [ 0.06061839],
       [-0.04069594],
       [-0.07195249],
       [-0.05578531],
       [ 0.04552903],
       [-0.00943939],
       [-0.03315126],
       [ 0.04984027],
       [-0.08488624],
       [ 0.00564998],
       [ 0.02073935],
       [-0.00728377],
       [ 0.10480869],
       [-0.02452876],
       [-0.00620595],
       [-0.03854032],
       [ 0.13714305],
       [ 0.17055523],
       [ 0.00241654],
       [ 0.03798434],
       [-0.05794093],
       [-0.00943939],
       [-0.02345095],
       [-0.0105172 ],
       [-0.03422907],
       [-0.00297252],
       [ 0.06816308],
       [ 0.00996123],
       [ 0.00241654],
       [-0.03854032],
       [ 0.02612841],
       [-0.08919748],
       [ 0.06061839],
       [-0.02884001],
       [-0.02991782],
       [-0.0191397 ],
       [-0.04069594],
       [ 0.01535029],
       [-0.02452876],
       [ 0.00133873],
       [ 0.06924089],
       [-0.06979687],
       [-0.02991782],
       [-0.046085  ],
       [ 0.01858372],
       [ 0.00133873],
       [-0.03099563],
       [-0.00405033],
       [ 0.01535029],
       [ 0.02289497],
       [ 0.04552903],
       [-0.04500719],
       [-0.03315126],
       [ 0.097264  ],
       [ 0.05415152],
       [ 0.12313149],
       [-0.08057499],
       [ 0.09295276],
       [-0.05039625],
       [-0.01159501],
       [-0.0277622 ],
       [ 0.05846277],
       [ 0.08540807],
       [-0.00081689],
       [ 0.00672779],
       [ 0.00888341],
       [ 0.08001901],
       [ 0.07139652],
       [-0.02452876],
       [-0.0547075 ],
       [-0.03638469],
       [ 0.0164281 ],
       [ 0.07786339],
       [-0.03961813],
       [ 0.01103904],
       [-0.04069594],
       [-0.03422907],
       [ 0.00564998],
       [ 0.08864151],
       [-0.03315126],
       [-0.05686312],
       [-0.03099563],
       [ 0.05522933],
       [-0.06009656],
       [ 0.00133873],
       [-0.02345095],
       [-0.07410811],
       [ 0.01966154],
       [-0.01590626],
       [-0.01590626],
       [ 0.03906215],
       [-0.0730303 ]])

In [22]:

diabetes_X = diabetes.data[:, np.newaxis, 2]

In [23]:

diabetes.target

Out[23]:

array([ 151.,   75.,  141.,  206.,  135.,   97.,  138.,   63.,  110.,
        310.,  101.,   69.,  179.,  185.,  118.,  171.,  166.,  144.,
         97.,  168.,   68.,   49.,   68.,  245.,  184.,  202.,  137.,
         85.,  131.,  283.,  129.,   59.,  341.,   87.,   65.,  102.,
        265.,  276.,  252.,   90.,  100.,   55.,   61.,   92.,  259.,
         53.,  190.,  142.,   75.,  142.,  155.,  225.,   59.,  104.,
        182.,  128.,   52.,   37.,  170.,  170.,   61.,  144.,   52.,
        128.,   71.,  163.,  150.,   97.,  160.,  178.,   48.,  270.,
        202.,  111.,   85.,   42.,  170.,  200.,  252.,  113.,  143.,
         51.,   52.,  210.,   65.,  141.,   55.,  134.,   42.,  111.,
         98.,  164.,   48.,   96.,   90.,  162.,  150.,  279.,   92.,
         83.,  128.,  102.,  302.,  198.,   95.,   53.,  134.,  144.,
        232.,   81.,  104.,   59.,  246.,  297.,  258.,  229.,  275.,
        281.,  179.,  200.,  200.,  173.,  180.,   84.,  121.,  161.,
         99.,  109.,  115.,  268.,  274.,  158.,  107.,   83.,  103.,
        272.,   85.,  280.,  336.,  281.,  118.,  317.,  235.,   60.,
        174.,  259.,  178.,  128.,   96.,  126.,  288.,   88.,  292.,
         71.,  197.,  186.,   25.,   84.,   96.,  195.,   53.,  217.,
        172.,  131.,  214.,   59.,   70.,  220.,  268.,  152.,   47.,
         74.,  295.,  101.,  151.,  127.,  237.,  225.,   81.,  151.,
        107.,   64.,  138.,  185.,  265.,  101.,  137.,  143.,  141.,
         79.,  292.,  178.,   91.,  116.,   86.,  122.,   72.,  129.,
        142.,   90.,  158.,   39.,  196.,  222.,  277.,   99.,  196.,
        202.,  155.,   77.,  191.,   70.,   73.,   49.,   65.,  263.,
        248.,  296.,  214.,  185.,   78.,   93.,  252.,  150.,   77.,
        208.,   77.,  108.,  160.,   53.,  220.,  154.,  259.,   90.,
        246.,  124.,   67.,   72.,  257.,  262.,  275.,  177.,   71.,
         47.,  187.,  125.,   78.,   51.,  258.,  215.,  303.,  243.,
         91.,  150.,  310.,  153.,  346.,   63.,   89.,   50.,   39.,
        103.,  308.,  116.,  145.,   74.,   45.,  115.,  264.,   87.,
        202.,  127.,  182.,  241.,   66.,   94.,  283.,   64.,  102.,
        200.,  265.,   94.,  230.,  181.,  156.,  233.,   60.,  219.,
         80.,   68.,  332.,  248.,   84.,  200.,   55.,   85.,   89.,
         31.,  129.,   83.,  275.,   65.,  198.,  236.,  253.,  124.,
         44.,  172.,  114.,  142.,  109.,  180.,  144.,  163.,  147.,
         97.,  220.,  190.,  109.,  191.,  122.,  230.,  242.,  248.,
        249.,  192.,  131.,  237.,   78.,  135.,  244.,  199.,  270.,
        164.,   72.,   96.,  306.,   91.,  214.,   95.,  216.,  263.,
        178.,  113.,  200.,  139.,  139.,   88.,  148.,   88.,  243.,
         71.,   77.,  109.,  272.,   60.,   54.,  221.,   90.,  311.,
        281.,  182.,  321.,   58.,  262.,  206.,  233.,  242.,  123.,
        167.,   63.,  197.,   71.,  168.,  140.,  217.,  121.,  235.,
        245.,   40.,   52.,  104.,  132.,   88.,   69.,  219.,   72.,
        201.,  110.,   51.,  277.,   63.,  118.,   69.,  273.,  258.,
         43.,  198.,  242.,  232.,  175.,   93.,  168.,  275.,  293.,
        281.,   72.,  140.,  189.,  181.,  209.,  136.,  261.,  113.,
        131.,  174.,  257.,   55.,   84.,   42.,  146.,  212.,  233.,
         91.,  111.,  152.,  120.,   67.,  310.,   94.,  183.,   66.,
        173.,   72.,   49.,   64.,   48.,  178.,  104.,  132.,  220.,   57.])

In [24]:

diabetes_y = diabetes.target

In [25]:

from sklearn.model_selection import train_test_split

In [26]:

X_train, x_test = train_test_split(diabetes_X)
y_train, y_test = train_test_split(diabetes_y)

In [32]:

plt.figure(figsize = (12, 9))
plt.scatter(X_train, y_train, label = 'Training Set')
plt.scatter(x_test, y_test, label = 'Test Set')
plt.legend(frameon = False)
plt.title("Example Test Train Split from Diabetes Data", loc = 'left', size = 20)

Out[32]:

Text(0,1,'Example Test Train Split from Diabetes Data')

Linear Regression: Fitting and Evaluating the Model¶

In [35]:

regr = linear_model.LinearRegression()

In [36]:

regr.fit(X_train, y_train)

Out[36]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [38]:

predictions = regr.predict(x_test)

In [40]:

print("The coefficients of the model are: \n", regr.coef_)

The coefficients of the model are:
 [ 6.29641819]

In [41]:

print("The intercept of the model are: \n", regr.intercept_)

The intercept of the model are:
 152.512205614

In [43]:

print("The Equation for the Line of Best Fit is \n y = ", regr.coef_, 'x +', regr.intercept_)

The Equation for the Line of Best Fit is
 y =  [ 6.29641819] x + 152.512205614

In [44]:

def l(x):
    return regr.coef_*x + regr.intercept_

In [45]:

l(30)

Out[45]:

array([ 341.40475121])

In [46]:

x = np.linspace(min(X_train), max(X_train), 1000)

In [47]:

plt.figure(figsize = (12, 9))
plt.scatter(X_train, y_train, label = 'Training Set')
plt.scatter(x_test, y_test, label = 'Test Set')
plt.plot(x, l(x), label = 'Line of Best Fit')
plt.legend(frameon = False)
plt.title("Example Test Train Split from Diabetes Data", loc = 'left', size = 20)

Out[47]:

Text(0,1,'Example Test Train Split from Diabetes Data')

In [48]:

print("The Mean Squared Error of the model is", mean_squared_error(y_test, predictions))

The Mean Squared Error of the model is 6126.13411338

In [49]:

print("The Variance Score is ", r2_score(y_test, predictions))

The Variance Score is  -0.000950748287665

In [51]:

regr.get_params

Out[51]:

<bound method BaseEstimator.get_params of LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)>

Using StatsModels and Seaborn¶

In [57]:

import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd

In [60]:

df = pd.DataFrame()

In [67]:

df['bmi'] = diabetes.data[:, 2]

In [68]:

df['disease'] = diabetes.target

In [69]:

df.head()

Out[69]:

	bmi	disease
0	0.061696	151.0
1	-0.051474	75.0
2	0.044451	141.0
3	-0.011595	206.0
4	-0.036385	135.0

In [73]:

len(df['bmi'])

Out[73]:

In [75]:

results = smf.ols('disease ~ bmi', data = df).fit()

In [76]:

print(results.summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:                disease   R-squared:                       0.344
Model:                            OLS   Adj. R-squared:                  0.342
Method:                 Least Squares   F-statistic:                     230.7
Date:                Sat, 10 Feb 2018   Prob (F-statistic):           3.47e-42
Time:                        14:16:19   Log-Likelihood:                -2454.0
No. Observations:                 442   AIC:                             4912.
Df Residuals:                     440   BIC:                             4920.
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    152.1335      2.974     51.162      0.000     146.289     157.978
bmi          949.4353     62.515     15.187      0.000     826.570    1072.301
==============================================================================
Omnibus:                       11.674   Durbin-Watson:                   1.848
Prob(Omnibus):                  0.003   Jarque-Bera (JB):                7.310
Skew:                           0.156   Prob(JB):                       0.0259
Kurtosis:                       2.453   Cond. No.                         21.0
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [77]:

df2 = df[:300]

In [78]:

df2.head()

Out[78]:

	bmi	disease
0	0.061696	151.0
1	-0.051474	75.0
2	0.044451	141.0
3	-0.011595	206.0
4	-0.036385	135.0

In [79]:

df2b = df[300:]

In [80]:

df2b.head()

Out[80]:

	bmi	disease
300	0.073552	275.0
301	-0.024529	65.0
302	0.033673	198.0
303	0.034751	236.0
304	-0.038540	253.0

In [83]:

split_results = smf.ols('disease ~ bmi', data = df2).fit()

In [84]:

print(split_results.summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:                disease   R-squared:                       0.342
Model:                            OLS   Adj. R-squared:                  0.340
Method:                 Least Squares   F-statistic:                     154.8
Date:                Sat, 10 Feb 2018   Prob (F-statistic):           6.61e-29
Time:                        14:18:03   Log-Likelihood:                -1668.4
No. Observations:                 300   AIC:                             3341.
Df Residuals:                     298   BIC:                             3348.
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    151.0306      3.651     41.372      0.000     143.846     158.215
bmi          975.5736     78.405     12.443      0.000     821.276    1129.872
==============================================================================
Omnibus:                        9.498   Durbin-Watson:                   1.764
Prob(Omnibus):                  0.009   Jarque-Bera (JB):                6.672
Skew:                           0.238   Prob(JB):                       0.0356
Kurtosis:                       2.446   Cond. No.                         21.5
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [87]:

predictions = split_results.predict(df2b['bmi'])

In [88]:

predictions[:10]

Out[88]:

  222.786110
  127.100973
  183.881164
  184.932649
  113.431668
  112.380183
  149.182158
  120.792063
  106.071273
  152.336613
dtype: float64

In [95]:

import seaborn as sns
sns.jointplot('bmi', 'disease', data = df, size = 10)

Out[95]:

<seaborn.axisgrid.JointGrid at 0x1c216fc438>

Other Examples of Machine Learning¶

What category does this belong to?
What is this a picture of?

In [12]:

from sklearn import datasets

In [13]:

iris = datasets.load_iris()
digits = datasets.load_digits()

In [14]:

print(digits.data)

[[  0.   0.   5. ...,   0.   0.   0.]
 [  0.   0.   0. ...,  10.   0.   0.]
 [  0.   0.   0. ...,  16.   9.   0.]
 ...,
 [  0.   0.   1. ...,   6.   0.   0.]
 [  0.   0.   2. ...,  12.   0.   0.]
 [  0.   0.  10. ...,  12.   1.   0.]]

In [15]:

digits.target

Out[15]:

array([0, 1, 2, ..., 8, 9, 8])

In [16]:

digits.images[0]

Out[16]:

array([[  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.],
       [  0.,   0.,  13.,  15.,  10.,  15.,   5.,   0.],
       [  0.,   3.,  15.,   2.,   0.,  11.,   8.,   0.],
       [  0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.],
       [  0.,   5.,   8.,   0.,   0.,   9.,   8.,   0.],
       [  0.,   4.,  11.,   0.,   1.,  12.,   7.,   0.],
       [  0.,   2.,  14.,   5.,  10.,  12.,   0.,   0.],
       [  0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.]])

In [17]:

iris.data[:5]

Out[17]:

array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3.6,  1.4,  0.2]])

In [18]:

iris.target

Out[18]:

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

What kind of Flower is This?¶

K-Means Clustering
Naive Bayes Classifier
Decision Tree

In [19]:

plt.subplot(1, 3, 1)
plt.imshow(digits.images[1])

plt.subplot(1, 3, 2)
plt.imshow(digits.images[2])

plt.subplot(1, 3, 3)
plt.imshow(digits.images[3])

Out[19]:

<matplotlib.image.AxesImage at 0x1a15e43630>

Learning and Predicting with Digits¶

Given an image, which digit does it represent? Here, we will fit an estimator to predict which class unknown images belong to. To do this, we will use the support vector classifier.

In [20]:

from sklearn import svm

In [21]:

clf = svm.SVC(gamma = 0.001, C = 100)

In [22]:

#fit on all but last data point
clf.fit(digits.data[:-1], digits.target[:-1])

Out[22]:

SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [23]:

clf.predict(digits.data[-1:])

Out[23]:

array([8])

In [24]:

plt.imshow(digits.images[-1])

Out[24]:

<matplotlib.image.AxesImage at 0x1a15db4278>