In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Intro to Machine Learning

One of the main ideas of machine learning, is to split data into testing and training sets. These sets are used to develop the model, and subsequently test its accuracy. Later, we will repeat this process a number of times to get an even better model. Machine learning can be thought of as representing a philosophy to model building, where we improve our models by iteratively building the model and testing it’s performance on held out data.

In [2]:
x = np.random.randn(400)
y = np.random.randn(400)
In [3]:
In [4]:
plt.scatter(x[:350], y[:350], color = 'red', alpha = 0.4, label = 'training set')
plt.scatter(x[350:], y[350:], color = 'blue', alpha = 0.4, label = 'test set')
plt.legend(loc = 'best', frameon = False)
plt.title("Idea of Test and Train Split \nin Machine Learning", loc = 'left')
Text(0,1,'Idea of Test and Train Split \nin Machine Learning')
In [5]:
X_train, x_test, y_train, y_test = x[:350].reshape(-1,1), x[350:].reshape(-1,1), y[:350].reshape(-1,1), y[350:].reshape(-1,1)
In [6]:
(350, 1)
In [7]:
from sklearn import linear_model
In [8]:
reg = linear_model.LinearRegression(), y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [9]:
In [10]:
y_predict = reg.predict(x_test.reshape(-1,1))
In [11]:
plt.scatter(X_train, y_train, alpha = 0.3)
plt.scatter(x_test, y_test, alpha = 0.3)
plt.plot(x_test, y_predict, color = 'black')
[<matplotlib.lines.Line2D at 0x1a159ebcf8>]

Regression Example: Loading and Structuring Data

Predicting level of diabetes based on body mass index measures.

In [16]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
In [17]:
diabetes = datasets.load_diabetes()
In [18]:
{'DESCR': 'Diabetes dataset\n================\n\nNotes\n-----\n\nTen baseline variables, age, sex, body mass index, average blood\npressure, and six blood serum measurements were obtained for each of n =\n442 diabetes patients, as well as the response of interest, a\nquantitative measure of disease progression one year after baseline.\n\nData Set Characteristics:\n\n  :Number of Instances: 442\n\n  :Number of Attributes: First 10 columns are numeric predictive values\n\n  :Target: Column 11 is a quantitative measure of disease progression one year after baseline\n\n  :Attributes:\n    :Age:\n    :Sex:\n    :Body mass index:\n    :Average blood pressure:\n    :S1:\n    :S2:\n    :S3:\n    :S4:\n    :S5:\n    :S6:\n\nNote: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).\n\nSource URL:\n\n\nFor more information see:\nBradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.\n(\n',
 'data': array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
          0.01990842, -0.01764613],
        [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
         -0.06832974, -0.09220405],
        [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
          0.00286377, -0.02593034],
        [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
         -0.04687948,  0.01549073],
        [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
          0.04452837, -0.02593034],
        [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
         -0.00421986,  0.00306441]]),
 'feature_names': ['age',
 'target': array([ 151.,   75.,  141.,  206.,  135.,   97.,  138.,   63.,  110.,
         310.,  101.,   69.,  179.,  185.,  118.,  171.,  166.,  144.,
          97.,  168.,   68.,   49.,   68.,  245.,  184.,  202.,  137.,
          85.,  131.,  283.,  129.,   59.,  341.,   87.,   65.,  102.,
         265.,  276.,  252.,   90.,  100.,   55.,   61.,   92.,  259.,
          53.,  190.,  142.,   75.,  142.,  155.,  225.,   59.,  104.,
         182.,  128.,   52.,   37.,  170.,  170.,   61.,  144.,   52.,
         128.,   71.,  163.,  150.,   97.,  160.,  178.,   48.,  270.,
         202.,  111.,   85.,   42.,  170.,  200.,  252.,  113.,  143.,
          51.,   52.,  210.,   65.,  141.,   55.,  134.,   42.,  111.,
          98.,  164.,   48.,   96.,   90.,  162.,  150.,  279.,   92.,
          83.,  128.,  102.,  302.,  198.,   95.,   53.,  134.,  144.,
         232.,   81.,  104.,   59.,  246.,  297.,  258.,  229.,  275.,
         281.,  179.,  200.,  200.,  173.,  180.,   84.,  121.,  161.,
          99.,  109.,  115.,  268.,  274.,  158.,  107.,   83.,  103.,
         272.,   85.,  280.,  336.,  281.,  118.,  317.,  235.,   60.,
         174.,  259.,  178.,  128.,   96.,  126.,  288.,   88.,  292.,
          71.,  197.,  186.,   25.,   84.,   96.,  195.,   53.,  217.,
         172.,  131.,  214.,   59.,   70.,  220.,  268.,  152.,   47.,
          74.,  295.,  101.,  151.,  127.,  237.,  225.,   81.,  151.,
         107.,   64.,  138.,  185.,  265.,  101.,  137.,  143.,  141.,
          79.,  292.,  178.,   91.,  116.,   86.,  122.,   72.,  129.,
         142.,   90.,  158.,   39.,  196.,  222.,  277.,   99.,  196.,
         202.,  155.,   77.,  191.,   70.,   73.,   49.,   65.,  263.,
         248.,  296.,  214.,  185.,   78.,   93.,  252.,  150.,   77.,
         208.,   77.,  108.,  160.,   53.,  220.,  154.,  259.,   90.,
         246.,  124.,   67.,   72.,  257.,  262.,  275.,  177.,   71.,
          47.,  187.,  125.,   78.,   51.,  258.,  215.,  303.,  243.,
          91.,  150.,  310.,  153.,  346.,   63.,   89.,   50.,   39.,
         103.,  308.,  116.,  145.,   74.,   45.,  115.,  264.,   87.,
         202.,  127.,  182.,  241.,   66.,   94.,  283.,   64.,  102.,
         200.,  265.,   94.,  230.,  181.,  156.,  233.,   60.,  219.,
          80.,   68.,  332.,  248.,   84.,  200.,   55.,   85.,   89.,
          31.,  129.,   83.,  275.,   65.,  198.,  236.,  253.,  124.,
          44.,  172.,  114.,  142.,  109.,  180.,  144.,  163.,  147.,
          97.,  220.,  190.,  109.,  191.,  122.,  230.,  242.,  248.,
         249.,  192.,  131.,  237.,   78.,  135.,  244.,  199.,  270.,
         164.,   72.,   96.,  306.,   91.,  214.,   95.,  216.,  263.,
         178.,  113.,  200.,  139.,  139.,   88.,  148.,   88.,  243.,
          71.,   77.,  109.,  272.,   60.,   54.,  221.,   90.,  311.,
         281.,  182.,  321.,   58.,  262.,  206.,  233.,  242.,  123.,
         167.,   63.,  197.,   71.,  168.,  140.,  217.,  121.,  235.,
         245.,   40.,   52.,  104.,  132.,   88.,   69.,  219.,   72.,
         201.,  110.,   51.,  277.,   63.,  118.,   69.,  273.,  258.,
          43.,  198.,  242.,  232.,  175.,   93.,  168.,  275.,  293.,
         281.,   72.,  140.,  189.,  181.,  209.,  136.,  261.,  113.,
         131.,  174.,  257.,   55.,   84.,   42.,  146.,  212.,  233.,
          91.,  111.,  152.,  120.,   67.,  310.,   94.,  183.,   66.,
         173.,   72.,   49.,   64.,   48.,  178.,  104.,  132.,  220.,   57.])}
In [19]:
'Diabetes dataset\n================\n\nNotes\n-----\n\nTen baseline variables, age, sex, body mass index, average blood\npressure, and six blood serum measurements were obtained for each of n =\n442 diabetes patients, as well as the response of interest, a\nquantitative measure of disease progression one year after baseline.\n\nData Set Characteristics:\n\n  :Number of Instances: 442\n\n  :Number of Attributes: First 10 columns are numeric predictive values\n\n  :Target: Column 11 is a quantitative measure of disease progression one year after baseline\n\n  :Attributes:\n    :Age:\n    :Sex:\n    :Body mass index:\n    :Average blood pressure:\n    :S1:\n    :S2:\n    :S3:\n    :S4:\n    :S5:\n    :S6:\n\nNote: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).\n\nSource URL:\n\n\nFor more information see:\nBradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.\n(\n'
In [20]:
array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
         0.01990842, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
        -0.06832974, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
         0.00286377, -0.02593034],
       [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
        -0.04687948,  0.01549073],
       [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
         0.04452837, -0.02593034],
       [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
        -0.00421986,  0.00306441]])
In [34]:
In [21]:[:, np.newaxis, 2]
array([[ 0.06169621],
       [ 0.04445121],
       [ 0.06169621],
       [ 0.03906215],
       [ 0.01750591],
       [ 0.04229559],
       [ 0.01211685],
       [-0.0105172 ],
       [ 0.06061839],
       [ 0.03582872],
       [ 0.05954058],
       [ 0.04445121],
       [ 0.12528712],
       [ 0.02289497],
       [ 0.01103904],
       [ 0.07139652],
       [ 0.01427248],
       [-0.0105172 ],
       [ 0.06816308],
       [-0.0730303 ],
       [ 0.01427248],
       [ 0.0164281 ],
       [ 0.0250506 ],
       [ 0.04121778],
       [ 0.00457217],
       [-0.0374625 ],
       [-0.046085  ],
       [ 0.03367309],
       [ 0.00241654],
       [ 0.02828403],
       [-0.0374625 ],
       [ 0.01211685],
       [ 0.00996123],
       [ 0.07139652],
       [ 0.0519959 ],
       [ 0.00457217],
       [ 0.00996123],
       [ 0.08864151],
       [ 0.01750591],
       [ 0.02828403],
       [ 0.04121778],
       [ 0.06492964],
       [ 0.04984027],
       [ 0.04552903],
       [ 0.00457217],
       [ 0.02073935],
       [ 0.01427248],
       [ 0.11019775],
       [ 0.00133873],
       [ 0.05846277],
       [-0.0105172 ],
       [ 0.00457217],
       [ 0.01750591],
       [ 0.08109682],
       [ 0.0347509 ],
       [ 0.02397278],
       [ 0.0164281 ],
       [ 0.09618619],
       [ 0.0433734 ],
       [ 0.05630715],
       [-0.0816528 ],
       [ 0.04984027],
       [ 0.11127556],
       [ 0.06169621],
       [ 0.01427248],
       [ 0.04768465],
       [ 0.01211685],
       [ 0.00564998],
       [ 0.04660684],
       [ 0.12852056],
       [ 0.05954058],
       [ 0.09295276],
       [ 0.01535029],
       [ 0.0703187 ],
       [ 0.02073935],
       [ 0.06061839],
       [-0.0105172 ],
       [ 0.0433734 ],
       [ 0.06385183],
       [ 0.03043966],
       [ 0.07247433],
       [-0.0191397 ],
       [ 0.06924089],
       [ 0.05954058],
       [-0.046085  ],
       [ 0.07139652],
       [ 0.00996123],
       [ 0.01966154],
       [ 0.02720622],
       [ 0.00457217],
       [ 0.00564998],
       [ 0.02397278],
       [ 0.04229559],
       [-0.0547075 ],
       [ 0.0250506 ],
       [-0.046085  ],
       [ 0.00349435],
       [ 0.05415152],
       [ 0.00133873],
       [ 0.03043966],
       [ 0.00672779],
       [ 0.04660684],
       [ 0.02612841],
       [ 0.04552903],
       [ 0.04013997],
       [ 0.01427248],
       [ 0.03690653],
       [ 0.00349435],
       [ 0.09403057],
       [ 0.03582872],
       [ 0.03151747],
       [ 0.03259528],
       [-0.046085  ],
       [ 0.07139652],
       [ 0.00026092],
       [ 0.03690653],
       [ 0.03906215],
       [ 0.00672779],
       [ 0.01966154],
       [ 0.07462995],
       [-0.046085  ],
       [ 0.05415152],
       [-0.0816528 ],
       [ 0.04768465],
       [ 0.06061839],
       [ 0.05630715],
       [ 0.09834182],
       [ 0.05954058],
       [ 0.03367309],
       [ 0.05630715],
       [ 0.16085492],
       [ 0.12744274],
       [ 0.02828403],
       [ 0.08864151],
       [ 0.03043966],
       [ 0.00888341],
       [ 0.00672779],
       [ 0.02612841],
       [ 0.01858372],
       [-0.0902753 ],
       [-0.0547075 ],
       [ 0.05522933],
       [ 0.07678558],
       [ 0.01858372],
       [ 0.09295276],
       [ 0.03906215],
       [-0.0374625 ],
       [ 0.07355214],
       [ 0.03367309],
       [ 0.0347509 ],
       [-0.046085  ],
       [ 0.00133873],
       [ 0.06492964],
       [ 0.04013997],
       [ 0.05307371],
       [ 0.04013997],
       [ 0.01427248],
       [ 0.00672779],
       [ 0.00457217],
       [ 0.03043966],
       [ 0.0519959 ],
       [ 0.06169621],
       [ 0.00564998],
       [ 0.05415152],
       [ 0.114509  ],
       [ 0.06708527],
       [ 0.03043966],
       [ 0.10480869],
       [ 0.08540807],
       [ 0.05954058],
       [ 0.02181716],
       [ 0.01858372],
       [ 0.01750591],
       [ 0.06061839],
       [ 0.04552903],
       [ 0.04984027],
       [ 0.00564998],
       [ 0.02073935],
       [ 0.10480869],
       [ 0.13714305],
       [ 0.17055523],
       [ 0.00241654],
       [ 0.03798434],
       [-0.0105172 ],
       [ 0.06816308],
       [ 0.00996123],
       [ 0.00241654],
       [ 0.02612841],
       [ 0.06061839],
       [-0.0191397 ],
       [ 0.01535029],
       [ 0.00133873],
       [ 0.06924089],
       [-0.046085  ],
       [ 0.01858372],
       [ 0.00133873],
       [ 0.01535029],
       [ 0.02289497],
       [ 0.04552903],
       [ 0.097264  ],
       [ 0.05415152],
       [ 0.12313149],
       [ 0.09295276],
       [-0.0277622 ],
       [ 0.05846277],
       [ 0.08540807],
       [ 0.00672779],
       [ 0.00888341],
       [ 0.08001901],
       [ 0.07139652],
       [-0.0547075 ],
       [ 0.0164281 ],
       [ 0.07786339],
       [ 0.01103904],
       [ 0.00564998],
       [ 0.08864151],
       [ 0.05522933],
       [ 0.00133873],
       [ 0.01966154],
       [ 0.03906215],
       [-0.0730303 ]])
In [22]:
diabetes_X =[:, np.newaxis, 2]
In [23]:
array([ 151.,   75.,  141.,  206.,  135.,   97.,  138.,   63.,  110.,
        310.,  101.,   69.,  179.,  185.,  118.,  171.,  166.,  144.,
         97.,  168.,   68.,   49.,   68.,  245.,  184.,  202.,  137.,
         85.,  131.,  283.,  129.,   59.,  341.,   87.,   65.,  102.,
        265.,  276.,  252.,   90.,  100.,   55.,   61.,   92.,  259.,
         53.,  190.,  142.,   75.,  142.,  155.,  225.,   59.,  104.,
        182.,  128.,   52.,   37.,  170.,  170.,   61.,  144.,   52.,
        128.,   71.,  163.,  150.,   97.,  160.,  178.,   48.,  270.,
        202.,  111.,   85.,   42.,  170.,  200.,  252.,  113.,  143.,
         51.,   52.,  210.,   65.,  141.,   55.,  134.,   42.,  111.,
         98.,  164.,   48.,   96.,   90.,  162.,  150.,  279.,   92.,
         83.,  128.,  102.,  302.,  198.,   95.,   53.,  134.,  144.,
        232.,   81.,  104.,   59.,  246.,  297.,  258.,  229.,  275.,
        281.,  179.,  200.,  200.,  173.,  180.,   84.,  121.,  161.,
         99.,  109.,  115.,  268.,  274.,  158.,  107.,   83.,  103.,
        272.,   85.,  280.,  336.,  281.,  118.,  317.,  235.,   60.,
        174.,  259.,  178.,  128.,   96.,  126.,  288.,   88.,  292.,
         71.,  197.,  186.,   25.,   84.,   96.,  195.,   53.,  217.,
        172.,  131.,  214.,   59.,   70.,  220.,  268.,  152.,   47.,
         74.,  295.,  101.,  151.,  127.,  237.,  225.,   81.,  151.,
        107.,   64.,  138.,  185.,  265.,  101.,  137.,  143.,  141.,
         79.,  292.,  178.,   91.,  116.,   86.,  122.,   72.,  129.,
        142.,   90.,  158.,   39.,  196.,  222.,  277.,   99.,  196.,
        202.,  155.,   77.,  191.,   70.,   73.,   49.,   65.,  263.,
        248.,  296.,  214.,  185.,   78.,   93.,  252.,  150.,   77.,
        208.,   77.,  108.,  160.,   53.,  220.,  154.,  259.,   90.,
        246.,  124.,   67.,   72.,  257.,  262.,  275.,  177.,   71.,
         47.,  187.,  125.,   78.,   51.,  258.,  215.,  303.,  243.,
         91.,  150.,  310.,  153.,  346.,   63.,   89.,   50.,   39.,
        103.,  308.,  116.,  145.,   74.,   45.,  115.,  264.,   87.,
        202.,  127.,  182.,  241.,   66.,   94.,  283.,   64.,  102.,
        200.,  265.,   94.,  230.,  181.,  156.,  233.,   60.,  219.,
         80.,   68.,  332.,  248.,   84.,  200.,   55.,   85.,   89.,
         31.,  129.,   83.,  275.,   65.,  198.,  236.,  253.,  124.,
         44.,  172.,  114.,  142.,  109.,  180.,  144.,  163.,  147.,
         97.,  220.,  190.,  109.,  191.,  122.,  230.,  242.,  248.,
        249.,  192.,  131.,  237.,   78.,  135.,  244.,  199.,  270.,
        164.,   72.,   96.,  306.,   91.,  214.,   95.,  216.,  263.,
        178.,  113.,  200.,  139.,  139.,   88.,  148.,   88.,  243.,
         71.,   77.,  109.,  272.,   60.,   54.,  221.,   90.,  311.,
        281.,  182.,  321.,   58.,  262.,  206.,  233.,  242.,  123.,
        167.,   63.,  197.,   71.,  168.,  140.,  217.,  121.,  235.,
        245.,   40.,   52.,  104.,  132.,   88.,   69.,  219.,   72.,
        201.,  110.,   51.,  277.,   63.,  118.,   69.,  273.,  258.,
         43.,  198.,  242.,  232.,  175.,   93.,  168.,  275.,  293.,
        281.,   72.,  140.,  189.,  181.,  209.,  136.,  261.,  113.,
        131.,  174.,  257.,   55.,   84.,   42.,  146.,  212.,  233.,
         91.,  111.,  152.,  120.,   67.,  310.,   94.,  183.,   66.,
        173.,   72.,   49.,   64.,   48.,  178.,  104.,  132.,  220.,   57.])
In [24]:
diabetes_y =
In [25]:
from sklearn.model_selection import train_test_split
In [26]:
X_train, x_test = train_test_split(diabetes_X)
y_train, y_test = train_test_split(diabetes_y)
In [32]:
plt.figure(figsize = (12, 9))
plt.scatter(X_train, y_train, label = 'Training Set')
plt.scatter(x_test, y_test, label = 'Test Set')
plt.legend(frameon = False)
plt.title("Example Test Train Split from Diabetes Data", loc = 'left', size = 20)
Text(0,1,'Example Test Train Split from Diabetes Data')

Linear Regression: Fitting and Evaluating the Model

In [35]:
regr = linear_model.LinearRegression()
In [36]:, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [38]:
predictions = regr.predict(x_test)
In [40]:
print("The coefficients of the model are: \n", regr.coef_)
The coefficients of the model are:
 [ 6.29641819]
In [41]:
print("The intercept of the model are: \n", regr.intercept_)
The intercept of the model are:
In [43]:
print("The Equation for the Line of Best Fit is \n y = ", regr.coef_, 'x +', regr.intercept_)
The Equation for the Line of Best Fit is
 y =  [ 6.29641819] x + 152.512205614
In [44]:
def l(x):
    return regr.coef_*x + regr.intercept_
In [45]:
array([ 341.40475121])
In [46]:
x = np.linspace(min(X_train), max(X_train), 1000)
In [47]:
plt.figure(figsize = (12, 9))
plt.scatter(X_train, y_train, label = 'Training Set')
plt.scatter(x_test, y_test, label = 'Test Set')
plt.plot(x, l(x), label = 'Line of Best Fit')
plt.legend(frameon = False)
plt.title("Example Test Train Split from Diabetes Data", loc = 'left', size = 20)
Text(0,1,'Example Test Train Split from Diabetes Data')
In [48]:
print("The Mean Squared Error of the model is", mean_squared_error(y_test, predictions))
The Mean Squared Error of the model is 6126.13411338
In [49]:
print("The Variance Score is ", r2_score(y_test, predictions))
The Variance Score is  -0.000950748287665
In [51]:
<bound method BaseEstimator.get_params of LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)>

Using StatsModels and Seaborn

In [57]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
In [60]:
df = pd.DataFrame()
In [67]:
df['bmi'] =[:, 2]
In [68]:
df['disease'] =
In [69]:
bmi disease
0 0.061696 151.0
1 -0.051474 75.0
2 0.044451 141.0
3 -0.011595 206.0
4 -0.036385 135.0
In [73]:
In [75]:
results = smf.ols('disease ~ bmi', data = df).fit()
In [76]:
                            OLS Regression Results
Dep. Variable:                disease   R-squared:                       0.344
Model:                            OLS   Adj. R-squared:                  0.342
Method:                 Least Squares   F-statistic:                     230.7
Date:                Sat, 10 Feb 2018   Prob (F-statistic):           3.47e-42
Time:                        14:16:19   Log-Likelihood:                -2454.0
No. Observations:                 442   AIC:                             4912.
Df Residuals:                     440   BIC:                             4920.
Df Model:                           1
Covariance Type:            nonrobust
                 coef    std err          t      P>|t|      [0.025      0.975]
Intercept    152.1335      2.974     51.162      0.000     146.289     157.978
bmi          949.4353     62.515     15.187      0.000     826.570    1072.301
Omnibus:                       11.674   Durbin-Watson:                   1.848
Prob(Omnibus):                  0.003   Jarque-Bera (JB):                7.310
Skew:                           0.156   Prob(JB):                       0.0259
Kurtosis:                       2.453   Cond. No.                         21.0

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [77]:
df2 = df[:300]
In [78]:
bmi disease
0 0.061696 151.0
1 -0.051474 75.0
2 0.044451 141.0
3 -0.011595 206.0
4 -0.036385 135.0
In [79]:
df2b = df[300:]
In [80]:
bmi disease
300 0.073552 275.0
301 -0.024529 65.0
302 0.033673 198.0
303 0.034751 236.0
304 -0.038540 253.0
In [83]:
split_results = smf.ols('disease ~ bmi', data = df2).fit()
In [84]:
                            OLS Regression Results
Dep. Variable:                disease   R-squared:                       0.342
Model:                            OLS   Adj. R-squared:                  0.340
Method:                 Least Squares   F-statistic:                     154.8
Date:                Sat, 10 Feb 2018   Prob (F-statistic):           6.61e-29
Time:                        14:18:03   Log-Likelihood:                -1668.4
No. Observations:                 300   AIC:                             3341.
Df Residuals:                     298   BIC:                             3348.
Df Model:                           1
Covariance Type:            nonrobust
                 coef    std err          t      P>|t|      [0.025      0.975]
Intercept    151.0306      3.651     41.372      0.000     143.846     158.215
bmi          975.5736     78.405     12.443      0.000     821.276    1129.872
Omnibus:                        9.498   Durbin-Watson:                   1.764
Prob(Omnibus):                  0.009   Jarque-Bera (JB):                6.672
Skew:                           0.238   Prob(JB):                       0.0356
Kurtosis:                       2.446   Cond. No.                         21.5

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [87]:
predictions = split_results.predict(df2b['bmi'])
In [88]:
300    222.786110
301    127.100973
302    183.881164
303    184.932649
304    113.431668
305    112.380183
306    149.182158
307    120.792063
308    106.071273
309    152.336613
dtype: float64
In [95]:
import seaborn as sns
sns.jointplot('bmi', 'disease', data = df, size = 10)
<seaborn.axisgrid.JointGrid at 0x1c216fc438>

Other Examples of Machine Learning

  • What category does this belong to?
  • What is this a picture of?
In [12]:
from sklearn import datasets
In [13]:
iris = datasets.load_iris()
digits = datasets.load_digits()
In [14]:
[[  0.   0.   5. ...,   0.   0.   0.]
 [  0.   0.   0. ...,  10.   0.   0.]
 [  0.   0.   0. ...,  16.   9.   0.]
 [  0.   0.   1. ...,   6.   0.   0.]
 [  0.   0.   2. ...,  12.   0.   0.]
 [  0.   0.  10. ...,  12.   1.   0.]]
In [15]:
array([0, 1, 2, ..., 8, 9, 8])
In [16]:
array([[  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.],
       [  0.,   0.,  13.,  15.,  10.,  15.,   5.,   0.],
       [  0.,   3.,  15.,   2.,   0.,  11.,   8.,   0.],
       [  0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.],
       [  0.,   5.,   8.,   0.,   0.,   9.,   8.,   0.],
       [  0.,   4.,  11.,   0.,   1.,  12.,   7.,   0.],
       [  0.,   2.,  14.,   5.,  10.,  12.,   0.,   0.],
       [  0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.]])
In [17]:[:5]
array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3.6,  1.4,  0.2]])
In [18]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

What kind of Flower is This?

  • K-Means Clustering
  • Naive Bayes Classifier
  • Decision Tree
In [19]:
plt.subplot(1, 3, 1)

plt.subplot(1, 3, 2)

plt.subplot(1, 3, 3)
<matplotlib.image.AxesImage at 0x1a15e43630>

Learning and Predicting with Digits

Given an image, which digit does it represent? Here, we will fit an estimator to predict which class unknown images belong to. To do this, we will use the support vector classifier.

In [20]:
from sklearn import svm
In [21]:
clf = svm.SVC(gamma = 0.001, C = 100)
In [22]:
#fit on all but last data point[:-1],[:-1])
SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
In [23]:
In [24]:
<matplotlib.image.AxesImage at 0x1a15db4278>