Data Accession

For today’s workshop we will be using the pandas library, the matplotlib library, and the seaborn library. Also, we will read data from the web with the pandas-datareader. By the end of the workshop, participants should be able to use Python to tell a story about a dataset they build from an open data source.

GOALS:

  • Understand how to load data as .csv files into Pandas
  • Import data from web with pandas-datareader and compare development indicators from the World Bank
  • Use API’s and requests to pull data from web

.csv files

In the first session, we explored built-in datasets. Typically, we would want to use our own data for analysis. A common filetype is the .csv or comma separated values type. You have probably used a spreadsheet program before, something like Microsoft Excel or Google Sheets. These programs allow you to save the data as a universally recognized formats, including the .csv extension. This is important as the .csv filetype can be understood and read by most data analysis languages including Python and R.

To begin, we will use Python to load a .csv file. Starting with the tips dataset from last lesson, we will save this data as a csv file in our data folder. Then, we can read the data in using Pandas read_csv method.

In [1]:
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
In [2]:
tips = sns.load_dataset("tips")
In [3]:
tips.head()
Out[3]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
In [4]:
tips.to_csv('data/tips.csv')
In [5]:
tips = pd.read_csv('data/tips.csv')
In [6]:
tips.head()
Out[6]:
Unnamed: 0 total_bill tip sex smoker day time size
0 0 16.99 1.01 Female No Sun Dinner 2
1 1 10.34 1.66 Male No Sun Dinner 3
2 2 21.01 3.50 Male No Sun Dinner 3
3 3 23.68 3.31 Male No Sun Dinner 2
4 4 24.59 3.61 Female No Sun Dinner 4
In [13]:
# add a column for tip percent
tips['tip_pct'] = tips['tip']/tips['total_bill']
In [14]:
# create variable grouped that groups the tips by sex and smoker
grouped = tips.groupby(['sex', 'smoker'])
In [15]:
# create variable grouped_pct that contains the tip_pct column from grouped
grouped_pct = grouped['tip_pct']
In [16]:
#what does executing this cell show?  Explain the .agg method.
grouped_pct.agg('mean')
Out[16]:
sex     smoker
Female  No        0.156921
        Yes       0.182150
Male    No        0.160669
        Yes       0.152771
Name: tip_pct, dtype: float64
In [19]:
# What other options can you pass to the .agg function?
grouped_pct.agg(['mean', 'std'])
Out[19]:
mean std
sex smoker
Female No 0.156921 0.036421
Yes 0.182150 0.071595
Male No 0.160669 0.041849
Yes 0.152771 0.090588
In [20]:
grouped_pct.agg?

Reading .csv files from web

If we have access to the file as a url, we can use the Pandas read_csv method to pass the url of the csv file instead of loading it from our local machine. For example, the Data and Software Carpentry organizations have a .csv file located in their github repository as seen below.

The first file on asia_gdp_per_capita can be loaded by using the link to the raw file on github:

https://raw.githubusercontent.com/swcarpentry/python-novice-gapminder/gh-pages/data/asia_gdp_per_capita.csv

hence, we pass this url to the read_csv function and have a new dataframe.

In [7]:
asia_gdp = pd.read_csv('https://raw.githubusercontent.com/swcarpentry/python-novice-gapminder/gh-pages/data/asia_gdp_per_capita.csv')
In [8]:
asia_gdp.head()
Out[8]:
'year' 'Afghanistan' 'Bahrain' 'Bangladesh' 'Cambodia' 'China' 'Hong Kong China' 'India' 'Indonesia' 'Iran' ... 'Philippines' 'Saudi Arabia' 'Singapore' 'Sri Lanka' 'Syria' 'Taiwan' 'Thailand' 'Vietnam' 'West Bank and Gaza' 'Yemen Rep.'
0 1952 779.445314 9867.084765 684.244172 368.469286 400.448611 3054.421209 546.565749 749.681655 3035.326002 ... 1272.880995 6459.554823 2315.138227 1083.532030 1643.485354 1206.947913 757.797418 605.066492 1515.592329 781.717576
1 1957 820.853030 11635.799450 661.637458 434.038336 575.987001 3629.076457 590.061996 858.900271 3290.257643 ... 1547.944844 8157.591248 2843.104409 1072.546602 2117.234893 1507.861290 793.577415 676.285448 1827.067742 804.830455
2 1962 853.100710 12753.275140 686.341554 496.913648 487.674018 4692.648272 658.347151 849.289770 4187.329802 ... 1649.552153 11626.419750 3674.735572 1074.471960 2193.037133 1822.879028 1002.199172 772.049160 2198.956312 825.623201
3 1967 836.197138 14804.672700 721.186086 523.432314 612.705693 6197.962814 700.770611 762.431772 5906.731805 ... 1814.127430 16903.048860 4977.418540 1135.514326 1881.923632 2643.858681 1295.460660 637.123289 2649.715007 862.442146
4 1972 739.981106 18268.658390 630.233627 421.624026 676.900092 8315.928145 724.032527 1111.107907 9613.818607 ... 1989.374070 24837.428650 8597.756202 1213.395530 2571.423014 4062.523897 1524.358936 699.501644 3133.409277 1265.047031

5 rows × 34 columns

Problems

Try to locate and load some .csv files using the internet. There are many great resources out there. Also, I want you to try the pd.read_clipboard method, where you’ve copied a data table from the internet. In both cases create a brief exploratory notebook for the data that contains the following:

  • Jupyter notebook with analysis and discussion
  • Data folder with relevant .csv files
  • Images folder with at least one image loaded into the notebook

Accessing data through API

Pandas has the functionality to access certain data through a datareader. We will use the pandas_datareader to investigate information about the World Bank. For more information, please see the documentation:

http://pandas-datareader.readthedocs.io/en/latest/remote_data.html

We will explore other examples with the datareader later, but to start let’s access the World Bank’s data. For a full description of the available data, look over the source from the World Bank.

https://data.worldbank.org/indicator

In [38]:
from pandas_datareader import wb
In [39]:
import datetime
In [40]:
wb.search('gdp.*capita.*const').iloc[:,:2]
Out[40]:
id name
646 6.0.GDPpc_constant GDP per capita, PPP (constant 2011 internation...
8064 NY.GDP.PCAP.KD GDP per capita (constant 2010 US$)
8066 NY.GDP.PCAP.KN GDP per capita (constant LCU)
8068 NY.GDP.PCAP.PP.KD GDP per capita, PPP (constant 2011 internation...
8069 NY.GDP.PCAP.PP.KD.87 GDP per capita, PPP (constant 1987 internation...
In [41]:
dat = wb.download(indicator='NY.GDP.PCAP.KD', country=['US','CA','MX'], start = 2005, end = 2016)
In [43]:
dat['NY.GDP.PCAP.KD'].groupby(level=0).mean()
Out[43]:
country
Canada           48601.353408
Mexico            9236.997678
United States    49731.965366
Name: NY.GDP.PCAP.KD, dtype: float64
In [44]:
wb.search('cell.*%').iloc[:,:2]
Out[44]:
id name
6339 IT.CEL.COVR.ZS Population covered by mobile cellular network (%)
6394 IT.MOB.COV.ZS Population coverage of mobile cellular telepho...
In [45]:
ind = ['NY.GDP.PCAP.KD', 'IT.MOB.COV.ZS']
In [46]:
dat = wb.download(indicator=ind, country = 'all', start = 2011, end = 2011).dropna()
In [47]:
dat.columns = ['gdp', 'cellphone']
dat.tail()
Out[47]:
gdp cellphone
country year
Swaziland 2011 3704.140658 94.9
Tunisia 2011 4014.916793 100.0
Uganda 2011 629.240447 100.0
Zambia 2011 1499.728311 62.0
Zimbabwe 2011 813.834010 72.4
In [48]:
dat.plot(x ='cellphone', y = 'gdp', kind = 'scatter')
Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2215fe80>
_images/03-datafiles_28_1.png
In [49]:
sns.distplot(dat['gdp']);
_images/03-datafiles_29_0.png
In [50]:
sns.distplot(dat['cellphone']);
_images/03-datafiles_30_0.png
In [51]:
sns.jointplot(dat['cellphone'], np.log(dat['gdp']))
Out[51]:
<seaborn.axisgrid.JointGrid at 0x1a22099cf8>
_images/03-datafiles_31_1.png
In [52]:
sns.jointplot(dat['cellphone'], np.log(dat['gdp']), kind = 'hex')
Out[52]:
<seaborn.axisgrid.JointGrid at 0x1a2144fc88>
_images/03-datafiles_32_1.png

StatsModels

StatsModels is a library that contains a wealth of classical statistical techniques. Depending on your comfort or interest in deeper use of classical statistics, you can consult the documentation at http://www.statsmodels.org/stable/index.html . Below, we show how to use statsmodels to perform a basic ordinary least squares fit with our \(y\) or dependent variable as cellphone and \(x\) or independent variable as log(gdp).

In [53]:
import numpy as np
import statsmodels.formula.api as smf
mod = smf.ols("cellphone ~ np.log(gdp)", dat).fit()
In [54]:
mod.summary()
Out[54]:
OLS Regression Results
Dep. Variable: cellphone R-squared: 0.321
Model: OLS Adj. R-squared: 0.296
Method: Least Squares F-statistic: 13.21
Date: Sat, 13 Jan 2018 Prob (F-statistic): 0.00111
Time: 12:28:27 Log-Likelihood: -127.26
No. Observations: 30 AIC: 258.5
Df Residuals: 28 BIC: 261.3
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -2.3708 24.082 -0.098 0.922 -51.700 46.959
np.log(gdp) 11.9971 3.301 3.635 0.001 5.236 18.758
Omnibus: 27.737 Durbin-Watson: 2.064
Prob(Omnibus): 0.000 Jarque-Bera (JB): 62.978
Skew: -1.931 Prob(JB): 2.11e-14
Kurtosis: 8.956 Cond. No. 56.3