Data and Design with Python

OVERVIEW

This short course aims to introduce participants to the Python computing language. We will investigate the use of Python to perform data analysis, access and structure information from the web, and build and deploy applications like web pages and message boards using Django. Students will be expected to complete a small project for each weeks topics described below.

Topics

  • Introduction to Data and Visualizations: The first class will focus on using Pandas and Seaborn to explore data in .csv files and through API’s. We emphasize the use of the computer to explore the data and look for patterns and differences. Our first project involves writing an analysis of New York City’s \(8^{\text{th}}\) grade mathematics scores.
  • Introduction to Pandas and Seaborn
  • Pandas and Seaborn
  • Assignment: Access and Analyze Data
  • Introduction to Web Scraping: Today, we investigate the use of webscraping to pull and clean data from websites. We will investigate some basics of HTML and CSS, and use the requests and BeautifulSoup2 libraries to pull this information.
  • Introduction to webscraping
  • Scraping Part II
  • Natural Language Processing and Scraping: Today, we extend our webscraping work to analyze the text of documents scraped. We will use the Natural Language Toolkit to analyze text. We will also introduce the use of regular expressions in navigating text on the computer.
  • Webscraping and Natural Language Processing
  • Sentiment Analysis of Text
  • More Machine Learning
  • Web Design with Django: In this workshop, we will use the Django framework to design and deploy a basic web application. Our assignment will be a basic website ready to display our earlier work with Jupyter notebooks. We discuss Django projects and applications to use Python to build a basic website.
  • Basic WebSite with Django
  • Applications with Django
  • Data and our Website: The final class serves to connect our earlier work with data and Python through Django models, where we build a database for our website. We will add a Blog application to our site, post some information, and access these posts as data in the shell. Finally, we use the ListView and DetailView to display these posts together with template logic.
  • Databases and Django: A Basic Blog

Lessons Learned

  • Student Computers: A number of students experienced difficulties with their computers at different points during the semester. In the first weeks, students who lacked access to their own functioning laptops dropped from enrollment. Also, a few students who were unaware of the level of coding involved dropped the course. If we were able to identify an IT support person who is capable of helping students install and optimize their personal computers, this would be great.

Technology Work

Also, if we were able to provide a web-based coding environment this could alleviate many of these issues. Below are three such options:

  • OpenEdX: A Learning Management system built by MIT and Harvard as part of their opencourse initiatives. This is freely available, however we would need a person competent in full stack web development. Alternatively, third party companies will launch and manage these applications for a fee that based on my initial research would be in the $10,000 neighborhood.

  • CoCalc: A collaborative computing platform that has many language capability. We should be able to launch some version of this ourselves, using the Jupyter notebook and text editor execution capabilities of the service. This would again require some support from an individual who understands servers and deploying interactive software applications on them.

  • JupyterHub: There have been examples of institutions that integrate Jupyter notebooks and other code related interfaces into their Learning Management Systems through JupyterHub. The most popular example is the Data8 course at UC Berkeley.

  • http://data8.org/

    This class integrates the JupyterHub with a virtual textbook. I am close to such things however I don’t have full control over my JupyterHub.

    You can check it out at

  • http://hub.dubmathematics.com

My goal is to integrate this within a website that students can access using some kind of login token.

Suggestions for Course

Despite some bumps in the road, many students were able to complete excellent work. Here are some examples of student github repositories that house three projects and a completed website built with Django:

If I were to do the course over again, I would keep the aim for work with both Data Analysis and Web Design as the focus. Ideally, the class would be a regular 3 or 4 hour class where we can spend more time on all three areas. I would also be interested in connecting with other instructors who work in web design and data visualization to normalize the use of specific technologies.


Hypothetical Semester Length Version

Here is a prospective outline for such a class:

Section I: Data Analysis and Machine Learning

  • Week I: Introduction to Python

Base introduction to the Python language. Jupyter notebooks and plotting. Saving and reusing programs.

  • Week II: Introduction to Pandas

Introduction to Data Structures and the Pandas library. Students will work with built in and external datasets.

  • Week III: Introduction to Machine Learning

We introduce machine learning through the Regression and Clustering algorithms. We will see how to implement each of these algorithms on our data structured with Pandas.

  • Week IV: Machine Learning with TensorFlow

In this week, we introduce applications of machine learning to visual and audio problems with the Google TensorFlow machine learning library. Here we will discuss neural networks and their use in solving computer vision problems.

Section II: Data and the Internet

  • Week VI: Introduction to WebScraping

This week focuses on data accession from the web. To start, we will scrape numerical tables into a Pandas DataFrame and use our earlier work with visualization and data analysis to explore the web data. Next we will focus on accessing and structuring textual data from tables in Wikipedia articles.

  • Week VII: WebCrawling

This week we will use Scrapy to set up a web crawler that will extract data from multiple websites with a similar structure.

  • Week VIII: Natural Language Processing I

Building on our earlier work with data analysis, we start turn text into data using the NLTK library. We discuss some introductory Natural Language Processing techniques and visualize novels from Project Gutenberg.

  • Week IX: Machine Learning and Text

This week we focus on using Machine Learning to understand the sentiment and important topics in a range of text. This will take place with reviews on Yelp and Amazon.com.

Section III: Web Design with Django

  • Week X: Introduction to Django

Setup a basic static website using the Python web framework Django. We will discuss the basics of how the internet works and complete a basic website that contains static HTML files that include some basic p5.js animations.

  • Week XI: Django and Models

The week we explore the use of databases with Django applications. We will build a blog for our site and begin to post entries based on our eariler projects. Next, we see how we can analyze this data using our Juptyer notebooks.

  • Week XII: Serving our Site

This week we complete our work with styling the basic site and serve it live to the internet using the Heroku service.

  • Week XIII: User Authentication and Site Access

Adding to our website, we build a user authentication interface that allows us to restrict access to all or part of our website.

  • Week XIV: Packaging your site as a reusable application

Finally, we will package our site for public use. We will use the Python standards to share our work with the larger world, including the launching of our frameworks on their own computer using a simple pip install.