Introduction to Web Scraping

GOALS:

  • Introduce structure of webpage
  • Use requests to get website data
  • Use Beautiful Soup to parse basic HTML page
In [20]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/dFKwcFJHLhE?ecver=1" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>

What is a website

Behind every website is HTML code. This HTML code is accessible to you on the internet. If we navigate to a website that contains 50 interesting facts about Kanye West (http://www.boomsbeat.com/articles/2192/20140403/50-interesting-facts-about-kanye-west-had-a-near-death-experience-in-2002-his-stylist-went-to-yale.htm), we can view the HTML behind it using the source code.

I’m using a macintosh computer and browsing with chrome. To get the source code I hit control and click on the page to see the page source option. Other browsers are similar. The result will be a new tab containing HTML code. Both are shown below.

HTML Tags

Tags are used to identify different objects on a website, and every tag has the same structure. For example, to write a paragraph on a webpage we would use the paragraph tags and put our text between the tags, as shown below.

<p>
This is where my text would go.
</p>

Here, the <p> starts the paragraph and the </p> ends the paragraph. Tags can be embedded within other tags. If we wanted to make a word bold and insert an image within the paragraph, we could write the following HTML code.

<p>
This is a <strong>heavy</strong> paragraph.  Here's a heavy picture.
<img src="images/heavy_pic.jpg"/img>
</p>

Also, tags may be given attributes. This may be used to apply a style using CSS. For example, the first fact about Kanye uses the dir attribute, and it was named ltr. This differentiates it from the opening paragraph that uses no attribute.

<div class="caption">Source: Flickr</div>
</div>
<p>Kanye West is a Grammy-winning rapper who is currently engaged to Kim Kardashian and he is well known for his outrageous statements and for his broad musical palette.</p>
<ol>
<li dir="ltr">
<p dir="ltr">Kanye Omari West was born June 8, 1977 in Atlanta.</p>

We can use Python to pull the HTML of a webpage into a Jupyter notebook, and then use libraries with functions that know how to read HTML. We will use the attributes to further fine tune parsing the pieces of interest on the webpage.

Getting the HTML with Requests

The requests library can be used to fetch the HTML content of our website. We will assign the content of the webpage to a variable k. We can peek at this after, printing the first 400 characters of the request.

In [1]:
import requests
k = requests.get('http://www.boomsbeat.com/articles/2192/20140403/50-interesting-facts-about-kanye-west-had-a-near-death-experience-in-2002-his-stylist-went-to-yale.htm')
In [2]:
print(k.text[:400])
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>50 interesting facts about Kanye West: Had a near death-experience in 2002, his stylist went to Yale : People : BOOMSbeat</title>
<meta content="width=device-width" name="viewport">

<meta name="Keywords" content="Kanye West, Kanye West facts, Kanye West net worth, Kanye West full name" />
<meta name="Description" content="Kanye West is a

As we wanted, we have all the HTML content that we saw in our source view.

Parsing HTML with Beautiful Soup

Now, we will use the Beautiful Soup library to parse the HTML. Beautiful soup knows how to read the HTML and has many functions we can use to pull specific pieces of interest out. To begin, we turn our request object into a beautiful soup object named soup.

In [3]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(k.text, 'html.parser')

Now, let us take a look at the source again and locate the structure surrounding the interesting facts. By searching on the source page for the first fact, I find the following.

Here, it’s important to notice that the facts lie inside <p> paragraph tags. These tags also have an attribute dir = "ltr". We can use beautiful soup to locate all these instances. If we are correct, we should have 50 interesting facts.

In [4]:
facts = soup.find_all('p')
In [5]:
len(facts)
Out[5]:
89
In [6]:
facts[0]
Out[6]:
<p class="art-date">Apr 03, 2014 11:57 AM EDT</p>
In [7]:
facts[0].text
Out[7]:
'Apr 03, 2014 11:57 AM EDT'
In [8]:
facts[2:53]
Out[8]:
[<p>Kanye West is a Grammy-winning rapper who is currently engaged to Kim Kardashian and he is well known for his outrageous statements and for his broad musical palette.</p>,
 <p>Kanye Omari West was born June 8, 1977 in Atlanta.</p>,
 <p>His father Ray West was a black panther in the 60s and 70s and he later became one of the first black photojournalists at the Atlanta-Journal Constitution and later became a Christian counselor. His mother Donda was English professor at Clark Atlanta University. He later moved to Chicago at the age of three when his parents divorced.</p>,
 <p>The name Kanye means "the only one" in Swahilli.</p>,
 <p>Kanye lived in China for more than a year with his mother when he was in fifth grade. His mother was a visiting professor there at the time and he joined her.</p>,
 <p>Kanye attended Chicago State University/Columbia College in Chicago.  He dropped out to pursue music which is why he named his 2004 debut album, "The College Dropout."</p>,
 <p>Kanye's struggle to transition from producer to MC is well documented throughout his music. However, 'Ye didn't keep to quiet about his desire to become a full-fledged rapper. Def Jam A&amp;R Chris Anokute recalled that Yeezy would often play his demo for him in his cubicle when he would stop by Def Jam's offices to pick up his production checks.</p>,
 <p>At the start of his music career, Kanye apparently kept his business dealings all in the family. West's late mother Donda-a professor of English at Clark Atlanta University and later Chicago State University-retired from teaching to become her son's manager for the early part of his career.</p>,
 <p>He sold his first beat to local Chicago rapper Gravity for $8,800.</p>,
 <p>He got his first big break through No I.D. (born Dion Ernest Wilson) is a veteran hip hop producer and current VP at Def Jam. He taught Kanye how to produce beats and gave him his start in the music business-all because their moms forced the two to hang out.</p>,
 <p>No I.D.'s mother convinced him to meet this "energetic" kid, and the lessons paid off: "At first it was just like, 'Alright man take this, learn this, go, git git git.' But eventually he started getting good and then I started managing him." West's subsequent success only bolstered No I.D.'s reputation outside of Chicago-a powerful lesson in why you should probably listen to your mother.</p>,
 <p>He initially rose to fame as a producer for Roc-A-Fella Records. He is was influential on Jay-Z's 2001 album, 'The Blueprint', producing 5 of the 13 tracks and two bonus tracks, more than any of the other producers on the album.</p>,
 <p>He dropped out of college and had a slew of random jobs. He worked as a telemarketer and sold insurance to Montgomery Ward credit card holders. "I was way better than most of the people there," he told Playboy. "I could sit around and draw pictures, basically do other [things] while I was reading the teleprompter."</p>,
 <p>Kanye was in a near fatal car accident while he was driving home from the studio in October 2002. The injuries were bad and he had to have a metal plate put into his chin.</p>,
 <p>While he was recovering in hospital, he didn't want to stop recording music so he asked for an electronic drum machine which he used to continue composing new music.</p>,
 <p>He admits that the idea of becoming a male porn star crossed his mind once or twice before.</p>,
 <p>His single debut is "Through the Wire" because he recorded it while he was still wearing the metal brace in his mouth.</p>,
 <p>Chaka Khan initially refused to grant Kanye permission to use the pitched-up sample of her vocals from "Through the Fire" on College Dropout. But according to the video's co-director Coodie Simmons who told <a href="http://espn.go.com/blog/music/post/_/id/4151/coodie-breaks-down-his-music-videos" rel="nofollow">ESPN.com</a> , a Sunday barbecue at Coodie's house changed hip-hop history. "Kanye brought Chaka Khan's son and I was like, 'We've got to shoot this video,' so we showed him the "Through The Wire" video. He was like, 'Aw man, I've got to show my mom this and tell her we're trying to get this done.' And I would say about two weeks later, she cleared the sample."</p>,
 <p><iframe class="videocontent" height="480" src="http://www.youtube.com/embed/uvb-1wjAtk4" width="853"></iframe></p>,
 <p>He is a huge fan of Fiona Apple and her music. Yeezy told Apple she was "possibly [his] favorite" and that the lyrics and singing on her debut Tidal made him want to work with producer Jon Brion "so I could be like the rap version of you." West went so far as to say "I hold you higher than Lauryn Hill in my eyes."</p>,
 <p>'College Dropout' was album of the year by almost every publication (New York Times, Time Magazine, GQ, Spin, XXL, Rolling Stone). NME voted Franz Ferdinand's debut number one and Kanye's album number seven.</p>,
 <p>West was the most nominated artist at the 47th Annual Grammy Awards with 10 nods, and he took home three trophies - Best Rap Album, Best Rap Song for "Jesus Walks" and Best R&amp;B Song for producing Alicia Keys' "You Don't Know My Name."</p>,
 <p>Following the success of his The College Dropout album, he treated himself by purchasing an 18th century aquarium with about 30 koi fish in it.</p>,
 <p>With the headline "Hip-Hop's Class Act," West becomes one of the rare entertainers to appear on the cover of Time. The lengthy article details the contradictions of The College Dropout and of West himself, who admits that when starting out in hip-hop, "It was a strike against me that I didn't wear baggy jeans and jerseys and that I never hustled, never sold drugs."</p>,
 <p>He used the money from the "Diamonds from Sierra Leone" music video to raise awareness about blood diamonds and the abuse of human rights that happen in the mining process.</p>,
 <p>He caused controversy when he strayed  from his scripted monologue at the live televised Concert for Hurricane Relief when he said "George Bush doesn't care about black people."  With a shocked looking Mike Myers at his side, West's comments air live on NBC on the East Coast but are edited out of the West Coast version later that night. "People have lost their lives, lost their families," he says a week later on The Ellen DeGeneres Show. "It's the least I could do to go up there and say something from my heart."</p>,
 <p><iframe class="videocontent" height="480" src="http://www.youtube.com/embed/zIUzLpO1kxI" width="853"></iframe></p>,
 <p>His nicknames include Ye, The Louis Vuitton Don, Yeezy or konman.</p>,
 <p>Even after being named Best Hip-Hop Artist at the MTV Europe Music Awards in Copenhagen, a fuming West storms on stage as the award for Best Video is being given to Parisian duo Justice vs. Simian. In a profanity-laced tirade, West says he should have won the prize for "Touch the Sky," because his video "cost a million dollars, Pamela Anderson was in it."</p>,
 <p><iframe class="videocontent" height="480" src="http://www.youtube.com/embed/YkwQbuAGLj4" width="853"></iframe></p>,
 <p>Kanye was named International Man of the Year by GQ in 2007 at a ceremony at Covent Garden's Opera House in London.</p>,
 <p>Unfortunately, his mother died that same year following complications while getting plastic surgery.</p>,
 <p>Kanye says he realizes, "Nothing is promised in life except for death." and he lives everyday with that in mind.</p>,
 <p>Kanye broke down at a concert in Paris, a week after the passing of his mother, Dr. Donda West, as he tried to sing the verses of "Hey Mama," a song he had written earlier on in his career in dedication to her.</p>,
 <p><iframe class="videocontent" height="480" src="http://www.youtube.com/embed/2ZXlnJ5o63g" width="853"></iframe></p>,
 <p>He launched an online travel company called "Kanye Travel Ventures" (KTV) through his official website in 2008.</p>,
 <p>After the infamous  Taylor Swift Gate VMAs incident in 2009, he decided to leave the country for a while. He went to Japan, then Rome, before finally moving to Hawaii for 6 months to start working on music again.</p>,
 <p><iframe class="videocontent" height="480" src="http://www.youtube.com/embed/UhL2LoYaZ90" width="853"></iframe></p>,
 <p>In addition to avoiding the VMAs backlash, 'Ye was able to slow down and spend time reflecting. "It was the first time I got to stop, since my mom had passed, because I never stopped and never tried to even soak in what all had happened," he later told Ellen Degeneres. Plus he got to do fun stuff like intern at Fendi.</p>,
 <p>The Eternal Sunshine of the Spotless Mind director visited the studio on the same the day West was recording "Diamonds From Sierra Leone," producer Jon Brion <a href="http://www.mtv.com/news/articles/1507538/kanyes-co-pilot-talks-late-registration.jhtml" rel="nofollow">told MTV</a> . In addition to playing drums on the Grammy-winning song, Gondry's more famous Late Registration contribution is the video for "Heard 'Em Say" featuring Adam Levine.</p>,
 <p>He said once in an interview that he prefers finalizing a song in post production more than having sex.</p>,
 <p>One of his favorite bands is  Scottish rock group Franz Ferdinand.</p>,
 <p>The song, 'Stronger', famously used a sample of Daft Punk's 'Harder, Better, Faster, Stronger'. But Kanye has also created some unreleased songs that contain samples from Broadway hit music, 'Wicked'.</p>,
 <p>Kanye was engaged to designer Alexis  Phifer for 18 months before he began a relationship Amber Rose. The couple makes a fashionable pair, wearing attention-grabbing ensembles around the world. "He'll pick out something and he'll be like 'Babe, just ... no.,'"</p>,
 <p>For Kanye, being famous has always been an unbearable drain. In his new track 'New Slaves', he compares being a celebrity to, erm, being a slave. Ironically, he is currently engaged to reality TV star Kim Kardashian who is known for loving the media.</p>,
 <p>At one point in his career-circa the release of Graduation in 2007-Kanye was slated to star in a TV series. Back by producers Rick Rubin and Larry Charles, the show was set to be a half-hour scripted sitcom based West's life and music career. Despite numerous mentions of the show to the press, it ultimately never made it to TV.</p>,
 <p>He is a sensitive person at heart. An episode of South Park depicting Kanye as an egomaniac is said to have "hurt [his] feelings".</p>,
 <p>Kanye and Royce have a long-standing feud stemming from a 2003 song that West produced for the Detroit rhymer titled "Heartbeat." West alleges that Nickel Nine never paid for the beat, but recorded to it and released it on Build And Destroy: The Lost Sessions regardless. He has since stated that he will never work with Royce again.</p>,
 <p>Although 'Ye has a penchant for left field collaborations-most notably Chris Martin of Coldplay, Daft Punk, Bon Iver and Katy Perry-one of his most unexpected collabs came with rock group 30 Seconds To Mars.</p>,
 <p>He is a budding fashion designer and he and he collaborated with French brand A.P.C. He garnered attention for selling a plain white t-shirt for $120. <img class="imgNone magnify" id="36487" src="//image.boomsbeat.com/data/images/full/36487/white-jpg.png" title="white shirt" width="600"/></p>,
 <p>Kanye opened up a burger chain in Chicago called Fatburger in 2008. When he opened it, he said he had plans to open 10 stores. He opened two before running into some financial problems and so he closed them down in 2011.</p>]
In [9]:
facts[2].text
Out[9]:
'Kanye West is a Grammy-winning rapper who is currently engaged to Kim Kardashian and he is well known for his outrageous statements and for his broad musical palette.'
In [10]:
facts = facts[3:53]

Creating a Table of Facts

Now, we can create a table that contains each interesting fact. To do so, we will start with an empty list and append each interesting fact using our above syntax and a for loop.

In [11]:
table = []
for i in facts:
    fact = i.text
    table.append(fact)
In [12]:
len(table)
Out[12]:
50
In [13]:
table[:5]
Out[13]:
['Kanye Omari West was born June 8, 1977 in Atlanta.',
 'His father Ray West was a black panther in the 60s and 70s and he later became one of the first black photojournalists at the Atlanta-Journal Constitution and later became a Christian counselor. His mother Donda was English professor at Clark Atlanta University. He later moved to Chicago at the age of three when his parents divorced.',
 'The name Kanye means "the only one" in Swahilli.',
 'Kanye lived in China for more than a year with his mother when he was in fifth grade. His mother was a visiting professor there at the time and he joined her.',
 'Kanye attended Chicago State University/Columbia College in Chicago.  He dropped out to pursue music which is why he named his 2004 debut album, "The College Dropout."']

Pandas and DataFrames

The standard library for data analysis in Python is Pandas. Here, the typical row and column format for data used is called a DataFrame. We can convert our table data to a dataframe as follows.

In [14]:
import pandas as pd
df = pd.DataFrame(table, columns=['Interesting Facts'])

We can use the head() function to examine the top 5 rows of our new DataFrame.

In [15]:
df.head()
Out[15]:
Interesting Facts
0 Kanye Omari West was born June 8, 1977 in Atla...
1 His father Ray West was a black panther in the...
2 The name Kanye means "the only one" in Swahilli.
3 Kanye lived in China for more than a year with...
4 Kanye attended Chicago State University/Columb...

Save our Data

Now, we can convert the dataframe to a comma separated value file on our computer. We could read this back in at any time as shown with the read_csv file.

In [17]:
df.to_csv('kanye_facts.csv', index=False, encoding='utf-8')
In [18]:
df = pd.read_csv('kanye_facts.csv', encoding='utf-8')
In [19]:
df.head(7)
Out[19]:
Interesting Facts
0 Kanye Omari West was born June 8, 1977 in Atla...
1 His father Ray West was a black panther in the...
2 The name Kanye means "the only one" in Swahilli.
3 Kanye lived in China for more than a year with...
4 Kanye attended Chicago State University/Columb...
5 Kanye's struggle to transition from producer t...
6 At the start of his music career, Kanye appare...