Day 28 Project: Web Scraping

Welcome to the day 28 project in the 30 Days of Python series! Today we're going to be continuing our exploration of third party modules by writing a program to automatically grab content from web pages for us!

In order to do this we're going to use a library called BeautifulSoup, which is for parsing and searching HTML documents. HTML is a markup language used to give structure and meaning to the content of web pages.

Before we look at BeautifulSoup in more detail, we need to learn a little bit about how web pages are structured, and how we can navigate them programmatically.

Remember, we've got a video walkthrough for this blog post, if you prefer that medium!

A quick primer on HTML

As I mentioned in the introduction, HTML is a markup language used to give structure and meaning to the content of a web page.

A markup language is a language which annotates text in such a way that a computer can understand it. Another example of a markup language is Markdown, which is used on various forums, blogs, and chat applications to indicate how the text should be formatted.

Here is some HTML describing a simple web page.

<!DOCTYPE html>

<html lang="en">

<head>
    <meta charset="UTF-8">
    <meta name="description" content="A simple page that greets the world with enthusiasm.">
    <title>Hello World!</title>
</head>

<body>
    <h1>Hello World!</h1>
</body>

</html>

The very first line contains <!DOCTYPE html> which is really just a label letting the computer know what format to expect the following content in. This is important for browsers, because while HTML5 is now the standard, some websites use other, older formats with different syntax.

The main content of an HTML document is made up of tags, so let's talk a little bit about how those work.

HTML tags

A tag is made up of a few components, and there are many different types of element in HTML that we can create tags for, each with its own name and special meaning. We don't have time to talk about all of them here, and we also don't need to, because knowing which element to use is really a concern for when building websites.

When we want to create an HTML element, the name of that element is placed between angled brackets like this: <html>. This construct is called a tag.

Usually tags have an accompanying closing tag which has a / before the name of the element. We can see an example of this in the document above for the html, head, body, title, and h1 elements.

When an HTML element has an opening and closing tag, anything placed between those tags is considered to be inside that element. By nesting elements inside one another we can give a page structure.

For example, let's look at the following piece of HTML code:

<div>
    <p>
        Spicy jalapeno bacon ipsum dolor amet id tongue pork belly andouille.
    </p>
</div>

Here we have a some spicy bacon ipsum text sitting inside a set of <p> tags, which are used to denote a paragraph of text. These <p> tags are inside a set of <div> tags, which are used to define some generic group of content.

Understanding the hierarchical nature of HTML is really important, because it's going to allow us to easily traverse the document later on.

Note

Not all HTML elements have an opening and closing tag. If we look at the elements inside the head tags of our Hello, World page we can see that we have a couple of meta elements which are composed of only a single tag each.

These elements are not allowed to have content, and there are many such elements in HTML.

HTML attributes

In addition to a name, we can define a set of attributes for a given HTML element. For this project we need to be aware of a couple of very common attributes called class and id.

Here is an example of an HTML element with a class and an id attribute set.

<body>
    <h1 class="header" id="pageTitle">Hello World!</h1>
</body>

The syntax for this is very simple. We just have the name of the attribute followed by and = symbol, and then a string describing the value associated with that attribute.

The class and id attributes give us ways of referring to different parts of our HTML document, and they're generally used an anchors for applying styles, or for interacting with elements on the page with JavaScript.

In our case, they're going to be very useful for getting hold of elements on the page.

A quick primer on CSS selectors

CSS is another language (short for Cascading Style Sheets) which is used for defining how the elements of an HTML document should actually look.

We're not going to be doing any styling in this project, but we do need to know a little bit about the selector syntax, which is how we describe which elements a style applies to. The reason this is relevant to us is because BeautifulSoup lets us search an HTML document using these selectors, and they're exceptionally powerful.

Basic selectors

First things first, let's talk about how to select by element names, by classes, and by ids.

Selecting all instances of an element on a page is very easy, since the selector is just the elements name. If we wanted to select all the h1 tags, for example, we could just write h1.

If we want to select by a class, we have to write the name of the class, prefixed with a period (.).

Returning to this example,

<body>
    <h1 class="header" id="pageTitle">Hello World!</h1>
</body>

We could grab this h1 tag using the selector .header, since the header class is listed among the classes for this particular h1 tag.

To search by id, we instead need to use # followed by the name of the id. For the above example, the selector would look like this: #pageTitle.

In theory, ids should be unique, so selecting by id should provide a single element, but this is not enforced by HTML, so we can't count on it.

Combining CSS selectors

Sometimes we can't specify exactly what we mean with a single term, so we can combine selectors into more specific selectors.

For example, if we wanted to ensure that we only select h1 elements with the header class, and not, for example, h2 elements with a header class, we could write this:

h1.header

By putting a space between selectors, we can imply a hierarchy. The following means "h1 elements with the header class that are inside a div":

div h1.header

We can get far more complicated than this, but these relatively simple selectors are enough to get us started for this project.

You can find a lot more information on the different CSS selectors you can use here.

Getting started with `BeautifulSoup`

By this point you know the drill when it comes to installing third-party modules, so there's no need to go over that again here. You can find information in the post for day 27 if you get stuck, or you can follow the BeautifulSoupinstallation guide.

Parsing HTML

Let's have a go at parsing some sample HTML from the BeautifulSoup documentation, which looks like this:

html_doc = """
<html>

<head>
    <title>The Dormouse's story</title>
</head>

<body>
    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">
        Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
    </p>

</body>

</html>
"""

Here we're working with a multi-line string which contains some text in HTML format. Later on we'll be getting the HTML directly from a website. The process is the same.

The import for BeautifulSoup is a little strange.

from bs4 import BeautifulSoup

Take care not to try to import from BeautifulSoup. That won't work.

With BeautifulSoup imported, all we need to do to parse the file is this:

from bs4 import BeautifulSoup

html_doc = """
<html>

<head>
    <title>The Dormouse's story</title>
</head>

<body>
    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">
        Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
    </p>

</body>

</html>
"""

soup = BeautifulSoup(html_doc, "html.parser")

Now we can work with soup to easily find what we're looking for in the source document.

Getting data out of the soup

The items inside the soup are Tag objects. This is a special type made by the BeautifulSoup authors for storing information about a given element in the HTML document. When we search through the soup, to get hold of specific elements on the page, what we'll get back is a Tag or list of Tag objects.

Tag objects have a lot of useful methods and attributes, such as content, which lets us see what's inside a given element. If the tag contains only text content, we can use the string attribute to get hold of that content. We'll see an example of this in a moment.

If we want to get the value for an attribute, we can treat the tag like a dictionary, where the tag names are the keys. So if we wanted the content of the class attribute for a given tag, we could write tag["class"].

We can search the soup using the select and select_one methods, which accept a CSS selector as a string.

For example, let's say we want to grab the names of the sisters from the HTML document above. We could do something like this:

from bs4 import BeautifulSoup

html_doc = """
<html>

<head>
    <title>The Dormouse's story</title>
</head>

<body>
    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">
        Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
    </p>

</body>

</html>
"""

soup = BeautifulSoup(html_doc, "html.parser")
sisters = soup.select(".sister")

for sister in sisters:
    print(sister.string)

Here soup.select(".sister") gave us back a list of <a> elements as Tag objects. We then looped over the list, which means that for any given iteration, sister contained a single Tag.

Since the <a> elements contain nothing but text, we were then able to extract that text content by accessing the string attribute of that tag.

Hopefully that all made sense. If you want to dive deeper into BeautifulSoup, you can find documentation here.

The brief

Phew! That was a lot to cover, but we're now ready to tackle the actual project.

For this project we're going to be scraping a site called http://books.toscrape.com/. It's a purpose-built site for learning how to scrape real websites.

What you need to do is scrape the front page of this site, and grab some data for every book on that page.

For each book, I want you to grab the title, the star rating, and the price. You should write all of this information to a new file in CSV format.

Before you can do any of this, you need to get hold of the actual HTML document. To do this, add the following lines to your file:

import requests

data = requests.get("http://books.toscrape.com/").content

Note that you will have to install the requests library, as it's not part of the standard library.

You have a few options for actually viewing the HTML. You can get hold of the data in a nice format by using the prettify method on your soup, which you can write to a file or print to the console. You can also go to the site directly and right click on the page. You should see an option to view page source. If you can't see this, try pressing ctrl + u.

Good luck with the project!

Our solution

First, let's get all the imports and initial set up in order.

import requests
from bs4 import BeautifulSoup

data = requests.get("http://books.toscrape.com/").content

soup = BeautifulSoup(data, "html.parser")

Before we continue, we really need to have a look at the HTML code and see how best to grab the data we're looking for.

Each of the books is stored as part of an ordered list (<ol> tag) as a list item (<li> tag), which looks like this:

<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
    <article class="product_pod">
        <div class="image_container">
            <a href="catalogue/a-light-in-the-attic_1000/index.html">
                <img
                    src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"
                    alt="A Light in the Attic"
                    class="thumbnail"
                >
            </a>
        </div>

        <p class="star-rating Three">
            <i class="icon-star"></i>
            <i class="icon-star"></i>
            <i class="icon-star"></i>
            <i class="icon-star"></i>
            <i class="icon-star"></i>
        </p>

        <h3>
            <a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">
                A Light in the ...
            </a>
        </h3>

        <div class="product_price">
            <p class="price_color">£51.77</p>
            <p class="instock availability">
                <i class="icon-ok"></i>
                In stock
            </p>

            <form>
                <button type="submit" class="btn btn-primary btn-block" data-loading-text="Adding...">
                    Add to basket
                </button>
            </form>
        </div>
    </article>
</li>

The majority of this information we don't need to worry about. For example, the first section of each <li> is dedicated to the image. There's also a little form element at the bottom for adding a book to the basket.

Let's break this up into chunks so we can better reason about what is going on.

For the price, we're concerned with this section:

<div class="product_price">
    <p class="price_color">£51.77</p>
    <p class="instock availability">
        <i class="icon-ok"></i>
        In stock
     </p>

     <form>
        <button type="submit" class="btn btn-primary btn-block" data-loading-text="Adding...">
            Add to basket
        </button>
     </form>
</div>

Rather conveniently, the price is in its own <p> tag, and it has a class associated with it. If we scan the document, we'll also find that it's a unique class only use for the prices.

Getting the p tags containing our prices can therefore be done with a very simple selector: ".price_color".

The title is a little trickier. It's located here:

<h3>
    <a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">
        A Light in the ...
    </a>
</h3>

We don't have any classes or ids to work with here, and the title of the book is also not actually string content inside the <a> tag. If we look, longer titles get truncated, so we need to look at the value in the title attribute instead.

In this case, I think the selector we should use is this:

".product_pod h3 a"

This is looking for a elements that live inside of h3 elements, that are inside an element with the class .product_pod. Where did product_pod come from? It a class added to the article element which contains our h3 tags.

This selector is specific enough where we're unlikely to accidentally select something we didn't intend. If it weren't, we would have to use an even more specific selector.

Finally, let's look at the HTML for the rating.

<p class="star-rating Three">
    <i class="icon-star"></i>
    <i class="icon-star"></i>
    <i class="icon-star"></i>
    <i class="icon-star"></i>
    <i class="icon-star"></i>
</p>

Here we have another simple option for our selector, we can just look for the star-rating class. Getting the information we need is going to be a little more difficult though, because the number of stars a book has is represented by another class. We're therefore going to have to create some way of retrieving this second class and converting it to something we can use more easily in our program.

Now that we've analysed the HTML, let's implement the code we need to grab the tags we want.

As a small note, I'm going to be storing our selectors in variables, just so they're more reusable.

import requests
from bs4 import BeautifulSoup

price_selector = ".price_color"
title_selector = ".product_pod h3 a"
rating_selector = ".star-rating"

data = requests.get("http://books.toscrape.com/").content

soup = BeautifulSoup(data, "html.parser")

prices = soup.select(price_selector)
titles = soup.select(title_selector)
ratings = soup.select(rating_selector)

One of the nice things here is that our tags have a consistent order, because we're dealing with lists. We can therefore group the different values together into the original books using zip.

We can then use a for loop to iterate over the zip object so that we can work with all the data for a given book. For now I'm just going to focus on printing the values, but this for loop will eventually move into a context manager.

import requests
from bs4 import BeautifulSoup

price_selector = ".price_color"
title_selector = ".product_pod h3 a"
rating_selector = ".star-rating"

data = requests.get("http://books.toscrape.com/").content

soup = BeautifulSoup(data, "html.parser")

prices = soup.select(price_selector)
titles = soup.select(title_selector)
ratings = soup.select(rating_selector)

for price, title, rating in zip(prices, titles, ratings):
    pass

We can now process each of the tags to get the values we want. For price, this is easy: we just need to find the string attribute for the tag.

Title is a little harder, but still manageable, because we have an attribute that contains the full title. As we discussed in the short intro to BeautifulSoup, we can treat tags like dictionaries, accessing attribute values like keys.

import requests
from bs4 import BeautifulSoup

price_selector = ".price_color"
title_selector = ".product_pod h3 a"
rating_selector = ".star-rating"

data = requests.get("http://books.toscrape.com/").content

soup = BeautifulSoup(data, "html.parser")

prices = soup.select(price_selector)
titles = soup.select(title_selector)
ratings = soup.select(rating_selector)

for price, title, rating in zip(prices, titles, ratings):
    print(f"{title['title']} costs {price.string}")

The final piece of the puzzle is the rating. In order to convert the class name to an actual star rating, I'm going to use a function.

I'm also going to define a dictionary so that we can map a given class name to something more useful. In this case, I'm going to map the class names to strings of "★" characters.

rating_mappings = {
    "One":   "★",
    "Two":   "★ ★",
    "Three": "★ ★ ★",
    "Four":  "★ ★ ★ ★",
    "Five":  "★ ★ ★ ★ ★"
}

def get_rating(tag):
    for term, rating in rating_mappings.items():
        if term in tag["class"]:
            return rating

The function I've defined above takes in a tag, and then uses our rating_mappings dictionary to determine which of our terms is a class name used in this Tag.

We do this by checking if our rating_mappings key is a member of the list associated with the "class" key for this Tag.

If we find a match, we return the star rating string associated with the key in our rating_mappings dictionary.

import requests
from bs4 import BeautifulSoup

def get_rating(tag):
    for term, rating in rating_mappings.items():
        if term in tag["class"]:
            return rating

rating_mappings = {
    "One":   "★",
    "Two":   "★ ★",
    "Three": "★ ★ ★",
    "Four":  "★ ★ ★ ★",
    "Five":  "★ ★ ★ ★ ★"
}

price_selector = ".price_color"
title_selector = ".product_pod h3 a"
rating_selector = ".star-rating"

data = requests.get("http://books.toscrape.com/").content

soup = BeautifulSoup(data, "html.parser")

prices = soup.select(price_selector)
titles = soup.select(title_selector)
ratings = soup.select(rating_selector)

for price, title, rating in zip(prices, titles, ratings):
    print(f"{title['title']} costs {price.string} - {get_rating(rating)}")

Once we've verified everything is working, we can modify our for loop so that we're writing to a file instead. In this case, because I'm using a special ★ character, we need to explicitly specify an encoding for our file.

import requests
from bs4 import BeautifulSoup

def get_rating(tag):
    for term, rating in rating_mappings.items():
        if term in tag["class"]:
            return rating

rating_mappings = {
    "One":   "★",
    "Two":   "★ ★",
    "Three": "★ ★ ★",
    "Four":  "★ ★ ★ ★",
    "Five":  "★ ★ ★ ★ ★"
}

price_selector = ".price_color"
title_selector = ".product_pod h3 a"
rating_selector = ".star-rating"

data = requests.get("http://books.toscrape.com/").content

soup = BeautifulSoup(data, "html.parser")

prices = soup.select(price_selector)
titles = soup.select(title_selector)
ratings = soup.select(rating_selector)

with open("books.csv", "w", encoding="utf-8") as book_file:
    for price, title, rating in zip(prices, titles, ratings):
        book_file.write(f"{title['title']},{price.string},{get_rating(rating)}\n")

With that, we're done!

If we look at our books.csv file, we now have something like this inside:

A Light in the Attic,£51.77,★ ★ ★
Tipping the Velvet,£53.74,★
Soumission,£50.10,★
Sharp Objects,£47.82,★ ★ ★ ★
Sapiens: A Brief History of Humankind,£54.23,★ ★ ★ ★ ★
The Requiem Red,£22.65,★
The Dirty Little Secrets of Getting Your Dream Job,£33.34,★ ★ ★ ★
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull,£17.93,★ ★ ★
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics,£22.60,★ ★ ★ ★
The Black Maria,£52.15,★
Starving Hearts (Triangular Trade Trilogy, #1),£13.99,★ ★
Shakespeare's Sonnets,£20.66,★ ★ ★ ★
Set Me Free,£17.46,★ ★ ★ ★ ★
Scott Pilgrim's Precious Little Life (Scott Pilgrim #1),£52.29,★ ★ ★ ★ ★
Rip it Up and Start Again,£35.02,★ ★ ★ ★ ★
Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991,£57.25,★ ★ ★
Olio,£23.88,★
Mesaerion: The Best Science Fiction Stories 1800-1849,£37.59,★
Libertarianism for Beginners,£51.33,★ ★
It's Only the Himalayas,£45.17,★ ★

If you want to extend this project a little bit, try to scrape multiple pages. There are 50 pages of books on the site, and the pagination has a consistent URL format, so you can scrape the pages one at a time if you wanted to.

You might also want to try to scrape some of the details pages for the books to get more information for a specific book.

Working on your own