Welcome to the day 27 project in the 30 Days of Python series! In this project we're going to be doing a bit of data analysis using a third party library called pandas.

pandas is a data analysis library designed to make working with relational data a lot easier.

There's quite a bit we need to cover before we can start working with pandas, so let's dive right in!

We've got a video walkthrough of this project below. But we'd recommend reading through the pandas primer before watching the video!

A quick primer on pandas

Installation

pandas is not part of the Python standard library, so we do have to install it before we can use it. If you're working with PyCharm, you can follow the instructions for installing packages from today's post.

If you're working in a different editor, you're probably going to want to install pandas using pip. There are instructions on how to do this in the pandas documentation.

The Series and DataFrame types

There are two basic types that we need to get familiar with in order to work with pandas.

The first of these is the DataFrame type, which is used to store tabular data. It consists of rows and columns which can be populated with data.

We can create a DataFrame in several ways, one of which is using a dictionary. In this case the dictionary keys become the headings for the columns, and the values associated with those keys populate the cells of that column.

import pandas as pd

movies = {
	"title": ("Inception", "Pirates of the Caribbean: The Curse of the Black Pearl"),
	"director": ("Christopher Nolan", "Gore Verbinski"),
	"year": (2010, 2003)
}

df = pd.DataFrame(movies)

If we print df, we actually get some really nice tabular output.

                                               title           director  year
0                                          Inception  Christopher Nolan  2010
1  Pirates of the Caribbean: The Curse of the Bla...     Gore Verbinski  2003

The first column contains an index, and the other columns contain the data we specified in our movies dictionary.

If we want to see some subset of this DataFrame, we can print df.head(), passing in the number of rows we want to see. This is going to be very useful for this project, because the data set we're using has over 18,000 rows!

Our DataFrame is actually composed of several other structures called Series. Each column in our DataFrame is a Series, which just consists of the values in a given column, along with an associated index for each value.

Style note

There's a common convention when importing pandas to use the following syntax:

import pandas as pd

I'd advise you follow this convention, and don't import pandas with the alias, or with an alias of your own creation. It's going to look very strange to people who use pandas regularly.

Modifying a DataFrame

Often we'll want to make modifications to a DataFrame, such as removing unnecessary columns, or renaming headers. We'll have to do both of things in this project.

First, let's talk about renaming headings. We can do this by calling the rename method on a DataFrame, passing in a dictionary. The keys of this dictionary represent the names we want to replace, and the values associated with those keys show what we want to replace the existing names with.

import pandas as pd

movies = {
	"title": ("Inception", "Pirates of the Caribbean: The Curse of the Black Pearl"),
	"director": ("Christopher Nolan", "Gore Verbinski"),
	"year": (2010, 2003)
}

df = pd.DataFrame(movies).rename(columns={"year": "release_year"})

Like many of the methods in pandas, the rename method can be called with another keyword argument for the inplace parameter. This tells pandas whether we want to create a copy of the DataFrame, or whether we want to modify an existing DataFrame.

We can therefore do this:

...

df = pd.DataFrame(movies)
df.rename(columns={"year": "release_year"}, inplace=True)

Now let's talk about throwing away columns. There are actually a lot of different ways we can do this, but I'm only going to discuss two here for the sake of brevity. The first option we're going to be using is drop.

drop is great for if we want to just remove one or two columns from our DataFrame.

Let's use drop to get rid of the director column from the DataFrame we've been working with.

...

df = pd.DataFrame(movies)

df.rename(columns={"year": "release_year"}, inplace=True)
df.drop(columns="director",  inplace=True)

As we can see, the format is very similar to using rename.

If we instead want to specify a few columns to keep, we can create a new DataFrame from an existing one like this:

...
df = pd.DataFrame(movies)

df.rename(columns={"year": "release_year"}, inplace=True)
df = df[["title", "release_year"]]

Note the use of double square brackets here. This is really important, otherwise pandas will assume you want a Series.

Note

When using inplace=True, you may run into a message like this in your console:

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

In these cases, you should call the copy method on the DataFrame you're trying to drop a column for. You can find more information about this here.

Filtering by values

Another important part of working with data is being able to focus in on some subset of the values.

Once again, we have a lot of options for how to do this using pandas. In this post I'm just going to talk about the query method, since it's one of the easiest options to understand, and it's very efficient for larger data sets.

Using query is fairly straightforward, we just need to provide a string which contains some condition to filter our DataFrame with. The condition contains normal Python logic, so we don't have to learn anything new there.

For example, let's say our DataFrame from before has now been populated with many movies, and I want to get all the movies from 2020. We can do something like this:

...

latest_movies = df.query("year == 2020")

Note that the column headings are not strings, and string data does need to be wrapped with an inner set of quotation marks.

For example, if we just wanted films by Christopher Nolan, we would have to do this:

...

nolan_movies = df.query("director == 'Christopher Nolan'")

If we want to make further modifications to these DataFrame objects, we should make a copy, otherwise pandas will complain. See the note from the previous section.

Note

If we want to refer to a variable name inside of a query string, we have to use a special piece of syntax so that pandas doesn't think we're referring to a column heading or an index.

The way we specify that something is a variable is by putting an @ before the variable name.

Where to learn more

We've covered a very small amount of what pandas has to offer here, and I would strongly recommend you go through their excellent getting started tutorials to learn more.

The brief

For this project we're going to be working with a data set on Kaggle, which is an awesome site if you're interesting in doing data science with Python. It contains tonnes of real data sets for you to use, absolutely free.

The data we're going to be working with today can be found here.

It contains thousands of records of avocado prices across several years in different regions of the US.

For this project I want to know a few different things about this data in this data set:

  1. I want to know which region had the lowest average price for conventionally grown avocados each year, and I want to know the same information for organic avocados.
  2. I want to know which region had the highest average price for both types of avocado for each given year.
  3. I want to know the lowest all time price for both conventionally grown and organic avocados, and I want to know the highest price as well.

pandas has some built in tools for calculating averages, so you may want to look in the documentation to see how to do that. There are also methods for finding the minimum and maximum value for a Series.

As a final note, the source data has many fields we don't need. Consider trimming the data down to just the region, the year, the type of avocado, and the price.

Good luck!

Our solution

First things first, we need to actually download our source data and place the CSV document in our project workspace. I'm going to be leaving my avocado.csv file in the root folder alongside my app.py.

Now that we have the data, we need to import pandas and read the data from the file.

import pandas as pd

with open("avocado.csv", "r") as avocado_prices:
	pass

There are several ways we can get the data out of the file and into a DataFrame. We could parse the data manually, or we could use the csv module to get everything for us, for example. However, this is such a common operation, that pandas has some built in tools to create a DataFrame directly from a CSV document.

import pandas as pd

with open("avocado.csv", "r") as avocado_prices:
	df = pd.read_csv(avocado_prices)

Here the read_csv function is going to parse the data in avocado_prices and turn it into a DataFrame for us. There are plenty of other read_ functions available in pandas for working with other formats too. The tutorial talks about this in more depth.

Now that we have our data, the next step is going to be trimming off all the fields we don't need. We only have four that we want to keep in this case, so we're not going to use drop here.

import pandas as pd

with open("avocado.csv", "r") as avocado_prices:
	df = pd.read_csv(avocado_prices)

df = df[["year", "region", "AveragePrice", "type"]]

One problem here is that the headings are not very nice. They don't follow the Python naming conventions we're used to, so I'm going to rename AveragePrice to price when creating the initial DataFrame.

import pandas as pd

with open("avocado.csv", "r") as avocado_prices:
	df = pd.read_csv(avocado_prices).rename(columns={"AveragePrice": "price"})

df = df[["year", "region", "price", "type"]]

Next I'm going to split df into two smaller DataFrame objects: one for conventionally grown avocados, and one for organic avocados.

The reason I'm doing this is because all of the tasks we've been set treat organic and conventional avocados differently, so we'd constantly be filtering one set out while working with the data.

In order to get the conventional avocados, I'm going to use the query method. Our query in this case is quite simple, we want all of the rows where the type of the avocado has the value "conventional".

import pandas as pd

with open("avocado.csv", "r") as avocado_prices:
	df = pd.read_csv(avocado_prices).rename(columns={"AveragePrice": "price"})

df = df[["year", "region", "price", "type"]]

conventional = df.query("type == 'conventional'")

At this point we no longer really need the type column for this smaller DataFrame, since all the values are the same. We can drop column all on the same line, or we can do an in place drop instead.

If we do an in place drop, we have to remember to call copy on the line where we assign to conventional.

import pandas as pd

with open("avocado.csv", "r") as avocado_prices:
	df = pd.read_csv(avocado_prices).rename(columns={"AveragePrice": "price"})

df = df[["year", "region", "price", "type"]]

conventional = df.query("type == 'conventional'").copy()
conventional.drop(columns="type", inplace=True)

Now we can do the same for our organic avocados.

import pandas as pd

with open("avocado.csv", "r") as avocado_prices:
	df = pd.read_csv(avocado_prices).rename(columns={"AveragePrice": "price"})

df = df[["year", "region", "price", "type"]]

conventional = df.query("type == 'conventional'").copy()
conventional.drop(columns="type", inplace=True)

organic = df.query("type == 'organic'").copy()
organic.drop(columns="type", inplace=True)

Now we have to figure out a way to calculate and store the average prices for each year in each region. Since this is likely to be a complex process, and we have to do it multiple times, it's probably best that we define some functions.

We're also going to need to know the years covered by our data set. We can write this manually, or we can grab this information from our original df by looking for unique values in the year column.

We can do that using the following attributes and methods:

years = df.year.unique()

In order to easily store the averages, I'm going to be creating a dictionary. The keys of this dictionary are going to be the different years in our data set, and the values associated with those keys will be DataFrame objects that contain the data for that specific year.

These DataFrame objects are going to contain a single row for each region, with the price for that region reflecting the average price for the whole year.

First things first, let's define a function that can grab us just the values for a given year, returning a new DataFrame for that year.

def filter_by_year(df, year):
	return df.query("year == @year").drop(columns="year")

Next let's define a function which arranges yearly averages for each region into a dictionary.

def get_average_by_year(df):
	averages = {}
	years = df.year.unique()

	for year in years:
		averages_for_year = filter_by_year(df, year).groupby("region").mean()
		averages.update({year: averages_for_year})

	return averages

We can now plug these into our existing code like so:

import pandas as pd


def get_average_by_year(df):
	averages = {}
	years = df.year.unique()

	for year in years:
		averages_for_year = filter_by_year(df, year).groupby("region").mean()
		averages.update({year: averages_for_year})

	return averages


def filter_by_year(df, year):
	return df.query("year == @year").drop(columns="year")


with open("avocado.csv", "r")as avocado_prices:
	df = pd.read_csv(avocado_prices).rename(columns={"AveragePrice": "price"})

df = df[["year", "region", "price", "type"]]

conventional = df.query("type == 'conventional'").copy()
conventional.drop(columns="type", inplace=True)

organic = df.query("type == 'organic'").copy()
organic.drop(columns="type", inplace=True)

conventional_averages = get_average_by_year(conventional)
organic_averages = get_average_by_year(organic)

Now that we have all our data organised how we want it, we just need a few for loops to output all the information we were asked for.

I would recommend writing a function or two for this, since we need to repeat the code a fair number of times, but the basic loop structure here is going to be something like this:

...

for year, data in conventional_averages.items():
	highest_value = data.price.max()
	location = data.query("price == @highest_value").index[0]
	print(f"Highest price for conventional avocados in {year} was ${highest_value:.2f} in {location}.")

print()

for year, data in conventional_averages.items():
	lowest_value = data.price.min()
	location = data.query("price == @lowest_value").index[0]
	print(f"Lowest price for conventional avocados in {year} was ${lowest_value:.2f} in {location}.")

print()

Everything here we've seen before, but we do need to mention this .index[0]. When we performed out groupby operation by region, the regions were turned into the row indices for the new DataFrame. We therefore have to use the index attribute to get the region. This is going to give us a list of indices, and we're just grabbing the first item in this case.

The final thing we need to do is calculate the overall lowest and highest values for both types of avocados.

highest_conventional = conventional.price.max()
print(f"The highest price for conventional avocados was ${highest_conventional:.2f}")

lowest_conventional = conventional.price.min()
print(f"The lowest price for conventional avocados was ${lowest_conventional:.2f}")

highest_organic = organic.price.max()
print(f"The highest price for organic avocados was ${highest_organic:.2f}")

lowest_organic = organic.price.min()
print(f"The lowest price for organic avocados was ${lowest_organic:.2f}")

Make sure to have a go at refactoring the loops for outputting the data. You may also want to think about separating the different parts of the program into different files.

If you want to share your solution, come chat with us over on our Discord server.