Practicing plotting with `pandas`

In this suite of exercises, we will practice creating various types of plots using the pandas library in Python.

As usual, we will root our work on actual data and perform a realistic (albeit minimal) scientifc analysis.

Deforestation Data

We use the Our World in Data deforestation dataset.

It contains data on forest area and deforestation data for various countries over time.

Its link is

https://ourworldindata.org/grapher/annual-deforestation.csv?v=1&csvType=full&useColumnShortNames=false

Exercise 1: Loading and pivoting the data

Read in the dataset using pandas.

Then, pivot the data so that you have years as the index, countries as the columns, and deforestation as the values.

Exercise 2 : Plotting deforestation for regions

The datasets contains rates in ha/year (hectares per year) for many different countries, as well as the various regions including the world as a whole.

Find the names of such regions in the dataset and create a plot with multiple lines, one for each region, showing deforestation over time. Include also the world as a whole.

Plot requirements:

The plot should have appropriate labels and a legend.
The lines should also have markers.
The lines should be distinguishable by color and line style.
The line for the World should be thicker than the others.
The deforestation values should be shown in millions of hectares per year (i.e., divide the values by 1,000,000).

Notes. - If you look carefully at the original dataset, you will notice that you have a Code column that contains codes for countries and regions. - You can provide a list of styles to the style argument of the plot method to have different line styles for each line. - If ax is an axis returned by a plot object, you can access its last plotted line as ax.lines[-1] and sets its properties with methods such as set_linewidth() or set_color().

Exercise 4: Stacked area plot of regional deforestation

Create a stacked area plot showing how deforestation is distributed across continents over time (1990-2020).

Requirements: - Use only the continental regions (Africa, Asia, Europe, North America, Oceania, South America) - Show values in millions of hectares per year - Only include positive values (deforestation, not reforestation) - Add appropriate title and labels

Hint: Use kind='area' and set stacked=True.

Exercise 5: Bar plot of top 10 countries with highest deforestation in 2010

Using the data for 2010, create a bar plot showing the top 10 countries with the highest deforestation rates in absolute terms.

Plot them so that they are sorted (ascending or descending, as you prefer).

You need to exclude regions and only consider countries.

Note. - Boolean operators can be used to filter dataframes. For example, if we want to filter a dataframe to include only rows where the column A doe not contain the value foo, we can do:

filtered_df = df[~(df['A'] == 'foo')]

where the ~ operator negates the boolean condition. - You can use barh as the kind of plot to create horizontal bar plots.

Exercise 6: Merging datasets and normalisation by area

The comparison above is in part unfair because countries have different sizes (and hence extent of forested surfaces).

As a first approximation, we can normalise deforestation rates by the total area of each country.

To retrieve the total area data for countries, you’ll need to load a CSV file containing country areas. We can use the Our World in Data dataset on land area.

Here’s the link to the dataset:

https://ourworldindata.org/grapher/land-area-km.csv

We can read this second dataset into a separate dataframe.

A powerful feature of pandas is the ability to merge dataframes based on common columns.

This is done using the pd.merge function as follows:

merged_df = pd.merge(df1, df2, on=list_of_common_columns)

Using merging, combine the deforestation data for 2010 with the land area data.

Then, create a new column in the merged dataframe that contains the deforestation rate per square kilometer for each country.

Finally, create a bar plot showing the top 10 countries with the highest deforestation rates per square kilometer in 2010.

Notes.

When merging, provide a list of common columns to the on argument. In this case, the common columns are likely to be Entity, Year and Code.

Exercise 7: Bringing vegetable oil data in

We can further extend our analysis by bringing in data on vegetable oil production, which is a significant driver of deforestation in some regions.

First, we load the vegetable oil production dataset from Our World in Data, which contains production data for various types of oils (palm oil, soybean oil, sunflower oil, etc.).

Since deforestation is driven by various oil crops (not just palm oil), sum all vegetable oil production types to get a total for each country and year and store it as a new column. This provides a more comprehensive view of agricultural oil pressure.

Exercise 8: Pivot by country the vegetable oil data

To get a by country view of vegetable oil production, pivot the vegetable oil dataset so that you have years as the index, countries as the columns, and total vegetable oil production as the values.

Exercise 9: Select countries for comparison

Filter to include only countries that appear in both the top 10 deforestation list (from 2010, Exercise 5) AND have oil production data.

Exercise 10: Plot vegetable oil production over time for selected countries

We normalize all oil production values relative to 1990 (baseline = 100) to make it easier to compare growth rates across countries of different sizes.

Produce two plots:

first the normalised vegetable oil production over time for the selected countries
then a smoothened version of the same plot (using a rolling mean with a window of 10 years).

Exercise 11: Comparing deforestation and oil production over time

For each of the following countries (Brazil, the Democratic Republic of Congo, India and Indonesia) plot a figure with the two lines:

the normalised vegetable oil production over time (use the smoothed version from Exercise 10)
the deforestation rate over time (also normalised to 1990)

Both lines should be on the same plot with the same y-axis scale (since both use 1990 = 100 as baseline).

Note: You may observe inverse correlations in some countries - this is expected and scientifically meaningful (see interpretation below).

Hints.

You may need to drop NaN values with the dropna() method to get continuous lines
You need first to normalise your deforestation data
You may want to create a new dataframe using pd.DataFrame and a suitable dictionary

Interpretation of Results

You should observe different trends depending on the countries of consideration that can reflect different modes of production, deforestation and shifts in policies:

Some countries show correlations between deforestation rates and vegetable oil production: this is typically when vegetable oil production requires direct area increases and deforestation is diffuse (small-scale clearing of land)
Other countries can show inverse correlation: this can have various causes, for example time-lags between extensive deforestation and usage of the land, as well as different means of expanding vegetable oil production (conversion of already cleared land, e.g. pasture, into crops)

This exercise demonstrates that macroscale data analysis can be rather complex and requires multiple layers of information and datasets to be properly interpreted. In this specific case, we would need more information on the modes of production, land use changes, policies and economic drivers to make sense of the observed trends.

--- title: Practicing plotting with `pandas` jupyter: python3 --- In this suite of exercises, we will practice creating various types of plots using the `pandas` library in Python. As usual, we will root our work on `actual data` and perform a realistic (albeit minimal) scientifc analysis. ## Deforestation Data We use the [Our World in Data deforestation dataset](https://ourworldindata.org/forests-and-deforestation). It contains data on forest area and deforestation data for various countries over time. Its link is [https://ourworldindata.org/grapher/annual-deforestation.csv?v=1&csvType=full&useColumnShortNames=false](https://ourworldindata.org/grapher/annual-deforestation.csv?v=1&csvType=full&useColumnShortNames=false) ## Exercise 1: Loading and pivoting the data Read in the dataset using `pandas`. Then, pivot the data so that you have years as the index, countries as the columns, and deforestation as the values. ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # SOLUTION import pandas as pd import requests # Fetch the data. df = pd.read_csv("https://ourworldindata.org/grapher/annual-deforestation.csv?v=1&csvType=full&useColumnShortNames=false") df ``` ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # SOLUTION df_pivot = df.pivot(index="Year", columns="Entity", values="Deforestation") df_pivot ``` ## Exercise 2 : Plotting deforestation for regions The datasets contains rates in ha/year (hectares per year) for many different countries, as well as the various regions including the world as a whole. Find the names of such regions in the dataset and create a plot with multiple lines, one for each region, showing deforestation over time. Include also the world as a whole. **Plot requirements:** - The plot should have appropriate labels and a legend. - The lines should also have markers. - The lines should be distinguishable by color and line style. - The line for the World should be thicker than the others. - The deforestation values should be shown in millions of hectares per year (i.e., divide the values by 1,000,000). **Notes.** - If you look carefully at the original dataset, you will notice that you have a `Code` column that contains codes for countries and regions. - You can provide a list of styles to the style argument of the `plot` method to have different line styles for each line. - If `ax` is an axis returned by a plot object, you can access its last plotted line as `ax.lines[-1]` and sets its properties with methods such as `set_linewidth()` or `set_color()`. ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # SOLUTION # the unique codes are # clearly only the WORLD has its own code OWID_WRL # hence the regions have NaN df['Code'].unique() ``` ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # SOLUTION # Find entities without country codes - these are typically regions regions = df[df['Code'].isna()]['Entity'].unique() print("Regions (non-country entities):") print(regions) # add the world to the list all_regions = list(regions) + ['World'] ``` ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # SOLUTION markers = ['o', 's', '^', 'v', 'D', 'p', '*', 'h', 'x', '+', 'd', 'H'] ax = (df_pivot[all_regions] / 1_000_000).plot(figsize=(12, 6), title="Annual Deforestation by Region", ylabel="Deforestation (Mha/year)", xlabel="Year", style=[f'-{m}' for m in markers], legend=True) # Make the World line thicker ax.lines[-1].set_linewidth(4) ``` ## Exercise 3 : Share of deforestation by continent in 2020 Pick now year 2020 and create a pie chart showing the share of deforestation for that year between the various continents (i.e., Africa, Asia, Europe, North America, Oceania, South America). The share is the fraction of the total deforestation in the world that is due to each continent. **Notes.** - You can improve the rendering of the pie chart by adding formatted percentage labels with the `autopct` to plot using the `pie` kind. ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" #SOLUTION continents = ['Africa', 'Asia', 'Europe', 'North America', 'Oceania', 'South America'] year = 2020 shares = df_pivot.loc[year, continents] / df_pivot.loc[year, 'World'] * 100 shares.plot(kind='pie', autopct='%1.1f%%', figsize=(8, 8), title=f"Deforestation Share by Continent in {year}", ylabel='') ``` ## Exercise 4: Stacked area plot of regional deforestation Create a stacked area plot showing how deforestation is distributed across continents over time (1990-2020). **Requirements:** - Use only the continental regions (Africa, Asia, Europe, North America, Oceania, South America) - Show values in millions of hectares per year - Only include positive values (deforestation, not reforestation) - Add appropriate title and labels **Hint:** Use `kind='area'` and set `stacked=True`. ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # SOLUTION # Filter for continental regions and positive deforestation values df_continents = df_pivot[continents].loc[1990:2020] df_continents = df_continents[df_continents > 0] # Convert to millions of hectares df_continents_mha = df_continents / 1_000_000 # Create stacked area plot df_continents_mha.plot(kind='area', stacked=True, figsize=(12, 6), title='Deforestation by Continent (1990-2020)', xlabel='Year', ylabel='Deforestation (Mha/year)') ``` ## Exercise 5: Bar plot of top 10 countries with highest deforestation in 2010 Using the data for 2010, create a **bar plot** showing the top 10 countries with the highest deforestation rates in absolute terms. Plot them so that they are sorted (ascending or descending, as you prefer). You need to **exclude regions** and only consider countries. **Note.** - Boolean operators can be used to filter dataframes. For example, if we want to filter a dataframe to include only rows where the column `A` doe not contain the value `foo`, we can do: ```python filtered_df = df[~(df['A'] == 'foo')] ``` where the `~` operator negates the boolean condition. - You can use `barh` as the kind of plot to create horizontal bar plots. ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # SOLUTION year = 2010 df_no_regions = df[~df['Entity'].isin(all_regions)] top_10 = df_no_regions.loc[df['Year'] == year].sort_values(by='Deforestation', ascending=False).head(10)[['Entity', 'Deforestation']] top_10 ``` ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # SOLUTION top_10.plot(kind='barh', x='Entity', y='Deforestation', figsize=(10, 6), title=f"Top 10 Countries by Deforestation in {year}", xlabel="Deforestation (ha/year)", ylabel="Country", legend=False) ``` ## Exercise 6: Merging datasets and normalisation by area The comparison above is in part unfair because countries have different sizes (and hence extent of forested surfaces). As a first approximation, we can normalise deforestation rates by the total area of each country. To retrieve the total area data for countries, you'll need to load a CSV file containing country areas. We can use the Our World in Data dataset on land area. Here's the link to the dataset: [https://ourworldindata.org/grapher/land-area-km.csv](https://ourworldindata.org/grapher/land-area-km.csv) We can read this second dataset into a separate dataframe. A powerful feature of `pandas` is the ability to merge dataframes based on common columns. This is done using the `pd.merge` function as follows: ```python merged_df = pd.merge(df1, df2, on=list_of_common_columns) ``` Using merging, combine the deforestation data for 2010 with the land area data. Then, create a new column in the merged dataframe that contains the deforestation rate per square kilometer for each country. Finally, create a bar plot showing the top 10 countries with the highest deforestation rates per square kilometer in 2010. **Notes.** - When merging, provide a list of common columns to the `on` argument. In this case, the common columns are likely to be `Entity`, `Year` and `Code`. ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # SOLUTION df_areas = pd.read_csv("https://ourworldindata.org/grapher/land-area-km.csv") df_areas ``` ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # SOLUTION df_merged = pd.merge(df, df_areas, on=['Entity',"Year","Code"]) df_merged ``` ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # SOLUTION df_merged['Fraction of deforested surface'] = df_merged['Deforestation'] / (df_merged['Land area (sq. km)'] * 100) ``` ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # SOLUTION df_no_regions = df_merged[~df_merged['Entity'].isin(all_regions)] top_10_def_area = df_no_regions.loc[df_merged['Year'] == year].sort_values(by='Fraction of deforested surface', ascending=False).head(10)[['Entity', 'Fraction of deforested surface']] top_10_def_area ``` ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # SOLUTION top_10_def_area.plot(kind='barh', x='Entity', y='Fraction of deforested surface', figsize=(10, 6), title=f"Top 10 Countries by Deforestation in {year}", xlabel='Fraction of deforested surface', ylabel="Country", legend=False) ``` ## Exercise 7: Bringing vegetable oil data in We can further extend our analysis by bringing in data on vegetable oil production, which is a significant driver of deforestation in some regions. First, we load the vegetable oil production dataset from Our World in Data, which contains production data for various types of oils (palm oil, soybean oil, sunflower oil, etc.). ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # SOLUTION import pandas as pd # Fetch the data. df_oil = pd.read_csv("https://ourworldindata.org/grapher/vegetable-oil-production.csv?v=1&csvType=full&useColumnShortNames=false") # df_oil = df_oil.dropna(subset=['Palm oil | 00000257 || Production | 005510 || tonnes']) df_oil ``` Since deforestation is driven by various oil crops (not just palm oil), **sum all vegetable oil production types to get a total for each country and year** and store it as a new column. This provides a more comprehensive view of agricultural oil pressure. ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # SOLUTION # sum all vegetable oils oil_columns = [col for col in df_oil.columns if 'Production' in col and 'tonnes' in col] print("Available oils:", oil_columns) # Sum all oil production for each country-year df_oil['Total_Oil_Production'] = df_oil[oil_columns].sum(axis=1) # df_oil['Total vegetable oil production'] = ``` ## Exercise 8: Pivot by country the vegetable oil data To get a by country view of vegetable oil production, pivot the vegetable oil dataset so that you have years as the index, countries as the columns, and total vegetable oil production as the values. ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # SOLUTION # Pivot by country df_total_oil = df_oil.pivot(index="Year", columns="Entity", values="Total_Oil_Production") df_total_oil ``` ## Exercise 9: Select countries for comparison Filter to include only countries that appear in both the top 10 deforestation list (from 2010, Exercise 5) AND have oil production data. ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # SOLUTION # take a selction of countires presnet in the top deforestation and in teh oil data country_selection = top_10['Entity'].tolist() country_selection = [country for country in country_selection if country in df_total_oil.columns] print(country_selection) ``` ## Exercise 10: Plot vegetable oil production over time for selected countries We normalize all oil production values relative to 1990 (baseline = 100) to make it easier to compare growth rates across countries of different sizes. Produce two plots: - first the normalised vegetable oil production over time for the selected countries - then a smoothened version of the same plot (using a rolling mean with a window of 10 years). ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # SOLUTION # Make values relative to year 1990 (base year = 100) df_total_oil_relative = (df_total_oil / df_total_oil.loc[1990]) * 100 ``` ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # SOLUTION # Plot relative growth df_total_oil_relative[country_selection].plot(figsize=(12, 6), title='Vegetable Oil Production Growth (Relative to Year 1990)', ylabel='Index (1990 = 100)', xlabel='Year') ``` ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # SOLUTION # smoothened df_total_oil_relative_smooth = df_total_oil_relative.rolling(window=10, center=True).mean() # Plot relative growth df_total_oil_relative_smooth[country_selection].plot(figsize=(12, 6), title='Vegetable Oil Production Growth - Smoothed (Relative to Year 1990)', ylabel='Index (1990 = 100)', xlabel='Year') ``` ## Exercise 11: Comparing deforestation and oil production over time For each of the following countries (Brazil, the Democratic Republic of Congo, India and Indonesia) plot a figure with the two lines: - the normalised vegetable oil production over time (use the smoothed version from Exercise 10) - the deforestation rate over time (also normalised to 1990) Both lines should be on the same plot with the same y-axis scale (since both use 1990 = 100 as baseline). **Note:** You may observe inverse correlations in some countries - this is expected and scientifically meaningful (see interpretation below). **Hints.** - You may need to drop `NaN` values with the `dropna()` method to get continuous lines - You need first to normalise your deforestation data - You may want to create a new dataframe using pd.DataFrame and a suitable dictionary ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # SOLUTION # Normalize deforestation to 1990 baseline df_deforestation_relative = (df_pivot / df_pivot.loc[1990]) * 100 countries = 'Brazil, Democratic Republic of Congo, India, Indonesia'.split(', ') # Create plots for the first 4 countries for country in countries: # Combine both series into a single dataframe df_combined = pd.DataFrame({ 'Deforestation Index': df_deforestation_relative[country], 'Oil Production Index': df_total_oil_relative_smooth[country] }) # Drop rows with NaN values so that the lines are continuous df_combined = df_combined.dropna() # Plot using pandas with both lines on the same scale df_combined.plot(figsize=(5, 3), title=f'{country} (1990 = 100)', xlabel='Year', ylabel='Index (1990 = 100)', ) ``` ## Interpretation of Results You should observe different trends depending on the countries of consideration that can reflect different modes of production, deforestation and shifts in policies: - Some countries show correlations between deforestation rates and vegetable oil production: this is typically when vegetable oil production requires direct area increases and deforestation is diffuse (small-scale clearing of land) - Other countries can show inverse correlation: this can have various causes, for example time-lags between extensive deforestation and usage of the land, as well as different means of expanding vegetable oil production (conversion of already cleared land, e.g. pasture, into crops) This exercise demonstrates that macroscale data analysis can be rather complex and requires multiple layers of information and datasets to be properly interpreted. In this specific case, we would need more information on the modes of production, land use changes, policies and economic drivers to make sense of the observed trends.