Pandas Basics
Setup
First, let’s import the necessary libraries:
Exercise 1: Creating DataFrames from dictionaries
Create a pandas DataFrame called elements_df containing information about chemical elements with the following data:
- Element names “Hydrogen”, “Helium”, “Lithium”, “Beryllium”, “Boron”
- Symbols “H”, “He”, “Li”, “Be”, “B”
- Atomic numbers 1, 2, 3, 4, 5
- Atomic masses 1.008, 4.003, 6.941, 9.012, 10.81
The column names should be: “name”, “symbol”, “atomic_number”, “atomic_mass”
Exercise 2: Setting and using indices
Using the elements_df DataFrame from Exercise 1:
Set the “symbol” column as the index
Access the row for “Li” using
.loc[]Extract just the atomic mass for “He”
Exercise 3: Working with real data - Tips Dataset
Let’s load a real dataset from an online source. We’ll use the famous “tips” dataset which contains information about restaurant tips.
Using the tips_df DataFrame:
Display the column names
Display the shape of the DataFrame (number of rows and columns)
Extract the “total_bill” column
Find the data type (dtype) of the “day” column
Display the last 8 rows of the DataFrame
Exercise 4: Statistical operations
Using the tips_df DataFrame:
Calculate the mean of the “total_bill” column
Find the maximum tip amount
Find the minimum tip amount
Calculate the standard deviation of the “tip” column
Find the index (row position) of the maximum total bill
Use
.describe()to get summary statistics for all numeric columns
Exercise 5: Arithmetic operations
Using the tips_df DataFrame:
Create a new column called “tip_percentage” that calculates the tip as a percentage of the total bill (multiply by 100 to get percentage)
Create a new column called “total_cost” that adds the total_bill and tip together
Calculate the mean tip percentage
Exercise 6: Filtering data (single conditions)
Using the tips_df DataFrame:
Filter to show only rows where the total_bill is greater than 30
Filter to show only rows where the day is “Sun”
Filter to show only rows where the tip percentage (from Exercise 5) is greater than 20%
How many rows have a total bill greater than 40? (Hint: use
.shapeorlen())
Exercise 7: Filtering with multiple conditions
Using the tips_df DataFrame:
Filter to show rows where total_bill is greater than 20 and the day is “Sat”
Filter to show rows where tip is greater than 5 and size (party size) is greater than or equal to 4
Filter to show rows where the day is “Sat” or the day is “Sun”
Filter to show rows where total_bill is between 15 and 25 (inclusive)
Exercise 8: Working with another dataset - Iris
Let’s work with the classic Iris dataset, which contains measurements of iris flowers.
Using the iris_df DataFrame:
Set the index to be a range from 1 to the number of rows (so it starts at 1 instead of 0)
Find how many unique species are in the dataset (Hint: use
.unique()or.nunique())Calculate the mean sepal_length for all flowers
Filter to show only the “setosa” species
For the “setosa” species, what is the maximum petal_length?
Create a new column called “sepal_area” that multiplies sepal_length by sepal_width
Which species has the largest average sepal_area? (This requires filtering by each species)
Exercise 9: Series operations
Create two pandas Series and perform operations on them:
Create a Series called
temps_celsiuswith the following data:- Index: [“London”, “Paris”, “Berlin”, “Madrid”, “Rome”]
- Values: [15, 18, 14, 22, 24]
Create a Series called
temps_fahrenheitthat converts the Celsius temperatures to Fahrenheit using the formula: F = (C × 9/5) + 32Create a new Series called
humiditywith the same cities as index and values: [75, 65, 70, 55, 60]Calculate the average temperature across all cities (in Celsius)
Find which city has the highest temperature
Exercise 10: Challenge - Planets Dataset
Let’s work with a dataset about planets in our solar system.
This dataset contains information about discovered exoplanets (planets outside our solar system).
Challenge tasks:
How many planets were discovered in the year 2010?
What is the most common discovery method?
Calculate the average orbital period for all planets (note: there may be missing values)
Filter to show planets discovered by the “Transit” method with an orbital period less than 10 days
Create a new column called “discovery_decade” that shows which decade each planet was discovered (e.g., 2000s, 2010s)
- Hint: You can use integer division:
(year // 10) * 10to get the decade
- Hint: You can use integer division:
What percentage of all planets were discovered using the “Radial Velocity” method?