Pandas Data Preprocessing - Exercises

In this notebook, you’ll practice essential data preprocessing skills with two real-world datasets:

UK Met Office Weather Data (Yeovilton station) - Time series analysis
UCI Wine Quality Dataset - Cross-sectional analysis

Through these exercises, you’ll learn:

Loading data from various sources
Data exploration and understanding
Type conversions and datetime handling
Handling missing values
Feature engineering and derived variables
Data aggregation and grouping
Correlation and statistical analysis
Data export and documentation

Part 1: Weather Data Analysis (UK Met Office - Yeovilton Station)

The first five exercises focus on analyzing historical weather data from the UK Met Office.

Exercise 1: Load Weather Data

The UK Met Office provides historical weather data for stations across the UK. Your task is to load data from Yeovilton station in Somerset, England. The data file is text-based but contains metadata lines at the top before the actual data begins.

Data Source: https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/yeoviltondata.txt

The Met Office weather data contains monthly measurements from Yeovilton station:

yyyy: Year (4 digits, e.g., 1964)
mm: Month (1-12, where 1 = January, 12 = December)
tmax: Maximum temperature for the month (°C)
tmin: Minimum temperature for the month (°C)
af: Air frost days - number of days when temperature fell below 0°C
rain: Total monthly rainfall (millimeters)
sun: Total monthly sunshine duration (hours)

Goal: Load the weather data, skipping metadata, and display the first rows to verify.

Hints: - pd.read_csv(url, ...): Reads data from a URL or file path - the option skiprows=[0,1,2,3] can skip specific lines (in this case line 0,1,2,3) - The option sep='\s+' uses whitespace as separator (regex for “one or more spaces/tabs”) - The option on_bad_lines='skip' skips malformed lines - .head() Display first 5 rows to confirm data loaded correctly - .shape Returns (rows, columns) showing DataFrame dimensions

Exercise 2: Clean and Prepare Weather Data

Raw data requires transformation before analysis. The Met Office data uses separate year and month columns, has ‘—’ for missing values, and some numeric columns may be stored as strings.

Goal: Create a clean DataFrame with datetime index, proper numeric types, and NaN for missing values. We will remove the NaN afterwards.

Expected columns: yyyy (year), mm (month), tmax (max temp °C), tmin (min temp °C), af (air frost days), rain (mm), sun (hours)

Hints: - Use .dtypes to check tha data type - df.info(): Shows columns, data types, and non-null counts - pd.to_datetime({'year': df['yyyy'], 'month': df['mm'], 'day': 1}): Creates datetime from a dictionary with data from multiple columns - df.set_index('column', inplace=True): Sets a column as the DataFrame index - df.replace('---', np.nan): Replaces ‘—’ strings with NaN - pd.to_numeric(series, errors='coerce'): Converts to numeric, replacing failures with NaN - df.sort_index(inplace=True): Sorts by index to ensure chronological order

Exercise 3: Handle Missing Values in Weather Data

Historical weather records often have gaps due to equipment failures or missing measurements. Your approach should depend on the nature of each measurement.

Goal: Analyze missing value patterns and apply appropriate filling strategies. Check that no missing values remain.

Data characteristics: - Temperature: Changes gradually (good for interpolation) - Rainfall: Variable; 0mm vs missing are different - Sunshine/frost: Could be 0 (none) or missing (not recorded)

Hints: - df.isnull().sum(): Counts missing (NaN) values per column - .interpolate(method='linear'): Estimates missing values from surrounding values (good for gradual changes) - .fillna(0): Replaces NaN with 0 (use carefully—only when 0 is plausible or to mark a period where data is missing) - .fillna(method='ffill'): Forward fill—propagates last valid value forward - .fillna(method='bfill'): Backward fill—propagates next valid value backward

Reflect: Why is interpolation better for temperature than rainfall?

Exercise 4: Extract Date Components and Create Weather Features

Feature engineering creates new variables to reveal patterns. Extract date components (month, season) and create derived measurements (temperature range, extremes).

Goal: Add columns for

year
month name
season
temperature range
mean temp
hot months (the maximum temp > 25)

Hints: - df.index.year, df.index.month, df.index.quarter: Extract year, month (1-12), or quarter (1-4) - df.index.month_name(): Returns full month name (‘January’, etc.) - Custom function + .apply(): Create season mapper: def get_season(month): ... then df['season'] = df['month'].apply(get_season)

Exercise 5: Analyze Weather Patterns with GroupBy

The groupby() operation splits data into groups, applies functions (like mean), then combines results.

You can group by month, season, year, or decade to analyze climate patterns.

Goal: Calculate the typical (mean) monthly temperature, rainfall, and sunshine for each month and season.

Hints: - df.groupby('column')['measurement1'].mean(): Groups and calculates mean for one column - df.groupby('column')[['col1', 'col2']].mean(): Groups and calculates means for multiple columns

More advanced question: Can you define decades and analyze trends over decades?

To analyse over decades, we need a new column for the 1960s, 1970s, etc etc. We can create this by integer division of the year by 10, then multiplying back by 10. For example, for year 1964:(1964 // 10) * 10 = 1960.

Part 2: Wine Quality Data Analysis (UCI Wine Quality Dataset)

The next five exercises focus on analyzing wine quality data with chemical properties.

Exercise 6: Load and Explore Wine Quality Data

Shift from time series to cross-sectional data. This UCI dataset contains physicochemical measurements of Portuguese red wines with quality ratings from experts.

Dataset: 11 chemical properties + 1 quality score (0-10).

Features (chemical properties, units in parentheses): - fixed acidity (g/L tartaric acid) - volatile acidity (g/L acetic acid) - citric acid (g/L) - residual sugar (g/L) - chlorides (g/L sodium chloride) - free sulfur dioxide (mg/L) - total sulfur dioxide (mg/L) - density (g/cm³) - pH (0-14 scale) - sulphates (g/L potassium sulphate) - alcohol (% vol)

Target: - quality (score 0-10, median expert ratings)

Each row = different wine sample.

Important: File uses semicolons as delimiter, not commas.

Data Source: https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv

Goal: Load data correctly and look at its shape and summary statistics.

Hints: - pd.read_csv(url, sep=';'): Specify semicolon delimiter - .head(), .tail(): Show first/last rows - .shape: Returns (n_rows, n_columns) - .info(): Summary of columns, types, non-null counts - .describe(): Statistics (mean, std, min, max, quartiles) for numeric columns - .isnull().sum(): Count missing values per column

Exercise 7: Create Wine Quality Categories and Clean Data

Convert continuous variables into categorical groupings for easier pattern recognition. This is called binning or discretization.

Goal: We will create three categorical columns: - quality_cat: ‘Poor’ (0-4), ‘Average’ (5-6), ‘Good’ (7-10) - sugar_level: which is ‘Dry’ (<9g/L), ‘Medium-Dry’ (9-18g/L), ‘Sweet’ (>18g/L)

We then construct a table showing counts of wines in each quality category by sugar level. This is called a contingency table. and can be produced with pd.crosstab().

Hints: - pd.cut(series, bins=[...], labels=[...]): Bins continuous data into categories - bins=[0, 4, 6, 10] creates bins: 0-4, 4-6, 6-10 - .value_counts(): Shows how many items in each category - pd.crosstab(df['cat1'], df['cat2']): Contingency table showing counts for category combinations

Exercise 8: Create Derived Features

Combine existing variables to create more informative features.

Goal: Engineer new features by combining chemical measurements: 1. Total Acidity: Sum fixed + volatile acidity for total acidity 2. Acidity ratio: fixed divided by volatile acidity 3. Balance ratios: Create sweetness-to-total-acidity ratio to measure wine balance 5. Standardized scores: Calculate z-scores for alcohol to normalize its scale. This is defined for a value in a popluation of given mean and standard deviation as (value - mean) / std deviation.

Then analyze which derived features correlate most strongly with wine quality.

Hints: - The df[selected_cols].corr() method can help identify strong relationships between selected columns - The .describe() method allows you to check derived features have reasonable values

Exercise 9: Wine Discovery Through Filtering

Let’s use filtering to discover interesting wines in the dataset! Think of this as being a wine explorer—you’ll use boolean conditions to find wines that match specific criteria.

Goal: Use pandas filtering to discover

Hidden gems: Good wines that don’t rely on high alcohol (less than 11%)
Fresh & balanced: Wines with low volatile acidity (< 0.4) and good citric acid (>0.3)

Print the top 5 results for each query, sorted by quality descending. In particular, print out the following columns: citric acid, total sulfur dioxide, sulphates,alcohol, acidity ratio, quality.

Hints: - df[df['column'] > value] filters rows where condition is true - Combine conditions: df[(condition1) & (condition2)] (AND), df[(condition1) | (condition2)] (OR) - .sort_values('column', ascending=False) sorts by a column - .head(n) shows top n results

Exercise 10: Export and Document Your Analysis

(this can only work on Noteable)

Preserve your work by exporting cleaned data and documenting your process. Specifically, export your data in at least two different formats.

Hints: - .to_csv('file.csv'): Exports to CSV. Use index=False to exclude index (wine) or index=True to include it (weather) - .to_excel('file.xlsx'): Exports to Excel (requires openpyxl) - .to_json('file.json', orient='index'): Exports to JSON (orient='records' or orient='index') - pd.read_csv('file.csv'): Read back to verify export - pd.DataFrame(dict): Create documentation tables from dictionaries

--- title: Pandas Data Preprocessing - Exercises jupyter: python3 --- In this notebook, you'll practice essential data preprocessing skills with two real-world datasets: 1. **UK Met Office Weather Data** (Yeovilton station) - Time series analysis 2. **UCI Wine Quality Dataset** - Cross-sectional analysis Through these exercises, you'll learn: - Loading data from various sources - Data exploration and understanding - Type conversions and datetime handling - Handling missing values - Feature engineering and derived variables - Data aggregation and grouping - Correlation and statistical analysis - Data export and documentation ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # Import required libraries import pandas as pd import numpy as np ``` --- # Part 1: Weather Data Analysis (UK Met Office - Yeovilton Station) The first five exercises focus on analyzing historical weather data from the UK Met Office. ## Exercise 1: Load Weather Data The UK Met Office provides historical weather data for stations across the UK. Your task is to load data from Yeovilton station in Somerset, England. The data file is text-based but contains metadata lines at the top before the actual data begins. **Data Source**: https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/yeoviltondata.txt The Met Office weather data contains monthly measurements from Yeovilton station: - **`yyyy`**: Year (4 digits, e.g., 1964) - **`mm`**: Month (1-12, where 1 = January, 12 = December) - **`tmax`**: Maximum temperature for the month (°C) - **`tmin`**: Minimum temperature for the month (°C) - **`af`**: Air frost days - number of days when temperature fell below 0°C - **`rain`**: Total monthly rainfall (millimeters) - **`sun`**: Total monthly sunshine duration (hours) **Goal**: Load the weather data, skipping metadata, and display the first rows to verify. **Hints:** - `pd.read_csv(url, ...)`: Reads data from a URL or file path - the option `skiprows=[0,1,2,3]` can skip specific lines (in this case line 0,1,2,3) - The option `sep='\s+'` uses whitespace as separator (regex for "one or more spaces/tabs") - The option `on_bad_lines='skip'` skips malformed lines - `.head()` Display first 5 rows to confirm data loaded correctly - `.shape` Returns `(rows, columns)` showing DataFrame dimensions ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # YOUR ANSWER HERE ``` ## Exercise 2: Clean and Prepare Weather Data Raw data requires transformation before analysis. The Met Office data uses separate year and month columns, has '---' for missing values, and some numeric columns may be stored as strings. **Goal**: Create a clean DataFrame with datetime index, proper numeric types, and NaN for missing values. We will remove the NaN afterwards. **Expected columns**: `yyyy` (year), `mm` (month), `tmax` (max temp °C), `tmin` (min temp °C), `af` (air frost days), `rain` (mm), `sun` (hours) **Hints:** - Use `.dtypes` to check tha data type - **`df.info()`**: Shows columns, data types, and non-null counts - **`pd.to_datetime({'year': df['yyyy'], 'month': df['mm'], 'day': 1})`**: Creates datetime from a dictionary with data from multiple columns - **`df.set_index('column', inplace=True)`**: Sets a column as the DataFrame index - **`df.replace('---', np.nan)`**: Replaces '---' strings with NaN - **`pd.to_numeric(series, errors='coerce')`**: Converts to numeric, replacing failures with NaN - **`df.sort_index(inplace=True)`**: Sorts by index to ensure chronological order ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # YOUR ANSWER HERE ``` ## Exercise 3: Handle Missing Values in Weather Data Historical weather records often have gaps due to equipment failures or missing measurements. Your approach should depend on the nature of each measurement. **Goal**: Analyze missing value patterns and apply appropriate filling strategies. Check that no missing values remain. **Data characteristics**: - **Temperature**: Changes gradually (good for interpolation) - **Rainfall**: Variable; 0mm vs missing are different - **Sunshine/frost**: Could be 0 (none) or missing (not recorded) **Hints:** - **`df.isnull().sum()`**: Counts missing (NaN) values per column - **`.interpolate(method='linear')`**: Estimates missing values from surrounding values (good for gradual changes) - **`.fillna(0)`**: Replaces NaN with 0 (use carefully—only when 0 is plausible or to mark a period where data is missing) - **`.fillna(method='ffill')`**: Forward fill—propagates last valid value forward - **`.fillna(method='bfill')`**: Backward fill—propagates next valid value backward **Reflect**: Why is interpolation better for temperature than rainfall? ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # YOUR ANSWER HERE ``` ## Exercise 4: Extract Date Components and Create Weather Features Feature engineering creates new variables to reveal patterns. Extract date components (month, season) and create derived measurements (temperature range, extremes). **Goal**: Add columns for - year - month name - season - temperature range - mean temp - hot months (the maximum temp > 25) **Hints:** - **`df.index.year`**, **`df.index.month`**, **`df.index.quarter`**: Extract year, month (1-12), or quarter (1-4) - **`df.index.month_name()`**: Returns full month name ('January', etc.) - **Custom function + `.apply()`**: Create season mapper: `def get_season(month): ...` then `df['season'] = df['month'].apply(get_season)` ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # YOUR ANSWER HERE ``` ## Exercise 5: Analyze Weather Patterns with GroupBy The `groupby()` operation splits data into groups, applies functions (like mean), then combines results. You can group by month, season, year, or decade to analyze climate patterns. **Goal**: Calculate the typical (mean) monthly temperature, rainfall, and sunshine for each month and season. **Hints:** - **`df.groupby('column')['measurement1'].mean()`**: Groups and calculates mean for one column - **`df.groupby('column')[['col1', 'col2']].mean()`**: Groups and calculates means for multiple columns **More advanced question**: Can you define decades and analyze trends over decades? ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # YOUR ANSWER HERE ``` To analyse over decades, we need a new column for the 1960s, 1970s, etc etc. We can create this by integer division of the year by 10, then multiplying back by 10. For example, for year 1964:`(1964 // 10) * 10 = 1960`. ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # YOUR ANSWER HERE ``` # Part 2: Wine Quality Data Analysis (UCI Wine Quality Dataset) The next five exercises focus on analyzing wine quality data with chemical properties. ## Exercise 6: Load and Explore Wine Quality Data Shift from time series to cross-sectional data. This UCI dataset contains physicochemical measurements of Portuguese red wines with quality ratings from experts. **Dataset**: 11 chemical properties + 1 quality score (0-10). **Features** (chemical properties, units in parentheses): - `fixed acidity` (g/L tartaric acid) - `volatile acidity` (g/L acetic acid) - `citric acid` (g/L) - `residual sugar` (g/L) - `chlorides` (g/L sodium chloride) - `free sulfur dioxide` (mg/L) - `total sulfur dioxide` (mg/L) - `density` (g/cm³) - `pH` (0-14 scale) - `sulphates` (g/L potassium sulphate) - `alcohol` (% vol) **Target**: - `quality` (score 0-10, median expert ratings) Each row = different wine sample. **Important**: File uses **semicolons** as delimiter, not commas. **Data Source**: https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv **Goal**: Load data correctly and look at its shape and summary statistics. **Hints:** - **`pd.read_csv(url, sep=';')`**: Specify semicolon delimiter - **`.head()`**, **`.tail()`**: Show first/last rows - **`.shape`**: Returns `(n_rows, n_columns)` - **`.info()`**: Summary of columns, types, non-null counts - **`.describe()`**: Statistics (mean, std, min, max, quartiles) for numeric columns - **`.isnull().sum()`**: Count missing values per column ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # YOUR ANSWER HERE ``` ## Exercise 7: Create Wine Quality Categories and Clean Data Convert continuous variables into categorical groupings for easier pattern recognition. This is called **binning** or **discretization**. **Goal**: We will create three categorical columns: - `quality_cat`: 'Poor' (0-4), 'Average' (5-6), 'Good' (7-10) - `sugar_level`: which is 'Dry' (<9g/L), 'Medium-Dry' (9-18g/L), 'Sweet' (>18g/L) We then construct a table showing counts of wines in each quality category by sugar level. This is called a contingency table. and can be produced with `pd.crosstab()`. **Hints:** - **`pd.cut(series, bins=[...], labels=[...])`**: Bins continuous data into categories - `bins=[0, 4, 6, 10]` creates bins: 0-4, 4-6, 6-10 - **`.value_counts()`**: Shows how many items in each category - **`pd.crosstab(df['cat1'], df['cat2'])`**: Contingency table showing counts for category combinations ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # YOUR ANSWER HERE ``` ## Exercise 8: Create Derived Features Combine existing variables to create more informative features. **Goal**: Engineer new features by combining chemical measurements: 1. **Total Acidity**: Sum fixed + volatile acidity for total acidity 2. **Acidity ratio**: fixed divided by volatile acidity 3. **Balance ratios**: Create sweetness-to-total-acidity ratio to measure wine balance 5. **Standardized scores**: Calculate z-scores for alcohol to normalize its scale. This is defined for a value in a popluation of given mean and standard deviation as `(value - mean) / std deviation`. Then analyze which derived features correlate most strongly with wine quality. **Hints:** - The `df[selected_cols].corr()` method can help identify strong relationships between selected columns - The `.describe()` method allows you to check derived features have reasonable values ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # YOUR ANSWER HERE ``` ## Exercise 9: Wine Discovery Through Filtering Let's use filtering to discover interesting wines in the dataset! Think of this as being a wine explorer—you'll use boolean conditions to find wines that match specific criteria. **Goal**: Use pandas filtering to discover 1. **Hidden gems**: Good wines that don't rely on high alcohol (less than 11%) 3. **Fresh & balanced**: Wines with low volatile acidity (< 0.4) and good citric acid (>0.3) Print the top 5 results for each query, sorted by quality descending. In particular, print out the following columns: `citric acid`, `total sulfur dioxide `, `sulphates`,`alcohol`, `acidity ratio`, `quality`. **Hints**: - `df[df['column'] > value]` filters rows where condition is true - Combine conditions: `df[(condition1) & (condition2)]` (AND), `df[(condition1) | (condition2)]` (OR) - `.sort_values('column', ascending=False)` sorts by a column - `.head(n)` shows top n results ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # YOUR ANSWER HERE ``` ## Exercise 10: Export and Document Your Analysis (this can only work on *Noteable*) Preserve your work by exporting cleaned data and documenting your process. Specifically, export your data in at least **two different formats**. **Hints:** - **`.to_csv('file.csv')`**: Exports to CSV. Use `index=False` to exclude index (wine) or `index=True` to include it (weather) - **`.to_excel('file.xlsx')`**: Exports to Excel (requires `openpyxl`) - **`.to_json('file.json', orient='index')`**: Exports to JSON (`orient='records'` or `orient='index'`) - **`pd.read_csv('file.csv')`**: Read back to verify export - **`pd.DataFrame(dict)`**: Create documentation tables from dictionaries ```{pyodide} #| caption: "▶ Ctrl/Cmd+Enter | ⇥ Ctrl/Cmd+] | ⇤ Ctrl/Cmd+[" # YOUR ANSWER HERE ```