Numpy vs Pandas: Exercises
We explore a numerical dataset freely avaible under MIT license on Kaggle, a platform for data science competitions and datasets.
The dataset focuses on solar system objects and contains several attributes.
Exercise 1: Extracting numpy arrays
- Select the columns
semimajorAxis,density, andgravityfrom the DataFramedfand convert them into anumpyarray namednp_solar_data. - Then, design a test (e.g. using an
if/elsestatement or anassertstatement) to verify thatnp_solar_datais indeed anumpyarray. Notice that simply printing the type ofnp_solar_datais not a test!
Reminder. An assert statement is a way to test if a condition in your code returns True. If not, the program will raise an AssertionError.
It uses the keyword assert followed by the condition to test and an optional error message.
assert condition, "Error message if condition is False"Hint. If the conversion to numpy array is correct, then the type of np_solar_data should be np.ndarray
Exercise 2: Extracting basic statistics
- Using the
numpyarraynp_solar_datacreated in Exercise 1, compute the mean and standard deviation for each of the three columns:semimajorAxis,density, andgravity. Store the results in two separatenumpyarrays namedmean_valuesandstd_values. - Can you find way to compute these same statistics from the original DataFrame
dfwithout converting it to anumpyarray? Store these results in two separatepandasSeries namedmean_seriesandstd_series. - Inspect the stand deviations in the numpy and pandas result. Can you choose parameters such that the results are identical?
Exercise 3: Understanding Pandas SettingWithCopyWarning
You want to analyze only the rocky planets by creating a subset and adding a calculated column.
Task: - Create a subset rocky_planets from df filtering for isPlanet == True and density > 3 - Add a new column mass_estimate to rocky_planets calculated as density * gravity - Observe any warnings that pandas displays - What does the warning mean? Why did it appear?
Exercise 4: When Views and Copies Actually Break Your Code
The warning is nice, but let’s see actual unexpected behavior! You’ll try to add a column categorizing planets as inner or outer solar system objects, but the code won’t work as expected.
Note. There are two types of chained indexing, both problematic:
- Column-then-row:
df['column'][condition] = value(select column, then filter) - Row-then-column:
df[condition]['column'] = value(filter rows, then select column)
Both create intermediate objects (potential copies) before assignment, so modifications may be lost!
Reload the DataFrame:
df = pd.read_csv("kaggle_solar/sol_data.csv")Filter for planets:
planets = df[df['isPlanet'] == True].copy()Initialize a column:
planets['outer_planet'] = FalseTry row-then-column chained indexing:
planets[planets['semimajorAxis'] > 5]['outer_planet'] = TrueNote: Objects with semimajorAxis > 5 AU are beyond Jupiter’s orbit (outer solar system).Check if it worked, then use
.loc[]to do it correctly
Exercise 5: extracting data and processing it with numpy and matpltolib
We now want to extract data and process it with numpy functions.
Using
numpyorpandasextract the escape velocity and the mass in kg for all the objects and plot them in a scatter plot usingmatplotlib.Consider that th escape velocity can be calculated as follows:
\[ v_{e s c}=\sqrt{2 g R} \]
where: - \(g\) is the surface gravity - \(R\) is the radius of the object in
- Improve your plotting to illustrate this relationship explicitly (e.g. by changing the scales, plotting guides to the eye, etc.). Note that the radius is in km, the escape veclocity is in m/s, and the gravity is in m/s².
Exercise 6: String Operations - Pandas vs NumPy
Extract planet names and convert them to uppercase using both pandas and numpy approaches.
Task: - Filter the DataFrame to get only planets (where isPlanet == True) - Extract the planet names (column eName) using pandas methods and convert them to uppercase (UPPERCASE). Store this in a variable named pandas_upper_names - Now try to do the same using numpy: convert the eName column to a numpy array and apply string operations:
- you can use a for loop to iterate through the names and convert them to uppercase (e.g. via list comprehension)
- or use `numpy.char` sub module for vectorized string operations
- Which approach is easier for string operations?
Exercise 7: Grouping and Aggregation - Pandas vs NumPy
Calculate the mean density for planets vs non-planets using pandas groupby, then try to replicate this in numpy.
Task: - Use pandas groupby() to compute the mean density grouped by isPlanet (True/False) - Now try to replicate this result using only numpy arrays: - Convert the relevant columns to numpy arrays - Manually separate the data into two groups (planets and non-planets) - Calculate the mean for each group
- Which approach is more convenient? What happens if you need to group by multiple columns?
Exercise 8: Correlation Matrix - Pandas vs NumPy
Calculate the correlation matrix for physical properties using pandas and numpy.
Background:
The covariance between two variables \(X\) and \(Y\) measures how they vary together: \[\text{cov}(X,Y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})\]
The correlation coefficient normalizes covariance by the standard deviations: \[r_{XY} = \frac{\text{cov}(X,Y)}{\sigma_X \sigma_Y}\] where \(\sigma_X = \sqrt{\frac{1}{n-1}\sum(x_i - \bar{x})^2}\) is the standard deviation.
Correlation ranges from -1 (perfect negative) to +1 (perfect positive), with 0 meaning no linear relationship.
Standardization: Transform each variable to have mean = 0 and standard deviation = 1: \[z_i = \frac{x_i - \bar{x}}{\sigma_x}\]
After standardization, the correlation matrix can be computed using matrix multiplication.
Task:
- Use pandas
.corr()to compute the correlation matrix fordensity,gravity, andmeanRadius - Now replicate this using
numpyby following these steps:- Extract data: Get the three columns as an \(n \times 3\) matrix \(X\) (n rows/objects, 3 columns/variables)
- Center the data: Subtract the mean of each column to get \(X_{\text{centered}}\) (remember that you can use the
axisparameter innumpyfunctions) - Standardize: Divide each centered column by its standard deviation to get \(Z\) (the standardized matrix)
- Compute correlation: The correlation matrix is \(R = \frac{1}{n-1} Z^T Z\) (a \(3 \times 3\) matrix)
- Compare with the pandas result
Note: The matrix \(Z\) has shape \((n \times 3)\), so \(Z^T\) has shape \((3 \times n)\), and their product gives a \((3 \times 3)\) correlation matrix.
Exercise 9: Creating a Theoretical Model Visualization with NumPy
Create a contour plot showing how escape velocity varies with surface gravity and radius, then overlay real solar system objects.
Scientific Background:
From \(v_{esc} = \sqrt{2gR}\), we can predict escape velocity for any combination of surface gravity \(g\) and radius \(R\). This creates a “phase space” showing where different types of objects fall.
Task:
- Create a grid of theoretical values using numpy:
- Use
np.linspace()to create arrays of gravity values (0.1 to 30 m/s²) and radius values (100 to 100,000 km) - Use
np.meshgrid()to create 2D grids for both variables - Calculate theoretical escape velocity on this grid using the formula (remember unit conversions!)
- Use
- Create a contour plot:
- Use
plt.contourf()orplt.contour()to plot the theoretical escape velocity - Add contour labels showing escape velocity values
- Filter the dataset for the 8 true planets: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune
- Overlay these planets on the contour plot using
plt.scatter()
- Use
- Interpretation:
- Do the 8 planets fall on the theoretical contours?
- Can you identify different groups among the planets (rocky vs gas/ice giants)?
Use the matplotlib documentation at https://matplotlib.org/stable/contents.html if needed.