Pandas vs Numpy

If we look back at our trajectory in Python, we should see that building from the lementary data structures we have progressed into m,ore complex data types exploited buy varous libraries.

In particular, we have seen how the efficient handling of numerical data types is perfomed in Python using the numpy numerical library and how this is further extended to handle tabular data using the pandas library.

pandas is built on top of numpy and provides high-level data structures and functions designed to make data analysis and manipulation easier and more intuitive.

Pandas chiefly operates with tabulated data stuctures called DataFrames, which are essentially 2-dimensional labeled data structures with columns of potentially different types. This makes pandas particularly well-suited for handling structured data, such as CSV files, SQL tables, and Excel spreadsheets.

On the other hand the building block of numpy is the ndarray, which is a powerful n-dimensional array, of which tables are just a particular instance (2-dimensional arrays). numpy is optimized for numerical computations and provides a wide range of mathematical functions to operate on these arrays efficiently.

In this lecture we will explore similarities and differences between pandas and numpy, and how to make them interoperate effectively.

Reading data in

We have seen that both pandas and numpy provide functions to read data from various file formats, but they do so in different ways.

In numpy we have functions like numpy.loadtxt() and numpy.genfromtxt() to read data from text files, while pandas offers more versatile functions like pandas.read_csv(), pandas.read_excel(), and pandas.read_sql() to read data from a variety of sources.

Consider a CSV file named sample_data.csv with the following content:

id,age,score,grade
1,20,85,4.0
2,21,,3.0
3,19,92,4.0
4,22,88,
5,20,76,3.0
6,,91,4.0
7,21,79,3.0
8,23,,2.0
9,20,95,4.0
10,22,82,

Notice that it contains some missing values.

We can read it in numpy simply with genfromtxt

numpy by default reads data directly into ndarrays, which contain homogeneous data types (the same dtype)

Notice that the headers have been lost and the missing values interpreted as nan which is a value that exists in numpy

In pandas, we can read the same file using read_csv()

This produces a DataFrame object, whidh is a bit richer: it contains the headers, the explicit indices and the data types of each column are inferred separately.

Cleaning the data from the missing values in numpy requires some ad hoc work: for example, we could decide to remove any row that contains a nan value using numpy.isnan() and boolean indexing:

The same is a bit more immediate in pandas, where we can use the dropna() method to remove rows with any missing values:

Notice that the resulting DataFrame still retains the column names and indices.

To completely forget the indices of the original data, we can reset them using the reset_index() method:

From the pandas DataFrame, we can easily convert it back to a numpy ndarray using the to_numpy() method:

This is also equivalent to accessing the values attribute of the DataFrame:

The to_numpy() method is preferred as it provides more options for controlling the conversion process. It is also a particular instance of the many to_...() methods that allow you to convert DataFrames in various formats:

Useful methods that you may immediately recognise in their utility are:

  • DataFrame.to_dict(): Converts the DataFrame to a dictionary.
  • DataFrame.to_json(): Converts the DataFrame to a JSON string.
  • DataFrame.to_excel(): Writes the DataFrame to an Excel file.
  • DataFrame.to_csv(): Writes the DataFrame to a CSV file.

We can also operate in the opposite direction, converting a numpy ndarray to a pandas DataFrame using the pandas.DataFrame() constructor: for this purpose, we may want to specify the column names explicitly using the columns parameter.

Indexing and Slicing

numpy uses the familiar zero-based integer indexing that we have seen in lists and arrays. We can access elements using integer indices, and we can slice arrays using the colon (:) operator.

For the given array

We can pick various elements and slices:

Pandas dataframes have (as we have seen it) a quite different syntax for indexing and slicing.

The closest to numpy’s indexing is the .iloc[] method (integer locator), which allows integer-based indexing similar to numpy arrays.

More in general, pandas provides label based indexing using the .loc[] method, which allows us to access data using row and column labels.

In this case the index is numerical:

And the columns are labelled with strings:

So we we can do the following:

Using [] in pandas without loc or iloc

Some care needs to be taken when using the square brackets [] operator in pandas, as it can behave differently based on the context.

When we use [] direftly with a DataFrame, it primarily serves two purposes: selecting columns and filtering rows.

Selecting columns: When we pass a single string or a list of strings to [], it selects the corresponding columns from the DataFrame.

Filtering rows: When we pass a boolean array or Series to [], it filters the rows based on the boolean values.

Oddly enough, if we pass an single integer pandas will interpret it as a column selection, not as a row selection as one might expect, and this will raise an error if there is no column with that name.

Important. These inconsistencies should motivate you to prefer .loc and .iloc for DataFrame indexing.

Mathematical operations on data tables

Basic statistics

Numpy arrays readily have a few mathematicla methods that operate element-wise on the array data, such as sum(), mean(), std(), etc.

They operate on rows or columns depending on the axis parameter:

  • axis=0 indicates that the operation is performed column-wise (i.e., along rows).
  • axis=1 indicates that the operation is performed row-wise (i.e., along columns).

Similar operations exist in pandas and work column-wise (default) or row-wise depending on the axis parameter as well.

We can also compute standard devations:

Oh, the two appear to provide quantitatively different results! This may come to a surprise, but it is an example of how different libraries may implement similar functions with slightly different conventions, and how important it is to read the documentation carefully.

For numpy this is

If you read the documentation correctl you will find that numpy.std() by default computes the standard deviation with with degree of freedom ddof=0, which is the population standard deviation

\[\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2}\]

This is different from the sample standard deviation formula, which uses ddof=1

\[s = \sqrt{\frac{1}{N-1} \sum_{i=1}^{N} (x_i - \bar{x})^2}\]

The two have different meanings:

  • the population standard deviation is used when you have data for the entire population
  • the sample standard deviation is used when you have a sample from a larger population and want to estimate the population standard deviation.

pandas by default uses the sample standard deviation formula (ddof=1).

In order to match the two results, we can specify ddof=0 in the function calls

More general mathematical operations

Numpy has a large set of mathematical functions that can be applied element-wise to arrays, such as np.sin(), np.exp(), np.log(), etc.

What if we want to apply these functions to pandas DataFrames?

We can surely create new columns by applying numpy functions to existing columns:

We can even apply the functions to the entire DataFrame, and numpy will apply them element-wise:

These are fast operations because they are vectorized.

Views vs Copies: A Critical Difference

When working with numpy arrays and pandas DataFrames, understanding the distinction between views and copies is crucial to avoid unexpected behavior.

A view is a reference to the original data structure - modifying a view will change the original data.

A copy is an independent duplicate - modifying a copy leaves the original unchanged.

This distinction is particularly important when subsetting data or performing operations that might create either views or copies.

Views and copies in NumPy

In numpy, slicing creates views by default. This is done for memory efficiency, but it means that modifying a slice will affect the original array.

Notice that modifying the slice changed the original array! This is because sliced is a view of original, not an independent copy.

We can verify this using np.shares_memory():

To create an independent copy that won’t affect the original, use the .copy() method explicitly:

Views and copies in Pandas

The situation in pandas is more complex. Pandas tries to avoid unnecessary copying for performance, but the behavior is not always obvious.

Chained indexing (using multiple [] operations in sequence) can be particularly problematic and may trigger a SettingWithCopyWarning.

If we try to filter rows and then modify the result, pandas may issue a warning because it’s unclear whether we’re modifying a view or a copy:

Best practice 1: Use .copy() explicitly when you want to work with a subset independently:

Best practice 2: If you want to modify the original DataFrame directly, use .loc[] in a single operation:

Library Default behavior How to get a copy When it matters
NumPy Slicing creates views Use .copy() explicitly When modifying slices
Pandas Complex (depends on operation) Use .copy() or .loc[] When subsetting and modifying

Interoperability with Plotting

While plotting libraries like matplotlib can work directly with pandas DataFrames, there are situations where converting to numpy arrays is useful or even necessary.

Let’s explore when and why you might convert data for plotting purposes.

Plotting directly with pandas vs numpy

Both approaches work, but have different advantages:

Both approaches produce identical plots. The pandas approach is more concise for simple plots, while the numpy approach can be clearer when you need the data for other operations.

For example, create contour plots to represent fields over a grid can be more straightforward using numpy arrays.

In the example below, we take tabulated data, arrange it into a grid using a pivot table, with x and y coordinates as columns and rows, get (x,y) pairs via numpy.meshgrid(), and then plot the contour using matplotlib.