Pandas

The pandas module is a powerful Python library for data manipulation and analysis. It makes reading tabulated data easier and allows various arithmetic or logical operations to be performed. The results can be outputted as tabular forms or plots.

Using pandas shares similarities with lists, numpy arrays, and dictionaries - you can add new elements, update them, and manipulate them. However, there are some subtle differences and new features that make pandas particularly powerful for data analysis.

We shall start by importing the pandas module. Conventionally, pandas is imported as “pd”:

DataFrames - The primary pandas structure

Just as numpy introduced the array object, pandas introduces new structures which allow more advanced functionality. The primary structure is called a DataFrame, which looks a lot like a table of data.

One way to create a DataFrame is using a dictionary where the keys are column names and the values are the column entries:

If we examine the pandas DataFrame above, we can see the following features: - This is a two-dimensional “table” of data with rows and columns - Initially, each row has been labelled by number (0, 1, 2). This is known as the index - Each column has been labelled by a column name (“formula”, “name”, “mol_mass”)

Working with Rows: The Index

The index is an important feature of pandas DataFrames. It builds upon other ordered objects (like lists and numpy arrays) but also adds in labelling (like dictionaries). By default, rows are indexed as 0, 1, 2… but we can assign this to anything we like to label our rows in a useful way.

To make the rows more meaningful, we can label them using the formula column. We can set a new index from a column using the set_index method:

Note: by default the set_index() method creates a new DataFrame. To update our original DataFrame this needs to be assigned back to our original variable name.

The benefit is that we can now access a row using our meaningful index name. We do this using the .loc[] syntax:

Note that we use square brackets not round brackets after .loc - this is unique pandas syntax for indexing a DataFrame.

We can access the full set of index values using the .index attribute:

Even though this returns an Index object, we can access elements using positional indexing as we would for a list or numpy array:

Working with Columns

Each column can be accessed from a pandas DataFrame by using the name of the column:

This is similar to accessing values using keys in a dictionary. Each column is stored as a pandas Series, which is a one-dimensional version of a DataFrame. This is similar to a list or a 1D numpy array but has an index just like a DataFrame.

To find the full names of all columns in a DataFrame, we can access the .columns attribute:

Working with Values and Data Types

A pandas DataFrame contains values for each row and column. Each column (Series) has a particular dtype, which is the data type (e.g. float, int, str) associated with that data. This is the same terminology we saw with numpy.

To access one element in a DataFrame, we can: - Select the row using .loc[] and then the column using [] indexing - Use .loc[] with both row and column (similar to multi-indexing in numpy)

Applying Operations

The pandas module is built on underlying numpy arrays and brings across similar functionality. You can even access the underlying numpy array using the .values attribute:

As with numpy arrays, we can perform arithmetic or logical operations across a whole column (a pandas Series) at once:

We can also combine multiple pandas Series with the same indices:

Just as with numpy, this operation combines these two objects element by element. For a pandas Series, this matches on the index rather than the position.

Statistical Operations

Similar to numpy arrays, pandas provides many methods for performing statistical operations. However, it’s always better to use pandas methods when dealing with pandas objects, as they handle data more intelligently (especially with missing data).

Common statistical operations include:

Updating DataFrames: Adding New Columns

Let’s create a new DataFrame to demonstrate how to update and add new information:

We can extract a column and perform operations on it:

To add this as a new column, we use syntax similar to assigning new values to dictionaries - define a new key and assign the new data:

When assigning a new value, it must be the same length as the number of rows in the DataFrame.

We can also apply operations between columns to create new columns. For instance, we can calculate population density:

Filtering Data

Similar to Boolean array indexing in numpy, we can filter pandas Series or DataFrames based on conditions.

Let’s filter to find cities with populations over 15 million:

We can break this down into steps. First, create the filter:

The filter contains True and False values for each index. We can apply this filter to our original column:

We can also filter the entire DataFrame:

This is equivalent to writing:

Filtering with Multiple Conditions

We can filter using multiple conditions with bitwise operators & (and) and | (or). Each condition must be surrounded by round brackets.

To find populations in a range between 15 and 25 million:

We can filter on multiple columns at once:

This shows all rows where the city population is greater than 15 million and the area is greater than 2000 square kilometres.

Reading Data from Files

One of the main strengths of pandas is its powerful tools for reading tabulated data. The read_csv() function is extremely versatile for reading comma separated value (csv) files.

What is a CSV file?

CSV stands for “comma separated values”. These are plain text files where data columns are separated by commas. They are much simpler than Excel files and can be read by any software capable of reading text.

Example of a CSV file content:

H2O,Water,18.01528
CO2,Carbon Dioxide,44.0095
C6H12O6,Glucose,180.156

Reading CSV files with pandas

Here’s an example of reading a csv file:

In this code: - We provide the read_csv() function with the filename (required input) - We use the optional skiprows parameter to ignore header lines - This creates a DataFrame object from the data in the file - Column names are pulled from the first row that is read (after skipped rows)

This reads the data from the csv file into a pandas DataFrame. From here, we can manipulate and analyze the data using all the tools we’ve discussed above.

A key option for read_csv() is the sep parameter, which allows you to specify a different delimiter if your data is not comma-separated. For example, if your data is tab-separated, you can use sep='\t'.

We can see some examples here below where (for convenience) we use the StringIO module to simulate reading from a file, so that the content of the “file” is immediately apparent in the code snippet itself:

A similar example below, but with triple quotes for the multi-line string and an arbitrarily different separator (;):

Conclusion

We covered the basics of the pandas module, including creating and manipulating DataFrames, accessing rows and columns, performing operations, filtering data, and reading data from CSV files.

pandas is the central tool for data analysis in Python.

It may appear to duplicate some of the opearations we have seen earlier with numpy or vanilla Python. However, the case for using pandas lies in its data organisation and teh construction of coherent pipelines.

Caveats.

  • pandas has a rather specific syntax, and learning it can take a while. It is important to explore the documentation, notably its tutorials:

    • https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/index.html
  • pandas is also a good example of a situation where copying and adapting code snippets from online resources (e.g. Stack Overflow) is a common practice.

  • It is also a good idea to keep a “cheat sheet” of common operations handy for reference.

  • Finally, LLM have normally a good knowledge of pandas and can be a useful resource for quick questions.