Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Pandas - Introduction

This notebook explans how to use the pandas library for analysis of tabular data.

# Start using pandas (default import convention)
import pandas as pd
import numpy as np
# Let pandas speak for themselves
print(pd.__doc__)

pandas - a powerful data analysis and manipulation library for Python
=====================================================================

**pandas** is a Python package providing fast, flexible, and expressive data
structures designed to make working with "relational" or "labeled" data both
easy and intuitive. It aims to be the fundamental high-level building block for
doing practical, **real world** data analysis in Python. Additionally, it has
the broader goal of becoming **the most powerful and flexible open source data
analysis / manipulation tool available in any language**. It is already well on
its way toward this goal.

Main Features
-------------
Here are just a few of the things that pandas does well:

  - Easy handling of missing data in floating point as well as non-floating
    point data.
  - Size mutability: columns can be inserted and deleted from DataFrame and
    higher dimensional objects
  - Automatic and explicit data alignment: objects can be explicitly aligned
    to a set of labels, or the user can simply ignore the labels and let
    `Series`, `DataFrame`, etc. automatically align the data for you in
    computations.
  - Powerful, flexible group by functionality to perform split-apply-combine
    operations on data sets, for both aggregating and transforming data.
  - Make it easy to convert ragged, differently-indexed data in other Python
    and NumPy data structures into DataFrame objects.
  - Intelligent label-based slicing, fancy indexing, and subsetting of large
    data sets.
  - Intuitive merging and joining data sets.
  - Flexible reshaping and pivoting of data sets.
  - Hierarchical labeling of axes (possible to have multiple labels per tick).
  - Robust IO tools for loading data from flat files (CSV and delimited),
    Excel files, databases, and saving/loading data from the ultrafast HDF5
    format.
  - Time series-specific functionality: date range generation and frequency
    conversion, moving window statistics, date shifting and lagging.

Visit the official website for a nicely written documentation: https://pandas.pydata.org

# Current version (should be 1.5+ in 2023)
print(pd.__version__)
2.2.3

Basic objects

The pandas library has a vast API with many useful functions. However, most of this revolves around two important classes:

  • Series

  • DataFrame

In this introduction, we will focus on them - what each of them does and how they relate to each other and numpy objects.

Series

Series is a one-dimensional data structure, central to pandas.

For a complete API, visit https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html

# My first series
series = pd.Series([11, 12, 13])
series
0 11 1 12 2 13 dtype: int64

This looks a bit like a Numpy array, does it not?

Actually, in most cases the Series wraps a Numpy array...

series.values  # The result is a Numpy array
array([11, 12, 13])

But there is something more. Alongside the values, we see that each item (or “row”) has a certain label. The collection of labels is called index.

series.index
RangeIndex(start=0, stop=3, step=1)

This index (see below) can be used, as its name suggests, to index items of the series.

# Return an element from the series
series.loc[1]
np.int64(12)
# Or
series[1]
np.int64(12)
# Construction from a dictionary
series_ab = pd.Series({"a": 2, "b": 4})
series_ab
a 2 b 4 dtype: int64
series_ab.loc["a"]
np.int64(2)

Exercise: Create a series with 5 elements.

result = ...

DataFrame

A DataFrame is pandas’ answer to Excel sheets - it is a collection of named columns (or, in our case, a collection of Series). Quite often, we directly read data frames from an external source, but it is possible to create them from:

  • a dict of Series, numpy arrays or other array-like objects

  • from an iterable of rows (where rows are Series, lists, dictionaries, ...)

# List of lists (no column names)
table = [
    ['a', 1],
    ['b', 3],
    ['c', 5]
]
table_df = pd.DataFrame(table)
table_df
Loading...
# Dict of Series (with column names)
df = pd.DataFrame({
    'number': pd.Series([1, 2, 3, 4], dtype=np.int8),
    'letter': pd.Series(['a', 'b', 'c', 'd'])
})
df
Loading...
# Numpy array (10x2), specify column names
data = np.random.normal(0, 1, (10, 2))

df = pd.DataFrame(data, columns=['a', 'b'])
df
Loading...
# A DataFrame also has an index.
df.index
RangeIndex(start=0, stop=10, step=1)
# ...that is shared by all columns
df.index is df["a"].index
True
# The columns also form an index.
df.columns
Index(['a', 'b'], dtype='object')

Exercise: Create DataFrame whose x-column is 0,14π,12π,..2π0, \frac{1}{4}\pi, \frac{1}{2}\pi, .. 2\pi , y column is cos(x) and index are fractions 0, 1/4, 1/2 ... 2

import fractions

index = [fractions.Fraction(n, ___) for n in range(___)]
x = np.___([___ for ___ in ___])
y = ___

df = pd.DataFrame(___, index = ___)

# display
df
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[18], line 3
      1 import fractions
----> 3 index = [fractions.Fraction(n, ___) for n in range(___)]
      4 x = np.___([___ for ___ in ___])
      5 y = ___

TypeError: 'RangeIndex' object cannot be interpreted as an integer

D(ata) types

Pandas builds upon the numpy data types (mentioned earlier) and adds a couple of more.

typed_df = pd.DataFrame({
  "bool": np.arange(5) % 2 == 0,
  "int": range(5),
  "int[nan]": pd.Series([np.nan, 0, 1, 2, 3], dtype="Int64"),
  "float": np.arange(5) * 3.14,
  "complex": np.array([1 + 2j, 2 + 3j, 3 + 4j, 4 + 5j, 5 + 6j]),
  "object": [None, 1, "2", [3, 4], 5 + 6j],
  "string?": ["a", "b", "c", "d", "e"],
  "string!": pd.Series(["a", "b", "c", "d", "e"], dtype="string"),
  "datetime": pd.date_range('2018-01-01', periods=5, freq='3M'),
  "timedelta": pd.timedelta_range(0, freq="1s", periods=5),
  "category": pd.Series(["animal", "plant", "animal", "animal", "plant"], dtype="category"),
  "period": pd.period_range('2018-01-01', periods=5, freq='M'),
})
typed_df
/var/folders/dm/gbbql3p121z0tr22r2z98vy00000gn/T/ipykernel_95281/1417085050.py:10: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
  "datetime": pd.date_range('2018-01-01', periods=5, freq='3M'),
Loading...
typed_df.dtypes
bool bool int int64 int[nan] Int64 float float64 complex complex128 object object string? object string! string[python] datetime datetime64[ns] timedelta timedelta64[ns] category category period period[M] dtype: object

We will see some of the types practically used in further analysis.

Indices & indexing

abc_series = pd.Series(range(3), index=["a", "b", "c"])
abc_series
a 0 b 1 c 2 dtype: int64
abc_series.index
Index(['a', 'b', 'c'], dtype='object')
abc_series.index = ["c", "d", "e"]  # Changes the labels in-place!
abc_series.index.name = "letter"
abc_series
letter c 0 d 1 e 2 dtype: int64
table = [
    ['a', 1],
    ['b', 3],
    ['c', 5]
]
table_df = pd.DataFrame(
    table,
    index=["first", "second", "third"],
    columns=["alpha", "beta"]
)
table_df
Loading...
alpha = table_df["alpha"]  # Simple [] indexing in DataFrame returns Series
alpha
first a second b third c Name: alpha, dtype: object
alpha.loc["second"]             # Simple [] indexing in Series returns scalar values.
'b'

A slice with a ["list", "of", "columns"] yields a DataFrame with those columns.

For example:

table_df[["beta", "alpha"]]
Loading...

[["column_name"]] returs a DataFrame as well, not Series:

table_df[["alpha"]]
Loading...

There are two ways how to properly index rows & cells in the DataFrame:

  • loc for label-based indexing

  • iloc for order-based indexing (it does not use the index at all)

Note the square brackets. The mentioned attributes actually are not methods but special “indexer” objects. They accept one or two arguments specifying the position along one or both axes.

loc

first = table_df.loc["first"]
first
alpha a beta 1 Name: first, dtype: object
table_df.loc["first", "beta"]
np.int64(1)
table_df.loc["first":"second", "beta"]   # Use ranges (inclusive)
first 1 second 3 Name: beta, dtype: int64

iloc

table_df.iloc[1]
alpha b beta 3 Name: second, dtype: object
table_df.iloc[0:4:2]   # Select every second row
Loading...
table_df.at["first", "beta"]
np.int64(1)
type(table_df.at)
pandas.core.indexing._AtIndexer

Modifying DataFrames

Adding a new column is like adding a key/value pair to a dict. Note that this operation, unlike most others, does modify the DataFrame.

table_df
Loading...
from datetime import datetime
table_df["now"] = datetime.now()
table_df
Loading...

Non-destructive version that returns a new DataFrame, uses the assign method:

table_df.assign(delta = [True, False, True]).drop(columns=["now"])
Loading...
# However, the original DataFrame is not changed
table_df
Loading...

Deleting a column is very easy too.

del table_df["now"]
table_df
Loading...

The drop method works with both rows and columns (creating a new data frame), returning a new object.

table_df.drop("beta", axis=1)
Loading...
table_df.drop("second", axis=0)
Loading...

Exercise: Use a combination of reset_index, drop and set_index to transform table_df into pd.DataFrame({'index': table_df.index}, index=table_df["alpha"])

results = table_df.___.___.___

# display
result

Let’s get some real data!

I/O in pandas

Pandas can read (and write to) a huge variety of file formats. More details can be found in the official documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

Most of the functions for reading data are named pandas.read_XXX, where XXX is the format used. We will look at three commonly used ones.

# List functions for input in pandas.

print("\n".join(method for method in dir(pd) if method.startswith("read_")))
read_clipboard
read_csv
read_excel
read_feather
read_fwf
read_gbq
read_hdf
read_html
read_json
read_orc
read_parquet
read_pickle
read_sas
read_spss
read_sql
read_sql_query
read_sql_table
read_stata
read_table
read_xml

Read CSV

Nowadays, a lot of data comes in the textual Comma-separated values format (CSV). Although not properly standardized, it is the de-facto standard for files that are not huge and are meant to be read by human eyes too.

Let’s read the population of U.S. states that we will need later:

territories = pd.read_csv("data/us_state_population.csv")
territories.head(9)
Loading...

The automatic data type parsing converts columns to appropriate types:

territories.dtypes
Territory object Population float64 Population 2010 int64 Code object dtype: object

Sometimes the CSV input does not work out of the box. Although pandas automatically understands and reads zipped files, it usually does not automatically infer the file format and its variations - for details, see the read_csv documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

pd.read_csv('data/iris.tsv.gz')
Loading...

...in this case, the CSV file does not use commas to separate values. Therefore, we need to specify an extra argument:

pd.read_csv("data/iris.tsv.gz", sep='\t')
Loading...

See the difference?

Read Excel

Let’s read the list of U.S. incidents when lasers interfered with airplanes.

pd.read_excel("data/laser_incidents_2019.xlsx")
Loading...

Note: This reads just the first sheet from the file. If you want to extract more sheets, you will need to use the pandas.'ExcelFile class. See the relevant part of the documentation.

Read HTML (Optional)

Pandas is able to scrape data from tables embedded in web pages using the read_html function. This might or might not bring you good results and probably you will have to tweak your data frame manually. But it is a good starting point - much better than being forced to parse the HTML ourselves!

tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_laser_types")
type(tables), len(tables)
(list, 9)
tables[1]
Loading...
tables[2]
Loading...

Write CSV

Pandas is able to write to many various formats but the usage is similar.

tables[1].to_csv("gas_lasers.csv", index=False)

Data analysis (very basics)

Let’s extend the data of laser incidents to a broader time range and read the data from a summary CSV file:

laser_incidents_raw = pd.read_csv("data/laser_incidents_2015-2020.csv")

Let’s see what we have here...

laser_incidents_raw.head()
Loading...
laser_incidents_raw.tail()
Loading...

For an unknown, potentially unevenly distributed dataset, looking at the beginning / end is typically not the best idea. We’d rather sample randomly:

# Show a few examples
laser_incidents_raw.sample(10)
Loading...
laser_incidents_raw.dtypes
Unnamed: 0 int64 Incident Date object Incident Time float64 Flight ID object Aircraft object Altitude float64 Airport object Laser Color object Injury object City object State object timestamp object dtype: object

The topic of data cleaning and pre-processing is very broad. We will limit ourselves to dropping unused columns and converting one to a proper type.

# The first three are not needed
laser_incidents = laser_incidents_raw.drop(columns=laser_incidents_raw.columns[:3])

# We convert the timestamp
laser_incidents = laser_incidents.assign(
    timestamp = pd.to_datetime(laser_incidents["timestamp"])
)
laser_incidents
Loading...
laser_incidents.dtypes
Flight ID object Aircraft object Altitude float64 Airport object Laser Color object Injury object City object State object timestamp datetime64[ns] dtype: object

Categorical dtype (Optional)

To analyze Laser Color, we can look at its typical values.

laser_incidents["Laser Color"].describe()
count 36461 unique 73 top green freq 32787 Name: Laser Color, dtype: object

Not too many different values.

laser_incidents["Laser Color"].unique()
array(['green', 'purple', 'blue', 'unknown', 'red', 'white', 'green and white', 'white and green', 'green and yellow', 'multiple', 'unknwn', 'green and purple', 'green and red', 'red and green', 'green and blue', 'blue and purple', 'red white and blue', 'blue and green', 'blue or purple', 'blue or green', 'yellow/orange', 'blue/purple', 'unkwn', 'orange', 'multi', 'yellow and white', 'blue and white', 'white or amber', 'red and white', 'yellow', 'amber', 'yellow and green', 'white and blue', 'red, blue, and green', 'purple-blue', 'red and blue', 'magenta', 'phx', 'green or blue', 'red or green', 'green or red', 'green, blue or purple', 'blue and red', 'unkn', 'blue-green', 'multi-colored', nan, 'blue-yellow', 'white or green', 'green and orange', 'white-green-red', 'multicolored', 'green-white', 'blue or white', 'green red blue', 'green or white', 'blue -green', 'green-red', 'green-blue', 'multi-color', 'green-yellow', 'red-white', 'blue-purple', 'white-yellow', 'green-purple', 'lavender', 'orange-red', 'blue-white', 'blue-red', 'yellow-white', 'red-green', 'white-green', 'white-blue', 'white-red'], dtype=object)
laser_incidents["Laser Color"].value_counts(normalize=True)
Laser Color green 0.899235 blue 0.046790 red 0.012260 white 0.010395 unkn 0.009051 ... red or green 0.000027 white or green 0.000027 blue-yellow 0.000027 multi-colored 0.000027 white-red 0.000027 Name: proportion, Length: 73, dtype: float64

This column is a very good candidate to turn into a pandas-special, Categorical data type. (See https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html)

laser_incidents["Laser Color"].memory_usage(deep=True)   # ~60 bytes per item
1969532
color_category = laser_incidents["Laser Color"].astype("category")
color_category.sample(10)
19194 green 1357 green 8379 green 9911 green 26650 green 22866 green 16266 green 32446 green 23370 green 16057 green Name: Laser Color, dtype: category Categories (73, object): ['amber', 'blue', 'blue -green', 'blue and green', ..., 'yellow and green', 'yellow and white', 'yellow-white', 'yellow/orange']
color_category.memory_usage(deep=True)              # ~1-2 bytes per item
43088

Exercise: Are there any other columns in the dataset that you would suggest for conversion to categorical?

laser_incidents.describe(include="all")
Loading...

Integer vs. float

Pandas is generally quite good at guessing (inferring) number types. You may wonder why Altitude is float and not int though. This is a consequence of not having an integer nan in numpy. There’s been many discussions about this.

laser_incidents["Altitude"]
0 8500.0 1 40000.0 2 2500.0 3 3000.0 4 11000.0 ... 36458 8000.0 36459 11000.0 36460 2000.0 36461 300.0 36462 1000.0 Name: Altitude, Length: 36463, dtype: float64
laser_incidents["Altitude"].astype(int)
---------------------------------------------------------------------------
IntCastingNaNError                        Traceback (most recent call last)
Cell In[82], line 1
----> 1 laser_incidents["Altitude"].astype(int)

File ~/workspace/fjfi/python-fjfi/.venv/lib/python3.12/site-packages/pandas/core/generic.py:6643, in NDFrame.astype(self, dtype, copy, errors)
   6637     results = [
   6638         ser.astype(dtype, copy=copy, errors=errors) for _, ser in self.items()
   6639     ]
   6641 else:
   6642     # else, only a single dtype is given
-> 6643     new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   6644     res = self._constructor_from_mgr(new_data, axes=new_data.axes)
   6645     return res.__finalize__(self, method="astype")

File ~/workspace/fjfi/python-fjfi/.venv/lib/python3.12/site-packages/pandas/core/internals/managers.py:430, in BaseBlockManager.astype(self, dtype, copy, errors)
    427 elif using_copy_on_write():
    428     copy = False
--> 430 return self.apply(
    431     "astype",
    432     dtype=dtype,
    433     copy=copy,
    434     errors=errors,
    435     using_cow=using_copy_on_write(),
    436 )

File ~/workspace/fjfi/python-fjfi/.venv/lib/python3.12/site-packages/pandas/core/internals/managers.py:363, in BaseBlockManager.apply(self, f, align_keys, **kwargs)
    361         applied = b.apply(f, **kwargs)
    362     else:
--> 363         applied = getattr(b, f)(**kwargs)
    364     result_blocks = extend_blocks(applied, result_blocks)
    366 out = type(self).from_blocks(result_blocks, self.axes)

File ~/workspace/fjfi/python-fjfi/.venv/lib/python3.12/site-packages/pandas/core/internals/blocks.py:758, in Block.astype(self, dtype, copy, errors, using_cow, squeeze)
    755         raise ValueError("Can not squeeze with more than one column.")
    756     values = values[0, :]  # type: ignore[call-overload]
--> 758 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
    760 new_values = maybe_coerce_values(new_values)
    762 refs = None

File ~/workspace/fjfi/python-fjfi/.venv/lib/python3.12/site-packages/pandas/core/dtypes/astype.py:237, in astype_array_safe(values, dtype, copy, errors)
    234     dtype = dtype.numpy_dtype
    236 try:
--> 237     new_values = astype_array(values, dtype, copy=copy)
    238 except (ValueError, TypeError):
    239     # e.g. _astype_nansafe can fail on object-dtype of strings
    240     #  trying to convert to float
    241     if errors == "ignore":

File ~/workspace/fjfi/python-fjfi/.venv/lib/python3.12/site-packages/pandas/core/dtypes/astype.py:182, in astype_array(values, dtype, copy)
    179     values = values.astype(dtype, copy=copy)
    181 else:
--> 182     values = _astype_nansafe(values, dtype, copy=copy)
    184 # in pandas we don't store numpy str dtypes, so convert to object
    185 if isinstance(dtype, np.dtype) and issubclass(values.dtype.type, str):

File ~/workspace/fjfi/python-fjfi/.venv/lib/python3.12/site-packages/pandas/core/dtypes/astype.py:101, in _astype_nansafe(arr, dtype, copy, skipna)
     96     return lib.ensure_string_array(
     97         arr, skipna=skipna, convert_na_value=False
     98     ).reshape(shape)
    100 elif np.issubdtype(arr.dtype, np.floating) and dtype.kind in "iu":
--> 101     return _astype_float_to_int_nansafe(arr, dtype, copy)
    103 elif arr.dtype == object:
    104     # if we have a datetime/timedelta array of objects
    105     # then coerce to datetime64[ns] and use DatetimeArray.astype
    107     if lib.is_np_dtype(dtype, "M"):

File ~/workspace/fjfi/python-fjfi/.venv/lib/python3.12/site-packages/pandas/core/dtypes/astype.py:145, in _astype_float_to_int_nansafe(values, dtype, copy)
    141 """
    142 astype with a check preventing converting NaN to an meaningless integer value.
    143 """
    144 if not np.isfinite(values).all():
--> 145     raise IntCastingNaNError(
    146         "Cannot convert non-finite values (NA or inf) to integer"
    147     )
    148 if dtype.kind == "u":
    149     # GH#45151
    150     if not (values >= 0).all():

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

Quite recently, Pandas introduced nullable types for working with missing data, for example nullable integer.

laser_incidents["Altitude"].astype("Int64")
0 8500 1 40000 2 2500 3 3000 4 11000 ... 36458 8000 36459 11000 36460 2000 36461 300 36462 1000 Name: Altitude, Length: 36463, dtype: Int64

Filtering

Indexing in pandas Series / DataFrames ([]) support also boolean (masked) arrays. These arrays can be obtained by applying boolean operations on them.

You can also use standard comparison operators like <, <=, ==, >=, >, !=.

It is possible to perform logical operations with boolean series too. You need to use |, &, ^ operators though, not and, or, not keywords.

As an example, find all California incidents:

is_california = laser_incidents.State == "California"
is_california.sample(10)
27390 False 5156 False 27181 False 29581 False 25194 False 19519 False 8354 True 23656 False 9596 True 5121 False Name: State, dtype: bool

Now we can directly apply the boolean mask. (Note: This is no magic. You can construct the mask yourself)

laser_incidents[is_california].sample(10)
Loading...

Or maybe we should include the whole West coast?

# isin takes an array of possible values
west_coast = laser_incidents[laser_incidents.State.isin(["California", "Oregon", "Washington"])]
west_coast.sample(10)
Loading...

Or low-altitude incidents?

laser_incidents[laser_incidents.Altitude < 300]
Loading...

Visualization intermezzo

Without much further ado, let’s create our first plot.

# Most frequent states
laser_incidents["State"].value_counts()[:20]
State California 7268 Texas 3620 Florida 2702 Arizona 1910 Colorado 988 Washington 982 Kentucky 952 Illinois 946 New York 921 Puerto Rico 912 Oregon 895 Tennessee 888 Nevada 837 Pennsylvania 826 Indiana 812 Utah 789 Ohio 750 Georgia 714 North Carolina 605 Missouri 547 Name: count, dtype: int64
laser_incidents["State"].value_counts()[:20].plot(kind="bar");
<Figure size 640x480 with 1 Axes>

Sorting

# Display 5 incidents with the highest altitude
laser_incidents.sort_values("Altitude", ascending=False).head(5)
Loading...
# Alternative
laser_incidents.nlargest(5, "Altitude")
Loading...

Exercise: Find the last 3 incidents with blue laser.

Arithmetics and string manipulation

Standard arithmetic operators work on numerical columns too. And so do mathematical functions. Note all such operations are performed in a vector-like fashion.

altitude_meters = laser_incidents["Altitude"] * .3048
altitude_meters.sample(10)
23557 2743.20 3239 1310.64 12955 1219.20 2326 1524.00 23309 914.40 6378 1828.80 18797 1828.80 16242 3352.80 1413 5486.40 21954 1066.80 Name: Altitude, dtype: float64

You may mix columns and scalars, the string arithmetics also works as expected.

laser_incidents["City"] + ", " + laser_incidents["State"]
0 Santa Barbara, California 1 San Antonio, Texas 2 Tampa, Florida 3 Fort Worth , Texas 4 Modesto, California ... 36458 Las Vegas, Nevada 36459 Lincoln, California 36460 Westhampton Beach, New York 36461 Guam, Guam 36462 Naples, Florida Length: 36463, dtype: object

Summary statistics

The describe method shows summary statistics for all the columns:

laser_incidents.describe()
Loading...
laser_incidents.describe(include="all")
Loading...
laser_incidents["Altitude"].mean()
np.float64(7358.314263625822)
laser_incidents["Altitude"].std()
np.float64(7642.6867120945535)
laser_incidents["Altitude"].max()
np.float64(240000.0)

Basic string operations (Optional)

These are typically accessed using the .str “accessor” of the Series like this:

  • series.str.lower

  • series.str.split

  • series.str.startswith

  • series.str.contains

  • ...

See more in the documentation.

laser_incidents[laser_incidents["City"].str.contains("City", na=False)]["City"].unique()
array(['Panama City', 'Oklahoma City', 'Salt Lake City', 'Bullhead City', 'Garden City', 'Atlantic City', 'Panama City ', 'New York City', 'Jefferson City', 'Kansas City', 'Rapid City', 'Tremont City', 'Boulder City', 'Traverse City', 'Cross City', 'Brigham City', 'Carson City', 'Midland City', 'Johnson City', 'Ponca City', 'Panama City Beach', 'Sioux City', 'Bay City', 'Silver City', 'Pueblo City', 'Iowa City', 'Calvert City', 'Crescent City', 'Oak City', 'Falls City', 'Salt Lake City ', 'Royse City', 'Kansas City ', 'Bossier City', 'Baker City', 'Ellwood City', 'Dodge City', 'Garden City ', 'Union City', 'King City', 'Kansas City ', 'Mason City', 'Plant City ', 'Lanai City', 'Tell City', 'Yuba City', 'Kansas City ', 'Salt Lake City ', 'Kansas City ', 'Ocean City', 'Cedar City', 'City of Commerce', 'Lake City', 'Beach City', 'Alexander City', 'Siler City', 'Charles City', 'Malad City ', 'Rush City', 'Webster City', 'Plant City'], dtype=object)
laser_incidents[laser_incidents["City"].str.contains("City", na=False)]["City"].str.strip().unique()
array(['Panama City', 'Oklahoma City', 'Salt Lake City', 'Bullhead City', 'Garden City', 'Atlantic City', 'New York City', 'Jefferson City', 'Kansas City', 'Rapid City', 'Tremont City', 'Boulder City', 'Traverse City', 'Cross City', 'Brigham City', 'Carson City', 'Midland City', 'Johnson City', 'Ponca City', 'Panama City Beach', 'Sioux City', 'Bay City', 'Silver City', 'Pueblo City', 'Iowa City', 'Calvert City', 'Crescent City', 'Oak City', 'Falls City', 'Royse City', 'Bossier City', 'Baker City', 'Ellwood City', 'Dodge City', 'Union City', 'King City', 'Mason City', 'Plant City', 'Lanai City', 'Tell City', 'Yuba City', 'Ocean City', 'Cedar City', 'City of Commerce', 'Lake City', 'Beach City', 'Alexander City', 'Siler City', 'Charles City', 'Malad City', 'Rush City', 'Webster City'], dtype=object)

Merging data

It is a common situation where we have two or more datasets with different columns that we need to bring together. This operation is called merging and the Pandas apparatus is to a great detail described in the documentation.

In our case, we would like to attach the state populations to the dataset.

population = pd.read_csv("data/us_state_population.csv")
population
Loading...

We will of course use the state name as the merge key. Before actually doing the merge, we can explore a bit whether all state names from the laser incidents dataset are present in our population table.

unknown_states = laser_incidents.loc[~laser_incidents["State"].isin(population["Territory"]), "State"]
print(f"There are {unknown_states.count()} rows with unknown states.")
print(f"Unknown state values are: \n{list(unknown_states.unique())}.")
There are 82 rows with unknown states.
Unknown state values are: 
[nan, 'Virgin Islands', 'Miami', 'North Hampshire', 'Marina Islands', 'Teas', 'Mexico', 'DC', 'VA', 'Northern Marina Islands', 'Mariana Islands', 'Oho', 'Northern Marianas Is', 'UNKN', 'Massachussets', 'FLorida', 'D.C.', 'MIchigan', 'Northern Mariana Is', 'Micronesia'].

We could certainly clean the data by correcting some of the typos. Since the number of the rows with unknown states is not large (compared to the length of the whole dataset), we will deliberetly not fix the state names. Instead, we will remove those rows from the merged dataset by using the inner type of merge. All the merge types: left, inner, outer and right are well explained by the schema below:

merge types

We can use the merge function to add the "Population" values.

laser_incidents_w_population = pd.merge(
    laser_incidents, population, left_on="State", right_on="Territory", how="inner"
)
laser_incidents_w_population
Loading...
laser_incidents_w_population.describe(include="all")
Loading...

Grouping & aggregation

A common pattern in data analysis is grouping (or binning) data based on some property and getting some aggredate statistics.

Example: Group this workshop participants by nationality a get the cardinality (the size) of each group.

Possibly the simplest group and aggregation is the value_counts method, which groups by the respective column value and yields the number (or normalized frequency) of each unique value in the data.

laser_incidents_w_population["State"].value_counts(normalize=False)
State California 7268 Texas 3620 Florida 2702 Arizona 1910 Colorado 988 Washington 982 Kentucky 952 Illinois 946 New York 921 Puerto Rico 912 Oregon 895 Tennessee 888 Nevada 837 Pennsylvania 826 Indiana 812 Utah 789 Ohio 750 Georgia 714 North Carolina 605 Missouri 547 Minnesota 531 New Jersey 519 Michigan 505 Hawaii 500 Alabama 473 Virginia 412 Oklahoma 412 New Mexico 401 Louisiana 351 Massachusetts 346 South Carolina 306 Maryland 255 Idaho 237 Arkansas 237 Wisconsin 207 Iowa 200 Connecticut 185 District of Columbia 183 Kansas 172 Mississippi 156 Montana 134 Nebraska 112 West Virginia 108 North Dakota 92 New Hampshire 86 Rhode Island 81 Alaska 67 Maine 66 South Dakota 52 Delaware 43 Guam 31 Vermont 28 Wyoming 22 U.S. Virgin Islands 1 Name: count, dtype: int64

This is just a primitive grouping and aggregation operation, we will look into more advanced patterns. Let us say we would like to get some numbers (statistics) for individual states. We can groupby the dataset by the "State" column:

grouped_by_state = laser_incidents_w_population.groupby("State")

What did we get?

grouped_by_state
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x127bf6f60>

What is this DataFrameGroupBy object? Its use case is:

  • Splitting the data into groups based on some criteria.

  • Applying a function to each group independently.

  • Combining the results into a data structure.

Let’s try a simple aggregate: the mean of altitude for each state:

grouped_by_state["Altitude"].mean().sort_values()
State Puerto Rico 3552.996703 Hawaii 4564.536585 Florida 4970.406773 Alaska 5209.848485 Wisconsin 5529.951220 New York 5530.208743 Guam 5800.000000 Maryland 6071.739130 District of Columbia 6087.144444 New Jersey 6204.306950 Illinois 6306.310566 Massachusetts 6473.763848 Texas 6487.493759 Delaware 6602.380952 Arizona 6678.333158 Nevada 6730.037485 California 6919.705613 Washington 7110.687629 Louisiana 7276.276353 Nebraska 7277.321429 Michigan 7330.459082 Oregon 7411.285231 South Dakota 7419.607843 North Dakota 7455.434783 Ohio 7482.409880 Pennsylvania 7518.614724 Connecticut 7519.562842 Vermont 7610.714286 Idaho 7636.756410 Oklahoma 7678.803440 Montana 7780.620155 Virginia 7903.889976 Rhode Island 8186.875000 Minnesota 8191.869811 South Carolina 8593.535948 Kansas 8661.994152 Indiana 8664.055693 Maine 8733.333333 Alabama 8821.210191 Mississippi 8828.685897 Tennessee 8987.354402 North Carolina 9251.180763 New Hampshire 9591.764706 Utah 9892.935197 Iowa 10174.619289 Missouri 10548.161468 New Mexico 10714.706030 U.S. Virgin Islands 11000.000000 Georgia 11130.663854 Arkansas 11203.483051 Colorado 11301.869388 Kentucky 11583.086225 West Virginia 12108.386792 Wyoming 18238.095238 Name: Altitude, dtype: float64

What if we were to group by year? We don’t have a year column but we can just extract the year from the date and use it for groupby.

grouped_by_year = laser_incidents_w_population.groupby(laser_incidents_w_population["timestamp"].dt.year)

You may have noticed how we extracted the year using the .dt accessor. We will use .dt even more below.

Let’s calculate the mean altitude of laser incidents per year. Are the lasers getting more powerful? 🤔

mean_altitude_per_year = grouped_by_year["Altitude"].mean().sort_index()
mean_altitude_per_year
timestamp 2015.0 6564.621830 2016.0 7063.288912 2017.0 7420.971064 2018.0 7602.049323 2019.0 8242.586268 2020.0 8618.242465 Name: Altitude, dtype: float64

We can also quickly plot the results, more on plotting in the next lessons.

mean_altitude_per_year.plot(kind="bar");
<Figure size 640x480 with 1 Axes>

Exercise: Calculate the sum of injuries per year. Use the fact that True + True = 2 ;)

We can also create a new Series if the corresponding column does not exist in the dataframe and group it by another Series (which in this case is a column from the dataframe). Important is that the grouped and the by series have the same index.

# how many incidents per million inhabitants are there for each state?
incidents_per_million = (1_000_000 / laser_incidents_w_population["Population"]).groupby(laser_incidents_w_population["State"]).sum()
incidents_per_million.sort_values(ascending=False)
State Hawaii 347.174968 Puerto Rico 283.072541 District of Columbia 272.401284 Nevada 263.392087 Arizona 259.539186 Utah 233.376716 Oregon 211.078085 Kentucky 210.978412 New Mexico 189.746676 California 186.218871 Colorado 169.180226 Washington 126.127279 Tennessee 125.933528 Idaho 122.225872 Florida 121.466464 Texas 120.547839 Montana 119.337375 Indiana 118.834422 North Dakota 118.060573 Oklahoma 102.492661 Alabama 93.214901 Minnesota 92.877892 Alaska 91.332542 Missouri 88.540597 Arkansas 77.816234 Louisiana 76.466573 Illinois 75.186584 Rhode Island 74.058226 Georgia 65.427299 Ohio 63.796895 Pennsylvania 63.675570 Iowa 62.489904 New Hampshire 61.638539 West Virginia 60.839723 Kansas 58.560169 South Carolina 57.925648 South Dakota 57.153911 Nebraska 56.912796 North Carolina 56.547484 New Jersey 56.037235 Mississippi 53.060196 Connecticut 51.017524 Michigan 50.328315 Massachusetts 49.556186 Maine 47.641734 Virginia 47.445656 New York 46.805556 Vermont 43.272381 Delaware 42.223261 Maryland 41.364812 Wyoming 37.840934 Wisconsin 35.129169 U.S. Virgin Islands 0.000000 Guam 0.000000 Name: Population, dtype: float64
incidents_per_million.sort_values(ascending=False).plot(kind="bar", figsize=(15, 3));
<Figure size 1500x300 with 1 Axes>

Time series operations (Optional)

We will briefly look at some more specific operation for time series data (data with a natural time axis). Typical operations for time series are resampling or rolling window transformations such as filtering. Note that Pandas is not a general digital signal processing library - there are other (more capable) tools for this purpose.

First, we set the index to "timestamp" to make our dataframe inherently time indexed. This will make doing further time operations easier.

incidents_w_time_index = laser_incidents.set_index("timestamp")
incidents_w_time_index
Loading...

First, turn the data into a time series of incidents per hour. This can be done by resampling to 1 hour and using count (basically on any column or on any column that has any non-NA value) to count the number of incidents.

incidents_hourly = incidents_w_time_index.notna().any(axis="columns").resample("1H").count().rename("incidents per hour")
incidents_hourly
/var/folders/dm/gbbql3p121z0tr22r2z98vy00000gn/T/ipykernel_95281/3245646514.py:1: FutureWarning: 'H' is deprecated and will be removed in a future version, please use 'h' instead.
  incidents_hourly = incidents_w_time_index.notna().any(axis="columns").resample("1H").count().rename("incidents per hour")
timestamp 2015-01-01 02:00:00 1 2015-01-01 03:00:00 2 2015-01-01 04:00:00 1 2015-01-01 05:00:00 3 2015-01-01 06:00:00 0 .. 2020-08-01 06:00:00 0 2020-08-01 07:00:00 1 2020-08-01 08:00:00 1 2020-08-01 09:00:00 0 2020-08-01 10:00:00 3 Name: incidents per hour, Length: 48945, dtype: int64

Looking at those data gives us a bit too detailed information.

incidents_hourly.sort_index().plot(kind="line", figsize=(15, 3));
<Figure size 1500x300 with 1 Axes>

A daily mean, the result of resampling to 1 day periods and calculating the mean, is already something more digestible. Though still a bit noisy.

incidents_daily = incidents_hourly.resample("1D").mean()
incidents_daily.plot.line(figsize=(15, 3));
<Figure size 1500x300 with 1 Axes>

We can look at filtered data by rolling mean with, e.g., 28 days window size.

incidents_daily_filtered = incidents_daily.rolling("28D").mean()
incidents_daily.plot.line(figsize=(15, 3));
incidents_daily_filtered.plot.line(figsize=(15, 3));
<Figure size 1500x300 with 1 Axes>