Jak pandy analyzují data

Pandas - Introduction

This notebook explans how to use the pandas library for analysis of tabular data.

In [1]:
# Start using pandas (default import convention)
import pandas as pd
import numpy as np
In [2]:
# Let pandas speak for themselves
print(pd.__doc__)
pandas - a powerful data analysis and manipulation library for Python
=====================================================================

**pandas** is a Python package providing fast, flexible, and expressive data
structures designed to make working with "relational" or "labeled" data both
easy and intuitive. It aims to be the fundamental high-level building block for
doing practical, **real world** data analysis in Python. Additionally, it has
the broader goal of becoming **the most powerful and flexible open source data
analysis / manipulation tool available in any language**. It is already well on
its way toward this goal.

Main Features
-------------
Here are just a few of the things that pandas does well:

  - Easy handling of missing data in floating point as well as non-floating
    point data.
  - Size mutability: columns can be inserted and deleted from DataFrame and
    higher dimensional objects
  - Automatic and explicit data alignment: objects can be explicitly aligned
    to a set of labels, or the user can simply ignore the labels and let
    `Series`, `DataFrame`, etc. automatically align the data for you in
    computations.
  - Powerful, flexible group by functionality to perform split-apply-combine
    operations on data sets, for both aggregating and transforming data.
  - Make it easy to convert ragged, differently-indexed data in other Python
    and NumPy data structures into DataFrame objects.
  - Intelligent label-based slicing, fancy indexing, and subsetting of large
    data sets.
  - Intuitive merging and joining data sets.
  - Flexible reshaping and pivoting of data sets.
  - Hierarchical labeling of axes (possible to have multiple labels per tick).
  - Robust IO tools for loading data from flat files (CSV and delimited),
    Excel files, databases, and saving/loading data from the ultrafast HDF5
    format.
  - Time series-specific functionality: date range generation and frequency
    conversion, moving window statistics, date shifting and lagging.

Visit the official website for a nicely written documentation: https://pandas.pydata.org

In [3]:
# Current version (should be 1.5+ in 2023)
print(pd.__version__)
2.2.2

Basic objects

The pandas library has a vast API with many useful functions. However, most of this revolves around two important classes:

  • Series
  • DataFrame

In this introduction, we will focus on them - what each of them does and how they relate to each other and numpy objects.

Series

Series is a one-dimensional data structure, central to pandas.

For a complete API, visit https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html

In [4]:
# My first series
series = pd.Series([1, 2, 3])
series
Out[4]:
0    1
1    2
2    3
dtype: int64

This looks a bit like a Numpy array, does it not?

Actually, in most cases the Series wraps a Numpy array...

In [5]:
series.values  # The result is a Numpy array
Out[5]:
array([1, 2, 3])

But there is something more. Alongside the values, we see that each item (or "row") has a certain label. The collection of labels is called index.

In [6]:
series.index
Out[6]:
RangeIndex(start=0, stop=3, step=1)

This index (see below) can be used, as its name suggests, to index items of the series.

In [7]:
# Return an element from the series
series.loc[1]
Out[7]:
2
In [8]:
# Or
series[1]
Out[8]:
2
In [9]:
# Construction from a dictionary
series_ab = pd.Series({"a": 2, "b": 4})
series_ab
Out[9]:
a    2
b    4
dtype: int64

Exercise: Create a series with 5 elements.

In [10]:
result = ...

DataFrame

A DataFrame is pandas' answer to Excel sheets - it is a collection of named columns (or, in our case, a collection of Series). Quite often, we directly read data frames from an external source, but it is possible to create them from:

  • a dict of Series, numpy arrays or other array-like objects
  • from an iterable of rows (where rows are Series, lists, dictionaries, ...)
In [11]:
# List of lists (no column names)
table = [
    ['a', 1],
    ['b', 3],
    ['c', 5]
]
table_df = pd.DataFrame(table)
table_df
Out[11]:
0 1
0 a 1
1 b 3
2 c 5
In [12]:
# Dict of Series (with column names)
df = pd.DataFrame({
    'number': pd.Series([1, 2, 3, 4], dtype=np.int8),
    'letter': pd.Series(['a', 'b', 'c', 'd'])
})
df
Out[12]:
number letter
0 1 a
1 2 b
2 3 c
3 4 d
In [13]:
# Numpy array (10x2), specify column names
data = np.random.normal(0, 1, (10, 2))

df = pd.DataFrame(data, columns=['a', 'b'])
df
Out[13]:
a b
0 -0.623829 -1.850478
1 -0.541544 -0.816918
2 1.353791 0.470390
3 -0.411840 2.111382
4 1.180242 -0.081106
5 -0.773012 1.676229
6 0.667790 -0.968450
7 -1.076493 0.050577
8 0.920813 -1.891053
9 -0.132454 -0.447746
In [14]:
# A DataFrame also has an index.
df.index
Out[14]:
RangeIndex(start=0, stop=10, step=1)
In [15]:
# ...that is shared by all columns
df.index is df["a"].index
Out[15]:
True
In [16]:
# The columns also form an index.
df.columns
Out[16]:
Index(['a', 'b'], dtype='object')

Exercise: Create DataFrame whose x-column is $0, \frac{1}{4}\pi, \frac{1}{2}\pi, .. 2\pi $, y column is cos(x) and index are fractions 0, 1/4, 1/2 ... 2

In [18]:
import fractions

index = [fractions.Fraction(n, ___) for n in range(___)]
x = np.___([___ for ___ in ___])
y = ___

df = pd.DataFrame(___, index = ___)

# display
df
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[18], line 3
      1 import fractions
----> 3 index = [fractions.Fraction(n, ___) for n in range(___)]
      4 x = np.___([___ for ___ in ___])
      5 y = ___

TypeError: 'RangeIndex' object cannot be interpreted as an integer

D(ata) types

Pandas builds upon the numpy data types (mentioned earlier) and adds a couple of more.

In [19]:
typed_df = pd.DataFrame({
  "bool": np.arange(5) % 2 == 0,
  "int": range(5),
  "int[nan]": pd.Series([np.nan, 0, 1, 2, 3], dtype="Int64"),
  "float": np.arange(5) * 3.14,
  "complex": np.array([1 + 2j, 2 + 3j, 3 + 4j, 4 + 5j, 5 + 6j]),
  "object": [None, 1, "2", [3, 4], 5 + 6j],
  "string?": ["a", "b", "c", "d", "e"],
  "string!": pd.Series(["a", "b", "c", "d", "e"], dtype="string"),
  "datetime": pd.date_range('2018-01-01', periods=5, freq='3M'),
  "timedelta": pd.timedelta_range(0, freq="1s", periods=5),
  "category": pd.Series(["animal", "plant", "animal", "animal", "plant"], dtype="category"),
  "period": pd.period_range('2018-01-01', periods=5, freq='M'),
})
typed_df
/var/folders/dm/gbbql3p121z0tr22r2z98vy00000gn/T/ipykernel_78670/1417085050.py:10: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
  "datetime": pd.date_range('2018-01-01', periods=5, freq='3M'),
Out[19]:
bool int int[nan] float complex object string? string! datetime timedelta category period
0 True 0 <NA> 0.00 1.0+2.0j None a a 2018-01-31 0 days 00:00:00 animal 2018-01
1 False 1 0 3.14 2.0+3.0j 1 b b 2018-04-30 0 days 00:00:01 plant 2018-02
2 True 2 1 6.28 3.0+4.0j 2 c c 2018-07-31 0 days 00:00:02 animal 2018-03
3 False 3 2 9.42 4.0+5.0j [3, 4] d d 2018-10-31 0 days 00:00:03 animal 2018-04
4 True 4 3 12.56 5.0+6.0j (5+6j) e e 2019-01-31 0 days 00:00:04 plant 2018-05
In [20]:
typed_df.dtypes
Out[20]:
bool                    bool
int                    int64
int[nan]               Int64
float                float64
complex           complex128
object                object
string?               object
string!       string[python]
datetime      datetime64[ns]
timedelta    timedelta64[ns]
category            category
period             period[M]
dtype: object

We will see some of the types practically used in further analysis.

Indices & indexing

In [21]:
abc_series = pd.Series(range(3), index=["a", "b", "c"])
abc_series
Out[21]:
a    0
b    1
c    2
dtype: int64
In [22]:
abc_series.index
Out[22]:
Index(['a', 'b', 'c'], dtype='object')
In [23]:
abc_series.index = ["c", "d", "e"]  # Changes the labels in-place!
abc_series.index.name = "letter"
abc_series
Out[23]:
letter
c    0
d    1
e    2
dtype: int64
In [24]:
table = [
    ['a', 1],
    ['b', 3],
    ['c', 5]
]
table_df = pd.DataFrame(
    table,
    index=["first", "second", "third"],
    columns=["alpha", "beta"]
)
table_df
Out[24]:
alpha beta
first a 1
second b 3
third c 5
In [25]:
alpha = table_df["alpha"]  # Simple [] indexing in DataFrame returns Series
alpha
Out[25]:
first     a
second    b
third     c
Name: alpha, dtype: object
In [26]:
alpha["second"]             # Simple [] indexing in Series returns scalar values.
Out[26]:
'b'

A slice with a ["list", "of", "columns"] yields a DataFrame with those columns.

For example:

In [ ]:
table_df[["beta", "alpha"]]

[["column_name"]] returs a DataFrame as well, not Series:

In [28]:
table_df[["alpha"]]
Out[28]:
alpha
first a
second b
third c

There are two ways how to properly index rows & cells in the DataFrame:

  • loc for label-based indexing
  • iloc for order-based indexing (it does not use the index at all)

Note the square brackets. The mentioned attributes actually are not methods but special "indexer" objects. They accept one or two arguments specifying the position along one or both axes.

loc
In [29]:
first = table_df.loc["first"]
first
Out[29]:
alpha    a
beta     1
Name: first, dtype: object
In [30]:
table_df.loc["first", "beta"]
Out[30]:
1
In [31]:
table_df.loc["first":"second", "beta"]   # Use ranges (inclusive)
Out[31]:
first     1
second    3
Name: beta, dtype: int64
iloc
In [32]:
table_df.iloc[1]
Out[32]:
alpha    b
beta     3
Name: second, dtype: object
In [33]:
table_df.iloc[0:4:2]   # Select every second row
Out[33]:
alpha beta
first a 1
third c 5
In [34]:
table_df.at["first", "beta"]
Out[34]:
1
In [35]:
type(table_df.at)
Out[35]:
pandas.core.indexing._AtIndexer

Modifying DataFrames

Adding a new column is like adding a key/value pair to a dict. Note that this operation, unlike most others, does modify the DataFrame.

In [36]:
from datetime import datetime
table_df["now"] = datetime.now()
table_df
Out[36]:
alpha beta now
first a 1 2024-04-25 12:45:05.181057
second b 3 2024-04-25 12:45:05.181057
third c 5 2024-04-25 12:45:05.181057

Non-destructive version that returns a new DataFrame, uses the assign method:

In [37]:
table_df.assign(delta = [True, False, True])
Out[37]:
alpha beta now delta
first a 1 2024-04-25 12:45:05.181057 True
second b 3 2024-04-25 12:45:05.181057 False
third c 5 2024-04-25 12:45:05.181057 True
In [38]:
# However, the original DataFrame is not changed
table_df
Out[38]:
alpha beta now
first a 1 2024-04-25 12:45:05.181057
second b 3 2024-04-25 12:45:05.181057
third c 5 2024-04-25 12:45:05.181057

Deleting a column is very easy too.

In [39]:
del table_df["now"]
table_df
Out[39]:
alpha beta
first a 1
second b 3
third c 5

The drop method works with both rows and columns (creating a new data frame), returning a new object.

In [40]:
table_df.drop("beta", axis=1)
Out[40]:
alpha
first a
second b
third c
In [41]:
table_df.drop("second", axis=0)
Out[41]:
alpha beta
first a 1
third c 5

Exercise: Use a combination of reset_index, drop and set_index to transform table_df into pd.DataFrame({'index': table_df.index}, index=table_df["alpha"])

In [ ]:
results = table_df.___.___.___

# display
result

Let's get some real data!

I/O in pandas

Pandas can read (and write to) a huge variety of file formats. More details can be found in the official documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

Most of the functions for reading data are named pandas.read_XXX, where XXX is the format used. We will look at three commonly used ones.

In [43]:
# List functions for input in pandas.

print("\n".join(method for method in dir(pd) if method.startswith("read_")))
read_clipboard
read_csv
read_excel
read_feather
read_fwf
read_gbq
read_hdf
read_html
read_json
read_orc
read_parquet
read_pickle
read_sas
read_spss
read_sql
read_sql_query
read_sql_table
read_stata
read_table
read_xml

Read CSV

Nowadays, a lot of data comes in the textual Comma-separated values format (CSV). Although not properly standardized, it is the de-facto standard for files that are not huge and are meant to be read by human eyes too.

Let's read the population of U.S. states that we will need later:

In [45]:
territories = pd.read_csv("data/us_state_population.csv")
territories.head(9)
Out[45]:
Territory Population Population 2010 Code
0 California 39029342.0 37253956 CA
1 Texas 30029572.0 25145561 TX
2 Florida 22244823.0 18801310 FL
3 New York 19677151.0 19378102 NY
4 Pennsylvania 12972008.0 12702379 PA
5 Illinois 12582032.0 12830632 IL
6 Ohio 11756058.0 11536504 OH
7 Georgia 10912876.0 9687653 GA
8 North Carolina 10698973.0 9535483 NC

The automatic data type parsing converts columns to appropriate types:

In [46]:
territories.dtypes
Out[46]:
Territory           object
Population         float64
Population 2010      int64
Code                object
dtype: object

Sometimes the CSV input does not work out of the box. Although pandas automatically understands and reads zipped files, it usually does not automatically infer the file format and its variations - for details, see the read_csv documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [48]:
pd.read_csv('data/iris.tsv.gz')
Out[48]:
Sepal length\tSepal width\tPetal length\tPetal width\tSpecies
0 5.1\t3.5\t1.4\t0.2\tI. setosa
1 4.9\t3.0\t1.4\t0.2\tI. setosa
2 4.7\t3.2\t1.3\t0.2\tI. setosa
3 4.6\t3.1\t1.5\t0.2\tI. setosa
4 5.0\t3.6\t1.4\t0.2\tI. setosa

...in this case, the CSV file does not use commas to separate values. Therefore, we need to specify an extra argument:

In [49]:
pd.read_csv("data/iris.tsv.gz", sep='\t')
Out[49]:
Sepal length Sepal width Petal length Petal width Species
0 5.1 3.5 1.4 0.2 I. setosa
1 4.9 3.0 1.4 0.2 I. setosa
2 4.7 3.2 1.3 0.2 I. setosa
3 4.6 3.1 1.5 0.2 I. setosa
4 5.0 3.6 1.4 0.2 I. setosa

See the difference?

Read Excel

Let's read the list of U.S. incidents when lasers interfered with airplanes.

In [51]:
pd.read_excel("data/laser_incidents_2019.xlsx")
Out[51]:
Incident Date Incident Time Flight ID Aircraft Altitude Airport Laser Color Injury City State
0 2019-01-01 35 N3EG C421 2500 SAT Green No San Antonio Texas
1 2019-01-01 43 RPA3469 E75S 4000 IAH Green No Houston Texas
2 2019-01-01 44 UAL1607 A319 4000 IAH Green No Houston Texas
3 2019-01-01 110 N205TM BE20 2500 HDC Green No Hammond Louisiana
4 2019-01-01 115 JIA5233 CRJ9 2000 JAX Green No Jacksonville Florida
... ... ... ... ... ... ... ... ... ... ...
6131 2019-12-31 845 ASH5861 CRJ9 3000 JAN Green No Jackson Mississippi
6132 2019-12-31 929 N22P CRUZ 2500 HNL Green No Honolulu Hawaii
6133 2019-12-31 2310 GTH530 GLF4 500 SJU White No Carolina Puerto Rico
6134 2019-12-31 2312 AMF6916 SW4 600 SJU Green No Carolina Puerto Rico
6135 2019-12-31 2327 N715TH C172 3000 CHO Green No Charlottesville Virginia

6136 rows × 10 columns

Note: This reads just the first sheet from the file. If you want to extract more sheets, you will need to use the pandas.'ExcelFile class. See the relevant part of the documentation.

Read HTML (Optional)

Pandas is able to scrape data from tables embedded in web pages using the read_html function. This might or might not bring you good results and probably you will have to tweak your data frame manually. But it is a good starting point - much better than being forced to parse the HTML ourselves!

In [53]:
tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_laser_types")
type(tables), len(tables)
Out[53]:
(list, 9)
In [54]:
tables[1]
Out[54]:
Laser gain medium and type Operation wavelength(s) Pump source Applications and notes
0 Helium–neon laser 632.8 nm (543.5 nm, 593.9 nm, 611.8 nm, 1.1523... Electrical discharge Interferometry, holography, spectroscopy, barc...
1 Argon laser 454.6 nm, 488.0 nm, 514.5 nm (351 nm, 363.8, 4... Electrical discharge Retinal phototherapy (for diabetes), lithograp...
2 Krypton laser 416 nm, 530.9 nm, 568.2 nm, 647.1 nm, 676.4 nm... Electrical discharge Scientific research, mixed with argon to creat...
3 Xenon ion laser Many lines throughout visible spectrum extendi... Electrical discharge Scientific research.
4 Nitrogen laser 337.1 nm Electrical discharge Pumping of dye lasers, measuring air pollution...
5 Carbon dioxide laser 10.6 μm, (9.4 μm) Transverse (high-power) or longitudinal (low-p... Material processing (laser cutting, laser beam...
6 Carbon monoxide laser 2.6 to 4 μm, 4.8 to 8.3 μm Electrical discharge Material processing (engraving, welding, etc.)...
7 Excimer laser 157 nm (F2), 193.3 nm (ArF), 248 nm (KrF), 308... Excimer recombination via electrical discharge Ultraviolet lithography for semiconductor manu...
In [55]:
tables[2]
Out[55]:
Laser gain medium and type Operation wavelength(s) Pump source Applications and notes
0 Hydrogen fluoride laser 2.7 to 2.9 μm for hydrogen fluoride (<80% atmo... Chemical reaction in a burning jet of ethylene... Used in research for laser weaponry, operated ...
1 Deuterium fluoride laser ~3800 nm (3.6 to 4.2 μm) (~90% atm. transmitta... chemical reaction US military laser prototypes.
2 COIL (chemical oxygen–iodine laser) 1.315 μm (<70% atmospheric transmittance) Chemical reaction in a jet of singlet delta ox... Military lasers, scientific and materials rese...
3 Agil (All gas-phase iodine laser) 1.315 μm (<70% atmospheric transmittance) Chemical reaction of chlorine atoms with gaseo... Scientific, weaponry, aerospace.

Write CSV

Pandas is able to write to many various formats but the usage is similar.

In [56]:
tables[1].to_csv("gas_lasers.csv", index=False)

Data analysis (very basics)

Let's extend the data of laser incidents to a broader time range and read the data from a summary CSV file:

In [58]:
laser_incidents_raw = pd.read_csv("data/laser_incidents_2015-2020.csv")

Let's see what we have here...

In [59]:
laser_incidents_raw.head()
Out[59]:
Unnamed: 0 Incident Date Incident Time Flight ID Aircraft Altitude Airport Laser Color Injury City State timestamp
0 0 2020-01-01 148.0 N424RP DA42/A 8500.0 SBA green False Santa Barbara California 2020-01-01 01:48:00
1 1 2020-01-01 155.0 AMF1829 B190 40000.0 SSF green False San Antonio Texas 2020-01-01 01:55:00
2 2 2020-01-01 214.0 NKS1881 A320 2500.0 TPA green False Tampa Florida 2020-01-01 02:14:00
3 3 2020-01-01 217.0 FDX3873 B763 3000.0 DFW green False Fort Worth Texas 2020-01-01 02:17:00
4 4 2020-01-01 218.0 SWA3635 B739 11000.0 MOD green False Modesto California 2020-01-01 02:18:00
In [60]:
laser_incidents_raw.tail()
Out[60]:
Unnamed: 0 Incident Date Incident Time Flight ID Aircraft Altitude Airport Laser Color Injury City State timestamp
36458 36458 2015-12-31 525.0 VRD917 A320 (AIRBUS - A-32 8000.0 LAS green False Las Vegas Nevada 2015-12-31 05:25:00
36459 36459 2015-12-31 623.0 DAL2371 B738 (BOEING - 737- 11000.0 LHM green False Lincoln California 2015-12-31 06:23:00
36460 36460 2015-12-31 1111.0 Unknown Unknown 2000.0 FOK green False Westhampton Beach New York 2015-12-31 11:11:00
36461 36461 2015-12-31 1147.0 UAL197 B737 300.0 GUM green False Guam Guam 2015-12-31 11:47:00
36462 36462 2015-12-31 2314.0 EJA336 E55P/L 1000.0 APF green False Naples Florida 2015-12-31 23:14:00

For an unknown, potentially unevenly distributed dataset, looking at the beginning / end is typically not the best idea. We'd rather sample randomly:

In [61]:
# Show a few examples
laser_incidents_raw.sample(10)
Out[61]:
Unnamed: 0 Incident Date Incident Time Flight ID Aircraft Altitude Airport Laser Color Injury City State timestamp
13749 13749 2018-10-23 306.0 DAL356 B738 8000.0 MSP green False Minneapolis Minnesota 2018-10-23 03:06:00
14397 14397 2018-12-04 17.0 LBQ784 PC12/G 15000.0 JAX green False Jacksonville Florida NaN
11645 11645 2018-06-01 601.0 AAL301 B738 6000.0 SLC red False Salt Lake City Utah 2018-06-01 06:01:00
5772 5772 2019-06-20 355.0 PFT574 C560 5000.0 LAS blue False Las Vegas Nevada 2019-06-20 03:55:00
22561 22561 2016-02-08 600.0 UAL249 Unknown 13000.0 T41 green False LA Porte Texas 2016-02-08 06:00:00
18574 18574 2017-08-11 445.0 DAL2974 B712 4000.0 PDX green False Portland Oregon 2017-08-11 04:45:00
14960 14960 2018-12-31 623.0 SWA2042 B737 6000.0 HWD green False Hayward California 2018-12-31 06:23:00
24922 24922 2016-06-17 524.0 N522LG COL4 7000.0 ELP green False El Paso Texas 2016-06-17 05:24:00
31425 31425 2015-06-10 213.0 RPA4266 E170 2500.0 IND green False Indianapolis Indiana 2015-06-10 02:13:00
668 668 2020-02-07 248.0 SWA1251 B737 10500.0 SAT green and red False San Antonio Texas 2020-02-07 02:48:00
In [62]:
laser_incidents_raw.dtypes
Out[62]:
Unnamed: 0         int64
Incident Date     object
Incident Time    float64
Flight ID         object
Aircraft          object
Altitude         float64
Airport           object
Laser Color       object
Injury            object
City              object
State             object
timestamp         object
dtype: object

The topic of data cleaning and pre-processing is very broad. We will limit ourselves to dropping unused columns and converting one to a proper type.

In [63]:
# The first three are not needed
laser_incidents = laser_incidents_raw.drop(columns=laser_incidents_raw.columns[:3])

# We convert the timestamp
laser_incidents = laser_incidents.assign(
    timestamp = pd.to_datetime(laser_incidents["timestamp"])
)
laser_incidents
Out[63]:
Flight ID Aircraft Altitude Airport Laser Color Injury City State timestamp
0 N424RP DA42/A 8500.0 SBA green False Santa Barbara California 2020-01-01 01:48:00
1 AMF1829 B190 40000.0 SSF green False San Antonio Texas 2020-01-01 01:55:00
2 NKS1881 A320 2500.0 TPA green False Tampa Florida 2020-01-01 02:14:00
3 FDX3873 B763 3000.0 DFW green False Fort Worth Texas 2020-01-01 02:17:00
4 SWA3635 B739 11000.0 MOD green False Modesto California 2020-01-01 02:18:00
... ... ... ... ... ... ... ... ... ...
36458 VRD917 A320 (AIRBUS - A-32 8000.0 LAS green False Las Vegas Nevada 2015-12-31 05:25:00
36459 DAL2371 B738 (BOEING - 737- 11000.0 LHM green False Lincoln California 2015-12-31 06:23:00
36460 Unknown Unknown 2000.0 FOK green False Westhampton Beach New York 2015-12-31 11:11:00
36461 UAL197 B737 300.0 GUM green False Guam Guam 2015-12-31 11:47:00
36462 EJA336 E55P/L 1000.0 APF green False Naples Florida 2015-12-31 23:14:00

36463 rows × 9 columns

In [64]:
laser_incidents.dtypes
Out[64]:
Flight ID              object
Aircraft               object
Altitude              float64
Airport                object
Laser Color            object
Injury                 object
City                   object
State                  object
timestamp      datetime64[ns]
dtype: object
Categorical dtype (Optional)

To analyze Laser Color, we can look at its typical values.

In [65]:
laser_incidents["Laser Color"].describe()
Out[65]:
count     36461
unique       73
top       green
freq      32787
Name: Laser Color, dtype: object

Not too many different values.

In [66]:
laser_incidents["Laser Color"].unique()
Out[66]:
array(['green', 'purple', 'blue', 'unknown', 'red', 'white',
       'green and white', 'white and green', 'green and yellow',
       'multiple', 'unknwn', 'green and purple', 'green and red',
       'red and green', 'green and blue', 'blue and purple',
       'red white and blue', 'blue and green', 'blue or purple',
       'blue or green', 'yellow/orange', 'blue/purple', 'unkwn', 'orange',
       'multi', 'yellow and white', 'blue and white', 'white or amber',
       'red and white', 'yellow', 'amber', 'yellow and green',
       'white and blue', 'red, blue, and green', 'purple-blue',
       'red and blue', 'magenta', 'phx', 'green or blue', 'red or green',
       'green or red', 'green, blue or purple', 'blue and red', 'unkn',
       'blue-green', 'multi-colored', nan, 'blue-yellow',
       'white or green', 'green and orange', 'white-green-red',
       'multicolored', 'green-white', 'blue or white', 'green red blue',
       'green or white', 'blue -green', 'green-red', 'green-blue',
       'multi-color', 'green-yellow', 'red-white', 'blue-purple',
       'white-yellow', 'green-purple', 'lavender', 'orange-red',
       'blue-white', 'blue-red', 'yellow-white', 'red-green',
       'white-green', 'white-blue', 'white-red'], dtype=object)
In [67]:
laser_incidents["Laser Color"].value_counts(normalize=True)
Out[67]:
Laser Color
green             0.899235
blue              0.046790
red               0.012260
white             0.010395
unkn              0.009051
                    ...   
red or green      0.000027
white or green    0.000027
blue-yellow       0.000027
multi-colored     0.000027
white-red         0.000027
Name: proportion, Length: 73, dtype: float64

This column is a very good candidate to turn into a pandas-special, Categorical data type. (See https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html)

In [68]:
laser_incidents["Laser Color"].memory_usage(deep=True)   # ~60 bytes per item
Out[68]:
2261216
In [69]:
color_category = laser_incidents["Laser Color"].astype("category")
color_category.sample(10)
Out[69]:
16027    green
17144    green
18133    green
6309     green
26757    green
6907     green
36042    green
21744    green
34013    green
17587    green
Name: Laser Color, dtype: category
Categories (73, object): ['amber', 'blue', 'blue -green', 'blue and green', ..., 'yellow and green', 'yellow and white', 'yellow-white', 'yellow/orange']
In [70]:
color_category.memory_usage(deep=True)              # ~1-2 bytes per item
Out[70]:
43668

Exercise: Are there any other columns in the dataset that you would suggest for conversion to categorical?

Integer vs. float

Pandas is generally quite good at guessing (inferring) number types. You may wonder why Altitude is float and not int though. This is a consequence of not having an integer nan in numpy. There's been many discussions about this.

In [71]:
laser_incidents["Altitude"]
Out[71]:
0         8500.0
1        40000.0
2         2500.0
3         3000.0
4        11000.0
          ...   
36458     8000.0
36459    11000.0
36460     2000.0
36461      300.0
36462     1000.0
Name: Altitude, Length: 36463, dtype: float64
In [72]:
laser_incidents["Altitude"].astype(int)
---------------------------------------------------------------------------
IntCastingNaNError                        Traceback (most recent call last)
Cell In[72], line 1
----> 1 laser_incidents["Altitude"].astype(int)

File ~/mambaforge/envs/python-fjfi/lib/python3.9/site-packages/pandas/core/generic.py:6643, in NDFrame.astype(self, dtype, copy, errors)
   6637     results = [
   6638         ser.astype(dtype, copy=copy, errors=errors) for _, ser in self.items()
   6639     ]
   6641 else:
   6642     # else, only a single dtype is given
-> 6643     new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   6644     res = self._constructor_from_mgr(new_data, axes=new_data.axes)
   6645     return res.__finalize__(self, method="astype")

File ~/mambaforge/envs/python-fjfi/lib/python3.9/site-packages/pandas/core/internals/managers.py:430, in BaseBlockManager.astype(self, dtype, copy, errors)
    427 elif using_copy_on_write():
    428     copy = False
--> 430 return self.apply(
    431     "astype",
    432     dtype=dtype,
    433     copy=copy,
    434     errors=errors,
    435     using_cow=using_copy_on_write(),
    436 )

File ~/mambaforge/envs/python-fjfi/lib/python3.9/site-packages/pandas/core/internals/managers.py:363, in BaseBlockManager.apply(self, f, align_keys, **kwargs)
    361         applied = b.apply(f, **kwargs)
    362     else:
--> 363         applied = getattr(b, f)(**kwargs)
    364     result_blocks = extend_blocks(applied, result_blocks)
    366 out = type(self).from_blocks(result_blocks, self.axes)

File ~/mambaforge/envs/python-fjfi/lib/python3.9/site-packages/pandas/core/internals/blocks.py:758, in Block.astype(self, dtype, copy, errors, using_cow, squeeze)
    755         raise ValueError("Can not squeeze with more than one column.")
    756     values = values[0, :]  # type: ignore[call-overload]
--> 758 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
    760 new_values = maybe_coerce_values(new_values)
    762 refs = None

File ~/mambaforge/envs/python-fjfi/lib/python3.9/site-packages/pandas/core/dtypes/astype.py:237, in astype_array_safe(values, dtype, copy, errors)
    234     dtype = dtype.numpy_dtype
    236 try:
--> 237     new_values = astype_array(values, dtype, copy=copy)
    238 except (ValueError, TypeError):
    239     # e.g. _astype_nansafe can fail on object-dtype of strings
    240     #  trying to convert to float
    241     if errors == "ignore":

File ~/mambaforge/envs/python-fjfi/lib/python3.9/site-packages/pandas/core/dtypes/astype.py:182, in astype_array(values, dtype, copy)
    179     values = values.astype(dtype, copy=copy)
    181 else:
--> 182     values = _astype_nansafe(values, dtype, copy=copy)
    184 # in pandas we don't store numpy str dtypes, so convert to object
    185 if isinstance(dtype, np.dtype) and issubclass(values.dtype.type, str):

File ~/mambaforge/envs/python-fjfi/lib/python3.9/site-packages/pandas/core/dtypes/astype.py:101, in _astype_nansafe(arr, dtype, copy, skipna)
     96     return lib.ensure_string_array(
     97         arr, skipna=skipna, convert_na_value=False
     98     ).reshape(shape)
    100 elif np.issubdtype(arr.dtype, np.floating) and dtype.kind in "iu":
--> 101     return _astype_float_to_int_nansafe(arr, dtype, copy)
    103 elif arr.dtype == object:
    104     # if we have a datetime/timedelta array of objects
    105     # then coerce to datetime64[ns] and use DatetimeArray.astype
    107     if lib.is_np_dtype(dtype, "M"):

File ~/mambaforge/envs/python-fjfi/lib/python3.9/site-packages/pandas/core/dtypes/astype.py:145, in _astype_float_to_int_nansafe(values, dtype, copy)
    141 """
    142 astype with a check preventing converting NaN to an meaningless integer value.
    143 """
    144 if not np.isfinite(values).all():
--> 145     raise IntCastingNaNError(
    146         "Cannot convert non-finite values (NA or inf) to integer"
    147     )
    148 if dtype.kind == "u":
    149     # GH#45151
    150     if not (values >= 0).all():

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

Quite recently, Pandas introduced nullable types for working with missing data, for example nullable integer.

In [73]:
laser_incidents["Altitude"].astype("Int64")
Out[73]:
0         8500
1        40000
2         2500
3         3000
4        11000
         ...  
36458     8000
36459    11000
36460     2000
36461      300
36462     1000
Name: Altitude, Length: 36463, dtype: Int64

Filtering

Indexing in pandas Series / DataFrames ([]) support also boolean (masked) arrays. These arrays can be obtained by applying boolean operations on them.

You can also use standard comparison operators like <, <=, ==, >=, >, !=.

It is possible to perform logical operations with boolean series too. You need to use |, &, ^ operators though, not and, or, not keywords.

As an example, find all California incidents:

In [74]:
is_california = laser_incidents.State == "California"
is_california.sample(10)
Out[74]:
2330     False
8288     False
16691    False
35831    False
26067    False
21937    False
22719    False
6498     False
24045     True
11533    False
Name: State, dtype: bool

Now we can directly apply the boolean mask. (Note: This is no magic. You can construct the mask yourself)

In [75]:
laser_incidents[is_california].sample(10)
Out[75]:
Flight ID Aircraft Altitude Airport Laser Color Injury City State timestamp
10611 N377YG EVSS 2000.0 SNA green False Santa Ana California 2018-03-19 04:53:00
29171 N254CA GLF4 4000.0 SNA green False Santa Ana California 2015-01-05 02:36:00
29266 QXE472 DH8D (DE HAVILLAND 8000.0 STS green False Santa Rosa California 2015-01-12 05:39:00
31416 VVT5423 UH60 4000.0 SNA green False Santa Ana California 2015-06-09 03:36:00
4202 JBU471 A320 2000.0 SJC green False San Jose California 2019-03-09 04:46:00
544 SWR41 B77W 50000.0 LAX green False Los Angeles California 2020-01-30 03:31:00
22330 N2121V PA28 4000.0 ONT green False Ontario California 2016-01-29 04:20:00
27317 N615PG E135 6000.0 SBP green False San Luis Obispo California 2016-10-26 02:00:00
32297 SWA4201 B737 5000.0 SAN green False San Diego California 2015-07-25 04:44:00
28747 SKW5786 CRJ2 3800.0 BUR green False Burbank California 2016-12-17 01:52:00

Or maybe we should include the whole West coast?

In [76]:
# isin takes an array of possible values
west_coast = laser_incidents[laser_incidents.State.isin(["California", "Oregon", "Washington"])]
west_coast.sample(10)
Out[76]:
Flight ID Aircraft Altitude Airport Laser Color Injury City State timestamp
8140 DAL8945 B752 2000.0 LAX green False Los Angeles California 2019-11-02 07:16:00
11670 JBU687 A321 10500.0 SMO green False Santa Monica California 2018-06-03 03:30:00
16188 N01J R44 500.0 LGB green False Long Beach California 2017-03-10 01:50:00
31950 SHERFF2 HELO 12000.0 SEA green False Seattle Washington 2015-07-11 04:30:00
14821 SKW3258 CRJ7 6000.0 LGB green False Long Beach California 2018-12-23 05:48:00
24680 AAL95 B739 9000.0 SAN green False San Diego California 2016-05-31 03:09:00
4749 JCM615 H25B 23000.0 DAG green False Daggett California 2019-04-08 06:16:00
32642 UAL1247 B739/L 15000.0 OXR green False Oxnard California 2015-08-10 04:09:00
33940 REH1 EC35 1200.0 STS green True Santa Rosa California 2015-10-05 03:10:00
1127 N816KW C172 2500.0 BFL green False Bakersfield California 2020-03-05 02:25:00

Or low-altitude incidents?

In [77]:
laser_incidents[laser_incidents.Altitude < 300]
Out[77]:
Flight ID Aircraft Altitude Airport Laser Color Injury City State timestamp
71 AAL633 A21N 0.0 ELP unknown False El Paso Texas 2020-01-04 04:02:00
267 17223 C172 200.0 SRQ green False Sarasota Florida 2020-01-14 01:12:00
400 N106NK C172 0.0 ADS red False Addison Texas 2020-01-21 20:49:00
613 FDX57 DC10 100.0 BQN green False Aguadilla Puerto Rico 2020-02-03 01:50:00
1066 CR6562 HELI 200.0 PBI green False West Palm Beach Florida 2020-03-01 05:14:00
... ... ... ... ... ... ... ... ... ...
35801 N80298 C172 200.0 MIA red False Miami Florida NaT
35892 N488SR C525 160.0 DUA unkn False Durant Oklahoma 2015-12-11 01:35:00
36089 UPS1337 B763 170.0 LEX green False Lexington Kentucky 2015-12-16 03:51:00
36156 UPS1295 A306 170.0 LEX green False Lexington Kentucky 2015-12-18 04:54:00
36206 NKS631 A320 172.0 TDZ green False Toledo Ohio 2015-12-19 23:53:00

274 rows × 9 columns

Visualization intermezzo

Without much further ado, let's create our first plot.

In [78]:
# Most frequent states
laser_incidents["State"].value_counts()[:20]
Out[78]:
State
California        7268
Texas             3620
Florida           2702
Arizona           1910
Colorado           988
Washington         982
Kentucky           952
Illinois           946
New York           921
Puerto Rico        912
Oregon             895
Tennessee          888
Nevada             837
Pennsylvania       826
Indiana            812
Utah               789
Ohio               750
Georgia            714
North Carolina     605
Missouri           547
Name: count, dtype: int64
In [79]:
laser_incidents["State"].value_counts()[:20].plot(kind="bar");

Sorting

In [80]:
# Display 5 incidents with the highest altitude
laser_incidents.sort_values("Altitude", ascending=False).head(5)
Out[80]:
Flight ID Aircraft Altitude Airport Laser Color Injury City State timestamp
21173 ROU1628 B763 240000.0 PBI green False West Palm Beach Florida 2017-12-04 11:49:00
12017 UPS797 A306 125000.0 ABQ green False Albuquerque New Mexico 2018-06-30 03:15:00
27807 LSFD1 EC 100000.0 SJC blue False San Jose California 2016-11-13 02:53:00
21049 ASQ5334 CRJ7 100000.0 RDU green False Raleigh North Carolina 2017-12-01 01:32:00
27785 ASH6193 CRJ7 98400.0 IND green False Indianapolis Indiana 2016-11-12 23:11:00
In [81]:
# Alternative
laser_incidents.nlargest(5, "Altitude")
Out[81]:
Flight ID Aircraft Altitude Airport Laser Color Injury City State timestamp
21173 ROU1628 B763 240000.0 PBI green False West Palm Beach Florida 2017-12-04 11:49:00
12017 UPS797 A306 125000.0 ABQ green False Albuquerque New Mexico 2018-06-30 03:15:00
21049 ASQ5334 CRJ7 100000.0 RDU green False Raleigh North Carolina 2017-12-01 01:32:00
27807 LSFD1 EC 100000.0 SJC blue False San Jose California 2016-11-13 02:53:00
27785 ASH6193 CRJ7 98400.0 IND green False Indianapolis Indiana 2016-11-12 23:11:00

Exercise: Find the last 3 incidents with blue laser.

Arithmetics and string manipulation

Standard arithmetic operators work on numerical columns too. And so do mathematical functions. Note all such operations are performed in a vector-like fashion.

In [82]:
altitude_meters = laser_incidents["Altitude"] * .3048
altitude_meters.sample(10)
Out[82]:
21880    2438.4
7556     1828.8
5124      762.0
1619     2286.0
28230    1066.8
8119     1066.8
4171      609.6
20169    4267.2
5014     2743.2
22018    1219.2
Name: Altitude, dtype: float64

You may mix columns and scalars, the string arithmetics also works as expected.

In [83]:
laser_incidents["City"] + ", " + laser_incidents["State"]
Out[83]:
0          Santa Barbara, California
1                 San Antonio, Texas
2                     Tampa, Florida
3                 Fort Worth , Texas
4                Modesto, California
                    ...             
36458              Las Vegas, Nevada
36459            Lincoln, California
36460    Westhampton Beach, New York
36461                     Guam, Guam
36462                Naples, Florida
Length: 36463, dtype: object

Summary statistics

The describe method shows summary statistics for all the columns:

In [84]:
laser_incidents.describe()
Out[84]:
Altitude timestamp
count 36218.000000 33431
mean 7358.314264 2017-08-31 03:32:36.253776384
min 0.000000 2015-01-01 02:00:00
25% 2500.000000 2016-03-25 06:09:30
50% 5000.000000 2017-08-01 04:10:00
75% 9700.000000 2019-01-14 17:07:00
max 240000.000000 2020-08-01 10:49:00
std 7642.686712 NaN
In [87]:
laser_incidents.describe(include="all")
Out[87]:
Flight ID Aircraft Altitude Airport Laser Color Injury City State timestamp
count 36451 36411 36218.000000 36450 36461 36445 36460 36457 33431
unique 24788 1731 NaN 2019 73 2 2254 73 NaN
top UNKN B737 NaN LAX green False Phoenix California NaN
freq 49 3817 NaN 988 32787 36261 1157 7268 NaN
mean NaN NaN 7358.314264 NaN NaN NaN NaN NaN 2017-08-31 03:32:36.253776384
min NaN NaN 0.000000 NaN NaN NaN NaN NaN 2015-01-01 02:00:00
25% NaN NaN 2500.000000 NaN NaN NaN NaN NaN 2016-03-25 06:09:30
50% NaN NaN 5000.000000 NaN NaN NaN NaN NaN 2017-08-01 04:10:00
75% NaN NaN 9700.000000 NaN NaN NaN NaN NaN 2019-01-14 17:07:00
max NaN NaN 240000.000000 NaN NaN NaN NaN NaN 2020-08-01 10:49:00
std NaN NaN 7642.686712 NaN NaN NaN NaN NaN NaN
In [88]:
laser_incidents["Altitude"].mean()
Out[88]:
7358.314263625822
In [89]:
laser_incidents["Altitude"].std()
Out[89]:
7642.6867120945535
In [90]:
laser_incidents["Altitude"].max()
Out[90]:
240000.0

Basic string operations (Optional)

These are typically accessed using the .str "accessor" of the Series like this:

  • series.str.lower
  • series.str.split
  • series.str.startswith
  • series.str.contains
  • ...

See more in the documentation.

In [96]:
laser_incidents[laser_incidents["City"].str.contains("City", na=False)]["City"].unique()
Out[96]:
array(['Panama City', 'Oklahoma City', 'Salt Lake City', 'Bullhead City',
       'Garden City', 'Atlantic City', 'Panama City  ', 'New York City',
       'Jefferson City', 'Kansas City', 'Rapid City', 'Tremont City',
       'Boulder City', 'Traverse City', 'Cross City', 'Brigham City',
       'Carson City', 'Midland City', 'Johnson City', 'Ponca City',
       'Panama City Beach', 'Sioux City', 'Bay City', 'Silver City',
       'Pueblo City', 'Iowa City', 'Calvert City', 'Crescent City',
       'Oak City', 'Falls City', 'Salt Lake City ', 'Royse City',
       'Kansas City ', 'Bossier City', 'Baker City', 'Ellwood City',
       'Dodge City', 'Garden City ', 'Union City', 'King City',
       'Kansas City    ', 'Mason City', 'Plant City   ', 'Lanai City',
       'Tell City', 'Yuba City', 'Kansas City  ', 'Salt Lake City    ',
       'Kansas City   ', 'Ocean City', 'Cedar City', 'City of Commerce',
       'Lake City', 'Beach City', 'Alexander City', 'Siler City',
       'Charles City', 'Malad City ', 'Rush City', 'Webster City',
       'Plant City'], dtype=object)
In [97]:
laser_incidents[laser_incidents["City"].str.contains("City", na=False)]["City"].str.strip().unique()
Out[97]:
array(['Panama City', 'Oklahoma City', 'Salt Lake City', 'Bullhead City',
       'Garden City', 'Atlantic City', 'New York City', 'Jefferson City',
       'Kansas City', 'Rapid City', 'Tremont City', 'Boulder City',
       'Traverse City', 'Cross City', 'Brigham City', 'Carson City',
       'Midland City', 'Johnson City', 'Ponca City', 'Panama City Beach',
       'Sioux City', 'Bay City', 'Silver City', 'Pueblo City',
       'Iowa City', 'Calvert City', 'Crescent City', 'Oak City',
       'Falls City', 'Royse City', 'Bossier City', 'Baker City',
       'Ellwood City', 'Dodge City', 'Union City', 'King City',
       'Mason City', 'Plant City', 'Lanai City', 'Tell City', 'Yuba City',
       'Ocean City', 'Cedar City', 'City of Commerce', 'Lake City',
       'Beach City', 'Alexander City', 'Siler City', 'Charles City',
       'Malad City', 'Rush City', 'Webster City'], dtype=object)

Merging data

It is a common situation where we have two or more datasets with different columns that we need to bring together. This operation is called merging and the Pandas apparatus is to a great detail described in the documentation.

In our case, we would like to attach the state populations to the dataset.

In [98]:
population = pd.read_csv("data/us_state_population.csv")
population
Out[98]:
Territory Population Population 2010 Code
0 California 39029342.0 37253956 CA
1 Texas 30029572.0 25145561 TX
2 Florida 22244823.0 18801310 FL
3 New York 19677151.0 19378102 NY
4 Pennsylvania 12972008.0 12702379 PA
5 Illinois 12582032.0 12830632 IL
6 Ohio 11756058.0 11536504 OH
7 Georgia 10912876.0 9687653 GA
8 North Carolina 10698973.0 9535483 NC
9 Michigan 10034113.0 9883640 MI
10 New Jersey 9261699.0 8791894 NJ
11 Virginia 8683619.0 8001024 VA
12 Washington 7785786.0 6724540 WA
13 Arizona 7359197.0 6392017 AZ
14 Tennessee 7051339.0 6346105 TN
15 Massachusetts 6981974.0 6547629 MA
16 Indiana 6833037.0 6483802 IN
17 Missouri 6177957.0 5988927 MO
18 Maryland 6164660.0 5773552 MD
19 Wisconsin 5892539.0 5686986 WI
20 Colorado 5839926.0 5029196 CO
21 Minnesota 5717184.0 5303925 MN
22 South Carolina 5282634.0 4625364 SC
23 Alabama 5074296.0 4779736 AL
24 Louisiana 4590241.0 4533372 LA
25 Kentucky 4512310.0 4339367 KY
26 Oregon 4240137.0 3831074 OR
27 Oklahoma 4019800.0 3751351 OK
28 Connecticut 3626205.0 3574097 CT
29 Utah 3380800.0 2763885 UT
30 Puerto Rico 3221789.0 3725789 PR
31 Iowa 3200517.0 3046355 IA
32 Nevada 3177772.0 2700551 NV
33 Arkansas 3045637.0 2915918 AR
34 Mississippi 2940057.0 2967297 MS
35 Kansas 2937150.0 2853118 KS
36 New Mexico 2113344.0 2059179 NM
37 Nebraska 1967923.0 1826341 NE
38 Idaho 1939033.0 1567582 ID
39 West Virginia 1775156.0 1852994 WV
40 Hawaii 1440196.0 1360301 HI
41 New Hampshire 1395231.0 1316470 NH
42 Maine 1385340.0 1328361 ME
43 Montana 1122867.0 989415 MT
44 Rhode Island 1093734.0 1052567 RI
45 Delaware 1018396.0 897934 DE
46 South Dakota 909824.0 814180 SD
47 North Dakota 779261.0 672591 ND
48 Alaska 733583.0 710231 AK
49 District of Columbia 671803.0 601723 DC
50 Vermont 647064.0 625741 VT
51 Wyoming 581381.0 563626 WY
52 Guam NaN 159358 GU
53 U.S. Virgin Islands NaN 106405 VI
54 American Samoa NaN 55519 AS
55 Northern Mariana Islands NaN 53883 MP

We will of course use the state name as the merge key. Before actually doing the merge, we can explore a bit whether all state names from the laser incidents dataset are present in our population table.

In [99]:
unknown_states = laser_incidents.loc[~laser_incidents["State"].isin(population["Territory"]), "State"]
print(f"There are {unknown_states.count()} rows with unknown states.")
print(f"Unknown state values are: \n{list(unknown_states.unique())}.")
There are 82 rows with unknown states.
Unknown state values are: 
[nan, 'Virgin Islands', 'Miami', 'North Hampshire', 'Marina Islands', 'Teas', 'Mexico', 'DC', 'VA', 'Northern Marina Islands', 'Mariana Islands', 'Oho', 'Northern Marianas Is', 'UNKN', 'Massachussets', 'FLorida', 'D.C.', 'MIchigan', 'Northern Mariana Is', 'Micronesia'].

We could certainly clean the data by correcting some of the typos. Since the number of the rows with unknown states is not large (compared to the length of the whole dataset), we will deliberetly not fix the state names. Instead, we will remove those rows from the merged dataset by using the inner type of merge. All the merge types: left, inner, outer and right are well explained by the schema below:

merge types

We can use the merge function to add the "Population" values.

In [103]:
laser_incidents_w_population = pd.merge(
    laser_incidents, population, left_on="State", right_on="Territory", how="inner"
)
In [104]:
laser_incidents_w_population
Out[104]:
Flight ID Aircraft Altitude Airport Laser Color Injury City State timestamp Territory Population Population 2010 Code
0 N424RP DA42/A 8500.0 SBA green False Santa Barbara California 2020-01-01 01:48:00 California 39029342.0 37253956 CA
1 AMF1829 B190 40000.0 SSF green False San Antonio Texas 2020-01-01 01:55:00 Texas 30029572.0 25145561 TX
2 NKS1881 A320 2500.0 TPA green False Tampa Florida 2020-01-01 02:14:00 Florida 22244823.0 18801310 FL
3 FDX3873 B763 3000.0 DFW green False Fort Worth Texas 2020-01-01 02:17:00 Texas 30029572.0 25145561 TX
4 SWA3635 B739 11000.0 MOD green False Modesto California 2020-01-01 02:18:00 California 39029342.0 37253956 CA
... ... ... ... ... ... ... ... ... ... ... ... ... ...
36370 VRD917 A320 (AIRBUS - A-32 8000.0 LAS green False Las Vegas Nevada 2015-12-31 05:25:00 Nevada 3177772.0 2700551 NV
36371 DAL2371 B738 (BOEING - 737- 11000.0 LHM green False Lincoln California 2015-12-31 06:23:00 California 39029342.0 37253956 CA
36372 Unknown Unknown 2000.0 FOK green False Westhampton Beach New York 2015-12-31 11:11:00 New York 19677151.0 19378102 NY
36373 UAL197 B737 300.0 GUM green False Guam Guam 2015-12-31 11:47:00 Guam NaN 159358 GU
36374 EJA336 E55P/L 1000.0 APF green False Naples Florida 2015-12-31 23:14:00 Florida 22244823.0 18801310 FL

36375 rows × 13 columns

In [106]:
laser_incidents_w_population.describe(include="all")
Out[106]:
Flight ID Aircraft Altitude Airport Laser Color Injury City State timestamp Territory Population Population 2010 Code
count 36363 36323 36137.000000 36365 36374 36359 36374 36375 33361 36375 3.634300e+04 3.637500e+04 36375
unique 24735 1726 NaN 2009 73 2 2239 54 NaN 54 NaN NaN 54
top UNKN B737 NaN LAX green False Phoenix California NaN California NaN NaN CA
freq 49 3811 NaN 988 32715 36177 1156 7268 NaN 7268 NaN NaN 7268
mean NaN NaN 7363.934333 NaN NaN NaN NaN NaN 2017-08-31 14:04:42.552681472 NaN 1.679960e+07 1.542564e+07 NaN
min NaN NaN 0.000000 NaN NaN NaN NaN NaN 2015-01-01 02:00:00 NaN 5.813810e+05 1.064050e+05 NaN
25% NaN NaN 2500.000000 NaN NaN NaN NaN NaN 2016-03-26 02:47:00 NaN 5.282634e+06 4.779736e+06 NaN
50% NaN NaN 5000.000000 NaN NaN NaN NaN NaN 2017-08-02 02:40:00 NaN 1.069897e+07 9.687653e+06 NaN
75% NaN NaN 9800.000000 NaN NaN NaN NaN NaN 2019-01-15 04:00:00 NaN 3.002957e+07 2.514556e+07 NaN
max NaN NaN 240000.000000 NaN NaN NaN NaN NaN 2020-08-01 10:49:00 NaN 3.902934e+07 3.725396e+07 NaN
std NaN NaN 7645.507063 NaN NaN NaN NaN NaN NaN NaN 1.378730e+07 1.287534e+07 NaN

Grouping & aggregation

A common pattern in data analysis is grouping (or binning) data based on some property and getting some aggredate statistics.

Example: Group this workshop participants by nationality a get the cardinality (the size) of each group.

Possibly the simplest group and aggregation is the value_counts method, which groups by the respective column value and yields the number (or normalized frequency) of each unique value in the data.

In [107]:
laser_incidents_w_population["State"].value_counts(normalize=False)
Out[107]:
State
California              7268
Texas                   3620
Florida                 2702
Arizona                 1910
Colorado                 988
Washington               982
Kentucky                 952
Illinois                 946
New York                 921
Puerto Rico              912
Oregon                   895
Tennessee                888
Nevada                   837
Pennsylvania             826
Indiana                  812
Utah                     789
Ohio                     750
Georgia                  714
North Carolina           605
Missouri                 547
Minnesota                531
New Jersey               519
Michigan                 505
Hawaii                   500
Alabama                  473
Virginia                 412
Oklahoma                 412
New Mexico               401
Louisiana                351
Massachusetts            346
South Carolina           306
Maryland                 255
Idaho                    237
Arkansas                 237
Wisconsin                207
Iowa                     200
Connecticut              185
District of Columbia     183
Kansas                   172
Mississippi              156
Montana                  134
Nebraska                 112
West Virginia            108
North Dakota              92
New Hampshire             86
Rhode Island              81
Alaska                    67
Maine                     66
South Dakota              52
Delaware                  43
Guam                      31
Vermont                   28
Wyoming                   22
U.S. Virgin Islands        1
Name: count, dtype: int64

This is just a primitive grouping and aggregation operation, we will look into more advanced patterns. Let us say we would like to get some numbers (statistics) for individual states. We can groupby the dataset by the "State" column:

In [108]:
grouped_by_state = laser_incidents_w_population.groupby("State")

What did we get?

In [109]:
grouped_by_state
Out[109]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x16d45faf0>

What is this DataFrameGroupBy object? Its use case is:

  • Splitting the data into groups based on some criteria.
  • Applying a function to each group independently.
  • Combining the results into a data structure.

Let's try a simple aggregate: the mean of altitude for each state:

In [110]:
grouped_by_state["Altitude"].mean().sort_values()
Out[110]:
State
Puerto Rico              3552.996703
Hawaii                   4564.536585
Florida                  4970.406773
Alaska                   5209.848485
Wisconsin                5529.951220
New York                 5530.208743
Guam                     5800.000000
Maryland                 6071.739130
District of Columbia     6087.144444
New Jersey               6204.306950
Illinois                 6306.310566
Massachusetts            6473.763848
Texas                    6487.493759
Delaware                 6602.380952
Arizona                  6678.333158
Nevada                   6730.037485
California               6919.705613
Washington               7110.687629
Louisiana                7276.276353
Nebraska                 7277.321429
Michigan                 7330.459082
Oregon                   7411.285231
South Dakota             7419.607843
North Dakota             7455.434783
Ohio                     7482.409880
Pennsylvania             7518.614724
Connecticut              7519.562842
Vermont                  7610.714286
Idaho                    7636.756410
Oklahoma                 7678.803440
Montana                  7780.620155
Virginia                 7903.889976
Rhode Island             8186.875000
Minnesota                8191.869811
South Carolina           8593.535948
Kansas                   8661.994152
Indiana                  8664.055693
Maine                    8733.333333
Alabama                  8821.210191
Mississippi              8828.685897
Tennessee                8987.354402
North Carolina           9251.180763
New Hampshire            9591.764706
Utah                     9892.935197
Iowa                    10174.619289
Missouri                10548.161468
New Mexico              10714.706030
U.S. Virgin Islands     11000.000000
Georgia                 11130.663854
Arkansas                11203.483051
Colorado                11301.869388
Kentucky                11583.086225
West Virginia           12108.386792
Wyoming                 18238.095238
Name: Altitude, dtype: float64

What if we were to group by year? We don't have a year column but we can just extract the year from the date and use it for groupby.

In [111]:
grouped_by_year = laser_incidents_w_population.groupby(laser_incidents_w_population["timestamp"].dt.year)

You may have noticed how we extracted the year using the .dt accessor. We will use .dt even more below.

Let's calculate the mean altitude of laser incidents per year. Are the lasers getting more powerful? 🤔

In [112]:
mean_altitude_per_year = grouped_by_year["Altitude"].mean().sort_index()
mean_altitude_per_year
Out[112]:
timestamp
2015.0    6564.621830
2016.0    7063.288912
2017.0    7420.971064
2018.0    7602.049323
2019.0    8242.586268
2020.0    8618.242465
Name: Altitude, dtype: float64

We can also quickly plot the results, more on plotting in the next lessons.

In [113]:
mean_altitude_per_year.plot(kind="bar");

Exercise: Calculate the sum of injuries per year. Use the fact that True + True = 2 ;)

We can also create a new Series if the corresponding column does not exist in the dataframe and group it by another Series (which in this case is a column from the dataframe). Important is that the grouped and the by series have the same index.

In [114]:
# how many incidents per million inhabitants are there for each state?
incidents_per_million = (1_000_000 / laser_incidents_w_population["Population"]).groupby(laser_incidents_w_population["State"]).sum()
incidents_per_million.sort_values(ascending=False)
Out[114]:
State
Hawaii                  347.174968
Puerto Rico             283.072541
District of Columbia    272.401284
Nevada                  263.392087
Arizona                 259.539186
Utah                    233.376716
Oregon                  211.078085
Kentucky                210.978412
New Mexico              189.746676
California              186.218871
Colorado                169.180226
Washington              126.127279
Tennessee               125.933528
Idaho                   122.225872
Florida                 121.466464
Texas                   120.547839
Montana                 119.337375
Indiana                 118.834422
North Dakota            118.060573
Oklahoma                102.492661
Alabama                  93.214901
Minnesota                92.877892
Alaska                   91.332542
Missouri                 88.540597
Arkansas                 77.816234
Louisiana                76.466573
Illinois                 75.186584
Rhode Island             74.058226
Georgia                  65.427299
Ohio                     63.796895
Pennsylvania             63.675570
Iowa                     62.489904
New Hampshire            61.638539
West Virginia            60.839723
Kansas                   58.560169
South Carolina           57.925648
South Dakota             57.153911
Nebraska                 56.912796
North Carolina           56.547484
New Jersey               56.037235
Mississippi              53.060196
Connecticut              51.017524
Michigan                 50.328315
Massachusetts            49.556186
Maine                    47.641734
Virginia                 47.445656
New York                 46.805556
Vermont                  43.272381
Delaware                 42.223261
Maryland                 41.364812
Wyoming                  37.840934
Wisconsin                35.129169
U.S. Virgin Islands       0.000000
Guam                      0.000000
Name: Population, dtype: float64
In [115]:
incidents_per_million.sort_values(ascending=False).plot(kind="bar", figsize=(15, 3));

Time series operations (Optional)

We will briefly look at some more specific operation for time series data (data with a natural time axis). Typical operations for time series are resampling or rolling window transformations such as filtering. Note that Pandas is not a general digital signal processing library - there are other (more capable) tools for this purpose.

First, we set the index to "timestamp" to make our dataframe inherently time indexed. This will make doing further time operations easier.

In [116]:
incidents_w_time_index = laser_incidents.set_index("timestamp")
incidents_w_time_index
Out[116]:
Flight ID Aircraft Altitude Airport Laser Color Injury City State
timestamp
2020-01-01 01:48:00 N424RP DA42/A 8500.0 SBA green False Santa Barbara California
2020-01-01 01:55:00 AMF1829 B190 40000.0 SSF green False San Antonio Texas
2020-01-01 02:14:00 NKS1881 A320 2500.0 TPA green False Tampa Florida
2020-01-01 02:17:00 FDX3873 B763 3000.0 DFW green False Fort Worth Texas
2020-01-01 02:18:00 SWA3635 B739 11000.0 MOD green False Modesto California
... ... ... ... ... ... ... ... ...
2015-12-31 05:25:00 VRD917 A320 (AIRBUS - A-32 8000.0 LAS green False Las Vegas Nevada
2015-12-31 06:23:00 DAL2371 B738 (BOEING - 737- 11000.0 LHM green False Lincoln California
2015-12-31 11:11:00 Unknown Unknown 2000.0 FOK green False Westhampton Beach New York
2015-12-31 11:47:00 UAL197 B737 300.0 GUM green False Guam Guam
2015-12-31 23:14:00 EJA336 E55P/L 1000.0 APF green False Naples Florida

36463 rows × 8 columns

First, turn the data into a time series of incidents per hour. This can be done by resampling to 1 hour and using count (basically on any column or on any column that has any non-NA value) to count the number of incidents.

In [117]:
incidents_hourly = incidents_w_time_index.notna().any(axis="columns").resample("1H").count().rename("incidents per hour")
incidents_hourly
/var/folders/dm/gbbql3p121z0tr22r2z98vy00000gn/T/ipykernel_78670/3245646514.py:1: FutureWarning: 'H' is deprecated and will be removed in a future version, please use 'h' instead.
  incidents_hourly = incidents_w_time_index.notna().any(axis="columns").resample("1H").count().rename("incidents per hour")
Out[117]:
timestamp
2015-01-01 02:00:00    1
2015-01-01 03:00:00    2
2015-01-01 04:00:00    1
2015-01-01 05:00:00    3
2015-01-01 06:00:00    0
                      ..
2020-08-01 06:00:00    0
2020-08-01 07:00:00    1
2020-08-01 08:00:00    1
2020-08-01 09:00:00    0
2020-08-01 10:00:00    3
Name: incidents per hour, Length: 48945, dtype: int64

Looking at those data gives us a bit too detailed information.

In [118]:
incidents_hourly.sort_index().plot(kind="line", figsize=(15, 3));

A daily mean, the result of resampling to 1 day periods and calculating the mean, is already something more digestible. Though still a bit noisy.

In [119]:
incidents_daily = incidents_hourly.resample("1D").mean()
incidents_daily.plot.line(figsize=(15, 3));

We can look at filtered data by rolling mean with, e.g., 28 days window size.

In [120]:
incidents_daily_filtered = incidents_daily.rolling("28D").mean()
incidents_daily.plot.line(figsize=(15, 3));
incidents_daily_filtered.plot.line(figsize=(15, 3));

Komentáře

Comments powered by Disqus