More tabular data with pandas¶

print view
notebook

  • Accessing data in pandas
  • Data processing and rearrangement
  • Combining data frames

SPARCS medical record data¶

Download this (and, optionally, unzip it): https://mscbio2025-2025.github.io/files/SPARCS_2014_compressed.csv.gz

More information here: https://navigator.med.nyu.edu/ace/sparcs/

Analyzing the data with pandas¶

Note: This is a large dataset, so we have already dropped a number of columns and are reading the data in compressed format (the uncompressed file is 6x larger!)

In [ ]:
import pandas as pd
data = pd.read_csv('../files/SPARCS_2014_compressed.csv.gz')
In [ ]:
data

Format conversion¶

In [ ]:
data.to_numpy()

Data access¶

Columns (and rows) have names that you can use to access them

In [ ]:
data.columns
In [ ]:
data.Gender  # get gender column

Data access¶

In [ ]:
data["CCS Diagnosis Code"][:3] # can't use dot syntax if column name has spaces

[] slices by rows, but indexes by column name - must provide range or it interprets the index as a column label.

In [ ]:
data[:1]

No (default) access by number¶

In [ ]:
data[0] 

.iloc: Position indexing¶

If you want to reference a pandas data frame with position based indexing, use .iloc - works just like numpy

In [ ]:
data.iloc[3,1]
In [ ]:
data.iloc[0,:]

Pandas uses NaN to indicate missing data

.loc: Label indexing¶

You can also index by the label names

Note that rows are indexed by their named index

In [ ]:
data.loc[:3, 'Gender']
In [ ]:
data.loc[[0,3,5],['CCS Diagnosis Description','CCS Diagnosis Code']]

Row slices¶

In [ ]:
data[10:20]

Note: .loc indices are absolute, not relative¶

In [ ]:
data[10:20].iloc[[0,3,5]]
In [ ]:
data[10:20].iloc[[0,3,5]]

Boolean Indexing¶

Just like numpy we can index by a boolean array or an array of indices.

In [ ]:
data[data.Gender == 'M'][:3]
In [ ]:
data.iloc[:,[0,3,5]][:3]

Sorting¶

In [ ]:
data.sort_values(by="Total Charges").head(3)

Is $99,999.65 the most expensive record?¶

In [ ]:
data.sort_values(by="Total Charges",ascending=False).head(3)

String methods¶

Can apply standard string functions to all cells
This returns the changed value; it does not mutate in place

In [ ]:
data['Total Charges'] = data['Total Charges'].str.lstrip('$').astype(float)

The above overwrites the previous Total Charges column to be a floating point number instead of a string with a dollar sign.

Correct sorting¶

In [ ]:
data.sort_values(by="Total Charges",ascending=False).head(3)

Creating new columns¶

In [ ]:
# setting errors='coerce' below will result in NaN values for invalid input
data['Length of Stay'] = pd.to_numeric(data['Length of Stay'], errors='coerce')
data['Length of Stay'] = data['Length of Stay'].fillna(120)
In [ ]:
data['Charge per day'] = data['Total Charges']/data['Length of Stay']
In [ ]:
data

Error handling¶

In [ ]:
pd.to_numeric('120 +',errors='coerce')
In [ ]:
pd.to_numeric('120 +',errors='ignore')
In [ ]:
pd.to_numeric('120 +')

Group by¶

Group records that have the same value for a column

In [ ]:
data.groupby('Facility ID')

We can then apply an aggregating function to these groups.

In [ ]:
# note: make sure mean() is only applied to columns with numerical values, or you will get an error
data.groupby('Facility ID').mean(numeric_only=True).sort_values(by='Total Charges')

Example¶

In [ ]:
data.groupby('Gender').mean(numeric_only=True).sort_values(by='Total Charges').loc[:,['Total Charges']]

The group by column has become an indexing column. Need to reset_index to convert index to columns.

In [ ]:
data.groupby('Gender').mean(numeric_only=True).sort_values(by='Total Charges').reset_index().loc[:,['Gender','Total Charges']]

Example¶

In [ ]:
data.groupby(['CCS Procedure Description','Gender'])['Total Charges'].mean()

unstack pivots a level of the (necessarily hierarchical) index labels¶

In [ ]:
gendercosts = data.groupby(['CCS Procedure Description','Gender'])['Total Charges'].mean().unstack()
gendercosts
In [ ]:
genderdiffs = gendercosts.F - gendercosts.M
genderdiffs
In [ ]:
genderdiffs.dropna(inplace=True)  #remove rows with NaN, modify genderdiffs in place
genderdiffs
In [ ]:
genderdiffs.sort_values().head(5)
In [ ]:
genderdiffs.sort_values().tail(5)

Combining DataFrames¶

pd.concat concatenates rows (i.e., default axis=0) while merging columns with the same name

In [ ]:
df1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    },
    index=[0, 1, 2, 3],
)

df2 = pd.DataFrame(
    {
        "B": ["B2", "B3", "B6", "B7"],
        "D": ["D2", "D3", "D6", "D7"],
        "F": ["F2", "F3", "F6", "F7"],
    },
    index=[2, 3, 6, 7],
)
In [ ]:
df1
In [ ]:
df2
In [ ]:
pd.concat([df1,df2])

Join¶

You can join two tables on a specific column (or columns)
Rows that has the same value (or key) in that column will be combined

In [ ]:
left = pd.DataFrame({"key": ["K0", "K1", "K2", "K3"],
                     "A": ["A0", "A1", "A2", "A3"],
                     "B": ["B0", "B1", "B2", "B3"],})


right = pd.DataFrame({"key": [ "K1", "K2", "K4"],
                      "C": ["C1", "C2", "C4"],
                      "D": ["D1", "D2", "D4"],})
In [ ]:
left
In [ ]:
right

Inner join¶

Key must exist in both tables

In [ ]:
pd.merge(left,right,on='key',how='inner')

Outer join¶

Key can exist in either table

In [ ]:
pd.merge(left,right,on='key',how='outer')

Left (right) join¶

Key must exist in left (right) table

In [ ]:
pd.merge(left, right, on='key', how='left')
In [ ]:
pd.merge(left, right, on='key', how='right')

Questions¶

Download: https://mscbio2025-2024.github.io/files/SPARCS_2014_compressed.csv.gz

  • How many data records are there?
  • How many coronary bypasses are there?
  • What is the average cost? Standard deviation?
  • What is the most common diagnosis that leads to a coronary bypass?
  • What percent of people with that diagnosis get a coronary bypass?
  • What are the facilities whose average cost for this operation is in the top 10%? Bottom 10%?
  • How correlated is the length of stay to the cost?
  • Is the percentage of people who go to these facilities with the most common diagnosis and receive a coronary bypass significantly different between these two groups?
  • What about knee replacements?
  • How well can a decision tree predict the cost of the operation? What are the most important features?

For next time¶

Python packages for sequence analysis.