More tabular data with pandas
¶
- Accessing data in
pandas
- Data processing and rearrangement
- Combining data frames
SPARCS medical record data¶
Download this (and, optionally, unzip it): https://mscbio2025-2025.github.io/files/SPARCS_2014_compressed.csv.gz
More information here: https://navigator.med.nyu.edu/ace/sparcs/
Analyzing the data with pandas
¶
Note: This is a large dataset, so we have already dropped a number of columns and are reading the data in compressed format (the uncompressed file is 6x larger!)
import pandas as pd
data = pd.read_csv('../files/SPARCS_2014_compressed.csv.gz')
data
Format conversion¶
data.to_numpy()
Data access¶
Columns (and rows) have names that you can use to access them
data.columns
data.Gender # get gender column
Data access¶
data["CCS Diagnosis Code"][:3] # can't use dot syntax if column name has spaces
[]
slices by rows, but indexes by column name - must provide range or it interprets the index as a column label.
data[:1]
No (default) access by number¶
data[0]
.iloc
: Position indexing¶
If you want to reference a pandas data frame with position based indexing, use .iloc
- works just like numpy
data.iloc[3,1]
data.iloc[0,:]
Pandas uses NaN to indicate missing data
.loc
: Label indexing¶
You can also index by the label names
Note that rows are indexed by their named index
data.loc[:3, 'Gender']
data.loc[[0,3,5],['CCS Diagnosis Description','CCS Diagnosis Code']]
Row slices¶
data[10:20]
Note: .loc
indices are absolute, not relative¶
data[10:20].iloc[[0,3,5]]
data[10:20].iloc[[0,3,5]]
Boolean Indexing¶
Just like numpy
we can index by a boolean array or an array of indices.
data[data.Gender == 'M'][:3]
data.iloc[:,[0,3,5]][:3]
Sorting¶
data.sort_values(by="Total Charges").head(3)
Is $99,999.65 the most expensive record?¶
data.sort_values(by="Total Charges",ascending=False).head(3)
String methods¶
Can apply standard string functions to all cells
This returns the changed value; it does not mutate in place
data['Total Charges'] = data['Total Charges'].str.lstrip('$').astype(float)
The above overwrites the previous Total Charges column to be a floating point number instead of a string with a dollar sign.
Correct sorting¶
data.sort_values(by="Total Charges",ascending=False).head(3)
Creating new columns¶
# setting errors='coerce' below will result in NaN values for invalid input
data['Length of Stay'] = pd.to_numeric(data['Length of Stay'], errors='coerce')
data['Length of Stay'] = data['Length of Stay'].fillna(120)
data['Charge per day'] = data['Total Charges']/data['Length of Stay']
data
Error handling¶
pd.to_numeric('120 +',errors='coerce')
pd.to_numeric('120 +',errors='ignore')
pd.to_numeric('120 +')
Group by¶
Group records that have the same value for a column
data.groupby('Facility ID')
We can then apply an aggregating function to these groups.
# note: make sure mean() is only applied to columns with numerical values, or you will get an error
data.groupby('Facility ID').mean(numeric_only=True).sort_values(by='Total Charges')
Example¶
data.groupby('Gender').mean(numeric_only=True).sort_values(by='Total Charges').loc[:,['Total Charges']]
The group by column has become an indexing column. Need to reset_index
to convert index to columns.
data.groupby('Gender').mean(numeric_only=True).sort_values(by='Total Charges').reset_index().loc[:,['Gender','Total Charges']]
Example¶
data.groupby(['CCS Procedure Description','Gender'])['Total Charges'].mean()
unstack
pivots a level of the (necessarily hierarchical) index labels¶
gendercosts = data.groupby(['CCS Procedure Description','Gender'])['Total Charges'].mean().unstack()
gendercosts
genderdiffs = gendercosts.F - gendercosts.M
genderdiffs
genderdiffs.dropna(inplace=True) #remove rows with NaN, modify genderdiffs in place
genderdiffs
genderdiffs.sort_values().head(5)
genderdiffs.sort_values().tail(5)
Combining DataFrames¶
pd.concat
concatenates rows (i.e., default axis=0) while merging columns with the same name
df1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
},
index=[0, 1, 2, 3],
)
df2 = pd.DataFrame(
{
"B": ["B2", "B3", "B6", "B7"],
"D": ["D2", "D3", "D6", "D7"],
"F": ["F2", "F3", "F6", "F7"],
},
index=[2, 3, 6, 7],
)
df1
df2
pd.concat([df1,df2])
Join¶
You can join two tables on a specific column (or columns)
Rows that has the same value (or key) in that column will be combined
left = pd.DataFrame({"key": ["K0", "K1", "K2", "K3"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],})
right = pd.DataFrame({"key": [ "K1", "K2", "K4"],
"C": ["C1", "C2", "C4"],
"D": ["D1", "D2", "D4"],})
left
right
Inner join¶
Key must exist in both tables
pd.merge(left,right,on='key',how='inner')
Outer join¶
Key can exist in either table
pd.merge(left,right,on='key',how='outer')
Left (right) join¶
Key must exist in left (right) table
pd.merge(left, right, on='key', how='left')
pd.merge(left, right, on='key', how='right')
Questions¶
Download: https://mscbio2025-2024.github.io/files/SPARCS_2014_compressed.csv.gz
- How many data records are there?
- How many coronary bypasses are there?
- What is the average cost? Standard deviation?
- What is the most common diagnosis that leads to a coronary bypass?
- What percent of people with that diagnosis get a coronary bypass?
- What are the facilities whose average cost for this operation is in the top 10%? Bottom 10%?
- How correlated is the length of stay to the cost?
- Is the percentage of people who go to these facilities with the most common diagnosis and receive a coronary bypass significantly different between these two groups?
- What about knee replacements?
- How well can a decision tree predict the cost of the operation? What are the most important features?
For next time¶
Python packages for sequence analysis.