RNA sequencing II¶

print view
notebook

Today ... lets go through an example¶

Simple analysis of gene expression data from samples with healthy and apoptotic cells

Dataset¶

  • Gene expression count matrix for healthy vs apoptotic cells

RNA-Seq analysis¶

  • Input: gene expression count matrix for healthy vs apoptotic cells
  • Analysis:
    • Quality control
    • Variance stabilization and normalization
    • Differential expression analysis
    • Result visualization
In [64]:
import pandas as pd
gene_df = pd.read_csv('https://mscbio2025-2025.github.io/files/data.csv',index_col=0);
gene_df.head()
Out[64]:
healthy_1 healthy_2 healthy_3 healthy_4 healthy_5 healthy_6 healthy_7 healthy_8 healthy_9 healthy_10 ... apoptotic_11 apoptotic_12 apoptotic_13 apoptotic_14 apoptotic_15 apoptotic_16 apoptotic_17 apoptotic_18 apoptotic_19 apoptotic_20
GAPDH 23 20 13 21 14 16 13 17 17 25 ... 16 14 29 20 27 21 20 26 14 10
ACTB 18 24 22 18 25 19 23 13 16 15 ... 10 20 12 25 16 16 15 20 10 14
RPL13A 15 22 16 18 19 26 17 20 21 37 ... 28 26 21 18 16 15 40 18 25 21
EEF1A1 33 16 16 7 28 22 17 24 17 23 ... 34 19 13 23 15 17 26 22 18 28
HPRT1 22 29 17 16 26 29 16 13 14 17 ... 24 16 23 16 28 14 17 20 23 23

5 rows × 40 columns

Quiz¶

  • How many samples are there
  • How many genes are there
  • How many healthy samples and how many apoptotic samples are there

Quality control and filtering¶

  • Essential steps to ensure that only high-quality is used for downstream analysis
  • E.g., keep only those genes with at least 10 counts in at least 2 samples
In [66]:
# Filtering
qc_filter = (gene_df > 10).sum(axis=1) >= 2
filtered_df = gene_df[qc_filter]

QUIZ¶

  • Were any genes removed?
    • If no, why not
    • If yes, which ones

Normalization and variance stabilization¶

  • To ensure that observed differences in gene expression reflect true biological variation, rather than technical bias or noise.
  • Normalization ensures that differences in normalized read counts reflect genuine differences in gene expression
  • Stabilization ensures highly expressed genes do not dominate analyses as genes with higher expression have higher variance.

varianceStabilization.png

In [67]:
# Normalization and Variance Stabilization

# Counts per Million normalization (Better approach TMM -- next time)
cpm = filtered_df.div(filtered_df.sum(axis=0), axis=1) * 1e6

# Variance stabilization (Better approach VST -- next time)
log_cpm = np.log1p(cpm)

# 
log_cpm.head()
Out[67]:
healthy_1 healthy_2 healthy_3 healthy_4 healthy_5 healthy_6 healthy_7 healthy_8 healthy_9 healthy_10 ... apoptotic_11 apoptotic_12 apoptotic_13 apoptotic_14 apoptotic_15 apoptotic_16 apoptotic_17 apoptotic_18 apoptotic_19 apoptotic_20
GAPDH 10.815460 10.731331 10.258002 10.810320 10.268394 10.543120 10.198412 10.615662 10.613267 10.961359 ... 9.830059 9.827904 10.549514 10.079074 10.462391 10.360278 10.140516 10.396554 9.806935 9.423614
ACTB 10.570343 10.913649 10.784081 10.656172 10.848197 10.714966 10.768941 10.347406 10.552644 10.450545 ... 9.360087 10.184563 9.667162 10.302209 9.939163 10.088354 9.852847 10.134199 9.470484 9.760064
RPL13A 10.388027 10.826640 10.465635 10.656172 10.573766 11.028618 10.466667 10.778177 10.824571 11.353396 ... 10.389651 10.446918 10.226751 9.973718 9.939163 10.023818 10.833643 10.028843 10.386729 10.165509
EEF1A1 11.176467 10.508193 10.465635 9.711748 10.961524 10.861567 10.466667 10.960496 10.613267 10.877979 ... 10.583802 10.133272 9.747200 10.218831 9.874628 10.148976 10.402871 10.229506 10.058237 10.453182
HPRT1 10.771009 11.102888 10.526258 10.538392 10.887417 11.137816 10.406045 10.347406 10.419116 10.575705 ... 10.235506 9.961429 10.317719 9.855941 10.498758 9.954828 9.978004 10.134199 10.303350 10.256478

5 rows × 40 columns

In [68]:
# Differential Expression Testing
import numpy as np
from scipy.stats import nbinom, ttest_ind


# Generate label groupings
group_labels = np.array(["healthy"] * n_healthy + ["apoptotic"] * n_apoptotic)

# Check
pvals = []
logFC = []
for gene in filtered_df.index:
    grp_healthy = log_cpm.loc[gene, group_labels == "healthy"]
    grp_apoptotic = log_cpm.loc[gene, group_labels == "apoptotic"]
    t_stat, p = ttest_ind(grp_healthy, grp_apoptotic, equal_var=False)
    pvals.append(p)
    # Log2 fold change: apoptotic - healthy
    logFC.append(grp_apoptotic.mean() - grp_healthy.mean())

P-value¶

pvalue.png

In [69]:
# Multiple testing correction (Benjamini-Hochberg FDR)

from statsmodels.stats.multitest import multipletests

adj_pvals = multipletests(pvals, method="fdr_bh")[1]

# Results
results = pd.DataFrame({
    "gene": filtered_df.index,
    "log2FC": logFC,
    "pval": pvals,
    "padj": adj_pvals
}).set_index("gene")
results.head(20)
Out[69]:
log2FC pval padj
gene
GAPDH -0.444492 1.282722e-04 0.000564
ACTB -0.564394 4.491119e-06 0.000035
RPL13A -0.629518 4.597034e-07 0.000010
EEF1A1 -0.500151 4.774840e-06 0.000035
HPRT1 -0.508990 2.328428e-05 0.000128
CASP3 0.095433 2.024409e-01 0.318121
CASP8 0.284745 8.543167e-04 0.002685
BAX 0.059504 5.758734e-01 0.603296
BCL2 0.219998 4.148730e-02 0.076060
FAS 0.248024 4.447152e-03 0.010871
FASLG 0.149073 2.365633e-02 0.047313
XIAP 0.127083 4.583502e-02 0.077567
TRADD 0.053356 4.696941e-01 0.603296
TNFRSF10B 0.015661 8.365509e-01 0.836551
JAK1 0.059407 5.402031e-01 0.603296
JAK3 0.099942 2.691609e-01 0.394769
STAT4 0.049702 5.516172e-01 0.603296
IRF9 0.278713 1.287967e-02 0.028335
CHMP1A 0.271356 1.413113e-03 0.003886
CHMP5 -0.084359 3.388759e-01 0.465954
In [56]:
#  Volcano Plot

import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
colors = []
for idx, row in results.iterrows():
    if row["log2FC"] > 0.2 and row["padj"] < 0.01:
        colors.append("red")
    elif row["log2FC"] < -0.2 and row["padj"] < 0.01:
        colors.append("blue")
    else:
        colors.append("grey")
plt.scatter(results["log2FC"], -np.log10(results["padj"]), 
            c=colors, alpha=0.7)

# Dashed vertical lines at log2FC threshold
plt.axvline(x=0.2, color='black', linestyle='dashed', linewidth=1)
plt.axvline(x=-0.2, color='black', linestyle='dashed', linewidth=1)

# Dashed horizontal line at -log10(0.01)
plt.axhline(y=2, color='black', linestyle='dashed', linewidth=1)

plt.xlabel("Log2 fold change (apoptotic vs healthy)")
plt.ylabel("-log10 adjusted p-value")
plt.title("Volcano plot: Apoptosis DE Genes")
# optionally annotate significant genes as before
for gene in results.index:
    if abs(results.loc[gene, "log2FC"]) > 2 and results.loc[gene, "padj"] < 0.01:
        plt.text(results.loc[gene, "log2FC"], -np.log10(results.loc[gene, "padj"]), gene, fontsize=9)
plt.show()
No description has been provided for this image
In [63]:
# Heatmap

import seaborn as sns

# healthy (green), apoptotic (magenta)
sample_type = pd.Series(["healthy"] * n_healthy + ["apoptotic"] * n_apoptotic, 
                        index=log_cpm.columns)
col_colors = sample_type.map({"healthy": "green", "apoptotic": "magenta"})

top_genes = results.sort_values("padj").head(10).index
data = log_cpm.loc[top_genes]

sns.clustermap(data, 
               cmap="vlag", 
               col_colors=col_colors,
               row_cluster=True,       # enable row clustering and dendrogram
               col_cluster=True,       # enable column clustering and dendrogram
               xticklabels=False, 
               yticklabels=top_genes,
               cbar_kws={"label": "Log1p-CPM"},
               figsize=(10, 6))
plt.suptitle("Heatmap: Top 10 DE Genes", y=1.05)
Out[63]:
Text(0.5, 1.05, 'Heatmap: Top 10 DE Genes')
No description has been provided for this image

Next time --> More RNA-Seq¶