Tutorial¶

print view
notebook

Experiment: Arabidopsis thaliana (Thale Cress)¶

Comparing wild type (WT) and atrx-1 mutant plants to analyze how the loss of function of ATRX protein results in changes in gene expression.
The ATRX protein is a histone chaperone known to be an important player in the regulation of gene expression.
RNA was isolated from three WT replicates and three mutant replicates.
RNA-seq was carried out on an Illumina Hiseq 2500.
The sequencing reads are paired-end data, hence we have 2 files per replicate.

Create directory in which we will be doing the analysis¶

create directory: mkdir -p tutorial/RNAseq
Go to the directory: cd tutorial/RNAseq

Download the data:¶

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/003/SRR4420293/SRR4420293_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/003/SRR4420293/SRR4420293_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/006/SRR4420296/SRR4420296_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/006/SRR4420296/SRR4420296_2.fastq.gz

Get the Genome reference file and associated GTF/GFF file for Arabidopsis.¶

Create a directory for the genome: mkdir Genome
Go to the directory: cd Genome
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_genomic.fna.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_genomic.gtf.gz

Unzip the files¶

gunzip GCF_000001735.4_TAIR10.1_genomic.fna.gz
gunzip GCF_000001735.4_TAIR10.1_genomic.gtf.gz
Check if they have been unzipped

Quality Check¶

Create a folder for fastqc files: mkdir fastqcReports
Check the number of cores:
- import os
- os.cpu_count()
fastqc -t 4 -o ./fastqcReports/ ./*.gz

Detour: When more than one QC file: Compile all the results together using multiqc¶

install multiqc: pip3 install multiqc
multiqc -p -o ./fastqcReports/ ./fastqcReports/ --title "Arabidopsis_RNAseq"

Detour: adapter trimming¶

installation: conda install bioconda::trimmomatic
mkdir adapter
cd adapter
download the adapter fasta:
- wget https://github.com/usadellab/Trimmomatic/blob/main/adapters/TruSeq3-PE.fa
apply trimmomatic
trimmomatic PE -threads 4 SRR4420293_1.fastq.gz SRR4420293_2.fastq.gz SRR4420293_1_trimmed_paired.fastq.gz SRR4420293_1_trimmed_unpaired.fastq.gz SRR4420293_2_trimmed_paired.fastq.gz SRR4420293_2_trimmed_unpaired.fastq.gz ILLUMINACLIP:./adapter/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

Aligning the reads:¶

Install hisat2: conda install -c bioconda hisat2

Extract splice sites and exons from GTF (optional but recommended)¶

hisat2_extract_splice_sites.py ./Genome/GCF_000001735.4_TAIR10.1_genomic.gtf > ./splicesites.txt
hisat2_extract_exons.py ./Genome/GCF_000001735.4_TAIR10.1_genomic.gtf > ./exons.txt

Build HISAT2 index with splice site information¶

hisat2-build \ -p 4 \ --ss ./splicesites.txt \ --exon ./exons.txt \ ./GCF_000001735.4_TAIR10.1_genomic.fna \ hisat_index/TAIR10

Now run HISAT2 with the index PREFIX to align the samples¶

hisat2 -p 4 \ -x ./genome/hisat_index/TAIR10 \ -1 SRR4420293_1.fastq.gz \ -2 SRR4420293_2.fastq.gz \ -S aligned_reads/SRR4420293.sam \ --summary-file ./aligned_reads/SRR4420293_alignment_summary.txt

install samtools¶

conda install -c bioconda samtools
verify the installation: samtools --version

Convert to sorted BAM¶

samtools view -@ 4 -bS aligned_reads/SRR4420293.sam | \ samtools sort -@ 4 -o aligned_reads/SRR4420293.sorted.bam && \ samtools index aligned_reads/SRR4420293.sorted.bam && \ rm aligned_reads/SRR4420293.sam

install featurecounts¶

conda install -c bioconda subread
verify installation: featureCounts -v

Let's count the features¶

featureCounts \ -a ./genome/GCF_000001735.4_TAIR10.1_genomic.gtf \ -o ./gene_counts/SRR4420293_counts.txt \ -p \ -t exon \ -g gene_id \ -T 4 \ ./aligned_reads/SRR4420293.sorted.bam

Quiz¶

Convert and save the counts txt file(s) into a csv file