Tutorial¶
Experiment: Arabidopsis thaliana (Thale Cress)¶
Comparing wild type (WT) and atrx-1 mutant plants to analyze how the loss of function of ATRX protein results in changes in gene expression.
The ATRX protein is a histone chaperone known to be an important player in the regulation of gene expression.
RNA was isolated from three WT replicates and three mutant replicates.
RNA-seq was carried out on an Illumina Hiseq 2500.
The sequencing reads are paired-end data, hence we have 2 files per replicate.
Create directory in which we will be doing the analysis¶
- create directory: mkdir -p tutorial/RNAseq
- Go to the directory: cd tutorial/RNAseq
Download the data:¶
- wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/003/SRR4420293/SRR4420293_1.fastq.gz
- wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/003/SRR4420293/SRR4420293_2.fastq.gz
- wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/006/SRR4420296/SRR4420296_1.fastq.gz
- wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/006/SRR4420296/SRR4420296_2.fastq.gz
Get the Genome reference file and associated GTF/GFF file for Arabidopsis.¶
- Create a directory for the genome: mkdir Genome
- Go to the directory: cd Genome
- wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_genomic.fna.gz
- wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_genomic.gtf.gz
Unzip the files¶
- gunzip GCF_000001735.4_TAIR10.1_genomic.fna.gz
- gunzip GCF_000001735.4_TAIR10.1_genomic.gtf.gz
- Check if they have been unzipped
Quality Check¶
- Create a folder for fastqc files: mkdir fastqcReports
- Check the number of cores:
- import os
- os.cpu_count()
- fastqc -t 4 -o ./fastqcReports/ ./*.gz
Detour: When more than one QC file: Compile all the results together using multiqc¶
- install multiqc: pip3 install multiqc
- multiqc -p -o ./fastqcReports/ ./fastqcReports/ --title "Arabidopsis_RNAseq"
Detour: adapter trimming¶
- installation: conda install bioconda::trimmomatic
- mkdir adapter
- cd adapter
- download the adapter fasta:
- apply trimmomatic
trimmomatic PE -threads 4
SRR4420293_1.fastq.gz SRR4420293_2.fastq.gz
SRR4420293_1_trimmed_paired.fastq.gz SRR4420293_1_trimmed_unpaired.fastq.gz
SRR4420293_2_trimmed_paired.fastq.gz SRR4420293_2_trimmed_unpaired.fastq.gz
ILLUMINACLIP:./adapter/TruSeq3-PE.fa:2:30:10
LEADING:3 TRAILING:3
SLIDINGWINDOW:4:15
MINLEN:36
Aligning the reads:¶
- Install hisat2: conda install -c bioconda hisat2
Extract splice sites and exons from GTF (optional but recommended)¶
- hisat2_extract_splice_sites.py ./Genome/GCF_000001735.4_TAIR10.1_genomic.gtf > ./splicesites.txt
- hisat2_extract_exons.py ./Genome/GCF_000001735.4_TAIR10.1_genomic.gtf > ./exons.txt
Build HISAT2 index with splice site information¶
hisat2-build \
-p 4 \
--ss ./splicesites.txt \
--exon ./exons.txt \
./GCF_000001735.4_TAIR10.1_genomic.fna \
hisat_index/TAIR10
Now run HISAT2 with the index PREFIX to align the samples¶
hisat2 -p 4 \
-x ./genome/hisat_index/TAIR10 \
-1 SRR4420293_1.fastq.gz \
-2 SRR4420293_2.fastq.gz \
-S aligned_reads/SRR4420293.sam \
--summary-file ./aligned_reads/SRR4420293_alignment_summary.txt
install samtools¶
- conda install -c bioconda samtools
- verify the installation: samtools --version
Convert to sorted BAM¶
samtools view -@ 4 -bS aligned_reads/SRR4420293.sam | \
samtools sort -@ 4 -o aligned_reads/SRR4420293.sorted.bam && \
samtools index aligned_reads/SRR4420293.sorted.bam && \
rm aligned_reads/SRR4420293.sam
install featurecounts¶
- conda install -c bioconda subread
- verify installation: featureCounts -v
Let's count the features¶
featureCounts \
-a ./genome/GCF_000001735.4_TAIR10.1_genomic.gtf \
-o ./gene_counts/SRR4420293_counts.txt \
-p \
-t exon \
-g gene_id \
-T 4 \
./aligned_reads/SRR4420293.sorted.bam
Quiz¶
- Convert and save the counts txt file(s) into a csv file