Lab 6: RNA-Seq

The data for this project was taken from this paper: Kurasz et al.. In this paper, they studied gene expression in Salmonella enterica serovar Typhimurium and the effect of nucleic acid damage induced by mitomycin (MMC), which is a chemotherapy drug, and other compounds. We will look at the MMC data in this project.

This lab also involves samtools, a set of programs that are useful for working with SAM and BAM files. To install these, login through a large server. You can then use mamba to install samtools as follows, assuming you have already created a bb485 mamba instance:

        eval "$(mamba shell hook --shell bash)"
        mamba create -n cuff_env -c conda-forge -c bioconda samtools cufflinks
        mamba activate cuff_env

Later on, after this lab, to exit that environment, you can type the following, but don't do that yet until later::

	mamba deactivate cuff_env

This "cuff_env" environment should be the best option as it creates the version of python necessary and also should install samtools along with it.

Step 1: Download the genome and build a hisat2 index.

First thing we should do is download the genome for Salmonella. For reference, I found this by searching NCBI in the "Assembly" section and found this link here: https://www.ncbi.nlm.nih.gov/assembly/?term=Salmonella+Typhimurium. I renamed the file and made it available at this link:

	wget https://hendrixlab.cgrb.oregonstate.edu/teaching/salmonella_genome.fasta

Next, we should build a hisat2 index with the following command:

	hisat2-build salmonella_genome.fasta sal

In this command, I have created a filebase called "sal", but you could call it what you want and change the subsequent commands accordingly. Note that several files are created as a result:

	[hendrixd@bb485 data]$ ls sal.*
	sal.1.ht2  sal.2.ht2  sal.3.ht2  sal.4.ht2  sal.5.ht2  sal.6.ht2  sal.7.ht2  sal.8.ht2

Step 2: Download the FASTQ files and align with hisat2

First, we will download the FASTQ file. I found this experiment on GEO here: GEO record GSE82167. As a side note, what I did was go to the "Samples" section, expanded that section, clicked the links to find the "SRR..." IDs for samples. I ran a command like this (but please don't run this because the original files are too big!):

	~/sratoolkit.2.9.4-centos_linux64/bin/fastq-dump -A SRR3621123 >& SRR3621123.err
        ~/sratoolkit.2.9.4-centos_linux64/bin/fastq-dump -A SRR3621125 >& SRR3621125.err

To make this more feasible in class, I have created the following shortened version of the resulting files:

	wget https://hendrixlab.cgrb.oregonstate.edu/teaching/salmonella_ctrl.fastq
	wget https://hendrixlab.cgrb.oregonstate.edu/teaching/salmonella_MMC.fastq

Question: How many lines are in these files and how many reads does it have?

We might want to quality trim these files, as well as remove any potential adapters. You can trim the files with "skewer" using the commands:

	skewer -x AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -q 30 -Q 30 salmonella_ctrl.fastq -o salmonella_ctrl
	skewer -x AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -q 30 -Q 30 salmonella_MMC.fastq -o salmonella_MMC

For this command, it is requiring a PHRED score of 30 for both the 3' end as well as the average score. Meaning, nucleotides will be removed from the 3' end if they are lower quality than 30, and will discard reads with an average score less than 30.

The alignment itself is done with hisat2 using mostly default options to create a SAM file as the output.

	hisat2 -x sal -U salmonella_ctrl-trimmed.fastq -S salmonella_ctrl.sam
	hisat2 -x sal -U salmonella_MMC-trimmed.fastq -S salmonella_MMC.sam

Step 3: Convert SAM to BAM, sort, and index the files.

In order for cuffdiff to run properly, we need to sort and index the files. SAM will work actually, but it needs to be sorted. Perhaps the easiest and fastest thing is do do this with samtools, which is like a swiss-army knife for SAM and BAM files. First, convert to BAM using samtools

	samtools view -b -S -o salmonella_ctrl.bam salmonella_ctrl.sam 
	samtools view -b -S -o salmonella_MMC.bam salmonella_MMC.sam

Next, we sort and index these files:

samtools sort -o salmonella_ctrl.sort.bam salmonella_ctrl.bam
samtools index salmonella_ctrl.sort.bam 
	
samtools sort -o salmonella_MMC.sort.bam salmonella_MMC.bam
samtools index salmonella_MMC.sort.bam

Step 4: Identify differentially expressed genes with Cuffdiff.

To do this, you'll need to download the GFF file for the organism.

	    wget https://hendrixlab.cgrb.oregonstate.edu/teaching/salmonella.gff

Please read more information on the options of cuffdiff, check here: Cuffdiff Manual. We are going to run a basic version here, and the only option we are setting is to create a label for the output file columns with meaningful labels.

	cuffdiff -L WT,MMC salmonella.gff salmonella_ctrl.sort.bam salmonella_MMC.sort.bam

Do you find any differentially expressed genes? These will correspond to those significant at an FDR of 0.05, by default. They make this easy to putting a "yes" on the last column for genes that are significant.

	grep yes gene_exp.diff

To get just the gene IDs that are on the 3rd column, try something like this to just print that column:

	grep yes gene_exp.diff | awk '{print $3}'

Do these genes make sense given the paper, and can you find information about their function at NCBI?

You can look for enriched funcional annotations using the "GO Enrichment" tool at http://geneontology.org/. GO Terms are part of a controlled vocabulary for gene functions. It is interesting that Salmonella is an available organism on this page. By extracting the gene IDs, you could paste them into this GO Enrichment tool, and compute enriched funcational categories.