Lab 7: RNA-Seq



The data for this project was taken from this paper: Kurasz et al.. In this paper, they studied gene expression in Salmonella enterica serovar Typhimurium and the effect of nucleic acid damage induced by mitomycin (MMC), which is a chemotherapy drug, and other compounds. We will look at the MMC data in this project.

Due to the large amount of data generated, I ask that you work in groups today. Please delete any old files that you may have on the server that you aren't using anymore. For example, the original chr*fa files from the Drosophila genome.

Step 1: Download the genome and build a hisat2 index.

First thing we should do is download the genome for Salmonella. For reference, I found this by searching NCBI in the "Assembly" section and found this link here: https://www.ncbi.nlm.nih.gov/assembly/?term=Salmonella+Typhimurium. I renamed the file and made it available at this link:

	wget https://hendrixlab.cgrb.oregonstate.edu/teaching/salmonella_genome.fasta
      

Next, we should build a hisat2 index with the following command:

	hisat2-build salmonella_genome.fasta sal
      

In this command, I have created a filebase called "sal", but you could call it what you want and change the subsequent commands accordingly. Note that several files are created as a result:

	[hendrixd@bb485 data]$ ls sal.*
	sal.1.ht2  sal.2.ht2  sal.3.ht2  sal.4.ht2  sal.5.ht2  sal.6.ht2  sal.7.ht2  sal.8.ht2
      

Step 2: Download the FASTQ files and align with hisat2

First, we will download the FASTQ file. I found this experiment on GEO here: GEO record GSE82167. As a side note, what I did was go to the "Samples" section, expanded that section, clicked the links to find the "SRR..." IDs for samples. I ran a command like this (but please don't run this because the original files are too big!):

	~/sratoolkit.2.9.4-centos_linux64/bin/fastq-dump -A SRR3621123 >& SRR3621123.err
        ~/sratoolkit.2.9.4-centos_linux64/bin/fastq-dump -A SRR3621125 >& SRR3621125.err
      

To make this more feasible in class, I have created the following shortened version of the resulting files:

	wget https://hendrixlab.cgrb.oregonstate.edu/teaching/salmonella_ctrl.fastq
	wget https://hendrixlab.cgrb.oregonstate.edu/teaching/salmonella_MMC.fastq
      

Question: How many lines are in these files and how many reads does it have?

We might want to quality trim these files, as well as remove any potential adapters. You can trim the files with "skewer" using the commands:

	skewer -x AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -q 30 -Q 30 salmonella_ctrl.fastq -o salmonella_ctrl
	skewer -x AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -q 30 -Q 30 salmonella_MMC.fastq -o salmonella_MMC
      

For this command, it is requiring a PHRED score of 30 for both the 3' end as well as the average score. Meaning, nucleotides will be removed from the 3' end if they are lower quality than 30, and will discard reads with an average score less than 30.

The alignment itself is done with hisat2 using mostly default options to create a SAM file as the output.

	hisat2 -x sal -U salmonella_ctrl-trimmed.fastq -S salmonella_ctrl.sam
	hisat2 -x sal -U salmonella_MMC-trimmed.fastq -S salmonella_MMC.sam
      

Step 3: Convert SAM to BAM, sort, and index the files.

In order for cuffdiff to run properly, we need to sort and index the files. SAM will work actually, but it needs to be sorted. Perhaps the easiest and fastest thing is do do this with samtools, which is like a swiss-army knife for SAM and BAM files. First, convert to BAM using samtools

	samtools view -b -S -o salmonella_ctrl.bam salmonella_ctrl.sam 
	samtools view -b -S -o salmonella_MMC.bam salmonella_MMC.sam
      

Next, we sort and index these files:

	samtools sort salmonella_ctrl.bam salmonella_ctrl.sort
	samtools index salmonella_ctrl.sort.bam 
	
	samtools sort salmonella_MMC.bam salmonella_MMC.sort
	samtools index salmonella_MMC.sort.bam 	
      

Step 4: Identify differentially expressed genes with Cuffdiff.

To do this, you'll need to download the GFF file for the organism.

	    wget https://hendrixlab.cgrb.oregonstate.edu/teaching/salmonella.gff
	  

Please read more information on the options of cuffdiff, check here: Cuffdiff Manual. We are going to run a basic version here, and the only option we are setting is to create a label for the output file columns with meaningful labels.

	cuffdiff -L WT,MMC salmonella.gff salmonella_ctrl.sort.bam salmonella_MMC.sort.bam
      

Do you find any differentially expressed genes? These will correspond to those significant at an FDR of 0.05, by default. They make this easy to putting a "yes" on the last column for genes that are significant.

	grep yes gene_exp.diff
      
To get just the gene IDs that are on the 3rd column, try something like this to just print that column:
	grep yes gene_exp.diff | awk '{print $3}'
      

Do these genes make sense given the paper, and can you find information about their function at NCBI?

You can look for enriched funcional annotations using the "GO Enrichment" tool at http://geneontology.org/. GO Terms are part of a controlled vocabulary for gene functions. It is interesting that Salmonella is an available organism on this page. By extracting the gene IDs, you could paste them into this GO Enrichment tool, and compute enriched funcational categories.