Lab 7: RNA-Seq

The data for this project was taken from this paper: Wu et al.

Please download the data in the directory here: Yeast Data

What this data contains is 4 RNA-Seq files (reads from just chromosome VII), a tar file containing a bowtie index for the latest assembly of yeast, the genome sequence, and an annotation gtf file.

You will need to untar the tar file with the command:

tar xvf sce_R64_1_1.tar

Step 1: Align the sequences with tophat.

Please read more information on the options of tophat, check here: Tophat Manual

      tophat -o sce_1n_R1_VII_sce_R64_1_1_tophat --no-coverage-search --no-novel-juncs -G sce_R64_1_1_ENSEMBL.gtf sce_R64_1_1 sce_1n_R1_VII.fastq
      tophat -o sce_1n_R2_VII_sce_R64_1_1_tophat --no-coverage-search --no-novel-juncs -G sce_R64_1_1_ENSEMBL.gtf sce_R64_1_1 sce_1n_R2_VII.fastq
      tophat -o sce_4n_R1_VII_sce_R64_1_1_tophat --no-coverage-search --no-novel-juncs -G sce_R64_1_1_ENSEMBL.gtf sce_R64_1_1 sce_4n_R1_VII.fastq
      tophat -o sce_4n_R2_VII_sce_R64_1_1_tophat --no-coverage-search --no-novel-juncs -G sce_R64_1_1_ENSEMBL.gtf sce_R64_1_1 sce_4n_R2_VII.fastq

In the above commands, the

flag is a time-saving option, the
restricts splice junctions to those in the provided GTF file.

Alternatives: You could also use a more updated aligner like Hisat2. An example of a similar command would be something like this:

	 hisat --max-intronlen 10000 -q -x sce_R64_1_1 -U sce_R64_1_1 sce_1n_R1_VII.fastq -S sce_1n_R1_VII_sce_R64_1_1_hisat.sam

In this command, the

--max-intronlen 10000
flag restricts the considered splice junctions by length, the
flag simply specifies FASTQ as input. Note: Hisat2 uses a different index file! .ht2 suffix instead of .ebwt

Step 2: Identify differentially expressed genes with Cuffdiff.

Please read more information on the options of cuffdiff, check here: Cuffdiff Manual

	cuffdiff -L 4n,1n sce_R64_1_1_ENSEMBL.gtf sce_4n_R1_VII_sce_R64_1_1_tophat/accepted_hits.bam,sce_4n_R2_VII_sce_R64_1_1_tophat/accepted_hits.bam sce_1n_R1_VII_sce_R64_1_1_tophat/accepted_hits.bam,sce_1n_R2_VII_sce_R64_1_1_tophat/accepted_hits.bam

Do you find any differentially expressed genes? These will correspond to those significant at an FDR of 0.05, by default. There should be 4 genes. They make this easy to putting a "yes" on the last column for genes that are significant.

	grep yes gene_exp.diff

Step 3: Are these 4 genes associated with any GO terms?

GO Terms, or "Gene Ontology" terms are a controlled vocabulary of gene functions that are described by GO annotations. Now that you have a list of genes, you can head to GOEAST web tool here: GOEAST. Select yeast (S. cerevisiae) from the species pull-down menu. I think you also need to select the Gene Symbol check box. The Step Optional button opens up numerous options. Let's use Hochberg (Benjamini-Hochberg) with an FDR of 0.05.

Bonus: If you want to take it further, you could re-do this project with the full dataset here: GEO link, although you should expect a decent runtime to do the full set.