The data for this project was taken from this paper: Wu et al.
Please download the data in the directory here: Yeast Data
What this data contains is 4 RNA-Seq files (reads from just chromosome VII), a tar file containing a bowtie index for the latest assembly of yeast, the genome sequence, and an annotation gtf file.
You will need to untar the tar file with the command:
tar xvf sce_R64_1_1.tar
Please read more information on the options of tophat, check here: Tophat Manual
tophat -o sce_1n_R1_VII_sce_R64_1_1_tophat --no-coverage-search --no-novel-juncs -G sce_R64_1_1_ENSEMBL.gtf sce_R64_1_1 sce_1n_R1_VII.fastq tophat -o sce_1n_R2_VII_sce_R64_1_1_tophat --no-coverage-search --no-novel-juncs -G sce_R64_1_1_ENSEMBL.gtf sce_R64_1_1 sce_1n_R2_VII.fastq tophat -o sce_4n_R1_VII_sce_R64_1_1_tophat --no-coverage-search --no-novel-juncs -G sce_R64_1_1_ENSEMBL.gtf sce_R64_1_1 sce_4n_R1_VII.fastq tophat -o sce_4n_R2_VII_sce_R64_1_1_tophat --no-coverage-search --no-novel-juncs -G sce_R64_1_1_ENSEMBL.gtf sce_R64_1_1 sce_4n_R2_VII.fastq
In the above commands, the
--no-coverage-searchflag is a time-saving option, the
-no-novel-juncsrestricts splice junctions to those in the provided GTF file.
Alternatives: You could also use a more updated aligner like Hisat2. An example of a similar command would be something like this:
hisat --max-intronlen 10000 -q -x sce_R64_1_1 -U sce_R64_1_1 sce_1n_R1_VII.fastq -S sce_1n_R1_VII_sce_R64_1_1_hisat.sam
In this command, the
--max-intronlen 10000flag restricts the considered splice junctions by length, the
-qflag simply specifies FASTQ as input. Note: Hisat2 uses a different index file! .ht2 suffix instead of .ebwt
Please read more information on the options of cuffdiff, check here: Cuffdiff Manual
cuffdiff -L 4n,1n sce_R64_1_1_ENSEMBL.gtf sce_4n_R1_VII_sce_R64_1_1_tophat/accepted_hits.bam,sce_4n_R2_VII_sce_R64_1_1_tophat/accepted_hits.bam sce_1n_R1_VII_sce_R64_1_1_tophat/accepted_hits.bam,sce_1n_R2_VII_sce_R64_1_1_tophat/accepted_hits.bam
Do you find any differentially expressed genes? These will correspond to those significant at an FDR of 0.05, by default. There should be 4 genes. They make this easy to putting a "yes" on the last column for genes that are significant.
grep yes gene_exp.diff
GO Terms, or "Gene Ontology" terms are a controlled vocabulary of gene functions that are described by GO annotations. Now that you have a list of genes, you can head to GOEAST web tool here: GOEAST. Select yeast (S. cerevisiae) from the species pull-down menu. I think you also need to select the Gene Symbol check box. The Step Optional button opens up numerous options. Let's use Hochberg (Benjamini-Hochberg) with an FDR of 0.05.
Bonus: If you want to take it further, you could re-do this project with the full dataset here: GEO link, although you should expect a decent runtime to do the full set.