Lab 2: Fastq and Quality Scores



Today we will review the fastq file

First, let's download a script discussed in class, and a fastq file.

	wget http://hendrixlab.cgrb.oregonstate.edu/teaching/printAverageQualityScores.py
	wget http://hendrixlab.cgrb.oregonstate.edu/teaching/reads.fastq
      

Move the script into your scripts folder with a command like "mv printAverageQualityScores.py Scripts/.". Next, let's count the number of lines in the fastq file. Given the result, how many reads do you think are contained in this file?

	wc -l reads.fastq
      

Next,let's check what happens when you run the script without any input, or with the "-h" flag.

	python Scripts/printAverageQualityScores.py
	python Scripts/printAverageQualityScores.py -h
      

This is a usage statement printed out, making it clear how to run the script. This is very helpful for writing scripts and is good practice. Now let's run the script with the input file, which will plot the average error probability as a function of position.:

	python Scripts/printAverageQualityScores.py reads.fastq	
      

How does this plot look? What does this say about errors in sequencing data?

Now modify the script to instead of printing the average probability, to print the average quality score. How does this plot compare to the probability plot?.

...

Now let's use a quality trimmer called "fastq_quality_trimmer" to trim our reads based on a quality threshold of 20. First, let's see how this program is run.

$ fastq_quality_trimmer -h

usage: fastq_quality_trimmer [-h] [-v] [-t N] [-l N] [-z] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.13 by A. Gordon (gordon@cshl.edu)

   [-h]         = This helpful help screen.
   [-t N]       = Quality threshold - nucleotides with lower 
                  quality will be trimmed (from the end of the sequence).
   [-l N]       = Minimum length - sequences shorter than this (after trimming)
                  will be discarded. Default = 0 = no minimum length. 
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTQ input file. default is STDIN.
   [-o OUTFILE] = FASTQ output file. default is STDOUT.
   [-v]         = Verbose - report number of sequences.
                  If [-o] is specified,  report will be printed to STDOUT.
                  If [-o] is not specified (and output goes to STDOUT),
                  report will be printed to STDERR.

      

Gotcha: The help menu doesn't tell you what value to subtract for some reason. This is done with the "-Q 33" flag. The full command is this:

	fastq_quality_trimmer -t 20 -i reads.fastq -o reads.qc.fastq -Q 33
      

Note: the "printAverageQualityScores.py" script will no longer work for data of varying length. Now that we've trimmed the reads, let's see how the resulting length distribution looks. This can be achieved by a second modification of the script to simply print the lengths of the trimmed reads.

	wget http://hendrixlab.cgrb.oregonstate.edu/teaching/printFastqReadLengths.py
      

Move this script into your scripts folder. This script can be run by the following command, similar to the first script:

	python Scripts/printFastqReadLengths.py reads.qc.fastq
      

Please take a look at this script and see how it works. This script will also produce a pdf file, showing the resulting read length distribution. The difference is, it is on a logscale on the y-axis using the line:

	pyplot.yscale('log')
      

Question: How does the read quality data compare to the length data after trimming? Is the result reasonable?