Lab 6: Genomic Data

Part I: Storing a Genome to a Dictionary

A dictionary is a data structure with key-value pairs. Dictionaries are indexed by a "key", which can be any string. This could be a useful concept for dealing with many sequences from a fasta file, with the defline as the key.

	>>> dict = {}
        >>> dict["R2D2"] = "droid"
        >>> dict["Vader"] = "Sith"
        >>> dict["Yoda"] = "Jedi"
        >>> print dict
        {'Vader': 'Sith', 'R2D2': 'droid', 'Yoda': 'Jedi'}
        >>> print dict["R2D2"]
        droid

Note how the dictionary looks when you print it out. You see a list of key-value pairs, separated by a ": ". Moreoever, whereas the list is enclosed by square brackets when printed out, the dictionary is enclosed by curly braces when printed out. Nevertheless, both of them allow you to access a particular term with square brackets such as list[2] or dict["R2D2"]

It turns out, it is a pretty good way to store a genome, namely with the chromosome names (deflines) as the key, and the sequence as the value.

      genome = {}
      sequences = SeqIO.parse(genomeFasta,"fasta")
      for record in sequences:
         genome[record.id] = record.seq

Part II: Storing a GFF to a list

Next, let's review creating a list and appending values to it. In this case, we will append (chrom,start,stop) genomic locations corresponding to CDS regions:

coords = []

GFF = open(gffFile,'r')
for line in GFF:
    if "#" not in line:
      chrom,source,seqtype,start,stop,score,strand,frame,attributes = line.strip().split("\t")
        if "CDS" in seqtype:
            if name in attributes:
                coords.append((chrom,int(start),int(stop)))

A couple of things to note. First, we have a gene name in mind, stored in the variable "name". We expect this to be a transcript ID because a gene ID could end up printing too many lines (one for each transcript associated with that gene). If we want to extract coding regions, it should be on a per-transcript basis. Second, we note that we are converting the positions to integers upon reading in. By default they are stored as strings, so the conversion is important. Third, note that we are excluding lines with a hash (#) because these are typically comments that do not contain the required number of columns.

Part III: Find a gene of interest in Drosophila menanogaster

Let's use NCBI Protein or NCBI Gene to find a your favorite gene. Let's then BLAST it to Drosophila (by setting Drosophila as the taxa), and find the best Drosophila ortholog.

Next, let's find a flybase transcript ID for one of the isoforms of the gene by navigating the flybase database and Gbrowse from flybase. Note: A flybase transcript ID looks like this: FBtr001828

Go to Flybase and search for the gene you found here: flybase, then paste your gene name into the "Jump to Gene" box on the upper left.

New Script: extractGeneProtein.py

To use this script, you'll need to download a flybase annotation. Download this one with wget:

	wget https://hendrixlab.cgrb.oregonstate.edu/teaching/flybase_r5.56.gff3

Next, let's look at the gff file. Let's use less, but turn off word-wrap with the -S option.

	less -S flybase_r5.56.gff3

Next, you can create a symbolic link to the genome file you created in Lab 4 to your current directory with a command like this, but you may need to updated it based on where the file is:

	ln -s ../BLAST/dm3.fa genome.fa

In this command, the first file "../BLAST/dm3.fa" is your original genome file. The second file name "genome.fa" is the name of the symbolic link you will be creating (but you can name it what you want). Now, let's take a look at the script.

import sys
import re
from Bio import SeqIO
from Bio.Seq import Seq

# this section takes care of reading in data from user
usage = "Usage: " + sys.argv[0] + " <genome fasta> <gff file> <transcript ID>"
if len(sys.argv) != 4:
    print usage
    sys.exit()

# read the input files/args.
genomeFasta = sys.argv[1]
gffFile = sys.argv[2]
name = sys.argv[3]

# read the fasta file into a dictionary.
genome = {}
sequences = SeqIO.parse(genomeFasta,"fasta")
for record in sequences:
    genome[record.id] = record.seq

# initialize some variables
revcomp = False
coords = []

# read through the gffFile          
GFF = open(gffFile,'r')
for line in GFF:
    if "#" not in line:
        chrom,source,seqtype,start,stop,score,strand,frame,attributes = line.strip().split("\t")
        if "CDS" in seqtype:
            if name in attributes:
                coords.append((chrom,int(start),int(stop)))
                if strand == "-":
                    revcomp = True

# reverse the order of the exons if on "-" strand
coords.sort(reverse = revcomp)

# collect the coding exons of the transcript 
ORF = Seq('')
for chrom,start,stop in coords:
    CDS = genome[chrom][start-1:stop]
    if revcomp:
        CDS = genome[chrom][start-1:stop].reverse_complement()
    # concatenate the coding sequence 
    ORF += CDS

# transcribe,translate, and print 
RNA = ORF.transcribe()
Protein = RNA.translate()
print Protein

Finally, after you copy the text of the above script into a file called "extractGeneAndProtein.py", you can run the script using the transcript ID for your favorite gene:

	python extractGeneAndProtein.py genome.fa flybase_r5.56.gff3 FBtr0089362

Note, here we are using the genome fasta file created in a previous lab, Lab 4.