Lab 6: Genomic Data



Some new python concepts

First, let's review creating a list and appending values to it. Moreover, let's try sorting the list. To make things more interesting, let's also append ordered pairs, thinking about them like (start,stop) positions of a genomic location:

>>> list = []
>>> list
[]
>>> list.append((172,233))
>>> list.append((22,73))
>>> list.append((56,87))
>>> list
[(172, 233), (22, 73), (56, 87)]
>>> list.sort()
>>> list
[(22, 73), (56, 87), (172, 233)]
>>> list[1]
(56, 87)
      

Next, let's introduce the concept of a dictionary. A list can be retrieved from using the index of a particular element of the list. For example consider the last command in the previous example.

A dictionary is a data structure that is indexed by a "key", that can be any string. This could be a useful concept for dealing with many sequences from a fasta file, with the defline as the key.

>>> dict = {}
>>> dict["R2D2"] = "droid"
>>> dict["Vader"] = "Sith"
>>> dict["Yoda"] = "Jedi"
>>> print dict
{'Vader': 'Sith', 'R2D2': 'droid', 'Yoda': 'Jedi'}
>>> print dict["R2D2"]
droid
      

Note how the dictionary looks when you print it out. You see a list of key-value pairs, separated by a ": ". Moreoever, whereas the list is enclosed by square brackets when printed out, the dictionary is enclosed by curly braces when printed out. Nevertheless, both of them allow you to access a particular term with square brackets such as list[2] or dict["R2D2"]

Now let's look at booleans. A boolean is a built-in datatype in python that can take on the value of True or False. You can assign the value to a boolean by setting it equal to a true or false statement.

>>> bool = 23 > 72
>>> bool
False
>>> bool = 2+2==4
>>> bool
True
      

Many functions take booleans as part of the input. For example, let's revisit the sort function for lists mentioned above. We can specify list.sort(reverse = True) when we want to reverse sort the list.

>>> list = [(22, 73), (56, 87), (172, 233)]
>>> bool = False
>>> bool = 2 < 3
>>> bool
True
>>> list.sort(reverse = bool)
>>> list
[(172, 233), (56, 87), (22, 73)]

      

Find a gene of interest in Drosophila menanogaster

Let's use NCBI Protein or NCBI Gene to find a your favorite gene. Let's then BLAST it to Drosophila (by setting Drosophila as the taxa), and find the best Drosophila ortholog.

Next, let's find a flybase transcript ID for one of the isoforms of the gene by navigating the flybase database and Gbrowse from flybase. Note: A flybase transcript ID looks like this: FBtr001828

Go to Flybase and search for the gene you found here: flybase, then paste your gene name into the "Jump to Gene" box on the upper left.

New Script: extractGeneProtein.py

To use this script, you'll need to download a flybase annotation. Download this one with wget:

	wget http://hendrixlab.cgrb.oregonstate.edu/teaching/flybase_r5.56.gff3
      

Next, let's look at the gff file. Let's use less, but turn off word-wrap with the -S option.

	less -S flybase_r5.56.gff3
      

Next, let's take a look at the script.

import sys
import re
from Bio import SeqIO
from Bio.Seq import Seq

# this section takes care of reading in data from user
if len(sys.argv) != 4 or "-h" in sys.argv or "--help" in sys.argv:
    # sys.argv[0] is the name of the script itself.
    print >> sys.stderr, "Usage: ", sys.argv[0]," <genome fasta> <gff file> <gene ID>\n"
    sys.exit()

# read the input files/args.
# sys.argv is a list containing the terms entered on the command line for the input files.
genomeFasta = sys.argv[1]
gffFile = sys.argv[2]
name = sys.argv[3]

# read the fasta file into a dictionary
genome = {}
sequences = SeqIO.parse(open(genomeFasta),"fasta")
for record in sequences:
    genome[record.id] = record.seq

# initialize some variables
revcomp = False
coords = []
# read through the gffFile
for line in open(gffFile,"r"):
    if "#" in line: continue
    chromname,source,seqtype,start,stop,score,strand,frame,info = line.strip().split("\t")
    if "CDS" in seqtype:
        match = re.search('Parent=(.*?);',info)
        if match:
            if name in match.group(1):
                coords.append((chromname,int(start),int(stop)))
                revcomp = strand is "-"

# if the gene was on the reverse strand, 
# reverse the order of the exons
coords.sort(reverse = revcomp)

# collect the coding exons of the transcript
ORF = Seq('')
for chromname,start,stop in coords:
    CDS = genome[chromname][start-1:stop]
    if revcomp: 
        CDS = genome[chromname][start-1:stop].reverse_complement()
    # concatenate the coding sequence
    ORF += CDS

# transcribe,translate, and print
RNA = ORF.transcribe()
Protein = RNA.translate()
print Protein	      
      

Finally, after you copy the text of the above script into a file called "extractGeneAndProtein.py", and put into your scripts directory (~/Scripts/) you can run the script using the transcript ID for your favorite gene:

	python ~/Scripts/extractGeneAndProtein.py dm3.fa flybase_r5.56.gff3 FBtr0089362	
      
Note, here we are using the genome fasta file created in a previous lab, Lab 4.