Today we are going to learn about various ways to produce motifs from data. The purpose is to show what works and what doesn't.
First, let's download a list of sequences that contain a motif at random positions and and some flanking sequence.
How many lines are in this file? How many sequences does that mean are in the file?
Let's try and create a motif from this data. What do you expect to happen?
>>> from Bio import SeqIO >>> from Bio import motifs >>> from Bio.Seq import Seq >>> sequences = SeqIO.parse("gt.fasta","fasta") >>> instances =  >>> for data in sequences: ... instance = data.seq ... instances.append(instance) ... >>> motif = motifs.create(instances)
Now the motif is created in the "motif" object. What does the count matrix look like? You can print it out by this command:
>>> print motif.counts
Create a motif LOGO from these counts. We can go through the web, or we can create one right now with Biopython:
>>> motif.weblogo("giant_1.pdf",format="pdf") >>> quit()
Does this look like a good motif? How can you tell?
Now go to the website JASPAR at http://jaspar.genereg.net/. Go to "JASPAR CORE Insecta". Then find the motif for "giant", labeled "gt" (motif ID MA0447.1). Click the motif LOGO image to get a pop-up of the motif details (you may need to enable pop-ups on your browser). How does the motif LOGO compare to what we found just by building a LOGO from the fasta file of all the sites? (If you have issues downloading the file, you can get it here.)
Now let's download the sites in JASPAR format, which is a variation on fasta, with a little more information. Click the link that says "(Show me all binding sites)..as fasta file" in the lower left portion of the pop-up. Transfer the downloaded file to crick by using the folder icon for file transfer. By looking at the file, what differences do you see?
Now, let's load this data using a special function that is part of the motifs module:
>>> from Bio import motifs >>> motif = motifs.read(open("MA0447.1.sites"),"sites") >>> print motif.counts
How does this compare to what you found previously? Now create a motif LOGO and exit:
>>> motif.weblogo("giant_2.pdf",format="pdf") >>> quit()
How does this motif compare to the image on the JASPAR webpage?
The reason why our first method didn't work is because we just looked at sequenes without performing motif finding. The JASPAR sites file encodes this information in the capital letters, letting it know where the motif instances are. Now let's try motif finding on the first set of data to see if we can recover it.
meme gt.fasta -dna -mod oops -nmotifs 1 -maxw 12 -o gt_3
What do each of these parameters mean? Take a look at the output file. Note the "information content" information, as well as probability scores (p-values) for the instances.
Finally, by using the file transfer icon, let's look at the meme.html output file and associated LOGO image generated by MEME. How does this compare to the actual motif in JASPAR?
Bonus question: What happens when you add the "-revcomp" flag to MEME? What is the difference in the output and why do you think this is?
meme gt.fasta -dna -revcomp -mod oops -nmotifs 1 -maxw 12 -o gt_4