Lab 4: Phylogenetics

Download Sequences from NCBI

Download sequences in fasta format for a gene of interest from NCBI nucleotide (NCBI nucleotide) I downloaded 5S rRNA for Human (Homo sapiens), Mouse (mus musculus), Rat (Rattus norvegicus), Frog (Xenopus laevis), Chicken (Gallus gallus), Fly (Drosophila melanogaster) and Arabidopsis (A. thaliana). Build a fasta file containing each sequence and a defline. I shortened the defline for each species to make it easier to read later.

Alternatively, you could download 18S rRNA sequences here: 18S rRNAs

Create a Multipe Sequence Alignment and Phylogenetic Tree with Clustalw

Create a multiple sequence alignment using ClustalW2.

	clustalw2 -infile=18S_rRNAs.fasta -type=DNA -outfile=18S_rRNAs.aln

Using the *.dnd file that you created with Clustalw, you can create an image of a phylogenetic tree using Biopython:

Enter the python termin by typing python:

	>>> import matplotlib as mpl
	>>> mpl.use('Agg')
	>>> import matplotlib.pyplot as pyplot
	>>> from Bio import Phylo
	>>> tree = Phylo.read('18S_rRNAs.dnd','newick')
	>>> Phylo.draw(tree)
	>>> pyplot.savefig("myTree.png")
	>>> quit()

How does the resulting tree compare to what you expect given these species?

Create a Multiple Sequence Alignment and Phylogenetic Tree with phyML

The program phyML provides much more flexibility in what sort of trees it can compute. To use it, we'll need to convert our alignment file to phylip. We can do this with the Biopython module AlignIO

	>>> from Bio import AlignIO
	>>> alignment = AlignIO.parse("18S_rRNAs.aln","clustal")
	>>> AlignIO.write(alignment,open("18S_rRNAs.phy","w"),"phylip")
	1
	>>> quit()

You will need to install phyml using conda. Importantly, you must use the large server for this to install because of RAM usage. You can install it with the following series of commands:

	conda create -n test
	conda activate test
	conda install bioconda::phyml

You can run phyml in the simplest way by simply typing "phyml" and then typing your alignment file:

	$ phyml
	
	Enter the sequence file name > 18S_rRNAs.phy

The program will give you a set of options, and you can optionally change them. To change the model, type "+" and then "M" to toggle through the models.

You can also run phyml using command line options described here: https://github.com/stephaneguindon/phyml/blob/master/doc/phyml-manual.pdf

Let's compare the phyml tree to the clustalw tree:

	>>> import matplotlib as mpl
	>>> mpl.use('Agg')
	>>> import matplotlib.pyplot as pyplot
	>>> from Bio import Phylo	
	>>> tree = Phylo.read('18S_rRNAs.phy_phyml_tree.txt','newick')
	>>> Phylo.draw(tree)
	>>> pyplot.savefig("myTreeML.png")
	>>> quit()

For the following sequences, compute the Jukes-Cantor distance between each pair by using a nested for-loop through the sequences. You will need to loop through each pair of sequences, and compare character-by-character to compute the proportion of mutations p for each pair. Then input the sequences into clustalw2 and phyml to compute the tree using a Jukes-Cantor model, and finally visualize the tree with biopython. Finally, compare the branch lengths for a few examples, you can do this manually by looking at the corresponding tree figure you create with Biopython. Download here: fake sequences

	>A
	GCTGATGCTAGCTAGCATGCATGCATGCTAGCTATCGTCGTGACTTACGTAGCTAGCTATCG
	>B
	GCTGATGCTAGTTAGCGTGCATGCAACCTAGCTATCGTCGAGACTTACCTAGCTAGTCATCG	
	>C
	GCTGGTGCAAGTTAGCGTGCAAGCAGCCTACCTATCGTCGAGACTAACCTAGCTAGTCGTCG	
	>D
	GCTGCTGCAAGTTAGCGTGCAGGCAGCCTACCTATCGACGAGACTATCCTAGCTAGTCGTGG	
	>E
	GCTGCTGCAAGTTAGCGTGCAGGCAGCATACCTATCGACGAGACTATCTTAGCTAGTCGTGG