Lab 4: Phylogenetics

Download Sequences from NCBI

Download sequences in fasta format for a gene of interest from NCBI nucleotide (NCBI nucleotide) I downloaded 5S rRNA for Human (Homo sapiens), Mouse (mus musculus), Rat (Rattus norvegicus), Frog (Xenopus laevis), Chicken (Gallus gallus), Fly (Drosophila melanogaster) and Arabidopsis (A. thaliana). Build a fasta file containing each sequence and a defline. I shortened the defline for each species to make it easier to read later.

Alternatively, you could download 18S rRNA sequences here: 18S rRNAs

Create a Multipe Sequence Alignment and Phylogenetic Tree with Clustalw

Create a multiple sequence alignment using ClustalW2.

	clustalw -infile=18S_rRNAs.fasta -type=DNA -outfile=18S_rRNAs.aln

Using the *.dnd file that you created with Clustalw, you can create an image of a phylogenetic tree using Biopython:

Enter the python termin by typing python:

	import matplotlib as mpl
	mpl.use('Agg')
	import matplotlib.pyplot as pyplot
	from Bio import Phylo
	tree = Phylo.read('18S_rRNAs.dnd','newick')
	Phylo.draw(tree)
	pyplot.savefig("myTree.png")

Or you can run this in a Jupyter notebook at display the figure directly, which is easier (no need to run the matplotlib commands to save a file):

	import matplotlib.pyplot as pyplot
	from Bio import Phylo
	tree = Phylo.read('18S_rRNAs.dnd','newick')
	Phylo.draw(tree)

How does the resulting tree compare to what you expect given these species?

Create a Multiple Sequence Alignment and Phylogenetic Tree with phyML

The program phyML provides much more flexibility in what sort of trees it can compute. To use it, we'll need to convert our alignment file to phylip. We can do this with the Biopython module AlignIO

	from Bio import AlignIO
	alignment = AlignIO.parse("18S_rRNAs.aln","clustal")
	AlignIO.write(alignment,"18S_rRNAs.phy","phylip")

You will need to install phyml using mamba. Importantly, you must use the large server for this to install because of RAM usage. You can install it with the following series of commands.

First, you'll need to create some writable directories for mamba:

        mkdir -p ~/mamba/envs
        mkdir -p ~/mamba/pkgs

Next, these commands will tell mamba to use the correct syntax:

        mamba config append envs_dirs ~/mamba/envs
        mamba config append pkgs_dirs ~/mamba/pkgs
        mamba config set channel_priority strict

Next, create the mamba instance:

        mamba create -n bb485 -c conda-forge python=3.11
	mamba activate bb485
        eval "$(mamba shell hook --shell bash)"
	mamba install bioconda::phyml

You can run phyml in the simplest way by simply typing "phyml" and then typing your alignment file:

	$ phyml
	
	Enter the sequence file name > 18S_rRNAs.phy

The program will give you a set of options, and you can optionally change them. To change the model, type "+" and then "M" to toggle through the models.

You can also run phyml using command line options described here: https://github.com/stephaneguindon/phyml/blob/master/doc/phyml-manual.pdf

Let's compare the phyml tree to the clustalw tree:

	>>> import matplotlib as mpl
	>>> mpl.use('Agg')
	>>> import matplotlib.pyplot as pyplot
	>>> from Bio import Phylo	
	>>> tree = Phylo.read('18S_rRNAs.phy_phyml_tree.txt','newick')
	>>> Phylo.draw(tree)
	>>> pyplot.savefig("myTreeML.png")
	>>> quit()

For the following sequences, compute the Jukes-Cantor distance between each pair by using a nested for-loop through the sequences. You will need to loop through each pair of sequences, and compare character-by-character to compute the proportion of mutations p for each pair. Then input the sequences into clustalw2 and phyml to compute the tree using a Jukes-Cantor model, and finally visualize the tree with biopython. Finally, compare the branch lengths for a few examples, you can do this manually by looking at the corresponding tree figure you create with Biopython. Download here: fake sequences

	>A
	GCTGATGCTAGCTAGCATGCATGCATGCTAGCTATCGTCGTGACTTACGTAGCTAGCTATCG
	>B
	GCTGATGCTAGTTAGCGTGCATGCAACCTAGCTATCGTCGAGACTTACCTAGCTAGTCATCG	
	>C
	GCTGGTGCAAGTTAGCGTGCAAGCAGCCTACCTATCGTCGAGACTAACCTAGCTAGTCGTCG	
	>D
	GCTGCTGCAAGTTAGCGTGCAGGCAGCCTACCTATCGACGAGACTATCCTAGCTAGTCGTGG	
	>E
	GCTGCTGCAAGTTAGCGTGCAGGCAGCATACCTATCGACGAGACTATCTTAGCTAGTCGTGG