Download the domain file from here: ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gzThis file contains a set of HMM models for protein domains defined for the Pfam database. To use this file, we must first unzip the file with gunzip.
This will produce the unzipped file "Pfam-A.hmm", which can be used to scan for these domains using the HMMer software program hmmscan as part of the HMMer package. However,we still need to index this .hmm domain file. This can be done with the HMMer software.
As we have seen in this course, there are many ways to download a protein sequence. Since we will work on protein secondary structure prediction, let's consider downloading something from PDB, the Protein Databank.
The PDZ domain is a structural domain involved in signaling in many organisms ranging from bacteria to animals. The structure consists of rougly 5 beta-strands and 2 alpha-helicies, as demonstrated by this image of the tertiary structure::
You can see more detail on this structure here at pdb
One useful piece of data in this database entry is the secondary strucutre information found here in this image, also present at the PDB database:
The sequence can be obtained in the upper right hand corner of the PDB database page under "Downloaded Files" and then under "FASTA sequence". Here you can download the following sequence:
>2NB4:A|PDBID|CHAIN|SEQUENCE PLTRPYLGFRVAVGRDSSGCTTLSIQEVTQTYTGSNGGADLMGPAFAAGLRVGDQLVRFAGYTVTELAAFNTVVARHVRP SASIPVVFSRDGVVMSATIVVGELE
Challenge: Download the fasta file here PDZ sequences and compute a multiple sequence alignment for the seuqences to see if it improves the domain identification.
Now we can save this sequence to a file called pdz.fasta. We can then run hmmscan to see if there are any domains in this sequence:
hmmscan --domtblout domains_2NB4.txt Pfam-A.hmm pdz.fasta
Does this output make sense? you can view it with the following command, using "ls -S" to turn off word-wrapping.
ls -S domains_2NB4.txt
As expected the only domains are PDZ. What is unexpected, perhaps, is the positions of the two Pfam domains pertaining to PDZ (defined by the "ali coord" entry and "from" and "to" columns.) don't always start at the same position, suggesting that to some degree the boundaries are a bit "fuzzy" at times.
jnet -p pdz.fasta
This produces a file format that gives the sequence and a letter designation for whether the region is an alph-helix, designated by an "H", or a beta-strand, designated by an "E". How does the predicted structure correspond to the true structure?
You can run this program on an alignment, but you need to remove gaps from the multiple sequence alignment. To do this, first compute an alignment. I used the PDZ file above
clustalw2 -infile=PDZ_sequences.fasta -type=Protein -outfile=PDZ_sequences.aln
Next, you need to convert to FASTA alignment format and remove gaps. You can covert to FASTA using AlignIO (see Lab 5), but this will not remove gaps. To remove gaps, very industrious students might try to write a python script to do this. Note that you need to remove gaps a column of the alignment. Meaning, if one sequene has a gap, that specific position of the alignment would be removed for all sequences. Alternatively, you an use the program "trimal" to do this. You can remove the gaps and convert to FASTA in one step:
trimal -in PDZ_sequences.aln -fasta -nogaps > PDZ_sequences.fa
Next, you can run jnet on the resulting alignment:
jnet -p PDZ_sequences.fa
The problem here is whether or not the alignment with gaps removed is informative. If your sequences are too far away, a lot of gaps could be removed, rendering the resuling gap-free alignment less "true" to the original sequences.