Download the domain file from here: ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gzThis file contains a set of HMM models for protein domains defined for the Pfam database. To use this file, we must first unzip the file with gunzip.
This will produce the unzipped file "Pfam-A.hmm", which can be used to scan for these domains using the HMMer software program hmmscan as part of the HMMer package. However,we still need to index this .hmm domain file. This can be done with the HMMer software.
As we have seen in this course, there are many ways to download a protein sequence. Since we will work on protein secondary structure prediction, let's consider downloading something from PDB, the Protein Databank.
The PDZ domain is a structural domain involved in signaling in many organisms ranging from bacteria to animals. The structure consists of rougly 5 beta-strands and 2 alpha-helicies, as demonstrated by this image of the tertiary structure::
You can see more detail on this structure here at pdb
One useful piece of data in this database entry is the secondary strucutre information found here in this image, also present at the PDB database:
The sequence can be obtained in the upper right hand corner of the PDB database page under "Downloaded Files" and then under "FASTA sequence". Here you can download the following sequence:
>2NB4:A|PDBID|CHAIN|SEQUENCE PLTRPYLGFRVAVGRDSSGCTTLSIQEVTQTYTGSNGGADLMGPAFAAGLRVGDQLVRFAGYTVTELAAFNTVVARHVRP SASIPVVFSRDGVVMSATIVVGELE
Now we can save this sequence to a file called pdz.fasta. We can then run hmmscan to see if there are any domains in this sequence:
hmmscan --domtblout domains_2NB4.txt Pfam-A.hmm pdz.fasta
Does this output make sense? you can view it with the following command, using "ls -S" to turn off word-wrapping.
ls -S domains_2NB4.txt
As expected the only domains are PDZ. What is unexpected, perhaps, is the positions of the two Pfam domains pertaining to PDZ (defined by the "ali coord" entry and "from" and "to" columns.) don't always start at the same position, suggesting that to some degree the boundaries are a bit "fuzzy" at times.
jnet -p pdz.fasta
This produces a file format that gives the sequence and a letter designation for whether the region is an alph-helix, designated by an "H", or a beta-strand, designated by an "E". How does the predicted structure correspond to the true structure?