Lab 4: BLAST, Alignment



Part 1: BLASTing to a protein database

Let’s first build a database. I downloaded the swissprot database from NCBI with the following command:

	wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/swissprot.gz
      

and then unzip the downloaded file with the following command:

	gunzip swissprot.gz
      

Although there is no file extension, the file is a FASTA file. Let's rename it so that we know it is a FASTA file.

	mv swissprot swissprot.fa
      

Next, let’s build a database with the following command:

	makeblastdb -in swissprot.fa -input_type fasta -title swissprot -dbtype prot
      

Not all of these options are required. Can you figure out which options are required by the help message printed with you run this command?

	makeblastdb -help
      

Download the protein sequence infomation for human BRCA1 and create a fasta file for the sequence NCBI human BRCA1 link. Save it to a file called brca1_pep.fa. Copy the sequence, and paste it into a file after opening it with nano:

	nano brca1_pep.fasta
      

To save with nano, type Ctrl-X, then type Y. Next, we can BLAST the brca1 pep.fasta file we created.

	blastp -query brca1_pep.fasta -db swissprot.fa > brca1_swissprot
      

Do the top hits make sense to you? You can search NCBI Protein for some of the IDs.

	less brca1_swissprot
      

What are the best hits? Do the order of the sequence hits make sense in terms of what you know of the biology?

You can also BLAST the sequence to the "non-redundant" database "nr" by pasting it to the NCBI BLAST page here: NCBI BLAST. Note that you could do theoretically do this by specifying "nr" for the database, but the BB485 server doesn't have downloaded (it's a very big file!)

Biopython You can analyze your blast hits using Biopython. To do this, you need to set the outout format to XML with the following command.

	
	blastp -query brca1_pep.fasta -db swissprot.fa -outfmt 5 > brca1_swissprot.xml
      

The XML can be difficult to read, but can be parsed easily. For example, you can print the alignment for each BLAST hit in the results with something like this:

      >>> from Bio.Blast import NCBIXML
      >>> result_handle = open("brca1_swissprot.xml")
      >>> blast_record = NCBIXML.read(result_handle)
      >>> for alignment in blast_record.alignments:
      ...      for hsp in alignment.hsps:
      ...          if hsp.expect < 1e-10:
      ...               print('sequence:', alignment.title)
      ...               print('length:', alignment.length)
      ...               print('e value:', hsp.expect)
      ...               print(hsp.query)
      ...               print(hsp.match)
      ...               print(hsp.sbjct)
     

Part 2: BLASTing to a genome

Drosophila Genome data link: Drosphila genome version dm3. Download chromFa.tar.gz with the command at the terminal:

wget https://hgdownload.soe.ucsc.edu/goldenPath/dm3/bigZips/chromFa.tar.gz

Unzip the file with the command:

tar xvfz chromFa.tar.gz

Combine all the chromosome fasta files into one genome file:

cat chr*fa > dm3.fa

You can clean up the directory by moving the chr*fa files into a directory. First create the directory:

mkdir genome

Move the chromosome files into the directory with this command:

mv chr*fa genome/.

Build a blast database:

makeblastdb -in dm3.fa -title dm3 -dbtype nucl

Download the sequence infomation for human BRCA1 and create a fasta file for the sequence NCBI human BRCA1 link.

BLAST The sequence to the genome:

blastn -query brca1.fa -db dm3.fa > brca1_dm3.blast

Download the RefSeq mRNA annotations refMrna.fa.gz with the command at the terminal:

wget https://hgdownload.soe.ucsc.edu/goldenPath/dm3/bigZips/refMrna.fa.gz

gunzip the file with the command:

gunzip refMrna.fa.gz

Create a database for the RefSeq annotations:

makeblastdb -in refMrna.fa -title refMrna -dbtype nucl

BLAST The sequence to the refMrna database:

blastn -query brca1.fa -db refMrna.fa > brca1_refMrna.blast

Discussion questions: the difference between the two results? How do you explain the difference? What is chrUextra anyway? To answer this you should look at the BLAST output with "less" in the same way you looked at other BLAST output above.


Going beyond: How does this e-value compare to BLASTing to mouse through the NCBI website? Can you find a gene in human that has a significant hit to the E. coli genome?