Lab 4: BLAST, Alignment



Part 1: Smith-Waterman Example

Let's work out on the board the Smith-Waterman alignment for the following sequences as a class:

	>x
	ACAGTGACTA
	>y
	ACAGGCTCGA
      

Part 2: BLAST on the command line

Let’s first build a database. I downloaded the swissprot database from NCBI with the following command:

	wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/swissprot.gz
      

and then unzip the downloaded file with the following command:

	gunzip swissprot.gz
      

Next, let’s build a database with the following command:

	makeblastdb -in swissprot -input_type fasta -title swissprot -dbtype prot
      

Not all of these options are required. Can you figure out which options are required by the help message printed with you run this command:

	makeblastdb -help
      

Download the protein sequence infomation for human BRCA1 and create a fasta file for the sequence NCBI human BRCA1 link. Save it to a file called brca1_pep.fa. Copy the sequence, and paste it into a file after opening it with nano:

	nano brca1_pep.fasta
      

To save with nano, type Ctrl-X, then type Y. Next, we can BLAST the brca1 pep.fasta file we created.

	blastp -query brca1_pep.fasta -db swissprot > brca1_swissprot
      

Do the top hits make sense to you? You can search NCBI Protein for some of the IDs.

	less brca1_swissprot
      

What are the best hits? Do the order of the sequence hits make sense in terms of what you know of the biology?

You can also BLAST the sequence to the "non-redundant" database "nr" by pasting it to the NCBI BLAST page here: NCBI BLAST. Note that you could do this on the command line with the following command. This is something you could try later.

Part 3: Formatting a database

Drosophila Genome data link: Drosphila genome version dm3. Download chromFa.tar.gz with the command at the terminal:

wget http://hgdownload.soe.ucsc.edu/goldenPath/dm3/bigZips/chromFa.tar.gz

Unzip the file with the command:

tar xvfz chromFa.tar.gz

Combine all the chromosome fasta files into one genome file:

cat chr*fa > dm3.fa

Build a blast database:

makeblastdb -in dm3.fa -title dm3 -dbtype nucl

Download the sequence infomation for human BRCA1 and create a fasta file for the sequence NCBI human BRCA1 link.

BLAST The sequence to the genome:

blastn -query brca1.fa -db dm3.fa > brca1_dm3.blast

Download the RefSeq mRNA annotations refMrna.fa.gz with the command at the terminal:

wget http://hgdownload.soe.ucsc.edu/goldenPath/dm3/bigZips/refMrna.fa.gz

gunzip the file with the command:

gunzip refMrna.fa.gz

Create a database for the RefSeq annotations:

makeblastdb -in refMrna.fa -title refMrna -dbtype nucl

BLAST The sequence to the refMrna database:

blastn -query brca1.fa -db refMrna.fa > brca1_refMrna.blast

Discussion questions: the difference between the two results? How do you explain the difference? What is chrUextra anyway? To answer this you should look at the BLAST output with "less" in the same way you looked at other BLAST output above.


Going beyond: How does this e-value compare to BLASTing to mouse through the NCBI website? Can you find a gene in human that has a significant hit to the E. coli genome?