Let’s first build a database. I downloaded the swissprot database from NCBI with the following command:
and then unzip the downloaded file with the following command:
Although there is no file extension, the file is a FASTA file. Let's rename it so that we know it is a FASTA file.
mv swissprot swissprot.fa
Next, let’s build a database with the following command:
makeblastdb -in swissprot.fa -input_type fasta -title swissprot -dbtype prot
Not all of these options are required. Can you figure out which options are required by the help message printed with you run this command?
Download the protein sequence infomation for human BRCA1 and create a fasta file for the sequence NCBI human BRCA1 link. Save it to a file called brca1_pep.fa. Copy the sequence, and paste it into a file after opening it with nano:
To save with nano, type Ctrl-X, then type Y. Next, we can BLAST the brca1 pep.fasta file we created.
blastp -query brca1_pep.fasta -db swissprot.fa > brca1_swissprot
Do the top hits make sense to you? You can search NCBI Protein for some of the IDs.
What are the best hits? Do the order of the sequence hits make sense in terms of what you know of the biology?
You can also BLAST the sequence to the "non-redundant" database "nr" by pasting it to the NCBI BLAST page here: NCBI BLAST. Note that you could do theoretically do this by specifying "nr" for the database, but the BB485 server doesn't have downloaded (it's a very big file!)
Biopython You can analyze your blast hits using Biopython. To do this, you need to set the outout format to XML with the following command.
blastp -query brca1_pep.fasta -db swissprot -outfmt 5 > brca1_swissprot.xml
The XML can be difficult to read, but can be parsed easily. For example, you can print the alignment for each BLAST hit in the results with something like this:
>>> from Bio.Blast import NCBIXML >>> result_handle = open("brca1_swissprot.xml") >>> blast_record = NCBIXML.read(result_handle) >>> for alignment in blast_record.alignments: ... for hsp in alignment.hsps: ... if hsp.expect < 1e-10: ... print('sequence:', alignment.title) ... print('length:', alignment.length) ... print('e value:', hsp.expect) ... print(hsp.query) ... print(hsp.match) ... print(hsp.sbjct)
Drosophila Genome data link: Drosphila genome version dm3. Download chromFa.tar.gz with the command at the terminal:
Unzip the file with the command:
tar xvfz chromFa.tar.gz
Combine all the chromosome fasta files into one genome file:
cat chr*fa > dm3.fa
You can clean up the directory by moving the chr*fa files into a directory. First create the directory:
Move the chromosome files into the directory with this command:
mv chr*fa genome/.
Build a blast database:
makeblastdb -in dm3.fa -title dm3 -dbtype nucl
Download the sequence infomation for human BRCA1 and create a fasta file for the sequence NCBI human BRCA1 link.
BLAST The sequence to the genome:
blastn -query brca1.fa -db dm3.fa > brca1_dm3.blast
Download the RefSeq mRNA annotations refMrna.fa.gz with the command at the terminal:
gunzip the file with the command:
Create a database for the RefSeq annotations:
makeblastdb -in refMrna.fa -title refMrna -dbtype nucl
BLAST The sequence to the refMrna database:
blastn -query brca1.fa -db refMrna.fa > brca1_refMrna.blast
Discussion questions: the difference between the two results? How do you explain the difference? What is chrUextra anyway? To answer this you should look at the BLAST output with "less" in the same way you looked at other BLAST output above.
Going beyond: How does this e-value compare to BLASTing to mouse through the NCBI website? Can you find a gene in human that has a significant hit to the E. coli genome?