Let's work out on the board the Smith-Waterman alignment for the following sequences as a class:
>x ACAGTGACTA >y ACAGGCTCGA
Let’s first build a database. I downloaded the swissprot database from NCBI with the following command:
and then unzip the downloaded file with the following command:
Next, let’s build a database with the following command:
makeblastdb -in swissprot -input_type fasta -title swissprot -dbtype prot
Not all of these options are required. Can you figure out which options are required by the help message printed with you run this command:
Download the protein sequence infomation for human BRCA1 and create a fasta file for the sequence NCBI human BRCA1 link. Save it to a file called brca1_pep.fa. Copy the sequence, and paste it into a file after opening it with nano:
To save with nano, type Ctrl-X, then type Y. Next, we can BLAST the brca1 pep.fasta file we created.
blastp -query brca1_pep.fasta -db swissprot > brca1_swissprot
Do the top hits make sense to you? You can search NCBI Protein for some of the IDs.
What are the best hits? Do the order of the sequence hits make sense in terms of what you know of the biology?
You can also BLAST the sequence to the "non-redundant" database "nr" by pasting it to the NCBI BLAST page here: NCBI BLAST. Note that you could do this on the command line with the following command. This is something you could try later.
Biopython You can analyze your blast hits using Biopython. To do this, you need to set the outout format to XML with the following command.
blastp -query brca1_pep.fasta -db swissprot -outfmt 5 > brca1_swissprot.xml
The XML can be difficult to read, but can be parsed easily. For example, you can print the alignment for each BLAST hit in the results with something like this:
>>> from Bio.Blast import NCBIXML >>> result_handle = open("my_blast.xml") >>> blast_records = NCBIXML.read(result_handle) >>> for blast_record in blast_records: ... for alignment in blast_record.alignments: ... for hsp in alignment.hsps: ... if hsp.expect < E_VALUE_THRESH: ... print('sequence:', alignment.title) ... print('length:', alignment.length) ... print('e value:', hsp.expect) ... print(hsp.query) ... print(hsp.match) ... print(hsp.sbjct
Drosophila Genome data link: Drosphila genome version dm3. Download chromFa.tar.gz with the command at the terminal:
Unzip the file with the command:
tar xvfz chromFa.tar.gz
Combine all the chromosome fasta files into one genome file:
cat chr*fa > dm3.fa
Build a blast database:
makeblastdb -in dm3.fa -title dm3 -dbtype nucl
Download the sequence infomation for human BRCA1 and create a fasta file for the sequence NCBI human BRCA1 link.
BLAST The sequence to the genome:
blastn -query brca1.fa -db dm3.fa > brca1_dm3.blast
Download the RefSeq mRNA annotations refMrna.fa.gz with the command at the terminal:
gunzip the file with the command:
Create a database for the RefSeq annotations:
makeblastdb -in refMrna.fa -title refMrna -dbtype nucl
BLAST The sequence to the refMrna database:
blastn -query brca1.fa -db refMrna.fa > brca1_refMrna.blast
Discussion questions: the difference between the two results? How do you explain the difference? What is chrUextra anyway? To answer this you should look at the BLAST output with "less" in the same way you looked at other BLAST output above.
Going beyond: How does this e-value compare to BLASTing to mouse through the NCBI website? Can you find a gene in human that has a significant hit to the E. coli genome?