Note: Some of the steps of this project take a long time, but usually not longer than 15-20 min at most, so plan accordingly.
The data for this project was taken from this paper: Integrated approaches reveal determinants of genome-wide binding and function of the transcription factor Pho4, Zhou et al. 2011. Please take a look at this paper. We will be looking at the data corresponding to Pho4 under the No Pi conditions.
This paper is studying the binding of the transcription factor Pho4, which binds to a sequence-specific motif described in the paper.
The raw data for this project is available at this GEO entry: GSE29506, but I have prepared the fastq files for download below.
Step 1: download the script from this directory with wget: Pho4 ChIP-Seq data and scripts. NOTE: You don't need to donwload the data if you're using the bb485 server as it is in /usr/share/data. Don't download to save file space on the servers!
You will find 2 fastq files, and a perl script. The other data you will use was downloaded in Project 8. It would be a good idea to check the file sizes with "ls -l" to confirm that the download was complete.
$ ls -l *fastq -rw-r--r-- 1 dhendrix dhendrix 2817509732 Mar 10 23:23 sce_INPUT_NoPi_ChIPSeq.fastq -rw-r--r-- 1 dhendrix dhendrix 746715226 Mar 10 23:31 sce_Pho4_NoPi_ChIPSeq.fastq
Step 2: Align the fastq files to the genome: You need to align both the ChIP-Seq data and the INPUT (control) data.
If you're using the bb485 server, in consideration of filespace, let's use the pre-downloaded fastq files specified by this command:
bowtie2 -x /usr/share/data/sce_R64_1_1 -U /usr/share/data/sce_Pho4_NoPi_ChIPSeq.fastq -S sce_Pho4_NoPi_ChIPSeq.sam
bowtie2 -x /usr/share/data/sce_R64_1_1 -U /usr/share/data/sce_INPUT_NoPi_ChIPSeq.fastq -S sce_INPUT_NoPi_ChIPSeq.sam
If you're using the another server, or have the filespace, you can download the fastq files and run with this command:
bowtie2 -x sce_R64_1_1 -U sce_Pho4_NoPi_ChIPSeq.fastq -S sce_Pho4_NoPi_ChIPSeq.sam >& bwt1
bowtie2 -x sce_R64_1_1 -U sce_INPUT_NoPi_ChIPSeq.fastq -S sce_INPUT_NoPi_ChIPSeq.sam >& bwt2
Step 3: Find the ChIP-Seq peaks with macs. Note that we are using a stringent p-value threshold here to pull out only the strongest peaks.
macs14 -t sce_Pho4_NoPi_ChIPSeq.sam -c sce_INPUT_NoPi_ChIPSeq.sam -n Pho4_vs_INPUT -g 1.2e7 -f SAM -p 1e-10 >& macs14.err
Macs will produce both peak bed files, and summit bed files. As you may have guessed, the peaks are the full enriched regions, and the summits are the tip of the peaks, and only constitute a small window of 2bp.
Step 4: Use the bed file of the summits to create a fasta file of of +/- 30bp around the summits:
perl ~/Scripts/bedToFasta.pl Pho4_vs_INPUT_summits.bed sce_R64_1_1.fa 30
Note: if you have the yeast genome file sce_R64_1_1.fa in another directory, specify the path to that directory.
This perl script was in the directory of Pho4 stuff you downloaded above. Basically, it takes the genomic locations in the bed file, and prints them to a fasta file. Note that in general, this program is run with the inputs as follows:
perl bedToFasta.pl [bed file] [genome fasta] [sequence buffer]
Check out the perl script to see what it looks like. It's similar to python in many ways, but different in many as well.
Step 5: Run MEME to find motifs in the peak region:
meme Pho4_vs_INPUT_summits.fasta -mod zoops -dna -nmotifs 3 -maxw 8 -revcomp
If you did all the steps correctly, the binding site for Pho4 should be one of the motifs in the MEME output.
Questions to answer for this project: How many peaks do you find at this threshold? You can use the command "wc -l bedfile" to count the number of lines in a bed file produced for the peaks. Which motif corresponds to the motif in the paper? Include the LOGO for the motif you found in your results. How many instances of the motif did you find? What is the information content of the motif? Even though this was run with "zoops", you might still expect every peak to have an instance of the motif. How many of the peaks have an instance of the motif?