Lab 10: ChIP-Seq

Note: Some of the steps of this project take a long time, but usually not longer than 15-20 min at most, so plan accordingly.

The data for this project was taken from this paper: Integrated approaches reveal determinants of genome-wide binding and function of the transcription factor Pho4, Zhou et al. 2011. Please take a look at this paper. We will be looking at the data corresponding to Pho4 under the No Pi conditions.

This paper is studying the binding of the transcription factor Pho4, which binds to a sequence-specific motif described in the paper.

The raw data for this project is available at this GEO entry: GSE29506, but I have prepared the fastq files for download below.

Step 1: Get the data from /usr/share/data/ by creating a symbolic link to the following files:

	[hendrixd@bb485 Lab10]$ ln -s /usr/share/data/sce_Pho4_NoPi_ChIPSeq.fastq .
	[hendrixd@bb485 Lab10]$ ln -s /usr/share/data/sce_INPUT_NoPi_ChIPSeq.fastq .
	[hendrixd@bb485 Lab10]$ ls
	sce_INPUT_NoPi_ChIPSeq.fastq  sce_Pho4_NoPi_ChIPSeq.fastq

Step 2: Align the fastq files to the genome: You need to align both the ChIP-Seq data and the INPUT (control) data.

The genome file itself is available here: /usr/share/data/

You could first create an index to the genome with this command:

	bowtie-build sce_R64_1_1.fa sce_R64_1_1

If you're using the bb485 server, in consideration of filespace, let's use the pre-downloaded fastq files specified by this command:

	bowtie -m 1 -S -q /usr/share/data/sce_R64_1_1 /usr/share/data/sce_Pho4_NoPi_ChIPSeq.fastq sce_Pho4_NoPi_ChIPSeq.sam

	bowtie -m 1 -S -q /usr/share/data/sce_R64_1_1 /usr/share/data/sce_INPUT_NoPi_ChIPSeq.fastq  sce_INPUT_NoPi_ChIPSeq.sam

If you're using the another server, or have the filespace, you can download the fastq files and run with this command:

	bowtie -m 1 -S -q sce_R64_1_1 sce_Pho4_NoPi_ChIPSeq.fastq sce_Pho4_NoPi_ChIPSeq.sam

	bowtie -m 1 -S -q sce_R64_1_1 sce_INPUT_NoPi_ChIPSeq.fastq  sce_INPUT_NoPi_ChIPSeq.sam

Step 3: Find the ChIP-Seq peaks with macs. Note that we are using a stringent p-value threshold here to pull out only the strongest peaks.

	macs14 -t sce_Pho4_NoPi_ChIPSeq.sam -c sce_INPUT_NoPi_ChIPSeq.sam -n Pho4_vs_INPUT -g 1.2e7 -f SAM -p 1e-10 >& macs14.err

Macs will produce both peak bed files, and summit bed files. As you may have guessed, the peaks are the full enriched regions, and the summits are the tip of the peaks, and only constitute a small window of 2bp.

Step 4: Use the bed file of the summits to create a fasta file of of +/- 30bp around the summits:

      
	perl ~/Scripts/bedToFasta.pl Pho4_vs_INPUT_summits.bed sce_R64_1_1.fa 30

Note: if you have the yeast genome file sce_R64_1_1.fa in another directory, specify the path to that directory.

You will need to write a script to map a bed file and genome to the corresponding genomic sequences. In the past, I provided a perl script to do this, but at this point you should be able to write this! Your script will need to read in a BED file and a genomic FASTA file, and use the genomic coordinates defined in the BED File to extract subsequences of the genome corresponding to the peaks::

	$ python bedToFasta.py sce_R64_1_1.fa Pho4_vs_INPUT_summits.bed

Your script should create an output FASTA file with the sequences of the summits +/- 30bp, meaning it should use an updated genomic coordinate that adds 30bp to each side of the window. In other words, it should update with something like this:.

	start = start - 30
	end = end + 30

Step 5: Run MEME to find motifs in the peak region:

	meme Pho4_vs_INPUT_summits.fasta -mod zoops -dna -nmotifs 3 -maxw 8 -revcomp

If you did all the steps correctly, the binding site for Pho4 should be one of the motifs in the MEME output.

Questions to answer for this project: How many peaks do you find at this threshold? You can use the command "wc -l bedfile" to count the number of lines in a bed file produced for the peaks. Which motif corresponds to the motif in the paper? Include the LOGO for the motif you found in your results. How many instances of the motif did you find? What is the information content of the motif? Even though this was run with "zoops", you might still expect every peak to have an instance of the motif. How many of the peaks have an instance of the motif?