Changes between Initial Version and Version 1 of SOPs/InProgressChipSeq


Ignore:
Timestamp:
05/14/14 11:33:29 (11 years ago)
Author:
ibarrasa
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SOPs/InProgressChipSeq

    v1 v1  
     1
     2== Using ChIP-Seq to identify and/or quantify bound regions (peaks) ==
     3 
     4
     5=== Reviews ===
     6[[http://www.plosone.org/article/info:doi/10.1371/journal.pone.0011471|Evaluation of ChIP-Seq performance]] Wilbanks EG, and Facciotti MT (2010; PLoS ONE)
     7
     8[[http://nar.oxfordjournals.org/content/early/2010/11/25/nar.gkq1187.abstract?etoc|A manually curated ChIP-seq benchmark demonstrates room for improvement in current peak-finder programs]] (NAR Nov. 2010)
     9
     10=== Step 1: Map reads ===
     11
     12 * Use [[http://bowtie-bio.sourceforge.net/tutorial.shtml|Bowtie]], [[http://bowtie-bio.sourceforge.net/bowtie2/index.shtml|bowtie2]], or another unspliced mapping tool[[br]]
     13
     14 * See the [[http://barcwiki.wi.mit.edu/wiki/SOPs/mapping|mapping SOP]] for more details.
     15
     16=== Step 2: Call peaks (bound regions) ===
     17Some of the parameters to consider when comparing programs are:
     18  * Adjustment of sequence tags to better represent the original DNA fragment (by shifting tags in the 3′ direction or by extending tags to the estimated length of the original fragments)
     19  * Background model used
     20  * Use of strand dependent bimodality
     21
     22BaRC performed a [[chipSeqAnalysisPackages|ChIP-Seq bake off]] in the summer of 2011, with these [[chipSeqBakeOff|ChIP-Seq bake off results]].
     23
     24Based on our ChIP-Seq bake off and on a published review ([[http://www.plosone.org/article/info:doi/10.1371/journal.pone.0011471|Evaluation of ChIP-Seq performance]]), MACs and SISSRs are good programs to try.
     25
     26==== MACS ====
     27  * For MACS to work the header of the sequences has to have no spaces.
     28  * macs points to macs14 on WIBR local machines
     29  * MACS may have trouble with a SAM file from bowtie if it contains unmapped reads (which it generally does).  As a result, you may need to filter out unmapped reads with a command like
     30{{{
     31samtools view -hS -F 4 s8_control_mouse_mm9.k1.n2.l36.best.sam > s8_control_mouse_mm9.k1.n2.l36.best.mapped_only.sam
     32}}}
     33
     34[[http://liulab.dfci.harvard.edu/MACS/|MACS]] ([[http://liulab.dfci.harvard.edu/MACS/00README.html|README]]) ([[http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Search&db=pubmed&term=18798982|Zhang et al., 2008]])[[br]][[br]]
     35
     36  * MACS is appropriate for both proteins like transcription factors that may have narrow peaks, as well as histone modifications that may affect broader regions.  However, for broader peaks changing values of the parameters may be needed: eg. using --nomodel, --nolambda (if there's no control), --call-subpeaks.  Running MACS with different parameters and viewing in IGV the results can help in choosing the optimal values.
     37
     38Sample commands to run MACS (current version as of March 5 2012): 1.4.2 using mapped reads in map or sam format:
     39
     40{{{
     41macs -t ./s7_mouse_mm9.k1.n2.l36.best.map -c ./s8_control_mouse_mm9.k1.n2.l36.best.map -g 1.87e9 --name=outputName --format=BOWTIE --tsize=36 --wig --space=25  --mfold=10,30
     42macs -t ./s7_mouse_mm9.k1.n2.l36.best.sam -c ./s8_control_mouse_mm9.k1.n2.l36.best.sam -g 1.87e9 --name=outputName --format=SAM --tsize=36 --wig --space=25  --mfold=10,30
     43}}}
     44
     45The parameters used on the command are:
     46  * -t TFILE Treatment file
     47  * -c CFILE Control file
     48  * --name=NAME Experiment name, which will be used to generate output file names. DEFAULT: "NA"
     49  * --format=FORMAT       Format of tag file, "BED" or "ELAND" or "ELANDMULTI" or "ELANDMULTIPET" or "SAM" or "BAM" or "BOWTIE". DEFAULT: "BED"
     50  * --tsize=TSIZE         Tag size. DEFAULT: auto detected tag size
     51  * --wig                 Whether or not to save shifted raw tag count at every bp into a wiggle file
     52  * --mfold=MFOLD      Select the regions within MFOLD range of high-confidence enrichment ratio against background to build model. The regions must be lower than upper limit, and higher than the lower limit. DEFAULT:10,30
     53  * -g GSIZE  Effective genome size. It can be 1.0e+9 or 1000000000, or shortcuts:'hs' for human (2.7e9), 'mm' for mouse (1.87e9), 'ce' for C. elegans (9e7) and 'dm' for fruitfly (1.2e8), Default:hs
     54  * --keep-dup=1   Controls the MACS behavior towards duplicate tags at the exact same location.  DEFAULT: 1 in MACS 1.4; auto in MACS2. 
     55
     56{{{
     57bsub "macs14 -t ./s7_mouse_mm9.k1.n2.l36.best.map -c ./s8_control_mouse_mm9.k1.n2.l36.best.map --name=outputName --format=BOWTIE --tsize=36 --wig --space=25  --mfold=10,30"
     58bsub "macs14 -t ./s7_mouse_mm9.k1.n2.l36.best.sam -c ./s8_control_mouse_mm9.k1.n2.l36.best.sam --name=outputName --format=SAM--tsize=36 --wig --space=25  --mfold=10,30"
     59}}}
     60
     61'''Note:''' The wig files that macs14 generates are not normalized.
     62==== SISSRs ====
     63[[http://dir.nhlbi.nih.gov/papers/lmi/epigenomes/sissrs/|SiSSRs]] ([[http://nar.oxfordjournals.org/cgi/content/full/36/16/5221|Reference]]) ([[http://dir.nhlbi.nih.gov/papers/lmi/epigenomes/sissrs/SISSRs-Manual.pdf|Manual]])
     64[[chipSeqBakeOff|ChIP-Seq bake off]]
     65
     66SISSRs uses strand bimodality to try to find the summit of the peak. The summit should be very close to the DNA bound by the TF.  It is more appropriate for TFs because they tend to bind in specific narrow regions.
     67
     68Map with Bowtie, use --sam parameter to get a sam ouput file
     69{{{
     70bsub "bowtie -t  -m 3 -n 3 -l 36  --strata --best --solexa1.3-quals  --sam   inputSeq bowtieOutput.sam"
     71}}}
     72
     73SISRs input is a bed file. Convert mapped reads from SAM to BAM and from BAM to bed format
     74
     75{{{
     76bsub "samtools view  -S -b -o bowtieOutput.bam   bowtieOutput.sam"
     77-S       input is SAM
     78-b       output BAM
     79}}}
     80
     81{{{
     82bsub "bamToBed -i bowtieOutput.bam > bowtieOutput.bed"
     83}}}
     84
     85Run SISSRs with a sample command like
     86{{{
     87sissrs.pl -i  bowtieOutput.bed -o outputFile  -s 2716965481 -b Background.bed -L 200
     88 }}}
     89   
     90The parameters used in the sample command:
     91  * -s is the size of the genome
     92  * -L is the maximum length of the fragment
     93  * -m is the percentage of mappable bps. Default is .8 for Eland in human.
     94
     95For more detailed description of parameters see [[chipSeqBakeOff|ChIP-Seq bake off]]
     96
     97
     98=== Step 3: Linking bound regions to genes ===
     99Both MACS and SISSRs provide bed files with the set of peaks, presumably indicating bound regions.
     100
     101To link this regions to genes check out this SOP: [[genome_regions_annotations|Linking genome regions to genome annotations]]
     102
     103Below it is a detailed example of how to do the following annotation:
     104 A. Find the genes that overlap with your peaks
     105 A. Find the genes that have a peak within -D distance upstream of the TSS/gene
     106 A. Find the genes that have a peak -D distance downstream of the gene
     107 A. Find the genes that have a peak -D distance downstream of the TSS
     108Note: We assume that the TSS is the 5' end of the transcript (gene), but you can also go to the UCSC Bioinformatics table browser and download the coordinates of the transcription start sites (TSS).
     109
     110=== Comparing binding across different samples ===
     111
     112  * Method 1: Intersect bound regions (peaks)
     113    * This binary comparison asks simply, "Is each peak in sample A present in sample B (and vice versa)?"
     114    * This ignores the enrichment/score of each peak and gives each non-peak a score of 0.
     115  * Method 2: Compare affinity (number of mapped reads) across bound regions
     116    * Filter out redundant reads from each BAM file with a command like 'samtools rmdup'.
     117    * Merge peaks of all samples (and make non-redundant):
     118    * mergeBed -i Peaks_all_redundant.bed > Peaks.bed
     119    * Count reads in each sample that overlaps each peak
     120    * intersectBed -bed -wb -abam Reads_1.noDups.bam -b Peaks.bed | cut -f13-15 | mergeBed -i stdin -n > Peak_counts.reads_1.bed
     121    * Modify counts per peak to account for differing numbers of mapped reads.
     122  * Method 3: Use [[http://code.google.com/p/ngsplot/ | ngsplot]] to make stacked heatmaps of peaks for each sample
     123
     124----
     125**Step 1**
     126
     127**Get the coordinates of the annotation features that you are going to use**. For analysis A, B and C those coordinates can be downloaded from the UCSC tables.
     128{{{
     129i.e., on the table browser select:
     130Mammal->Mouse
     131group: Genes and Gene Prediction Tracks
     132track: RefSeq Genes
     133table: refGene
     134region: genome
     135output format: bed
     136(name your file)
     137get output
     138}}}
     139
     140 {{{
     141for A: Select Whole Gene
     142for B: Select Upstream by D bases
     143for C: Downstream by D bases
     144 }}}
     145 
     146You will have now 3 annotation files named for example:
     147
     148{{{
     149mm9_REFseqgenes.bed   (gene bodies)
     150mm9_REFseqgenes1kbUp.bed  (1Kb upstream of the gene)
     151mm9_REFseqgenes1kbDownofGene.bed (1Kb downstream of the gene)
     152}}}   
     153   
     154The gene bodies file has more columns that you need, so keep only the first 6 columns.
     155
     156{{{
     157cut -f1-6 mm9_REFseqgenes.bed > mm9_REFseqgenes_TRIM.bed
     158}}}
     159
     160For annotation D, to get the coordinates of the regions from the TSS to a certain distance downstream of the TSS (beginning of the gene):
     161
     1621. Take the genes files and keep the coordinate that represents the start of the gene (based on the strand),
     163
     164{{{
     165awk -F "\t" '{if ($6 == "+") {print  $1"\t"$2"\t"$2+1"\t"$4"\t"$5"\t"$6} else {print  $1"\t"$3-1"\t"$3"\t"$4"\t"$5"\t"$6}  }' mm9_REFseqgenes.bed >| mm9_REFseqgenesTSS.bed
     166}}}
     167
     1682. Then use the tool "slopBed" to increase the range of genomic coordinates to include downstream of the TSS
     169
     170The tool "slopBed" needs a file with the chromosome sizes that can be created with this command
     171
     172{{{
     173mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from mm9.chromInfo" > mm9.genome
     174}}}
     175
     176Slop the file with the TSS coordinates to get the desired distance downstream (3') of the TSS
     177
     178{{{
     179slopBed -i mm9_REFseqgenesTSS.bed  -g mm9.genome -l 0 -r 1000 -s  >|  mm9_REFseqgenes_1kbdownofTSS.bed
     180}}}
     181
     182
     183----
     184
     185**Step 2: Intersect your peaks and your annotation files**
     186{{{
     187intersectBed -a genes.bed -b peaks.bed -wa -wb
     188-wa returns the list from file A
     189-wb returns the list from file B
     190}}}
     191
     192{{{
     193intersectBed -a mm9_1KbUpDownofTSS.bed -b peaks.bed -wa -wb >| 1KbUpDownofTSS_peaks.bed
     194}}}
     195
     196