Changes between Initial Version and Version 1 of SOPs/chip_seq_peaks


Ignore:
Timestamp:
01/23/13 16:49:43 (12 years ago)
Author:
trac
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SOPs/chip_seq_peaks

    v1 v1  
     1=== Reviews ===
     2[[http://www.plosone.org/article/info:doi/10.1371/journal.pone.0011471|Evaluation of ChIP-Seq performance]] Wilbanks EG, and Facciotti MT (2010; PLoS ONE)
     3
     4[[http://nar.oxfordjournals.org/content/early/2010/11/25/nar.gkq1187.abstract?etoc|A manually curated ChIP-seq benchmark demonstrates room for improvement in current peak-finder programs]] (NAR Nov. 2010)
     5
     6=== Step 1: Map reads ===
     7
     8 * Use [[http://bowtie-bio.sourceforge.net/tutorial.shtml|Bowtie]], [[http://bowtie-bio.sourceforge.net/bowtie2/index.shtml|bowtie2]], or another unspliced mapping tool[[br]]
     9 Sample bowtie command: Output ".map" file
     10
     11{{{
     12 bsub "bowtie  -k 1 -n 2 -l 70 --best --solexa1.3-quals /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 s_7.txt ../s7_mouse_mm9.k1.n2.l36.best.map"
     13}}}
     14
     15The parameters used on the sample command are:
     16  * -l/--seedlen <int> seed length for -n (default: 28) -- Set to longest possible length of high-quality bases.
     17  * -n/--seedmms <int> max mismatches in seed (can be 0-3, default: -n 2)
     18  * -k <int>           report up to <int> good alignments per read (default: 1) -- If you want only uniquely mapped reads, however, also use '-m 1' to ignore multi-mapped reads
     19  * --solexa1.3-quals  (if input quals are from GA Pipeline ver. >= 1.3)[[br]]
     20 To find out the solexa version of your reads: [[version_of_solexa_fastq|version_of_solexa_fastq]]
     21  * --best             (in the case of multi-mapped reads, keep only the best hit(s))
     22
     23
     24Sample command: Output ".sam" file
     25{{{
     26bsub "bowtie  -k 1 -n 2 -l 70 --best --sam --solexa1.3-quals /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 s_7.txt ../s7_mouse_mm9.k1.n2.l36.best.sam"
     27}}}
     28
     29Parameters     
     30  * --sam  To get SAM output format
     31
     32To see other parameters log into tak and type '''bowtie'''
     33
     34
     35=== Step 2: Call peaks (bound regions) ===
     36Some of the parameters to consider when comparing programs are:
     37  * Adjustment of sequence tags to better represent the original DNA fragment (by shifting tags in the 3′ direction or by extending tags to the estimated length of the original fragments)
     38  * Background model used
     39  * Use of strand dependent bimodality
     40
     41BaRC performed a [[chipSeqAnalysisPackages|ChIP-Seq bake off]] in the summer of 2011, with these [[chipSeqBakeOff|ChIP-Seq bake off results]].
     42
     43Based on our ChIP-Seq bake off and on a published review ([[http://www.plosone.org/article/info:doi/10.1371/journal.pone.0011471|Evaluation of ChIP-Seq performance]]), MACs and SISSRs are good programs to try.
     44
     45==== MACS ====
     46  * For MACS to work the header of the sequences has to have no spaces.
     47  * macs points to macs14 on WIBR local machines
     48  * MACS may have trouble with a SAM file from bowtie if it contains unmapped reads (which it generally does).  As a result, you may need to filter out unmapped reads with a command like
     49{{{
     50samtools view -hS -F 4 s8_control_mouse_mm9.k1.n2.l36.best.sam > s8_control_mouse_mm9.k1.n2.l36.best.mapped_only.sam
     51}}}
     52
     53[[http://liulab.dfci.harvard.edu/MACS/|MACS]] ([[http://liulab.dfci.harvard.edu/MACS/00README.html|README]]) ([[http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Search&db=pubmed&term=18798982|Zhang et al., 2008]])[[br]][[br]]
     54
     55  * MACS is appropriate for both proteins like transcription factors that may have narrow peaks, as well as histone modifications that may affect broader regions.  However, for broader peaks changing values of the parameters may be needed: eg. using --nomodel, --nolambda (if there's no control), --call-subpeaks.  Running MACS with different parameters and viewing in IGV the results can help in choosing the optimal values.
     56
     57Sample commands to run MACS (current version as of March 5 2012): 1.4.2 using mapped reads in map or sam format:
     58
     59{{{
     60macs -t ./s7_mouse_mm9.k1.n2.l36.best.map -c ./s8_control_mouse_mm9.k1.n2.l36.best.map -g 1.87e9 --name=outputName --format=BOWTIE --tsize=36 --wig --space=25  --mfold=10,30
     61macs -t ./s7_mouse_mm9.k1.n2.l36.best.sam -c ./s8_control_mouse_mm9.k1.n2.l36.best.sam -g 1.87e9 --name=outputName --format=SAM --tsize=36 --wig --space=25  --mfold=10,30
     62}}}
     63
     64The parameters used on the command are:
     65  * -t TFILE Treatment file
     66  * -c CFILE Control file
     67  * --name=NAME Experiment name, which will be used to generate output file names. DEFAULT: "NA"
     68  * --format=FORMAT       Format of tag file, "BED" or "ELAND" or "ELANDMULTI" or "ELANDMULTIPET" or "SAM" or "BAM" or "BOWTIE". DEFAULT: "BED"
     69  * --tsize=TSIZE         Tag size. DEFAULT: auto detected tag size
     70  * --wig                 Whether or not to save shifted raw tag count at every bp into a wiggle file
     71  * --mfold=MFOLD      Select the regions within MFOLD range of high-confidence enrichment ratio against background to build model. The regions must be lower than upper limit, and higher than the lower limit. DEFAULT:10,30
     72  * -g GSIZE  Effective genome size. It can be 1.0e+9 or 1000000000, or shortcuts:'hs' for human (2.7e9), 'mm' for mouse (1.87e9), 'ce' for C. elegans (9e7) and 'dm' for fruitfly (1.2e8), Default:hs
     73  * --keep-dup=1   Controls the MACS behavior towards duplicate tags at the exact same location.  DEFAULT: 1 in MACS 1.4; auto in MACS2. 
     74
     75{{{
     76bsub "macs14 -t ./s7_mouse_mm9.k1.n2.l36.best.map -c ./s8_control_mouse_mm9.k1.n2.l36.best.map --name=outputName --format=BOWTIE --tsize=36 --wig --space=25  --mfold=10,30"
     77bsub "macs14 -t ./s7_mouse_mm9.k1.n2.l36.best.sam -c ./s8_control_mouse_mm9.k1.n2.l36.best.sam --name=outputName --format=SAM--tsize=36 --wig --space=25  --mfold=10,30"
     78}}}
     79
     80==== SISSRs ====
     81[[http://dir.nhlbi.nih.gov/papers/lmi/epigenomes/sissrs/|SiSSRs]] ([[http://nar.oxfordjournals.org/cgi/content/full/36/16/5221|Reference]]) ([[http://dir.nhlbi.nih.gov/papers/lmi/epigenomes/sissrs/SISSRs-Manual.pdf|Manual]])
     82[[chipSeqBakeOff|ChIP-Seq bake off]]
     83
     84SISSRs uses strand bimodality to try to find the summit of the peak. The summit should be very close to the DNA bound by the TF.  It is more appropriate for TFs because they tend to bind in specific narrow regions.
     85
     86Map with Bowtie, use --sam parameter to get a sam ouput file
     87{{{
     88bsub "bowtie -t  -m 3 -n 3 -l 36  --strata --best --solexa1.3-quals  --sam   inputSeq bowtieOutput.sam"
     89}}}
     90
     91SISRs input is a bed file. Convert mapped reads from SAM to BAM and from BAM to bed format
     92
     93{{{
     94bsub "samtools view  -S -b -o bowtieOutput.bam   bowtieOutput.sam"
     95-S       input is SAM
     96-b       output BAM
     97}}}
     98
     99{{{
     100bsub "bamToBed -i bowtieOutput.bam > bowtieOutput.bed"
     101}}}
     102
     103Run SISSRs with a sample command like
     104{{{
     105sissrs.pl -i  bowtieOutput.bed -o outputFile  -s 2716965481 -b Background.bed -L 200
     106 }}}
     107   
     108The parameters used in the sample command:
     109  * -s is the size of the genome
     110  * -L is the maximum length of the fragment
     111  * -m is the percentage of mappable bps. Default is .8 for Eland in human.
     112
     113For more detailed description of parameters see [[chipSeqBakeOff|ChIP-Seq bake off]]
     114
     115
     116=== Step 3: Linking bound regions to genes ===
     117Both MACS and SISSRs provide bed files with the set of peaks, presumably indicating bound regions.
     118
     119To link this regions to genes check out this SOP: [[genome_regions_annotations|Linking genome regions to genome annotations]]
     120
     121Below it is a detailed example of how to do the following annotation:
     122 A. Find the genes that overlap with your peaks
     123 A. Find the genes that have a peak within -D distance upstream of the TSS/gene
     124 A. Find the genes that have a peak -D distance downstream of the gene
     125 A. Find the genes that have a peak -D distance downstream of the TSS
     126Note: We assume that the TSS is the 5' end of the transcript (gene), but you can also go to the UCSC Bioinformatics table browser and download the coordinates of the transcription start sites (TSS).
     127
     128----
     129**Step 1**
     130
     131**Get the coordinates of the annotation features that you are going to use**. For analysis A, B and C those coordinates can be downloaded from the UCSC tables.
     132{{{
     133i.e., on the table browser select:
     134Mammal->Mouse
     135group: Genes and Gene Prediction Tracks
     136track: RefSeq Genes
     137table: refGene
     138region: genome
     139output format: bed
     140(name your file)
     141get output
     142}}}
     143
     144 {{{
     145for A: Select Whole Gene
     146for B: Select Upstream by D bases
     147for C: Downstream by D bases
     148 }}}
     149 
     150You will have now 3 annotation files named for example:
     151
     152{{{
     153mm9_REFseqgenes.bed   (gene bodies)
     154mm9_REFseqgenes1kbUp.bed  (1Kb upstream of the gene)
     155mm9_REFseqgenes1kbDownofGene.bed (1Kb downstream of the gene)
     156}}}   
     157   
     158The gene bodies file has more columns that you need, so keep only the first 6 columns.
     159
     160{{{
     161cut -f1-6 mm9_REFseqgenes.bed > mm9_REFseqgenes_TRIM.bed
     162}}}
     163
     164For annotation D, to get the coordinates of the regions from the TSS to a certain distance downstream of the TSS (beginning of the gene):
     165
     1661. Take the genes files and keep the coordinate that represents the start of the gene (based on the strand),
     167
     168{{{
     169awk -F "\t" '{if ($6 == "+") {print  $1"\t"$2"\t"$2+1"\t"$4"\t"$5"\t"$6} else {print  $1"\t"$3-1"\t"$3"\t"$4"\t"$5"\t"$6}  }' mm9_REFseqgenes.bed >| mm9_REFseqgenesTSS.bed
     170}}}
     171
     1722. Then use the tool "slopBed" to increase the range of genomic coordinates to include downstream of the TSS
     173
     174The tool "slopBed" needs a file with the chromosome sizes that can be created with this command
     175
     176{{{
     177mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from mm9.chromInfo" > mm9.genome
     178}}}
     179
     180Slop the file with the TSS coordinates to get the desired distance downstream (3') of the TSS
     181
     182{{{
     183slopBed -i mm9_REFseqgenesTSS.bed  -g mm9.genome -l 0 -r 1000 -s  >|  mm9_REFseqgenes_1kbdownofTSS.bed
     184}}}
     185
     186
     187----
     188
     189**Step 2: Intersect your peaks and your annotation files**
     190{{{
     191intersectBed -a genes.bed -b peaks.bed -wa -wb
     192-wa returns the list from file A
     193-wb returns the list from file B
     194}}}
     195
     196{{{
     197intersectBed -a mm9_1KbUpDownofTSS.bed -b peaks.bed -wa -wb >| 1KbUpDownofTSS_peaks.bed
     198}}}
     199
     200