Context Navigation

← Previous Change
Wiki History
Next Change →

Changes between Initial Version and Version 1 of SOPs/chip_seq_peaks

Timestamp:: 01/23/13 16:49:43 (12 years ago)
Author:: trac
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

SOPs/chip_seq_peaks

               v1
+=== Reviews ===
+[[http://www.plosone.org/article/info:doi/10.1371/journal.pone.0011471|Evaluation of ChIP-Seq performance]] Wilbanks EG, and Facciotti MT (2010; PLoS ONE)
+[[http://nar.oxfordjournals.org/content/early/2010/11/25/nar.gkq1187.abstract?etoc|A manually curated ChIP-seq benchmark demonstrates room for improvement in current peak-finder programs]] (NAR Nov. 2010)
+=== Step 1: Map reads ===
+ * Use [[http://bowtie-bio.sourceforge.net/tutorial.shtml|Bowtie]], [[http://bowtie-bio.sourceforge.net/bowtie2/index.shtml|bowtie2]], or another unspliced mapping tool[[br]]
+ Sample bowtie command: Output ".map" file
+{{{
+ bsub "bowtie  -k 1 -n 2 -l 70 --best --solexa1.3-quals /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 s_7.txt ../s7_mouse_mm9.k1.n2.l36.best.map"
+}}}
+The parameters used on the sample command are:
+  * -l/--seedlen <int> seed length for -n (default: 28) -- Set to longest possible length of high-quality bases.
+  * -n/--seedmms <int> max mismatches in seed (can be 0-3, default: -n 2)
+  * -k <int>           report up to <int> good alignments per read (default: 1) -- If you want only uniquely mapped reads, however, also use '-m 1' to ignore multi-mapped reads
+  * --solexa1.3-quals  (if input quals are from GA Pipeline ver. >= 1.3)[[br]]
+ To find out the solexa version of your reads: [[version_of_solexa_fastq|version_of_solexa_fastq]]
+  * --best             (in the case of multi-mapped reads, keep only the best hit(s))
+Sample command: Output ".sam" file
+{{{
+bsub "bowtie  -k 1 -n 2 -l 70 --best --sam --solexa1.3-quals /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 s_7.txt ../s7_mouse_mm9.k1.n2.l36.best.sam"
+}}}
+Parameters
+  * --sam  To get SAM output format
+To see other parameters log into tak and type '''bowtie'''
+=== Step 2: Call peaks (bound regions) ===
+Some of the parameters to consider when comparing programs are:
+  * Adjustment of sequence tags to better represent the original DNA fragment (by shifting tags in the 3′ direction or by extending tags to the estimated length of the original fragments)
+  * Background model used
+  * Use of strand dependent bimodality
+BaRC performed a [[chipSeqAnalysisPackages|ChIP-Seq bake off]] in the summer of 2011, with these [[chipSeqBakeOff|ChIP-Seq bake off results]].
+Based on our ChIP-Seq bake off and on a published review ([[http://www.plosone.org/article/info:doi/10.1371/journal.pone.0011471|Evaluation of ChIP-Seq performance]]), MACs and SISSRs are good programs to try.
+==== MACS ====
+  * For MACS to work the header of the sequences has to have no spaces.
+  * macs points to macs14 on WIBR local machines
+  * MACS may have trouble with a SAM file from bowtie if it contains unmapped reads (which it generally does).  As a result, you may need to filter out unmapped reads with a command like
+{{{
+samtools view -hS -F 4 s8_control_mouse_mm9.k1.n2.l36.best.sam > s8_control_mouse_mm9.k1.n2.l36.best.mapped_only.sam
+}}}
+[[http://liulab.dfci.harvard.edu/MACS/|MACS]] ([[http://liulab.dfci.harvard.edu/MACS/00README.html|README]]) ([[http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Search&db=pubmed&term=18798982|Zhang et al., 2008]])[[br]][[br]]
+  * MACS is appropriate for both proteins like transcription factors that may have narrow peaks, as well as histone modifications that may affect broader regions.  However, for broader peaks changing values of the parameters may be needed: eg. using --nomodel, --nolambda (if there's no control), --call-subpeaks.  Running MACS with different parameters and viewing in IGV the results can help in choosing the optimal values.
+Sample commands to run MACS (current version as of March 5 2012): 1.4.2 using mapped reads in map or sam format:
+{{{
+macs -t ./s7_mouse_mm9.k1.n2.l36.best.map -c ./s8_control_mouse_mm9.k1.n2.l36.best.map -g 1.87e9 --name=outputName --format=BOWTIE --tsize=36 --wig --space=25  --mfold=10,30
+macs -t ./s7_mouse_mm9.k1.n2.l36.best.sam -c ./s8_control_mouse_mm9.k1.n2.l36.best.sam -g 1.87e9 --name=outputName --format=SAM --tsize=36 --wig --space=25  --mfold=10,30
+}}}
+The parameters used on the command are:
+  * -t TFILE Treatment file
+  * -c CFILE Control file
+  * --name=NAME Experiment name, which will be used to generate output file names. DEFAULT: "NA"
+  * --format=FORMAT       Format of tag file, "BED" or "ELAND" or "ELANDMULTI" or "ELANDMULTIPET" or "SAM" or "BAM" or "BOWTIE". DEFAULT: "BED"
+  * --tsize=TSIZE         Tag size. DEFAULT: auto detected tag size
+  * --wig                 Whether or not to save shifted raw tag count at every bp into a wiggle file
+  * --mfold=MFOLD      Select the regions within MFOLD range of high-confidence enrichment ratio against background to build model. The regions must be lower than upper limit, and higher than the lower limit. DEFAULT:10,30
+  * -g GSIZE  Effective genome size. It can be 1.0e+9 or 1000000000, or shortcuts:'hs' for human (2.7e9), 'mm' for mouse (1.87e9), 'ce' for C. elegans (9e7) and 'dm' for fruitfly (1.2e8), Default:hs
+  * --keep-dup=1   Controls the MACS behavior towards duplicate tags at the exact same location.  DEFAULT: 1 in MACS 1.4; auto in MACS2.
+{{{
+bsub "macs14 -t ./s7_mouse_mm9.k1.n2.l36.best.map -c ./s8_control_mouse_mm9.k1.n2.l36.best.map --name=outputName --format=BOWTIE --tsize=36 --wig --space=25  --mfold=10,30"
+bsub "macs14 -t ./s7_mouse_mm9.k1.n2.l36.best.sam -c ./s8_control_mouse_mm9.k1.n2.l36.best.sam --name=outputName --format=SAM--tsize=36 --wig --space=25  --mfold=10,30"
+}}}
+==== SISSRs ====
+[[http://dir.nhlbi.nih.gov/papers/lmi/epigenomes/sissrs/|SiSSRs]] ([[http://nar.oxfordjournals.org/cgi/content/full/36/16/5221|Reference]]) ([[http://dir.nhlbi.nih.gov/papers/lmi/epigenomes/sissrs/SISSRs-Manual.pdf|Manual]])
+[[chipSeqBakeOff|ChIP-Seq bake off]]
+SISSRs uses strand bimodality to try to find the summit of the peak. The summit should be very close to the DNA bound by the TF.  It is more appropriate for TFs because they tend to bind in specific narrow regions.
+Map with Bowtie, use --sam parameter to get a sam ouput file
+{{{
+bsub "bowtie -t  -m 3 -n 3 -l 36  --strata --best --solexa1.3-quals  --sam   inputSeq bowtieOutput.sam"
+}}}
+SISRs input is a bed file. Convert mapped reads from SAM to BAM and from BAM to bed format
+{{{
+bsub "samtools view  -S -b -o bowtieOutput.bam   bowtieOutput.sam"
+-S       input is SAM
+-b       output BAM
+}}}
+{{{
+bsub "bamToBed -i bowtieOutput.bam > bowtieOutput.bed"
+}}}
+Run SISSRs with a sample command like
+{{{
+sissrs.pl -i  bowtieOutput.bed -o outputFile  -s 2716965481 -b Background.bed -L 200
+ }}}
+The parameters used in the sample command:
+  * -s is the size of the genome
+  * -L is the maximum length of the fragment
+  * -m is the percentage of mappable bps. Default is .8 for Eland in human.
+For more detailed description of parameters see [[chipSeqBakeOff|ChIP-Seq bake off]]
+=== Step 3: Linking bound regions to genes ===
+Both MACS and SISSRs provide bed files with the set of peaks, presumably indicating bound regions.
+To link this regions to genes check out this SOP: [[genome_regions_annotations|Linking genome regions to genome annotations]]
+Below it is a detailed example of how to do the following annotation:
+ A. Find the genes that overlap with your peaks
+ A. Find the genes that have a peak within -D distance upstream of the TSS/gene
+ A. Find the genes that have a peak -D distance downstream of the gene
+ A. Find the genes that have a peak -D distance downstream of the TSS
+Note: We assume that the TSS is the 5' end of the transcript (gene), but you can also go to the UCSC Bioinformatics table browser and download the coordinates of the transcription start sites (TSS).
+----
+**Step 1**
+**Get the coordinates of the annotation features that you are going to use**. For analysis A, B and C those coordinates can be downloaded from the UCSC tables.
+{{{
+i.e., on the table browser select:
+Mammal->Mouse
+group: Genes and Gene Prediction Tracks
+track: RefSeq Genes
+table: refGene
+region: genome
+output format: bed
+(name your file)
+get output
+}}}
+ {{{
+for A: Select Whole Gene
+for B: Select Upstream by D bases
+for C: Downstream by D bases
+ }}}
+You will have now 3 annotation files named for example:
+{{{
+mm9_REFseqgenes.bed   (gene bodies)
+mm9_REFseqgenes1kbUp.bed  (1Kb upstream of the gene)
+mm9_REFseqgenes1kbDownofGene.bed (1Kb downstream of the gene)
+}}}
+The gene bodies file has more columns that you need, so keep only the first 6 columns.
+{{{
+cut -f1-6 mm9_REFseqgenes.bed > mm9_REFseqgenes_TRIM.bed
+}}}
+For annotation D, to get the coordinates of the regions from the TSS to a certain distance downstream of the TSS (beginning of the gene):
+. Take the genes files and keep the coordinate that represents the start of the gene (based on the strand),
+{{{
+awk -F "\t" '{if ($6 == "+") {print  $1"\t"$2"\t"$2+1"\t"$4"\t"$5"\t"$6} else {print  $1"\t"$3-1"\t"$3"\t"$4"\t"$5"\t"$6}  }' mm9_REFseqgenes.bed >| mm9_REFseqgenesTSS.bed
+}}}
+. Then use the tool "slopBed" to increase the range of genomic coordinates to include downstream of the TSS
+The tool "slopBed" needs a file with the chromosome sizes that can be created with this command
+{{{
+mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from mm9.chromInfo" > mm9.genome
+}}}
+Slop the file with the TSS coordinates to get the desired distance downstream (3') of the TSS
+{{{
+slopBed -i mm9_REFseqgenesTSS.bed  -g mm9.genome -l 0 -r 1000 -s  >|  mm9_REFseqgenes_1kbdownofTSS.bed
+}}}
+----
+**Step 2: Intersect your peaks and your annotation files**
+{{{
+intersectBed -a genes.bed -b peaks.bed -wa -wb
+-wa returns the list from file A
+-wb returns the list from file B
+}}}
+{{{
+intersectBed -a mm9_1KbUpDownofTSS.bed -b peaks.bed -wa -wb >| 1KbUpDownofTSS_peaks.bed
+}}}