Context Navigation

Changes between Version 59 and Version 60 of SOPs/atac_Seq

Timestamp:: 06/15/21 12:35:51 (4 years ago)
Author:: byuan
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

SOPs/atac_Seq

-              v59
+              v60
   * The [https://www.encodeproject.org/atac-seq/#standards ENCODE project] has [[https://www.encodeproject.org/atac-seq/#standards|recommendations]] on TSS enrichment scores, fragment size distribution. You can also download ENCODE pipeline and analyze your samples with the pipeline. Its QC output html file includes quality controls results and their interpretation. It also estimates the library complexity based the uniqueness of reads.
   * [[https://www.sciencedirect.com/science/article/pii/S240547122030079X | ataqv]] summarizes QC results into an interactive html page, which also allows you to view multiple samples together.
+  * [[https://www.sciencedirect.com/science/article/pii/S240547122030079X|ataqv]] summarizes QC results into an interactive html page, which also allows you to view multiple samples together.
           * First, run ataqv on each bam file to generate JSON files. Here is a sample command for a bulk ATAC_seq sample:
 …
 MACS v2 is applicable for ATAC-Seq using the appropriate options/parameters.
+  * ENCODE pipeline macs2 code: using pair-end tags as macs2 input. From [[https://docs.google.com/document/d/1f0Cm4vRyDQDu0bMehHD7P7KOMxTOP-HiNoIvL1VcBt8/edit |ATAC-seq pipeline protocol specification ]].  It works for hg19, hg38, mm9 and mm10.
+{{{
+   bedtools bamtobed -bedpe -mate1 -i $name_sorted.bam | gzip -nc > foo.bedpe.gz
+   # Create final TA file
+   zcat foo.bedpe.gz | awk 'BEGIN{OFS="\t"}{printf "%s\t%s\t%s\tN\t1000\t%s\n%s\t%s\t%s\tN\t1000\t%s\n",$1,$2,$3,$9,$4,$5,$6,$10}' | gzip -nc > tagAlign.gz
+  # Call peaks: ENCODE uses very small p-value, so it could get enough peaks for IDR
+  # If you don't run IDR, you need to make p-value more stringent, such as with q 0.01
+  macs2 callpeak -t tagAlign.gz -n foo.narrow -f BED -g ${species} -p 0.01 --nomodel --shift -75 --extsize 150 --call-summits --keep-dup all -B --SPMR --call-summits
+}}}
+  * Using pair-end bed as macs2 input. It considers ends of both mates, focus on cutting/insertion sites enrichment in ATAC-seq. ( https://twitter.com/XiChenUoM/status/1336658454866325506 )
+       * Codes below were implemented in the ATAC-seq review paper ( https://github.com/alexyfyf/atac_nf/blob/7f996b7de0e349c5a10dbbd75b2c266339517a3b/atac.nf#L341 ).
+       * [[https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-1929-3 | From reads to insight: a hitchhiker’s guide to ATAC-seq data analysis]]
+  * Run [[https://github.com/ENCODE-DCC/atac-seq-pipeline|ENCODE ATAC-seq Pipeline]] . It works well with human (hg38, hg19) and mouse (mm10, mm9) samples with biological replicates. The pipeline takes fastq files, cleans and maps the reads, filters aligned reads and does peak calls. Here is the [[https://www.encodeproject.org/pipelines/ENCPL787FUN/|workflow]].  In addition, it does quality controls. The steps below shows you how to run it on our Whitehead server.
+       * Verify there is no preexisting conda startup code with the command below:
+{{{
+conda env list
+}}}
+         You have no preexisting conda if you get "conda: command not found". Otherwise, log out, log back in, start the new conda instance, and activate encode-atac-seq-pipeline
+      * Ignore the developer's instructions, use your home directory for conda and the pipeline.
+{{{
+# Be sure to keep the first dot in the command below:
+. /nfs/BaRC_Public/conda/start_barc_conda
+conda activate encode-atac-seq-pipeline
+}}}
+      * Run. Files could be url or fullpath. [[https://github.com/ENCODE-DCC/atac-seq-pipeline/blob/master/docs/input.md | Detail information about .json file]]
+{{{
+caper run /nfs/BaRC_Public/atac-seq-pipeline/atac.wdl -i sample.json
+# After the job finishes, you can deactivate conda with
+conda deactivate
+}}}
+      * content in sample.json:
+{{{
+{
+    "atac.pipeline_type" : "atac",
+    "atac.genome_tsv" : "/nfs/BaRC_datasets/ENCODE_ATAC-seq_Pipeline/mm10/mm10.tsv",
+    "atac.fastqs_rep1_R1" : [
+        "/fullpath/sample_rep1_1.fastq.gz"
+    ],
+    "atac.fastqs_rep1_R2" : [
+        "/fullpath/sample_rep1_2.fastq.gz"
+    ],
+    "atac.fastqs_rep2_R1" : [
+        "/fullpath/sample_rep2_1.fastq.gz"
+    ],
+    "atac.fastqs_rep2_R2" : [
+        "/fullpath/sample_rep2_2.fastq.gz"
+    ],
+    "atac.paired_end" : true,
+    "atac.auto_detect_adapter" : true,
+    "atac.enable_tss_enrich" : true,
+    "atac.title" : "sample",
+    "atac.description" : "ATAC-seq mouse sample"
+}
+}}}
+      * Supported genome files for hg19, hg38, mm9 and mm10 can be found in /nfs/BaRC_datasets/ENCODE_ATAC-seq_Pipeline, and atac.genome_tsv used for .json is
+          * hg19: /nfs/BaRC_datasets/ENCODE_ATAC-seq_Pipeline/hg19/hg19.tsv
+          * hg38: /nfs/BaRC_datasets/ENCODE_ATAC-seq_Pipeline/hg38/hg38.tsv
+          * mm9: /nfs/BaRC_datasets/ENCODE_ATAC-seq_Pipeline/mm9/mm9.tsv
+          * mm10: /nfs/BaRC_datasets/ENCODE_ATAC-seq_Pipeline/mm10/mm10.tsv
+     * If you are working with other species or don't have replicates, you can run macs2 with pair-end bed as an input. It considers ends of both mates, focus on cutting/insertion sites enrichment in ATAC-seq. ( https://twitter.com/XiChenUoM/status/1336658454866325506 )
 {{{
     # convert bam to bed
+    bedtools bamtobed -i $bam > ${name}_pe.bed
+    macs2 callpeak -t ${name}_pe.bed -n ${name}_narrow -f BED -g ${species} -q 0.01 --nomodel --shift -75 --extsize 150 --call-summits --keep-dup all
+    bedtools bamtobed -i foo.bam > foo_pe.bed
+    # shift reads
+    cat foo.pe.bed | awk -F $'\t' 'BEGIN {OFS = FS}{ if ($6 == "+") {$2 = $2 + 4} else if ($6 == "-") {$3 = $3 - 5} print $0}' >| foo_tn5_pe.bed
+    # call peaks.
+    # --keep-dup all: since duplicates have been removed in previous step
+    macs2 callpeak -t foo_tn5_pe.bed -n foo -f BED -g mm -q 0.01 --nomodel --shift -75 --extsize 150 --call-summits --keep-dup all
 }}}
    * Run macs2 with BAMPE option: If using the 'BAMPE' option with paired-end reads, let MACS run the pileup and calculate 'extsize'.
+    * If you are working with NFR, macs2 BAMPE option also works. When using'BAMPE' option with paired-end reads, we let MACS run the pileup and calculate 'extsize'.
        * BAMPE format: there is no special format for BAMPE - MACS will treat PE reads as coming from the same fragment, from the manual,
        * "If the BAM file is generated for paired-end data, MACS will only keep the left mate(5' end) tag. However, when format BAMPE is specified, MACS will use the real fragments inferred from alignment results for reads pileup."
 {{{
+    #note: if duplicates were removed, use --keep-dup all; if not, use --keep-dup 1.  Removing duplicates is recommended.
+    macs2 callpeak -f BAMPE -t file.sorted.bam  --keep-dup all -B -q 0.01 -g mm -n MACS_ATACSeq_Peaks
+}}}
+   * Run macs 2 using --nomodel --shift s --extsize 2s. This can be used for single-end reads.
+      * For PE reads, it ignores the the right mate, and not recommended ( https://github.com/macs3-project/MACS/issues/145 ).
+      * In case where short fragment population is extremely dominant, the final output won't be off much as compared with BAMPE ( https://github.com/macs3-project/MACS/issues/145 )
+{{{
+   macs2 callpeak -t $foo.bam -f BAM -B -q 0.01 -g hs -n ${prefix} --nomodel --shift -75 --extsize 150
+    macs2 callpeak -f BAMPE -t NFR.bam  --keep-dup all -B -q 0.01 -g mm -n MACS_ATACSeq_Peaks
+}}}
+   * We strongly suggest to use pair-end reads. But if you have single-end reads, you can run macs2 using --nomodel --shift s --extsize 2s.
+{{{
+   macs2 callpeak -t $foo.bam -f BAM -B -q 0.01 -g hs -n foo --nomodel --shift -75 --extsize 150
 }}}
-    * Which macs option?
-       * MACS' author, T.Liu, recommends using -f BAMPE if PE reads are used [[https://github.com/taoliu/MACS/issues/331]], using BAMPE option asks MACS to pileup and calculate the extension size -  works for finding accessible regions within cut sites.  The additional parameters can also be used to look only at the //exact// cut sites by Tn5 instead of the open/accessible regions [[https://github.com/taoliu/MACS/issues/145]], if so, -f BAMPE may not be suitable.
-[[https://github.com/jsh58/Genrich | Genrich]] is another piece of software for peak-calling. It has the advantages of (a) running all of the post-alignment steps through peak-calling with one command, and (b) can process multiple replicates. Detailed information can be found in [[https://informatics.fas.harvard.edu/atac-seq-guidelines.html|Harvard ATAC-seq Guidelines]]
-{{{
-# Start by sorting mapped reads by read name
-samtools sort -n bowtie/ATAC.bam > ATAC_sortedByName.bam
-# Run Genrich
-Genrich -e chrM -j -r -m 30 -v -t ATAC_sortedByName.bam -o peaks/ATAC_peak -f peaks/ATAC_log
-where
--j              ATAC-seq mode (must be specified)  intervals are interpreted that are centered on transposase cut sites (the ends of each DNA fragment). Only properly paired alignment are analyzed by default
--r              Remove PCR duplicates
--m  <int>       Minimum MAPQ to keep an alignment (def. 0)
--d <int>        Expand cut sites to the given length (default 100bp)
--e <arg>        Chromosomes (reference sequences) to exclude. Can be a comma-separated  list, e.g. -e chrM,chrY.
--E <file>       Input BED file(s) of genomic regions to exclude, such as 'N' homopolymers or high mappability regions
--t <file>       Input SAM/BAM file(s) for experimental sample(s)
--o <file>       Output peak file (in ENCODE narrowPeak format)
--f <LOG>        Those who wish to explore the results of varying the peak-calling parameters (-q/-p, -a, -l, -g) should consider having Genrich produce a log file when it parses the SAM/BAM files (for example, with -f <LOG> added to the above command). Then, Genrich can call peaks directly from the log file with the -P option: Genrich -P -f <LOG> -o <OUT2> -p 0.01  -a 200 -v
-}}}
 ==== Other Tools or Options ====