Changes between Version 59 and Version 60 of SOPs/atac_Seq


Ignore:
Timestamp:
06/15/21 12:35:51 (4 years ago)
Author:
byuan
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SOPs/atac_Seq

    v59 v60  
    8080  * The [https://www.encodeproject.org/atac-seq/#standards ENCODE project] has [[https://www.encodeproject.org/atac-seq/#standards|recommendations]] on TSS enrichment scores, fragment size distribution. You can also download ENCODE pipeline and analyze your samples with the pipeline. Its QC output html file includes quality controls results and their interpretation. It also estimates the library complexity based the uniqueness of reads. 
    8181
    82   * [[https://www.sciencedirect.com/science/article/pii/S240547122030079X | ataqv]] summarizes QC results into an interactive html page, which also allows you to view multiple samples together.
     82  * [[https://www.sciencedirect.com/science/article/pii/S240547122030079X|ataqv]] summarizes QC results into an interactive html page, which also allows you to view multiple samples together.
    8383          * First, run ataqv on each bam file to generate JSON files. Here is a sample command for a bulk ATAC_seq sample:
    8484
     
    118118MACS v2 is applicable for ATAC-Seq using the appropriate options/parameters.
    119119
    120   * ENCODE pipeline macs2 code: using pair-end tags as macs2 input. From [[https://docs.google.com/document/d/1f0Cm4vRyDQDu0bMehHD7P7KOMxTOP-HiNoIvL1VcBt8/edit |ATAC-seq pipeline protocol specification ]].  It works for hg19, hg38, mm9 and mm10.
    121 {{{
    122 
    123    bedtools bamtobed -bedpe -mate1 -i $name_sorted.bam | gzip -nc > foo.bedpe.gz
    124 
    125    # Create final TA file
    126    zcat foo.bedpe.gz | awk 'BEGIN{OFS="\t"}{printf "%s\t%s\t%s\tN\t1000\t%s\n%s\t%s\t%s\tN\t1000\t%s\n",$1,$2,$3,$9,$4,$5,$6,$10}' | gzip -nc > tagAlign.gz
    127 
    128 
    129   # Call peaks: ENCODE uses very small p-value, so it could get enough peaks for IDR
    130   # If you don't run IDR, you need to make p-value more stringent, such as with q 0.01
    131   macs2 callpeak -t tagAlign.gz -n foo.narrow -f BED -g ${species} -p 0.01 --nomodel --shift -75 --extsize 150 --call-summits --keep-dup all -B --SPMR --call-summits
    132 
    133 }}}
    134 
    135 
    136   * Using pair-end bed as macs2 input. It considers ends of both mates, focus on cutting/insertion sites enrichment in ATAC-seq. ( https://twitter.com/XiChenUoM/status/1336658454866325506 )
    137        * Codes below were implemented in the ATAC-seq review paper ( https://github.com/alexyfyf/atac_nf/blob/7f996b7de0e349c5a10dbbd75b2c266339517a3b/atac.nf#L341 ).
    138        * [[https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-1929-3 | From reads to insight: a hitchhiker’s guide to ATAC-seq data analysis]]
     120  * Run [[https://github.com/ENCODE-DCC/atac-seq-pipeline|ENCODE ATAC-seq Pipeline]] . It works well with human (hg38, hg19) and mouse (mm10, mm9) samples with biological replicates. The pipeline takes fastq files, cleans and maps the reads, filters aligned reads and does peak calls. Here is the [[https://www.encodeproject.org/pipelines/ENCPL787FUN/|workflow]].  In addition, it does quality controls. The steps below shows you how to run it on our Whitehead server.
     121
     122       * Verify there is no preexisting conda startup code with the command below:
     123{{{
     124conda env list
     125}}}
     126         You have no preexisting conda if you get "conda: command not found". Otherwise, log out, log back in, start the new conda instance, and activate encode-atac-seq-pipeline
     127
     128      * Ignore the developer's instructions, use your home directory for conda and the pipeline.
     129{{{
     130# Be sure to keep the first dot in the command below:
     131. /nfs/BaRC_Public/conda/start_barc_conda
     132conda activate encode-atac-seq-pipeline
     133}}}
     134
     135      * Run. Files could be url or fullpath. [[https://github.com/ENCODE-DCC/atac-seq-pipeline/blob/master/docs/input.md | Detail information about .json file]]
     136{{{
     137caper run /nfs/BaRC_Public/atac-seq-pipeline/atac.wdl -i sample.json
     138# After the job finishes, you can deactivate conda with
     139conda deactivate
     140}}}
     141
     142
     143      * content in sample.json:
     144
     145{{{
     146{
     147    "atac.pipeline_type" : "atac",
     148    "atac.genome_tsv" : "/nfs/BaRC_datasets/ENCODE_ATAC-seq_Pipeline/mm10/mm10.tsv",
     149    "atac.fastqs_rep1_R1" : [
     150        "/fullpath/sample_rep1_1.fastq.gz"
     151    ],
     152    "atac.fastqs_rep1_R2" : [
     153        "/fullpath/sample_rep1_2.fastq.gz"
     154    ],
     155    "atac.fastqs_rep2_R1" : [
     156        "/fullpath/sample_rep2_1.fastq.gz"
     157    ],
     158    "atac.fastqs_rep2_R2" : [
     159        "/fullpath/sample_rep2_2.fastq.gz"
     160    ],
     161    "atac.paired_end" : true,
     162    "atac.auto_detect_adapter" : true,
     163    "atac.enable_tss_enrich" : true,
     164    "atac.title" : "sample",
     165    "atac.description" : "ATAC-seq mouse sample"
     166}
     167}}}
     168      * Supported genome files for hg19, hg38, mm9 and mm10 can be found in /nfs/BaRC_datasets/ENCODE_ATAC-seq_Pipeline, and atac.genome_tsv used for .json is
     169          * hg19: /nfs/BaRC_datasets/ENCODE_ATAC-seq_Pipeline/hg19/hg19.tsv
     170          * hg38: /nfs/BaRC_datasets/ENCODE_ATAC-seq_Pipeline/hg38/hg38.tsv
     171          * mm9: /nfs/BaRC_datasets/ENCODE_ATAC-seq_Pipeline/mm9/mm9.tsv
     172          * mm10: /nfs/BaRC_datasets/ENCODE_ATAC-seq_Pipeline/mm10/mm10.tsv
     173
     174
     175
     176     * If you are working with other species or don't have replicates, you can run macs2 with pair-end bed as an input. It considers ends of both mates, focus on cutting/insertion sites enrichment in ATAC-seq. ( https://twitter.com/XiChenUoM/status/1336658454866325506 )
    139177{{{
    140178    # convert bam to bed
    141     bedtools bamtobed -i $bam > ${name}_pe.bed
    142     macs2 callpeak -t ${name}_pe.bed -n ${name}_narrow -f BED -g ${species} -q 0.01 --nomodel --shift -75 --extsize 150 --call-summits --keep-dup all
     179    bedtools bamtobed -i foo.bam > foo_pe.bed
     180    # shift reads
     181    cat foo.pe.bed | awk -F $'\t' 'BEGIN {OFS = FS}{ if ($6 == "+") {$2 = $2 + 4} else if ($6 == "-") {$3 = $3 - 5} print $0}' >| foo_tn5_pe.bed
     182    # call peaks.
     183    # --keep-dup all: since duplicates have been removed in previous step
     184    macs2 callpeak -t foo_tn5_pe.bed -n foo -f BED -g mm -q 0.01 --nomodel --shift -75 --extsize 150 --call-summits --keep-dup all
    143185 
    144186}}}
    145187
    146    * Run macs2 with BAMPE option: If using the 'BAMPE' option with paired-end reads, let MACS run the pileup and calculate 'extsize'.
     188    * If you are working with NFR, macs2 BAMPE option also works. When using'BAMPE' option with paired-end reads, we let MACS run the pileup and calculate 'extsize'.
    147189       * BAMPE format: there is no special format for BAMPE - MACS will treat PE reads as coming from the same fragment, from the manual,
    148190       * "If the BAM file is generated for paired-end data, MACS will only keep the left mate(5' end) tag. However, when format BAMPE is specified, MACS will use the real fragments inferred from alignment results for reads pileup."
    149191
    150192{{{
    151 
    152     #note: if duplicates were removed, use --keep-dup all; if not, use --keep-dup 1.  Removing duplicates is recommended.
    153     macs2 callpeak -f BAMPE -t file.sorted.bam  --keep-dup all -B -q 0.01 -g mm -n MACS_ATACSeq_Peaks
    154 }}}
    155 
    156    * Run macs 2 using --nomodel --shift s --extsize 2s. This can be used for single-end reads.
    157       * For PE reads, it ignores the the right mate, and not recommended ( https://github.com/macs3-project/MACS/issues/145 ).
    158       * In case where short fragment population is extremely dominant, the final output won't be off much as compared with BAMPE ( https://github.com/macs3-project/MACS/issues/145 )
    159 {{{
    160 
    161    macs2 callpeak -t $foo.bam -f BAM -B -q 0.01 -g hs -n ${prefix} --nomodel --shift -75 --extsize 150
     193    macs2 callpeak -f BAMPE -t NFR.bam  --keep-dup all -B -q 0.01 -g mm -n MACS_ATACSeq_Peaks
     194}}}
     195
     196   * We strongly suggest to use pair-end reads. But if you have single-end reads, you can run macs2 using --nomodel --shift s --extsize 2s.
     197{{{
     198
     199   macs2 callpeak -t $foo.bam -f BAM -B -q 0.01 -g hs -n foo --nomodel --shift -75 --extsize 150
    162200}}}
    163201 
    164     * Which macs option?
    165        * MACS' author, T.Liu, recommends using -f BAMPE if PE reads are used [[https://github.com/taoliu/MACS/issues/331]], using BAMPE option asks MACS to pileup and calculate the extension size -  works for finding accessible regions within cut sites.  The additional parameters can also be used to look only at the //exact// cut sites by Tn5 instead of the open/accessible regions [[https://github.com/taoliu/MACS/issues/145]], if so, -f BAMPE may not be suitable.
    166 
    167 [[https://github.com/jsh58/Genrich | Genrich]] is another piece of software for peak-calling. It has the advantages of (a) running all of the post-alignment steps through peak-calling with one command, and (b) can process multiple replicates. Detailed information can be found in [[https://informatics.fas.harvard.edu/atac-seq-guidelines.html|Harvard ATAC-seq Guidelines]]
    168 
    169 {{{
    170 
    171 # Start by sorting mapped reads by read name
    172 samtools sort -n bowtie/ATAC.bam > ATAC_sortedByName.bam
    173 # Run Genrich
    174 Genrich -e chrM -j -r -m 30 -v -t ATAC_sortedByName.bam -o peaks/ATAC_peak -f peaks/ATAC_log
    175 
    176 where
    177 -j              ATAC-seq mode (must be specified)  intervals are interpreted that are centered on transposase cut sites (the ends of each DNA fragment). Only properly paired alignment are analyzed by default
    178 -r              Remove PCR duplicates
    179 -m  <int>       Minimum MAPQ to keep an alignment (def. 0)
    180 -d <int>        Expand cut sites to the given length (default 100bp)
    181 -e <arg>        Chromosomes (reference sequences) to exclude. Can be a comma-separated  list, e.g. -e chrM,chrY.
    182 -E <file>       Input BED file(s) of genomic regions to exclude, such as 'N' homopolymers or high mappability regions
    183 -t <file>       Input SAM/BAM file(s) for experimental sample(s)
    184 -o <file>       Output peak file (in ENCODE narrowPeak format)
    185 -f <LOG>        Those who wish to explore the results of varying the peak-calling parameters (-q/-p, -a, -l, -g) should consider having Genrich produce a log file when it parses the SAM/BAM files (for example, with -f <LOG> added to the above command). Then, Genrich can call peaks directly from the log file with the -P option: Genrich -P -f <LOG> -o <OUT2> -p 0.01  -a 200 -v
    186 }}}
    187 
    188202==== Other Tools or Options ====
    189203