120 | | * ENCODE pipeline macs2 code: using pair-end tags as macs2 input. From [[https://docs.google.com/document/d/1f0Cm4vRyDQDu0bMehHD7P7KOMxTOP-HiNoIvL1VcBt8/edit |ATAC-seq pipeline protocol specification ]]. It works for hg19, hg38, mm9 and mm10. |
121 | | {{{ |
122 | | |
123 | | bedtools bamtobed -bedpe -mate1 -i $name_sorted.bam | gzip -nc > foo.bedpe.gz |
124 | | |
125 | | # Create final TA file |
126 | | zcat foo.bedpe.gz | awk 'BEGIN{OFS="\t"}{printf "%s\t%s\t%s\tN\t1000\t%s\n%s\t%s\t%s\tN\t1000\t%s\n",$1,$2,$3,$9,$4,$5,$6,$10}' | gzip -nc > tagAlign.gz |
127 | | |
128 | | |
129 | | # Call peaks: ENCODE uses very small p-value, so it could get enough peaks for IDR |
130 | | # If you don't run IDR, you need to make p-value more stringent, such as with q 0.01 |
131 | | macs2 callpeak -t tagAlign.gz -n foo.narrow -f BED -g ${species} -p 0.01 --nomodel --shift -75 --extsize 150 --call-summits --keep-dup all -B --SPMR --call-summits |
132 | | |
133 | | }}} |
134 | | |
135 | | |
136 | | * Using pair-end bed as macs2 input. It considers ends of both mates, focus on cutting/insertion sites enrichment in ATAC-seq. ( https://twitter.com/XiChenUoM/status/1336658454866325506 ) |
137 | | * Codes below were implemented in the ATAC-seq review paper ( https://github.com/alexyfyf/atac_nf/blob/7f996b7de0e349c5a10dbbd75b2c266339517a3b/atac.nf#L341 ). |
138 | | * [[https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-1929-3 | From reads to insight: a hitchhiker’s guide to ATAC-seq data analysis]] |
| 120 | * Run [[https://github.com/ENCODE-DCC/atac-seq-pipeline|ENCODE ATAC-seq Pipeline]] . It works well with human (hg38, hg19) and mouse (mm10, mm9) samples with biological replicates. The pipeline takes fastq files, cleans and maps the reads, filters aligned reads and does peak calls. Here is the [[https://www.encodeproject.org/pipelines/ENCPL787FUN/|workflow]]. In addition, it does quality controls. The steps below shows you how to run it on our Whitehead server. |
| 121 | |
| 122 | * Verify there is no preexisting conda startup code with the command below: |
| 123 | {{{ |
| 124 | conda env list |
| 125 | }}} |
| 126 | You have no preexisting conda if you get "conda: command not found". Otherwise, log out, log back in, start the new conda instance, and activate encode-atac-seq-pipeline |
| 127 | |
| 128 | * Ignore the developer's instructions, use your home directory for conda and the pipeline. |
| 129 | {{{ |
| 130 | # Be sure to keep the first dot in the command below: |
| 131 | . /nfs/BaRC_Public/conda/start_barc_conda |
| 132 | conda activate encode-atac-seq-pipeline |
| 133 | }}} |
| 134 | |
| 135 | * Run. Files could be url or fullpath. [[https://github.com/ENCODE-DCC/atac-seq-pipeline/blob/master/docs/input.md | Detail information about .json file]] |
| 136 | {{{ |
| 137 | caper run /nfs/BaRC_Public/atac-seq-pipeline/atac.wdl -i sample.json |
| 138 | # After the job finishes, you can deactivate conda with |
| 139 | conda deactivate |
| 140 | }}} |
| 141 | |
| 142 | |
| 143 | * content in sample.json: |
| 144 | |
| 145 | {{{ |
| 146 | { |
| 147 | "atac.pipeline_type" : "atac", |
| 148 | "atac.genome_tsv" : "/nfs/BaRC_datasets/ENCODE_ATAC-seq_Pipeline/mm10/mm10.tsv", |
| 149 | "atac.fastqs_rep1_R1" : [ |
| 150 | "/fullpath/sample_rep1_1.fastq.gz" |
| 151 | ], |
| 152 | "atac.fastqs_rep1_R2" : [ |
| 153 | "/fullpath/sample_rep1_2.fastq.gz" |
| 154 | ], |
| 155 | "atac.fastqs_rep2_R1" : [ |
| 156 | "/fullpath/sample_rep2_1.fastq.gz" |
| 157 | ], |
| 158 | "atac.fastqs_rep2_R2" : [ |
| 159 | "/fullpath/sample_rep2_2.fastq.gz" |
| 160 | ], |
| 161 | "atac.paired_end" : true, |
| 162 | "atac.auto_detect_adapter" : true, |
| 163 | "atac.enable_tss_enrich" : true, |
| 164 | "atac.title" : "sample", |
| 165 | "atac.description" : "ATAC-seq mouse sample" |
| 166 | } |
| 167 | }}} |
| 168 | * Supported genome files for hg19, hg38, mm9 and mm10 can be found in /nfs/BaRC_datasets/ENCODE_ATAC-seq_Pipeline, and atac.genome_tsv used for .json is |
| 169 | * hg19: /nfs/BaRC_datasets/ENCODE_ATAC-seq_Pipeline/hg19/hg19.tsv |
| 170 | * hg38: /nfs/BaRC_datasets/ENCODE_ATAC-seq_Pipeline/hg38/hg38.tsv |
| 171 | * mm9: /nfs/BaRC_datasets/ENCODE_ATAC-seq_Pipeline/mm9/mm9.tsv |
| 172 | * mm10: /nfs/BaRC_datasets/ENCODE_ATAC-seq_Pipeline/mm10/mm10.tsv |
| 173 | |
| 174 | |
| 175 | |
| 176 | * If you are working with other species or don't have replicates, you can run macs2 with pair-end bed as an input. It considers ends of both mates, focus on cutting/insertion sites enrichment in ATAC-seq. ( https://twitter.com/XiChenUoM/status/1336658454866325506 ) |
151 | | |
152 | | #note: if duplicates were removed, use --keep-dup all; if not, use --keep-dup 1. Removing duplicates is recommended. |
153 | | macs2 callpeak -f BAMPE -t file.sorted.bam --keep-dup all -B -q 0.01 -g mm -n MACS_ATACSeq_Peaks |
154 | | }}} |
155 | | |
156 | | * Run macs 2 using --nomodel --shift s --extsize 2s. This can be used for single-end reads. |
157 | | * For PE reads, it ignores the the right mate, and not recommended ( https://github.com/macs3-project/MACS/issues/145 ). |
158 | | * In case where short fragment population is extremely dominant, the final output won't be off much as compared with BAMPE ( https://github.com/macs3-project/MACS/issues/145 ) |
159 | | {{{ |
160 | | |
161 | | macs2 callpeak -t $foo.bam -f BAM -B -q 0.01 -g hs -n ${prefix} --nomodel --shift -75 --extsize 150 |
| 193 | macs2 callpeak -f BAMPE -t NFR.bam --keep-dup all -B -q 0.01 -g mm -n MACS_ATACSeq_Peaks |
| 194 | }}} |
| 195 | |
| 196 | * We strongly suggest to use pair-end reads. But if you have single-end reads, you can run macs2 using --nomodel --shift s --extsize 2s. |
| 197 | {{{ |
| 198 | |
| 199 | macs2 callpeak -t $foo.bam -f BAM -B -q 0.01 -g hs -n foo --nomodel --shift -75 --extsize 150 |
164 | | * Which macs option? |
165 | | * MACS' author, T.Liu, recommends using -f BAMPE if PE reads are used [[https://github.com/taoliu/MACS/issues/331]], using BAMPE option asks MACS to pileup and calculate the extension size - works for finding accessible regions within cut sites. The additional parameters can also be used to look only at the //exact// cut sites by Tn5 instead of the open/accessible regions [[https://github.com/taoliu/MACS/issues/145]], if so, -f BAMPE may not be suitable. |
166 | | |
167 | | [[https://github.com/jsh58/Genrich | Genrich]] is another piece of software for peak-calling. It has the advantages of (a) running all of the post-alignment steps through peak-calling with one command, and (b) can process multiple replicates. Detailed information can be found in [[https://informatics.fas.harvard.edu/atac-seq-guidelines.html|Harvard ATAC-seq Guidelines]] |
168 | | |
169 | | {{{ |
170 | | |
171 | | # Start by sorting mapped reads by read name |
172 | | samtools sort -n bowtie/ATAC.bam > ATAC_sortedByName.bam |
173 | | # Run Genrich |
174 | | Genrich -e chrM -j -r -m 30 -v -t ATAC_sortedByName.bam -o peaks/ATAC_peak -f peaks/ATAC_log |
175 | | |
176 | | where |
177 | | -j ATAC-seq mode (must be specified) intervals are interpreted that are centered on transposase cut sites (the ends of each DNA fragment). Only properly paired alignment are analyzed by default |
178 | | -r Remove PCR duplicates |
179 | | -m <int> Minimum MAPQ to keep an alignment (def. 0) |
180 | | -d <int> Expand cut sites to the given length (default 100bp) |
181 | | -e <arg> Chromosomes (reference sequences) to exclude. Can be a comma-separated list, e.g. -e chrM,chrY. |
182 | | -E <file> Input BED file(s) of genomic regions to exclude, such as 'N' homopolymers or high mappability regions |
183 | | -t <file> Input SAM/BAM file(s) for experimental sample(s) |
184 | | -o <file> Output peak file (in ENCODE narrowPeak format) |
185 | | -f <LOG> Those who wish to explore the results of varying the peak-calling parameters (-q/-p, -a, -l, -g) should consider having Genrich produce a log file when it parses the SAM/BAM files (for example, with -f <LOG> added to the above command). Then, Genrich can call peaks directly from the log file with the -P option: Genrich -P -f <LOG> -o <OUT2> -p 0.01 -a 200 -v |
186 | | }}} |
187 | | |