58 | | * For samples from human, mouse, fly, or C. elegans, one can prevent some probable false-positive peaks by removing reads that overlap "blacklisted" regions. The blacklist, [https://www.nature.com/articles/s41598-019-45839-z popularized by ENCODE], is a a comprehensive set of genomic regions that have anomalous, unstructured, or high signal in next-generation sequencing experiments independent of cell line or experiment. The blacklist regions can be downloaded from [https://github.com/Boyle-Lab/Blacklist/]. We have them on Whitehead servers at /nfs/BaRC_datasets/ENCODE_blacklist/Blacklist/lists |
59 | | |
60 | | * Reads overlapping the blacklist can be filtered using alignmentSieve (from the deepTools package) or 'intersectBed -v' (from the bedtools suite): |
61 | | {{{ |
62 | | alignmentSieve -b Reads.bam --blackListFileName hg38-blacklist.bed -o Reads.no_blackList.bam |
63 | | intersectBed -v -a Reads.bam -b hg38-blacklist.bed > Reads.no_blackList.bam |
64 | | }}} |
65 | | |
66 | | * An alternative for handling blacklist regions is to keep the mapped reads as is but (further downstream) remove peaks overlapping blacklist regions. Between-sample normalization will typically differ whether one filters these regions in reads or in peaks. |
67 | | |
90 | | * In addition to do varies quality controls, [[https://www.sciencedirect.com/science/article/pii/S240547122030079X | ataqv]] summarizes QC results into an interactive html page, which also allows you to view multiple samples together. |
91 | | |
92 | | First, run ataqv on each bam file to generate JSON files. |
93 | | Here is a sample command for a bulk ATAC_seq sample: |
| 81 | * [[https://www.sciencedirect.com/science/article/pii/S240547122030079X | ataqv]] summarizes QC results into an interactive html page, which also allows you to view multiple samples together. |
| 82 | * First, run ataqv on each bam file to generate JSON files. Here is a sample command for a bulk ATAC_seq sample: |
148 | | * Using pair-end bed as macs2 input. It considers ends of both mates, focus on cutting/insertion sites enrichment in ATAC-seq. [[ [[ https://twitter.com/XiChenUoM/status/1336658454866325506 | Explaination ]]. Codes implemented in the ATAC-seq review paper ( https://github.com/alexyfyf/atac_nf/blob/7f996b7de0e349c5a10dbbd75b2c266339517a3b/atac.nf#L341 ) |
149 | | [[https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-1929-3 | From reads to insight: a hitchhiker’s guide to ATAC-seq data analysis]] |
| 135 | * Using pair-end bed as macs2 input. It considers ends of both mates, focus on cutting/insertion sites enrichment in ATAC-seq. ( https://twitter.com/XiChenUoM/status/1336658454866325506 ) |
| 136 | * Codes below were implemented in the ATAC-seq review paper ( https://github.com/alexyfyf/atac_nf/blob/7f996b7de0e349c5a10dbbd75b2c266339517a3b/atac.nf#L341 ). |
| 137 | * [[https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-1929-3 | From reads to insight: a hitchhiker’s guide to ATAC-seq data analysis]] |
167 | | * Run macs 2 using --nomodel --shift s --extsize 2s. This can be used for single-end reads. For PE reads, [[ https://github.com/macs3-project/MACS/issues/145 | it ignores the the right mate, and not recommended ]]. [[ https://github.com/macs3-project/MACS/issues/145 | In case where short fragment population is extremely dominant, the final output won't be off much as compared with BAMPE ]] |
| 155 | * Run macs 2 using --nomodel --shift s --extsize 2s. This can be used for single-end reads. |
| 156 | * For PE reads, it ignores the the right mate, and not recommended ( https://github.com/macs3-project/MACS/issues/145 ). |
| 157 | * In case where short fragment population is extremely dominant, the final output won't be off much as compared with BAMPE ( https://github.com/macs3-project/MACS/issues/145 ) |
173 | | * Which macs optMACS' author, T.Liu, recommends using -f BAMPE if PE reads are used [[https://github.com/taoliu/MACS/issues/331]], using BAMPE option asks MACS to pileup and calculate the extension size - works for finding accessible regions within cut sites. The additional parameters can also be used to look only at the //exact// cut sites by Tn5 instead of the open/accessible regions [[https://github.com/taoliu/MACS/issues/145]], if so, -f BAMPE may not be suitable. |
174 | | * Shifting reads, pos. strand +4 and neg strand -5 (see recommendations below) may be needed as well to find //exact// cut sites. |
175 | | |
176 | | |
177 | | * [[https://github.com/jsh58/Genrich | Genrich]] is another piece of software for peak-calling. It has the advantages of (a) running all of the post-alignment steps through peak-calling with one command, and (b) can process multiple replicates. Detailed information can be found in [[https://informatics.fas.harvard.edu/atac-seq-guidelines.html|Harvard ATAC-seq Guidelines]] |
| 163 | * Which macs option? |
| 164 | * MACS' author, T.Liu, recommends using -f BAMPE if PE reads are used [[https://github.com/taoliu/MACS/issues/331]], using BAMPE option asks MACS to pileup and calculate the extension size - works for finding accessible regions within cut sites. The additional parameters can also be used to look only at the //exact// cut sites by Tn5 instead of the open/accessible regions [[https://github.com/taoliu/MACS/issues/145]], if so, -f BAMPE may not be suitable. |
| 165 | * Shifting reads, pos. strand +4 and neg strand -5 (see recommendations below) may be needed as well to find //exact// cut sites. |
| 166 | |
| 167 | |
| 168 | [[https://github.com/jsh58/Genrich | Genrich]] is another piece of software for peak-calling. It has the advantages of (a) running all of the post-alignment steps through peak-calling with one command, and (b) can process multiple replicates. Detailed information can be found in [[https://informatics.fas.harvard.edu/atac-seq-guidelines.html|Harvard ATAC-seq Guidelines]] |
| 200 | === [=#Blacklist Blacklist filtering for peaks ] === |
| 201 | * For samples from human, mouse, fly, or C. elegans, one can prevent some probable false-positive peaks by removing reads that overlap "blacklisted" regions. The blacklist, [https://www.nature.com/articles/s41598-019-45839-z popularized by ENCODE], is a a comprehensive set of genomic regions that have anomalous, unstructured, or high signal in next-generation sequencing experiments independent of cell line or experiment. The blacklist regions can be downloaded from [https://github.com/Boyle-Lab/Blacklist/]. We have them on Whitehead servers at /nfs/BaRC_datasets/ENCODE_blacklist/Blacklist/lists |
| 202 | {{{ |
| 203 | bedtools intersect -v -a ${PEAK} -b ${BLACKLIST} \ |
| 204 | | awk 'BEGIN{OFS="\t"} {if ($5>1000) $5=1000; print $0}' \ |
| 205 | | grep -P 'chr[\dXY]+[ \t]' | gzip -nc > ${FILTERED_PEAK} |
| 206 | |
| 207 | }}} |
| 208 | |