| | 66 | }}} |
| | 67 | |
| | 68 | === [=#postAlignQC Post-alignment quality control and calculate QC metrics] === |
| | 69 | |
| | 70 | * Calculate the fragment size distribution using BaRC code |
| | 71 | (/nfs/BaRC_Public/BaRC_code/R/calculate_ATACseq_fragment_size_distribution/calculate_ATACseq_fragment_size_distribution.R) |
| | 72 | {{{ |
| | 73 | # USAGE: calculate_ATACseq_fragment_size_distribution.R input.bam output.pdf title_of_the_figure |
| | 74 | calculate_ATACseq_fragment_size_distribution.R sample.bam sample_fragment_sizes.pdf Sample |
| | 75 | }}} |
| | 76 | |
| | 77 | This script is a wrapper for fragSizeDist from the [https://www.bioconductor.org/packages/release/bioc/html/ATACseqQC.html ATACseqQC] R package: |
| | 78 | {{{ |
| | 79 | library("ATACseqQC") |
| | 80 | pdf("My_sample.fragment_sizes.pdf", w=11, h=8.5) |
| | 81 | fragSizeDist("Mapped_reads.bam", "My_sample") |
| | 82 | dev.off() |
| | 83 | }}} |
| | 84 | See [[https://www.nature.com/articles/nmeth.2688/figures/2 | Fig 2]] from Buenrostro et al. for the ideal distribution of fragment sizes. |
| | 85 | |
| | 86 | * Calculate the TSS enrichment score (the degree to which transcription start sites show enrichment for ATAC-seq reads) using BaRC code (/nfs/BaRC_Public/BaRC_code/Python/calculate_TSS_enrichment_score/calculate_TSS_enrichment_score.py) |
| | 87 | {{{ |
| | 88 | # USAGE: calculate_TSS_enrichment_score.py --outdir OUTDIR --outprefix OUTPREFIX --fastq1 FASTQ1 --tss TSS_BED --chromsizes CHROMSIZES --bam BAM |
| | 89 | |
| | 90 | ./calculate_TSS_enrichment_score.py --outdir OUT_QC_1 --outprefix Sample_A --fastq1 ATACseq_reads.fq.gz --tss TSS.hg38.bed --chromsizes chromInfo.hg38.txt --bam ATACseq_mapped_reads.bam >| Sample_A.TSS_enrichment_score.txt |
| 154 | | |
| 155 | | |
| 156 | | |
| 157 | | === [=#Blacklist Blacklist filtering for peaks ] === |
| 158 | | * For samples from human, mouse, fly, or C. elegans, one can prevent some probable false-positive peaks by removing reads that overlap "blacklisted" regions. The blacklist, [https://www.nature.com/articles/s41598-019-45839-z popularized by ENCODE], is a a comprehensive set of genomic regions that have anomalous, unstructured, or high signal in next-generation sequencing experiments independent of cell line or experiment. The blacklist regions can be downloaded from [https://github.com/Boyle-Lab/Blacklist/]. We have them on Whitehead servers at /nfs/BaRC_datasets/ENCODE_blacklist/Blacklist/lists |
| 159 | | {{{ |
| 160 | | bedtools intersect -v -a ${PEAK} -b ${BLACKLIST} \ |
| 161 | | | awk 'BEGIN{OFS="\t"} {if ($5>1000) $5=1000; print $0}' \ |
| 162 | | | grep -P 'chr[\dXY]+[ \t]' | gzip -nc > ${FILTERED_PEAK} |
| 163 | | |
| 164 | | }}} |
| 165 | | |
| 166 | | === Subset peaks into reproducible peaks === |
| | 180 | ==== Subset peaks into reproducible peaks ==== |
| 178 | | === [=#QC Run quality control and calculate QC metrics] === |
| 179 | | |
| 180 | | * Calculate the fragment size distribution with the [https://www.bioconductor.org/packages/release/bioc/html/ATACseqQC.html ATACseqQC] R package: |
| 181 | | {{{ |
| 182 | | library("ATACseqQC") |
| 183 | | pdf("My_sample.fragment_sizes.pdf", w=11, h=8.5) |
| 184 | | fragSizeDist("Mapped_reads.bam", "My_sample") |
| 185 | | dev.off() |
| 186 | | }}} |
| 187 | | See [[https://www.nature.com/articles/nmeth.2688/figures/2 | Fig 2]] from Buenrostro et al. for the ideal distribution of fragment sizes. |
| 188 | | |
| 189 | | * Calculate the TSS enrichment score (the degree to which transcription start sites show enrichment for ATAC-seq reads) using BaRC code (/nfs/BaRC_Public/BaRC_code/Python/calculate_TSS_enrichment_score/calculate_TSS_enrichment_score.py) |
| 190 | | {{{ |
| 191 | | # USAGE: calculate_TSS_enrichment_score.py --outdir OUTDIR --outprefix OUTPREFIX --fastq1 FASTQ1 --tss TSS_BED --chromsizes CHROMSIZES --bam BAM |
| 192 | | |
| 193 | | ./calculate_TSS_enrichment_score.py --outdir OUT_QC_1 --outprefix Sample_A --fastq1 ATACseq_reads.fq.gz --tss TSS.hg38.bed --chromsizes chromInfo.hg38.txt --bam ATACseq_mapped_reads.bam >| Sample_A.TSS_enrichment_score.txt |
| 194 | | }}} |
| 195 | | |
| | 192 | |
| | 193 | === [=#AfterPeakQC Quality control after calling peaks] === |
| | 234 | |
| | 235 | }}} |
| | 236 | |
| | 237 | |
| | 238 | === [=#Blacklist Blacklist filtering for peaks ] === |
| | 239 | * For samples from human, mouse, fly, or C. elegans, one can prevent some probable false-positive peaks by removing reads that overlap "blacklisted" regions. The blacklist, [https://www.nature.com/articles/s41598-019-45839-z popularized by ENCODE], is a a comprehensive set of genomic regions that have anomalous, unstructured, or high signal in next-generation sequencing experiments independent of cell line or experiment. The blacklist regions can be downloaded from [https://github.com/Boyle-Lab/Blacklist/]. We have them on Whitehead servers at /nfs/BaRC_datasets/ENCODE_blacklist/Blacklist/lists |
| | 240 | {{{ |
| | 241 | bedtools intersect -v -a ${PEAK} -b ${BLACKLIST} \ |
| | 242 | | awk 'BEGIN{OFS="\t"} {if ($5>1000) $5=1000; print $0}' \ |
| | 243 | | grep -P 'chr[\dXY]+[ \t]' | gzip -nc > ${FILTERED_PEAK} |