Context Navigation

Changes between Version 79 and Version 80 of SOPs/atac_Seq

Timestamp:: 07/09/21 13:06:53 (4 years ago)
Author:: byuan
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

SOPs/atac_Seq

-              v79
+              v80
 === Basic Approach ===
  * [#preprocess Pre-process reads] (remove adapters and other "contamination")
+ * [#map Map reads to the genome] (with an unspliced mapping tool)
+ * [#map Map reads to the genome, and post-alignment filtering]
+ * [#postAlignQC Post-alignment quality control and calculate QC metrics]
  * [#call_peaks Call ATAC-seq "peaks"] with a high coverage of mapped reads
+ * [#AfterPeakQC Quality control after calling peaks]
  * [#Blacklist Blacklist filtering for peaks ]
  * [#QC Run quality control and calculate QC metrics]
 …
 {{{
 samtools view -h file.bam | grep -v chrM | samtools view -b -h -f 0x2 - | samtools sort - > file.sorted.bam
+}}}
+=== [=#postAlignQC Post-alignment quality control and calculate QC metrics] ===
+   * Calculate the fragment size distribution using BaRC code
+     (/nfs/BaRC_Public/BaRC_code/R/calculate_ATACseq_fragment_size_distribution/calculate_ATACseq_fragment_size_distribution.R)
+{{{
+# USAGE: calculate_ATACseq_fragment_size_distribution.R input.bam output.pdf title_of_the_figure
+calculate_ATACseq_fragment_size_distribution.R sample.bam sample_fragment_sizes.pdf Sample
+}}}
+    This script is a wrapper for fragSizeDist from the [https://www.bioconductor.org/packages/release/bioc/html/ATACseqQC.html ATACseqQC] R package:
+{{{
+library("ATACseqQC")
+pdf("My_sample.fragment_sizes.pdf", w=11, h=8.5)
+fragSizeDist("Mapped_reads.bam", "My_sample")
+dev.off()
+}}}
+    See [[https://www.nature.com/articles/nmeth.2688/figures/2 | Fig 2]] from Buenrostro et al. for the ideal distribution of fragment sizes.
+   * Calculate the TSS enrichment score (the degree to which transcription start sites show enrichment for ATAC-seq reads) using BaRC code (/nfs/BaRC_Public/BaRC_code/Python/calculate_TSS_enrichment_score/calculate_TSS_enrichment_score.py)
+{{{
+# USAGE: calculate_TSS_enrichment_score.py --outdir OUTDIR --outprefix OUTPREFIX --fastq1 FASTQ1 --tss TSS_BED --chromsizes CHROMSIZES --bam BAM
+./calculate_TSS_enrichment_score.py --outdir OUT_QC_1 --outprefix Sample_A --fastq1 ATACseq_reads.fq.gz --tss TSS.hg38.bed --chromsizes chromInfo.hg38.txt --bam ATACseq_mapped_reads.bam >| Sample_A.TSS_enrichment_score.txt
 }}}
 …
 macs2 callpeak -t $foo.bam -f BAM -B -q 0.01 -g hs -n foo --nomodel --shift -75 --extsize 150
 }}}
 ==== Other Tools or Options ====
 …
  *  [[https://github.com/jsh58/Genrich | Genrich]]
+=== [=#Blacklist Blacklist filtering for peaks ] ===
+   * For samples from human, mouse, fly, or C. elegans, one can prevent some probable false-positive peaks by removing reads that overlap "blacklisted" regions.  The blacklist, [https://www.nature.com/articles/s41598-019-45839-z popularized by ENCODE], is a a comprehensive set of genomic regions that have anomalous, unstructured, or high signal in next-generation sequencing experiments independent of cell line or experiment. The blacklist regions can be downloaded from [https://github.com/Boyle-Lab/Blacklist/].  We have them on Whitehead servers at /nfs/BaRC_datasets/ENCODE_blacklist/Blacklist/lists
+{{{
+bedtools intersect -v -a ${PEAK} -b ${BLACKLIST} \
+                 | awk 'BEGIN{OFS="\t"} {if ($5>1000) $5=1000; print $0}' \
+                 | grep -P 'chr[\dXY]+[ \t]'  | gzip -nc > ${FILTERED_PEAK}
+}}}
+=== Subset peaks into reproducible peaks ===
+==== Subset peaks into reproducible peaks ====
   * If you have biological replicates, you can identify reproducible peaks with the 'idr' function, based on the method of [[https://projecteuclid.org/journals/annals-of-applied-statistics/volume-5/issue-3/Measuring-reproducibility-of-high-throughput-experiments/10.1214/11-AOAS466.full | Qunhua Li et al]] and using [[https://github.com/nboley/idr | code from Nathan Boley]].
 …
+=== [=#QC Run quality control and calculate QC metrics] ===
+   * Calculate the fragment size distribution with the [https://www.bioconductor.org/packages/release/bioc/html/ATACseqQC.html ATACseqQC] R package:
+{{{
+library("ATACseqQC")
+pdf("My_sample.fragment_sizes.pdf", w=11, h=8.5)
+fragSizeDist("Mapped_reads.bam", "My_sample")
+dev.off()
+}}}
+      See [[https://www.nature.com/articles/nmeth.2688/figures/2 | Fig 2]] from Buenrostro et al. for the ideal distribution of fragment sizes.
+   * Calculate the TSS enrichment score (the degree to which transcription start sites show enrichment for ATAC-seq reads) using BaRC code (/nfs/BaRC_Public/BaRC_code/Python/calculate_TSS_enrichment_score/calculate_TSS_enrichment_score.py)
+{{{
+# USAGE: calculate_TSS_enrichment_score.py --outdir OUTDIR --outprefix OUTPREFIX --fastq1 FASTQ1 --tss TSS_BED --chromsizes CHROMSIZES --bam BAM
+./calculate_TSS_enrichment_score.py --outdir OUT_QC_1 --outprefix Sample_A --fastq1 ATACseq_reads.fq.gz --tss TSS.hg38.bed --chromsizes chromInfo.hg38.txt --bam ATACseq_mapped_reads.bam >| Sample_A.TSS_enrichment_score.txt
+}}}
+=== [=#AfterPeakQC Quality control after calling peaks] ===
    * Assessing Peak Calls by calculating the FRiP score (with /nfs/BaRC_Public/BaRC_code/Python/calculate_FRiP_score/calculate_FRiP_score.py).  The FRiP (Fraction of reads in peaks) score describes the fraction of all mapped reads that fall into the called peak regions.  The higher the score, the better, preferably over 0.3, according to [https://www.encodeproject.org/atac-seq/#standards ENCODE].
 {{{
 …
 ./calculate_FRiP_score.py SampleA.bam SampleA_peaks.bed
 }}}
    * The [https://www.encodeproject.org/atac-seq/#standards ENCODE project] has [[https://www.encodeproject.org/atac-seq/#standards|recommendations]] on TSS enrichment scores (ENCODE uses one tss to represent a gene, not one tss per transcript.  TSS score is much smaller with one tss per transcript), fragment size distribution. You can also download ENCODE pipeline and analyze your samples with the pipeline. Its QC output html file includes quality controls results and their interpretation. It also estimates the library complexity based the uniqueness of reads.
 …
 # If you have multiple samples, you can combine them into a single report. In this case, both sample1 and sample2 will be in the same plots inside folder called "all_samples"
 mkarv all_samples sample1.ataqv.json.gz sample2.ataqv.json.gz
+}}}
+=== [=#Blacklist Blacklist filtering for peaks ] ===
+   * For samples from human, mouse, fly, or C. elegans, one can prevent some probable false-positive peaks by removing reads that overlap "blacklisted" regions.  The blacklist, [https://www.nature.com/articles/s41598-019-45839-z popularized by ENCODE], is a a comprehensive set of genomic regions that have anomalous, unstructured, or high signal in next-generation sequencing experiments independent of cell line or experiment. The blacklist regions can be downloaded from [https://github.com/Boyle-Lab/Blacklist/].  We have them on Whitehead servers at /nfs/BaRC_datasets/ENCODE_blacklist/Blacklist/lists
+{{{
+bedtools intersect -v -a ${PEAK} -b ${BLACKLIST} \
+                 | awk 'BEGIN{OFS="\t"} {if ($5>1000) $5=1000; print $0}' \
+                 | grep -P 'chr[\dXY]+[ \t]'  | gzip -nc > ${FILTERED_PEAK}
 }}}