Changes between Version 79 and Version 80 of SOPs/atac_Seq


Ignore:
Timestamp:
07/09/21 13:06:53 (4 years ago)
Author:
byuan
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SOPs/atac_Seq

    v79 v80  
    77=== Basic Approach ===
    88 * [#preprocess Pre-process reads] (remove adapters and other "contamination")
    9  * [#map Map reads to the genome] (with an unspliced mapping tool)
     9 * [#map Map reads to the genome, and post-alignment filtering]
     10 * [#postAlignQC Post-alignment quality control and calculate QC metrics]
    1011 * [#call_peaks Call ATAC-seq "peaks"] with a high coverage of mapped reads
     12 * [#AfterPeakQC Quality control after calling peaks]
    1113 * [#Blacklist Blacklist filtering for peaks ]
    1214 * [#QC Run quality control and calculate QC metrics]
     
    6264{{{
    6365samtools view -h file.bam | grep -v chrM | samtools view -b -h -f 0x2 - | samtools sort - > file.sorted.bam
     66}}}
     67
     68=== [=#postAlignQC Post-alignment quality control and calculate QC metrics] ===
     69
     70   * Calculate the fragment size distribution using BaRC code
     71     (/nfs/BaRC_Public/BaRC_code/R/calculate_ATACseq_fragment_size_distribution/calculate_ATACseq_fragment_size_distribution.R)
     72{{{
     73# USAGE: calculate_ATACseq_fragment_size_distribution.R input.bam output.pdf title_of_the_figure
     74calculate_ATACseq_fragment_size_distribution.R sample.bam sample_fragment_sizes.pdf Sample
     75}}}
     76
     77    This script is a wrapper for fragSizeDist from the [https://www.bioconductor.org/packages/release/bioc/html/ATACseqQC.html ATACseqQC] R package:
     78{{{
     79library("ATACseqQC")
     80pdf("My_sample.fragment_sizes.pdf", w=11, h=8.5)
     81fragSizeDist("Mapped_reads.bam", "My_sample")
     82dev.off()
     83}}}
     84    See [[https://www.nature.com/articles/nmeth.2688/figures/2 | Fig 2]] from Buenrostro et al. for the ideal distribution of fragment sizes.
     85
     86   * Calculate the TSS enrichment score (the degree to which transcription start sites show enrichment for ATAC-seq reads) using BaRC code (/nfs/BaRC_Public/BaRC_code/Python/calculate_TSS_enrichment_score/calculate_TSS_enrichment_score.py)
     87{{{
     88# USAGE: calculate_TSS_enrichment_score.py --outdir OUTDIR --outprefix OUTPREFIX --fastq1 FASTQ1 --tss TSS_BED --chromsizes CHROMSIZES --bam BAM
     89
     90./calculate_TSS_enrichment_score.py --outdir OUT_QC_1 --outprefix Sample_A --fastq1 ATACseq_reads.fq.gz --tss TSS.hg38.bed --chromsizes chromInfo.hg38.txt --bam ATACseq_mapped_reads.bam >| Sample_A.TSS_enrichment_score.txt
    6491}}}
    6592
     
    145172macs2 callpeak -t $foo.bam -f BAM -B -q 0.01 -g hs -n foo --nomodel --shift -75 --extsize 150
    146173}}}
    147  
    148174
    149175==== Other Tools or Options ====
     
    152178 *  [[https://github.com/jsh58/Genrich | Genrich]]
    153179
    154 
    155 
    156 
    157 === [=#Blacklist Blacklist filtering for peaks ] ===
    158    * For samples from human, mouse, fly, or C. elegans, one can prevent some probable false-positive peaks by removing reads that overlap "blacklisted" regions.  The blacklist, [https://www.nature.com/articles/s41598-019-45839-z popularized by ENCODE], is a a comprehensive set of genomic regions that have anomalous, unstructured, or high signal in next-generation sequencing experiments independent of cell line or experiment. The blacklist regions can be downloaded from [https://github.com/Boyle-Lab/Blacklist/].  We have them on Whitehead servers at /nfs/BaRC_datasets/ENCODE_blacklist/Blacklist/lists
    159 {{{
    160 bedtools intersect -v -a ${PEAK} -b ${BLACKLIST} \
    161                  | awk 'BEGIN{OFS="\t"} {if ($5>1000) $5=1000; print $0}' \
    162                  | grep -P 'chr[\dXY]+[ \t]'  | gzip -nc > ${FILTERED_PEAK}
    163 
    164 }}}
    165 
    166 === Subset peaks into reproducible peaks ===
     180==== Subset peaks into reproducible peaks ====
    167181
    168182  * If you have biological replicates, you can identify reproducible peaks with the 'idr' function, based on the method of [[https://projecteuclid.org/journals/annals-of-applied-statistics/volume-5/issue-3/Measuring-reproducibility-of-high-throughput-experiments/10.1214/11-AOAS466.full | Qunhua Li et al]] and using [[https://github.com/nboley/idr | code from Nathan Boley]].
     
    176190
    177191
    178 === [=#QC Run quality control and calculate QC metrics] ===
    179 
    180    * Calculate the fragment size distribution with the [https://www.bioconductor.org/packages/release/bioc/html/ATACseqQC.html ATACseqQC] R package:
    181 {{{
    182 library("ATACseqQC")
    183 pdf("My_sample.fragment_sizes.pdf", w=11, h=8.5)
    184 fragSizeDist("Mapped_reads.bam", "My_sample")
    185 dev.off()
    186 }}}
    187       See [[https://www.nature.com/articles/nmeth.2688/figures/2 | Fig 2]] from Buenrostro et al. for the ideal distribution of fragment sizes.
    188 
    189    * Calculate the TSS enrichment score (the degree to which transcription start sites show enrichment for ATAC-seq reads) using BaRC code (/nfs/BaRC_Public/BaRC_code/Python/calculate_TSS_enrichment_score/calculate_TSS_enrichment_score.py)
    190 {{{
    191 # USAGE: calculate_TSS_enrichment_score.py --outdir OUTDIR --outprefix OUTPREFIX --fastq1 FASTQ1 --tss TSS_BED --chromsizes CHROMSIZES --bam BAM
    192 
    193 ./calculate_TSS_enrichment_score.py --outdir OUT_QC_1 --outprefix Sample_A --fastq1 ATACseq_reads.fq.gz --tss TSS.hg38.bed --chromsizes chromInfo.hg38.txt --bam ATACseq_mapped_reads.bam >| Sample_A.TSS_enrichment_score.txt
    194 }}}
    195 
     192
     193=== [=#AfterPeakQC Quality control after calling peaks] ===
    196194   * Assessing Peak Calls by calculating the FRiP score (with /nfs/BaRC_Public/BaRC_code/Python/calculate_FRiP_score/calculate_FRiP_score.py).  The FRiP (Fraction of reads in peaks) score describes the fraction of all mapped reads that fall into the called peak regions.  The higher the score, the better, preferably over 0.3, according to [https://www.encodeproject.org/atac-seq/#standards ENCODE].
    197195{{{
     
    201199./calculate_FRiP_score.py SampleA.bam SampleA_peaks.bed
    202200}}}
    203 
    204201
    205202   * The [https://www.encodeproject.org/atac-seq/#standards ENCODE project] has [[https://www.encodeproject.org/atac-seq/#standards|recommendations]] on TSS enrichment scores (ENCODE uses one tss to represent a gene, not one tss per transcript.  TSS score is much smaller with one tss per transcript), fragment size distribution. You can also download ENCODE pipeline and analyze your samples with the pipeline. Its QC output html file includes quality controls results and their interpretation. It also estimates the library complexity based the uniqueness of reads. 
     
    235232# If you have multiple samples, you can combine them into a single report. In this case, both sample1 and sample2 will be in the same plots inside folder called "all_samples"
    236233mkarv all_samples sample1.ataqv.json.gz sample2.ataqv.json.gz
     234
     235}}}
     236
     237
     238=== [=#Blacklist Blacklist filtering for peaks ] ===
     239   * For samples from human, mouse, fly, or C. elegans, one can prevent some probable false-positive peaks by removing reads that overlap "blacklisted" regions.  The blacklist, [https://www.nature.com/articles/s41598-019-45839-z popularized by ENCODE], is a a comprehensive set of genomic regions that have anomalous, unstructured, or high signal in next-generation sequencing experiments independent of cell line or experiment. The blacklist regions can be downloaded from [https://github.com/Boyle-Lab/Blacklist/].  We have them on Whitehead servers at /nfs/BaRC_datasets/ENCODE_blacklist/Blacklist/lists
     240{{{
     241bedtools intersect -v -a ${PEAK} -b ${BLACKLIST} \
     242                 | awk 'BEGIN{OFS="\t"} {if ($5>1000) $5=1000; print $0}' \
     243                 | grep -P 'chr[\dXY]+[ \t]'  | gzip -nc > ${FILTERED_PEAK}
    237244
    238245}}}