Changes between Version 67 and Version 68 of SOPs/atac_Seq


Ignore:
Timestamp:
07/07/21 12:08:29 (4 years ago)
Author:
byuan
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SOPs/atac_Seq

    v67 v68  
    88 * [#preprocess Pre-process reads] (remove adapters and other "contamination")
    99 * [#map Map reads to the genome] (with an unspliced mapping tool)
    10  * [#QC Run quality control and calculate QC metrics]
    1110 * [#call_peaks Call ATAC-seq "peaks"] with a high coverage of mapped reads
    1211 * [#Blacklist Blacklist filtering for peaks ]
     12 * [#QC Run quality control and calculate QC metrics]
    1313 * [#Analyze Analyze peak regions for binding motifs]
    1414 * Identify differentially accessible regions (for multiple-sample experiments)
     
    5858  * Check deduplication level with [[http://barcwiki.wi.mit.edu/wiki/SOPs/qc_shortReads | 'fastqc']].
    5959 
    60 
    61 === [=#QC Run quality control and calculate QC metrics] ===
    62 
    63    * Calculate the fragment size distribution with the [https://www.bioconductor.org/packages/release/bioc/html/ATACseqQC.html ATACseqQC] R package:
    64 {{{
    65 library("ATACseqQC")
    66 pdf("My_sample.fragment_sizes.pdf", w=11, h=8.5)
    67 fragSizeDist("Mapped_reads.bam", "My_sample")
    68 dev.off()
    69 }}}
    70       See [[https://www.nature.com/articles/nmeth.2688/figures/2 | Fig 2]] from Buenrostro et al. for the ideal distribution of fragment sizes.
    71 
    72    * Calculate the TSS enrichment score (the degree to which transcription start sites show enrichment for ATAC-seq reads) using BaRC code (/nfs/BaRC_Public/BaRC_code/Python/calculate_TSS_enrichment_score/calculate_TSS_enrichment_score.py)
    73 {{{
    74 # USAGE: calculate_TSS_enrichment_score.py --outdir OUTDIR --outprefix OUTPREFIX --fastq1 FASTQ1 --tss TSS_BED --chromsizes CHROMSIZES --bam BAM
    75 
    76 ./calculate_TSS_enrichment_score.py --outdir OUT_QC_1 --outprefix Sample_A --fastq1 ATACseq_reads.fq.gz --tss TSS.hg38.bed --chromsizes chromInfo.hg38.txt --bam ATACseq_mapped_reads.bam >| Sample_A.TSS_enrichment_score.txt
    77 }}}
    78 
    79 
    80   * The [https://www.encodeproject.org/atac-seq/#standards ENCODE project] has [[https://www.encodeproject.org/atac-seq/#standards|recommendations]] on TSS enrichment scores, fragment size distribution. You can also download ENCODE pipeline and analyze your samples with the pipeline. Its QC output html file includes quality controls results and their interpretation. It also estimates the library complexity based the uniqueness of reads. 
    81 
    82   * [[https://www.sciencedirect.com/science/article/pii/S240547122030079X|ataqv]] summarizes QC results into an interactive html page, which also allows you to view multiple samples together.
    83           * First, run ataqv on each bam file to generate JSON files. Here is a sample command for a bulk ATAC_seq sample:
    84 
    85 
    86 {{{
    87 
    88 # --peak-file: peak file in bed format
    89 # --tss-file: tss in bed format
    90 # --excluded-region-file A bed file containing excluded regions, in this case, we exclude the regions in the ENCODE blast list
    91 # --metrics-file: output in json format
    92 # --ignore-read-groups: Even if read groups are present in the BAM file, ignore them and combine metrics for all reads under a single sample and library named with the --name option. This also implies that a single peak file will be used for all reads.
    93 
    94 # The duplicated reads need to be labeled by Picard MarkDuplicates ( REMOVE_DUPLICATES=FALSE) before running the ataqv
    95 
    96 # ataqc supports human/mouse/rat/fly/worm/yeast currently. If your genome is not listed, you can still run it by adding --autosomal-reference-file and --mitochondrial-reference-name.
    97 #    --autosomal-reference-file: a file containing autosomal reference names, one per line
    98 #    --mitochondrial-reference-name: name for the mitochondrial DNA
    99 
    100 ataqv --peak-file sample1_peak.bed --name sample1 --metrics-file sample1.ataqv.json.gz --excluded-region-file /nfs/genomes/human_hg38_dec13/anno/hg38-blacklist.v2.bed.gz --tss-file hg38.tss.refseq.bed.gz --ignore-read-groups human sample1.bam > sample1.ataqv.out
    101 
    102 }}}
    103          Next, run mkarv on the JSON files to generate the interactive web viewer. By default, SRR891268 will be used as the reference sample in the viewer. You can specify a different reference when you built your viewer instance, refer to the information on how to configure the reference with mkarv -h.
    104 
    105 {{{
    106 
    107 # This will create a folder named as sample1, whose index.html contains the interactive plots.
    108 mkarv sample1 sample1.ataqv.json.gz
    109 
    110 # If you have multiple samples, you can combine them into a single report. In this case, both sample1 and sample2 will be in the same plots inside folder called "all_samples"
    111 mkarv all_samples sample1.ataqv.json.gz sample2.ataqv.json.gz
    112 
    113 }}}
    11460
    11561
     
    175121# convert bam to bed
    176122bedtools bamtobed -i foo.bam > foo_pe.bed
    177 # shift reads. Tn5 produces 5’ overhangs of 9 bases long: pos. strand +4 and neg strand -5
     123# shift reads. Reads should be shifted + 4 bp and − 5 bp for positive and negative strand respectively, to account for the 9-bp duplication created by DNA repair of the nick by Tn5 transposase
    178124cat foo.pe.bed | awk -F $'\t' 'BEGIN {OFS = FS}{ if ($6 == "+") {$2 = $2 + 4} else if ($6 == "-") {$3 = $3 - 5} print $0}' >| foo_tn5_pe.bed
    179125# call peaks.
     
    206152
    207153
    208 === [=#Asessing Assessing Peak Calls ] ===
    209 
    210 Calculate the FRiP score (with /nfs/BaRC_Public/BaRC_code/Python/calculate_FRiP_score/calculate_FRiP_score.py).  The FRiP (Fraction of reads in peaks) score describes the fraction of all mapped reads that fall into the called peak regions.  The higher the score, the better, preferably over 0.3, according to [https://www.encodeproject.org/atac-seq/#standards ENCODE].
     154
     155
     156=== [=#Blacklist Blacklist filtering for peaks ] ===
     157   * For samples from human, mouse, fly, or C. elegans, one can prevent some probable false-positive peaks by removing reads that overlap "blacklisted" regions.  The blacklist, [https://www.nature.com/articles/s41598-019-45839-z popularized by ENCODE], is a a comprehensive set of genomic regions that have anomalous, unstructured, or high signal in next-generation sequencing experiments independent of cell line or experiment. The blacklist regions can be downloaded from [https://github.com/Boyle-Lab/Blacklist/].  We have them on Whitehead servers at /nfs/BaRC_datasets/ENCODE_blacklist/Blacklist/lists
     158{{{
     159bedtools intersect -v -a ${PEAK} -b ${BLACKLIST} \
     160                 | awk 'BEGIN{OFS="\t"} {if ($5>1000) $5=1000; print $0}' \
     161                 | grep -P 'chr[\dXY]+[ \t]'  | gzip -nc > ${FILTERED_PEAK}
     162
     163}}}
     164
     165
     166=== [=#QC Run quality control and calculate QC metrics] ===
     167
     168   * Calculate the fragment size distribution with the [https://www.bioconductor.org/packages/release/bioc/html/ATACseqQC.html ATACseqQC] R package:
     169{{{
     170library("ATACseqQC")
     171pdf("My_sample.fragment_sizes.pdf", w=11, h=8.5)
     172fragSizeDist("Mapped_reads.bam", "My_sample")
     173dev.off()
     174}}}
     175      See [[https://www.nature.com/articles/nmeth.2688/figures/2 | Fig 2]] from Buenrostro et al. for the ideal distribution of fragment sizes.
     176
     177   * Calculate the TSS enrichment score (the degree to which transcription start sites show enrichment for ATAC-seq reads) using BaRC code (/nfs/BaRC_Public/BaRC_code/Python/calculate_TSS_enrichment_score/calculate_TSS_enrichment_score.py)
     178{{{
     179# USAGE: calculate_TSS_enrichment_score.py --outdir OUTDIR --outprefix OUTPREFIX --fastq1 FASTQ1 --tss TSS_BED --chromsizes CHROMSIZES --bam BAM
     180
     181./calculate_TSS_enrichment_score.py --outdir OUT_QC_1 --outprefix Sample_A --fastq1 ATACseq_reads.fq.gz --tss TSS.hg38.bed --chromsizes chromInfo.hg38.txt --bam ATACseq_mapped_reads.bam >| Sample_A.TSS_enrichment_score.txt
     182}}}
     183
     184   * Assessing Peak Calls by calculating the FRiP score (with /nfs/BaRC_Public/BaRC_code/Python/calculate_FRiP_score/calculate_FRiP_score.py).  The FRiP (Fraction of reads in peaks) score describes the fraction of all mapped reads that fall into the called peak regions.  The higher the score, the better, preferably over 0.3, according to [https://www.encodeproject.org/atac-seq/#standards ENCODE].
    211185{{{
    212186# Using a 'narrowPeak' file listing MACS2 peaks
     
    217191
    218192
    219 === [=#Blacklist Blacklist filtering for peaks ] ===
    220    * For samples from human, mouse, fly, or C. elegans, one can prevent some probable false-positive peaks by removing reads that overlap "blacklisted" regions.  The blacklist, [https://www.nature.com/articles/s41598-019-45839-z popularized by ENCODE], is a a comprehensive set of genomic regions that have anomalous, unstructured, or high signal in next-generation sequencing experiments independent of cell line or experiment. The blacklist regions can be downloaded from [https://github.com/Boyle-Lab/Blacklist/].  We have them on Whitehead servers at /nfs/BaRC_datasets/ENCODE_blacklist/Blacklist/lists
    221 {{{
    222 bedtools intersect -v -a ${PEAK} -b ${BLACKLIST} \
    223                  | awk 'BEGIN{OFS="\t"} {if ($5>1000) $5=1000; print $0}' \
    224                  | grep -P 'chr[\dXY]+[ \t]'  | gzip -nc > ${FILTERED_PEAK}
    225 
    226 }}}
     193   * The [https://www.encodeproject.org/atac-seq/#standards ENCODE project] has [[https://www.encodeproject.org/atac-seq/#standards|recommendations]] on TSS enrichment scores, fragment size distribution. You can also download ENCODE pipeline and analyze your samples with the pipeline. Its QC output html file includes quality controls results and their interpretation. It also estimates the library complexity based the uniqueness of reads. 
     194
     195   * [[https://www.sciencedirect.com/science/article/pii/S240547122030079X|ataqv]] summarizes QC results into an interactive html page, which also allows you to view multiple samples together.
     196          * First, run ataqv on each bam file to generate JSON files. Here is a sample command for a bulk ATAC_seq sample:
     197
     198
     199{{{
     200
     201# --peak-file: peak file in bed format
     202# --tss-file: tss in bed format
     203# --excluded-region-file A bed file containing excluded regions, in this case, we exclude the regions in the ENCODE blast list
     204# --metrics-file: output in json format
     205# --ignore-read-groups: Even if read groups are present in the BAM file, ignore them and combine metrics for all reads under a single sample and library named with the --name option. This also implies that a single peak file will be used for all reads.
     206
     207# The duplicated reads need to be labeled by Picard MarkDuplicates ( REMOVE_DUPLICATES=FALSE) before running the ataqv
     208
     209# ataqc supports human/mouse/rat/fly/worm/yeast currently. If your genome is not listed, you can still run it by adding --autosomal-reference-file and --mitochondrial-reference-name.
     210#    --autosomal-reference-file: a file containing autosomal reference names, one per line
     211#    --mitochondrial-reference-name: name for the mitochondrial DNA
     212
     213ataqv --peak-file sample1_peak.bed --name sample1 --metrics-file sample1.ataqv.json.gz --excluded-region-file /nfs/genomes/human_hg38_dec13/anno/hg38-blacklist.v2.bed.gz --tss-file hg38.tss.refseq.bed.gz --ignore-read-groups human sample1.bam > sample1.ataqv.out
     214
     215}}}
     216         Next, run mkarv on the JSON files to generate the interactive web viewer. By default, SRR891268 will be used as the reference sample in the viewer. You can specify a different reference when you built your viewer instance, refer to the information on how to configure the reference with mkarv -h.
     217
     218{{{
     219
     220# This will create a folder named as sample1, whose index.html contains the interactive plots.
     221mkarv sample1 sample1.ataqv.json.gz
     222
     223# If you have multiple samples, you can combine them into a single report. In this case, both sample1 and sample2 will be in the same plots inside folder called "all_samples"
     224mkarv all_samples sample1.ataqv.json.gz sample2.ataqv.json.gz
     225
     226}}}
     227
    227228
    228229=== [=#Analyze Analyze peak regions for binding motifs] ===