Context Navigation

Changes between Version 67 and Version 68 of SOPs/atac_Seq

Timestamp:: 07/07/21 12:08:29 (4 years ago)
Author:: byuan
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

SOPs/atac_Seq

-              v67
+              v68
  * [#preprocess Pre-process reads] (remove adapters and other "contamination")
  * [#map Map reads to the genome] (with an unspliced mapping tool)
- * [#QC Run quality control and calculate QC metrics]
  * [#call_peaks Call ATAC-seq "peaks"] with a high coverage of mapped reads
  * [#Blacklist Blacklist filtering for peaks ]
+ * [#QC Run quality control and calculate QC metrics]
  * [#Analyze Analyze peak regions for binding motifs]
  * Identify differentially accessible regions (for multiple-sample experiments)
 …
   * Check deduplication level with [[http://barcwiki.wi.mit.edu/wiki/SOPs/qc_shortReads | 'fastqc']].
-=== [=#QC Run quality control and calculate QC metrics] ===
-   * Calculate the fragment size distribution with the [https://www.bioconductor.org/packages/release/bioc/html/ATACseqQC.html ATACseqQC] R package:
-{{{
-library("ATACseqQC")
-pdf("My_sample.fragment_sizes.pdf", w=11, h=8.5)
-fragSizeDist("Mapped_reads.bam", "My_sample")
-dev.off()
-}}}
-      See [[https://www.nature.com/articles/nmeth.2688/figures/2 | Fig 2]] from Buenrostro et al. for the ideal distribution of fragment sizes.
-   * Calculate the TSS enrichment score (the degree to which transcription start sites show enrichment for ATAC-seq reads) using BaRC code (/nfs/BaRC_Public/BaRC_code/Python/calculate_TSS_enrichment_score/calculate_TSS_enrichment_score.py)
-{{{
-# USAGE: calculate_TSS_enrichment_score.py --outdir OUTDIR --outprefix OUTPREFIX --fastq1 FASTQ1 --tss TSS_BED --chromsizes CHROMSIZES --bam BAM
-./calculate_TSS_enrichment_score.py --outdir OUT_QC_1 --outprefix Sample_A --fastq1 ATACseq_reads.fq.gz --tss TSS.hg38.bed --chromsizes chromInfo.hg38.txt --bam ATACseq_mapped_reads.bam >| Sample_A.TSS_enrichment_score.txt
-}}}
-  * The [https://www.encodeproject.org/atac-seq/#standards ENCODE project] has [[https://www.encodeproject.org/atac-seq/#standards|recommendations]] on TSS enrichment scores, fragment size distribution. You can also download ENCODE pipeline and analyze your samples with the pipeline. Its QC output html file includes quality controls results and their interpretation. It also estimates the library complexity based the uniqueness of reads.
-  * [[https://www.sciencedirect.com/science/article/pii/S240547122030079X|ataqv]] summarizes QC results into an interactive html page, which also allows you to view multiple samples together.
-          * First, run ataqv on each bam file to generate JSON files. Here is a sample command for a bulk ATAC_seq sample:
-{{{
-# --peak-file: peak file in bed format
-# --tss-file: tss in bed format
-# --excluded-region-file A bed file containing excluded regions, in this case, we exclude the regions in the ENCODE blast list
-# --metrics-file: output in json format
-# --ignore-read-groups: Even if read groups are present in the BAM file, ignore them and combine metrics for all reads under a single sample and library named with the --name option. This also implies that a single peak file will be used for all reads.
-# The duplicated reads need to be labeled by Picard MarkDuplicates ( REMOVE_DUPLICATES=FALSE) before running the ataqv
-# ataqc supports human/mouse/rat/fly/worm/yeast currently. If your genome is not listed, you can still run it by adding --autosomal-reference-file and --mitochondrial-reference-name.
-#    --autosomal-reference-file: a file containing autosomal reference names, one per line
-#    --mitochondrial-reference-name: name for the mitochondrial DNA
-ataqv --peak-file sample1_peak.bed --name sample1 --metrics-file sample1.ataqv.json.gz --excluded-region-file /nfs/genomes/human_hg38_dec13/anno/hg38-blacklist.v2.bed.gz --tss-file hg38.tss.refseq.bed.gz --ignore-read-groups human sample1.bam > sample1.ataqv.out
-}}}
-         Next, run mkarv on the JSON files to generate the interactive web viewer. By default, SRR891268 will be used as the reference sample in the viewer. You can specify a different reference when you built your viewer instance, refer to the information on how to configure the reference with mkarv -h.
-{{{
-# This will create a folder named as sample1, whose index.html contains the interactive plots.
-mkarv sample1 sample1.ataqv.json.gz
-# If you have multiple samples, you can combine them into a single report. In this case, both sample1 and sample2 will be in the same plots inside folder called "all_samples"
-mkarv all_samples sample1.ataqv.json.gz sample2.ataqv.json.gz
-}}}
 …
 # convert bam to bed
 bedtools bamtobed -i foo.bam > foo_pe.bed
 # shift reads. Tn5 produces 5’ overhangs of 9 bases long: pos. strand +4 and neg strand -5
+# shift reads. Reads should be shifted + 4 bp and − 5 bp for positive and negative strand respectively, to account for the 9-bp duplication created by DNA repair of the nick by Tn5 transposase
 cat foo.pe.bed | awk -F $'\t' 'BEGIN {OFS = FS}{ if ($6 == "+") {$2 = $2 + 4} else if ($6 == "-") {$3 = $3 - 5} print $0}' >| foo_tn5_pe.bed
 # call peaks.
 …
+=== [=#Asessing Assessing Peak Calls ] ===
+Calculate the FRiP score (with /nfs/BaRC_Public/BaRC_code/Python/calculate_FRiP_score/calculate_FRiP_score.py).  The FRiP (Fraction of reads in peaks) score describes the fraction of all mapped reads that fall into the called peak regions.  The higher the score, the better, preferably over 0.3, according to [https://www.encodeproject.org/atac-seq/#standards ENCODE].
+=== [=#Blacklist Blacklist filtering for peaks ] ===
+   * For samples from human, mouse, fly, or C. elegans, one can prevent some probable false-positive peaks by removing reads that overlap "blacklisted" regions.  The blacklist, [https://www.nature.com/articles/s41598-019-45839-z popularized by ENCODE], is a a comprehensive set of genomic regions that have anomalous, unstructured, or high signal in next-generation sequencing experiments independent of cell line or experiment. The blacklist regions can be downloaded from [https://github.com/Boyle-Lab/Blacklist/].  We have them on Whitehead servers at /nfs/BaRC_datasets/ENCODE_blacklist/Blacklist/lists
+{{{
+bedtools intersect -v -a ${PEAK} -b ${BLACKLIST} \
+                 | awk 'BEGIN{OFS="\t"} {if ($5>1000) $5=1000; print $0}' \
+                 | grep -P 'chr[\dXY]+[ \t]'  | gzip -nc > ${FILTERED_PEAK}
+}}}
+=== [=#QC Run quality control and calculate QC metrics] ===
+   * Calculate the fragment size distribution with the [https://www.bioconductor.org/packages/release/bioc/html/ATACseqQC.html ATACseqQC] R package:
+{{{
+library("ATACseqQC")
+pdf("My_sample.fragment_sizes.pdf", w=11, h=8.5)
+fragSizeDist("Mapped_reads.bam", "My_sample")
+dev.off()
+}}}
+      See [[https://www.nature.com/articles/nmeth.2688/figures/2 | Fig 2]] from Buenrostro et al. for the ideal distribution of fragment sizes.
+   * Calculate the TSS enrichment score (the degree to which transcription start sites show enrichment for ATAC-seq reads) using BaRC code (/nfs/BaRC_Public/BaRC_code/Python/calculate_TSS_enrichment_score/calculate_TSS_enrichment_score.py)
+{{{
+# USAGE: calculate_TSS_enrichment_score.py --outdir OUTDIR --outprefix OUTPREFIX --fastq1 FASTQ1 --tss TSS_BED --chromsizes CHROMSIZES --bam BAM
+./calculate_TSS_enrichment_score.py --outdir OUT_QC_1 --outprefix Sample_A --fastq1 ATACseq_reads.fq.gz --tss TSS.hg38.bed --chromsizes chromInfo.hg38.txt --bam ATACseq_mapped_reads.bam >| Sample_A.TSS_enrichment_score.txt
+}}}
+   * Assessing Peak Calls by calculating the FRiP score (with /nfs/BaRC_Public/BaRC_code/Python/calculate_FRiP_score/calculate_FRiP_score.py).  The FRiP (Fraction of reads in peaks) score describes the fraction of all mapped reads that fall into the called peak regions.  The higher the score, the better, preferably over 0.3, according to [https://www.encodeproject.org/atac-seq/#standards ENCODE].
 {{{
 # Using a 'narrowPeak' file listing MACS2 peaks
 …
+=== [=#Blacklist Blacklist filtering for peaks ] ===
+   * For samples from human, mouse, fly, or C. elegans, one can prevent some probable false-positive peaks by removing reads that overlap "blacklisted" regions.  The blacklist, [https://www.nature.com/articles/s41598-019-45839-z popularized by ENCODE], is a a comprehensive set of genomic regions that have anomalous, unstructured, or high signal in next-generation sequencing experiments independent of cell line or experiment. The blacklist regions can be downloaded from [https://github.com/Boyle-Lab/Blacklist/].  We have them on Whitehead servers at /nfs/BaRC_datasets/ENCODE_blacklist/Blacklist/lists
+{{{
+bedtools intersect -v -a ${PEAK} -b ${BLACKLIST} \
+                 | awk 'BEGIN{OFS="\t"} {if ($5>1000) $5=1000; print $0}' \
+                 | grep -P 'chr[\dXY]+[ \t]'  | gzip -nc > ${FILTERED_PEAK}
+}}}
+   * The [https://www.encodeproject.org/atac-seq/#standards ENCODE project] has [[https://www.encodeproject.org/atac-seq/#standards|recommendations]] on TSS enrichment scores, fragment size distribution. You can also download ENCODE pipeline and analyze your samples with the pipeline. Its QC output html file includes quality controls results and their interpretation. It also estimates the library complexity based the uniqueness of reads.
+   * [[https://www.sciencedirect.com/science/article/pii/S240547122030079X|ataqv]] summarizes QC results into an interactive html page, which also allows you to view multiple samples together.
+          * First, run ataqv on each bam file to generate JSON files. Here is a sample command for a bulk ATAC_seq sample:
+{{{
+# --peak-file: peak file in bed format
+# --tss-file: tss in bed format
+# --excluded-region-file A bed file containing excluded regions, in this case, we exclude the regions in the ENCODE blast list
+# --metrics-file: output in json format
+# --ignore-read-groups: Even if read groups are present in the BAM file, ignore them and combine metrics for all reads under a single sample and library named with the --name option. This also implies that a single peak file will be used for all reads.
+# The duplicated reads need to be labeled by Picard MarkDuplicates ( REMOVE_DUPLICATES=FALSE) before running the ataqv
+# ataqc supports human/mouse/rat/fly/worm/yeast currently. If your genome is not listed, you can still run it by adding --autosomal-reference-file and --mitochondrial-reference-name.
+#    --autosomal-reference-file: a file containing autosomal reference names, one per line
+#    --mitochondrial-reference-name: name for the mitochondrial DNA
+ataqv --peak-file sample1_peak.bed --name sample1 --metrics-file sample1.ataqv.json.gz --excluded-region-file /nfs/genomes/human_hg38_dec13/anno/hg38-blacklist.v2.bed.gz --tss-file hg38.tss.refseq.bed.gz --ignore-read-groups human sample1.bam > sample1.ataqv.out
+}}}
+         Next, run mkarv on the JSON files to generate the interactive web viewer. By default, SRR891268 will be used as the reference sample in the viewer. You can specify a different reference when you built your viewer instance, refer to the information on how to configure the reference with mkarv -h.
+{{{
+# This will create a folder named as sample1, whose index.html contains the interactive plots.
+mkarv sample1 sample1.ataqv.json.gz
+# If you have multiple samples, you can combine them into a single report. In this case, both sample1 and sample2 will be in the same plots inside folder called "all_samples"
+mkarv all_samples sample1.ataqv.json.gz sample2.ataqv.json.gz
+}}}
 === [=#Analyze Analyze peak regions for binding motifs] ===