wiki:SOPs/ENCODE pipeline

Run ENCODE ATAC-seq Pipeline on the Whitehead Server

If you have human (hg38, hg19) and mouse (mm10, mm9) samples with biological replicates, you run ENCODE ATAC-seq Pipeline. The pipeline takes fastq files, cleans and maps the reads, filters aligned reads and does peak calls. Here is the schema of the workflow. In addition, it does quality controls. Here is a sample QC report. The steps below shows you how to run it on our Whitehead server. Note: It only works on python2.

  • content in input sample.json:
    {
        "atac.pipeline_type" : "atac",
        "atac.genome_tsv" : "/nfs/BaRC_datasets/ENCODE_ATAC-seq_Pipeline/mm10/mm10.tsv",
        "atac.fastqs_rep1_R1" : [
            "/fullpath/sample_rep1_1.fastq.gz"
        ],
        "atac.fastqs_rep1_R2" : [
        	"/fullpath/sample_rep1_2.fastq.gz"
        ],
        "atac.fastqs_rep2_R1" : [
        	"/fullpath/sample_rep2_1.fastq.gz"
        ],
        "atac.fastqs_rep2_R2" : [
    	"/fullpath/sample_rep2_2.fastq.gz"
        ],
        "atac.paired_end" : true,
        "atac.auto_detect_adapter" : true,
        "atac.enable_tss_enrich" : true,
        "atac.title" : "sample",
        "atac.description" : "ATAC-seq mouse sample"
    }
    
  • Supported genome files for hg19, hg38, mm9 and mm10 can be found in /nfs/BaRC_datasets/ENCODE_ATAC-seq_Pipeline, and atac.genome_tsv used for .json is
    • hg19: /nfs/BaRC_datasets/ENCODE_ATAC-seq_Pipeline/hg19/hg19.tsv
    • hg38: /nfs/BaRC_datasets/ENCODE_ATAC-seq_Pipeline/hg38/hg38.tsv
    • mm9: /nfs/BaRC_datasets/ENCODE_ATAC-seq_Pipeline/mm9/mm9.tsv
    • mm10: /nfs/BaRC_datasets/ENCODE_ATAC-seq_Pipeline/mm10/mm10.tsv
  • To initiate conda inside Whitehead:
    # Be sure to keep the first dot in the command below:
    . /nfs/BaRC_Public/conda/start_barc_conda
    
  • Before running the ENCODE pipeline, verify there is no preexisting conda startup code with the command below:
    conda env list
    
    You have no preexisting conda if you get "conda: command not found". Otherwise, log out, log back in, start the new conda instance, and activate encode-atac-seq-pipeline
  • Ignore the developer's instructions and use your home directory for conda and the pipeline.
    conda activate encode-atac-seq-pipeline
    
  • Run. Files could be url or fullpath. Detailed information about .json file
    caper run /nfs/BaRC_Public/atac-seq-pipeline/atac.wdl -i sample.json
    # After the job finishes, you can deactivate conda with
    conda deactivate
    
  • The QC report is call-qc_report/execution/qc.html
  • idr peaks files:
    • rep1: call-idr_pr/shard-0/execution/rep1-pr1_vs_rep1-pr2.idr0.05.bfilt.narrowPeak.gz
    • rep2: call-idr_pr/shard-1/execution/rep2-pr1_vs_rep2-pr2.idr0.05.bfilt.narrowPeak.gz
    • Note: shard-0 refers to the first biological replicate, shard-1 refers to the 2nd biological replicate, and so on
    • rep1 and rep2: call-idr/shard-1/execution/rep1_vs_rep2.idr0.05.bfilt.narrowPeak.gz
Note: See TracWiki for help on using the wiki.