== What is Nextflow? Nextflow is a workflow system for creating scalable, portable, and reproducible workflows. It has been developed specifically to ease the creation and execution of bioinformatics pipelines. It allows you to run your analysis pipeline on a large-scale dataset in a streamlined and parallel manner. Nextflow can deploy workflows on a variety of execution platforms, including your local machine, HPC schedulers, and cloud. Additionally, Nextflow supports a range of compute environments, software container runtimes, and package managers, allowing workflows to be executed in reproducible and isolated environments. == Why using Nextflow? The rise of big data has made it increasingly necessary to be able to analyze and perform experiments on large datasets in a portable and reproducible manner. Nextflow has several highlighted features that could be helpful in reproducible and efficient pipeline implementation. 1. Reproducibility Nextflow supports Docker and Singularity containers technology. This, along with the integration of the GitHub code sharing platform, allows you to write self-contained pipelines, manage versions and to rapidly reproduce any former configuration. 2. Continuous checkpoints All the intermediate results produced during the pipeline execution are automatically tracked. This allows you to resume its execution, from the last successfully executed step, no matter what the reason was for it stopping. 3. Portability Nextflow can be executed on multiple platforms without changing its codes. It supports various executors including batch schedulers like SLURM, LSF, PBS, and cloud platforms, such as Kubernetes, Amazon AWS, Google Cloud and Microsoft Azure platforms. == Installation of Nextflow Nextflow can be used on Linux, macOS and windows. It requires Bash 3.2 (or later) and Java 17 (or later, up to 23) to be installed. For the instructions to install Nextflow, please refer to this page: [https://www.nextflow.io/docs/latest/install.html] The Nextflow command line tool has been installed on the WI slurm cluster: /nfs/BaRC_Public/apps/nextflow/nextflow The current version is nextflow version 24.04.4.5917. The main purpose of the Nextflow CLI is to run Nextflow pipelines with the run command. Nextflow can execute a local script (e.g. ./main.nf) or a remote project (e.g. github.com/foo/bar). To launch the execution of a pipeline project, hosted in a remote code repository, you simply need to specify its qualified name or the repository URL after the run command. The qualified name is formed by two parts: the owner name and the repository name separated by a / character. In other words if a Nextflow project is hosted, for example, in a GitHub repository at the address http://github.com/foo/bar, it can be executed by entering the following command in your shell terminal: {{{ nextflow run foo/bar }}} or using the project URL: {{{ nextflow run http://github.com/foo/bar }}} If the project is found, it will be automatically downloaded to the Nextflow home directory ($HOME/.nextflow by default) and cached for subsequent runs. Try this simple example by running the following command: {{{ nextflow run nextflow-io/hello }}} This is a simple script showing the basic 'Hello World!' example for the Nextflow framework. It will download a trivial example from the repository published at http://github.com/nextflow-io/hello and execute it on your computer. Run this example to confirm all tools are installed properly. == What is nf-core? nf-core is a global community effort to collect a curated set of open‑source analysis pipelines built using Nextflow. There are 128 pipelines that are currently available as part of nf-core. Browse them at https://nf-co.re/pipelines/. == How to run nf-core pipelines To run a pipeline: 1. Configure Nextflow to run on your system: The simplest way to run is with -profile docker (or singularity). This instructs Nextflow to execute jobs locally, with Docker (or Singularity) to fulfill software dependencies. Please note that if you are running the pipeline on the slurm cluster, you can only use -profile singularity, because you don't have the permission to run docker on it. Conda is also supported with -profile conda. However, this option is not recommended, as reproducibility of results can’t be guaranteed without containerization. 2. Run the tests for your pipeline in the terminal to confirm everything is working: {{{ nextflow run nf-core/ -profile test,singularity --outdir }}} Replace with the name of an nf-core pipeline. Nextflow will pull the code from the GitHub repository and fetch the software requirements automatically, so there is no need to download anything first. 3. Read the pipeline documentation to see which command-line parameters are required. This will be specific to your data type and usage. 4. To launch the pipeline with real data, omit the test config profile and provide the required pipeline-specific parameters. For example, to run the CUTandRun pipeline, your command will be similar to this: {{{ nextflow run nf-core/cutandrun -profile singularity —input samplesheet.csv —peakcaller ‘seacr,MACS2’ —genome GRCh38 —outdir nextflow_cutandrun }}} 5. Once complete, check the pipeline execution and quality control reports (such as multiqc_report.html files for MultiQC reports). Each pipeline’s documentation describes the outputs to expect. Please refer to nf-core documentation for more details (https://nf-co.re/docs/usage/getting_started/introduction). == Run nf-core pipelines on the slurm cluster The slurm executor allows you to run your pipeline script using the SLURM resource manager. Nextflow manages each process as a separate job that is submitted to the cluster using the sbatch command. The jobs can be distributed across multiple nodes depending on your request computing resources. The pipeline must be launched from a node where the sbatch command is available, which is typically the cluster login node. To enable the SLURM executor, set process.executor = 'slurm' in the nextflow.config file. SLURM partitions can be specified with the queue directive. To submit the nf-core pipelines to the slurm cluster, you could provide a configure file like this (take nf-core/cutandrun pipeline as an example and save it as 'cutandrun.config'): {{{ process{ executor = 'slurm' queue = '20' slurm.queueSize = 10 memory = '200 GB' cpus = 36 withName: 'NFCORE_CUTANDRUN:CUTANDRUN:DEDUPLICATE_PICARD:BAM_SORT_STATS_SAMTOOLS:SAMTOOLS_SORT' { cpus = { 6 * task.attempt } memory = { 15.GB * task.attempt } } withName: 'NFCORE_CUTANDRUN:CUTANDRUN:PREPARE_PEAKCALLING:BEDTOOLS_SORT' { cpus = { 1 * task.attempt } memory = { 12.GB * task.attempt } } withName: 'NFCORE_CUTANDRUN:CUTANDRUN:SAMTOOLS_CUSTOMVIEW' { cpus = { 2 * task.attempt } memory = { 8.GB * task.attempt } } withName: 'NFCORE_CUTANDRUN:CUTANDRUN:FRAG_LEN_HIST' { cpus = { 4 * task.attempt } memory = { 12.GB * task.attempt } } withName: 'NFCORE_CUTANDRUN:CUTANDRUN:DEEPTOOLS_PLOTHEATMAP_GENE_ALL' { cpus = { 4 * task.attempt } memory = { 32.GB * task.attempt } } } }}} And then submit the job using the command line like below: {{{ sbatch --partition=20 --job-name=NextF --output=NextF-%j.out --mem=200gb --nodes=1 --ntasks=1 --cpus-per-task=36 --wrap \ " /nfs/BaRC_Public/apps/nextflow/nextflow run nf-core/cutandrun -profile singularity --normalisation_binsize 1 --input samplesheet.csv -c cutandrun.config --normalisation_mode CPM \ --peakcaller 'MACS2' --replicate_threshold 2 --end_to_end FALSE --multiqc_title 'multiQCReport' --skip_removeduplicates true \ --skip_preseq false --skip_dt_qc false --skip_multiqc false --skip_reporting false --dump_scale_factors true --email 'username@wi.mit.edu' --genome GRCh38 \ --extend_fragments false --macs2_qvalue 0.01 --outdir ./nextFlow_CUTTAG " }}} Reference link: https://www.nextflow.io/docs/latest/executor.html == The recommendations of running individual nf-core pipelines 1. nf-core CUTandRun pipeline Please refer to the BaRC Best Practice of CUT&Tag analysis: http://barcwiki.wi.mit.edu/wiki/SOPs/CUT%26Tag ---- 2. nf-core Ribo-seq pipeline Example usage: (1) If you have two conditions with at least three replicates: {{{ sbatch –partition=20 –job-name=NextF_Ribo –output=NextF_Ribo-%j.out –mem=300gb –nodes=1 –ntasks=1 \ –cpus-per-task=40 –wrap ” /nfs/BaRC_Public/apps/nextflow/nextflow run nf-core/riboseq \ -profile singularity \ –input samplesheet.csv \ –contrasts contrasts.csv \ –email 'your.email@wi.mit.edu' \ –multiqc_title 'multiQCReport' \ –fasta /nfs/genomes/human_hg38_dec13_no_random/fasta_whole_genome/hg38.fa \ –gtf /nfs/genomes/human_hg38_dec13_no_random/gtf/Homo_sapiens.GRCh38.106.canonical.gtf \ –outdir ./nextflow_RiboSeq ” }}} (2) If you have two conditions with less than three replicates: {{{ sbatch –partition=20 –job-name=NextF_Ribo –output=NextF_Ribo-%j.out –mem=300gb –nodes=1 –ntasks=1 \ –cpus-per-task=20 –wrap ” /nfs/BaRC_Public/apps/nextflow/nextflow run nf-core/riboseq \ -profile singularity \ –input samplesheet.csv \ –email 'your.email@wi.mit.edu' \ –multiqc_title 'multiQCReport' \ –fasta /nfs/genomes/human_hg38_dec13_no_random/fasta_whole_genome/hg38.fa \ –gtf /nfs/genomes/human_hg38_dec13_no_random/gtf/Homo_sapiens.GRCh38.106.canonical.gtf \ –outdir ./nextflow_RiboSeq” }}} Note: The difference between (1) and (2) is that the contrast file is not provided in (2). By doing this, we will skip the translational efficiency analysis conducted by the 'anota2seq' package. The reason to skip this step is that, when there are two conditions, the 'anota2seq' can only perform translational efficiency analysis if there are at least three replicates. However, you could calculate the translational efficiency by the ratio between Ribo-seq and RNA-seq signal. ---- 3. nf-core ATAC-seq pipeline Sample command using a configuration file with the settings we recommend for macs2: {{{ sbatch –partition=20 –job-name=NextF_ATACmcas2ConfigFull –output=NextF_ATAC_macs2ConfigFull-%j.out –mem=200gb –nodes=1 –ntasks=1 –cpus-per-task=20 –wrap ”/nfs/BaRC_Public/apps/nextflow/nextflow run nf-core/atacseq -profile singularity -c ./macs2Custom.config –input ./atacseq_sampleSheetFullFastq.csv –min_trimmed_reads 0 –aligner bowtie2 –keep_dups TRUE –narrow_peak TRUE –email 'username@wi.mit.edu' –genome mm10 –read_length 50 –outdir ./OutNextF_ATAC” }}} This is the content of the “macs2Custom.config” file: {{{ process { withName: '.*:MERGED_LIBRARY_CALL_ANNOTATE_PEAKS:MACS2_CALLPEAK' { ext.args = [ '–keep-dup auto', '–nomodel', '–shift -75', '–extsize 150', '–format BAM', '–bdg ', '–qvalue 0.01' ].join(' ').trim() } } }}}