Building Your First Nextflow Pipeline: PART 3 - Modularization & Configurable settings

Nextflow is a workflow management system designed to simplify the process of writing and executing complex computational workflows. The main advantages of nextflow are:

  • Reproducibility: ensures workflows can be run consistently across environments.
  • Portability: supports execution on laptops, HPC clusters, cloud (AWS, GCP, Azure), and container platforms (Docker, Singularity).
  • Scalability: handles anything from small local runs to massive parallel workloads.
  • Modularity: pipelines are built from reusable processes that can be easily combined.
  • Container support: seamless integration with Docker, Singularity, and Conda for reproducible software environments.
  • Dataflow programming model: channels handle complex data parallelism naturally.
  • Error handling & resume: automatic checkpointing allows workflows to resume from failure points.

In previous posts of this series, we first built a workflow that only processes one sample, and then updated it to work with multiple samples. In today's post, we will learn how to modularize the code and make the settings configurable.

Modularization

For this part, I will follow the nextflow tutorial Part 3: Moving code into modules.

Basicaally, we will create a modules folder and move each process to a separate file. Then we will import those modules in the main script. This will make our code cleaner and easier to maintain or share among different projects.

The first step is creating a folder names modules in the same directory as the main script, inside we can arrange the modules (processes) as we wish but I recommend creating subfolders by tool as follows. We also create a main.nf file inside each folder to hold the process code.

INDEX process

1mkdir -p modules/simpleaf/index
2touch modules/simpleaf/index/main.nf

we move the code from the main script to the new main.nf file and replace it with an include statement:

1include { SIMPLEAF_INDEX as INDEX } from './modules/simpleaf/index/main.nf'

As you see, we can conveninently even rename processes on the fly to make our life easier.

Now do the same for the QUANT process ( see the bonus section if need help).

And this is it, now the folder structure should look like this:

 1.
 2├──modules
 3│   └── simpleaf
 4│       ├── index
 5│       │   └── main.nf
 6│       └── quant
 7│           └── main.nf
 8├── nextflow.config
 9├── post.1.nf
10├── post.2.nf
11├── post.3.nf
12└── post.4.nf

and we just need to run the pipeline as before (you can check the post.4.nf file in the bonus section):

1nextflow run post.4.nf -resume

The code works the same way as before. Since we used the -resume flag, no new tasks needed to be run.

Configurable settings

Next thing to do is avoiding hardcoded paths and settings in the script.

Nextflow provides three ways, in addition to the hardcode approach, to set parameters:

  1. Command line parameters
  2. Params files
  3. Configuration files

Parameters can be declared in the main script using the params object, as we did in previous posts. This is helpful for developing and testing, but not ideal for working pipelines.

Parameters can also be configured on the command line by prefixing the parameter name with a double dash (--). This is useful for pipelines with a small number of parameters that change frequently.

Parameters specified on the command line can be also specified in a params file using the -params-file option when running nextflow. The -params-file option loads parameters for your Nextflow pipeline from a JSON or YAML file. This is useful for pipelines with many parameters and for documentation purposes.

Finally, parameters can be set in configuration files. Configuration files are written in a custom DSL and can be used to set parameters, configure executors, define profiles, and more. This is useful for pipelines with many parameters that change infrequently and need to be shared among multiple users or environments.

Parameters are applied in this order, so lowest to highest priority:

  1. Parameters defined in pipeline scripts (useful for hardcoded defaults)
  2. Parameters defined in config files
  3. Parameters specified in a params file (-params-file)
  4. Parameters specified on the command line (--parameter_name)

I tend to use config files to set the parameters because it also allows to set other configurations (i.e. docker). Keeping all in the same file is very useful for documentation purposes. Create a file named nextflow.config in the same directory as the main script with the following content:

 1// Enable Docker
 2docker {
 3    enabled = true
 4}
 5
 6// setting parameters
 7params {
 8    referenceFASTA = 'reference/refdata-gex-GRCh38-2024-A/fasta/genome.fa'
 9    referenceGTF = 'reference/refdata-gex-GRCh38-2024-A/genes/genes.gtf.gz'
10    sample_info = 'sample_info.tsv'
11    outdir = 'results_full'
12}

Now we can remove the hardcoded paths from the main script and just use the params object to access the parameters.

1nextflow run post.4.nf -resume

Again, we see that nothing was run since we used the -resume flag and there were no functional modifications.

Great, our workflow can process multiple samples using configurable settings, and it is modularized. Next steps would be polishing the workflow and adding extra steps of the analysis. We will do it in next posts.

Modularization:

Command Line Parameters:

Configuration Files:

Key Documentation Sections:

BONUS

QUANT module

Here is the code to create folder and main.nf file.

1mkdir -p modules/simpleaf/quant
2touch modules/simpleaf/quant/main.nf

This is the include statement to add to the main script

1include { QUANT as QUANT } from './modules/simpleaf/quant/main.nf'

post.4.nf

 1/*
 2 * Pipeline parameters
 3 */
 4
 5// reference genome path
 6params.referenceFASTA = 'reference/refdata-gex-GRCh38-2024-A/fasta/genome.fa'
 7params.referenceGTF = 'reference/refdata-gex-GRCh38-2024-A/genes/genes.gtf.gz'
 8// sample info
 9params.sample_info = 'sample_info.tsv' // path to sample information file
10
11// output directory
12params.outdir      = "results_full"
13
14// Include modules
15include { SIMPLEAF_INDEX as INDEX } from './modules/simpleaf/index/main.nf'
16include { QUANT as QUANT } from './modules/simpleaf/quant/main.nf'
17
18workflow {
19
20    // define input files
21    fasta = file("${params.referenceFASTA}") // reference fasta
22    gtf = file("${params.referenceGTF}") // annotation GTF
23
24    // Load sample information from TSV file
25    sample_info = Channel.fromPath(params.sample_info)
26        .splitCsv(header: true, sep: '\t')
27
28    // INDEX creation
29    index = INDEX(
30        fasta, 
31        gtf
32    )
33
34    // prep for quantification
35    quant_input = sample_info
36        .map { x ->
37            def files_R1 = file("${x.sample_path}/*${x.reads1_pat}*", checkIfExists: true)
38            def files_R2 = file("${x.sample_path}/*${x.reads2_pat}*", checkIfExists: true)
39            tuple(x.sample_id, x.chemistry, files_R1, files_R2)
40        }
41    
42    // Alevin-fry quantification
43    quant = QUANT(
44        index.index_path, 
45        quant_input
46    )
47}
comments powered by Disqus