Building Your First Nextflow Pipeline: PART 4 - Polish your pipeline

Nextflow is a workflow management system designed to simplify the process of writing and executing complex computational workflows. The main advantages of nextflow are:

  • Reproducibility: ensures workflows can be run consistently across environments.
  • Portability: supports execution on laptops, HPC clusters, cloud (AWS, GCP, Azure), and container platforms (Docker, Singularity).
  • Scalability: handles anything from small local runs to massive parallel workloads.
  • Modularity: pipelines are built from reusable processes that can be easily combined.
  • Container support: seamless integration with Docker, Singularity, and Conda for reproducible software environments.
  • Dataflow programming model: channels handle complex data parallelism naturally.
  • Error handling & resume: automatic checkpointing allows workflows to resume from failure points.

Following the previous posts of this series (workflow that only processes one sample, multiple samples and Modularization & Configurable settings) we constructed a fully functional pipeline. Today, we will polish it to make it more professional and shareable.

Workflow outputs

Our current pipeline uses publishDir to move the results of each process to the desired output directory. For example, in the INDEX process:

1publishDir params.outdir, mode: 'symlink'

Last versions of nextflow (>=24.04.0) introduced a new way to define workflow outputs using the output block in the main workflow script. This approach is more flexible and powerful than publishDir, allowing you to specify exactly which files or channels should be treated as outputs of the workflow. See here how to migrate from publishDir to workflow outputs.

We start creating a new main block at the beginning of the workflow:

1workflow {
2    main:
3    // define input files & variables
4    ...

add a publish block in the workflow and a output section at the end of the script. Note that we an specify the mode as we did for the publishDir directive.

 1    ...
 2    publish:
 3    simpleaf_index = index.index_path
 4    quantification_results = quant.quant_path
 5}
 6
 7output {
 8    simpleaf_index {
 9        path "${params.outdir}/'index"
10        mode 'symlink'
11    }
12    
13    quantification_results {
14        path "${params.outdir}/'quant'"
15        mode 'symlink'
16    }
17}

This feature requires the nextflow.preview.output feature flag to be enabled. I prefer to add it before the ouput block for clarity:

1    ...
2    }
3
4// Workflow outputs
5nextflow.preview.output = true
6
7output {
8    ...

All files in the published channels are copied into the output directory, which is results by default. The output directory can be set using the outputDir config setting or the -output-dir command-line option.

to keep the outdir paramater, we can add the following line to the nextflow.config file:

1// Set outputDir based on parameter (with default)
2outputDir = params.outdir

Last thing is removing all the publishDir directives from the processes.

Now we can test the workflow, with -resume it should not need to do anything.

1nextflow run post.5.nf -resume

According to nextflow developers, this approach is intended to replace the publishDir directive so it would be wise to start using it now. Just stay tuned to the official documentation for updates as this is a new feature and may evolve as said in the image.

help

IF you want to share your pipeline or just for documentation (your future you will thank you), it is a good practice to add a help message to guide users on how to run it and what parameters are required.

Nextflow's official documentation doesn't provide a specific guide for adding help messages to pipelines. Using the params.help to show help messages is a community-established best practice, particularly popularized by nf-core.

Groovy is a programming language based on Java. Nextflow scripts are written in Groovy, so we will use groovy files as follows to implement the help functionality. Create the following file help.groovy within a help folder in the main folder of the pipeline

 1def helpMessage(Map params) {
 2    return """
 3    ========================================
 4     SimpleAF Single-Cell RNA-seq Pipeline
 5    ========================================
 6    
 7    Usage:
 8      nextflow run main.nf [options]
 9    
10    Required Parameters:
11      --referenceFASTA    Path to reference FASTA file
12      --referenceGTF      Path to reference GTF file  
13      --sample_info       Path to sample information TSV file
14    
15    Optional Parameters:
16      --outdir            Output directory
17      --help              Show this help message and exit
18    
19    Examples:
20      nextflow run main.nf --help
21      nextflow run main.nf --sample_info samples.tsv --outdir my_results
22    
23    Sample Info File Format:
24        The sample_info TSV file should contain the following columns:
25        - sample_id: Unique sample identifier
26        - chemistry: 10x chemistry (e.g., 10xv4-3p)
27        - sample_path: Path to directory containing FASTQ files
28        - reads1_pat: Pattern for R1 files (e.g., *R1*)
29        - reads2_pat: Pattern for R2 files (e.g., *R2*)
30    
31    Output Structure:
32        Output_folder/
33        ├── index/
34        │   └── index_simpleaf/
35        └── quant/
36            ├── sample1_quant/
37            ├── sample2_quant/
38    """
39}

then add the following to the beginning of the main script:

1// Help setting
2params.help = false
3include { helpMessage } from './lib/help.groovy'

and this within the workflow, at the beginning:

1
2// Show help message if --help is provided
3if (params.help) {
4    println( helpMessage(params) )
5    exit 0
6}

now, before running your pipeline, you can check the help message with

1nextflow run post.5.nf --help

Parameter validation

The last part, maybe, to make your pipeline more professional and shareable is to validate the required parameters. Just add the following within the workflow, after the if (params.help) block:

1// Validate required parameters
2if (!params.referenceFASTA || !params.referenceGTF || !params.sample_info) {
3    error "Missing required parameters. Use --help for usage information."
4}

This is a very quick and simple validation, you can make it more complex to check each parameter individually and provide more specific error messages.

BONUS

Final pipeline

Folder structure

 1.
 2├── lib
 3│   └── help.groovy
 4├── modules
 5│   └── simpleaf
 6│       ├── index
 7│       │   └── main.nf
 8│       └── quant
 9│           └── main.nf
10├── nextflow.config
11├── post.1.nf
12├── post.2.nf
13├── post.3.nf
14├── post.5.nf
15└── post.4.nf

post.5.nf

This would be the main.nf file of the final pipeline

 1// Help setting
 2params.help = false
 3include { helpMessage } from './lib/help.groovy'
 4
 5// Include modules
 6include { SIMPLEAF_INDEX as INDEX } from './modules/simpleaf/index/main.nf'
 7include { QUANT as QUANT } from './modules/simpleaf/quant/main.nf'
 8
 9workflow {
10
11    // Show help message if --help is provided
12    if (params.help) {
13        println( helpMessage(params) )
14        exit 0
15    }
16
17    // Parameter validation
18    if (!params.outdir || !params.referenceFASTA || !params.referenceGTF || !params.sample_info) {
19        error "Missing required parameters. Use --help for usage information."
20    }
21
22    // define input files & variables
23    fasta = file("${params.referenceFASTA}") // reference fasta
24    gtf = file("${params.referenceGTF}") // annotation GTF
25
26    // Load sample information from TSV file
27    sample_info = Channel.fromPath(params.sample_info)
28        .splitCsv(header: true, sep: '\t')
29
30    // INDEX creation
31    index = INDEX(
32        fasta, 
33        gtf
34    )
35
36    // prep for quantification
37    quant_input = sample_info
38        .map { x ->
39            def files_R1 = file("${x.sample_path}/*${x.reads1_pat}*", checkIfExists: true)
40            def files_R2 = file("${x.sample_path}/*${x.reads2_pat}*", checkIfExists: true)
41            tuple(x.sample_id, x.chemistry, files_R1, files_R2)
42        }
43    
44    // Alevin-fry quantification
45    quant = QUANT(
46        index.index_path, 
47        quant_input
48    )
49
50    publish:
51    simpleaf_index = index.index_path
52    quantification_results = quant.quant_path
53}
54
55// Workflow outputs
56nextflow.preview.output = true
57
58output {
59    simpleaf_index {
60        path "index"
61        mode 'symlink'
62    }
63    
64    quantification_results {
65        path "quant"
66        mode 'symlink'
67    }
68}

Config file

 1// setting parameters
 2params {
 3    referenceFASTA = 'reference/refdata-gex-GRCh38-2024-A/fasta/genome.fa'
 4    referenceGTF = 'reference/refdata-gex-GRCh38-2024-A/genes/genes.gtf.gz'
 5    sample_info = 'sample_info.tsv'
 6    outdir = 'results_full'
 7}
 8
 9// Enable Docker
10docker {
11    enabled = true
12}
13
14// Set outputDir based on parameter (with default)
15outputDir = params.outdir

Help file

 1def helpMessage(Map params) {
 2    return """
 3    ========================================
 4     SimpleAF Single-Cell RNA-seq Pipeline
 5    ========================================
 6    
 7    Usage:
 8      nextflow run main.nf [options]
 9    
10    Required Parameters:
11      --referenceFASTA    Path to reference FASTA file
12      --referenceGTF      Path to reference GTF file  
13      --sample_info       Path to sample information TSV file
14    
15    Optional Parameters:
16      --outdir            Output directory
17      --help              Show this help message and exit
18    
19    Examples:
20      nextflow run main.nf --help
21      nextflow run main.nf --sample_info samples.tsv --outdir my_results
22    
23    Sample Info File Format:
24        The sample_info TSV file should contain the following columns:
25        - sample_id: Unique sample identifier
26        - chemistry: 10x chemistry (e.g., 10xv4-3p)
27        - sample_path: Path to directory containing FASTQ files
28        - reads1_pat: Pattern for R1 files (e.g., _R1_)
29        - reads2_pat: Pattern for R2 files (e.g., _R2_)
30    
31    Output Structure:
32        Output_folder/
33        ├── index/
34        │   └── index_simpleaf/
35        └── quant/
36            ├── sample1_quant/
37            ├── sample2_quant/
38    """
39}

Processes

INDEX process

 1/*
 2 * Generate simpleaf index
 3 */
 4process SIMPLEAF_INDEX {
 5
 6    container 'quay.io/biocontainers/simpleaf:0.19.5--ha6fb395_0'
 7    
 8    // publishDir params.outdir, mode: 'symlink'
 9
10    input:
11        path fasta
12        path gtf
13
14    output:
15        path "index_simpleaf", emit : index_path
16
17    script:
18    """
19    # export required var
20    export ALEVIN_FRY_HOME=.
21
22    # set maximum number of file descriptors for temp files
23    ulimit -n 2048
24
25    # prep simpleaf
26    simpleaf set-paths
27
28    simpleaf index \
29        --output index_simpleaf \
30        --fasta ${fasta} \
31        --gtf ${gtf} \
32        --threads 4 \
33        --work-dir ./workdir.noindex
34    """
35}

QUANT process

 1/*
 2 * Quantify Gene Expression
 3 */
 4process QUANT {
 5
 6    container 'quay.io/biocontainers/simpleaf:0.19.5--ha6fb395_0'
 7
 8    // publishDir "${params.outdir}/quant", mode: 'symlink'
 9
10    input:
11    path index_path
12    tuple val(sample_id), val(chemistry), path(reads1_files), path(reads2_files)
13
14    output:
15        path "${sample_id}_quant" , emit: quant_path
16
17    script:
18    def R1_FILES = reads1_files.collect().join(',')
19    def R2_FILES = reads2_files.collect().join(',') 
20    """
21    # Download chemistry file
22    wget -O chemistries.json https://raw.githubusercontent.com/COMBINE-lab/simpleaf/dev/resources/chemistries.json
23
24    # export required var
25    export ALEVIN_FRY_HOME=.
26
27    # prep simpleaf
28    simpleaf set-paths
29
30    # run simpleaf quant
31    simpleaf quant \
32        --reads1 $R1_FILES \
33        --reads2 $R2_FILES \
34        --threads 4 \
35        --index ${index_path}/index \
36        --t2g-map ${index_path}/ref/t2g_3col.tsv \
37        --chemistry ${chemistry} \
38        --resolution cr-like \
39        --unfiltered-pl --anndata-out \
40        --output ${sample_id}_quant
41    """
42}
comments powered by Disqus