Customize a pipeline
Cloning a pipeline
You don’t need to modify a pipeline if you need only to change a pipeline parameter or adapt the execution to your local environment: a pipeline execution is highly customizable by providing custom configuration files and parameters. Most of the time you will be able to run a pipeline without modifying it, but cloning a pipeline is useful when you need to add new features or to fix bugs in a pipeline you are working on:
git clone https://github.com/nf-core/rnaseq
Hint
nextflow itself can clone a pipeline like git does:
nextflow clone nf-core/rnaseq
The nf-core prefix of the pipeline is the organization name, and the
rnaseq is the repository name, as you find on GitHub.
Configuring a pipeline
You can customize a pipeline by creating custom configuration files: this could be necessary if you need to lower the requirements of a pipeline, for example, in order to run a pipeline with limited resources or if you need to track the parameters you are been using for a particular analysis. You can also specify a custom configuration file in order to run a pipeline with a different profile, for example to enable different options required to a specific environment. A custom configuration file has an higher priority than the default configuration file, but will have a lower priority than the parameters provided with command line. Moreover, is it possible to use an institutional configuration file in which you can specify the default parameters for all the pipelines you plan to execute within the infrastructure provided by your institution (see here for an example). For a complete list of configuration options and priorities, please see the nextflow configuration documentation.
nextflow.config
Before starting with a new custom configuration file, you should take a look to
the default configuration file provided by the pipeline you are working on. For
a standard nextflow pipeline, the default configuration file is named nextflow.config
and is located on the root of the pipeline directory. In this file there are defined
the default parameters that affect pipeline execution. In a DSL2 pipeline, you can
also find the conf/base.config file, in which the requirements for each job
are defined.
Hint
Is recommended by the community that the pipeline parameters, like the input files,
the reference database used or user defined values need to be provided by a parameters
file, which is defined as a JSON file and is specified with the -params-file
option. This let you to run a pipeline without
providing parameters using the command line interface. All the parameters which
cannot be specified using the command line interface (for example the amount of
memory required by a certain step) can be defined in the custom configuration file.
Institutional configuration files
Nextflow offers a GitHub repository where institutional configuration files can be stored and shared among users. This means that users belonging to the same institution can share configuration files that are specific to their infrastructure. This repository is located at https://github.com/nf-core/configs and is structured in mainly two sections, configuration that are shared among all pipelines and configuration that are specific to a single pipeline. Usually the first configuration files keeps information about executors, queues, resources and they can be applied to all pipelines independently in a particular computing environment in your institute. The second configuration files are specific to a single pipeline and can be used to customize a single pipeline step, for example to change the number of CPUs or the amount of memory required by a single process overriding the pipeline default configuration.
Institutional configuration files are managed through the profile scope and usually the nf-core community pipelines are already configured to use them. This means that if an institutional configuration file is available in the nf-core configs repository, it can be using passing the profile name to the pipeline execution, for example:
nextflow run nf-core/rnaseq --profile <my_institution> ...
This is enough to apply the global institutional configuration to the pipeline execution and the pipeline specific configuration if available. For more information see the Shared nf-core/configs and the Step-by-step guide to writing an institutional profile documents for more information.
Tip
We have a custom institutional configuration repository at ibba. To use it
with nf-core pipelines, you should add the repository
cnr-ibba/nf-configs using, the
--custom_config_base option, and specify ibba and your working environment
profile, for example:
nextflow run nf-core/rnaseq \
--custom_config_base https://raw.githubusercontent.com/cnr-ibba/nf-configs/ibba \
--profile ibba,core \
...
cnr-ibba pipelines, like cnr-ibba/nf-resequencing-mem are already configured to use our local institutional configuration repository. See nf-core/configs: IBBA Configuration for more information.
Hint
The institutional configuration files are accessed remotely during pipeline execution:
if you need to work offline, you should download and manage a local copy and provide
the path to the institutional configuration file using the -config option and
the institutional configuration git repository though the --custom_config_base
option. More information can be found in Running nextflow offline
and Clone institutional configuration files of this documentation.
Custom configuration files
There are other configuration files that can be used to customize a single pipeline
and can be stored in the pipeline directory or in the directory where you are running
the pipeline. Those configuration files have the highest priority and can be used
to customize a single pipeline execution for a particular project. Those configuration
files should be specified using the -c or -config option when running the pipeline,
for example:
nextflow run nf-core/rnaseq -c custom.config ...
Warning
Avoid to name your custom config file as nextflow.config, since is a reserved
name for the default configuration file, which is loaded automatically by nextflow
if present in your project directory. If you name your custom configuration file
with a different name, you can control when it’s loaded using the -c or
-config option when running nextflow.
More information about configuration customization can be found in the official nextflow Configuration. The reference of all configuration options could be found at nextflow Configuration options reference. Here we provide some examples of how to customize a pipeline using custom configuration files.
Process selectors
Nextflow let you to specify the behavior of a process or a group of processes
using process selectors
in the configuration files. There are mainly two types of selectors:
withLabel and withName: the first one let you to specify the requirements
for every process having the same label, the second one let you to specify the
requirements for a process by name. More precisely, in DSL2 pipelines, this requirements
are specified in conf/base.config and conf/modules.config where the first
file is used to specify the requirements for a group of jobs using labels and
the second one is used to specify the requirements for a single process using
names.
The Nextflow community recommend to specify the requirements for
a group of processes when possible using withLabel: when there’s
the need to specify the requirements for a single process, you can use the withName
selector. For example, to lower resources requirements, it’s better to
start by redefining the most used labels, like process_high and process_medium,
and after redefine single processes. Start with an empty custom configuration
file and add a process scope like this:
process {
withLabel: process_low {
...
}
withLabel: process_medium {
...
}
withLabel: process_high {
...
}
withName: FASTQC {
...
}
}
You may want to explore the imported modules tho understand will processes will
be affected by which label.
In order to get effect, you need to provide this file with the nextflow -c
or -config option:
nextflow run -c custom.config ...
Hint
Since these parameters will override the default ones, it’s better to declare only the minimal parameters required by your pipeline. See nextflow documentation for Process selectors for more information.
Dynamic allocation of resources
It is possible that different instances of a process require different resources
in terms of computing power, memory, or time. In such situations, requesting, for example,
an amount of memory too low will cause some tasks to fail. Instead, using a
higher limit that fits all the tasks in your execution could significantly
decrease the execution priority of your jobs. In such cases, the
Dynamic directives
could be useful to increase the resources required by a process if the task fails
and is retried. For example, Nextflow let you to specify the resources
required by a process dynamically using the task.attempt variable. This variable
is a counter that is incremented each time a task is retried. For example, you can
specify the resources required by a process like this:
process {
withLabel:process_medium {
cpus = { 6 * task.attempt }
memory = { 12.GB * task.attempt }
time = { 8.h * task.attempt }
}
}
This means that every time a task is retried, the amount of resources required by
the process will be increased by a factor equal to the number of attempts. However,
the maximum amount of attempts and resources should be specified in configuration
files to avoid infinite loops or excessive resource requirements.
Such directives that affect the dynamic allocation of resources when a task is retried
are errorStrategy
and maxRetries:
the first one let you to specify the behavior of a process when an error occurs,
and you can configure this option to terminate the pipeline when an error is found or
continue with the workflow just ignoring the error. The second one let you to specify the
maximum number of retries for a process, after that value is reached, the entire
pipeline will be terminated. Usually, these directive are defined by default in
conf/base.config file of the pipeline like this:
process {
errorStrategy = { task.exitStatus in ((130..145) + 104) ? 'retry' : 'finish' }
maxRetries = 1
}
But eventually, you can override these directives for a particular process using
the withName or withLabel process selectors in the custom configuration file.
Handling failing jobs
You can use more complex closures to define the behavior of a process when an error occurs. For example, you can specify that a process should be retried if it fails until a maximum number of retries is reached. After that, we just ignore the error and continue with the workflow: this is an example of how to specify the behavior of a process when an error occurs in a custom configuration file:
withName: VCFTOOLS_TSTV_COUNT {
errorStrategy = { task.attempt <= 2 ? 'retry' : 'ignore' }
}
The same can be defined directly in the process declaration in a nextflow file:
process MY_PROCESS {
tag "$meta.id"
label 'process_single'
errorStrategy { task.attempt <= maxRetries ? 'retry' : 'ignore' }
<other process directives>
}
Tip
Note that we declare errorStrategy = in nextflow configuration file, but
we declare errorStrategy { ... } in the process declaration in a nextflow file:
This behavior will be further investigated.
Hint
This can be possible if there are no dependent processes that require the output of the process that failed. Take a look to the Handling failing jobs with Nextflow medium article to get more hints on how to handle failing jobs in Nextflow.
Setting max amount of resources for a process
Nextflow will also let you to specify the maximum resources required by a process using the resourceLimits directive: this could be specified at the task level or globally at the process level. In the latter case, you will set the maximum resources required by every process called by the pipeline. An example of how to specify the maximum resources required by a process is shown below:
process {
resourceLimits = [
cpus: 32,
memory: 64.GB
]
}
Warning
When using the resourceLimits directive, you are only declare the maximum amount of resources that a process can require, you are not specifying the total amount of resources that will be used by all the process during the pipeline execution.
Hint
The resourceLimits directive was introduced in Nextflow version 24.04.0:
the pipeline options --max_cpus, --max_memory and --max_time are
deprecated and will be removed in future versions. If you need to work
with pipelines developed with older versions of Nextflow, you should use the
old check_max function to ensure that resource requirements don’t exceed
a maximum limit. See the Dynamic allocation of resources (old syntax)
section for more information.
Tip
If you need to know if your pipeline support the newest resourceLimits directive,
take a look at nextflow.config file in the pipeline directory and in the
conf/base.config file: if the dynamic allocation of resources is managed by
the check_max function and by the max_cpus, max_memory and max_time
parameters, you should use the old syntax to manage resources.
Dynamic allocation of resources (old syntax)
Before version 24.04.0, Nextflow let you specify the maximum resources required
by a process using the --max_cpus, --max_memory and --max_time parameters.
The resources were allocated dynamically using the check_max function, which
needs to be included in the custom configuration file or in any files that make
use of the check_max function to dynamically allocate resources.
You should remember to specify a default value for max_memory, max_cpus,
and max_time in your custom configuration file to avoid warnings
when the check_max function is evaluated. An example of how to specify the maximum
resources required by a process with the old syntax is shown below:
params {
// Max resource options
// Defaults only, expecting to be overwritten
// need to be specified in order to ``check_max`` function to work
max_memory = '64.GB'
max_cpus = 32
max_time = '240.h'
}
process {
withLabel:process_medium {
cpus = { check_max( 6 * task.attempt, 'cpus' ) }
memory = { check_max( 12.GB * task.attempt, 'memory' ) }
time = { check_max( 8.h * task.attempt, 'time' ) }
}
}
// Function to ensure that resource requirements don't go beyond
// a maximum limit
def check_max(obj, type) {
if (type == 'memory') {
try {
if (obj.compareTo(params.max_memory as nextflow.util.MemoryUnit) == 1)
return params.max_memory as nextflow.util.MemoryUnit
else
return obj
} catch (all) {
println " ### ERROR ### Max memory '${params.max_memory}' is not valid! Using default value: $obj"
return obj
}
} else if (type == 'time') {
try {
if (obj.compareTo(params.max_time as nextflow.util.Duration) == 1)
return params.max_time as nextflow.util.Duration
else
return obj
} catch (all) {
println " ### ERROR ### Max time '${params.max_time}' is not valid! Using default value: $obj"
return obj
}
} else if (type == 'cpus') {
try {
return Math.min( obj, params.max_cpus as int )
} catch (all) {
println " ### ERROR ### Max cpus '${params.max_cpus}' is not valid! Using default value: $obj"
return obj
}
}
}
Hint
The --max_cpus, --max_memory and --max_time parameters are the maximum
allowed values for dynamic job requirements: by setting these parameters you can
ensure that a single job will not allocate more resources than the ones you have
declared. Those parameters have not effect on the global resources used or the
number of job submitted.
Tip
--max_cpus, --max_memory and --max_time are parameters that can be
submitted using the nextflow params file or command line interface.
Remove process limits
Sometimes could be convenient to remove the limits set by a process, for example
a very long task that requires a lot of time to be completed: in this case, will
be more convenient to avoid setting a walltime limit and let the executor choose
the max allowed value. You can simply unset the time limit for a process by setting
a null value for the time parameter in the custom configuration file, for example:
process {
withLabel:unlimited_time {
time = null
}
}
This will override all the time limits set by the process and will let the executor to choose the max allowed value (if supported).
Provide custom parameters to a process
Some modules may require additional parameters to be provided in order to work
correctly. This parameters can be specified with the ext.args variable within
the process scope in the custom configuration file, for example:
process {
withName:process_fastqc {
ext.args = '-t 4'
}
}
When a process is composed by two (or more) tools, you can specify parameters for
each process independently, using ext.args, ext.args2, ext.args3:
ext.args will be used for the first process, ext.args2 for the second and
so on. In a DSL2 pipeline, custom variables for each process are defined in
conf/base.config file: take a look to this file to understand which variables
are set by default in your pipeline and before adding new variables to a process.
Provide custom parameters to a container runtime
Sometimes could be useful to provide custom parameters to the container runtime
used to run a process. For example, you may want to provide custom Singularity
options to a process in order to mount a specific directory or to provide a
custom environment variable. This can be done using the runOptions variable with the
container runtime scope in the custom configuration file, for example:
singularity {
runOptions = '--bind /data/project:/mnt/project'
}
docker {
runOptions = '--env MY_ENV_VAR=value'
}
Warning
By default, docker.runOptions is set to '-u $(id -u):$(id -g)': this
is required to run process as the current user in order to create files with
proper permissions. Remember to include '-u $(id -u):$(id -g)' when providing
your custom docker options.
In addition, there’s also the containerOptions process directive that can be
used to provide custom options to the container runtime for a specific process.
However, container runtime like Singularity and Docker may have different way
to specify those options, so it’s better to use the container runtime scope
with runOptions in the custom configuration file to provide custom options that will be applied
to all the processes using that container runtime. If you need to provide custom
options to a specific process, and you need to distinguish between different container
runtimes, you can use a closure to define the options dynamically based on the
container runtime used by the process, for example if you require GPU support:
process {
withName: process_with_gpu {
containerOptions = {
workflow.containerEngine == "singularity" ? '--nv' :
( workflow.containerEngine == "docker" ? '--gpus all' : null )
}
}
}
This will try to set the proper options based on the container runtime used by the process, or will not set any options if the container runtime is not Singularity or Docker.
Provide custom parameters to executors
There are parameters that can be provided to the executor used to run a process: this parameters don’t affect the process behavior, but can be used to customize the job submission to the computing environment. A list of all the available parameters for each executor can be found in the nextflow documentation at Executors.
There’s one parameter for SLURM executor that is quite useful to customize
the job submission: the clusterOptions parameter let you to provide custom
parameters to the sbatch command used to submit jobs to the SLURM scheduler
(which are not directly supported by , like cpus`, memory, time or queue).
For example, you may want to specify a custom partition or quality of service
for a specific process, like this:
process {
withName: process_name {
clusterOptions = '--partition=long --qos=normal'
}
}
This will add the --partition=long --qos=normal options to the sbatch
command used to submit jobs for the specified process.
Change output file names
Sometimes could be useful to change the output file names of a process, for example when applying a process which keeps the same input file name in input and output. Ideally, the output file name prefix is defined at process level like this:
script:
def args = task.ext.args ?: ''
def prefix = task.ext.prefix ?: "${meta.id}"
So it is possible to configure a task.ext.prefix variable in the custom configuration
file to define the output file name prefix, for example:
process {
withName: SEQKIT_RMDUP_R1 {
ext.prefix = { "${meta.id}_R1" }
}
}
In this example we use closures to define the output file name prefix dynamically, an this is useful to keep sample name in output file. In alternative, is possible to modify the meta.id using the map operator, but this cannot be defined in the custom configuration file, should be defined in pipeline workflow or subworkflow, for example:
channel.map { meta, it -> [[id: "${meta.id}_updated"], it] }
However, this will override the old meta.id value with the new one, and all
the processes will then use the new value to define their output file name prefix.
A third option could be to use the
publishDir
directive and define a closure to define the output file name prefix, for example:
publishDir 'results', saveAs: { filename -> "foo_$filename" }
See Store outputs renaming files on nextflow patterns for more information.
Create a custom profile
A profile is a set of parameters that can be used to run a pipeline in a specific
environment. For example, you can define a profile to run a pipeline in a cluster
environment, or to run a pipeline using a specific container engine. You can also
define a profile to run a pipeline with a specific set of parameters, for example
test data.
A profile is defined in a configuration file, which is specified
using the -profile option when running nextflow. A profile require a name
which is used to identify the profile and a set of parameters. For example, you
can define a profile like this in your custom.config file:
profiles {
cineca {
process {
clusterOptions = { "--partition=g100_usr_prod --qos=normal" }
}
}
}
In this example, each process will be submitted to the g100_usr_prod partition
using the normal quality of service, and those parameters may depend on the
environment in which this pipeline is supposed to run. In another environment,
those parameter will not apply, so there’s no need to use this specific profile
in a different environment. You can the call your pipeline using the -profile
option:
nextflow run -profile cineca,singularity ...
Params file
A Nextflow JSON parameter file is a way of providing configuration parameters for a Nextflow pipeline in a structured format using JSON (JavaScript Object Notation). It allows users to define various parameters required by the pipeline in a file rather than passing them directly via the command line. The main key features of a Nextflow JSON parameter File are
Structure: The JSON file contains key-value pairs that define different parameters. This structure makes it easy to read and modify parameters without needing to remember command line syntax.
Use Case: JSON parameter files are particularly useful for complex workflows with many parameters or when those parameters are subject to frequent changes. Users can manage their configurations in one place.
Access in Pipeline: Parameters defined in the JSON file can be accessed directly in your Nextflow scripts using the params object.
Here’s a simple example of what a Nextflow JSON parameter file might look like:
{
"input": "data/input_file.txt",
"output": "results/",
"other_param": "value"
}
where input, output, and other_param are the parameters required by the
pipeline and can be declared or overridden using CLI and prepending -- to the
parameter name (eg. --input, --output, --other_param).
To use a JSON parameter file in a Nextflow pipeline, you can specify it on the
command line using the -params-file option:
nextflow run <your pipeline> -params-file params.json
The benefits of using a JSON parameter file include:
Readability: JSON files are quite structured and make it easy to see the settings needed for a pipeline.
Convenience: It’s more convenient to edit a JSON file for changing parameters than to modify and remember long command-line options.
Version Control: JSON files can be easily tracked and managed using version control systems like Git, which is particularly useful for collaborative projects.
Compatibility: JSON is widely supported across different programming languages, making it easy to generate or manipulate if needed.