Customize a pipeline ==================== .. contents:: Table of Contents Cloning a pipeline ------------------ You don't need to modify a pipeline if you need only to change a pipeline parameter or adapt the execution to your local environment: a pipeline execution is highly customizable by providing *custom configuration files* and *parameters*. Most of the time you will be able to run a pipeline without modifying it, but cloning a pipeline is useful when you need to add new features or to fix bugs in a pipeline you are working on:: git clone https://github.com/nf-core/rnaseq .. hint:: nextflow itself can clone a pipeline like git does:: nextflow clone nf-core/rnaseq The ``nf-core`` prefix of the pipeline is the *organization* name, and the ``rnaseq`` is the *repository* name, as you find on GitHub. .. _configuring-a-pipeline: Configuring a pipeline ---------------------- You can customize a pipeline by creating custom configuration files: this could be necessary if you need to lower the requirements of a pipeline, for example, in order to run a pipeline with limited resources or if you need to track the parameters you are been using for a particular analysis. You can also specify a custom configuration file in order to run a pipeline with a different profile, for example to enable different options required to a specific environment. A custom configuration file has an higher priority than the default configuration file, but will have a lower priority than the parameters provided with command line. Moreover, is it possible to use an *institutional configuration file* in which you can specify the default parameters for all the pipelines you plan to execute within the infrastructure provided by your institution (see `here `_ for an example). For a complete list of configuration options and priorities, please see the `nextflow configuration `_ documentation. nextflow.config --------------- Before starting with a new custom configuration file, you should take a look to the default configuration file provided by the pipeline you are working on. For a standard nextflow pipeline, the default configuration file is named ``nextflow.config`` and is located on the root of the pipeline directory. In this file there are defined the default parameters that affect pipeline execution. In a DSL2 pipeline, you can also find the ``conf/base.config`` file, in which the requirements for each job are defined. .. hint:: Is recommended by the community that the pipeline parameters, like the input files, the reference database used or user defined values need to be provided by a *parameters* file, which is defined as a JSON file and is specified with the ``-params-file`` option. This let you to run a pipeline without providing parameters using the command line interface. All the parameters which cannot be specified using the command line interface (for example the amount of memory required by a certain step) can be defined in the custom configuration file. .. _institutional-configuration-files: Institutional configuration files --------------------------------- Nextflow offers a GitHub repository where institutional configuration files can be stored and shared among users. This means that users belonging to the same institution can share configuration files that are specific to their infrastructure. This repository is located at ``_ and is structured in mainly two sections, configuration that are shared among all pipelines and configuration that are specific to a single pipeline. Usually the first configuration files keeps information about *executors*, *queues*, *resources* and they can be applied to all pipelines independently in a particular computing environment in your institute. The second configuration files are specific to a single pipeline and can be used to customize a single pipeline step, for example to change the number of CPUs or the amount of memory required by a single process overriding the pipeline default configuration. Institutional configuration files are managed through the `profile scope `_ and usually the *nf-core* community pipelines are already configured to use them. This means that if an institutional configuration file is available in the nf-core configs repository, it can be using passing the profile name to the pipeline execution, for example: .. code-block:: bash nextflow run nf-core/rnaseq -profile ... This is enough to apply the global institutional configuration to the pipeline execution and the pipeline specific configuration if available. For more information see the `Shared nf-core/configs `_ and the `Step-by-step guide to writing an institutional profile `_ documents for more information. .. tip:: We have a custom *institutional* configuration repository at *ibba*. To use it with nf-core pipelines, you should add the repository `cnr-ibba/nf-configs `_ using, the ``--custom_config_base`` option, and specify `ibba` and your working environment profile, for example: .. code-block:: bash nextflow run nf-core/rnaseq \ --custom_config_base https://raw.githubusercontent.com/cnr-ibba/nf-configs/ibba \ -profile ibba,core \ ... cnr-ibba pipelines, like `cnr-ibba/nf-resequencing-mem `_ are already configured to use our local institutional configuration repository. See `nf-core/configs: IBBA Configuration `_ for more information. .. hint:: The institutional configuration files are accessed remotely during pipeline execution: if you need to work offline, you should download and manage a local copy and provide the path to the institutional configuration file using the ``-config`` option and the institutional configuration git repository though the ``--custom_config_base`` option. More information can be found in :ref:`running-nextflow-offline` and :ref:`clone-institutional-configuration-files` of this documentation. Custom configuration files -------------------------- There are other configuration files that can be used to customize a single pipeline and can be stored in the pipeline directory or in the directory where you are running the pipeline. Those configuration files have the highest priority and can be used to customize a single pipeline execution for a particular project. Those configuration files should be specified using the ``-c`` or ``-config`` option when running the pipeline, for example: .. code-block:: bash nextflow run nf-core/rnaseq -c custom.config ... .. warning:: Avoid to name your custom config file as ``nextflow.config``, since is a reserved name for the default configuration file, which is loaded automatically by nextflow if present in your project directory. If you name your custom configuration file with a different name, you can control when it's loaded using the ``-c`` or ``-config`` option when running nextflow. More information about configuration customization can be found in the official nextflow `Configuration `_. The reference of all configuration options could be found at nextflow `Configuration options `_ reference. Here we provide some examples of how to customize a pipeline using custom configuration files. Process selectors ~~~~~~~~~~~~~~~~~ Nextflow let you to specify the behavior of a process or a group of processes using process `selectors `_ in the configuration files. There are mainly two types of selectors: ``withLabel`` and ``withName``: the first one let you to specify the requirements for every process having the same label, the second one let you to specify the requirements for a process by name. More precisely, in DSL2 pipelines, this requirements are specified in ``conf/base.config`` and ``conf/modules.config`` where the first file is used to specify the requirements for a group of jobs using *labels* and the second one is used to specify the requirements for a single process using *names*. The Nextflow community recommend to specify the requirements for a group of processes when possible using ``withLabel``: when there's the need to specify the requirements for a single process, you can use the ``withName`` selector. For example, to lower resources requirements, it's better to start by redefining the most used labels, like ``process_high`` and ``process_medium``, and after redefine single processes. Start with an empty *custom configuration* file and add a ``process`` scope like this: .. code-block:: groovy process { withLabel: process_low { ... } withLabel: process_medium { ... } withLabel: process_high { ... } withName: FASTQC { ... } } You may want to explore the imported modules tho understand will processes will be affected by which label. In order to get effect, you need to provide this file with the nextflow ``-c`` or ``-config`` option: .. code-block:: bash nextflow run -c custom.config ... .. hint:: Since these parameters will override the default ones, it's better to declare only the minimal parameters required by your pipeline. See nextflow documentation for `Process selectors `_ for more information. .. _dynamic-allocation-resources: Dynamic allocation of resources ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ It is possible that different instances of a process require different resources in terms of computing power, memory, or time. In such situations, requesting, for example, an amount of memory too low will cause some tasks to fail. Instead, using a higher limit that fits all the tasks in your execution could significantly decrease the execution priority of your jobs. In such cases, the `Dynamic directives `_ could be useful to increase the resources required by a process if the task fails and is retried. For example, Nextflow let you to specify the resources required by a process dynamically using the ``task.attempt`` variable. This variable is a counter that is incremented each time a task is retried. For example, you can specify the resources required by a process like this: .. code-block:: groovy process { withLabel:process_medium { cpus = { 6 * task.attempt } memory = { 12.GB * task.attempt } time = { 8.h * task.attempt } } } This means that every time a task is retried, the amount of resources required by the process will be increased by a factor equal to the number of attempts. However, the *maximum* amount of attempts and resources should be specified in configuration files to avoid infinite loops or excessive resource requirements. Such directives that affect the dynamic allocation of resources when a task is retried are `errorStrategy `_ and `maxRetries `_: the first one let you to specify the behavior of a process when an error occurs, and you can configure this option to terminate the pipeline when an error is found or continue with the workflow just ignoring the error. The second one let you to specify the maximum number of retries for a process, after that value is reached, the entire pipeline will be terminated. Usually, these directive are defined by default in ``conf/base.config`` file of the pipeline like this: .. code-block:: groovy process { errorStrategy = { task.exitStatus in ((130..145) + 104) ? 'retry' : 'finish' } maxRetries = 1 } But eventually, you can override these directives for a particular process using the ``withName`` or ``withLabel`` process selectors in the custom configuration file. .. _handling-failing-jobs: Handling failing jobs ~~~~~~~~~~~~~~~~~~~~~ You can use more complex *closures* to define the behavior of a process when an error occurs. For example, you can specify that a process should be retried if it fails until a maximum number of retries is reached. After that, we just ignore the error and continue with the workflow: this is an example of how to specify the behavior of a process when an error occurs in a custom configuration file: .. code-block:: groovy withName: VCFTOOLS_TSTV_COUNT { errorStrategy = { task.attempt <= 2 ? 'retry' : 'ignore' } } The same can be defined directly in the process declaration in a nextflow file: .. code-block:: groovy process MY_PROCESS { tag "$meta.id" label 'process_single' errorStrategy { task.attempt <= maxRetries ? 'retry' : 'ignore' } } .. tip:: Note that we declare ``errorStrategy =`` in nextflow configuration file, but we declare ``errorStrategy { ... }`` in the process declaration in a nextflow file: This behavior will be further investigated. .. hint:: This can be possible if there are no dependent processes that require the output of the process that failed. Take a look to the `Handling failing jobs with Nextflow `_ medium article to get more hints on how to handle failing jobs in Nextflow. Setting max amount of resources for a process ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Nextflow will also let you to specify the maximum resources required by a process using the `resourceLimits `_ directive: this could be specified at the task level or globally at the process level. In the latter case, you will set the maximum resources required by every process called by the pipeline. An example of how to specify the maximum resources required by a process is shown below: .. code-block:: groovy process { resourceLimits = [ cpus: 32, memory: 64.GB ] } .. warning:: When using the `resourceLimits` directive, you are only declare the maximum amount of resources that a process can require, you are not specifying the total amount of resources that will be used by all the process during the pipeline execution. .. hint:: The `resourceLimits` directive was introduced in Nextflow version ``24.04.0``: the pipeline options ``--max_cpus``, ``--max_memory`` and ``--max_time`` are deprecated and will be removed in future versions. If you need to work with pipelines developed with older versions of Nextflow, you should use the old ``check_max`` function to ensure that resource requirements don't exceed a maximum limit. See the `Dynamic allocation of resources (old syntax)`_ section for more information. .. tip:: If you need to know if your pipeline support the newest ``resourceLimits`` directive, take a look at ``nextflow.config`` file in the pipeline directory and in the ``conf/base.config`` file: if the dynamic allocation of resources is managed by the ``check_max`` function and by the ``max_cpus``, ``max_memory`` and ``max_time`` parameters, you should use the old syntax to manage resources. Dynamic allocation of resources (old syntax) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Before version ``24.04.0``, Nextflow let you specify the maximum resources required by a process using the ``--max_cpus``, ``--max_memory`` and ``--max_time`` parameters. The resources were allocated dynamically using the ``check_max`` function, which needs to be included in the custom configuration file or in any files that make use of the ``check_max`` function to dynamically allocate resources. You should remember to specify a default value for ``max_memory``, ``max_cpus``, and ``max_time`` in your *custom configuration file* to avoid warnings when the ``check_max`` function is evaluated. An example of how to specify the maximum resources required by a process with the old syntax is shown below: .. code-block:: groovy params { // Max resource options // Defaults only, expecting to be overwritten // need to be specified in order to ``check_max`` function to work max_memory = '64.GB' max_cpus = 32 max_time = '240.h' } process { withLabel:process_medium { cpus = { check_max( 6 * task.attempt, 'cpus' ) } memory = { check_max( 12.GB * task.attempt, 'memory' ) } time = { check_max( 8.h * task.attempt, 'time' ) } } } // Function to ensure that resource requirements don't go beyond // a maximum limit def check_max(obj, type) { if (type == 'memory') { try { if (obj.compareTo(params.max_memory as nextflow.util.MemoryUnit) == 1) return params.max_memory as nextflow.util.MemoryUnit else return obj } catch (all) { println " ### ERROR ### Max memory '${params.max_memory}' is not valid! Using default value: $obj" return obj } } else if (type == 'time') { try { if (obj.compareTo(params.max_time as nextflow.util.Duration) == 1) return params.max_time as nextflow.util.Duration else return obj } catch (all) { println " ### ERROR ### Max time '${params.max_time}' is not valid! Using default value: $obj" return obj } } else if (type == 'cpus') { try { return Math.min( obj, params.max_cpus as int ) } catch (all) { println " ### ERROR ### Max cpus '${params.max_cpus}' is not valid! Using default value: $obj" return obj } } } .. hint:: The ``--max_cpus``, ``--max_memory`` and ``--max_time`` parameters are the maximum allowed values for dynamic job requirements: by setting these parameters you can ensure that a *single job* will not allocate more resources than the ones you have declared. Those parameters have not effect on the *global* resources used or the number of job submitted. .. tip:: ``--max_cpus``, ``--max_memory`` and ``--max_time`` are parameters that can be submitted using the nextflow *params file* or *command line interface*. Remove process limits ~~~~~~~~~~~~~~~~~~~~~ Sometimes could be convenient to remove the limits set by a process, for example a very long task that requires a lot of time to be completed: in this case, will be more convenient to avoid setting a walltime limit and let the *executor* choose the max allowed value. You can simply unset the time limit for a process by setting a ``null`` value for the time parameter in the custom configuration file, for example: .. code-block:: groovy process { withLabel:unlimited_time { time = null } } This will override all the time limits set by the process and will let the *executor* to choose the max allowed value (if supported). Provide custom parameters to a process ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Some modules may require additional parameters to be provided in order to work correctly. This parameters can be specified with the ``ext.args`` variable within the process scope in the custom configuration file, for example: .. code-block:: groovy process { withName:process_fastqc { ext.args = '-t 4' } } When a process is composed by two (or more) tools, you can specify parameters for each process independently, using ``ext.args``, ``ext.args2``, ``ext.args3``: ``ext.args`` will be used for the first process, ``ext.args2`` for the second and so on. In a DSL2 pipeline, custom variables for each process are defined in ``conf/base.config`` file: take a look to this file to understand which variables are set by default in your pipeline and before adding new variables to a process. Provide custom parameters to a container runtime ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sometimes could be useful to provide custom parameters to the container runtime used to run a process. For example, you may want to provide custom Singularity options to a process in order to mount a specific directory or to provide a custom environment variable. This can be done using the ``runOptions`` variable with the container runtime scope in the custom configuration file, for example: .. code-block:: groovy singularity { runOptions = '--bind /data/project:/mnt/project' } docker { runOptions = '--env MY_ENV_VAR=value' } .. warning:: By default, ``docker.runOptions`` is set to ``'-u $(id -u):$(id -g)'``: this is required to run process as the current user in order to create files with proper permissions. Remember to include ``'-u $(id -u):$(id -g)'`` when providing your custom docker options. In addition, there's also the ``containerOptions`` process directive that can be used to provide custom options to the container runtime for a specific process. However, container runtime like Singularity and Docker may have different way to specify those options, so it's better to use the container runtime scope with ``runOptions`` in the custom configuration file to provide custom options that will be applied to all the processes using that container runtime. If you need to provide custom options to a specific process, and you need to distinguish between different container runtimes, you can use a **closure** to define the options dynamically based on the container runtime used by the process, for example if you require *GPU* support: .. code-block:: groovy process { withName: process_with_gpu { containerOptions = { workflow.containerEngine == "singularity" ? '--nv' : ( workflow.containerEngine == "docker" ? '--gpus all' : null ) } } } This will try to set the proper options based on the container runtime used by the process, or will not set any options if the container runtime is not Singularity or Docker. Provide custom parameters to executors ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There are parameters that can be provided to the executor used to run a process: this parameters don't affect the process behavior, but can be used to customize the job submission to the computing environment. A list of all the available parameters for each executor can be found in the nextflow documentation at `Executors `_. There's one parameter for ``SLURM`` executor that is quite useful to customize the job submission: the ``clusterOptions`` parameter let you to provide custom parameters to the ``sbatch`` command used to submit jobs to the SLURM scheduler (which are not directly supported by , like ``cpus```, ``memory``, ``time`` or ``queue``). For example, you may want to specify a custom partition or quality of service for a specific process, like this: .. code-block:: groovy process { withName: process_name { clusterOptions = '--partition=long --qos=normal' } } This will add the ``--partition=long --qos=normal`` options to the ``sbatch`` command used to submit jobs for the specified process. Change output file names ~~~~~~~~~~~~~~~~~~~~~~~~ Sometimes could be useful to change the output file names of a process, for example when applying a process which keeps the same input file name in input and output. Ideally, the output file name *prefix* is defined at process level like this: .. code-block:: groovy script: def args = task.ext.args ?: '' def prefix = task.ext.prefix ?: "${meta.id}" So it is possible to configure a ``task.ext.prefix`` variable in the custom configuration file to define the output file name prefix, for example: .. code-block:: groovy process { withName: SEQKIT_RMDUP_R1 { ext.prefix = { "${meta.id}_R1" } } } In this example we use *closures* to define the output file name prefix *dynamically*, an this is useful to keep *sample name* in output file. In alternative, is possible to modify the `meta.id` using the `map operator `_, but this cannot be defined in the custom configuration file, should be defined in pipeline *workflow* or *subworkflow*, for example: .. code-block:: groovy channel.map { meta, it -> [[id: "${meta.id}_updated"], it] } However, this will override the old ``meta.id`` value with the new one, and all the processes will then use the new value to define their output file name prefix. A third option could be to use the `publishDir `_ directive and define a closure to define the output file name prefix, for example: .. code-block:: groovy publishDir 'results', saveAs: { filename -> "foo_$filename" } See `Store outputs renaming files `_ on `nextflow patterns `_ for more information. Create a custom profile ~~~~~~~~~~~~~~~~~~~~~~~ A profile is a set of parameters that can be used to run a pipeline in a specific environment. For example, you can define a profile to run a pipeline in a cluster environment, or to run a pipeline using a specific container engine. You can also define a profile to run a pipeline with a specific set of parameters, for example test data. A profile is defined in a configuration file, which is specified using the ``-profile`` option when running nextflow. A profile require a name which is used to identify the profile and a set of parameters. For example, you can define a profile like this in your ``custom.config`` file: .. code-block:: groovy profiles { cineca { process { clusterOptions = { "--partition=g100_usr_prod --qos=normal" } } } } In this example, each process will be submitted to the ``g100_usr_prod`` partition using the ``normal`` quality of service, and those parameters may depend on the environment in which this pipeline is supposed to run. In another environment, those parameter will not apply, so there's no need to use this specific profile in a different environment. You can the call your pipeline using the ``-profile`` option:: nextflow run -profile cineca,singularity ... Params file ----------- A Nextflow JSON parameter file is a way of providing configuration parameters for a Nextflow pipeline in a structured format using JSON (JavaScript Object Notation). It allows users to define various parameters required by the pipeline in a file rather than passing them directly via the command line. The main key features of a Nextflow JSON parameter File are 1. **Structure**: The JSON file contains key-value pairs that define different parameters. This structure makes it easy to read and modify parameters without needing to remember command line syntax. 2. **Use Case**: JSON parameter files are particularly useful for complex workflows with many parameters or when those parameters are subject to frequent changes. Users can manage their configurations in one place. 3. **Access in Pipeline**: Parameters defined in the JSON file can be accessed directly in your Nextflow scripts using the `params` object. Here’s a simple example of what a Nextflow JSON parameter file might look like: .. code-block:: json { "input": "data/input_file.txt", "output": "results/", "other_param": "value" } where ``input``, ``output``, and ``other_param`` are the parameters required by the pipeline and can be declared or overridden using CLI and prepending ``--`` to the parameter name (eg. ``--input``, ``--output``, ``--other_param``). To use a JSON parameter file in a Nextflow pipeline, you can specify it on the command line using the ``-params-file`` option: .. code-block:: bash nextflow run -params-file params.json The benefits of using a JSON parameter file include: - **Readability**: JSON files are quite structured and make it easy to see the settings needed for a pipeline. - **Convenience**: It’s more convenient to edit a JSON file for changing parameters than to modify and remember long command-line options. - **Version Control**: JSON files can be easily tracked and managed using version control systems like Git, which is particularly useful for collaborative projects. - **Compatibility**: JSON is widely supported across different programming languages, making it easy to generate or manipulate if needed.