Running Nextflow

A note on containers

Despite nextflow could be run using conda, singularity, docker or other container runtimes, the recommended container application to use is singularity: this solution in fact manages all software dependencies in a unique file and could be cached and reused in order to speed up the calculation process (see Setting NXF_SINGULARITY_CACHEDIR for more information). You can have more information about singularity in the singularity section of this guidelines.

You can select the type of container runtime to use with the -profile option, for example:

nextflow run nf-core/rnaseq -profile test,singularity -resume

Warning

Downloading software dependencies could take a lot of time and could be subject to networking errors, which are not related to pipelines or data but can slow or broke pipeline execution. In such way, it’s better to configure caches when downloading softwares: singularity cache could be configured in singularity scope or better using $NXF_SINGULARITY_CACHEDIR. See Setting NXF_SINGULARITY_CACHEDIR for more information

Nextflow parameters and pipeline parameters

There are two types of parameters you can pass to nextflow: nextflow parameters and pipeline parameters. Nextflow parameters are related to nextflow itself, like -resume or -log. Pipeline parameters are related to the pipeline you are running, like --input or --output. In general, nextflow parameters have only one - before the parameter name, while pipeline parameters have two --. To get a full list of available options, you can call nextflow with -h parameter or without any parameter:

$ nextflow -h

While to have a list of parameters for a specific pipeline, you can call the pipeline with --help option, for example:

$ nextflow run nf-core/rnaseq --help

Another important aspect if that pipeline parameters can be written in a json file and provided to nextflow with the -params-file option. This is useful when you have a lot of parameters to provide to the pipeline, or when you want to save a configuration for later use. For example, to provide a json file with parameters to the pipeline, you can do:

$ nextflow run nf-core/rnaseq -params-file params.json

where params.json is a json file with the following content:

{
  "input": "samplesheet.csv",
  "fasta": "path/to/genome.fasta"
}

Nextflow parameters and pipeline parameters are not the only way to customize a pipeline: nextflow allows to define custom configuration files in which you can customize other aspects of the pipeline, like the number of CPUs to use, the memory to allocate, environment variables and also settings specific to the running environment in which the pipeline is called. For more information, see the Configuration file section of the nextflow documentation. See also Configuring a pipeline section of this guidelines for more information. To get more information on CLI and pipeline options, please see Command line, and both CLI reference and Pipeline parameters from nextflow documentation.

Execute a community pipeline

Nextflow lets to build and share bioinformatics pipelines across the community. The simples way to use nextflow is to identify the pipeline you need, check for its requirements and then launch it using your data. Since all the nextflow community pipelines are public, you could download and modify them according your needs.

Search for a community pipeline

Community pipelines are available at nf-core pipelines site: you could search a pipeline and browse its documentation in the nf-core website. For example, by searching for rnaseq you could reach the rnaseq pipeline page project and get documentation on its usage by clicking on Usage tab.

You can download a pipeline using nextflow pull followed by the pipeline like <organization name>/<pipeline>, for example:

nextflow pull nf-core/rnaseq

This will download a copy of the pipeline in a nextflow cache folder, which usually is $HOME/.nextflow/assets: the pipeline will be placed in a subfolder for the organization and pipeline name (in this case nf-core/rnaseq). The containers files required to execute the pipeline will be downloaded when the pipeline is executed for the first time: please check for internet connection during pipeline execution: if it not possible to download the container, there’s the possibility to run nextflow offline. Please see Running nextflow offline of this documentation and the official Running offline nextflow documentation for more information.

Hint

The organization name is the GitHub organization which hosts the pipeline, like nf-core GitHub or cnr-ibba, while the pipeline name is the name of the GitHub repository which contains the pipeline. You could derive the pipeline name by removing https://github.com/ from the repository URL. For example, from https://github.com/nf-core/rnaseq you can derive the pipeline named nf-core/rnaseq.

Tip

You can get a list of available nf-core pipelines using nf-core/tools with nf-core pipelines list command. You can also add a pattern to search for a specific pipeline, for example:

nf-core pipelines list rna

to get a list of pipelines related to RNA analysis.

In order to download the pipeline, the softwares, and testing all in your local environment (which is recommended to see that all the stuff works as intended, see Run a pipeline with test data) you can call directly the nextflow pipeline on test data, for example for the rnaseq pipeline:

mkdir nf-rnaseq
cd nf-rnaseq
nextflow run nf-core/rnaseq -profile test,singularity -resume

Hint

Calling nextflow run with a remote pipeline will place the work and results directories in the current working directory, with some other hidden files useful for logging the pipeline execution in the current directory. For such reason, it’s better to create an empty project directory in which calling nextflow run or create a new directory for the project in which you plan to run the pipeline.

Tip

The community pipelines have a --help option to show all supported parameters. try:

nextflow run nf-core/rnaseq --help

To get a full list of the available options

Warning

It is possible that the nextflow version required by the pipeline is different from your nextflow version installed and you couldn’t execute the pipeline. Please see this section of nextflow troubleshooting.

When calling nextflow using a community pipeline like nextflow run nf-core/rnaseq, nextflow will download the latest pipeline version, and will place a local copy of the pipeline in your $HOME/.nextflow/assets folder. This local copy of the pipeline is called whenever you will call nextflow run using the same pipeline. If you need a particular version or branch of such pipeline, you can indicate such requirement with the -r option, for example:

$ nextflow pull nf-core/rnaseq -r 3.12

Warning

Whenever you pull a pipeline version different from the latest, you MUST declare the same version or branch when calling nextflow, for example:

$ nextflow run nf-core/rnaseq -r 3.12 --help

If you need to update your local pipeline to latest version see the Update a pipeline section.

Manage community pipelines with nf-core

Search for a pipeline

Whenever you run a community pipeline, nextflow will download and cache it (in your $HOME/.nextflow/assets/ folder). You could check your installed community pipelines with:

nextflow list

You can list all the available nf-core pipelines with:

nf-core pipelines list

You could search for a specific pipeline by providing a name as an argument:

nf-core pipelines list rna

Download a pipeline

You can download a pipeline with its container dependencies. This will be helpful when running nextflow in an environment without internet connection:

nf-core pipelines download nf-core/rnaseq -r 3.12.0

this command let the possibility to amend singularity images in your $NXF_SINGULARITY_CACHEDIR, which means that images will not be placed in the archive but in your local $NXF_SINGULARITY_CACHEDIR folder if missing.

Hint

using the option --download-configuration yes you can download also the institutional configuration file for offline usage. This is useful when you need to run a pipeline in an environment without internet connection. For more information see Institutional configuration files and Running nextflow offline.

Run a pipeline

The most interesting thing is the possibility to configure params interactively with:

$ nf-core pipelines launch rnaseq

This command will download the pipeline in the assets folder and then will open a web browser or a CLI interactive session to let you configure the pipeline parameters interactively. You can also save the configuration in a file and use it later with the nextflow -params-file option.

See Install nf-core/tools to get nf-core/tools software installed

Tip

nextflow creates a lot of file in the current working directory. It’s better to create a custom directory in which nextflow can be called

Execute a shared custom pipeline

Nextflow is able to manage pipelines outside the scope of the nf-core team, if they are shared in public repositories. For example, to execute a pipeline available on GitHub, call nextflow with <profile/project> like the following example:

nextflow run cnr-ibba/nf-resequencing-mem -resume -profile singularity \
  --input <samplesheet.csv> --genome_fasta <path/to/genome.fasta>

where cnr-ibba/nf-resequencing-mem is the repository which contains the nextflow pipeline.

Tip

You can configure nextflow to store your GitHub access credentials, see Access to private repositories of this guidelines

Nextflow best-practices

Here are some tips that could be useful while running nextflow.

Run a pipeline with test data

When you run a pipeline for the first time, it’s better to use test data in order to check if the pipeline is working as expected. All the community pipelines have a -profile test option which will download a small dataset and run the pipeline on it. For example, to run the nf-core/rnaseq pipeline with test data, you can do:

nextflow run nf-core/rnaseq -profile test,singularity -resume

This will also download the required dependencies (like the singularity images). Next time you will run the pipeline, nextflow will use the cached images and will not download them again.

Getting information from logs

By calling nextflow log you can get information on your last nextflow runs, which includes timestamp, duration, status, run name and the command used when the pipeline was called:

$ nextflow log
TIMESTAMP               DURATION        RUN NAME                STATUS  REVISION ID     SESSION ID                              COMMAND
2021-10-27 12:40:32     54.8s           serene_engelbart        OK      c44b10f3aa      598f0939-a7b0-497f-a16f-b2431a7e5ee3    nextflow run . -profile test,docker
2021-10-27 12:49:05     43.6s           evil_ride               OK      c44b10f3aa      a70a75e2-61fc-4407-aba4-19ac33f31774    nextflow run . -profile test,docker

RUN NAME is an arbitrary name assigned to your pipeline. By calling nextflow log again and providing such name you can retrieve more information on single execution steps:

$ nextflow log serene_engelbart
/home/paolo/Projects/NEXTFLOWetude/nf-core-resequencing/work/5d/6ff357b9b679198557bf22d24adf1e
/home/paolo/Projects/NEXTFLOWetude/nf-core-resequencing/work/ff/dd919f582e8583a16aecc58f6cc093
/home/paolo/Projects/NEXTFLOWetude/nf-core-resequencing/work/74/944e234214bcca20209637a94c0ac2
/home/paolo/Projects/NEXTFLOWetude/nf-core-resequencing/work/31/b075adb744673b9cc8fb214729c455

By defaults nextflow log <run name> will return only the working directory, to get more informative results you need to specify some columns using -f parameter, for example:

$ nextflow log serene_engelbart -f 'process,status,exit,hash,duration,workdir'
NFCORE_RESEQUENCING:RESEQUENCING:INPUT_CHECK:SAMPLESHEET_CHECK  COMPLETED       0       5d/6ff357       1.8s    /home/paolo/Projects/NEXTFLOWetude/nf-core-resequencing/work/5d/6ff357b9b679198557bf22d24adf1e
NFCORE_RESEQUENCING:RESEQUENCING:FASTQC COMPLETED       0       ff/dd919f       7.2s    /home/paolo/Projects/NEXTFLOWetude/nf-core-resequencing/work/ff/dd919f582e8583a16aecc58f6cc093
NFCORE_RESEQUENCING:RESEQUENCING:FASTQC COMPLETED       0       74/944e23       5.2s    /home/paolo/Projects/NEXTFLOWetude/nf-core-resequencing/work/74/944e234214bcca20209637a94c0ac2
NFCORE_RESEQUENCING:RESEQUENCING:FASTQC COMPLETED       0       31/b075ad       7.2s    /home/paolo/Projects/NEXTFLOWetude/nf-core-resequencing/work/31/b075adb744673b9cc8fb214729c455

Call nextflow log -l to have a full list available columns.

Resume calculations

Nextflow, by default, executes every calculation in a subfolder inside the work directory in your current working directory. Every steps is executed in separate subfolders and nextflow will take care about inputs and outputs among related steps. It is frequent to call nextflow multiple times, for example while modifying a pipeline or while tuning parameters or solving issues. In such way, you can save a lot of spaces (and calculation times) by resuming a pipeline (aka. don’t run job completed with success). To achieve this, is important to add the -resume option while calling nextflow:

$ nextflow run <pipeline> -resume <pipeline parameters>

Note

nextflow parameters have only one - before parameter names. Pipeline parameters will always have -- in front of them. Nextflow commands, like run, info, log, ... don’t have any - in front of them

Cleanup

After a pipeline is completed with success, it’s better to clean up work directory in order to save space. All the desired outputs need to be saved outside this folder, in order to safely remove temporary data. There’s a nextflow clean option which safely remove temporary files and nextflow logs. You can have information on nextflow runs by calling nextflow log inside your project folder:

$ nextflow log
TIMESTAMP               DURATION        RUN NAME                STATUS  REVISION ID     SESSION ID                              COMMAND
2021-01-14 18:31:18     34m 17s         magical_roentgen        OK      3643a94411      fa1714cf-1dbf-45ec-9910-9dcb27aab52b    nextflow run nf-core/rnaseq -profile test,singularity -resume --max_cpus=24
2021-01-15 15:38:02     -               magical_rosalind        -       3643a94411      fa1714cf-1dbf-45ec-9910-9dcb27aab52b    nextflow run nf-core/rnaseq -profile test,singularity -resume --max_cpus=24

Then you could remove a specific run using name, for example:

$ nextflow clean magical_roentgen -f

See nextflow clean documentation for more info.

Note

When calling log, you can inspect the command line used to execute the pipeline. You could also get information about execution times. For more information, take a look at nextflow log documentation.

Hint

Despite singularity will write images in $NXF_SINGULARITY_CACHEDIR, there are also cache files stored inside your $HOME/.singularity/cache directory. Free some space with:

$ singularity cache clean

The previous command will not affect your downloaded singularity images in $NXF_SINGULARITY_CACHEDIR folder. If you want to remove them, you have to do it manually. See Clean up Singularity section of this guidelines for more information.

Warning

calling nextflow clean -f without sessionid, or run name will only remove temporary files from the last nextflow run, without removing files from other previous sessions. If you want to remove ALL your nextflow cache directories with a single command, you can do:

$ nextflow clean $(nextflow log -q) -f

where nextflow log -q simply returns only run name for all your nextflow run in your working folder.

Update a pipeline

If you manage community pipeline using nextflow or nf-core software (not using git), you can have information on outdated pipelines with nf-core pipelines list command:

$ nf-core pipelines list
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ Pipeline Name     ┃ Stars ┃ Latest Release ┃      Released ┃  Last Pulled ┃ Have latest release? ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│ rnaseq            │   323 │            3.1 │   2 weeks ago │  2 hours ago │ Yes (v3.1)           │
│ methylseq         │    66 │          1.6.1 │   3 weeks ago │ 4 months ago │ No (v1.5)            │

In this example, we can see that the rnaseq pipeline is just updated, while methylseq is quite old and need to be updated.

Hint

You can search for as specific pipeline with nf-core pipelines list <pattern>, for example:

$ nf-core pipelines list rnaseq

Note

When you manage pipelines using nextflow software, pipelines are locally downloaded in your $HOME/.nextflow/assets/ (see Manage community pipelines with nf-core): the information you see reflect the updates of the community pipelines compared to your local assets.

In order to update a community pipeline, you need to call nextflow pull, for example:

$ nextflow pull nf-core/rnaseq

this will update your local assets by downloading the latest default revision of the pipeline. If you need a specific version (or branch), you need to specify it with -r option:

$ nextflow pull nf-core/rnaseq -r 3.12

Tip

You can get a list of available revision and version with:

$ nextflow info nf-core/rnaseq

This is related to the local copy of the pipeline in your assets folder, make sure to do this after a nextflow pull command to collect the latest information.

Hint

the same considerations apply with custom shared pipelines, for example:

$ nextflow pull cnr-ibba/nf-resequencing-mem -r issue-1

Warning

if you download a specific version with nextflow pull, you have to specify it when you call nextflow run with the same -r option. This is required if you need to run your analyses with an old pipeline version, or if your nextflow executable doesn’t support the latest pipeline version.

Delete the local copy of a pipeline

In order to remove a local copy of a pipeline (a pipeline installed in your cache using nextflow pull or nextflow run), simply type:

$ nextflow drop <pipeline_name>

where <pipeline_name> is a single row returned nextflow list (github organization/pipeline name)