Create a new pipeline
Start from scratch
If you can’t find a proper pipeline in community, you could create a pipeline by your self. In Learning Nextflow section of these guidelines you can find a lot of material on working with nextflow. However, the most interesting feature in nextflow is the DSL2 syntax: with it, you can re-use modules in which calculations steps are defined by the community. In such way, you can avoid to write a full pipeline from yourself.
The minimal set of files required to have a pipeline is to have locally
main.nf, nextflow.config and modules.json inside your project folder.
You should have also a modules directory inside your project:
mkdir -p my-new-pipeline/modules
cd my-new-pipeline
touch main.nf nextflow.config modules.json README.md .nf-core.yml
Next you have to edit modules.json in order to have minimal information:
{
"name": "<your pipeline name>",
"homePage": "<your pipeline repository URL>",
"repos": { }
}
Without this requisites you will not be able to add community modules to your
pipelines using nf-core/tools.
Tip
It’s a good idea to track your pipeline with a CVS software like git
Hint
You could also create a new pipeline using the nf-core template:
nf-core pipelines create
This utility command will configure a pipeline to be submitted to the nf-core
community or let you to customize all the options to include in a pipeline that
can be kept private or stand-alone (not to be submitted to the community).
Please see the join the community
section and get in contact with the developers if you plan to contribute to the
community.
Browsing modules list
You can get a list of modules by using nf-core/tools (see here
how you can install it):
nf-core modules list remote
You could also browse modules inside a different repository and branch, for example:
nf-core modules --github-repository https://github.com/cnr-ibba/nf-modules.git \
--branch master list remote
Hint
You can work to a new module and make a pull request to add it to the community. See Custom pipeline modules section to work with custom modules. See also nf-core guidelines to understand how you could contribute to the community.
Adding a module to a pipeline
You can download and add a module to your pipeline using nf-core/tools:
nf-core modules install --dir . fastqc
Note
The --dir . option is optional, the default installation path is the CWD
(that need to be your pipeline source directory)
Hint
If you don’t provide the module, nf-core will search
and prompt for for a module in nf-core/modules GitHub repository
Add a simple workflow
In order to have a minimal pipeline, you need to add at least an unnamed workflow
to your pipeline. Moreover, you should declare the input channels and the modules
or the processes you plan to use. Suppose to create a minimal pipeline to do a fastqc
analysis on a set of reads. You can install the fastqc module as described
above and then add a workflow like this in your main.nf:
// Declare syntax version
nextflow.enable.dsl=2
include { FASTQC } from './modules/nf-core/fastqc/main'
workflow {
reads_ch = Channel.fromFilePairs(params.input, checkIfExists: true)
.map { it ->
[[id: it[1][0].baseName], it[1]]
}
// .view()
FASTQC(reads_ch)
}
In this case FASTQC expect to receive a channel with meta information, so
this is why we create an input channel and then we add meta relying on file names.
Please refer to the module main.nf file to understand how to call a module
and how to pass parameters to it. Next you will need also a minimal
nextflow.config configuration file to run your pipeline, in order
to define where softwares could be found, and other useful options:
params {
input = null
}
profiles {
docker {
docker.enabled = true
docker.userEmulation = true
}
}
docker.registry = 'quay.io'
Next, you can call your pipeline like this:
nextflow run main.nf -profile docker --input "data/*_{1,2}.fastq.gz"
You can create different workflows and call them in your main workflow, or you
can install a subworkflow as like as you install a module. Also you can add
more options to your nextflow.config file, or define a custom profile
for modules, in order to provide more options to your pipeline. Please refer
to nextflow documentation to get more information on how to customize your
pipeline.
List all modules in a pipeline
You can have a full list of installed modules using:
nf-core modules list local
Update a pipeline module
You can update a module simple by calling:
nf-core modules update fastqc
Hint
Call nf-core modules update --help to get a list of the available options,
for example, if you need to install a specific version of a module
Custom pipeline modules
We provide custom DSL2 modules (not implemented by nf-core community) in our
repository at cnr-ibba/nf-modules.
This repository is not maintained by nf-core community, it’s intended
to share modules across pipelines and to test stuff locally. It’s organized in a
similar way to nf-core/modules, so it’s
possible to take a module from here and share it with the nextflow community (please see
their documentation).
In order to get a list of available custom modules, specify custom modules repository
using -g parameter (short option for --github-repository), for example:
nf-core modules -g https://github.com/cnr-ibba/nf-modules.git list remote
Add a custom module to a pipeline
To add a custom module to your pipeline, move into your pipeline folder and call
nf-core install with your custom module repository as parameter, for example:
nf-core modules --git-remote https://github.com/cnr-ibba/nf-modules.git install freebayes/single
Create a new module
You can create a new module inside a pipeline folder or inside a modules git cloned
folder. If you create a module inside a pipeline, you will create such module in the
modules/local/ folder of the pipeline, and such model will exists only in your
pipeline; If you create a module inside a modules folder, you can then install
such modules in every pipeline using nf-core modules install. Creating a module
in a modules github folder is also the way to contribute to Nextflow community.
The command acts in the same way for both the two scenarios: relying on your project,
nf-core modules will determine if your folder is a pipeline or a modules
repository clone:
nf-core modules create freebayes/single --author <you GitHub account> --label process_high --meta
Tip
To get more information in creating modules see Create a module guide.
Testing a new module
The custom repository module is configured to use GitHub WorkFlows in order to perform
some tests on all modules. Please, try to define tests and configuration files like other
modules (you can take a look to community modules to get some examples). You can try to
test some modules locally before submitting a pull request to the custom repository
modules. The python package pytest-workflow is a requirement to make such tests.
You need also to specify an environment between conda, docker or singularity
in order to perform test. Use tags to specify which tests need to be run:
NF_CORE_MODULES_TEST=1 PROFILE=docker pytest --symlink --keep-workflow-wd \
--git-aware --tag freebayes/single
You need to check also syntax with nf-core script by specify which tests to call
using tags:
nf-core modules lint freebayes/single
If you are successful in both tests, you have an higher chance that your tests will be executed without errors in GitHub workflow.
Subworkflows
A subworkflow is an experimental feature which allow to include a chain of modules
together (for example bam_sort_samtools, which execute samtools sort, samtools
index and then call the bam_stats_samtools, which is another subworkflow.
There are imported in the main workflow (pipeline) like any others modules. It is
possible to manage subworkflows in the same way as modules, using nf-core
tools. For example:
nf-core subworkflows list remote
to get a list of available subworkflows. Similarly, you can install a subworkflow
using nf-core tools:
nf-core subworkflows install bam_sort_stats_samtools
See also Subworkflow Specifications for more information.
Pipeline best practices
Use DSL2 syntax when possible
Warning
Starting from nextflow 22.12.0-edge version, DSL1 was removed and DSL2
is the new standard. You cannot use DSL1 with a recent version of nextflow.
DSL2 is the newest pipeline standard and the nextflow community is currently moving to this format. This means that community pipelines will be updated to fully support this standard and if you plan to submit your pipeline to the community you will probably need to write code using this format.
The major changes provided by DSL2 format are modules, as described by this docs, which let you reuse softwares managed and provided by the community simplifying your pipeline: the code required to run software and to provide/collect input and output are provided by the modules, which can be installed or updated as described by this guide.
Another change introduced in DSL2 is the different way you can pass data between different pipeline steps. With the old standard, the only way is by using channels: this implies that after consuming values from a channel you cannot reuse those values in another pipeline step. For example if one step produces and output required by two or more steps, you have to put data in two or more channels, like this:
output:
file '*.fq' into trimmed_reads, quantifier_input_reads
and once trimmed_reads values are consumed, you cannot read these values in
another step. Another example could be a step in which
you align reads to an indexed genome made by a different step: since the genome
index is emitted once from the indexing step, you will be able to align only one
sample if you pass the channels as they are in input: the only way to align all
your samples is to use the
combine operator
and put all values in a new channel:
trimmed_reads.combine(genome_index).set{ align_input }
and then read those values as a tuple:
input:
tuple file(sample), file(genome) from align_input
In the newest DSL2 version, you can specify the output values from the module itself without using the channels syntax, for example:
BWA_MEM(TRIMGALORE.out.reads, BWA_INDEX.out.index)
and values from a module step can be read as many times as needed.
Warning
set and into operators used in previous version are removed in DSL2.
See DSL 2 nextflow documentation
to have a picture of major changes.
Write the configuration stuff outside your pipeline
Since the aim of nextflow pipelines is reproducibility and portability,
you should avoid to place your analysis specific parameters in your pipeline main
script: this force users to modify your pipeline according their needs and this
implies different pipeline scripts with differ only for a few things, for example
where the input files are. If you place your configuration files outside your main
script, you can re-use the same parameters within different scripts and keep
your main file unmodified: this keeps the stuff simple and let you to focus only
on important changes with your CVS. For example, you could define a
custom params.json JSON config file in which specify your
specific requirements:
{
"readPaths": "$baseDir/fastq/*.fastq.gz",
"outdir": "results",
"genome": "/path/to/genome.fasta"
}
All the other parameters which cannot be specified using the command line interface need to be provided in a custom configuration file using the standard nextflow syntax:
profiles {
slurm {
process.executor = 'slurm'
process.queue = 'testing'
}
}
Then, you can call nextflow by providing your custom parameters and configuration file:
nextflow run -resume main.nf -params-file params.json \
-config custom.config -profile singularity
Hint
nextflow looks for configurations in different locations, and each location is
ranked in order to decide which settings will be applied: you can override the
default configuration by using a configuration source with an higher priority,
for example the -c <config file>, -params-file <file> or parameters
provided with command line are different locations where the last have the higher priority. See
Configuration file
section of nextflow documentation.
Add test data to your pipeline
It frustrating writing a pipeline on a real dataset: steps could require a lot
of time to be completed and if you made any errors when calling software or when
collecting outputs you will be noticed after a long period of time and you have
no way to recover the data you have with a nextflow error.
In testing and revision stages or when adding new features, consider
to work with a reference data sets like the
one provided by nextflow community
or add some public data to your pipeline. Please, remember to not track big files
with your CVS: you should provide the minimal requirements to get your pipeline
running as intended in the shortest time. You should also consider
to provide a test profile with the required parameters which let you to test
your pipeline like this:
nextflow run . -profile test,singularity
Where the test profile is specified in nextflow.config and refers to
the test dataset you provide with your pipeline:
profiles {
...
test {
// test input reads
reads_path = "./testdata/GSE110004/*{1,2}.fastq.gz"
// Genome references
genome_path = "./testdata/genome.fa"
}
}
This type of test could be used even with CI system, like GitHub workflow.
Lower resources usage
You should consider to lower the resources required by your pipeline. This will avoid the costs of allocating more resources than needed and will let you complete your analysis in a shorter time when resources are limited. Take a look at Dynamic allocation of resources documentation section. You can also provide a institutional configuration to your pipeline. See Institutional configuration files for more information.
Patch a module/workflow
If you need to patch a module (or a workflow), you can do it by editing the module
file in your module directory. If the module (or subworkflow) is provided by the
community, linting test will fail and moreover will be difficult to update the
module in the future using nf-core tools. In this case, you should consider
to apply nf-core modules patch
followed by the module you are modifying, for example:
nf-core modules patch fastqc
which will create a patch file in your module directory, tracking the changes you
made to the module. This patch file will be applied when the module will be updated,
solving issues with linters and letting you to customize and manage the module
with nf-core tools. See Patch a module
nextflow documentation for more information.