R environment ============= .. contents:: Table of Contents This document will illustrate a series of best practices to deal with a R environment on our shared infrastructure. Ideally, ``R`` provide beautiful *IDEs* like *RStudio*, however IDEs like that are not installed in our remote infrastructure, since there are no *X* session available to display softwares with a graphical interface. Despite this, the recommended way to work with our infrastructure is to develop your analyses locally and then move your code and data to the remote resources when you need to scale up the calculations. Since the environment you have locally is different from the remote environment, you have to operate in order to have minimal differences between your environments: you have to track all your required dependencies, refer to file locations relative to your project or script (without using *absolute paths* which are not available remotely). Moreover the code executed remote and locally should be the same, since you don't have to adapt your remote script every time you need to change or fix your local script. All changes in your code behavior between local and remote environments should be supplied using *CLI* or parameter files (the last are preferred, since your analysis will be *reproducible*). R local installation -------------------- ``R`` is not installed by default in our system. If you plan to use ``R`` with your analyses, you have to install ``R`` locally in your ``$HOME`` directory, since this directory is mounted in the same position in every instance in our infrastructure, an this means that you could install ``R`` in your *login* environment and then execute all your computational intensive task in the *worker* nodes without worrying about synchronizing your libraries and code between our machines. Here we describe two different approach to manage an ``R`` installation, using :ref:`conda ` and :ref:`singularity ` respectively. .. _R-conda: Create an R environment with conda ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Despite you can install R from *source*, the recommended (and easiest!) way to install ``R`` is by using conda. Conda provides several ``R`` package and a dedicated `channel `_ in which ``R`` dependencies are resolved. The ``r-essentals`` conda package provide a lot of packages installed with the ``r-base`` version, which provides the minimal software to run an ``R`` session. However if you plan to manage packages installation by yourself, if you require the most update packages or your packages need to be installed from source since they are not included in conda repositories, the most effective way is to install the less from conda and then install your required packages from sources. To create a new environment in ``conda`` with the latest version (*4.3.1* at the moment) you can do like this:: conda create --channel R --name R-4.3 r-base=4.3 .. hint:: If you requires different R version, you should search also in ``conda-forge`` channel, please see our documentation section :ref:`on channels `. Even if we provide a lot of compilation libraries within our server instances, it is possible that you can't compile and install an ``R`` package since there could be missing dependencies. Conda provide several packages for compiling libraries, and ``R`` is enough smart to describe which library is missing. Using this information you sould use ``conda search `` and ``conda install `` to resolve the missing dependency. Sometimes, the package is already installed within the system but you require to export an environment variable to install and use a particular ``R`` package. For example, for the ``rJava`` package we need to export a *JAVA* environment variable to install such package, as described in the section :ref:`Setting environment variables ` in our documentation. Briefly, in such case we need to add an environment variable to conda environment using ``conda env config vars set``:: $ JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 $ conda env config vars set LD_LIBRARY_PATH="$JAVA_HOME/lib/server":$LD_LIBRARY_PATH After that, you need to restart your conda environment (simply exit and re-enter in your ``R`` conda environment) to take effects: you will be able to complete the installation process and to use this compiled library within your projects. .. _R-singularity: Build an R environment through singularity ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sometimes it's difficult to solve all your dependencies with conda or it could be complex to figure out all the environment variables and the libraries paths required to install your R packages properly. In these cases, it could be better to manage the R version using :doc:`singularity <../general/singularity>`. In the following example, we try to install all the dependencies required to compile and install packages like ``rgdal``, ``terra`` and ``raster`` through a singularity build which start from a ``R`` base image coming from docker and in which we add some dependencies require to compile these packages. Here is how it looks like the singularity ``.def`` file:: Bootstrap: docker From: rocker/r-base:4.3.1 Stage: build %post apt-get update && apt-get install -y \ libgdal-dev \ libudunits2-dev \ libfontconfig1-dev \ libharfbuzz-dev \ libfribidi-dev NOW=`date` echo "export NOW=\"${NOW}\"" >> $SINGULARITY_ENVIRONMENT %runscript echo "Container was created $NOW" echo "Arguments received: $*" exec "$@" Next, you will require an account to `Sylabs `_, since as a normal user you can't build locally a singularity image, you have to do it *remotely* using a *singularity build service*. *Sylab* can provide you time and space to build up images using a free tier. Once you have created an account, login through your terminal with:: $ singularity remote login After that, you can build your custom images. Supposing that you have created a definition file like before, and you named it ``rgdal.def``, you can build your image using the ``--remote`` parameter:: $ singularity build --remote rgdal.sif rgdal.def Your singularity built image will be the ``rgdal.sif`` file. After that, you can load the ``R`` just built with your all your defined dependencies with:: singularity run rgdal.sif R this will start an ``R`` terminal using your singularity image. Please see our section on :doc:`singularity <../general/singularity>` to get more information. Please see also `singularity documentation `_ to understand how create a *definition* file and which commands and parameters are supported when calling ``singularity`` Manage dependencies with Renv ----------------------------- Instead of installing your ``R`` packages *globally*, you can use `Renv `_ to manage the transition between your local environment and the remote environment, and ensure reproducibility between your projects. Briefly, ``renv`` install your dependencies within projects, and this means that you could work with projects which have different dependencies in the same time. Moreover this could help you when resuming a project started long time ago, working with the same library versions you used when you have started such project, without breaking your code since you have installed a more recent version of such packages *globally*. Unlike `packrat `_, which build and install packages inside your project folder, ``renv`` build packages once and links such packages to the proper built directory when needed: this means that if you use the same package between different projects, your package caches is built *once* and used every time is needed, saving your time when re-using the same dependency across your projects. Your package dependencies will be tracked using the ``renv.lock`` file, which is created and managed through ``renv`` command. There will also an ``renv`` folder in which some filer required by ``R`` to find and load your packages correctly are located. Simply manage your packages as usual, and then call: .. code-block:: R renv::snapshot() To save the state of your libraries to the ``renv.lock`` file. Once you are ready to move your code on remote environment, remember to synchronize your ``renv.lock`` file. After that, you can use: .. code-block:: R renv::restore() to install your required libraries on your remote environment, without installing your libraries one-by-one after test for their presence on the remote environment. Sometimes it could be impossible to restore all your dependencies from the ``renv.lock`` file: ``renv`` developers can't ensure you that such process will be successful every time. This can happen, for example, to packages which require a certain system library or R version. If you can't restore a particular package, you can try to update the ``renv.lock`` file using: .. code-block:: R renv::record() to update the package version (default behavior is to install the latest version available from CRAN). This will let you to install a more recent version of the package which could be compatible with your system. .. hint:: If you have trouble when restoring an environment, you can call ``renv::purge()`` by providing the package name which gave you issues, in order to clean up the problematic package. Sometimes you require to restart your R session, to see changes in your working environment. Tracking ``renv.lock`` with your code using ``git`` (or backing up your ``renv.lock`` file) is *strongly recommended*. There can be also cases in which you have to clean up your environment, please refer to `renv documentation `_. Shared environment with renv ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ By default, ``renv`` create a folder named ``renv`` within your project folder, in which some files required by ``R`` to find and load your packages correctly are located. Usually packages are installed in a user-wide cache, which is located in your home directory (``$HOME/.cache/R/renv/cache``), then linked to your project when needed. However, if you plan to work with different users on the same project, it could be better to share the same package library between different users. To do that, you can define an environment variable named ``RENV_PATHS_CACHE`` which points to a shared location that is writable only by a trusted group of collaborators (and read-only or inaccessible to other users). For example, you can define such variable in your ``.bashrc`` file like this:: export RENV_PATHS_CACHE="/path/to/shared/location/renv_cache" Make sure that the directory used for ``RENV_PATHS_CACHE`` is not world-writable, otherwise other users on the system could tamper with shared packages and execute arbitrary code in your R sessions. Or you can define such variable in the ``~/.Renviron`` file, which is read by ``R`` at the beginning of each session. After that, every time you will restore your environment using ``renv::restore()``, the packages will be installed in the shared location, and then linked to your project when needed. The here package ---------------- You have to avoid to refer to your scripts or data files using *absolute paths*, since the paths you have in your local R installation are different from the path you will find on remote environment. Using a package like "`here `_" can help you to code your paths relying on ``R`` environment. The ``here()`` function (which has the same name of the package) return the absolute location of your ``R`` project file, and by providing the *relative path* of a file respect to your project as an argument you receive an *absolute path* as a return value, which can be used to deal with file locations in different OS (like windows and linux, for instance) and with different project locations. For example: .. code-block:: R here("directory", "file") will return the absolute path of ``directory/file`` file relative to your ``.Rproj`` file location. .. important:: In order to use the ``here`` package you have to define a ``.Rproj`` file at the top of your project. Creating the project using *RStudio* is the recommended way for doing that. Calling rmarkdown from terminal ------------------------------- Since ``RStudio`` is not available on our remote infrastructure, you cannot render a ``.Rmd`` file by clicking on *knitr* button on like on your *RStudio* IDE. However, you are able to call ``rmarkdown::render()`` and provide the location of your ``.Rmd`` script as parameter. For example, if you define a ``.Rmd`` file like this: .. code-block:: markdown --- title: "A sample Rmarkdown file" author: "Paolo Cozzi" date: "`r Sys.Date()`" output: html_document params: param1: "a simple string param" param2: 42 --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ... Than you can render this file using ``rmarkdown::render()``, or better by defing a new script which call your ``.Rmd`` file, for example by loading your libraries using ``renv`` and finding your paths using ``here`` packages: .. code-block:: R #! /usr/bin/env -S Rscript --slave --vanilla # activate environment library(renv) renv::activate() library(rmarkdown) rmarkdown::render( "", params = list( param1 = "override param1 in your .Rmd file", param2 = 101 ) ) In this example, we define some parameters in our ``.Rmd`` script, and then we override them using the ``Rscript`` file. We could also make use of the ``optparse`` package in order to accept parameters from CLI, and them provide them when calling the render script. This bring us some advantages: we can define our ``.Rmd`` file to render some test data by default in our local environment by calling *knitr* directly in our Rstudio session. Then on our remote infrastructure we can provide the real data from *command line* without modifying our script. .. hint:: If you are used to save cache with markdown, maybe you have to clean up your project in your ``Rscript``. R best practices ---------------- At the end of this document, we can try to sum up some behaviors that should be adopted when porting your local projects on a remote shared infrastructure without ``RStudio``: - Manage your ``R`` projects using ``git`` is *strongly recommended*. - Don't use *absolute* paths: use *relative* paths when possible, or manage your file locations using R packages like ``here``. - When using ``rmarkdown``, customize your analysis using parameters. - Create a very simple ``R`` script, in which you will render your ``.Rmd`` file. - Never create a big script, instead split your code in steps and save a ``RDS`` data file after each step completion. - If you require the *RStudio* to manage your plots, copy your ``RDS`` data locally and then work in your preferred environment. - Don't call a CPU intensive calculation without ensuring all your dependencies are installed correctly and without testing your analysis workflow with a small set of data, which can return results *immediately* or *very quickly*. - Track your dependencies in a file, better if you manage them with ``renv``. - If you use ``renv`` to manage dependencies between your environments, remember to save your environment every time you add/remove a dependencies and to synchronize your local environment with your remote environment.