wiki:SOPs/creatingVirtualEnvs

Creating and using virtual environments

Conda environments

Start by downloading and installing conda somewhere that will have enough room to hold lots of applications (so not your home directory)

# Get the Miniforge installer
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash Miniforge3-Linux-x86_64.sh
# Miniforge3 will now be installed into this location:
# [choose your preferred location]
/nfs/BaRC/USER/conda

Create your desired environment

# Activate the environment (pointing to where you installed conda)
eval "$(/nfs/BaRC/USER/conda/bin/conda shell.bash hook)"
# Create a new environment
# If you don't include '--no-default-packages' you'll also get everything on your PATH
/nfs/BaRC/USER/conda/bin/conda create --name RNAseq_2024a --no-default-packages

Activate the environment

conda activate RNAseq_2024a

Add applications to your environment, specifying versions (if you want the install commands to be reproducible). These will be installed under your original conda location. The newest version of some software can cause problems (such as with STAR: "Genome version: 2.7.1a is INCOMPATIBLE with running STAR version: 2.7.11b") or conda incompatibilities.

conda install -c bioconda STAR=2.7.11b
conda install -c bioconda multiqc=1.25.2
conda install -c bioconda fastqc=0.12.1
conda install -c bioconda STAR=2.7.1
conda install -c bioconda subread=2.0.8

Get a list of packages in our environment

conda list -n RNAseq_2024a

Leave the environment

conda deactivate

Go back to environment

conda activate RNAseq_2024a

The name of your current environment should be obvious from the command line.

(RNAseq_2024a) gbell@sparky ~$

Make sure that you can use the virtual environment from a jupyter notebook.

conda activate RNAseq_2024a
conda install ipykernel
python -m ipykernel install --user --name RNAseq_2004 --display-name "Python (RNAseq_2024a)"
conda deactivate

Afterward, when you create a new notebook or change the kernel of an existing one, you will find "Python (RNAseq_2024a)" as an available kernel.

You can also use the virtual environment from a jupyter notebook on the fry cluster. In this way, you can require more computational resources (like multiple cpus/gpus or big memory).

To do it, first you need to install jupyterlab in your virtual environment:

conda activate RNAseq_2024a
conda install jupyterlab
conda deactivate

Next, create a slurm script as below -- here we will save it as 'jupyterOnCluster.sbatch'

#!/bin/bash
# Configuration values for a SLURM batch job.
# One leading hash(#) before the word SBATCH is not a comment, but two are.
#SBATCH --job-name=jobName
#SBATCH --nodes=1                 # Ensure that all cores are on one machine
#SBATCH --ntasks=1                # Run a single task
#SBATCH --ntasks-per-core=8       # Enter number of cores/threads you wish to request
#SBATCH --time 02:00:00           # max time that the job/jupyter instance actually needs to be running (in hh:mm:ss format) 
#SBATCH --mem=16gb                # Enter amount of memory you wish to request
#SBATCH --partition=20            # partition (queue) to use
#SBATCH --output %x-%j.out        # name of output file.  %x is the job-name %j is jobid
#SBATCH --mail-type=ALL
#SBATCH --mail-user=your.name@wi.mit.edu

# If using a conda environment 
# (if not, delete or comment out next 3 lines):
# Activate conda itself
eval "$(/nfs/BaRC/USER/conda/bin/conda shell.bash hook)"
# Activate your specific conda environment
conda activate RNAseq_2024a
# If using a python virtual environment 
# (if not, delete or comment out next line):
#source /path/to/your/python/virtualenv/directory/bin/activate


# Workaround for jupyter bug
unset XDG_RUNTIME_DIR


jupyter-lab \
--no-browser \
--port-retries=0 \
--ip=0.0.0.0 \
--port=`shuf -i 8900-10000 -n 1` \
--notebook-dir=/ \
--LabApp.default_url="/lab/tree/home/$(whoami)"

#/usr/bin/mail -s "$SLURM_JOB_NAME $SLURM_JOB_ID" yourwhiteheadusername@wi.mit.edu < %x-%j.out  #uncomment (remove hash at beginning of this line) if you want your job output emailed to you

Then submit the slurm job:

sbatch jupyterOnCluster.sbatch

After the slurm job gets running on the cluster, an output file named 'jobName-%j.out' will be created.

Open the output file and you will find a URL pointing to the jupyter notebook. Copy and paste the URL into your web browser, you will get access to the jupyter notebook with your virual environment.

Make sure to set a reasonable time limit to your slurm job or cancel it after you finish working on it, to avoid occupying computational resources unnecessarily.

Reference to IT's instructions on this topic: https://docs.google.com/document/d/1eYGVn5M402n2b9pueWdHoeLx84Ue-IGQG8t6e2QK7To

Save the environment

conda env export > RNAseq_2024a.environment.yml

Someone else should be able to create new environment from this YAML file

conda env create -f RNAseq_2024a.environment.yml

Remove a problem piece of software from the environment

conda remove STAR

If you no longer want the environment

conda remove -n ENV_NAME --all

If we want to use slurm, we need to add the path to the slurm commands. Is there a better way to do this?

export PATH=$PATH:/opt/slurm/bin

To test the environment -- the RNA-seq Hot Topics exercises should work.

See also the Whitehead IT conda page: https://clusterguide.wi.mit.edu/software/conda/

Singularity environments

Singularity containers allow you to create and run containers that package up pieces of software in a way that is portable and reproducible. Some software now comes in this way so that "installation" is simply downloading a SIF file.

One example is AGAT (Another Gtf/Gff Analysis Toolkit), which provides instructions on how to download and run the AGAT container that includes a series of applications:

# Download
singularity pull docker://quay.io/biocontainers/agat:1.0.0--pl5321hdfd78af_0
# Run
singularity run agat_1.0.0--pl5321hdfd78af_0.sif
# When finished
exit

Then one can run commands such as 'agat_convert_sp_gff2gtf.pl'. The trouble is that the environment doesn't include our usual filesystem, making it not very useful. The 'singularity' command needs to be modified to also include the required folder(s), such as the following one-line command

singularity run -B /lab/BaRC_projects:/lab/BaRC_projects --cleanenv --pwd /lab/BaRC_projects /nfs/BaRC_Public/apps/AGAT/agat_1.0.0--pl5321hdfd78af_0.sif
# Go where we want
cd /lab/BaRC_projects
# Check that the environment includes our desired files/folders
ls

One problem is that this is an older version of AGAT (v1.0.0). Another problem is that some of the commands require samtools, which is not present in the container. What can we do about this?

One solution to build a customized singularity container is to use the Sequera Container Builder. We can search for and add AGAT (not the first hit) and samtools, then specifying that we want a singularity container. Then click on "Get Container". When it's ready, run 'singularity pull' on the oras link, like

singularity pull oras://community.wave.seqera.io/library/agat_samtools:d30ed34317069fe6

We end up downloading a file like 'agat_samtools_d30ed34317069fe6.sif'. Then we can do the 'singularity run' command like above and get to run both AGAT and samtools.

By the way, including '--cleanenv' in the 'singularity run' command is to prevent the container from reading the environment from your .bashrc file. If you want to include those aliases, etc. then remove '--cleanenv'.

See also the Whitehead IT singularity page: https://clusterguide.wi.mit.edu/software/singularity/

Note: See TracWiki for help on using the wiki.