Creating and using virtual environments
Conda environments
Start by downloading and installing conda somewhere that will have enough room to hold lots of applications (so not your home directory)
# Get the Miniforge installer wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh bash Miniforge3-Linux-x86_64.sh # Miniforge3 will now be installed into this location: # [choose your preferred location] /nfs/BaRC/USER/conda
Create your desired environment
# Activate the environment (pointing to where you installed conda) eval "$(/nfs/BaRC/USER/conda/bin/conda shell.bash hook)" # Create a new environment # If you don't include '--no-default-packages' you'll also get everything on your PATH /nfs/BaRC/USER/conda/bin/conda create --name RNAseq_2024a --no-default-packages
Activate the environment
conda activate RNAseq_2024a
Add applications to your environment, specifying versions (if you want the install commands to be reproducible). These will be installed under your original conda location. The newest version of some software can cause problems (such as with STAR: "Genome version: 2.7.1a is INCOMPATIBLE with running STAR version: 2.7.11b") or conda incompatibilities.
conda install -c bioconda STAR=2.7.11b conda install -c bioconda multiqc=1.25.2 conda install -c bioconda fastqc=0.12.1 conda install -c bioconda STAR=2.7.1 conda install -c bioconda subread=2.0.8
Get a list of packages in our environment
conda list -n RNAseq_2024a
Leave the environment
conda deactivate
Go back to environment
conda activate RNAseq_2024a
The name of your current environment should be obvious from the command line.
(RNAseq_2024a) gbell@sparky ~$
Make sure that you can use the virtual environment from a jupyter notebook.
conda activate RNAseq_2024a conda install ipykernel python -m ipykernel install --user --name RNAseq_2004 --display-name "Python (RNAseq_2024a)" conda deactivate
Afterward, when you create a new notebook or change the kernel of an existing one, you will find "Python (RNAseq_2024a)" as an available kernel.
You can also use the virtual environment from a jupyter notebook on the fry cluster. In this way, you can require more computational resources (like multiple cpus/gpus or big memory).
To do it, first you need to install jupyterlab in your virtual environment:
conda activate RNAseq_2024a conda install jupyterlab conda deactivate
Next, create a slurm script as below -- here we will save it as 'jupyterOnCluster.sbatch'
#!/bin/bash # Configuration values for a SLURM batch job. # One leading hash(#) before the word SBATCH is not a comment, but two are. #SBATCH --job-name=jobName #SBATCH --nodes=1 # Ensure that all cores are on one machine #SBATCH --ntasks=1 # Run a single task #SBATCH --ntasks-per-core=8 # Enter number of cores/threads you wish to request #SBATCH --time 02:00:00 # max time that the job/jupyter instance actually needs to be running (in hh:mm:ss format) #SBATCH --mem=16gb # Enter amount of memory you wish to request #SBATCH --partition=20 # partition (queue) to use #SBATCH --output %x-%j.out # name of output file. %x is the job-name %j is jobid #SBATCH --mail-type=ALL #SBATCH --mail-user=your.name@wi.mit.edu # If using a conda environment # (if not, delete or comment out next 3 lines): # Activate conda itself eval "$(/nfs/BaRC/USER/conda/bin/conda shell.bash hook)" # Activate your specific conda environment conda activate RNAseq_2024a # If using a python virtual environment # (if not, delete or comment out next line): #source /path/to/your/python/virtualenv/directory/bin/activate # Workaround for jupyter bug unset XDG_RUNTIME_DIR jupyter-lab \ --no-browser \ --port-retries=0 \ --ip=0.0.0.0 \ --port=`shuf -i 8900-10000 -n 1` \ --notebook-dir=/ \ --LabApp.default_url="/lab/tree/home/$(whoami)" #/usr/bin/mail -s "$SLURM_JOB_NAME $SLURM_JOB_ID" yourwhiteheadusername@wi.mit.edu < %x-%j.out #uncomment (remove hash at beginning of this line) if you want your job output emailed to you
Then submit the slurm job:
sbatch jupyterOnCluster.sbatch
After the slurm job gets running on the cluster, an output file named 'jobName-%j.out' will be created.
Open the output file and you will find a URL pointing to the jupyter notebook. Copy and paste the URL into your web browser, you will get access to the jupyter notebook with your virual environment.
Make sure to set a reasonable time limit to your slurm job or cancel it after you finish working on it, to avoid occupying computational resources unnecessarily.
Reference to IT's instructions on this topic: https://docs.google.com/document/d/1eYGVn5M402n2b9pueWdHoeLx84Ue-IGQG8t6e2QK7To
Save the environment
conda env export > RNAseq_2024a.environment.yml
Someone else should be able to create new environment from this YAML file
conda env create -f RNAseq_2024a.environment.yml
Remove a problem piece of software from the environment
conda remove STAR
If you no longer want the environment
conda remove -n ENV_NAME --all
If we want to use slurm, we need to add the path to the slurm commands. Is there a better way to do this?
export PATH=$PATH:/opt/slurm/bin
To test the environment -- the RNA-seq Hot Topics exercises should work.
See also the Whitehead IT conda page: https://clusterguide.wi.mit.edu/software/conda/
Singularity environments
Singularity containers allow you to create and run containers that package up pieces of software in a way that is portable and reproducible. Some software now comes in this way so that "installation" is simply downloading a SIF file.
One example is AGAT (Another Gtf/Gff Analysis Toolkit), which provides instructions on how to download and run the AGAT container that includes a series of applications:
# Download singularity pull docker://quay.io/biocontainers/agat:1.0.0--pl5321hdfd78af_0 # Run singularity run agat_1.0.0--pl5321hdfd78af_0.sif # When finished exit
Then one can run commands such as 'agat_convert_sp_gff2gtf.pl'. The trouble is that the environment doesn't include our usual filesystem, making it not very useful. The 'singularity' command needs to be modified to also include the required folder(s), such as the following one-line command
singularity run -B /lab/BaRC_projects:/lab/BaRC_projects --cleanenv --pwd /lab/BaRC_projects /nfs/BaRC_Public/apps/AGAT/agat_1.0.0--pl5321hdfd78af_0.sif # Go where we want cd /lab/BaRC_projects # Check that the environment includes our desired files/folders ls
One problem is that this is an older version of AGAT (v1.0.0). Another problem is that some of the commands require samtools, which is not present in the container. What can we do about this?
One solution to build a customized singularity container is to use the Sequera Container Builder. We can search for and add AGAT (not the first hit) and samtools, then specifying that we want a singularity container. Then click on "Get Container". When it's ready, run 'singularity pull' on the oras link, like
singularity pull oras://community.wave.seqera.io/library/agat_samtools:d30ed34317069fe6
We end up downloading a file like 'agat_samtools_d30ed34317069fe6.sif'. Then we can do the 'singularity run' command like above and get to run both AGAT and samtools.
By the way, including '--cleanenv' in the 'singularity run' command is to prevent the container from reading the environment from your .bashrc file. If you want to include those aliases, etc. then remove '--cleanenv'.
See also the Whitehead IT singularity page: https://clusterguide.wi.mit.edu/software/singularity/