== Using AlphaFold multimer to predict the structure of protein complexes == === Background === As soon as the effectiveness of AlphaFold2 for protein structure prediction became evident, workers began to adapt it to predicting protein structure ''complexes''. This effort led to [https://www.biorxiv.org/content/10.1101/2021.10.04.463034v2 AlphaFold-Multimer]. While the best place to start a search for a predicted structure for a single protein sequence is likely to be an online database, you will likely have to compute the predicted structures for multimeric protein complexes. === Running AlphaFold3 === AlphaFold3 can predict structures of multiple types of molecules, including protein, DNA, RNA, ligands, and ions. The code for AlphaFold3 is not available, but it can be run on the Google AlphaFold server: [https://golgi.sandbox.google.com/ https://golgi.sandbox.google.com/] As of May 2024, one can run as many as 20 jobs per day. You can download the results (for the 5 predictions from each job), including structure (cif) files that can be opened in PyMOL or other structural viewers. === Running AlphaFold-Multimer using ChimeraX === As with structure prediction for monomeric proteins, [https://www.cgl.ucsf.edu/chimerax/ ChimeraX] is a good [https://www.youtube.com/watch?v=6lXeCPuTePs starting point due to its intuitive graphical user interface] and convenient visualization tools. You will need to install ChimeraX on a desktop or laptop computer, but the AlphaFold predictions will be made using computing resources in the cloud via the [https://www.nature.com/articles/s41592-022-01488-1 ColabFold] implementation of AlphaFold, which uses [https://www.nature.com/articles/nbt.3988 MMseqs2] to efficiently compute an initial multiple sequence alignment (MSA). === Running AlphaFold using computing resources at Whitehead === It may happen that the freely available computational resources accessed via ChimeraX are a constraint on completing your AlphaFold-Multimer predictions. In that case, there are multiple ways that you can make the predictions locally. One way is to make use of the ColabFold implementation of AlphaFold-Multimer, which makes use of MMseqs2 from the initial MSA step. {{{ sbatch RunColabFold_multimer_1.5.5.slurm }}} In the command above the job (i.e. RunColabFold_multimer_1.5.5.slurm) that is submitted to the SLURM scheduler might look like: {{{ #!/bin/bash #SBATCH --job-name=AFbatch #SBATCH --nodes=1 # ensure cores are on one node #SBATCH --ntasks=1 # run a single task #SBATCH --cpus-per-task=8 # number of cores/threads requested. #SBATCH --mem=64gb # memory requested. #SBATCH --partition=nvidia-t4-20 # partition (queue) to use #SBATCH --output AFbatch.out # write output to file. #SBATCH --gres=gpu:1 # Required for GPU access export PATH="/nfs/apps/test/colab155test/localcolabfold/colabfold-conda/bin:$PATH" workpath=/lab/MY_LAB/my_project cd ${workpath} colabfold_batch --msa-mode mmseqs2_uniref_env --model-type alphafold2_multimer_v3 --rank multimer fasta/proteins.fa output }}} In the commands above, you will need to substitute the path to your working directory along with paths to your fasta file and output directory. In the example above, the fasta file (i.e. proteins.fa) is within a subdirectory of the working directory called "fasta". Likewise, the output will be written in a subdirectory, called "output", of the working directory. When using ColabFold, be sure to separate the amino acid sequences for individual proteins with a colon, like in this example: {{{ >proteins RMKQLEDKVEELLSKNYHLENEVARLKKLVGER: RMKQLEDKVEELLSKNYHLENEVARLKKLVGER }}} The following instructions allow you to run AlphaFold-Multimer locally without using ColabFold: {{{ sbatch --export=ALL,FASTA_NAME=example.fa,USERNAME='user',FASTA_PATH=/path/to/fasta/file,AF2_WORK_DIR=/path/to/working/directory ./RunAlphaFold_multimer_2.3.2_slurm.sh }}} In the command above, substitute your own user id, fasta file and the paths to both the fasta file and the working directory. In this example, the job (i.e. RunAlphaFold_multimer_2.3.2_slurm.sh) that is submitted to the SLURM scheduler might look like: {{{ #!/bin/bash #SBATCH --job-name=AF2M # friendly name for job. #SBATCH --nodes=1 # ensure cores are on one node #SBATCH --ntasks=1 # run a single task #SBATCH --cpus-per-task=8 # number of cores/threads requested. #SBATCH --mem=64gb # memory requested. #SBATCH --partition=nvidia-t4-20 # partition (queue) to use #SBATCH --output output-%j.out # %j inserts jobid to STDOUT #SBATCH --gres=gpu:1 # Required for GPU access export TF_FORCE_UNIFIED_MEMORY=1 export XLA_PYTHON_CLIENT_MEM_FRACTION=4 export OUTPUT_NAME='model_1' export ALPHAFOLD_DATA_PATH='/alphafold/data.2023b' # Specify ALPHAFOLD_DATA_PATH cd $AF2_WORK_DIR singularity run -B $AF2_WORK_DIR:/af2 -B $ALPHAFOLD_DATA_PATH:/data -B .:/etc --pwd /app/alphafold --nv /alphafold/alphafold_2.3.2.sif --data_dir=/data/ --output_dir=/af2/$FASTA_PATH --fasta_paths=/af2/$FASTA_PATH/$FASTA_NAME --max_template_date=2050-01-01 --db_preset=full_dbs --bfd_database_path=/data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt --uniref30_database_path=/data/uniref30/UniRef30_2023_02 --uniref90_database_path=/data/uniref90/uniref90.fasta --mgnify_database_path=/data/mgnify/mgy_clusters_2022_05.fa --template_mmcif_dir=/data/pdb_mmcif/mmcif_files --obsolete_pdbs_path=/data/pdb_mmcif/obsolete.dat --use_gpu_relax=True --model_preset=multimer --pdb_seqres_database_path=/data/pdb_seqres/pdb_seqres.txt --uniprot_database_path=/data/uniprot/uniprot.fasta --num_multimer_predictions_per_model=1 # Email the STDOUT output file to specified address. /usr/bin/mail -s "$SLURM_JOB_NAME $SLURM_JOB_ID" $USERNAME@wi.mit.edu < $AF2_WORK_DIR/output-${SLURM_JOB_ID}.out }}} Unlike when using ColabFold, when running AlphaFold as above, the input fasta file "example.fa" should be a list of fasta entries, one per amino acid sequence within the multimeric complex. For example: {{{ >proteinA RMKQLEDKVEELLSKNYHLENEVARLKKLVGER >proteinB RMKQLEDKVEELLSKNYHLENEVARLKKLVGER }}}