== Predicting protein structure from sequence using AlphaFold == === Background === The success of [https://www.nature.com/articles/s41586-021-03819-2 DeepMind's AlphaFold protein folding algorithm] in the [https://onlinelibrary.wiley.com/toc/10970134/2021/89/12 CASP14 structural prediction assessment] has been widely [https://www.nature.com/articles/d41586-020-03348-4 celebrated] and has profoundly invigorated the structural biology community. Today, if you have a protein sequence for which you'd like to learn a high quality predicted structure, an excellent place to start is the [https://alphafold.ebi.ac.uk/ AlphaFold Protein Structure Database]. An alternative database to search is the [https://esmatlas.com/resources?action=fold ESM Metagenomic Atlas], where you may find predicted structures for orphan proteins with few sequence homologs. === Running AlphaFold using ChimeraX === If you cannot find a predicted structure for your protein within the databases listed above, perhaps because amino acid substitutions relative to the reference sequence are present, [https://www.cgl.ucsf.edu/chimerax/ ChimeraX] is an [https://www.youtube.com/watch?v=gIbCAcMDM7E easy place to start due to its graphical user interface] and convenient visualization tools. You will need to install ChimeraX on a desktop or laptop computer, but the AlphaFold predictions will be made using computing resources in the cloud via the [https://www.nature.com/articles/s41592-022-01488-1 ColabFold] implementation of AlphaFold, which uses [https://www.nature.com/articles/nbt.3988 MMseqs2] to efficiently compute an initial multiple sequence alignment. === Running AlphaFold using computing resources at Whitehead === It may happen that the freely available computational resources accessed via ChimeraX are a constraint on completing your AlphaFold predictions. In that case, you can make the predictions locally using a command like the following: {{{ sbatch --export=ALL,FASTA_NAME=example.fa,USERNAME='user',FASTA_PATH=/path/to/fasta/file,AF2_WORK_DIR=/path/to/working/directory ./RunAlphaFold_2.3.2_slurm.sh }}} In the command above, substitute your own user id, fasta file and the paths to both the fasta file and the working directory. In this example, the job (named RunAlphaFold_2.3.2_slurm.sh above) that is submitted to the SLURM scheduler might look like: {{{ #!/bin/bash #SBATCH --job-name=AF2 # friendly name for job. #SBATCH --nodes=1 # ensure cores are on one node #SBATCH --ntasks=1 # run a single task #SBATCH --cpus-per-task=8 # number of cores/threads requested. #SBATCH --mem=64gb # memory requested. #SBATCH --partition=nvidia-t4-20 # partition (queue) to use #SBATCH --output output-%j.out # %j inserts jobid to STDOUT #SBATCH --gres=gpu:1 # Required for GPU access export TF_FORCE_UNIFIED_MEMORY=1 export XLA_PYTHON_CLIENT_MEM_FRACTION=4 export OUTPUT_NAME='model_1' export ALPHAFOLD_DATA_PATH='/alphafold/data.2023b' # Specify ALPHAFOLD_DATA_PATH cd $AF2_WORK_DIR singularity run -B $AF2_WORK_DIR:/af2 -B $ALPHAFOLD_DATA_PATH:/data -B .:/etc --pwd /app/alphafold --nv /alphafold/alphafold_2.3.2.sif --data_dir=/data/ --output_dir=/af2/$FASTA_PATH --fasta_paths=/af2/$FASTA_PATH/$FASTA_NAME --max_template_date=2050-01-01 --db_preset=full_dbs --bfd_database_path=/data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt --uniref30_database_path=/data/uniref30/UniRef30_2023_02 --uniref90_database_path=/data/uniref90/uniref90.fasta --mgnify_database_path=/data/mgnify/mgy_clusters_2022_05.fa --template_mmcif_dir=/data/pdb_mmcif/mmcif_files --obsolete_pdbs_path=/data/pdb_mmcif/obsolete.dat --use_gpu_relax=True --model_preset=monomer --pdb70_database_path=/data/pdb70/pdb70 # Email the STDOUT output file to specified address. /usr/bin/mail -s "$SLURM_JOB_NAME $SLURM_JOB_ID" $USERNAME@wi.mit.edu < $AF2_WORK_DIR/output-${SLURM_JOB_ID}.out }}}