== Predicting protein structure from sequence using AlphaFold ==

=== Background ===
The success of [https://www.nature.com/articles/s41586-021-03819-2 DeepMind's AlphaFold protein folding algorithm] in the [https://onlinelibrary.wiley.com/toc/10970134/2021/89/12 CASP14 structural prediction assessment] has been widely [https://www.nature.com/articles/d41586-020-03348-4 celebrated] and has profoundly invigorated the structural biology community. Today, if you have a protein sequence for which you'd like to learn a high quality predicted structure, an excellent place to start is the [https://alphafold.ebi.ac.uk/ AlphaFold Protein Structure Database]. An alternative database to search is the [https://esmatlas.com/resources?action=fold ESM Metagenomic Atlas], where you may find predicted structures for orphan proteins with few sequence homologs.

=== Running AlphaFold using ChimeraX ===

If you cannot find a predicted structure for your protein within the databases listed above, perhaps because amino acid substitutions relative to the reference sequence are present, [https://www.cgl.ucsf.edu/chimerax/ ChimeraX] is an [https://www.youtube.com/watch?v=gIbCAcMDM7E easy place to start due to its graphical user interface] and convenient visualization tools.  You will need to install ChimeraX on a desktop or laptop computer, but the AlphaFold predictions will be made using computing resources in the cloud via the [https://www.nature.com/articles/s41592-022-01488-1 ColabFold] implementation of AlphaFold, which uses [https://www.nature.com/articles/nbt.3988 MMseqs2] to efficiently compute an initial multiple sequence alignment.

=== Running AlphaFold using computing resources at Whitehead ===

It may happen that the freely available computational resources accessed via ChimeraX are a constraint on completing your AlphaFold predictions.  In that case, you can make the predictions locally using a command like the following:
 
{{{
sbatch --export=ALL,FASTA_NAME=example.fa,USERNAME='user',FASTA_PATH=/path/to/fasta/file,AF2_WORK_DIR=/path/to/working/directory ./RunAlphaFold_2.3.2_slurm.sh
}}}

In the command above, substitute your own user id, fasta file and the paths to both the fasta file and the working directory.  In this example, the job (named RunAlphaFold_2.3.2_slurm.sh above) that is submitted to the SLURM scheduler might look like:

{{{
#!/bin/bash

#SBATCH --job-name=AF2  		# friendly name for job.
#SBATCH --nodes=1 			# ensure cores are on one node
#SBATCH --ntasks=1 			# run a single task
#SBATCH --cpus-per-task=8 		# number of cores/threads requested.
#SBATCH --mem=64gb 			# memory requested.
#SBATCH --partition=nvidia-t4-20	# partition (queue) to use
#SBATCH --output output-%j.out  	# %j inserts jobid to STDOUT
#SBATCH --gres=gpu:1  			# Required for GPU access

export TF_FORCE_UNIFIED_MEMORY=1
export XLA_PYTHON_CLIENT_MEM_FRACTION=4

export OUTPUT_NAME='model_1'
export ALPHAFOLD_DATA_PATH='/alphafold/data.2023b' # Specify ALPHAFOLD_DATA_PATH

cd $AF2_WORK_DIR
singularity run -B $AF2_WORK_DIR:/af2 -B $ALPHAFOLD_DATA_PATH:/data -B .:/etc --pwd /app/alphafold --nv /alphafold/alphafold_2.3.2.sif --data_dir=/data/ --output_dir=/af2/$FASTA_PATH --fasta_paths=/af2/$FASTA_PATH/$FASTA_NAME --max_template_date=2050-01-01 --db_preset=full_dbs --bfd_database_path=/data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt --uniref30_database_path=/data/uniref30/UniRef30_2023_02 --uniref90_database_path=/data/uniref90/uniref90.fasta --mgnify_database_path=/data/mgnify/mgy_clusters_2022_05.fa --template_mmcif_dir=/data/pdb_mmcif/mmcif_files --obsolete_pdbs_path=/data/pdb_mmcif/obsolete.dat --use_gpu_relax=True --model_preset=monomer --pdb70_database_path=/data/pdb70/pdb70

# Email the STDOUT output file to specified address.
/usr/bin/mail -s "$SLURM_JOB_NAME $SLURM_JOB_ID" $USERNAME@wi.mit.edu < $AF2_WORK_DIR/output-${SLURM_JOB_ID}.out
}}}