Alphafold 2
About
Alphafold 2 allows users to predict the 3-D structure of arbitrary proteins. It was published in Nature (Jumper et al. 2021).
We have implemented the use of alphafold on the campus cluster through the use of singularity and some scripts to help run the container for your particular files. It works best when using gpus to help with the computation.
Preparing to run
Here we will show how to load the modules for it, create a submission scripts, and submit the job.
To get started you should get the fasta file you want to run against. If you want to just try it out, you can grab a fasta file from public sources.
Next load the environment modules to put the software in your path:
module load singularity/3.8.0 alphafold/2.2.0
This examples assumes you have the fasta file in a direcotry in your home directory called fasta_files. It also assumes you are writing the output files to the scratch dir. You can make the directories like this:
mkdir -p ~/fasta_files
mkdir -p /central/scratch/$USER/alphafold/out
There is an example fasta file at /central/software/alphafold/examples which can be copied to your fasta_files direcotry:
cp /central/software/alphafold/examples/rcsb_pdb_3DMW-EDS.fasta ~/fasta_files/.
Next you will want to create a submission script. There is an example file available as well which you can copy to your home diretory:
cp /central/software/alphafold/examples/alphafold.sub ~/.
The Submission Script
Next we give the job a name which will show up in the scheduler:
#SBATCH --job-name=alphafold_run
Then we will say how long we want it to run, The job will be killed when it reaches this length. We will start with the maximum time of 7 days, but when you are more comfortable with job runtimes you may want to drop this to a reasonable time. Setting a more realistic time will help keep jobs that are doing the wrong thing from incurring additional costs and will also let you jobs get through the queue faster since it may be able to fit into a backfill slot
#SBATCH --time=7-00:00
The next lines are all about the resources you job will use. In this case we will be using a single node, but not allowing other jobs to run on it (exclusive). It will use one task but that task can use 28 cores. We are requesting 4 gpus on the node and 32G
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:4 # You need to request one GPU to be able to run AlphaFold properly
#SBATCH --exclusive
#SBATCH --cpus-per-task=28 # adjust this if you are using parallel commands
#SBATCH --mem=32G # adjust this according to the memory requirement per node you need
The next two lines are about having the schedule keep you informed about when the job starts and ends. Make sure to put your actual email address in. You can also not set these if you prefer to not be emailed.
#SBATCH --mail-user=$USER@caltech.edu
#SBATCH --mail-type=ALL
Next we get to what will actually run on the compute node when it runs.
First we will set some variables on where your input files are, where to put the output files, and where the alphafold data directories are. The download dir is only necessary if you are using some non standard data:
DOWNLOAD_DIR=/central/software/alphafold/data/ # Set the appropriate path to your downloaded data
INPUT_DIR=/home/$USER/fasta_files/
OUTPUT_DIR=/central/scratch/$USER/alphafold/out
Next we will load the modules. This is in case you forgot to load them before.
Wrapper script options
Please make sure all required parameters are given
Usage: /central/software/alphafold/2.2.0/bin/run_alphafold.sh <OPTIONS>
Required Parameters:
-o <output_dir> Path to a directory that will store the results.
-m <model_names> Name of model to use <monomer|monomer_casp14|monomer_ptm|multimer>
-f <fasta_path> Path to a FASTA file containing one sequence
-t <max_template_date> Maximum template release date to consider (ISO-8601 format - i.e. YYYY-MM-DD). Important if folding historical test sets
Optional Parameters:
-b <benchmark> Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time required for inferencing many proteins (default: 'False')
-d <data_dir> Path to directory of supporting data
-g <use_gpu> Enable NVIDIA runtime to run with GPUs (default: True)
-a <gpu_devices> Comma separated list of devices to pass to 'CUDA_VISIBLE_DEVICES' (default: 0)
-n <number> How many predictions (each with a different random seed) will be generated per model
-p <preset> Choose preset model configuration - no ensembling (full_dbs) or 8 model ensemblings (casp14) (default: 'full_dbs')
Submitting your job
Troubleshooting
Error: 2022-07-19 14:56:53.454079: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:618] unable to add host callback: CUDA_ERROR_INVALID_HANDLE: invalid resource handle. 2022-07-19 14:56:53.487217: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:1047] could not synchronize on CUDA context: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered :: *** Begin stack trace ***
Solution: Try unsetting TF_FORCE_UNIFIED_MEMORY either in an interactive session or your sbatch file *and* increase your memory requested x2 or x3 times as a test. (You can drop memory back down once verified that things are working)
unset TF_FORCE_UNIFIED_MEMORY