Update of README.md

DeepMind · copybara-github · commit e3231de13513 · 2023-01-11T07:42:36.000-08:00
PiperOrigin-RevId: 501279176
Change-Id: I9cf92212322b29691844973ded9e337e81b3a9fd
diff --git a/README.md b/README.md
@@ -3,7 +3,7 @@
 # AlphaFold
 
 This package provides an implementation of the inference pipeline of AlphaFold
-v2.0. For simplicity, we refer to this model as AlphaFold throughout the rest of
+v2. For simplicity, we refer to this model as AlphaFold throughout the rest of
 this document.
 
 We also provide:
@@ -36,21 +36,51 @@ If you have any questions, please contact the AlphaFold team at
 
 ![CASP14 predictions](imgs/casp14_predictions.gif)
 
-## First time setup
+## Installation and running your first prediction
 
 You will need a machine running Linux, AlphaFold does not support other
-operating systems.
+operating systems. Full installation requires up to 3 TB of disk space to keep
+genetic databases (SSD storage is recommended) and a modern NVIDIA GPU (GPUs
+with more memory can predict larger protein structures).
 
-The following steps are required in order to run AlphaFold:
+Please follow these steps:
 
 1.  Install [Docker](https://www.docker.com/).
     *   Install
         [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)
         for GPU support.
     *   Setup running
         [Docker as a non-root user](https://docs.docker.com/engine/install/linux-postinstall/#manage-docker-as-a-non-root-user).
-1.  Download genetic databases (see below).
-1.  Download model parameters (see below).
+
+1.  Clone this repository and `cd` into it.
+
+    ```bash
+    git clone https://github.com/deepmind/alphafold.git
+    cd ./alphafold
+    ```
+
+1.  Download genetic databases and model parameters:
+
+    *   Install `aria2c` (on most Linux distributions it is available via the
+    package manager).
+
+    *   Please use the script `scripts/download_all_data.sh` to download
+    and set up full databases. This may take substantial time (download size is
+    556 GB), so we recommend running this script in the background:
+
+    ```bash
+    scripts/download_all_data.sh <DOWNLOAD_DIR> > download.log 2> download_all.log &
+    ```
+
+    *   **Note: The download directory `<DOWNLOAD_DIR>` should *not* be a
+    subdirectory in the AlphaFold repository directory.** If it is, the Docker
+    build will be slow as the large databases will be copied into the docker
+    build context.
+
+    *   It is possible to run AlphaFold with reduced databases; please refer to
+    the [complete documentation](#genetic-databases).
+
+
 1.  Check that AlphaFold will be able to use a GPU by running:
 
     ```bash
@@ -63,10 +93,58 @@ The following steps are required in order to run AlphaFold:
     or take a look at the following
     [NVIDIA Docker issue](https://github.com/NVIDIA/nvidia-docker/issues/1447#issuecomment-801479573).
 
-If you wish to run AlphaFold using Singularity (a common containerization
-platform on HPC systems) we recommend using some of the third party Singularity
-setups as linked in https://github.com/deepmind/alphafold/issues/10 or
-https://github.com/deepmind/alphafold/issues/24.
+    If you wish to run AlphaFold using Singularity (a common containerization
+    platform on HPC systems) we recommend using some of the third party Singularity
+    setups as linked in https://github.com/deepmind/alphafold/issues/10 or
+    https://github.com/deepmind/alphafold/issues/24.
+
+1.  Build the Docker image:
+
+    ```bash
+    docker build -f docker/Dockerfile -t alphafold .
+    ```
+
+    If you encounter the following error:
+
+    ```
+    W: GPG error: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC
+    E: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 InRelease' is not signed.
+    ```
+
+    use the workaround described in
+    https://github.com/deepmind/alphafold/issues/463#issuecomment-1124881779.
+
+1.  Install the `run_docker.py` dependencies. Note: You may optionally wish to
+    create a
+    [Python Virtual Environment](https://docs.python.org/3/tutorial/venv.html)
+    to prevent conflicts with your system's Python environment.
+
+    ```bash
+    pip3 install -r docker/requirements.txt
+    ```
+
+1.  Make sure that the output directory exists (the default is `/tmp/alphafold`)
+    and that you have sufficient permissions to write into it.
+
+1.  Run `run_docker.py` pointing to a FASTA file containing the protein
+    sequence(s) for which you wish to predict the structure (`--fasta_paths`
+    parameter). AlphaFold will search for the available templates before the
+    date specified by the `--max_template_date` parameter; this could be used to
+    avoid certain templates during modeling. `--data_dir` is the directory with
+    downloaded genetic databases and `--output_dir` is the absolute path to the
+    output directory.
+
+    ```bash
+    python3 docker/run_docker.py \
+      --fasta_paths=your_protein.fasta \
+      --max_template_date=2022-01-01 \
+      --data_dir=$DOWNLOAD_DIR \
+      --output_dir=/home/user/absolute_path_to_the_output_dir
+    ```
+
+1.  Once the run is over, the output directory shall contain predicted
+    structures of the target protein. Please check the documentation below for
+    additional options and troubleshooting tips.
 
 ### Genetic databases
 
@@ -86,22 +164,24 @@ AlphaFold needs multiple genetic (sequence) databases to run:
 We provide a script `scripts/download_all_data.sh` that can be used to download
 and set up all of these databases:
 
-*   Default:
+*   Recommended default:
 
     ```bash
     scripts/download_all_data.sh <DOWNLOAD_DIR>
     ```
 
     will download the full databases.
 
-*   With `reduced_dbs`:
+*   With `reduced_dbs` parameter:
 
     ```bash
     scripts/download_all_data.sh <DOWNLOAD_DIR> reduced_dbs
     ```
 
     will download a reduced version of the databases to be used with the
-    `reduced_dbs` database preset.
+    `reduced_dbs` database preset. This shall be used with the corresponding
+    AlphaFold parameter `--db_preset=reduced_dbs` later during the AlphaFold run
+    (please see [AlphaFold parameters](#running-alphafold) section).
 
 :ledger: **Note: The download directory `<DOWNLOAD_DIR>` should *not* be a
 subdirectory in the AlphaFold repository directory.** If it is, the Docker build
@@ -111,7 +191,7 @@ We don't provide exactly the database versions used in CASP14 – see the
 [note on reproducibility](#note-on-casp14-reproducibility). Some of the
 databases are mirrored for speed, see [mirrored databases](#mirrored-databases).
 
-:ledger: **Note: The total download size for the full databases is around 415 GB
+:ledger: **Note: The total download size for the full databases is around 556 GB
 and the total size when unzipped is 2.62 TB. Please make sure you have a large
 enough hard drive space, bandwidth and time to download. We recommend using an
 SSD for better genetic search performance.**
@@ -230,58 +310,11 @@ To use the deprecated v2.1.0 AlphaFold-Multimer model weights:
 **The simplest way to run AlphaFold is using the provided Docker script.** This
 was tested on Google Cloud with a machine using the `nvidia-gpu-cloud-image`
 with 12 vCPUs, 85 GB of RAM, a 100 GB boot disk, the databases on an additional
-3 TB disk, and an A100 GPU.
-
-1.  Clone this repository and `cd` into it.
+3 TB disk, and an A100 GPU. For your first run, please follow the instructions
+from [Installation and running your first prediction](#installation-and-running-your-first-prediction)
+section.
 
-    ```bash
-    git clone https://github.com/deepmind/alphafold.git
-    ```
-
-1.  Build the Docker image:
-
-    ```bash
-    docker build -f docker/Dockerfile -t alphafold .
-    ```
-
-    If you encounter the following error:
-
-    ```
-    W: GPG error: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC
-    E: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 InRelease' is not signed.
-    ```
-
-    use the workaround described in
-    https://github.com/deepmind/alphafold/issues/463#issuecomment-1124881779.
-
-1.  Install the `run_docker.py` dependencies. Note: You may optionally wish to
-    create a
-    [Python Virtual Environment](https://docs.python.org/3/tutorial/venv.html)
-    to prevent conflicts with your system's Python environment.
-
-    ```bash
-    pip3 install -r docker/requirements.txt
-    ```
-
-1.  Make sure that the output directory exists (the default is `/tmp/alphafold`)
-    and that you have sufficient permissions to write into it.
-
-1.  Run `run_docker.py` pointing to a FASTA file containing the protein
-    sequence(s) for which you wish to predict the structure. If you are
-    predicting the structure of a protein that is already in PDB and you wish to
-    avoid using it as a template, then `max_template_date` must be set to be
-    before the release date of the structure. You must also provide the path to
-    the directory containing the downloaded databases. For example, for the
-    T1050 CASP14 target:
-
-    ```bash
-    python3 docker/run_docker.py \
-      --fasta_paths=T1050.fasta \
-      --max_template_date=2020-05-14 \
-      --data_dir=$DOWNLOAD_DIR
-    ```
-
-    By default, Alphafold will attempt to use all visible GPU devices. To use a
+1.  By default, Alphafold will attempt to use all visible GPU devices. To use a
     subset, specify a comma-separated list of GPU UUID(s) or index(es) using the
     `--gpu_devices` flag. See
     [GPU enumeration](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#gpu-enumeration)
@@ -325,9 +358,24 @@ with 12 vCPUs, 85 GB of RAM, a 100 GB boot disk, the databases on an additional
       --max_template_date=2020-05-14 \
       --model_preset=monomer \
       --db_preset=reduced_dbs \
-      --data_dir=$DOWNLOAD_DIR
+      --data_dir=$DOWNLOAD_DIR \
+      --output_dir=/home/user/absolute_path_to_the_output_dir
     ```
 
+1.  After generating the predicted model, by default AlphaFold runs a relaxation
+    step to improve geometrical quality. You can control this via `--run_relax=true`
+    (default) or `--run_relax=false`.
+
+1.  The relaxation step can be run on GPU (faster, but could be less stable) or
+    CPU (slow, but stable). This can be controlled with `--enable_gpu_relax=true`
+    (default) or `--enable_gpu_relax=false`.
+
+1.  AlphaFold can re-use MSAs (multiple sequence alignments) for the same
+    sequence via `--use_precomputed_msas=true` option; this can be useful for
+    trying different AlphaFold parameters. This option assumes that the
+    directory structure generated by the first AlphaFold run in the output
+    directory exists and that the protein sequence is the same.
+
 ### Running AlphaFold-Multimer
 
 All steps are the same as when running the monomer system, but you will have to
@@ -342,7 +390,8 @@ python3 docker/run_docker.py \
   --fasta_paths=multimer.fasta \
   --max_template_date=2020-05-14 \
   --model_preset=multimer \
-  --data_dir=$DOWNLOAD_DIR
+  --data_dir=$DOWNLOAD_DIR \
+  --output_dir=/home/user/absolute_path_to_the_output_dir
 ```
 
 By default the multimer system will run 5 seeds per model (25 total predictions)
@@ -402,7 +451,8 @@ python3 docker/run_docker.py \
   --fasta_paths=monomer.fasta \
   --max_template_date=2021-11-01 \
   --model_preset=monomer \
-  --data_dir=$DOWNLOAD_DIR
+  --data_dir=$DOWNLOAD_DIR \
+  --output_dir=/home/user/absolute_path_to_the_output_dir
 ```
 
 #### Folding a homomer
@@ -426,7 +476,8 @@ python3 docker/run_docker.py \
   --fasta_paths=homomer.fasta \
   --max_template_date=2021-11-01 \
   --model_preset=multimer \
-  --data_dir=$DOWNLOAD_DIR
+  --data_dir=$DOWNLOAD_DIR \
+  --output_dir=/home/user/absolute_path_to_the_output_dir
 ```
 
 #### Folding a heteromer
@@ -454,7 +505,8 @@ python3 docker/run_docker.py \
   --fasta_paths=heteromer.fasta \
   --max_template_date=2021-11-01 \
   --model_preset=multimer \
-  --data_dir=$DOWNLOAD_DIR
+  --data_dir=$DOWNLOAD_DIR \
+  --output_dir=/home/user/absolute_path_to_the_output_dir
 ```
 
 #### Folding multiple monomers one after another
@@ -468,7 +520,8 @@ python3 docker/run_docker.py \
   --fasta_paths=monomer1.fasta,monomer2.fasta \
   --max_template_date=2021-11-01 \
   --model_preset=monomer \
-  --data_dir=$DOWNLOAD_DIR
+  --data_dir=$DOWNLOAD_DIR \
+  --output_dir=/home/user/absolute_path_to_the_output_dir
 ```
 
 #### Folding multiple multimers one after another
@@ -482,7 +535,8 @@ python3 docker/run_docker.py \
   --fasta_paths=multimer1.fasta,multimer2.fasta \
   --max_template_date=2021-11-01 \
   --model_preset=multimer \
-  --data_dir=$DOWNLOAD_DIR
+  --data_dir=$DOWNLOAD_DIR \
+  --output_dir=/home/user/absolute_path_to_the_output_dir
 ```
 
 ### AlphaFold output
@@ -633,6 +687,7 @@ If you use the code or data in this package, please cite:
 
 In addition, if you use the AlphaFold-Multimer mode, please cite:
 
+
 ```bibtex
 @article {AlphaFold-Multimer2021,
   author       = {Evans, Richard and O{\textquoteright}Neill, Michael and Pritzel, Alexander and Antropova, Natasha and Senior, Andrew and Green, Tim and {\v{Z}}{\'\i}dek, Augustin and Bates, Russ and Blackwell, Sam and Yim, Jason and Ronneberger, Olaf and Bodenstein, Sebastian and Zielinski, Michal and Bridgland, Alex and Potapenko, Anna and Cowie, Andrew and Tunyasuvunakool, Kathryn and Jain, Rishub and Clancy, Ellen and Kohli, Pushmeet and Jumper, John and Hassabis, Demis},
@@ -754,4 +809,5 @@ reference to the following:
     (unmodified), by Mitchell AL et al., available free of all copyright
     restrictions and made fully and freely available for both non-commercial and
     commercial use under
-    [CC0 1.0 Universal (CC0 1.0) Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/).
+    [CC0 1.0 Universal (CC0 1.0) Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/).
+