|
| 1 | +--- |
| 2 | +layout: userdoc |
| 3 | +title: "Estimating amino acid substitution models" |
| 4 | +author: Cuong Dang |
| 5 | +date: 2021-10-16 |
| 6 | +docid: 8 |
| 7 | +icon: info-circle |
| 8 | +doctype: tutorial |
| 9 | +tags: |
| 10 | +- tutorial |
| 11 | +description: Estimating amino acid substitution models. |
| 12 | +sections: |
| 13 | +- name: Estimating a model from a single concatenated alignment |
| 14 | + url: Estimating a model from a single concatenated alignment |
| 15 | +- name: Estimating a model from a folder of alignments |
| 16 | + url: Estimating a model from a folder of alignments |
| 17 | +--- |
| 18 | + |
| 19 | + |
| 20 | +Estimating amino acid substitution models |
| 21 | +========================== |
| 22 | +Amino acid substitution models are a key component in phylogenetic analyses of protein sequences. All amino acid models available to date are time-reversible, an assumption designed for computational convenience but not for biological reality. Another significant downside to time-reversible models is that they do not allow inference of rooted trees without outgroups. |
| 23 | +In our recent papers ([Dang et al., 2021], [Minh et al., 2021]), we introduced QMaker and nQMaker, both using maximum likelihood approach that respectively allow the estimation of time reversible and time non-reversible amino acid substitution models from a single concatenated alignment or from a set of protein sequence alignments. Both reversible models estimated with QMaker and non-reversible models estimated with nQMaker fit better to empirical alignments than existing models, across a wide range of datasets. Moreover, the non-reversible models recovered correctly the commonly known root placements for the plant and bird trees with high statistical support without the need to use an outgroup. |
| 24 | +Six new reversible models (Q.pfam, Q.plant, Q.mammal, Q.bird, Q.insect, Q.yeast) and six new non-reversible models (NQ.pfam, NQ.plant, NQ.mammal, NQ.bird, NQ.insect, NQ.yeast) are implemented in the latest version of IQ-TREE. |
| 25 | +This tutorial provides an easy-to-use approach for users to estimate non-reversible models and rooted phylogenies from their own protein datasets. Please make sure |
| 26 | +that you use the lateset version of IQ-TREE for full features and cite our manuscripts: |
| 27 | + |
| 28 | +> Bui Quang Minh, Cuong Cao Dang, Le Sy Vinh, and Robert Lanfear (2021), QMaker: Fast and Accurate Method to Estimate Empirical Models of Protein Evolution. <https://doi.org/10.1093/sysbio/syab010> |
| 29 | +
|
| 30 | +> Cuong Cao Dang, Bui Quang Minh, Hanon McShea, Joanna Masel, Jennifer Eleanor James, Le Sy Vinh, and Robert Lanfear (2021), nQMaker: estimating time non-reversible amino acid substitution models. <https://doi.org/10.1101/2021.10.18.464754> |
| 31 | +
|
| 32 | +## Estimating a model from a single concatenated alignment |
| 33 | +### Estimating a reversible model |
| 34 | +We first demonstrate the estimation of a reversible model for a clade-specific dataset. Please download and extract the [sample training data](data/plant_10loci.zip). This sample data was exctracted from plant data ([Ran et al., 2018]). There are two files in the downloaded folder: |
| 35 | + |
| 36 | +* `alignment.nex` contains the alignment in NEXUS format. |
| 37 | +* `train.nex` contains the training partitions in NEXUS format. |
| 38 | + |
| 39 | +The estimation (training) then can be accomplished with three commands in IQ-TREE. The first command is: |
| 40 | + |
| 41 | + # step 1: infer an single edge-linked tree with reversible models as initial models |
| 42 | + iqtree2 --seed 1 -T AUTO -s alignment.nex -p train.nex -m MFP -mset LG,WAG,JTT -cmax 4 --prefix train_plant |
| 43 | + |
| 44 | +- `-seed 1` sets the random seed. |
| 45 | +- `-T AUTO` sets the number of computing threads to automatically detection, `-T 10` will let IQ-TREE utilize up to 10 CPU threads. |
| 46 | +- `-s alignment.nex` specifies the NEXUS file containing the concatenated alignment. |
| 47 | +- `-p train.nex` specifies the NEXUS file containing the training loci, `-p` option will estimate an edge-linked partition model with a single tree topology shared across all loci. This -p option is typically used with concatenation tree estimation that assumes a single species tree but rescale the branch lengths of the locus trees. It was shown to perform best among other partition models ([Duchêne et al., 2019]). |
| 48 | +Note: If you don't have the partition file `train.nex`, simply ignore the `-p` option. |
| 49 | +- `-mset LG,WAG,JTT` defines the initial candidate matrices to reduce computational burden. |
| 50 | +- `-cmax 4` restricts up to four categories for the rate heterogeneity across sites. |
| 51 | +- `--prefix train_plant` or `-pre train_plant` sets the name of the output files. |
| 52 | + |
| 53 | +This will run ModelFinder to find the best model for each loci. The best models and the best trees will be saved to `train_plant.best_model.nex` and to `train_plant.treefile`, respectively. These two files will be used as the input for the second command that estimate a join reversible matrix: |
| 54 | + |
| 55 | + # step 2: estimate a join non-reversible matrix across all loci |
| 56 | + iqtree2 -seed 1 -T AUTO -s alignment.nex -p train_plant.best_model.nex -te train_plant.treefile --init-model LG --model-joint GTR20+FO --prefix plant_GTR20_FO |
| 57 | +- `--init-model LG` option specifies the initial matrix |
| 58 | +- `-p train_plant.best_model.nex` and `-te train_plant.treefile` options specify the best models and trees found in step 1. |
| 59 | + |
| 60 | +The resulting matrix (`Q.plant`) which is included in the file `plant_GTR20_FO.iqtree` can be obtained with the command: |
| 61 | + |
| 62 | + # step 3: extract the resulting reversible matrix |
| 63 | + grep -A 21 "can be used as input for IQ-TREE" plant_GTR20_FO.iqtree | tail -n20 > Q.plant |
| 64 | + |
| 65 | +We can open `Q.plant` in a text viewer as below, or use with IQ-TREE (e.g. `-m Q.plant` option). The `Q.plant` is in PAML format, it contains the lower diagonal part of the AA exchange rates matrix and 20 AA frequencies. |
| 66 | + |
| 67 | + 0.120955 |
| 68 | + 0.048462 0.458565 |
| 69 | + 0.436023 0.037551 4.799928 |
| 70 | + 0.470505 1.707355 0.338537 0.000100 |
| 71 | + 0.261114 3.736017 0.605443 0.147890 0.286279 |
| 72 | + 0.837596 0.016734 0.388965 5.815020 0.000100 4.138646 |
| 73 | + 1.939708 1.199364 0.911021 1.413132 0.681747 0.419571 0.886564 |
| 74 | + 0.232927 3.431828 4.195096 1.867861 1.850593 4.826298 0.328092 0.170808 |
| 75 | + 0.118245 0.101767 0.246939 0.041486 0.186430 0.018556 0.000100 0.000100 0.015294 |
| 76 | + 0.137216 0.190052 0.000100 0.000100 0.362263 0.642152 0.028449 0.025460 0.554159 2.930110 |
| 77 | + 0.135127 5.915805 3.499350 0.090416 0.000100 2.539190 1.229355 0.196033 0.301659 0.149798 0.013760 |
| 78 | + 0.359701 0.983896 0.049506 0.000100 0.380313 0.236000 0.389444 0.000100 0.031711 5.188215 4.848803 0.701453 |
| 79 | + 0.203942 0.034729 0.000100 0.000100 1.671509 0.000100 0.000100 0.048554 0.091040 1.015749 3.012478 0.000154 0.873360 |
| 80 | + 1.317722 0.441669 0.000100 0.128632 0.000100 1.357928 0.123957 0.048531 0.824914 0.000100 0.502828 0.122503 0.000100 0.000100 |
| 81 | + 3.425467 1.067776 4.114253 0.435618 2.788606 0.280845 0.104001 1.549784 0.361033 0.041339 0.384561 0.290811 0.188779 0.615963 2.474679 |
| 82 | + 4.260751 0.715157 2.261495 0.178679 0.000100 0.277338 0.227034 0.186275 0.000100 1.909182 0.041874 0.966718 3.385793 0.072074 0.806596 4.938908 |
| 83 | + 0.245284 0.698218 0.000100 0.000100 1.273897 0.161369 0.000100 0.405661 0.536027 0.012605 0.896675 0.000100 0.341461 0.993273 0.031731 0.131122 0.000100 |
| 84 | + 0.012568 0.009986 0.621948 0.200102 2.122422 0.031013 0.010158 0.000100 5.748250 0.031443 0.000100 0.000100 0.156847 7.335297 0.173676 0.241389 0.000100 1.228204 |
| 85 | + 2.630085 0.043992 0.071593 0.042649 0.326353 0.036449 0.277692 0.162678 0.140434 9.733184 1.730443 0.035985 1.921040 0.678498 0.149578 0.089526 0.989870 0.000100 0.234116 |
| 86 | + |
| 87 | + 0.076028 0.051084 0.039819 0.051496 0.012072 0.038467 0.064409 0.052997 0.017632 0.064148 0.102707 0.073142 0.019406 0.048651 0.034180 0.077354 0.046007 0.011774 0.033201 0.085428 |
| 88 | + |
| 89 | +The amino-acid order in this file is: |
| 90 | + |
| 91 | + A R N D C Q E G H I L K M F P S T W Y V |
| 92 | + Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val |
| 93 | + |
| 94 | + |
| 95 | +### Estimating a non-reversible model |
| 96 | +To estimate a non-reversible model, we use `--model-joint NONREV+FO` option instead of `--model-joint GTR20+FO` (in step 2). The three commands now are: |
| 97 | + |
| 98 | + # step 1: infer an single edge-linked tree with reversible models as initial models |
| 99 | + iqtree2 --seed 1 -T AUTO -s alignment.nex -p train.nex -m MFP -mset LG,WAG,JTT -cmax 4 --prefix train_plant |
| 100 | + |
| 101 | + # step 2: estimate a join non-reversible matrix across all loci |
| 102 | + iqtree2 -seed 1 -T AUTO -s alignment.nex -p train_plant.best_model.nex -te train_plant.treefile --model-joint NONREV+FO --prefix plant_NONREV_FO |
| 103 | + |
| 104 | + # step 3: extract the resulting non-reversible matrix |
| 105 | + grep -A 22 "can be used as input for IQ-TREE" plant_NONREV_FO.iqtree | tail -n21 > NQ.plant |
| 106 | + |
| 107 | +The resulting `NQ.plant` matrix may now look like: |
| 108 | + |
| 109 | + -1.061624 0.007785 0.002655 0.022039 0.006589 0.014742 0.062839 0.090131 0.001704 0.002323 0.003087 0.001751 0.012195 0.012809 0.041760 0.320267 0.210023 0.003323 0.000857 0.244743 |
| 110 | + 0.008289 -1.080907 0.020365 0.000987 0.026624 0.177551 0.001802 0.086672 0.075705 0.009088 0.021643 0.491866 0.024868 0.000653 0.006541 0.072228 0.039720 0.013123 0.000986 0.002196 |
| 111 | + 0.004032 0.019786 -1.313687 0.267488 0.005393 0.008927 0.032858 0.067708 0.062829 0.023892 0.000094 0.252395 0.000094 0.000094 0.000094 0.438096 0.096599 0.000094 0.027919 0.005295 |
| 112 | + 0.041666 0.003896 0.227284 -0.890796 0.000094 0.000094 0.400738 0.099378 0.050443 0.002720 0.000094 0.007025 0.000094 0.000094 0.000094 0.027181 0.013001 0.000094 0.009413 0.007395 |
| 113 | + 0.051903 0.094902 0.010786 0.000094 -0.893500 0.026469 0.000094 0.037475 0.106883 0.010559 0.070577 0.000094 0.000094 0.102084 0.000094 0.236907 0.000094 0.012537 0.099632 0.032222 |
| 114 | + 0.013065 0.194182 0.039046 0.017941 0.000305 -1.075698 0.292962 0.027675 0.095402 0.001329 0.081175 0.209181 0.004369 0.000094 0.062599 0.012804 0.017520 0.000094 0.000094 0.005861 |
| 115 | + 0.064413 0.000094 0.012826 0.355911 0.000094 0.181070 -0.822679 0.053530 0.003000 0.000094 0.002512 0.097327 0.005370 0.000094 0.005713 0.002772 0.012020 0.000094 0.000094 0.025650 |
| 116 | + 0.187614 0.053999 0.033263 0.066321 0.009740 0.014380 0.058840 -0.621387 0.003268 0.000094 0.005951 0.015482 0.000094 0.000544 0.000094 0.145278 0.009973 0.000144 0.000094 0.016213 |
| 117 | + 0.033332 0.235841 0.278083 0.091410 0.023465 0.257662 0.049843 0.011802 -1.428426 0.004130 0.053383 0.021902 0.000094 0.007900 0.054172 0.039777 0.000094 0.000094 0.250585 0.014858 |
| 118 | + 0.023950 0.001632 0.004395 0.002772 0.003342 0.000717 0.000094 0.000837 0.000094 -1.493125 0.272227 0.010394 0.124246 0.059446 0.000094 0.003597 0.087965 0.000297 0.003057 0.893969 |
| 119 | + 0.026527 0.011287 0.000094 0.000094 0.002177 0.021633 0.001975 0.000094 0.012033 0.226203 -0.745614 0.000094 0.089496 0.149580 0.006638 0.026123 0.003215 0.013687 0.002033 0.152631 |
| 120 | + 0.017415 0.341123 0.171696 0.006194 0.000094 0.105871 0.089788 0.009665 0.007083 0.009163 0.002472 -0.829488 0.013473 0.000094 0.003097 0.013503 0.038477 0.000094 0.000094 0.000094 |
| 121 | + 0.000094 0.043168 0.007303 0.000094 0.011450 0.011888 0.047858 0.000094 0.004725 0.350677 0.567158 0.061171 -1.532577 0.026166 0.000094 0.027269 0.182166 0.012220 0.014137 0.164845 |
| 122 | + 0.007119 0.004300 0.000094 0.000094 0.023494 0.000094 0.002681 0.008815 0.000094 0.054752 0.348000 0.001664 0.026142 -0.836377 0.000094 0.051269 0.011247 0.006273 0.217177 0.072974 |
| 123 | + 0.115066 0.035090 0.000094 0.011471 0.000094 0.046128 0.005991 0.007020 0.013780 0.000094 0.071513 0.009870 0.000094 0.000094 -0.550036 0.167147 0.048381 0.001156 0.005462 0.011491 |
| 124 | + 0.238188 0.080851 0.127209 0.032161 0.040145 0.018437 0.014191 0.071781 0.006455 0.002646 0.050029 0.035037 0.001082 0.032565 0.125243 -1.151350 0.243892 0.000094 0.012692 0.018650 |
| 125 | + 0.361018 0.036699 0.129824 0.003747 0.000094 0.010391 0.015700 0.012775 0.000630 0.135648 0.002005 0.101363 0.080632 0.000094 0.013000 0.417972 -1.388325 0.000094 0.000094 0.066546 |
| 126 | + 0.014117 0.017973 0.000094 0.000094 0.025311 0.008866 0.000094 0.034917 0.015069 0.000094 0.068188 0.000094 0.000094 0.037263 0.000094 0.016805 0.000094 -0.251399 0.012044 0.000094 |
| 127 | + 0.000094 0.000094 0.010861 0.001684 0.031664 0.000094 0.001058 0.000094 0.132692 0.000094 0.000094 0.000094 0.000094 0.405258 0.006474 0.003214 0.000094 0.007815 -0.601659 0.000094 |
| 128 | + 0.203068 0.005150 0.004281 0.000094 0.003602 0.000454 0.018986 0.008201 0.003052 0.638676 0.198251 0.009524 0.040070 0.033033 0.008064 0.000094 0.062714 0.000094 0.007267 -1.244676 |
| 129 | + |
| 130 | + 0.076646 0.049413 0.038372 0.049451 0.010780 0.037824 0.063761 0.052468 0.015186 0.065770 0.104298 0.072672 0.019435 0.049968 0.035179 0.078294 0.045874 0.013490 0.033377 0.087742 |
| 131 | + |
| 132 | +Note: To assess the statistical support of the root position with bootstraping (-B 1000 option), users can use [this tutorial](Rootstrap). |
| 133 | + |
| 134 | +## Estimating a model from a folder of alignments |
| 135 | +### Estimating a reversible model |
| 136 | +We will now estimate a reversible model from a folder of alignments. Please first download the file [plant_10alignments.zip](data/plant_10alignments.zip). There is a sub-folder named `train_plant` in the downloaded folder. We use `-S` option instead of `-s` and `-p` options to allow each alignment having a separate tree. This -S option is typically used with a folder of alignments. The three commands are: |
| 137 | + |
| 138 | + # step 1: infer a separate tree for each alignment with reversible models as initial models |
| 139 | + iqtree2 -seed 1 -T AUTO -S train_plant -mset LG,WAG,JTT -cmax 4 -pre train_plant |
| 140 | + |
| 141 | + # step 2: estimate a join reversible matrix across all alignments |
| 142 | + iqtree2 -seed 1 -T AUTO -S train_plant.best_model.nex -te train_plant.treefile --model-joint GTR20+FO --init-model LG -pre train_plant.GTR20 |
| 143 | + |
| 144 | + # step 3: extract the resulting reversible matrix |
| 145 | + grep -A 21 "can be used as input for IQ-TREE" train_plant.GTR20.iqtree | tail -n20 > Q.plant |
| 146 | + |
| 147 | +### Estimating a non-reversible model |
| 148 | +Same as estimating a non-reversible model from a single concatenated alignment, we change `--model-joint GTR20+FO` option (in step 2) to `--model-joint NONREV+FO`. |
| 149 | + |
| 150 | + # step 1: infer a separate tree for each alignment with reversible models as initial models |
| 151 | + iqtree2 -seed 1 -T AUTO -S train_plant -mset LG,WAG,JTT -cmax 4 -pre train_plant |
| 152 | + |
| 153 | + # step 2: estimate a join non-reversible matrix across all alignments |
| 154 | + iqtree2 -seed 1 -T AUTO -S train_plant.best_model.nex -te train_plant.treefile --model-joint NONREV+FO --init-model LG -pre train_plant.NONREV |
| 155 | + |
| 156 | + # step 3: extract the resulting non-reversible matrix |
| 157 | + grep -A 22 "can be used as input for IQ-TREE" train_plant.NONREV.iqtree | tail -n21 > NQ.plant |
| 158 | + |
| 159 | +[Dang et al., 2021]: https://doi.org/10.1101/2021.10.18.464754 |
| 160 | +[Minh et al., 2021]: https://doi.org/10.1093/sysbio/syab010 |
| 161 | +[Naser-Khdour et al., 2021]: https://doi.org/10.1093/sysbio/syab067 |
| 162 | +[El-Gebali et al., 2018]: https://doi.org/10.1093/nar/gky995 |
| 163 | +[Duchêne et al., 2019]: https://doi.org/10.1093/molbev/msz291 |
| 164 | +[Ran et al., 2018]: https://doi.org/10.1098/rspb.2018.1012 |
0 commit comments