/usr/share/doc/phylip/html/doc/promlk.html is in phylip-doc 1:3.696+dfsg-1.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>
<TITLE>promlk</TITLE>
<META NAME="description" CONTENT="promlk">
<META NAME="keywords" CONTENT="promlk">
<META NAME="resource-type" CONTENT="document">
<META NAME="distribution" CONTENT="global">
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
</HEAD>
<BODY BGCOLOR="#ccffff">
<DIV ALIGN=RIGHT>
version 3.696
</DIV>
<P>
<DIV ALIGN=CENTER>
<H1>Promlk -- Protein maximum likelihood program<BR>with molecular clock</H1>
</DIV>
<P>
© Copyright 2000-2014 by Joseph Felsenstein. All rights reserved.
License terms <a href="main.html#copyright">here</a>.
<P>
This program implements the maximum likelihood method for protein amino
acid sequences under the constraint that the trees estimated must be
consistent with a molecular clock. The molecular clock is the
assumption that the tips of the tree are all equidistant, in branch
length, from its root. This program is indirectly related to Proml.
It uses the Dayhoff probability model of change between amino acids.
Its algorithmic details are not yet published, but many of them are
similar to Dnamlk.
<P>
The assumptions of the model are:
<OL>
<LI>Each position in the sequence evolves independently.
<LI>Different lineages evolve independently.
<LI>Each position undergoes substitution at an expected rate which is
chosen from a series of rates (each with a probability of occurrence)
which we specify.
<LI>All relevant positions are included in the sequence, not just those that
have changed or those that are "phylogenetically informative".
<LI>The probabilities of change between amino acids are given by the
model of Jones, Taylor, and Thornton (1992), the PMB model
of Veerassamy, Smith and Tillier (2003), or the DCMut version
(Kosiol and Goldman, 2005) of the PAM model of
Dayhoff (Dayhoff and Eck, 1968; Dayhoff et. al., 1979).
</OL>
<P>
Note the assumption that we are looking at all positions, including those
that have not changed at all. It is important not to restrict attention
to some positions based on whether or not they have changed; doing that
would bias branch lengths by making them too long, and that in turn
would cause the method to misinterpret the meaning of those positions that
had changed.
<P>
This program uses a Hidden Markov Model (HMM)
method of inferring different rates of evolution at different amino acid
positions. This
was described in a paper by me and Gary Churchill (1996). It allows us to
specify to the program that there will be
a number of different possible evolutionary rates, what the prior
probabilities of occurrence of each is, and what the average length of a
patch of positions all having the same rate. The rates can also be chosen
by the program to approximate a Gamma distribution of rates, or a
Gamma distribution plus a class of invariant positions. The program computes the
likelihood by summing it over all possible assignments of rates to positions,
weighting each by its prior probability of occurrence.
<P>
For example, if we have used the C and A options (described below) to specify
that there are three possible rates of evolution, 1.0, 2.4, and 0.0,
that the prior probabilities of a position having these rates are 0.4, 0.3, and
0.3, and that the average patch length (number of consecutive positions
with the same rate) is 2.0, the program will sum the likelihood over
all possibilities, but giving less weight to those that (say) assign all
positions to rate 2.4, or that fail to have consecutive positions that have the
same rate.
<P>
The Hidden Markov Model framework for rate variation among positions
was independently developed by Yang (1993, 1994, 1995). We have
implemented a general scheme for a Hidden Markov Model of
rates; we allow the rates and their prior probabilities to be specified
arbitrarily by the user, or by a discrete approximation to a Gamma
distribution of rates (Yang, 1995), or by a mixture of a Gamma
distribution and a class of invariant positions.
<P>
This feature effectively removes the artificial assumption that all positions
have the same rate, and also means that we need not know in advance the
identities of the positions that have a particular rate of evolution.
<P>
Another layer of rate variation also is available. The user can assign
categories of rates to each positions (for example, we might want
amino acid positions in the active site of a protein to change more slowly
than other positions. This is done with the categories input file and the
C option. We then specify (using the menu) the relative rates of evolution of
amino acid positions
in the different categories. For example, we might specify that positions
in the active site
evolve at relative rates of 0.2 compared to 1.0 at other positions. If we
are assuming that a particular position maintains a cysteine bridge to another,
we may want to put it in a category of positions (including perhaps the
initial position of the protein sequence which maintains methionine) which
changes at a rate of 0.0.
<P>
If both user-assigned rate categories and Hidden Markov Model rates
are allowed, the program assumes that the
actual rate at a position is the product of the user-assigned category rate
and the Hidden Markov Model regional rate. (This may not always make
perfect biological sense: it would be more natural to assume some upper
bound to the rate, as we have discussed in the Felsenstein and Churchill
paper). Nevertheless you may want to use both types of rate variation.
<P>
<H2>INPUT FORMAT AND OPTIONS</H2>
<P>
Subject to these assumptions, the program is a
correct maximum likelihood method. The
input is fairly standard, with one addition. As usual the first line of the
file gives the number of species and the number of amino acid positions.
<P>
Next come the species data. Each
sequence starts on a new line, has a ten-character species name
that must be blank-filled to be of that length, followed immediately
by the species data in the one-letter amino acid code. The sequences must
either be in the "interleaved" or "sequential" formats
described in the Molecular Sequence Programs document. The I option
selects between them. The sequences can have internal
blanks in the sequence but there must be no extra blanks at the end of the
terminated line. Note that a blank is not a valid symbol for a deletion.
<P>
The options are selected using an interactive menu. The menu looks like this:
<P>
<TABLE><TR><TD BGCOLOR=white>
<PRE>
Amino acid sequence
Maximum Likelihood method with molecular clock, version 3.69
Settings for this run:
U Search for best tree? Yes
P JTT, PMB or PAM probability model? Jones-Taylor-Thornton
C One category of substitution rates? Yes
R Rate variation among sites? constant rate of change
G Global rearrangements? No
W Sites weighted? No
J Randomize input order of sequences? No. Use input order
M Analyze multiple data sets? No
I Input sequences interleaved? Yes
0 Terminal type (IBM PC, ANSI, none)? ANSI
1 Print out the data at start of run No
2 Print indications of progress of run Yes
3 Print out tree Yes
4 Write out trees onto tree file? Yes
5 Reconstruct hypothetical sequences? No
Are these settings correct? (type Y or the letter for one to change)
</PRE>
</TD></TR></TABLE>
<P>
The user either types "Y" (followed, of course, by a carriage-return)
if the settings shown are to be accepted, or the letter or digit corresponding
to an option that is to be changed.
<P>
The options U, W, J, O, M, and 0 are the usual ones. They are described in the
main documentation file of this package. Option I is the same as in
other molecular sequence programs and is described in the documentation file
for the sequence programs.
<P>
The P option toggles between three models of amino acid change. One
is the Jones-Taylor-Thornton model, another the PMB (Probability
Matrix from Blocks) model of Veerassamy, Smith and Tillier (2003),
another the DCMut model (Kosiol and Goldman, 2005) based on the
the Dayhoff PAM matrix
model. These are all based on Margaret Dayhoff's (Dayhoff and Eck, 1968;
Dayhoff et. al., 1979) method of empirical tabulation of changes of
amino acid sequences, and conversion of these to a probability
model of amino acid change which is used to make a transition probability
matrix which allows prediction of the probability of changing from any
one amino acid to any other, and also predicts equilibrium amino acid
composition.
<P>
The R (Hidden Markov Model rates) option allows the user to
approximate a Gamma distribution of rates among positions, or a
Gamma distribution plus a class of invariant positions, or to specify how
many categories of
substitution rates there will be in a Hidden Markov Model of rate
variation, and what are the rates and probabilities
for each. By repeatedly selecting the R option one toggles among
no rate variation, the Gamma, Gamma+I, and general HMM possibilities.
<P>
If you choose Gamma or Gamma+I the program will ask how many rate
categories you want. If you have chosen Gamma+I, keep in mind that
one rate category will be set aside for the invariant class and only
the remaining ones used to approximate the Gamma distribution.
For the approximation we do not use the quantile method of Yang (1995)
but instead use a quadrature method using generalized Laguerre
polynomials. This should give a good approximation to the Gamma
distribution with as few as 5 or 6 categories.
<P>
In the Gamma and Gamma+I cases, the user will be
asked to supply the coefficient of variation of the rate of substitution
among positions. This is different from the parameters used by Nei and Jin
(1990) but
related to them: their parameter <EM>a</EM> is also known as "alpha",
the shape parameter of the Gamma distribution. It is
related to the coefficient of variation by
<P>
CV = 1 / a<SUP>1/2</SUP>
<P>
or
<P>
a = 1 / (CV)<SUP>2</SUP>
<P>
(their parameter <EM>b</EM> is absorbed here by the requirement that time is scaled so
that the mean rate of evolution is 1 per unit time, which means that <EM>a = b</EM>).
As we consider cases in which the rates are less variable we should set <EM>a</EM>
larger and larger, as <EM>CV</EM> gets smaller and smaller.
<P>
If the user instead chooses the general Hidden Markov Model option,
they are first asked how many HMM rate categories there
will be (for the moment there is an upper limit of 9,
which should not be restrictive). Then
the program asks for the rates for each category. These rates are
only meaningful relative to each other, so that rates 1.0, 2.0, and 2.4
have the exact same effect as rates 2.0, 4.0, and 4.8. Note that an
HMM rate category
can have rate of change 0, so that this allows us to take into account that
there may be a category of amino acid positions that are invariant. Note that
the run time
of the program will be proportional to the number of HMM rate categories:
twice as
many categories means twice as long a run. Finally the program will ask for
the probabilities of a random amino acid position falling into each of these
regional rate categories. These probabilities must be nonnegative and sum to
1. Default
for the program is one category, with rate 1.0 and probability 1.0 (actually
the rate does not matter in that case).
<P>
If more than one HMM rate category is specified, then another
option, A, becomes
visible in the menu. This allows us to specify that we want to assume that
positions that have the same HMM rate category are expected to be clustered
so that there is autocorrelation of rates. The
program asks for the value of the average patch length. This is an expected
length of patches that have the same rate. If it is 1, the rates of
successive positions will be independent. If it is, say, 10.25, then the
chance of change to a new rate will be 1/10.25 after every position. However
the "new rate" is randomly drawn from the mix of rates, and hence could
even be the same. So the actual observed length of patches with the same
rate will be a bit larger than 10.25. Note below that if you choose
multiple patches, there will be an estimate in the output file as to
which combination of rate categories contributed most to the likelihood.
<P>
Note that the autocorrelation scheme we use is somewhat different
from Yang's (1995) autocorrelated Gamma distribution. I am unsure
whether this difference is of any importance -- our scheme is chosen
for the ease with which it can be implemented.
<P>
The C option allows user-defined rate categories. The user is prompted
for the number of user-defined rates, and for the rates themselves,
which cannot be negative but can be zero. These numbers, which must be
nonnegative (some could be 0),
are defined relative to each other, so that if rates for three categories
are set to 1 : 3 : 2.5 this would have the same meaning as setting them
to 2 : 6 : 5.
The assignment of rates to amino acid positions
is then made by reading a file whose default name is "categories".
It should contain a string of digits 1 through 9. A new line or a blank
can occur after any character in this string. Thus the categories file
might look like this:
<P>
<PRE>
122231111122411155
1155333333444
</PRE>
<P>
With the current options R, A, and C the program has a good
ability to infer different rates at different positions and estimate
phylogenies under a more realistic model. Note that Likelihood Ratio
Tests can be used to test whether one combination of rates is
significantly better than another, provided one rate scheme represents
a restriction of another with fewer parameters. The number of parameters
needed for rate variation is the number of regional rate categories, plus
the number of user-defined rate categories less 2, plus one if the
regional rate categories have a nonzero autocorrelation.
<P>
The G (global search) option causes, after the last species is added to
the tree, each possible group to be removed and re-added. This improves the
result, since the position of every species is reconsidered. It
approximately triples the run-time of the program.
<P>
The User tree (option U) is read from a file whose default name is
<TT>intree</TT>. The trees can be multifurcating. This allows us to test the
hypothesis that a given branch has zero length.
<P>
If the U (user tree) option is chosen another option appears in
the menu, the L option. If it is selected,
it signals the program that it
should take any branch lengths that are in the user tree and
simply evaluate the likelihood of that tree, without further altering
those branch lengths. In the case of a clock, if some branches have lengths
and others do not, the program does not estimate the lengths of those that
do not have lengths given in the user tree. If any of the branches
do not have lengths, the program re-estimates the lengths of all of them.
This is done because estimating some and not others is hard in the
case of a clock.
<P>
The W (Weights) option is invoked in the usual way, with only weights 0
and 1 allowed. It selects a set of positions to be analyzed, ignoring the
others. The positions selected are those with weight 1. If the W option is
not invoked, all positions are analyzed.
The Weights (W) option
takes the weights from a file whose default name is "weights". The weights
follow the format described in the main documentation file.
<P>
The M (multiple data sets) option will ask you whether you want to
use multiple sets of weights (from the weights file) or multiple data sets
from the input file.
The ability to use a single data set with multiple weights means that
much less disk space will be used for this input data. The bootstrapping
and jackknifing tool Seqboot has the ability to create a weights file with
multiple weights. Note also that when we use multiple weights for
bootstrapping we can also then maintain different rate categories for
different sites in a meaningful way. If you use the multiple
data sets option rather than multiple weights, you should not at the
same time use the user-defined rate categories option (option C), because
the user-defined rate categories could then be associated with the
wrong sites. This is not a concern when the M option is used
by using multiple weights.
<P>
The algorithm used for searching among trees is faster than it was in
version 3.5, thanks to using a technique invented by David Swofford
and J. S. Rogers. This involves not iterating most branch lengths on most
trees when searching among tree topologies, This is of necessity a
"quick-and-dirty" search but it saves much time.
<P>
<H2>OUTPUT FORMAT</H2>
<P>
The output starts by giving the number of species, the number of amino
acid positions.
<P>
If the R (HMM rates) option is used a table of the relative rates of
expected substitution at each category of positions is printed, as well
as the probabilities of each of those rates.
<P>
There then follow the data sequences, if the user has selected the menu
option to print them out, with the base sequences printed in
groups of ten amino acids. The
trees found are printed as a rooted
tree topology. The
internal nodes are numbered arbitrarily for the sake of
identification. The number of trees evaluated so far and the log
likelihood of the tree are also given. The branch lengths in the diagram are
roughly proportional to the estimated branch lengths, except that very short
branches are printed out at least three characters in length so that the
connections can be seen. The unit of branch length is the expected
fraction of amino acids changed (so that 1.0 is 100 PAMs).
<P>
A table is printed
showing the length of each tree segment, and the time (in units of
expected amino acid substitutions per position) of each fork in the tree,
measured from the root of the tree. I have not attempted to include
code for approximate confidence limits on branch points, as I have done
for branch lengths in Proml, both because of the extreme crudeness of
that test, and because the variation of times for different forks would be
highly correlated.
<P>
The log likelihood printed out with the final tree can be used to perform
various likelihood ratio tests. One can, for example, compare runs with
different values of the relative rate of change in the active site and in
the rest of the protein to determine
which value is the maximum likelihood estimate, and what is the allowable range
of values (using a likelihood ratio test, which you will find described in
mathematical statistics books). One could also estimate the base frequencies
in the same way. Both of these, particularly the latter, require multiple runs
of the program to evaluate different possible values, and this might get
expensive.
<P>
This program makes possible a (reasonably) legitimate
statistical test of the molecular clock. To do such a test, run Proml
and Promlk on the same data. If the trees obtained are of the same
topology (when considered as unrooted), it is legitimate to compare
their likelihoods by the likelihood ratio test. In Proml the likelihood
has been computed by estimating 2n-3 branch lengths, if there are n tips
on the tree. In Promlk it has been computed by estimating n-1 branching
times (in effect, n-1 branch lengths). The difference in the number of
parameters is (2n-3)-(n-1) = n-2. To perform the test take the
difference in log likelihoods between the two runs (Proml should be the
higher of the two, barring numerical iteration difficulties) and double
it. Look this up on a chi-square distribution with n-2 degrees of
freedom. If the result is significant, the log likelihood has been
significantly increased by allowing all 2n-3 branch lengths to be
estimated instead of just n-1, and molecular clock may be rejected.
<P>
If the U (User Tree) option is used and more than one tree is supplied,
and the program is not told to assume autocorrelation between the
rates at different amino acid positions, the
program also performs a statistical test of each of these trees against the
one with highest likelihood. If there are two user trees, the test
done is one which is due to Kishino and Hasegawa (1989), a version
of a test originally introduced by Templeton (1983). In this
implementation it uses the mean and variance of
log-likelihood differences between trees, taken across amino acid
positions. If the two
trees' means are more than 1.96 standard deviations different
then the trees are
declared significantly different. This use of the empirical variance of
log-likelihood differences is more robust and nonparametric than the
classical likelihood ratio test, and may to some extent compensate for the
any lack of realism in the model underlying this program.
<P>
If there are more than two trees, the test done is an extension of
the KHT test, due to Shimodaira and Hasegawa (1999). They pointed out
that a correction for the number of trees was necessary, and they
introduced a resampling method to make this correction. In the version
used here the variances and covariances of the sum of log likelihoods across
amino acid positions are computed for all pairs of trees. To test whether the
difference between each tree and the best one is larger than could have
been expected if they all had the same expected log-likelihood,
log-likelihoods for all trees are sampled with these covariances and equal
means (Shimodaira and Hasegawa's "least favorable hypothesis"),
and a P value is computed from the fraction of times the difference between
the tree's value and the highest log-likelihood exceeds that actually
observed. Note that this sampling needs random numbers, and so the
program will prompt the user for a random number seed if one has not
already been supplied. With the two-tree KHT test no random numbers
are used.
<P>
In either the KHT or the SH test the program
prints out a table of the log-likelihoods of each tree, the differences of
each from the highest one, the variance of that quantity as determined by
the log-likelihood differences at individual sites, and a conclusion as to
whether that tree is or is not significantly worse than the best one. However
the test is not available if we assume that there
is autocorrelation of rates at neighboring positions (option A) and is not
done in those cases.
<P>
The branch lengths printed out are scaled in terms of 100 times the
expected numbers of
amino acid substitutions, scaled so that the average rate of
change, averaged over all the positions analyzed, is set to 100.0,
if there are multiple categories of positions. This means that whether or not
there are multiple categories of positions, the expected percentage of change
for very small branches is equal to the branch length. Of course,
when a branch is twice as
long this does not mean that there will be twice as much net change expected
along it, since some of the changes occur in the same position and overlie or
even reverse each
other.
underlying numbers of changes. That means that a branch of length 26
is 26 times as long as one which would show a 1% difference between
the amino acid sequences at the beginning and end of the branch, but we
would not expect the sequences at the beginning and end of the branch to be
26% different, as there would be some overlaying of changes.
<P>
Because of limitations of the numerical
algorithm, branch length estimates of zero will often print out as small
numbers such as 0.00001. If you see a branch length that small, it is really
estimated to be of zero length.
<P>
Another possible source of confusion is the existence of negative values for
the log likelihood. This is not really a problem; the log likelihood is not a
probability but the logarithm of a probability. When it is
negative it simply means that the corresponding probability is less
than one (since we are seeing its logarithm). The log likelihood is
maximized by being made more positive: -30.23 is worse than -29.14.
<P>
At the end of the output, if the R option is in effect with multiple
HMM rates, the program will print a list of what amino acid position
categories contributed the most to the final likelihood. This combination of
HMM rate categories need not have contributed a majority of the likelihood,
just a plurality. Still, it will be helpful as a view of where the
program infers that the higher and lower rates are. Note that the
use in this calculations of the prior probabilities of different rates,
and the average patch length, gives this inference a "smoothed"
appearance: some other combination of rates might make a greater
contribution to the likelihood, but be discounted because it conflicts
with this prior information. See the example output below to see
what this printout of rate categories looks like.
A second list will also be printed out, showing for each position which
rate accounted for the highest fraction of the likelihood. If the fraction
of the likelihood accounted for is less than 95%, a dot is printed instead.
<P>
Option 3 in the menu controls whether the tree is printed out into
the output file. This is on by default, and usually you will want to
leave it this way. However for runs with multiple data sets such as
bootstrapping runs, you will primarily be interested in the trees
which are written onto the output tree file, rather than the trees
printed on the output file. To keep the output file from becoming too
large, it may be wisest to use option 3 to prevent trees being
printed onto the output file.
<P>
Option 4 in the menu controls whether the tree estimated by the program
is written onto a tree file. The default name of this output tree file
is "outtree". If the U option is in effect, all the user-defined
trees are written to the output tree file.
<P>
Option 5 in the menu controls whether ancestral states are estimated
at each node in the tree. If it is in effect, a table of ancestral
sequences is printed out (including the sequences in the tip species which
are the input sequences).
The symbol printed out is for the amino acid which accounts for the
largest fraction of the likelihood at that position.
In that table, if a position has an amino acid which
accounts for more than 95% of the likelihood, its symbol printed in capital
letters (W rather than w). One limitation of the current
version of the program is that when there are multiple HMM rates
(option R) the reconstructed amino acids are based on only the single
assignment of rates to positions which accounts for the largest amount of the
likelihood. Thus the assessment of 95% of the likelihood, in tabulating
the ancestral states, refers to 95% of the likelihood that is accounted
for by that particular combination of rates.
<P>
<H2>PROGRAM CONSTANTS</H2>
<P>
The constants defined at the beginning of the program include "maxtrees",
the maximum number of user trees that can be processed. It is small (100)
at present to save some further memory but the cost of increasing it
is not very great. Other constants
include "maxcategories", the maximum number of position
categories, "namelength", the length of species names in
characters, and three others, "smoothings", "iterations", and "epsilon", that
help "tune" the algorithm and define the compromise between execution speed and
the quality of the branch lengths found by iteratively maximizing the
likelihood. Reducing iterations and smoothings, and increasing epsilon, will
result in faster execution but a worse result. These values
will not usually have to be changed.
<P>
The program spends most of its time doing real arithmetic.
The algorithm, with separate and independent computations
occurring for each pattern, lends itself readily to parallel processing.
<P>
<H2>PAST AND FUTURE OF THE PROGRAM</H2>
<P>
This program was developed in version 3.6 by Lucas Mix by combining code
from Dnamlk and from Proml.
<P>
<HR>
<P>
<H3>TEST DATA SET</H3>
<P>
<TABLE><TR><TD BGCOLOR=white>
<PRE>
5 13
Alpha AACGTGGCCAAAT
Beta AAGGTCGCCAAAC
Gamma CATTTCGTCACAA
Delta GGTATTTCGGCCT
Epsilon GGGATCTCGGCCC
</PRE>
</TD></TR></TABLE>
<P>
<HR>
<H3>CONTENTS OF OUTPUT FILE (with all numerical options on)</H3>
<P>
(It was run with HMM rates having gamma-distributed rates
approximated by 5 rate categories,
with coefficient of variation of rates 1.0, and with patch length
parameter = 1.5. Two user-defined rate categories were used, one for
the first 6 positions, the other for the last 7, with rates 1.0 : 2.0.
Weights were used, with sites 1 and 13 given weight 0, and all others
weight 1.)
<P>
<TABLE><TR><TD BGCOLOR=white>
<PRE>
Amino acid sequence Maximum Likelihood method, version 3.69
5 species, 13 sites
Site categories are:
1111112222 222
Sites are weighted as follows:
01111 11111 110
Jones-Taylor-Thornton model of amino acid change
Name Sequences
---- ---------
Alpha AACGTGGCCA AAT
Beta ..G..C.... ..C
Gamma C.TT.C.T.. C.A
Delta GGTA.TT.GG CC.
Epsilon GGGA.CT.GG CCC
Discrete approximation to gamma distributed rates
Coefficient of variation of rates = 1.000000 (alpha = 1.000000)
States in HMM Rate of change Probability
1 0.264 0.522
2 1.413 0.399
3 3.596 0.076
4 7.086 0.0036
5 12.641 0.000023
Expected length of a patch of sites having the same rate = 1.500
Site category Rate of change
1 1.000
2 2.000
+Beta
|
| +Epsilon
| +-------------------------------------3
1-------2 +--------Delta
| |
| +----------Gamma
|
+------Alpha
remember: this is an unrooted tree!
Ln Likelihood = -104.53314
Between And Length Approx. Confidence Limits
------- --- ------ ------- ---------- ------
1 Alpha 0.46548 ( zero, 1.16234) **
1 Beta 0.00010 ( zero, 0.56371)
1 2 0.53585 ( zero, 1.53611) *
2 3 2.52202 ( zero, 5.51952) **
3 Epsilon 0.00010 ( zero, 0.70102)
3 Delta 0.56179 ( zero, 1.37921) **
2 Gamma 0.72465 ( zero, 1.87900) **
* = significantly positive, P < 0.05
** = significantly positive, P < 0.01
Combination of categories that contributes the most to the likelihood:
1122111111 111
Most probable category at each site if > 0.95 probability ("." otherwise)
....1....1 1..
Probable sequences at interior nodes:
node Reconstructed sequence (caps if > 0.95)
1 .AGGTCGCCA AA.
Beta AAGGTCGCCA AAC
2 .AggTCGCCA CA.
3 .GGATCTCGG CC.
Epsilon GGGATCTCGG CCC
Delta GGTATTTCGG CCT
Gamma CATTTCGTCA CAA
Alpha AACGTGGCCA AAT
</PRE>
</TD></TR></TABLE>
</BODY>
</HTML>
|