/usr/share/doc/phylip/html/doc/contml.html is in phylip-doc 1:3.696+dfsg-1.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>
<TITLE>contml</TITLE>
<META NAME="description" CONTENT="contml">
<META NAME="keywords" CONTENT="contml">
<META NAME="resource-type" CONTENT="document">
<META NAME="distribution" CONTENT="global">
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
</HEAD>
<BODY BGCOLOR="#ccffff">
<DIV ALIGN=RIGHT>
version 3.696
</DIV>
<P>
<DIV ALIGN=CENTER>
<H1>Contml - Gene Frequencies and Continuous Characters Maximum Likelihood method</H1>
</DIV>
<P>
© Copyright 1986-2014 by Joseph Felsenstein. All rights reserved.
License terms <a href="main.html#copyright">here</a>.
<P>
This program estimates phylogenies by the restricted maximum likelihood method
based on the Brownian motion model. It is based on the model of Edwards and
Cavalli-Sforza (1964; Cavalli-Sforza and Edwards, 1967). Gomberg (1966),
Felsenstein (1973b, 1981c) and Thompson (1975) have done extensive further work
leading to efficient algorithms. Contml uses restricted maximum
likelihood estimation (REML), which is the criterion used by Felsenstein
(1973b). The actual algorithm is an iterative EM Algorithm (Dempster,
Laird, and Rubin, 1977) which is guaranteed to always give increasing
likelihoods. The algorithm is described in detail in a paper of mine
(Felsenstein, 1981c), which you should definitely consult if you are
going to use this program. Some simulation tests of it are given
by Rohlf and Wooten (1988) and Kim and Burgman (1988).
<P>
The default (gene frequency) mode treats the input as gene frequencies at a
series of loci, and
square-root-transforms the allele frequencies (constructing the frequency of
the missing allele at each locus first). This enables us to use the
Brownian motion model on the resulting coordinates, in an approximation
equivalent to using Cavalli-Sforza and Edwards's (1967) chord measure
of genetic distance and taking that to give distance between particles
undergoing pure Brownian motion. It assumes that each locus evolves
independently by pure genetic drift.
<P>
The alternative continuous characters mode (menu option C) treats the input
as a series of coordinates of each species in N dimensions. It assumes
that we have transformed the characters to remove correlations and to
standardize their variances.
<P>
<H2>A word about microsatellite data</H2>
<P>
Many current users of Contml use it to analyze microsatellite data.
There are three ways to do this:
<P>
<UL>
<LI> Coding each copy number as an allele, and feeding in the
frequencies of these alleles. As Contml's gene frequency mode assumes that
all change is by genetic drift, this means that no copy number arises by
mutation during the divergence of the populations. Since microsatellite
loci have very high mutation rates, this is questionable.
<LI> Use some other
program, one not in the PHYLIP package, to compute distances among the
populations. Some of the programs that can do this are RSTCalc, poptrfdos,
Microsat, and Populations. Links to them can be found at my Phylogeny
Programs web site at <A HREF="http://evolution.gs.washington.edu/phylip/software.html">
<CODE>http://evolution.gs.washington.edu/phylip/software.html</CODE></A>.
<P>
Those distance measures allow for mutation during the divergence of the
populations. But even they are not perfect -- they do not allow us to use
all the information contained in the gene frequency differences
within a copy number allele. There is a need for a more complete
statistical treatment of inference of phylogenies from microsatellite models,
ones that take both mutation and genetic drift fully into account.
<LI> Alternatively, there is the Brownian motion approximation to mean population
copy number. This is described in my book (Felsenstein, 2004, Chapter 15,
pp. 242-245), and it is implicit also in the microsatellite distances.
Each locus is coded as a single continuous character, the mean of the copy
number at that microsatellite locus in that species. Thus if the species
(or population) has frequencies 0.10, 0.24, 0.60, and 0.06 of alleles that
have 18, 19, 20, and 21 copies, it is coded as having
<P>
0.10 <tt>X</tt> 18 + 0.24 <tt>X</tt> 19 + 0.60 <tt>X</tt> 20 + 0.06 <tt>X</tt> 21 = 19.62
<P>
copies. These values can, I believe, be calculated by a spreadsheet program.
Each microsatellite is represented by one character, and the continuous
character mode of Contml is used (not the gene frequencies mode). This
coding allows for mutation that changes copy number. It
does not make complete use of all data, but neither does the treatment
of microsatellite gene frequencies as changing only by genetic drift.
</UL>
<P>
<H2>The input file</H2>
<P>
The input file is as described in the continuous characters
documentation file above. Options are selected using a menu:
<P>
<TABLE><TR><TD BGCOLOR=white>
<PRE>
Continuous character Maximum Likelihood method version 3.69
Settings for this run:
U Search for best tree? Yes
C Gene frequencies or continuous characters? Gene frequencies
A Input file has all alleles at each locus? No, one allele missing at each
O Outgroup root? No, use as outgroup species 1
G Global rearrangements? No
J Randomize input order of species? No. Use input order
M Analyze multiple data sets? No
0 Terminal type (IBM PC, ANSI, none)? ANSI
1 Print out the data at start of run No
2 Print indications of progress of run Yes
3 Print out tree Yes
4 Write out trees onto tree file? Yes
Y to accept these or type the letter for one to change
</PRE>
</TD></TR></TABLE>
<P>
Option U is the usual User Tree option. Options C (Continuous Characters)
and A (All alleles present) have been described
in the Gene Frequencies and Continuous Characters Programs documentation
file. The options G, J, O and M are the usual Global Rearrangements, Jumble
order of species, Outgroup root, and Multiple Data Sets options.
<P>
The M (Multiple data sets) option does not allow multiple sets of weights
instead of multiple data sets, as there are no weights in this program.
<P>
The G and J options have no effect if the User Tree option is selected. User
trees are given with a trifurcation (three-way split) at the base. They
can start from any interior node. Thus the tree:
<P>
<PRE>
A
!
*--B
!
*-----C
!
*--D
!
E
</PRE>
<P>
can be represented by any of the following:
<P>
<PRE>
(A,B,(C,(D,E)));
((A,B),C,(D,E));
(((A,B),C),D,E);
</PRE>
<P>
(there are of course 69 other representations as well obtained from these
by swapping the order of branches at an interior node).
<P>
<H2>The output file</H2>
<P>
The output has a standard appearance. The topology of the tree
is given by an unrooted tree diagram. The lengths (in time or in
expected amounts of variance) are given in a table below the topology,
and a rough confidence interval given for each length. Negative lower
bounds on length indicate that rearrangements may be acceptable.
<P>
The units of length are amounts of expected accumulated variance (not
time). The
log likelihood (natural log) of each tree is also given, and it is
indicated how many topologies have been tried. The tree does not
necessarily have all tips contemporary, and the log likelihood may be
either positive or negative (this simply corresponds to whether the
density function does or does not exceed 1) and a negative log
likelihood does not indicate any error. The log likelihood allows
various formal likelihood ratio hypothesis tests. The description of
the tree includes approximate standard errors on the lengths of segments
of the tree. These are calculated by considering only the curvature of
the likelihood surface as the length of the segment is varied, holding
all other lengths constant. As such they are most probably underestimates of
the variance, and hence may give too much confidence in the given tree.
<P>
One should use caution in interpreting the likelihoods that are printed
out. If the model is wrong, it will not be possible to use the
likelihoods to make formal statistical statements. Thus, if gene
frequencies are being analyzed, but the gene frequencies change not only
by genetic drift, but also by mutation, the model is not correct. It
would be as well-justified in this case to use Gendist to compute the
Nei (1972) genetic distance and then use Fitch, Kitsch or Neighbor to make a
tree. If continuous characters are being analyzed, but if the
characters have not been transformed to new coordinates that evolve
independently and at equal rates, then the model is also violated and no
statistical analysis is possible. Doing such a transformation is not
easy, and usually not even possible.
<P>
If the U (User Tree) option is used and more than one tree is supplied,
the program also performs a statistical test of each of these trees against the
one with highest likelihood. If there are two user trees, the test
done is one which is due to Kishino and Hasegawa (1989), a version
of a test originally introduced by Templeton (1983). In this
implementation it uses the mean and variance of
log-likelihood differences between trees, taken across loci. If the two
trees' means are more than 1.96 standard deviations different then the trees are
declared significantly different. This use of the empirical variance of
log-likelihood differences is more robust and nonparametric than the
classical likelihood ratio test, and may to some extent compensate for
any lack of realism in the model underlying this program.
<P>
If there are more than two trees, the test done is an extension of
the KHT test, due to Shimodaira and Hasegawa (1999). They pointed out
that a correction for the number of trees was necessary, and they
introduced a resampling method to make this correction. The version
used here is a multivariate normal approximation to their test; it is
due to Shimodaira (1998). The variances and covariances of the sum of
log likelihoods across loci are computed for all pairs of trees. To test
whether the difference between each tree and the best one is larger than
could have been expected if they all had the same expected log-likelihood,
log-likelihoods for all trees are sampled with these covariances and equal
means (Shimodaira and Hasegawa's "least favorable hypothesis"),
and a P value is computed from the fraction of times the difference between
the tree's value and the highest log-likelihood exceeds that actually
observed. Note that this sampling needs random numbers, and so the
program will prompt the user for a random number seed if one has not
already been supplied. With the two-tree KHT test no random numbers
are used.
<P>
In either the KHT or the SH test the program
prints out a table of the log-likelihoods of each tree, the differences of
each from the highest one, the variance of that quantity as determined by
the log-likelihood differences at individual sites, and a conclusion as to
whether that tree is or is not significantly worse than the best one.
<P>
One problem which sometimes arises is that the program is fed two species
(or populations) with identical transformed gene frequencies: this can
happen if sample sizes are small and/or many loci are monomorphic. In
this case the program "gets its knickers in a twist" and can divide by
zero, usually causing a crash. If you suspect that this has happened,
check for two species with identical coordinates. If you find them,
eliminate one from the problem: the two must always show up as being at the
same point on the tree anyway.
<P>
The constants
available for modification at the beginning of the
program include "epsilon1",
a small quantity used in the iterations of branch lengths,
"epsilon2", another not quite so small quantity used to check
whether gene frequencies that were fed in for all alleles do not add up to 1,
"smoothings", the number of passes through a
given tree in the iterative likelihood maximization for a given topology,
"maxtrees", the maximum number of user trees that will be used for the
Kishino-Hasegawa-Templeton test, and
"namelength", the length of species names.
There is no provision in this program for saving multiple trees that are
tied for having the highest likelihood, mostly because an exact tie is
unlikely anyway.
<P>
The algorithm does not run as quickly as the discrete character
methods but is not enormously slower. Like them, its execution time
should rise as the cube of the number of species.
<P>
<H3>TEST DATA SET</H3>
<P>
This data set was compiled by me from the compilation of human gene
frequencies by Mourant (1976). It appeared in a paper of mine
(Felsenstein, 1981c) on maximum likelihood phylogenies from gene
frequencies. The names of the loci and alleles are given in that
paper.
<P>
<TABLE><TR><TD BGCOLOR=white>
<PRE>
5 10
2 2 2 2 2 2 2 2 2 2
European 0.2868 0.5684 0.4422 0.4286 0.3828 0.7285 0.6386 0.0205
0.8055 0.5043
African 0.1356 0.4840 0.0602 0.0397 0.5977 0.9675 0.9511 0.0600
0.7582 0.6207
Chinese 0.1628 0.5958 0.7298 1.0000 0.3811 0.7986 0.7782 0.0726
0.7482 0.7334
American 0.0144 0.6990 0.3280 0.7421 0.6606 0.8603 0.7924 0.0000
0.8086 0.8636
Australian 0.1211 0.2274 0.5821 1.0000 0.2018 0.9000 0.9837 0.0396
0.9097 0.2976
</PRE>
</TD></TR></TABLE>
<P>
<HR>
<P>
<H3>TEST SET OUTPUT (WITH ALL NUMERICAL OPTIONS TURNED ON)</H3>
<P>
<TABLE><TR><TD BGCOLOR=white>
<PRE>
Continuous character Maximum Likelihood method version 3.69
5 Populations, 10 Loci
Numbers of alleles at the loci:
------- -- ------- -- --- -----
2 2 2 2 2 2 2 2 2 2
Name Gene Frequencies
---- ---- -----------
locus: 1 2 3 4 5 6
7 8 9 10
European 0.28680 0.56840 0.44220 0.42860 0.38280 0.72850
0.63860 0.02050 0.80550 0.50430
African 0.13560 0.48400 0.06020 0.03970 0.59770 0.96750
0.95110 0.06000 0.75820 0.62070
Chinese 0.16280 0.59580 0.72980 1.00000 0.38110 0.79860
0.77820 0.07260 0.74820 0.73340
American 0.01440 0.69900 0.32800 0.74210 0.66060 0.86030
0.79240 0.00000 0.80860 0.86360
Australian 0.12110 0.22740 0.58210 1.00000 0.20180 0.90000
0.98370 0.03960 0.90970 0.29760
+-----------------------------------------------------------African
!
! +-------------------------------Australian
1-------------3
! ! +-----------------------American
! +-----2
! +Chinese
!
+European
remember: this is an unrooted tree!
Ln Likelihood = 38.71914
Between And Length Approx. Confidence Limits
------- --- ------ ------- ---------- ------
1 African 0.09693444 ( 0.03123910, 0.19853605)
1 3 0.02252816 ( 0.00089799, 0.05598045)
3 Australian 0.05247406 ( 0.01177094, 0.11542376)
3 2 0.00945315 ( -0.00897717, 0.03795670)
2 American 0.03806240 ( 0.01095938, 0.07997877)
2 Chinese 0.00208822 ( -0.00960622, 0.02017434)
1 European 0.00000000 ( -0.01627246, 0.02516630)
</PRE>
</TD></TR></TABLE>
</BODY>
</HTML>
|