/usr/share/doc/phylip/html/doc/discrete.html is in phylip-doc 1:3.696+dfsg-1.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>
<TITLE>discrete</TITLE>
<META NAME="description" CONTENT="discrete">
<META NAME="keywords" CONTENT="discrete">
<META NAME="resource-type" CONTENT="document">
<META NAME="distribution" CONTENT="global">
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
</HEAD>
<BODY BGCOLOR="#ccffff">
<DIV ALIGN=RIGHT>
version 3.696
</DIV>
<P>
<DIV ALIGN=CENTER>
<H1>DOCUMENTATION FOR (0,1) DISCRETE CHARACTER PROGRAMS</H1>
</DIV>
<P>
© Copyright 1986-2014 by Joseph Felsenstein. All rights reserved.
License terms <a href="main.html#copyright">here</a>.
<P>
These programs are intended for the use of morphological
systematists who are dealing with discrete characters,
or by molecular evolutionists dealing with presence-absence data on
restriction sites. One of the programs (Pars) allows multistate
characters, with up to 8 states, plus the unknown state symbol "?".
For the others, the characters
are assumed to be coded into a series of (0,1) two-state characters. For
most of the programs there are two other states possible, "P", which
stands for the state of Polymorphism for both states (0 and 1), and "?",
which stands for the state of ignorance: it is the state "unknown", or
"does not apply". The state "P" can also be denoted by "B", for "both".
<P>
There is a method invented by Sokal and Sneath (1963) for linear
sequences of character states, and fully developed for branching sequences
of character states
by Kluge and Farris (1969) for recoding a multistate character
into a series of two-state (0,1) characters. Suppose we had a character
with four states whose character-state tree had the rooted form:
<P>
<PRE>
1 ---> 0 ---> 2
|
|
V
3
</PRE>
<P>
<P>
so that 1 is the ancestral state and 0, 2 and 3 derived states. We can
represent this as three two-state characters:
<P>
<PRE>
Old State New States
--- ----- --- ------
0 001
1 000
2 011
3 101
</PRE>
<P>
The three new states correspond to the three arrows in the above character
state tree. Possession of one of the new states corresponds to whether or not
the old state had that arrow in its ancestry. Thus the first new state
corresponds to the bottommost arrow, which only state 3 has in its ancestry,
the second state to the rightmost of the top arrows, and the third state to
the leftmost top arrow. This coding will guarantee that the number of times
that states arise on the tree (in programs Mix, Move, Penny and Boot)
or the number of polymorphic states in a tree segment (in the Polymorphism
option of Dollop, Dolmove, Dolpenny and Dolboot) will correctly
correspond to what would have been the case had our programs been able to take
multistate characters into account. Although I have shown the above character
state tree as rooted, the recoding method works equally well on unrooted
multistate characters as long as the connections between the states are known
and contain no loops.
<P>
However, in the default option of programs Dollop, Dolmove, Dolpenny
and Dolboot the multistate recoding does not necessarily work properly, as it
may lead the program to reconstruct nonexistent state combinations such as
010. An example of this problem is given in my paper on alternative
phylogenetic methods (1979).
<P>
If you have multistate character data where the states are connected in a
branching "character state tree" you may want to do the binary recoding
yourself. Thanks to Christopher Meacham, the package contains
a program, Factor, which will do the recoding itself. For details see
the documentation file for Factor.
<P>
We now also have the program Pars, which can do parsimony for unordered
character states.
<P>
<H2>COMPARISON OF METHODS</H2>
<P>
The methods used in these programs make different assumptions about
evolutionary rates, probabilities of different kinds of events, and our
knowledge about the characters or about the character state trees.
Basic references on these assumptions are my 1979, 1981b and 1983b
papers, particularly the latter. The
assumptions of each method are briefly described in the documentation
file for the corresponding program. In most cases my assertions about what are
the assumptions of these methods are challenged by others, whose papers I also
cite at that point. Personally, I believe that they are wrong and I am
right. I must emphasize the importance of
understanding the assumptions underlying the methods you are using. No
matter how fancy the algorithms, how maximum the likelihood or how
minimum the number of steps, your results can only be as good as the
correspondence between biological reality and your assumptions!
<P>
<H2>INPUT FORMAT</H2>
<P>
The input format is as described in the general documentation file. The
input starts with a line containing the number of
species and the number of characters.
<P>
In Pars, each character can have up to 8 states plus a "?" state. In any
character, the first 8 symbols encountered will be taken to represent
these states. Any of the digits 0-9, letters A-Z and a-z, and even symbols
such as + and -, can be used (and in fact which 8 symbols are used can
be different in different characters).
<P>
In the other discrete characters programs the allowable states are,
0, 1, P, B, and ?. Blanks
may be included between the states (i. e. you can have a
species whose data is DISCOGLOSS0 1 1 0 1 1 1). It is possible for
extraneous information to follow the end of the character state data on
the same line. For example, if there were 7 characters in the data set,
a line of species data could read "DISCOGLOSS0110111 Hello there").
<P>
The discrete character data can continue to a new line whenever needed.
The characters are not in the "aligned" or "interleaved" format used by the
molecular sequence programs: they have the name and entire set of characters
for one species, then the name and entire set of characters for the next
one, and so on. This is known as the sequential format. Be particularly
careful when you use restriction sites
data, which can be in either the aligned or the sequential format for use in
Restml but must be in the sequential format for these discrete character
programs.
<P>
For Pars the discrete character data can be in either Sequential or
Interleaved format; the latter is the default.
<P>
Errors in the input data will often be detected by the programs, and this will
cause them to issue an error message such as 'BAD OUTGROUP NUMBER: ' together
with information as to which species, character, or in this case outgroup
number is the incorrect one. The program will then terminate; you will have
to look at the data and figure out what went wrong and fix it. Often an error
in the data causes a lack of synchronization between what is in the data file
and what the program thinks is to be there. Thus a missing character may
cause the program to read part of the next species name as a character and
complain about its value. In this type of case you should look for the error
earlier in the data file than the point about which the program is
complaining.
<P>
<H2>OPTIONS GENERALLY AVAILABLE</H2>
<P>
Specific information on options will be given in the documentation
file associated with each program. However, some options occur in many
programs. Options are selected from the menu in each
program.
<P>
<UL>
<LI>The A (Ancestral states) option. This indicates that we are
specifying the ancestral states for each character. In the menu the
ancestors (A) option must be selected.
An ancestral states input file is read, whose default name is
<TT>ancestors</TT>. It contains
a line or lines giving the ancestral states for each character.
These may be 0, 1 or ?, the latter
indicating that the ancestral state is unknown.
<P>
An example is:
<P>
001??11
<P>
The ancestor information can be continued to a new line and can have blanks
between any of the characters in the same way that species character data
can.
In the program Clique the ancestor is instead to be included as a
regular species and
no A option is available.
<P>
<LI>The F (Factors) option. This is used in programs Move, Dolmove,
and Factor. It specifies which binary characters correspond
to which multistate characters. To use the F option you
choose the F option in the program menu. After that the program
will read a factors file (default name <TT>factors</TT>)
which consists of a line or lines containing a symbol
for each binary character. The
symbol can be anything, provided that it is the same for binary characters
that correspond to the same multistate character, and changes between
multistate characters. A good practice is to make it the lower-order digit
of the number of the multistate character.
<P>
For example, if there were 20 binary characters that had been generated by
nine multistate characters having respectively 4, 3, 3, 2, 1, 2, 2, 2, and 1
binary factors you would make the factors file be:
<P>
11112223334456677889
<P>
although it could equivalently be:
<P>
aaaabbbaaabbabbaabba
<P>
All that is important is that the symbol
for each binary character change only when adjacent binary characters
correspond to different mutlistate characters. The factors
file contents
can continue to a new line at any time except during the initial characters
filling out the length of a species name.
<P>
<LI>The J (Jumble) option. This causes the species to be entered into the
tree in a random order rather than in their order in the input file. The
program prompts you for a random number seed. This option is described in
the main documentation file.
<P>
<LI>The M (Multiple data sets) option. This has also been described in the
main documentation file. It is not to be confused with the M option specified
in the input file, which is the Mixture of methods option (yes, I know
this is confusing).
<P>
<LI>The O (outgroup) option. This has also already been discussed in the
general documentation file. It specifies the number of the particular species
which will be used as the outgroup in rerooting the final tree when it is
printed out. It will not have any effect if the tree is already rooted or is
a user-defined tree. This option is not available in Dollop, Dolmove,
or Dolpenny, which always infer a rooted tree, or Clique, which
requires you to work out the rerooting by hand. The menu selection will
cause you to be prompted for the number of the outgroup.
<P>
<LI>The T (threshold) option. This sets a threshold such that if the
number of steps counted in a character is higher than the threshold, it
will be taken to be the threshold value rather than the actual number of
steps. This option has already been described in the main documentation
file. The user is prompted for the threshold value. My 1981 paper
(Felsenstein, 1981b)
explains the logic behind the Threshold option, which is an attractive
alternative to successive weighting of characters.
<P>
<LI>The U (User tree) option. This has already been described in the
main documentation file. For all of these programs user trees are to be
specified as bifurcating trees, even in the cases where the tree that
is inferred by the programs is to be regarded as unrooted.
<P>
<LI>The W (Weights) option. This allows us to specify weights on the
characters, including the possibility of omitting characters from the
analysis. It has already been described in the main documentation file.
<P>
<LI>The X (miXture) option. In the programs Mix, Move, and Penny
the user can specify for each character which parsimony method is
in effect. This is done by selecting menu option X (not M) and having
an input mixture file, whose default name is <TT>mixture</TT>.
It contains a line or lines with one letter for
each character. These letters are C or S if the character is to
be reconstructed according to Camin-Sokal parsimony, W or ? if the
character is to be reconstructed according to Wagner parsimony. So if
there are 20 characters the line giving the mixture might look like this:
<P>
<PRE>
WWWCC WWCWC
</PRE>
<P>
Note that blanks in the seqence of characters (after the first ones that
are as long as the species names) will be ignored, and the information
can go on to a new line at any point. So this could equally well have been
specified by
<P>
<PRE>
WW
CCCWWCWC
</PRE>
</UL>
<P>
<H2>INFORMATION IN THE OUTPUT</H2>
<P>
On the line in that table corresponding to each branch of the tree will also
be printed "yes", "no" or "maybe" as an answer to the question of whether this
branch is of nonzero length. If there is no evidence that any character has
changed in that branch, then "no" will be printed. If there is definite
evidence that one has changed, then "yes" will be printed. If the matter is
ambiguous, then "maybe" will be printed. You should keep in mind that all of
these conclusions assume that we are only interested in the assignment of
states that require the least amount of change. In reality, the confidence
limit on tree topology usually includes many different topologies, and
presumably also then the confidence limits on amounts of change in branches
are also very broad.
<P>
In addition to the table showing numbers of events, a table may be printed out
showing which ancestral state causes the fewest events for each
character. This will not always be done, but only when the tree is rooted and
some ancestral states are unknown. This can be used to infer states of
occurred and making it easy for the user to reconstruct all the alternative
patterns of the characters states in the hypothetical ancestral nodes.
In Pars you can, using the menu, turn off this dot-differencing
convention and see all states at all hypothetical ancestral nodes of the tree.
<P>
If you select the proper menu option, a table of the number of events
required in each character can also be printed, to help in reconstructing the
placement of changes on the tree.
<P>
This table may not be obvious at first. A typical example looks like
this:
<PRE>
steps in each character:
0 1 2 3 4 5 6 7 8 9
*-----------------------------------------
0! 2 2 2 2 1 1 2 2 1
10! 1 2 3 1 1 1 1 1 1 2
20! 1 2 2 1 2 2 1 1 1 2
30! 1 2 1 1 1 2 1 3 1 1
40! 1
</PRE>
The numbers across the top and down the side indicate which character is being
referred to. Thus character 23 is column "3" of row "20" and has 2 steps in
this case.
<P>
I cannot emphasize too strongly that just because the tree diagram which
the program prints out contains a particular branch MAY NOT MEAN THAT WE HAVE
EVIDENCE THAT THE BRANCH IS OF NONZERO LENGTH.
In some of the older programs, the procedure which prints out
the tree cannot cope with a trifurcation, nor can the internal data structures
used in some of my programs. Therefore, even when we have no resolution and
a multifurcation, successive bifurcations may be printed out, although some of
the branches shown will in fact actually be of zero length. To find out which,
you will have to work out character by character where the placements of the
changes on the tree are, under all possible ways that the changes can be placed
on that tree.
<P>
In Pars, Mix, Penny, Dollop, and Dolpenny the trees will be (if the user selects
the option to see them) accompanied by tables showing the reconstructed states
of the characters in the hypothetical ancestral nodes in the interior of the
tree. This will enable you to reconstruct where the changes were in each of
the characters. In some cases the state shown in an interior node will be "?",
which means that either 0 or 1 would be possible at that point. In such cases
you have to work out the ambiguity by hand. A unique assignment of locations
of changes is often not possible in the case of the Wagner parsimony method.
There may be multiple ways of assigning changes to segments of the tree with
that method. Printing only one would be misleading, as it might imply that
certain segments of the tree had no change, when another equally valid
assignment would put changes there. It must be emphasized that all these
multiple assignments have exactly equal numbers of total changes, so that none
is preferred over any other.
<P>
I have followed the convention of having a "." printed out in the table of
character states of the hypothetical ancestral nodes whenever a state is 0 or 1
and its immediate ancestor is the same. This has the effect of highlighting
the places where changes might have occurred and making it easy for the user to
reconstruct all the alternative patterns of the characters states in the
hypothetical ancestral nodes.
In Pars you can, using the menu, turn off this dot-differencing
convention and see all states at all hypothetical ancestral nodes of the tree.
<P>
On the line in that table corresponding to each branch of the tree will also
be printed "yes", "no" or "maybe" as an answer to the question of whether this
branch is of nonzero length. If there is no evidence that any character has
changed in that branch, then "no" will be printed. If there is definite
evidence that one has changed, then "yes" will be printed. If the matter is
ambiguous, then "maybe" will be printed. You should keep in mind that all of
these conclusions assume that we are only interested in the assignment of
states that requires the least amount of change. In reality, the confidence
limit on tree topology usually includes many different topologies, and
presumably also then the confidence limits on amounts of change in branches
are also very broad.
<P>
In addition to the table showing numbers of events, a table may be printed out
showing which ancestral state causes the fewest events for each
character. This will not always be done, but only when the tree is rooted and
some ancestral states are unknown. This can be used to infer states of
ancestors. For example, if you use the O (Outgroup) and A (Ancestral states)
options together, with at least some of the ancestral states being given as
"?", then inferences will be made for those characters, as the outgroup makes
the tree rooted if it was not already.
<P>
In programs Mix and Penny, if you are using the Camin-Sokal parsimony option
with ancestral state "?" and it turns out that the program cannot decide
between ancestral states 0 and 1, it will fail to even attempt reconstruction
of states of the hypothetical ancestors, printing them all out as "." for
those characters. This is done for internal bookkeeping reasons -- to
reconstruct their changes would require a fair amount of additional code and
additional data structures. It is not too hard to reconstruct the internal
states by hand, trying the two possible ancestral states one after the
other. A similar comment applies to the use of ancestral state "?" in the
Dollo or Polymorphism parsimony methods (programs Dollop and Dolpenny) which
also can result in a similar hesitancy to print the estimate of the states of
the hypothetical ancestors. In all of these cases the program will print "?"
rather than "no" when it describes whether there are any changes in a branch,
since there might or might not be changes in those characters which are not
reconstructed.
<P>
For further information see the documentation files for the
individual programs.
</BODY>
</HTML>
|