/usr/share/doc/phylip/html/doc/protpars.html is in phylip-doc 1:3.696+dfsg-1.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>
<TITLE>protpars</TITLE>
<META NAME="description" CONTENT="protpars">
<META NAME="keywords" CONTENT="protpars">
<META NAME="resource-type" CONTENT="document">
<META NAME="distribution" CONTENT="global">
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
</HEAD>
<BODY BGCOLOR="#ccffff">
<DIV ALIGN=RIGHT>
version 3.696
</DIV>
<P>
<DIV ALIGN=CENTER>
<H1>Protpars -- Protein Sequence Parsimony Method</H1>
</DIV>
<P>
© Copyright 1986-2014 by Joseph Felsenstein. All rights reserved.
License terms <a href="main.html#copyright">here</a>.
<P>
</EM>
<P>
This program infers an unrooted phylogeny from protein sequences, using a
new method intermediate between the approaches of Eck and Dayhoff (1966) and
Fitch (1971). Eck and Dayhoff (1966) allowed any amino acid to change to
any other, and counted the number of such changes needed to evolve the
protein sequences on each given phylogeny. This has the problem that it
allows replacements which are not consistent with the genetic code, counting
them equally with replacements that are consistent. Fitch, on the other hand,
counted the minimum number of nucleotide substitutions that would be
needed to achieve the given protein sequences. This counts silent
changes equally with those that change the amino acid.
<P>
The present method insists that any changes of amino acid be consistent
with the genetic code so that, for example, lysine is allowed to change
to methionine but not to proline. However, changes between two amino acids
via a third are allowed and counted as two changes if each of the two
replacements is individually allowed. This sometimes allows changes that
at first sight you would think should be outlawed. Thus we can change from
phenylalanine to glutamine via leucine in two steps
total. Consulting the genetic code, you will find that there is a leucine
codon one step away from a phenylalanine codon, and a leucine codon one
step away from glutamine. But they are not the same leucine codon. It
actually takes three base substitutions to get from either of the
phenylalanine codons TTT and TTC to either of the glutamine codons
CAA or CAG. Why then does this program count only two? The answer
is that recent DNA sequence comparisons seem to show that synonymous
changes are considerably faster and easier than ones that change the
amino acid. We are assuming that, in effect, synonymous changes occur
so much more readily that they need not be counted. Thus, in the chain
of changes TTT (Phe) --> CTT (Leu) --> CTA (Leu) --> CAA (Glu), the middle
one is not counted because it does not change the amino acid (leucine).
<P>
To maintain consistency with the genetic code, it is necessary for the
program internally to treat serine as two separate states (ser1 and ser2)
since the two groups of serine codons are not adjacent in the
code. Changes to the state "deletion" are counted as three steps to prevent the
algorithm from assuming unnecessary deletions. The state "unknown" is
simply taken to mean that the amino acid, which has not been determined,
will in each part of a tree that is evaluated be assumed to be whichever one
causes the fewest steps.
<P>
The assumptions of this method (which has not been described in the
literature), are thus something like this:
<P>
<OL>
<LI>Changes in different sites are independent.
<LI>Changes in different lineages are independent.
<LI>The probability of a base substitution that changes the amino
acid sequence is small over the lengths of time involved in
a branch of the phylogeny.
<LI>The expected amounts of change in different branches of the phylogeny
do not vary by so much that two changes in a high-rate branch
are more probable than one change in a low-rate branch.
<LI>The expected amounts of change do not vary enough among sites that two
changes in one site are more probable than one change in another.
<LI>The probability of a base change that is synonymous is much higher
than the probability of a change that is not synonymous.
</OL>
<P>
That these are the assumptions of parsimony methods has been documented
in a series of papers of mine: (1973a, 1978b, 1979, 1981b, 1983b, 1988b). For
an opposing view arguing that the parsimony methods make no substantive
assumptions such as these, see the papers by Farris (1983) and Sober (1983a,
1983b, 1988), but also read the exchange between Felsenstein and Sober (1986).
<P>
The input for the program is fairly standard. The first line contains the
number of species and the number of amino acid positions (counting any
stop codons that you want to include).
<P>
Next come the species data. Each
sequence starts on a new line, has a ten-character species name
that must be blank-filled to be of that length, followed immediately
by the species data in the one-letter code. The sequences must either
be in the "interleaved" or "sequential" formats
described in the Molecular Sequence Programs document. The I option
selects between them. The sequences can have internal
blanks in the sequence but there must be no extra blanks at the end of the
terminated line. Note that a blank is not a valid symbol for a deletion.
<P>
The protein sequences are given by the one-letter codes
described in the <A HREF="sequence.html">Molecular Sequence Programs documentation file</A>. Note that
if two polypeptide chains are being used that are of different lengths
owing to one terminating before the other, they should be coded as (say)
<P><PRE>
HIINMA*????
HIPNMGVWABT
</PRE><P>
since after the stop codon we do not definitely know that
there has been a deletion, and do not know what amino acid would
have been there. If DNA studies tell us that there is
DNA sequence in that region, then we could use "X" rather than "?". Note
that "X" means an unknown amino acid, but definitely an amino acid,
while "?" could mean either that or a deletion. The distinction is often
significant in regions where there are deletions: one may want to encode
a six-base deletion as "-?????" since that way the program will only count
one deletion, not six deletion events, when the deletion arises. However,
if there are overlapping deletions it may not be so easy to know what
coding is correct.
<P>
One will usually want to
use "?" after a stop codon, if one does not know what amino acid is there. If
the DNA sequence has been observed there, one probably ought to resist
putting in the amino acids that this DNA would code for, and one should use
"X" instead, because under the assumptions implicit in this parsimony
method, changes to any noncoding sequence are much easier than
changes in a coding region that change the amino acid, so that they
shouldn't be counted anyway!
<P>
The form of this information
is the standard one described in the main documentation file. For the U option
the tree
provided must be a rooted bifurcating tree, with the root placed anywhere
you want, since that root placement does not affect anything.
<P>
The options are selected using an interactive menu. The menu looks like this:
<P>
<TABLE><TR><TD BGCOLOR=white>
<PRE>
Protein parsimony algorithm, version 3.69
Setting for this run:
U Search for best tree? Yes
J Randomize input order of sequences? No. Use input order
O Outgroup root? No, use as outgroup species 1
T Use Threshold parsimony? No, use ordinary parsimony
C Use which genetic code? Universal
W Sites weighted? No
M Analyze multiple data sets? No
I Input sequences interleaved? Yes
0 Terminal type (IBM PC, ANSI, none)? ANSI
1 Print out the data at start of run No
2 Print indications of progress of run Yes
3 Print out tree Yes
4 Print out steps in each site No
5 Print sequences at all nodes of tree No
6 Write out trees onto tree file? Yes
Are these settings correct? (type Y or the letter for one to change)
</PRE>
</TD></TR></TABLE>
<P>
The user either types "Y" (followed, of course, by a carriage-return)
if the settings shown are to be accepted, or the letter or digit corresponding
to an option that is to be changed.
<P>
The options U, J, O, T, W, M, and 0 are the usual ones. They are described in
the main documentation file of this package. Option I is the same as in
other molecular sequence programs and is described in the documentation file
for the sequence programs. Option C allows the user to select among various
nuclear and mitochondrial genetic codes. There is no provision for coping
with data where different genetic codes have been used in different
organisms.
<P>
In the U (User tree) option, the trees should
not be preceded by a line with the number of trees on it.
<P>
Output is standard: if option 1 is toggled on, the data is printed out,
with the convention that "." means "the same as in the first species".
Then comes a list of equally parsimonious trees, and (if option 2 is
toggled on) a table of the
number of changes of state required in each position. If option 5 is toggled
on, a table is printed
out after each tree, showing for each branch whether there are known to be
changes in the branch, and what the states are inferred to have been at the
top end of the branch. This is a reconstruction of the ancestral sequences
in the tree. If you choose option 5, a menu item "." appears which gives you
the opportunity to turn off dot-differencing so that complete ancestral
sequences are shown. If the inferred state is a "?" there will be multiple
equally-parsimonious assignments of states; the user must work these out for
themselves by hand. If option 6 is left in its default state the trees
found will be written to a tree file, so that they are available to be used
in other programs. If the program finds multiple
trees tied for best, all of these are written out onto the output tree
file. Each is followed by a numerical weight in square brackets (such as
[0.25000]). This is needed when we use the trees to make a consensus
tree of the results of bootstrapping or jackknifing, to avoid overrepresenting
replicates that find many tied trees.
<P>
If the U (User Tree) option is used and more than one tree is supplied, the
program also performs a statistical test of each of these trees against the
best tree. This test is a version of the test proposed by
Alan Templeton (1983), and evaluated in a test case by me (1985a). It is
closely parallel to a test using log likelihood differences
due to Kishino and Hasegawa (1989), and uses the mean
and variance of
step differences between trees, taken across positions. If the mean
is more than 1.96 standard deviations different then the trees are declared
significantly different. The program
prints out a table of the steps for each tree, the differences of
each from the best one, the variance of that quantity as determined by
the step differences at individual positions, and a conclusion as to
whether that tree is or is not significantly worse than the best one.
<P>
The program is derived from Mix but has had some rather elaborate
bookkeeping using sets of bits installed. It is not a very fast
program but is speeded up substantially over version 3.2.
<P>
<HR>
<H3>TEST DATA SET</H3>
<P>
<TABLE><TR><TD BGCOLOR=white>
<PRE>
5 10
Alpha ABCDEFGHIK
Beta AB--EFGHIK
Gamma ?BCDSFG*??
Delta CIKDEFGHIK
Epsilon DIKDEFGHIK
</PRE>
</TD></TR></TABLE>
<P>
<HR>
<P>
<H3>CONTENTS OF OUTPUT FILE (with all numerical options on)</H3>
<P>
<TABLE><TR><TD BGCOLOR=white>
<PRE>
Protein parsimony algorithm, version 3.69
5 species, 10 sites
Name Sequences
---- ---------
Alpha ABCDEFGHIK
Beta ..--......
Gamma ?...S..*??
Delta CIK.......
Epsilon DIK.......
3 trees in all found
+--------Gamma
!
+--2 +--Epsilon
! ! +--4
! +--3 +--Delta
1 !
! +-----Beta
!
+-----------Alpha
remember: this is an unrooted tree!
requires a total of 16.000
steps in each position:
0 1 2 3 4 5 6 7 8 9
*-----------------------------------------
0! 3 1 5 3 2 0 0 2 0
10! 0
From To Any Steps? State at upper node
( . means same as in the node below it on tree)
1 ANCDEFGHIK
1 2 no ..........
2 Gamma yes ?B..S..*??
2 3 yes ..?.......
3 4 yes ?IK.......
4 Epsilon maybe D.........
4 Delta yes C.........
3 Beta yes .B--......
1 Alpha maybe .B........
+--Epsilon
+--4
+--3 +--Delta
! !
+--2 +-----Gamma
! !
1 +--------Beta
!
+-----------Alpha
remember: this is an unrooted tree!
requires a total of 16.000
steps in each position:
0 1 2 3 4 5 6 7 8 9
*-----------------------------------------
0! 3 1 5 3 2 0 0 2 0
10! 0
From To Any Steps? State at upper node
( . means same as in the node below it on tree)
1 ANCDEFGHIK
1 2 no ..........
2 3 maybe ?.........
3 4 yes .IK.......
4 Epsilon maybe D.........
4 Delta yes C.........
3 Gamma yes ?B..S..*??
2 Beta yes .B--......
1 Alpha maybe .B........
+--Epsilon
+-----4
! +--Delta
+--3
! ! +--Gamma
1 +-----2
! +--Beta
!
+-----------Alpha
remember: this is an unrooted tree!
requires a total of 16.000
steps in each position:
0 1 2 3 4 5 6 7 8 9
*-----------------------------------------
0! 3 1 5 3 2 0 0 2 0
10! 0
From To Any Steps? State at upper node
( . means same as in the node below it on tree)
1 ANCDEFGHIK
1 3 no ..........
3 4 yes ?IK.......
4 Epsilon maybe D.........
4 Delta yes C.........
3 2 no ..........
2 Gamma yes ?B..S..*??
2 Beta yes .B--......
1 Alpha maybe .B........
</PRE>
</TD></TR></TABLE>
</BODY>
</HTML>
|