                                  fdnainvar



Wiki

   The master copies of EMBOSS documentation are available at
   http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.

   Please help by correcting and extending the Wiki pages.

Function

   Nucleic acid sequence invariants method

Description

   For nucleic acid sequence data on four species, computes Lake's and
   Cavender's phylogenetic invariants, which test alternative tree
   topologies. The program also tabulates the frequencies of occurrence of
   the different nucleotide patterns. Lake's invariants are the method
   which he calls "evolutionary parsimony".

Algorithm

   This program reads in nucleotide sequences for four species and
   computes the phylogenetic invariants discovered by James Cavender
   (Cavender and Felsenstein, 1987) and James Lake (1987). Lake's method
   is also called by him "evolutionary parsimony". I prefer Cavender's
   more mathematically precise term "invariants", as the method bears
   somewhat more relationship to likelihood methods than to parsimony. The
   invariants are mathematical formulas (in the present case linear or
   quadratic) in the EXPECTED frequencies of site patterns which are zero
   for all trees of a given tree topology, irrespective of branch lengths.
   We can consider at a given site that if there are no ambiguities, we
   could have for four species the nucleotide patterns (considering the
   same site across all four species) AAAA, AAAC, AAAG, ... through TTTT,
   256 patterns in all.

   The invariants are formulas in the expected pattern frequencies, not
   the observed pattern frequencies. When they are computed using the
   observed pattern frequencies, we will usually find that they are not
   precisely zero even when the model is correct and we have the correct
   tree topology. Only as the number of nucleotides scored becomes
   infinite will the observed pattern frequencies approach their
   expectations; otherwise, we must do a statistical test of the
   invariants.

   Some explanation of invariants will be found in the above papers, and
   also in my recent review article on statistical aspects of inferring
   phylogenies (Felsenstein, 1988b). Although invariants have some
   important advantages, their validity also depends on symmetry
   assumptions that may not be satisfied. In the discussion below suppose
   that the possible unrooted phylogenies are I: ((A,B),(C,D)), II:
   ((A,C),(B,D)), and III: ((A,D),(B,C)).

  Lake's Invariants, Their Testing and Assumptions

   Lake's invariants are fairly simple to describe: the patterns involved
   are only those in which there are two purines and two pyrimidines at a
   site. Thus a site with AACT would affect the invariants, but a site
   with AAGG would not. Let us use (as Lake does) the symbols 1, 2, 3, and
   4, with the proviso that 1 and 2 are either both of the purines or both
   of the pyrimidines; 3 and 4 are the other two nucleotides. Thus 1 and 2
   always differ by a transition; so do 3 and 4. Lake's invariants,
   expressed in terms of expected frequencies, are the three quantities:

(1)      P(1133) + P(1234) - P(1134) - P(1233),

(2)      P(1313) + P(1324) - P(1314) - P(1323),

(3)      P(1331) + P(1342) - P(1341) - P(1332),

   He showed that invariants (2) and (3) are zero under Topology I, (1)
   and (3) are zero under topology II, and (1) and (2) are zero under
   Topology III. If, for example, we see a site with pattern ACGC, we can
   start by setting 1=A. Then 2 must be G. We can then set 3=C (so that 4
   is T). Thus its pattern type, making those substitutions, is 1323.
   P(1323) is the expected probability of the type of pattern which
   includes ACGC, TGAG, GTAT, etc.

   Lake's invariants are easily tested with observed frequencies. For
   example, the first of them is a test of whether there are as many sites
   of types 1133 and 1234 as there are of types 1134 and 1233; this is
   easily tested with a chi-square test or, as in this program, with an
   exact binomial test. Note that with several invariants to test, we risk
   overestimating the significance of results if we simply accept the
   nominal 95% levels of significance (Li and Guoy, 1990).

   Lake's invariants assume that each site is evolving independently, and
   that starting from any base a transversion is equally likely to end up
   at each of the two possible bases (thus, an A undergoing a transversion
   is equally likely to end up as a C or a T, and similarly for the other
   four bases from which one could start. Interestingly, Lake's results do
   not assume that rates of evolution are the same at all sites. The
   result that the total of 1133 and 1234 is expected to be the same as
   the total of 1134 and 1233 is unaffected by the fact that we may have
   aggregated the counts over classes of sites evolving at different
   rates.

  Cavender's Invariants, Their Testing and Assumptions

   Cavender's invariants (Cavender and Felsenstein, 1987) are for the case
   of a character with two states. In the nucleic acid case we can
   classify nucleotides into two states, R and Y (Purine and Pyrimidine)
   and then use the two-state results. Cavender starts, as before, with
   the pattern frequencies. Coding purines as R and pyrimidines as Y, the
   patterns types are RRRR, RRRY, and so on until YYYY, a total of 16
   types. Cavender found quadratic functions of the expected frequencies
   of these 16 types that were expected to be zero under a given
   phylogeny, irrespective of branch lengths. Two invariants (called K and
   L) were found for each tree topology. The L invariants are particularly
   easy to understand. If we have the tree topology ((A,B),(C,D)), then in
   the case of two symmetric states, the event that A and B have the same
   state should be independent of whether C and D have the same state, as
   the events determining these happen in different parts of the tree. We
   can set up a contingency table:

                                 C = D         C =/= D
                           ------------------------------
                          |
                   A = B  |   YYYY, YYRR,     YYYR, YYRY,
                          |   RRRR, RRYY      RRYR, RRRY
                          |
                 A =/= B  |   YRYY, YRRR,     YRYR, YRRY,
                          |   RYYY, RYRR      RYYR, RYRY

   and we expect that the events C = D and A = B will be independent.
   Cavender's L invariant for this tree topology is simply the negative of
   the crossproduct difference,

      P(A=/=B and C=D) P(A=B and C=/=D) - P(A=B and C=D) P(A=/=B and C=/=D).

   One of these L invariants is defined for each of the three tree
   topologies. They can obviously be tested simply by doing a chi-square
   test on the contingency table. The one corresponding to the correct
   topology should be statistically indistinguishable from zero. Again,
   there is a possible multiple tests problem if all three are tested at a
   nominal value of 95%.

   The K invariants are differences between the L invariants. When one of
   the tables is expected to have crossproduct difference zero, the other
   two are expected to be nonzero, and also to be equal. So the difference
   of their crossproduct differences can be taken; this is the K
   invariant. It is not so easily tested.

   The assumptions of Cavender's invariants are different from those of
   Lake's. One obviously need not assume anything about the frequencies
   of, or transitions among, the two different purines or the two
   different pyrimidines. However one does need to assume independent
   events at each site, and one needs to assume that the Y and R states
   are symmetric, that the probability per unit time that a Y changes into
   an R is the same as the probability that an R changes into a Y, so that
   we expect equal frequencies of the two states. There is also an
   assumption that all sites are changing between these two states at the
   same expected rate. This assumption is not needed for Lake's
   invariants, since expectations of sums are equal to sums of
   expectations, but for Cavender's it is, since products of expectations
   are not equal to expectations of products.

   It is helpful to have both sorts of invariants available; with further
   work we may appreciate what other invaraints there are for various
   models of nucleic acid change.

Usage

   Here is a sample session with fdnainvar


% fdnainvar -printdata
Nucleic acid sequence invariants method
Input (aligned) nucleotide sequence set(s): dnainvar.dat
Phylip weights file (optional):
Phylip dnainvar program output file [dnainvar.fdnainvar]:


Output written to output file "dnainvar.fdnainvar"

Done.



   Go to the input files for this example
   Go to the output files for this example

Command line arguments

Nucleic acid sequence invariants method
Version: EMBOSS:6.6.0.0

   Standard (Mandatory) qualifiers:
  [-sequence]          seqsetall  File containing one or more sequence
                                  alignments
   -weights            properties Phylip weights file (optional)
  [-outfile]           outfile    [*.fdnainvar] Phylip dnainvar program output
                                  file

   Additional (Optional) qualifiers (* if not always prompted):
   -printdata          boolean    [N] Print data at start of run
*  -[no]dotdiff        boolean    [Y] Use dot-differencing to display results
   -[no]printpattern   boolean    [Y] Print counts of patterns
   -[no]printinvariant boolean    [Y] Print invariants
   -[no]progress       boolean    [Y] Print indications of progress of run

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-sequence" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -scircular1         boolean    Sequence is circular
   -squick1            boolean    Read id and sequence only
   -sformat1           string     Input sequence format
   -iquery1            string     Input query fields or ID list
   -ioffset1           integer    Input start position offset
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-outfile" associated qualifiers
   -odirectory2        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options and exit. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages
   -version            boolean    Report version number and exit


Input file format

   fdnainvar reads any normal sequence USAs.

  Input files for usage example

  File: dnainvar.dat

   4   13
Alpha     AACGTGGCCAAAT
Beta      AAGGTCGCCAAAC
Gamma     CATTTCGTCACAA
Delta     GGTATTTCGGCCT

Output file format

   fdnainvar output consists first (if option 1 is selected) of a
   reprinting of the input data, then (if option 2 is on) tables of
   observed pattern frequencies and pattern type frequencies. A table will
   be printed out, in alphabetic order AAAA through TTTT of all the
   patterns that appear among the sites and the number of times each
   appears. This table will be invaluable for computation of any other
   invariants. There follows another table, of pattern types, using the
   1234 notation, in numerical order 1111 through 1234, of the number of
   times each type of pattern appears. In this computation all sites at
   which there are any ambiguities or deletions are omitted. Cavender's
   invariants could actually be computed from sites that have only Y or R
   ambiguities; this will be done in the next release of this program.

   If option 3 is on the invariants are then printed out, together with
   their statistical tests. For Lake's invariants the two sums which are
   expected to be equal are printed out, and then the result of an
   one-tailed exact binomial test which tests whether the difference is
   expected to be this positive or more. The P level is given (but
   remember the multiple-tests problem!).

   For Cavender's L invariants the contingency tables are given. Each is
   tested with a one-tailed chi-square test. It is possible that the
   expected numbers in some categories could be too small for valid use of
   this test; the program does not check for this. It is also possible
   that the chi-square could be significant but in the wrong direction;
   this is not tested in the current version of the program. To check it
   beware of a chi-square greater than 3.841 but with a positive
   invariant. The invariants themselves are computed, as the difference of
   cross-products. Their absolute magnitudes are not important, but which
   one is closest to zero may be indicative. Significantly nonzero
   invariants should be negative if the model is valid. The K invariants,
   which are simply differences among the L invariants, are also printed
   out without any test on them being conducted. Note that it is possible
   to use the bootstrap utility SEQBOOT to create multiple data sets, and
   from the output from sunning all of these get the empirical variability
   of these quadratic invariants.

  Output files for usage example

  File: dnainvar.fdnainvar


Nucleic acid sequence Invariants method, version 3.69.650

 4 species,  13  sites

Name            Sequences
----            ---------

Alpha        AACGTGGCCA AAT
Beta         ..G..C.... ..C
Gamma        C.TT.C.T.. C.A
Delta        GGTA.TT.GG CC.



   Pattern   Number of times

     AAAC         1
     AAAG         2
     AACC         1
     AACG         1
     CCCG         1
     CCTC         1
     CGTT         1
     GCCT         1
     GGGT         1
     GGTA         1
     TCAT         1
     TTTT         1


Symmetrized patterns (1, 2 = the two purines  and  3, 4 = the two pyrimidines
                  or  1, 2 = the two pyrimidines  and  3, 4 = the two purines)

     1111         1
     1112         2
     1113         3
     1121         1
     1132         2
     1133         1
     1231         1
     1322         1
     1334         1

Tree topologies (unrooted):

    I:  ((Alpha,Beta),(Gamma,Delta))
   II:  ((Alpha,Gamma),(Beta,Delta))
  III:  ((Alpha,Delta),(Beta,Gamma))



  [Part of this file has been deleted for brevity]

different purine:pyrimidine ratios from 1:1.

  Tree I:

   Contingency Table

      2     8
      1     2

   Quadratic invariant =             4.0

   Chi-square =    0.23111 (not significant)


  Tree II:

   Contingency Table

      1     5
      1     6

   Quadratic invariant =            -1.0

   Chi-square =    0.01407 (not significant)


  Tree III:

   Contingency Table

      1     2
      6     4

   Quadratic invariant =             8.0

   Chi-square =    0.66032 (not significant)




Cavender's quadratic invariants (type K) using purines vs. pyrimidines
 (these are expected to be zero for the correct tree topology)
They will be misled if there are substantially
different evolutionary rate between sites, or
different purine:pyrimidine ratios from 1:1.
No statistical test is done on them here.

  Tree I:              -9.0
  Tree II:              4.0
  Tree III:             5.0


Data files

   None

Notes

   None.

References

   None.

Warnings

   None.

Diagnostic Error Messages

   None.

Exit status

   It always exits with status 0.

Known bugs

   None.

See also

                    Program name                          Description
                    distmat      Create a distance matrix from a multiple sequence alignment
                    ednacomp     DNA compatibility algorithm
                    ednadist     Nucleic acid sequence distance matrix program
                    ednainvar    Nucleic acid sequence invariants method
                    ednaml       Phylogenies from nucleic acid maximum likelihood
                    ednamlk      Phylogenies from nucleic acid maximum likelihood with clock
                    ednapars     DNA parsimony algorithm
                    ednapenny    Penny algorithm for DNA
                    eprotdist    Protein distance algorithm
                    eprotpars    Protein parsimony algorithm
                    erestml      Restriction site maximum likelihood method
                    eseqboot     Bootstrapped sequences algorithm
                    fdiscboot    Bootstrapped discrete sites algorithm
                    fdnacomp     DNA compatibility algorithm
                    fdnadist     Nucleic acid sequence distance matrix program
                    fdnaml       Estimate nucleotide phylogeny by maximum likelihood
                    fdnamlk      Estimates nucleotide phylogeny by maximum likelihood
                    fdnamove     Interactive DNA parsimony
                    fdnapars     DNA parsimony algorithm
                    fdnapenny    Penny algorithm for DNA
                    fdolmove     Interactive Dollo or polymorphism parsimony
                    ffreqboot    Bootstrapped genetic frequencies algorithm
                    fproml       Protein phylogeny by maximum likelihood
                    fpromlk      Protein phylogeny by maximum likelihood
                    fprotdist    Protein distance algorithm
                    fprotpars    Protein parsimony algorithm
                    frestboot    Bootstrapped restriction sites algorithm
                    frestdist    Calculate distance matrix from restriction sites or fragments
                    frestml      Restriction site maximum likelihood method
                    fseqboot     Bootstrapped sequences algorithm
                    fseqbootall  Bootstrapped sequences algorithm

Author(s)

                    This program is an EMBOSS conversion of a program written by Joe
                    Felsenstein as part of his PHYLIP package.

                    Please report all bugs to the EMBOSS bug team
                    (emboss-bug (c) emboss.open-bio.org) not to the original author.

History

                    Written (2004) - Joe Felsenstein, University of Washington.

                    Converted (August 2004) to an EMBASSY program by the EMBOSS team.

Target users

                    This program is intended to be used by everyone and everything, from
                    naive users to embedded scripts.

Comments

None
