PHYLIP_NJ_Example

Purpose

This is an example workflow that demonstrates how to use PHYLIP components to infer and plot a phylogenetic tree from a multiple sequence alignment, using the neighbor-joining method. It is also to demonstrate how to run this program in non-interactive mode, the first step to programmatic wrapping. This work flow covers the fundamentals; using PHYLIP in this way is more for prototyping or for one-off analyses. Integration into a high level application or batch processing will involve more abstraction.

Prerequisites

  • Access to a unix/linux shell
  • A multiple sequence alignment file from CLUSTALW (or some other source).
  • This work flow assumes that you have the BioPerl libraries and the PHYLIP binary executables compiled/installed on you system.

The DNA sequences

Original data

Multiple sequence alignment format

The input format for DNA sequences is described in the PHYLIP documentation.

  • The starting point is aligned sequences in CLUSTALW format.
  • One easy way to convert this to PHYLIP (aka Felsenstein) format is with a BioPerl script:
    #!/usr/bin/perl -w 
    use strict;
    use Bio::AlignIO;
    
    my $infile = shift || die "Usage ./clustal2phylip.pl clustal_file\n";
    -e $infile || die "Input file $infile does not exist!\n";
    
    my $in  = Bio::AlignIO->new( -file => $infile, -format => 'clustalw');
    my $out = Bio::AlignIO->new( -format => 'phylip');
    
    my $aln = $in->next_aln;
    $out->write_aln($aln);
    
  • This would be run with this incantation
    $ ./clustal2phylip.pl actin.aln >actin.fel
  • Below is an excerpt from actin.fel:
     6 1134
    c_elegans    ATGTGTGACG ACGA---GGT TGCCGCTCTT GTTGTAGACA ATGGATCCGG AATGTGCAAG 
    c_briggsae   ATGTGTGACG ACGA---GGT TGCAGCTCTC GTAGTGGACA ATGGCTCCGG AATGTGCAAG 
    b_xylophil   ATGTGTGACG AAGA---AGT TGCCGCTCTT GTTGTGGACA ATGGCTCCGG TATGTGCAAA 
    c_oncophor   ATGTGTGACG ACGA---GGT TGCTGCTCTT GTGGTTGACA ATGGATCCGG AATGTGCAAA 
    p_magellan   ATGTGTGACG ACGA---GGT AGCAGCTTTA GTAGTAGACA ATGGCTCCGG TATGTGCAAG 
    s_salar      ATGTGTGACG ACGACGAGAC TACTGCTCTT GTGTGCGACA ACGGCTCCGG CCTTGTGAAG 
    
                 GCCGGATTCG CCGGAGACGA CGCTCCACGC GCCGTGTTCC CATCCATTGT CGGAAGACCA 
                 GCCGGATTTG CCGGAGACGA TGCTCCACGC GCCGTCTTCC CATCCATCGT TGGACGCCCA 
                 GCCGGTTTCG CCGGAGATGA TGCCCCACGT GCCGTCTTCC CCTCCATTGT CGGAAGACCC 
                 GCCGGATTTG CCGGAGATGA CGCTCCTCGA GCTGTCTTCC CCTCCATCGT CGGCCGACCC 
                 GCCGGGTTCG CCGGAGACGA TGCTCCACGC GCTGTGTTCC CCTCCATTGT TGGAAGGCCC 
                 GCTGGCTTCG CCGGTGATGA CGCACCCAGG GCTGTCTTCC CCTCCATCGT TGGTCGCCCT 
    

    Alternate method

    • The above script is but one example of the many ways to get to a phylip format multiple sequence alignment file.
    • Another example is the script run_clustal.pl to do the alignment on the original fasta file and write the output directly to phylip format.
      $ ./run_clustal actin.fa actin.fel phylip
    • PHYLIP formatted multiple sequence alignment data can also be directly output from the CLUSTALW program (menu/keystroke: 2. Multiple Alignments -> 9. Output format options -> 4. Toggle PHYLIP format output)

    Calculating the distant matrix

    The PHYLIP program DNADIST is used to pre-process the sequence data into a DNA distance matrix.

    DNADIST Menu-driven interface

  • PHYLIP programs all use a command line menu
  • Below is the menu for DNADIST
    Nucleic acid sequence Distance Matrix program, version 3.69
    
    Settings for this run:
      D  Distance (F84, Kimura, Jukes-Cantor, LogDet)?  F84
      G          Gamma distributed rates across sites?  No
      T                 Transition/transversion ratio?  2.0
      C            One category of substitution rates?  Yes
      W                         Use weights for sites?  No
      F                Use empirical base frequencies?  Yes
      L                       Form of distance matrix?  Square
      M                    Analyze multiple data sets?  No
      I                   Input sequences interleaved?  Yes
      0            Terminal type (IBM PC, ANSI, none)?  ANSI
      1             Print out the data at start of run  No
      2           Print indications of progress of run  Yes
    
      Y to accept these or type the letter for one to change
    
  • We will use the default parameters
    1) Call the program
    $ dnadist
    2) Enter the file name at the prompt
    dnadist: can't find input file "infile"
    Please enter a new file name>actin.fel
    
    3) Enter 'Y' at the menu to calculate the matrix using the default parameters
    4) Save your work. The matrix will be saved to 'outfile', which you should rename to something else to clear the path for subsequent PHYLIP programs and also to save an audit trail that will not be accidentally overwritten.
    $ mv outfile actin.dist
  • The resulting distance matrix looks like this:
        6
    c_elegans  0.000000 0.058113 0.108458 0.148962 0.154771 0.209534
    c_briggsae 0.058113 0.000000 0.110660 0.168952 0.149338 0.201134
    b_xylophil 0.108458 0.110660 0.000000 0.163680 0.130987 0.185266
    c_oncophor 0.148962 0.168952 0.163680 0.000000 0.195618 0.206068
    p_magellan 0.154771 0.149338 0.130987 0.195618 0.000000 0.175117
    s_salar    0.209534 0.201134 0.185266 0.206068 0.175117 0.000000
    

    Distance matrices in PHYLIP

    Documentation of inputs for distance-based methods in PHYLIP can be found here

    DNADIST Non-interactive mode

  • As with all PHYLIP programs, you can bypass the menu by creating a text file with the commands and options listed in the sequence you would enter them via the prompts and menu:
    actin.fel
    Y
    
  • Then use this incantation:
    $ dnadist < dnadist_commands.txt
    $ mv outfile actin.dist
    

Making the Neighbor Joining Tree

The PHYLIP implementation of the Neighbor-Joining algorithm is in the program NEIGHBOR

NEIGHBOR Menu-driven interface

Neighbor-Joining/UPGMA method version 3.69

Settings for this run:
  N       Neighbor-joining or UPGMA tree?  Neighbor-joining
  O                        Outgroup root?  No, use as outgroup species  1
  L         Lower-triangular data matrix?  No
  R         Upper-triangular data matrix?  No
  S                        Subreplicates?  No
  J     Randomize input order of species?  No. Use input order
  M           Analyze multiple data sets?  No
  0   Terminal type (IBM PC, ANSI, none)?  ANSI
  1    Print out the data at start of run  No
  2  Print indications of progress of run  Yes
  3                        Print out tree  Yes
  4       Write out trees onto tree file?  Yes


  Y to accept these or type the letter for one to change
  • we are using the defaults here
    1) Launch NEIGHBOR
    $ neighbor 
    2) Input sequence
    neighbor: can't find input file "infile"
    Please enter a new file name> actin.dist
    
    3) Enter 'Y'
    4) Save your work, this time there is an outfile and a treefile
    $ mv outfile actin.nj.out
    $ mv outtree actin.nj.tree
    

    cleanup

    For low-level use of PHYLIP programs, be sure to remove or rename 'outfile' and 'outtree', or you will be prompted to overwrite them interactively, which screws up the non-interactive use described below

  • The output file looks like this:
       6 Populations
    
    Neighbor-Joining/UPGMA method version 3.69
    
    
     Neighbor-joining method
    
     Negative branch lengths allowed
    
    
      +-c_briggsae
      ! 
      ! +--b_xylophil
      1-3 
      ! ! +-----c_oncophor
      ! +-4 
      !   ! +---p_magellan
      !   +-2 
      !     +------s_salar   
      ! 
      +-c_elegans 
    
    
    remember: this is an unrooted tree!
    
    Between        And            Length
    -------        ---            ------
       1          c_briggsae      0.03010
       1             3            0.02968
       3          b_xylophil      0.05082
       3             4            0.00966
       4          c_oncophor      0.09688
       4             2            0.01641
       2          p_magellan      0.06789
       2          s_salar         0.10723
       1          c_elegans       0.02801
    
  • Note that the tree is unrooted.
  • The tree file looks like this:
    (c_briggsae:0.03010,(b_xylophil:0.05082,(c_oncophor:0.09688,
    (p_magellan:0.06789,s_salar:0.10723):0.01641):0.00966):0.02968,c_elegans:0.02801);
    
  • This is Newick format, with branch lengths

NEIGHBOR Non-interactive mode

  • Because we are using defaults, the sequence of inputs is simple
    actin.dist
    Y
    
  • Which is run like so:
    $ neighbor < neighbor_commands.txt
    $ mv outfile actin.nj.out
    $ mv outtree actin.nj.tree
    

Viewing the tree

We will use the PHYLIP program DRAWTREE to render the tree

  • Note if you have not done so already, copy one of the font files from the source code distribution (font1, font2, etc) to 'fontfile' in your working directory.
  • The menu for DRAWTREE is shown below.
    Unrooted tree plotting program version 3.69
    
    Here are the settings: 
    
     0  Screen type (IBM PC, ANSI)?  ANSI
     P       Final plotting device:  Postscript printer
     V           Previewing device:  X Windows display
     B          Use branch lengths:  Yes
     L             Angle of labels:  branch points to Middle of label
     R            Rotation of tree:  90.0
     I     Iterate to improve tree:  Equal-Daylight algorithm
     D  Try to avoid label overlap?  No
     S      Scale of branch length:  Automatically rescaled
     C   Relative character height:  0.3333
     F                        Font:  Times-Roman
     M          Horizontal margins:  1.65 cm
     M            Vertical margins:  2.16 cm
     #           Page size submenu:  one page per tree
    
     Y to accept these or type the letter for one to change
    
  • The options are documented here. We work out the display parameters we like through trial and error, end end up with a file like so:
    outtree
    0
    P
    W
    1024
    768
    V
    N
    R
    45
    L
    F
    0
    Y
    
  • annotated:
    Toggle off onscreen display
    0
    Plotting device (file format), windows bitmap, 1024 x 768 pixels
    P
    W
    1024
    768
    
    Set preview to 'None'
    V
    N
    
    Set angle of rotation to 45 degrees
    R
    45
    
    Set label angles to fixed, horizontal
    L
    F
    0
    
    and 'Y' to run.
$ drawtree < drawtree-commands.txt
$ mv plotfile actin.nj.tree.bmp
  • and this is the tree, rendered as an unrooted network

Wrappers for PHYLIP

  • There is limited support for running some PHYLIP components via BioPerl.
  • Below is a generic perl script to run a PHYLIP program using a command file.
    #!/usr/bin/perl -w                                                                                                                                                                                               
    # run_phylip.pl                                                                                                                                                                                                  
    use strict;
    
    my $program = shift;
    my $command = shift;
    my $outfile = shift;
    my $outtree = shift;
    
    $program && $command && $outfile
      || die "Usage: ./run_phylip.pl program command_file output_file [output_tree]\n";
    
    # get rid of previous work                                                                                                                                                                                       
    unlink "infile"  if -e "infile";
    unlink "intree"  if -e "intree";
    unlink "outtree" if -e "outtree";
    unlink "outfile" if -e "outfile";
    
    # run the specified program with command file                                                                                                                                                                    
    system "$program < $command > $program.out";
    system "mv outfile $outfile";
    system "mv outtree $outtree" if $outtree;
    
    print "Done. Outfile saved as $outfile. Program output saved as '$program.out'\n";