PHYLIP_Documentation

Metadata Considerations

  • PHYLIP input formats are very simple and only allow 10 characters for each species/taxon label.
  • This limitation is going to require a solution to work with taxon labels > 10 characters without data loss. We will need to:
    1) track metadata to link truncated or simplified taxon labels with the original, full-length labels
    2) ensure that the 10-character taxon labels are unique within the file

    Note: we can modify PHYLIP to use longer taxon names! See here

File and program documentation.

Inventory of sample data for PHYLIP

DNA Sequence

actin.fel, a multiple sequence alignment file in interleaved format. This file is suitable for input into molecular sequence based methods, such DNADIST, DNAML, and DNAPARS. Format specification is described here.

Distance Matrices

actin.dist, a DNA distance matrix, calculated using DNADIST, suitable for input into distance-based methods such as NEIGHBOR. Details on the generation of this file are covered in PHYLIP_NJ_Example. Format specification is here.

Continuous Character Data

PDAP.fel, a two-character data set for 49 mammalian species. This file is suitable for use in CONTRAST. The format specification is described above and here.

50K.continuous.fel, a synthetic data set for 50000 species, suiatbel for use on CONTRAST. Provenance of the original data is described in 50K_Synthetic_Data. Conversion to PHYLIP format is described in PHYLIP_CONTRAST_Example.

Tree Data

PHYLIP tree data format is the same as Newick format

actin.nj.treefile, the tree resulting from the phylogenetic analysis described in PHYLIP_NJ_Example. This format is also suitable as input data for PHYLIP programs that consume trees.

50K_final_newick.tre, a 50000-taxon synthetic tree provided by Brian Omeara. This tree has been tested with CONTRAST.

crimson50Knewick.tre, a 50000-taxon synthetic tree provided by Val Tannen.

mult_treefile_example.fel, an example of a PHYLIP treefile with multiple trees.

Shore bird data set

Considerations

  • The fifth character in the trait data is non-numeric. I convert it to numeric here but it would likely be best to throw a warning here.
  • The tree has a few minor polytomies. A warning would be good but I think the analysis should run anyway

Files

shorebirds.txt
shorebirds.fel
shorebirds.tree.fel

Conversion

  • The script below will consume shorebirds.txt and convert it to correct PHYLIP format.
  • Note the perl script is not the important part, the PHYLIP file format is the important part
  • Note also that the perl script is specific to this file and is not generally applicable
 perl fix_shorebirds.pl shorebirds.txt >shorebirds.fel

Note, I am using a 30-character limit for species names (see here)

#!/usr/bin/perl -w
# Convert shorebirds.txt into a phylip file for CONTRAST
use strict;
my (%seen,$idx,%idx,@output,$traits);


my $treefile = 'shorebirds.tree.fel';

while (<>) {
  next if /Species/; # No header line!

  my ($taxon,undef,@traits) = split;
  $taxon && @traits > 0 || next;

  $traits ||= @traits;
  $traits == @traits 
      || die "$taxon has ".scalar(@traits)." traits.  Should have $traits\n";;

  # The last trait value is actually discrete and non-numeric
  # we convert it here to numeric (maybe should delete it?)
  $idx{$traits[-1]} ||= ++$idx;
  $traits[-1] = $idx{$traits[-1]};

  my $label = (length $taxon) < 30 ? sprintf('%-30s',$taxon) : substr $taxon, 0, 30;
  if ($seen{$label}++) {
    $label =~ s/\S$/$seen{$label}/;
  }
  (my $unpadded_label = $label) =~ s/\s+$//;
  s/$taxon\s+/$label/;
  `perl -i -pe 's/$taxon/$unpadded_label/' $treefile`; 
  push @output, $label . join("\t",@traits);
}

print "  " . scalar(@output) . " $traits\n";
print join("\n", @output), "\n";