DNAML_bootstrap_Workflow

Purpose

The purpose of this document is to demonstrate how doing bootstrap replicate analysis with PHYLIP's SEQBOOT, DNAML and CONSENSE could be decomposed for use on Condor.

Assumptions:

  • Perl installed
  • PHYLIP installed

Alignment file

The starting point is an alignment of nine mitochondrial genomes. The alignments were done with MUSCLE.
The alignment file was then converted into PHYLIP format.

DNAML

The program used for the basic phylogenetic inference operation is DNAML. It is an implementation of the maximum likelihood method described here

PHYLIP programs all have interactive menus but, for convenience, I use a little perl wrapper that accepts as arguments the program name (DNAML), a command file to be passed via STDIN and output file name(s)

#!/usr/bin/perl -w                                                                                                                                                                
#run_phylip.pl                                                                                                                                                                   
use strict;

my $program = shift;
my $command = shift;
my $outfile = shift;
my $outtree = shift;

$program && $command && $outfile
    || die "Usage: ./run_phylip.pl program command_file output_file [output_tree]\n";

# get rid of previous work                                                                                                                                                        
unlink "infile"  if -e "infile";
unlink "intree"  if -e "intree";
unlink "outtree" if -e "outtree";
unlink "outfile" if -e "outfile";

# run the specified program with command file                                                                                                                                     
system "$program < $command > $program.out";
system "mv outfile $outfile";
system "mv outtree $outtree" if $outtree;

print "Done. Outfile saved as $outfile. Program output saved as '$program.out'\n";

  • The command file is simple, it just says the name of the alignment file and 'Y' to proceed
fish_aligned.fel
Y
  • The DNAML run would be invoked as follows (using time)
$ time ./run_phylip.pl dnaml command.txt results.txt results.tre
Done. Outfile saved as results.txt. Program output saved as 'dnaml.out'

real	0m6.088s
user	0m6.059s
sys	0m0.029s

  • So a single run takes about six seconds.

Bootstrap replicates

The thing that makes it a bit more complicated is the requirement to do bootstrap replicate analysis.

  • Consider doing doing 2000 replicates, that would be about 12,000 seconds or > 3 hours. This is the part that could be paralellized
  • The program used to create the 2000 replicate data sets is SEQBOOT. The command file for this is:
fish_aligned.fel
R
2000
Y
777

This encodes a command to do 2000 bootstrap replicates using 777 as a random number string.

It is run with the incantation:

$ ./run_phylip.pl seqboot command.txt replicates.txt

dividing up the work (This is the bit where Condor would come in).

The replicate data file uses this header, we can use it as a delimiter for splitting

    9 17913
C_lavaretu ---------- ---------- ---------- ---------- ---------- ----------
Ssalar     CCTTCA---- ---------- ---------- ------GTAA AATTTAAACA AAAAAATTGT
Ogorbuscha CCCCATATTT TTCCACCAAT CTTTAATCCG CGAATTATTT TTCCCAAACT TTTTGATTGT
Omasou     CCCCTAA--- ---------- ---------- ------A-AA AATTTAAACT TTTTGATTGT
Onerka     CCCCACA--- ---------- ---------- ------A-AA AACCCAAACT TTTTGATTGT
Oketa      CCCCATA--- ---------- ---------- ------ATAA AATTTAAACT TTTTGATTGT
Otshawytsc CCCCGTA--- ---------- ---------- ------ATAA AATTTAAACT TTTTGATTGT
Omykiss    ---------- ---------- ---------- ---------- ---------- ----------
Oclarki    CCCCATA--- ---------- ---------- ------ATAA AATTTAAACA ATTTGATTGT
... etc

Now we have to split up the one big file with 2000 data sets to 2000 data files. The script below will do this:

#!/usr/bin/perl -w                                                                                                                                                                
# split.pl -- split big replicate file into one replicate/file                                                                                                                    
use strict;

-d 'split' || mkdir('split');

my $idx;
open OUT, ">./split/".++$idx or die $!;
while (<>) {
  if (/^\s+\d+/) {
    close OUT;
    open OUT, ">./split/".++$idx or die $!;
    print OUT $_;
    print STDERR "Replicate $idx       \r";
  }
  else {
    print OUT $_;
  }
}

  • We now have a "split" sub-directory with 2000 replicate files in it. This would be deployed in parallel and the tree files would need to be re-consolidated after.

After Condor

We will now have run 2000 instances of DNAML on the 2000 files. There will be 2000 trees (each named 'outtree' by DNAML).

  • The trees can be consolidated into one file that would look like this (this example has 10 trees):
    (Ssalar:0.04898,(Omasou:0.03914,((Otshawytsc:0.00409,
    Oketa:0.00432):0.02963,((Ogorbuscha:0.04560,Onerka:0.03225):0.01396,
    (Omykiss:0.02449,Oclarki:0.02578):0.01332):0.00409):0.00670):0.03735,
    C_lavaretu:0.10258);
    (Ssalar:0.04777,(Omasou:0.03462,(((Onerka:0.02829,
    Ogorbuscha:0.04473):0.01513,(Omykiss:0.02278,Oclarki:0.02603):0.01420):0.00480,
    (Otshawytsc:0.00500,Oketa:0.00427):0.02871):0.00963):0.03944,
    C_lavaretu:0.09908);
    (Ssalar:0.04738,(((Omykiss:0.02303,Oclarki:0.02558):0.01479,
    ((Ogorbuscha:0.04929,Onerka:0.02981):0.01347,(Otshawytsc:0.00454,
    Oketa:0.00469):0.02989):0.00413):0.00763,Omasou:0.03291):0.03470,
    C_lavaretu:0.10251);
    (((Oketa:0.00435,Otshawytsc:0.00591):0.02830,((Omasou:0.03957,
    (Omykiss:0.02261,Oclarki:0.02665):0.01460):0.00387,
    (Onerka:0.02864,Ogorbuscha:0.04717):0.01277):0.00669):0.03765,
    Ssalar:0.05094,C_lavaretu:0.10451);
    ((Omasou:0.03245,(((Omykiss:0.02312,Oclarki:0.02885):0.01412,
    (Ogorbuscha:0.04532,Onerka:0.02934):0.01328):0.00573,
    (Otshawytsc:0.00492,Oketa:0.00445):0.02820):0.00973):0.03989,
    Ssalar:0.04698,C_lavaretu:0.10134);
    (((Omykiss:0.02173,Oclarki:0.02647):0.01142,((Onerka:0.03011,
    Ogorbuscha:0.04546):0.01317,((Otshawytsc:0.00417,
    Oketa:0.00458):0.02683,Omasou:0.03915):0.00582):0.00400):0.04298,
    Ssalar:0.04537,C_lavaretu:0.09640);
    (Ssalar:0.04449,((((Otshawytsc:0.00505,Oketa:0.00525):0.02775,
    Omasou:0.03917):0.00684,(Ogorbuscha:0.04631,Onerka:0.03077):0.01193):0.00449,
    (Oclarki:0.02821,Omykiss:0.02270):0.01167):0.04288,
    C_lavaretu:0.09818);
    (Ssalar:0.04860,(Omasou:0.03679,((Onerka:0.02849,
    Ogorbuscha:0.04435):0.01337,((Omykiss:0.02336,
    Oclarki:0.02535):0.01430,(Otshawytsc:0.00540,Oketa:0.00455):0.03091):0.00409):0.00787):0.03759,
    C_lavaretu:0.10208);
    (((((Onerka:0.03149,Ogorbuscha:0.04389):0.01369,
    Omasou:0.03719):0.00494,(Omykiss:0.02325,Oclarki:0.02560):0.01416):0.00625,
    (Oketa:0.00442,Otshawytsc:0.00471):0.02498):0.04066,
    Ssalar:0.04512,C_lavaretu:0.10274);
    (Ssalar:0.04654,(((Omykiss:0.01983,Oclarki:0.02605):0.01525,
    (Omasou:0.04124,(Ogorbuscha:0.04070,Onerka:0.02885):0.01212):0.00426):0.00684,
    (Otshawytsc:0.00476,Oketa:0.00523):0.02523):0.03550,
    C_lavaretu:0.09941);
    

Now, we can take the majority rule consensus tree (again using the example above) using the program CONSENSE.

The command file:

treefile.txt
y

where treefile.txt is the consolidated trees.

$ ./run_phylip.pl consense command.txt consensus.txt consensus.tree
Done. Outfile saved as consensus.txt. Program output saved as 'consense.out'

and the consensus file looks like:

Consensus tree program, version 3.69

Species in order: 

  1. Ssalar
  2. Omasou
  3. Otshawytsc
  4. Oketa
  5. Ogorbuscha
  6. Onerka
  7. Omykiss
  8. Oclarki
  9. C lavaretu


Sets included in the consensus tree

Set (species in order)     How many times out of   10.00

.*******.                  10.00
..**.....                  10.00
......**.                  10.00
....**...                  10.00
..******.                   5.00
....****.                   3.00


Sets NOT included in consensus tree:

Set (species in order)     How many times out of   10.00

.*..****.                   3.00
.*****...                   2.00
.***.....                   2.00
.*..**...                   2.00
..****...                   1.00
.*....**.                   1.00
..**..**.                   1.00


Extended majority rule consensus tree

CONSENSUS TREE:
the numbers on the branches indicate the number
of times the partition of the species into the two sets
which are separated by that branch occurred
among the trees, out of  10.00 trees

                                          +-------Oclarki
                                  +--10.0-|
                                  |       +-------Omykiss
                          +--3.00-|
                          |       |       +-------Onerka
                          |       +--10.0-|
                  +--5.00-|               +-------Ogorbuscha
                  |       |
                  |       |               +-------Otshawytsc
          +--10.0-|       +----------10.0-|
          |       |                       +-------Oketa
  +-------|       |
  |       |       +-------------------------------Omasou
  |       |
  |       +---------------------------------------C lavaretu
  |
  +-----------------------------------------------Ssalar


  remember: this is an unrooted tree!