We have run the software package Transrate on the Thousand Plants assemblies. This program characterizes the quality of assemblies and classifies each of the scaffolds as good or bad. The software comes from Steve Kelly's lab, the same group who provided the OrthoFinder algorithm for gene family inference.
Principally, the analysis aligns reads to the assembled scaffolds/contigs and looks for base mismatches between reads and scaffolds, regions without aligned reads, violations of normal paired-end read structure, and unusually-large variations in mapped read depth. These features are combined to give a score for each scaffold.
Once scaffold scores are evaluated, Transrate selects a threshold score separating good from bad quality scaffolds. This is chosen to maximize a total assembly score which averages the good scaffold scores with the fraction of reads within those.
A preprint describing Transrate is available at http://dx.doi.org/10.1101/021626 and if you have any questions about the method you may contact Steve Steven Kelly <firstname.lastname@example.org>.
One quick observation is that very short 1kP scaffolds tend to be classified as bad, but the fraction of good scaffolds improves with greater length. (see TransratePassRate.png
) Scaffolds of 300 bp or longer, are overwhelmingly (89%) rated as good. While we previously recommended that sub-300 bp scaffolds be ignored for most analyses, you may be interested in this alternative argument for doing so.
Steve Steven Kelly cautions that the automated scaffold cut-off is great for rejecting bad assemblies, but it does discard some correct assemblies. He suggests that for OneKP it should be generally fine, as most of SOAPdenovo-Trans errors result from multiple fragments of larger scaffolds already present in the assembly. A very conservative alternative cut off is to just throw away anything with a minimum score, i.e. 0.01, indicating the scaffold has no supporting reads.