All assembled scaffolds longer than 300 bp were queried against all NCBI RefSeq plant sequences (Release 54, July 2012) using BLASTX. The best matching protein coding genes were used to generate GeneWise translations using a modified TransPipes pipeline (described here). Inferred amino acid sequences can be found in the
assembly directory for each taxon (
CODE-SOAPdenovo-Trans-translated.tar.bz2). Translated assembly names match the nucleotide assemblies from which they are derived.
Unfortunately, the pipeline relabels entries in the fasta using 0, 1, 2, 3... Because the original SOAPdenovo-Trans assembly output is numbered 2000001, 2000002,... (i.e. from one) the obvious conversion between the two sets of numbers is in error by one. For example, scaffold-DFYF-2008707-Ilex_sp. corresponds to 8706 not 8707 is this tar file.
This issue is not repeated for the orthogroup analysis, which labels sequences using the original seven digit numbers.