APWEB2 design document

Download the original attachment

APweb2: Design considerations for upgrading an invaluable web resource for botany

Campbell Webb, Peter Stevens, Amy Zanne

29 May 2009 (Version 1)

Background

The Angiosperm Phylogeny Website (http://www.mobot.org/MOBOT/research/APWeb/), designed and populated by Peter Stevens, has since 2001 become the top internet resource for data about plant classification and characters. If offers:

A continually revised, knowledge-filtered, single hypothesis for the phylogeny for the relationship of families, which offers users a ‘best estimate’ of the true tree
A synonymy of plant families
Collected information about apomorphies (shared characters) for all family and order nodes, and many other nodes
Comprehensive character descriptions and glossary
A very comprehensive, up to date bibliography
Educational resources

These data are accessible through a series of well-organized web pages, including links to and from (static) images of phylogenies. The site works very well as a stand-alone tool for human self-education. However, the key limitation for integrating this resource with other projects is that the data occur in generally unstructured text strings. Peter has long desired to be able to offer these data in a more universally accessible manner, and serious discussions about upgrading APweb to major version 2.0 began in 2005. In 2008, we constructed a demo of the key design components (see: http://phylodiversity.net/apweb2/demo/).

Goals

Key content components of APweb2 will be:

Most apomorphy and other data refer to a node of the angiosperm phylogeny. These data will be collected as ‘node units’ and then linked to an independent phylogeny.
Multiple phylogenies can be used, from Peter’s ‘best estimate’ to other published hypotheses, and even a user’s phylogeny.
With the ‘node unit’ data will be structured by class (e.g., anatomy, ecology, chemistry...). Classes of data will remain unstructured character strings at this stage, but encoding characters independently is the primary concern for the next version (3.0).
Where appropriate, data for a node that are presented to a user will be aggregated ‘on the fly’ from data at subordinate nodes. For example, number of genera or species in an order will be calculated from contained family nodes, clearly depending on the phylogeny used.
The appearance to the user should not be radically different from current APweb.
A simple URL-based web service will enable the data to be accessible to other web-enabled software.

Key software design components of APweb2 will be:

Tools and data will be developed in a revision control system (e.g., SVN), allowing versioning and multiple author access.
Coding choices (platform, language) will reflect the need for long-term sustainability and easy comprehension by key project personnel. Perl is strongly recommended by Webb (please, no java!).
The basic data model should be expressed and implemented in XML. A preliminary schema already exists: http://svn.phylodiversity.net/apweb2/schema/apweb.rnc.
Storage in a native XML database (e.g., Sedna, eXist) is recommended, allowing native Xquery queries and easy return of XML to a browser that can be viewed via XSL transformations (see demo).
Phylogenies should be stored in a separate database; this might be the phylomatic tree-of-trees DB (http://svn.phylodiversity.net/tot/trees/) and/or other web-accessible databases.
The web-tools will read in Newick (for the trees) and XML (for the APweb node-based data) and integrate them.
Trees will be viewable either by re-using other appropriate (fast/easy) tree viewers, or using an integrated converter. SVG is recommended by Webb as the viewing medium, since click-able URL links can be very easily integrated (see demo).
A simple (PHP-based?) web-form tool will be created to allow Peter to easily populate and edit the database.

Actions

These steps are required to complete this upgrade to version 2.0:

1. Seeking design input from the user community,
2. Finalizing the data schema,
3. Converting the flat HTML into structured data (ultimately XML),
4. Setting up the online XML database, importing the data,
5. Creating an online tool for data management by Peter and others,
6. Choosing and fortifying the phylogeny data storage system,
7. Crafting the reference phylogeny that reflects current APG ordinal and family structure,
8. Designing and coding the algorithms for integration of node-based and phylogeny data, including the ‘up-passing’ of data through the tree,
9. Creating the web-interface to this APweb2 engine,
10. Prettifying the interface to make it appear similar to APweb1,
11. Testing and documenting the web-services.
12. Designing and implementing a long-term software and data maintenance plan.

Future considerations

Conversion from APweb 1.0 to 2.0 is part of a larger plan for plant biodiversity informatics. further projects, which will be integral parts of APweb, include:

Creating dynamic linkages between text citations and bibliographic references, and encoding of the bibliography into structured data (RIS, BibTex...), with links to DOIs, where possible. The vast size of the bibliography precludes this transformation in the current project.
Creating an up-to-date genus synonymy resource that will be nested, via the reference phylogeny within the APG ordinal and family classification system. Institutional and data linkages to the NCBI taxonomy should be attempted.
Encoding (tagging) an increasing number of characters inside the free-text fields according to a robust reference ontology for plant morphology. The design of this ontology, including its need to be explicitly phylogenetic and evolutionary, is already under discussion, and elements of this facility may make their way into APweb version 2.0.
Allowing community contributions to the data. Some sort of Wiki-model may be appropriate. In fact, another solution to this whole project might be to encode APweb2 into Semantic MediaWiki (http://semantic-mediawiki.org/).