TNRS Workshop

Toward a Taxonomic Name Resolution Service

Missouri Botanical Garden, St. Louis, MO
March 31-April 2, 2010

CONTENTS

#Introduction
#Why New World Plants
#Relationship to Other Initiatives
#Goals of the Meeting
Proposed Requirements of the TNRS
Meeting agenda
Meeting participants
Use Cases
Synonymy lookup examples
Architecture
Presentations
Post-meeting summary
Summary and prioritization of TNRS components

** Download pdf of TNRS Summary and prioritization, plus all use cases **

Meeting Overview

INTRODUCTION

The past decade has seen an explosive growth of large biological databases aggregated from multiple sources. Continental and global-scale data warehouses and networks such as GBIF (www.gbif.org/), SpeciesLink (http://splink.cria.org.br/), and REMIB (http://www.conabio.gob.mx/remib_ingles/doctos/remib_ing.html) provide access to millions of records from biological collections worldwide. Thousands of ecological inventories and species trait measurement are now available through portals such as VegBank (www.vegbank.org/), SALVIAS (www.salvias.net)  and TraitNet (www.columbia.edu/cu/traitnet/). Electronic archives such as GenBank (www.ncbi.nlm.nih.gov/Genbank/) and TreeBase (www.treebase.org/)  house millions of records of sequence data and phylogenies for hundreds of thousands of organisms. Such "mega datasets" represent a major new tool for the study of biodiversity, and have made possible analyses at spatial and temporal scales unimaginable even a decade ago (e.g., Loarie et al. 2008, Weiser et al. 2007, García 2006, Peterson et al. 2002).

Unfortunately, the increasing use of mega datasets has highlighted a major obstacle within the biological sciences: the taxonomic impediment. Do two different names represent two species or one?  Does the same name used in different data sets at different times refer to the same species? How to extract the intended meaning of misspelled names, abbreviations and variant spellings? Given the tools currently available, answering such questions far a large biological database, with hundreds or even thousands of taxon name strings, can be daunting, time-consuming and error-prone.

As a contribution toward resolving the taxonomic impediment, we will be discussing a Taxonomic Name Resolution Service (TNRS), to be developed by  the Missouri Botanical Garden (MBG) in collaboration with iPlant, BIEN, IPNI and other collaborators. The TNRS will be a suite of applications for automated and computer-assisted correction and standardization of taxonomic names, and will draw upon the extensive taxonomic resources of MBG's TROPICOS database  (www.tropicos.org), the International Plant Names Index  (www.ipni.org), and additional digitized monographic and regional taxonomies. Our primary goal is to facilitate taxonomic standardization of very large biological datasets through machine-to-machine transfer and manipulation of taxonomic data.

WHY NEW WORLD PLANTS?

Whereas extensive digitized taxonoy is available for most temperate regions of the world, the same cannot be said for the tropics. However, taxonomic knowledge is far more complete for the Americas than for the Old World tropics, and a large fraction of that knowledge is accessible digitally from a single source: the TROPICOS database. For many botanists and ecologist working in the Neotropics, TROPICOS is the de facto reference for both names and synonymy.

Our proposed focus on plants of the New World is largely pragmatic: combining TROPICOS taxonomy with additional sources of digitized monographic and regional synonymy will enable us to develop an Americas-wide TNRS essentially immediately, without the need for a major additional digitization. Focusing initially on the Americas will also meets the immediate needs of BIEN, whose compilation of plant inventories and specimens is overwhelmingly centered on the New World. However, the proposed TNRS architecture must be scalable to allow global coverage as additional taxonomic data becomes available.

RELATIONSHIP TO OTHER INITIATIVES

We recognize and support recent efforts to develop an open and global solution to the taxonomic impediment, in particular the Global Names Architecture (GNA) currently being developed under the leadership of GBIF (http://www.gbif.org/informatics/name-services/global-names-architecture/). The goals of the present meeting - to develop a TNRS based on existing taxonomic resources and focused predominantly on plants of the Americas - are more modest, and for that reason more quickly obtainable. Rather than re-inventing the wheel, our intention is to use currently-available digitized taxonomic resources to resolve taxonomy over a geographically-limited domain. Our hope is that the proposed New World Plant TNRS will inform and improve global initiatives by providing a rich array of use cases and a direct end-user testing of a functional TNRS.

GOALS OF THE MEETING

The primary goal of this BIEN/MBG collaboration will be to work together with iPlant and the iPlant Tree of Life (iPToL) project to draft detailed requirements for (1) a botanical Taxonomic Name Resolution Service (TNRS) and (2) a dynamic New World Plant Names Checklist (NWPC). The TNRS will allow New World plant occurrence data to be mapped to a standard set of taxon concepts based on existing digitized sources of regional and monographic taxonomy. Although the TNRS should be capable of encompassing all plant taxa, focusing  on the Americas and using existing taxonomic resources will allow rapid development and deployment of this urgently-needed application.

Technology to be developed with iPlant

The TNRS & NWPC will be web services, with batch-processing capability and an intuitive user interface, developed with the assistance of programmers and technical support from iPlant. The TNRS & NWPC services will be available to any user interested in correcting and harmonizing names of plant taxa. The development of the TNRS & NWPC necessitates short- and long-term goals. It is up to us to define these goals in order to develop this technology.

Specific Goals

(i) Overview of science goals. Science goals will help define use cases, needs, and structure of the TNRS.
(ii) Agree upon outcomes. Ensure MBG, BIEN, iPTOL and collaborators are clear on respective needs and desired outcomes.
(iii) Articulate specific ‘use cases’. These will help define what the TNRS should and should not do as well as clarify programming and technology needs.
(iv) Articulate short and long-term goals for the TNRS. Goals will be translated into requirements documents and work flows to be implemented by iPlant and MBG informatics teams.
(v) Detail longer term vision and goals for next TNRS meetings and technology development goals between now and future meetings.

PROPOSED REQUIREMENTS OF THE TNRS

First meeting will need to clearly articulate short and long term goals. Based on past discussions there are two clear lines of development:

  1. Name matching
  2. Synonymy

These lines of development can start separately but ultimately will merge

Short-term goals (focus on immediately after the meeting, with the goal of implementing within the first year).

(1) Name matching
      •    Data
         o    Compile complete list of names within TROPICOS and ideally IPNI. Flag taxa occurring within the Americas, 
      •    Applications
         o    Define and articulate the basic functions of proposed web service, including but not limited to:
               o    Match a user-submitted list of names against authoritative list.
               o    Catch and correct common spelling mistakes, atomize standard data elements, extract additional concatenated information, etc.
               o    Interface for user interpretation and adjudication of ambiguous cases based on match rankings
(2) Synonymy
      •    Data:
         o    Identify mechanisms for accessing all digitized synonymized checklists within MBG. The expectation is that these will provide a common framework for combining various authoritative lists into a dynamic New World plant checklist.
         o    Identify other digitized sources of monographic and regional synonymy
         o    Determine gaps and develop plan for capturing additional non-digitized sources of synonymy
      •    Applications
         o    Define functions of application for checking submitted names against list of synonymized names.
         o    Rank degrees of ambiguity
         o    Interface for user interpretation and adjudication of ambiguous cases

Literature cited: