DataMatch

Name: DataMatch

Start date:  3/1/2011

Target Release: 0.4 (11Q2)

Status:  Landed 

Lead: Barb Banbury

Background

To perform post tree analyses, trait data needs to be matched to phylogenies. If these datasets have different origins or are stored in different files, the following problems can occur:

  • More taxa in the tree than taxa with data;
  • More taxa with data than in the tree;
  • Different taxa in the tree and for which data has been collected. This case can have different origins:
    • Different taxa have been collected;
    • The taxa names have changed;
    • The taxa names are incomplete;
    • The taxa names are misspelled;

[On top of that, if multiple traits are stored in a table, it is possible that the taxa name is present in the file, but one or more traits have not been recorded in one or more taxa. Depending on the type of analysis performed, 3 resolutions exist. Complete independence, pairwise matching and minimum set. But this is to be moved somewhere else].

The goal of this project is to solve a critical problem that will arise once large trees will be available in the Discovery Environment, either through ToL efforts, automatic tree-building or user uploads. The process of matching tree and data, by manually correcting typos and identifying synonyms will major limitation to the use of these large phylogenies for post-tree analyses. The DE currently has a very limited tool to match trait and data which doesn't allow subsetting or correcting taxa names. Also, it doesn't offer the possibility to check the taxa names against accepted nomenclatures. For this latter aspect, the DE already implements a Taxonomic Name Resolution Service that allows the automatic correction and standardization of plant names using authoritative sources. In addition, having standardized taxa names across dataset will enable a much wider sharing of the data.

Project Description

This project aims at addressing this problem by providing a tool will allow the user to automatically match taxa across files. The problem can be split into 2 components: name resolution and name matching. In the first step, the taxa names in the two datasets are standardized via the TNRS service and in the second step, the taxa names appearing in the tree and the data file are compared. The user should be able to run the tool in a supervised and in an unsupervised mode. In the unsupervised mode, names are standardized automatically, matched between tree and data and all non matching taxa dropped from the tree and the dataset (and stored in a file for user's review). In the supervised mode, the user will be able to control the various step, including picking synonyms and manually correcting and matching data.
Update Given the program's features it's trivially simple to extend DataMatch to accept lists of taxa as input, resulting in a generic subsetter.

Deliverables

A tool that will allow a user to:

  1. Select two files containing taxa names: Any combination of a phylogeny, a trait table or a list of taxa.
  2. Run the two files through the taxonomic name resolution service. If necessary, the user should be allowed to indicate the genus, in case this is not present or has been abbreviated. The original taxa names will be replaced with the ones returned by the service if above a certain score. Otherwise the original name will be retained but the closest hits will be also returned. In case of synonyms, the preferred (how?) name will be returned but the synonyms will also be reported.
  3. Review the matches and manually:
    1. Pick a synonym (name with synonyms should be flagged)
    2. Remove taxa,
    3. Attach a taxon to a suggested (i.e. below the threshold) name;
    4. Modify taxa's names;
      1. Send the modified name to the resolution service?
    5. Match taxa.
  4. The user will be able to automatically reduce the dataset to the the matched taxa only (unsupervised mode);
  5. Save the modified files. 
  6. Launch an analysis of the matched dataset.

Milestones

Milestone

ETA

Reached

Tool to match taxa name across tree-data

Beginning of April

COMPLETED

Tool to match taxa name across tree-tree and data-data

Beginning of April

COMPLETED

Extend to list

End of April

COMPLETED

UI

End of April

COMPLETED: Mockup

Inclusion of TNRS service

End of April

COMPLETED

GUI for manual matching

TBD

SPUN OFF

Team members (if applicable)

Name

Role

Contact

Barb Banbury

Lead

bbanbury@utk.edu

Jeremy Beaulieu

 

jeremy.beaulieu@yale.edu

Naim Matasci

DE Integration

nmatasci@iplantcollaborative.org

Bill Piel

TNRS Integration

william.piel@yale.edu

https://pods.iplantcollaborative.org/jira/browse/TRAILEVOL-8

Dependencies/Related projects

TNRS; DE 0.4