Exploring Taxon Concepts (ETC) through Analyzing Fine-Grained Semantic Markup of Descriptive Literature
NSF Award: DBI-1147266 Project Duration: 07/01/2012 - 06/30/2016
Scientific names are the primary identifiers for organisms and the anchor for the communication and comparison of biological knowledge. However, there is constant revision of the circumscription of taxa by experts, making interpretation of the names through time challenging. This problem is exacerbated by experts’ lack of consensus within and among major groups of organisms on species concepts, allowing opinions to vary greatly from conservative (“lumpers”) to more liberal (“splitters”). Thus, it is very difficult for non-taxonomists or non-experts to evaluate the modern interpretation of the scientific names they use in a given dataset or publication in order to compare this to another or combine these for broader analysis. Based on our prior work on fine-grained semantic parsing of descriptive literature, we can for the first time contemplate a more quantifiable method to evaluate taxonomic concepts. Morphological descriptions accessed from both born-digital and digitized legacy literature provide a comprehensive history of the knowledge of variability that taxa encompass.
The ETC project will develop novel ways of tying scientific names directly to published biological characteristics of organisms, and implements a new user-friendly program, the Explorer of Taxon Concepts to assist with the disambiguation of the scientific names of species and all taxonomic ranks. Prototypes from several successful NSF-funded projects are integrated through ETC to enable (i) text-mining extraction of taxonomic knowledge from scientific literature, (ii) analysis and integration of this knowledge using logic-based reasoning and information theoretic methods, and (iii) result visualization. The results will shed light on similarities and differences among various scientists’ understanding of a particular species, as well as relations between the terminology used by different scientists, allowing for more accurate integration of data gathered by different investigators. A component of the ETC project is computer science research aimed at a novel integration of state-of-the-art logic inference and information theoretic approaches to taxonomic science.
ETC’s components support scientific knowledge value added to its inputs, making them useful in many biodiversity information applications. For example, a case study of the ETC project, the selection of the Rosaceae (the Rose family) and Apoidea (the Bee super-family) will further research into critical pollination systems which are currently of great concern due to reductions in bee populations globally with the potential to reduce yield of many staple food crops. Character and anatomy ontologies built and enhanced by the ETC project will benefit all knowledge-based applications in biology.
The integration of ETC components with existing biological computing infrastructure will broaden their impact further. ETC partners with iPlant’s Education, Outreach, and Training (EOT) group to document the software for instructional use and encourage its adoption in the classroom. Components of the research and the final products will be packaged into learning modules for college and graduate level courses at University of Arizona, the University of California at Davis, and other universities.
Project outcomes will become accessible through this website.