Untangling attribution in biodiversity data records

Author

Ariño, Arturo H.

Caballero-López, Berta

Lobato-Vila, Irene

Galicia, David

Publication date

2025-12-23



Abstract

The exactness, fitness-for-purpose (FFP) and reliability of primary biodiversity data can be enhanced by additional data beyond the basic taxon-location-date triad (Hill et al. 2010). Often, the only available data are the labels in legacy specimen collections. The digitization process is most efficient if all available information can be collected at once in a single event of specimen handling, rather than in separate phases. There is, however, a compromise between producing a faster catalogue for immediate use and housekeeping, and an accurate, wider-FFP database where all data have been thoroughly checked. Recognizing potential sources of error at digitization time may help making choices. During a dataset integration procedure, the quality and reliability of the data capture was analyzed. The dataset consisted of transcribed label data of over 58K pinned insects of agriculturally-relevant groups in XXth-century collections at six institutions in Spain, that resulted in almost 6000 collector strings. But collector names could be 1. unidentified; 2. misread; 3. ambiguous, 4. duplicated under variants; or 5. misplaced or misattributed to/from another entity, e.g. a location. This resulted in a high entropy level where one collector could be databased in multiple ways, artificially inflating the corresponding catalogues. The entropy was much higher in collections where collectors contributed few specimens, which is the case for universitybased collections. By using simple indexing and cross-referencing techniques, the roster of names was significantly reduced, but full disambiguation of collectors required mining ancillary sources and consulting with people with long-standing knowledge of the collections. Overall, 42% of collector names were in error, resulting in excess entropy.

Document Type

Article

Document version

Published version

Language

English

CDU Subject

59 - Zoology

Subject

Col·leccionistes i col·leccions; Insectes; Espanya; Bases de dades

Pages

6 p.

Version of

BISS: Biodiversity Information Science and Standards, núm. 9 (2025), p.1-6, :e183199

Documents

Ariño_2025.pdf

225.7Kb

Rights

© Ariño A et al.

Attribution 4.0 International

© Ariño A et al.

This item appears in the following Collection(s)