Untangling attribution in biodiversity data records

dc.contributor.author
Ariño, Arturo H.
dc.contributor.author
Caballero-López, Berta
dc.contributor.author
Lobato-Vila, Irene
dc.contributor.author
Galicia, David
dc.date.accessioned
2025-12-30T09:36:02Z
dc.date.available
2025-12-30T09:36:02Z
dc.date.issued
2025-12-23
dc.identifier.uri
http://hdl.handle.net/2072/489050
dc.description.abstract
The exactness, fitness-for-purpose (FFP) and reliability of primary biodiversity data can be enhanced by additional data beyond the basic taxon-location-date triad (Hill et al. 2010). Often, the only available data are the labels in legacy specimen collections. The digitization process is most efficient if all available information can be collected at once in a single event of specimen handling, rather than in separate phases. There is, however, a compromise between producing a faster catalogue for immediate use and housekeeping, and an accurate, wider-FFP database where all data have been thoroughly checked. Recognizing potential sources of error at digitization time may help making choices. During a dataset integration procedure, the quality and reliability of the data capture was analyzed. The dataset consisted of transcribed label data of over 58K pinned insects of agriculturally-relevant groups in XXth-century collections at six institutions in Spain, that resulted in almost 6000 collector strings. But collector names could be 1. unidentified; 2. misread; 3. ambiguous, 4. duplicated under variants; or 5. misplaced or misattributed to/from another entity, e.g. a location. This resulted in a high entropy level where one collector could be databased in multiple ways, artificially inflating the corresponding catalogues. The entropy was much higher in collections where collectors contributed few specimens, which is the case for universitybased collections. By using simple indexing and cross-referencing techniques, the roster of names was significantly reduced, but full disambiguation of collectors required mining ancillary sources and consulting with people with long-standing knowledge of the collections. Overall, 42% of collector names were in error, resulting in excess entropy.
ca
dc.format.extent
6 p.
ca
dc.language.iso
eng
ca
dc.relation.ispartof
BISS: Biodiversity Information Science and Standards, núm. 9 (2025), p.1-6, :e183199
ca
dc.rights
© Ariño A et al.
ca
dc.rights
Attribution 4.0 International
*
dc.rights.uri
http://creativecommons.org/licenses/by/4.0/
*
dc.source
RECERCAT (Dipòsit de la Recerca de Catalunya)
dc.subject.other
Col·leccionistes i col·leccions
ca
dc.subject.other
Insectes
ca
dc.subject.other
Espanya
ca
dc.subject.other
Bases de dades
ca
dc.title
Untangling attribution in biodiversity data records
ca
dc.type
info:eu-repo/semantics/article
ca
dc.subject.udc
59
ca
dc.description.version
info:eu-repo/semantics/publishedVersion
ca
dc.embargo.terms
cap
ca
dc.identifier.doi
https://doi.org/10.3897/biss.9.183199
ca
dc.rights.accessLevel
info:eu-repo/semantics/openAccess


Documents

Ariño_2025.pdf

225.7Kb PDF

Aquest element apareix en la col·lecció o col·leccions següent(s)