dc.contributor.author
Ariño, Arturo H.
dc.contributor.author
Caballero-López, Berta
dc.contributor.author
Lobato-Vila, Irene
dc.contributor.author
Galicia, David
dc.date.accessioned
2025-12-30T09:36:02Z
dc.date.available
2025-12-30T09:36:02Z
dc.date.issued
2025-12-23
dc.identifier.uri
http://hdl.handle.net/2072/489050
dc.description.abstract
The exactness, fitness-for-purpose (FFP) and reliability of primary biodiversity data can
be enhanced by additional data beyond the basic taxon-location-date triad (Hill et al.
2010). Often, the only available data are the labels in legacy specimen collections. The
digitization process is most efficient if all available information can be collected at once in
a single event of specimen handling, rather than in separate phases. There is, however,
a compromise between producing a faster catalogue for immediate use and
housekeeping, and an accurate, wider-FFP database where all data have been
thoroughly checked.
Recognizing potential sources of error at digitization time may help making choices.
During a dataset integration procedure, the quality and reliability of the data capture was
analyzed. The dataset consisted of transcribed label data of over 58K pinned insects of agriculturally-relevant groups in XXth-century collections at six institutions in Spain, that
resulted in almost 6000 collector strings. But collector names could be
1. unidentified;
2. misread;
3. ambiguous,
4. duplicated under variants; or
5. misplaced or misattributed to/from another entity, e.g. a location.
This resulted in a high entropy level where one collector could be databased in multiple
ways, artificially inflating the corresponding catalogues. The entropy was much higher in
collections where collectors contributed few specimens, which is the case for universitybased
collections.
By using simple indexing and cross-referencing techniques, the roster of names was
significantly reduced, but full disambiguation of collectors required mining ancillary
sources and consulting with people with long-standing knowledge of the collections.
Overall, 42% of collector names were in error, resulting in excess entropy.
ca
dc.relation.ispartof
BISS: Biodiversity Information Science and Standards, núm. 9 (2025), p.1-6, :e183199
ca
dc.rights
© Ariño A et al.
ca
dc.rights
Attribution 4.0 International
*
dc.rights.uri
http://creativecommons.org/licenses/by/4.0/
*
dc.source
RECERCAT (Dipòsit de la Recerca de Catalunya)
dc.subject.other
Col·leccionistes i col·leccions
ca
dc.subject.other
Insectes
ca
dc.subject.other
Espanya
ca
dc.subject.other
Bases de dades
ca
dc.title
Untangling attribution in biodiversity data records
ca
dc.type
info:eu-repo/semantics/article
ca
dc.description.version
info:eu-repo/semantics/publishedVersion
ca
dc.identifier.doi
https://doi.org/10.3897/biss.9.183199
ca
dc.rights.accessLevel
info:eu-repo/semantics/openAccess