Attacking misinformation detection using adversarial examples generated by language models

dc.contributor.author
Przybyła, Piotr
dc.contributor.author
McGill, Euan
dc.contributor.author
Saggion, Horacio
dc.date.accessioned
2026-03-07T08:52:46Z
dc.date.available
2026-03-07T08:52:46Z
dc.date.issued
2026-03-06T14:48:51Z
dc.date.issued
2026-03-06T14:48:51Z
dc.date.issued
2025
dc.date.issued
2026-03-06T14:48:51Z
dc.identifier
Przybyła P, McGill E, Saggion H. Attacking misinformation detection using adversarial examples generated by language models. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2025 Nov 4-9. Suzhou, China. Kerrville: ACL; 2025. p. 27626-42. DOI: 10.18653/v1/2025.emnlp-main.1405
dc.identifier
9798891763326
dc.identifier
https://hdl.handle.net/10230/72718
dc.identifier
http://dx.doi.org/10.18653/v1/2025.emnlp-main.1405
dc.identifier.uri
https://hdl.handle.net/10230/72718
dc.description.abstract
Large language models have many beneficial applications, but can they also be used to attack content-filtering algorithms in social media platforms? We investigate the challenge of generating adversarial examples to test the robustness of text classification algorithms detecting low-credibility content, including propaganda, false claims, rumours and hyperpartisan news. We focus on simulation of content moderation by setting realistic limits on the number of queries an attacker is allowed to attempt. Within our solution (TREPAT), initial rephrasings are generated by large language models with prompts inspired by meaning-preserving NLP tasks, such as text simplification and style transfer. Subsequently, these modifications are decomposed into small changes, applied through beam search procedure, until the victim classifier changes its decision. We perform quantitative evaluation using various prompts, models and query limits, targeted manual assessment of the generated text and qualitative linguistic analysis. The results confirm the superiority of our approach in the constrained scenario, especially in case of long input text (news articles), where exhaustive search is not feasible.
dc.description.abstract
The work of P. Przybyła is part of the ERINIA project, which received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101060930. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the funders. Neither the European Union nor the granting authority can be held responsible for them. We gratefully acknowledge Polish high-performance computing infrastructure PLGrid (HPC Centers: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2025/018019. We also acknowledge support from Departament de Recerca i Universitats de la Generalitat de Catalunya (ajuts SGR-Cat 2021) and from Maria de Maeztu Units of Excellence Programme CEX2021-001195-M, funded by MCIN/AEI /10.13039/501100011033. Finally, we are grateful for the participation of Alba Táboas García in the manual evaluation effort
dc.format
application/pdf
dc.format
application/pdf
dc.language
eng
dc.publisher
ACL (Association for Computational Linguistics)
dc.relation
roceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2025 Nov 4-9. Suzhou, China. Kerrville: ACL; 2025.
dc.relation
info:eu-repo/grantAgreement/EC/HE/101060930
dc.rights
© ACL, Creative Commons Attribution 4.0 License
dc.rights
http://creativecommons.org/licenses/by/4.0/
dc.rights
info:eu-repo/semantics/openAccess
dc.subject
Misinformation detection
dc.subject
Adversarial examples
dc.subject
Language models
dc.title
Attacking misinformation detection using adversarial examples generated by language models
dc.type
info:eu-repo/semantics/bookPart
dc.type
info:eu-repo/semantics/publishedVersion


Files in this item

FilesSizeFormatView

There are no files associated with this item.

This item appears in the following Collection(s)