Attacking misinformation detection using adversarial examples generated by language models

Przybyła, Piotr; McGill, Euan; Saggion, Horacio

Attacking misinformation detection using adversarial examples generated by language models

dc.contributor.author

Przybyła, Piotr

dc.contributor.author

McGill, Euan

dc.contributor.author

Saggion, Horacio

dc.date.accessioned

2026-03-07T08:52:46Z

dc.date.available

2026-03-07T08:52:46Z

dc.date.issued

2026-03-06T14:48:51Z

dc.date.issued

2026-03-06T14:48:51Z

dc.date.issued

2025

dc.date.issued

2026-03-06T14:48:51Z

dc.identifier

Przybyła P, McGill E, Saggion H. Attacking misinformation detection using adversarial examples generated by language models. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2025 Nov 4-9. Suzhou, China. Kerrville: ACL; 2025. p. 27626-42. DOI: 10.18653/v1/2025.emnlp-main.1405

dc.identifier

9798891763326

dc.identifier

https://hdl.handle.net/10230/72718

dc.identifier

http://dx.doi.org/10.18653/v1/2025.emnlp-main.1405

dc.identifier.uri

https://hdl.handle.net/10230/72718

dc.description.abstract

Large language models have many beneficial applications, but can they also be used to attack content-filtering algorithms in social media platforms? We investigate the challenge of generating adversarial examples to test the robustness of text classification algorithms detecting low-credibility content, including propaganda, false claims, rumours and hyperpartisan news. We focus on simulation of content moderation by setting realistic limits on the number of queries an attacker is allowed to attempt. Within our solution (TREPAT), initial rephrasings are generated by large language models with prompts inspired by meaning-preserving NLP tasks, such as text simplification and style transfer. Subsequently, these modifications are decomposed into small changes, applied through beam search procedure, until the victim classifier changes its decision. We perform quantitative evaluation using various prompts, models and query limits, targeted manual assessment of the generated text and qualitative linguistic analysis. The results confirm the superiority of our approach in the constrained scenario, especially in case of long input text (news articles), where exhaustive search is not feasible.

dc.description.abstract

The work of P. Przybyła is part of the ERINIA project, which received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101060930. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the funders. Neither the European Union nor the granting authority can be held responsible for them. We gratefully acknowledge Polish high-performance computing infrastructure PLGrid (HPC Centers: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2025/018019. We also acknowledge support from Departament de Recerca i Universitats de la Generalitat de Catalunya (ajuts SGR-Cat 2021) and from Maria de Maeztu Units of Excellence Programme CEX2021-001195-M, funded by MCIN/AEI /10.13039/501100011033. Finally, we are grateful for the participation of Alba Táboas García in the manual evaluation effort

dc.format

application/pdf

dc.format

application/pdf

dc.language

eng

dc.publisher

ACL (Association for Computational Linguistics)

dc.relation

roceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2025 Nov 4-9. Suzhou, China. Kerrville: ACL; 2025.

dc.relation

info:eu-repo/grantAgreement/EC/HE/101060930

dc.rights

http://creativecommons.org/licenses/by/4.0/

dc.rights

info:eu-repo/semantics/openAccess

dc.subject

Misinformation detection

dc.subject

Adversarial examples

dc.subject

Language models

dc.title

Attacking misinformation detection using adversarial examples generated by language models

dc.type

info:eu-repo/semantics/bookPart

dc.type

info:eu-repo/semantics/publishedVersion

Files in this item

Files	Size	Format	View
There are no files associated with this item.

This item appears in the following Collection(s)

Recerca: articles, congressos, llibres [20989]