Attacking misinformation detection using adversarial examples generated by language models

Przybyła, Piotr; McGill, Euan; Saggion, Horacio; Przybyła, Piotr; McGill, Euan; Saggion, Horacio

Attacking misinformation detection using adversarial examples generated by language models

To access the full text documents, please follow this link: https://hdl.handle.net/10230/72718

Author

Przybyła, Piotr

McGill, Euan

Saggion, Horacio

Publication date

2026-03-06T14:48:51Z

2025

2026-03-06T14:48:51Z

Abstract

Large language models have many beneficial applications, but can they also be used to attack content-filtering algorithms in social media platforms? We investigate the challenge of generating adversarial examples to test the robustness of text classification algorithms detecting low-credibility content, including propaganda, false claims, rumours and hyperpartisan news. We focus on simulation of content moderation by setting realistic limits on the number of queries an attacker is allowed to attempt. Within our solution (TREPAT), initial rephrasings are generated by large language models with prompts inspired by meaning-preserving NLP tasks, such as text simplification and style transfer. Subsequently, these modifications are decomposed into small changes, applied through beam search procedure, until the victim classifier changes its decision. We perform quantitative evaluation using various prompts, models and query limits, targeted manual assessment of the generated text and qualitative linguistic analysis. The results confirm the superiority of our approach in the constrained scenario, especially in case of long input text (news articles), where exhaustive search is not feasible.

The work of P. Przybyła is part of the ERINIA project, which received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101060930. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the funders. Neither the European Union nor the granting authority can be held responsible for them. We gratefully acknowledge Polish high-performance computing infrastructure PLGrid (HPC Centers: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2025/018019. We also acknowledge support from Departament de Recerca i Universitats de la Generalitat de Catalunya (ajuts SGR-Cat 2021) and from Maria de Maeztu Units of Excellence Programme CEX2021-001195-M, funded by MCIN/AEI /10.13039/501100011033. Finally, we are grateful for the participation of Alba Táboas García in the manual evaluation effort

Document Type

Chapter or part of a book

Published version

Language

English

Subjects and keywords

Misinformation detection; Adversarial examples; Language models

Publisher

ACL (Association for Computational Linguistics)

Related items

roceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2025 Nov 4-9. Suzhou, China. Kerrville: ACL; 2025.

info:eu-repo/grantAgreement/EC/HE/101060930

Recommended citation

This citation was generated automatically.

Export

DIDL MARC MARC_CCUC METS OAI_DC ORE QDC RDF

Rights

http://creativecommons.org/licenses/by/4.0/

This item appears in the following Collection(s)

Recerca: articles, congressos, llibres [20989]

Attacking misinformation detection using adversarial examples generated by language models

Author

Publication date

Share

Abstract

Document Type

Language

Subjects and keywords

Publisher

Related items

Recommended citation

Export

Rights

This item appears in the following Collection(s)