2026-03-06T14:48:51Z
2026-03-06T14:48:51Z
2025
2026-03-06T14:48:51Z
Large language models have many beneficial applications, but can they also be used to attack content-filtering algorithms in social media platforms? We investigate the challenge of generating adversarial examples to test the robustness of text classification algorithms detecting low-credibility content, including propaganda, false claims, rumours and hyperpartisan news. We focus on simulation of content moderation by setting realistic limits on the number of queries an attacker is allowed to attempt. Within our solution (TREPAT), initial rephrasings are generated by large language models with prompts inspired by meaning-preserving NLP tasks, such as text simplification and style transfer. Subsequently, these modifications are decomposed into small changes, applied through beam search procedure, until the victim classifier changes its decision. We perform quantitative evaluation using various prompts, models and query limits, targeted manual assessment of the generated text and qualitative linguistic analysis. The results confirm the superiority of our approach in the constrained scenario, especially in case of long input text (news articles), where exhaustive search is not feasible.
The work of P. Przybyła is part of the ERINIA project, which received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101060930. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the funders. Neither the European Union nor the granting authority can be held responsible for them. We gratefully acknowledge Polish high-performance computing infrastructure PLGrid (HPC Centers: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2025/018019. We also acknowledge support from Departament de Recerca i Universitats de la Generalitat de Catalunya (ajuts SGR-Cat 2021) and from Maria de Maeztu Units of Excellence Programme CEX2021-001195-M, funded by MCIN/AEI /10.13039/501100011033. Finally, we are grateful for the participation of Alba Táboas García in the manual evaluation effort
Chapter or part of a book
Published version
English
Misinformation detection; Adversarial examples; Language models
ACL (Association for Computational Linguistics)
roceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2025 Nov 4-9. Suzhou, China. Kerrville: ACL; 2025.
info:eu-repo/grantAgreement/EC/HE/101060930
© ACL, Creative Commons Attribution 4.0 License
http://creativecommons.org/licenses/by/4.0/