Multi-turn attacks for automated LLM red teaming

Niayesh, Nazanin

Multi-turn attacks for automated LLM red teaming

dc.contributor.author

Niayesh, Nazanin

dc.date.accessioned

2025-11-05T20:27:34Z

dc.date.available

2025-11-05T20:27:34Z

dc.date.issued

2025-11-04T16:35:21Z

dc.date.issued

2025-11-04T16:35:21Z

dc.date.issued

2025

dc.identifier

http://hdl.handle.net/10230/71767

dc.identifier.uri

http://hdl.handle.net/10230/71767

dc.description.abstract

Treball fi de màster de: Erasmus Mundus joint Master in Artificial Intelligence (EMAI)

dc.description.abstract

Supervisora: Prof. Lejla Batina Co-Supervisora: Dra. Maria-Irina Nicolae

dc.description.abstract

Large language models (LLMs) are increasingly used in applications within various domains such as healthcare, research, and education. With the growing use of these models, especially in critical systems (e.g., self-driving cars), the security of these models becomes increasingly crucial. Automated red teaming aims to efficiently and effectively uncover the security vulnerabilities of LLMs and LLM-based applications so that these can be mitigated before being misused by malicious parties. Red teaming often utilizes jailbreak attacks. Most works on jailbreak attacks in the literature focus on single-turn attacks, which are executed as a single input prompt to the target LLM within one conversation turn. Multi-turn attacks, however, better represent manual red teaming, which is also often done over several conversation turns. Additionally, multi-turn attacks have been shown to uncover a greater number and variety of weaknesses and security vulnerabilities compared to single-turn attacks. While research on multi-turn attacks has increased recently, most works on automatic jailbreaking are still centered on single-turn techniques. This work aims to contribute to the improvement and spreading of multi-turn attacks such that they can be used in automated red teaming to improve the security of LLM applications. The focus of this work lies on the automatic black-box multi-turn attack Generative Offensive Agent Tester (GOAT). Specifically, we extend Generative Offensive Agent Tester (GOAT) with new jailbreak strategies, and both GOAT and these additional strategies are implemented in the Azure Python Risk Identification Tool (PyRIT) red teaming and security evaluation framework. Extensive experiments are conducted to compare the performance of the proposed changes in relation to the original GOAT method in a variety of setups involving a number of tuned attackers and scorer models. A comprehensive analysis of these experimental results reveals that the proposed modifications result in promising improvements in most setups.

dc.format

application/pdf

dc.language

eng

dc.rights

Llicència CC Reconeixement-NoComercial-SenseObraDerivada 4.0 Internacional (CC BY-NC-ND 4.0)

dc.rights

https://creativecommons.org/licenses/by-nc-nd/4.0/

dc.rights

info:eu-repo/semantics/openAccess

dc.subject

Intel·ligència artificial

dc.title

Multi-turn attacks for automated LLM red teaming

dc.type

info:eu-repo/semantics/masterThesis

Fitxers en aquest element

Fitxers	Grandària	Format	Visualització
No hi ha fitxers associats a aquest element.

Aquest element apareix en la col·lecció o col·leccions següent(s)

Treballs d'estudiants [4945]