dc.contributor.author
Niayesh, Nazanin
dc.date.accessioned
2025-11-05T20:27:34Z
dc.date.available
2025-11-05T20:27:34Z
dc.date.issued
2025-11-04T16:35:21Z
dc.date.issued
2025-11-04T16:35:21Z
dc.identifier
http://hdl.handle.net/10230/71767
dc.identifier.uri
http://hdl.handle.net/10230/71767
dc.description.abstract
Treball fi de màster de: Erasmus Mundus joint Master in Artificial Intelligence (EMAI)
dc.description.abstract
Supervisora: Prof. Lejla Batina
Co-Supervisora: Dra. Maria-Irina Nicolae
dc.description.abstract
Large language models (LLMs) are increasingly used in applications within various domains such as healthcare, research, and education. With the growing use of these models, especially in critical systems (e.g., self-driving cars), the security of these models becomes increasingly crucial. Automated red teaming aims to efficiently and effectively uncover the security vulnerabilities of LLMs and LLM-based applications so that these can be mitigated before being misused by malicious parties. Red teaming often utilizes jailbreak attacks. Most works on jailbreak attacks in the literature focus on single-turn attacks, which are executed as a single input prompt to the target LLM within one conversation turn. Multi-turn attacks, however, better represent manual red teaming, which is also often done over several conversation turns. Additionally, multi-turn attacks have been shown to uncover a greater number and variety of weaknesses and security vulnerabilities compared to single-turn
attacks. While research on multi-turn attacks has increased recently, most works on automatic jailbreaking are still centered on single-turn techniques. This work aims to contribute to the improvement and spreading of multi-turn attacks such that they can be used in automated red teaming to improve the security of LLM applications. The focus of this work lies on the automatic black-box multi-turn attack Generative Offensive Agent Tester (GOAT). Specifically, we extend Generative Offensive Agent Tester (GOAT) with new jailbreak strategies, and both GOAT and these additional strategies are implemented in the Azure Python Risk Identification Tool (PyRIT) red teaming and security evaluation framework. Extensive experiments are conducted to compare the performance of the proposed changes in relation to the original GOAT method in a variety of setups involving a number of tuned attackers and scorer models. A comprehensive analysis of these experimental results reveals that the proposed modifications result in promising improvements in most setups.
dc.format
application/pdf
dc.rights
Llicència CC Reconeixement-NoComercial-SenseObraDerivada 4.0 Internacional (CC BY-NC-ND 4.0)
dc.rights
https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.rights
info:eu-repo/semantics/openAccess
dc.subject
Intel·ligència artificial
dc.title
Multi-turn attacks for automated LLM red teaming
dc.type
info:eu-repo/semantics/masterThesis