2024 IEEE Symposium on Security and Privacy (SP)
Download PDF

Abstract

Large Language Models (LLMs) are increasingly employed in numerous applications. It is hence important to ensure that their ethical standard aligns with humans’. However, existing jail-breaking efforts show that such alignment could be compromised by well-crafted prompts. In this paper, we disclose a new threat to LLMs alignment when a malicious actor has access to the top-k token predictions at each output position of the model, such as in all open-source LLMs and many commercial LLMs that provide the needed APIs (e.g., some GPT versions). It does not require crafting any prompt. Instead, it leverages the observation that even when an LLM declines a toxic query, the harmful response is concealed deep within the output logits. We can coerce the model to disclose it by forcefully using low-ranked output tokens during auto-regressive output generation, and such forcing is only needed in a very small number of selected output positions. We call it model interrogation. Since our method operates differently from jail-breaking, it has better effectiveness than state-of-the- art jail-breaking techniques (92% versus 62%) and is 10 to 20 times faster. The toxic content elicited by our method is also of better quality. More importantly, it is complementary to jail-breaking, and a synergetic integration of the two exhibits superior performance over individual methods. We also find that with interrogation, harmful content can even be extracted from models customized for coding tasks.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles