Jailbreaking Large Language Models – Attacks and Defenses

Invited speaker 1: 邓格雷

In this sharing session, we explore the emerging challenges of LLM jailbreaks by introducing various attack methodologies—ranging from cunning prompts and automated frameworks like MasterKey to RAG-based poisoning, which manipulates external content to reintroduce harmful outputs. We then review common defensive approaches in both industry and research domains, and discuss a new methodology - proactive safety reasoning, a self-scrutinizing mechanism that detects and thwarts malicious prompts at their inception.