LLM Abuse Prevention Tool Using GCG Jailbreak Attack Detection and DistilBERT-Based Ethics Judgment
Qiuyu Chen, Shingo Yamaguchi, Yudai YamamotoIn recent years, the misuse of large language models (LLMs) has emerged as a significant issue. This paper focuses on a specific attack method known as the greedy coordinate gradient (GCG) jailbreak attack, which compels LLMs to generate responses beyond ethical boundaries. We have developed a tool to suppress the improper use of LLMs by employing a high-precision detection method that combines syntactic tree analysis with the perplexity of generated text. Furthermore, the tool incorporates one of the small language models (SLMs), the DistilBERT model, to evaluate the harmfulness of sentences, thereby preventing harmful content from entering the LLM. Experimental results demonstrate that the tool effectively detects GCG jailbreak attacks and contributes to the secure usage of LLMs. In the test results, the defense success rate reached 90.8%.