TSA: A Two-Stage Jailbreak Attack Exploiting Logical Consistency of Large Language Models
Yi Du, Wenjuan Lian, Hongbao Zhang, Tong Liu, Fengnian CaiLarge language models (LLMs) are widely deployed in high-stakes decision-making tasks, raising growing security concerns. Jailbreak attacks, a major threat to LLMs, have evolved from superficial semantic evasion to exploiting inherent model properties. LLMs exhibit a tendency toward logical consistency: once a premise is accepted, models tend to follow its reasoning chain, which may lead them to generate harmful content even if the final output violates safety rules. This tendency may present a potential vulnerability that could be exploited for jailbreak attacks. To exploit this vulnerability, this paper proposes TSA (Two-Stage Jailbreak Attack), a lightweight two-stage jailbreak framework. The method consists of two core steps: first, logic presetting, which guides the model to generate a structured analysis report of harmful behavior and establishes a compliant logical premise; second, intent enhancement, which extracts execution paths from the generated analysis content and prompts the model to autonomously produce harmful outputs. Evaluations on nine mainstream LLMs show that TSA achieves an average attack success rate (ASR) of 84.83% on MiniAdvBench with only 3.61 Queries Per Successful Jailbreak (QPS) and an average ASR of 51.44% on MiniHarmBench, performing favorably compared with the evaluated baseline methods under our experimental settings. The findings suggest that such logical-consistency-based vulnerability may exist among the tested mainstream LLMs, highlighting the necessity to optimize safety alignment for defending against this category of reasoning-driven jailbreak attacks.