Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability

doi:10.1145/3808209

DOI: 10.1145/3808209 ISSN: 2994-970X

Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability

Leizhen Zhang, Shuhan Chen, Sheng Chen

Large language models (LLMs) are increasingly used for tasks that implicitly reduce to Boolean satisfiability (SAT), yet their reasoning ability on SAT remains unclear. We present a systematic study of LLMs on 2-SAT and 3-SAT, together with two canonical reductions—Vertex Cover and a discrete 3D-packing formulation—designed to probe representation-invariant reasoning. Our evaluation begins with the conventional lens (accuracy/precision/recall/F1) and the phase-transition setting. We find that traditional metrics are frequently misleading. Models achieve high scores even though they tend to classify all formulas as satisfiable, fail to reproduce the classical easy--hard--easy signature around the 3-SAT threshold, and degrade sharply as the number of variables N grows.

To address this, we introduce a paired-formula protocol (minimally different satisfiable/unsatisfiable instances) and a new measure, Accurate Differentiation Rate (ADR), which requires prediction on both members of each pair correct. ADR cleanly separates reasoning-oriented models from heuristic ones and correlates with witness validity (truth assignments that actually satisfy the formula). Extending beyond CNF, we test cross-representation consistency via standard reductions: (i) Convert CNF to Vertex Cover and (ii) Convert 3-SAT to discrete 3D packing with verifiable placement constraints. Decisions made on CNF and on their graph/packing counterparts agree for most models on >80 percent of instances, revealing stable decision rules across representations. A leading model (e.g., GPT-5) achieves both high invariance and correctness on small N, but still suffers scale-induced degradation.

Taken together, our results support the thesis that SAT is a conservative probe for LLM reasoning: performance on SAT predicts transfer to other NP-style reductions, while paired evaluation with ADR provides a faithful, representation-robust assessment beyond conventional metrics.

Outline

Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability

More from our Archive