Comparative Evaluation of Gemini and DeepSeek for LLM-Generated Code Quality and Architectural Robustness in Backend Software Engineering

doi:10.3390/electronics15132805

DOI: 10.3390/electronics15132805 ISSN: 2079-9292

Comparative Evaluation of Gemini and DeepSeek for LLM-Generated Code Quality and Architectural Robustness in Backend Software Engineering

Marko Horvat, Iva Ursić, Klara Krmpotić

The increasing integration of large language models (LLMs) into software engineering workflows under the term vibe-coding necessitates systematic empirical evaluation of their code generation capabilities, especially in the context of complex backend development and architectural decision-making. This study compares popular foundational models Google Gemini 3 Pro and DeepSeek-V3.1 for developing a Java/Spring Boot backend application using a structured prompt-chaining protocol following a typical vibe-coding process. The generated solutions were evaluated using several quantitative and qualitative criteria, including the number of corrective prompts, the extent of required manual code interventions, functional correctness, architectural robustness, maintainability-related design choices, latency, and test quality. The results show substantial differences between the two models. DeepSeek required twice as many corrective natural language prompts as Gemini, but both models required a similar number of manual interventions in the generated code, with 23 for DeepSeek and 20 for Gemini. The most pronounced difference was in architectural reasoning. Gemini autonomously introduced the Data Transfer Object design pattern, resulting in a decoupled architecture, although at a cost of a minor performance issue. In contrast, DeepSeek was better in development of boilerplate code but exposed raw JPA entities through the application interface leading to tight coupling and other issues. Gemini’s solution satisfied 90.25% of evaluated requirements compared to 68.08% for DeepSeek. Additionally, generated tests showed a higher success rate and broader code coverage, achieving 95.7% successful test execution and 55.9% code coverage for Gemini, compared to 74.1% and 45.6% for DeepSeek, respectively. The results indicate that within the paradigm of vibe-coding, even the best available foundational LLMs may still require expert human supervision, especially when the generated code is expected to satisfy specific requirements in production-oriented backend systems.

Outline

Comparative Evaluation of Gemini and DeepSeek for LLM-Generated Code Quality and Architectural Robustness in Backend Software Engineering

More from our Archive