Multi-Model Ensemble Evaluation of Student Design Projects in Higher Education: A Comparative Analysis of AI and Human Expert Grading
Filip Cvitić, Tajana Koren Ivančević, Nikolina Stanić LoknarThis study investigates the potential, limitations, and pedagogical implications of applying a parallel multi-model AI evaluation workflow, using ChatGPT, DeepSeek, and Uizard, to assess student design projects in higher education. Because design assessment involves both formal criteria and subjective creative interpretation, the study first established a human expert baseline based on three independent university professors. The human inter-rater reliability was low to moderate, with a mean pairwise Spearman’s ρ of 0.36 and Cronbach’s α of 0.60 for packaging design, and ρ of 0.43 and α of 0.69 for web design. This finding is central to the study, as it shows that the human benchmark in creative design assessment is itself variable and interpretive. Against this baseline, AI–human alignment remained limited and task-dependent. For packaging design, the AI ensemble showed only a weak positive association with the human expert baseline (Spearman’s ρ = 0.30, p = 0.031), which should be interpreted cautiously given the Bonferroni-adjusted significance threshold used in the study. For web design, no significant AI–human association was observed. Qualitative analysis of AI-generated rationales identified recurring limitations, including hallucination, aesthetic shield effects, and missed context, where visually polished work was rewarded despite deeper conceptual or structural weaknesses. The findings suggest that current AI systems can provide useful formative feedback on visible formal features, but they are not reliable as autonomous grading tools for complex creative work. AI-assisted assessment is therefore best understood as a supervised formative support mechanism, while final evaluation should remain grounded in human pedagogical judgment.