GoldFormer: A Texture-Aware Vision Transformer-Based Algorithm for Detecting Near-Identical Images

doi:10.3390/a19070530

DOI: 10.3390/a19070530 ISSN: 1999-4893

GoldFormer: A Texture-Aware Vision Transformer-Based Algorithm for Detecting Near-Identical Images

Zobeir Raisi

Distinguishing authentic gold products from high-quality counterfeits is a challenging fine-grained computer vision problem; counterfeit items are engineered to replicate surface texture, hallmark engravings, color, and geometry with remarkable fidelity, making visual discrimination unreliable even for trained professionals. In this paper, we address the problem of visual gold authentication from unconstrained smartphone imagery in three main contributions. First, we introduce GoldNet, a public benchmark dataset designed for this task, comprising 2127 real-world images of authentic and counterfeit gold items collected under diverse real-world conditions. Second, we evaluate fourteen classification architectures spanning classical handcrafted texture descriptors, convolutional neural networks (CNNs), and vision transformers under a rigorous transfer learning protocol, establishing the first comprehensive baseline for this problem. Third, we propose GoldFormer, a hybrid dual-stream algorithm that combines the local texture representations of ResNet-50 with the global contextual modeling capability of the Swin Transformer (Swin-T) through a newly designed Texture-Aware Attention Gate (TAAG) module. The TAAG dynamically modulates Swin feature dimensions using CNN-derived texture energy, providing improved discriminability and per-prediction interpretability without requiring post hoc attribution. Experimental results show that, under matched-resolution 5-fold cross-validation, the proposed GoldFormer attains the highest overall accuracy (95.02%, F1-score 0.9502) at roughly half the FLOPs of its higher-resolution setting, statistically tied with the strongest individual backbone (ViT-B/16, 94.31%; McNemar p=0.23) and on par with a training-free soft-voting ensemble (94.92%), while significantly improving on its own Swin-T backbone (93.65%) and adding built-in, attribution-free texture-gate interpretability. GoldFormer surpasses trained human-expert performance (89.80%) by approximately 5 percentage points.

Outline

GoldFormer: A Texture-Aware Vision Transformer-Based Algorithm for Detecting Near-Identical Images

More from our Archive