Abstract A052: Lymformer: A Domain-Specific, Self-Supervised Vision Transformer for Survival Prediction Using Whole Slide Images in Diffuse Large B-Cell Lymphoma
Chen Zhou, Jie Xu, Carlos A. Torres-Cabala, Francisco Vega, Jason Westin, Soundar Kumara, Dennis O'Malley, Christopher R. Flowers, L.Jeffrey Medeiros, Swaminathan P. IyerAbstract
Background:
Diffuse large B-cell lymphoma (DLBCL) is the most common non-Hodgkin lymphoma, yet outcomes remain heterogeneous. Extracting prognostically relevant morphological patterns from whole-slide images (WSIs) is challenging due to their gigapixel scale. Existing multiple-instance learning (MIL) pipelines often use feature extractors pre-trained on natural or solid-tumor images, which may not capture lymphoma-specific cytologic and architectural patterns. We hypothesized that a self-supervised, lymphoma-only vision transformer pre-trained on large-scale DLBCL WSIs could learn domain-specific morphologic representations predictive of overall survival (OS).
Methods:
Lymformer is a Swin-based foundational vision transformer trained self-supervised on 1,186 unannotated DLBCL WSIs (∼2 million tiles) to learn morphologic representations. For downstream OS prediction after frontline R-CHOP, we analyzed two complementary cohorts with clinical and imaging data: DLBCL-TCGA (n=42; younger, earlier-stage, long-term OS, and genomics) and DLBCL-Stanford (n=209; older, more advanced-stage, and IPI; HANS cell-of-origin surrogates; no sequencing). The Stanford cohort (n=119) was used for 5-fold cross-validated training; an additional 30 Stanford and 39 TCGA cases served as in-domain and out-of-domain test sets, respectively. WSIs were tiled at 40× into 1,024×1,024-pixel patches. To interpret learned morphology, we clustered tile embeddings into 10 self-supervised morphologic groups using K-means. HoVer-Net was applied to representative tiles to segment nuclei and quantify 4 morphometric features: area, crowdedness, roundness, and roughness. Survival performance was evaluated using the concordance index (C-index) and compared with image-only baselines (ResNet-50, CTransPath, Phikon). One-way ANOVA tested differences in nuclear features across clusters.
Results:
On the Stanford test set, the Lymformer-MIL model achieved a C-index of 0.817 for OS, outperforming ResNet-50 (0.730), CTransPath (0.763), and Phikon (0.743). Lymformer preserved the generalization on the out-of-domain TCGA cohort. Attention maps associated with poor-risk predictions frequently highlighted regions with necrosis and high cellular density. Across the 10 embedding-derived clusters, nuclear area and crowdedness distributions differed markedly (ANOVA P<1e−31), with some clusters enriched for small, sparsely distributed, smooth nuclei and others dominated by large, crowded, irregular nuclei.
Conclusions:
Lymformer is a domain-specific foundational vision transformer trained exclusively on lymphoma WSIs that enables post-hoc image-based OS prediction in DLBCL by linking extracted features to morphologic phenotypes. These retrospective analyses are limited by partial clinical and genomic annotation, the lack of direct comparison with established prognosticators, and the black-box nature of the model. Rigorous head-to-head evaluation in fully annotated, prospectively collected cohorts will be performed.
Citation Format:
Chen Zhou, Jie Xu, Carlos A. Torres-Cabala, Francisco Vega, Jason Westin, Soundar Kumara, Dennis O'Malley, Christopher R. Flowers, L.Jeffrey Medeiros, Swaminathan P. Iyer. Lymformer: A Domain-Specific, Self-Supervised Vision Transformer for Survival Prediction Using Whole Slide Images in Diffuse Large B-Cell Lymphoma [abstract]. In: Proceedings of the Fifth AACR International Meeting on Advances in Malignant Lymphoma: From Discovery to Clinical Impact; 2026 Jun 24-27; Philadelphia, PA. Philadelphia (PA): AACR; Blood Cancer Discov 2026;7(3_Suppl):Abstract nr A052.