Detection of cancer recurrence from Thai-English electronic medical records using sentence embeddings
Ekapob Sangariyavanich, Wanchana Ponthongmak, Nawanan Theera-Ampornpunt, Nat Tangchitnob, Gareth J McKay, Ammarin ThakkinstianObjective
This study developed and validated monolingual and bilingual sentence-bidirectional encoder representations from transformers (SBERT) models for detecting cancer recurrence within Thai-English electronic medical records (EMRs) from Thai cancer hospitals.
Method
A multicentre dataset of 32 436 documents from 1250 patients was used for model development. External validation involved an independent dataset of 9244 documents from 384 patients across two Thai cancer hospitals. Performance was benchmarked against a fine-tuned PubMedBERT (MetBERT).
Results
The development dataset included breast (43.9%), colorectal (12.1%), cervical (28.0%) and head and neck (16.0%) cancers. MetBERT achieved the highest area under the precision-recall curve (AUPRC) for locoregional versus no recurrence (11.1%) and locoregional versus distant recurrence (91.7%), while monolingual-SBERT excelled at distant versus no recurrence (32.0%). External validation demonstrated MetBERT superiority for locoregional versus no recurrence (9.30%–21.50%). For distant versus no recurrence, bilingual-SBERT performed best with AUPRC 17.55%–24.39%. While MetBERT led in distinguishing locoregional versus distant recurrence (88.30%–94.70%), bilingual-SBERT demonstrated robust external validation performance (AUPRC 85.25%–91.80%).
Discussion
Low AUPRC values (9%–32%) reflect the extreme class imbalance in real-world data (~1% recurrence prevalence). Despite this, fine-tuned MetBERT achieved highest performance, while bilingual-SBERT demonstrated superior robustness during external validation. This validates sentence embedding models for handling mixed Thai-English medical records in multilingual clinical environments.
Conclusion
Sentence embedding frameworks provide a practical, generalisable solution for detecting cancer recurrence within multilingual EMRs. Despite text-length constraints, these models are suitable for clinical integration as a screening tool for cancer registry workflows.