DOI: 10.3390/app16136630 ISSN: 2076-3417

A Pipeline for Domain-Specialized Small Language Models from Unstructured Data: A Test Case Using Malaysian Clinical Practice Guidelines

Campanale Haakim bin Yusuf, Lee-Yeng Ong

The adoption of Large Language Models (LLMs) in highly regulated, domain-specific sectors is constrained by high computational costs, cloud dependency, and strict data privacy regulations. Furthermore, specific-domain knowledge is usually locked in static, unstructured document formats, preventing automated reasoning. To address these challenges, this study proposes a generalizable end-to-end pipeline for developing domain-specialized Small Language Models (SLMs) optimized for resource-constrained environments starting from unstructured data. To validate the proposed pipeline, Malaysian Clinical Practice Guidelines (CPGs) in PDF format were used as a test case. The methodology systematically digitizes these unstructured data into a NoSQL database and employs an isomorphic teacher model to generate a strictly grounded synthetic instruction-tuning dataset. Through Quantized Low-Rank Adaptation (QLoRA) and 4-bit Post-Training Quantization (PTQ), a general-purpose model is transformed into a highly compressed, domain-specialized SLM, named SpecioSLM. Systematic workstation benchmarking across four candidate architectures identified the Microsoft Phi-3-Mini (3.8B) variant as the optimal model. The model achieved a throughput of 91.59 tokens per second (TPS), a Time to First Token (TTFT) of 0.17 s, and a semantic fidelity BERTScore of 90.27. A preliminary ARM64-based simulation is further conducted targeting a specific edge device to validate architectural and memory footprint viability.

More from our Archive