DOI: 10.1302/1358-992x.2024.1.047 ISSN: 1358-992X


J. Grammens, L. F.A. Pereira, F. Danckaers, J. Vanlommel, A. Van Haver, P. Verdonk, J. Sijbers

Currently implemented accuracy metrics in open-source libraries for segmentation by supervised machine learning are typically one-dimensional scores [1]. While extremely relevant to evaluate applicability in clinics, anatomical location of segmentation errors is often neglected.

This study aims to include the three-dimensional (3D) spatial information in the development of a novel framework for segmentation accuracy evaluation and comparison between different methods.

Predicted and ground truth (manually segmented) segmentation masks are meshed into 3D surfaces. A template mesh of the same anatomical structure is then registered to all ground truth 3D surfaces. This ensures all surface points on the ground truth meshes to be in the same anatomically homologous order. Next, point-wise surface deviations between the registered ground truth mesh and the meshed segmentation prediction are calculated and allow for color plotting of point-wise descriptive statistics. Statistical parametric mapping includes point-wise false discovery rate (FDR) adjusted p-values (also referred to as q-values).

The framework reads volumetric image data containing the segmentation masks of both ground truth and segmentation prediction. 3D color plots containing descriptive statistics (mean absolute value, maximal value,…) on point-wise segmentation errors are rendered. As an example, we compared segmentation results of nnUNet [2], UNet++ [3] and UNETR [4] by visualizing the mean absolute error (surface deviation from ground truth) as a color plot on the 3D model of bone and cartilage of the mean distal femur.

A novel framework to evaluate segmentation accuracy is presented. Output includes anatomical information on the segmentation errors, as well as point-wise comparative statistics on different segmentation algorithms. Clearly, this allows for a better informed decision-making process when selecting the best algorithm for a specific clinical application.

More from our Archive