DOI: 10.1002/alz.081841 ISSN: 1552-5260

Automated FreeSurfer segmentation and visual quality control in 10,000 MRI scans from a large memory clinic cohort

Diana I. Bocancea, Anouk den Braber, Chenyang Jiang, Emma M. Coomans, Annemartijn A. J. M. van Unnik, Julia M.L. van Veen, Marieke Ribberink, Niels Reijner, Shalina Saddal, Sophie E Mastenbroek, Aniek M. van Gils, Caro M. Kluin, Diederick Martijn de Leeuw, Eleonora M. Vromen, Lotta Bekkers, Margarita Georgallidou, Sophie M. van der Landen, Jana M. Baumann, Joost Heuvelink, Lianne M. Reus, Luigi Lorenzini, Suzie Kamps, Sophie P.H. Bouman, Ellen Hanna Singleton, Sterre C.M. de Boer, Rebecca Z. Rousset, Bart Kuijper, Eline Verhagen, Dimas Smit, Roos M. Rikken, Pieter Jelle Visser, Femke H. Bouwman, Frederik Barkhof, Afina W. Lemstra, Yolande A.L. Pijnenburg, Wiesje M. van der Flier, Vikram Venkatraghavan, Betty M. Tijms
  • Psychiatry and Mental health
  • Cellular and Molecular Neuroscience
  • Geriatrics and Gerontology
  • Neurology (clinical)
  • Developmental Neuroscience
  • Health Policy
  • Epidemiology



Automated image segmentation methods together with increasing computer power allow quantifying brain alterations in great detail in large neuroimaging datasets. The increasing size of data complicates visual quality control (QC), which is the gold standard. Increasingly often, research relies on automated QC methods. Here, we aimed to investigate the performance of automated FreeSurfer segmentation through visual QC in a large memory clinic cohort with more than ten thousand images, and to compare an automated QC measure to the visual QC.


10,400 T1‐weighted MRI scans from the Amsterdam Dementia Cohort were segmented with FreeSurfer(v7.1). Quality control was performed using an adapted version of the Enigma QC protocol[]. Twenty‐six individuals visually assessed cortical segmentations, rating each as “fail”, “moderate”, or “pass”, identifying failure reasons and affected lobes. We investigated the error occurrence in failed or moderate quality scans, and whether segmentation failures depend on clinical diagnosis in a subset of 4990 baseline scans with most common diagnoses (i.e., SCD, MCI, AD, FTD, VaD and DLB). We compared an automated QC measure (i.e. median ± 3*IQR thresholded SurfaceHoles) with the visual QC output.


The majority (78.3%) of 10,400 cortical segmentations were rated as having “pass”‐quality, 16.2% were “moderate”‐quality and 5.4% segmentations were rated as “fail”. Concordance between Automated QC and Visual QC was high for pass ratings(84%), but low for fail ratings(22.5%) (Table‐1). Within failed segmentations, most common reasons were processing errors(51.7%), image artifacts(14.3%) and underestimation(13.3%); for moderate ratings reasons were inclusion of meninges (47.8%) and underestimation of cortical thickness(33.2%) (Figure‐1). Stratified per diagnosis, segmentation failed in 2.3% of the SCD scans, 3% of MCI, 4.7% of DLB, 5.4% of AD, 7% of FTD and 14.5% of VaD. Further, we observed variation in affected lobes among diagnostic groups (Figure‐2b).


The majority of scans passed visual QC, in high concordance with the automated QC. Images with moderate or failed quality, occurred more often in VaD and FTD. The most common mis‐segmentations were overestimation and underestimation of cortical thickness. This very large visually quality controlled data could be used as a benchmark to test future automated QC pipelines on.

More from our Archive