ARM: Active Region Masking for 3D Medical Image Analysis
Can Wang, Zesheng ChengSelf-supervised learning can reduce the dependence of 3D medical image analysis on expensive voxel-level annotations, but masked image modeling remains inefficient when uniform random masking is applied to volumetric data with large redundant backgrounds. Random masks often generate easy reconstruction targets that can be recovered through local interpolation, limiting the learning of anatomical semantics. Therefore, an effective masking strategy should adapt to image-specific structural difficulty rather than sample regions uniformly. We propose Active Region Masking (ARM), a self-supervised pre-training method that treats 3D mask generation as a patch-wise actor–critic decision process. Patch-level decision units share an actor–critic policy and use reconstruction error with a masking-ratio constraint as an intrinsic reward to identify regions that are difficult to reconstruct and potentially informative for anatomical representation learning. The reconstructor is trained with an asymmetric 3D Swin Transformer encoder–decoder, encouraging global anatomical reasoning from visible context. For segmentation tasks, the ARM pre-trained encoder is used to initialize the downstream Swin-UNETR framework; classification is evaluated with a matched downstream protocol. Across 12 task-level downstream evaluations, including BTCV, MSD, MM-WHS, AMOS22, FLARE22, CC-CCII, and BraTS21, ARM consistently improves over the evaluated contrastive and masked-modeling baselines. With 1k scans as pre-training data, ARM achieves an average Dice score of 89.80% on the BTCV-and-unseen-dataset benchmark, and scaling to 10k scans increases the average Dice score to 90.66%. These results indicate that active region masking improves label efficiency, segmentation robustness, and CT-to-MRI transfer in 3D medical image analysis.