Ecological Vision Hypothesis: Training Deep Neural Networks for Robustness and Human Alignment
Frank Tong, Hojin JangDeep neural networks (DNNs) offer highly promising neurocomputational models of the visual system, yet vast gaps remain between DNNs and human observers. By some accounts, DNNs are approaching near ceiling levels in their ability to predict human neural responses to clear real-world images. However, even modest diversions toward more ambiguous viewing conditions can readily expose the brittle and inflexible nature of these networks. Human vision remains robust when faced with noise, blur, occlusion, and other challenges, whereas DNNs trained to classify large image datasets typically lack such robustness. Here, we discuss the ecological vision hypothesis, proposing that the robustness of human vision is acquired via learning from prevalent encounters with challenging viewing conditions, such that DNNs trained with similar challenges should become more robust and human-aligned. In particular, the prevalence of blur in everyday vision may enhance sensitivity to global shape and attenuate reliance on local textural cues. We conjecture that providing DNNs with ecologically relevant information to learn 3D scene and shape properties will further advance DNN-to-human alignment.