Beyond Simple Averaging: Combining Scores, Reliabilities, and Validities of Multiple Intelligence Tests
Dominik Weber, Nicolas Becker, Frank M. Spinath, Marco KochIn applied diagnostics, scores from multiple intelligence tests are often combined by simple arithmetic averaging. Although convenient, this practice is statistically imprecise: it neglects (a) that tests are positively correlated but not identical, each capturing somewhat different aspects of intelligence, and (b) that combining correlated measures reduces variance. Consequently, arithmetic means underestimate intelligence above the IQ scale center and overestimate it below the center. Such distortions can lead to serious misjudgments in high-stakes assessment, such as in educational placement or forensic evaluations of legal culpability. To obtain exploratory evidence on the prevalence of simple averaging practice, we surveyed n = 75 psychologically educated individuals familiar with IQ metrics. In response to a case vignette requiring the combination of several IQ scores, 58 participants (77.33%) applied simple averaging, and none considered intercorrelations and variance reduction. Therefore, the aim of this article was to propose the application of a more valid method for combining scores (as well as corresponding reliabilities and validities) of multiple tests. To validate the method, we performed Monte Carlo simulations, covering a wide range of test characteristics. Results showed virtually perfect accuracy (r = 1.00, p < .001), and the outcomes were robust against variations in the number of tests to be combined, score distributions, and intercorrelations. To facilitate adoption in practical diagnostics, we developed and introduced an open-source R package and an accompanying open-access Shiny web application.