Bioinformatic assessment of the potential amyloidogenicity of the human and evolutionarily more ancient proteomes
Douglas B. Kell, Ivayla Roberts, Xiaomian Tan, J. Bernadette Moore, Etheresia PretoriusAmyloGram is a computer program that uses n-gram encoding and a random forest classifier to produce a numerical score between 0 and 1 for the predicted amyloidogenicity of a given protein sequence. In a variety of recent studies we have used AmyloGram to obtain an overall amyloidogenicity score for members of the human proteome. Of 83,567 full-length canonical human polypeptides, 79.2% had a score exceeding 0.7 (the median was 0.813), consistent with the view that most natural protein sequences contain elements that are in fact potentially amyloidogenic. Here we first asked whether this operational threshold is supported by orthogonal predictors and curated amyloid proteins, and then whether similarly high scores are also observed in evolutionarily ancient proteomes. For the human proteome, PASTA2 values correlated positively with AmyloGram scores (r² = 0.374 for minimum free energy and 0.321 for average free energy), and proteins with AmyloGram scores ≥0.7 were significantly enriched for strongly negative PASTA2 values (for average free energy < -10 PEU: 4,724/66,190 versus 75/17,377; c² p = 2.4 × 10-250). In AmyPro, 117 curated amyloid proteins had substantially higher median AmyloGram scores than did the eight curated non-amyloids, although the imbalance of this dataset demands caution. AMYPred-FRL showed only a weak and partly discordant relationship with AmyloGram in the archaeal test proteome examined. We then computed AmyloGram score distributions for the proteomes of 130 other organisms, including archaea, Gram-negative bacteria, Gram-positive bacteria and viruses, representing 475,999 proteins in total.