CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions

doi:10.1093/nar/gkad989

Max Schubach, Thorben Maass, Lusiné Nazaretyan, Sebastian Röner, Martin Kircher

CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions

Genetics

Abstract Machine Learning-based scoring and classification of genetic variants aids the assessment of clinical findings and is employed to prioritize variants in diverse genetic studies and analyses. Combined Annotation-Dependent Depletion (CADD) is one of the first methods for the genome-wide prioritization of variants across different molecular functions and has been continuously developed and improved since its original publication. Here, we present our most recent release, CADD v1.7. We explored and integrated new annotation features, among them state-of-the-art protein language model scores (Meta ESM-1v), regulatory variant effect predictions (from sequence-based convolutional neural networks) and sequence conservation scores (Zoonomia). We evaluated the new version on data sets derived from ClinVar, ExAC/gnomAD and 1000 Genomes variants. For coding effects, we tested CADD on 31 Deep Mutational Scanning (DMS) data sets from ProteinGym and, for regulatory effect prediction, we used saturation mutagenesis reporter assay data of promoter and enhancer sequences. The inclusion of new features further improved the overall performance of CADD. As with previous releases, all data sets, genome-wide CADD v1.7 scores, scripts for on-site scoring and an easy-to-use webserver are readily provided via https://cadd.bihealth.org/ or https://cadd.gs.washington.edu/ to the community.

Need a simple solution for managing your BibTeX entries? Explore CiteDrive!

Web-based, modern reference management
Collaborate and share with fellow researchers
Integration with Overleaf
Comprehensive BibTeX/BibLaTeX support
Save articles and websites directly from your browser
Search for new articles from a database of tens of millions of references

Try out CiteDrive

CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions

Need a simple solution for managing your BibTeX entries? Explore CiteDrive!

More from our Archive

Investigation of the usefulness of liver-specific deconvolution method by establishing a liver benchmark dataset

COSMIC: a curated database of somatic variants and clinical data for cancer

CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions

Association of <scp>HLA</scp> class I and <scp>II</scp> genes with cervical cancer susceptibility in a Han Chinese population

Identification of the novel <i>HLA‐C*05:279</i> allele in a Spanish renal transplant recipient

Unveiling Anoikis‐related genes: A breakthrough in the prognosis of bladder cancer

Role of abscisic acid‐mediated stomatal closure in responses to pathogens in plants

The effects of mutation and recombination rate heterogeneity on the inference of demography and the distribution of fitness effects

Signatures of co-evolution and co-regulation in the CYP3A and CYP4F genes in humans

Exploring polymorphism in a palatable prey: predation risk and frequency dependence in relation to distinct levels of conspicuousness