ADSP Whole Genome Sequencing (WGS) Release 4 Data Update from Genome Center for Alzheimer’s Disease
Yuk Yee Leung, Wan‐Ping Lee, Amanda B Kuzma, Prabhakaran Gangadharan, Heather Issen Nicaretta, Liming Qu, Youli Ren, Laura B Cantwell, Otto Valladares, Yi Zhao, Taha Iqbal, Michael A. Schmidt, Pedro R. Mena, Badri N Vardarajan, Clifton L. Dalgard, Brian W. Kunkle, William S. Bush, Eden R. Martin, Adam C. Naj, Jonathan L. Haines, Margaret A. Pericak‐Vance, Li‐San Wang, Gerald D. Schellenberg,- Psychiatry and Mental health
- Cellular and Molecular Neuroscience
- Geriatrics and Gerontology
- Neurology (clinical)
- Developmental Neuroscience
- Health Policy
- Epidemiology
Abstract
Background
The Genome Center for Alzheimer’s Disease (GCAD) coordinates the integration of all available Alzheimer’s disease (AD) relevant whole genome sequencing (WGS) data with the goal of identifying AD risk or protective genetic variants and eventual therapeutic targets. The WGS datasets are generated through collaboration between investigators from the Alzheimer’s Disease Sequencing Project (ADSP) and GCAD. With the goal of minimizing data heterogeneity, introduced by different sequencing protocols and assays, GCAD processes all samples using standardized pipelines and performs quality control (QC)/quality assurance (QA) checks.
Methods
Raw sequencing data (FASTQs or BAMs) were aligned to GRCh38/hg38 by BWA, and variant calling and joint genotyping on single nucleotide variants (SNVs), insertions and deletions (indels), were done by GATK. Structural variants (SVs) were called per sample using the Smoove, Manta, and Strelka packages. Preliminary QA checks including sex check, contamination, and genotype concordance were performed followed by QC per ADSP protocol to evaluate the quality of samples and variants. To facilitate access and usage of massive joint‐genotype called VCF files, a compact version for storing variant info and sample genotypes only was released first.
Results
We dropped 275 (0.7%) samples of poor coverage (<20×), and we flagged 219 (0.6%) samples that were of borderline quality. As a result, the dataset (ADSP Release 4, 2022) includes 36,361 genomes from 40 diverse cohorts with 4 major ancestries: 16,573 Non‐Hispanic Whites, 11,358 Hispanics; 5,422 African Americans; and 2,802 Asians. Data are deeply sequenced (average genome coverage: 40x). All samples’ CRAMs and gVCFs from GATK were deposited into NIAGADS Data Sharing Service (DSS) (
Conclusion
The ADSP and GCAD generate high quality SNVs, indels and SV calls. Currently GCAD is preparing the next release of ∼60,000 more ancestrally‐diverse WGS samples sequenced primarily through the ADSP Follow‐Up Study, which we anticipate will be released in 2023 to greatly benefit the AD genetics community.