DOI: 10.1002/alz.077351 ISSN: 1552-5260

ADSP Whole Genome Sequencing (WGS) Release 4 Data Update from Genome Center for Alzheimer’s Disease

Yuk Yee Leung, Wan‐Ping Lee, Amanda B Kuzma, Prabhakaran Gangadharan, Heather Issen Nicaretta, Liming Qu, Youli Ren, Laura B Cantwell, Otto Valladares, Yi Zhao, Taha Iqbal, Michael A. Schmidt, Pedro R. Mena, Badri N Vardarajan, Clifton L. Dalgard, Brian W. Kunkle, William S. Bush, Eden R. Martin, Adam C. Naj, Jonathan L. Haines, Margaret A. Pericak‐Vance, Li‐San Wang, Gerald D. Schellenberg,
  • Psychiatry and Mental health
  • Cellular and Molecular Neuroscience
  • Geriatrics and Gerontology
  • Neurology (clinical)
  • Developmental Neuroscience
  • Health Policy
  • Epidemiology



The Genome Center for Alzheimer’s Disease (GCAD) coordinates the integration of all available Alzheimer’s disease (AD) relevant whole genome sequencing (WGS) data with the goal of identifying AD risk or protective genetic variants and eventual therapeutic targets. The WGS datasets are generated through collaboration between investigators from the Alzheimer’s Disease Sequencing Project (ADSP) and GCAD. With the goal of minimizing data heterogeneity, introduced by different sequencing protocols and assays, GCAD processes all samples using standardized pipelines and performs quality control (QC)/quality assurance (QA) checks.


Raw sequencing data (FASTQs or BAMs) were aligned to GRCh38/hg38 by BWA, and variant calling and joint genotyping on single nucleotide variants (SNVs), insertions and deletions (indels), were done by GATK. Structural variants (SVs) were called per sample using the Smoove, Manta, and Strelka packages. Preliminary QA checks including sex check, contamination, and genotype concordance were performed followed by QC per ADSP protocol to evaluate the quality of samples and variants. To facilitate access and usage of massive joint‐genotype called VCF files, a compact version for storing variant info and sample genotypes only was released first.


We dropped 275 (0.7%) samples of poor coverage (<20×), and we flagged 219 (0.6%) samples that were of borderline quality. As a result, the dataset (ADSP Release 4, 2022) includes 36,361 genomes from 40 diverse cohorts with 4 major ancestries: 16,573 Non‐Hispanic Whites, 11,358 Hispanics; 5,422 African Americans; and 2,802 Asians. Data are deeply sequenced (average genome coverage: 40x). All samples’ CRAMs and gVCFs from GATK were deposited into NIAGADS Data Sharing Service (DSS) ( for public distribution. Joint‐genotyped called VCFs are undergoing a full QC/annotation process and will be made available. This joint‐genotyped called VCF contains >362M bi‐allelic variants, >58M multi‐allelic variants, with 95% of variants remaining after QC. SV calling is ongoing and data will be ready prior to the conference.


The ADSP and GCAD generate high quality SNVs, indels and SV calls. Currently GCAD is preparing the next release of ∼60,000 more ancestrally‐diverse WGS samples sequenced primarily through the ADSP Follow‐Up Study, which we anticipate will be released in 2023 to greatly benefit the AD genetics community.

More from our Archive