Genetic resources

CKB research has been greatly enhanced by large-scale genotyping of study individuals. Together with other exposure and outcome data, genetic data enable a wide range of investigations, including discovery of genetic determinants of disease risk, the role of lifestyle factors and quantitative traits (e.g. blood pressure, adiposity), Mendelian randomisation assessment of the causal contribution of risk factors and behaviours to disease, and phenome-wide analyses of potential drug targets.

SNP genotyping data

Using the multiplex Illumina Golden Gate® platform, approximately 100,000 DNA samples were genotyped for panels of 384 single nucleotide polymorphisms (SNPs) during 2012-2013. These SNPs were selected to support a range of projects, including investigation of genetic variants affecting the function of genes encoding potential drug targets (JACC 2016, IJE 2016, JAMA Cardiol. 2018). These early SNP data also provided an important check of sample linkage and DNA quality. For example, there was a mismatch between participants’ reported gender and genetically-determined sex for just 0.1% of samples, and only 2.5% of samples failed quality control (similar to other studies, e.g. UK Biobank genotyping). Together, these data provided high confidence in subsequent use of the extracted DNA for genome-wide genotyping.

Genome wide genotyping data

There is substantial genetic diversity both between populations of different ancestries and across China. When CKB genome-wide genotyping commenced in 2015, the available genotyping arrays did not fully capture such variation. Hence, CKB designed a custom Affymetrix (now ThermoFisher) Axiom® genotyping array with improved genome-wide coverage of common and low-frequency variation in Chinese populations. The array also included a series of probes for detection and classification of circulating hepatitis B virus (HBV), which is prevalent in the Chinese population. The final array design assayed a total of 803,030 genetic variants, including more than 80,000 variants with predicted functional effects on specific genes.

Using the CKB custom genotyping array, we genotyped just over 100,000 CKB samples: a population-representative subset of 77,176 participants and an additional 23,542 selected for studies of specific diseases (e.g. stroke, COPD). Genotyping data quality was high: for example, there was 99.9% concordance between pairs of duplicates. We then conducted imputation to statistically infer genotypes using the 1000 Genomes Phase 3 reference panel, which yielded genotypes for more than 21 million variants. These data have supported many studies within the CKB group and as part of collaborations and consortia. This work has included Mendelian randomisation studies of the contribution of blood lipids to different types of stroke (Nat Med 2019, Ann Neurol 2020), assessment of polygenic risk scores (PRS) for risk of fracture (Genome Med 2021) or breast cancer (Gen in Med 2021), or genome-wide association studies (GWAS) of lung function and respiratory disease (Eur Respir J 2021). A recent further round of imputation using two larger imputation reference panels (TopMed, Westlake BioBank for Chinese), has provided genotypes for more than 50 million variants.

Population structure

Appropriate analyses of genetic data rely on proper understanding of the genetic relationships between individuals in the study. We have identified substantial relatedness among CKB participants, with 24% (28% in rural, and 18% in urban areas) having at least one parent, child, or sibling in the study and 32% (39% rural, 23% urban) having one or more second-degree relative i.e. grandparent, grandchild, uncle/aunt, niece/nephew. We also found evidence of past consanguinity e.g. marriages between second cousins.

These aspects of the data have been used in analysis of the impact of inbreeding on reproductive success (Nat Commun 2019), and for within-family GWAS to understand how shared environment can influence the results of genetic association studies (bioRxiv 2021).

Analysis of genome-wide variation between individuals has also identified substantial genetic differences between individuals from different regions of China. Principal Component Analysis (PCA) groups study participants into discrete clusters largely reflecting the regions from which individuals were recruited, in a pattern strongly correlated with longitude and latitude. PCA within particular study regions revealed further patterns of genetic variation corresponding to participants’ specific recruitment clinics. This was, in general, much more pronounced in rural regions, reflecting established communities with little population movement, and was less strongly observed in urban regions in which there had often been recent inward migration. These findings have enhanced our overall understanding of the cohort, and have informed many of our genetic analyses (Cell Genomics 2023).

WHOLE GENOME SEQUENCING

In 2021, a pilot of whole genome sequencing of 10,000 individuals was completed at BGI, Shenzhen, with 122M different genetic variants identified in at least one sample. We have developed an online server offering a free genotype imputation service based on the CKB reference panel (https://db.cngb.org/imputation/), and have shown that imputation using the CKB panel greatly increases the number of well-imputed variants and improves imputation accuracy, leading to improved GWAS performance (Nucleic Acids Research 2023). Further whole genome sequencing commenced in early 2024, and sequencing of the entire CKB cohort will be complete in the first quarter of 2025. Work is ongoing to establish a cloud-based platform to facilitate analyses using these data.

Further resources and DaTA SHARING

We continue to expand the available genetic resources. Data on DNA methylation (a chemical change in DNA involved in turning genes on and off) is available for approximately 1,000 individuals, and has provided evidence for involvement of methylation at the ANKS1A and SNX30 genes in cardiovascular risk (eLife 2021). We plan to supplement these data with long-read sequencing and DNA methylation using Nanopore technology (pilot study under discussion). Finally, online resources to access CKB data continue to be developed and expanded: The results of CKB GWAS can now be browsed and downloaded through a PheWeb browser (https://pheweb.ckbiobank.org/); and a remote access platform (RAP) will soon be implemented to give researchers the ability to conduct their own analyses using CKB genetic data.

Impact of research

Together with the many available phenotypes and disease endpoints in CKB, these genetic resources are enabling a broad range of projects led by CKB and external researchers (links to Genetics Collaborations). Together with other large biobanks in diverse populations, CKB will help to correct the strong Euro-centric bias of the genetic literature. With further development of the genetic resources, in combination with the growing range of diverse molecular assays, CKB will continue to make significant contributions to genetic discovery and elucidation of disease prediction, prevention, and treatment.

Cookies on this website