Genome-wide association study
In genomics, a genome-wide association study (GWA study, or GWAS), is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. GWA studies typically focus on associations between single-nucleotide polymorphisms (SNPs) and traits like major human diseases, but can equally be applied to any other genetic variants and any other organisms.
When applied to human data, GWA studies compare the DNA of participants having varying phenotypes for a particular trait or disease. These participants may be people with a disease (cases) and similar people without the disease (controls), or they may be people with different phenotypes for a particular trait, for example blood pressure. This approach is known as phenotype-first, in which the participants are classified first by their clinical manifestation(s), as opposed to genotype-first. Each person gives a sample of DNA, from which millions of genetic variants are read using SNP arrays. If there is significant statistical evidence that one type of the variant (one allele) is more frequent in people with the disease, the variant is said to be associated with the disease. The associated SNPs are then considered to mark a region of the human genome that may influence the risk of disease.
GWA studies investigate the entire genome, in contrast to methods that specifically test a small number of pre-specified genetic regions. Hence, GWAS is a non-candidate-driven approach, in contrast to gene-specific candidate-driven studies. GWA studies identify SNPs and other variants in DNA associated with a disease, but they cannot on their own specify which genes are causal.[1][2][3]
The first successful GWAS published in 2002 studied myocardial infarction.
Background
Any two human genomes differ in millions of different ways. There are small variations in the individual nucleotides of the genomes (SNPs) as well as many larger variations, such as deletions, insertions and copy number variations. Any of these may cause alterations in an individual's traits, or phenotype, which can be anything from disease risk to physical properties such as height.[8] Around the year 2000, prior to the introduction of GWA studies, the primary method of investigation was through inheritance studies of genetic linkage in families. This approach had proven highly useful towards single gene disorders.[9][8][10] However, for common and complex diseases the results of genetic linkage studies proved hard to reproduce.[8][10] A suggested alternative to linkage studies was the genetic association study. This study type asks if the allele of a genetic variant is found more often than expected in individuals with the phenotype of interest (e.g. with the disease being studied). Early calculations on statistical power indicated that this approach could be better than linkage studies at detecting weak genetic effects.[11]
In addition to the conceptual framework several additional factors enabled the GWA studies. One was the advent of
Methods
The most common approach of GWA studies is the
Example: suppose that there are two alleles, T and C. The number of individuals in the case group having allele T is represented by 'A' and the number of individuals in the control group having allele T is represented by 'B'. Similarly, the number of individuals in the case group having allele C is represented by 'X' and the number of individuals in the control group having allele C is represented by 'Y'. In this case the odds ratio for allele T is A:B (meaning 'A to B', in standard odds terminology) divided by X:Y, which in mathematical notation is simply (A/B)/(X/Y).
When the allele frequency in the case group is much higher than in the control group, the odds ratio is higher than 1, and vice versa for lower allele frequency. Additionally, a P-value for the significance of the odds ratio is typically calculated using a simple chi-squared test. Finding odds ratios that are significantly different from 1 is the objective of the GWA study because this shows that a SNP is associated with disease.[16] Because so many variants are tested, it is standard practice to require the p-value to be lower than 5×10−8 to consider a variant significant.
Variations on the case-control approach. A common alternative to case-control GWA studies is the analysis of quantitative phenotypic data, e.g. height or
A key step in the majority of GWA studies is the imputation of genotypes at SNPs not on the genotype chip used in the study.[23] This process greatly increases the number of SNPs that can be tested for association, increases the power of the study, and facilitates meta-analysis of GWAS across distinct cohorts. Genotype imputation is carried out by statistical methods that impute genotypic data to a set of reference panel of haplotypes, which typically have been densely genotyped using whole-genome sequencing. These methods take advantage of sharing of haplotypes between individuals over short stretches of sequence to impute alleles. Existing software packages for genotype imputation include IMPUTE2,[24] Minimac, Beagle[25] and MaCH.[26]
In addition to the calculation of association, it is common to take into account any variables that could potentially
After odds ratios and
Results
Attempts have been made at creating comprehensive catalogues of SNPs that have been identified from GWA studies.[33] As of 2009, SNPs associated with diseases are numbered in the thousands.[34]
The first GWA study, conducted in 2005, compared 96 patients with
Another landmark publication in the history of GWA studies was the
Since these first landmark GWA studies, there have been two general trends.[38] One has been towards larger and larger sample sizes. In 2018, several genome-wide association studies are reaching a total sample size of over 1 million participants, including 1.1 million in a genome-wide study of educational attainment[39] follow by another in 2022 with 3 million individuals[40] and a study of insomnia containing 1.3 million individuals.[41] The reason is the drive towards reliably detecting risk-SNPs that have smaller effect sizes and lower allele frequency. Another trend has been towards the use of more narrowly defined phenotypes, such as blood lipids, proinsulin or similar biomarkers.[42][43] These are called intermediate phenotypes, and their analyses may be of value to functional research into biomarkers.[44]
A variation of GWAS uses participants that are first-degree relatives of people with a disease. This type of study has been named genome-wide association study by proxy (GWAX).[45]
A central point of debate on GWA studies has been that most of the SNP variations found by GWA studies are associated with only a small increased risk of the disease, and have only a small predictive value. The median odds ratio is 1.33 per risk-SNP, with only a few showing odds ratios above 3.0.
Clinical applications and examples
A challenge for future successful GWA study is to apply the findings in a way that accelerates
Hepatitis C treatment
One such success is related to identifying the genetic variant associated with response to anti-
eQTL, LDL and cardiovascular disease
The goal of elucidating pathophysiology has also led to increased interest in the association between risk-SNPs and the gene expression of nearby genes, the so-called expression quantitative trait loci (eQTL) studies.[55] The reason is that GWAS studies identify risk-SNPs, but not risk-genes, and specification of genes is one step closer towards actionable drug targets. As a result, major GWA studies by 2011 typically included extensive eQTL analysis.[56][57][58] One of the strongest eQTL effects observed for a GWA-identified risk SNP is the SORT1 locus.[42] Functional follow up studies of this locus using small interfering RNA and gene knock-out mice have shed light on the metabolism of low-density lipoproteins, which have important clinical implications for cardiovascular disease.[42][59][60]
Atrial fibrillation
For example, a
Schizophrenia
Research using a High-Precision Protein Interaction Prediction (HiPPIP) computational model that discovered 504 new protein-protein interactions (PPIs) associated with genes linked to schizophrenia.[62][63][64] While the evidence supporting the genetic basis of schizophrenia is not controversial, one study found that 25 candidate schizophrenia genes discovered from GWAS had little association with schizophrenia, demonstrating that GWAS alone may be insufficient to identify candidate genes.[65]
Conservation applications
Population level GWA studies may be used to identify adaptive genes to help evaluate ability of species to adapt to changing environmental conditions as the global climate becomes warmer.[66] This could help determine extirpation risk for species and could therefore be an important tool for conservation planning. Utilizing GWA studies to determine adaptive genes could help elucidate the relationship between neutral and adaptive genetic diversity.
Agricultural applications
Plant growth stages and yield components
GWA studies act as an important tool in plant breeding. With large genotyping and phenotyping data, GWAS are powerful in analyzing complex inheritance modes of traits that are important yield components such as number of grains per spike, weight of each grain and plant structure. In a study on GWAS in spring wheat, GWAS have revealed a strong correlation of grain production with booting data, biomass and number of grains per spike.[67] GWA study is also a success in study genetic architecture of complex traits in rice.[68]
Plant pathogens
The emergences of plant pathogens have posed serious threats to plant health and biodiversity. Under this consideration, identification of wild types that have the natural resistance to certain pathogens could be of vital importance. Furthermore, we need to predict which alleles are associated with the resistance. GWA studies is a powerful tool to detect the relationships of certain variants and the resistance to the plant pathogen, which is beneficial for developing new pathogen-resisted cultivars.[69]
Chicken
The first GWA study in chickens was done by Abasht and Lamont [70] in 2007. This GWA was used to study the fatness trait in F2 population found previously. Significantly related SNPs were found are on 10 chromosomes (1, 2, 3, 4, 7, 8, 10, 12, 15 and 27).
Limitations
GWA studies have several issues and limitations that can be taken care of through proper quality control and study setup. Lack of well defined case and control groups, insufficient sample size, control for
Additionally, GWA studies identify candidate risk variants for the population from which their analysis is performed, and with most GWA studies historically stemming from European databases, there is a lack of translation of the identified risk variants to other non-European populations.
Fine-mapping
Genotyping arrays designed for GWAS rely on linkage disequilibrium to provide coverage of the entire genome by genotyping a subset of variants. Because of this, the reported associated variants are unlikely to be the actual causal variants. Associated regions can contain hundreds of variants spanning large regions and encompassing many different genes, making the biological interpretation of GWAS loci more difficult. Fine-mapping is a process to refine these lists of associated variants to a credible set most likely to include the causal variant.
Fine-mapping requires all variants in the associated region to have been genotyped or imputed (dense coverage), very stringent quality control resulting in high-quality genotypes, and large sample sizes sufficient in separating out highly correlated signals. There are several different methods to perform fine-mapping, and all methods produce a posterior probability that a variant in that locus is causal. Because the requirements are often difficult to satisfy, there are still limited examples of these methods being more generally applied.
See also
- Association mapping
- Transcriptome-wide association study
- Epidemiology
- Genetic diversity
- Gene–environment interaction
- Genomics
- Linkage disequilibrium
- Molecular epidemiology
- Polygenic score
- Population genetics
- Genetic epidemiology
- Common disease-common variant hypothesis
- Microbiome-wide association study
- Conservation biology
References
- ^ PMID 20647212.
- ^ PMID 18349094.
- ^ "Genome-Wide Association Studies". National Human Genome Research Institute.
- S2CID 21414260.
- PMID 15761122.
- ^ "GWAS Catalog: The NHGRI-EBI Catalog of published genome-wide association studies". European Molecular Biology Laboratory. Retrieved 18 April 2017.
- ^ PMID 23300413.
- ^ ISBN 978-0-8153-4149-9.
- ^ "Online Mendelian Inheritance in Man". Archived from the original on 5 December 2011. Retrieved 6 December 2011.
- ^ PMID 11565063.
- S2CID 5228523.
- PMID 17550341.
- S2CID 4387110.
- S2CID 6720459.
- ^ PMID 17554300.
- ^ PMID 21293453.
- PMID 17701901.
- PMID 26072488.
- PMID 28194175.
- S2CID 5942777.
- ^ Carré C, Carluer JB, Chaux C, Estoup-Streiff C, Roche N, Hosy E, Mas A, Krouk G (March, 2024). "Next-Gen GWAS: full 2D epistatic interaction maps retrieve part of missing heritability and improve phenotypic prediction". Genome biology. doi:10.1186/s13059-024-03202-0. PMID 38523316. S2CID 146570
- ^ ISSN 1474-760X.
- S2CID 1465707.
- PMID 22384356.
- PMID 19200528.
- PMID 21058334.
- PMID 18758442.
- PMID 27906529.
- PMID 24473445.
- PMID 22792080.
- PMID 33875891.
- PMID 21829380.
- PMID 19474294.
- PMID 19161620.
- S2CID 32716116.
- PMID 21860027.
- ^ "Largest ever study of genetics of common diseases published today" (Press release). Wellcome Trust Case Control Consortium. 6 June 2007. Archived from the original on 4 June 2008. Retrieved 19 June 2008.
- S2CID 6463743.
- PMID 30038396.
- PMID 35361970.
- PMID 30804565.
- ^ PMID 19060906.
- PMID 21873549.
- PMID 19901186.
- S2CID 5598845.
- PMID 20300123.
- ^ PMID 18987709.
- S2CID 7602652.
- PMID 20837927.
- PMID 20159871.
- PMID 20522751.
- S2CID 1707096.
- PMID 19759533.
- PMID 25059740.
- PMID 20562444.
- PMID 22055160.
- PMID 22100073.
- PMID 21462369.
- S2CID 24020035.
- PMID 29892015.
- PMID 27336055.
- ^ "New Schizophrenia Study Focuses on Protein-Protein Interactions". psychcentral.com. 3 May 2016. Archived from the original on 11 January 2020. Retrieved 22 April 2023.
- PMC 5887623.
- PMID 28823710.
- PMID 34930821.
- PMID 29143598.
- PMID 21915109.
- PMID 28588588.
- PMID 17894563.
- ^ MacArthur D (8 July 2010). "Serious flaws revealed in "longevity genes" study". Wired. Retrieved 7 December 2011.
- PMID 21778381.
- PMID 22279548.
- S2CID 148570302.
- PMID 20395969.
- PMID 10762547.
- )
- PMID 21670730.
External links
- Genotype-phenotype interaction software tools and databases on omicX[permanent dead link]
- Statistical Methods for the Analysis of Genome-Wide Association Studies [video lecture series]
- Whole genome association studies — by the National Human Genome Research Institute
- GWAS Central — a central database of summary-level genetic association findings
- Barrett J (18 July 2010). "How to read a genome-wide association study". Genomes Unzipped.
- Consortia of genome-wide association studies (GWAS) Archived 26 February 2018 at the Wayback Machine — by Bennett SN, Caporaso, NE, et al.
- PLINK — whole genome association analysis toolset
- ENCODE threads explorer Impact of functional information on understanding variation. Nature (journal)