Population structure (genetics)
Population structure (also called genetic structure and population stratification) is the presence of a systematic difference in allele frequencies between subpopulations. In a randomly mating (or panmictic) population, allele frequencies are expected to be roughly similar between groups. However, mating tends to be non-random to some degree, causing structure to arise. For example, a barrier like a river can separate two groups of the same species and make it difficult for potential mates to cross; if a mutation occurs, over many generations it can spread and become common in one subpopulation while being completely absent in the other.
Genetic variants do not necessarily cause observable changes in organisms, but can be correlated by coincidence because of population structure—a variant that is common in a population that has a high rate of disease may erroneously be thought to cause the disease. For this reason, population structure is a common
Description
The basic cause of population structure in
Measures
Population structure is a complex phenomenon and no single measure captures it entirely. Understanding a population's structure requires a combination of methods and measures.[3][4] Many statistical methods rely on simple population models in order to infer historical demographic changes, such as the presence of population bottlenecks, admixture events or population divergence times. Often these methods rely on the assumption of panmictia, or homogeneity in an ancestral population. Misspecification of such models, for instance by not taking into account the existence of structure in an ancestral population, can give rise to heavily biased parameter estimates.[5] Simulation studies show that historical population structure can even have genetic effects that can easily be misinterpreted as historical changes in population size, or the existence of admixture events, even when no such events occurred.[6]
Heterozygosity
One of the results of population structure is a reduction in
Here, is the fraction of individuals in subpopulation that are heterozygous. Assuming there are two alleles, that occur at respective frequencies , it is expected that under random mating the subpopulation will have a heterozygosity rate of . Then:
Similarly, for the total population , we can define allowing us to compute the expected heterozygosity of subpopulation and the value as:[9]
If F is 0, then the allele frequencies between populations are identical, suggesting no structure. The theoretical maximum value of 1 is attained when an allele reaches total fixation, but most observed maximum values are far lower.
Admixture inference
An individual's genotype can be modelled as an admixture between K discrete clusters of populations.[9] Each cluster is defined by the frequencies of its genotypes, and the contribution of a cluster to an individual's genotypes is measured via an estimator. In 2000, Jonathan K. Pritchard introduced the STRUCTURE algorithm to estimate these proportions via Markov chain Monte Carlo, modelling allele frequencies at each locus with a Dirichlet distribution.[11] Since then, algorithms (such as ADMIXTURE) have been developed using other estimation techniques.[12][13] Estimated proportions can be visualized using bar plots — each bar represents an individual, and is subdivided to represent the proportion of an individual's genetic ancestry from one of the K populations.[9]
Varying K can illustrate different scales of population structure; using a small K for the entire human population will subdivide people roughly by continent, while using large K will partition populations into finer subgroups.[9] Though clustering methods are popular, they are open to misinterpretation: for non-simulated data, there is never a "true" value of K, but rather an approximation considered useful for a given question.[3] They are sensitive to sampling strategies, sample size, and close relatives in data sets; there may be no discrete populations at all; and there may be hierarchical structure where subpopulations are nested.[3] Clusters may be admixed themselves,[9] and may not have a useful interpretation as source populations.[14]
Dimensionality reduction
Genetic data are
PCA transforms data to maximize variance; given enough data, when each individual is visualized as point on a plot, discrete clusters can form.
Demographic inference
Population structure is an important aspect of evolutionary and population genetics. Events like migrations and interactions between groups leave a genetic imprint on populations. Admixed populations will have haplotype chunks from their ancestral groups, which gradually shrink over time because of recombination. By exploiting this fact and matching shared haplotype chunks from individuals within a genetic dataset, researchers may trace and date the origins of population admixture and reconstruct historic events such as the rise and fall of empires, slave trades, colonialism, and population expansions.[26]
Role in genetic epidemiology
Population structure can be a problem for
Phenotypes (measurable traits), such as height or risk for heart disease, are the product of some combination of genes and environment. These traits can be predicted using polygenic scores, which seek to isolate and estimate the contribution of genetics to a trait by summing the effects of many individual genetic variants. To construct a score, researchers first enrol participants in an association study to estimate the contribution of each genetic variant. Then, they can use the estimated contributions of each genetic variant to calculate a score for the trait for an individual who was not in the original association study. If structure in the study population is correlated with environmental variation, then the polygenic score is no longer measuring the genetic component alone.[29]
Several methods can at least partially control for this confounding effect. The genomic control method was introduced in 1999 and is a relatively nonparametric method for controlling the inflation of test statistics.[30] It is also possible to use unlinked genetic markers to estimate each individual's ancestry proportions from some K subpopulations, which are assumed to be unstructured.[31] More recent approaches make use of principal component analysis (PCA), as demonstrated by Alkes Price and colleagues,[32] or by deriving a genetic relationship matrix (also called a kinship matrix) and including it in a linear mixed model (LMM).[33][34]
PCA and LMMs have become the most common methods to control for confounding from population structure. Though they are likely sufficient for avoiding false positives in association studies, they are still vulnerable to overestimating effect sizes of marginally associated variants and can substantially bias estimates of polygenic scores and trait heritability.[35][36] If environmental effects are related to a variant that exists in only one specific region (for example, a pollutant is found in only one city), it may not be possible to correct for this population structure effect at all.[29] For many traits, the role of structure is complex and not fully understood, and incorporating it into genetic studies remains a challenge and is an active area of research.[37]
References
- S2CID 14255234.
- ^ McVean G (2001). "Population Structure" (PDF). Archived from the original (PDF) on 2018-11-23. Retrieved 2020-11-14.
- ^ PMID 30108219.
- ^ S2CID 24403040.
- PMID 30007846.
- PMID 30293985.
- ^ OCLC 37481398.
- PMID 24540312.
- ^ a b c d e f g h Coop G (2019). Population and Quantitative Genetics. pp. 22–44.
- PMID 31132375.
- PMID 10835412.
- PMID 19648217.
- ^ PMID 21801023.
- PMID 27729489.
- PMID 22253600.
- PMID 22927824.
- PMID 356262.
- PMID 17194218.
- PMID 18758442.
- PMID 19834557.
- S2CID 10739417.
- PMID 28718343.
- ^ PMID 31675358.
- PMID 32218440.
- PMID 33561250.
- PMID 24531965.
- S2CID 9760182.
- PMID 10364535.
- ^ PMID 33355092.
- S2CID 6297807.
- PMID 10827107.
- S2CID 8127858.
- S2CID 8507433.
- PMID 25642633.
- PMID 33200985.
- PMID 30895926.
- PMID 31030318.