Single-nucleotide polymorphism
In genetics and bioinformatics, a single-nucleotide polymorphism (SNP /snɪp/; plural SNPs /snɪps/) is a germline substitution of a single nucleotide at a specific position in the genome that is present in a sufficiently large fraction of considered population (generally regarded as 1% or more).[1][2]
For example, a G nucleotide present at a specific location in a reference genome may be replaced by an A in a minority of individuals. The two possible nucleotide variations of this SNP – G or A – are called alleles.[3]
SNPs can help explain differences in susceptibility to a wide range of diseases across a population. For example, a common SNP in the CFH gene is associated with increased risk of age-related macular degeneration.[4] Differences in the severity of an illness or response to treatments may also be manifestations of genetic variations caused by SNPs. For example, two common SNPs in the APOE gene, rs429358 and rs7412, lead to three major APO-E alleles with different associated risks for development of Alzheimer's disease and age at onset of the disease.[5]
Single nucleotide substitutions with an allele frequency of less than 1% are sometimes called single-nucleotide variants (SNVs).[6] "Variant" may also be used as a general term for any single nucleotide change in a DNA sequence,[2] encompassing both common SNPs and rare mutations, whether germline or somatic.[7][8] The term SNV has therefore been used to refer to point mutations found in cancer cells.[9] DNA variants must also commonly be taken into consideration in molecular diagnostics applications such as designing PCR primers to detect viruses, in which the viral RNA or DNA sample may contain SNVs.[citation needed] However, this nomenclature uses arbitrary distinctions (such as an allele frequency of 1%) and is not used consistently across all fields; the resulting disagreement has prompted calls for a more consistent framework for naming differences in DNA sequences between two samples.[10][11]
Types
Single-nucleotide
SNPs in the coding region are of two types: synonymous SNPs and nonsynonymous SNPs. Synonymous SNPs do not affect the protein sequence, while nonsynonymous SNPs change the amino acid sequence of protein.[13]
- SNPs in eQTL(expression quantitative trait locus).
- SNPs in coding regions:
- synonymous substitutions by definition do not result in a change of amino acid in the protein, but still can affect its function in other ways. An example would be a seemingly silent mutation in the multidrug resistance gene 1 (MDR1), which codes for a cellular membrane pump that expels drugs from the cell, can slow down translation and allow the peptide chain to fold into an unusual conformation, causing the mutant pump to be less functional (in MDR1 protein e.g. C1236T polymorphism changes a GGC codon to GGT at amino acid position 412 of the polypeptide (both encode glycine) and the C3435T polymorphism changes ATC to ATT at position 1145 (both encode isoleucine)).[16]
- nonsynonymous substitutions:
- progeria syndrome)
- mRNA, and in a truncated, incomplete, and usually nonfunctional protein product (e.g. Cystic fibrosis caused by the G542X mutation in the cystic fibrosis transmembrane conductance regulator gene).[18]
SNPs that are not in protein-coding regions may still affect
Frequency
More than 600 million SNPs have been identified across the human genome in the world's population.[19] A typical genome differs from the reference human genome at 4 to 5 million sites, most of which (more than 99.9%) consist of SNPs and short indels.[20]
Within a genome
The genomic distribution of SNPs is not homogenous; SNPs occur in
SNP density can be predicted by the presence of
Within a population
There are variations between human populations, so a SNP allele that is common in one geographical or ethnic group may be much rarer in another. However, this pattern of variation is relatively rare; in a global sample of 67.3 million SNPs, the Human Genome Diversity Project "found no such private variants that are fixed in a given continent or major region. The highest frequencies are reached by a few tens of variants present at >70% (and a few thousands at >50%) in Africa, the Americas, and Oceania. By contrast, the highest frequency variants private to Europe, East Asia, the Middle East, or Central and South Asia reach just 10 to 30%."[24]
Within a population, SNPs can be assigned a minor allele frequency—the lowest allele frequency at a locus that is observed in a particular population.[25] This is simply the lesser of the two allele frequencies for single-nucleotide polymorphisms.
With this knowledge scientists have developed new methods in analyzing population structures in less studied species.[26][27][28] By using pooling techniques the cost of the analysis is significantly lowered.[citation needed] These techniques are based on sequencing a population in a pooled sample instead of sequencing every individual within the population by itself. With new bioinformatics tools there is a possibility of investigating population structure, gene flow and gene migration by observing the allele frequencies within the entire population. With these protocols there is a possibility in combining the advantages of SNPs with micro satellite markers.[29][30] However, there are information lost in the process such as linkage disequilibrium and zygosity information.
Applications
This section is in prose. is available. (May 2023) |
- Association studies can determine whether a genetic variant is associated with a disease or trait.[31]
- A tag SNP is a representative single-nucleotide polymorphism in a region of the genome with high linkage disequilibrium (the non-random association of alleles at two or more loci). Tag SNPs are useful in whole-genome SNP association studies, in which hundreds of thousands of SNPs across the entire genome are genotyped.
- Haplotype mapping: sets of alleles or DNA sequences can be clustered so that a single SNP can identify many linked SNPs.
- Linkage disequilibrium (LD), a term used in population genetics, indicates non-random association of alleles at two or more loci, not necessarily on the same chromosome. It refers to the phenomenon that SNP allele or DNA sequence that are close together in the genome tend to be inherited together. LD can be affected by two parameters (among other factors, such as population stratification): 1) The distance between the SNPs [the larger the distance, the lower the LD]. 2) Recombination rate [the lower the recombination rate, the higher the LD].[32]
- In genetic epidemiology SNPs are used to estimate transmission clusters.[33]
Importance
Variations in the DNA sequences of humans can affect how humans develop
Examples include biomedical research, forensics, pharmacogenetics, and disease causation, as outlined below.Clinical research
Genome-wide association study (GWAS)
One of main contributions of SNPs in clinical research is genome-wide association study (GWAS).[35] Genome-wide genetic data can be generated by multiple technologies, including SNP array and whole genome sequencing. GWAS has been commonly used in identifying SNPs associated with diseases or clinical phenotypes or traits. Since GWAS is a genome-wide assessment, a large sample site is required to obtain sufficient statistical power to detect all possible associations. Some SNPs have relatively small effect on diseases or clinical phenotypes or traits. To estimate study power, the genetic model for disease needs to be considered, such as dominant, recessive, or additive effects. Due to genetic heterogeneity, GWAS analysis must be adjusted for race.
Candidate gene association study
Candidate gene association study is commonly used in genetic study before the invention of high throughput genotyping or sequencing technologies.[36] Candidate gene association study is to investigate limited number of pre-specified SNPs for association with diseases or clinical phenotypes or traits. So this is a hypothesis driven approach. Since only a limited number of SNPs are tested, a relatively small sample size is sufficient to detect the association. Candidate gene association approach is also commonly used to confirm findings from GWAS in independent samples.
Homozygosity mapping in disease
Genome-wide SNP data can be used for homozygosity mapping.[37] Homozygosity mapping is a method used to identify homozygous autosomal recessive loci, which can be a powerful tool to map genomic regions or genes that are involved in disease pathogenesis.
Methylation patterns
Recently, preliminary results reported SNPs as important components of the epigenetic program in organisms.[38][39] Moreover, cosmopolitan studies in European and South Asiatic populations have revealed the influence of SNPs in the methylation of specific CpG sites.[40] In addition, meQTL enrichment analysis using GWAS database, demonstrated that those associations are important toward the prediction of biological traits.[40][41][42]
Forensic sciences
SNPs have historically been used to match a forensic DNA sample to a suspect but has been made obsolete due to advancing
Some cons to using SNPs versus STRs is that SNPs yield less information than STRs, and therefore more SNPs are needed for analysis before a profile of a suspect is able to be created. Additionally, SNPs heavily rely on the presence of a database for comparative analysis of samples. However, in instances with degraded or small volume samples, SNP techniques are an excellent alternative to STR methods. SNPs (as opposed to STRs) have an abundance of potential markers, can be fully automated, and a possible reduction of required fragment length to less than 100bp.[26]
Pharmacogenetics
Pharmacogenetics focuses on identifying genetic variations including SNPs associated with differential responses to treatment.[43] Many drug metabolizing enzymes, drug targets, or target pathways can be influenced by SNPs. The SNPs involved in drug metabolizing enzyme activities can change drug pharmacokinetics, while the SNPs involved in drug target or its pathway can change drug pharmacodynamics. Therefore, SNPs are potential genetic markers that can be used to predict drug exposure or effectiveness of the treatment. Genome-wide pharmacogenetic study is called pharmacogenomics. Pharmacogenetics and pharmacogenomics are important in the development of precision medicine, especially for life-threatening diseases such as cancers.
Disease
Only small amount of SNPs in the human genome may have impact on human diseases. Large scale GWAS has been done for the most important human diseases, including heart diseases, metabolic diseases, autoimmune diseases, and neurodegenerative and psychiatric disorders.[35] Most of the SNPs with relatively large effects on these diseases have been identified. These findings have significantly improved understanding of disease pathogenesis and molecular pathways, and facilitated development of better treatment. Further GWAS with larger samples size will reveal the SNPs with relatively small effect on diseases. For common and complex diseases, such as type-2 diabetes, rheumatoid arthritis, and Alzheimer's disease, multiple genetic factors are involved in disease etiology. In addition, gene-gene interaction and gene-environment interaction also play an important role in disease initiation and progression.[44]
Examples
- Serotonin 5-HT2A receptor gene on human chromosome 13.[45]
- The SNP − 3279C/A (rs3761548) is amongst the SNPs locating in the promoter region of the Foxp3 gene, might be involved in cancer progression.[46]
- A SNP in the F5 gene causes Factor V Leiden thrombophilia.[47]
- rs3091244 is an example of a triallelic SNP in the CRP gene on human chromosome 1.[48]
- TAS2R38 codes for PTC tasting ability, and contains 6 annotated SNPs.[49]
- rs148649884 and rs138055828 in the FCN1 gene encoding M-ficolin crippled the ligand-binding capability of the recombinant M-ficolin.[50]
- An intronic SNP in DNA mismatch repair gene PMS2 (rs1059060, Ser775Asn) is associated with increased sperm DNA damage and risk of male infertility.[51]
Databases
As there are for genes, bioinformatics databases exist for SNPs.
- dbSNP is a SNP database from the National Center for Biotechnology Information (NCBI). As of June 8, 2015[update], dbSNP listed 149,735,377 SNPs in humans.[52][53]
- Kaviar[54] is a compendium of SNPs from multiple data sources including dbSNP.
- SNPedia is a wiki-style database supporting personal genome annotation, interpretation and analysis.
- The OMIMdatabase describes the association between polymorphisms and diseases (e.g., gives diseases in text form)
- dbSAP – single amino-acid polymorphism database for protein variation detection[55]
- The Human Gene Mutation Database provides gene mutations causing or associated with human inherited diseases and functional SNPs
- The International HapMap Project, where researchers are identifying Tag SNPs to be able to determine the collection of haplotypes present in each subject.
- genome-wide association studies.
The International SNP Map working group mapped the sequence flanking each SNP by alignment to the genomic sequence of large-insert clones in Genebank. These alignments were converted to chromosomal coordinates that is shown in Table 1.[56] This list has greatly increased since, with, for instance, the Kaviar database now listing 162 million single nucleotide variants (SNVs).
Chromosome | Length(bp) | All SNPs | TSC SNPs |
||
---|---|---|---|---|---|
Total SNPs | kb per SNP | Total SNPs | kb per SNP | ||
1 | 214,066,000 | 129,931 | 1.65 | 75,166 | 2.85 |
2 | 222,889,000 | 103,664 | 2.15 | 76,985 | 2.90 |
3 | 186,938,000 | 93,140 | 2.01 | 63,669 | 2.94 |
4 | 169,035,000 | 84,426 | 2.00 | 65,719 | 2.57 |
5 | 170,954,000 | 117,882 | 1.45 | 63,545 | 2.69 |
6 | 165,022,000 | 96,317 | 1.71 | 53,797 | 3.07 |
7 | 149,414,000 | 71,752 | 2.08 | 42,327 | 3.53 |
8 | 125,148,000 | 57,834 | 2.16 | 42,653 | 2.93 |
9 | 107,440,000 | 62,013 | 1.73 | 43,020 | 2.50 |
10 | 127,894,000 | 61,298 | 2.09 | 42,466 | 3.01 |
11 | 129,193,000 | 84,663 | 1.53 | 47,621 | 2.71 |
12 | 125,198,000 | 59,245 | 2.11 | 38,136 | 3.28 |
13 | 93,711,000 | 53,093 | 1.77 | 35,745 | 2.62 |
14 | 89,344,000 | 44,112 | 2.03 | 29,746 | 3.00 |
15 | 73,467,000 | 37,814 | 1.94 | 26,524 | 2.77 |
16 | 74,037,000 | 38,735 | 1.91 | 23,328 | 3.17 |
17 | 73,367,000 | 34,621 | 2.12 | 19,396 | 3.78 |
18 | 73,078,000 | 45,135 | 1.62 | 27,028 | 2.70 |
19 | 56,044,000 | 25,676 | 2.18 | 11,185 | 5.01 |
20 | 63,317,000 | 29,478 | 2.15 | 17,051 | 3.71 |
21 | 33,824,000 | 20,916 | 1.62 | 9,103 | 3.72 |
22 | 33,786,000 | 28,410 | 1.19 | 11,056 | 3.06 |
X | 131,245,000 | 34,842 | 3.77 | 20,400 | 6.43 |
Y | 21,753,000 | 4,193 | 5.19 | 1,784 | 12.19 |
RefSeq | 15,696,674 | 14,534 | 1.08 | ||
Totals | 2,710,164,000 | 1,419,190 | 1.91 | 887,450 | 3.05 |
Nomenclature
The nomenclature for SNPs include several variations for an individual SNP, while lacking a common consensus.
The rs### standard is that which has been adopted by dbSNP and uses the prefix "rs", for "reference SNP", followed by a unique and arbitrary number.[57] SNPs are frequently referred to by their dbSNP rs number, as in the examples above.
The Human Genome Variation Society (HGVS) uses a standard which conveys more information about the SNP. Examples are:
- c.76A>T: "c." for coding region, followed by a number for the position of the nucleotide, followed by a one-letter abbreviation for the nucleotide (A, C, G, T or U), followed by a greater than sign (">") to indicate substitution, followed by the abbreviation of the nucleotide which replaces the former[58][59][60]
- p.Ser123Arg: "p." for protein, followed by a three-letter abbreviation for the amino acid, followed by a number for the position of the amino acid, followed by the abbreviation of the amino acid which replaces the former.[61]
SNP analysis
SNPs can be easily assayed due to only containing two possible alleles and three possible genotypes involving the two alleles: homozygous A, homozygous B and heterozygous AB, leading to many possible techniques for analysis. Some include: DNA sequencing; capillary electrophoresis; mass spectrometry; single-strand conformation polymorphism (SSCP); single base extension; electrochemical analysis; denaturating HPLC and gel electrophoresis; restriction fragment length polymorphism; and hybridization analysis.
Programs for prediction of SNP effects
An important group of SNPs are those that corresponds to
- SIFT This program provides insight into how a laboratory induced missense or nonsynonymous mutation will affect protein function based on physical properties of the amino acid and sequence homology.
- LIST[63][64] (Local Identity and Shared Taxa) estimates the potential deleteriousness of mutations resulted from altering their protein functions. It is based on the assumption that variations observed in closely related species are more significant when assessing conservation compared to those in distantly related species.
- SNAP2
- SuSPect
- PolyPhen-2
- PredictSNP
- MutationTaster: official website
- Variant Effect Predictor from the Ensemblproject
- SNPViz Archived 2020-08-07 at the Wayback Machine[65] This program provides a 3D representation of the protein affected, highlighting the amino acid change so doctors can determine pathogenicity of the mutant protein.
- PROVEAN
- PhyreRisk is a database which maps variants to experimental and predicted protein structures.[66]
- Missense3D is a tool which provides a stereochemical report on the effect of missense variants on protein structure.[67]
See also
- Affymetrix
- HapMap
- Illumina
- International HapMap Project
- Short tandem repeat(STR)
- Single-base extension
- SNP array
- SNP genotyping
- SNPedia
- Snpstr
- SNV calling from NGS data
- Suspension array technology
- Tag SNP
- TaqMan
- Variome
References
- ISBN 978-0-12-383834-6, retrieved 2023-05-02
- ^ S2CID 82415195
- PMID 28696921.
- PMID 24702844.
- PMID 33679311.
- ^ "Definition of single nucleotide variant - NCI Dictionary of Genetics Terms". www.cancer.gov. 2012-07-20. Retrieved 2023-05-02.
- PMID 20130035.
- PMID 25234433.
- S2CID 14433306.
- PMID 26173390.
- ^ Li, Heng (March 15, 2021). "SNP vs SNV". Heng Li's blog. Retrieved May 3, 2023.
- PMID 24688635.
- PMID 30991970.
- PMID 25276428.
- PMID 26531896.
- S2CID 15146955.
- PMID 22549407.
- PMID 22137130.
- ^ "What are single nucleotide polymorphisms (SNPs)?: MedlinePlus Genetics". medlineplus.gov. Retrieved 2023-03-22.
- PMID 26432245.
- S2CID 205357396.
- PMID 11525814.
- PMID 20026267.
- PMID 32193295.)
{{cite journal}}
: CS1 maint: multiple names: authors list (link - PMID 26207627.
- PMID 30061425.
- PMID 21139633.
- PMID 24139972.
- PMID 31236247.
- PMID 28386419.
- PMID 15078859.
- ^ Gupta PK, Roy JK, Prasad M (25 February 2001). "Single nucleotide polymorphisms: a new paradigm for molecular marker technology and DNA polymorphism detection with emphasis on their use in plants". Current Science. 80 (4): 524–535. Archived from the original on 13 February 2017.
- PMID 30690464.
- Mary Ann Liebert, Inc. Archivedfrom the original on 26 December 2010. Retrieved 2008-07-06.
(subtitle) Medical applications are where the market's growth is expected
- ^ PMID 28686856.
- PMID 18505952.
- S2CID 10789932.
- S2CID 221886624.
- PMID 34041244.
- ^ S2CID 256821844.
- PMID 34475392.
- PMID 27886132.
- PMID 29040422.
- PMID 31410638.
- PMID 16814396.
- S2CID 226218796.
- PMID 21116184.
- PMID 17161935.
- PMID 15466815.
- PMID 23209787.
- PMID 22594646.
- ^ National Center for Biotechnology Information, United States National Library of Medicine. 2014. NCBI dbSNP build 142 for human. "[DBSNP-announce] DBSNP Human Build 142 (GRCh38 and GRCh37.p13)". Archived from the original on 2017-09-10. Retrieved 2017-09-11.
- ^ National Center for Biotechnology Information, United States National Library of Medicine. 2015. NCBI dbSNP build 144 for human. Summary Page. "DBSNP Summary". Archived from the original on 2017-09-10. Retrieved 2017-09-11.
- PMID 21965822.
- PMID 27903894.
- PMID 11237013.
- ^ "Clustered RefSNPs (rs) and Other Data Computed in House". SNP FAQ Archive. Bethesda (MD): U.S. National Center for Biotechnology Information. 2005.
- ^ J.T. Den Dunnen (2008-02-20). "Recommendations for the description of sequence variants". Human Genome Variation Society. Archived from the original on 2008-09-14. Retrieved 2008-09-05.
- PMID 10612815.
- PMID 17251329.
- ^ "Sequence Variant Nomenclature". varnomen.hgvs.org. Retrieved 2019-12-02.
- PMID 20031630.
- PMID 30952844.
- PMID 32352516.
- doi:10.18547/gcb.2018.vol4.iss1.e100048. Archived from the originalon 2020-08-07. Retrieved 2018-10-20.
- PMID 31075275.
- PMID 30995449.
Further reading
- "Glossary". Nature Reviews.
- Human Genome Project Information — SNP Fact Sheet
External links
- NCBI resources Archived 2013-09-02 at the Wayback Machine – Introduction to SNPs from NCBI
- The SNP Consortium LTD – SNP search
- NCBI dbSNP database – "a central repository for both single base nucleotide substitutions and short deletion and insertion polymorphisms"
- HGMD – the Human Gene Mutation Database, includes rare mutations and functional SNPs
- GWAS Central – a central database of summary-level genetic association findings
- 1000 Genomes Project – A Deep Catalog of Human Genetic Variation
- WatCut Archived 2007-06-18 at the Wayback Machine – an online tool for the design of SNP-RFLP assays
- SNPStats Archived 2008-10-13 at the Wayback Machine – SNPStats, a web tool for analysis of genetic association studies
- Restriction HomePage – a set of tools for DNA restriction and SNP detection, including design of mutagenic primers
- American Association for Cancer Research Cancer Concepts Factsheet on SNPs
- PharmGKB – The Pharmacogenetics and Pharmacogenomics Knowledge Base, a resource for SNPs associated with drug response and disease outcomes.
- GEN-SNiP Archived 2010-01-19 at the Wayback Machine – Online tool that identifies polymorphisms in test DNA sequences.
- Rules for Nomenclature of Genes, Genetic Markers, Alleles, and Mutations in Mouse and Rat
- HGNC Guidelines for Human Gene Nomenclature
- SNP effect predictor with galaxy integration
- Open SNP – a portal for sharing own SNP test results
- dbSAP Archived 2016-12-20 at the Wayback Machine – SNP database for protein variation detection