Pan-genome
In the fields of
The genetic repertoire of a bacterial species is much larger than the gene content of an individual strain. [7] Some species have open (or extensive) pangenomes, while others have closed pangenomes.[2] For species with a closed pan-genome, very few genes are added per sequenced genome (after sequencing many strains), and the size of the full pangenome can be theoretically predicted. Species with an open pangenome have enough genes added per additional sequenced genome that predicting the size of the full pangenome is impossible.[4] Population size and niche versatility have been suggested as the most influential factors in determining pan-genome size.[2]
Pangenomes were originally constructed for species of
Etymology
The term 'pangenome' was defined with its current meaning by Tettelin et al. in 2005;[2] it derives 'pan' from the Greek word παν, meaning 'whole' or 'everything', while the genome is a commonly used term to describe an organism's complete genetic material. Tettelin et al. applied the term specifically to bacteria, whose pangenome "includes a core genome containing genes present in all strains and a dispensable genome composed of genes absent from one or more strains and genes that are unique to each strain."[2]
Parts of the pangenome
Core
Is the part of the pangenome that is shared by every genome in the tested set. Some authors have divided the core pangenome in hard core, those families of homologous genes that has at least one copy of the family shared by every genome (100% of genomes) and the soft core or extended core,[15] those families distributed above a certain threshold (90%). In a study that involves the pangenomes of Bacillus cereus and Staphylococcus aureus, some of them isolated from the international space station, the thresholds used for segmenting the pangenomes were as follows: "Cloud," "Shell," and "Core" corresponding to gene families with presence in <10%, 10 to 95%, and >95% of the genomes, respectively.[16]
The core genome size and proportion to the pangenome depends on several factors, but it is especially dependent on the phylogenetic similarity of the considered genomes. For example, the core of two identical genomes would also be the complete pangenome. The core of a genus will always be smaller than the core genome of a species. Genes that belong to the core genome are often related to
Shell
Is the part of the pangenome shared by the majority of the genomes in a pangenome.[18] There is not a universally accepted threshold to define the shell genome, some authors consider a gene family as part of the shell pangenome if it shared by more than 50% of the genomes in the pangenome.[19] A family can be part of the shell by several evolutive dynamics, for example by gene loss in a lineage where it was previously part of the core genome, such is the case of enzymes in the tryptophan operon in Actinomyces,[20] or by gene gain and fixation of a gene family that was previously part of the dispensable genome such is the case of trpF gene in several Corynebacterium species.[21]
Cloud
The cloud genome consists of those gene families shared by a minimal subset of the genomes in the pangenome,[22] it includes singletons or genes present in only one of the genomes. It is also known as the peripheral genome, or accessory genome. Gene families in this category are often related to ecological adaptation.[citation needed]
Classification
The pan-genome can be somewhat arbitrarily classified as open or closed based on the alpha value of the Heap law: [23][15]
- Number of gene families.
- Number of genomes.
- Constant of proportionality.
- Exponent calculated in order to adjust the curve of number of gene families vs new genome.
if then the pangenome is considered open. if then the pangenome is considered closed.
Usually, the pangenome software can calculate the parameters of the Heap law that best describe the behavior of the data.
Open pangenome
An open pangenome occurs when in one taxonomic lineage keeps increasing the number of new gene families and this increment does not seem to be
Closed Pangenome
A closed pangenome occurs in a lineage when only few gene families are added when new
History
Pangenome
The original pangenome concept was developed by Tettelin et al.
The pangenome of a genomic lineage accounts for the intra lineage gene content variability. Pangenome evolves due to: gene duplication, gene gain and loss dynamics and interaction of the genome with mobile elements that are shaped by selection and drift.[26] Some studies point that prokaryotes pangenomes are the result of adaptive, not neutral evolution that confer species the ability to migrate to new niches.[27]
Supergenome
The supergenome can be thought of as the real pangenome size if all genomes from a species were sequenced.[28] It is defined as all genes accessible for being gained by a certain species. It cannot be calculated directly but its size can be estimated by the pangenome size calculated from the available genome data. Estimating the size of the cloud genome can be troubling because of its dependence on the occurrence of rare genes and genomes. In 2011 genomic fluidity was proposed as a measure to categorize the gene-level similarity among groups of sequenced isolates. [29] In some lineages the supergenomes did appear infinite,[30] as is the case of the Bacteria domain.[31]
Metapangenome
'Metapangenome' has been defined as the outcome of the analysis of pangenomes in conjunction with the environment where the abundance and prevalence of gene clusters and genomes are recovered through shotgun metagenomes.[32] The combination of metagenomes with pangenomes, also referred to as "metapangenomics", reveals the population-level results of habitat-specific filtering of the pangenomic gene pool.[33]
Other authors consider that Metapangenomics expands the concept of pangenome by incorporating
Examples
Prokaryote pangenome
In 2018, 87% of the available whole genome sequences were bacteria fueling researchers interest in calculating prokaryote pangenomes at different taxonomic levels.[22] In 2015, the pangenome of 44 strains of Streptococcus pneumoniae bacteria shows few new genes discovered with each new genome sequenced (see figure). In fact, the predicted number of new genes dropped to zero when the number of genomes exceeds 50 (note, however, that this is not a pattern found in all species). This would mean that S. pneumoniae has a 'closed pangenome'.[37] The main source of new genes in S. pneumoniae was Streptococcus mitis from which genes were transferred horizontally. The pan-genome size of S. pneumoniae increased logarithmically with the number of strains and linearly with the number of polymorphic sites of the sampled genomes, suggesting that acquired genes accumulate proportionately to the age of clones.[36] Another example of prokaryote pan-genome is Prochlorococcus, the core genome set is much smaller than the pangenome, which is used by different ecotypes of Prochlorococcus.[38] Open pan-genome has been observed in environmental isolates such as Alcaligenes sp.[39] and Serratia sp.,[40] showing a sympatric lifestyle. Nevertheless, open pangenome is not exclusive to free living microorganisms, a 2015 study on Prevotella bacteria isolated from humans, compared the gene repertoires of its species derived from different body sites of human. It also reported an open pan-genome showing vast diversity of gene pool.[41]
Archaea also have some pangenome studies.
Eukaryote pangenome
In animals, the human pangenome is being studied. In 2010 a study estimated that a complete human pan-genome would contain ~19–40 Megabases of novel sequence not present in the extant reference human genome.[44] The Human Pangenome consortium has the goal to acknowledge the human genome diversity. In 2023, a draft human pangenome reference was published.[45] It is based on 47 genomes from persons of varied ethnicity.[45] Plans are underway for an improved reference capturing still more biodiversity from a still wider sample.[45]
Among plants, there are examples of pangenome studies in model species, both diploid [9] and polyploid,[10] and a growing list of crops.[46][47] Pangenomes have shown promise as a tool in plant breeding by accounting for structural variants and SNPs in non-reference genomes, which helps to solve the problem of missing heritability that persists in genome wide association studies.[48] An emerging plant-based concept is that of pan-NLRome, which is the repertoire of nucleotide-binding leucine-rich repeat (NLR) proteins, intracellular immune receptors that recognize pathogen proteins and confer disease resistance.[49]
Virus pangenome
Virus does not necessarily have genes extensively shared by clades such as is the case of 16S in bacteria, and therefore the core genome of the full Virus Domain is empty. Nevertheless, several studies have calculated the pangenome of some viral lineages. The core genome from six species of pandoraviruses comprises 352 gene families only 4.7% of the pangenome, resulting in an open pangenome.[50]
Data structures
The number of sequenced genomes is continuously growing "simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets".[51] Pangenome graphs are emerging data structures designed to represent pangenomes and to efficiently map reads to them. They have been reviewed by Eizenga et al. [52]
Software tools
As interest in pangenomes increased, there have been several software tools developed to help analyze this kind of data.
To start a pangenomic analysis the first step is the homogenization of genome annotation.
The two most cited software tools for pangenomic analysis at the end of 2014[55] were Panseq[56] and the pan-genomes analysis pipeline (PGAP).[57] Other options include BPGA – A Pan-Genome Analysis Pipeline for prokaryotic genomes,[58] GET_HOMOLOGUES,[59] Roary.[60] and PanDelos.[61] In 2015 a review focused on prokaryote pangenomes[62] and another for plant pan-genomes were published.[63] Among the first software packages designed for plant pangenomes were PanTools.[64] and GET_HOMOLOGUES-EST.[11][59] In 2018 panX was released, an interactive web tool that allows inspection of gene families evolutionary history.[65] panX can display an alignment of genomes, a phylogenetic tree, mapping of mutations and inference about gain and loss of the family on the core-genome phylogeny. In 2019 OrthoVenn 2.0 [66] allowed comparative visualization of families of homologous genes in Venn diagrams up to 12 genomes. In 2023, BRIDGEcerealwas developed to survey and graph indel-based haplotypes from pan-genome through a gene model ID.[67]
In 2020 Anvi'o[1] was available as a multiomics platform that contains pangenomic and metapangenomic analyses as well as visualization workflows. In Anvi'o, genomes are displayed in concentrical circles and each radius represents a gene family, allowing for comparison of more than 100 genomes in its interactive visualization. In 2020, a computational comparison of tools for extracting gene-based pangenomic contents (such as GET_HOMOLOGUES, PanDelos, Roary, and others) has been released.[68] Tools were compared from a methodological perspective, analyzing the causes that lead a given methodology to outperform other tools. The analysis was performed by taking into account different bacterial populations, which are synthetically generated by changing evolutionary parameters. Results show a differentiation of the performance of each tool that depends on the composition of the input genomes. Again in 2020, several tools introduced a graphical representation of the pangenomes showing the contiguity of genes (PPanGGOLiN,[46] Panaroo[65]).
See also
- Metagenomics
- Pathogenomics
- Quasispecies
- Human Pangenome Reference
References
- ^ PMID 33349678.
- ^ PMID 16172379.
- ^ PMID 16185861.
- ^ PMID 25483351.
- ^ PMID 24548794.
- PMID 23241446.
- PMID 20890839.
- PMID 17300983.
- ^ PMID 29259172.
- ^ PMID 32728126.
- ^ PMID 28261241.
- PMID 19435847.
- PMID 19015323.
- S2CID 217167361.
- ^ PMID 22174796.
- PMID 30637341.
- PMID 26754847.
- PMID 21304685.
- PMID 30946645.
- PMID 28362260.
- PMID 23800623.
- ^ S2CID 219011507.
- ^ PMID 32843837.
- PMID 25722247.
- PMID 30126366.
- S2CID 204823648.
- S2CID 19612970.
- PMID 25894542.
- PMID 21232151.
- PMID 25141959.
- PMID 19168257.
- ^ PMID 29423345.
- PMID 33323129.
- S2CID 219067583.
- PMID 33841754.
- ^ PMID 21034474.
- PMID 26442149.
- PMID 18159947.
- PMID 29483539.
- arXiv:1610.04160 [q-bio.GN].
- PMID 25887946.
- PMID 33273480.
- PMID 30714895.
- S2CID 205274447.
- ^ PMID 37165242.
- ^ S2CID 152283283.
- PMID 33239781.
- PMID 35676474.
- PMID 31442410.
- PMID 30042742.
- PMID 27769991.
- PMID 32453966.
- PMID 11410670.
- PMID 18261238.
- ^ PMID 25721608.
- PMID 20843356.
- PMID 22130594.
- PMID 27071527.
- ^ PMID 24096415.
- PMID 26198102.
- PMID 30497358.
- PMID 27006628.
- PMID 26593040.
- PMID 27587666.
- ^ PMID 29077859.
- PMID 31053848.
- PMID 37202927.
- PMID 32893299.