Comparative genomics

taxa. Each chromosome has been laid out horizontally and homologous blocks in each genome are shown as identically colored regions linked across genomes. Regions that are inverted relative to Y. pestis KIM are shifted below a genome's center axis.^[1]

Comparative genomics is a field of

orthologous sequences (sequences that share a common ancestry) in the aligned genomes and checking to what extent those sequences are conserved. Based on these, genome and molecular evolution are inferred and this may in turn be put in the context of, for example, phenotypic evolution or population genetics.^[7]

Virtually started as soon as the whole genomes of two organisms became available (that is, the genomes of the bacteria

chimpanzees, and, more surprisingly, similarity between seemingly distantly related organisms, such as humans and the yeast Saccharomyces cerevisiae.^[4]

It has also showed the extreme diversity of the gene composition in different evolutionary lineages.[8]

History

See also:
History of genomics

Comparative genomics has a root in the comparison of
Epstein-Barr virus that contained more than 100 genes each.^[11]

The first complete genome sequence of a cellular organism, that of Haemophilus influenzae Rd, was published in 1995.^[12] The second genome sequencing paper was of the small parasitic bacterium Mycoplasma genitalium published in the same year.^[13] Starting from this paper, reports on new genomes inevitably became comparative-genomic studies.^[8]
Microbial genomes. The first high-resolution whole genome comparison system of microbial genomes of 10-15kbp was developed in 1998 by Art Delcher, Simon Kasif and Steven Salzberg and applied to the comparison of entire highly related microbial organisms with their collaborators at the Institute for Genomic Research (TIGR). The system is called MUMMER and was described in a publication in Nucleic Acids Research in 1999. The system helps researchers to identify large rearrangements, single base mutations, reversals, tandem repeat expansions and other polymorphisms. In bacteria, MUMMER enables the identification of polymorphisms that are responsible for virulence, pathogenicity, and anti-biotic resistance. The system was also applied to the Minimal Organism Project at TIGR and subsequently to many other comparative genomics projects.
Eukaryote genomes.
eukaryotes D. melanogaster, C. elegans, and S. cerevisiae, as well as the prokaryote H. influenzae.^[17] At the same time, Bonnie Berger, Eric Lander, and their team published a paper on whole-genome comparison of human and mouse.^[18]

With the publication of the large genomes of vertebrates in the 2000s, including
Japanese pufferfish Takifugu rubripes, and mouse, precomputed results of large genome comparisons have been released for downloading or for visualization in a genome browser. Instead of undertaking their own analyses, most biologists can access these large cross-species comparisons and avoid the impracticality caused by the size of the genomes.^[19]

annotated reference genome, and thus provide a list of possible gene differences that may be the basis for any functional variation among strains.^[9]

Evolutionary principles

Main article: Evolution

One character of biology is evolution,
evolutionary theory
is also the theoretical foundation of comparative genomics, and at the same time the results of comparative genomics unprecedentedly enriched and developed the theory of evolution. When two or more of the genome sequence are compared, one can deduce the evolutionary relationships of the sequences in a phylogenetic tree. Based on a variety of biological genome data and the study of vertical and horizontal evolution processes, one can understand vital parts of the gene structure and its regulatory function.
Similarity of related genomes is the basis of comparative genomics. If two creatures have a recent common ancestor, the differences between the two species genomes are evolved from the ancestors' genome. The closer the relationship between two organisms, the higher the similarities between their genomes. If there is close relationship between them, then their genome will display a linear behaviour (synteny), namely some or all of the genetic sequences are conserved. Thus, the genome sequences can be used to identify gene function, by analyzing their homology (sequence similarity) to genes of known function.

Human FOXP2 gene and evolutionary conservation is shown in and multiple alignment (at bottom of figure) in this image from the UCSC Genome Browser. Note that conservation tends to cluster around coding regions (exons).

Orthologous sequences are related sequences in different species: a gene exists in the original species, the species divided into two species, so genes in new species are orthologous to the sequence in the original species. Paralogous sequences are separated by gene cloning (gene duplication): if a particular gene in the genome is copied, then the copy of the two sequences is paralogous to the original gene. A pair of orthologous sequences is called orthologous pairs (orthologs), a pair of paralogous sequence is called collateral pairs (paralogs). Orthologous pairs usually have the same or similar function, which is not necessarily the case for collateral pairs. In collateral pairs, the sequences tend to evolve into having different functions.
Comparative genomics exploits both similarities and differences in the
selection has acted upon these elements. Those elements that are responsible for similarities between different species should be conserved through time (stabilizing selection), while those elements responsible for differences among species should be divergent (positive selection
). Finally, those elements that are unimportant to the evolutionary success of the organism will be unconserved (selection is neutral).
One of the important goals of the field is the identification of the mechanisms of eukaryotic genome evolution. It is however often complicated by the multiplicity of events that have taken place throughout the history of individual lineages, leaving only distorted and superimposed traces in the genome of each living organism. For this reason comparative genomics studies of small
model organisms (for example the model Caenorhabditis elegans and closely related Caenorhabditis briggsae) are of great importance to advance our understanding of general mechanisms of evolution.^[20]^[21]

Methods

Computational approaches are necessary for genome comparisons, given the large amount of data encoded in genomes. Many tools are now publicly available, ranging from whole genome comparisons to gene expression analysis.^[22] This includes approaches from systems and control, information theory, string analysis and data mining.^[23] Computational approaches will remain critical for research and teaching, especially when information science and genome biology is taught in conjunction.^[24]

Phylogenetic tree of descendant species and reconstructed ancestors. The branch color represents breakpoint rates in RACFs (breakpoints per million years). Black branches represent nondetermined breakpoint rates. Tip colors depict assembly contiguity: black, scaffold-level genome assembly; green, chromosome-level genome assembly; yellow, chromosome-scale scaffold-level genome assembly. Numbers next to species names indicate diploid chromosome number (if known).^[25]

Comparative genomics starts with basic comparisons of genome size and gene density. For instance, genome size is important for coding capacity and possibly for regulatory reasons. High gene density facilitates
genome annotation
, analysis of environmental selection. By contrast, low gene density hampers the mapping of genetic disease as in the human genome.

Sequence alignment

Alignments are used to capture information about similar sequences such as ancestry, common evolutionary descent, or common structure and function. Alignments can be done for both genetic and protein sequences.^[26]^[27] Alignments consist of local or global pairwise alignments, and multiple sequence alignments. One way to find global alignments is to use a dynamic programming algorithm known as Needleman-Wunsch algorithm. This algorithm can be modified and used to find local alignments.

Example of a phylogenetic tree created from an alignment of 250 unique spike protein sequences from the Betacoronavirus family.

Phylogenetic reconstruction

Another computational method for comparative genomics is phylogenetic reconstruction. It is used to describe evolutionary relationships in terms of common ancestors. The relationships are usually represented in a tree called a phylogenetic tree. Similarly, coalescent theory is a retrospective model to trace alleles of a gene in a population to a single ancestral copy shared by members of the population. This is also known as the most recent common ancestor. Analysis based on coalescence theory tries predicting the amount of time between the introduction of a mutation and a particular allele or gene distribution in a population. This time period is equal to how long ago the most recent common ancestor existed. The inheritance relationships are visualized in a form similar to a phylogenetic tree. Coalescence (or the gene genealogy) can be visualized using dendrograms.^[28]

Example of synteny block and break. Genes located on chromosomes of two species are denoted in letters. Each gene is associated with a number representing the species they belong to (species 1 or 2). Orthologous genes are connected by dashed lines and genes without an orthologous relationship are treated as gaps in synteny programs.^[29]

Genome maps

An additional method in comparative genomics is
genetic diseases associated with DNA rearrangements.^{[citation needed}
]

$Solid green squares indicate mammalian chromosomes maintained as a single synteny block (either as a single chromosome or fused with another MAM), with shades of the color indicating the fraction of the chromosome affected by intra-chromosomal rearrangements (the lightest shade is most affected). Split blocks demarcate mammalian chromosomes affected by inter-chromosomal rearrangements. Upper (green)triangles show the fraction of the chromosome affected by intra chromosomal rearrangements, and lower (red) triangles show the fraction affected by inter chromosomal rearrangements. Syntenic relationships of each MAM to the human genome are given at the right of the diagram. MAMX appears split in goat because its X chromosome is assembled as two separate fragments. BOR, boreoeutherian ancestor chromosome; EUA, Euarchontoglires ancestor chromo-some; EUC, Euarchonta ancestor chromosome; EUT, eutherian ancestor chromosome; PMT; Primatomorpha ancestor chromosome; PRT, primates (Hominidae) ancestor chromosome; THE, therian ancestor chromosome.$
Image from the study Evolution of the ancestral mammalian karyotype and syntenic regions. It is a Visualization of the evolutionary history of reconstructed mammalian chromosomes based on the human lineage.^[25]

Tools

Computational tools for analyzing sequences and complete genomes are developing quickly due to the availability of large amount of genomic data. At the same time, comparative analysis tools are progressed and improved. In the challenges about these analyses, it is very important to visualize the comparative results.^[31]
Visualization of sequence conservation is a tough task of comparative sequence analysis. As we know, it is highly inefficient to examine the alignment of long genomic regions manually. Internet-based genome browsers provide many useful tools for investigating genomic sequences due to integrating all sequence-based biological information on genomic regions. When we extract large amount of relevant biological data, they can be very easy to use and less time-consuming.^[31]

UCSC Browser: This site contains the reference sequence and working draft assemblies for a large collection of genomes.^[32]

Ensembl: The Ensembl project produces genome databases for vertebrates and other eukaryotic species, and makes this information freely available online.^[33]

MapView: The Map Viewer provides a wide variety of genome mapping and sequencing data.^[34]

VISTA is a comprehensive suite of programs and databases for comparative analysis of genomic sequences. It was built to visualize the results of comparative analysis based on DNA alignments. The presentation of comparative data generated by VISTA can easily suit both small and large scale of data.^[35]

BlueJay Genome Browser: a stand-alone visualization tool for the multi-scale viewing of annotated genomes and other genomic elements.^[36]

An advantage of using online tools is that these websites are being developed and updated constantly. There are many new settings and content can be used online to improve efficiency.^[31]

Selected applications

Agriculture

genome wide association study conducted on 517 rice landraces revealed 80 loci associated with several categories of agronomic performance, such as grain weight, amylose content, and drought tolerance. Many of the loci were previously uncharacterized.^[37] Not only is this methodology powerful, it is also quick. Previous methods of identifying loci associated with agronomic performance required several generations of carefully monitored breeding of parent strains, a time-consuming effort that is unnecessary for comparative genomic studies.^[38]

Medicine

Vaccine development

The medical field also benefits from the study of comparative genomics. In an approach known as
commensal and pathogenic strains of E. coli to identify pathogen-specific genes as a basis for finding antigens that result in immune response against pathogenic strains but not commensal ones.^[41] In May 2019, using the Global Genome Set, a team in the UK and Australia sequenced thousands of globally-collected isolates of Group A Streptococcus, providing potential targets for developing a vaccine against the pathogen, also known as S. pyogenes.^[42]

zinc-finger protein; OR, olfactory receptor genes; DAD1, defender against cell death; The sites of species-specific, processed pseudogenes are shown by gray triangles. See also GenBank accession numbers AE000658-62. Modified after Glusman et al. 2001.^[43]

Mouse models in immunology

animal models
. Comparative Medicine Research is built on the ability to use information from one species to understand the same processes in another. We can get new insights into molecular pathways by comparing human and mouse T cells and their effects on the immune system utilizing comparative genomics. In order to comprehend its TCRs and their genes, Glusman conducted research on the sequencing of the human and mouse T cell receptor loci. TCR genes are well-known and serve as a significant resource for supporting functional genomics and understanding how genes and intergenic regions of the genome contribute to biological processes.[43]
T-cell immune receptors are important in seeing the world of pathogens in the cellular immune system. One of the reasons for sequencing the human and mouse TCR loci was to match the orthologous gene family sequences and discover conserved areas using comparative genomics. These, it was thought, would reflect two sorts of biological information: (1) exons and (2) regulatory sequences. In fact, the majority of V, D, J, and C exons could be identified in this method. The variable regions are encoded by multiple unique DNA elements that are rearranged and connected during T cell (TCR) differentiation: variable (V), diversity (D), and joining (J) elements for the and polypeptides; and V and J elements for the and polypeptides.[Figure 1] However, several short noncoding conserved blocks of the genome had been shown. Both human and mouse motifs are largely clustered in the 200 bp [Figure 2], the known 3′ enhancers in the TCR/ were identified, and a conserved region of 100 bp in the mouse J intron was subsequently shown to have a regulatory function.

[Figure 2] Gene structure of the human (top) and mouse (bottom) V, D, J, and C gene segments. The arrows represent the transcriptional direction of each TCR gene. The squares and circles represent going in a direct and reverse direction. Modified after Glusman et al. 2001.^[43]

Comparisons of the genomic sequences within each physical site or location of a specific gene on a chromosome (locs) and across species allow for research on other mechanisms and other regulatory signals. Some suggest new hypotheses about the evolution of TCRs, to be tested (and improved) by comparison to the TCR gene complement of other vertebrate species. A comparative genomic investigation of humans and mice will obviously allow for the discovery and annotation of many other genes, as well as identifying in other species for regulatory sequences.^[43]

Research

Comparative genomics also opens up new avenues in other areas of research. As DNA sequencing technology has become more accessible, the number of
sequenced genomes
has grown. With the increasing reservoir of available genomic data, the potency of comparative genomic inference has grown as well.
A notable case of this increased potency is found in recent primate research. Comparative genomic methods have allowed researchers to gather information about genetic variation, differential gene expression, and evolutionary dynamics in primates that were indiscernible using previous data and methods.^[44]

Great Ape Genome Project

The Great Ape Genome Project used comparative genomic methods to investigate genetic variation with reference to the six great ape species, finding healthy levels of variation in their gene pool despite shrinking population size.^[45] Another study showed that patterns of DNA methylation, which are a known regulation mechanism for gene expression, differ in the prefrontal cortex of humans versus chimps, and implicated this difference in the evolutionary divergence of the two species.^[46]

See also

Data mining

Molecular evolution

Comparative anatomy

Homology

Sequence mining

Alignment-free sequence analysis

References

PMID 18650965.

^ ^a ^b ^c Touchman J (2010). "Comparative Genomics". Nature Education Knowledge. 3 (10): 13.

^
S2CID 5491782
.

^ ^a ^b Russel PJ, Hertz PE, McMillan B (2011). Biology: The Dynamic Science (2nd ed.). Belmont, CA: Brooks/Cole. pp. 409–410.

ISBN 9781405101202
.

PMID 14624258.

S2CID 43171654
.

^ ^a ^b ^c ^d Koonin EV, Galperin MY (2003). Sequence - Evolution - Function: Computational approaches in comparative genomics. Dordrecht: Springer Science+Business Media.

^
PMID 22199376
.

PMID 6384934
.

PMID 3012465
.

PMID 7542800
.

S2CID 29825758
.

S2CID 16763139
.

PMID 9851916
.

PMID 10731132
.

PMID 10731134
.

PMID 10899144.

S2CID 2037634
.

PMID 14624247.

PMC 261884.

ISBN 978-0-521-67191-0
.

PMID 25984837
.

PMID 22046119.

^
PMID 36161960
.

PMID 29206392
. Retrieved 2022-12-18.

S2CID 226247797
.

S2CID 2041895
.

PMID 29382321
.

PMID 19347649
.

^
PMID 21250292
.

^ "UCSC Browser".

^ "Ensembl Genome Browser". Archived from the original on 2013-10-21.

^ "Map Viewer".

^ "VISTA tools".

S2CID 34553139
.

S2CID 439442.

S2CID 13358998.

PMID 22882709.

PMID 15994562.

PMID 18676672.

^ "Group a Streptococcus Vaccine Target Candidates Identified from Global Genome Set". 28 May 2019.

^
PMID 11567625
.

PMID 24709753.

PMID 23823723.

PMID 22922032.

Further reading

Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES (May 2003). "Sequencing and comparison of yeast species to identify genes and regulatory elements". Nature. 423 (6937): 241–254.
S2CID 1530261
.

Cliften P, Sudarsanam P, Desikan A, Fulton L, Fulton B, Majors J, et al. (July 2003). "Finding functional features in Saccharomyces genomes by phylogenetic footprinting". Science. 301 (5629): 71–76.
S2CID 1305166
.

Boffelli D, McAuliffe J, Ovcharenko D, Lewis KD, Ovcharenko I, Pachter L, Rubin EM (February 2003). "Phylogenetic shadowing of primate sequences to find functional regions of the human genome". Science. 299 (5611): 1391–1394.
S2CID 17217612
.

Dujon B, Sherman D, Fischer G, Durrens P, Casaregola S, Lafontaine I, et al. (July 2004). "Genome evolution in yeasts". Nature. 430 (6995): 35–44.
S2CID 4399964
.

Filipski A, Kumar S (2005). "Comparative genomics in eukaryotes". In Gregory TR (ed.). The Evolution of the Genome. San Diego: Elsevier. pp. 521–583.

Gregory TR, DeSalle R (2005). "Comparative genomics in prokaryotes". In Gregory TR (ed.). The Evolution of the Genome. San Diego: Elsevier. pp. 585–675.

Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, et al. (March 2005). "Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals". Nature. 434 (7031): 338–345.
PMID 15735639
.

Champ PC, Binnewies TT, Nielsen N, Zinman G, Kiil K, Wu H, et al. (March 2006). "Genome update: purine strand bias in 280 bacterial genomes". Microbiology. 152 (Pt 3): 579–583.
PMID 16514138
.

Kumar L, Breakspear A, Kistler C, Ma LJ, Xie X (March 2010). "Systematic discovery of regulatory motifs in Fusarium graminearum by comparing four Fusarium genomes". BMC Genomics. 11: 208.
PMID 20346147.

Batzoglou S, Pachter L, Mesirov JP, Berger B, Lander ES (July 2000). "Human and mouse gene structure: comparative analysis and application to exon prediction". Genome Research. 10 (7): 950–958.
PMID 10899144.

External links

This article's use of external links may not follow Wikipedia's policies or guidelines. Please improve this article by removing excessive or inappropriate external links, and converting useful links where appropriate into footnote references. (February 2017) (Learn how and when to remove this template message)

Genomes OnLine Database (GOLD)

Genome News Network

JCVI Comprehensive Microbial Resource

Pathema: A Clade Specific Bioinformatics Resource Center

CBS Genome Atlas Database Archived 2016-05-16 at the Portuguese Web Archive

The UCSC Genome Browser

The U.S. National Human Genome Research Institute

Ensembl The
Ensembl
Genome Browser

Genolevures, comparative genomics of the Hemiascomycetous yeasts

Phylogenetically Inferred Groups (PhIGs), a recently developed method incorporates phylogenetic signals in building gene clusters for use in comparative genomics.

Metazome Archived 2006-08-10 at the Wayback Machine, a resource for the phylogenomic exploration and analysis of Metazoan gene families.

IMG The Integrated Microbial Genomes system, for comparative genome analysis by the DOE-JGI.

Dcode.org Dcode.org Comparative Genomics Center.

SUPERFAMILY Protein annotations for all completely sequenced organisms

Comparative Genomics

Blastology and Open Source: Needs and Deeds

Alignment-free comparative Genomics tool

v
t
e
Omics
Genomics

Cognitive genomics

Computational genomics

Comparative genomics

Functional genomics

Genome project
Human Genome Project

Metagenomics
Human Microbiome Project

Pangenomics

Personal genomics

Population genomics

Social genomics

Structural genomics

Bioinformatics

Biochip

Cheminformatics

Chemogenomics

Connectomics
Human Connectome Project

Epigenomics
Human Epigenome Project

Glycomics

Immunomics

Lipidomics

Metabolomics

Microbiomics

Nutrigenomics

Paleopolyploidy

Pharmacogenetics

Pharmacogenomics

Systems biology

Toxicogenomics

Transcriptomics

Structural biology

Proteomics
Human proteome project

Call-map proteomics

Structure-based drug design

Expression proteomics

Research tools

2-D electrophoresis

Mass spectrometer

Electrospray ionization

Matrix-assisted laser desorption ionization

Matrix-assisted laser desorption ionization-time of flight mass spectrometer

Microfluidic-based tools

Isotope affinity tags

Chromosome conformation capture

Organizations

DNA Data Bank of Japan (JP)

European Molecular Biology Laboratory (EU)

National Institutes of Health (USA)

Wellcome Sanger Institute (UK)

List

Category

Retrieved from "https://en.wikipedia.org/w/index.php?title=Comparative_genomics&oldid=1210099150"

[1] PMID 18650965.

[scitable-2] Touchman J (2010). "Comparative Genomics". Nature Education Knowledge. 3 (10): 13.

[Xia-3] 
S2CID 5491782
.

[Russell-4] Russel PJ, Hertz PE, McMillan B (2011). Biology: The Dynamic Science (2nd ed.). Belmont, CA: Brooks/Cole. pp. 409–410.

[primrose-5] ISBN 9781405101202
.

[6] PMID 14624258.

[ellegren-7] S2CID 43171654
.

[koonin-8] Koonin EV, Galperin MY (2003). Sequence - Evolution - Function: Computational approaches in comparative genomics. Dordrecht: Springer Science+Business Media.

[hu-9] 
PMID 22199376
.

[10] PMID 6384934
.

[11] PMID 3012465
.

[12] PMID 7542800
.

[13] S2CID 29825758
.

[14] S2CID 16763139
.

[15] PMID 9851916
.

[16] PMID 10731132
.

[17] PMID 10731134
.

[18] PMID 10899144.

[19] S2CID 2037634
.

[20] PMID 14624247.

[21] PMC 261884.

[22] ISBN 978-0-521-67191-0
.

[smash-23] PMID 25984837
.

[24] PMID 22046119.

[Evolution_of_the_ancestral_mammalia-25] 
PMID 36161960
.

[26] PMID 29206392
. Retrieved 2022-12-18.

[27] S2CID 226247797
.

[28] S2CID 2041895
.

[29] PMID 29382321
.

[30] PMID 19347649
.

[Humana_Press-31] 
PMID 21250292
.

[32] "UCSC Browser".

[33] "Ensembl Genome Browser". Archived from the original on 2013-10-21.

[34] "Map Viewer".

[35] "VISTA tools".

[36] S2CID 34553139
.

[37] S2CID 439442.

[38] S2CID 13358998.

[39] PMID 22882709.

[40] PMID 15994562.

[41] PMID 18676672.

[42] "Group a Streptococcus Vaccine Target Candidates Identified from Global Genome Set". 28 May 2019.

[glusman2001-43] 
PMID 11567625
.

[44] PMID 24709753.

[45] PMID 23823723.

[46] PMID 22922032.

[1]

[7]

[4]

[11]

[12]

[13]

[8]

[17]

[18]

[19]

[9]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[41]

[42]

[43]

[44]

[45]

[46]