Genomics

Source: Wikipedia, the free encyclopedia.
(Redirected from
Genomic
)

Genomics is an interdisciplinary field of

proteins with the assistance of enzymes and messenger molecules. In turn, proteins make up body structures such as organs and tissues as well as control chemical reactions and carry signals between cells. Genomics also involves the sequencing and analysis of genomes through uses of high throughput DNA sequencing and bioinformatics to assemble and analyze the function and structure of entire genomes.[6][7] Advances in genomics have triggered a revolution in discovery-based research and systems biology to facilitate understanding of even the most complex biological systems such as the brain.[8]

The field also includes studies of intragenomic (within the genome) phenomena such as epistasis (effect of one gene on another), pleiotropy (one gene affecting more than one trait), heterosis (hybrid vigour), and other interactions between loci and alleles within the genome.[9]

History

Etymology

From the Greek ΓΕΝ[10] gen, "gene" (gamma, epsilon, nu, epsilon) meaning "become, create, creation, birth", and subsequent variants: genealogy, genesis, genetics, genic, genomere, genotype, genus etc. While the word genome (from the German Genom, attributed to Hans Winkler) was in use in English as early as 1926,[11] the term genomics was coined by Tom Roderick, a geneticist at the Jackson Laboratory (Bar Harbor, Maine), over beers with Jim Womack, Tom Shows and Stephen O’Brien at a meeting held in Maryland on the mapping of the human genome in 1986.[12] First as the name for a new journal and then as a whole new science discipline.[13]

Early sequencing efforts

Following

University of Ghent (Ghent, Belgium) were the first to determine the sequence of a gene: the gene for Bacteriophage MS2 coat protein.[18] Fiers' group expanded on their MS2 coat protein work, determining the complete nucleotide-sequence of bacteriophage MS2-RNA (whose genome encodes just four genes in 3569 base pairs [bp]) and Simian virus 40 in 1976 and 1978, respectively.[19][20]

DNA-sequencing technology developed

Frederick Sanger
Walter Gilbert
Frederick Sanger and Walter Gilbert shared half of the 1980 Nobel Prize in Chemistry for Independently developing methods for the sequencing of DNA.

In addition to his seminal work on the amino acid sequence of insulin,

Maxam-Gilbert method (also known as the chemical method) of DNA sequencing, involving the preferential cleavage of DNA at known bases, a less efficient method.[26][27] For their groundbreaking work in the sequencing of nucleic acids, Gilbert and Sanger shared half the 1980 Nobel Prize in chemistry with Paul Berg (recombinant DNA
).

Complete genomes

The advent of these technologies resulted in a rapid intensification in the scope and speed of completion of

"Hockey stick" graph showing the exponential growth of public sequence databases.
The number of genome projects has increased as technological improvements continue to lower the cost of sequencing. (A) Exponential growth of genome sequence databases since 1995. (B) The cost in US Dollars (USD) to sequence one million bases. (C) The cost in USD to sequence a 3,000 Mb (human-sized) genome on a log-transformed scale.

Most of the microorganisms whose genomes have been completely sequenced are problematic

Pan troglodytes) are all important model animals in medical research.[27]

A rough draft of the human genome was completed by the Human Genome Project in early 2001, creating much fanfare.[41] This project, completed in 2003, sequenced the entire genome for one specific person, and by 2007 this sequence was declared "finished" (less than one error in 20,000 bases and all chromosomes assembled).[41] In the years since then, the genomes of many other individuals have been sequenced, partly under the auspices of the 1000 Genomes Project, which announced the sequencing of 1,092 genomes in October 2012.[42] Completion of this project was made possible by the development of dramatically more efficient sequencing technologies and required the commitment of significant bioinformatics resources from a large international collaboration.[43] The continued analysis of human genomic data has profound political and social repercussions for human societies.[44]

The "omics" revolution

General schema showing the relationships of the genome, transcriptome, proteome, and metabolome (lipidome)

The English-language neologism omics informally refers to a field of study in biology ending in -omics, such as genomics, proteomics or metabolomics. The related suffix -ome is used to address the objects of study of such fields, such as the genome, proteome, or metabolome (lipidome) respectively. The suffix -ome as used in molecular biology refers to a totality of some sort; similarly omics has come to refer generally to the study of large, comprehensive biological data sets. While the growth in the use of the term has led some scientists (Jonathan Eisen, among others[45]) to claim that it has been oversold,[46] it reflects the change in orientation towards the quantitative analysis of complete or near-complete assortment of all the constituents of a system.[47] In the study of symbioses, for example, researchers which were once limited to the study of a single gene product can now simultaneously compare the total complement of several types of biological molecules.[48][49]

Genome analysis

After an organism has been selected, genome projects involve three components: the sequencing of DNA, the assembly of that sequence to create a representation of the original chromosome, and the annotation and analysis of that representation.[9]

DOE JGI
). Third, the genome sequence is annotated at several levels: DNA, protein, gene pathways, or comparatively.

Sequencing

Historically, sequencing was done in sequencing centers, centralized facilities (ranging from large independent institutions such as Joint Genome Institute which sequence dozens of terabases a year, to local molecular biology core facilities) which contain research laboratories with the costly instrumentation and technical support necessary. As sequencing technology continues to improve, however, a new generation of effective fast turnaround benchtop sequencers has come within reach of the average academic laboratory.[50][51] On the whole, genome sequencing approaches fall into two broad categories, shotgun and high-throughput (or next-generation) sequencing.[9]

Shotgun sequencing

An ABI PRISM 3100 Genetic Analyzer. Such capillary sequencers automated early large-scale genome sequencing efforts.

Shotgun sequencing is a sequencing method designed for analysis of DNA sequences longer than 1000 base pairs, up to and including entire chromosomes.[52] It is named by analogy with the rapidly expanding, quasi-random firing pattern of a shotgun. Since gel electrophoresis sequencing can only be used for fairly short sequences (100 to 1000 base pairs), longer DNA sequences must be broken into random small segments which are then sequenced to obtain reads. Multiple overlapping reads for the target DNA are obtained by performing several rounds of this fragmentation and sequencing. Computer programs then use the overlapping ends of different reads to assemble them into a continuous sequence.[52][53] Shotgun sequencing is a random sampling process, requiring over-sampling to ensure a given nucleotide is represented in the reconstructed sequence; the average number of reads by which a genome is over-sampled is referred to as coverage.[54]

For much of its history, the technology underlying shotgun sequencing was the classical chain-termination method or '

OH group required for the formation of a phosphodiester bond between two nucleotides, causing DNA polymerase to cease extension of DNA when a ddNTP is incorporated. The ddNTPs may be radioactively or fluorescently labelled for detection in DNA sequencers.[9] Typically, these machines can sequence up to 96 DNA samples in a single batch (run) in up to 48 runs a day.[57]

High-throughput sequencing

The high demand for low-cost sequencing has driven the development of high-throughput sequencing technologies that parallelize the sequencing process, producing thousands or millions of sequences at once.[58][59] High-throughput sequencing is intended to lower the cost of DNA sequencing beyond what is possible with standard dye-terminator methods. In ultra-high-throughput sequencing, as many as 500,000 sequencing-by-synthesis operations may be run in parallel.[60][61]

Illumina Genome Analyzer II System. Illumina technologies have set the standard for high-throughput massively parallel sequencing.[50]

The Illumina dye sequencing method is based on reversible dye-terminators and was developed in 1996 at the Geneva Biomedical Research Institute, by

fluorescently labeled nucleotides, then the dye along with the terminal 3' blocker is chemically removed from the DNA, allowing the next cycle.[63]

An alternative approach, ion semiconductor sequencing, is based on standard DNA replication chemistry. This technology measures the release of a hydrogen ion each time a base is incorporated. A microwell containing template DNA is flooded with a single

homopolymer is present in the template sequence multiple nucleotides will be incorporated in a single flood cycle, and the detected electrical signal will be proportionally higher.[64]

Assembly

Overlapping reads form contigs; contigs and gaps of known length form scaffolds.
Paired end reads of next generation sequencing data mapped to a reference genome.
Multiple, fragmented sequence reads must be assembled together on the basis of their overlapping areas.

Sequence assembly refers to

gene transcripts (ESTs).[9]

Assembly approaches

Assembly can be broadly categorized into two approaches: de novo assembly, for genomes which are not similar to any sequenced in the past, and comparative assembly, which uses the existing sequence of a closely related organism as a reference during assembly.

NP-hard), making it less favourable for short-read NGS technologies. Within the de novo assembly paradigm there are two primary strategies for assembly, Eulerian path strategies, and overlap-layout-consensus (OLC) strategies. OLC strategies ultimately try to create a Hamiltonian path through an overlap graph which is an NP-hard problem. Eulerian path strategies are computationally more tractable because they try to find a Eulerian path through a deBruijn graph.[54]

Finishing

Finished genomes are defined as having a single contiguous sequence with no ambiguities representing each replicon.[67]

Annotation

The DNA sequence assembly alone is of little value without additional analysis.

sequences, and consists of three main steps:[68]

  1. identifying portions of the genome that do not code for proteins
  2. identifying elements on the genome, a process called gene prediction, and
  3. attaching biological information to these elements.

Automatic annotation tools try to perform these steps in silico, as opposed to manual annotation (a.k.a. curation) which involves human expertise and potential experimental verification.[69] Ideally, these approaches co-exist and complement each other in the same annotation pipeline (also see below).

Traditionally, the basic level of annotation is using

Ensembl) rely on both curated data sources as well as a range of software tools in their automated genome annotation pipeline.[70] Structural annotation consists of the identification of genomic elements, primarily ORFs
and their localisation, or gene structure. Functional annotation consists of attaching biological information to genomic elements.

Sequencing pipelines and databases

The need for reproducibility and efficient management of the large amount of data associated with genome projects mean that computational pipelines have important applications in genomics.[71]

Research areas

Functional genomics

Functional genomics is a field of

DNA sequence
or structures. Functional genomics attempts to answer questions about the function of DNA at the levels of genes, RNA transcripts, and protein products. A key characteristic of functional genomics studies is their genome-wide approach to these questions, generally involving high-throughput methods rather than a more traditional "gene-by-gene" approach.

A major branch of genomics is still concerned with sequencing the genomes of various organisms, but the knowledge of full genomes has created the possibility for the field of functional genomics, mainly concerned with patterns of gene expression during various conditions. The most important tools here are microarrays and bioinformatics.

Structural genomics

An example of a protein structure determined by the Midwest Center for Structural Genomics

Structural genomics seeks to describe the

3-dimensional structure of every protein encoded by a given genome.[72][73] This genome-based approach allows for a high-throughput method of structure determination by a combination of experimental and modeling approaches. The principal difference between structural genomics and traditional structural prediction is that structural genomics attempts to determine the structure of every protein encoded by the genome, rather than focusing on one particular protein. With full-genome sequences available, structure prediction can be done more quickly through a combination of experimental and modeling approaches, especially because the availability of large numbers of sequenced genomes and previously solved protein structures allow scientists to model protein structure on the structures of previously solved homologs. Structural genomics involves taking a large number of approaches to structure determination, including experimental methods using genomic sequences or modeling-based approaches based on sequence or structural homology to a protein of known structure or based on chemical and physical principles for a protein with no homology to any known structure. As opposed to traditional structural biology, the determination of a protein structure through a structural genomics effort often (but not always) comes before anything is known regarding the protein function. This raises new challenges in structural bioinformatics, i.e. determining protein function from its 3D structure.[74]

Epigenomics

Epigenomics is the study of the complete set of

epigenetic modifications on the genetic material of a cell, known as the epigenome.[75] Epigenetic modifications are reversible modifications on a cell's DNA or histones that affect gene expression without altering the DNA sequence (Russell 2010 p. 475). Two of the most characterized epigenetic modifications are DNA methylation and histone modification.[76] Epigenetic modifications play an important role in gene expression and regulation, and are involved in numerous cellular processes such as in differentiation/development[77] and tumorigenesis.[75] The study of epigenetics on a global level has been made possible only recently through the adaptation of genomic high-throughput assays.[78]

Metagenomics

Environmental Shotgun Sequencing (ESS) is a key technique in metagenomics. (A) Sampling from habitat; (B) filtering particles, typically by size; (C) Lysis and DNA extraction; (D) cloning and library construction; (E) sequencing the clones; (F) sequence assembly into contigs and scaffolds.

Metagenomics is the study of metagenomes,

Sanger sequencing or massively parallel pyrosequencing to get largely unbiased samples of all genes from all the members of the sampled communities.[80] Because of its power to reveal the previously hidden diversity of microscopic life, metagenomics offers a powerful lens for viewing the microbial world that has the potential to revolutionize understanding of the entire living world.[81][82]

Model systems

Viruses and bacteriophages

phage evolution. Bacteriophage genome sequences can be obtained through direct sequencing of isolated bacteriophages, but can also be derived as part of microbial genomes. Analysis of bacterial genomes has shown that a substantial amount of microbial DNA consists of prophage sequences and prophage-like elements.[83] A detailed database mining of these sequences offers insights into the role of prophages in shaping the bacterial genome: Overall, this method verified many known bacteriophage groups, making this a useful tool for predicting the relationships of prophages from bacterial genomes.[84][85]

Cyanobacteria

At present there are 24

Acaryochloris and Prochloron, the N2-fixing filamentous cyanobacteria Nodularia spumigena, Lyngbya aestuarii and Lyngbya majuscula, as well as bacteriophages infecting marine cyanobaceria. Thus, the growing body of genome information can also be tapped in a more general way to address global problems by applying a comparative approach. Some new and exciting examples of progress in this field are the identification of genes for regulatory RNAs, insights into the evolutionary origin of photosynthesis, or estimation of the contribution of horizontal gene transfer to the genomes that have been analyzed.[86]

Applications

autosomal chromosome pairs, both the female (XX) and male (XY) versions of the two sex chromosomes, as well as the mitochondrial genome (at bottom left).

Genomics has provided applications in many fields, including

Genomic medicine

Next-generation genomic technologies allow clinicians and biomedical researchers to drastically increase the amount of genomic data collected on large study populations.

Brigham and Women’s Hospital, Broad Institute and Harvard Medical School was established in 2012 to conduct empirical research in translating genomics into health. Brigham and Women's Hospital opened a Preventive Genomics Clinic in August 2019, with Massachusetts General Hospital following a month later.[93][94] The All of Us research program aims to collect genome sequence data from 1 million participants to become a critical component of the precision medicine research platform.[95]

Synthetic biology and bioengineering

The growth of genomic knowledge has enabled increasingly sophisticated applications of

Population and conservation genomics

phylogenetic history and demography of a population.[98] Population genomic methods are used for many different fields including evolutionary biology, ecology, biogeography, conservation biology and fisheries management. Similarly, landscape genomics has developed from landscape genetics
to use genomic methods to identify relationships between patterns of environmental and genetic variation.

Conservationists can use the information gathered by genomic sequencing in order to better evaluate genetic factors key to species conservation, such as the

evolutionary processes and to detect patterns in variation throughout a given population, conservationists can formulate plans to aid a given species without as many variables left unknown as those unaddressed by standard genetic approaches.[100]

See also

References

  1. S2CID 4268222
    .
  2. .
  3. .
  4. .
  5. ^ "WHO definitions of genetics and genomics". World Health Organization. Archived from the original on June 30, 2004.
  6. .
  7. .
  8. .
  9. ^ .
  10. .
  11. ^ "Genome, n". Oxford English Dictionary (Third ed.). Oxford University Press. 2008. Retrieved 2012-12-01.(subscription required)
  12. PMID 18166670
    .
  13. .
  14. .
  15. .
  16. .
  17. .
  18. .
  19. .
  20. .
  21. .
  22. ^ Sanger F (1980). "Nobel lecture: Determination of nucleotide sequences in DNA" (PDF). Nobelprize.org. Retrieved 2010-10-18.
  23. ^
    S2CID 4206886
    .
  24. .
  25. .
  26. .
  27. ^ a b Darden L, Tabery J (2010). "Molecular Biology". In Zalta EN (ed.). The Stanford Encyclopedia of Philosophy (Fall 2010 ed.).
  28. S2CID 4355527
    .(subscription required)
  29. .
  30. .
  31. .
  32. .
  33. .(subscription required)
  34. ^ "Complete genomes: Viruses". NCBI. 17 November 2011. Retrieved 2011-11-18.
  35. ^ "Genome Project Statistics". Entrez Genome Project. 7 October 2011. Retrieved 2011-11-18.
  36. ISSN 0362-4331
    . Retrieved 2012-12-21.
  37. .
  38. ^ "Human gene number slashed". BBC. 20 October 2004. Retrieved 2012-12-21.
  39. S2CID 21797344
    .
  40. ^ National Human Genome Research Institute (14 July 2004). "Dog Genome Assembled: Canine Genome Now Available to Research Community Worldwide". Genome.gov. Retrieved 2012-01-20.
  41. ^ .
  42. .
  43. .
  44. ^ .
  45. .
  46. . Retrieved 2013-01-04.
  47. ^ Scudellari M (1 October 2011). "Data Deluge". The Scientist. Retrieved 2013-01-04.
  48. PMID 22983030
    .
  49. .
  50. ^ a b Baker M (14 September 2012). "Benchtop sequencers ship off" (Blog). Nature News Blog. Retrieved 2012-12-22.
  51. PMID 22827831
    .
  52. ^ .
  53. .
  54. ^ .
  55. .
  56. .
  57. ^ Illumina, Inc. (28 February 2012). An Introduction to Next-Generation Sequencing Technology (PDF). San Diego, California, USA: Illumina, Inc. p. 12. Retrieved 2012-12-28.
  58. PMID 17449817
    .
  59. .
  60. .
  61. .
  62. ^ US 20050100900, Kawashima EH, Farinelli L, Mayer P, "Method of nucleic acid amplification", published 12 May 2005, issued 26 July 2011, assigned to Solexa Ltd Great Britain. 
  63. PMID 18576944. Archived from the original
    (PDF) on 2013-05-18. Retrieved 2013-01-04.
  64. ^ Davies K (2011). "Powering Preventative Medicine". Bio-IT World (September–October).
  65. ^ "Home". PacBio.
  66. ^ "home". Oxford Nanopore Technologies.
  67. PMID 19815760
    .
  68. .
  69. S2CID 20412451. Archived from the original
    (PDF) on 2013-05-29. Retrieved 2013-01-04.
  70. .
  71. .
  72. .
  73. .
  74. .
  75. ^ .
  76. .
  77. .
  78. .
  79. .
  80. .
  81. .
  82. .
  83. .
  84. .
  85. .
  86. .
  87. .
  88. .
  89. .
  90. .
  91. .
  92. .
  93. ^ Robbins R (16 August 2019). "Top U.S. medical centers roll out DNA sequencing clinics for healthy (and often wealthy) clients". STAT News.
  94. ^ "Two Boston Health Systems Enter the Growing Direct-to-Consumer Gene Sequencing Market by Opening Preventative Genomics Clinics, but Can Patients Afford the Service?". Dark Daily. The Dark Intelligence Group. 3 January 2020.
  95. ^ "NIH-funded genome centers to accelerate precision medicine discoveries". National Institutes of Health: All of Us Research Program. National Institutes of Health. 25 September 2018.
  96. .
  97. .
  98. .
  99. .
  100. .

Further reading

External links