Reference genome

Source: Wikipedia, the free encyclopedia.
The first printout of the human reference genome presented as a series of books, displayed at the Wellcome Collection, London

A reference genome (also known as a reference assembly) is a digital

Properties of reference genomes

Measures of length

The length of a genome can be measured in multiple different ways.

A simple way to measure genome length is to count the number of base pairs in the assembly.[3]

The golden path is an alternative measure of length that omits redundant regions such as haplotypes and pseudo autosomal regions.[4][5] It is usually constructed by layering sequencing information over a physical map to combine scaffold information. It is a 'best estimate' of what the genome will look like and typically includes gaps, making it longer than the typical base pair assembly.[6]

Contigs and scaffolds

Diagram of reads arrangement, forming contigs and these can be assembled into scaffolds in the complete process of sequencing and assembly of a reference genome. The gap between contig 1 and 2 is indicated as sequenced, forming a scaffold, while the other gap is not sequenced and separates scaffold 1 and 2.

Reference genomes assembly requires reads overlapping, creating contigs, which are contiguous DNA regions of consensus sequences.[7] If there are gaps between contigs, these can be filled by scaffolding, either by contigs amplification with PCR and sequencing or by Bacterial Artificial Chromosome (BAC) cloning.[8][7] Filling these gaps is not always possible, in this case multiple scaffolds are created in a reference assembly.[9] Scaffolds are classified in 3 types: 1) Placed, whose chromosome, genomic coordinates and orientations are known; 2) Unlocalised, when only the chromosome is known but not the coordinates or orientation; 3) Unplaced, whose chromosome is not known.[10]

The number of contigs and scaffolds, as well as their average lengths are relevant parameters, among many others, for a reference genome assembly quality assessment since they provide information about the continuity of the final mapping from the original genome. The smaller the number of scaffolds per chromosome, until a single scaffold occupies an entire chromosome, the greater the continuity of the genome assembly.[11][12][13] Other related parameters are N50 and L50. N50 is the length of the contigs/scaffolds in which the 50% of the assembly is found in fragments of this length or greater, while L50 is the number of contigs/scaffolds whose length is N50. The higher the value of N50, the lower the value of L50, and vice versa, indicating high continuity in the assembly.[14][15][16]

Mammalian genomes

The human and mouse reference genomes are maintained and improved by the

. GRC continues to improve reference genomes by building new alignments that contain fewer gaps, and fixing misrepresentations in the sequence.

Human reference genome

The original human reference genome was derived from thirteen anonymous volunteers from

Evolution of the cost of sequencing a human genome from 2001 to 2021

As the cost of

single nucleotide polymorphism differences, while about 1.4 percent of his DNA could not be matched to the reference genome at all.[21][22] For regions where there is known to be large-scale variation, sets of alternate loci
are assembled alongside the reference locus.

Chromosomes ideogram of the human reference genome assembly GRCh38/hg38. Characteristic bands patterns are displayed in black, grey and white, while the gaps and partially assembled regions are displayed in blue and rose, respectively. Reference: Genome Data Viewer of the NCBI.[24]

The latest human reference genome assembly, released by the Genome Reference Consortium, was GRCh38 in 2017.[25] Several patches were added to update it, the latest patch being GRCh38.p14, published in March 2022.[26][27] This build only has 349 gaps across the entire assembly, which implies a great improvement in comparison with the first version, which had roughly 150,000 gaps.[18] The gaps are mostly in areas such as telomeres, centromeres, and long repetitive sequences, with the biggest gap along the long arm of the Y chromosome, a region of ~30 Mb in length (~52% of the Y chromosome's length).[28] The number of genomic clone libraries contributing to the reference has increased steadily to >60 over the years, although individual RP11 still accounts for 70% of the reference genome.[1] Genomic analysis of this anonymous male suggests that he is of African-European ancestry.[1]

In 2022, the Telomere-to-Telomere (T2T) Consortium[29] published the first completely assembled reference genome (version T2T-CHM13), without any gaps in the assembly.[30][31] The Telomere-to-Telomere (T2T) consortium not only is an open, community-based effort to generate the first complete assembly of a human genome, but also provides an opportunity to examine how centromeric and pericentromeric (near the centromere) sequences evolve. This effort relied on careful measures in order to assemble, polish, and validate entire centromeric and pericentromeric repeat arrays. By deeply characterizing these recently assembled sequences, the consortium presented a high-resolution, genome-wide atlas of the sequence content and organization of human centromeric and pericentromeric regions.[32] On the other hand, according to the GRC website, their next assembly release for the human genome (version GRCh39) is currently "indefinitely postponed".[33]

Recent genome assemblies are as follows:[34]

Release name Date of release Equivalent UCSC version
GRCh39 Indefinitely postponed[33] -
T2T-CHM13 January 2022 hs1
GRCh38 Dec 2013 hg38
GRCh37 Feb 2009 hg19
NCBI Build 36.1 Mar 2006 hg18
NCBI Build 35 May 2004 hg17
NCBI Build 34 Jul 2003 hg16

Limitations

For much of a genome, the reference provides a good approximation of the DNA of any single individual. But in regions with high allelic diversity, such as the major histocompatibility complex in humans and the major urinary proteins of mice, the reference genome may differ significantly from other individuals.[35][36][37] Due to the fact that the reference genome is a "single" distinct sequence, which gives its utility as an index or locator of genomic features, there are limitations in terms of how faithfully it represents the human genome and its variability. Most of the initial samples used for reference genome sequencing came from people of European ancestry. In 2010, it was found that, by de novo assembling genomes from African and Asian populations with the NCBI reference genome (version NCBI36), these genomes had ~5Mb sequences that did not align against any region of the reference genome.[38]

Following projects to the Human Genome Project seek to address a deeper and more diverse characerization of the human genetic variability, which the reference genome is not able to represent. The HapMap Project, active during the period 2002 -2010, with the purpose of creating a haplotypes map and their most common variations among different human populations. Up to 11 populations of different ancestry were studied, such as individuals of the Han ethnic group from China, Gujaratis from India, the Yoruba people from Nigeria or Japanese people, among others.[39][40][41][42] The 1000 Genomes Project, carried out between 2008 and 2015, with the aim of creating a database that includes more than 95% of the variations present in the human genome and whose results can be used in studies of association with diseases (GWAS) such as diabetes, cardiovascular or autoimmune diseases. A total of 26 ethnic groups were studied in this project, expanding the scope of the HapMap project to new ethnic groups such as the Mende people of Sierra Leone, the Vietnamese people or the Bengali people.[43][44][45][46] The Human Pangenome Project, which started its initial phase in 2019 with the creation of the Human Pangenome Reference Consortium, seeks to create the largest map of human genetic variability taking the results of previous studies as a starting point.[47][48]

Mouse reference genome

Recent mouse genome assemblies are as follows:[34]

Release name Date of release Equivalent UCSC version
GRCm39 June 2020 mm39
GRCm38 Dec 2011 mm10
NCBI Build 37 Jul 2007 mm9
NCBI Build 36 Feb 2006 mm8
NCBI Build 35 Aug 2005 mm7
NCBI Build 34 Mar 2005 mm6

Other genomes

Since the Human Genome Project was finished, multiple international projects have started, focused on assembling reference genomes for many organisms. Model organisms (e.g., zebrafish (Danio rerio), chicken (Gallus gallus), Escherichia coli etc.) are of special interest to the scientific community, as well as, for example, endangered species (e.g., Asian arowana (Scleropages formosus) or the American bison (Bison bison)). As of August 2022, the NCBI database supports 71 886 partially or completely sequenced and assembled genomes from different species, such as 676 mammals, 590 birds and 865 fishes. Also noteworthy are the numbers of 1796 insects genomes, 3747 fungi, 1025 plants, 33 724 bacteria, 26 004 virus and 2040 archaea.[49] A lot of these species have annotation data associated with their reference genomes that can be publicly accessed and visualized in genome browsers such as Ensembl and UCSC Genome Browser.[50][51]

Some examples of these international projects are: the Chimpanzee Genome Project, carried out between 2005 and 2013 jointly by the Broad Institute and the McDonnell Genome Institute of Washington University in St. Louis, which generated the first reference genomes for 4 subspecies of Pan troglodytes;[52][53] the 100K Pathogen Genome Project, which started in 2012 with the main goal of creating a database of reference genomes for 100 000 pathogen microorganisms to use in public health, outbreaks detection, agriculture and environment;[54] the Earth BioGenome Project, which started in 2018 and aims to sequence and catalog the genomes of all the eukaryotic organisms on Earth to promote biodiversity conservation projects. Inside this big-science project there are up to 50 smaller-scale affiliated projects such as the Africa BioGenome Project or the 1000 Fungal Genomes Project.[55][56][57]

References

  1. ^ a b c "How many individuals were sequenced for the human reference genome assembly?". Genome Reference Consortium. Retrieved 7 April 2022.
  2. PMID 18000006
    .
  3. ^ "Help - Glossary - Homo sapiens - Ensembl genome browser 87". www.ensembl.org.
  4. ^ "Golden path length | VectorBase". www.vectorbase.org. Archived from the original on 2020-08-07. Retrieved 2016-12-12.
  5. ^ "Help - Glossary - Homo sapiens - Ensembl genome browser 87". www.ensembl.org.
  6. ^ "Whole assembly vs Golden path length in Ensembl? - SEQanswers". seqanswers.com. 31 July 2014. Retrieved 2016-12-12.
  7. ^ .
  8. ^ "Help - Glossary - Homo_sapiens - Ensembl genome browser 107". www.ensembl.org. Retrieved 2022-09-26.
  9. PMID 33634311
    .
  10. ^ "Chromosomes, scaffolds and contigs". www.ensembl.org. Retrieved 2022-09-26.
  11. PMID 20305016
    .
  12. .
  13. .
  14. .
  15. .
  16. .
  17. .
  18. ^ .
  19. .
  20. .
  21. ^ a b Wade N (May 31, 2007). "Genome of DNA Pioneer Is Deciphered". New York Times. Retrieved February 21, 2009.
  22. ^
    PMID 18421352
    .
  23. J. Craig Venter whose DNA was sequenced and assembled using shotgun sequencing
    methods.
  24. ^ "Genome Data Viewer - NCBI". www.ncbi.nlm.nih.gov. Retrieved 2022-08-18.
  25. PMID 28396521
    .
  26. ^ "GRCh38.p14 - hg38 - Genome - Assembly - NCBI". www.ncbi.nlm.nih.gov. Retrieved 2022-08-19.
  27. ^ Genome Reference Consortium (2022-05-09). "GenomeRef: GRCh38.p14 is now released!". GRC Blog (GenomeRef). Retrieved 2022-08-19.
  28. ^ "GRCh38.p14 - hg38 - Genome - Assembly - NCBI - Statistics Report". www.ncbi.nlm.nih.gov. Retrieved 2022-08-18.
  29. ^ "Telomere-to-Telomere". NHGRI. Retrieved 2022-08-16.
  30. S2CID 247854936
    .
  31. ^ "T2T-CHM13v2.0 - Genome - Assembly - NCBI". www.ncbi.nlm.nih.gov. Retrieved 2022-08-16.
  32. PMID 35357911
    .
  33. ^ a b "Genome Reference Consortium". www.ncbi.nlm.nih.gov. Retrieved 2022-08-18.
  34. ^ a b "UCSC Genome Bioinformatics: FAQ". genome.ucsc.edu. Retrieved 2016-08-18.
  35. S2CID 186243515
    .
  36. .
  37. .
  38. .
  39. .
  40. .
  41. .
  42. ^ "International HapMap Project". Genome.gov. Retrieved 2022-08-18.
  43. PMID 20981092
    .
  44. .
  45. .
  46. .
  47. .
  48. .
  49. ^ "Genome List - Genome - NCBI". www.ncbi.nlm.nih.gov. Retrieved 2022-08-18.
  50. ^ "Species List". uswest.ensembl.org. Archived from the original on 2022-08-06. Retrieved 2022-08-18.
  51. ^ "GenArk: UCSC Genome Archive". hgdownload.soe.ucsc.edu. Retrieved 2022-08-18.
  52. ^ "Chimpanzee Genome Project". BCM-HGSC. 2016-03-04. Retrieved 2022-08-18.
  53. PMID 23823723
    .
  54. ^ "100K Pathogen Genome Project – Genomes for Public Health & Food Safety". Retrieved 2022-08-18.
  55. PMID 29686065
    .
  56. ^ "African BioGenome Project – Genomics in the service of conservation and improvement of African biological diversity". Retrieved 2022-08-18.
  57. ^ "1000 Fungal Genomes Project". mycocosm.jgi.doe.gov. Retrieved 2022-08-18.

External links