Ensembl genome database project

Source: Wikipedia, the free encyclopedia.

Ensembl genome database project.
genomic
information.

Similar databases and browsers are found at NCBI and the University of California, Santa Cruz (UCSC).

History

The human genome consists of three billion

pattern-matching of protein to DNA.[5][6] The Ensembl project was launched in 1999 in response to the imminent completion of the Human Genome Project, with the initial goals of automatically annotate the human genome, integrate this annotation with available biological data and make all this knowledge publicly available.[2]

In the Ensembl project, sequence data are fed into the gene annotation system (a collection of software "pipelines" written in Perl) which creates a set of predicted gene locations and saves them in a MySQL database for subsequent analysis and display. Ensembl makes these data freely accessible to the world research community. All the data and code produced by the Ensembl project is available to download,[7] and there is also a publicly accessible database server allowing remote access. In addition, the Ensembl website provides computer-generated visual displays of much of the data.

Over time the project has expanded to include additional species (including key

protists, focusing on providing taxonomic and evolutionary context to genes, whilst the original project continues to focus on vertebrates.[8][9]

As of 2020, Ensembl supported over 50 000 genomes across both Ensembl and Ensembl Genomes databases, adding some new innovative features such as Rapid Release, a new website designed to make genome annotation data available more quickly to users, and COVID-19, a new website to access to SARS-CoV-2 reference genome.

Displaying genomic data

Gene SGCB aligned to the human genome

Central to the Ensembl concept is the ability to automatically generate graphical views of the alignment of genes and other genomic data against a reference genome. These are shown as data tracks, and individual tracks can be turned on and off, allowing the user to customise the display to suit their research interests. The interface also enables the user to zoom in to a region or move along the genome in either direction.

Other displays show data at varying levels of resolution, from whole

trees of similar genes (homologues) across a range of species. The graphics are complemented by tabular displays, and in many cases data can be exported directly from the page in a variety of standard file formats such as FASTA
.

Externally produced data can also be added to the display by uploading a suitable file in one of the supported formats, such as

BED, or PSL
.

Graphics are generated using a suite of custom Perl modules based on GD, the standard Perl graphics display library.

Alternative access methods

In addition to its website, Ensembl provides a REST

scripts to be written to retrieve data of interest. The same API is used internally by the web interface to display the data. It is divided in sections like the core API, the compara API (for comparative genomics
data), the variation API (for accessing SNPs, SNVs, CNVs..), and the functional genomics API (to access regulatory data). The Ensembl website provides extensive information on how to install and use the API.

This software can be used to access the public MySQL database, avoiding the need to download enormous datasets. The users could even choose to retrieve data from the MySQL with direct SQL queries, but this requires an extensive knowledge of the current database schema.

Large datasets can be retrieved using the BioMart data-mining tool. It provides a web interface for downloading datasets using complex queries.

Last, there is an FTP server which can be used to download entire MySQL databases as well some selected data sets in other formats.

Current species

The annotated genomes include most fully sequenced vertebrates and selected model organisms. All of them are eukaryotes, there are no prokaryotes. As of 2022, there are 271 species registered, this includes:[11]

Species
Chordata Mammalia Euarchontoglires
Primates
Angola colobus, black-capped squirrel monkey, black snub-nosed monkey, bonobo, bushbaby, capuchin, chimpanzee, common marmoset, Coquerel's sifaka, crab-eating macaque, drill, human, macaque, mouse lemur, gelada, gibbon, golden snub-nosed monkey, gorilla, greater bamboo lemur, green monkey, Ma's night monkey, olive baboon, orangutan, pig-tailed macaque, sooty mangabey, tarsier, Ugandan red colobus
Scandentia
tree shrew
Glires (Rodents + Lagomorphs)
Laurasiatheria
Afrotheria
hyrax, tenrec
Xenarthra Armadillo, sloth
Marsupialia
Common wombat, koala, opossum, Tasmanian devil, wallaby
Monotremes
Platypus
Reptilia Argentine black and white tegu, blue-ringed sea krait, central bearded dragon, chinese softshell turtle, common snapping turtle, common wall lizard, desert tortoise, eastern brown snake, saltwater crocodile, Goode's thornscrub tortoise, green anole, indian cobra, komodo dragon, mainland tiger snake, painted turtle, Pinta Island tortoise, three-toed box turtle, tuatara, West African mud turtle
Birds
Lissamphibia Leisan spiny toad, Xenopus tropicalis
Teleosts
Cyclostomata
Petromyzon marinus (sea lamprey)
Tunicates
Ciona intestinalis, Ciona savignyi
Invertebrates Insects Drosophila melanogaster (fruitfly), Anopheles gambiae (mosquito), Aedes aegypti (mosquito)
Worms Caenorhabditis elegans
Yeast Saccharomyces cerevisiae (baker's yeast)

Open source/mirrors

All data part of the Ensembl project is open access and all software is open source, being freely available to the scientific community, under a CC BY 4.0 license. Currently, Ensembl database website is mirrored at four different locations worldwide to improve the service.

Official mirror sites
UK (Sanger Institute) ---- main website
US West (Amazon AWS) ---- Cloud-based mirror on West Coast of United States
US East (Amazon AWS) ---- Cloud-based mirror on East Coast of United States
Asia (Amazon AWS) ---- Cloud-based mirror in Singapore

See also

References

  1. PMID 31691826
    .
  2. ^ .
  3. .
  4. .
  5. ^ Davis, Charles Patrick (29 March 2021). "Medical definition of Genome Annotation". Archived from the original on 14 June 2021. Retrieved 7 August 2022.
  6. PMID 15123590
    .
  7. .
  8. .
  9. .
  10. .
  11. ^ "Species List". uswest.ensembl.org. Archived from the original on 6 August 2022. Retrieved 5 August 2022.

External links