Chemical database

Source: Wikipedia, the free encyclopedia.

A chemical database is a

crystal structures, spectra, reactions
and syntheses, and thermophysical data.

Types of chemical databases

Bioactivity database

Bioactivity databases correlate structures or other chemical information to bioactivity results taken from

bioassays
in literature, patents, and screening programs.

Name Developer(s) Initial release
ScrubChem Jason Bret Harris 2016[1][2]
PubChem-BioAssay
NIH
 
2004[3][4]
ChEMBL
EMBL-EBI
2009[5]

Chemical structures

terabytes
of physical memory.

Literature database

Chemical literature databases correlate structures or other chemical information to relevant references such as academic papers or patents. This type of database includes STN, Scifinder, and Reaxys. Links to literature are also included in many databases that focus on chemical characterization.

Crystallographic database

Crystallographic databases store X-ray crystal structure data. Common examples include Protein Data Bank and Cambridge Structural Database
.

NMR spectra database

.

Reactions database

Most chemical databases store information on stable molecules but in databases for reactions also intermediates and temporarily created unstable molecules are stored. Reaction databases contain information about products, educts, and reaction mechanisms.

Thermophysical database

Thermophysical data are information about

Chemical structure representation

There are two principal techniques for representing chemical structures in digital databases

These approaches have been refined to allow representation of

organo-metallic
compounds. The principal advantage of a computer representation is the possibility for increased storage and fast, flexible search.

Search

Substructure

Chemists can search databases using parts of structures, parts of their

hash-codes based on fragments derived computationally. These are called 'fingerprints' although the term is sometimes used synonymously with structural-keys. The amount of memory needed to store these structural-keys and fingerprints can be reduced by 'folding', which is achieved by combining parts of the key using bitwise-operations and thereby reducing the overall length.[8]

Conformation

Search by matching 3D conformation of molecules or by specifying spatial constraints is another feature that is particularly of use in drug design. Searches of this kind can be computationally very expensive. Many approximate methods have been proposed, for instance BCUTS, special function representations, moments of inertia, ray-tracing histograms, maximum distance histograms, shape multipoles to name a few.[9][10][11][12][13]

Giga Search

Databases of synthesizable and virtual chemicals are getting larger each year, therefore the ability to efficiently mine them is critical for drug discovery projects. MolSoft's MolCart Giga Search (http://www.molsoft.com/giga-search.html) is the first ever method designed for substructure search of billions of chemicals.

Descriptors

All properties of molecules beyond their structure can be split up into either physico-chemical or

molecular weight, (partial) charge, solubility, etc. can mostly be computed directly based on the molecule's structure, pharmacological descriptors can be derived only indirectly using involved multivariate statistics or experimental (screening, bioassay
) results. All of those descriptors can for reasons of computational effort be stored along with the molecule's representation and usually are.

Similarity

There is no single definition of molecular similarity, however the concept may be defined according to the application and is often described as an

MCS) based substructure search [7](similarity or distance measure) is also very common. MCS is also used for screening drug like compounds by hitting molecules, which share common subgraph (substructure).[14]

Chemicals in the databases may be

clustered into groups of 'similar' molecules based on similarities. Both hierarchical and non-hierarchical clustering approaches can be applied to chemical entities with multiple attributes. These attributes or molecular properties may either be determined empirically or computationally derived descriptors. One of the most popular clustering approaches is the Jarvis-Patrick algorithm .[15]

In

QSAR
methods.

Registration systems

Databases systems for maintaining unique records on chemical compounds are termed as Registration systems. These are often used for chemical indexing, patent systems and industrial databases.

Registration systems usually enforce uniqueness of the chemical represented in the database through the use of unique representations. By applying rules of precedence for the generation of stringified notations, one can obtain unique/'

hash codes
to achieve the same objective.

A key difference between a registration system and a simple chemical database is the ability to accurately represent that which is known, unknown, and partially known. For example, a chemical database might store a molecule with

racemic
. Each of these would be considered a different record in a chemical registry system.

Registration systems also preprocess molecules to avoid considering trivial differences such as differences in halogen ions in chemicals.

An example is the

CAS registry number
.

List of Chemical Cartridges

List of Chemical Registration Systems

Web-based

Name Developer(s) Initial release
CDD Vault Collaborative Drug Discovery  2018[26][27][28]
Adroit Repository[29] Adroit DI[30] 2023[31][32]

Tools

The computational representations are usually made transparent to chemists by graphical display of the data. Data entry is also simplified through the use of chemical structure editors. These editors internally convert the graphical data into computational representations.

There are also numerous algorithms for the interconversion of various formats of representation. An open-source utility for conversion is

OpenBabel. These search and conversion algorithms are implemented either within the database system itself or as is now the trend is implemented as external components that fit into standard relational database systems. Both Oracle and PostgreSQL based systems make use of cartridge technology that allows user defined datatypes. These allow the user to make SQL
queries with chemical search conditions (For example, a query to search for records having a phenyl ring in their structure represented as a SMILES string in a SMILESCOL column could be

 SELECT * FROM CHEMTABLE WHERE SMILESCOL.CONTAINS('c1ccccc1')

Algorithms for the conversion of

IUPAC names to structure representations and vice versa are also used for extracting structural information from text. However, there are difficulties due to the existence of multiple dialects of IUPAC. Work is on to establish a unique IUPAC standard (See InChI
).

See also

References

  1. ^ http://www.scrubchem.org
  2. S2CID 73493315
    .
  3. ^ "PubChem". pubchem.ncbi.nlm.nih.gov.
  4. PMID 27899599
    .
  5. ^ "ChEMBL Database".
  6. S2CID 17268751
  7. ^ .
  8. .
  9. .
  10. .
  11. .
  12. .
  13. .
  14. .
  15. .
  16. ^ "BIOVIA Direct - BIOVIA - Dassault Systèmes®".
  17. ^ "JChem Engines | ChemAxon".
  18. ^ "Chemistry – Oracle Cartridge | Inside Informatics".
  19. PMC 2867114
    .
  20. ^ "Small Molecule Drug Discovery Software". Small Molecule Drug Discovery Software.
  21. ^ "BIOVIA Chemical Registration - BIOVIA - Dassault Systèmes®". www.3ds.com.
  22. ^ "Register". Archived from the original on 2021-12-10. Retrieved 2021-03-13.
  23. ^ "Scilligence RegMol | Scilligence". 6 June 2016.[permanent dead link]
  24. ^ "Compound Registration". chemaxon.com.
  25. ^ "Signals Notebook - PerkinElmer Informatics". perkinelmerinformatics.com.
  26. ^ "CDD Vault Update: CDD Vault is Now an ELN". 16 February 2018.
  27. ^ "CDD Electronic Lab Notebook (ELN)". 14 August 2019.
  28. ^ "Electronic Lab Notebooks: What they are (And why you need one)". 4 August 2019.
  29. ^ "Review of SDF Pro from Adroit DI. June 2023 – Macs in Chemistry". 2023-11-05. Retrieved 2024-03-11.
  30. ^ "Adroit DI main page". adroitdi.com. Retrieved 2024-03-10.
  31. ^ "Adroit DI's SDF Pro: The Fast and Affordable Solution to Storing, Sorting and Wrangling 10 Million Molecules in Seconds". www.businesswire.com. 2023-05-16. Retrieved 2024-03-10.
  32. ^ "Best of the Best Entity Registration". 20Visioneers15. Retrieved 2024-03-10.