Quantitative structure–activity relationship
Quantitative structure–activity relationship models (QSAR models) are
In QSAR modeling, the predictors consist of physico-chemical properties or theoretical
Related terms include quantitative structure–property relationships (QSPR) when a chemical property is modeled as the response variable.[5][6] "Different properties or behaviors of chemical molecules have been investigated in the field of QSPR. Some examples are quantitative structure–reactivity relationships (QSRRs), quantitative structure–chromatography relationships (QSCRs) and, quantitative structure–toxicity relationships (QSTRs), quantitative structure–electrochemistry relationships (QSERs), and quantitative structure–
As an example, biological activity can be expressed quantitatively as the concentration of a substance required to give a certain biological response. Additionally, when physicochemical properties or structures are expressed by numbers, one can find a mathematical relationship, or quantitative structure-activity relationship, between the two. The mathematical expression, if carefully validated,[8][9][10][11] can then be used to predict the modeled response of other chemical structures.[12]
A QSAR has the form of a mathematical model:
- Activity = f (physiochemical properties and/or structural properties) + error
The error includes model error (bias) and observational variability, that is, the variability in observations even on a correct model.
Essential steps in QSAR studies
The principal steps of QSAR/QSPR include:[7]
- Selection of data set and extraction of structural/empirical descriptors
- Variable selection
- Model construction
- Validation evaluation
SAR and the SAR paradox
The basic assumption for all molecule-based
In general, one is more interested in finding strong
The SAR paradox refers to the fact that it is not the case that all similar molecules have similar activities.
Types
Fragment based (group contribution)
Analogously, the "
Group or fragment-based QSAR is also known as GQSAR.[17] GQSAR allows flexibility to study various molecular fragments of interest in relation to the variation in biological response. The molecular fragments could be substituents at various substitution sites in congeneric set of molecules or could be on the basis of pre-defined chemical rules in case of non-congeneric sets. GQSAR also considers cross-terms fragment descriptors, which could be helpful in identification of key fragment interactions in determining variation of activity.[17] Lead discovery using fragnomics is an emerging paradigm. In this context FB-QSAR proves to be a promising strategy for fragment library design and in fragment-to-lead identification endeavours.[18]
An advanced approach on fragment or group-based QSAR based on the concept of pharmacophore-similarity is developed.[19] This method, pharmacophore-similarity-based QSAR (PS-QSAR) uses topological pharmacophoric descriptors to develop QSAR models. This activity prediction may assist the contribution of certain pharmacophore features encoded by respective fragments toward activity improvement and/or detrimental effects.[19]
3D-QSAR
The acronym 3D-QSAR or 3-D QSAR refers to the application of force field calculations requiring three-dimensional structures of a given set of small molecules with known activities (training set). The training set needs to be superimposed (aligned) by either experimental data (e.g. based on ligand-protein crystallography) or molecule superimposition software. It uses computed potentials, e.g. the Lennard-Jones potential, rather than experimental constants and is concerned with the overall molecule rather than a single substituent. The first 3-D QSAR was named Comparative Molecular Field Analysis (CoMFA) by Cramer et al. It examined the steric fields (shape of the molecule) and the electrostatic fields[20] which were correlated by means of partial least squares regression (PLS).
The created data space is then usually reduced by a following
On June 18, 2011 the Comparative Molecular Field Analysis (CoMFA) patent has dropped any restriction on the use of GRID and partial least-squares (PLS) technologies.[citation needed]
Chemical descriptor based
In this approach, descriptors quantifying various electronic, geometric, or steric properties of a molecule are computed and used to develop a QSAR.[23] This approach is different from the fragment (or group contribution) approach in that the descriptors are computed for the system as whole rather than from the properties of individual fragments. This approach is different from the 3D-QSAR approach in that the descriptors are computed from scalar quantities (e.g., energies, geometric parameters) rather than from 3D fields.
An example of this approach is the QSARs developed for olefin polymerization by half sandwich compounds.[24][25]
String based
It has been shown that activity prediction is even possible based purely on the
Graph based
Similarly to string-based methods, the molecular graph can directly be used as input for QSAR models,[29][30] but usually yield inferior performance compared to descriptor-based QSAR models.[31][32]
Modeling
In the literature it can be often found that chemists have a preference for
Data mining approach
Computer SAR models typically calculate a relatively large number of features. Because those lack structural interpretation ability, the preprocessing steps face a feature selection problem (i.e., which structural features should be interpreted to determine the structure-activity relationship). Feature selection can be accomplished by visual inspection (qualitative selection by a human); by data mining; or by molecule mining.
A typical
Matched molecular pair analysis
Typically QSAR models derived from non linear machine learning is seen as a "black box", which fails to guide medicinal chemists. Recently there is a relatively new concept of matched molecular pair analysis[35] or prediction driven MMPA which is coupled with QSAR model in order to identify activity cliffs.[36]
Evaluation of the quality of QSAR models
QSAR modeling produces predictive models derived from application of statistical tools correlating biological activity (including desirable therapeutic effect and undesirable side effects) or physico-chemical properties in QSPR models of chemicals (drugs/toxicants/environmental pollutants) with descriptors representative of molecular structure or properties. QSARs are being applied in many disciplines, for example: risk assessment, toxicity prediction, and regulatory decisions[37] in addition to drug discovery and lead optimization.[38] Obtaining a good quality QSAR model depends on many factors, such as the quality of input data, the choice of descriptors and statistical methods for modeling and for validation. Any QSAR modeling should ultimately lead to statistically robust and predictive models capable of making accurate and reliable predictions of the modeled response of new compounds.
For validation of QSAR models, usually various strategies are adopted:[39]
- internal validation or cross-validation (actually, while extracting data, cross validation is a measure of model robustness, the more a model is robust (higher q2) the less data extraction perturb the original model);
- external validation by splitting the available data set into training set for model development and prediction set for model predictivity check;
- blind external validation by application of model on new external data and
- data randomization or Y-scrambling for verifying the absence of chance correlation between the response and the modeling descriptors.
The success of any QSAR model depends on accuracy of the input data, selection of appropriate descriptors and statistical tools, and most importantly validation of the developed model. Validation is the process by which the reliability and relevance of a procedure are established for a specific purpose; for QSAR models validation must be mainly for robustness, prediction performances and applicability domain (AD) of the models.[8][9][11][40][41]
Some validation methodologies can be problematic. For example, leave one-out cross-validation generally leads to an overestimation of predictive capacity. Even with external validation, it is difficult to determine whether the selection of training and test sets was manipulated to maximize the predictive capacity of the model being published.
Different aspects of validation of QSAR models that need attention include methods of selection of training set compounds,[42] setting training set size[43] and impact of variable selection[44] for training set models for determining the quality of prediction. Development of novel validation parameters for judging quality of QSAR models is also important.[11][45][46]
Application
Chemical
One of the first historical QSAR applications was to predict boiling points.[47]
It is well known for instance that within a particular
A still very interesting application is the Hammett equation, Taft equation and pKa prediction methods.[48]
Biological
The biological activity of molecules is usually measured in
While many quantitative structure activity relationship analyses involve the interactions of a family of molecules with an
It is part of the machine learning method to reduce the risk for a SAR paradox, especially taking into account that only a finite amount of data is available (see also MVUE). In general, all QSAR problems can be divided into coding[50] and learning.[51]
Applications
(Q)SAR models have been used for risk management. QSARS are suggested by regulatory authorities; in the European Union, QSARs are suggested by the REACH regulation, where "REACH" abbreviates "Registration, Evaluation, Authorisation and Restriction of Chemicals". Regulatory application of QSAR methods includes in silico toxicological assessment of genotoxic impurities.[52] Commonly used QSAR assessment software such as DEREK or CASE Ultra (MultiCASE) is used to genotoxicity of impurity according to ICH M7.[53]
The chemical descriptor space whose convex hull is generated by a particular training set of chemicals is called the training set's applicability domain. Prediction of properties of novel chemicals that are located outside the applicability domain uses extrapolation, and so is less reliable (on average) than prediction within the applicability domain. The assessment of the reliability of QSAR predictions remains a research topic.[citation needed]
The QSAR equations can be used to predict biological activities of newer molecules before their synthesis.
Examples of machine learning tools for QSAR modeling include:[54]
S.No. | Name | Algorithms | External link |
---|---|---|---|
1. | R | RF, SVM, Naïve Bayesian, and ANN | "R: The R Project for Statistical Computing". |
2. | libSVM | SVM | "LIBSVM -- A Library for Support Vector Machines". |
3. | Orange | RF, SVM, and Naïve Bayesian | "Orange Data Mining". |
4. | RapidMiner | SVM, RF, Naïve Bayes, DT, ANN, and k-NN | "RapidMiner | #1 Open Source Predictive Analytics Platform". |
5. | Weka | RF, SVM, and Naïve Bayes | "Weka 3 - Data Mining with Open Source Machine Learning Software in Java". |
6. | Knime | DT, Naïve Bayes, and SVM | "KNIME | Open for Innovation". |
7. | AZOrange[55] | RT, SVM, ANN, and RF | "AZCompTox/AZOrange: AstraZeneca add-ons to Orange". GitHub. 2018-09-19. |
8. | Tanagra | SVM, RF, Naïve Bayes, and DT | "TANAGRA - A free DATA MINING software for teaching and research". Archived from the original on 2017-12-19. Retrieved 2016-03-24. |
9. | Elki | k-NN | "ELKI Data Mining Framework". Archived from the original on 2016-11-19. |
10. | MALLET | "MALLET homepage". | |
11. | MOA | "MOA Massive Online Analysis | Real Time Analytics for Data Streams". Archived from the original on 2017-06-19. | |
12. | Deep Chem | Logistic Regression, Naive Bayes, RF, ANN, and others | "DeepChem". deepchem.io. Retrieved 20 October 2017. |
13. | alvaModel[56] | Regression ( SVM and Consensus) |
"alvaModel: a software tool to create QSAR/QSPR models". alvascience.com. |
14. | scikit-learn (Python) [57] | Logistic Regression, Naive Bayes, kNN, RF, SVM, GP, ANN, and others | "SciKit-Learn". scikit-learn.org. Retrieved 13 August 2023. |
See also
- ADME
- Cheminformatics
- Computer-assisted drug design(CADD)
- Conformation–activity relationship
- Differential solubility
- Matched molecular pair analysis
- Molecular descriptor
- Molecular design software
- Partition coefficient
- Pharmacokinetics
- Pharmacophore
- Q-RASAR
- QSAR & Combinatorial Science – Scientific journal
- Software for molecular mechanics modeling
- List of predicted structure based properties
References
- ISBN 978-3-527-31852-0.
- ISBN 978-3-319-27282-5.
- ISBN 978-3-319-17281-1.
- S2CID 49418479.
- .
- S2CID 17622541.
- ^ .
- ^ .
- ^ hdl:11383/1668881.
- PMID 26110025.
- ^ PMID 22721530.
- S2CID 23564249.
- PMID 11848856.
- ISBN 978-3-527-33015-7.
- PMID 17597897.
- .
- ^ a b Ajmani S, Jadhav K, Kulkarni SA, Group-Based QSAR (G-QSAR)
- S2CID 1171860.
- ^ S2CID 45364247.
- ISBN 978-0-582-38210-7.
- ISBN 978-0-262-19509-6.
- .
- .
- PMID 17348648.
- .
- arXiv:1602.06289 [cs.CL].
- arXiv:1703.07076 [cs.LG].
- PMID 30155234.
- PMID 16180893.
- PMID 27558503.
- PMID 33597034.
- PMID 36456532.
- ISBN 978-0-521-58519-4.
- ISBN 978-0-8247-2397-2.
- PMID 23557664.
- PMID 25544551.
- .
- S2CID 21518449.
- ISBN 978-3-527-30044-0.
- S2CID 21305783.
- PMID 22534664.
- .
- .
- PMID 17933600.
- PMID 19471190.
- PMID 21800825.
- ISBN 978-0-85626-454-2.
- ISBN 9780124095472.
- PMID 12668435.
- ISBN 978-3-527-29913-3.
- ISBN 978-0-471-05669-0.
- S2CID 2714861.
- ^ ICH M7 Assessment and control of DNA reactive (mutagenic) impurities in pharmaceuticals to limit potential carcinogenic risk - Scientific guideline [1]
- PMID 25448759.
- PMID 21798025.
- PMID 36361669.
- ^ Fabian Pedregosa; Gaël Varoquaux; Alexandre Gramfort; Vincent Michel; Bertrand Thirion; Olivier Grisel; Mathieu Blondel; Peter Prettenhofer; Ron Weiss; Vincent Dubourg; Jake Vanderplas; Alexandre Passos; David Cournapeau; Matthieu Perrot; Édouard Duchesnay (2011). "scikit-learn: Machine Learning in Python". Journal of Machine Learning Research. 12: 2825–2830.
Further reading
- Selassie CD (2003). "History of Quantitative Structure-Activity Relationships" (PDF). In Abraham DJ (ed.). Burger's medicinal Chemistry and Drug Discovery. Vol. 1 (6th ed.). New York: Wiley. pp. 1–48. ISBN 978-0-471-27401-8.
- Shityakov S, Puskás I, Roewer N, Förster C, Broscheit J (2014). "Three-dimensional quantitative structure-activity relationship and docking studies in a series of anthocyanin derivatives as cytochrome P450 3A4 inhibitors". Advances and Applications in Bioinformatics and Chemistry. 7: 11–21. PMID 24741320.
External links
- "The Cheminformatics and QSAR Society". Retrieved 2009-05-11.
- "The 3D QSAR Server". Retrieved 2011-06-18.
- Verma, Rajeshwar P.; Hansch, Corwin (2007). "Nature Protocols: Development of QSAR models using C-QSAR program". Protocol Exchange. doi:10.1038/nprot.2007.125. Archived from the originalon 2007-05-01. Retrieved 2009-05-11.
A regression program that has dual databases of over 21,000 QSAR models
- "QSAR World". Archived from the original on 2009-04-25. Retrieved 2009-05-11.
A comprehensive web resource for QSAR modelers
- Chemoinformatics Tools, Drug Theoretics and Cheminformatics Laboratory
- Multiscale Conceptual Model Figures for QSARs in Biological and Environmental Science