Binary classification
This article needs additional citations for verification. (May 2011) |
Binary classification is the task of classifying the elements of a set into one of two groups (each called class) on the basis of a classification rule. Typical binary classification problems include:
- Medical testing to determine if a patient has certain disease or not;
- Quality control in industry, deciding whether a specification has been met;
- In information retrieval, deciding whether a page should be in the result set of a search or not.
Binary classification is
Statistical binary classification
Statistical classification is a problem studied in machine learning. It is a type of supervised learning, a method of machine learning where the categories are predefined, and is used to categorize new probabilistic observations into said categories. When there are only two categories the problem is known as statistical binary classification.
Some of the methods commonly used for binary classification are:
- Decision trees
- Random forests
- Bayesian networks
- Support vector machines
- Neural networks
- Logistic regression
- Probit model
- Genetic Programming
- Multi expression programming
- Linear genetic programming
Each classifier is best in only a select domain based upon the number of observations, the dimensionality of the
Evaluation of binary classifiers
There are many metrics that can be used to measure the performance of a classifier or predictor; different fields have different preferences for specific metrics due to different goals. In medicine sensitivity and specificity are often used, while in information retrieval precision and recall are preferred. An important distinction is between metrics that are independent of how often each category occurs in the population (the prevalence), and metrics that depend on the prevalence – both types are useful, but they have very different properties.
Given a classification of a specific data set, there are four basic combinations of actual data category and assigned category:
Assigned Actual
|
Test outcome positive | Test outcome negative |
---|---|---|
Condition positive | True positive | False negative |
Condition negative | False positive | True negative |
These can be arranged into a 2×2 contingency table, with rows corresponding to actual value – condition positive or condition negative – and columns corresponding to classification value – test outcome positive or test outcome negative.
The eight basic ratios
There are eight basic ratios that one can compute from this table, which come in four complementary pairs (each pair summing to 1). These are obtained by dividing each of the four numbers by the sum of its row or column, yielding eight numbers, which can be referred to generically in the form "true positive row ratio" or "false negative column ratio".
There are thus two pairs of column ratios and two pairs of row ratios, and one can summarize these with four numbers by choosing one ratio from each pair – the other four numbers are the complements.
The row ratios are:
- recall. These are the proportion of the population with the condition for which the test is correct.
- with complement the false negative rate(FNR) = (FN/(TP+FN))
- with complement the
- specificity(SPC),
- with complement false positive rate (FPR) = (FP/(TN+FP)), also called independent of prevalence
The column ratios are:
- precision) (TP/(TP+FP)). These are the proportion of the population with a given test result for which the test is correct.
- with complement the false discovery rate (FDR) (FP/(TP+FP))
- negative predictive value(NPV) (TN/(TN+FN))
- with complement the false omission rate(FOR) (FN/(TN+FN)), also called dependence on prevalence.
- with complement the
In diagnostic testing, the main ratios used are the true column ratios – true positive rate and true negative rate – where they are known as sensitivity and specificity. In informational retrieval, the main ratios are the true positive ratios (row and column) – positive predictive value and true positive rate – where they are known as precision and recall. There is no general theory that sets out which pair should be used in which circumstances; each discipline has its own reason for the choice it has made.
One can take ratios of a complementary pair of ratios, yielding four likelihood ratios (two column ratio of ratios, two row ratio of ratios). This is primarily done for the column (condition) ratios, yielding likelihood ratios in diagnostic testing. Taking the ratio of one of these groups of ratios yields a final ratio, the diagnostic odds ratio (DOR). This can also be defined directly as (TP×TN)/(FP×FN) = (TP/FN)/(FP/TN); this has a useful interpretation – as an odds ratio – and is prevalence-independent.
There are a number of other metrics, most simply the
Converting continuous values to binary
Tests whose results are of continuous values, such as most
However, such conversion causes a loss of information, as the resultant binary classification does not tell how much above or below the cutoff a value is. As a result, when converting a continuous value that is close to the cutoff to a binary one, the resultant
See also
- Approximate membership query filter
- Examples of Bayesian inference
- Classification rule
- Confusion matrix
- Detection theory
- Kernel methods
- Multiclass classification
- Multi-label classification
- One-class classification
- Prosecutor's fallacy
- Receiver operating characteristic
- Thresholding (image processing)
- Uncertainty coefficient, aka proficiency
- Qualitative property
- Precision and recall (equivalent classification schema)
References
- CiteSeerX 10.1.1.649.303.
- ^ Y. Lu and C. Rasmussen (2012). "Simplified markov random fields for efficient semantic labeling of 3D point clouds" (PDF). IROS.
Bibliography
- ISBN 0-521-78019-5 ([1]SVM Book)
- John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004. )
- Bernhard Schölkopf and A. J. Smola: Learning with Kernels. MIT Press, Cambridge, Massachusetts, 2002. ISBN 0-262-19475-9