Feature selection
This article includes a list of general references, but it lacks sufficient corresponding inline citations. (July 2010) |
Part of a series on |
Machine learning and data mining |
---|
Feature selection is the process of selecting a subset of relevant
Feature selection techniques are used for several reasons:
- simplification of models to make them easier to interpret by researchers/users,[2]
- shorter training times,[3]
- to avoid the curse of dimensionality,[4]
- improve data's compatibility with a learning model class,[5]
- encode inherent symmetries present in the input space.[6][7][8][9]
The central premise when using a feature selection technique is that the data contains some features that are either redundant or irrelevant, and can thus be removed without incurring much loss of information.[10] Redundant and irrelevant are two distinct notions, since one relevant feature may be redundant in the presence of another relevant feature with which it is strongly correlated.[11]
Introduction
A feature selection algorithm can be seen as the combination of a search technique for proposing new feature subsets, along with an evaluation measure which scores the different feature subsets. The simplest algorithm is to test each possible subset of features finding the one which minimizes the error rate. This is an exhaustive search of the space, and is computationally intractable for all but the smallest of feature sets. The choice of evaluation metric heavily influences the algorithm, and it is these evaluation metrics which distinguish between the three main categories of feature selection algorithms: wrappers, filters and embedded methods.[11]
- Wrapper methods use a predictive model to score feature subsets. Each new subset is used to train a model, which is tested on a hold-out set. Counting the number of mistakes made on that hold-out set (the error rate of the model) gives the score for that subset. As wrapper methods train a new model for each subset, they are very computationally intensive, but usually provide the best performing feature set for that particular type of model or typical problem.
- Filter methods use a proxy measure instead of the error rate to score a feature subset. This measure is chosen to be fast to compute, while still capturing the usefulness of the feature set. Common measures include the Support Vector Machinesto repeatedly construct a model and remove features with low weights.
- Embedded methods are a catch-all group of techniques which perform feature selection as part of the model construction process. The exemplar of this approach is the LASSO method for constructing a linear model, which penalizes the regression coefficients with an L1 penalty, shrinking many of them to zero. Any features which have non-zero regression coefficients are 'selected' by the LASSO algorithm. Improvements to the LASSO include Bolasso which bootstraps samples;[17] Elastic net regularization, which combines the L1 penalty of LASSO with the L2 penalty of ridge regression; and FeaLect which scores all the features based on combinatorial analysis of regression coefficients.[18] AEFS further extends LASSO to nonlinear scenario with autoencoders.[19] These approaches tend to be between filters and wrappers in terms of computational complexity.
In traditional regression analysis, the most popular form of feature selection is stepwise regression, which is a wrapper technique. It is a greedy algorithm that adds the best feature (or deletes the worst feature) at each round. The main control issue is deciding when to stop the algorithm. In machine learning, this is typically done by cross-validation. In statistics, some criteria are optimized. This leads to the inherent problem of nesting. More robust methods have been explored, such as branch and bound and piecewise linear network.
Subset selection
Subset selection evaluates a subset of features as a group for suitability. Subset selection algorithms can be broken up into wrappers, filters, and embedded methods. Wrappers use a search algorithm to search through the space of possible features and evaluate each subset by running a model on the subset. Wrappers can be computationally expensive and have a risk of over fitting to the model. Filters are similar to wrappers in the search approach, but instead of evaluating against a model, a simpler filter is evaluated. Embedded techniques are embedded in, and specific to, a model.
Many popular search approaches use
Alternative search-based techniques are based on targeted projection pursuit which finds low-dimensional projections of the data that score highly: the features that have the largest projections in the lower-dimensional space are then selected.
Search approaches include:
- Exhaustive[20]
- Best first
- Simulated annealing
- Genetic algorithm[21]
- Greedy forward selection[22][23][24]
- Greedy backward elimination
- Particle swarm optimization[25]
- Targeted projection pursuit
- Scatter search[26][27]
- Variable neighborhood search[28][29]
Two popular filter metrics for classification problems are
Other available filter metrics include:
- Class separability
- Error probability
- Inter-class distance
- Probabilistic distance
- Entropy
- Consistency-based feature selection
- Correlation-based feature selection
Optimality criteria
The choice of optimality criteria is difficult as there are multiple objectives in a feature selection task. Many common criteria incorporate a measure of accuracy, penalised by the number of features selected. Examples include
Other criteria are Bayesian information criterion (BIC), which uses a penalty of for each added feature, minimum description length (MDL) which asymptotically uses , Bonferroni / RIC which use , maximum dependency feature selection, and a variety of new criteria that are motivated by false discovery rate (FDR), which use something close to . A maximum entropy rate criterion may also be used to select the most relevant subset of features.[33]
Structure learning
Filter feature selection is a specific case of a more general paradigm called
Information Theory Based Feature Selection Mechanisms
There are different Feature Selection mechanisms around that utilize mutual information for scoring the different features. They usually use all the same algorithm:
- Calculate the mutual information as score for between all features () and the target class (c)
- Select the feature with the largest score (e.g. ) and add it to the set of selected features (S)
- Calculate the score which might be derived from the mutual information
- Select the feature with the largest score and add it to the set of select features (e.g. )
- Repeat 3. and 4. until a certain number of features is selected (e.g. )
The simplest approach uses the mutual information as the "derived" score.[35]
However, there are different approaches, that try to reduce the redundancy between features.
Minimum-redundancy-maximum-relevance (mRMR) feature selection
Peng et al.[36] proposed a feature selection method that can use either mutual information, correlation, or distance/similarity scores to select features. The aim is to penalise a feature's relevancy by its redundancy in the presence of the other selected features. The relevance of a feature set S for the class c is defined by the average value of all mutual information values between the individual feature fi and the class c as follows:
- .
The redundancy of all features in the set S is the average value of all mutual information values between the feature fi and the feature fj:
The mRMR criterion is a combination of two measures given above and is defined as follows:
Suppose that there are n full-set features. Let xi be the set membership indicator function for feature fi, so that xi=1 indicates presence and xi=0 indicates absence of the feature fi in the globally optimal feature set. Let and . The above may then be written as an optimization problem:
The mRMR algorithm is an approximation of the theoretically optimal maximum-dependency feature selection algorithm that maximizes the mutual information between the joint distribution of the selected features and the classification variable. As mRMR approximates the combinatorial estimation problem with a series of much smaller problems, each of which only involves two variables, it thus uses pairwise joint probabilities which are more robust. In certain situations the algorithm may underestimate the usefulness of features as it has no way to measure interactions between features which can increase relevancy. This can lead to poor performance[35] when the features are individually useless, but are useful when combined (a pathological case is found when the class is a parity function of the features). Overall the algorithm is more efficient (in terms of the amount of data required) than the theoretically optimal max-dependency selection, yet produces a feature set with little pairwise redundancy.
mRMR is an instance of a large class of filter methods which trade off between relevancy and redundancy in different ways.[35][37]
Quadratic programming feature selection
mRMR is a typical example of an incremental greedy strategy for feature selection: once a feature has been selected, it cannot be deselected at a later stage. While mRMR could be optimized using floating search to reduce some features, it might also be reformulated as a global quadratic programming optimization problem as follows:[38]
where is the vector of feature relevancy assuming there are n features in total, is the matrix of feature pairwise redundancy, and represents relative feature weights. QPFS is solved via quadratic programming. It is recently shown that QFPS is biased towards features with smaller entropy,[39] due to its placement of the feature self redundancy term on the diagonal of H.
Conditional mutual information
Another score derived for the mutual information is based on the conditional relevancy:[39]
where and .
An advantage of SPECCMI is that it can be solved simply via finding the dominant eigenvector of Q, thus is very scalable. SPECCMI also handles second-order feature interaction.
Joint mutual information
In a study of different scores Brown et al.[35] recommended the joint mutual information[40] as a good score for feature selection. The score tries to find the feature, that adds the most new information to the already selected features, in order to avoid redundancy. The score is formulated as follows:
The score uses the conditional mutual information and the mutual information to estimate the redundancy between the already selected features () and the feature under investigation ().
Hilbert-Schmidt Independence Criterion Lasso based feature selection
For high-dimensional and small sample data (e.g., dimensionality > 105 and the number of samples < 103), the Hilbert-Schmidt Independence Criterion Lasso (HSIC Lasso) is useful.[41] HSIC Lasso optimization problem is given as
where is a kernel-based independence measure called the (empirical) Hilbert-Schmidt independence criterion (HSIC), denotes the trace, is the regularization parameter, and are input and output centered Gram matrices, and are Gram matrices, and are kernel functions, is the centering matrix, is the m-dimensional identity matrix (m: the number of samples), is the m-dimensional vector with all ones, and is the -norm. HSIC always takes a non-negative value, and is zero if and only if two random variables are statistically independent when a universal reproducing kernel such as the Gaussian kernel is used.
The HSIC Lasso can be written as
where is the
Correlation feature selection
The correlation feature selection (CFS) measure evaluates subsets of features on the basis of the following hypothesis: "Good feature subsets contain features highly correlated with the classification, yet uncorrelated to each other".[42][43] The following equation gives the merit of a feature subset S consisting of k features:
Here, is the average value of all feature-classification correlations, and is the average value of all feature-feature correlations. The CFS criterion is defined as follows:
The and variables are referred to as correlations, but are not necessarily
Let xi be the set membership indicator function for feature fi; then the above can be rewritten as an optimization problem:
The combinatorial problems above are, in fact, mixed 0–1
Regularized trees
The features from a decision tree or a tree ensemble are shown to be redundant. A recent method called regularized tree[45] can be used for feature subset selection. Regularized trees penalize using a variable similar to the variables selected at previous tree nodes for splitting the current node. Regularized trees only need build one tree model (or one tree ensemble model) and thus are computationally efficient.
Regularized trees naturally handle numerical and categorical features, interactions and nonlinearities. They are invariant to attribute scales (units) and insensitive to outliers, and thus, require little data preprocessing such as normalization. Regularized random forest (RRF)[46] is one type of regularized trees. The guided RRF is an enhanced RRF which is guided by the importance scores from an ordinary random forest.
Overview on metaheuristics methods
A metaheuristic is a general description of an algorithm dedicated to solve difficult (typically NP-hard problem) optimization problems for which there is no classical solving methods. Generally, a metaheuristic is a stochastic algorithm tending to reach a global optimum. There are many metaheuristics, from a simple local search to a complex global search algorithm.
Main principles
The feature selection methods are typically presented in three classes based on how they combine the selection algorithm and the model building.
Filter method
Filter type methods select variables regardless of the model. They are based only on general features like the correlation with the variable to predict. Filter methods suppress the least interesting variables. The other variables will be part of a classification or a regression model used to classify or to predict data. These methods are particularly effective in computation time and robust to overfitting.[47]
Filter methods tend to select redundant variables when they do not consider the relationships between variables. However, more elaborate features try to minimize this problem by removing variables highly correlated to each other, such as the Fast Correlation Based Filter (FCBF) algorithm.[48]
Wrapper method
Wrapper methods evaluate subsets of variables which allows, unlike filter approaches, to detect the possible interactions amongst variables.[49] The two main disadvantages of these methods are:
- The increasing overfitting risk when the number of observations is insufficient.
- The significant computation time when the number of variables is large.
Embedded method
Embedded methods have been recently proposed that try to combine the advantages of both previous methods. A learning algorithm takes advantage of its own variable selection process and performs feature selection and classification simultaneously, such as the FRMT algorithm.[50]
Application of feature selection metaheuristics
This is a survey of the application of feature selection metaheuristics lately used in the literature. This survey was realized by J. Hammon in her 2013 thesis.[47]
Application | Algorithm | Approach | Classifier | Evaluation Function | Reference |
---|---|---|---|---|---|
SNPs | Feature Selection using Feature Similarity | Filter | r2 | Phuong 2005[49] | |
SNPs | Genetic algorithm | Wrapper | Decision Tree | Classification accuracy (10-fold) | Shah 2004[51] |
SNPs | Hill climbing | Filter + Wrapper | Naive Bayesian | Predicted residual sum of squares | Long 2007[52] |
SNPs | Simulated annealing | Naive bayesian | Classification accuracy (5-fold) | Ustunkar 2011[53] | |
Segments parole | Ant colony | Wrapper | Artificial Neural Network |
MSE | Al-ani 2005 [citation needed] |
Marketing | Simulated annealing | Wrapper | Regression | AIC, r2 | Meiri 2006[54] |
Economics | Simulated annealing, genetic algorithm | Wrapper | Regression | BIC | Kapetanios 2007[55] |
Spectral Mass | Genetic algorithm | Wrapper | Multiple Linear Regression, Partial Least Squares | root-mean-square error of prediction |
Broadhurst et al. 1997[56] |
Spam | Binary PSO + Mutation | Wrapper | Decision tree | weighted cost | Zhang 2014[25] |
Microarray | Tabu search + PSO | Wrapper | Support Vector Machine, K Nearest Neighbors |
Euclidean Distance |
Chuang 2009[57] |
Microarray | PSO + Genetic algorithm | Wrapper | Support Vector Machine | Classification accuracy (10-fold) | Alba 2007[58] |
Microarray | Genetic algorithm + Iterated Local Search | Embedded | Support Vector Machine | Classification accuracy (10-fold) | Duval 2009[59] |
Microarray | Iterated local search | Wrapper | Regression | Posterior Probability | Hans 2007[60] |
Microarray | Genetic algorithm | Wrapper | K Nearest Neighbors | Classification accuracy ( Leave-one-out cross-validation ) |
Jirapech-Umpai 2005[61] |
Microarray | Hybrid genetic algorithm | Wrapper | K Nearest Neighbors | Classification accuracy (Leave-one-out cross-validation) | Oh 2004[62] |
Microarray | Genetic algorithm | Wrapper | Support Vector Machine | Sensitivity and specificity | Xuan 2011[63] |
Microarray | Genetic algorithm | Wrapper | All paired Support Vector Machine | Classification accuracy (Leave-one-out cross-validation) | Peng 2003[64] |
Microarray | Genetic algorithm | Embedded | Support Vector Machine | Classification accuracy (10-fold) | Hernandez 2007[65] |
Microarray | Genetic algorithm | Hybrid | Support Vector Machine | Classification accuracy (Leave-one-out cross-validation) | Huerta 2006[66] |
Microarray | Genetic algorithm | Support Vector Machine | Classification accuracy (10-fold) | Muni 2006[67] | |
Microarray | Genetic algorithm | Wrapper | Support Vector Machine | EH-DIALL, CLUMP | Jourdan 2005[68] |
Alzheimer's disease | Welch's t-test | Filter | Support vector machine | Classification accuracy (10-fold) | Zhang 2015[69] |
Computer vision | Infinite Feature Selection | Filter | Independent | ROC AUC
|
Roffo 2015[70] |
Microarrays | Eigenvector Centrality FS | Filter | Independent | Average Precision, Accuracy, ROC AUC | Roffo & Melzi 2016[71] |
XML | Symmetrical Tau (ST) | Filter | Structural Associative Classification | Accuracy, Coverage | Shaharanee & Hadzic 2014 |
Feature selection embedded in learning algorithms
Some learning algorithms perform feature selection as part of their overall operation. These include:
- -regularization techniques, such as sparse regression, LASSO, and -SVM
- Regularized trees,[45] e.g. regularized random forest implemented in the RRF package[46]
- Decision tree[72]
- Memetic algorithm
- Random multinomial logit(RMNL)
- Auto-encoding networks with a bottleneck-layer
- Submodular feature selection[73][74][75]
- Local learning based feature selection.[76] Compared with traditional methods, it does not involve any heuristic search, can easily handle multi-class problems, and works for both linear and nonlinear problems. It is also supported by a strong theoretical foundation. Numeric experiments showed that the method can achieve a close-to-optimal solution even when data contains >1M irrelevant features.
- Recommender system based on feature selection.[77] The feature selection methods are introduced into recommender system research.
See also
- Cluster analysis
- Data mining
- Dimensionality reduction
- Feature extraction
- Hyperparameter optimization
- Model selection
- Relief (feature selection)
References
- S2CID 220665533.
- ^ Gareth James; Daniela Witten; Trevor Hastie; Robert Tibshirani (2013). An Introduction to Statistical Learning. Springer. p. 204.
- ISBN 978-0-387-30768-8, retrieved 2021-07-13
- ISSN 1547-5905.
- ISSN 1533-7928.
- S2CID 8368258.
- S2CID 8849753.
- S2CID 13745401.
- S2CID 8501814.
- PMID 25988841.
- ^ a b c Guyon, Isabelle; Elisseeff, André (2003). "An Introduction to Variable and Feature Selection". JMLR. 3.
- ^ a b Yang, Yiming; Pedersen, Jan O. (1997). A comparative study on feature selection in text categorization (PDF). ICML.
- PMID 30031057.
- ^ Forman, George (2003). "An extensive empirical study of feature selection metrics for text classification" (PDF). Journal of Machine Learning Research. 3: 1289–1305.
- .
- .
- S2CID 609778.
- PMID 23369194.
- ^ Kai Han; Yunhe Wang; Chao Zhang; Chao Li; Chao Xu (2018). Autoencoder inspired unsupervised feature selection. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
- arXiv:2004.06152 [stat.CO].
- PMID 25719748.
- .
- ^ Figueroa, Alejandro; Guenter Neumann (2013). Learning to Rank Effective Paraphrases from Query Logs for Community Question Answering. AAAI.
- hdl:10533/196878.
- ^ .
- ^ F.C. Garcia-Lopez, M. Garcia-Torres, B. Melian, J.A. Moreno-Perez, J.M. Moreno-Vega. Solving feature subset selection problem by a Parallel Scatter Search, European Journal of Operational Research, vol. 169, no. 2, pp. 477–489, 2006.
- S2CID 235770316.
- ^ F.C. Garcia-Lopez, M. Garcia-Torres, B. Melian, J.A. Moreno-Perez, J.M. Moreno-Vega. Solving Feature Subset Selection Problem by a Hybrid Metaheuristic. In First International Workshop on Hybrid Metaheuristics, pp. 59–68, 2004.
- ^ M. Garcia-Torres, F. Gomez-Vela, B. Melian, J.M. Moreno-Vega. High-dimensional feature selection via feature grouping: A Variable Neighborhood Search approach, Information Sciences, vol. 326, pp. 102-118, 2016.
- )
- ^ Akaike, H. (1985), "Prediction and entropy", in Atkinson, A. C.; Fienberg, S. E. (eds.), A Celebration of Statistics (PDF), Springer, pp. 1–24, archived (PDF) from the original on August 30, 2019.
- ISBN 9780387953649.
- S2CID 49555941.
- ^ Aliferis, Constantin (2010). "Local causal and markov blanket induction for causal discovery and feature selection for classification part I: Algorithms and empirical evaluation" (PDF). Journal of Machine Learning Research. 11: 171–234.
- ^ a b c d Brown, Gavin; Pocock, Adam; Zhao, Ming-Jie; Luján, Mikel (2012). "Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection". Journal of Machine Learning Research. 13: 27–66.[1]
- ^ Nguyen, H., Franke, K., Petrovic, S. (2010). "Towards a Generic Feature-Selection Measure for Intrusion Detection", In Proc. International Conference on Pattern Recognition (ICPR), Istanbul, Turkey. [2]
- ^ Rodriguez-Lujan, I.; Huerta, R.; Elkan, C.; Santa Cruz, C. (2010). "Quadratic programming feature selection" (PDF). JMLR. 11: 1491–1516.
- ^ a b Nguyen X. Vinh, Jeffrey Chan, Simone Romano and James Bailey, "Effective Global Approaches for Mutual Information based Feature Selection". Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'14), August 24–27, New York City, 2014. "[3]"
- ^ Yang, Howard Hua; Moody, John (2000). "Data visualization and feature selection: New algorithms for nongaussian data" (PDF). Advances in Neural Information Processing Systems: 687–693.
- S2CID 2742785.
- ^ Hall, M. (1999). Correlation-based Feature Selection for Machine Learning (PDF) (PhD thesis). University of Waikato.
- S2CID 8398495.
- ^ Nguyen, Hai; Franke, Katrin; Petrovic, Slobodan (December 2009). "Optimizing a class of feature selection measures". Proceedings of the NIPS 2009 Workshop on Discrete Optimization in Machine Learning: Submodularity, Sparsity & Polyhedra (DISCML). Vancouver, Canada.
- ^ a b H. Deng, G. Runger, "Feature Selection via Regularized Trees", Proceedings of the 2012 International Joint Conference on Neural Networks (IJCNN), IEEE, 2012
- ^ CRAN
- ^ a b Hamon, Julie (November 2013). Optimisation combinatoire pour la sélection de variables en régression en grande dimension: Application en génétique animale (Thesis) (in French). Lille University of Science and Technology.
- ^ Yu, Lei; Liu, Huan (August 2003). "Feature selection for high-dimensional data: a fast correlation-based filter solution" (PDF). ICML'03: Proceedings of the Twentieth International Conference on International Conference on Machine Learning: 856–863.
- ^ PMID 16447987.
- PMID 28934234.
- PMID 15302085.
- PMID 21749471.
- S2CID 8075318.
- .
- .
- .
- PMID 20047491.
- ^ E. Alba, J. Garia-Nieto, L. Jourdan et E.-G. Talbi. Gene Selection in Cancer Classification using PSO-SVM and GA-SVM Hybrid Algorithms. Archived 2016-08-18 at the Wayback Machine Congress on Evolutionary Computation, Singapore: Singapore (2007), 2007
- ^ B. Duval, J.-K. Hao et J. C. Hernandez Hernandez. A memetic algorithm for gene selection and molecular classification of an cancer. In Proceedings of the 11th Annual conference on Genetic and evolutionary computation, GECCO '09, pages 201-208, New York, NY, USA, 2009. ACM.
- ^ C. Hans, A. Dobra et M. West. Shotgun stochastic search for 'large p' regression. Journal of the American Statistical Association, 2007.
- PMID 15958165.
- PMID 15521491.
- PMID 21491369.
- PMID 14644442.
- ISBN 978-3-540-71782-9.
- ISBN 978-3-540-33237-4.
- S2CID 2073035.
- .
- PMID 26082713.
- S2CID 3223980.
- ^ Roffo, Giorgio; Melzi, Simone (September 2016). "Features Selection via Eigenvector Centrality" (PDF). NFmcp2016. Retrieved 12 November 2016.
- ^ R. Kohavi and G. John, "Wrappers for feature subset selection", Artificial intelligence 97.1-2 (1997): 273-324
- ].
- ^ Liu et al., Submodular feature selection for high-dimensional acoustic score spaces Archived 2015-10-17 at the Wayback Machine
- ^ Zheng et al., Submodular Attribute Selection for Action Recognition in Video Archived 2015-11-18 at the Wayback Machine
- PMID 20634556.
- Knowledge-Based Systems, 157: 1-9
Further reading
- Guyon, Isabelle; Elisseeff, Andre (2003). "An Introduction to Variable and Feature Selection". Journal of Machine Learning Research. 3: 1157–1182.
- Harrell, F. (2001). Regression Modeling Strategies. Springer. ISBN 0-387-95232-2.
- Liu, Huan; Motoda, Hiroshi (1998). Feature Selection for Knowledge Discovery and Data Mining. Springer. ISBN 0-7923-8198-X.
- Liu, Huan; Yu, Lei (2005). "Toward Integrating Feature Selection Algorithms for Classification and Clustering". IEEE Transactions on Knowledge and Data Engineering. 17 (4): 491–502. S2CID 1607600.
External links
- Feature Selection Package, Arizona State University (Matlab Code)
- NIPS challenge 2003 (see also NIPS)
- Naive Bayes implementation with feature selection in Visual Basic Archived 2009-02-14 at the Wayback Machine (includes executable and source code)
- Minimum-redundancy-maximum-relevance (mRMR) feature selection program
- FEAST (Open source Feature Selection algorithms in C and MATLAB)